1651972800
Got Scraping
Got Scraping is a small but powerful got
extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.
$ npm install got-scraping
Note:
- Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.
Got scraping package is built using the got.extend(...)
functionality, therefore it supports all the features Got has.
Interested what's under the hood?
const { gotScraping } = require('got-scraping');
gotScraping
.get('https://apify.com')
.then( ({ body }) => console.log(body))
proxyUrl
Type: string
URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.
const { gotScraping } = require('got-scraping');
gotScraping
.get({
url: 'https://apify.com',
proxyUrl: 'http://usernamed:password@myproxy.com:1234',
})
.then(({ body }) => console.log(body))
useHeaderGenerator
Type: **boolean
**
Default: true
Whether to use the generation of the browser-like headers.
headerGeneratorOptions
See the HeaderGeneratorOptions
docs.
const response = await gotScraping({
url: 'https://api.apify.com/v2/browser-info',
headerGeneratorOptions:{
browsers: [
{
name: 'chrome',
minVersion: 87,
maxVersion: 89
}
],
devices: ['desktop'],
locales: ['de-DE', 'en-US'],
operatingSystems: ['windows', 'linux'],
}
});
sessionToken
A non-primitive unique object which describes the current session. By default, it's undefined
, so new headers will be generated every time. Headers generated with the same sessionToken
never change.
Thanks to the included header-generator
package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.
Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl
option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.
Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.
HTTP/1.1 headers are always automatically formatted in Pascal-Case
. However, there is an exception: x-
headers are not modified in any way.
By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.
Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.
To get more detailed information about the implementation, please refer to the source code.
This package can only generate all the standard attributes. You might want to add the referer
header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.
This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.
const response = await gotScraping({
url: 'https://apify.com/',
headers: {
'user-agent': 'test',
},
});
For more advanced usage please refer to the Got documentation.
You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML
content type. You might want to alter the generated headers to match the browser ones.
const response = await gotScraping({
responseType: 'json',
url: 'https://api.apify.com/v2/browser-info',
});
This section covers possible errors that might happen due to different site implementations.
RequestError: Client network socket disconnected before secure TLS connection was established
The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined
or a custom value.
Author: Apify
Source Code: https://github.com/apify/got-scraping
License:
1651972800
Got Scraping
Got Scraping is a small but powerful got
extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.
$ npm install got-scraping
Note:
- Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.
Got scraping package is built using the got.extend(...)
functionality, therefore it supports all the features Got has.
Interested what's under the hood?
const { gotScraping } = require('got-scraping');
gotScraping
.get('https://apify.com')
.then( ({ body }) => console.log(body))
proxyUrl
Type: string
URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.
const { gotScraping } = require('got-scraping');
gotScraping
.get({
url: 'https://apify.com',
proxyUrl: 'http://usernamed:password@myproxy.com:1234',
})
.then(({ body }) => console.log(body))
useHeaderGenerator
Type: **boolean
**
Default: true
Whether to use the generation of the browser-like headers.
headerGeneratorOptions
See the HeaderGeneratorOptions
docs.
const response = await gotScraping({
url: 'https://api.apify.com/v2/browser-info',
headerGeneratorOptions:{
browsers: [
{
name: 'chrome',
minVersion: 87,
maxVersion: 89
}
],
devices: ['desktop'],
locales: ['de-DE', 'en-US'],
operatingSystems: ['windows', 'linux'],
}
});
sessionToken
A non-primitive unique object which describes the current session. By default, it's undefined
, so new headers will be generated every time. Headers generated with the same sessionToken
never change.
Thanks to the included header-generator
package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.
Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl
option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.
Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.
HTTP/1.1 headers are always automatically formatted in Pascal-Case
. However, there is an exception: x-
headers are not modified in any way.
By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.
Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.
To get more detailed information about the implementation, please refer to the source code.
This package can only generate all the standard attributes. You might want to add the referer
header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.
This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.
const response = await gotScraping({
url: 'https://apify.com/',
headers: {
'user-agent': 'test',
},
});
For more advanced usage please refer to the Got documentation.
You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML
content type. You might want to alter the generated headers to match the browser ones.
const response = await gotScraping({
responseType: 'json',
url: 'https://api.apify.com/v2/browser-info',
});
This section covers possible errors that might happen due to different site implementations.
RequestError: Client network socket disconnected before secure TLS connection was established
The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined
or a custom value.
Author: Apify
Source Code: https://github.com/apify/got-scraping
License:
1672295404
This package provides an asynchronous HTTP client for PHP based on Amp. Its API simplifies standards-compliant HTTP resource traversal and RESTful web service consumption without obscuring the underlying protocol. The library manually implements HTTP over TCP sockets; as such it has no dependency on ext/curl
.
https://
)This package can be installed as a Composer dependency.
composer require amphp/http-client
Additionally, you might want to install the nghttp2
library to take advantage of FFI to speed up and reduce the memory usage on PHP 7.4.
Documentation is bundled within this repository in the docs
directory.
More extensive code examples reside in the examples
directory.
amphp/http-client
follows the semver semantic versioning specification like all other amphp
packages.
Everything in an Internal
namespace or marked as @internal
is not public API and therefore not covered by BC guarantees.
4.x
Stable and recommended version.
Legacy version. Use amphp/artax
as package name instead.
No longer maintained. Use amphp/artax
as package name instead.
No longer maintained. Use amphp/artax
as package name instead.
If you discover any security related issues, please email me@kelunik.com
instead of using the issue tracker.
Author: Amphp
Source Code: https://github.com/amphp/http-client
License: MIT license
1623262740
Have you ever wondered how companies started to maintain and store big data? Well, flash drives were only prevalent at the start of the millennium. But with the advancement of the internet and technology, the big data analytics industry is projected to reach $103 billion by 2027, according to** Statista**.
As the need to store big data and access instantly increases at an alarming rate, scraping and web crawling technologies are becoming more and more useful. Today, companies mainly use web scraping technology to regulate price, calculate the consumer satisfaction index, and assess its intelligence. Read on to find the uses of cloud-based web scraping for big data apps.
…
#data-analytics #web-scraping #big-data #cloud based web scraping for big data applications #big data applications #cloud based web scraping
1651025820
An enhanced http client for Golang
This package provides you a http client package for your http requests. You can send requests quicly with this package. If you want to contribute this package, please fork and create a pull request.
Installation
$ go get -u github.com/bozd4g/go-http-client/
Usage
import (
"encoding/json"
"fmt"
client "github.com/bozd4g/go-http-client"
)
type Todo struct {
Id int
UserId int
Title string
Completed bool
}
func main() {
httpClient := client.New("https://jsonplaceholder.typicode.com/")
request, err := httpClient.Get("posts/10")
if err != nil {
panic(err)
}
response, err := httpClient.Do(request)
if err != nil {
panic(err)
}
var todo Todo
err = json.Unmarshal(response.Get().Body, &todo)
if err != nil {
panic(err)
}
fmt.Println(todo.Title) // Lorem ipsum dolor sit amet
// or
var todo2 Todo
response, err = httpClient.Do(request)
if err == nil {
response.To(&todo2)
fmt.Println(todo2.Title) // Lorem ipsum dolor sit amet
} else {
fmt.Println(err.Error())
}
}
You can call these functions from your application.
Function | Has Params |
---|---|
Get(endpoint string) | - |
GetWith(endpoint string, params interface {}) | Yes |
Post(endpoint string) | - |
PostWith(endpoint string, params interface {}) | Yes |
Patch(endpoint string) | - |
PatchWith(endpoint string, params interface{}) | Yes |
Put(endpoint string) | - |
PutWith(endpoint string, params interface{}) | Yes |
Delete(endpoint string) | - |
DeleteWith(endpoint string, params interface{}) | Yes |
Do() (Response, error) | - |
To(value interface{}) | - |
Author: Bozd4g
Source Code: https://github.com/bozd4g/go-http-client
License: MIT License
1657051020
Patron is a Ruby HTTP client library based on libcurl. It does not try to expose the full "power" (read complexity) of libcurl but instead tries to provide a sane API while taking advantage of libcurl under the hood.
First, you instantiate a Session object. You can set a few default options on the Session instance that will be used by all subsequent requests:
sess = Patron::Session.new
sess.timeout = 10
sess.base_url = "http://myserver.com:9900"
sess.headers['User-Agent'] = 'myapp/1.0'
You can set options with a hash in the constructor:
sess = Patron::Session.new({ :timeout => 10,
:base_url => 'http://myserver.com:9900',
:headers => {'User-Agent' => 'myapp/1.0'} } )
Or the set options in a block:
sess = Patron::Session.new do |patron|
patron.timeout = 10
patron.base_url = 'http://myserver.com:9900'
patron.headers = {'User-Agent' => 'myapp/1.0'}
end
Output debug log:
sess.enable_debug "/tmp/patron.debug"
The Session is used to make HTTP requests.
resp = sess.get("/foo/bar")
Requests return a Response object:
if resp.status < 400
puts resp.body
end
The GET, HEAD, PUT, POST and DELETE operations are all supported.
sess.put("/foo/baz", "some data")
sess.delete("/foo/baz")
You can ship custom headers with a single request:
sess.post("/foo/stuff", "some data", {"Content-Type" => "text/plain"})
By itself, the Patron::Session
objects are not thread safe (each Session
holds a single curl_state
pointer from initialization to garbage collection). At this time, Patron has no support for curl_multi_*
family of functions for doing concurrent requests. However, the actual code that interacts with libCURL does unlock the RVM GIL, so using multiple Session
objects in different threads actually enables a high degree of parallelism. For sharing a resource of sessions between threads we recommend using the excellent connection_pool gem by Mike Perham.
patron_pool = ConnectionPool.new(size: 5, timeout: 5) { Patron::Session.new }
patron_pool.with do |session|
session.get(...)
end
Sharing Session objects between requests will also allow you to benefit from persistent connections (connection reuse), see below.
Patron follows the libCURL guidelines on connection reuse. If you create the Session object once and use it for multiple requests, the same libCURL handle is going to be used across these requests and if requests go to the same hostname/port/protocol the connection should get reused.
When performing the libCURL request, Patron goes out of it's way to unlock the GVL (global VM lock) to allow other threads to be scheduled in parallel. The GVL is going to be released when the libCURL request starts, and will then be shortly re-acquired to provide the progress callback - if the callback has been configured, and then released again until the libCURL request has been performed and the response has been read in full. This allows one to execute multiple libCURL requests in parallel, as well as perform other activities on other MRI threads that are currently active in the process.
Patron 1.0 and up requires MRI Ruby 2.3 or newer. The 0.x versions support Ruby 1.9.3 and these versions get tagged and developed on the v0.x
branch.
A recent version of libCURL is required. We recommend at least 7.19.4 because it supports limiting the protocols, and that is very important for security - especially if you follow redirects.
On OSX the provided libcurl is sufficient if you are not using fork+SSL combination (see below). You will have to install the libcurl development packages on Debian or Ubuntu. Other Linux systems are probably similar. For Windows we do not have an established build instruction at the moment, unfortunately.
Currently, an issue is at play with OSX builds of curl
which use Apple's SecureTransport. Such builds (which Patron is linking to), are causing segfaults when performing HTTPS requests in forked subprocesses. If you need to check whether your system is affected, run the Patron test suite by performing
$ bundle install && bundle exec rspec
in the Patron install directory. Most default curl configurations on OSX (both the Apple-shipped version and the version available via Homebrew) are linked to SecureTransport and are likely to be affected. This issue may also manifest in forking webserver implementations (such as Unicorn or Passenger) and in forking job execution engines (such as resque), so even though you may not be using fork()
directly your server engine might be doing it for you.
To circumvent the issue, you need to build curl
with OpenSSL via homebrew. When doing so, curl
will use openssl as it's SSL driver. You also need to change the Patron compile flag:
$ brew install curl-openssl && \
gem install patron -- --with-curl-config=/usr/local/opt/curl-openssl/bin/curl-config
You can also save this parameter for all future Bundler-driven gem installs by setting this flag in Bundler proper:
$ bundle config build.patron --with-curl-config=/usr/local/opt/curl-openssl/bin/curl-config
sudo gem install patron
Author: toland
Source Code: https://github.com/toland/patron
License: MIT license