Desmond  Gerber

Desmond Gerber

1651972800

Got-scraping: HTTP Client Made for Scraping Based on Got

Got Scraping

Got Scraping is a small but powerful got extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Installation

$ npm install got-scraping

Note:

  • Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.

API

Got scraping package is built using the got.extend(...) functionality, therefore it supports all the features Got has.

Interested what's under the hood?

const { gotScraping } = require('got-scraping');

gotScraping
    .get('https://apify.com')
    .then( ({ body }) => console.log(body))

options

proxyUrl

Type: string

URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.

const { gotScraping } = require('got-scraping');

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:password@myproxy.com:1234',
    })
    .then(({ body }) => console.log(body))

useHeaderGenerator

Type: **boolean**
Default: true

Whether to use the generation of the browser-like headers.

headerGeneratorOptions

See the HeaderGeneratorOptions docs.

const response = await gotScraping({
    url: 'https://api.apify.com/v2/browser-info',
    headerGeneratorOptions:{
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89
            }
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    }
});

sessionToken

A non-primitive unique object which describes the current session. By default, it's undefined, so new headers will be generated every time. Headers generated with the same sessionToken never change.

Under the hood

Thanks to the included header-generator package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.

Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.

Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.

HTTP/1.1 headers are always automatically formatted in Pascal-Case. However, there is an exception: x- headers are not modified in any way.

By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.

Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.

To get more detailed information about the implementation, please refer to the source code.

Tips

This package can only generate all the standard attributes. You might want to add the referer header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

Overriding request headers

const response = await gotScraping({
    url: 'https://apify.com/',
    headers: {
        'user-agent': 'test',
    },
});

For more advanced usage please refer to the Got documentation.

JSON mode

You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML content type. You might want to alter the generated headers to match the browser ones.

const response = await gotScraping({
    responseType: 'json',
    url: 'https://api.apify.com/v2/browser-info',
});

Error recovery

This section covers possible errors that might happen due to different site implementations.

RequestError: Client network socket disconnected before secure TLS connection was established

The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined or a custom value.

Author: Apify
Source Code: https://github.com/apify/got-scraping 
License: 

#node #http #client 

What is GEEK

Buddha Community

Got-scraping: HTTP Client Made for Scraping Based on Got
Desmond  Gerber

Desmond Gerber

1651972800

Got-scraping: HTTP Client Made for Scraping Based on Got

Got Scraping

Got Scraping is a small but powerful got extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Installation

$ npm install got-scraping

Note:

  • Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.

API

Got scraping package is built using the got.extend(...) functionality, therefore it supports all the features Got has.

Interested what's under the hood?

const { gotScraping } = require('got-scraping');

gotScraping
    .get('https://apify.com')
    .then( ({ body }) => console.log(body))

options

proxyUrl

Type: string

URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.

const { gotScraping } = require('got-scraping');

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:password@myproxy.com:1234',
    })
    .then(({ body }) => console.log(body))

useHeaderGenerator

Type: **boolean**
Default: true

Whether to use the generation of the browser-like headers.

headerGeneratorOptions

See the HeaderGeneratorOptions docs.

const response = await gotScraping({
    url: 'https://api.apify.com/v2/browser-info',
    headerGeneratorOptions:{
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89
            }
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    }
});

sessionToken

A non-primitive unique object which describes the current session. By default, it's undefined, so new headers will be generated every time. Headers generated with the same sessionToken never change.

Under the hood

Thanks to the included header-generator package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.

Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.

Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.

HTTP/1.1 headers are always automatically formatted in Pascal-Case. However, there is an exception: x- headers are not modified in any way.

By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.

Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.

To get more detailed information about the implementation, please refer to the source code.

Tips

This package can only generate all the standard attributes. You might want to add the referer header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

Overriding request headers

const response = await gotScraping({
    url: 'https://apify.com/',
    headers: {
        'user-agent': 'test',
    },
});

For more advanced usage please refer to the Got documentation.

JSON mode

You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML content type. You might want to alter the generated headers to match the browser ones.

const response = await gotScraping({
    responseType: 'json',
    url: 'https://api.apify.com/v2/browser-info',
});

Error recovery

This section covers possible errors that might happen due to different site implementations.

RequestError: Client network socket disconnected before secure TLS connection was established

The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined or a custom value.

Author: Apify
Source Code: https://github.com/apify/got-scraping 
License: 

#node #http #client 

Http-client: Async HTTP/1.1+2 Client for PHP Based on Amp

Http client

This package provides an asynchronous HTTP client for PHP based on Amp. Its API simplifies standards-compliant HTTP resource traversal and RESTful web service consumption without obscuring the underlying protocol. The library manually implements HTTP over TCP sockets; as such it has no dependency on ext/curl.

Features

Installation

This package can be installed as a Composer dependency.

composer require amphp/http-client

Additionally, you might want to install the nghttp2 library to take advantage of FFI to speed up and reduce the memory usage on PHP 7.4.

Documentation

Documentation is bundled within this repository in the docs directory.

Examples

More extensive code examples reside in the examples directory.

Versioning

amphp/http-client follows the semver semantic versioning specification like all other amphp packages.

Everything in an Internal namespace or marked as @internal is not public API and therefore not covered by BC guarantees.

4.x

Stable and recommended version.

3.x

Legacy version. Use amphp/artax as package name instead.

2.x

No longer maintained. Use amphp/artax as package name instead.

1.x

No longer maintained. Use amphp/artax as package name instead.

Security

If you discover any security related issues, please email me@kelunik.com instead of using the issue tracker.

Download Details:

Author: Amphp
Source Code: https://github.com/amphp/http-client 
License: MIT license

#php #http #client #async 

Ray  Patel

Ray Patel

1623262740

Cloud Based Web Scraping for Big Data Applications 

Have you ever wondered how companies started to maintain and store big data? Well, flash drives were only prevalent at the start of the millennium. But with the advancement of the internet and technology, the big data analytics industry is projected to reach $103 billion by 2027, according to** Statista**.

As the need to store big data and access instantly increases at an alarming rate, scraping and web crawling technologies are becoming more and more useful. Today, companies mainly use web scraping technology to regulate price, calculate the consumer satisfaction index, and assess its intelligence. Read on to find the uses of cloud-based web scraping for big data apps.

What is Web Scraping?

How Cloud-Based Web Scraping Benefits an Organisation?

#data-analytics #web-scraping #big-data #cloud based web scraping for big data applications #big data applications #cloud based web scraping

Nigel  Uys

Nigel Uys

1651025820

Go-http-client: An Enhanced HTTP Client for Golang

 go-http-client

An enhanced http client for Golang

This package provides you a http client package for your http requests. You can send requests quicly with this package. If you want to contribute this package, please fork and create a pull request.

Installation

$ go get -u github.com/bozd4g/go-http-client/

Usage

import (
    "encoding/json"
    "fmt"
    client "github.com/bozd4g/go-http-client"
)

type Todo struct {
    Id        int
    UserId    int
    Title     string
    Completed bool
}

func main() {
    httpClient := client.New("https://jsonplaceholder.typicode.com/")
    request, err := httpClient.Get("posts/10")
    
    if err != nil {
        panic(err)
    }
    
    response, err := httpClient.Do(request)
    if err != nil {
        panic(err)
    }
    
    var todo Todo
    err = json.Unmarshal(response.Get().Body, &todo)
    if err != nil {
        panic(err)
    }
    fmt.Println(todo.Title) // Lorem ipsum dolor sit amet

    // or  
    var todo2 Todo     
    response, err = httpClient.Do(request)
    if err == nil {
        response.To(&todo2)
        fmt.Println(todo2.Title) // Lorem ipsum dolor sit amet
    } else {
        fmt.Println(err.Error())
    }
}

Functions

You can call these functions from your application.

FunctionHas Params
Get(endpoint string)-
GetWith(endpoint string, params interface {})Yes
Post(endpoint string)-
PostWith(endpoint string, params interface {})Yes
Patch(endpoint string)-
PatchWith(endpoint string, params interface{})Yes
Put(endpoint string)-
PutWith(endpoint string, params interface{})Yes
Delete(endpoint string)-
DeleteWith(endpoint string, params interface{})Yes
Do() (Response, error)-
To(value interface{})-

Documentation on go.dev 🔗

Author: Bozd4g
Source Code: https://github.com/bozd4g/go-http-client 
License: MIT License

#go #golang #http #client 

Royce  Reinger

Royce Reinger

1657051020

Patron: Ruby HTTP Client Based on Libcurl

Patron

Patron is a Ruby HTTP client library based on libcurl. It does not try to expose the full "power" (read complexity) of libcurl but instead tries to provide a sane API while taking advantage of libcurl under the hood.

Usage

First, you instantiate a Session object. You can set a few default options on the Session instance that will be used by all subsequent requests:

sess = Patron::Session.new
sess.timeout = 10
sess.base_url = "http://myserver.com:9900"
sess.headers['User-Agent'] = 'myapp/1.0'

You can set options with a hash in the constructor:

sess = Patron::Session.new({ :timeout => 10,
                             :base_url => 'http://myserver.com:9900',
                             :headers => {'User-Agent' => 'myapp/1.0'} } )

Or the set options in a block:

sess = Patron::Session.new do |patron|
    patron.timeout = 10
    patron.base_url = 'http://myserver.com:9900'
    patron.headers = {'User-Agent' => 'myapp/1.0'}
end

Output debug log:

sess.enable_debug "/tmp/patron.debug"

The Session is used to make HTTP requests.

resp = sess.get("/foo/bar")

Requests return a Response object:

if resp.status < 400
  puts resp.body
end

The GET, HEAD, PUT, POST and DELETE operations are all supported.

sess.put("/foo/baz", "some data")
sess.delete("/foo/baz")

You can ship custom headers with a single request:

sess.post("/foo/stuff", "some data", {"Content-Type" => "text/plain"})

Threading

By itself, the Patron::Session objects are not thread safe (each Session holds a single curl_state pointer from initialization to garbage collection). At this time, Patron has no support for curl_multi_* family of functions for doing concurrent requests. However, the actual code that interacts with libCURL does unlock the RVM GIL, so using multiple Session objects in different threads actually enables a high degree of parallelism. For sharing a resource of sessions between threads we recommend using the excellent connection_pool gem by Mike Perham.

patron_pool = ConnectionPool.new(size: 5, timeout: 5) { Patron::Session.new }
patron_pool.with do |session|
  session.get(...)
end

Sharing Session objects between requests will also allow you to benefit from persistent connections (connection reuse), see below.

Persistent connections

Patron follows the libCURL guidelines on connection reuse. If you create the Session object once and use it for multiple requests, the same libCURL handle is going to be used across these requests and if requests go to the same hostname/port/protocol the connection should get reused.

Performance with parallel requests

When performing the libCURL request, Patron goes out of it's way to unlock the GVL (global VM lock) to allow other threads to be scheduled in parallel. The GVL is going to be released when the libCURL request starts, and will then be shortly re-acquired to provide the progress callback - if the callback has been configured, and then released again until the libCURL request has been performed and the response has been read in full. This allows one to execute multiple libCURL requests in parallel, as well as perform other activities on other MRI threads that are currently active in the process.

Requirements

Patron 1.0 and up requires MRI Ruby 2.3 or newer. The 0.x versions support Ruby 1.9.3 and these versions get tagged and developed on the v0.x branch.

A recent version of libCURL is required. We recommend at least 7.19.4 because it supports limiting the protocols, and that is very important for security - especially if you follow redirects.

On OSX the provided libcurl is sufficient if you are not using fork+SSL combination (see below). You will have to install the libcurl development packages on Debian or Ubuntu. Other Linux systems are probably similar. For Windows we do not have an established build instruction at the moment, unfortunately.

Forking webservers on macOS and SSL

Currently, an issue is at play with OSX builds of curl which use Apple's SecureTransport. Such builds (which Patron is linking to), are causing segfaults when performing HTTPS requests in forked subprocesses. If you need to check whether your system is affected, run the Patron test suite by performing

$ bundle install && bundle exec rspec

in the Patron install directory. Most default curl configurations on OSX (both the Apple-shipped version and the version available via Homebrew) are linked to SecureTransport and are likely to be affected. This issue may also manifest in forking webserver implementations (such as Unicorn or Passenger) and in forking job execution engines (such as resque), so even though you may not be using fork() directly your server engine might be doing it for you.

To circumvent the issue, you need to build curl with OpenSSL via homebrew. When doing so, curl will use openssl as it's SSL driver. You also need to change the Patron compile flag:

$ brew install curl-openssl && \
    gem install patron -- --with-curl-config=/usr/local/opt/curl-openssl/bin/curl-config

You can also save this parameter for all future Bundler-driven gem installs by setting this flag in Bundler proper:

$ bundle config build.patron --with-curl-config=/usr/local/opt/curl-openssl/bin/curl-config

Installation

sudo gem install patron

Author: toland
Source Code: https://github.com/toland/patron 
License: MIT license

#ruby #http #client