Elian  Harber

Elian Harber

1658542027

Geziyor, Blazing Fast Web Crawling & Scraping Framework for Go

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.  

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8
  • Proxy management (Single, Round-Robin, Custom)

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Custom Requests - Passing Metadata To Callbacks

You can create custom requests with client.NewRequest

Use that request on geziyor.Do(request, callback)

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        req, _ := client.NewRequest("GET", "https://httpbin.org/anything", nil)
        req.Meta["key"] = "value"
        g.Do(req, g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println("This is our data from request: ", r.Request.Meta["key"])
    },
}).Start()

Proxy - Use proxy per request

If you want to use proxy for your requests, and you have 1 proxy, you can just set these env values: HTTP_PROXY HTTPS_PROXY And geziyor will use those proxies.

Also, you can use in-order proxy per request by setting ProxyFunc option to client.RoundRobinProxy Or any custom proxy selection function that you want. See client/proxy.go on how to implement that kind of custom proxy selection function.

Proxies can be HTTP, HTTPS and SOCKS5.

Note: If you use http scheme for proxy, It'll be used for http requests and not for https requests.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs:         []string{"http://httpbin.org/anything"},
    ParseFunc:         parseFunc,
    ProxyFunc:         client.RoundRobinProxy("http://some-http-proxy.com", "https://some-https-proxy.com", "socks5://some-socks5-proxy.com"),
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8         200000        108710 ns/op
PASS
ok      github.com/geziyor/geziyor    22.861s

Author: geziyor
Source Code: https://github.com/geziyor/geziyor 
License: MPL-2.0 license

#go #golang

What is GEEK

Buddha Community

Geziyor, Blazing Fast Web Crawling & Scraping Framework for Go
Nigel  Uys

Nigel Uys

1656997740

Geziyor: Blazing Fast Web Crawling & Scraping Framework for Go

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing. 

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8
  • Proxy management (Single, Round-Robin, Custom)

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Custom Requests - Passing Metadata To Callbacks

You can create custom requests with client.NewRequest

Use that request on geziyor.Do(request, callback)

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        req, _ := client.NewRequest("GET", "https://httpbin.org/anything", nil)
        req.Meta["key"] = "value"
        g.Do(req, g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println("This is our data from request: ", r.Request.Meta["key"])
    },
}).Start()

Proxy - Use proxy per request

If you want to use proxy for your requests, and you have 1 proxy, you can just set these env values: HTTP_PROXY HTTPS_PROXY And geziyor will use those proxies.

Also, you can use in-order proxy per request by setting ProxyFunc option to client.RoundRobinProxy Or any custom proxy selection function that you want. See client/proxy.go on how to implement that kind of custom proxy selection function.

Proxies can be HTTP, HTTPS and SOCKS5.

Note: If you use http scheme for proxy, It'll be used for http requests and not for https requests.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs:         []string{"http://httpbin.org/anything"},
    ParseFunc:         parseFunc,
    ProxyFunc:         client.RoundRobinProxy("http://some-http-proxy.com", "https://some-https-proxy.com", "socks5://some-socks5-proxy.com"),
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8         200000        108710 ns/op
PASS
ok      github.com/geziyor/geziyor    22.861s

Author: geziyor
Source Code: https://github.com/geziyor/geziyor 
License: MPL-2.0 license

#go #golang #spider #scraping 

Ajay Kapoor

1619172468

10 Top Web Development Frameworks for Assured Success of Your Project - PixelCrayons

Web development frameworks are a powerful answer for businesses to accomplish a unique web app as they play a vital role in providing tools and libraries for developers to use.

Most businesses strive to seek offbeat web applications that can perform better and enhance traffic to the site. Plus, it is imperative to have such apps as the competition is very high in the digital world.

Developers find it sophisticated to use the libraries and templates provided by frameworks to make interactive and user-friendly web applications. Moreover, frameworks assist them in increasing the efficiency, performance, and productivity of the web development task.

Before getting deep into it, let’s have a quick glance at the below facts and figures below that will help you comprehend the utility of the frameworks.

As per Statista, 35.9% of developers used React in 2020.
25.1% of developers used the Angular framework worldwide.
According to SimilarTech, 2,935 websites use the Spring framework, most popular among the News and Media domain.

What is a Framework?
A framework is a set of tools that paves the way for web developers to create rich and interactive web apps. It comprises libraries, templates, and specific software tools. Additionally, it enables them to develop a hassle-free application by not rewriting the same code to build the application.

There are two categories of frameworks: the back-end framework, known as the server-side, and the front-end framework, known as the client-side.

The backend framework refers to a web page portion that you can not see, and it communicates with the front end one. On the other hand, the front-end is a part of the web that users can see and experience.

You can understand by an example that what you see on the app is the front-end part, and the communication you make with it is the part of the back end.

Read the full blog here

Hence, depending on your web development application requirements, you can hire web developers from India’s best web development company. In no time, you will be amongst those who are reaping the results of using web development frameworks for the applications.

#web-development-frameworks #web-frameworks #top-web-frameworks #best-web-development-frameworks

AutoScraper Introduction: Fast and Light Automatic Web Scraper for Python

In the last few years, web scraping has been one of my day to day and frequently needed tasks. I was wondering if I can make it smart and automatic to save lots of time. So I made AutoScraper!

The project code is available on Github.

This project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages!

Installation

Install latest version from git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git

How to use

Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

## We can add one or multiple candidates here.
## You can also put urls here to retrieve urls.
wanted_list = ["How to call an external command?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

#python #web-scraping #web-crawling #data-scraping #website-scraping #open-source #repositories-on-github #web-development

Go-web-framework-benchmark: Go Web Framework Benchmark

go-web-framework-benchmark

This benchmark suite aims to compare the performance of Go web frameworks. It is inspired by Go HTTP Router Benchmark but this benchmark suite is different with that. Go HTTP Router Benchmark suit aims to compare the performance of routers but this Benchmark suit aims to compare whole HTTP request processing.

Last Test Updated: 2020-05

test environment

  • CPU: QEMU Virtual CPU version(1.8GHZ, 8 cores)
  • Memory: 16G
  • Go: go1.16.3 linux/amd64
  • OS: CentOS Linux release 7.5.1804 (Core)

Motivation

When I investigated performance of Go web frameworks, I found Go HTTP Router Benchmark, created by Julien Schmidt. He also developed a high performance http router: httprouter. I had thought I got the performance result until I created a piece of codes to mock the real business logics:

api.Get("/rest/hello", func(c *XXXXX.Context) {
	sleepTime := strconv.Atoi(os.Args[1]) //10ms
	if sleepTime > 0 {
		time.Sleep(time.Duration(sleepTime) * time.Millisecond)
	}

	c.Text("Hello world")
})

When I use the above codes to test those web frameworks, the token time of route selection is not so important in the whole http request processing, although performance of route selection of web frameworks are very different.

So I create this project to compare performance of web frameworks including connection, route selection, handler processing. It mocks business logics and can set a special processing time.

The you can get some interesting results if you use it to test.

Implementation

When you test a web framework, this test suit will starts a simple http server implemented by this web framework. It is a real http server and only contains GET url: "/hello".

When this server processes this url, it will sleep n milliseconds in this handler. It mocks the business logics such as:

  • read data from sockets
  • write data to disk
  • access databases
  • access cache servers
  • invoke other microservices
  • ……

It contains a test.sh that can do those tests automatically.

It uses wrk to test.

Basic Test

The first test case is to mock 0 ms, 10 ms, 100 ms, 500 ms processing time in handlers.

Benchmark (Round 3) the concurrency clients are 5000.

Latency (Round 3) Latency is the time of real processing time by web servers. The smaller is the better.

Allocs (Round 3) Allocs is the heap allocations by web servers when test is running. The unit is MB. The smaller is the better.

If we enable http pipelining, test result as below:

benchmark pipelining (Round 2)

Concurrency Test

In 30 ms processing time, the test result for 100, 1000, 5000 clients is:

concurrency (Round 3)

Latency (Round 3)

Latency (Round 3)

If we enable http pipelining, test result as below:

concurrency pipelining(Round 2)

cpu-bound case Test

cpu-bound (5000 concurrency)

Usage

You should install this package first if you want to run this test.

go get github.com/smallnest/go-web-framework-benchmark

It takes a while to install a large number of dependencies that need to be downloaded. Once that command completes, you can run:

cd $GOPATH/src/github.com/smallnest/go-web-framework-benchmark
go build -o  gowebbenchmark *.go
./test.sh

It will generate test results in processtime.csv and concurrency.csv. You can modify test.sh to execute your customized test cases.

  • If you also want to generate latency data and allocation data, you can run the script:
./test-latency.sh
  • If you don't want use keepalive, you can run:
./test-latency-nonkeepalive.sh
  • If you want to test http pipelining, you can run:
./test-pipelining.sh
  • If you want to test some of web frameworks, you can modify the test script and only keep your selected web frameworks:
……
web_frameworks=( "default" "ace" "beego" "bone" "denco" "echov1" "echov2standard" "echov2fasthttp" "fasthttp-raw" "fasthttprouter" "fasthttp-routing" "gin" "gocraftWeb" "goji" "gojiv2" "gojsonrest" "gorestful" "gorilla" "httprouter" "httptreemux" "lars" "lion" "macaron" "martini" "pat" "r2router" "tango" "tiger" "traffic" "violetear" "vulcan")
……
  • If you want to test all cases, you can run:
./test-all.sh

Plot

you can run the shell script plot.sh in testresults directory and it can generate all images in its parent directory.

Add new web framework

Welcome to add new Go web frameworks. You can follow the below steps and send me a pull request.

  1. add your web framework link in README
  2. add a hello implementation in server.go
  3. add your webframework in libs.sh

Please add your web framework alphabetically.

Tested web frameworks (in alphabetical order)

Only test those webframeworks which are stable

some libs have not been maintained and the test code has removed them

Author: Smallnest
Source Code: https://github.com/smallnest/go-web-framework-benchmark 
License: Apache-2.0 license

#go #golang #web #framework #benchmark 

Any Alpha

Any Alpha

1613122689

Top 3 Golang Web Frameworks In 2021

Golang is one of the most powerful and famous tools used to write APIs and web frameworks. Google’s ‘Go’ otherwise known as Golan orders speedy running local code. It is amazing to run a few programming advancements rethinking specialists and software engineers from various sections. We can undoubtedly say that this is on the grounds that the engineers have thought that it was easiest to utilize Go. It is always considered as ago for web and mobile app development because it is ranked highest among all the web programming languages.

Top 3 Golang web frameworks in 2021:

1.Martini: Martini is said to be a low-profile framework as it’s a small community but also known for its various unique things like injecting various data sets or working on handlers of different types. It is very active and there are some twenty and above plug-ins which could also be the reason for the need for add-ons. It deals with some principles of techniques like routing, dealing, etc, basic common tricks to do middleware.

2.Buffalo: Buffalo is known for its fast application development services. It is a complete process of starting any project from scratch and providing end to end facility for back-end web building. Buffalo comes with the dev command which helps directly to experience transformations in front of you and redevelop your whole binary. It is rather an ecosystem used to create the best app development.

3.Gorilla: Gorilla is the largest and longest-running Go web framework. It can be little and maximum for any user. It is also the biggest English-speaking community that comes with robust web sockets features so you can attach the REST codes to the endpoints giving a third-party service like Pusher.

So, these are some web frameworks that can be used for Golang language. Each framework has its unique points which are only found in them but all of them are the best. IF your developer is in search of one this is where you can find the best.

#top 3 golang web frameworks in 2021 #golang #framework #web-service #web #web-development