Louis Jones

Louis Jones

1675669889

Web Scraping with JavaScript and Puppeteer

Learn the basics of web scraping in JavaScript and Node.js using Puppeteer in this tutorial. JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML.

Welcome to the world of web scraping! Have you ever needed data from a website but found it hard to access it in a structured format? This is where web scraping comes in.

Using scripts, we can extract the data we need from a website for various purposes, such as creating databases, doing some analytics, and even more.

Disclaimer: Be careful when doing web scraping. Always make sure you're scraping sites that allow it, and performing this activity within ethical and legal limits.

JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML.

But if you're looking to do more advanced tasks including automations, you'll need libraries such as Puppeteer, Cheerio, or Nightmare (don't worry the name is nightmare, but it's not that bad to use 😆).

I'll introduce the basics of web scraping in JavaScript and Node.js using Puppeteer in this article. I structured the writing to show you some basics of fetching information on a website and clicking a button (for example, moving to the next page).

At the end of this introduction, I'll recommend ways to practice and learn more by improving the project we just created.

Prerequisites

Before diving in and scraping our first page together using JavaScript, Node.js, and the HTML DOM, I'd recommend having a basic understanding of these technologies. It'll improve your learning and understanding of the topic.

Let's dive in! 🤿

How to Initialize Your First Puppeteer Scraper

New project...new folder! First, create the first-puppeteer-scraper-example folder on your computer. It'll contain the code of our future scraper.

mkdir first-puppeteer-scraper-example

Create a new project folder using mkdir

Now, it's time to initialize your Node.js repository with a package.json file. It's helpful to add information to the repository and NPM packages, such as the Puppeteer library.

npm init -y

Initialize the package.json file using the npm init command

After typing this command, you should find this package.json file in your repository tree.

{
  "name": "first-puppeteer-scraper-example",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  },
  "type": "module",
  "devDependencies": {},
  "description": ""
}

package.json initialized with the npm init -y command

Before proceeding, we must ensure the project is configured to handle ES6 features. To do so, you can add the "types": "module" instruction at the end of the configuration.

{
  "name": "first-puppeteer-scraper-example",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  },
  "type": "module",
  "description": "",
  "types": "module"
}

package.json file after enabling the ES6 features

The last step of our scraper initialization is to install the Puppeteer library. Here's how:

npm install puppeteer

Install Puppeteer with the npm install command

Wow! We're there – we're ready to scrape our first website together. 🤩

How to Scrape Your First Piece of Data

In this article, we'll use the ToScrape website as our learning platform. This online sandbox provides two projects specifically designed for web scraping, making it a great starting point to learn the basics such as data extraction and page navigation.

For this beginner's introduction, we'll specifically focus on the Quotes to Scrape website.

How to Initialize the Script

In the project repository root, you can create an index.js file. This will be our application entry point.

To keep it simple, our script consists of one function in charge of getting the website's quotes (getQuotes).

In the function's body, we will need to follow different steps:

  • Start a Puppeteer session with puppeteer.launch (it'll instantiate a browser variable that we'll use for manipulating the browser)
  • Open a new page/tab with browser.newPage (it'll instantiate a page variable that we'll use for manipulating the page)
  • Change the URL of our new page to http://quotes.toscrape.com/ with page.goto

Here's the commented version of the initial script:

import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });
};

// Start the scraping
getQuotes();

What do you think of running our scraper and seeing the output? Let's do it with the command below:

node index.js

Start our Node.js application with the node index.js command

After doing this, you should have a brand new browser application started with a new page and the website Quotes to Scrape loaded onto it. Magic, isn't it? 🪄

image-353

Quotes to Scrape homepage loaded by our initial script

Note: For this first iteration, we're not closing the browser. This means you will need to close the browser to stop the running application.

How to Fetch the First Quote

Whenever you want to scrape a website, you'll have to play with the HTML DOM. What I recommend is to inspect the page and start navigating the different elements to find what you need.

In our case, we'll follow the baby step principle and start fetching the first quote, author, and text.

After browsing the page HTML, we can notice a quote is encapsulated in a <div> element with a class name quote (class="quote"). This is important information because the scraping works with CSS selectors (for example, .quote).

image-354

Browser inspector with the first quote <div> selectedimage-355An example of how each quote is rendered in the HTML

Now that we have this knowledge, we can return to our getQuotes function and improve our code to select the first quote and extract its data.

We will need to add the following after the page.goto instruction:

  • Extract data from our page HTML with page.evaluate (it'll execute the function passed as a parameter in the page context and returns the result)
  • Get the quote HTML node with document.querySelector (it'll fetch the first <div> with the classname quote and returns it)
  • Get the quote text and author from the previously extracted quote HTML node with quote.querySelector (it'll extract the elements with the classname text and author under <div class="quote"> and returns them)

Here's the updated version with detailed comments:

import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });

  // Get page data
  const quotes = await page.evaluate(() => {
    // Fetch the first element with class "quote"
    const quote = document.querySelector(".quote");

    // Fetch the sub-elements from the previously fetched quote element
    // Get the displayed text and return it (`.innerText`)
    const text = quote.querySelector(".text").innerText;
    const author = quote.querySelector(".author").innerText;

    return { text, author };
  });

  // Display the quotes
  console.log(quotes);

  // Close the browser
  await browser.close();
};

// Start the scraping
getQuotes();

Something interesting to point out is that the function name for selecting an element is the same as in the browser inspect. Here's an example:

image-362

After running the document.querySelector instruction in the browser inspector, we have the first quote as an output (like on Puppeteer)

Let's run our script one more time and see what we have as an output:

{
  text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  author: 'Albert Einstein'
}

Output of our script after running node index.js

We did it! Our first scraped element is here, right in the terminal. Now, let's expand it and fetch all the current page quotes. 🔥

How to Fetch All Current Page Quotes

Now that we know how to fetch one quote, let's trick our code a bit to get all the quotes and extract their data one by one.

Previously we used document.getQuerySelector to select the first matching element (the first quote). To be able to fetch all quotes, we will need the document.querySelectorAll function instead.

We'll need to follow these steps to make it work:

  • Replace document.getQuerySelector with document.querySelectorAll (it'll fetch all <div> elements with the classname quote and return them)
  • Convert the fetched elements to a list with Array.from(quoteList) (it'll ensure the list of quotes is iterable)
  • Move our previous code to get the quote text and author inside the loop and return the result (it'll extract the elements with the classname text and author under <div class="quote"> for each quote)

Here's the code update:

import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will be in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });

  // Get page data
  const quotes = await page.evaluate(() => {
    // Fetch the first element with class "quote"
    // Get the displayed text and returns it
    const quoteList = document.querySelectorAll(".quote");

    // Convert the quoteList to an iterable array
    // For each quote fetch the text and author
    return Array.from(quoteList).map((quote) => {
      // Fetch the sub-elements from the previously fetched quote element
      // Get the displayed text and return it (`.innerText`)
      const text = quote.querySelector(".text").innerText;
      const author = quote.querySelector(".author").innerText;

      return { text, author };
    });
  });

  // Display the quotes
  console.log(quotes);

  // Close the browser
  await browser.close();
};

// Start the scraping
getQuotes();

As an end result, if we run our script one more time, we should have a list of quotes as an output. Each element of this list should have a text and an author property.

[
  {
    text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    author: 'J.K. Rowling'
  },
  {
    text: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
    author: 'Albert Einstein'
  },
  {
    text: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
    author: 'Jane Austen'
  },
  {
    text: "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
    author: 'Marilyn Monroe'
  },
  {
    text: '“Try not to become a man of success. Rather become a man of value.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is better to be hated for what you are than to be loved for what you are not.”',
    author: 'André Gide'
  },
  {
    text: "“I have not failed. I've just found 10,000 ways that won't work.”",
    author: 'Thomas A. Edison'
  },
  {
    text: "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
    author: 'Eleanor Roosevelt'
  },
  {
    text: '“A day without sunshine is like, you know, night.”',
    author: 'Steve Martin'
  }
]

Output of our script after running node index.js

Good job! All the quotes from the first page are now scraped by our script. 👏

How to Move to the Next Page

Our script is now able to fetch all the quotes for one page. What would be interesting is clicking on the "Next page" at the page bottom and doing the same on the second page.

image-363

"Next" button at the Quotes to Scrape page bottom

Back to our browser inspect, and let's find how we can target this element using CSS selectors.

As we can notice, the next button is placed under an unordered list <ul> with a pager classname (<ul class="pager">). This list has an element <li> with a next classname (<li class="next">). Finally, there is a link anchor <a> that links to the second page (<a href="/page/2/">).

In CSS, if we want to target this specific link there are different ways to do that. We can do:

  • .next > a: but, it's risky because if there is an other element with .next as a parent element containing a link, it'll click on it.
  • .pager > .next > a: safer, because we make sure the link should be inside the .pager parent element under the .next element. There is a low risk of having this hierarchy more than once.

image-356

An example of how the "Next" button is rendered in the HTML

To click this button, at the end of our script after the console.log(quotes);, you can add the following: await page.click(".pager > .next > a");.

Since we're now closing the browser page with await browser.close(); after all instructions are done, you need to comment on this instruction to see the second page opened in the scraper browser.

It's temporary and for testing purposes, but the end of our getQuotes function should look like this:

  // Display the quotes
  console.log(quotes);

  // Click on the "Next page" button
  await page.click(".pager > .next > a");

  // Close the browser
  // await browser.close();

After this, if you run our scraper again, after processing all instructions, your browser should stop on the second page:

image-357

Quotes to Scrape second page loaded after clicking the "Next" button

It’s Your Time! Here’s What You Can Do Next:

Congrats on reaching the end of this introduction to scraping with Puppeteer! 👏

Now it's your turn to improve the scraper and make it get more data from the Quotes to Scrape website. Here's a list of potential improvements you can make:

  • Navigate between all pages using the "Next" button and fetch the quotes on all the pages.
  • Fetch the quote's tags (each quote has a list of tags).
  • Scrape the author's about page (by clicking on the author's name on each quote).
  • Categorize the quotes by tags or authors (it's not 100% related to the scraping itself, but that can be a good improvement).

Feel free to be creative and do any other things you see fit 🚀

Scraper Code Is Available on GitHub

Check out the latest version of our scraper on GitHub! You're free to save, fork, or utilize it as you see fit.

=> First Puppeteer Scraper (example)

Successful Scraping Start: Thanks for reading the article!

I hope this article gave you a valuable introduction to web scraping using JavaScript and Puppeteer. Writing this was a pleasure, and I hope you found it informative and enjoyable.

Join me on Twitter for more content like this. I regularly share content to help you grow your web development skills and would love to have you join the conversation. Let's learn, grow, and inspire each other along the way!

Original article source at https://www.freecodecamp.org

#javascript #puppeteer #webscraping #node 

Web Scraping with JavaScript and Puppeteer

Cascadia.jl: A CSS Selector library in Julia

Cascadia

A CSS Selector library in Julia.

Inspired by, and mostly a direct translation of, the Cascadia CSS Selector library, written in Go, by @andybalhom.

This package depends on the Gumbo.jl package by @porterjamesj, which is a Julia wrapper around Google's Gumbo HTML parser library

Usage

Usage is simple. Use Gumbo to parse an HTML string into a document, create a Selector from a string, and then use eachmatch to get the nodes in the document that match the selector. Alternatively, use sel"<selector string>" to do the same thing as Selector. The eachmatch function returns an array of elements which match the selector. If no match is found, a zero element array is returned. For unique matches, the array contains one element. Thus, check the length of the array to test whether a selector matches.

using Cascadia
using Gumbo

n=parsehtml("<p id=\"foo\"><p id=\"bar\">")
s=Selector("#foo")
sm = sel"#foo"
eachmatch(s, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

eachmatch(sm, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

Note: The top level matching function name has changed from matchall in v0.6 to eachmatch in v0.7 and higher to reflect the change in Julia base.

Webscraping Example

The primary use case for this library is to enable webscraping -- the automatic extraction of information from html pages. As an example, consider the following code, which returns a list of questions that have been tagged with julia-lang on StackOverflow.

using Cascadia, Gumbo, HTTP

r = HTTP.get("http://stackoverflow.com/questions/tagged/julia-lang")
h = parsehtml(String(r.body))

qs = eachmatch(Selector(".question-summary"),h.root)

println("StackOverflow Julia Questions (votes  answered?  url)")

for q in qs
    votes = nodeText(eachmatch(Selector(".votes .vote-count-post "), q)[1])
    answered = length(eachmatch(Selector(".status.answered"), q)) > 0 || length(eachmatch(Selector(".status.answered-accepted"), q)) > 0
    href = eachmatch(Selector(".question-hyperlink"), q)[1].attributes["href"]
    println("$votes  $answered  http://stackoverflow.com$href")
end

This code produces the following output:

StackOverflow Julia Questions (votes  answered?  url)

0  false  http://stackoverflow.com/questions/59361325/how-to-get-a-rolling-window-regression-in-julia
0  true  http://stackoverflow.com/questions/59356818/how-i-translate-python-code-into-julia-code
-2  false  http://stackoverflow.com/questions/59354720/how-to-fix-this-error-in-julia-throws-same-error-for-all-packages-not-found-i
-1  true  http://stackoverflow.com/questions/59354407/julia-package-for-geocoding
1  false  http://stackoverflow.com/questions/59350631/jupyter-lab-precompile-error-for-kernel-1-0-after-adding-kernel-1-3
0  true  http://stackoverflow.com/questions/59348461/genie-framework-does-not-install-under-julia-1-2
...
2  true  http://stackoverflow.com/questions/59300202/julia-package-install-fail-with-please-specify-by-known-name-uuid
2  false  http://stackoverflow.com/questions/59297379/how-do-i-transfer-my-packages-after-installing-a-new-julia-version

Note that this returns the elements on the first page of the query results. Getting the values from subsequent pages is left as an exercise for the reader.

Current Status

This library should work with almost all CSS selectors. Please raise an issue if you find any that don't work. However, note that CSS pseudo elements are not yet supported.

Specifically, the following selector types are tested, and known to work.

Selector
address
*
#foo
li#t1
*#t4
.t1
p.t1
div.teST
.t1.fail
p.t1.t2
p[title]
address[title="foo"]
[title ~= foo]
[title~="hello world"]
[lang|="en"]
[title^="foo"]
[title$="bar"]
[title*="bar"]
.t1:not(.t2)
div:not(.t1)
li:nth-child(odd)
li:nth-child(even)
li:nth-child(-n+2)
li:nth-child(3n+1)
li:nth-last-child(odd)
li:nth-last-child(even)
li:nth-last-child(-n+2)
li:nth-last-child(3n+1)
span:first-child
span:last-child
p:nth-of-type(2)
p:nth-last-of-type(2)
p:last-of-type
p:first-of-type
p:only-child
p:only-of-type
:empty
div p
div table p
div > p
p ~ p
p + p
li, p
p +/*This is a comment*/ p
p:contains("that wraps")
p:containsOwn("that wraps")
:containsOwn("inner")
p:containsOwn("block")
div:has(#p1)
div:has(:containsOwn("2"))
body :has(:containsOwn("2"))
body :haschild(:containsOwn("2"))
p:matches([\d])
p:matches([a-z])
p:matches([a-zA-Z])
p:matches([^\d])
p:matches(^(0|a))
p:matches(^\d+$)
p:not(:matches(^\d+$))
div :matchesOwn(^\d+$)
[href#=(fina)]:not([href#=(\/\/[^\/]+untrusted)])
[href#=(^https:\/\/[^\/]*\/?news)]
:input

Download Details:

Author: Algocircle
Source Code: https://github.com/Algocircle/Cascadia.jl 
License: View license

#julia #css #webscraping 

Cascadia.jl: A CSS Selector library in Julia
Noah Saunders

Noah Saunders

1672998212

Web Scraping Financial News using Python

Learn how to extract financial news seamlessly using Python. Learn techniques to gather unstructured finance data using Python library BeautifulSoup & transform into structured data

We often have plenty of unstructured data available for free on the internet. Some of this data may be useful for combining with other structured or unstructured data available in the organization. What if I could fetch the desired unstructured data from web, transform it into structured format, and combine it with my other data, preprocess combined data, so that I can extract valuable insights to facilitate quick and better data-driven decision making?

Good news is that there are some techniques such as Web Scraping which can help us solve the problem of data gathering at scale and build curated datasets. In this course we will help you achieve this goal. Following are our Learning Objectives for this course.

Automate the process of gathering unstructured data which is in the form of raw HTML.

Learn to web scrap Financial News of specific listed companies on the Stock Market.

Use BeautifulSoup4 Python library for web scraping - Install, Exception Handling, Advanced HTML Parsing.

How to traverse a single domain to fetch data from many HTML pages.

Process gathered (scrapped) data and transform it into structured format JSON and save as CSV.

In this course we are giving you hands-on experience of how to build and automate process of generating curated dataset from raw HTML text, scraped from web.

What you’ll learn

  •        Automate the process of gathering unstructured data which is in the form of raw HTML.
  •        Learn to web scrap Financial News of specific listed companies on the Stock Market.
  •        Use BeautifulSoup4 Python library for web scraping - Install, Exception Handling, Advanced HTML Parsing.
  •        How to traverse a single domain to fetch data from many HTML pages.
  •        Process gathered (scrapped) data and transform it into structured format JSON and save as CSV.

Are there any course requirements or prerequisites?

  •        Basic Python 3 programming
  •        Basic HTML knowledge

Who this course is for:

  •        Beginner Python developers who would like to learn web scraping techniques
  •        Anybody who wants to learn how to transform unstructured data into structured format
  •        Anybody who wants to learn how to scrape news (e.g. financial news) from web portals
  •        Anybody who wants to gather and transform unstructured data from web for their Machine Learning (NLP, Text Analytics) Projects

#webscraping #python 

Web Scraping Financial News using Python

Web Scraping in Python with Beautiful Soup

Learn how to automate data extraction and Web Scraping using Python and Beautiful Soup. Build a Web Scraper with Python and Beautiful Soup

In today’s competitive world everybody is looking for ways to innovate and make use of new technologies. Web scraping (also called web data extraction or data scraping) provides a solution for those who want to get access to structured web data in an automated fashion. Web scraping is useful if the public website you want to get data from doesn’t have an API, or it does but provides only limited access to the data.

Web scraping is the process of collecting structured web data in an automated fashion. It’s also called web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

If you’ve ever copied and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.  In this course we are going to extract data using Python and a Python module called Beautiful Soup.

What you’ll learn

  •        Setup data extraction environment
  •        Extract | Scape data from the web
  •        Build a web scrapping tool
  •        Prototype a web scraping tool
  •        Inspect HTML elements
  •        Extract data using Beautiful Soup

Are there any course requirements or prerequisites?

  •        Requirements are covered in the course.

Who this course is for:

  •        Beginners to web scraping and data extraction

Subscribe: https://www.youtube.com/@learntocode922/featured 

#webscraping #python #beautifulsoup

Web Scraping in Python with Beautiful Soup
Sheldon  Grant

Sheldon Grant

1669443907

Beginners Guide to Web Scraping with Python

Web Scraping with Python

Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster. 

In this article on Web Scraping with Python, you will learn about web scraping in brief and see how to extract data from a website with a demonstration. I will be covering the following topics:

  • Why is Web Scraping Used?
  • What Is Web Scraping?
  • Is Web Scraping Legal?
  • Why is Python Good For Web Scraping?
  • How Do You Scrape Data From A Website?
  • Libraries used for Web Scraping
  • Web Scraping Example : Scraping Flipkart Website

Why is Web Scraping Used?

Web scraping is used to collect large information from websites. But why does someone have to collect such large data from websites? To know about this, let’s look at the applications of web scraping:

  • Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.
  • Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.
  • Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
  • Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.
  • Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.

What is Web Scraping?

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form. There are different ways to scrape websites such as online Services, APIs or writing your own code. In this article, we’ll see how to implement web scraping with python. 

Web Scraping - Edureka

Is Web Scraping Legal?

Talking about whether web scraping is legal or not, some websites allow web scraping and some don’t. To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. You can find this file by appending “/robots.txt” to the URL that you want to scrape. For this example, I am scraping Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.

Why is Python Good for Web Scraping?

Here is the list of features of Python which makes it more suitable for web scraping.

  • Ease of Use: Python Programming is simple to code. You do not have to add semi-colons “;” or curly-braces “{}” anywhere. This makes it less messy and easy to use.
  • Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.
  • Dynamically typed: In Python, you don’t have to define datatypes for variables, you can directly use the variables wherever required. This saves time and makes your job faster.
  • Easily Understandable Syntax: Python syntax is easily understandable mainly because reading a Python code is very similar to reading a statement in English. It is expressive and easily readable, and the indentation used in Python also helps the user to differentiate between different scope/blocks in the code. 
  • Small code, large task: Web scraping is used to save time. But what’s the use if you spend more time writing the code? Well, you don’t have to. In Python, you can write small codes to do large tasks. Hence, you save time even while writing the code.
  • Community: What if you get stuck while writing the code? You don’t have to worry. Python community has one of the biggest and most active communities, where you can seek help from.

Find out our Python Training in Top Cities/Countries

IndiaUSAOther Cities/Countries
BangaloreNew YorkUK
HyderabadChicagoLondon
DelhiAtlantaCanada
ChennaiHoustonToronto
MumbaiLos AngelesAustralia
PuneBostonUAE
KolkataMiamiDubai
AhmedabadSan FranciscoPhilippines

How Do You Scrape Data From A Website?

When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. 

To extract data using web scraping with python, you need to follow these basic steps:

  1. Find the URL that you want to scrape
  2. Inspecting the Page
  3. Find the data you want to extract
  4. Write the code
  5. Run the code and extract the data
  6. Store the data in the required format 

Now let us see how to extract data from the Flipkart website using Python.

Learn Python, Deep Learning, NLP, Artificial Intelligence, Machine Learning with these AI and ML courses a PG Diploma certification program by NIT Warangal.

Libraries used for Web Scraping 

As we know, Python is has various applications and there are different libraries for different purposes. In our further demonstration, we will be using the following libraries:

  • Selenium:  Selenium is a web testing library. It is used to automate browser activities.
  • BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.
  • Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format. 

Web Scraping Example : Scraping Flipkart Website

Pre-requisites:

  • Python 2.x or Python 3.x with Selenium, BeautifulSoup, pandas libraries installed
  • Google-chrome browser
  • Ubuntu Operating System

Let’s get started!

Step 1: Find the URL that you want to scrape

For this example, we are going scrape Flipkart website to extract the Price, Name, and Rating of Laptops. The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.

Step 2: Inspecting the Page

The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.

Inspect Button - Web Scraping with Python - Edureka

When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open.

Inspecting page - Web Scraping with Python - Edureka

Step 3: Find the data you want to extract

Let’s extract the Price, Name, and Rating which is in the “div” tag respectively.

Step 4: Write the code

First, let’s create a Python file. To do this, open the terminal in Ubuntu and type gedit <your file name> with .py extension.

I am going to name my file “web-s”. Here’s the command:

gedit web-s.py

Now, let’s write our code in this file. 

First, let us import all the necessary libraries:

from selenium import webdriver
from BeautifulSoup import BeautifulSoup
import pandas as pd

To configure webdriver to use Chrome browser, we have to set the path to chromedriver

driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")

Refer the below code to open the URL:

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
driver.get("<a href="https://www.flipkart.com/laptops/">https://www.flipkart.com/laptops/</a>~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&amp;amp;amp;amp;amp;amp;amp;amp;amp;uniq")

Now that we have written the code to open the URL, it’s time to extract the data from the website. As mentioned earlier, the data we want to extract is nested in <div> tags. So, I will find the div tags with those respective class-names, extract the data and store the data in a variable. Refer the code below:

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text) 

Step 5: Run the code and extract the data

To run the code, use the below command:

python web-s.py

Step 6: Store the data in a required format

After extracting the data, you might want to store it in a format. This format varies depending on your requirement. For this example, we will store the extracted data in a CSV (Comma Separated Value) format. To do this, I will add the following lines to my code:

df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings}) 
df.to_csv('products.csv', index=False, encoding='utf-8')

Now, I’ll run the whole code again.

A file name “products.csv” is created and this file contains the extracted data.

web-scraping-with-python-output-Edureka

I hope you guys enjoyed this article on “Web Scraping with Python”. I hope this blog was informative and has added value to your knowledge. Now go ahead and try Web Scraping. Experiment with different modules and applications of Python

If you wish to know about Web Scraping With Python on Windows platform, then the below video will help you understand how to do it or you can also join our Python Master course.

Web Scraping With Python | Python Tutorial | Web Scraping Tutorial | Edureka

This Edureka live session on “WebScraping using Python” will help you understand the fundamentals of scraping along with a demo to scrape some details from Flipkart.

Got a question regarding “web scraping with Python”? You can ask it on edureka! Forum and we will get back to you at the earliest or you can join our Python Training in Hobart today..

To get in-depth knowledge on Python Programming language along with its various applications, you can enroll here for live online Python training with 24/7 support and lifetime access.

Original article source at: https://www.edureka.co/

#webscraping #python 

Beginners Guide to Web Scraping with Python
Reid  Rohan

Reid Rohan

1662955382

6 Essential Web Scraping Frameworks with JavaScript

In today's post we will learn about 6 Essential Web Scraping Frameworks with JavaScript.

What is a Web Scraping?

Web scraping is the process of collecting structured web data in an automated fashion. It’s also called web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

If you’ve ever copied and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.

Table of contents:

  • Webparsy - NodeJS lib and cli for scraping websites using Puppeteer and YAML.
  • Node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
  • Node-simplecrawler - Flexible event driven crawler for node.
  • Crawlee - Node.js and TypeScript library that crawls with Cheerio, JSDOM, Playwright and Puppeteer while enhancing them with anti-blocking features, queue, storages and more.
  • Ayakashi.io - The next generation web scraping framework. 
  • Pjscrape - A web-scraping framework written in Javascript, using PhantomJS and jQuery.

1 - Webparsy: NodeJS lib and cli for scraping websites using Puppeteer and YAML.

Overview

You can use WebParsy either as cli from your terminal or as a NodeJS library.

Cli

Install webparsy:

$ npm i webparsy -g
$ webparsy example/_weather.yml --customFlag "custom flag value"
Result:

{
  "title": "Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com",
  "city": "Madrid, España",
  "temp": 18
}

Library

const webparsy = require('webparsy')
const parsingResult = await webparsy.init({
  file: 'jobdefinition.yml'
  flags: { ... } // optional
})

Methods

init(options)

options:

One of yaml, file or string is required.

  • yaml: A yaml npm module instance of the scraping definition.
  • string: The YAML definition, as a plain string.
  • file: The path for the YAML file containing the scraping definition.

Additionally, you can pass a flags object property to input additional values to your scraping process.

View on Github

2 - Node-crawler: Web Crawler/Spider for NodeJS + server-side jQuery.

Install

$ npm install crawler

Basic usage

const Crawler = require('crawler');

const c = new Crawler({
    maxConnections: 10,
    // This will be called for each crawled page
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            const $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($('title').text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);

Slow down

Use rateLimit to slow down when you are visiting web sites.

const Crawler = require('crawler');

const c = new Crawler({
    rateLimit: 1000, // `maxConnections` will be forced to 1
    callback: (err, res, done) => {
        console.log(res.$('title').text());
        done();
    }
});

c.queue(tasks);//between two tasks, minimum time gap is 1000 (ms)

View on Github

3 - Node-simplecrawler: Flexible event driven crawler for node.

Installation

npm install --save simplecrawler

Getting Started

Initializing simplecrawler is a simple process. First, you require the module and instantiate it with a single argument. You then configure the properties you like (eg. the request interval), register a few event listeners, and call the start method. Let's walk through the process!

After requiring the crawler, we create a new instance of it. We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first.

var Crawler = require("simplecrawler");

var crawler = new Crawler("http://www.example.com/");

You can initialize the crawler with or without the new operator. Being able to skip it comes in handy when you want to chain API calls.

var crawler = Crawler("http://www.example.com/")
    .on("fetchcomplete", function () {
        console.log("Fetched a resource!")
    });

By default, the crawler will only fetch resources on the same domain as that in the URL passed to the constructor. But this can be changed through the crawler.domainWhitelist property.

Now, let's configure some more things before we start crawling. Of course, you're probably wanting to ensure you don't take down your web server. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250 ms like this:

crawler.interval = 10000; // Ten seconds
crawler.maxConcurrency = 3;

You can also define a max depth for links to fetch:

crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)
// Or:
crawler.maxDepth = 2; // First page and discovered links from it are fetched
// Or:
crawler.maxDepth = 3; // Etc.

View on Github

4 - Crawlee: Node.js and TypeScript library that crawls with Cheerio, JSDOM, Playwright and Puppeteer while enhancing them with anti-blocking features, queue, storages and more.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler
cd my-crawler
npm start

Manual installation

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It's not bundled with Crawlee to reduce install size.

npm install crawlee playwright
import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

By default, Crawlee stores data to ./storage in the current working directory. You can override this directory via Crawlee configuration. For details, see Configuration guide, Request storage and Result storage.

View on Github

5 - Ayakashi.io: The next generation web scraping framework. 

Ayakashi helps you build scraping and automation systems that are

  • easy to build
  • simple or sophisticated
  • highly performant
  • maintainable and built for change

Powerful querying and data models

Ayakashi's way of finding things in the page and using them is done with props and domQL.
Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is.
Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.

High level builtin actions

Ready made actions so you can focus on what matters.
Easily handle infinite scrolling, single page navigation, events and more.
Plus, you can always build your own actions, either from scratch or by composing other actions.

Preload code on pages

Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page?
Preloaders have you covered.

Control how you save your data

Automatically save your extracted data to all major SQL engines, JSON and CSV.
Need something more exotic or the ability to control exactly how the data is persisted?
Package and plug your custom logic as a script.

View on Github

6 - Pjscrape: A web-scraping framework written in Javascript, using PhantomJS and jQuery.

Overview

pjscrape is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and jQuery. Built for PhantomJS, it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.

Features

  • Client-side, Javascript-based scraping environment with full access to jQuery functions
  • Easy, flexible syntax for setting up one or more scrapers
  • Recursive/crawl scraping
  • Delay scrape until a "ready" condition occurs
  • Load your own scripts on the page before scraping
  • Modular architecture for logging and writing/formatting scraped items
  • Client-side utilities for common tasks
  • Growing set of unit tests

View on Github

Thank you for following this article. 

Related videos:

Introduction To Web Scraping With Javascript

#javascript #webscraping #frameworks 

6 Essential Web Scraping Frameworks with JavaScript
Reid  Rohan

Reid Rohan

1662950340

7 Best Network with JavaScript Web Scraping

In today's post we will learn about 7 Best Network with JavaScript Web Scraping.

What is a Network?

A network consists of two or more computers that are linked in order to share resources (such as printers and CDs), exchange files, or allow electronic communications. The computers on a network may be linked through cables, telephone lines, radio waves, satellites, or infrared light beams.

Two very common types of networks include:

  • Local Area Network (LAN)
  • Wide Area Network (WAN)

You may also see references to a Metropolitan Area Networks (MAN), a Wireless LAN (WLAN), or a Wireless WAN (WWAN).

Table of contents:

  • Socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js.
  • Rest - RESTful HTTP client for JavaScript.
  • Wreck - HTTP Client Utilities.
  • Got - Simplified HTTP requests.
  • Node-fetch - A light-weight module that brings window.fetch to Node.js.
  • Bent - Functional HTTP client for Node.js w/ async/await.
  • Axios - Promise based HTTP client for the browser and node.js.

1 - Socks5-http-client: SOCKS v5 HTTP client implementation in JavaScript for Node.js.

var shttp = require('socks5-http-client');

shttp.get('http://www.google.com/', function(res) {
	res.setEncoding('utf8');
	res.on('readable', function() {
		console.log(res.read()); // Log response to console.
	});
});

URLs are parsed using url.parse. You may also pass an options hash as the first argument to get or request.

Options

Specify the socksHost and socksPort options if your SOCKS server isn't running on localhost:1080. Tor runs its SOCKS server on port 9050 by default, for example.

Specify a username and password using socksUsername and socksPassword.

Using with Tor

Works great for making HTTP requests through Tor.

Make sure a Tor server is running locally and run node example/tor http://en.wikipedia.org/wiki/SOCKS to test.

View on Github

2 - Rest: RESTful HTTP client for JavaScript.

Just enough client, as you need it. Make HTTP requests from a browser or Node.js applying only the client features you need. Configure a client once, and share it safely throughout your application. Easily extend with interceptors that wrap the request and/or response, or MIME type converters for rich data formats.

Usage

Using rest.js is easy. The core clients provide limited functionality around the request and response lifecycle. The request and response objects are normalized to support portability between different JavaScript environments.

The return value from a client is a promise that is resolved with the response when the remote request finishes.

The core client behavior can be augmented with interceptors. An interceptor wraps the client and transforms the request and response. For example: an interceptor may authenticate a request, or reject the promise if an error is encountered. Interceptors may be combined to create a client with the desired behavior. A configured interceptor acts just like a client. The core clients are basic, they only know the low level mechanics of making a request and parsing the response. All other behavior is applied and configured with interceptors.

Interceptors are applied to a client by wrapping. To wrap a client with an interceptor, call the wrap method on the client providing the interceptor and optionally a configuration object. A new client is returned containing the interceptor's behavior applied to the parent client. It's important to note that the behavior of the original client is not modified, in order to use the new behavior, you must use the returned client.

Making a basic request:

var rest = require('rest');

rest('/').then(function(response) {
    console.log('response: ', response);
});

In this example, you can see that the request object is very simple, it just a string representing the path. The request may also be a proper object containing other HTTP properties.

The response should look familiar as well, it contains all the fields you would expect, including the response headers (many clients ignore the headers).

Working with JSON:

If you paid attention when executing the previous example, you may have noticed that the response.entity is a string. Often we work with more complex data types. For this, rest.js supports a rich set of MIME type conversions with the MIME Interceptor. The correct converter will automatically be chosen based on the Content-Type response header. Custom converts can be registered for a MIME type, more on that later...

var rest, mime, client;

rest = require('rest'),
mime = require('rest/interceptor/mime');

client = rest.wrap(mime);
client({ path: '/data.json' }).then(function(response) {
    console.log('response: ', response);
});

Before an interceptor can be used, it needs to be configured. In this case, we will accept the default configuration, and obtain a client. Now when we see the response, the entity will be a JS object instead of a String.

View on Github

3 - Wreck: HTTP Client Utilities.

wreck is part of the hapi ecosystem and was designed to work seamlessly with the hapi web framework and its other components (but works great on its own or with other frameworks). If you are using a different web framework and find this module useful, check out hapi – they work even better together.

Installation:

npm: npm install @hapi/wreck

yarn: yarn add @hapi/wreck

Usage

const Wreck = require('@hapi/wreck');

const example = async function () {

    const { res, payload } = await Wreck.get('http://example.com');
    console.log(payload.toString());
};

try {
    example();
}
catch (ex) {
    console.error(ex);
}

View on Github

4 - Got: Simplified HTTP requests.

Install

npm install got

Warning: This package is native ESM and no longer provides a CommonJS export. If your project uses CommonJS, you'll have to convert to ESM or use the dynamic import() function. Please don't open issues for questions regarding CommonJS / ESM. You can also use Got v11 instead which is pretty stable. We will backport security fixes to v11 for the foreseeable future.

Take a peek

A quick start guide is available.

JSON mode

Got has a dedicated option for handling JSON payload.
Furthermore, the promise exposes a .json<T>() function that returns Promise<T>.

import got from 'got';

const {data} = await got.post('https://httpbin.org/anything', {
	json: {
		hello: 'world'
	}
}).json();

console.log(data);
//=> {"hello": "world"}

For advanced JSON usage, check out the parseJson and stringifyJson options.

For more useful tips like this, visit the Tips page.

View on Github

5 - Node-fetch: A light-weight module that brings window.fetch to Node.js.

Installation

Current stable release (3.x) requires at least Node.js 12.20.0.

npm install node-fetch

Loading and configuring the module

ES Modules (ESM)

import fetch from 'node-fetch';

CommonJS

node-fetch from v3 is an ESM-only module - you are not able to import it with require().

If you cannot switch to ESM, please use v2 which remains compatible with CommonJS. Critical bug fixes will continue to be published for v2.

npm install node-fetch@2

Alternatively, you can use the async import() function from CommonJS to load node-fetch asynchronously:

// mod.cjs
const fetch = (...args) => import('node-fetch').then(({default: fetch}) => fetch(...args));

Providing global access

To use fetch() without importing it, you can patch the global object in node:

// fetch-polyfill.js
import fetch, {
  Blob,
  blobFrom,
  blobFromSync,
  File,
  fileFrom,
  fileFromSync,
  FormData,
  Headers,
  Request,
  Response,
} from 'node-fetch'

if (!globalThis.fetch) {
  globalThis.fetch = fetch
  globalThis.Headers = Headers
  globalThis.Request = Request
  globalThis.Response = Response
}

// index.js
import './fetch-polyfill'

// ...

View on Github

6 - Bent: Functional HTTP client for Node.js w/ async/await.

Usage

const bent = require('bent')

const getJSON = bent('json')
const getBuffer = bent('buffer')

let obj = await getJSON('http://site.com/json.api')
let buffer = await getBuffer('http://site.com/image.png')

As you can see, bent is a function that returns an async function.

Bent takes options which constrain what is accepted by the client. Any response that falls outside the constraints will generate an error.

You can provide these options in any order, and Bent will figure out which option is which by inspecting the option's type and content.

const post = bent('http://localhost:3000/', 'POST', 'json', 200);
const response = await post('cars/new', {name: 'bmw', wheels: 4});

If you don't set a response encoding ('json', 'string' or 'buffer') then the native response object will be returned after the statusCode check.

In Node.js, we also add decoding methods that match the Fetch API (.json(), .text() and .arrayBuffer()).

const bent = require('bent')

const getStream = bent('http://site.com')

let stream = await getStream('/json.api')
// status code
stream.status // 200
stream.statusCode // 200
// optionally decode
const obj = await stream.json()
// or
const str = await stream.text()

View on Github

7 - Axios: Promise based HTTP client for the browser and node.js.

Installing

Using npm:

$ npm install axios

Using bower:

$ bower install axios

Using yarn:

$ yarn add axios

Using pnpm:

$ pnpm add axios

Using jsDelivr CDN:

<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>

Using unpkg CDN:

<script src="https://unpkg.com/axios/dist/axios.min.js"></script>

Example

note: CommonJS usage

In order to gain the TypeScript typings (for intellisense / autocomplete) while using CommonJS imports with require() use the following approach:

const axios = require('axios').default;

// axios.<method> will now provide autocomplete and parameter typings

Performing a GET request

const axios = require('axios').default;

// Make a request for a user with a given ID
axios.get('/user?ID=12345')
  .then(function (response) {
    // handle success
    console.log(response);
  })
  .catch(function (error) {
    // handle error
    console.log(error);
  })
  .then(function () {
    // always executed
  });

// Optionally the request above could also be done as
axios.get('/user', {
    params: {
      ID: 12345
    }
  })
  .then(function (response) {
    console.log(response);
  })
  .catch(function (error) {
    console.log(error);
  })
  .then(function () {
    // always executed
  });  

// Want to use async/await? Add the `async` keyword to your outer function/method.
async function getUser() {
  try {
    const response = await axios.get('/user?ID=12345');
    console.log(response);
  } catch (error) {
    console.error(error);
  }
}

View on Github

Thank you for following this article. 

Related videos:

Web Scraping with Puppeteer & Node.js: Chrome Automation

#javascript #network #webscraping 

7 Best Network with JavaScript Web Scraping
Michael Kitas

Michael Kitas

1661786991

Python Selenium Tutorial #11 - Heroku Deployment CLI & GitHub

🧾This selenium tutorial is designed for beginners to learn how to use the python selenium library to perform web scraping, web testing, and create website bots. Selenium is a Python library that provides a high-level API to control Chrome or Chromium and Firefox or Geckodriver over the DevTools Protocol. Selenium runs non-headless by default but can be configured to run headless. Buildpacks: heroku/python heroku/google-chrome heroku/chromedriver Environment Variables: CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver 

 



📚 Playlist: https://www.youtube.com/watch?v=a2ul3... 

⚫GitHub Repo: https://github.com/michaelkitas/Selen... 
🔴Geckodriver: https://github.com/mozilla/geckodrive... 
🔴Chrome Driver: https://chromedriver.chromium.org/dow... 
🔵Download Visual Studio Code: https://code.visualstudio.com/download 
🟡Download Python: https://www.python.org/downloads/ 
🟢Selenium Library: https://pypi.org/project/selenium/ 

🔗 Social Medias 🔗 
🌎 Website: websidev.com 
📂 GitHub: https://github.com/michaelkitas 
📱 Instagram: https://www.instagram.com/michael_kitas 
💬 Discord Server: https://discord.gg/mYCBHTZm6v 

⚡ Please leave a LIKE and SUBSCRIBE for more content! ⚡ 

🎬 My Gear 🎬

 💻 Macbook Pro 2021 14" 16GB 512GB: https://amzn.to/3KEYSgj 

🖥 Monitors (SAMSUNG UR55): https://amzn.to/37LNtgl 

⌨ Keyboard (Logitech G915 TKL): https://amzn.to/3xrIbkK 

🖱 Mouse (Logitech G305 Lightspeed): https://amzn.to/3vgcYy7 

🖱 Mousepad (Razer Pro Glide): https://amzn.to/3LQLLce 

📸 Webcam (Razer Kiyo): https://amzn.to/3O5TFQK 

📢 Speaker (Edifier G2000): https://amzn.to/3uwKam2

 🎧 Headphones (SONY WH-1000XM3): https://amzn.to/367w5lz 

🎤 Microphone (Blue Yeti): https://amzn.to/37eTP7Z 

🎙 Mic Boom Arm (InnoGear Set): https://amzn.to/3Jz4cQX 

💾 External SSD (Samsung T7 Touch): https://amzn.to/3uCDwdW 

💾 External HDD (WD My Passport): https://amzn.to/3M5Fjhp 

🔌 UPS (CyberPower CP1500EPFCLCD): https://amzn.to/37hEcwz 

🔌 Second UPS (CyberPower BU650E): Couldn't find a link 

💻 PC Parts 

💻 Processor (AMD Ryzen 9 3900X): https://amzn.to/3KACNiY 

Graphics Card (GIGABYTE GeForce RTX 3060 Gaming OC 12G): https://amzn.to/3juKMCj 

Motherboard (GIGABYTE B550M DS3H): https://amzn.to/3rojyl7 

Ram (HyperX Predator DDR4 32GB): https://amzn.to/3Jyknhv 

⏱️Timestamps⏱️ 

Intro (0:00

Setup Selenium Script (0:33

Heroku CLI Deployment (4:46

GitHub Deployment (11:46

⭐ Tags ⭐ 

- Selenium Tutorials 

- Selenium Python -

 Python Selenium tutorial 

- Selenium Tutorial for Beginners 

⭐ Hashtags ⭐ #selenium #python #webscraping

Python Selenium Tutorial #11 - Heroku Deployment CLI & GitHub
Thierry  Perret

Thierry Perret

1661455320

Comment Implémenter Le Web Scraping Avec Go

Le web scraping est un outil essentiel que tout développeur utilise à un moment donné de sa carrière. Par conséquent, il est essentiel que les développeurs comprennent ce qu'est un grattoir Web et comment en créer un.

Le grattage Web, la récolte Web ou l'extraction de données Web est un grattage de données utilisé pour extraire des données de sites Web. Le logiciel de grattage Web peut accéder directement au World Wide Web à l'aide du protocole de transfert hypertexte ou d'un navigateur Web. Bien que le grattage Web puisse être effectué manuellement par un utilisateur de logiciel, le terme fait généralement référence à des processus automatisés mis en œuvre à l'aide d'un bot ou d'un robot d'exploration Web. Il s'agit d'une forme de copie dans laquelle des données spécifiques sont collectées et copiées à partir du Web, généralement dans une base de données ou une feuille de calcul locale centrale, pour une récupération ou une analyse ultérieure.

En d'autres termes, le web scraping est un processus d'extraction de données à partir de sites Web et est utilisé dans de nombreux cas, allant de l'analyse de données à la génération de prospects. La tâche peut être effectuée manuellement ou peut être automatisée via un script ou un logiciel.

Il existe une variété de cas d'utilisation pour le web scraping. Jetez un œil à quelques-uns :

Collecte de données : L'application ou l'utilisation la plus utile du web scraping est la collecte de données. Les données sont convaincantes et l'analyse des données de la bonne manière peut donner une longueur d'avance à une entreprise par rapport à une autre. Le scraping Web est un outil essentiel pour la collecte de données. L'écriture d'un simple script peut rendre la collecte de données beaucoup plus accessible et plus rapide que les tâches manuelles. De plus, les données peuvent également être saisies dans une feuille de calcul pour être mieux visualisées et analysées.

Effectuer des études de marché et générer des prospects : effectuer des études de marché et générer des prospects sont des tâches cruciales de grattage Web. Les e-mails, numéros de téléphone et autres informations importantes provenant de divers sites Web peuvent être extraits et utilisés ultérieurement pour ces tâches importantes.

Construire des outils de comparaison de prix : Vous avez peut-être remarqué des extensions de navigateur qui vous alertent d'un changement de prix pour les produits sur les plateformes de commerce électronique. Ces outils sont également construits à l'aide de grattoirs Web.

Dans cet article, vous apprendrez à créer un web scraper simple à l'aide de Go .

Robert Griesemer, Rob Pike et Ken Thompson ont créé le langage de programmation Go chez Google, et il est sur le marché depuis 2009. Go, également connu sous le nom de Golang, possède de nombreuses fonctionnalités brillantes. Démarrer avec Go est simple et rapide. En conséquence, ce langage relativement récent gagne beaucoup d'attrait dans le monde des développeurs.

Mettre en œuvre le Web Scraping avec Go

La prise en charge de la simultanéité a fait de Go un langage rapide et puissant, et comme le langage est facile à utiliser, vous pouvez créer votre scraper Web avec seulement quelques lignes de code. Pour créer des web scrapers avec Go, deux librairies sont très populaires :

  1. goquerie
  2. Mal au ventre

Dans cet article, vous utiliserez Colly pour implémenter le scraper. Au début, vous apprendrez les bases de la création d'un scraper et vous implémenterez un scraper d'URL à partir d'une page Wikipedia. Une fois que vous connaissez les éléments de base du grattage Web avec Colly, vous améliorerez les compétences et implémenterez un grattoir plus avancé.

Conditions préalables

Avant d'aller plus loin dans cet article, assurez-vous que les outils et bibliothèques suivants sont installés sur votre ordinateur. Vous aurez besoin des éléments suivants :

  • Compréhension de base de Go
  • Go (de préférence la dernière version—1.17.2, au moment de la rédaction de cet article)
  • IDE ou éditeur de texte de votre choix ( Visual Studio Code préféré)
  • Extension Go pour l'IDE (si disponible)

Comprendre Colly et le Collectorcomposant

Le package Colly est utilisé pour créer des robots d'exploration et des grattoirs Web. Il est basé sur Go's Net/HTTP et le package goquery. Le package goquery donne une syntaxe de type jQuery dans Go to target HTML elements. Ce paquet seul est également utilisé pour construire des grattoirs.

Le composant principal de Colly est le Collector. Selon les docs , le Collectorcomposant gère les communications réseau et il est également responsable des rappels qui lui sont attachés lorsqu'un Collectortravail est en cours d'exécution. Ce composant est configurable et vous pouvez modifier la UserAgentchaîne ou ajouter des Authenticationen-têtes, restreindre ou autoriser les URL à l'aide de ce composant.

Comprendre les rappels Colly

Des rappels peuvent également être ajoutés au Collectorcomposant. La bibliothèque Colly a des rappels, tels que OnHTMLet OnRequest. Vous pouvez vous référer à la documentation pour en savoir plus sur tous les rappels. Ces rappels s'exécutent à différents moments du cycle de vie du Collector. Par exemple, le OnRequestrappel est exécuté juste avant que le Collectoreffectue une requête HTTP.

La OnHTMLméthode est le rappel le plus couramment utilisé dans la création de scrapers Web. Il permet d'enregistrer un rappel Collectorlorsqu'il atteint une balise HTML spécifique sur la page Web.

Initialisation du répertoire du projet et installation de Colly

Avant de commencer à écrire du code, vous devez initialiser le répertoire du projet. Ouvrez l'IDE de votre choix et ouvrez un dossier dans lequel vous enregistrerez tous vos fichiers de projet. Maintenant, ouvrez une fenêtre de terminal et localisez votre répertoire. Après, tapez la commande suivante dans le terminal :

go mod init github.com/Username/Project-Name

Dans la commande ci-dessus, modifiez [github.com](http://github.com)le domaine dans lequel vous stockez vos fichiers, comme Bitbucket ou Gitlab . Changez également Usernamevotre nom d'utilisateur et le nom Project-Namede projet que vous souhaitez lui donner.

Une fois que vous avez tapé la commande et appuyé sur Entrée, vous constaterez qu'un nouveau fichier est créé avec le nom de go.mod. Ce fichier contient les informations sur les dépendances directes et indirectes dont le projet a besoin. L'étape suivante consiste à installer la dépendance Colly. Pour installer la dépendance, tapez la commande suivante dans le terminal :

go get -u github.com/go-colly/colly/...

Cela téléchargera la bibliothèque Colly et générera un nouveau fichier appelé go.sum. Vous pouvez maintenant trouver la dépendance dans le go.modfichier. Le go.sumfichier répertorie la somme de contrôle des dépendances directes et indirectes, ainsi que la version. Vous pouvez en savoir plus sur les fichiers go.sumet ici .go.mod

Construire un grattoir de base

Maintenant que vous avez configuré le répertoire du projet avec les dépendances nécessaires, vous pouvez continuer avec l'écriture de certains codes. Le grattoir de base vise à gratter tous les liens d'une page Wikipédia spécifique et à les imprimer sur le terminal. Ce grattoir est conçu pour vous mettre à l'aise avec les blocs de construction de la bibliothèque Colly.

Créez un nouveau fichier dans le dossier avec une extension .go— par exemple, main.go. Toute la logique ira dans ce fichier. Commencez par écrire package main. Cette ligne indique au compilateur que le package doit être compilé en tant que programme exécutable au lieu d'une bibliothèque partagée.

package main

L'étape suivante consiste à commencer à écrire la mainfonction. Si vous utilisez Visual Studio Code, il importera automatiquement les packages nécessaires. Sinon, dans le cas d'autres IDE, vous devrez peut-être le faire manuellement. Le Collectorde Colly est initialisé avec la ligne de code suivante :

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )
}

Ici, le NewCollectorest initialisé, et en option, en.wikipedia.org est passé comme domaine autorisé. La même chose Collectorpeut également être initialisée sans lui passer d'option. Maintenant, si vous enregistrez le fichier, Colly sera automatiquement importé dans votre main.gofichier ; sinon, ajoutez les lignes suivantes après la package mainligne :

import (
    "fmt"

    "github.com/gocolly/colly"
)

Les lignes ci-dessus importent deux packages dans le main.gofichier. Le premier package est le fmtpackage et le second est la bibliothèque Colly.

Maintenant, ouvrez cette URL dans votre navigateur. Ceci est la page Wikipédia sur le web scraping. Le grattoir Web va gratter tous les liens de cette page. Bien comprendre les outils de développement de navigateur est une compétence inestimable dans le scraping Web. Ouvrez les outils d'inspection du navigateur en cliquant avec le bouton droit sur la page et en sélectionnant Inspect . Cela ouvrira l'inspecteur de page. Vous pourrez voir l'intégralité du HTML, du CSS, des appels réseau et d'autres informations importantes à partir d'ici. Pour cet exemple en particulier, recherchez la mw-parser-outputdiv :

Wikipédia dans les outils de développement
 

Cet élément div contient le corps de la page. Le ciblage des liens à l'intérieur de cette div fournira tous les liens utilisés à l'intérieur de l'article.

Ensuite, vous utiliserez la OnHTMLméthode. Voici le code restant pour le scraper :

// Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")

La OnHTMLméthode prend en compte deux paramètres. Le premier paramètre est l'élément HTML. L'atteindre va exécuter la fonction de rappel, qui est passée en deuxième paramètre. Dans la fonction de rappel, la linksvariable est affectée à une méthode qui renvoie tous les attributs enfants correspondant aux attributs de l'élément. La e.ChildAttrs("a", "href")fonction renvoie une tranche de chaînes de tous les liens à l'intérieur de la mw-parser-outputdiv. La fmt.Println(links)fonction imprime les liens dans le terminal.

Enfin, visitez l'URL à l'aide de la c.Visit("https://en.wikipedia.org/wiki/Web_scraping")commande. Le code de scraper complet ressemblera à ceci :

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )

    // Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}

L'exécution de ce code avec la commande go run main.goobtiendra tous les liens de la page.

Gratter les données du tableau

Tableau HTML W3Schools

Pour gratter les données du tableau, vous pouvez soit supprimer les codes que vous avez écrits à l'intérieur c.OnHTML, soit créer un nouveau projet en suivant les mêmes étapes mentionnées ci-dessus. Pour créer et écrire un fichier CSV, vous utiliserez la encoding/csvbibliothèque disponible dans Go. Voici le code de démarrage :

package main

import (
    "encoding/csv"
    "log"
    "os"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()
}

Dans la mainfonction, la première action consiste à définir le nom du fichier. Ici, il est défini comme data.csv. Ensuite, en utilisant la os.Create(fName)méthode, le fichier est créé avec le nom data.csv. Si une erreur se produit lors de la création du fichier, il enregistrera également l'erreur et quittera le programme. La defer file.Close()commande fermera le fichier lorsque la fonction environnante reviendra.

La writer := csv.NewWriter(file)commande initialise l'écrivain CSV pour écrire dans le fichier, et writer.Flush()lancera tout, du tampon à l'écrivain.

Une fois le processus de création de fichier terminé, le processus de grattage peut être lancé. Ceci est similaire à l'exemple ci-dessus.

Ensuite, ajoutez les lignes de code ci-dessous après la defer writer.Flush()fin de la ligne :

c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")

Dans ce code, Colly est en cours d'initialisation. Colly utilise la ForEachméthode pour parcourir le contenu. Étant donné que le tableau comporte trois colonnes ou tdéléments, à l'aide du nth-childpseudo-sélecteur, trois colonnes sont sélectionnées. el.ChildTextrenvoie le texte à l'intérieur de l'élément. Le mettre à l'intérieur de la writer.Writeméthode écrira les éléments dans le fichier CSV. Enfin, l'instruction d'impression imprime un message lorsque le grattage est terminé. Étant donné que ce code ne cible pas les en-têtes de tableau, il n'imprimera pas l'en-tête. Le code complet de ce scraper ressemblera à ceci :

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")
}

Une fois réussi, la sortie apparaîtra comme ceci :

Fichier de sortie CSV dans Excel


 

Conclusion

Dans cet article, vous avez appris ce que sont les web scrapers ainsi que certains cas d'utilisation et comment ils peuvent être implémentés avec Go, à l'aide de la bibliothèque Colly.

Cependant, les méthodes décrites dans ce tutoriel ne sont pas les seules possibles pour implémenter un scraper. Envisagez d'expérimenter vous-même et de trouver de nouvelles façons de le faire. Colly peut également travailler avec la goquerybibliothèque pour créer un grattoir plus puissant.

Selon votre cas d'utilisation, vous pouvez modifier Colly pour répondre à vos besoins. Le grattage Web est très pratique pour la recherche de mots clés, la protection de la marque, la promotion, les tests de sites Web et bien d'autres choses. Ainsi, savoir comment créer votre propre grattoir Web peut vous aider à devenir un meilleur développeur.

Lien : https://www.scrapingbee.com/blog/web-scraping-go/

#webscraping #go #golang 

Comment Implémenter Le Web Scraping Avec Go
Callum  Owen

Callum Owen

1661448120

How to Implement Web Scraping with Go

Web scraping is an essential tool that every developer uses at some point in their career. Hence, it is essential for developers to understand what a web scraper is, as well as how to build one.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

In other words, web scraping is a process for extracting data from websites and is used in many cases, ranging from data analysis to lead generation. The task can be completed manually or can be automated through a script or software.

See more at: https://www.scrapingbee.com/blog/web-scraping-go/

#webscraping #go #golang 

How to Implement Web Scraping with Go
加藤  七夏

加藤 七夏

1661430060

如何使用 Go 實現網頁抓取

Web 抓取是每個開發人員在其職業生涯的某個階段使用的基本工具。因此,開發人員必須了解什麼是網絡爬蟲,以及如何構建它。

Web 抓取、Web 收集或 Web 數據提取是用於從網站中提取數據的數據抓取。網絡抓取軟件可以使用超文本傳輸協議或網絡瀏覽器直接訪問萬維網。雖然網絡抓取可以由軟件用戶手動完成,但該術語通常是指使用機器人或網絡爬蟲實現的自動化過程。它是一種複制形式,其中從網絡收集和復制特定數據,通常複製到中央本地數據庫或電子表格中,以供以後檢索或分析。

換句話說,網絡抓取是從網站中提取數據的過程,並在許多情況下使用,從數據分析到潛在客戶生成。該任務可以手動完成,也可以通過腳本或軟件自動完成。

網絡抓取有多種用例。看幾個:

收集數據:網絡抓取最有用的應用或用途是數據收集。數據令人信服,以正確的方式分析數據可以使一家公司領先於另一家公司。Web 抓取是數據收集的重要工具——編寫一個簡單的腳本可以使數據收集比手動工作更容易訪問和更快。此外,數據還可以輸入到電子表格中,以便更好地可視化和分析。

進行市場研究和潛在客戶生成:進行市場研究和生成潛在客戶是至關重要的網絡抓取任務。來自各種網站的電子郵件、電話號碼和其他重要信息可以被抓取,然後用於這些重要任務。

構建價格比較工具:您可能已經註意到瀏覽器擴展程序會提醒您電子商務平台上產品的價格變化。此類工具也是使用網絡抓取工具構建的。

在本文中,您將學習如何使用Go創建一個簡單的網絡爬蟲。

Robert Griesemer、Rob Pike 和 Ken Thompson 在 Google 創建了 Go 編程語言,自 2009 年以來一直在市場上。Go,也稱為 Golang,具有許多出色的功能。Go 入門快速而直接。結果,這種相對較新的語言在開發人員世界中獲得了很大的吸引力。

用 Go 實現網頁抓取

對並發的支持使 Go 成為一種快速、強大的語言,並且由於該語言易於上手,您只需幾行代碼即可構建您的網絡爬蟲。為了使用 Go 創建網絡爬蟲,兩個庫非常流行:

  1. 查詢
  2. 科利

在本文中,您將使用 Colly 來實現刮板。首先,您將學習構建刮板的基礎知識,並且您將從 Wikipedia 頁面實現 URL 刮板。一旦您了解了使用 Colly 進行網頁抓取的基本構建塊,您將提陞技能並實現更高級的抓取工具。

先決條件

在繼續本文之前,請確保您的計算機上安裝了以下工具和庫。您將需要以下內容:

  • 圍棋的基本理解
  • Go(最好是最新版本——1.17.2,在撰寫本文時)
  • 您選擇的 IDE 或文本編輯器(首選Visual Studio Code
  • IDE 的Go 擴展(如果可用)

了解 Colly 和Collector組件

Colly 包用於構建網絡爬蟲和爬蟲。它基於 Go 的 Net/HTTP 和 goquery 包。goquery 包在 Go 中提供了一種類似 jQuery 的語法來定位 HTML 元素。單獨這個包也用於構建刮板。

Colly 的主要成分是Collector. 根據文檔,該Collector組件管理網絡通信,並且還負責在Collector作業運行時附加到它的回調。該組件是可配置的,您可以在此組件的幫助下修改UserAgent字符串或添加Authentication標頭,限製或允許 URL。

了解 Colly 回調

回調也可以添加到Collector組件中。Colly 庫有回調,例如OnHTMLOnRequest。您可以參考文檔以了解所有回調。這些回調在Collector. 例如,回調在發出 HTTP 請求OnRequest之前運行。Collector

OnHTML方法是構建網絡爬蟲時最常用的回調。它允許Collector在到達網頁上的特定 HTML 標記時註冊回調。

初始化項目目錄並安裝 Colly

在開始編寫代碼之前,您必須初始化項目目錄。打開您選擇的 IDE 並打開一個文件夾,您將在其中保存所有項目文件。現在,打開一個終端窗口,並找到您的目錄。之後,在終端中鍵入以下命令:

go mod init github.com/Username/Project-Name

在上述命令中,更改[github.com](http://github.com)為存儲文件的域,例如BitbucketGitlab。此外,更改Username為您的用戶名以及Project-Name您想提供的任何項目名稱。

輸入命令並回車後,您會發現創建了一個名為go.mod. 該文件包含有關項目所需的直接和間接依賴項的信息。下一步是安裝 Colly 依賴項。要安裝依賴項,請在終端中鍵入以下命令:

go get -u github.com/go-colly/colly/...

這將下載 Colly 庫並生成一個名為go.sum. 您現在可以在go.mod文件中找到依賴項。該go.sum文件列出了直接和間接依賴項的校驗和以及版本。您可以在此處閱讀有關go.sumgo.mod文件的更多信息。

建立一個基本的刮刀

現在您已經設置了具有必要依賴項的項目目錄,您可以繼續編寫一些代碼。基本刮板旨在從特定的 Wikipedia 頁面刮取所有鏈接並將它們打印在終端上。這個刮板是為了讓你對 Colly 庫的構建塊感到舒服。

在文件夾中創建一個擴展名為的新文件,.go例如main.go. 所有的邏輯都將進入這個文件。從寫作開始package main。這一行告訴編譯器包應該編譯為可執行程序而不是共享庫。

package main

下一步是開始編寫main函數。如果您使用的是 Visual Studio Code,它將自動導入必要的包。否則,對於其他 IDE,您可能必須手動執行此操作。CollectorColly 使用以下代碼行初始化:

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )
}

在這裡,NewCollector被初始化,並且作為一個選項,en.wikipedia.org作為允許的域傳遞。也Collector可以在不向其傳遞任何選項的情況下對其進行初始化。現在,如果您保存文件,Colly 將自動導入到您的main.go文件中;如果不是,請在該行之後添加以下package main行:

import (
    "fmt"

    "github.com/gocolly/colly"
)

上面的行在main.go文件中導入了兩個包。第一個包是fmt包,第二個是Colly庫。

現在,在瀏覽器中打開此 URL 。這是關於網絡抓取的維基百科頁面。網絡抓取工具將抓取該頁面的所有鏈接。很好地理解瀏覽器開發工具是網頁抓取的一項非常寶貴的技能。通過右鍵單擊頁面並選擇Inspect打開瀏覽器檢查工具。這將打開頁面檢查器。您將能夠從這裡看到整個 HTML、CSS、網絡調用和其他重要信息。具體來說,對於這個例子,找到mw-parser-outputdiv:

開發工具中的維基百科
 

這個 div 元素包含頁面的主體。定位此 div 內的鏈接將提供文章中使用的所有鏈接。

接下來,您將使用該OnHTML方法。這是刮板的剩餘代碼:

// Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")

OnHTML方法接受兩個參數。第一個參數是 HTML 元素。到達它將執行作為第二個參數傳遞的回調函數。在回調函數內部,links變量被分配給一個方法,該方法返回與元素屬性匹配的所有子屬性。該e.ChildAttrs("a", "href")函數返回mw-parser-outputdiv內所有鏈接的字符串切片。該fmt.Println(links)函數打印終端中的鏈接。

c.Visit("https://en.wikipedia.org/wiki/Web_scraping")最後,使用命令訪問 URL 。完整的爬蟲代碼如下所示:

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )

    // Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}

使用命令運行此代碼go run main.go將獲取頁面上的所有鏈接。

抓取表數據

W3Schools HTML 表

要抓取表格數據,您可以刪除您在內部編寫的代碼,也可以c.OnHTML按照上述相同步驟創建一個新項目。要製作和編寫 CSV 文件,您將使用encoding/csvGo 中可用的庫。這是啟動代碼:

package main

import (
    "encoding/csv"
    "log"
    "os"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()
}

main函數內部,第一個動作是定義文件名。在這裡,它被定義為data.csv。然後使用該os.Create(fName)方法,使用名稱創建文件data.csv。如果在文件創建過程中發生任何錯誤,它也會記錄錯誤並退出程序。該defer file.Close()命令將在周圍函數返回時關閉文件。

writer := csv.NewWriter(file)命令初始化 CSV 寫入器以寫入文件,並將writer.Flush()緩衝區中的所有內容都扔給寫入器。

一旦文件創建過程完成,就可以開始抓取過程。這與上面的示例類似。

接下來,在行結束後添加以下代碼defer writer.Flush()行:

c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")

在這段代碼中,Colly 正在被初始化。Colly 使用該ForEach方法對內容進行迭代。因為表格有三列或td元素,所以使用nth-child偽選擇器,選擇了三列。el.ChildText返回元素內的文本。將其放入writer.Write方法中會將元素寫入 CSV 文件。最後,當抓取完成時,print 語句會打印一條消息。因為此代碼不是針對錶格標題,所以它不會打印標題。這個爬蟲的完整代碼如下:

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")
}

成功後,輸出將如下所示:

在 Excel 中輸出 CSV 文件


 

結論

在本文中,您在 Colly 庫的幫助下了解了 Web 抓取工具以及一些用例以及如何使用 Go 實現它們。

但是,本教程中描述的方法並不是實現刮板的唯一可能方式。考慮自己嘗試一下並找到新的方法。Colly 還可以與goquery庫一起製作更強大的刮板。

根據您的用例,您可以修改 Colly 以滿足您的需求。網頁抓取對於關鍵字研究、品牌保護、促銷、網站測試和許多其他事情非常方便。因此,了解如何構建自己的網絡爬蟲可以幫助您成為更好的開發人員。

鏈接:https ://www.scrapingbee.com/blog/web-scraping-go/

#webscraping #go #golang 

如何使用 Go 實現網頁抓取
Hong  Nhung

Hong Nhung

1661422800

Cách Triển Khai Web Scraping Với Go

Web Scraping là một công cụ thiết yếu mà mọi nhà phát triển sử dụng tại một số thời điểm trong sự nghiệp của họ. Do đó, điều cần thiết là các nhà phát triển phải hiểu công cụ quét web là gì, cũng như cách xây dựng công cụ này.

Thu thập dữ liệu web, thu thập dữ liệu web hoặc trích xuất dữ liệu web là việc thu thập dữ liệu được sử dụng để trích xuất dữ liệu từ các trang web. Phần mềm duyệt web có thể truy cập trực tiếp vào World Wide Web bằng Giao thức truyền siêu văn bản hoặc trình duyệt web. Mặc dù người dùng phần mềm có thể thực hiện việc quét web theo cách thủ công, thuật ngữ này thường đề cập đến các quy trình tự động được thực hiện bằng cách sử dụng bot hoặc trình thu thập thông tin web. Đây là một hình thức sao chép trong đó dữ liệu cụ thể được thu thập và sao chép từ web, thường vào cơ sở dữ liệu cục bộ trung tâm hoặc bảng tính, để truy xuất hoặc phân tích sau này.

Nói cách khác, quét web là một quá trình trích xuất dữ liệu từ các trang web và được sử dụng trong nhiều trường hợp, từ phân tích dữ liệu đến tạo khách hàng tiềm năng. Nhiệm vụ có thể được hoàn thành thủ công hoặc có thể được tự động hóa thông qua tập lệnh hoặc phần mềm.

Có một loạt các trường hợp sử dụng để tìm kiếm web. Hãy xem qua một số:

Thu thập dữ liệu: Ứng dụng hữu ích nhất hoặc sử dụng web cạo là thu thập dữ liệu. Dữ liệu hấp dẫn và việc phân tích dữ liệu theo cách phù hợp có thể đưa công ty này lên trước công ty khác. Việc thu thập dữ liệu trên web là một công cụ cần thiết để thu thập dữ liệu — viết một tập lệnh đơn giản có thể giúp việc thu thập dữ liệu dễ tiếp cận và nhanh hơn nhiều so với thực hiện các công việc thủ công. Hơn nữa, dữ liệu cũng có thể được nhập vào bảng tính để được hiển thị và phân tích tốt hơn.

Thực hiện nghiên cứu thị trường và tạo khách hàng tiềm năng: Thực hiện nghiên cứu thị trường và tạo khách hàng tiềm năng là những nhiệm vụ quan trọng của việc tìm kiếm trên web. Email, số điện thoại và thông tin quan trọng khác từ các trang web khác nhau có thể được thu thập và sau đó được sử dụng cho các nhiệm vụ quan trọng này.

Xây dựng công cụ so sánh giá: Bạn có thể nhận thấy tiện ích mở rộng trình duyệt thông báo cho bạn về sự thay đổi giá đối với các sản phẩm trên nền tảng thương mại điện tử. Những công cụ như vậy cũng được xây dựng bằng cách sử dụng công cụ quét web.

Trong bài viết này, bạn sẽ học cách tạo một trình duyệt web đơn giản bằng Go .

Robert Griesemer, Rob Pike và Ken Thompson đã tạo ra ngôn ngữ lập trình cờ vây tại Google và nó đã có mặt trên thị trường từ năm 2009. Go, còn được gọi là Golang, có nhiều tính năng tuyệt vời. Bắt đầu với cờ vây rất nhanh và đơn giản. Do đó, ngôn ngữ tương đối mới hơn này đang thu hút được rất nhiều sự chú ý trong thế giới nhà phát triển.

Triển khai Web Scraping với Go

Việc hỗ trợ tính đồng thời đã làm cho Go trở thành một ngôn ngữ nhanh, mạnh mẽ và bởi vì ngôn ngữ này dễ bắt đầu, bạn có thể xây dựng trình duyệt web của mình chỉ với một vài dòng mã. Để tạo trình duyệt web với Go, hai thư viện rất phổ biến:

  1. truy vấn
  2. Colly

Trong bài viết này, bạn sẽ sử dụng Colly để thực hiện cạp. Lúc đầu, bạn sẽ học những điều cơ bản về xây dựng một trình quét và bạn sẽ triển khai một trình quét URL từ một trang Wikipedia. Một khi bạn biết các khối xây dựng cơ bản của việc cạo trang web với Colly, bạn sẽ nâng cấp kỹ năng và triển khai một công cụ cạo nâng cao hơn.

Điều kiện tiên quyết

Trước khi tiếp tục bài viết này, hãy đảm bảo rằng các công cụ và thư viện sau đã được cài đặt trên máy tính của bạn. Bạn sẽ cần những thứ sau:

  • Hiểu biết cơ bản về cờ vây
  • Go (tốt nhất là phiên bản mới nhất — 1.17.2, kể từ khi viết bài này)
  • IDE hoặc trình soạn thảo văn bản mà bạn chọn ( ưu tiên Mã Visual Studio )
  • Đi đến phần mở rộng cho IDE (nếu có)

Hiểu Colly và Collectorthành phần

Gói Colly được sử dụng để xây dựng trình thu thập thông tin và trình thu thập dữ liệu web. Nó dựa trên Net / HTTP và gói goquery của Go. Gói goquery cung cấp một cú pháp giống như jQuery trong Chuyển đến các phần tử HTML mục tiêu. Riêng gói này cũng được sử dụng để chế tạo máy nạo.

Thành phần chính của Colly là Collector. Theo tài liệu , Collectorthành phần quản lý các giao tiếp mạng và nó cũng chịu trách nhiệm về các lệnh gọi lại được gắn vào nó khi một Collectorcông việc đang chạy. Thành phần này có thể định cấu hình và bạn có thể sửa đổi UserAgentchuỗi hoặc thêm Authenticationtiêu đề, hạn chế hoặc cho phép URL với sự trợ giúp của thành phần này.

Hiểu các cuộc gọi lại Colly

Các lệnh gọi lại cũng có thể được thêm vào Collectorthành phần. Thư viện Colly có các lệnh gọi lại, chẳng hạn như OnHTMLOnRequest. Bạn có thể tham khảo tài liệu để tìm hiểu về tất cả các lệnh gọi lại. Các lệnh gọi lại này chạy ở các điểm khác nhau trong vòng đời của Collector. Ví dụ: lệnh OnRequestgọi lại được chạy ngay trước khi Collectorthực hiện một yêu cầu HTTP.

Phương OnHTMLpháp này là cách gọi lại phổ biến nhất được sử dụng trong việc xây dựng trình duyệt web. Nó cho phép đăng ký một lệnh gọi lại Collectorkhi nó đến một thẻ HTML cụ thể trên trang web.

Khởi tạo Thư mục Dự án và Cài đặt Colly

Trước khi bắt đầu viết mã, bạn phải khởi tạo thư mục dự án. Mở IDE mà bạn chọn và mở một thư mục nơi bạn sẽ lưu tất cả các tệp dự án của mình. Bây giờ, hãy mở một cửa sổ dòng lệnh và tìm thư mục của bạn. Sau đó, gõ lệnh sau vào terminal:

go mod init github.com/Username/Project-Name

Trong lệnh trên, hãy thay đổi [github.com](http://github.com)thành miền nơi bạn lưu trữ tệp của mình, chẳng hạn như Bitbucket hoặc Gitlab . Ngoài ra, hãy thay đổi Usernametên người dùng của bạn và Project-Namevới bất kỳ tên dự án nào bạn muốn đặt cho nó.

Khi bạn nhập lệnh và nhấn enter, bạn sẽ thấy rằng một tệp mới được tạo với tên của go.mod. Tệp này chứa thông tin về các phụ thuộc trực tiếp và gián tiếp mà dự án cần. Bước tiếp theo là cài đặt phụ thuộc Colly. Để cài đặt phần phụ thuộc, hãy nhập lệnh sau vào terminal:

go get -u github.com/go-colly/colly/...

Thao tác này sẽ tải xuống thư viện Colly và sẽ tạo ra một tệp mới có tên go.sum. Bây giờ bạn có thể tìm thấy phần phụ thuộc trong go.modtệp. Tệp go.sumliệt kê tổng kiểm tra của các phụ thuộc trực tiếp và gián tiếp, cùng với phiên bản. Bạn có thể đọc thêm về tệp go.sumở đây .go.mod

Xây dựng một Scraper cơ bản

Bây giờ bạn đã thiết lập thư mục dự án với sự phụ thuộc cần thiết, bạn có thể tiếp tục viết một số mã. Công cụ quét cơ bản nhằm mục đích cạo tất cả các liên kết từ một trang Wikipedia cụ thể và in chúng trên thiết bị đầu cuối. Máy quét này được xây dựng để giúp bạn thoải mái với các khối xây dựng của thư viện Colly.

Tạo một tệp mới trong thư mục có phần mở rộng là — .goví dụ:. main.goTất cả logic sẽ đi vào tệp này. Bắt đầu bằng cách viết package main. Dòng này cho trình biên dịch biết rằng gói nên biên dịch dưới dạng một chương trình thực thi thay vì một thư viện được chia sẻ.

package main

Bước tiếp theo là bắt đầu viết mainhàm. Nếu bạn đang sử dụng Visual Studio Code, nó sẽ tự động nhập các gói cần thiết. Nếu không, trong trường hợp của các IDE khác, bạn có thể phải làm điều đó theo cách thủ công. Of CollectorColly được khởi tạo bằng dòng mã sau:

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )
}

Tại đây, tên miền NewCollectorđược khởi tạo và như một tùy chọn, en.wikipedia.org được chuyển như một tên miền được phép. Tương tự Collectorcũng có thể được khởi tạo mà không cần chuyển bất kỳ tùy chọn nào cho nó. Bây giờ, nếu bạn lưu tệp, Colly sẽ được tự động nhập vào main.gotệp của bạn; nếu không, hãy thêm các dòng sau vào sau package maindòng:

import (
    "fmt"

    "github.com/gocolly/colly"
)

Các dòng trên nhập hai gói trong main.gotệp. Gói đầu tiên là fmtgói và gói thứ hai là thư viện Colly.

Bây giờ, hãy mở URL này trong trình duyệt của bạn. Đây là trang Wikipedia trên web tìm kiếm. Trình duyệt web sẽ loại bỏ tất cả các liên kết từ trang này. Hiểu rõ các công cụ dành cho nhà phát triển trình duyệt là một kỹ năng vô giá trong việc tìm kiếm trang web. Mở công cụ kiểm tra trình duyệt bằng cách nhấp chuột phải vào trang và chọn Kiểm tra . Thao tác này sẽ mở trình kiểm tra trang. Bạn sẽ có thể xem toàn bộ HTML, CSS, các cuộc gọi mạng và thông tin quan trọng khác từ đây. Đối với ví dụ cụ thể này, hãy tìm mw-parser-outputdiv:

Wikipedia trong Công cụ dành cho nhà phát triển
 

Phần tử div này chứa phần nội dung của trang. Việc nhắm mục tiêu các liên kết bên trong div này sẽ cung cấp tất cả các liên kết được sử dụng bên trong bài viết.

Tiếp theo, bạn sẽ sử dụng OnHTMLphương pháp. Đây là mã còn lại cho máy quét:

// Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")

Phương OnHTMLthức này có hai tham số. Tham số đầu tiên là phần tử HTML. Tiếp cận nó sẽ thực hiện hàm gọi lại, được truyền dưới dạng tham số thứ hai. Bên trong hàm gọi lại, linksbiến được gán cho một phương thức trả về tất cả các thuộc tính con khớp với các thuộc tính của phần tử. Hàm trả về một phần chuỗi của tất cả các liên kết bên trong e.ChildAttrs("a", "href")div . Hàm in các liên kết trong thiết bị đầu cuối.mw-parser-outputfmt.Println(links)

Cuối cùng, hãy truy cập URL bằng c.Visit("https://en.wikipedia.org/wiki/Web_scraping")lệnh. Mã cạp hoàn chỉnh sẽ trông như thế này:

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )

    // Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}

Chạy mã này với lệnh go run main.gosẽ nhận được tất cả các liên kết trên trang.

Dữ liệu bảng Scraping

Bảng HTML W3Schools

Để xóa dữ liệu bảng, bạn có thể xóa các mã bạn đã viết bên trong c.OnHTMLhoặc tạo một dự án mới bằng cách làm theo các bước tương tự được đề cập ở trên. Để tạo và ghi tệp CSV, bạn sẽ sử dụng encoding/csvthư viện có sẵn trong Go. Đây là mã khởi động:

package main

import (
    "encoding/csv"
    "log"
    "os"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()
}

Bên trong mainhàm, hành động đầu tiên là xác định tên tệp. Ở đây, nó được định nghĩa là data.csv. Sau đó, sử dụng os.Create(fName)phương pháp này, tệp được tạo với tên data.csv. Nếu bất kỳ lỗi nào xảy ra trong quá trình tạo tệp, nó cũng sẽ ghi lại lỗi và thoát khỏi chương trình. Lệnh defer file.Close()sẽ đóng tệp khi hàm xung quanh trả về.

Lệnh writer := csv.NewWriter(file)khởi tạo trình ghi CSV để ghi vào tệp và writer.Flush()sẽ chuyển mọi thứ từ bộ đệm sang trình ghi.

Khi quá trình tạo tệp hoàn tất, quá trình cạo có thể được bắt đầu. Điều này tương tự như ví dụ trên.

Tiếp theo, thêm các dòng mã dưới đây sau khi defer writer.Flush()dòng kết thúc:

c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")

Trong đoạn mã này, Colly đang được khởi tạo. Colly sử dụng ForEachphương pháp này để lặp lại nội dung. Bởi vì bảng có ba cột hoặc tdphần tử, sử dụng nth-childbộ chọn giả, ba cột được chọn. el.ChildTexttrả về văn bản bên trong phần tử. Đặt nó bên trong writer.Writephương thức sẽ ghi các phần tử vào tệp CSV. Cuối cùng, câu lệnh in sẽ in ra một thông báo khi việc cạo hoàn tất. Bởi vì mã này không nhắm mục tiêu tiêu đề bảng, nó sẽ không in tiêu đề. Mã hoàn chỉnh cho máy quét này sẽ như thế này:

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")
}

Sau khi thành công, đầu ra sẽ xuất hiện như sau:

Xuất tệp CSV trong Excel


 

Sự kết luận

Trong bài viết này, bạn đã biết được trình duyệt web là gì cũng như một số trường hợp sử dụng và cách chúng có thể được triển khai với Go, với sự trợ giúp của thư viện Colly.

Tuy nhiên, các phương pháp được mô tả trong hướng dẫn này không phải là cách duy nhất có thể thực hiện một máy quét. Hãy tự mình thử nghiệm điều này và tìm ra những cách mới để thực hiện. Colly cũng có thể làm việc cùng với goquerythư viện để tạo ra một máy quét mạnh mẽ hơn.

Tùy thuộc vào trường hợp sử dụng của bạn, bạn có thể sửa đổi Colly để đáp ứng nhu cầu của bạn. Web cạo rất tiện dụng cho việc nghiên cứu từ khóa, bảo vệ thương hiệu, quảng bá, kiểm tra trang web và nhiều thứ khác. Vì vậy, biết cách xây dựng trình duyệt web của riêng bạn có thể giúp bạn trở thành một nhà phát triển tốt hơn.

Liên kết: https://www.scrapingbee.com/blog/web-scraping-go/

#webscraping #go #golang 

Cách Triển Khai Web Scraping Với Go

Как реализовать парсинг веб-страниц с помощью Go

Веб-скрапинг — важный инструмент, который каждый разработчик использует в какой-то момент своей карьеры. Следовательно, разработчикам важно понимать, что такое парсер и как его создать.

Веб-скрапинг, веб-сбор или извлечение веб-данных — это парсинг данных, используемый для извлечения данных с веб-сайтов. Программное обеспечение для очистки веб-страниц может напрямую обращаться к всемирной паутине с использованием протокола передачи гипертекста или веб-браузера. Хотя веб-скрапинг может выполняться пользователем программного обеспечения вручную, этот термин обычно относится к автоматизированным процессам, реализованным с использованием бота или поискового робота. Это форма копирования, при которой определенные данные собираются и копируются из Интернета, как правило, в центральную локальную базу данных или электронную таблицу для последующего поиска или анализа.

Другими словами, веб-скрапинг — это процесс извлечения данных с веб-сайтов, который используется во многих случаях, от анализа данных до лидогенерации. Задачу можно выполнить вручную или автоматизировать с помощью сценария или программного обеспечения.

Существует множество вариантов использования парсинга веб-страниц. Взгляните лишь на некоторые из них:

Сбор данных. Самое полезное применение веб-скрапинга — это сбор данных. Данные убедительны, и правильный анализ данных может поставить одну компанию впереди другой. Веб-скрапинг — важный инструмент для сбора данных: написание простого скрипта может сделать сбор данных намного более доступным и быстрым, чем ручная работа. Кроме того, данные также могут быть введены в электронную таблицу для лучшей визуализации и анализа.

Проведение маркетинговых исследований и привлечение потенциальных клиентов: проведение маркетинговых исследований и привлечение потенциальных клиентов являются важными задачами веб-скрейпинга. Электронные письма, номера телефонов и другую важную информацию с различных веб-сайтов можно извлечь и позже использовать для этих важных задач.

Создание инструментов для сравнения цен. Возможно, вы заметили расширения браузера, которые предупреждают вас об изменении цен на продукты на платформах электронной коммерции. Такие инструменты также создаются с использованием парсеров.

В этой статье вы узнаете, как создать простой парсер с помощью Go .

Роберт Гриземер, Роб Пайк и Кен Томпсон создали язык программирования Go в Google, и он существует на рынке с 2009 года. Go, также известный как Golang, обладает множеством замечательных функций. Начать работу с Go можно быстро и просто. В результате этот сравнительно новый язык становится все более привлекательным в мире разработчиков.

Внедрение парсинга веб-страниц с помощью Go

Благодаря поддержке параллелизма Go стал быстрым и мощным языком, а поскольку с этим языком легко начать работу, вы можете создать свой веб-парсер, написав всего несколько строк кода. Для создания парсеров на Go очень популярны две библиотеки:

  1. goquery
  2. Колли

В этой статье вы будете использовать Colly для реализации парсера. Сначала вы изучите самые основы создания парсера и создадите парсер URL со страницы Википедии. Как только вы освоите базовые строительные блоки парсинга веб-страниц с помощью Colly, вы повысите свои навыки и создадите более продвинутый парсер.

Предпосылки

Прежде чем двигаться дальше в этой статье, убедитесь, что на вашем компьютере установлены следующие инструменты и библиотеки. Вам понадобится следующее:

  • Базовое понимание Go
  • Go (желательно последняя версия — 1.17.2 на момент написания статьи)
  • IDE или текстовый редактор на ваш выбор ( предпочтительно Visual Studio Code )
  • Расширение Go для IDE (если доступно)

Понимание Колли и Collectorкомпонента

Пакет Colly используется для создания поисковых роботов и парсеров. Он основан на Go Net/HTTP и пакете goquery. Пакет goquery предоставляет синтаксис, подобный jQuery, в Go для целевых элементов HTML. Только этот пакет также используется для создания скребков.

Основным компонентом Colly является файл Collector. Согласно документам , Collectorкомпонент управляет сетевыми коммуникациями, а также отвечает за связанные с ним обратные вызовы при выполнении Collectorзадания. Этот компонент настраивается, и вы можете изменять UserAgentстроку или добавлять Authenticationзаголовки, ограничивая или разрешая URL-адреса с помощью этого компонента.

Понимание обратных вызовов Colly

Обратные вызовы также могут быть добавлены к Collectorкомпоненту. В библиотеке Colly есть обратные вызовы, такие как OnHTMLи OnRequest. Вы можете обратиться к документации , чтобы узнать обо всех обратных вызовах. Эти обратные вызовы выполняются в разные моменты жизненного цикла файла Collector. Например, OnRequestобратный вызов запускается непосредственно перед Collectorвыполнением HTTP-запроса.

Этот OnHTMLметод является наиболее распространенным обратным вызовом, используемым при создании парсеров. Это позволяет зарегистрировать обратный вызов, Collectorкогда он достигает определенного HTML-тега на веб-странице.

Инициализация каталога проекта и установка Colly

Прежде чем приступить к написанию кода, вы должны инициализировать каталог проекта. Откройте IDE по вашему выбору и откройте папку, в которой вы сохраните все файлы вашего проекта. Теперь откройте окно терминала и найдите свой каталог. После введите в терминал следующую команду:

go mod init github.com/Username/Project-Name

В приведенной выше команде измените [github.com](http://github.com)домен, в котором вы храните свои файлы, например Bitbucket или Gitlab . Кроме того, измените Usernameсвое имя пользователя и Project-Nameлюбое имя проекта, которое вы хотели бы ему дать.

Как только вы введете команду и нажмете Enter, вы обнаружите, что создается новый файл с именем go.mod. Этот файл содержит информацию о прямых и косвенных зависимостях, необходимых проекту. Следующим шагом будет установка зависимости Colly. Чтобы установить зависимость, введите в терминале следующую команду:

go get -u github.com/go-colly/colly/...

Это загрузит библиотеку Colly и создаст новый файл с именем go.sum. Теперь вы можете найти зависимость в go.modфайле. В go.sumфайле указана контрольная сумма прямых и косвенных зависимостей, а также версия. Вы можете прочитать больше о файлах go.sumи здесь .go.mod

Создание базового скребка

Теперь, когда вы настроили каталог проекта с необходимой зависимостью, вы можете приступить к написанию некоторых кодов. Основной скребок предназначен для извлечения всех ссылок с определенной страницы Википедии и вывода их на терминал. Этот скребок создан для того, чтобы вам было удобно работать со строительными блоками библиотеки Colly.

Создайте в папке новый файл с расширением .go, например, main.go. Вся логика будет находиться в этом файле. Начните с письма package main. Эта строка сообщает компилятору, что пакет должен компилироваться как исполняемая программа, а не как разделяемая библиотека.

package main

Следующий шаг — начать писать mainфункцию. Если вы используете Visual Studio Code, он автоматически импортирует необходимые пакеты. В противном случае, в случае других IDE, вам, возможно, придется делать это вручную. Colly Collectorинициализируется следующей строкой кода:

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )
}

Здесь NewCollectorинициализируется и, как вариант, en.wikipedia.org передается как разрешенный домен. То же самое Collectorможно инициализировать без передачи ему каких-либо параметров. Теперь, если вы сохраните файл, Колли будет автоматически импортирован в ваш main.goфайл; если нет, добавьте следующие строки после package mainстроки:

import (
    "fmt"

    "github.com/gocolly/colly"
)

Приведенные выше строки импортируют два пакета в main.goфайл. Первый пакет — это fmtпакет, а второй — библиотека Colly.

Теперь откройте этот URL-адрес в своем браузере. Это страница Википедии, посвященная веб-скрапингу. Веб-скрейпер собирает все ссылки с этой страницы. Хорошее понимание инструментов разработчика браузера — бесценный навык в веб-скрейпинге. Откройте инструменты проверки браузера, щелкнув правой кнопкой мыши страницу и выбрав Inspect . Откроется инспектор страниц. Здесь вы сможете увидеть весь HTML, CSS, сетевые вызовы и другую важную информацию. В частности, для этого примера найдите mw-parser-outputdiv:

Википедия в инструментах разработчика
 

Этот элемент div содержит тело страницы. Таргетинг на ссылки внутри этого блока предоставит все ссылки, используемые внутри статьи.

Далее вы будете использовать OnHTMLметод. Вот оставшийся код для скребка:

// Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")

Метод OnHTMLпринимает два параметра. Первый параметр — это элемент HTML. Достигнув его, будет выполнена функция обратного вызова, которая передается вторым параметром. Внутри функции обратного вызова linksпеременная назначается методу, который возвращает все дочерние атрибуты, соответствующие атрибутам элемента. Функция e.ChildAttrs("a", "href")возвращает фрагмент строк всех ссылок внутри mw-parser-outputdiv. Функция fmt.Println(links)печатает ссылки в терминале.

Наконец, перейдите по URL-адресу с помощью c.Visit("https://en.wikipedia.org/wiki/Web_scraping")команды. Полный код парсера будет выглядеть так:

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )

    // Find and print all links
    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })
    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}

Запустив этот код с помощью команды go run main.go, вы получите все ссылки на странице.

Очистка данных таблицы

HTML-таблица W3Schools

Чтобы очистить данные таблицы, вы можете либо удалить коды, которые вы написали внутри, c.OnHTMLлибо создать новый проект, выполнив те же действия, что и выше. Чтобы создать и записать файл CSV, вы будете использовать encoding/csvбиблиотеку, доступную в Go. Вот начальный код:

package main

import (
    "encoding/csv"
    "log"
    "os"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()
}

Внутри mainфункции первым действием является определение имени файла. Здесь он определяется как data.csv. Затем с помощью os.Create(fName)метода создается файл с именем data.csv. Если во время создания файла произойдет какая-либо ошибка, она также зарегистрирует ошибку и закроет программу. Команда defer file.Close()закроет файл, когда функция окружения вернется.

Команда writer := csv.NewWriter(file)инициализирует модуль записи CSV для записи в файл, и writer.Flush()он передаст все данные из буфера в модуль записи.

После завершения процесса создания файла можно приступать к очистке. Это похоже на приведенный выше пример.

Затем добавьте следующие строки кода после окончания defer writer.Flush()строки:

c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")

В этом коде Colly инициализируется. Колли использует этот ForEachметод для повторения содержимого. Поскольку в таблице три столбца или tdэлемента, с помощью nth-childпсевдоселектора выбираются три столбца. el.ChildTextвозвращает текст внутри элемента. Помещение его внутри writer.Writeметода запишет элементы в файл CSV. Наконец, оператор печати печатает сообщение, когда очистка завершена. Поскольку этот код не нацелен на заголовки таблицы, он не будет печатать заголовок. Полный код этого скребка будет таким:

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    c := colly.NewCollector()
    c.OnHTML("table#customers", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, el *colly.HTMLElement) {
            writer.Write([]string{
                el.ChildText("td:nth-child(1)"),
                el.ChildText("td:nth-child(2)"),
                el.ChildText("td:nth-child(3)"),
            })
        })
        fmt.Println("Scrapping Complete")
    })
    c.Visit("https://www.w3schools.com/html/html_tables.asp")
}

В случае успеха вывод будет выглядеть следующим образом:

Выходной файл CSV в Excel


 

Вывод

В этой статье вы узнали, что такое парсеры, а также некоторые варианты их использования и то, как их можно реализовать с помощью Go с помощью библиотеки Colly.

Однако методы, описанные в этом руководстве, — не единственный возможный способ реализации парсера. Подумайте о том, чтобы поэкспериментировать с этим самостоятельно и найти новые способы сделать это. Колли также может работать вместе с goqueryбиблиотекой, чтобы сделать более мощный скребок.

В зависимости от вашего варианта использования вы можете изменить Colly в соответствии со своими потребностями. Веб-скрапинг очень удобен для исследования ключевых слов, защиты бренда, продвижения, тестирования веб-сайтов и многого другого. Таким образом, знание того, как создать собственный парсер, может помочь вам стать лучшим разработчиком.

Ссылка: https://www.scrapingbee.com/blog/web-scraping-go/

#webscraping #go #golang 

Как реализовать парсинг веб-страниц с помощью Go
Nat  Grady

Nat Grady

1660224720

Rvest: Simple Web Scraping for R

rvest 

Overview

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

Installation

# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")

Usage

library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each <section> corresponds
# to a different film
films <- starwars %>% html_elements("section")
films
#> {xml_nodeset (7)}
#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside <h2>
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
#> [1] "The Phantom Menace"      "Attack of the Clones"   
#> [3] "Revenge of the Sith"     "A New Hope"             
#> [5] "The Empire Strikes Back" "Return of the Jedi"     
#> [7] "The Force Awakens"

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id") %>% 
  readr::parse_integer()
episode
#> [1] 1 2 3 4 5 6 7

If the page contains tabular data you can convert it directly to a data frame with html_table():

html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")

html %>% 
  html_element(".tracklist") %>% 
  html_table()
#> # A tibble: 29 × 4
#>    No.   Title                       `Performer(s)`                       Length
#>    <chr> <chr>                       <chr>                                <chr> 
#>  1 1.    "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43  
#>  2 2.    "\"Prologue\""              ""                                   2:28  
#>  3 3.    "\"Emmett's Morning\""      ""                                   2:00  
#>  4 4.    "\"Emmett Falls in Love\""  ""                                   1:11  
#>  5 5.    "\"Escape\""                ""                                   3:26  
#>  6 6.    "\"Into the Old West\""     ""                                   1:00  
#>  7 7.    "\"Wyldstyle Explains\""    ""                                   1:21  
#>  8 8.    "\"Emmett's Mind\""         ""                                   2:17  
#>  9 9.    "\"The Transformation\""    ""                                   1:46  
#> 10 10.   "\"Saloons and Wagons\""    ""                                   3:38  
#> # … with 19 more rows

Code of Conduct

Please note that the rvest project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Download Details:

Author: Tidyverse
Source Code: https://github.com/tidyverse/rvest 
License: Unknown, MIT licenses found

#r #html #webscraping 

Rvest: Simple Web Scraping for R
Thierry  Perret

Thierry Perret

1658854629

Comment Implémenter Un Web Scraper Dans Rust

Supposons que vous souhaitiez obtenir des informations à partir d'un site Web, telles que les cours des actions, les dernières publicités ou les publications les plus récentes. Pour ce faire, le moyen le plus simple consiste à se connecter à une API. Si le site Web dispose d'une API gratuite, vous pouvez simplement demander les informations dont vous avez besoin.

Sinon, il y a toujours la deuxième option : le web scraping.

Au lieu de vous connecter à une ressource "officielle", vous pouvez utiliser un bot pour explorer le contenu du site Web et l'analyser pour trouver les éléments dont vous avez besoin.

Dans cet article, vous apprendrez à implémenter le web scraping avec le langage de programmation Rust. Vous utiliserez deux bibliothèques Rust, reqwestet scraper, pour récupérer la liste des cent meilleurs films d'IMDb.

Implémentation d'un Web Scraper dans Rust

Vous allez configurer un grattoir Web entièrement fonctionnel dans Rust. Votre cible pour le grattage sera IMDb , une base de données de films, séries télévisées et autres médias.

En fin de compte, vous aurez un programme Rust qui peut gratter les cent meilleurs films par note d'utilisateur à tout moment.

Ce didacticiel suppose que vous avez déjà installé Rust et Cargo (le gestionnaire de paquets de Rust). Si vous ne le faites pas, suivez la documentation officielle pour les installer.

Création du projet et ajout de dépendances

Pour commencer, vous devez créer un projet Rust de base et ajouter toutes les dépendances que vous utiliserez. Ceci est mieux fait avec Cargo.

Pour générer un nouveau projet pour un binaire Rust, exécutez :

cargo new web_scraper

Ensuite, ajoutez les bibliothèques requises aux dépendances. Pour ce projet, vous utiliserez reqwestet scraper.

Ouvrez le web_scraperdossier dans votre éditeur de code préféré et ouvrez le cargo.tomlfichier. À la fin du fichier, ajoutez les bibliothèques :

[dependencies]

reqwest = {version = "0.11", features = ["blocking"]}
scraper = "0.12.0"

Vous pouvez maintenant passer à src/main.rset commencer à créer votre grattoir Web.

Obtenir le HTML du site Web

Scraper une page implique généralement d'obtenir le code HTML de la page, puis de l'analyser pour trouver les informations dont vous avez besoin. Par conséquent, vous devrez rendre le code de la page IMDb disponible dans votre programme Rust. Pour ce faire, vous devez d'abord comprendre le fonctionnement des navigateurs, car c'est votre façon habituelle d'interagir avec les pages Web.

Pour afficher une page Web dans le navigateur, le navigateur (client) envoie une requête HTTP au serveur, qui répond avec le code source de la page Web. Le navigateur restitue ensuite ce code.

HTTP a différents types de requêtes, telles que GET (pour obtenir le contenu d'une ressource) et POST (pour envoyer des informations au serveur). Pour obtenir le code d'une page Web IMDb dans votre programme Rust, vous devrez imiter le comportement des navigateurs en envoyant une requête HTTP GET à IMDb.

Dans Rust, vous pouvez utiliser reqwestpour cela. Cette bibliothèque Rust couramment utilisée fournit les fonctionnalités d'un client HTTP. Il peut faire beaucoup de choses qu'un navigateur ordinaire peut faire, comme ouvrir des pages, se connecter et stocker des cookies.

Pour demander le code d'une page, vous pouvez utiliser la reqwest::blocking::getméthode :

fn main() {

    let response = reqwest::blocking::get(
        "https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc&count=100",
    )
    .unwrap()
    .text()
    .unwrap();

}

responsecontiendra désormais le code HTML complet de la page demandée.

Extraction d'informations à partir de HTML

La partie la plus difficile d'un projet de grattage Web consiste généralement à obtenir les informations spécifiques dont vous avez besoin à partir du document HTML. À cette fin, un outil couramment utilisé dans Rust est la scraperbibliothèque. Il fonctionne en analysant le document HTML dans une structure arborescente. Vous pouvez utiliser des sélecteurs CSS pour interroger les éléments qui vous intéressent.

La première étape consiste à analyser l'intégralité de votre document HTML à l'aide de la bibliothèque :

    let document = scraper::Html::parse_document(&response);

Ensuite, recherchez et sélectionnez les pièces dont vous avez besoin. Pour ce faire, vous devez vérifier le code du site Web et trouver une collection de sélecteurs CSS qui identifient de manière unique ces éléments.

La façon la plus simple de le faire est via votre navigateur habituel. Trouvez l'élément dont vous avez besoin, puis vérifiez le code de cet élément en l'inspectant :

Comment inspecter un élément

Dans le cas d'IMDb, l'élément dont vous avez besoin est le nom du film. Lorsque vous cochez l'élément, vous verrez qu'il est enveloppé dans une <a>balise :

<a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>

Malheureusement, cette balise n'est pas unique. Comme il y a beaucoup de <a>balises sur la page, ce ne serait pas une bonne idée de toutes les supprimer, car la plupart d'entre elles ne seront pas les éléments dont vous avez besoin. Au lieu de cela, recherchez la balise unique aux titres de films, puis accédez à la <a>balise à l'intérieur de cette balise.

Dans ce cas, vous pouvez choisir la lister-item-headerclasse :

<h3 class="lister-item-header">
    <span class="lister-item-index unbold text-primary">1.</span>
    <a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
    <span class="lister-item-year text-muted unbold">(1994)</span>
</h3>

Vous devez maintenant créer une requête à l'aide de la scraper::Selector::parseméthode.

Vous lui donnerez un h3.lister-item-header>asélecteur. En d'autres termes, il trouve <a>les balises qui ont comme parent une <h3>balise appartenant à une lister-item-headerclasse.

Utilisez la requête suivante :

    let title_selector = scraper::Selector::parse("h3.lister-item-header>a").unwrap();

Vous pouvez maintenant appliquer cette requête à votre document analysé avec la selectméthode. Pour obtenir les titres réels des films au lieu des éléments HTML, vous mapperez chaque élément HTML sur le code HTML qu'il contient :

    let titles = document.select(&title_selector).map(|x| x.inner_html());

titlesest maintenant un itérateur contenant les noms de tous les cent premiers titres.

Il ne vous reste plus qu'à imprimer ces noms. Pour ce faire, commencez zippar votre liste de titres avec les numéros de 1 à 100. Appelez ensuite la for_eachméthode sur l'itérateur résultant, qui imprimera chaque élément de l'itérateur sur une ligne distincte :

    titles
        .zip(1..101)
        .for_each(|(item, number)| println!("{}. {}", number, item));

Votre web scraper est maintenant terminé.

Voici le code complet du scraper :

fn main() {
    let response = reqwest::blocking::get(
        "https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc&count=100",
    )
    .unwrap()
    .text()
    .unwrap();

    let document = scraper::Html::parse_document(&response);

    let title_selector = scraper::Selector::parse("h3.lister-item-header>a").unwrap();

    let titles = document.select(&title_selector).map(|x| x.inner_html());

    titles
        .zip(1..101)
        .for_each(|(item, number)| println!("{}. {}", number, item));
}

Si vous enregistrez le fichier et l'exécutez avec cargo run, vous devriez obtenir la liste des cent meilleurs films à tout moment :

1. The Shawshank Redemption
2. The Godfather
3. The Dark Knight
4. The Lord of the Rings: The Return of the King
5. Schindler's List
6. The Godfather: Part II
7. 12 Angry Men
8. Pulp Fiction
9. Inception
10. The Lord of the Rings: The Two Towers
...

Conclusion

Dans ce didacticiel, vous avez appris à utiliser Rust pour créer un simple grattoir Web. Rust n'est pas un langage populaire pour les scripts, mais comme vous l'avez vu, il fait le travail assez facilement.

Ce n'est que le point de départ du scraping Web Rust. Il existe de nombreuses façons de mettre à niveau ce grattoir, en fonction de vos besoins.

Voici quelques options que vous pouvez essayer comme exercice :

  • Analyser les données dans une structure personnalisée : vous pouvez créer une structure Rust typée contenant des données de film. Cela facilitera l'impression des données et leur utilisation plus loin dans votre programme.
  • Enregistrer les données dans un fichier : Au lieu d'imprimer les données du film, vous pouvez à la place les enregistrer dans un fichier.
  • Créez un qui se connecte à un compte IMDb : vous souhaiterez peut-être qu'IMDb affiche les films en fonction de vos préférences avant de les analyser. Par exemple, IMDb affiche les titres de films dans la langue du pays dans lequel vous vivez. Si cela pose un problème, vous devrez configurer vos préférences IMDb, puis créer un grattoir Web qui peut se connecter et gratter avec les préférences.Client

Cependant, parfois, travailler avec des sélecteurs CSS ne suffit pas. Vous pourriez avoir besoin d'une solution plus avancée qui simule les actions prises par un vrai navigateur. Dans ce cas, vous pouvez utiliser thirtyfour, la bibliothèque de test de l'interface utilisateur de Rust, pour une action de grattage Web plus puissante.

Lien : https://www.scrapingbee.com/blog/web-scraping-rust/

#rust #webscraping 

Comment Implémenter Un Web Scraper Dans Rust