Hugo JS

Introduction

The process of collecting information from a website (or websites) is often referred to as either web scraping or web crawling. Web scraping is the process of scanning a webpage/website and extracting information out of it, whereas web crawling is the process of iteratively finding and fetching web links starting from a URL or list of URLs.

While there are differences between the two, you might have heard the two words used interchangeably. Although this article will be a guide on how to scrape information, the lessons learned here can very easily be used for the purposes of ‘crawling’.

Hopefully I don’t need to spend much time talking about why we would look to scrape data from an online resource, but quite simply, if there is data you want to collect from an online resource, scraping is how we would go about it. And if you would prefer to avoid the rigour of going through each page of a website manually, we now have tools that can automate the process.

I’ll also take a moment to add that the process of web scraping is a legal grey area. You will be steering on the side of legal if you are collecting data for personal use and it is data that is otherwise freely available. Scraping data that is not otherwise freely available is where stuff enters murky water. Many websites will also have policies relating to how data can be used, so please bear those policies in mind. With all of that out of the way, let’s get into it.

For the purposes of demonstration, I will be scraping my own website and will be downloading a copy of the scraped data. In doing so, we will:

  1. Set up an environment that allows us to be able to watch the automation if we choose to (the alternative is to run this in what is known as a ‘headless’ browser — more on that later);
  2. Automating the visit to my website;
  3. Traverse the DOM;
  4. Collect pieces of data;
  5. Download pieces of data;
  6. Learn how to handle asynchronous requests;
  7. And my favourite bit: end up with a complete project that we can reuse whenever we want to scrape data.

Now in order to do all of these, we will be making use of two things: Node.js, and Puppeteer. Now chances are you have already heard of Node.js before, so we won’t go into what that is, but just know that we will be using one Node.js module: FS (File System).

Let’s briefly explain what Puppeteer is.

#programming #big-data #javascript #nodejs #data

How to Scrape data from a Website with JavaScript
17.05 GEEK