Build a web scraper with Node

Build a web scraper with Node

Web scraping is a technique in data extraction where you pull information from websites. In this tutorial, we will be showing you how to build a simple web scraper with Node.

Web scraping refers to the process of gathering information from a website through automated scripts. This eases the process of gathering large amounts of data from websites where no official API has been defined.

The process of web scraping can be broken down into two main steps:

  1. Fetching the HTML source code of the website through an HTTP request or by using a headless browser.
  2. Parsing the raw data to extract just the information you’re interested in.

We’ll examine both steps during the course of this tutorial. At the end of it all, you should be able to build a web scraper for any website with ease.


Prerequisites

To complete this tutorial, you need to have Node.js (version 8.x or later) and npm installed on your computer. This page contains instructions on how on how to install or upgrade your Node installation to the latest version.


Getting started

Create a new scraper directory for this tutorial and initialize it with a package.json file by running npm init -y from the project root.

Next, install the dependencies that we’ll be needing too build up the web scraper:

    npm install axios cheerio puppeteer --save

Here’s what each one does:

  • Axios: Promise-based HTTP client for Node.js and the browser
  • Cheerio: jQuery implementation for Node.js. Cheerio makes it easy to select, edit, and view DOM elements.
  • Puppeteer: A Node.js library for controlling Google Chrome or Chromium.

You may need to wait a bit for the installation to complete as the puppeteer package needs to download Chromium as well.


Scrap a static website with Axios and Cheerio

To demonstrate how you can scrape a website using Node.js, we’re going to set up a script to scrape the Premier League website for some player stats. Specifically, we’ll scrape the website for the top 20 goalscorers in Premier League history and organize the data as JSON.

Create a new pl-scraper.js file in the root of your project directory and populate it with the following code:

    // pl-scraper.js

const axios = require('axios');

const url = 'https://www.premierleague.com/stats/top/players/goals?se=-1&cl=-1&iso=-1&po=-1?se=-1';

axios(url)
  .then(response => {
    const html = response.data;
    console.log(html);
  })
  .catch(console.error);

If you run the code with node pl-scraper.js, a long string of HTML will be printed to the console. But how can you parse the HTML for the exact data you need? That’s where Cheerio comes in.

Cheerio allows us to use jQuery methods to parse an HTML string and extract whatever information we want from it. But before you write any code, let’s examine the exact data that we need through the browser dev tools.

Open this link in your browser, and open the dev tools on that page. Use the inspector tool to highlight the body of the table listing the top goalscorers in Premier League history.

As you can see the table body has a class of .statsTableContainer. We can select all the rows using cheerio like this: $('.statsTableContainer > tr'). Go ahead and update the pl-scraper.js file to look like this:

    // pl-scraper.js

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://www.premierleague.com/stats/top/players/goals?se=-1&cl=-1&iso=-1&po=-1?se=-1';

axios(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);
    const statsTable = $('.statsTableContainer > tr');
    console.log(statsTable.length);
  })
  .catch(console.error);

Unlike jQuery which operates on the browser DOM, you need to pass in the HTML document into Cheerio before we can use it to parse the document with it. After loading the HTML, we select all 20 rows in .statsTableContainer and store a reference to the selection in statsTable. You can run the code with node pl-scraper.js and confirm that the length of statsTable is exactly 20.

The next step is to extract the rank, player name, nationality and number of goals from each row. We can achieve that using the following script:

    // pl-scraper.js

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://www.premierleague.com/stats/top/players/goals?se=-1&cl=-1&iso=-1&po=-1?se=-1';

axios(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html)
    const statsTable = $('.statsTableContainer > tr');
    const topPremierLeagueScorers = [];

    statsTable.each(function () {
      const rank = $(this).find('.rank > strong').text();
      const playerName = $(this).find('.playerName > strong').text();
      const nationality = $(this).find('.playerCountry').text();
      const goals = $(this).find('.mainStat').text();

      topPremierLeagueScorers.push({
        rank,
        name: playerName,
        nationality,
        goals,
      });
    });

    console.log(topPremierLeagueScorers);
  })
  .catch(console.error);

Here, we are looping over the selection of rows and using the find() method to extract the data that we need, organize it and store it in an array. Now, we have an array of JavaScript objects that can be consumed anywhere else.


Scrape a dynamic website using Puppeteer

Some websites rely exclusively on JavaScript to load their content, so using an HTTP request library like axios to request the HTML will not work because it will not wait for any JavaScript to execute like a browser would before returning a response.

This is where Puppeteer comes in. It is a library that allows you to control a headless browser from a Node.js script. A perfect use case for this library is scraping pages that require JavaScript execution.

Let’s examine how Puppeteer can help us scrape news headlines from r/news since the newer version of Reddit requires JavaScript to render content on the page.

It appears, the headlines are wrapped in an anchor tag that links to the discussion on that headline. Although the class names have been obfuscated, we can select each headline by targeting each h2 inside any anchor tag that links to the discussion page.

Create a new reddit-scraper.js file and add the following code into it:

    // reddit-scraper.js

const cheerio = require('cheerio');
const puppeteer = require('puppeteer');

const url = 'https://www.reddit.com/r/news/';

puppeteer
  .launch()
  .then(browser => browser.newPage())
  .then(page => {
    return page.goto(url).then(function() {
      return page.content();
    });
  })
  .then(html => {
    const $ = cheerio.load(html);
    const newsHeadlines = [];
    $('a[href*="/r/news/comments"] > h2').each(function() {
      newsHeadlines.push({
        title: $(this).text(),
      });
    });

    console.log(newsHeadlines);
  })
  .catch(console.error);

This code launches a puppeteer instance, navigates to the provided URL, and returns the HTML content after all the JavaScript on the page has bee executed. We then use Cheerio as before to parse and extract the desired data from the HTML string.


Wrap up

In this tutorial, we learned how to set up web scraping in Node.js. We looked at scraping methods for both static and dynamic websites, so you should have no issues scraping data off of any website you desire.

You can find the complete source code used for this tutorial in this GitHub repository.


Learn More

The Complete Node.js Developer Course (2nd Edition)

Learn and Understand NodeJS

How to build RESTful APIs with ASP.NET Core

5 Javascript (ES6+) features that you should be using in 2019

MEAN Stack Tutorial MongoDB, ExpressJS, AngularJS and NodeJS

Originally published by Ayooluwa Isaiah at https://pusher.com

Top Vue.js Developers in USA

Top Vue.js Developers in USA

Vue.js is an extensively popular JavaScript framework with which you can create powerful as well as interactive interfaces. Vue.js is the best framework when it comes to building a single web and mobile apps.

We, at HireFullStackDeveloperIndia, implement the right strategic approach to offer a wide variety through customized Vue.js development services to suit your requirements at most competitive prices.

Vue.js is an open-source JavaScript framework that is incredibly progressive and adoptive and majorly used to build a breathtaking user interface. Vue.js is efficient to create advanced web page applications.

Vue.js gets its strength from the flexible JavaScript library to build an enthralling user interface. As the core of Vue.js is concentrated which provides a variety of interactive components for the web and gives real-time implementation. It gives freedom to developers by giving fluidity and eases the integration process with existing projects and other libraries that enables to structure of a highly customizable application.

Vue.js is a scalable framework with a robust in-build stack that can extend itself to operate apps of any proportion. Moreover, vue.js is the best framework to seamlessly create astonishing single-page applications.

Our Vue.js developers have gained tremendous expertise by delivering services to clients worldwide over multiple industries in the area of front-end development. Our adept developers are experts in Vue development and can provide the best value-added user interfaces and web apps.

We assure our clients to have a prime user interface that reaches end-users and target the audience with the exceptional user experience across a variety of devices and platforms. Our expert team of developers serves your business to move ahead on the path of success, where your enterprise can have an advantage over others.

Here are some key benefits that you can avail when you decide to hire vue.js developers in USA from HireFullStackDeveloperIndia:

  • A team of Vue.js developers of your choice
  • 100% guaranteed client satisfaction
  • Integrity and Transparency
  • Free no-obligation quote
  • Portal development solutions
  • Interactive Dashboards over a wide array of devices
  • Vue.js music and video streaming apps
  • Flexible engagement model
  • A free project manager with your team
  • 24*7 communication with your preferred means

If you are looking to hire React Native developers in USA, then choosing HireFullStackDeveloperIndia would be the best as we offer some of the best talents when it comes to Vue.js.

Learn Node.js - Node.js API Development for Beginners

Learn Node.js API Development from absolute scratch. This video is for complete beginners getting started guide!

In this video you will learn the core fundamentals of Node JS so that you can start building API using Node JS. You will learn Modern JavaScript, Node JS event loop, Asynchronous programming, using node modules, npm modules and creating your own modules, creating server, connect to database and sending json responses.


Learn More

Node.js With Passport Authentication | Full Project

Full Stack Developers: Everything You Need to Know

MEAN Stack Tutorial MongoDB, ExpressJS, AngularJS and NodeJS

How to Perform Web-Scraping using Node.js

Moving from NodeJS to Go

Authenticate a Node ES6 API with JSON Web Tokens

The Complete Node.js Developer Course (3rd Edition)

Angular & NodeJS - The MEAN Stack Guide

NodeJS - The Complete Guide (incl. MVC, REST APIs, GraphQL)

Node.js: The Complete Guide to Build RESTful APIs (2018)