Getting Started with Web Scraping with Node.js

Getting Started with Web Scraping with Node.js

Getting Started with Web Scraping with Node.js. Web scraping is a technique used for retrieving data from websites. This guide will walk you through the process with the popular Node.js request-promise module, CheerioJS, and Puppeteer. Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering any data you need with Node.js!

So what’s web scraping anyway? It involves automating away the laborious task of collecting information from websites.

There are a lot of use cases for web scraping: you might want to collect prices from various e-commerce sites for a price comparison site. Or perhaps you need flight times and hotel/AirBNB listings for a travel site. Maybe you want to collect emails from various directories for sales leads, or use data from the internet to train machine learning/AI models. Or you could even be wanting to build a search engine like Google!

Getting started with web scraping is easy, and the process can be broken down into two main parts:

  • acquiring the data using an HTML request library or a headless browser,
  • and parsing the data to get the exact information you want.

This guide will walk you through the process with the popular Node.js request-promise module, CheerioJS, and Puppeteer. Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering any data you need with Node.js!

We will be gathering a list of all the names and birthdays of U.S. presidents from Wikipedia and the titles of all the posts on the front page of Reddit.

First things first: Let’s install the libraries we’ll be using in this guide (Puppeteer will take a while to install as it needs to download Chromium as well).

Making your first request
npm install --save request request-promise cheerio puppeteer

Next, let’s open a new text file (name the file potusScraper.js), and write a quick function to get the HTML of the Wikipedia “List of Presidents” page.

const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';

rp(url)
  .then(function(html){
    //success!
    console.log(html);
  })
  .catch(function(err){
    //handle error
  });

Output:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Presidents of the United States - Wikipedia</title>
...

Using Chrome DevTools

Cool, we got the raw HTML from the web page! But now we need to make sense of this giant blob of text. To do that, we’ll need to use Chrome DevTools to allow us to easily search through the HTML of a web page.

Using Chrome DevTools is easy: simply open Google Chrome, and right click on the element you would like to scrape (in this case I am right clicking on George Washington, because we want to get links to all of the individual presidents’ Wikipedia pages):

Now, simply click inspect, and Chrome will bring up its DevTools pane, allowing you to easily inspect the page’s source HTML.

Parsing HTML with Cheerio.js

Awesome, Chrome DevTools is now showing us the exact pattern we should be looking for in the code (a “big” tag with a hyperlink inside of it). Let’s use Cheerio.js to parse the HTML we received earlier to return a list of links to the individual Wikipedia pages of U.S. presidents.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';

rp(url)
  .then(function(html){
    //success!
    console.log($('big > a', html).length);
    console.log($('big > a', html));
  })
  .catch(function(err){
    //handle error
  });

Output:

45
{ '0':
  { type: 'tag',
    name: 'a',
    attribs: { href: '/wiki/George_Washington', title: 'George Washington' },
    children: [ [Object] ],
    next: null,
    prev: null,
    parent:
      { type: 'tag',
        name: 'big',
        attribs: {},
        children: [Array],
        next: null,
        prev: null,
        parent: [Object] } },
  '1':
    { type: 'tag'
  ...

We check to make sure there are exactly 45 elements returned (the number of U.S. presidents), meaning there aren’t any extra hidden “big” tags elsewhere on the page. Now, we can go through and grab a list of links to all 45 presidential Wikipedia pages by getting them from the “attribs” section of each element.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';

rp(url)
  .then(function(html){
    //success!
    const wikiUrls = [];
    for (let i = 0; i < 45; i++) {
      wikiUrls.push($('big > a', html)[i].attribs.href);
    }
    console.log(wikiUrls);
  })
  .catch(function(err){
    //handle error
  });

Output:

[
  '/wiki/George_Washington',
  '/wiki/John_Adams',
  '/wiki/Thomas_Jefferson',
  '/wiki/James_Madison',
  '/wiki/James_Monroe',
  '/wiki/John_Quincy_Adams',
  '/wiki/Andrew_Jackson',
  ...
]

Now we have a list of all 45 presidential Wikipedia pages. Let’s create a new file (named potusParse.js), which will contain a function to take a presidential Wikipedia page and return the president’s name and birthday. First things first, let’s get the raw HTML from George Washington’s Wikipedia page.

const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/George_Washington';

rp(url)
  .then(function(html) {
    console.log(html);
  })
  .catch(function(err) {
    //handle error
  });

Output:

<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>George Washington - Wikipedia</title>
...

Let’s once again use Chrome DevTools to find the syntax of the code we want to parse, so that we can extract the name and birthday with Cheerio.js.

So we see that the name is in a class called “firstHeading” and the birthday is in a class called “bday”. Let’s modify our code to use Cheerio.js to extract these two classes.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/George_Washington';

rp(url)
  .then(function(html) {
    console.log($('.firstHeading', html).text());
    console.log($('.bday', html).text());
  })
  .catch(function(err) {
    //handle error
  });

Output:

George Washington
1732-02-22

Putting it all together

Perfect! Now let’s wrap this up into a function and export it from this module.

const rp = require('request-promise');
const $ = require('cheerio');

const potusParse = function(url) {
  return rp(url)
    .then(function(html) {
      return {
        name: $('.firstHeading', html).text(),
        birthday: $('.bday', html).text(),
      };
    })
    .catch(function(err) {
      //handle error
    });
};

module.exports = potusParse;

Now let’s return to our original file potusScraper.js and require the potusParse.js module. We’ll then apply it to the list of wikiUrls we gathered earlier.

const rp = require('request-promise');
const $ = require('cheerio');
const potusParse = require('./potusParse');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';

rp(url)
  .then(function(html) {
    //success!
    const wikiUrls = [];
    for (let i = 0; i < 45; i++) {
      wikiUrls.push($('big > a', html)[i].attribs.href);
    }
    return Promise.all(
      wikiUrls.map(function(url) {
        return potusParse('https://en.wikipedia.org' + url);
      })
    );
  })
  .then(function(presidents) {
    console.log(presidents);
  })
  .catch(function(err) {
    //handle error
    console.log(err);
  });

Output:


[
  { name: 'George Washington', birthday: '1732-02-22' },
  { name: 'John Adams', birthday: '1735-10-30' },
  { name: 'Thomas Jefferson', birthday: '1743-04-13' },
  { name: 'James Madison', birthday: '1751-03-16' },
  { name: 'James Monroe', birthday: '1758-04-28' },
  { name: 'John Quincy Adams', birthday: '1767-07-11' },
  { name: 'Andrew Jackson', birthday: '1767-03-15' },
  { name: 'Martin Van Buren', birthday: '1782-12-05' },
  { name: 'William Henry Harrison', birthday: '1773-02-09' },
  { name: 'John Tyler', birthday: '1790-03-29' },
  { name: 'James K. Polk', birthday: '1795-11-02' },
  { name: 'Zachary Taylor', birthday: '1784-11-24' },
  { name: 'Millard Fillmore', birthday: '1800-01-07' },
  { name: 'Franklin Pierce', birthday: '1804-11-23' },
  { name: 'James Buchanan', birthday: '1791-04-23' },
  { name: 'Abraham Lincoln', birthday: '1809-02-12' },
  { name: 'Andrew Johnson', birthday: '1808-12-29' },
  { name: 'Ulysses S. Grant', birthday: '1822-04-27' },
  { name: 'Rutherford B. Hayes', birthday: '1822-10-04' },
  { name: 'James A. Garfield', birthday: '1831-11-19' },
  { name: 'Chester A. Arthur', birthday: '1829-10-05' },
  { name: 'Grover Cleveland', birthday: '1837-03-18' },
  { name: 'Benjamin Harrison', birthday: '1833-08-20' },
  { name: 'Grover Cleveland', birthday: '1837-03-18' },
  { name: 'William McKinley', birthday: '1843-01-29' },
  { name: 'Theodore Roosevelt', birthday: '1858-10-27' },
  { name: 'William Howard Taft', birthday: '1857-09-15' },
  { name: 'Woodrow Wilson', birthday: '1856-12-28' },
  { name: 'Warren G. Harding', birthday: '1865-11-02' },
  { name: 'Calvin Coolidge', birthday: '1872-07-04' },
  { name: 'Herbert Hoover', birthday: '1874-08-10' },
  { name: 'Franklin D. Roosevelt', birthday: '1882-01-30' },
  { name: 'Harry S. Truman', birthday: '1884-05-08' },
  { name: 'Dwight D. Eisenhower', birthday: '1890-10-14' },
  { name: 'John F. Kennedy', birthday: '1917-05-29' },
  { name: 'Lyndon B. Johnson', birthday: '1908-08-27' },
  { name: 'Richard Nixon', birthday: '1913-01-09' },
  { name: 'Gerald Ford', birthday: '1913-07-14' },
  { name: 'Jimmy Carter', birthday: '1924-10-01' },
  { name: 'Ronald Reagan', birthday: '1911-02-06' },
  { name: 'George H. W. Bush', birthday: '1924-06-12' },
  { name: 'Bill Clinton', birthday: '1946-08-19' },
  { name: 'George W. Bush', birthday: '1946-07-06' },
  { name: 'Barack Obama', birthday: '1961-08-04' },
  { name: 'Donald Trump', birthday: '1946-06-14' }
]

Rendering JavaScript Pages

Voilà! A list of the names and birthdays of all 45 U.S. presidents. Using just the request-promise module and Cheerio.js should allow you to scrape the vast majority of sites on the internet.

Recently, however, many sites have begun using JavaScript to generate dynamic content on their websites. This causes a problem for request-promise and other similar HTTP request libraries (such as axios and fetch), because they only get the response from the initial request, but they cannot execute the JavaScript the way a web browser can.

Thus, to scrape sites that require JavaScript execution, we need another solution. In our next example, we will get the titles for all of the posts on the front page of Reddit. Let’s see what happens when we try to use request-promise as we did in the previous example.

Output:

const rp = require('request-promise');
const url = 'https://www.reddit.com';

rp(url)
  .then(function(html){
    //success!
    console.log(html);
  })
  .catch(function(err){
    //handle error
  });
}

Here’s what the output looks like:

<!DOCTYPE html><html
lang="en"><head><title>reddit: the front page of the
internet</title>
...

Hmmm…not quite what we want. That’s because getting the actual content requires you to run the JavaScript on the page! With Puppeteer, that’s no problem.

Puppeteer is an extremely popular new module brought to you by the Google Chrome team that allows you to control a headless browser. This is perfect for programmatically scraping pages that require JavaScript execution. Let’s get the HTML from the front page of Reddit using Puppeteer instead of request-promise.

const puppeteer = require('puppeteer');
const url = 'https://www.reddit.com';

puppeteer
  .launch()
  .then(function(browser) {
    return browser.newPage();
  })
  .then(function(page) {
    return page.goto(url).then(function() {
      return page.content();
    });
  })
  .then(function(html) {
    console.log(html);
  })
  .catch(function(err) {
    //handle error
  });

Output:

<!DOCTYPE html><html lang="en"><head><link
  href="//c.amazon-adsystem.com/aax2/apstag.js" rel="preload"
  as="script">
...

Nice! The page is filled with the correct content!

Now we can use Chrome DevTools like we did in the previous example.

It looks like Reddit is putting the titles inside “h2” tags. Let’s use Cheerio.js to extract the h2 tags from the page.

const puppeteer = require('puppeteer');
const $ = require('cheerio');
const url = 'https://www.reddit.com';

puppeteer
  .launch()
  .then(function(browser) {
    return browser.newPage();
  })
  .then(function(page) {
    return page.goto(url).then(function() {
      return page.content();
    });
  })
  .then(function(html) {
    $('h2', html).each(function() {
      console.log($(this).text());
    });
  })
  .catch(function(err) {
    //handle error
  });

Output:

Russian Pipeline. Upvote so that this is the first image people see when they Google “Russian Pipeline”
John F. Kennedy Jr. Sitting in the pilot seat of the Marine One circa 1963
I didn't take it as a compliment.
How beautiful is this
Hustle like Faye
The power of a salt water crocodile's tail.
I'm 36, and will be dead inside of a year.
F***ing genius.
TIL Anthony Daniels, who endured years of discomfort in the C-3PO costume, was so annoyed by Alan Tudyk (Rogue One) playing K-2SO in the comfort of a motion-capture suit that he cursed at Tudyk. Tudyk later joked that a "fuck you" from Daniels was among the highest compliments he had ever received.
Reminder about the fact UC Davis paid over $100k to remove this photo from the internet.
King of the Hill reruns will start airing on Comedy Central July 24th
[Image] Slow and steady
White House: Trump open to Russia questioning US citizens
Godzilla: King of the Monsters Teaser Banner
He tried
Soldier reunited with his dog after being away.
Hiring a hitman on yourself and preparing for battle is the ultimate extreme sport.
Two paintballs colliding midair
My thoughts & prayers are with those ears
When even your fantasy starts dropping hints
Elon Musk's apology is out
"When you're going private so you plant trees to throw some last shade at TDNW before you vanish." Thanos' farm advances. The soul children will have full bellies. 1024 points will give him the resources to double, and irrigate, his farm. (See comment)
Some leaders prefer chess, others prefer hungry hippos. Travis Chapman, oil, 2018
The S.S. Ste. Claire, retired from ferrying amusement park goers, now ferries The Damned across the river Styx.
A soldier is reunited with his dog
*hits blunt*
Today I Learned
Black Panther Scene Representing the Pan-African Flag
The precision of this hydraulic press.
Let bring the game to another level
When you're fighting a Dark Souls boss and you gamble to get 'just one extra hit' in instead of rolling out of range.
"I check for traps"
Anon finds his home at last
He’s hungry
Being a single mother is a thankless job.
TIL That when you're pulling out Minigun, you're actually pulling out suitcase that then transforms into Minigun.
OMG guys don’t look!!! 🙈🙈🙈
hyubsama's emote of his own face denied for political reasons because twitch thinks its a picture of Kim Jong Un

Node.js for Beginners - Learn Node.js from Scratch (Step by Step)

Node.js for Beginners - Learn Node.js from Scratch (Step by Step)

Node.js for Beginners - Learn Node.js from Scratch (Step by Step) - Learn the basics of Node.js. This Node.js tutorial will guide you step by step so that you will learn basics and theory of every part. Learn to use Node.js like a professional. You’ll learn: Basic Of Node, Modules, NPM In Node, Event, Email, Uploading File, Advance Of Node.

Node.js for Beginners

Learn Node.js from Scratch (Step by Step)

Welcome to my course "Node.js for Beginners - Learn Node.js from Scratch". This course will guide you step by step so that you will learn basics and theory of every part. This course contain hands on example so that you can understand coding in Node.js better. If you have no previous knowledge or experience in Node.js, you will like that the course begins with Node.js basics. otherwise if you have few experience in programming in Node.js, this course can help you learn some new information . This course contain hands on practical examples without neglecting theory and basics. Learn to use Node.js like a professional. This comprehensive course will allow to work on the real world as an expert!
What you’ll learn:

  • Basic Of Node
  • Modules
  • NPM In Node
  • Event
  • Email
  • Uploading File
  • Advance Of Node

Top 7 Most Popular Node.js Frameworks You Should Know

Top 7 Most Popular Node.js Frameworks You Should Know

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser. In this post, you'll see top 7 of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser.

One of the main advantages of Node is that it enables developers to use JavaScript on both the front-end and the back-end of an application. This not only makes the source code of any app cleaner and more consistent, but it significantly speeds up app development too, as developers only need to use one language.

Node is fast, scalable, and easy to get started with. Its default package manager is npm, which means it also sports the largest ecosystem of open-source libraries. Node is used by companies such as NASA, Uber, Netflix, and Walmart.

But Node doesn't come alone. It comes with a plethora of frameworks. A Node framework can be pictured as the external scaffolding that you can build your app in. These frameworks are built on top of Node and extend the technology's functionality, mostly by making apps easier to prototype and develop, while also making them faster and more scalable.

Below are 7of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Express

With over 43,000 GitHub stars, Express is the most popular Node framework. It brands itself as a fast, unopinionated, and minimalist framework. Express acts as middleware: it helps set up and configure routes to send and receive requests between the front-end and the database of an app.

Express provides lightweight, powerful tools for HTTP servers. It's a great framework for single-page apps, websites, hybrids, or public HTTP APIs. It supports over fourteen different template engines, so developers aren't forced into any specific ORM.

Meteor

Meteor is a full-stack JavaScript platform. It allows developers to build real-time web apps, i.e. apps where code changes are pushed to all browsers and devices in real-time. Additionally, servers send data over the wire, instead of HTML. The client renders the data.

The project has over 41,000 GitHub stars and is built to power large projects. Meteor is used by companies such as Mazda, Honeywell, Qualcomm, and IKEA. It has excellent documentation and a strong community behind it.

Koa

Koa is built by the same team that built Express. It uses ES6 methods that allow developers to work without callbacks. Developers also have more control over error-handling. Koa has no middleware within its core, which means that developers have more control over configuration, but which means that traditional Node middleware (e.g. req, res, next) won't work with Koa.

Koa already has over 26,000 GitHub stars. The Express developers built Koa because they wanted a lighter framework that was more expressive and more robust than Express. You can find out more about the differences between Koa and Express here.

Sails

Sails is a real-time, MVC framework for Node that's built on Express. It supports auto-generated REST APIs and comes with an easy WebSocket integration.

The project has over 20,000 stars on GitHub and is compatible with almost all databases (MySQL, MongoDB, PostgreSQL, Redis). It's also compatible with most front-end technologies (Angular, iOS, Android, React, and even Windows Phone).

Nest

Nest has over 15,000 GitHub stars. It uses progressive JavaScript and is built with TypeScript, which means it comes with strong typing. It combines elements of object-oriented programming, functional programming, and functional reactive programming.

Nest is packaged in such a way it serves as a complete development kit for writing enterprise-level apps. The framework uses Express, but is compatible with a wide range of other libraries.

LoopBack

LoopBack is a framework that allows developers to quickly create REST APIs. It has an easy-to-use CLI wizard and allows developers to create models either on their schema or dynamically. It also has a built-in API explorer.

LoopBack has over 12,000 GitHub stars and is used by companies such as GoDaddy, Symantec, and the Bank of America. It's compatible with many REST services and a wide variety of databases (MongoDB, Oracle, MySQL, PostgreSQL).

Hapi

Similar to Express, hapi serves data by intermediating between server-side and client-side. As such, it's can serve as a substitute for Express. Hapi allows developers to focus on writing reusable app logic in a modular and prescriptive fashion.

The project has over 11,000 GitHub stars. It has built-in support for input validation, caching, authentication, and more. Hapi was originally developed to handle all of Walmart's mobile traffic during Black Friday.

Node JS Full Course Learn Node.js in 7 Hours

Node JS Full Course Learn Node.js in 7 Hours

This Edureka Node.js Full Course video will help you in learn Node.js along with practical demonstration. This Node.js Tutorial for Beginners is ideal for both beginners as well as professionals who want to master the most prominently used javascript backend framework.

This Edureka Node.js Full Course video will help you in learn Node.js along with practical demonstration. This Node.js Tutorial for Beginners is ideal for both beginners as well as professionals who want to master the most prominently used javascript backend framework. Below are the topics covered in this node.js tutorial video:
2:32 What is Node.js?
3:22 Client-Server Architecture
4:12 Multi-Threaded Model
6:13 Single-Threaded Model
7:43 Multi-Threaded vs Event-Driven
9:45 Uber Old Architecture
11:10 Uber New Architecture
12:30 What is Node.js?
13:05 Sucess Stories
14:20 Node.js Trend
14:40 Node.js Features
16:25 Node.js Installation
16:50 Node.js First Example
17:30 Blocking vs Non-blocking
18:50 Demo
23:50 Node.js Modules
23:50 NPM
25:10 Global Objects
26:55 File System
30:30 Callbacks
31:45 Event
33:05 HTTP
34:50 Hands On
1:09:45 Node.js Tutorial
1:10:45 What is Node.js?
1:12:10 Features of Node.js
1:13:00 Node.js Architecture
1:14:55 NPM(Node Package Manager)
1:16:20 Node.js Modules
1:16:30 Node.js Modules Types
1:16:35 Core Modules
1:16:55 Local Modules
1:17:10 3rd Party Modules
1:18:35 JSON File
1:23:30 Data Types
1:25:35 Variables
1:26:40 Operators
1:27:45 Functions
1:29:10 Objects
1:29:55 File Systems
1:33:50 Events
1:34:20 HTTP Module
1:40:02 Events
1:44:37 HTTP Module
1:45:27 Creating a Web Server using Node.js
1:45:42 Express.js
1:46:57 Demo
1:58:37 Node.js NPM Tutorial
1:59:37 What is NPM?
2:03:12 Main Functions of NPM
2:04:27 Need For NPM
2:08:07 NPM Packages
2:17:42 NPM Installation
2:18:12 JSON File
2:31:32 Node.js Express Tutorial
2:32:02 Introduction to Express.js
2:32:32 Features of Express.js
2:35:27 Getting Started with Express.js
2:39:42 Routing Methods
2:44:57 Hands-On
2:48:12 Building RESTful API with Node.js
2:48:27 What is REST API?
2:49:42 Features of REST API
2:51:12 Principles of REST API
2:56:37 Methods of REST API
2:59:52 Building REST API with Node.js
3:24:07 Node.js MySQL Tutorial
3:24:32 What is MySQL?
3:25:13 Advantages of Using MySQL with Node.js
3:27:38 MySQL Installation
3:44:23 Node.js MongoDB Tutorial
3:44:58 What is NoSQL?
3:47:53 NoSQL Databases
3:48:38 Introduction to MongoDB
3:52:48 Features of MongoDB
3:53:03 MongoDB Installation
4:36:08 Node.js Docker Tutorial
4:36:38 What is Docker?
4:39:13 Docker Working
4:41:43 Docker Basics
4:41:48 DockerFile
4:42:03 Docker Images
4:42:23 Docker Container
4:44:38 Why use Node.js with Docker?
4:45:18 Demo: Node.js with Docker
4:58:38 MEAN Stack Application Tutorial
4:59:18 What is MEAN Application?
4:59:53 MongoDB
5:00:28 Express
5:01:13 Angular
5:01:23 Node.js
5:02:17 RESTful API
5:03:02 Contact List MEAN App
6:17:57 Node.js Interview Questions

Got a question on the topic? Please share it in the comment section below and our experts will answer it for you.