How to build a Web Crawler using Node workers

How to build a Web Crawler using Node workers

In Node tutorial, we will learn how to build a web crawler that uses Node workers. Learn how to build a web crawler that scrapes a website and stores the data in a database. This crawler bot will perform both operations using Node workers. How to build a web crawler that scrapes currency exchange rates and saves it to a database. Learn how to use worker threads to run these operations.

Introduction

A web crawler, often shortened to crawler or sometimes called a spider-bot, is a bot that systematically browses the internet typically for the purpose of web indexing. These internet bots can be used by search engines to improve the quality of search results for users. In addition to indexing the world wide web, crawling can also be used to gather data (known as web scraping).

The process of web scraping can be quite tasking on the CPU depending on the site’s structure and the complexity of data being extracted. To optimize and speed up this process, we will make use of Node workers (threads) which are useful for CPU-intensive operations.

In this article, we will learn how to build a web crawler that scrapes a website and stores the data in a database. This crawler bot will perform both operations using Node workers.

Prerequisites
  1. Basic knowledge of Node.js
  2. Yarn or NPM (we’ll be using Yarn)
  3. A system configured to run Node code (preferably version 10.5.0 or superior)
Installation

Launch a terminal and create a new directory for this tutorial:

$ mkdir worker-tutorial
$ cd worker-tutorial

Initialize the directory by running the following command:

$ yarn init -y

We need the following packages to build the crawler:

  • Axios — a promised based HTTP client for the browser and Node.js
  • Cheerio — a lightweight implementation of jQuery which gives us access to the DOM on the server
  • Firebase database — a cloud-hosted NoSQL database. If you’re not familiar with setting up a firebase database, check out the documentation and follow steps 1-3 to get started

Let’s install the packages listed above with the following command:

$ yarn add axios cheerio firebase-admin
Hello workers

Before we start building the crawler using workers, let’s go over some basics. You can create a test file hello.js in the root of the project to run the following snippets.

Registering a worker

A worker can be initialized (registered) by importing the worker class from the worker_threads module like this:

// hello.js

const { Worker } = require('worker_threads');

new Worker("./worker.js");
Hello world

Printing out Hello World with workers is as simple as running the snippet below:

// hello.js

const { Worker, isMainThread }  = require('worker_threads');
if(isMainThread){
    new Worker(__filename);
} else{
    console.log("Worker says: Hello World"); // prints 'Worker says: Hello World'
}

This snippet pulls in the worker class and the isMainThread object from the worker_threads module:

  • isMainThread helps us know when we are either running inside the main thread or a worker thread
  • new Worker(__filename) registers a new worker with the __filename variable which, in this case, is hello.js
Communication with workers

When a new worker (thread) is spawned, there is a messaging port that allows inter-thread communications. Below is a snippet which shows how to pass messages between workers (threads):

// hello.js

const { Worker, isMainThread, parentPort }  = require('worker_threads');

if (isMainThread) {
    const worker =  new Worker(__filename);
    worker.once('message', (message) => {
        console.log(message); // prints 'Worker thread: Hello!'
    });
    worker.postMessage('Main Thread: Hi!');
} else {
    parentPort.once('message', (message) => {
        console.log(message) // prints 'Main Thread: Hi!'
        parentPort.postMessage("Worker thread: Hello!");
    });
}

In the snippet above, we send a message to the parent thread using parentPort.postMessage() after initializing a worker thread. Then we listen for a message from the parent thread using parentPort.once(). We also send a message to the worker thread using worker.postMessage() and listen for a message from the worker thread using worker.once().

Running the code produces the following output:

Main Thread: Hi!
Worker thread: Hello!
Building the crawler

Let’s build a basic web crawler that uses Node workers to crawl and write to a database. The crawler will complete its task in the following order:

  1. Fetch (request) HTML from the website
  2. Extract the HTML from the response
  3. Traverse the DOM and extract the table containing exchange rates
  4. Format table elements (tbody, tr, and td) and extract exchange rate values
  5. Stores exchange rate values in an object and send it to a worker thread using worker.postMessage()
  6. Accept message from parent thread in worker thread using parentPort.on()
  7. Store message in firestore (firebase database)

Let’s create two new files in our project directory:

  1. main.js – for the main thread
  2. dbWorker.js – for the worker thread

The source code for this tutorial is available here on GitHub. Feel free to clone it, fork it or submit an issue.

Main thread (main.js)

In the main thread, we will scrape the IBAN website for the current exchange rates of popular currencies against the US dollar. We will import axios and use it to fetch the HTML from the site using a simple GET request.

We will also use cheerio to traverse the DOM and extract data from the table element. To know the exact elements to extract, we will open the IBAN website in our browser and load dev tools:

From the image above, we can see the table element with the classes — table table-bordered table-hover downloads. This will be a great starting point and we can feed that into our cheerio root element selector:

// main.js

const axios = require('axios');
const cheerio = require('cheerio');
const url = "https://www.iban.com/exchange-rates";

fetchData(url).then( (res) => {
    const html = res.data;
    const $ = cheerio.load(html);
    const statsTable = $('.table.table-bordered.table-hover.downloads > tbody > tr');
    statsTable.each(function() {
        let title = $(this).find('td').text();
        console.log(title);
    });
})

async function fetchData(url){
    console.log("Crawling data...")
    // make http call to url
    let response = await axios(url).catch((err) => console.log(err));

    if(response.status !== 200){
        console.log("Error occurred while fetching data");
        return;
    }
    return response;
}

Running the code above with Node will give the following output:


Going forward, we will update the main.js file so that we can properly format our output and send it to our worker thread.

Updating the main thread

To properly format our output, we need to get rid of white space and tabs since we will be storing the final output in JSON. Let’s update the main.js file accordingly:

// main.js
[...]
let workDir = __dirname+"/dbWorker.js";

const mainFunc = async () => {
  const url = "https://www.iban.com/exchange-rates";
  // fetch html data from iban website
  let res = await fetchData(url);
  if(!res.data){
    console.log("Invalid data Obj");
    return;
  }
  const html = res.data;
  let dataObj = new Object();
  // mount html page to the root element
  const $ = cheerio.load(html);

  let dataObj = new Object();
  const statsTable = $('.table.table-bordered.table-hover.downloads > tbody > tr');
  //loop through all table rows and get table data
  statsTable.each(function() {
    let title = $(this).find('td').text(); // get the text in all the td elements
    let newStr = title.split("\t"); // convert text (string) into an array
    newStr.shift(); // strip off empty array element at index 0
    formatStr(newStr, dataObj); // format array string and store in an object
  });

  return dataObj;
}

mainFunc().then((res) => {
    // start worker
    const worker = new Worker(workDir); 
    console.log("Sending crawled data to dbWorker...");
    // send formatted data to worker thread 
    worker.postMessage(res);
    // listen to message from worker thread
    worker.on("message", (message) => {
        console.log(message)
    });
});

[...]

function formatStr(arr, dataObj){
    // regex to match all the words before the first digit
    let regExp = /[^A-Z]*(^\D+)/ 
    let newArr = arr[0].split(regExp); // split array element 0 using the regExp rule
    dataObj[newArr[1]] = newArr[2]; // store object 
}

In the snippet above, we are doing more than data formatting; after the mainFunc() has been resolved, we pass the formatted data to the worker thread for storage.

Worker thread (dbWorker.js)

In this worker thread, we will initialize firebase and listen for the crawled data from the main thread. When the data arrives, we will store it in the database and send a message back to the main thread to confirm that data storage was successful.

The snippet that takes care of the aforementioned operations can be seen below:

// dbWorker.js

const { parentPort } = require('worker_threads');
const admin = require("firebase-admin");

//firebase credentials
let firebaseConfig = {
    apiKey: "XXXXXXXXXXXX-XXX-XXX",
    authDomain: "XXXXXXXXXXXX-XXX-XXX",
    databaseURL: "XXXXXXXXXXXX-XXX-XXX",
    projectId: "XXXXXXXXXXXX-XXX-XXX",
    storageBucket: "XXXXXXXXXXXX-XXX-XXX",
    messagingSenderId: "XXXXXXXXXXXX-XXX-XXX",
    appId: "XXXXXXXXXXXX-XXX-XXX"
};

// Initialize Firebase
admin.initializeApp(firebaseConfig);
let db = admin.firestore();
// get current data in DD-MM-YYYY format
let date = new Date();
let currDate = `${date.getDate()}-${date.getMonth()}-${date.getFullYear()}`;
// recieve crawled data from main thread
parentPort.once("message", (message) => {
    console.log("Recieved data from mainWorker...");
    // store data gotten from main thread in database
    db.collection("Rates").doc(currDate).set({
        rates: JSON.stringify(message)
    }).then(() => {
        // send data back to main thread if operation was successful
        parentPort.postMessage("Data saved successfully");
    })
    .catch((err) => console.log(err))    
});

Note: To set up a database on firebase, please visit the firebase documentation and follow steps 1-3 to get started.

Running main.js (which encompasses dbWorker.js) with Node will give the following output:


You can now check your firebase database and will see the following crawled data:

Final notes

Although web crawling can be fun, it can also be against the law if you use data to commit copyright infringement. It is generally advised that you read the terms and conditions of the site you intend to crawl, to know their data crawling policy beforehand. You can learn more in the Crawling Policy section of this page.

The use of worker threads does not guarantee your application will be faster but can present that mirage if used efficiently because it frees up the main thread by making CPU intensive tasks less cumbersome on the main thread.

Conclusion

In this tutorial, we learned how to build a web crawler that scrapes currency exchange rates and saves it to a database. We also learned how to use worker threads to run these operations.

The source code for each of the following snippets is available on GitHub. Feel free to clone it, fork it or submit an issue.

Introduction to Electron: Build Desktop App using Node and JavaScript

Introduction to Electron: Build Desktop App using Node and JavaScript

Introduction to Electron: Build Desktop App using Node and JavaScript. In this Electron tutorial, Felix will give a technical introduction to Electron. He’ll cover the basics and explain both benefits and challenges of using Node.js and JavaScript to build major desktop applications. How to build your first Desktop App with JavaScript using Electron.

Electron: Desktop Apps With JavaScript

Introduction to Electron: Build Desktop App using Node and JavaScript

Chances are high that you’re already using desktop software built with JavaScript and Node.js: Apps like Visual Studio Code, Slack, or WhatsApp use the framework Electron to combine native code with the conveniences of Node.js and web technologies.

In this talk, Felix will give a technical introduction to Electron. Building a small code editor live on stage, he’ll cover the basics and explain both benefits and challenges of using Node.js and JavaScript to build major desktop applications.

JavaScript Tutorial: if-else Statement in JavaScript

JavaScript Tutorial: if-else Statement in JavaScript

This JavaScript tutorial is a step by step guide on JavaScript If Else Statements. Learn how to use If Else in javascript and also JavaScript If Else Statements. if-else Statement in JavaScript. JavaScript's conditional statements: if; if-else; nested-if; if-else-if. These statements allow you to control the flow of your program's execution based upon conditions known only during run time.

Decision Making in programming is similar to decision making in real life. In programming also we face some situations where we want a certain block of code to be executed when some condition is fulfilled.
A programming language uses control statements to control the flow of execution of the program based on certain conditions. These are used to cause the flow of execution to advance and branch based on changes to the state of a program.

JavaScript’s conditional statements:

  • if
  • if-else
  • nested-if
  • if-else-if

These statements allow you to control the flow of your program’s execution based upon conditions known only during run time.

  • if: if statement is the most simple decision making statement. It is used to decide whether a certain statement or block of statements will be executed or not i.e if a certain condition is true then a block of statement is executed otherwise not.
    Syntax:
if(condition) 
{
   // Statements to execute if
   // condition is true
}

Here, condition after evaluation will be either true or false. if statement accepts boolean values – if the value is true then it will execute the block of statements under it.
If we do not provide the curly braces ‘{‘ and ‘}’ after if( condition ) then by default if statement will consider the immediate one statement to be inside its block. For example,

if(condition)
   statement1;
   statement2;

// Here if the condition is true, if block 
// will consider only statement1 to be inside 
// its block.

Flow chart:

Example:

<script type = "text/javaScript"> 

// JavaScript program to illustrate If statement 

var i = 10; 

if (i > 15) 
document.write("10 is less than 15"); 

// This statement will be executed 
// as if considers one statement by default 
document.write("I am Not in if"); 

< /script> 

Output:

I am Not in if
  • if-else: The if statement alone tells us that if a condition is true it will execute a block of statements and if the condition is false it won’t. But what if we want to do something else if the condition is false. Here comes the else statement. We can use the else statement with if statement to execute a block of code when the condition is false.
    Syntax:
if (condition)
{
    // Executes this block if
    // condition is true
}
else
{
    // Executes this block if
    // condition is false
}


Example:

<script type = "text/javaScript"> 

// JavaScript program to illustrate If-else statement 

var i = 10; 

if (i < 15) 
document.write("10 is less than 15"); 
else
document.write("I am Not in if"); 

< /script> 

Output:

i is smaller than 15
  • nested-if A nested if is an if statement that is the target of another if or else. Nested if statements means an if statement inside an if statement. Yes, JavaScript allows us to nest if statements within if statements. i.e, we can place an if statement inside another if statement.
    Syntax:
if (condition1) 
{
   // Executes when condition1 is true
   if (condition2) 
   {
      // Executes when condition2 is true
   }
}

Example:

<script type = "text/javaScript"> 

// JavaScript program to illustrate nested-if statement 

var i = 10; 

if (i == 10) { 

// First if statement 
if (i < 15) 
	document.write("i is smaller than 15"); 

// Nested - if statement 
// Will only be executed if statement above 
// it is true 
if (i < 12) 
	document.write("i is smaller than 12 too"); 
else
	document.write("i is greater than 15"); 
} 
< /script> 

Output:

i is smaller than 15
i is smaller than 12 too
  • if-else-if ladder Here, a user can decide among multiple options.The if statements are executed from the top down. As soon as one of the conditions controlling the if is true, the statement associated with that if is executed, and the rest of the ladder is bypassed. If none of the conditions is true, then the final else statement will be executed.
if (condition)
    statement;
else if (condition)
    statement;
.
.
else
    statement;


Example:

<script type = "text/javaScript"> 
// JavaScript program to illustrate nested-if statement 

var i = 20; 

if (i == 10) 
document.wrte("i is 10"); 
else if (i == 15) 
document.wrte("i is 15"); 
else if (i == 20) 
document.wrte("i is 20"); 
else
document.wrte("i is not present"); 
< /script> 

Output:

i is 20

Learn NPM - The Node Package Manager for JavaScript

Learn NPM - The Node Package Manager for JavaScript

Learn the fundamentals of NPM - The Node Package Manager for JavaScript - NPM for Beginners - NPM Crash Course: What is NPM? NPM stands for Node Package Manager and it is used mainly to download and install JavaScript Packages. NPM comes with Nodejs already as a default so we would need to download that first in order for us to use it.

Free JavaScript Tutorial - NPM for Beginners - Fast Track

Learn NPM - The Node Package Manager for JavaScript

NPM stands for Node Package Manager and it is used mainly to download and install JavaScript Packages. This course would be beneficial for any web developer who is looking to make his or her life easier by using code that is already written instead of starting from scratch.

NPM comes with Nodejs already as a default so we would need to download that first in order for us to use it. This is a short course full of useful content and tips.

What you'll learn

  • Students will learn the fundamentals of NPM