The Introduction to Web Scraping with Node JS

The Introduction to Web Scraping with Node JS

What is web scraping? Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples: Scraping social media sites to find trending data Scraping email addresses from websites that publish public...

What is web scraping? Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples: Scraping social media sites to find trending data Scraping email addresses from websites that publish public emails Scraping data from another website to use on your own site Scraping online stores for sales data, product pictures, etc. Warnings. Web scraping is against most website’s terms of service. Your IP address may be banned from a website if you scrape too frequently or maliciously. What will we need? For this project we’ll be using Node.js. If you’re not familiar with Node, check out my 3 Best Node.JS Courses. We’ll also be using two open-sourced npm modules to make today’s task a little easier: request-promise — Request is a simple HTTP client that allows us to make quick and easy HTTP calls. cheerio — jQuery for Node.js. Cheerio makes it easy to select, edit, and view DOM elements. Project Setup. Create a new project folder. Within that folder create an index.js file. We’ll need to install and require our dependencies. Open up your command line, and install and save: request, request-promise, and cheerio npm install --save request request-promise cheerio Then require them in our index.js file: const rp = require('request-promise'); const cheerio = require('cheerio'); Setting up the Request request-promise accepts an object as input, and returns a promise. The options object needs to do two things: Pass in the url we want to scrape. Tell Cheerio to load the returned HTML so that we can use it. Here’s what that looks like: const options = { uri: https://www.yourURLhere.com, transform: function (body) { return cheerio.load(body); } }; The uri key is simply the website we want to scrape. The transform key tells request-promise to take the returned body and load it into Cheerio before returning it to us. Awesome. We’ve successfully set up our HTTP request options! Here’s what your code should look like so far: const rp = require('request-promise'); const cheerio = require('cheerio'); const options = { uri: https://www.yourURLhere.com, transform: function (body) { return cheerio.load(body); } }; Make the Request Now that the options are taken care of, we can actually make our request. The boilerplate in the documentation for that looks like this: rp(OPTIONS) .then(function (data) { // REQUEST SUCCEEDED: DO SOMETHING }) .catch(function (err) { // REQUEST FAILED: ERROR OF SOME KIND }); We pass in our options object to request-promise, then wait to see if our request succeeds or fails. Either way, we do something with the returned data. Knowing what the documentation says to do, lets create our own version: rp(options) .then(($) => { console.log($); }) .catch((err) => { console.log(err); }); The code is pretty similar. The big difference is I’ve used arrow functions. I’ve also logged out the returned data from our HTTP request. We’re going to test to make sure everything is working so far. Replace the placeholder uri with the website you want to scrape. Then, open up your console and type: node index.js // LOGS THE FOLLOWING: { [Function: initialize] fn: initialize { constructor: [Circular], _originalRoot: { type: 'root', name: 'root', namespace: 'http://www.w3.org/1999/xhtml', attribs: {}, ... If you don’t see an error, then everything is working so far — and you just made your first scrape! Having fun? Want to learn how to build more cool stuff with Node? Check out my 3 Best Node JS Courses Here is the full code of our boilerplate:

Boilerplate web scraping code Using the Data What good is our web scraper if it doesn’t actually return any useful data? This is where the fun begins. There are numerous things you can do with Cheerio to extract the data that you want. First and foremost, Cheerio’s selector implementation is nearly identical to jQuery’s. So if you know jQuery, this will be a breeze. If not, don’t worry, I’ll show you. Selectors The selector method allows you to traverse and select elements in the document. You can get data and set data using a selector. Imagine we have the following HTML in the website we want to scrape:

  • New York
  • Portland
  • Salem
We can select id’s using (#), classes using (.), and elements by their tag names, ex: div. $('.large').text() // New York $('#medium').text() // Portland $('li[class=small]').html() //
  • Salem
  • Looping Just like jQuery, we can also iterate through multiple elements with the each() function. Using the same HTML code as above, we can return the inner text of each li with the following code: $('li').each(function(i, elem) { cities[i] = $(this).text(); }); // New York Portland Salem Finding Imagine we have two lists on our web site:
    • New York
    • Portland
    • Salem
    • Bend
    • Hood River
    • Madras
    We can select each list using their respective ID’s, then find the small city/town within each list: $('#cities').find('.small').text() // Salem $('#towns').find('.small').text() // Madras Finding will search all descendant DOM elements, not just immediate children as shown in this example. Children Children is similar to find. The difference is that children only searches for immediate children of the selected element. $('#cities').children('#c-medium').text(); // Portland Text & HTML Up until this point, all of my examples have included the .text() function. Hopefully you’ve been able to figure out that this function is what gets the text of the selected element. You can also use .html() to return the html of the given element: $('.large').text() // Bend $('.large').html() //
  • Bend
  • Additional Methods There are more methods than I can count, and the documentation for all of them is available here. Chrome Developer Tools Don’t forget, the Chrome Developer Tools are your friend. In Google Chrome, you can easily find element, class, and ID names using: CTRL + SHIFT + C

    Finding class names with chrome dev tools As you seen in the above image, I’m able to hover over an element on the page and the element name and class name of the selected element are shown in real-time! Limitations As Jaye Speaks points out: MOST websites modify the DOM using JavaScript. Unfortunately Cheerio doesn’t resolve parsing a modified DOM. Dynamically generated content from procedures leveraging AJAX, client-side logic, and other async procedures are not available to Cheerio. Remember this is an introduction to basic scraping. In order to get started you’ll need to find a static website with minimal DOM manipulation. Go forth and scrape! Thanks for reading. You should have the tools necessary now to go forth and scrape static websites! I publish a few articles and tutorials each week, please consider entering your email here if you’d like to be added to my once-weekly email list. If tutorials like this interest you and you want to learn more, check out my 3 Best Node JS Courses

    Bootstrap 5 Complete Course with Examples

    Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

    Nest.JS Tutorial for Beginners

    Hello Vue 3: A First Look at Vue 3 and the Composition API

    Building a simple Applications with Vue 3

    Deno Crash Course: Explore Deno and Create a full REST API with Deno

    How to Build a Real-time Chat App with Deno and WebSockets

    Convert HTML to Markdown Online

    HTML entity encoder decoder Online

    Node JS Development Company| Node JS Web Developers-SISGAIN

    SISGAIN is the top rated node js development company providing professional services on node js web and mobile development.

    Hire Top Node JS Developers | Best Node.js Development Company India

    Hire dedicated Node JS developers & programmers in India for custom full-stack NodeJS web development projects on hourly/full-time basis. Strict NDA, 16+ years exp & 2500+ clients|450+ Experts

    Hire Dedicated Node.js Developers - Hire Node.js Developers

    Get business-centric Node.Js development services from expert Node.JS developers. We have expertise in developing & maintaining Node JS apps as per the business requirements.

    Node JS Development Company | Hire Node.js Developers

    Looking to hire Node js developers? One of the top Node js development companies in India & USA offers cost-effective Node js web development services.

    Introduction to Web Scraping with JavaScript and Node.js

    This beginner's guide introduces you to the basics of javascript web scraping and provides plenty of examples that you can easily copy. In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js. Introduction to Web Scraping With JavaScript and Node.js