Avav Smith

Avav Smith

1575542855

Web Scraping With NodeJS and Puppeteer

Introduction

As you maybe checked my profile, I’m not in charge of the technical stuff at Koopol. Thus, this post is a bit out of nowhere.

Why?

In a sentence, Koopol helps brands and resellers automating online price monitoring. By definition, Koopol is an innovative and technical SaaS solution.

Passionate about innovations and technical matters anyway, I was always a bit frustrated about being nowhere in development, I mean up-to-date development ;-).

That’s why I decided to re-open my programming chapter and improve my technical skills, besides my main activities.

As a project owner, I’m convinced that being aware, understand, and imagine technical challenges, the technical team is facing is crucial for the good development of Koopol.

In this article, I would like to explain how it can be easy to start web scraping. Don’t be afraid. Even if you are not a developer, I’m sure many of you are interested about that important subject.

Web scraping is useful in a lot of various matters. As soon as you have to copy/paste data from multiple sources as a business developer, a sales, or even a recruiter. The challenge is always similar: gather the relevant data.

Again, keep in mind that this article is dedicated to not technical guys, like me.

This being said, the following lines will go through how I learned by myself on how to start web scraping.

Here is an overview of what we will discuss about:

  • Code Editor: Visual Studio Code
  • Programming language: JavaScript (Node.js)
  • Web Scraping library: the famous Puppeteer

Ready? Let’s start.

Codes Editor: Visual Studio Code

In this tutorial, I will use Visual Code Editor. You can download the latest version by clicking here: https://code.visualstudio.com/

Here are a few advantages of the IDE:

  • Terminal: Visual Studio Code owns its Terminal. It will help you avoiding switching all the time between IDE and the Terminal for code running. This is efficient.
  • Integrated Git: Visual Studio Code includes Git to follow each change you might perform in your code. In other words (for beginners), it will allow you coming back in the code history if you made a mistake and want to turn back.
  • Automatic saving: Visual Studio Code will automatically save your code. Thus, if any errors occurred while you are programming, be sure you will retrieve your code back. So you can keep the focus on what matter.
  • Extensions: Visual Studio Code allows you to add many possible extensions thanks to its large developers’ community. For instance, you can add extensions highlighting code to help you retrieve your path, etc.

This blog post aims not at promoting this solution. If you feel better with another one, please use yours!

Part I — Install the Environment

A. How to install Node.j on a Mac?

To use Puppeteer, we previously need to set up our development environment. To use Puppeteer library (JavaScript Library), we need to set up a Node.js environment. Don’t worry, it takes a few minutes only…

Step 1

Open the Terminal

Step 2.a: If you do have Node.js installed

Enter the following code to check the Node.js version already installed.

node -v

To update your Node.JS version: I recommend you to run the following command line:

npm i -g npm

If you get a lot of checkPermissions warnings, you might have to run the command as a superuser by running:

sudo npm i -g npm

Terminal will probably ask you to type your password, in that case.

Step 2.b: If you do not have Node.js already installed

  1. Go to nodejs.org, and download the latest version for macOS.
  2. When the file finishes downloading, double-click on the .pkg file to install it. I recommend you to download the LTS version dedicated for Most Users.

This is image title
https://nodejs.org/en/

3. Go through the entire installation process

This is image title

4. When the installation is complete, open the Terminal and enter the below code, to verify that Node.JS is installed correctly, and to check the version.

node -v

If a version is displayed, you are ready for the next part.

B. How to install Puppeteer Library?

Puppeteer is a Node library allowing you to control and automate a Chrome browser but in a headless way. Ok, a bit confusing, let’s take a moment to detail a bit that part.

Headless Chrome is shipping in Chrome 59. It’s a way to run the Chrome browser in a headless environment. Essentially, running Chrome without chrome! It brings all modern web platform features provided by Chromium and the Blink rendering engine to the command line.

Why is that useful?

A headless browser is a great tool for automated testing and server environments where you don’t need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders an URL.

Source: https://developers.google.com/web/updates/2017/04/headless-chrome

Now, you understand Puppeteer library purpose.

So, how to install it in the right place. Open the Terminal, choose the place you want (e.g. on your Desktop), and create a dedicated directory for our web scraping project:

mkdir project1

Now install puppeteer inside that directory project1 by running below commands

npm install puppeteer

npm is a package manager which comes automatically with Node.js previously installed. In other words, it will manage the Puppeteer installation process for you, at the place you are. Thus by running the above-mentioned code, it will download and bundle the latest version of Chromium.

That’s it.

Now, we can start web scraping.

Part II — Web Scraping

Now, the best part. Let’s start scraping the web. Oh, ok, let’s start on the first page…

Here are our scraping objectives:

  1. Start puppeteer and go to a specific product page
  2. Scrape Title, Description, and the Price of the product

Step 1 — Starting

Scraping Permissions

Let’s start with a very simple example: https://www.theslanket.com .

Before scraping, make sure the website is not forbidding it in its robots.txt file. In this case, let’s check: https://www.theslanket.com/robots.txt

# Hello Robots and Crawlers!  We're glad you are here, but we would
# prefer you not create hundreds and hundreds of carts.
User-agent: *
Disallow: /cgi-bin/UCEditor
Disallow: /cgi-bin/UCSearch
Disallow: /cgi-bin/UCReviewHelpful
Disallow: /cgi-bin/UCMyAccount
Disallow: /merchant/signup/signup2Save.do
Disallow: /merchant/signup/signupSave.do
Crawl-delay: 5
# Sitemap files
Sitemap: https://www.theslanket.com/sitemapsdotorg_index.xml

This is not forbidden, so let’s go.

Objectives

Let’s consider a random product, like https://www.theslanket.com/shop/the-stroller-slanket/TBS-RUBY-WINE.html

This is image title

I highlighted the 5 elements we will scrape in red:

  1. ProductTitle
  2. NormalPrice
  3. DiscountedPrice
  4. ShortDescription
  5. SKU

Step 2 — Scraping

Create a Node.js file Create a new file, let’s name it SlanketScraping.js. Save that file in your specified directory. In our case, it will be in project1.

Create a browser instance

(async () => {
  const browser = await puppeteer.launch()
})()

(Optional) Pass options via object to puppeteer.launch(). In that case, let’s pass 2 options to puppeteer.

Headless: this option consists of showing Chromium while Puppeteer is browsing. As defined earlier, Puppeteer is a headless Chrome browser. But, as a beginner, I recommend you to start by displaying (working in headless: false) to see what’s happening and debug. You can always switch it in True, and nothing will appear then.

SlowMo: the slow-motion option allows to slowdown puppeteer. It can be used in many situations, but here, let’s say it will be used to see what the browser is doing and to avoid disturbing the server… We will set it at 250 ms (milliseconds). By default, slowMo will be set at 0 ms, so full speed.

(async () => {
  const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
  })
})()

Next, we will use the newPage() method to get the page object. If you work in headless: false, you will see a new tab appearing.

(async () => {
  const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
  })
const page = await browser.newPage()
})()

This is image title

What you should see

Next, we will pass the URL we want to scrape. To perform that, let’s call the goto() method on the page object to load the page.

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
   })
   const page = await browser.newPage()
   await page.goto('https://www.theslanket.com/shop/the-stroller-slanket/TBS-RUBY-WINE.html')
   
   browser.close()
})()

Here we launched puppeteer, and went to the specific product we want to scrape, then closed the browser. At this stage, we didn’t scrape anything but only browsing.

Let’s scrape the 5 elements we described earlier.

Get the page content

When a page is loaded with a URL, we will use the evaluate() method to get the page content.

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
   })
   const page = await browser.newPage()
   await page.goto('https://www.theslanket.com/shop/the-stroller-slanket/TBS-RUBY-WINE.html')
const results = await page.evaluate(() =>{
//... elements to scrape
})
browser.close()
})()

Inside the evaluate() method, we will target the element we want to scrape, by using specific Selectors.

Finding the right Selectors can be tricky sometimes. If you need more information about selectors, I recommend you to read the following documentation about that topic. Trust me, you will need it.

https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector

Back to our selector. The best practice, I would recommend, is to use the Google Chrome console to define and test your selectors. To open the console:

  • open the targeted URL in Chrome
  • ctrl+click on a specific element you want to scrape, let’s start with the ProductTitle, and
  • Select Inspect

A black panel called Elements is opening on the right side of your page, at the right position of the element you click on.

This is image title
Google Chrome > Inspect element by clicking on the title
Here, our element is included in

This is image title

The title we are looking for is: “The Stroller Slanket — Ruby Wine”, included in the <div class=”text”>.

So, let’s try to select it directly in the Google Chrome Console. Beside the Elements panel, click on the Console tab.

document.querySelector('.text').innerText

The answer is empty: “”.

This is image title

The result is empty…

Damned, we failed.

Let’s try the parent div, as following

document.querySelector('.widget.widget-itemtitle ').innerText

This is image title
Here is our title!

So, this selector is working to provide us with the ProductTitle of the product.

Let’s add the selector in our code to see if Puppeteer can scrape it.

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
   })
   const page = await browser.newPage()
   await page.goto('https://www.theslanket.com/shop/the-stroller-slanket/TBS-RUBY-WINE.html')
const results = await page.evaluate(() =>{
//our new selector
   return document.querySelector('.widget.widget-itemtitle ').innerText;
})
//log results at the screen
console.log(results)
browser.close()
})()

In the Visual Studio Code, in your project1 directory, run :

node SlanketScraping.js

Visual Studio Code should log

The Stroller Slanket - Ruby Wine

Let’s add the other elements following the same methodology than used to scrape the Title. Since we are scraping several elements, we will define an object containing the five elements.

Here are the 5 elements selector:

ProductTitle: document.querySelector('.widget.widget-itemtitle ').innerText,
NormalPrice: document.querySelector('.price').innerText,
DiscountedPrice: document.querySelector('.price.sale').innerText,
ShortDescription: document.querySelector('.widget-itemdescription-excerpt').innerText,
SKU: document.querySelector('.widget.widget-itemsku ').innerText,

Full Code

Here is the full code of our example.

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
      headless: false,
      slowMo: 250,
   })
   const page = await browser.newPage()
   await page.goto('https://www.theslanket.com/shop/the-stroller-slanket/TBS-RUBY-WINE.html')
const results = await page.evaluate(() =>{
   
   return{
ProductTitle: document.querySelector('.widget.widget-itemtitle ').innerText,
NormalPrice: document.querySelector('.price').innerText,
DiscountedPrice: document.querySelector('.price.sale').innerText,
ShortDescription: document.querySelector('.widget-itemdescription-excerpt').innerText,
SKU: document.querySelector('.widget.widget-itemsku ').innerText,
}
})
//log results at the screen
console.log(results)
browser.close()
})()

About Koopol

This article aimed at presenting you a very first and simple scraping exercise using Puppeteer.

In the future, I will publish more “complex” web scraping missions.

I hope you learned a few things, and it will help you develop and improve your web scraping skills.

By the way…

At Koopol we want everybody to be able to scrape. As already said, it can be useful in various projects. So, if it interests you, no matter your current job, never hesitate to drop us an email (info@koopol.com). We will be more than happy to meet you, and who knows, maybe work together?

Scrape safely.

#nodejs #javascript #Puppeteer #node-js

What is GEEK

Buddha Community

Web Scraping With NodeJS and Puppeteer
Deno Developer

Deno Developer

1610421000

How To Scrape Wikipedia By Using Puppeteer and Nodejs

In this article, we’ll go through scraping a Wikipedia table with COVID-19 data using Puppeteer and Node. The original article that I used for this project is located here.

I have never scraped a website before. I’ve always seen it as a hacky thing to do. But, after going through this little project, I can see the value of something like this. Data is hard to find and if you can scrape a website for it, in my opinion, by all means, do it.

#puppeteer #javascript #web-development #nodejs #web-scraping

Luna  Hermann

Luna Hermann

1596201180

A Brief Introduction for Web Scrapping with Puppeteer

In my current job we are working with a bunch of information from the internet (for analytics purposes) and we always need to recover some specific data from a variety of websites.

One of my tasks at my job is to retrieve this data and transform it into a traditional format. When I worked on this I was thinking: that’s simple, I just need to find some resources like: web service or files and do an http request call and voila!.

I had always worked consuming http endpoints or downloading files. It’s the most common way to reach data nowadays. But I found some websites which don’t have any of these options. Something like this happened in my head:

So, I did some research into how to read the raw data from an html page and return the most important information. Finally, I found a nice Nodejs library to do it. That was Puppeteer.

What is Puppeteer?

Puppeteeris an open-source library for Nodejs that allows us to control Chrome or Chromium API with the web browser devtools.

puppeteer/puppeteer

Headless Chrome Node.js API. Contribute to puppeteer/puppeteer development by creating an account on GitHub.

github.com

#nodejs #google-chrome #web-scraping #puppeteer #web-scraping-tools

Lia  Haley

Lia Haley

1598436960

5 Puppeteer Tricks That Will Make Your Web Scraping Easier

Puppeteer probably is the best free web scraping tool on the internet. It has so many options and is very easy to use once you get the hang of it. The problem with it is that it is too complicated and the average developer might be overwhelmed by the vast options it offers.

As a veteran in the web scraping industry and the proxy world, I’ve gathered five puppeteer tricks (with code examples), which I believe help you with the daunting task of web scraping when using Puppeteer and how they will help you avoid detection.

So, What is Puppeteer?

Puppeteer is an open-source Node.js library developed and maintained by Google. It is based on Chromium, the open version of Chrome, and can do almost any task a human can perform on a regular web browser. It has a headless mode, which allows it to run as code in the background, without actually rendering the pages, and thus reduces a lot of the resources needed to run it.

Google’s maintenance of this library is fantastic, with new features and security updates regularly added a clear and easy-to-use API, and user-friendly documentation.

What is Web Scraping?

Web Scraping is the automatic version of surfing the web and collecting data. The internet is full of content and user-generated content (UGC), so you can scrape countless data points.

However, most of the valuable data is in these popular websites, which are being scraped daily are Google search results, eCommerce platforms like Amazon, Walmart, Shopify, Travel websites, hotels you get the deal. Most companies or individuals who perform web scraping are looking for data to improve their sales, search rankings, keyword analysis, price comparison, and so on.

What Is the Difference Between Web Crawling and Web Scraping?

Web scraping and web crawling are very similar terms, and the confusion between them is natural. The main difference between web scraping and web crawling revolves around the type of operation/activity that the user is doing.

Web crawling moves around a website and collects links, and optionally goes through those links and collects and aggregates data or additional links. It is called crawling because it works like a spider that crawls through a website; this is why crawlers are often called spiders by some developers.

Web Scraping on the other hand is task-oriented. It’s targeting a predefined link and retrieves the data from it and sends it to the database.

Usually, a data collection is built around a combination of those two approaches, which means getting the links to scrape with a web crawler/spider and then scraping the data from those pages with a scraper.

#web scraping #puppeteer #web crawling web scraping

Autumn  Blick

Autumn Blick

1603805749

What's the Link Between Web Automation and Web Proxies?

Web automation and web scraping are quite popular among people out there. That’s mainly because people tend to use web scraping and other similar automation technologies to grab information they want from the internet. The internet can be considered as one of the biggest sources of information. If we can use that wisely, we will be able to scrape lots of important facts. However, it is important for us to use appropriate methodologies to get the most out of web scraping. That’s where proxies come into play.

How Can Proxies Help You With Web Scraping?

When you are scraping the internet, you will have to go through lots of information available out there. Going through all the information is never an easy thing to do. You will have to deal with numerous struggles while you are going through the information available. Even if you can use tools to automate the task and overcome struggles, you will still have to invest a lot of time in it.

When you are using proxies, you will be able to crawl through multiple websites faster. This is a reliable method to go ahead with web crawling as well and there is no need to worry too much about the results that you are getting out of it.

Another great thing about proxies is that they will provide you with the chance to mimic that you are from different geographical locations around the world. While keeping that in mind, you will be able to proceed with using the proxy, where you can submit requests that are from different geographical regions. If you are keen to find geographically related information from the internet, you should be using this method. For example, numerous retailers and business owners tend to use this method in order to get a better understanding of local competition and the local customer base that they have.

If you want to try out the benefits that come along with web automation, you can use a free web proxy. You will be able to start experiencing all the amazing benefits that come along with it. Along with that, you will even receive the motivation to take your automation campaigns to the next level.

#automation #web #proxy #web-automation #web-scraping #using-proxies #website-scraping #website-scraping-tools

Vincent Lab

Vincent Lab

1605178617

Web Scraping with Node.js using Puppeteer

In this video I’m going to be scraping aqicn.org using Puppeteer.

#aqicn.org #puppeteer #web scraping #node.js #nodejs scraping #data mining