1675669889
Learn the basics of web scraping in JavaScript and Node.js using Puppeteer in this tutorial. JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML.
Welcome to the world of web scraping! Have you ever needed data from a website but found it hard to access it in a structured format? This is where web scraping comes in.
Using scripts, we can extract the data we need from a website for various purposes, such as creating databases, doing some analytics, and even more.
Disclaimer: Be careful when doing web scraping. Always make sure you're scraping sites that allow it, and performing this activity within ethical and legal limits.
JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML.
But if you're looking to do more advanced tasks including automations, you'll need libraries such as Puppeteer, Cheerio, or Nightmare (don't worry the name is nightmare, but it's not that bad to use 😆).
I'll introduce the basics of web scraping in JavaScript and Node.js using Puppeteer in this article. I structured the writing to show you some basics of fetching information on a website and clicking a button (for example, moving to the next page).
At the end of this introduction, I'll recommend ways to practice and learn more by improving the project we just created.
Before diving in and scraping our first page together using JavaScript, Node.js, and the HTML DOM, I'd recommend having a basic understanding of these technologies. It'll improve your learning and understanding of the topic.
Let's dive in! 🤿
New project...new folder! First, create the first-puppeteer-scraper-example
folder on your computer. It'll contain the code of our future scraper.
mkdir first-puppeteer-scraper-example
Create a new project folder using mkdir
Now, it's time to initialize your Node.js repository with a package.json file. It's helpful to add information to the repository and NPM packages, such as the Puppeteer library.
npm init -y
Initialize the package.json
file using the npm init
command
After typing this command, you should find this package.json
file in your repository tree.
{
"name": "first-puppeteer-scraper-example",
"version": "1.0.0",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"puppeteer": "^19.6.2"
},
"type": "module",
"devDependencies": {},
"description": ""
}
package.json
initialized with the npm init -y
command
Before proceeding, we must ensure the project is configured to handle ES6 features. To do so, you can add the "types": "module"
instruction at the end of the configuration.
{
"name": "first-puppeteer-scraper-example",
"version": "1.0.0",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"puppeteer": "^19.6.2"
},
"type": "module",
"description": "",
"types": "module"
}
package.json
file after enabling the ES6 features
The last step of our scraper initialization is to install the Puppeteer library. Here's how:
npm install puppeteer
Install Puppeteer with the npm install
command
Wow! We're there – we're ready to scrape our first website together. 🤩
In this article, we'll use the ToScrape website as our learning platform. This online sandbox provides two projects specifically designed for web scraping, making it a great starting point to learn the basics such as data extraction and page navigation.
For this beginner's introduction, we'll specifically focus on the Quotes to Scrape website.
In the project repository root, you can create an index.js
file. This will be our application entry point.
To keep it simple, our script consists of one function in charge of getting the website's quotes (getQuotes
).
In the function's body, we will need to follow different steps:
puppeteer.launch
(it'll instantiate a browser
variable that we'll use for manipulating the browser)browser.newPage
(it'll instantiate a page
variable that we'll use for manipulating the page)http://quotes.toscrape.com/
with page.goto
Here's the commented version of the initial script:
import puppeteer from "puppeteer";
const getQuotes = async () => {
// Start a Puppeteer session with:
// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
// - no default viewport (`defaultViewport: null` - website page will in full width and height)
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
// Open a new page
const page = await browser.newPage();
// On this new page:
// - open the "http://quotes.toscrape.com/" website
// - wait until the dom content is loaded (HTML is ready)
await page.goto("http://quotes.toscrape.com/", {
waitUntil: "domcontentloaded",
});
};
// Start the scraping
getQuotes();
What do you think of running our scraper and seeing the output? Let's do it with the command below:
node index.js
Start our Node.js application with the node index.js
command
After doing this, you should have a brand new browser application started with a new page and the website Quotes to Scrape loaded onto it. Magic, isn't it? 🪄
Quotes to Scrape homepage loaded by our initial script
Note: For this first iteration, we're not closing the browser. This means you will need to close the browser to stop the running application.
Whenever you want to scrape a website, you'll have to play with the HTML DOM. What I recommend is to inspect the page and start navigating the different elements to find what you need.
In our case, we'll follow the baby step principle and start fetching the first quote, author, and text.
After browsing the page HTML, we can notice a quote is encapsulated in a <div>
element with a class name quote
(class="quote"
). This is important information because the scraping works with CSS selectors (for example, .quote).
Browser inspector with the first quote <div>
selectedAn example of how each quote is rendered in the HTML
Now that we have this knowledge, we can return to our getQuotes
function and improve our code to select the first quote and extract its data.
We will need to add the following after the page.goto
instruction:
page.evaluate
(it'll execute the function passed as a parameter in the page context and returns the result)document.querySelector
(it'll fetch the first <div>
with the classname quote
and returns it)quote.querySelector
(it'll extract the elements with the classname text
and author
under <div class="quote">
and returns them)Here's the updated version with detailed comments:
import puppeteer from "puppeteer";
const getQuotes = async () => {
// Start a Puppeteer session with:
// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
// - no default viewport (`defaultViewport: null` - website page will in full width and height)
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
// Open a new page
const page = await browser.newPage();
// On this new page:
// - open the "http://quotes.toscrape.com/" website
// - wait until the dom content is loaded (HTML is ready)
await page.goto("http://quotes.toscrape.com/", {
waitUntil: "domcontentloaded",
});
// Get page data
const quotes = await page.evaluate(() => {
// Fetch the first element with class "quote"
const quote = document.querySelector(".quote");
// Fetch the sub-elements from the previously fetched quote element
// Get the displayed text and return it (`.innerText`)
const text = quote.querySelector(".text").innerText;
const author = quote.querySelector(".author").innerText;
return { text, author };
});
// Display the quotes
console.log(quotes);
// Close the browser
await browser.close();
};
// Start the scraping
getQuotes();
Something interesting to point out is that the function name for selecting an element is the same as in the browser inspect. Here's an example:
After running the document.querySelector
instruction in the browser inspector, we have the first quote as an output (like on Puppeteer)
Let's run our script one more time and see what we have as an output:
{
text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
author: 'Albert Einstein'
}
Output of our script after running node index.js
We did it! Our first scraped element is here, right in the terminal. Now, let's expand it and fetch all the current page quotes. 🔥
Now that we know how to fetch one quote, let's trick our code a bit to get all the quotes and extract their data one by one.
Previously we used document.getQuerySelector
to select the first matching element (the first quote). To be able to fetch all quotes, we will need the document.querySelectorAll
function instead.
We'll need to follow these steps to make it work:
document.getQuerySelector
with document.querySelectorAll
(it'll fetch all <div>
elements with the classname quote
and return them)Array.from(quoteList)
(it'll ensure the list of quotes is iterable)text
and author
under <div class="quote">
for each quote)Here's the code update:
import puppeteer from "puppeteer";
const getQuotes = async () => {
// Start a Puppeteer session with:
// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
// - no default viewport (`defaultViewport: null` - website page will be in full width and height)
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
// Open a new page
const page = await browser.newPage();
// On this new page:
// - open the "http://quotes.toscrape.com/" website
// - wait until the dom content is loaded (HTML is ready)
await page.goto("http://quotes.toscrape.com/", {
waitUntil: "domcontentloaded",
});
// Get page data
const quotes = await page.evaluate(() => {
// Fetch the first element with class "quote"
// Get the displayed text and returns it
const quoteList = document.querySelectorAll(".quote");
// Convert the quoteList to an iterable array
// For each quote fetch the text and author
return Array.from(quoteList).map((quote) => {
// Fetch the sub-elements from the previously fetched quote element
// Get the displayed text and return it (`.innerText`)
const text = quote.querySelector(".text").innerText;
const author = quote.querySelector(".author").innerText;
return { text, author };
});
});
// Display the quotes
console.log(quotes);
// Close the browser
await browser.close();
};
// Start the scraping
getQuotes();
As an end result, if we run our script one more time, we should have a list of quotes as an output. Each element of this list should have a text and an author property.
[
{
text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
author: 'Albert Einstein'
},
{
text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
author: 'J.K. Rowling'
},
{
text: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
author: 'Albert Einstein'
},
{
text: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
author: 'Jane Austen'
},
{
text: "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
author: 'Marilyn Monroe'
},
{
text: '“Try not to become a man of success. Rather become a man of value.”',
author: 'Albert Einstein'
},
{
text: '“It is better to be hated for what you are than to be loved for what you are not.”',
author: 'André Gide'
},
{
text: "“I have not failed. I've just found 10,000 ways that won't work.”",
author: 'Thomas A. Edison'
},
{
text: "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
author: 'Eleanor Roosevelt'
},
{
text: '“A day without sunshine is like, you know, night.”',
author: 'Steve Martin'
}
]
Output of our script after running node index.js
Good job! All the quotes from the first page are now scraped by our script. 👏
Our script is now able to fetch all the quotes for one page. What would be interesting is clicking on the "Next page" at the page bottom and doing the same on the second page.
"Next" button at the Quotes to Scrape page bottom
Back to our browser inspect, and let's find how we can target this element using CSS selectors.
As we can notice, the next button is placed under an unordered list <ul>
with a pager
classname (<ul class="pager">
). This list has an element <li>
with a next
classname (<li class="next">
). Finally, there is a link anchor <a>
that links to the second page (<a href="/page/2/">
).
In CSS, if we want to target this specific link there are different ways to do that. We can do:
.next > a
: but, it's risky because if there is an other element with .next
as a parent element containing a link, it'll click on it..pager > .next > a
: safer, because we make sure the link should be inside the .pager
parent element under the .next
element. There is a low risk of having this hierarchy more than once.An example of how the "Next" button is rendered in the HTML
To click this button, at the end of our script after the console.log(quotes);
, you can add the following: await page.click(".pager > .next > a");
.
Since we're now closing the browser page with await browser.close();
after all instructions are done, you need to comment on this instruction to see the second page opened in the scraper browser.
It's temporary and for testing purposes, but the end of our getQuotes
function should look like this:
// Display the quotes
console.log(quotes);
// Click on the "Next page" button
await page.click(".pager > .next > a");
// Close the browser
// await browser.close();
After this, if you run our scraper again, after processing all instructions, your browser should stop on the second page:
Quotes to Scrape second page loaded after clicking the "Next" button
Congrats on reaching the end of this introduction to scraping with Puppeteer! 👏
Now it's your turn to improve the scraper and make it get more data from the Quotes to Scrape website. Here's a list of potential improvements you can make:
Feel free to be creative and do any other things you see fit 🚀
Check out the latest version of our scraper on GitHub! You're free to save, fork, or utilize it as you see fit.
=> First Puppeteer Scraper (example)
I hope this article gave you a valuable introduction to web scraping using JavaScript and Puppeteer. Writing this was a pleasure, and I hope you found it informative and enjoyable.
Join me on Twitter for more content like this. I regularly share content to help you grow your web development skills and would love to have you join the conversation. Let's learn, grow, and inspire each other along the way!
Original article source at https://www.freecodecamp.org
#javascript #puppeteer #webscraping #node
1626321063
PixelCrayons: Our JavaScript web development service offers you a feature-packed & dynamic web application that effectively caters to your business challenges and provide you the best RoI. Our JavaScript web development company works on all major frameworks & libraries like Angular, React, Nodejs, Vue.js, to name a few.
With 15+ years of domain expertise, we have successfully delivered 13800+ projects and have successfully garnered 6800+ happy customers with 97%+ client retention rate.
Looking for professional JavaScript web app development services? We provide custom JavaScript development services applying latest version frameworks and libraries to propel businesses to the next level. Our well-defined and manageable JS development processes are balanced between cost, time and quality along with clear communication.
Our JavaScript development companies offers you strict NDA, 100% money back guarantee and agile/DevOps approach.
#javascript development company #javascript development services #javascript web development #javascript development #javascript web development services #javascript web development company
1598436960
Puppeteer probably is the best free web scraping tool on the internet. It has so many options and is very easy to use once you get the hang of it. The problem with it is that it is too complicated and the average developer might be overwhelmed by the vast options it offers.
As a veteran in the web scraping industry and the proxy world, I’ve gathered five puppeteer tricks (with code examples), which I believe help you with the daunting task of web scraping when using Puppeteer and how they will help you avoid detection.
Puppeteer is an open-source Node.js library developed and maintained by Google. It is based on Chromium, the open version of Chrome, and can do almost any task a human can perform on a regular web browser. It has a headless mode, which allows it to run as code in the background, without actually rendering the pages, and thus reduces a lot of the resources needed to run it.
Google’s maintenance of this library is fantastic, with new features and security updates regularly added a clear and easy-to-use API, and user-friendly documentation.
Web Scraping is the automatic version of surfing the web and collecting data. The internet is full of content and user-generated content (UGC), so you can scrape countless data points.
However, most of the valuable data is in these popular websites, which are being scraped daily are Google search results, eCommerce platforms like Amazon, Walmart, Shopify, Travel websites, hotels you get the deal. Most companies or individuals who perform web scraping are looking for data to improve their sales, search rankings, keyword analysis, price comparison, and so on.
Web scraping and web crawling are very similar terms, and the confusion between them is natural. The main difference between web scraping and web crawling revolves around the type of operation/activity that the user is doing.
Web crawling moves around a website and collects links, and optionally goes through those links and collects and aggregates data or additional links. It is called crawling because it works like a spider that crawls through a website; this is why crawlers are often called spiders by some developers.
Web Scraping on the other hand is task-oriented. It’s targeting a predefined link and retrieves the data from it and sends it to the database.
Usually, a data collection is built around a combination of those two approaches, which means getting the links to scrape with a web crawler/spider and then scraping the data from those pages with a scraper.
#web scraping #puppeteer #web crawling web scraping
1603805749
Web automation and web scraping are quite popular among people out there. That’s mainly because people tend to use web scraping and other similar automation technologies to grab information they want from the internet. The internet can be considered as one of the biggest sources of information. If we can use that wisely, we will be able to scrape lots of important facts. However, it is important for us to use appropriate methodologies to get the most out of web scraping. That’s where proxies come into play.
When you are scraping the internet, you will have to go through lots of information available out there. Going through all the information is never an easy thing to do. You will have to deal with numerous struggles while you are going through the information available. Even if you can use tools to automate the task and overcome struggles, you will still have to invest a lot of time in it.
When you are using proxies, you will be able to crawl through multiple websites faster. This is a reliable method to go ahead with web crawling as well and there is no need to worry too much about the results that you are getting out of it.
Another great thing about proxies is that they will provide you with the chance to mimic that you are from different geographical locations around the world. While keeping that in mind, you will be able to proceed with using the proxy, where you can submit requests that are from different geographical regions. If you are keen to find geographically related information from the internet, you should be using this method. For example, numerous retailers and business owners tend to use this method in order to get a better understanding of local competition and the local customer base that they have.
If you want to try out the benefits that come along with web automation, you can use a free web proxy. You will be able to start experiencing all the amazing benefits that come along with it. Along with that, you will even receive the motivation to take your automation campaigns to the next level.
#automation #web #proxy #web-automation #web-scraping #using-proxies #website-scraping #website-scraping-tools
1610421000
In this article, we’ll go through scraping a Wikipedia table with COVID-19 data using Puppeteer and Node. The original article that I used for this project is located here.
I have never scraped a website before. I’ve always seen it as a hacky thing to do. But, after going through this little project, I can see the value of something like this. Data is hard to find and if you can scrape a website for it, in my opinion, by all means, do it.
#puppeteer #javascript #web-development #nodejs #web-scraping
1594963828
List of some useful JavaScript Frameworks and libraries for website, web apps, and mobile apps development, that developers should know about to make selection easier.
This article will help you understand the various types of JavaScript Framework available in the market. When it comes to choosing the best platform for you, it’s not only the number of features you need to consider but also its functionality. The ease with which it fits within your project is also an essential factor. The next step is to choose the framework that best fits your company requirements or you can select the best from the list of top web development companies to develop your product based on your requirements.
#javascript frameworks for web applications #web applications development companies #progressive javascript framework #javascript frameworks #javascript #frameworks