Introduction to web scraping with Node.js

Introduction to web scraping with Node.js

Introduction To Web Scraping With Node JS. Web Scraping comes in, when we're in a need to collect information from different web pages

Introduction To Web Scraping With Node JS. Web Scraping comes in, when we're in a need to collect information from different web pages

For a long time when ever I wanted to try and create websites for practice I would visit a website, open the console and try to get the content I needed - all this to avoid using lorem ipsum, which I absolutely hate.

Few months a go I heard of web scraping, hey better late the never right? And it seems to do a similar thing to what I tried to do manually.

Today I’m going to explain how to web scrape with Node.

Setting up

We’ll be using three packages to accomplish this.

  • Axios is a “promise based HTTP client for the browser and node.js” and we’ll use it to get html from any chosen website.
  • Cheerio is like jQuery but for the server. We’ll use it as a way to pick content from the Axios results.
  • fs is a node module which we’ll use to write the fetched content into a JSON file.

Let’s start setting up the project. First create a folder, then cd to it in the terminal.

To initialise the project just run npm init and follow the steps (you can just hit enter to everything). When the initial setup is complete you’ll have created a package.json file.

Now we need to install the two packages we listed above

npm install --save axios cheerio


(Remember fs is already part of node, we do not need to install anything for it)

You’ll see that the above packages are installed under node_modules directory, they are also listed inside the package.json file.

Our mission is to get the posts we’ve written and store them in a JSON file, as you see below:

Create a JavaScript file in your project folder, call it devtoList.js if you like.

First require the packages we installed

let axios = require('axios');
let cheerio = require('cheerio');
let fs = require('fs'); 


Now lets get the contents from dev.to

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
        const html = response.data;
            const $ = cheerio.load(html); 
    }
    }, (error) => console.log(err) );


In the first line we get the contents from the specified URL. As already stated, axios is promise based, then we check if the response was correct, and get the data.

If you console log response.data you’ll see the html markup from the url. Then we load that HTML into cheerio (jQuery would do this for us behind the scenes). To drive the point home let’s replace response.data with hard-coded html

const html = '<h3 class="title">I have a bunch of questions on how to behave when contributing to open source</h3>'
const h3 = cheerio.load(html)
console.log(h3.text())


That returns the string without the h3 tag.

Select the content

At this point you would open the console on the website you want to scrape and find the content you need. Here it is:

From the above we know that every article has the class of single-article, The title is an h3 tag and the tags are inside a tags class.

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
            const html = response.data;
            const $ = cheerio.load(html); 
            let devtoList = [];
            $('.single-article').each(function(i, elem) {
                devtoList[i] = {
                    title: $(this).find('h3').text().trim(),
                    url: $(this).children('.index-article-link').attr('href'),
                    tags: $(this).find('.tags').text().split('#')
                          .map(tag =>tag.trim())
                          .filter(function(n){ return n != "" })
                }      
            });
    }
}, (error) => console.log(err) );


The above code is very easy to read, especially if we refer to the screenshot above. We loop through each node with the class of .single-article. Then we find the only h3, we get the text from it and just trim() the redundant white space. Then the url is just as simple, we get the href from the relevant anchor tag.

Getting the tags is just simple really. We first get them all as a string (#tag1 #tag2) then we split that string (whenever # appears) into an array. Finally we map through each value in the array just to trim() the white space, finally we filter out the any empty values (mostly caused by the trimming).

The declaration of an empty array (let devtoList = []) outside the loop allows us to populate it from within.

That would be it. The devtoList array object has the data we scraped from the website. Now we just want to store this data into a JSON file so that we can use it elsewhere.

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
            const html = response.data;
            const $ = cheerio.load(html); 
            let devtoList = [];
            $('.single-article').each(function(i, elem) {
                devtoList[i] = {
                    title: $(this).find('h3').text().trim(),
                    url: $(this).children('.index-article-link').attr('href'),
                    tags: $(this).find('.tags').text().split('#')
                          .map(tag =>tag.trim())
                          .filter(function(n){ return n != "" })
                }      
            });
            const devtoListTrimmed = devtoList.filter(n => n != undefined )
            fs.writeFile('devtoList.json', 
                          JSON.stringify(devtoListTrimmed, null, 4), 
                          (err)=> console.log('File successfully written!'))
    }
}, (error) => console.log(err) );


The original devtoList array object might have empty values, so we just trim them away, then we use the fs module to write to a file (above I named it devtoList.json, the content of which the array object converted into JSON.

And that’s all it takes!

The code above can be found in github

=================================================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ The Complete Node.js Developer Course (3rd Edition)

☞ Angular & NodeJS - The MEAN Stack Guide

☞ NodeJS - The Complete Guide (incl. MVC, REST APIs, GraphQL)

☞ Docker for Node.js Projects From a Docker Captain

☞ Intro To MySQL With Node.js - Learn To Use MySQL with Node!

☞ Node.js Absolute Beginners Guide - Learn Node From Scratch

☞ React Node FullStack - Social Network from Scratch to Deploy

☞ Selenium WebDriver - JavaScript nodeJS webdriver IO & more!

☞ Complete Next.js with React & Node - Beautiful Portfolio App

☞ Build a Blockchain & Cryptocurrency | Full-Stack Edition

Top 7 Most Popular Node.js Frameworks You Should Know

Top 7 Most Popular Node.js Frameworks You Should Know

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser. In this post, you'll see top 7 of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser.

One of the main advantages of Node is that it enables developers to use JavaScript on both the front-end and the back-end of an application. This not only makes the source code of any app cleaner and more consistent, but it significantly speeds up app development too, as developers only need to use one language.

Node is fast, scalable, and easy to get started with. Its default package manager is npm, which means it also sports the largest ecosystem of open-source libraries. Node is used by companies such as NASA, Uber, Netflix, and Walmart.

But Node doesn't come alone. It comes with a plethora of frameworks. A Node framework can be pictured as the external scaffolding that you can build your app in. These frameworks are built on top of Node and extend the technology's functionality, mostly by making apps easier to prototype and develop, while also making them faster and more scalable.

Below are 7of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Express

With over 43,000 GitHub stars, Express is the most popular Node framework. It brands itself as a fast, unopinionated, and minimalist framework. Express acts as middleware: it helps set up and configure routes to send and receive requests between the front-end and the database of an app.

Express provides lightweight, powerful tools for HTTP servers. It's a great framework for single-page apps, websites, hybrids, or public HTTP APIs. It supports over fourteen different template engines, so developers aren't forced into any specific ORM.

Meteor

Meteor is a full-stack JavaScript platform. It allows developers to build real-time web apps, i.e. apps where code changes are pushed to all browsers and devices in real-time. Additionally, servers send data over the wire, instead of HTML. The client renders the data.

The project has over 41,000 GitHub stars and is built to power large projects. Meteor is used by companies such as Mazda, Honeywell, Qualcomm, and IKEA. It has excellent documentation and a strong community behind it.

Koa

Koa is built by the same team that built Express. It uses ES6 methods that allow developers to work without callbacks. Developers also have more control over error-handling. Koa has no middleware within its core, which means that developers have more control over configuration, but which means that traditional Node middleware (e.g. req, res, next) won't work with Koa.

Koa already has over 26,000 GitHub stars. The Express developers built Koa because they wanted a lighter framework that was more expressive and more robust than Express. You can find out more about the differences between Koa and Express here.

Sails

Sails is a real-time, MVC framework for Node that's built on Express. It supports auto-generated REST APIs and comes with an easy WebSocket integration.

The project has over 20,000 stars on GitHub and is compatible with almost all databases (MySQL, MongoDB, PostgreSQL, Redis). It's also compatible with most front-end technologies (Angular, iOS, Android, React, and even Windows Phone).

Nest

Nest has over 15,000 GitHub stars. It uses progressive JavaScript and is built with TypeScript, which means it comes with strong typing. It combines elements of object-oriented programming, functional programming, and functional reactive programming.

Nest is packaged in such a way it serves as a complete development kit for writing enterprise-level apps. The framework uses Express, but is compatible with a wide range of other libraries.

LoopBack

LoopBack is a framework that allows developers to quickly create REST APIs. It has an easy-to-use CLI wizard and allows developers to create models either on their schema or dynamically. It also has a built-in API explorer.

LoopBack has over 12,000 GitHub stars and is used by companies such as GoDaddy, Symantec, and the Bank of America. It's compatible with many REST services and a wide variety of databases (MongoDB, Oracle, MySQL, PostgreSQL).

Hapi

Similar to Express, hapi serves data by intermediating between server-side and client-side. As such, it's can serve as a substitute for Express. Hapi allows developers to focus on writing reusable app logic in a modular and prescriptive fashion.

The project has over 11,000 GitHub stars. It has built-in support for input validation, caching, authentication, and more. Hapi was originally developed to handle all of Walmart's mobile traffic during Black Friday.

Node.js Tutorial for Beginners | Node.js Crash Course | Node.js Certification Training

This courseis designed for professionals who aspire to be application developers and gain expertise in building real-time, highly-scalable applications in Node.js. The following professionals can go for this course :

Why learn Node.js?

Node.js uses JavaScript - a language known to millions of developers worldwide - thus giving it a much lower learning curve even for complete beginners. Using Node.js you can build simple Command Line programs or complex enterprise level web applications with equal ease. Node.js is an event-driven, server-side, asynchronous development platform with lightning speed execution. Node.js helps you to code the most complex functionalities in just a few lines of code...

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

The Complete Node.js Developer Course (3rd Edition)

Angular & NodeJS - The MEAN Stack Guide

NodeJS - The Complete Guide (incl. MVC, REST APIs, GraphQL)

Docker for Node.js Projects From a Docker Captain

Intro To MySQL With Node.js - Learn To Use MySQL with Node!

Node.js Absolute Beginners Guide - Learn Node From Scratch

React Node FullStack - Social Network from Scratch to Deploy

Selenium WebDriver - JavaScript nodeJS webdriver IO & more!

Complete Next.js with React & Node - Beautiful Portfolio App

Build a Blockchain & Cryptocurrency | Full-Stack Edition

A Beginner Guide To Node.js (Basic Introduction To Node.js)

Node.js is a very popular javascript free and open source cross-platform for server-side programming built on Google Chrome’s Javascript V8 Engine. It is used by thousands of developers around the world to develop mobile and web applications. According to StackOverflow survey, Node.js is one of most famous choice for building the web application in 2018.

Introduction

Node.js is a very popular javascript free and open source cross-platform for server-side programming built on Google Chrome’s Javascript V8 Engine. It is used by thousands of developers around the world to develop mobile and web applications. According to StackOverflow survey, Node.js is one of most famous choice for building the web application in 2018.

In this article, you will gain a deep understanding of node, learn how node.js works and why it is so popular among the developers and startups. Not In startup even big companies like eBay, Microsoft, GoDaddy, Paypal etc.

Why is Node.js so much popular

It is fast very fast

It’s a javascript runtime built on google chrome javascript v8 engine which means both node js and js executed in your browser running in the same engine that makes it very fast in comparison to any other server-side programming language.

It uses event-driven and non-blocking model

Node.js uses the event-driven, non-blocking I/O model that makes it very lightweight and efficient.
Now let’s understand the above statement in more details. Here I/O refers to Input /Output.

Event Driven Programming is a paradigm in which control flow of any program is determined by the occurrence of the events. All these events monitor by the code which is known as an event listener. If you are from javascript background then most probably you know what is event-listeners. In short, event-listener is a procedure or function that waits for an event to occurs. In javascript, onload, onclick, onblur most common event-listener.

**Blocking I/O **takes time and hence block other function. Consider the scenario where we want to fetch data from the database for two different users. Here we can not get the data of the second user until we did not complete the first user process. Since javascript is a single threaded and here we would have to start a new thread every time we want to fetch user data. So here Non-Blocking I/O parts come in.

Example of Blocking I/O operation

<span class="hljs-keyword">const</span> fs = <span class="hljs-built_in">require</span>(‘fs’);
<span class="hljs-keyword">var</span> contents = fs.readFileSync(<span class="hljs-string">'package.json'</span>).toString();
<span class="hljs-built_in">console</span>.log(contents);

In** Non-blocking I/O **operations, you can get the user2 data without waiting for the completion of the user1 request. You can initiate both requests in parallel. **Non-blocking I/O **eliminates the need for the multi-threaded, since the system can handle multiple requests at the same time. That is the main reason which makes it very fast.

Example of Non-blocking I/O operation

<span class="hljs-keyword">const</span> fs = <span class="hljs-built_in">require</span>(‘fs’);
fs.readFile(<span class="hljs-string">'package.json'</span>, <span class="hljs-function"><span class="hljs-keyword">function</span> (<span class="hljs-params">err, buf</span>)</span>{
    <span class="hljs-built_in">console</span>.log(buf.toString());
});

Note: You can learn more about the event loop and other things by going through this link.

What is Node Package Manager ( NPM )

It is is the official package manager for the node. It bundles automatically installed when you install node in your system. It is used to install new packages and manage them in useful ways. NPM install packages in two modes local and global. In the local mode, NPM installs packages in the node_module directory of the current working directory which location is owned by current user. Global packages installed in the directory where the node is installed and the location is owned by the root user.

What is the package.json

package.json is a plain JSON text file which manages all the packaged which you installed in your node application. Every Node.js applications should have this file at the root directory to describe the application metadata. A simple package.json file looks like below

{
    <span class="hljs-string">"name"</span> : <span class="hljs-string">"codesquery"</span>,
    <span class="hljs-string">"version"</span> : <span class="hljs-string">"1.0.0"'
    "repository": {
	"type" : "git",
	"url" : "github_repository_url"
    },
    "dependencies": {
	"async": "0.8.0",
	"express": "4.2.x"
    }
}
</span>

In the above file, name and versions are mandatory for the package.json file and rest is optional.

Installing Node.js

  • In Windows, you can install the node.js by using the installer provided by the official node.js website. Follow the installer instruction and node.js will be installed in your windows system.
  • In Linux OS, you can install the node.js by adding the PPA in your system and then install node js. Run the below command the terminal to install node js
sudo apt-get install curl python-software-properties
curl -sL https:<span class="hljs-comment">//deb.nodesource.com/setup_10.x | sudo -E bash -</span>
sudo apt-get install nodejs

  • In macOS, download the macOS installer from the official node.js website. Now run the installer by accepting the license and selecting the destination.

Test Node.js Installation

You can test the node.js installation by typing below command in the terminal

node -v

If node.js was installed successfully then you will see the installed version of the node in the terminal.

Frameworks and Tools

After gaining the popularity among the developers, there are so many frameworks built for the node js for the different type of uses. Here, I will tell you some of the most famous node js frameworks in the market

  • Express.js is the most popular framework for node.js development. A lot of popular websites is powered by express.js due to its lightweight.
  • Hapi.js is a powerful and robust framework for developing the API. This framework has features like input validation, configuration based functionality, error handling, caching and logging.
  • Metor.js is one of the most used frameworks in the node js web application development. This framework is backed by a huge community of developers, tutorials and good documentation.
  • Socket.io is used to build a real-time web application like chat system and analytics. Its allow the bi-direction data flow between the web client and server.
  • Koa.js is yet another most used framework to build the web application using the node js. This framework is backed by the team behind Express.js. It allows you to ditch callbacks and increase error handling.

Conclusion

Today, Node.js shaping the future of web and application development technology. This is the just the basic of how node js works. If you want to build a scalable web application using the node js then you need to know more then this.

Till now, you have got the basic idea of node.js and now it is time to build something using the node.js. You can start with first by create a simple server using the node.js and then connect your node with MongoDB to perform the basic crud operation.