Create Your First Python Web Crawler Using Scrapy

Create Your First Python Web Crawler Using Scrapy

In this tutorial, the focus will be on one of the best frameworks for web crawling called Scrapy. You will learn the basics of Scrapy and how to create your first web crawler or spider. Furthermore, the tutorial gives a demonstration of extracting and storing the scraped data.

Scrapy, a web framework written in Python that is used to crawl through a website and to extract data in an efficient manner.

You can use the extracted data for further processing, data mining, and storing the data in spreadsheets or any other business need.

Scrapy Architecture

The architecture of Scrapy contains five main components:

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines

Scrapy Engine

The Scrapy engine is the main component of Scrapy which is aimed at controlling the data flow between all other components. The engine generates requests and manages events against an action.

Scheduler

The scheduler receives the requests sent by the engine and queues them.

Downloader

The objective of the downloader is to fetch all the web pages and send them to the engine. The engine then sends the web pages to the spider.

Spiders

Spiders are the codes you write to parse websites and extract data.

Item Pipeline

The item pipeline processes the items side by side after the spiders extract them.

Installing Scrapy

You can simply install Scrapy along with its dependencies by using the Python Package Manager (pip).

Run the following command to install Scrapy in Windows:

pip install scrapy

However, the official Installation guide recommends installing Scrapy in a virtual environment because the Scrapy dependencies may conflict with other Python system packages which will affect other scripts and tools.

Therefore, we will create a virtual environment to provide an encapsulated development environment.

In this tutorial, we will install a virtual environment first and then continue with the installation of Scrapy.

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines
pip install virtualenv

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines
pip install virtualenvwrapper-win

3.Set the path within the scripts folder, so you can globally use the Python commands:

set PATH=%PATH%;C:\Users\hp\appdata\local\programs\python\python37-32\scripts

4.Create a virtual environment:

mkvirtualenv ScrapyTut

Where ScrapyTut is the name of our environment:

5.Create your project folder and connect it with the virtual environment:

6.Bind virtual environment with the current working directory:

setprojectdir .

7.If you want to turn off the virtual environment mode simply use deactivate as below:

deactivate

8.If you want to work again on the project use the workon command along with the name of your project:

workon ScrapyTut

Now we have our virtual environment, we can continue the installation of Scrapy.

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines
pip install& scrapy

Note that when installing Twisted, you may encounter an error as:

Microsoft visual c++ 14.0 is& required

To fix this error, you will have to install the following from Microsoft build Tools:

After this installation, if you get another error like the following:

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\link.exe' failed with exit status 1158

Simply download the wheel for Twisted that matches your version of Python. Paste this wheel into your current working directory as:

Now run the following command:

pip install Twisted-18.9.0-cp37-cp37m-win32.whl

Now, everything is ready to create our first crawler, so let’s do it.

Create a Scrapy Project

Before writing a Scrapy code, you will have to create a Scrapy project using the startproject command like this:

scrapy startproject myFirstScrapy

That will generate the project directory with the following contents:

The spider folder contains the spiders.

Here the scrapy.cfg file is the configuration file. Inside the myFirstScrapy folder we will have the following files:

Create a Spider

After creating the project, navigate to the project directory and generate your spider along with the website URL that you want to crawl by executing the following command:

scrapy genspider jobs www.python.org

The result will be like the following:

Our “jobs” spider folder will be like this:

In the Spiders folder, we can have multiple spiders within the same project.

Now let’s go through the content of our newly created spider. Open the jobs.py file which contains the following code:

import scrapy

class JobsSpider(scrapy.Spider):

    name = 'jobs'

    allowed_domains = ['www.python.org']

    start_urls = ['http://www.python.org/']

    def parse(self, response):

        pass

Here the AccessoriesSpider is the subclass of scrapy.Spider. The ‘name’ variable is the name of our spider that was assigned in the process of creation of the spider. The name is used to run the spider. The ‘allowed_domains’ is the domain accessible by this spider.

The start_urls is the URL from where the web crawling will be started or you can say it is the initial URL where web crawling begins. Then we have the parse method which parses through the content of the page.

To crawl the accessories page of our URL, we need to add one more link in the start_urls property as below:

start_urls = ['http://www.python.org/',

                  'https://www.python.org/jobs/']

As we want to crawl more than one page, it is recommended to subclass the spider from the CrawlSpider class instead of the scrapy.spider class. For this, you will have to import the following module:

from scrapy.spiders import CrawlSpider

Our class will look like the following:

class JobsSpider(CrawlSpider): …

The next step is to initialize the rules variable. The rules variable defines the navigation rules that will be followed when crawling the site. To use the rules object, import the following class:

from scrapy.spiders import Rule

The rules variable further contains rule objects such as:

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines
from scrapy.linkextractors import LinkExtractor

The rule variable will look like the following:

rules = (

        Rule(LinkExtractor(allow=(), restrict_css=('.list-recent-jobs',)),

             callback="parse_item",

             follow=True),)
  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipelines

Here allow is used to specify the link which is to be extracted. But in our example, we have restricted by CSS class. So only the pages with the specified class should be extracted.

The callback parameter specifies the method that will be called when parsing the page. The .list-recent-jobsis the class for all the jobs listed on the page. You can check the class of an item by right clicking on that item and select inspect on the web page.

In the example, we called the spider’s parse_item method instead of parse.

The content of the parse_item method is as follows:

 def parse_item(self, response):

        print('Extracting…' + response.url)

This will print Extracting… along with the URL currently being extracted. For example, a link https://www.python.org/jobs/3698/ is extracted. So on the output screen, Extracting…https://www.python.org/jobs/3698/ will be printed.

To run the spider, navigate to your project folder and type in the following command:

scrapy crawl jobs

The output will be like the following:

In this example, we set follow=true which means the crawler will crawl the pages until the rule becomes false. That means when the list of jobs ends.

If you want to get only the print statement, you can use the following command:

scrapy crawl –nolog jobs

The output will be like the following:

Congratulations! You’ve built your first web crawler.

Scrapy Basics

Now we can crawl web pages. Let’s play with the crawled content for a little.

Selectors

You can use selectors to select some parts of data from the crawled HTML. The selectors select data from HTML by using XPath and CSS through response.xpath() and response.css() respectively. Just like in the previous example, we used the css class to select the data.

Consider the following example where we declared a string with HTML tags. Using the selector class we extracted the data in the h1 tag using the Selector.xpath:

>>> from scrapy.selector import Selector

>>> body = '<html><body><h1>Heading 1</h1></body></html>'

>>> Selector(text = body).xpath('//h1/text()').get()

'Heading 1'

Items

Scrapy uses Python dicts to return the extracted data.

To extract data, Scrapy provides the Item class which provides item objects. We can use these item objects as containers for the scraped data.

Items provide a simple syntax to declare fields. The syntax is like the following:

>>> import scrapy

>>> class Job(scrapy.Item):

    company = scrapy.Field()

The Field object specifies the Metadata for each field.

You may notice when the Scrapy project is created, an items.py file is also created in our project directory. We can modify this file to add our items as follows:

import scrapy

class MyfirstscrapyItem(scrapy.Item):

    # define the fields for your item here like:

location = scrapy.Field()

Here we have added one item. You can call this class from your spider file to initialize the items as follows:

    def parse_item(self, response):

        item_links = response.css('.text > .listing-company > .listing-location > a::text'').extract()

        for x in item_links:

            yield scrapy.Request(x, callback=self.MyfirstscrapyItem)

In the above code, we have used the css method of response to extract the data.

In our web page, we have a div with class text, inside this div, we have a heading with class listing-company, inside this heading, we have a span tag with class listing-location, and finally, we have a tag a that contains some text. This text is extracted using the extract() method.

Finally, we will loop through all the items extracted and call the items class.

Instead of doing all this in the crawler, we can also test our crawler by using only one statement while working in the Scrapy shell. We will demonstrate Scrapy shell in a later section.

Item Loaders

The data or items scrapped by the Item object is loaded or populated by using the Item Loader. You can use the item loader to extend the parsing rules.

After extracting items, we can populate the items in the item loader with the help of selectors.

The syntax for Item loader is as follows:

from scrapy.loader import ItemLoader

from jobs.items import Job

def parse(self, response):

    l = ItemLoader(item=Job(), response=response)

    l.add_css(‘name’, ‘//li[@class = ‘listing-company’]’)

    l.load_item()

Scrapy Shell

Scrapy shell is a command line tool that lets the developers test the parser without going through the crawler itself. With Scrapy shell, you can debug your code easily. The main purpose of Scrapy shell is to test the data extraction code.

We use the Scrapy shell to test the data extracted by CSS and XPath expression when performing crawl operations on a website.

You can activate Scrapy shell from the current project using the shell command:

scrapy shell

if you want to parse a web page, so you will use the shell command along with the link of the page:

scrapy shell https://www.python.org/jobs/3659/

To extract the location of the job, simply run the following command in the shell:

response.css('.text > .listing-company > .listing-location > a::text').extract()

The result will be like this:

Similarly, you can extract any data from the website.

To get the current working URL, you can use the command below:

response.url

This is how you extract all the data in Scrapy. In the next section, we will save this data into a CSV file.

Storing the data

Let’s use the response.css in our actual code. We will store the value returned by this statement into a variable and after that, we will store this into a CSV file. Use the following code:

    def parse_detail_page(self, response):

        location = response.css('.text > .listing-company > .listing-location > a::text').extract()

        item = MyfirstscrapyItem()

        item['location'] = location

        item['url'] = response.url

        yield item

Here we stored the result of response.css into a variable called location. Then we assigned this variable to the location object of the item in the MyfirstscrapyItem() class.

Execute the following command to run your crawler and store the result into a CSV file:

scrapy crawl jobs -o ScrappedData.csv

The will generate a CSV file in the project directory:

Scrapy is a very easy framework to crawl web pages. That was just the beginning. If you liked the tutorial and hungry for more, tell us on the comments blew what is the next Scrapy topic you would like to read about?

Further reading:

Extracting data from various sheets with Python

Guide to R and Python in a Single Jupyter Notebook

Getting Started With RabbitMQ: Python

Set up file uploads to S3 via Django in 10 minutes

Positional-only arguments in Python

Post Multipart Form Data in Python with Requests: Flask File Upload Example

Get Modular with Python Functions

Six Python Tips for Beginners

Web Scraping 101 in Python

5 Python Frameworks You Should Learn 2019

python web-development

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Hire Python Developers

Are you looking for experienced, reliable, and qualified Python developers? If yes, you have reached the right place. At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, our full-stack Python development services...

Hire Python Developers India

Looking to build robust, scalable, and dynamic responsive websites and applications in Python? At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, we constantly endeavor to give you exactly what you need. If you need to...

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Hire Web Developer

Looking for an attractive & user-friendly web developer? HourlyDeveloper.io, a leading web, and mobile app development company, offers web developers for hire through flexible engagement models. You can **[Hire Web...

Top Python Development Companies | Hire Python Developers

After analyzing clients and market requirements, TopDevelopers has come up with the list of the best Python service providers. These top-rated Python developers are widely appreciated for their professionalism in handling diverse projects. When...