Hands-on with Scrapy

With all the theoretical aspects of using Scrapy being dealt with in part-1, it’s now time for some practical examples. I shall put these theoretical aspects into examples of increasing complexity. There are 3 examples,

  • Example demonstrating single request & response by extracting a city’s weather from a weather site
  • Example demonstrating multiple requests & responses by extracting book details from a dummy online book store
  • Example demonstrating image scraping

You can download these examples from my GitHub page. This is the second part of a 4 part tutorial series on web scraping using Scrapy and Selenium.

Important note:

Before you try to scrape any website, please go through its robots.txt file. It can be accessed like www.google.com/robots.txt. There, you will see a list of pages allowed and disallowed for scraping google’s website. You can access only those pages that fall under User-agent: * and those that follow Allow:.


Example 1 — Handling single request & response by extracting a city’s weather from a weather site

Our goal for this example is to extract today’s ‘Chennai’ city weather report from weather.com. The extracted data must contain temperature, air quality and condition/description. You are free to choose your city. Just provide the URL to your city in the spider’s code. As pointed out earlier, the site allows data to be scraped provided there is a crawl delay of no less than 10 seconds i.e. you have to wait at least 10 seconds before requesting another URL from weather.com. This can be found in the site’s robots.txt.

User-agent: *
# Crawl-delay: 10

I have created a new Scrapy project using scrapy startproject command and created a basic spider using

scrapy genspider -t basic weather_spider weather.com

The first task while starting to code is to adhere to the site’s policy. To adhere to weather.com’s crawl delay policy, we need to add the following line to our scrapy project’s settings.py file.

DOWNLOAD_DELAY = 10

This line makes the spiders in our project to wait 10 seconds before making a new URL request. We can now start to code our spider.

#python #web-scraping #scrapy #web-scraping-series

Web scraping with Scrapy: Practical Understanding
3.95 GEEK