Scraping with Scrapy and Django Integration

Scraping with Scrapy and Django Integration

Scraping with Scrapy and Django Integration - Create a django project, with admin and database. Create app and add to installed apps. Define the data structure, so the item, so our django model. Install Scrapy.

Create a django project, with admin and database. Create app and add to installed apps. Define the data structure, so the item, so our django model. Install Scrapy.

What is scraping?

Scraping is the process of data mining. Also known as web data extraction, web harvesting, spying.. It is software that simulates human interaction with a web page to retrieve any wanted information (eg images, text, videos). This is done by a scraper.

This scraper involves making a GET request to a website and parsing the html response. The scraper then searches for the data required within the html and repeats the process until we have collected all the data we want.

It is useful for quickly accessing and analysing large amounts of data, which can be stored in a CSV file, or in a database depending on our needs!

There are many reasons to do web scraping such as lead generation and market analysis. However, when scraping websites, you must always be careful not to violate the terms and conditions of websites you are scraping, or to violate anyone’s privacy. This is why it is thought to be a little controversial but can save many hours of searching through sites, and logging the data manually. All of those hours saved mean a great deal of money saved.

There are many different libraries that can be used for web scraping, e.g. selenium, phantomjs. In Ruby you can also use the nokogiri gem to write your own ruby based scraper. Another popular library is is beautiful soup which is popular among python devs.

At Theodo, we needed to use a web scraping tool with the ability to follow links and as python developers the solution we opted for was using theDjangoframework with an open source web scraping framework called Scrapy.

Scrapy and Django

Scrapy allows us to define data structures, write data extractors, and comes with built in CSS and xpath selectors that we can use to extract the data, the scrapy shell, and built in JSON, CSV, and XML output. There is also a built in FormRequest class which allows you to mock login and is easy to use out of the box.

Websites tend to have countermeasures to prevent excessive requests, so Scrapy randomises the time between each request by default which can help avoid getting banned from them. Scrapy can also be used for automated testing and monitoring.

Django has an integrated admin which makes it easy to access the db. That along with the ease of filtering and sorting data and import/export library to allow us to export data.

Scrapy also used to have a built in class called DjangoItem which is now an easy to use external library. The DjangoItem library provides us with a class of item that uses the field defined in a Django model just by specifying which model it is related to. The class also provides a method to create and populate the Django model instance with the item data from the pipeline. This library allows us to integrate Scrapy and Django easily and means we can also have access to all the data directly in the admin!

So what happens?


Let’s start from the spider. Spiders are the core of the scraper. It makes the request to our defined URLs, parses the responses, and extracts information from them to be processed in the items.

Scrapy has a start_requests method which generates a request with the URL. When Scrapy fetches a website according to the request, it will parse the response to a callback method specified in the request object. The callback method can generate an item from the response data or generate another request.

What happens behind the scenes? Everytime we start a Scrapy task, we start a crawler to do it. The Spider defines how to perform the crawl (ie following links). The crawler has an engine to drive it’s flow. When a crawler starts, it will get the spider from its queue, which means the crawler can have more than one spider. The next spider will then be started by the crawler and scheduled to crawl the webpage by the engine. The engine middlewares drive the flow of the crawler. The middlewares are organised in chains to process requests and responses.


Selectors can be use to parse a web page to generate an item. They select parts of the html document specified either by xpath or css expressions. Xpath selects nodes in XML docs (that can also be used in HTML docs) and CSS is a language for applying styles to HTML documents. CSS selectors use the HTML classes and id tag names to select the data within the tags. Scrapy in the background using the cssselect library transforms these CSS selectors into xpath selectors.

CSS vs Xpath

data = response.css("") 
data = response.xpath("//div[@class='team-popup-wrap st-about-employee-pop-up']")
Short but sweet: when dealing with classes, ids and tag names, use CSS selectors. If you have no class name and just know the content of the tag use xpath selectors. Either way chrome dev tools can help: copy selector for the element’s unique css selector or you can copy its xpath selector. This is to give a basis, may have to tweak it! Two more helper tools are [**XPath helper**]( "**XPath helper**") and [**this**]( "**this**") cheatsheet. Selectors are also chainable.

Items and Pipeline

Items produce the output. They are used to structure the data parsed by the spider. The Item Pipeline is where the data is processed once the items have been extracted from the spiders. Here we can run tasks such as validation and storing items in a database.

How I did it

Here’s an example of how we can integrate Scrapy and Django. (This tutorial uses scrapy version 1.5.1, djangoitem version 1.1.1, django 2.1.4)

Let’s scrape the data off the Theodo UK Team Page and integrate it into a Django Admin Panel:

  1. Generate Django project with integrated admin + db
  2. Create a django project, with admin and database
  3. Create app and add to installed apps
  4. Define the data structure, so the item, so our django model.
      from django.db import model

      class TheodoTeam(models.Model):
        name = models.CharField(max_length=150)
        image = models.CharField(max_length=150)
        fun_fact = models.TextField(blank=True)

        class Meta:
            verbose_name = "theodo UK team"
   5 . Install Scrapy

   6 . Run
scrapy startproject scraper
   7. Connect using DjangoItem
      from scrapy_djangoitem import DjangoItem
      from theodo_team.models import TheodoTeam

      class TheodoTeamItem(DjangoItem):
        django_model = TheodoTeam
  8. The Spider – Spiders have a start*urls class which takes a list of URLs. The URLs will then be used by the start*requests method to create the initial requests for your spider. Then using the response and selectors, select the data required.
    import scrapy
    from scraper.items import TheodoTeamItem

    class TheodoSpider(scrapy.Spider):
      name = "theodo"
      start_urls = [""]

      # this is what start_urls does
      # def start_requests(self):
      #     urls = ['',]
      #     for url in urls:
      #       yield scrapy.Request(url=url, callback=self.parse)

      def parse(self, response):
          data = response.css("")

          for line in data:
              item = TheodoTeamItem()
              item["name"] = line.css("div.h3 h3::text").extract_first()
              item["image"] = line.css("img.img-team-popup::attr(src)").extract_first()
              item["fun_fact"] = line.css("div.p-small p::text").extract().pop()
              yield item
  9. Pipeline – use it to save the items to the database
    class TheodoTeamPipeline(object):
      def process_item(self, item, spider):

          return item
 10. Activate the Item Pipeline component – where the integer value represents the order in which they run for multiple pipelines
## scraper/
      ITEM_PIPELINES = {"scraper.pipelines.TheodoTeamPipeline": 300}
11 . Create a Django command to run Scrapy crawl – This initialises django in the scraper and is            needed to be able to access django in the spider.
## commands/

    from import BaseCommand
    from scraper.spiders import TheodoSpider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings

    class Command(BaseCommand):
      help = "Release the spiders"

      def handle(self, *args, **options):
          process = CrawlerProcess(get_project_settings())

  12 . Run crawl to save the items to the database

Project Structure:


Challenges and problems encountered:

Selectors!! Selectors are not one size fits all. Different selectors are needed for every website and if there is constant layout changes, they require upkeep. It can also be difficult to find all the data required without manipulating it. This occurs when tags may not have a class name or if data is not consistently stored in the same tag.

An example of how complicated selectors can get:
segments = response.css("tr td[rowspan]")
rowspan = int(segment.css("::attr(rowspan)").extract_first())
           all_td_after_segment = segment.xpath("./../following-sibling::tr[position()<={}]/td".format(rowspan- 1))

line = all_td_after_segment.extract_first()
data = line.xpath("./descendant-or-self::a/text()")
more_data = line.xpath("substring-after(substring-before(./strong/text()[preceding-sibling::a], '%'), '\xa0')").extract_first()
As you can see, setting up the scraper is not the hard part! I think integrating Scrapy and Django is a desirable, efficient and speedy solution to be able to store data from a website into a database.

Originally published by *Henriette Brand *at **

django python

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Python Django Tutorial | Django Course

🔥Intellipaat Django course: 👉This Python Django tutorial will help you learn what is django web development &...

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Main Pros and Cons of Django As A Web Framework for Python Developers

India's best Institute for Django Online Training Course & Certification. Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.

How To Compare Tesla and Ford Company By Using Magic Methods in Python

Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc.. You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like __init__, __call__, __str__ etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).

How to Cache Website using Django — Python ?

How to Cache Website using Django — Python ?. In this blog we would discuss caching an entire website(And thus consider it as part 1 of Caching Series).