Scraping with Scrapy and Django Integration

Create a django project, with admin and database. Create app and add to installed apps. Define the data structure, so the item, so our django model. Install Scrapy.

What is scraping?

Scraping is the process of data mining. Also known as web data extraction, web harvesting, spying… It is software that simulates human interaction with a web page to retrieve any wanted information (eg images, text, videos). This is done by a scraper.

This scraper involves making a GET request to a website and parsing the html response. The scraper then searches for the data required within the html and repeats the process until we have collected all the data we want.

It is useful for quickly accessing and analysing large amounts of data, which can be stored in a CSV file, or in a database depending on our needs!

There are many reasons to do web scraping such as lead generation and market analysis. However, when scraping websites, you must always be careful not to violate the terms and conditions of websites you are scraping, or to violate anyone’s privacy. This is why it is thought to be a little controversial but can save many hours of searching through sites, and logging the data manually. All of those hours saved mean a great deal of money saved.

There are many different libraries that can be used for web scraping, e.g. selenium, phantomjs. In Ruby you can also use the nokogiri gem to write your own ruby based scraper. Another popular library is is beautiful soup which is popular among python devs.

At Theodo, we needed to use a web scraping tool with the ability to follow links and as python developers the solution we opted for was using theDjangoframework with an open source web scraping framework called Scrapy.

Scrapy and Django

Scrapy allows us to define data structures, write data extractors, and comes with built in CSS and xpath selectors that we can use to extract the data, the scrapy shell, and built in JSON, CSV, and XML output. There is also a built in FormRequest class which allows you to mock login and is easy to use out of the box.

Websites tend to have countermeasures to prevent excessive requests, so Scrapy randomises the time between each request by default which can help avoid getting banned from them. Scrapy can also be used for automated testing and monitoring.

Django has an integrated admin which makes it easy to access the db. That along with the ease of filtering and sorting data and import/export library to allow us to export data.

Scrapy also used to have a built in class called DjangoItem which is now an easy to use external library. The DjangoItem library provides us with a class of item that uses the field defined in a Django model just by specifying which model it is related to. The class also provides a method to create and populate the Django model instance with the item data from the pipeline. This library allows us to integrate Scrapy and Django easily and means we can also have access to all the data directly in the admin!

So what happens?

Spiders

Let’s start from the spider. Spiders are the core of the scraper. It makes the request to our defined URLs, parses the responses, and extracts information from them to be processed in the items.

Scrapy has a start_requests method which generates a request with the URL. When Scrapy fetches a website according to the request, it will parse the response to a callback method specified in the request object. The callback method can generate an item from the response data or generate another request.

What happens behind the scenes? Everytime we start a Scrapy task, we start a crawler to do it. The Spider defines how to perform the crawl (ie following links). The crawler has an engine to drive it’s flow. When a crawler starts, it will get the spider from its queue, which means the crawler can have more than one spider. The next spider will then be started by the crawler and scheduled to crawl the webpage by the engine. The engine middlewares drive the flow of the crawler. The middlewares are organised in chains to process requests and responses.

Selectors

Selectors can be use to parse a web page to generate an item. They select parts of the html document specified either by xpath or css expressions. Xpath selects nodes in XML docs (that can also be used in HTML docs) and CSS is a language for applying styles to HTML documents. CSS selectors use the HTML classes and id tag names to select the data within the tags. Scrapy in the background using the cssselect library transforms these CSS selectors into xpath selectors.

CSS vs Xpath

data = response.css("div.st-about-employee-pop-up") 
data = response.xpath("//div[@class='team-popup-wrap st-about-employee-pop-up']")

Short but sweet: when dealing with classes, ids and tag names, use CSS selectors. If you have no class name and just know the content of the tag use xpath selectors. Either way chrome dev tools can help: copy selector for the element’s unique css selector or you can copy its xpath selector. This is to give a basis, may have to tweak it! Two more helper tools are [**XPath helper**](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl "**XPath helper**") and [**this**](https://devhints.io/xpath "**this**") cheatsheet. Selectors are also chainable.

Items and Pipeline

Items produce the output. They are used to structure the data parsed by the spider. The Item Pipeline is where the data is processed once the items have been extracted from the spiders. Here we can run tasks such as validation and storing items in a database.

How I did it

Here’s an example of how we can integrate Scrapy and Django. (This tutorial uses scrapy version 1.5.1, djangoitem version 1.1.1, django 2.1.4)

Let’s scrape the data off the Theodo UK Team Page and integrate it into a Django Admin Panel:

  1. Generate Django project with integrated admin + db
  2. Create a django project, with admin and database
  3. Create app and add to installed apps
  4. Define the data structure, so the item, so our django model.
## models.py
      from django.db import model

      class TheodoTeam(models.Model):
        name = models.CharField(max_length=150)
        image = models.CharField(max_length=150)
        fun_fact = models.TextField(blank=True)

        class Meta:
            verbose_name = "theodo UK team"

   5 . Install Scrapy

   6 . Run
scrapy startproject scraper

   7. Connect using DjangoItem
## items.py
      from scrapy_djangoitem import DjangoItem
      from theodo_team.models import TheodoTeam

      class TheodoTeamItem(DjangoItem):
        django_model = TheodoTeam

  8. The Spider – Spiders have a start*urls class which takes a list of URLs. The URLs will then be used by the start*requests method to create the initial requests for your spider. Then using the response and selectors, select the data required.
    import scrapy
    from scraper.items import TheodoTeamItem

    class TheodoSpider(scrapy.Spider):
      name = "theodo"
      start_urls = ["https://www.theodo.co.uk/team"]

      # this is what start_urls does
      # def start_requests(self):
      #     urls = ['https://www.theodo.co.uk/team',]
      #     for url in urls:
      #       yield scrapy.Request(url=url, callback=self.parse)

      def parse(self, response):
          data = response.css("div.st-about-employee-pop-up")

          for line in data:
              item = TheodoTeamItem()
              item["name"] = line.css("div.h3 h3::text").extract_first()
              item["image"] = line.css("img.img-team-popup::attr(src)").extract_first()
              item["fun_fact"] = line.css("div.p-small p::text").extract().pop()
              yield item

  9. Pipeline – use it to save the items to the database
## pipelines.py
    class TheodoTeamPipeline(object):
      def process_item(self, item, spider):
          item.save()
          return item

 10. Activate the Item Pipeline component – where the integer value represents the order in which they run for multiple pipelines
## scraper/settings.py
      ITEM_PIPELINES = {"scraper.pipelines.TheodoTeamPipeline": 300}

11 . Create a Django command to run Scrapy crawl – This initialises django in the scraper and is            needed to be able to access django in the spider.
## commands/crawl.py

    from django.core.management.base import BaseCommand
    from scraper.spiders import TheodoSpider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings

    class Command(BaseCommand):
      help = "Release the spiders"

      def handle(self, *args, **options):
          process = CrawlerProcess(get_project_settings())

          process.crawl(TheodoSpider)
          process.start()

  12 . Run manage.py crawl to save the items to the database

Project Structure:

scraper
     management
         commands
             crawl.py
     spiders
         theodo_team_spider.py
         apps.py
         items.py
         middlewares.py
         pipelines.py
         settings.py
 theodo_team
     admin
     migrations
     models

Challenges and problems encountered:

Selectors!! Selectors are not one size fits all. Different selectors are needed for every website and if there is constant layout changes, they require upkeep. It can also be difficult to find all the data required without manipulating it. This occurs when tags may not have a class name or if data is not consistently stored in the same tag.

An example of how complicated selectors can get:
segments = response.css("tr td[rowspan]")
rowspan = int(segment.css("::attr(rowspan)").extract_first())
           all_td_after_segment = segment.xpath("./../following-sibling::tr[position()<={}]/td".format(rowspan- 1))

line = all_td_after_segment.extract_first()
data = line.xpath("./descendant-or-self::a/text()")
more_data = line.xpath("substring-after(substring-before(./strong/text()[preceding-sibling::a], '%'), '\xa0')").extract_first()

As you can see, setting up the scraper is not the hard part! I think integrating Scrapy and Django is a desirable, efficient and speedy solution to be able to store data from a website into a database.

*Originally published by ****Henriette Brand ****at *blog.theodo.com

#django #python

What is GEEK

Buddha Community

Scraping with Scrapy and Django Integration
Ahebwe  Oscar

Ahebwe Oscar

1620177818

Django admin full Customization step by step

Welcome to my blog , hey everyone in this article you learn how to customize the Django app and view in the article you will know how to register  and unregister  models from the admin view how to add filtering how to add a custom input field, and a button that triggers an action on all objects and even how to change the look of your app and page using the Django suit package let’s get started.

Database

Custom Titles of Django Admin

Exclude in Django Admin

Fields in Django Admin

#django #create super user django #customize django admin dashboard #django admin #django admin custom field display #django admin customization #django admin full customization #django admin interface #django admin register all models #django customization

Ahebwe  Oscar

Ahebwe Oscar

1620185280

How model queries work in Django

How model queries work in Django

Welcome to my blog, hey everyone in this article we are going to be working with queries in Django so for any web app that you build your going to want to write a query so you can retrieve information from your database so in this article I’ll be showing you all the different ways that you can write queries and it should cover about 90% of the cases that you’ll have when you’re writing your code the other 10% depend on your specific use case you may have to get more complicated but for the most part what I cover in this article should be able to help you so let’s start with the model that I have I’ve already created it.

**Read More : **How to make Chatbot in Python.

Read More : Django Admin Full Customization step by step

let’s just get into this diagram that I made so in here:

django queries aboutDescribe each parameter in Django querset

we’re making a simple query for the myModel table so we want to pull out all the information in the database so we have this variable which is gonna hold a return value and we have our myModel models so this is simply the myModel model name so whatever you named your model just make sure you specify that and we’re gonna access the objects attribute once we get that object’s attribute we can simply use the all method and this will return all the information in the database so we’re gonna start with all and then we will go into getting single items filtering that data and go to our command prompt.

Here and we’ll actually start making our queries from here to do this let’s just go ahead and run** Python manage.py shell** and I am in my project file so make sure you’re in there when you start and what this does is it gives us an interactive shell to actually start working with our data so this is a lot like the Python shell but because we did manage.py it allows us to do things a Django way and actually query our database now open up the command prompt and let’s go ahead and start making our first queries.

#django #django model queries #django orm #django queries #django query #model django query #model query #query with django

Ahebwe  Oscar

Ahebwe Oscar

1620200340

how to integrate CKEditor in Django

how to integrate CKEditor in Django

Welcome to my Blog, in this article we learn about how to integrate CKEditor in Django and inside this, we enable the image upload button to add an image in the blog from local. When I add a CKEditor first time in my project then it was very difficult for me but now I can easily implement it in my project so you can learn and implement CKEditor in your project easily.

how to integrate CKEditor in Django

#django #add image upload in ckeditor #add image upload option ckeditor #ckeditor image upload #ckeditor image upload from local #how to add ckeditor in django #how to add image upload plugin in ckeditor #how to install ckeditor in django #how to integrate ckeditor in django #image upload in ckeditor #image upload option in ckeditor

Ananya Gupta

Ananya Gupta

1597123834

Main Pros and Cons of Django As A Web Framework for Python Developers

Django depicts itself as “the web system for fussbudgets with cutoff times”. It was intended to help Python engineers take applications from idea to consummation as fast as could be expected under the circumstances.

It permits fast turn of events on the off chance that you need to make a CRUD application with batteries included. With Django, you won’t need to rehash an already solved problem. It just works and lets you center around your business rationale and making something clients can utilize.

Pros of Django

“Batteries included” theory

The standard behind batteries-included methods normal usefulness for building web applications accompanies the system, not as isolated libraries.

Django incorporates much usefulness you can use to deal with normal web advancement undertakings. Here are some significant level functionalities that Django gives you, which else you need to stay together if you somehow happened to utilize a small scale structure:

ORM

Database relocations

Client validation

Administrator board

Structures

Normalized structure

Django as a system proposes the right structure of an undertaking. That structure helps designers in making sense of how and where to execute any new component.

With a generally acknowledged venture structure that is like numerous tasks, it is a lot simpler to discover online good arrangements or approach the network for help. There are numerous energetic Python designers who will assist you with comprehending any issue you may experience.

Django applications

Django applications (or applications for short) permit designers to separate a task into numerous applications. An application is whatever is introduced by putting in settings.INSTALLED_APPS. This makes it simpler for engineers to add usefulness to the web application by coordinating outer Django applications into the venture.

There are many reusable modules and applications to accelerate your turn of events learn through Online Django Class and Check the Django website.

Secure of course

Django gives great security assurance out of the crate and incorporates avoidance components for basic assaults like SQL Injection (XSS) and Cross-site Request Forgery (CSRF). You can discover more subtleties in the official security diagram control.

REST structure for building APIs

Django REST Framework, commonly condensed “DRF”, is a Python library for building APIs. It has secluded and adaptable engineering that functions admirably for both straightforward and complex web APIs.

DRF gives a lot of verification and authorization strategies out of the case. It is an adaptable, full-included library with measured and adjustable engineering. It accompanies nonexclusive classes for CRUD tasks and an implicit API program for testing API endpoints.

GraphQL structure for building APIs

Huge REST APIs regularly require a lot of solicitations to various endpoints to recover every single required datum. GraphQL it’s a question language that permits us to share related information in a lot simpler design. For a prologue to GraphQL and an outline of its ideas, if it’s not too much trouble allude to the authority GraphQL documentation.

Graphene-Django gives reflections that make it simple to add GraphQL usefulness to your Django venture. Ordinary Django models, structures, validation, consent arrangements, and different functionalities can be reused to manufacture GraphQL blueprint. It additionally gives an implicit API program for testing API endpoints.

Cons of Django

Django ORM

Django ORM, made before SQLAlchemy existed, is currently much sub-par compared to SQLAlchemy. It depends on the Active Record design which is more regrettable than the Unit of Work design embraced by SQLAlchemy. This implies, in Django, models can “spare” themselves and exchanges are off as a matter of course, they are a bit of hindsight. Peruse more in Why I kind of aversion Django.

Django advances course popularity increses day by day:

Django is huge and is viewed as strong bit of programming. This permits the network to create several reusable modules and applications yet has additionally restricted the speed of advancement of the Django. On head of that Django needs to keep up in reverse similarity, so it advances gradually.

Rundown - Should I use Django as a Python designer?

While Django ORM isn’t as adaptable as SQLAlchemy and the enormous environment of reusable modules and applications hinders structure advancement - plainly Django ought to be the best option web system for Python engineers.

Elective, light systems, similar to Flask, while offering a retreat from Django huge biological system and designs, in the long haul can require substantially more additional libraries and usefulness, in the end making many experienced Python engineers winding up wishing they’d began with Django.

Django undertaking’s security and network have become enormously over the previous decade since the system’s creation. Official documentation and instructional exercises are probably the best anyplace in programming advancement. With each delivery, Django keeps on including huge new usefulness.

#django online training #django online course #online django course #django course #django training #django certification course

Marget D

Marget D

1626077187

4 key Features of Django Framework that Make it the Best Amongst all!

Django is one of the popular python based open-source web frameworks mainly used by the developers who like to have rapid development along with the clean pragmatic design.

Read this blog to know the various Django Features with details.

#django framework #django web development #django development company #django development services #python django development company #python django development