Automated Web Scraping with Python and Celery

This is part 2 of building a web scraping tool with Python. We’ll be using integrating Celery, a task management system, into our web scraping project.

Part 1, Building an RSS feed scraper with Python, illustrated how we can use Requests and Beautiful Soup.

In part 3 of this series, Making a web scraping application with Python, Celery, and Django, I will be demonstrating how to integrate a web scraping tool into web applications.

Background:

In a previous article, I created a simple RSS feed reader that scrapes information from HackerNews using Requests and BeautifulSoup (see the code on GitHub). We’ll now be building upon this code as a basis for creating a task management system and scheduled scraping.

The next logical step in data collection from websites that change frequently (i.e., an RSS feed that displays X number of items at a time), is to scrape on a regular basis. Within the previous scraping example, we utilized the command line to execute our code on command; however, this isn’t a scalable solution. To automate this, the addition of Celery to create a task queueing system with period runs.

I will be using the following:

Python 3.7+
Requests
BeautifulSoup 4
A text editor (I use Visual Studio Code)
Celery — Distributed task queue
RabbitMQ — Message broker

Note: All library dependencies are listed in the requirements.txt and Pipfile/Pipfile.loc

#software-development #python #web-development #data #data-science

Background:

codeburst.io

Automated Web Scraping with Python and Celery