There are already a lot of different resources available on creating web-scrapers using Python which are usually based on either a combination of the well known Python packages urllib+beautifulsoup4 or Selenium. When you are faced with the challenge to scrape a javascript-heavy web page or a level of interaction with the content is required that can not be achieved by simply sending URL requests, then Selenium is very likely your preferred choice. I don’t want to go into the details here on how you can set-up your scraping script and the best practices on how to run it in a reliable way. I just want to refer to this and this resources that I found are particularly helpful.
The problem that we want to solve in this post is: How can I, as a Data Analyst/Data Scientist, set up an orchestrated and fully managed process to facilitate a Selenium scraper with a minimum of dev-ops required? The main use case for such a set up is a managed and scheduled solution to run all your scraping jobs in the cloud.
The tools we are going to use are:
#selenium #google-cloud-platform #airflow #web-scraping