There are already a lot of different resources available on creating web-scrapers using Python which are usually based on either a combination of the well known Python packages urllib+beautifulsoup4 or Selenium. When you are faced with the challenge to scrape a javascript-heavy web page or a level of interaction with the content is required that can not be achieved by simply sending URL requests, then Selenium is very likely your preferred choice. I don’t want to go into the details here on how you can set-up your scraping script and the best practices on how to run it in a reliable way. I just want to refer to this and this resources that I found are particularly helpful.

The problem that we want to solve in this post is: How can I, as a Data Analyst/Data Scientist, set up an orchestrated and fully managed process to facilitate a Selenium scraper with a minimum of dev-ops required? The main use case for such a set up is a managed and scheduled solution to run all your scraping jobs in the cloud.

The tools we are going to use are:

  • **Google Cloud Composer **to schedule jobs and orchestrate workflows
  • Selenium as a framework to scrape websites
  • **Google Kubernetes Engine **to deploy a Selenium remote driver as containerized application in the cloud

#selenium #google-cloud-platform #airflow #web-scraping

Scraping the web with Selenium on Google Cloud Composer (Airflow)
27.65 GEEK