An Introduction to web scraping and to Serverless Cloud services.

The purpose of this article is to present au systematic approach to read an RSS News Feed and to process its content to web scrape news articles. The challenge is to be able to extract text articles published in different websites without any strong premise on a web page structure.

The overall solution is described in three steps :

  1. A message is published in Cloud Pub/Sub with a URL to an news RSS feed,
  2. A first Cloud Function is triggered by the previous message. It extracts each article within the RSS feed, stores it in Cloud Storage and publishes a message for each article in Cloud Pub/Sub for further usage,
  3. A second Cloud Function is triggered by the previous messages. It web scrapes the article page, stores the resulting text in Cloud Storage and publishes a message in Cloud Pub/Sub for further usage.

#google-cloud-platform #google-cloud-functions #google-cloud-pubsub #python #cloud

Extract RSS News Feeds using Python and Google Cloud Services
1.45 GEEK