With various attempt to clamp down the effect of COVID19 on the world, various research works and innovative measures depends on insights gained from the right data. Most of the data required to aid innovations may not be available via Application Programming Interface (API) or file formats like ‘.csv’ waiting to be downloaded, but can only be accessed as part of a web page. All code snippet can be found here.

Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Worldometers has a credible sources of COVID19 data around the world. In this article, we will learn how to scrape COVID19 data depicted below from a web page to a Dask dataframe from the site using python.

Why Dask dataframe?

Pandas have been one of the most popular and favorite data science tools used in Python programming language for data wrangling and analysis. Pandas have their own limitations when it comes to big data due to its algorithm and local memory constraints.

However, Dask is an open-source and freely available Python library. Dask provides ways to scale Pandas, Scikit-Learn, and Numpy in terms of performance and scalability. In the context of this article, the dataset is bound to be constantly increasing, making Dask the ideal tool to use.

Elements of a web page

Before we delve into web scraping proper, lets clear up the difference between a webpage and website. A web page can be considered as a single entity whereas a website is a combination of web pages. Web pages are accessed through a browser while in website HTTP, and DNS protocols are used to access it. The content in a website changes according to the web page while a web page contains more specific information.

There are Four(4) basic elements of a webpage, which are:

  1. Structure
  2. Function
  3. Content
  4. Aesthetics

The above-listed elements, fall into but not limited to these programmable component such as HTML— contain the main content of the page, CSS — add styling to make the page look nicer and lastly JS— JavaScript files add interactivity to web pages.

When we perform web scraping, we’re interested in extraction of information from the main content of the web page, which makes a good understanding of HTML important.

#web-scraping #covid19 #coronavirus-covid19 #coronavirus #data

Scraping COVID19 Data Using Python
1.85 GEEK