There is an inordinate amount of data online that is available to be accessed. Knowing how to retrieve and analyze this data is an extremely useful skill to have. In this tutorial, we will use the python requests and Beautiful Soup libraries for quickly web scraping such data. By the end of this tutorial you will be able to request a webpage using the requests library, parse through it using the Beautiful Soup library, and then create a dataframe with the scraped data using the pandas library. I will be using a jupyter notebook for my code.

What is Web Scraping?

Web scraping is the process of gathering or extracting data from websites. This process can be broken down into two major steps: making an HTTP request from a specified source (such as a website) using the requests library, and then parsing through it using the Beautiful Soup library. For this tutorial, we will web scrape the batting averages for India’s international cricket team found at this link.

Making an HTTP Request

First we will use the requests library to make an HTTP request from a website for the purpose of getting data from a webpage, such as its source code.

To begin, we need to make sure to install the requests library. You can do so with the following command:

pip install requests

We then must import the requests module in your code:

import requests

Next, we will use the get method to get a webpage. For this method, we must include the url of whatever we want to request the webpage from:

source = requests.get(‘https://stats.espncricinfo.com/ci/engine/records/averages/batting.html?class=2;current=2;id=6;type=team')

The get method returns a response object that we saved to the source variable. This response object is the server’s response to our HTTP request.

If we just print this object, we get this output: <Response [200]>. This tells us that it is a response object, with an HTTP status code of 200. An HTTP status code of 200 tells us that this HTTP request was successful.

To view the different attributes and methods of this object (or any python object), you can use the dir function. The help function gives us more information about these attributes and methods.

dir(source) 

Image for post

help(source)

Image for post

#data-science #web-scraping-tools #python #web-scraping #programming

Web Scraping With Python
1.90 GEEK