Cricket is one of my favorite sports(although I can’t play it to save my life). And since World Cup is drawing close, I am sure millions of cricket fans are trying to predict who is going to take the cricket glory home. I had carried out some predictions for Indian Twenty20 Cricket League called IPL, which is present atiplpredictormatches.pythonanywhere.com**. **I am hoping to make similar predictor machine for World Cup, too. And, this article covers the first step for this — Data Gathering Phase.

_Data Gathering Phase is a task that can take up to 70 to 80% of your total time dedicated to any project. For gathering data, I am going to use Web Scraping as all major cricket data is present on the web and we can easily access it through web scraping. __HowStat _is an excellent structured cricket statistics site that I will be using in this article. Another great option is espncricinfo.com.

For this article, I will only be carrying out only two tasks** —**

  1. Finding all the players that have ever played an ODI match and
  2. Finding the scores of all the players in each year and how many matches they played in that year.

Let’s start with the first task. For web scrapping, we will need the following basic libraries which we will first import:

Filename : scrapping.py

Image for post

import pandas as pd  # file operations
from bs4 import BeautifulSoup as soup  #Scrapping tool
from urllib.request import urlopen as ureq # For requesting data from link
import numpy as np
import re

Next, we will write code for web scraping using Beautiful Soup:

For the URL, I go to HowStat Website and decide to first take the data of the players with alphabet starting from A — as they provide different sets of players on their starting character for name. For simplicity, we first take the character A.

Hence, the website URL is http://howstat.com/cricket/Statistics/Players/PlayerList.asp?Country=ALL&Group=A. Go to this website link and press Ctrl+Shift+I to Inspect the HTML Code. Through this, you can understand the location of the needed data in the HTML code. This is important as we will scrap through HTML code. Next, since we need data of all the players that can be seen, we have two options.

  1. Take each data individually.
  2. Take the whole table.

Obviously the second idea is more appealing and requires less lines of code.

#data-science #data-collection #data-preprocessing #web-scrapping #cricket #data analysis

Web Scrapping Cricket ODI data from HowStat and Preprocessing
3.75 GEEK