English Premier League

The fuel of each and every machine learning or deep learning model is data. Without data, the models are useless. Before building a model and train it, we should try to explore and understand the data at hand. By understanding, I mean correlations, structures, distributions, characteristics and trends in data. A comprehensive understanding of data will be very useful in building a robust and well-designed model. We can draw valuable conclusions by exploring the data.

In this post, I will walk through an exploratory data analysis process of [English Premier League 2019–2020 season dataset] which is available on Kaggle.

Let’s start by reading the data into a Pandas dataframe:

import numpy as np
import pandas as pd

df_epl = pd.read_csv("../input/epl-stats-20192020/epl2020.csv")
print(df_epl.shape)
(576, 45)

Dataset has 576 rows and 45 columns. To be able to display all the columns, we need to adjust display.max_columns setting.

pd.set_option("display.max_columns",45)

df_epl.head()

![](https://miro.medium.com/max/816/1*Ep38v5KYegOGQVmfsSmH8Q.png)

It does not fit on the screen but we can see all the columns by sliding the scroll bar. The dataset includes the statistics for 288 games. There are 576 rows because each game is represented with two rows, one from the home team side and one for away team side. For instance, the first two rows represent “Liverpool-Norwich” game.

The first column (“Unnamed: 0”) is redundant so we can just drop it:

df_epl.drop([‘Unnamed: 0’], axis=1, inplace=True)
df_epl = df_epl.reset_index(drop=True)


The dataset includes lots of different statistics about games.

*   xG, xGA: Expected goals for team and opponent
*   scored, missed: Goal scored and conceded
*   xpts, pts: Expected and received points
*   wins, draws, losses: Binary variables showing the result of the game
*   tot_goal, tot_con: Total goals scored and conceded from the beginning of the season

There are also basic stats such as shots, shots on target, corner kicks, yellow card, red card. We also have information about the date and time of the games.

Let’s start with days:

df_epl.matchDay.value_counts()

Most of the games are played on Saturdays.

We can quickly create a standing based on the total number of points achieved so far. The maximum value in the tot_points column shows the most up to date points:

df_epl[['teamId','tot_points']].groupby('teamId').max().sort_values(by='tot_points', ascending=False)[:10]

I only displayed the first 10 teams. If you are a football (i.e. soccer) fan, you may have heard of the success of Liverpool dominating the English Premier League this season. Liverpool leads by 25 points.

The advancements in technology and data science brought up new stats in football. One type of relatively new stats is “expected” stats such as expected goals and expected points. Let’s check how close expected and actual values are. There are different ways to do a comparison. One way is to check the distribution of the difference:

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
%matplotlib inline
plt.figure(figsize=(10,6))
plt.title("Expected vs Actual Goals - Distribution of Difference", fontsize=18)
diff_goal = df_epl.xG - df_epl.scored
sns.distplot(diff_goal, hist=False, color='blue')

#data-analysis #artificial-intelligence #data-science #data analysis

towardsdatascience.com

English Premier League