Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.
These words are believed to belong to a prominent American mathematician — John Tukey. He played a key role in the development of statistics due to his contributing dozens of brilliant ideas related to the ways of **collecting **and **analyzing **data. On top of that, he first introduced the ‘Exploratory data analysis’ (EDA) term.
Let’s take a look at the meaning that is hidden behind this term.
The **Exploratory Data Analysis **(EDA) is a set of approaches which includes univariate, bivariate and multivariate visualization techniques, dimensionality reduction, cluster analysis.
The *main goal *of EDA is to get a full understanding of the data and draw attention to its most important features in order to prepare it for applying more advanced analysis techniques and feeding into machine learningalgorithms. Besides, it helps to generate hypotheses about data, detect its anomalies and reveal the structure.
You should never neglect data exploration — skipping this significant stage of any data science or machine learning project may lead to generating inaccurate models or wrong data analysis results.
During exploring, you should look at your data from as many angles as possible since the devil is always in the details.
The g*raphical techniques *are the most natural for the human mind, therefore, plotting shouldn’t be underestimated. These techniques usually include depicting the data using box and whisker plots, histograms, lag plots, standard deviation plots, Pareto charts, scatter plots, bar and pie charts, violin plots, correlation matrices, and more.
The goal of this tutorial is to share my experience of exploring and visualizing the data before starting a predictive analytics project. I hope to inspire you to get insights into data as well as Tukey encouraged statisticians to pay more attention to this approach.
Before starting a new data visualization project, it’s crucial to understand your long-term goals.
Today we’ll be working with the Medical Appointment No Shows dataset that contains information about the patients’ appointments.
Each patient’s record is characterized by the following features:
We aim to understand **why **people who receive treatment instructions do not show up at the next appointment time. In other words, what are the contributing factors for missing appointments?
But this is the long-term goal. Before digging deeper, we should try answering the following questions:
This list is not complete — you can extend it with additional questions that come to your mind during the analysis.
In this tutorial, we’ll try visualizing data in Python. I assume you know how to work with basic Python libraries. Let’s import the ones we’ll need for working with the data:
We’ll use Plotly as a primary charting library. It’s an open-source Python package that extends the functionality of d3.js and stack.gl and offers sophisticated charts which can meet the requirements of any project. Being high-level, Plotly is more convenient to work with, and for this reason, I prefer it to matplotlib.
Another thing for which I admire Plotly is its **interactivity **of exploring data with charts.
We’ll use it offline so as not to create an account and be limited in attempts to build charts.
After you’ve downloaded the data from Kaggle, the next step to take is to build a pandas DataFrame based on the CSV data. Here is a tutorial which will make you comfortable with working with pandas.
Let’s remove some columns that we will not need so as to make data processing faster:
Before cleaning the data, let’s check the quality of the data and data types of each column.
Information about the dataframe
Here you can also check the number of memory used by the dataframe.Use
head() method to display the first five rows of the dataframe:
Check the overall number of samples and features by using
We have 110527 records and 12 features.
*Cleaning the data is an art *that should be mastered in the first place before starting any data science or machine learning project. It makes data easier to investigate and build visualizations around.
After you’ve checked the data types of features, you may have noticed that
AppointmentDay features have an
object data type.
To make dealing with date features easies, let’s convert the type of ‘ScheduledDay’ and ‘AppointmentDay’ to datetime64[ns]. You need this to get access to useful methods and attributes.
Another way is to convert types of columns while reading the data.
To do this, pass a list of columns’ names which should be treated as date columns to
parse*dates parameter of
read*csv method. This way they will be formatted in a readable way:
Also, it’s a good idea to convert string data types to categorical because this data type helps you save some memory by making the dataframe smaller. The memory usage of categorical variables is proportional to the number of categories + the length of the data.
Also, a categorical column will be treated as a categorical variable by most statistical Python libraries.
Sometimes the data can be inconsistent. For example, if an appointment day comes before the scheduled day, then something is wrong and we need to swap their values.
You may have noticed that our features contain typing errors.
Let’s rename misspelled column names:
Optionally, you can rename “No-show” column to “Presence” and its values to ‘Present’ and ‘Absent’ so as to avoid any misinterpretation.
Now that our dataset is neat and accurate, let’s move ahead to extending the dataset with new features.
We can add a new feature to the dataset — ‘Waiting Time Days’ to check how long the patient needs to wait for the appointment day.
Another new feature may be ‘WeekDay’ — a weekday of an appointment. With this feature, we can analyze on which days people don’t show up more often.
Similarly, add ‘Month’, ‘Hour’ features:
Let’s check whether there are null values in each column in this elegant way:
Alternatively, if you want to check an individual column for the presence of null values, you can do it this way:
We are lucky — there are no null values in our dataset.
Still, what are the strategies to address missing values?
Analyzing existing techniques and approaches, I’ve come to the conclusion that the most popular strategies for dealing with missing data are:
Once you’ve cleaned the data, it’s time to inspect it more profoundly.
Perform the following steps:
The best charts for *visualizing proportions *are pie, donut charts, treemaps, stacked area and bar charts. Let’s use a pie chart:
It’s clear that only 20.2% of patients didn’t show up while 79.8% were present on the appointment day.
A **box & whiskers plot **handles this task best:
With this interactive plot, you can see that the middle quartile of the data (median) is 37.
That means that 50% of patients are younger than 37 and the other 50% are older than 37.
Upper quartile means that 75% of the age values fall below 55. Lower quartile means that 25% of age values fall below 18.
The range of age values from lower to upper quartile is called the interquartile range. From the plot, you can conclude that 50% of patients are aged 18 to 55 years.
If you take a look at whiskers, you’ll** find **the greatest value (excluding outliers) which is 102.
Our data contains only one outlier — a patient with age 115. The lowest value is 0 which is quite possible since the patients can be small children.
Another insight this plot allows to get is that the data is clearly positively skewed since the box plot is not symmetric. Quartile 3 — Quartile 2 > Quartile 2 — Quartile 1.
For this, we can use the same box plot but it’s grouped by “Presence” column.
You can see that people don’t show up mostly on Tuesdays and Wednesdays.
Possible techniques that can be applied to this data later:
That’s it for now! You’ve finished exploring the dataset but you can continue revealing insights.
Hopefully, this simple project will be helpful in grasping the basic idea of the EDA. I encourage you to try experimenting with data and different types of visualizations to figure out what is the best way to get the most of your data.
Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.
#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners
Exploratory Data Analysis (EDA) is a very common and important practice followed by all data scientists. It is the process of looking at tables and tables of data from different angles in order to understand it fully. Gaining a good understanding of data helps us to clean and summarize it, which then brings out the insights and trends which were otherwise unclear.
EDA has no hard-core set of rules which are to be followed like in ‘data analysis’, for example. People who are new to the field always tend to confuse between the two terms, which are mostly similar but different in their purpose. Unlike EDA, data analysis is more inclined towards the implementation of probabilities and statistical methods to reveal facts and relationships among different variants.
Coming back, there is no right or wrong way to perform EDA. It varies from person to person however, there are some major guidelines commonly followed which are listed below.
We will look at how some of these are implemented using a very famous ‘Home Credit Default Risk’ dataset available on Kaggle here. The data contains information about the loan applicant at the time of applying for the loan. It contains two types of scenarios:
on at least one of the first Y instalments of the loan in our sample,
We’ll be only working on the application data files for the sake of this article.
#data science #data analysis #data analysis in python #exploratory data analysis in python
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
Exploratory Data Analysis (EDA) is one of the most important aspect in every data science or data analysis problem. It provides us greater understanding on our data and can possibly unravel hidden insights that aren’t that obvious to us. The first article I’ve wrote on Medium is also on performing EDA in R, you can check it out here. This post will focus more on graphical EDA in Python using matplotlib, regression line and even motion chart!
The dataset we are using for this article can be obtained from Gapminder, and drilling down into _Population, Gender Equality in Education _and Income.
The _Population _data contains yearly data regarding the estimated resident population, grouped by countries around the world between 1800 and 2018.
The Gender Equality in Education data contains yearly data between 1970 and 2015 on the ratio between female to male in schools, among 25 to 34 years old which includes primary, secondary and tertiary education across different countries
The _Income _data contains yearly data of income per person adjusted for differences in purchasing power (in international dollars) across different countries around the world, for the period between 1800 and 2018.
Let’s first plot the population data over time, and focus mainly on the three countries Singapore, United States and China. We will use
matplotlib library to plot 3 different line charts on the same figure.
import pandas as pd import matplotlib.pylab as plt %matplotlib inline ## read in data population = pd.read_csv('./population.csv') ## plot for the 3 countries plt.plot(population.Year,population.Singapore,label="Singapore") plt.plot(population.Year,population.China,label="China") plt.plot(population.Year,population["United States"],label="United States") ## add legends, labels and title plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Population') plt.title('Population Growth over time') plt.show()
#exploratory-data-analysis #data-analysis #data-science #data-visualization #python
EDA is a way to understand what the data is all about. It is very important as it helps us to understand the outliers, relationship of features within the data with the help of graphs and plots.
EDA is a time taking process as we need to make visualizations between different features using libraries like Matplot, seaborn, etc.
There is a way to automate this process by a single line of code using the library Pandas Visual Analysis.
Let’s understand the different sections in the user interface :
#data-analysis #machine-learning #data-visualization #data-science #data analysis #exploratory data analysis