Sintek Ong

Sintek Ong


Exploratory Data Analysis with Python: Medical Appointments Data

Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.


These words are believed to belong to a prominent American mathematician — John Tukey. He played a key role in the development of statistics due to his contributing dozens of brilliant ideas related to the ways of **collecting **and **analyzing **data. On top of that, he first introduced the ‘Exploratory data analysis’ (EDA) term.

Let’s take a look at the meaning that is hidden behind this term.

What is EDA

The **Exploratory Data Analysis **(EDA) is a set of approaches which includes univariate, bivariate and multivariate visualization techniques, dimensionality reduction, cluster analysis.

The *main goal *of EDA is to get a full understanding of the data and draw attention to its most important features in order to prepare it for applying more advanced analysis techniques and feeding into machine learningalgorithms. Besides, it helps to generate hypotheses about data, detect its anomalies and reveal the structure.

You should never neglect data exploration — skipping this significant stage of any data science or machine learning project may lead to generating inaccurate models or wrong data analysis results.

During exploring, you should look at your data from as many angles as possible since the devil is always in the details.

What EDA techniques are used

The g*raphical techniques *are the most natural for the human mind, therefore, plotting shouldn’t be underestimated. These techniques usually include depicting the data using box and whisker plots, histograms, lag plots, standard deviation plots, Pareto charts, scatter plots, bar and pie charts, violin plots, correlation matrices, and more.


The goal of this tutorial is to share my experience of exploring and visualizing the data before starting a predictive analytics project. I hope to inspire you to get insights into data as well as Tukey encouraged statisticians to pay more attention to this approach.

Step 1. Define your objectives

Before starting a new data visualization project, it’s crucial to understand your long-term goals.

Today we’ll be working with the Medical Appointment No Shows dataset that contains information about the patients’ appointments.

Each patient’s record is characterized by the following features:

  • PatientID — a unique identifier of a patient
  • AppointmentID — a unique identifier of an appointment
  • Gender
  • ScheduledDay — a day when an appointment is planned to occur.
  • AppointmentDay — a real date of an appointment
  • Age — a patient’s age.
  • Neighborhood — a neighborhood of each patient
  • Scholarship — Does the patient receive a scholarship?
  • Hypertension — Does the patient have hypertension?
  • Diabetes
  • Alcoholism
  • Handicap
  • SMS_received — Has the patient received an SMS reminder?
  • No_show — Has the patient decided not to show up?

We aim to understand **why **people who receive treatment instructions do not show up at the next appointment time. In other words, what are the contributing factors for missing appointments?

But this is the long-term goal. Before digging deeper, we should try answering the following questions:

  • What is the ratio of people who miss appointments to those who don’t?
  • Who don’t show up more often — men or women?
  • What is the most popular month/day/hour for not showing up?
  • What is the age distribution of patients?

This list is not complete — you can extend it with additional questions that come to your mind during the analysis.

Prepare your workspace

In this tutorial, we’ll try visualizing data in Python. I assume you know how to work with basic Python libraries. Let’s import the ones we’ll need for working with the data:

We’ll use Plotly as a primary charting library. It’s an open-source Python package that extends the functionality of d3.js and and offers sophisticated charts which can meet the requirements of any project. Being high-level, Plotly is more convenient to work with, and for this reason, I prefer it to matplotlib.

Another thing for which I admire Plotly is its **interactivity **of exploring data with charts.

We’ll use it offline so as not to create an account and be limited in attempts to build charts.

Reading the data

After you’ve downloaded the data from Kaggle, the next step to take is to build a pandas DataFrame based on the CSV data. Here is a tutorial which will make you comfortable with working with pandas.

Let’s remove some columns that we will not need so as to make data processing faster:

Profiling the data

Before cleaning the data, let’s check the quality of the data and data types of each column.

Information about the dataframe

Here you can also check the number of memory used by the dataframe.Use head() method to display the first five rows of the dataframe:

Check the overall number of samples and features by using .shapeattribute:

We have 110527 records and 12 features.

Cleaning & preparing the data

*Cleaning the data is an art *that should be mastered in the first place before starting any data science or machine learning project. It makes data easier to investigate and build visualizations around.

After you’ve checked the data types of features, you may have noticed that ScheduledDay and AppointmentDay features have an object data type.

To make dealing with date features easies, let’s convert the type of ‘ScheduledDay’ and ‘AppointmentDay’ to datetime64[ns]. You need this to get access to useful methods and attributes.

Another way is to convert types of columns while reading the data.

To do this, pass a list of columns’ names which should be treated as date columns to parse*dates parameter of read*csv method. This way they will be formatted in a readable way:

Also, it’s a good idea to convert string data types to categorical because this data type helps you save some memory by making the dataframe smaller. The memory usage of categorical variables is proportional to the number of categories + the length of the data.

Also, a categorical column will be treated as a categorical variable by most statistical Python libraries.

Sometimes the data can be inconsistent. For example, if an appointment day comes before the scheduled day, then something is wrong and we need to swap their values.

Prettify the column names

You may have noticed that our features contain typing errors.

Let’s rename misspelled column names:

Optionally, you can rename “No-show” column to “Presence” and its values to ‘Present’ and ‘Absent’ so as to avoid any misinterpretation.

Now that our dataset is neat and accurate, let’s move ahead to extending the dataset with new features.

Feature engineering

We can add a new feature to the dataset — ‘Waiting Time Days’ to check how long the patient needs to wait for the appointment day.

Another new feature may be ‘WeekDay’ — a weekday of an appointment. With this feature, we can analyze on which days people don’t show up more often.

Similarly, add ‘Month’, ‘Hour’ features:

Dealing with missing values

Let’s check whether there are null values in each column in this elegant way:

Alternatively, if you want to check an individual column for the presence of null values, you can do it this way:

We are lucky — there are no null values in our dataset.

Still, what are the strategies to address missing values?

Analyzing existing techniques and approaches, I’ve come to the conclusion that the most popular strategies for dealing with missing data are:

  • Leaving them as they are.
  • Removing them with dropna().
  • Filling NA/NaN values with fillna().
  • Replacing missing values with expected values (mean) or zeros.
  • Dropping a column in case the number of its missing values exceeds a certain threshold (e.g., > 50% of values).

Exploring the dataset

Once you’ve cleaned the data, it’s time to inspect it more profoundly.

Perform the following steps:

  • Check unique values in all the columns

  • Take a look at basic statistics of the numerical features:

Plotting data

  • Check patients distribution by gender.

The best charts for *visualizing proportions *are pie, donut charts, treemaps, stacked area and bar charts. Let’s use a pie chart:

  • Check how many people didn’t show up at the appointment date:

It’s clear that only 20.2% of patients didn’t show up while 79.8% were present on the appointment day.

  • Measure the variability of the Age data.

A **box & whiskers plot **handles this task best:

With this interactive plot, you can see that the middle quartile of the data (median) is 37.

That means that 50% of patients are younger than 37 and the other 50% are older than 37.

Upper quartile means that 75% of the age values fall below 55. Lower quartile means that 25% of age values fall below 18.

The range of age values from lower to upper quartile is called the interquartile range. From the plot, you can conclude that 50% of patients are aged 18 to 55 years.

If you take a look at whiskers, you’ll** find **the greatest value (excluding outliers) which is 102.

Our data contains only one outlier — a patient with age 115. The lowest value is 0 which is quite possible since the patients can be small children.

Another insight this plot allows to get is that the data is clearly positively skewed since the box plot is not symmetric. Quartile 3 — Quartile 2 > Quartile 2 — Quartile 1.

  • Check the interquartile ranges of the age of those people who show up and those who don’t.

For this, we can use the same box plot but it’s grouped by “Presence” column.

  • Analyze the age ranges of men and women:

  • Check the frequency of showing up and not showing up by gender

  • On which weekdays people don’t show up most often:

You can see that people don’t show up mostly on Tuesdays and Wednesdays.

What’s next

Possible techniques that can be applied to this data later:

  • Unsupervised ML techniques, namely KMeans clustering or hierarchical clustering (but don’t forget to scale the features!). Clustering may help to learn what are groups of patients that share common features.
  • Analyze which variables have explanatory power to the “No-show up” column.

Bringing it all together

That’s it for now! You’ve finished exploring the dataset but you can continue revealing insights.

Hopefully, this simple project will be helpful in grasping the basic idea of the EDA. I encourage you to try experimenting with data and different types of visualizations to figure out what is the best way to get the most of your data.

#python #data-science

What is GEEK

Buddha Community

Exploratory Data Analysis with Python: Medical Appointments Data
Ray  Patel

Ray Patel


top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

HI Python

HI Python


Exploratory Data Analysis in Python: What You Need to Know?

Exploratory Data Analysis (EDA) is a very common and important practice followed by all data scientists. It is the process of looking at tables and tables of data from different angles in order to understand it fully. Gaining a good understanding of data helps us to clean and summarize it, which then brings out the insights and trends which were otherwise unclear.

EDA has no hard-core set of rules which are to be followed like in ‘data analysis’, for example. People who are new to the field always tend to confuse between the two terms, which are mostly similar but different in their purpose. Unlike EDA, data analysis is more inclined towards the implementation of probabilities and statistical methods to reveal facts and relationships among different variants.

Coming back, there is no right or wrong way to perform EDA. It varies from person to person however, there are some major guidelines commonly followed which are listed below.

  • Handling missing values: Null values can be seen when all the data may not have been available or recorded during collection.
  • Removing duplicate data: It is important to prevent any overfitting or bias created during training the machine learning algorithm using repeated data records
  • Handling outliers: Outliers are records that drastically differ from the rest of the data and don’t follow the trend. It can arise due to certain exceptions or inaccuracy during data collection
  • Scaling and normalizing: This is only done for numerical data variables. Most of the time the variables greatly differ in their range and scale which makes it difficult to compare them and find correlations.
  • Univariate and Bivariate analysis: Univariate analysis is usually done by seeing how one variable is affecting the target variable. Bivariate analysis is carried out between any 2 variables, it can either be numerical or categorical or both.

We will look at how some of these are implemented using a very famous ‘Home Credit Default Risk’ dataset available on Kaggle here. The data contains information about the loan applicant at the time of applying for the loan. It contains two types of scenarios:

  • The client with payment difficulties: he/she had late payment more than X days

on at least one of the first Y instalments of the loan in our sample,

  • All other cases: All other cases when the payment is paid on time.

We’ll be only working on the application data files for the sake of this article.

#data science #data analysis #data analysis in python #exploratory data analysis in python

Siphiwe  Nair

Siphiwe Nair


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Hertha  Walsh

Hertha Walsh


Graphical Approach to Exploratory Data Analysis in Python

Exploratory Data Analysis (EDA) is one of the most important aspect in every data science or data analysis problem. It provides us greater understanding on our data and can possibly unravel hidden insights that aren’t that obvious to us. The first article I’ve wrote on Medium is also on performing EDA in R, you can check it out here. This post will focus more on graphical EDA in Python using matplotlib, regression line and even motion chart!


The dataset we are using for this article can be obtained from Gapminder, and drilling down into _Population, Gender Equality in Education _and Income.

The _Population _data contains yearly data regarding the estimated resident population, grouped by countries around the world between 1800 and 2018.

The Gender Equality in Education data contains yearly data between 1970 and 2015 on the ratio between female to male in schools, among 25 to 34 years old which includes primary, secondary and tertiary education across different countries

The _Income _data contains yearly data of income per person adjusted for differences in purchasing power (in international dollars) across different countries around the world, for the period between 1800 and 2018.

EDA on Population

Let’s first plot the population data over time, and focus mainly on the three countries Singapore, United States and China. We will use matplotlib library to plot 3 different line charts on the same figure.

import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline

## read in data
population = pd.read_csv('./population.csv')
## plot for the 3 countries
plt.plot(population.Year,population["United States"],label="United States")
## add legends, labels and title
plt.title('Population Growth over time')

#exploratory-data-analysis #data-analysis #data-science #data-visualization #python

Aketch  Rachel

Aketch Rachel


Exploratory Data Analysis in Few Seconds

EDA is a way to understand what the data is all about. It is very important as it helps us to understand the outliers, relationship of features within the data with the help of graphs and plots.

EDA is a time taking process as we need to make visualizations between different features using libraries like Matplot, seaborn, etc.

There is a way to automate this process by a single line of code using the library Pandas Visual Analysis.

About Pandas Visual Analysis

  1. It is an open-source python library used for Exploratory Data Analysis.
  2. It creates an interactive user interface to visualize datasets in Jupyter Notebook.
  3. Visualizations created can be downloaded as images from the interface itself.
  4. It has a selection type that will help to visualize patterns with and without outliers.


  1. Installation
  2. 2. Importing Dataset
  3. 3. EDA using Pandas Visual Analysis

Understanding Output

Let’s understand the different sections in the user interface :

  1. Statistical Analysis: This section will show the statistical properties like Mean, Median, Mode, and Quantiles of all numerical features.
  2. Scatter Plot-It shows the Distribution between 2 different features with the help of a scatter plot. you can choose features to be plotted on the X and Y axis from the dropdown.
  3. Histogram-It shows the distribution between 2 Different features with the help of a Histogram.

#data-analysis #machine-learning #data-visualization #data-science #data analysis #exploratory data analysis