The Best Exploratory Data Analysis with Pandas Profiling

Table of Contents

  1. Introduction
  2. Overview
  3. Variables
  4. Interactions
  5. Correlations
  6. Missing Values
  7. Sample
  8. Summary
  9. References

Introduction

There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.

The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.

Overview

Image for post

Overview example. Screenshot by Author [3].

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.

The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.

I will be discussing variables, which are also referred to as columns or features of your dataframe

#technology #towards-data-science #data-science #machine-learning #education #pandas

What is GEEK

Buddha Community

The Best Exploratory Data Analysis with Pandas Profiling
bindu singh

bindu singh

1647351133

Procedure To Become An Air Hostess/Cabin Crew

Minimum educational required – 10+2 passed in any stream from a recognized board.

The age limit is 18 to 25 years. It may differ from one airline to another!

 

Physical and Medical standards –

  • Females must be 157 cm in height and males must be 170 cm in height (for males). This parameter may vary from one airline toward the next.
  • The candidate's body weight should be proportional to his or her height.
  • Candidates with blemish-free skin will have an advantage.
  • Physical fitness is required of the candidate.
  • Eyesight requirements: a minimum of 6/9 vision is required. Many airlines allow applicants to fix their vision to 20/20!
  • There should be no history of mental disease in the candidate's past.
  • The candidate should not have a significant cardiovascular condition.

You can become an air hostess if you meet certain criteria, such as a minimum educational level, an age limit, language ability, and physical characteristics.

As can be seen from the preceding information, a 10+2 pass is the minimal educational need for becoming an air hostess in India. So, if you have a 10+2 certificate from a recognized board, you are qualified to apply for an interview for air hostess positions!

You can still apply for this job if you have a higher qualification (such as a Bachelor's or Master's Degree).

So That I may recommend, joining Special Personality development courses, a learning gallery that offers aviation industry courses by AEROFLY INTERNATIONAL AVIATION ACADEMY in CHANDIGARH. They provide extra sessions included in the course and conduct the entire course in 6 months covering all topics at an affordable pricing structure. They pay particular attention to each and every aspirant and prepare them according to airline criteria. So be a part of it and give your aspirations So be a part of it and give your aspirations wings.

Read More:   Safety and Emergency Procedures of Aviation || Operations of Travel and Hospitality Management || Intellectual Language and Interview Training || Premiere Coaching For Retail and Mass Communication |Introductory Cosmetology and Tress Styling  ||  Aircraft Ground Personnel Competent Course

For more information:

Visit us at:     https://aerofly.co.in

Phone         :     wa.me//+919988887551 

Address:     Aerofly International Aviation Academy, SCO 68, 4th Floor, Sector 17-D,                            Chandigarh, Pin 160017 

Email:     info@aerofly.co.in

 

#air hostess institute in Delhi, 

#air hostess institute in Chandigarh, 

#air hostess institute near me,

#best air hostess institute in India,
#air hostess institute,

#best air hostess institute in Delhi, 

#air hostess institute in India, 

#best air hostess institute in India,

#air hostess training institute fees, 

#top 10 air hostess training institute in India, 

#government air hostess training institute in India, 

#best air hostess training institute in the world,

#air hostess training institute fees, 

#cabin crew course fees, 

#cabin crew course duration and fees, 

#best cabin crew training institute in Delhi, 

#cabin crew courses after 12th,

#best cabin crew training institute in Delhi, 

#cabin crew training institute in Delhi, 

#cabin crew training institute in India,

#cabin crew training institute near me,

#best cabin crew training institute in India,

#best cabin crew training institute in Delhi, 

#best cabin crew training institute in the world, 

#government cabin crew training institute

 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Aketch  Rachel

Aketch Rachel

1625001660

Exploratory Data Analysis in Few Seconds

EDA is a way to understand what the data is all about. It is very important as it helps us to understand the outliers, relationship of features within the data with the help of graphs and plots.

EDA is a time taking process as we need to make visualizations between different features using libraries like Matplot, seaborn, etc.

There is a way to automate this process by a single line of code using the library Pandas Visual Analysis.

About Pandas Visual Analysis

  1. It is an open-source python library used for Exploratory Data Analysis.
  2. It creates an interactive user interface to visualize datasets in Jupyter Notebook.
  3. Visualizations created can be downloaded as images from the interface itself.
  4. It has a selection type that will help to visualize patterns with and without outliers.

Implementation

  1. Installation
  2. 2. Importing Dataset
  3. 3. EDA using Pandas Visual Analysis

Understanding Output

Let’s understand the different sections in the user interface :

  1. Statistical Analysis: This section will show the statistical properties like Mean, Median, Mode, and Quantiles of all numerical features.
  2. Scatter Plot-It shows the Distribution between 2 different features with the help of a scatter plot. you can choose features to be plotted on the X and Y axis from the dropdown.
  3. Histogram-It shows the distribution between 2 Different features with the help of a Histogram.

#data-analysis #machine-learning #data-visualization #data-science #data analysis #exploratory data analysis

Tia  Gottlieb

Tia Gottlieb

1594144380

Exploratory Data Analysis — Passport Numbers in Pandas

According to the ICAO standard, the passport number should be up to 9 characters long and can contain numbers and letters. During your work as an analyst, you can come along a data set containing the passports and you will be asked to explore it.

I have recently worked with one such set and I’d like to share the steps of this analysis with you, including:

  • Number of records
  • Duplicated records
  • Length of the records
  • Analysis of the leading and trailing zeros
  • Appearance of the character at a specific position
  • Where do letters appear in the string using regular expressions (regexes)
  • Length on the sequence of letters
  • Is there a common prefix
  • Anonymize/Randomize the data while keeping the characteristics of the dataset

You can go through the steps with me. Get the (randomized data) from github. It also contains the Jupiter notebook with all the steps.

Basic Dataset Exploration

First, let’s load the data. Since the dataset contains only one column, it’s quite straightforward.

# import the packages which will be used
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"path\data.csv")
df.info()

The .info() command will tell us that we have 10902 passports in the dataset and all are imported as “object” which means that the format is string.

Duplicates

As an initial step of any analysis should be the check if there are some duplicated values. In our case, there are, so we will remove them using pandas’s .drop_duplicates() method.

print(len(df["PassportNumber"].unique()))# if lower than 10902 there are duplicates

df.drop_duplicates(inplace=True) # or df = df.drop_duplicates()

Length of the passports

Usually, you continue with the check of the longest and the shortest passport.

[In]: df["PassportNumber"].agg(["min","max"])
[Out]: 
   min 000000050
   max ZXD244549
   Name: PassportNumber, dtype: object

You might become happy, that all the passports are 9 characters long, but you would be misled. The data are having string format so that the lowest “string” value is the one which starts with the highest number of zeros and the largest the one which has the most zeds at the beginning.

# ordering of the strings is not the same as order of numbers
0 > 0001 > 001 > 1 > 123 > AB > Z > Z123 > ZZ123

In order to see the length of the passports let’s look at their length.

[In]: df["PassportNumber"].apply(len).agg(["min","max"])
[Out]: 
   min 3
   max 17
   Name: PassportNumber, dtype: object

In the contracts to our initial belief, the shortest passport contains only 3 characters while the longest is 17 (way over the expected maximum of 9) characters long.

Let’s expand our data frame with the 'len'column so that we can have a look at examples:

# Add the 'len' column
df['len'] = df["PassportNumber"].apply(len)

# look on the examples having the maximum lenght
[In]: df[df["len"]==df['len'].max()]
[Out]: 
   PassportNumber    len
   25109300000000000 17
   27006100000000000 17
# look on the examples having the minimum lenght
[In]: df[df["len"]==df['len'].min()]
[Out]: 
   PassportNumber    len
   179               3
   917               3
   237               3

The 3 digit passport numbers look suspicious, but they meet the ICAO criteria, but the longest ones are obviously too long, however, they contain quite many trailing zeros. Maybe someone just added the zeros in order to meet some data storage requirements.

Let’s have a look at the overall length distribution of our data sample.

# calculate count of appearance of various lengths
counts_by_value = df["len"].value_counts().reset_index()
separator = pd.Series(["|"]*df["len"].value_counts().shape[0])
separator.name = "|"
counts_by_index = df["len"].value_counts().sort_index().reset_index()
lenght_distribution_df = pd.concat([counts_by_value, separator, counts_by_index], axis=1)
# draw the chart
ax = df["len"].value_counts().sort_index().plot(kind="bar")
ax.set_xlabel("position")
ax.set_ylabel("number of records")
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.05))

Distribution of the passport lengths of the data sample

We see, that the most passports number in our sample, are 7, 8 or 9 characters long. Quite a few are however 10 or 12 characters long, which is unexpected.

Leading and trailing zeros

Maybe the long passports are having leading or trailing zeros, like our example with 17 characters.

In order to explore these zero-pads let’s add two more columns to our data set — ‘leading_zeros’ and ‘trailing_zeros’ to contain the number of leading and trailing zeros.

# number of leading zeros can be calculated by subtracting the length of the string l-stripped off the leading zeros from the total length of the string

df["leading_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.lstrip("0")))
# similarly the number of the trailing zeros can be calculated by subtracting the length of the string r-stripped off the leading zeros from the total length of the string
df["trailing_zeros"] = df["PassportNumber"].apply(lambda x: len(x) - len(x.rstrip("0")))

Then we can easily display the passport which has more than 9 characters to check if the have any leading or trailing zeros:

[In]: df[df["len"]>9]
[Out]:
   PassportNumber  len  leading_zeros trailing_zeros
   73846290957     11   0             0
   N614226700      10   0             2
   WC76717593      10   0             0
   ...

#pandas #exploratory-data-analysis #data-analysis #dataset #data analysis

Gerhard  Brink

Gerhard Brink

1624272463

How Are Data analysis and Data science Different From Each Other

With possibly everything that one can think of which revolves around data, the need for people who can transform data into a manner that helps in making the best of the available data is at its peak. This brings our attention to two major aspects of data – data science and data analysis. Many tend to get confused between the two and often misuse one in place of the other. In reality, they are different from each other in a couple of aspects. Read on to find how data analysis and data science are different from each other.

Before jumping straight into the differences between the two, it is critical to understand the commonalities between data analysis and data science. First things first – both these areas revolve primarily around data. Next, the prime objective of both of them remains the same – to meet the business objective and aid in the decision-making ability. Also, both these fields demand the person be well acquainted with the business problems, market size, opportunities, risks and a rough idea of what could be the possible solutions.

Now, addressing the main topic of interest – how are data analysis and data science different from each other.

As far as data science is concerned, it is nothing but drawing actionable insights from raw data. Data science has most of the work done in these three areas –

  • Building/collecting data
  • Cleaning/filtering data
  • Organizing data

#big data #latest news #how are data analysis and data science different from each other #data science #data analysis #data analysis and data science different