PANDAS: Put Away Novice Data Analyst Status

Pandas as I call it Put Away Novice Data Analyst Status is a powerful open-source data analysis and manipulation library. It can help you to do various operations on the data and generate different reports about it. I am going to break up this into two articles-

Basics- which I am going to cover in this story. I will cover basic Pandas functions that will give you an overview of how you can start working with Pandas and how it can help you save a lot of your time.
Advanced- will go through advanced functions which makes it easy to solve complex analysis problems. It will cover topics like styling, plotting, reading multiple files, etc. Part-2 is still work in progress, I am targeting to release it by next week. Stay tuned.

Before starting, make sure you have installed Pandas. If not you can use the following command to download it.

# If you are using Anaconda Distribution (recommended)
conda install -c conda-forge pandas

# install pandas using pip
pip install pandas
# import pandas in notebook or python script
import pandas as pd

If you have not yet set up the environment for your data analytics or data science projects, you can refer to the blog that I did ‘How to Start your Data Science Journey’ which walks you through the different libraries you can use and how you can install them.

For this exercise, I will use the famous Titanic Dataset. I recommend you to download the data and notebook from the Github. Copy it into your environment and follow along with it.

ankitgoel1602/data-science

PANDAS: Put Away Novice Data Analytics Status. This repository gives an overview of different Pandas APIs which you can use…

github.com

For more details about data, you can refer to Kaggle.

Let’s begin, I tried to keep the general flow of data analysis like starting with reading the data and then going through different steps in the data analysis process.

1. Reading data using read_csv or read_excel

The starting point of any data analysis is acquiring the dataset. Pandas provide different functions to read the data from different formats. The most commonly used ones are —

read_csv( )

This allows you to read a CSV file.

pd.read_csv('path_to_your_csv_file.csv')

Pandas provide different options to configure like column names or data types or the number of rows you would like to read. Check out the Pandas read_csv API for more details.

Image for post

Different Options for read_csv Pandas provides. Source- Pandas Documentation.

read_excel( )

This allows you to read an Excel file.

pd.read_excel('path_to_your_excel_file.xlsx')

Like CSV, Pandas provides a rich set of options for read_excel that allows you to read a particular sheet name in Excel, data types, or the number of rows to read. Check out Pandas read_excel API for more details.

And that’s not it, there are many other data types which Pandas supports. Do check out Pandas documentation if you are using other data types.

Reading titanic dataset which we are going to use here using the read_csv command-

# You can get it from the Github link provided.

# Loading Titanic Dataset into titanic_data variable
titanic_data = pd.read_csv('titanic_train.csv')

This will create a Pandas DataFrame (like Tables) and store it into the titanic_data variable.

Next, we will see how to get more details about the data we loaded.

2. Explore data using the head, tail, or sample.

Once we have loaded the data, we would like to review it. Pandas provide different APIs which we can use to explore data.

head( )

This is like a TOP command in SQL and gives us ’n’ records from the start of the DataFrame.

# Selecting top 5 (n=5) records from the DataFrame
titanic_data.head(5)

Image for post

Top 5 records from Titanic Dataset. Source: Author.

tail( )

This gives us the ’n’ records from the end of the DataFrame.

# Selecting last 5 (n=5) records from the DataFrame
titanitc_data.tail(5)

Image for post

Last 5 records from Titanic Dataset. Source: Author.

sample( )

This picks up randomly the ’n’ number of records from the data. Note- the output of this command can be different on different runs.

titanic_data.sample(5)

Image for post

5 random records from Titanic Dataset. Source: Author.

3. Data Dimensions using shape

Once we have the data, we need to know how many rows or columns we are dealing with and Pandas shape API gives us exactly that. Let’s see —

# shape of the dataframe, note there is no parenthesis at the end as it is a property of dataframe
titanic_data.shape

(891, 12)

(891, 12) means we have 891 rows and 12 columns in our Titanic dataset.

#programming #python #pandas #data-science #data-analysis #data analysis

ankitgoel1602/data-science

1. Reading data using read_csv or read_excel

read_csv( )

read_excel( )

2. Explore data using the head, tail, or sample.

head( )

tail( )

sample( )

3. Data Dimensions using shape

towardsdatascience.com

PANDAS: Put Away Novice Data Analyst Status