Learn to group the data and summarize in several different ways, to use aggregate functions, data transformation, filter, map.

Source: Unspalsh by Ilona Froehlich

Groupby is a very popular function in Pandas. This is very good at summarising, transforming, filtering, and a few other very essential data analysis tasks. In this article, I will explain the application of groupby function in detail with example.

For this article, I will use a ‘Students Performance’ dataset from Kaggle. Please feel free download the dataset from here:

Contribute to rashida048/Datasets development by creating an account on GitHub.

Here I am importing the necessary packages and the dataset:

```
import pandas as pd
import numpy as np
df = pd.read_csv('StudentsPerformance.csv')
df.head()
```

Groupby function splits the dataset based on criteria that you define. Here I am showing the process behind the groupby function. It will give you an idea, how much work we may have to do if we would not have groupby function. I will make a new smaller dataset of two columns only to demonstrate in this section. The columns are ‘gender’ and ‘reading score’.

```
test = df[['gender', 'reading score']]
test.head()
```

**Let’s find out the average reading score gender-wise**

First, we need to split the dataset based on gender. Generate the data for females only.

```
female = test['gender'] == 'female'
test[female].head()
```

In the same way, generate the data for the males,

```
male = test['gender'] == 'male'
test[male].head()
```

Using females and males dataset above to calculate the mean reading score for females and males respectively.

```
fe_avg = test[female]['reading score'].mean()
male_avg = test[male]['reading score'].mean()
print(fe_avg, male_avg)
```

The mean reading score of females is 72.608 and the mean reading score for males is 65.473. Now, make a DataFrame for the mean reading score of females and males.

`df_reading = pd.DataFrame({'Gender': ['female', 'male'], 'reading score': [fe_avg, male_avg]})`

Now, **let’s solve the same problem with the groupby function.** Splitting the data based on gender and applying the ‘mean’ on it with just one simple line of code:

`test.groupby('gender').mean()`

This small piece of code gives the same result.

I will use the original dataset ‘df’ now. Make groups of ‘race/ethnicity’.

```
race = df.groupby('race/ethnicity')
print(race)
```

Output: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023339DCE940>

It returns an object. Now check the datatype of ‘race’.

`type(race)`

Output: pandas.core.groupby.generic.DataFrameGroupBy

So, we generated a DataFrameGroupBy object. Calling groups on this DataFrameGroupBy object will return the indices of each group.

```
race.groups
#Here is the Output:
{'group A': Int64Index([ 3, 13, 14, 25, 46, 61, 62, 72, 77, 82, 88, 112, 129, 143, 150, 151, 170, 228, 250, 296, 300, 305, 327, 356, 365, 368, 378, 379, 384, 395, 401, 402, 423, 428, 433, 442, 444, 464, 467, 468, 483, 489, 490, 506, 511, 539, 546, 571, 575, 576, 586, 589, 591, 597, 614, 623, 635, 651, 653, 688, 697, 702, 705, 731, 741, 769, 778, 805, 810, 811, 816, 820, 830, 832, 837, 851, 892, 902, 911, 936, 943, 960, 966, 972, 974, 983, 985, 988, 994], dtype='int64'), 'group B': Int64Index([ 0, 2, 5, 6, 7, 9, 12, 17, 21, 26, ... 919, 923, 944, 946, 948, 969, 976, 980, 982, 991], dtype='int64', length=190), 'group C': Int64Index([ 1, 4, 10, 15, 16, 18, 19, 23, 27, 28, ... 963, 967, 971, 975, 977, 979, 984, 986, 996, 997], dtype='int64', length=319), 'group D': Int64Index([ 8, 11, 20, 22, 24, 29, 30, 33, 36, 37, ... 965, 970, 973, 978, 981, 989, 992, 993, 998, 999], dtype='int64', length=262), 'group E': Int64Index([ 32, 34, 35, 44, 50, 51, 56, 60, 76, 79, ... 937, 949, 950, 952, 955, 962, 968, 987, 990, 995], dtype='int64', length=140)}
```

Have a look at the output above. Groupby function splits the data into subgroups and you can now see the indices of each subgroup. That’s great! But only the indices are not enough. We need to see the real data of each group. The function ‘get_group’ helps with that.

`race.get_group('group B')`

I am showing the part of the results here. The original output is much bigger.

**Find the size of each group**

Calling size on the ‘race’ object will give the size of each group

`race.size()`

**Loop over each group**

You can loop over the groups. Here is an example:

```
for name, group in race:
print(name, 'has', group.shape[0], 'data')
```

**Grouping by multiple variables**

In all the examples above, we only grouped by one variable. But grouping by multiple variables is also possible. Here I am grouping by ‘race/ethnicity’ and ‘gender’. This should return the number of data of each race segregated by gender.

`df.groupby(['gender', 'race/ethnicity']).size()`

This example aggregates the data using ‘size’. There are other aggregate functions as well. Here is the list of all the aggregate functions:

sum()

mean()

size()

count()

std()

var()

sem()

min()

median()

Please try them out. Just replace any of these aggregate functions instead of the ‘size’ in the above example.

**Using multiple aggregate functions**

The way we can use groupby on multiple variables, using multiple aggregate functions is also possible. This next example will group by ‘race/ethnicity and will aggregate using ‘max’ and ‘min’ functions.

data-science pandas data-analysis towards-data-science python data analysis

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

3 Python Pandas Tricks for Efficient Data Analysis. Pandas is one of the predominant data analysis tools which is highly appreciated among data scientists. It provides numerous flexible and versatile functions to perform efficient data analysis. In…

In this tutorial, you will know about the TED TALKS DATA ANALYSIS project from scratch.

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

Getting Started with Data Science with Python. we will explore more about pandas library and its uses in data science.