If you’re a data scientist, you likely spend a lot of time cleaning and manipulating data for use in your applications. One of the core libraries for preparing data is the Pandas library for Python.
In a previous post, we explored the background of Pandas and the basic usage of a Pandas DataFrame, the core data structure in Pandas. Check out that post if you want to get up to speed with the basics of Pandas.
In this post, we’ll explore a few of the core methods on Pandas DataFrames. These methods help you segment and review your DataFrames during your analysis.
We’ll cover
Let’s get started.
Pandas is typically used for exploring and organizing large volumes of tabular data, like a super-powered Excel spreadsheet. Often, you’ll want to organize a pandas DataFrame into subgroups for further analysis.
For example, perhaps you have stock ticker data in a DataFrame, as we explored in the last post. Your Pandas DataFrame might look as follows:
>>> df date symbol open high low close volume 0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
Perhaps we want to analyze this stock information on a symbol-by-symbol basis rather than combining Amazon (“AMZN”) data with Google (“GOOG”) data or that of Apple (“AAPL”).
This is where the Pandas groupby method is useful. You can use groupby to chunk up your data into subsets for further analysis.
Perhaps we want to analyze this stock information on a symbol-by-symbol basis rather than combining Amazon (“AMZN”) data with Google (“GOOG”) data or that of Apple (“AAPL”).
This is where the Pandas groupby method is useful. You can use groupby to chunk up your data into subsets for further analysis.
Let’s do some basic usage of groupby to see how it’s helpful.
In your Python interpreter, enter the following commands:
>>> import pandas as pd >>> import numpy as np >>> url = 'https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv' >>> df = pd.read_csv(url) >>> df date symbol open high low close volume 0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
In the steps above, we’re importing the Pandas and NumPy libraries, then setting up a basic DataFrame by downloading CSV data from a URL. We print our DataFrame to the console to see what we have.
Now, let’s group our DataFrame using the stock symbol. The easiest and most common way to use groupby is by passing one or more column names. For our example, we’ll use “symbol” as the column name for grouping:
>>> symbols = df.groupby('symbol') >>> print(symbols.groups) {'AAPL': Int64Index([5, 6, 7, 8, 9], dtype='int64'), 'AMZN': Int64Index([0, 1, 2, 3, 4], dtype='int64'), 'GOOG': Int64Index([10, 11, 12, 13, 14], dtype='int64')}
Interpreting the output from the printed groups can be a little hard to understand. In the output above, it’s showing that we have three groups: AAPL, AMZN, and GOOG. For each group, it includes an index to the rows in the original DataFrame that belong to each group.
The input to groupby is quite flexible. You can choose to group by multiple columns. For example, if we had a year column available, we could group by both stock symbol and year to perform year-over-year analysis on our stock data.
In the previous example, we passed a column name to the groupby method. You can also pass your own function to the groupby method. This function will receive an index number for each row in the DataFrame and should return a value that will be used for grouping. This can provide significant flexibility for grouping rows using complex logic.
As an example, imagine we want to group our rows depending on whether the stock price increased on that particular day. We would use the following:
>>> def increased(idx): ... return df.loc[idx].close > df.loc[idx].open ... >>> df.groupby(increased).groups {False: Int64Index([2, 3, 4, 7, 8, 9, 13, 14], dtype='int64'), True: Int64Index([0, 1, 5, 6, 10, 11, 12], dtype='int64')}
First, we would define a function called increased, which receives an index. It returns “True” if the close value for that row in the DataFrame is higher than the open value; otherwise, it returns “False.”
When we pass that function into the groupby() method, our DataFrame is grouped into two groups based on whether the stock’s closing price was higher than the opening price on the given day.
After you’ve created your groups using the groupby function, you can perform some handy data manipulation on the resulting groups.
In our example above, we created groups of our stock tickers by symbol. Let’s now find the mean trading volume for each symbol.
>>> symbols['volume'].agg(np.mean) symbol AAPL 23733309.4 AMZN 4755355.0 GOOG 1321077.0 Name: volume, dtype: float64
To complete this task, you specify the column on which you want to operate—“volume”—then use Pandas’ agg method to apply NumPy’s mean function. The result is the mean volume for each of the three symbols. From this, we can see that AAPL’s trading volume is an order of magnitude larger than AMZN and GOOG’s trading volume.
Iteration is a core programming pattern, and few languages have nicer syntax for iteration than Python. Python’s built-in list comprehensions and generators make iteration a breeze.
Pandas groupby is no different, as it provides excellent support for iteration. You can loop over the groupby result object using a for loop:
>>> for symbol, group in symbols: ... print(symbol) ... print(group) ... AAPL date symbol open high low close volume 5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 AMZN date symbol open high low close volume 0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 GOOG date symbol open high low close volume 10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
Each iteration on the groupby object will return two values. The first value is the identifier of the group, which is the value for the column(s) on which they were grouped. The second value is the group itself, which is a Pandas DataFrame object.
If you want more flexibility to manipulate a single group, you can use the get_group method to retrieve a single group.
>>> aapl = symbols.get_group('AAPL') >>> aapl date symbol open high low close volume 5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 >>> type(aapl) <class 'pandas.core.frame.DataFrame'>
In the example above, we use the Pandas get_group method to retrieve all AAPL rows. To retrieve a particular group, you pass the identifier of the group into the get_group method. This method returns a Pandas DataFrame, which we can manipulate as needed.
If you’re working with a large DataFrame, you’ll need to use various heuristics for understanding the shape of your data. In this section, we’ll look at Pandas count and value_counts, two methods for evaluating your DataFrame.
The count method will show you the number of values for each column in your DataFrame. Using our DataFrame from above, we get the following output:
>>> df.count() date 15 symbol 15 open 15 high 15 low 15 close 15 volume 15 dtype: int64
The output isn’t particularly helpful for us, as each of our 15 rows has a value for every column. However, this can be very useful where your data set is missing a large number of values. Using the count method can help to identify columns that are incomplete. From there, you can decide whether to exclude the columns from your processing or to provide default values where necessary.
For our case, value_counts method is more useful. This method will return the number of unique values for a particular column. If you have continuous variables, like our columns, you can provide an optional “bins” argument to separate the values into half-open bins.
Let’s use the Pandas value_counts method to view the shape of our “volume” column.
>>> df['volume'].value_counts(bins=4) (1072952.085, 7683517.5] 10 (20851974.5, 27436203.0] 3 (14267746.0, 20851974.5] 2 (7683517.5, 14267746.0] 0 Name: volume, dtype: int64
In the output above, Pandas has created four separate bins for our volume column and shows us the number of rows that land in each bin.
Both counts() and value_counts() are great utilities for quickly understanding the shape of your data.
In this post, we learned about groupby, count, and value_counts - three of the main methods in Pandas.
Pandas is a powerful tool for manipulating data once you know the core operations and how to use it. New to Pandas or Python? Download Kite to supercharge your workflow. Kite provides line-of-code completions while you’re typing for faster development, as well as examples of how others are using the same methods.
#python #data-science #pandas