It is very common that we want to segment a Pandas DataFrame by consecutive values. However, dealing with consecutive values is almost always not easy in any circumstances such as SQL, so does Pandas. Also, standard SQL provides a bunch of window functions to facilitate this kind of manipulations, but there are not too many window functions handy in Pandas. Fortunately, there are many workarounds in Python and sometimes make it even easier than classic window functions.
In this article, I’ll demonstrate how to group Pandas DataFrame by consecutive same values that repeat in one or multiple times.
If you are still not quite sure what is the problem we’re trying to solve, don’t worry, you will understand by the sample data that is generated as follows:
df = pd.DataFrame({
'item':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'],
'v':[1, 2, 3, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9]
})
In the screenshot, the green lines split the groups that we expected.
In this example, it is actually very easy to group these values. Just simple groupby('v')
.
However, it is actually assuming that the value v
must be monotonically increasing.
What if the value column is not monotonically increasing?
df = pd.DataFrame({
'item':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N'],
'v':[1, 2, 3, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 3]
})
In this new example, we added the 13th row which has its value v == 3
again. If we simply groupby('v')
, the 13th row will be put in the same group with 2nd, 3rd and 4th rows, which is not what we want.
In other words, the 13th row should be in a separated group, because it is not consecutive.
The basic idea is to create such a column can be grouped by. It must have the same values for the consecutive original values, but different values when the original value changes. We can use cumsum()
. Here are the intuitive steps.
#data-science #data-analysis #window-functions #python #pandas #data analysis