Many models input cumulative variables like rainfalls in a few hours or company revenue over some months. Unfortunately, the data sources are often not aligned in time. One sensor feeds the data every odd hour, another one in even hours. One company provides information in May and the other in June.

If you don’t want to wait for all your data sources to collect the inputs or when it doesn’t suit to feed the model with data coming from different time frames, you have to spread your measurement to the same periods of time. Pandas can natively do part of the job, but in this article, we’ll explore how to upsample with an average, which requires a little bit of extra coding.

You can run all the example from this tutorial through the notebook shared on Github — Upsample_to_average.ipynb

Resampling

Downsample

Resampling in python’s Pandas allows you to turn more frequent values to less frequent — **downsample**, e.g. hourly data to daily sum, count and average, or daily to monthly values.

## downsample:
CAT|DATE|VALUE
abc|0101|10
abc|0102|20
abc|0103|15
## downsample
[IN]: df.groupby("CAT").resample("W", on="DATE").agg({"VALUE":["sum","count","mean","first","last"]})
[OUT]:
CAT|DATE|SUM|COUNT|MEAN|MIN|MAX|FIRST|LAST
abc|0107|45 |   3 | 15 | 10| 20|  10 | 15

It’s called downsample because the number of data-rows decreases. You can apply sumcountmean (for average), medianminmaxfirst or last. Based on daily inputs you can resample to weeks, months, quarters, years, but also to semi-months — see the complete list of resample options in pandas documentation. You can also resample to multiplies, e.g. 5H for groups of 5 hours.

#pandas #python #resampling #data-preprocessing #machine-learning

Upsample with an average in Pandas
3.70 GEEK