Pandas DataFrame has a built-in method sort_values()
to sort values by the given variable(s). The method itself is fairly straightforward to use, however it doesn’t work for custom sorting, for example,
XS
, S
, M
, L
, and XL
Jan
, Feb
, Mar
, Apr
, ….etcMon
, Tue
, Wed
, Thu
, Fri
, Sat
, and Sun
.In this article, we are going to take a look at how to do a custom sort on Pandas DataFrame.
Please check out my Github repo for the source code
Suppose we have a dataset about a clothing store:
df = pd.DataFrame({
'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})
Data made by author
We can see that each cloth has a size value and the data should be sorted by the following order:
XS
for extra smallS
for smallM
for mediumL
for largeXL
for extra largeHowever, you will get the following output when calling sort_values('size')
.
The output is not we want, but it is technically correct. Under the hood, sort_values()
is sorting values by numerical order for number data or character alphabetically for object data.
Here are two common solutions:
CategoricalDtype
#pandas #machine-learning #python #data-science #sorting