Pandas DataFrame has a built-in method sort_values() to sort values by the given variable(s). The method itself is fairly straightforward to use, however it doesn’t work for custom sorting, for example,

  • the t-shirt size: XSSML, and XL
  • the month: JanFebMarApr , ….etc
  • the day of the week: MonTueWedThuFriSat, and Sun.

In this article, we are going to take a look at how to do a custom sort on Pandas DataFrame.

Please check out my Github repo for the source code

Take a look at the problem

Suppose we have a dataset about a clothing store:

df = pd.DataFrame({
    'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})

Image for post

Data made by author

We can see that each cloth has a size value and the data should be sorted by the following order:

  • XS for extra small
  • S for small
  • M for medium
  • L for large
  • XL for extra large

However, you will get the following output when calling sort_values('size') .

Image for post

The output is not we want, but it is technically correct. Under the hood, sort_values() is sorting values by numerical order for number data or character alphabetically for object data.

Here are two common solutions:

  1. Create a new column for custom sorting
  2. Cast data to category type with orderedness using CategoricalDtype

#pandas #machine-learning #python #data-science #sorting

How to do a Custom Sort on Pandas DataFrame
21.45 GEEK