Project Description

Pandas: Categories

Author

Nicolas Vandeput

BOOK

Data Science for Supply Chain Forecast

If you struggle with big DataFrames in Python, there is one very useful trick: categories. This is not taught in any of the online classes I’ve seen so far on Python, so I guess this deserves an article.

What are categories?

In data sets you have from time to time some fields that should only contain specific values. For exemple the department of an employee (Finance/HR/Marketing), the transaction type (Sales/Returns), the genre of a movie (Horror/Comedy/Action) and so on. If you have such cases typically you want to be sure you can’t put a value in this field that is not an acceptable value (to go back to our example above, the department cannot be “0XAD123”).

Categories in Pandas

Typically if you have such a data set in pandas, the type of the field would be object. But object are sub optimal for two reasons:

  1. It is does not prevent you from using wrong categories (like a wrong department).
  2. It takes a lot of space.

In pandas, you can use the type category to get both less space usage and data protection. To transform a dataframe column type into a category, just do this:

df.column = df.column.astype(category)

This will automatically create a category with all the different possible values you currently have. As you can see in the example below this can easily reduce the space needed by 50% for this specific column.

Example

 
import numpy as np
import pandas as pd

# DataFrame initialization
cat = ["Type A","Type B","Type D","Type E"]
size = 100000
df = pd.DataFrame(np.random.choice(cat,size=size), columns=["Type"])
df["Quantity"] = np.random.randint(0,100,size)
#see the initial size of the DataFrame (around 1.1MB for 100.000 lines)
print(df.info())
# Let's transform the Type column into a category
df["Type"] = df["Type"].astype("category")
#check again the size of the dataframe (around 0.5MB for 100.000 lines)
print(df.info())

Limitations

  • If you perform a join on a dataframe with categories, the categories will revert back to the type object. This is very unfortunate. Even if you merge your two dataframes with proper categories on both dataframes, this won’t help.
  • If you add data to the dataframe with new categories or if you update a category into a new one, you will face an issue if the category was not defined ahead. This is rather annoying, you can solve this by reverting to normal object type or declare a new categorical type like this:
df["Type"] = df["Type"].cat.add_categories(["Type New"])

Other ressources

Latest Posts