Category Encoders Analysis (in Python)



In many practical ML activities, a dataset will contain categorical variables. It is far more appropriate in an enterprise context, where most of the attributes are categorical. These variables have distinct discrete values. For example, the size of an organization can be Small, Medium, or Large, or geographic regions can be such as Americas, Asia Pacific, and Europe. Many ML algorithms, especially tree-based models, can handle this type of data directly.

However, many algorithms do not accept the data directly. Therefore, it is needed to encode these attributes into numerical values for further processing. There are various methods to encode the categorical data. Some extensively used methods are described in the following section:

Label encoding: As the name implies, label encoding converts categorical labels into numerical labels. Label encoding is better suited for the ordinal categorical data. The labels are always in between 0 and n-1, where n is the number of classes.

One-hot encoding: This is also known as dummy coding. In this method, dummy columns are generated for each class of a categorical attribute/predictor. For each dummy predictor, the presence of a value is represented by 1, and its absence is represented by 0.

Frequency-based encoding: In this method, first the frequency is calculated for each class. Then the relative frequency for each class out of the total classes is calculated. This relative frequency is assigned as the encoded values for the attribute's levels. 

Target mean encoding: In this method, each class of the categorical predictors is encoded as a function of the mean of the target. This method can only be used in a supervised learning problem where there is a target feature.

Binary encoding: The classes are first transformed to the numerical values. Then these numerical values are changed to their similar binary strings. This is later split into separate columns. Each binary digit becomes an independent column.

Hash encoding: This method is also commonly known as feature hashing. Most of us would be aware of a hash function that is used to map data to a number. This method may assign different classes to the same bucket, but is useful when there are hundreds of categories or classes present for an input feature.

Ref: Hands-On Automated Machine Learning. A Beginner’s Guide to Building Automated Machine Learning Systems Using AutoML and Python (Sibanjan Das & Umit Mert Cakmak, 2018)

Notes about Python package “category_encoders” 

Label Encoder (from Scikit-learn):
- Suitable for Both Nominal Data and Ordinal data. 
- In case of Nominal data, it can be little less accurate for some dataset.
- Each value in the column is assigned a value between 1 to N, where N is the total number of category of a class (columns to be encoded).
- Limitation: Even if there is no order or relation between category, this model will consider it in order which can result in inaccurate results. (Use One Hot encoder to overcome this problem)

One Hot Encoder:
- Best suited for Nominal data
- Works for Ordinal data too.
- Always accurate for Nominal data.
- Will take lesser time comparatively but more space.
- It generates separate columns for category present in a class (columns to be encoded) and compare them.

Ordinal Encoder:
- Best fit for Ordinal Data
- Similar to Label Encoder (from Scikit-learn)
- Like Label Encoder, it also assigns each feature an integer from 1 to N. 

Target/Mean Encoding
- Takes the mean of the feature and target variable
- Used for supervised classification tasks.
- Suitable for addressing high dimensionality problem.
- Must take steps to avoid overfitting/ response leakage.
- Good fit for nominal and ordinal data
- Won't work well for regression algorithms

Binary Encoder:
- Converts a category into binary digits.
- Each binary digit creates one feature column.
- Best suitable for high dimensional ordinal data
- There might be some information loss

BaseN Encoder:
- BaseN allows us to convert the integers with any value of the base.
- By default, it takes base=2 like Binary Encoder, but we can change the base with any number
- It can be used to reduce dimensions.

Hashing Encoder:
- Hashing is the process of transformation of a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.

- Uses md5 by default to convert the string into a fixed-length shorter string that we can define by using the parameter “n_components”.

For example:

import category_encoders as ce
he = ce.HashingEncoder(cols=['B', 'M', 'P'], n_components=5).fit_transform(pandas_dataframe)
print(he)

Output:
    col_0	col_1	col_2	col_3	col_4	BP
0	1	    2   	0   	0   	0	    10
1	0	    2	    0	    1	    0	    180
2	0	    1	    0	    1	    1	    16
3	1	    1	    0	    1	    0	    201
4	0	    0	    2	    0	    1	    8136
.. .. ..

- Advantageous when the cardinality of category is very high.
- With Hashing, the number of dimensions will be far lesser than the number of dimensions that are generated with encodings such as One Hot Encoding.
- Suitable for high dimension ordinal and nominal data.

Helmert Encoder:
- In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
- Suitable for regression algorithm
- Preferred for ordinal data.

~ ~ ~

Scikit-learn has these (present in package “sklearn.preprocessing”):
1. LabelEncoder
2. OneHotEncoder
3. OrdinalEncoder

A helping note about preserving LabelEncoding for a category using it's dictionary representation (Ref).

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

  from sklearn.preprocessing import LabelEncoder
  le = preprocessing.LabelEncoder()
  le.fit(X)

  le_dict = dict(zip(le.classes_, le.transform(le.classes_)))
  
Retrieve label for a single new item, if item is missing then set value as 'Unknown':

  le_dict.get(new_item, '[Unknown]')

Retrieve labels for a Dataframe column:

  df[your_col].apply(lambda x: le_dict.get(x, '[Unknown]'))

~ ~ ~

No comments:

Post a Comment