Comparing StringIndexer (PySpark), LabelEncoder (skLearn), OrdinalEncoder (skLearn), OrdinalEncoder (category_encoders)



Here we compare the following four category encoders:
1. StringIndexer (PySpark)
2. LabelEncoder (skLearn)
3. OrdinalEncoder (skLearn)
4. OrdinalEncoder (category_encoders)

About LabelEncoder

6.9. Transforming the prediction target (y)

These are transformers that are not intended to be used on features, only on supervised learning targets. See also Transforming target in regression if you want to transform the prediction target for learning, but evaluate the model in the original (untransformed) space.
6.9.1. Label binarization

LabelBinarizer is a utility class to help create a label indicator matrix from a list of multi-class labels:

>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer()
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
       [0, 0, 0, 1]]) 

For multiple labels per instance, use MultiLabelBinarizer:

>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
       [0, 0, 1]])
>>> lb.classes_
array([1, 2, 3]) 

6.9.2. Label encoding

LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1. This is sometimes useful for writing efficient Cython routines. LabelEncoder can be used as follows:

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6]) 

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris'] 

Ref: scikit-learn.org (Preprocessing Targets)

Additional Note:

% Encode target labels with value between 0 and n_classes-1.

% This transformer should be used to encode target values, i.e. y, and not the input X.

Ref: LabelEncoder Docs

About 'category_encoders' OrdinalEncoder 

Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.

Ref: 'category_encoders' OrdinalEncoder

About StringIndexer 

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), and four ordering options are supported: “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), “alphabetDesc”: descending alphabetical order, and “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”). The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show() 

~ ~ ~ ~ ~

IndexToString

Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.

Ref: StringIndexer

About skLearn's OrdinalEncoder and OneHotEncoder
To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1):

>>> enc = preprocessing.OrdinalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.transform([['female', 'from US', 'uses Safari']])
array([[0., 1., 1.]])

Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

Continuing the example above:

>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder()
>>> enc.transform([['female', 'from US', 'uses Safari'],
...                ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]]) 

Code 

import pandas as pd
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OrdinalEncoder as soe
from category_encoders.ordinal import OrdinalEncoder as coe

df = pd.DataFrame({
    "col1": ['A', 'A', 'A', 'B', 'B', 'C'],
    "col2": ['A', 'B', 'B', 'C', 'C', 'C'],
    "col3": ['A', 'A', 'B', 'B', 'C', 'C'],
    "col4": ['C', 'B', 'B', 'A', 'A', 'A']
})

le = LabelEncoder()
le = le.fit(df.col1)
col1_le = le.transform(df.col1)

le = le.fit(df.col2)
col2_le = le.transform(df.col2)

le = le.fit(df.col3)
col3_le = le.transform(df.col3)

le = le.fit(df.col4)
col4_le = le.transform(df.col4)

df_le = pd.DataFrame({'le1': col1_le, 'le2': col2_le, 'le3': col3_le, 'le4': col4_le})

# ~ ~ ~

coe_var = coe(drop_invariant=False, return_df=True)

coe_var = coe_var.fit(df)

df_coe = coe_var.transform(df)

df_coe.columns = ['coe1', 'coe2', 'coe3', 'coe4']

# ~ ~ ~

soe_var = soe().fit(df)

arr_soe = soe_var.transform(df)

df_soe = pd.DataFrame({
    'soe1': arr_soe.T[0],
    'soe2': arr_soe.T[1],
    'soe3': arr_soe.T[2],
    'soe4': arr_soe.T[3]
})

# ~ ~ ~

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

sdf = sqlCtx.createDataFrame(df)

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(sdf) for column in list(sdf.columns)]

pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(sdf).transform(sdf)

df_r.show()

pd.concat([df_le, df_soe, df_coe], axis=1)
Final Note % Category_encoder OrdinalEncoder: Assign numbers starting from 1 to n to 'n' categories in the order they appear. % StringIndexer: Assign numbers starting from 0 to 'n-1' to n categories based on the frequency of 'n' categories going from high to low. This order is configurable as below: The indices are in [0, numLabels), and four ordering options are supported: “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), “alphabetDesc”: descending alphabetical order, and “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid="error", stringOrderType="alphabetDesc") Ref: pyspark.ml.feature.StringIndexer % Scikit-Learn's LabelEncoder and OrdinalEncoder are identical in encoding categories. LabelEncoder is returning integers, OrdinalEncoder is returning float. These two consider the alphabetical order of the distinct categorical values and assigns them values from 0 to 'n-1'. Few More Code Snippets coe_var = coe(drop_invariant=False, return_df=True) df = pd.DataFrame({ "col1": ['A', 'A', 'A', 'B', 'B', 'C'], "col2": ['A', 'B', 'B', 'C', 'C', 'C'], "col3": ['A', 'A', 'B', 'B', 'C', 'C'], "col4": ['C', 'B', 'B', 'A', 'A', 'A'], "col5": ['C', 'C', 'B', 'B', 'A', 'A'] }) coe_var.fit_transform(df)
soe_var = soe().fit(df) df_soe = soe_var.transform(df) df_soe = pd.DataFrame({ 'soe1': df_soe.T[0], 'soe2': df_soe.T[1], 'soe3': df_soe.T[2], 'soe4': df_soe.T[3], 'soe5': df_soe.T[4] }) print(df_soe)

No comments:

Post a Comment