Saturday, July 4, 2020

One Hot Encoding from PySpark, Pandas, Category Encoders and skLearn



Using PySpark:

import pyspark
print(pyspark.__version__)

3.0.0

from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.
from pyspark.ml import Pipeline

from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, FeatureHasher
from pyspark.sql.functions import col

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

values = [("K1","a", 5, 'x'), ("K2","a", 5, 'x'), ("K3","b", 5, 'x'), ("K4","b", 10, 'x')]
columns = ['key', 'alphabet', 'd1', 'd0']
df = sqlCtx.createDataFrame(values, columns)

+---+--------+---+---+
|key|alphabet| d1| d0|
+---+--------+---+---+
| K1|       a|  5|  x|
| K2|       a|  5|  x|
| K3|       b|  5|  x|
| K4|       b| 10|  x|
+---+--------+---+---+ 

Ref: spark.apache.org

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Note: OneHotEncoder accepts numeric columns.
encoder = OneHotEncoder(inputCol="key", outputCol="key_vector", dropLast = True)
encoder = encoder.fit(df)
df = encoder.transform(df)

Error:
IllegalArgumentException: requirement failed: Column key must be of type numeric but was actually of type string. 

Even though FeatureHasher is supposed return an output that is like OneHotEncoder, but it does not. It's output is inconsistent.

From the documentation:
Since a simple modulo is used to transform the hash function to a vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices. 

for nf in [3, 4, 5]:
    df = df.drop('key_vector')
    encoder = FeatureHasher(numFeatures = nf, inputCols=["key"], outputCol="key_vector")
    # 'FeatureHasher' object has no attribute 'fit'
    df = encoder.transform(df)

    #SparseVector(int size, int[] indices, double[] values) 
    temp = df.collect()
    for i in temp:
        print(i.key_vector, i.key_vector.toArray())

OUTPUT:
(3,[1],[1.0]) [0. 1. 0.]
(3,[0],[1.0]) [1. 0. 0.]
(3,[0],[1.0]) [1. 0. 0.]
(3,[1],[1.0]) [0. 1. 0.]
(4,[2],[1.0]) [0. 0. 1. 0.]
(4,[1],[1.0]) [0. 1. 0. 0.]
(4,[1],[1.0]) [0. 1. 0. 0.]
(4,[2],[1.0]) [0. 0. 1. 0.]
(5,[0],[1.0]) [1. 0. 0. 0. 0.]
(5,[3],[1.0]) [0. 0. 0. 1. 0.]
(5,[1],[1.0]) [0. 1. 0. 0. 0.]
(5,[4],[1.0]) [0. 0. 0. 0. 1.] 

for nf in [2, 3, 4, 5]:
    df = df.drop('alphabet_vector')
    encoder = FeatureHasher(numFeatures = nf, inputCols=["alphabet"], outputCol="alphabet_vector")
    # encoder = encoder.fit(df) # AttributeError: 'FeatureHasher' object has no attribute 'fit'
    df = encoder.transform(df)

    #SparseVector(int size, int[] indices, double[] values) 
    temp = df.collect()
    for i in temp:
        print(i.alphabet_vector, ' ## ', i.alphabet_vector.toArray())

OUTPUT:
(2,[1],[1.0])  ##  [0. 1.]
(2,[1],[1.0])  ##  [0. 1.]
(2,[1],[1.0])  ##  [0. 1.]
(2,[1],[1.0])  ##  [0. 1.]
(3,[1],[1.0])  ##  [0. 1. 0.]
(3,[1],[1.0])  ##  [0. 1. 0.]
(3,[1],[1.0])  ##  [0. 1. 0.]
(3,[1],[1.0])  ##  [0. 1. 0.]
(4,[3],[1.0])  ##  [0. 0. 0. 1.]
(4,[3],[1.0])  ##  [0. 0. 0. 1.]
(4,[1],[1.0])  ##  [0. 1. 0. 0.]
(4,[1],[1.0])  ##  [0. 1. 0. 0.]
(5,[2],[1.0])  ##  [0. 0. 1. 0. 0.]
(5,[2],[1.0])  ##  [0. 0. 1. 0. 0.]
(5,[4],[1.0])  ##  [0. 0. 0. 0. 1.]
(5,[4],[1.0])  ##  [0. 0. 0. 0. 1.] 
      

Fix: Converting String categories into One-hot encoded values using StringIndexer and OneHotEncoder

df = df.drop('alphabet_vector_1', 'alphabet_vector_2', 'indexedAlphabet') alphabetIndexer = StringIndexer(inputCol="alphabet", outputCol="indexedAlphabet").fit(df) df = alphabetIndexer.transform(df) encoder = OneHotEncoder(inputCol="indexedAlphabet", outputCol="alphabet_vector_1", dropLast = True) encoder = encoder.fit(df) df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.alphabet_vector_1, " ## ", i.alphabet_vector_1.toArray()) encoder = OneHotEncoder(inputCol="indexedAlphabet", outputCol="alphabet_vector_2", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df) temp = df.collect() for i in temp: print(i.alphabet_vector_2, " ## ", i.alphabet_vector_2.toArray()) (1,[0],[1.0]) ## [1.] (1,[0],[1.0]) ## [1.] (1,[],[]) ## [0.] (1,[],[]) ## [0.] (2,[0],[1.0]) ## [1. 0.] (2,[0],[1.0]) ## [1. 0.] (2,[1],[1.0]) ## [0. 1.] (2,[1],[1.0]) ## [0. 1.] df = df.drop('key_vector', 'indexedKey') alphabetIndexer = StringIndexer(inputCol="key", outputCol="indexedKey").fit(df) df = alphabetIndexer.transform(df) encoder = OneHotEncoder(inputCol="indexedKey", outputCol="key_vector", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.key_vector, " ## ", i.key_vector.toArray()) Output: (4,[0],[1.0]) ## [1. 0. 0. 0.] (4,[1],[1.0]) ## [0. 1. 0. 0.] (4,[2],[1.0]) ## [0. 0. 1. 0.] (4,[3],[1.0]) ## [0. 0. 0. 1.]

How does it treat numeric columns?

df = df.drop('d1_vector_2') encoder = OneHotEncoder(inputCol="d1", outputCol="d1_vector_2", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df) Output: (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[],[]) [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Using Pandas

import pandas as pd values = [("K1", "a", 5, 'x'), ("K2","a", 5, 'x'), ("K3","b", 5, 'x'), ("K4","b", 10, 'x')] columns = ['key', 'alphabet', 'd1', 'd0'] df = pd.DataFrame(values, columns = columns) pd.get_dummies(data=df, columns=['key', 'alphabet']) Output: d1 d0 key_K1 key_K2 key_K3 key_K4 alphabet_a alphabet_b 5 x 1 0 0 0 1 0 5 x 0 1 0 0 1 0 5 x 0 0 1 0 0 1 10 x 0 0 0 1 0 1 pd.get_dummies(data=df, columns=['key', 'alphabet'], drop_first=True) Output: d1 d0 key_K2 key_K3 key_K4 alphabet_b 5 x 0 0 0 0 5 x 1 0 0 0 5 x 0 1 0 1 10 x 0 0 1 1

Using SciKit Learn

Ref: scikit-learn.org from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = True, drop = none") print(enc.transform(df[['key', 'alphabet']])) enc = OneHotEncoder(sparse=False) enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = False, drop = none") print(enc.transform(df[['key', 'alphabet']])) enc = OneHotEncoder(sparse=False, drop = 'first') enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = False, drop = first") print(enc.transform(df[['key', 'alphabet']])) OneHotEncoder on key and alphabet. sparse = True, drop = none (0, 0) 1.0 (0, 4) 1.0 (1, 1) 1.0 (1, 4) 1.0 (2, 2) 1.0 (2, 5) 1.0 (3, 3) 1.0 (3, 5) 1.0 OneHotEncoder on key and alphabet. sparse = False, drop = none [[1. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0.] [0. 0. 1. 0. 0. 1.] [0. 0. 0. 1. 0. 1.]] OneHotEncoder on key and alphabet. sparse = False, drop = first [[0. 0. 0. 0.] [1. 0. 0. 0.] [0. 1. 0. 1.] [0. 0. 1. 1.]] Scikit-learn's LabelBinarizer Ref: scikit-learn.org from sklearn import preprocessing lb = preprocessing.LabelBinarizer(sparse_output=False) lb.fit(df[['key']]) lb.transform(df[['key']]) array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]])

Using Package: category_encoders

To install this package with conda run one of the following: conda install -c conda-forge category_encoders conda install -c conda-forge/label/gcc7 category_encoders conda install -c conda-forge/label/cf201901 category_encoders conda install -c conda-forge/label/cf202003 category_encoders Ref: anaconda.org from category_encoders import OneHotEncoder cat_features = ['key', 'alphabet'] enc = OneHotEncoder(cols = cat_features) enc.fit(df) OneHotEncoder(cols=['key', 'alphabet'], drop_invariant=False, handle_missing='value', handle_unknown='value', return_df=True, use_cat_names=False, verbose=0) enc.transform(df) Output: key_1 key_2 key_3 key_4 alphabet_1 alphabet_2 d1 d0 1 0 0 0 1 0 5 x 0 1 0 0 1 0 5 x 0 0 1 0 0 1 5 x 0 0 0 1 0 1 10 x

No comments:

Post a Comment