Sunday, October 30, 2022

Do we need all the one hot features?


import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()

enc_pc_df = pd.get_dummies(df, columns = ['Pclass']) enc_pc_df.head()

Hypothetical Question

Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns? OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'? Answer: If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1. Else: 0 Assumptions made: there are 'n' number of columns.

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss. One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

We drop the first column from one-hot feature matrix. enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True) enc_pc_df.head() The default value of "drop_first" parameter is False.
Tags: Technology,Machine Learning,

No comments:

Post a Comment