survival8: Do we need all the one hot features?

Sunday, October 30, 2022

Do we need all the one hot features?


import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()




enc_pc_df = pd.get_dummies(df, columns = ['Pclass'])
enc_pc_df.head()




Hypothetical Question
Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns?

OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'?

Answer:
If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1.
Else: 0
Assumptions made: there are 'n' number of columns. 

Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss.
One column's value is related to the value of rest n-1 columns.

So, what is the solution to resolve this relation?

We drop the first column from one-hot feature matrix.

enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True)
enc_pc_df.head()

The default value of "drop_first" parameter is False.