import pandas as pd
df = pd.read_csv('in/titanic_train.csv')
df.head()
enc_pc_df = pd.get_dummies(df, columns = ['Pclass'])
enc_pc_df.head()
Hypothetical Question
Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns?
OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'?
Answer:
If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1.
Else: 0
Assumptions made: there are 'n' number of columns.
Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss.
One column's value is related to the value of rest n-1 columns.
So, what is the solution to resolve this relation?
We drop the first column from one-hot feature matrix.
enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True)
enc_pc_df.head()
The default value of "drop_first" parameter is False.
Pages
▼



No comments:
Post a Comment