import pandas as pd df = pd.read_csv('in/titanic_train.csv') df.head() enc_pc_df = pd.get_dummies(df, columns = ['Pclass']) enc_pc_df.head()Hypothetical Question
Q: If I remove a column (could be first, could be last) from the one-hot feature matrix (let's say with 'n' columns), can I reproduce the same matrix from the 'n-1' columns? OR: Rephrasing the question: How do we get back original matrix or 'put back the dropped column'? Answer: If: There is no '1' in the remaining n-1 values in a row, then the dropped value from that row is 1. Else: 0 Assumptions made: there are 'n' number of columns.Conclusion: In removing one column from the one-hot feature matrix, there is still no data loss. One column's value is related to the value of rest n-1 columns.
So, what is the solution to resolve this relation?
We drop the first column from one-hot feature matrix. enc_pc_df = pd.get_dummies(df, columns = ['Pclass'], drop_first = True) enc_pc_df.head() The default value of "drop_first" parameter is False.
Sunday, October 30, 2022
Do we need all the one hot features?
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment