We are going to try out scikit-learn's MinMaxScaler for two features of a dataset. import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler # Two lists with 20 values train_df = pd.DataFrame({'A': list(range(1000, 3000, 100)), 'B': list(range(1000, 5000, 200))}) # Two lists with 42 values test_df = pd.DataFrame({'A': list(range(-200, 4000, 100)), \ 'B': sorted(list(range(1000, 4900, 100)) + [1050, 1150, 1250])}) scaler_a = MinMaxScaler(feature_range = (0, 10)) # feature_range: tuple (min, max), default=(0, 1) scaler_b = MinMaxScaler(feature_range = (0, 10)) # feature_range: tuple (min, max), default=(0, 1) train_df['a_skl'] = scaler_a.fit_transform(train_df[['A']]) train_df['b_skl'] = scaler_b.fit_transform(train_df[['B']]) print(train_df[0:1]) print(train_df[-1:]) Output: A B a_skl b_skl 0 1000 1000 0.0 0.0 A B a_skl b_skl 19 2900 4800 10.0 10.0 test_df['a_skl'] = scaler_a.transform(test_df[['A']]) test_df['b_skl'] = scaler_b.transform(test_df[['B']]) print(test_df[0:1]) print(test_df[-1:]) Output: A B a_skl b_skl 0 -200 1000 -6.315789 0.0 A B a_skl b_skl 41 3900 4800 15.263158 10.0 train_df_minmax_a = train_df['A'].agg([np.min, np.max]) train_df_minmax_b = train_df['B'].agg([np.min, np.max]) test_df_minmax_a = test_df['A'].agg([np.min, np.max]) test_df_minmax_b = test_df['B'].agg([np.min, np.max]) print(train_df_minmax_a) print(train_df_minmax_b) Output: amin 1000 amax 2900 Name: A, dtype: int64 amin 1000 amax 4800 Name: B, dtype: int64 The problem We have two features A and B. In training data, A has range: 1000 to 2900 and B has range: 1000 to 4800. In test data, A has range: -200 to 3900, and B has range: 1000 to 4800. On test data, B gets converted to values between 0 to 10 as specified in MinMaxScaler definition. But A in test data gets converted to range: -6.3 to 15.26. Result: A and B are still in different ranges on test data. Fix We should be able to anticipate the range we are going to observe in test data or in real time situation / production. Next, we define a MinMaxScaler of our own. For A, we set expected minimum to -500 and for B, we set expected minimum to 0. (Similarly for maximums.) r_min = 0 r_max = 10 def getMinMax(cell, amin, amax): a = cell - amin x_std = a / (amax - amin) x_scaled = x_std * (r_max - r_min) + r_min return x_scaled test_df['a_gmm'] = test_df['A'].apply(lambda x: getMinMax(x, -500, train_df_minmax_a.amax)) test_df['b_gmm'] = test_df['B'].apply(lambda x: getMinMax(x, 800, train_df_minmax_b.amax)) print(test_df) Output: A B a_skl b_skl a_gmm b_gmm 0 -200 1000 -6.31 0.00 0.88 0.50 1 -100 1050 -5.78 0.13 1.17 0.62 2 0 1100 -5.26 0.26 1.47 0.75 ... 39 3700 4600 14.21 9.47 12.35 9.50 40 3800 4700 14.73 9.73 12.64 9.75 41 3900 4800 15.26 10.0 12.94 10.0 The way we have adjusted expected minimum value in test data, similarly we have to do for expected maximum to bring the scaled values in the same range. Issue fixed. References % https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html % https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html % https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html % https://scikit-learn.org/stable/modules/preprocessing.html % https://benalexkeen.com/feature-scaling-with-scikit-learn/
Sunday, June 21, 2020
Working with skLearn's MinMax scaler and defining our own
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment