We are going to try out scikit-learn's MinMaxScaler for two features of a dataset.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Two lists with 20 values
train_df = pd.DataFrame({'A': list(range(1000, 3000, 100)), 'B': list(range(1000, 5000, 200))})
# Two lists with 42 values
test_df = pd.DataFrame({'A': list(range(-200, 4000, 100)), \
'B': sorted(list(range(1000, 4900, 100)) + [1050, 1150, 1250])})
scaler_a = MinMaxScaler(feature_range = (0, 10)) # feature_range: tuple (min, max), default=(0, 1)
scaler_b = MinMaxScaler(feature_range = (0, 10)) # feature_range: tuple (min, max), default=(0, 1)
train_df['a_skl'] = scaler_a.fit_transform(train_df[['A']])
train_df['b_skl'] = scaler_b.fit_transform(train_df[['B']])
print(train_df[0:1])
print(train_df[-1:])
Output:
A B a_skl b_skl
0 1000 1000 0.0 0.0
A B a_skl b_skl
19 2900 4800 10.0 10.0
test_df['a_skl'] = scaler_a.transform(test_df[['A']])
test_df['b_skl'] = scaler_b.transform(test_df[['B']])
print(test_df[0:1])
print(test_df[-1:])
Output:
A B a_skl b_skl
0 -200 1000 -6.315789 0.0
A B a_skl b_skl
41 3900 4800 15.263158 10.0
train_df_minmax_a = train_df['A'].agg([np.min, np.max])
train_df_minmax_b = train_df['B'].agg([np.min, np.max])
test_df_minmax_a = test_df['A'].agg([np.min, np.max])
test_df_minmax_b = test_df['B'].agg([np.min, np.max])
print(train_df_minmax_a)
print(train_df_minmax_b)
Output:
amin 1000
amax 2900
Name: A, dtype: int64
amin 1000
amax 4800
Name: B, dtype: int64
The problem
We have two features A and B. In training data, A has range: 1000 to 2900 and B has range: 1000 to 4800.
In test data, A has range: -200 to 3900, and B has range: 1000 to 4800.
On test data, B gets converted to values between 0 to 10 as specified in MinMaxScaler definition.
But A in test data gets converted to range: -6.3 to 15.26.
Result: A and B are still in different ranges on test data.
Fix
We should be able to anticipate the range we are going to observe in test data or in real time situation / production.
Next, we define a MinMaxScaler of our own.
For A, we set expected minimum to -500 and for B, we set expected minimum to 0. (Similarly for maximums.)
r_min = 0
r_max = 10
def getMinMax(cell, amin, amax):
a = cell - amin
x_std = a / (amax - amin)
x_scaled = x_std * (r_max - r_min) + r_min
return x_scaled
test_df['a_gmm'] = test_df['A'].apply(lambda x: getMinMax(x, -500, train_df_minmax_a.amax))
test_df['b_gmm'] = test_df['B'].apply(lambda x: getMinMax(x, 800, train_df_minmax_b.amax))
print(test_df)
Output:
A B a_skl b_skl a_gmm b_gmm
0 -200 1000 -6.31 0.00 0.88 0.50
1 -100 1050 -5.78 0.13 1.17 0.62
2 0 1100 -5.26 0.26 1.47 0.75
...
39 3700 4600 14.21 9.47 12.35 9.50
40 3800 4700 14.73 9.73 12.64 9.75
41 3900 4800 15.26 10.0 12.94 10.0
The way we have adjusted expected minimum value in test data, similarly we have to do for expected maximum to bring the scaled values in the same range.
Issue fixed.
References
% https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
% https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html
% https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
% https://scikit-learn.org/stable/modules/preprocessing.html
% https://benalexkeen.com/feature-scaling-with-scikit-learn/
Pages
- Index of Lessons in Technology
- Index of Book Summaries
- Index of Book Lists And Downloads
- Index For Job Interviews Preparation
- Index of "Algorithms: Design and Analysis"
- Python Course (Index)
- Data Analytics Course (Index)
- Index of Machine Learning
- Postings Index
- Index of BITS WILP Exam Papers and Content
- Lessons in Investing
- Index of Math Lessons
- Downloads
- Index of Management Lessons
- Book Requests
- Index of English Lessons
- Index of Medicines
- Index of Quizzes (Educational)
Sunday, June 21, 2020
Working with skLearn's MinMax scaler and defining our own
Labels:
Technology
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment