Thursday, June 1, 2023

Ch 3 - Scaling and Normalization

Scaling vs. Normalization: What's the difference?

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that: % in scaling, you're changing the range of your data, while % in normalization, you're changing the shape of the distribution of your data. Ref: kaggle

Scaling

Why scale data? Example:
Marks1 Marks2 Marks3
Student 1 280 70 60
Student 2 200 60 55
Student 3 270 40 30
Euclidean distance(s1,s2) = 80.77 Euclidean Distance(s1,s3) = 43.58 Euclidean Distance(s2,s3) = 71.58 Euclidean distance between Student-1 and Student-2 is being dominated by the marks1. Same is the case for Euclidean Distance calculation between student-2 and student-3. There also the e. distance is being high because of difference in the marks-1 attribute. If we see the e. distance for student-1 and student-3, there the distance is not high because marks-1 are close to each each other (viz. 280 and 270). Unlike 200 & 270 or 200 & 280.

Why do we have to scale / normalize the input for an artificial neural network?

There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network: Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate. Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1). Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized. Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important. Mentioned below are the instances where Normalization is very important: % K-Means % K-Nearest-Neighbours % Principal Component Analysis (PCA) % Gradient Descent Ref: stackoverflow
z-score is 0.67 in both cases. Z-score is understood as how many standard deviations away the point is away from the mean.

What is there is an outlier in the data?

When there is outlier (on the higher side) in the data, MinMaxScaler sets the outlier data point to 1 and squashes all the other points to values near 0. So, if you doubt that there are outliers in the data, don't use MinMaxScaler. Use StandardScaler.

Now in code

Min Max Scaler

Standard Scaler

Q1: Which subject is easy?

Q2: Applying zscore normalization using Scikit Learn on Pandas DataFrame

Q3: Visualizing what minmax scaler and standard scaler do to the data.

Tags: Data Analytics,Technology,Python,