Highlighting Algorithm Followed For Smoothing The Data
1. Decide which kind of binning you want to use?
- Equal frequency
- Equal width
2. Once you have binned the data, you have to decide whether you are going to assign the bin with a value from:
- mean
- median
- boundary
3. Replace each bin value by the formula selected in Step 2.
Smoothing(noisy data)
Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by each of the following methods.
Equal-frequency partitioning
What is Smoothing by bin mean/median/boundary?
How do we define the first bin?
We need a bin that encloses 5, 10 and 11.
(4.5, 11.5]: This is also correct but let’s look at Pandas.
What Pandas has created is:
(4.999, 12.5]: Range exlusive of 4.999 and starting from there. Also range inclusive of 12.5 and ending there.
Is it wrong? No.
Next bin:
(12.5, 42.5]: Is it wrapping the elements 13, 15 and 35?
Next bin would start at 42.5. Can we say this?
Smoothing(noisy data)
Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by each of the following methods.
Equal-frequency partitioning
What is Smoothing by bin mean/median/boundary?
Replace each bin value is replaced by mean/median/nearest boundary
On smoothing by bin-boundary (bins follow equal-frequency partitioning):
Bin 1: 5, 13, 13, 13
As 5 is closer to boundary value ‘5’. And, 10, 11 are closer to boundary value ‘13’
Bin 2: 15, 15, 55, 55
Bin 3: 72, 72, 215, 215
Original:
Smoothing by equal-frequency binning using the mean of each bin
1. creation of bins In code: pd.qcut()
2. grouping the data according to bins In code: df.groupby()
3. find the mean of each group In code: df.groupby().mean()
4. create a map of bin labels and mean values In code: it is essentially a dictionary that looks like this: { '(4.999, 14.333]': 9.75, '(14.333, 60.667]': 38.75, '(60.667, 215.0]': 145.75 } A dictionary is simply key-value pairs. 5. Populate a new column containing the mean of each bin for each data point.
Smoothing (noisy data)
Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by each of the following methods.
Equal-width partitioning
The width of each interval is (215 - 5)/3 = 70.
Perform Smoothing by bin mean/median/boundary.
Bins using equal width partitioning.
Elements: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
The width of each interval is (215 – 5)/3 = 70.
Domain for bin-1: 5 up to, but not, 75 (= 5 + 70)
Domain for bin-2: 75 to 144
Domain for bin-3: 145 Onwards (inc. 215 from the input data set)
No comments:
Post a Comment