Monday, July 24, 2023

Smoothing (Part of Data Analytics Course)

Highlighting Algorithm Followed For Smoothing The Data

1. Decide which kind of binning you want to use?

- Equal frequency

- Equal width

2. Once you have binned the data, you have to decide whether you are going to assign the bin with a value from:

- mean

- median

- boundary

3. Replace each bin value by the formula selected in Step 2.

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

What is Smoothing by bin mean/median/boundary?







How do we define the first bin?

We need a bin that encloses 5, 10 and 11.

(4.5, 11.5]: This is also correct but let’s look at Pandas.

What Pandas has created is:

(4.999, 12.5]: Range exlusive of 4.999 and starting from there. Also range inclusive of 12.5 and ending there.

Is it wrong? No.

Next bin:

(12.5, 42.5]: Is it wrapping the elements 13, 15 and 35?

Next bin would start at 42.5. Can we say this?

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

What is Smoothing by bin mean/median/boundary?

Replace each bin value is replaced by mean/median/nearest boundary

On smoothing by bin-boundary (bins follow equal-frequency partitioning):

Bin 1: 5, 13, 13, 13

As 5 is closer to boundary value ‘5’. And, 10, 11 are closer to boundary value ‘13’

Bin 2: 15, 15, 55, 55

Bin 3: 72, 72, 215, 215

Original:

Smoothing by equal-frequency binning using the mean of each bin

1. creation of bins 
In code: pd.qcut()
2. grouping the data according to bins In code: df.groupby()
3. find the mean of each group In code: df.groupby().mean()
4. create a map of bin labels and mean values In code: it is essentially a dictionary that looks like this: { '(4.999, 14.333]': 9.75, '(14.333, 60.667]': 38.75, '(60.667, 215.0]': 145.75 } A dictionary is simply key-value pairs. 5. Populate a new column containing the mean of each bin for each data point.

Smoothing (noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-width partitioning

The width of each interval is (215 - 5)/3 = 70.

Perform Smoothing by bin mean/median/boundary.

Bins using equal width partitioning.

Elements: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

The width of each interval is (215 – 5)/3 = 70.

Domain for bin-1: 5 up to, but not, 75 (= 5 + 70)

Domain for bin-2: 75 to 144

Domain for bin-3: 145 Onwards (inc. 215 from the input data set)

No comments:

Post a Comment