Showing posts with label Data Analytics. Show all posts
Showing posts with label Data Analytics. Show all posts

Wednesday, June 10, 2026

Quiz on "Modeling data distributions" (Unit 4, Jun 2026)

1:

Solution:

Code:

"""
Revenue:
Mean: 500
Stdev: 125

Fixed Monthly Costs: 225

Profit = Revenue - Fixed Monthly Costs
So MeanProfit = MeanRevenue - Fixed Monthly Costs = 500 - 225 = 275
StdevProfit = StdevRevenue = 125

"""

2:

Code:

b_small = 1 # For x less than 2

h_small = 0.5

area_small = b_small * h_small / 2 = 0.25

b_full = 4
h_full = 0.5
area_full = b_full * h_full / 2 = 1

area_small / area_full = 0.25 / 1 = 0.25

3:

Solution:

4:


5:

Solution:

Code:

"""
** BRAINSTORMING **

mean = 21.02
sd = 2

mean_1sd_less = mean - sd = 21.02 - 2 = 19.02
mean_1sd_more = mean + sd = 21.02 + 2 = 23.02

mean_2sd_less = mean - 2*sd = 21.02 - 4 = 17.02
mean_2sd_more = mean + 2*sd = 21.02 + 4 = 25.02


Empirical Rule: 
68% of the data is between 19.02 and 23.02 (within 1 sd of the mean)
95% of the data is between 17.02 and 25.02 (within 2 sds of the mean)
99.7% of the data is between 15.02 and 27.02 (within 3 sds of the mean)

*** BUT 25 != 25.02 ***

"""

6:


7:

Solution:

Code:

mean = 66000
sd = 22000

# How do we determine the z-score with an area of 0.05 above it using Python?

from scipy.stats import norm

# Method 1: Use the percent point function (ppf)
# The ppf takes the cumulative probability to the LEFT.
# Since the area above is 0.05, the area below is 1 - 0.05 = 0.95.
z_score = norm.ppf(0.95)

# Method 2: Use the inverse survival function (isf)
# The isf directly takes the upper tail probability.
z_score_alt = norm.isf(0.05)

print(f"z-score (ppf): {z_score}")     # 1.6448536269514729
print(f"z-score (isf): {z_score_alt}") # 1.6448536269514729

print("--- Using standard \"statistics\" Package ---")

from statistics import NormalDist

# Standard normal distribution (mu=0, sigma=1)
z = NormalDist().inv_cdf(0.95)
print(z)  # 1.6448536269514722

x = mean + z * sd
print(f"Value corresponding to z-score: {x}")  # 101000.0



8:

Code:

mean = 87
sd = 8

l = 104.6
h = 108.2

# How do we determine the z-scores corresponding to these values using Python's standard statistics package?

from statistics import NormalDist
# Standard normal distribution (mu=0, sigma=1)
z_l = (l - mean) / sd

print(f"z-score for {l}: {z_l:.4f}")
# How do we determine the percentage of data below l using Python's standard statistics package?

# We can use the cumulative distribution function (CDF) of the normal distribution.
# The CDF gives us the probability that a random variable from the distribution is less than or equal to a certain value.

# Method 1: Use the cumulative distribution function (CDF)
cdf_l = NormalDist().cdf(z_l)
print(f"Percentage of data below {l}: {cdf_l * 100:.4f}%")


z_h = (h - mean) / sd
cdf_h = NormalDist().cdf(z_h)
print(f"z-score for {h}: {z_h:.4f}")
print(f"Percentage of data below {h}: {cdf_h * 100:.4f}%")

answer = cdf_h - cdf_l
print(f"Percentage of data between {l} and {h}: {answer * 100:.4f}%")
print(f"Proportion of data between {l} and {h}: {answer:.4f}")

output = """
z-score for 104.6: 2.2000
Percentage of data below 104.6: 98.6097%
z-score for 108.2: 2.6500
Percentage of data below 108.2: 99.5975%
Percentage of data between 104.6 and 108.2: 0.9879%
Proportion of data between 104.6 and 108.2: 0.0099
"""
print()
print("--- CORRECT OUTPUT ---")
print(output)


9:

10:

Tags: Mathematical Foundations for Data Science,Data Analytics,

Tuesday, June 9, 2026

Quiz on "Summarizing quantitative data" (Unit 3, Jun 2026)

1:

>>> import numpy as np
>>> l1 = [12.5, 11.5, 11.0, 24.0, 13.0]
>>> mean1 = np.mean(l1)
>>> median1 = np.median(l1)
>>> 
>>> l2 = [12.5, 11.5, 11.0, 13.0]
>>> mean2 = np.mean(l2)
>>> median2 = np.median(l2)
>>> 
>>> mean2 - mean1
-2.4000000000000004
>>> median2 - median1
-0.5
>>> 
>>> print(mean1, median1, mean2, median2)
14.4 12.5 12.0 12.0

2:

import numpy as np

l = [4, 5, 7, 7, 7, 8, 10, 11, 11, 13, 13, 14]

q1 = np.percentile(l, 25)
q3 = np.percentile(l, 75)

print("q1, q3:", q1, q3) # 7.0, 11.5

import numpy as np

l = [4, 5, 7, 7, 7, 8, 10, 11, 11, 13, 13, 14]

# NumPy >= 1.22.0 (using the 'method' parameter)
q3_nearest = np.percentile(l, 75, method='nearest')  # Returns 11
q3_lower   = np.percentile(l, 75, method='lower')    # Returns 11
q3_higher  = np.percentile(l, 75, method='higher')   # Returns 13

print("q3_nearest, q3_lower, q3_higher:")
print(q3_nearest, q3_lower, q3_higher)


# # NumPy < 1.22.0 (using the older 'interpolation' parameter)
# q3_nearest = np.percentile(l, 75, interpolation='nearest')  # Returns 11

print("---  CALCULATION AS PER KHAN ACADEMY ---")

median = np.median(l)
print("Median:", median) # 8.0

lower_half = [x for x in l if x <= median]
upper_half = [x for x in l if x >= median]

q1 = np.median(lower_half)
q3 = np.median(upper_half)
print("Q1:", q1) # 7.0
print("Q3:", q3) # 11.0
Output
q1, q3: 7.0 11.5
q3_nearest, q3_lower, q3_higher:
11 11 13
---  CALCULATION AS PER KHAN ACADEMY ---
Median: 9.0
Q1: 7.0
Q3: 12.0

3:

4:

5:

import numpy as np 

l = [35, 39, 39, 43, 43, 44]

print(np.mean(l)) 

6:

import numpy as np

l = [1, 2, 3, 3, 4, 4, 4, 6]

q1 = np.percentile(l, 25)
q3 = np.percentile(l, 75)

print(q1, q3)

iqr = q3 - q1
print(iqr)


print("--- CALCULATION AS PER KHAN ACADEMY ---")

m = np.median(l)

lower = [x for x in l if x <= m]
upper = [x for x in l if x >= m]

print(lower, upper)
q1 = np.median(lower)
q3 = np.median(upper)
print(q1, q3)

iqr = q3 - q1
print(iqr)

7:

8:

q1 = 2
q3 = 5
iqr = q3 - q1
print(iqr)

l = [1] + [2] * 7 + [3] * 5 + [5] * 3 + [6] * 2 + [7, 9]

print(l)

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print(lower_bound, upper_bound)

outliers = [x for x in l if x < lower_bound or x > upper_bound]
print(outliers)
print(len(outliers))
Tags: Data Analytics,Mathematical Foundations for Data Science,

Thursday, October 12, 2023

Similarity, dissimilarity and distance (Part of Data Analytics Course)

Measuring Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Numerical (interval-scaled) variables

A = (2, 4)

B = (1, 3)

Euclidean distance = math.sqrt(pow(2-1, 2) + pow(4-3, 2))

A = (2, 7, 5)

B = (4, 8, 9)

ed = math.sqrt(pow(2-4, 2) + pow(7-8, 2) + pow(5-9, 2))

Euclidean distance is the direct and straight line distance.

Whereas:

Manhattan distance is indirect and calculated along x and y axis.

Euclidean Distance

Manhattan Distance

Manhattan Distance

Manhattan Distance is inspired from a traveler going from one point to another in the city of Manhattan.

Euclidean distance: Straight line distance between two points.

Manhattan distance: is cityblock distance that is you would go from place to place in car.

Minkowski Distance

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

Compute the Minkowski distance between the two objects, using p = 3.

In plain English: kth root of sum of kth powers of absolute value of differences.

For manhattan distance, p = 1 in Minkowski dist.

For euclidean distance, p = 2 in Minkowski dist.

Manhattan and Euclidean distances are special cases of Minkowski.

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by:

  • Asymmetric binary variables

If all binary attributes have the same weight then they are symmetric. Let’s say we have the contingency table:

If the binary attributes are asymmetric, Jaccard coefficient is often used:

For cell (i=1, j=1) representing #(object I = 1 and object J = 1):

J = q / (q + r + s)

J = # elements in intersection / # elements in the union

Jaccard similarity (aka coefficient or score) ranges from 0 to 1.

Jaccard dissimilarity (aka distance) = 1 – Jaccard Similarity

The reason why ‘t’ is not considered in Jaccard coeff. Is because, while taking an example of shopping cart, it would not make same sense to count the number of times that were missing from both the cart I and J. If we count the number of items that were missing in both carts, it would increase the similarity.

If it is given the attributes are binary and assymmetric, t is not counted. If it is symmetric, then formula for J may include ‘t’ in the numerator and denominator:

J = (q + t) / (q + r + s + t)

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

  • Categorical variables

A categorical variable is a generalization of the binary variable in that it can take on more than two states.

The dissimilarity between two objects i and j can be computed as:

where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.

Similarity/Dissimilarity

Between text type of data:

This is equal to dot-product of the corresponding features divided by the magnitude. Geometrically, it is equal to the cosine of angle between two lines.

V

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

C

0

1

0

0

1

1

0

1

0

1

D

1

0

0

0

0

1

0

1

0

0

E

1

0

1

0

1

1

0

0

0

0

Cosine similarity is commonly seen in text data.

Which two out of these vectors are closest?

And, which two out of these vectors are farthest?

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

A.B

1

0

0

0

1

0

0

0

0

0

There are two ones so sum()

= 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0

= 2

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

|A| = math.sqrt(1**2 + 1**2 + 0 + 1**2 + 1**2 + 0 + 0 + 1**2 + 1**2 + 0)

= 2.449489742783178

|B| = math.sqrt(1**2 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0)

= 2.23606797749979

Cos similarity (A, B) --> cos(theta) = A.B / |A|.|B|

Cosine similarity = 2 / ( 2.449 * 2.236 )

= 0.3652324960500105

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

A.B

1

0

0

0

1

0

0

0

0

0

Cosine Similarity

A = (4, 4)

B = (4, 0)

Cosine similarity between A and B is: cos(theta)

Note: It is very easy to visualize it in 2D.

Formula wise: Cosine similarity = A.B / |A| * |B|

This formula has come after rearranging the dot product formula:

A.B = |A| * |B| * cos(theta)

(0,0)

Cosine Similarity

Find cosine similarity between the following two sentences:

IIT Delhi course on data analytics IIT

IIT Delhi course on data analytics

Term: tf for d1, tf for d2

d1.d2 = 2.1 + 1.1 + 1.1 + 1.1 + 1.1 + 1.1 = 7

|d1| = sqrt(2^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(9)

|d2| = sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6)

Cosine similarity = d1.d2 / |d1| * |d2| = 7 / sqrt(9*6) = 0.93

IIT

2

1

Delhi

1

1

Course

1

1

On

1

1

Data

1

1

Analytics

1

1

Cosine Similarity

Sent 1: I bought a new car today.

Sent 2: I already have a new car.

Step 1: Create vectors.

Vector 1

Vector 2

Product of components

I

1

1

1

bought

1

0

  • 0

a

1

1

  • 1

new

1

1

  • 1

car

1

1

  • 1

today

1

0

  • 0

already

0

1

  • 0

have

0

1

  • 0

Cosine Similarity

V1 = 1,1,1,1,1,1,0,0

V2 = 1,0,1,1,1,0,1,1

v1.v2 = 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 4

|v1| = math.sqrt(1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 0 + 0) = 2.45

|v2| = math.sqrt(1**2 + 0 + 1**2 + 1**2 + 1**2 + 0 + 1**2 + 1**2) = 2.45

Cosine sim = 4 / (2.45 * 2.45)

= 0.66

Cosine Similarity

Cosine Similarity: It is a bag-of-words based model i.e., order of words does not matter.

Sent1: I bought a new car today.

Sent2: I haven’t bought a new car today.

Step 1: Creating vectors

Cosine similarity is based on whether same words have been used in two sentences.

“How similar are two sentences in terms of the words (irrespective of their meaning) used in them?”

Cosine Similarity

Find cosine similarity between the following two sentences:

Brand new course on data analytics

Test-1 is scheduled later this month

Cosine similarity = 0.0

In Code Following Steps Would Be Followed To Find The Cosine Similarity Between Two Sentences

1. Load the libraries like Scikit Learn.

Projects might include other libraries like Pandas, NumPy, NLTK (Natural Language Toolkit).

2. Load the data (i.e. your two sentences)

3. Convert them into vector form using CountVectorizer.

Note: there are some other ways also to convert a sentence into vector, but we want to keep it simple for first class.

4. Next, you can use this method: sklearn.metrics.pairwise.cosine_similarity() to find the cosine similarity.

Note: These are high level steps. You can find a lot of similarity/dissimilarity measures following these steps after some minor modifications like changing Scikit Learn with SciPy, etc.

Thank You!

Tags: Technology,Data Analytics,