Wednesday, October 11, 2023

Similarity, dissimilarity and distance (Part of Data Analytics Course)

Measuring Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Numerical (interval-scaled) variables

A = (2, 4)

B = (1, 3)

Euclidean distance = math.sqrt(pow(2-1, 2) + pow(4-3, 2))

A = (2, 7, 5)

B = (4, 8, 9)

ed = math.sqrt(pow(2-4, 2) + pow(7-8, 2) + pow(5-9, 2))

Euclidean distance is the direct and straight line distance.

Whereas:

Manhattan distance is indirect and calculated along x and y axis.

Euclidean Distance

Manhattan Distance

Manhattan Distance

Manhattan Distance is inspired from a traveler going from one point to another in the city of Manhattan.

Euclidean distance: Straight line distance between two points.

Manhattan distance: is cityblock distance that is you would go from place to place in car.

Minkowski Distance

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

Compute the Minkowski distance between the two objects, using p = 3.

In plain English: kth root of sum of kth powers of absolute value of differences.

For manhattan distance, p = 1 in Minkowski dist.

For euclidean distance, p = 2 in Minkowski dist.

Manhattan and Euclidean distances are special cases of Minkowski.

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by:

  • Asymmetric binary variables

If all binary attributes have the same weight then they are symmetric. Let’s say we have the contingency table:

If the binary attributes are asymmetric, Jaccard coefficient is often used:

For cell (i=1, j=1) representing #(object I = 1 and object J = 1):

J = q / (q + r + s)

J = # elements in intersection / # elements in the union

Jaccard similarity (aka coefficient or score) ranges from 0 to 1.

Jaccard dissimilarity (aka distance) = 1 – Jaccard Similarity

The reason why ‘t’ is not considered in Jaccard coeff. Is because, while taking an example of shopping cart, it would not make same sense to count the number of times that were missing from both the cart I and J. If we count the number of items that were missing in both carts, it would increase the similarity.

If it is given the attributes are binary and assymmetric, t is not counted. If it is symmetric, then formula for J may include ‘t’ in the numerator and denominator:

J = (q + t) / (q + r + s + t)

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

  • Categorical variables

A categorical variable is a generalization of the binary variable in that it can take on more than two states.

The dissimilarity between two objects i and j can be computed as:

where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.

Similarity/Dissimilarity

Between text type of data:

This is equal to dot-product of the corresponding features divided by the magnitude. Geometrically, it is equal to the cosine of angle between two lines.

V

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

C

0

1

0

0

1

1

0

1

0

1

D

1

0

0

0

0

1

0

1

0

0

E

1

0

1

0

1

1

0

0

0

0

Cosine similarity is commonly seen in text data.

Which two out of these vectors are closest?

And, which two out of these vectors are farthest?

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

A.B

1

0

0

0

1

0

0

0

0

0

There are two ones so sum()

= 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0

= 2

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

|A| = math.sqrt(1**2 + 1**2 + 0 + 1**2 + 1**2 + 0 + 0 + 1**2 + 1**2 + 0)

= 2.449489742783178

|B| = math.sqrt(1**2 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0)

= 2.23606797749979

Cos similarity (A, B) --> cos(theta) = A.B / |A|.|B|

Cosine similarity = 2 / ( 2.449 * 2.236 )

= 0.3652324960500105

A

1

1

0

1

1

0

0

1

1

0

B

1

0

1

0

1

1

1

0

0

0

A.B

1

0

0

0

1

0

0

0

0

0

Cosine Similarity

A = (4, 4)

B = (4, 0)

Cosine similarity between A and B is: cos(theta)

Note: It is very easy to visualize it in 2D.

Formula wise: Cosine similarity = A.B / |A| * |B|

This formula has come after rearranging the dot product formula:

A.B = |A| * |B| * cos(theta)

(0,0)

Cosine Similarity

Find cosine similarity between the following two sentences:

IIT Delhi course on data analytics IIT

IIT Delhi course on data analytics

Term: tf for d1, tf for d2

d1.d2 = 2.1 + 1.1 + 1.1 + 1.1 + 1.1 + 1.1 = 7

|d1| = sqrt(2^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(9)

|d2| = sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6)

Cosine similarity = d1.d2 / |d1| * |d2| = 7 / sqrt(9*6) = 0.93

IIT

2

1

Delhi

1

1

Course

1

1

On

1

1

Data

1

1

Analytics

1

1

Cosine Similarity

Sent 1: I bought a new car today.

Sent 2: I already have a new car.

Step 1: Create vectors.

Vector 1

Vector 2

Product of components

I

1

1

1

bought

1

0

  • 0

a

1

1

  • 1

new

1

1

  • 1

car

1

1

  • 1

today

1

0

  • 0

already

0

1

  • 0

have

0

1

  • 0

Cosine Similarity

V1 = 1,1,1,1,1,1,0,0

V2 = 1,0,1,1,1,0,1,1

v1.v2 = 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 4

|v1| = math.sqrt(1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 0 + 0) = 2.45

|v2| = math.sqrt(1**2 + 0 + 1**2 + 1**2 + 1**2 + 0 + 1**2 + 1**2) = 2.45

Cosine sim = 4 / (2.45 * 2.45)

= 0.66

Cosine Similarity

Cosine Similarity: It is a bag-of-words based model i.e., order of words does not matter.

Sent1: I bought a new car today.

Sent2: I haven’t bought a new car today.

Step 1: Creating vectors

Cosine similarity is based on whether same words have been used in two sentences.

“How similar are two sentences in terms of the words (irrespective of their meaning) used in them?”

Cosine Similarity

Find cosine similarity between the following two sentences:

Brand new course on data analytics

Test-1 is scheduled later this month

Cosine similarity = 0.0

In Code Following Steps Would Be Followed To Find The Cosine Similarity Between Two Sentences

1. Load the libraries like Scikit Learn.

Projects might include other libraries like Pandas, NumPy, NLTK (Natural Language Toolkit).

2. Load the data (i.e. your two sentences)

3. Convert them into vector form using CountVectorizer.

Note: there are some other ways also to convert a sentence into vector, but we want to keep it simple for first class.

4. Next, you can use this method: sklearn.metrics.pairwise.cosine_similarity() to find the cosine similarity.

Note: These are high level steps. You can find a lot of similarity/dissimilarity measures following these steps after some minor modifications like changing Scikit Learn with SciPy, etc.

Thank You!

Tags: Technology,Data Analytics,

No comments:

Post a Comment