survival8: Similarity and Dissimilarity in Data Mining

Sunday, April 24, 2022

Similarity and Dissimilarity in Data Mining

How to compute the dissimilarity between objects described by the following types of variables:

Numerical (interval-scaled) variables

Use Euclidean distance or Manhattan distance. 

Euclidean distance:




Manhattan distance:




Minkowski distance:

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
Compute the Minkowski distance between the two objects, using p = 3.

= (|22 - 20| ^ 3 + |1 - 0| ^ 3 + |42 - 36| ^ 3 + |10 - 8| ^ 3) ^ (1/3)
= 6.15

For manhattan dist., p = 1 in Minkowski dist.
For euclidean dist., p = 2 in Minkowski dist.

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Asymmetric binary variables

If all binary attributes have the same weight then they are symmetric. Let's say we have the contingency table:




If the binary attributes are asymmetric, Jaccard coefficient is often used:
For cell (i=1, j=1) representing #(object-I = 1 and object-J = 1):




In terms of Set Operations, formula for Jaccard Coefficient for classes A and B becomes:

J = (A intersection B) / (A union B)

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Categorical variables

A categorical variable is a generalization of the binary variable in that it can take on more than two states.
The dissimilarity between two objects i and j can be computed as:




where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.

For Text
Use Cosine Similarity.