Sunday, April 24, 2022

Similarity and Dissimilarity in Data Mining


How to compute the dissimilarity between objects described by the following types of variables:

Numerical (interval-scaled) variables

Use Euclidean distance or Manhattan distance.

Euclidean distance:

Manhattan distance:

Minkowski distance:

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): Compute the Minkowski distance between the two objects, using p = 3. = (|22 - 20| ^ 3 + |1 - 0| ^ 3 + |42 - 36| ^ 3 + |10 - 8| ^ 3) ^ (1/3) = 6.15 For manhattan dist., p = 1 in Minkowski dist. For euclidean dist., p = 2 in Minkowski dist.

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Asymmetric binary variables

If all binary attributes have the same weight then they are symmetric. Let's say we have the contingency table:
If the binary attributes are asymmetric, Jaccard coefficient is often used: For cell (i=1, j=1) representing #(object-I = 1 and object-J = 1):
In terms of Set Operations, formula for Jaccard Coefficient for classes A and B becomes: J = (A intersection B) / (A union B)

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Categorical variables

A categorical variable is a generalization of the binary variable in that it can take on more than two states. The dissimilarity between two objects i and j can be computed as:
where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.

For Text

Use Cosine Similarity.

Tags: Technology,Machine Learning,Data Visualization,

No comments:

Post a Comment