How to compute the dissimilarity between objects described by the following types of variables:
Numerical (interval-scaled) variables
Use Euclidean distance or Manhattan distance.Euclidean distance:
Manhattan distance:
Minkowski distance:
Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): Compute the Minkowski distance between the two objects, using p = 3. = (|22 - 20| ^ 3 + |1 - 0| ^ 3 + |42 - 36| ^ 3 + |10 - 8| ^ 3) ^ (1/3) = 6.15 For manhattan dist., p = 1 in Minkowski dist. For euclidean dist., p = 2 in Minkowski dist.Briefly outline how to compute the dissimilarity between objects described by the following types of variables:
Asymmetric binary variables
If all binary attributes have the same weight then they are symmetric. Let's say we have the contingency table: If the binary attributes are asymmetric, Jaccard coefficient is often used: For cell (i=1, j=1) representing #(object-I = 1 and object-J = 1): In terms of Set Operations, formula for Jaccard Coefficient for classes A and B becomes: J = (A intersection B) / (A union B)Briefly outline how to compute the dissimilarity between objects described by the following types of variables:
Categorical variables
A categorical variable is a generalization of the binary variable in that it can take on more than two states. The dissimilarity between two objects i and j can be computed as: where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.For Text
Use Cosine Similarity.
Sunday, April 24, 2022
Similarity and Dissimilarity in Data Mining
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment