Thursday, January 14, 2021

Multi-label Classification using Python



What is Multi-Label Classification? 

See this image:

What if I ask you that does this image contains a house? The option will be YES or NO. Consider another case, like what all things (or labels) are relevant to this picture? House: Yes, Tree: Yes, Beach: No, Cloud; Yes, Mountain: No, Animal: No These types of problems, where we have a set of target variables, are known as multi-label classification problems. So, is there any difference between these two cases? Clearly, yes because in the second case any image may contain a different set of these multiple labels for different images. Multi-Label v/s Multi-Class
For any movie, Central Board of Film Certification, issue a certificate depending on the contents of the movie. For example, if you look above, this movie has been rated as ‘U/A’ (meaning ‘Parental Guidance for children below the age of 12 years’) certificate. There are other types of certificates classes like ‘A’ (Restricted to adults) or ‘U’ (Unrestricted Public Exhibition), but it is sure that each movie can only be categorized with only one out of those three type of certificates. In short, there are multiple categories but each instance is assigned only one, therefore such problems are known as multi-class classification problem. Again, if you look back at the image, this movie has been categorized into comedy and romance genre. But there is a difference that this time each movie could fall into one or more different sets of categories. Therefore, each instance can be assigned with multiple categories, so these types of problems are known as multi-label classification problem, where we have a set of target labels. Techniques for Solving a Multi-Label classification problem Basically, there are three methods to solve a multi-label classification problem, namely: 1. Problem Transformation 2. Adapted Algorithm 3. Ensemble approaches 4.1 Problem Transformation In this method, we will try to transform our multi-label problem into single-label problem(s). This method can be carried out in three different ways as: a. Binary Relevance b. Classifier Chains c. Label Powerset Note: These techniques are available in the package "skmultilearn.problem_transform". 4.1.1 Binary Relevance This is the simplest technique, which basically treats each label as a separate single class classification problem. For example, let us consider a case as shown below. We have the data set like this, where X is the independent feature and Y’s are the target variable.
In binary relevance, this problem is broken into 4 different single class classification problems as shown in the figure below.
4.1.2 Classifier Chains In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain. Let’s try to this understand this by an example. In the dataset given below, we have X as the input space and Y’s as the labels.
In classifier chains, this problem would be transformed into 4 different single label problems, just like shown below. Here yellow colored is the input space and the white part represent the target variable.
4.1.3 Label Powerset In this, we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data. Let’s understand it by an example.
In this, we find that x1 and x4 have the same labels, similarly, x3 and x6 have the same set of labels. So, label powerset transforms this problem into a single multi-class problem as shown below.
So, label powerset has given a unique class to every possible label combination that is present in the training set. 4.2 Adapted Algorithm Adapted algorithm, as the name suggests, adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems. For example, multi-label version of kNN is represented by MLkNN. So, let us quickly implement this on our randomly generated data set. Sci-kit learn provides inbuilt support of multi-label classification in some of the algorithm like Random Forest and Ridge regression. So, you can directly call them and predict the output. You can check the multi-learn library if you wish to learn more about other types of adapted algorithm. 4.3 Ensemble Approaches Ensemble always produces better results. Scikit-Multilearn library provides different ensembling classification functions, which you can use for obtaining better results. A GitHub repository for Fuzzy-kNN This code is not an efficient implementation as it takes an awful lot of time as it does the distance calculations during the "fit()" method operation. Link: Fuzzy-kNN
1: https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/ 2: Kaggle: Toxic Comment Classification Challenge https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge 3: Multi-Label Classification https://paperswithcode.com/task/multi-label-classification 4: scikit-multilearn https://pypi.org/project/scikit-multilearn/ 5: Multi-Label Classification in Python http://scikit.ml/ 6: https://machinelearningmastery.com/multi-label-classification-with-deep-learning/ 7: https://scikit-learn.org/stable/modules/multiclass.html 8: https://en.wikipedia.org/wiki/Multi-label_classification 9: https://www.geeksforgeeks.org/an-introduction-to-multilabel-classification/ 10: https://monkeylearn.com/blog/multi-label-classification/

No comments:

Post a Comment