kNN classification parallelized using MapReduce


TITLE :
Improving kNN algorithm's performance on financial transaction data by customizing the process of text vectorization and finding ways to parallelize the kNN algorithm

ABSTRACT :  

The main goal of this project is to provide a customized version of kNN algorithm to classify bank statements into one of the ‘Transaction Classes’ (which is domain specific information) that gives better performance using a customized vectorizer that does not remove domain specific data or keywords considering them as stop words. 
Next, we look to improve the algorithm further by parallelizing it based on the MapReduce paradigm. We measure and compare the performance of the linear version and MapReduce version and look for areas where further improvements can be done.
The implementation is in Python, and platform for MapReduce version of the kNN program Hadoop.

1. INTRODUCTION

1.1. Challenges

At present situation categorization of financial bank statements is done manually using hard and fast rules like if NEFT keyword is present then category is 'NEFT Payment', or if 'RTGS' keyword is present then category is 'RTGS' payment.

This rule would not work in the case when data is read from an error prone resource like a printed paper or a scanned image or when data is being entered manually by typing and keywords like 'NEFT' are smudged to 'NFFT' or 'NEET'.

To overcome the difficulty that we now have, we would introduce Machine Learning into the process to make an intelligent guess of class of the financial transaction by looking at the bank statement signature.

1.2. Solution

First we work on a customized vectorization process that would treat a transaction statement differently than reading a simple English text, for that would not remove serial numbers and keywords like 'NEFT' during stop words removal process. 

Secondly, for classification financial bank statements into appropriate classes, the algorithm of choice is k-Nearest Neighbor algorithm. kNN is one of the simplest and most widely used algorithm for the classification purpose. We implement a linear version of the algorithm using Python where the similarity measure is the cosine similarity between the input transaction record and training transaction records.

Third, we seek to improve the kNN algorithm by parallelizing it around the MapReduce paradigm. We run it on the Hadoop platform and collect the performance metrics.

Fourth, we do a comparative study of performance of linear version and parallelized kNN algorithm and seek to find insights to make further improvements. 

2. Literature Review

2.1. Vectorization

It is the process of tokenizing a collection of text documents and building a vocabulary of the words.

For example, suppose we have four documents as follows:
d1: “new york times”
d2: “new york post”
d3: “los angeles times”
d4: “new york times new york post”

Then vector space representation using count of each vocabulary term is as follows:

Tab. 1 (Term Frequency Matrix)

The above table is also known as term frequency matrix. 2.2. Cosine Similarity Cosine similarity is a number between [0, 1] telling about the similarity between two documents taken as vectors derived from ‘Term Frequency Matrix’, ‘1’ being highest similarity and ‘0’ being no similarity. Continuing the example taken in section 2.1., let us compute the similarity between documents ‘d3’ and ‘d4’. d3: [1, 1, 0, 0, 1, 0] d4: [0, 0, 2, 1, 1, 2] Using the formula for cosine similarity: Fig. 1, “Cosine Similarity”
The cosine of the angle between the two vectors is: 0.1825 Similarly, the similarity between d2 and d4 is calculated follows: d2: [0, 0, 1, 1, 0, 1] d4: [0, 0, 2, 1, 1, 2] The cosine of the angle between the two vectors is: 0.9128 These vectors are 8-dimensional. A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualize to one that can be. In this case you can think of this as the angle of about 35 degrees which is some 'distance' from zero or perfect agreement.   2.3. kNN (k-Nearest Neighbor) algorithm kNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point. kNN can be used for classification — the output is a class membership (predicts a class — a discrete value). An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. It can also be used for regression — output is the value for the object (predicts continuous values). This value is the average (or median) of the values of its k nearest neighbors. The algorithm can be summarized as: 1. A positive integer k is specified, along with a new sample 2. We select the k entries in our database which are closest to the new sample 3. We find the most common classification of these entries 4. This is the classification we give to the new sample Showcasing a simple example, suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. We will plot this data on a scatter plot and visualize to which class a new data point (shown in yellow filled circle) belongs to when value of k is ‘5’. Fig. 2, “An example of kNN”
2.4. MapReduce paradigm An algorithm following MapReduce paradigm tells us that the algorithm can be divided into three phases ‘Map’, ‘Shuffle’ and ‘Reduce’. • Map stage: The map or mapper’s job is to process the input data. Generally, the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage: This stage is the combination of the ‘Shuffle’ stage and the ‘Reduce’ stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. An example of MapReduce paradigm could be word count problem: Suppose we have this text “foo foo quux labs foo bar quux”. Then mapper might produce resulting key-value pairs containing the count of the individual tokens in the following form: foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1 Then, this is sorted and passed to reducer for producing the final count as follows: bar 1 foo 3 labs 1 quux 2 2.5. Performance metrics 2.5.1. Recall Recall is “how many of the true positives were recalled (found)”, i.e. how many of the correct hits were also found. Recall = TP / (TP + FN) Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative. 2.5.2. Precision Precision is how many of the returned hits were true positive i.e. how many of the found instances were correct hits. Precision = TP / (TP + FP) A very interesting image from Wikipedia explains Precision and Recall using Venn diagram and it is very easy visualize the two:
Fig. 3, “Precision and Recall” 2.5.3. F-score A measure that combines precision and recall is the harmonic mean of precision and recall called “balanced F-score”: F = 2 * (precision * recall) / (precision + recall) 2.5.4. Micro averaging As we would be dealing with multiclass classification, we would take help of averaging techniques called ‘micro averaging’, ‘macro averaging’ and ‘weighted averaging’. In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics. For example, for a set of data, the system's True positive (TP1) = 12; False positive (FP1) =9; False negative (FN1) =3 Then precision (P1) and recall (R1) will be 57.14 and 80 And for a different set of data, the system's True positive (TP2) = 50; False positive (FP2) = 23; False negative (FN2) = 9 Then precision (P2) and recall (R2) will be 68.49 and 84.75 Now, the average precision and recall of the system using the Micro-average method is Micro-average of precision = (TP1+TP2) / (TP1+TP2+FP1+FP2) = (12+50) / (12+50+9+23) = 65.96 Micro-average of recall = (TP1+TP2) / (TP1+TP2+FN1+FN2) = (12+50) / (12+50+3+9) = 83.78 The Micro-average F-Score will be simply the harmonic mean of these two figures. 2.5.5. Macro averaging The method is straight forward, just take the average of the precision and recall of the system on different sets. For example, the macro-average precision and recall of the system for the given example is Macro-average precision = (P1+P2)/2 = (57.14+68.49) / 2 = 62.82 Macro-average recall = (R1+R2)/2 = (80+84.75) / 2 = 82.25 The Macro-average F-Score will be simply the harmonic mean of these two figures. 2.5.6. Weighted averaging Weighted average = w1 * x1 + w2 * x2 w = relative weight (%) x = value The weighted average formula is used to calculate the average value of a particular set of numbers with different levels of relevance. The weights should be represented as a percentage of the total relevancy. Therefore, all weights should be equal to 100%, or ‘1’. Next, determine the average using the arithmetic mean formula. An example would be the average of 1, 2, and 3 would be the sum of 1 + 2 + 3 divided by 3, which would return 2. However, the weighted average formula looks at how relevant each number is. Say that 1 only happens 10% of the time while 2 and 3 each happen 45% of the time. The percentages in this example would be the weights. The weighted average would be 2.35. 2.5.7. Use of Pandas to do SQL like analysis on the classifier output Because we using Pandas for storing and maintaining data, we also do SQL like operations on this data to do further analysis. For example: Count of transactions from test data that were wrongly classified: Tab. 2, “Classifier Results in Pandas DataFrame”
3. Implementation Overview 3.1. Feature Engineering For feature engineering, we use Pandas. Handling of missing data: a) Fill zeros in the numeric data type columns where it is 'Nan' b) Fill 'Missing' in columns with string data type where it is 'Nan' Adding new derived columns to aid classification: a) Adding column to aid in similarity between transactions depending upon the transaction type (‘debit’ or ‘credit’). b) Add a new column that has 'Particulars' data but 'Separator' characters are replaced by \SPACE\ and digits are replaced with zeros. This is to facilitate tokenization in classification algorithms. c) Adding a new column based on column added in step (b) that does not have the classification keywords. This column would be used as input during training and output will be a class to which this transaction belongs. 3.2. Vectorization The vectorization of financial transaction statements could be customized to not remove any thing from the data as every character in the transaction statement is meaningful and any change to any character in the transaction statement is effectively corruption of the data. For our problem of categorization, we are concerned about presence certain keywords and overall structure of the transaction statement. We further try to improve the performance of the algorithm by changing arbitrary length serial numbers with length varying from 5 to 7 characters (for example) into fixed length string literals containing all zeros i.e. the string literal “00000”. This would result in an increased similarity between two serial numbers and between overall transaction statements following same structure. For ex: “ABCD 123456 789” becomes “ABCD 00000 00000”. Let us see what difference it makes: Suppose two transactions are: • NEFT 123456 789 • NEET 0987 65 Cosine similarity between these two is: 0 After modifying the transaction statements, the statements become: • NEFT 00000 00000 • NEET 00000 00000 The cosine similarity now is: 0.8 3.3. Implementing the linear kNN algorithm for classification • This kNN classification program makes use of the similarity measure between the input ‘transaction statement’ and the statements in the training set. Vector space model is used for representing text input and then we use cosine similarity to measure similarity. • Parameter ‘k’ for the algorithm is set to the number equaling square root of “total number of data points in the training set”. • After classification, we note down the performance metrics for the algorithm and we can also drill down about performance of the classifier based on results we obtain and loading them into Pandas DataFrame. 3.4. Implementing the parallelized kNN algorithm based on the MapReduce paradigm To parallelize kNN, we break it into parts called Mapper and Reducer that run on machines named likewise ‘Map Machine’ for Mapper code and ‘Reduce Machine’ for Reducer code. To identify the class of a query ‘Transaction’, we take the following steps: a) We divide the training records into ‘N’ parts, here N is equal to the number of Map machines. b) We feed the training records and query ‘transaction’ to the Map machines. c) Map machines generate the [key, value] pair in the form of [Class of training data point, distance of query point from the training point]. d) For each class we have a Reduce machine. Reduce machine takes the input from Map machine that are of class particular to the Reduce machine and generates the result of ‘Top 15 pairs of [Class, Distance] for Class X (let us say)’. Top record is the one with the minimum distance. e) Next, there is another Reduce machine that generates the result of ‘Top 15 pairs of [Class, Distance] with all the classes participating as input.’ f) Note that last Reduce machine finds the Top-15 candidates (for identifying class of query ‘Transaction’). g) Reduce machines can process the data incrementally from the data stream without the need to store all the candidates coming from the Map machines at previous level. Diagrammatically: Fig. 4, “High Level Diagram Showing Nodes Running the Parallelized kNN”
4. In-depth Implementation Review 4.1. A look at the data Only relevant columns are shown here. Tab. 3, “A look at the data”
‘Particulars’ is the field that we are going to read and classify into the classes as shown in the column of ‘First Level Classification’ and the reason why the class applies to the particular is mentioned in the column ‘Basis of Classification’. ‘Basis of Classification’ column aids in giving some domain knowledge and ideas about the data but we are not going to populate this after running our Machine learning algorithm for classification.   4.2. Pseudo-code for linear version of kNN Step 1: Load the CSV data into Pandas data frame. Step 2: Do feature engineering to handle missing data, add new derived columns to aid classification. Step 3: Generate test set and training set by randomly taking out 10% of the data for testing. Step 4: Build vectorizer that by changes arbitrary length serial numbers with length varying from 5 to 7 characters (for example) into fixed length string literals containing all zeros i.e. the string literal “00000”. Step 5: We will define two functions as part of implementing kNN. First one is “calculateDistances”. This calculates the distance of query point from each other point. Step 6: And second one is “getClassLabel” runs SQL query like operation on the Pandas DataFrame from last step with columns ‘Class’ and ‘Similarity’. Essentially, it sorts the data based on descending order of similarity, retrieves top 15 entries, and returns the majority class among the top 15 classes.   4.3. Pseudo-code for parallelized version of kNN 4.3.1. Pseudo code for Mapper Step 1: For each item coming from ‘Standard Input’, repeat the ‘Step 2’: Comment for ‘Step 1’: From standard input, Mapper code takes the input of test data. Step 2: For each item in the training-data-set, repeat the steps ‘3’ to ‘5’: Step 3: Get vectors for the query particular (coming from test data) and training particular (coming from training data) Step 4: Normalize the vectors obtained in ‘Step 3’ Step 5: Print the ‘First level classification’ of training particular and cosine similarity between query particular and training particular. Comment for ‘Step 5’: Syntax for printing in ‘Step 5’ is “[first level classification]\t[cosine similarity]” 4.3.2. Pseudo code for Reducer # Input comes from STDIN Step 1: for line in sys.stdin: Step 1.1: remove leading and trailing whitespace Step 1.2: parse the input we got from mapper.py Step 1.3: Append to the list of top 15 instances with highest similarity. Step 2: outputString = find the majority class in the ‘top 15 list’ Step 3: Appending output strings from multiple runs to a file Step 4: print this MapReduce output (‘outputString’) to 'stdout'   4.4. Generating test data and training data Step 1: Generate a random permutation of numbers from 1 to the number of rows in the complete dataset. Step 2: Reorder the Pandas DataFrame containing the complete dataset according to the permutation obtaining in ‘Step 1’. Step 3: Define the percentage of rows you want in the test data. Step 4: Take out the first part (‘test data’) from Pandas DataFrame (containing complete data) in proportion to the percentage defined in ‘Step 3’. Step 5: Take out the last part (‘training data’) from Pandas DataFrame (containing complete data) in proportion to the number “1 - percentage defined in ‘Step 3’”.   4.5. Information of the Tools Used 4.4.1 Host OS: Windows 10 Home (64-Bit) 4.4.2 Processor: Intel (R) Core (TM) i7-4510U CPU @ 2 GHz 2.6 GHz’ 4.4.3 Virtual Machine: VirtualBox 5.2.18 4.4.4 Guest OS: Ubuntu 18.04 (64-Bit) 4.4.5 Hadoop Installation (Running on Guest OS): 3.1.1 (64-Bit) 4.4.6 Java Runtime Environment (Running on Guest OS): OpenJDK version “10.0.2” 64-Bit 4.4.7 Language: Python (Running on Guest OS): 3.6.5 4.4.8 IDE: Visual Studio Code: 1.27.2 (64-Bit)   5. Conclusion and Recommendations 5.1. Conclusion kNN is one of the simplest and most widely used classification algorithm due to its competitive performance with respect to the other known algorithms. We implemented linear version of kNN and parallelized version of kNN. The classification performance metrics of the two versions were of similar order but the running time of parallelized version is improved in proportion to the number of Map machines and Reducer machines deployed. Metrics from one of the test runs are shown below: Micro averaged: Precision: 0.91 Recall: 0.91 F-score: 0.91 Macro averaged: Precision: 0.76 Recall: 0.78 F-score: 0.76 Weighted average: Precision: 0.91 Recall: 0.91 F-score: 0.90 Count of transactions from test data that were wrongly classified: Tab. 4, “Classifier Results in Pandas DataFrame”
5.2. Recommendations - The software used were of latest development versions. - Processor with 64-Bit register size and operation set are required. - If more advanced technologies, in terms of software or hardware appear, or improvements appear in the future releases of software, they should be used for any experimentation or work. 6. Directions for future work From the knowledge and expertise gained while working on kNN: - We can work to improve the performance linear version of kNN - We are going to make performance improvement while comparing with industry grade machine learning libraries. - We can find out if we can build parallelized versions of other Machine Learning algorithms like Naïve Bayes, Linear Regression, Logistic Regression, k-Means, Decision Trees. - If we are able to parallelize the pseudo code of other Machine Learning algorithm, we can implement them as well. - We can do a comparative study of the other algorithms in comparison to kNN after the work in the above points. 7. Summary The report discusses in detail about the classification technique kNN and its variations (linear and parallelized version) among which the comparison is performed. It discusses about the process of vectorization, classification using linear kNN and parallelized kNN using MapReduce. We run the Mapper code and Reducer code on the Hadoop platform running in the Guest OS of Ubuntu and Host OS of Windows. The aim of the project is as follows: Improving kNN algorithm's performance on financial transaction data by customizing the process of text vectorization and finding ways to parallelize the kNN algorithm. 8. References 1. Montgomery, Douglas C., Cheryl L. Jennings and Murat Kulahci. Introduction to Time Series Analysis and Forecasting. New Jersey: John Wiley and Sons Inc., 2008 2. Shumway, Robert H. and David S. Stoffer. Time Series Analysis and Its Applications. New York: Springer, 2011 3. Cowpertwait, Paul S.P. Introductory Time Series with R. New York: Springer, 2006 4. Mitchell, Tom M. Machine Learning. New York: McGraw-Hill, 1997 5. McKinney, Wes. Python for Data Analysis. Sebastopol: O’Reilly Media, 2012 6. Géron, Aurélien. Hands-on Machine Learning with Scikit-Learn and TensorFlow. Sebastopol: O’Reilly Media, 2017 7. Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Jilalin Liu, Peter Sadowski, Evan Racah, Suren Byna, Craig Tull, Wahid Bhimji, Prabhat, Pradeep Dubey. Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures. Intel HPC Developer Conference. 2016; Berkeley. Lawrence Berkely National Lab: Intel Corporation; 2016.  

No comments:

Post a Comment