Thursday, June 18, 2020

Importance of posing right question for machine learning, data analysis and data preprocessing



We are going to demonstrate in this post the importance of posing the right question (presenting the ML use case, the ML problem correctly) and the importance of understanding data and presenting it to the ML model in the correct form.

The problem we are considering is of classification of data based on two columns, viz, alphabet and number.

+--------+------+------+
|alphabet|number|animal|
+--------+------+------+
|       A|     1|   Cat|
|       A|     3|   Cat|
|       A|     5|   Cat|
|       A|     7|   Cat|
|       A|     9|   Cat|
|       A|     0|   Dog|
|       A|     2|   Dog|
|       A|     4|   Dog|
|       A|     6|   Dog|
|       A|     8|   Dog|
|       B|     1|   Dog|
|       B|     3|   Dog|
|       B|     5|   Dog|
|       B|     7|   Dog|
|       B|     9|   Dog|
|       B|     0|   Cat|
|       B|     2|   Cat|
|       B|     4|   Cat|
|       B|     6|   Cat|
|       B|     8|   Cat|
+--------+------+------+

We have to tell if the "animal" is 'Cat' or 'Dog' based on 'alphabet' and 'number'. 
Rules are as follows:
If alphabet is A and number is odd, animal is cat.
If alphabet is A and number is even, animal is dog.
If alphabet is B and number is odd, animal is dog.
If alphabet is B and number is even, animal is cat.

We have written following "DecisionTreeClassifier" code for this task and we are going to see how the data preprocessing aids for this problem.

from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.
from pyspark.ml import Pipeline

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

a = list()
for i in range(0, 100000, 2):
 a.append(('A', i+1, 'Cat'))   
b = list()
for i in range(0, 100000, 2):
 a.append(('A', i, 'Dog'))   
c = list()
for i in range(0, 100000, 2):
 c.append(('B', i+1, 'Dog'))   
d = list()
for i in range(0, 100000, 2):
 d.append(('B', i, 'Cat'))   
 
l = a + b + c + d

df = sqlCtx.createDataFrame(l, ['alphabet', 'number', 'animal'])

alphabetIndexer = StringIndexer(inputCol="alphabet", outputCol="indexedAlphabet").fit(df)
df = alphabetIndexer.transform(df)

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["indexedAlphabet", "number"], outputCol="features")

# VectorAssembler does not have a "fit" method.
# VectorAssembler would not work on "alphabet" column directly. Throws the error: "IllegalArgumentException: Data type string of column alphabet is not supported". It would work on "indexedAlphabet".

df = assembler.transform(df)

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="animal", outputCol="label").fit(df)

df = labelIndexer.transform(df)

dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

# Chain indexers and tree in a Pipeline
# pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
pipeline = Pipeline(stages=[dt])

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = df.randomSplit([0.95, 0.05])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
# predictions.select("prediction", "label", "features").show()


# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction") # 'precision' and 'recall' for 'metricName' arg are invalid.

accuracy = evaluator.evaluate(predictions)
#print("Test Error = %g" % (1.0 - accuracy))
print(df.count())
print("Accuracy = %g" % (accuracy))

treeModel = model.stages[0]
print(treeModel) # summary only 

With this code, we observe the following results:

df.count(): 20000
Accuracy = 0.439774

df.count(): 200000
Accuracy = 0.471752

df.count(): 2000000
Accuracy = 0.490135 

What went wrong? 

We posed a simple classification problem to DecisionTreeClassifier, but we did not simplify the data to turn numbers into 'odd or even' indicator. This makes the learning so hard for model that with even 2 million data points, the classification accuracy stood at 49%.

~ ~ ~

Code changes to produce simplified data:
Change 1: replacing 'number' with 'odd or even' indicator.
Change 2: reducing number of data points.

a = list()
for i in range(0, 1000, 2):
 a.append(('A', (i+1) % 2, 'Cat'))   
b = list()
for i in range(0, 1000, 2):
 a.append(('A', i%2, 'Dog'))   
c = list()
for i in range(0, 1000, 2):
 c.append(('B', (i+1) % 2, 'Dog'))   
d = list()
for i in range(0, 1000, 2):
 d.append(('B', i%2, 'Cat'))
 
Result:

df.count(): 200
Accuracy = 0.228571

df.count(): 2000
Accuracy = 1 

No comments:

Post a Comment