Simple demonstration of how important data is for machine learning


We are having a very simple tabular data that we have generated programmatically with the following structure:

investorID pcSoldAtProfit pcHeldAtLoss dispBias
0 29 0.17 0.89 n
1 68 0.16 0.08 n
2 23 0.97 0.07 n
3 80 0.20 0.67 n
4 2 0.74 0.77 y

Here, features are "pcSoldAtProfit" and "pcHeldAtLoss", and target variable is "dispBias".

The simple rule for generating this data is "if pcSoldAtProfit and pcHeldAtLoss are both greater than 0.7 then dispBias is 'y' else dispBias is 'n'".

This rule is so straightforward that even a human would be able to guess it by looking at the few samples of data but now lets look at how Decision Tree algorithm performs on various sizes of this data...

With 100 rows and splitting at 25%:


We see here that algorithm failed to classify 3 instances of the test data.

And this does not get any better until we increase the data points count to 10000 (with splitting at 10%):


So we conclude that even for very simple learning an ML algorithm needs huge amounts of training data.

No comments:

Post a Comment