Since the last blogpost, I left my feature engineering journey behind and can now finally can start looking at some models.
No Free Lunch Theorem
During my Machine Learning journey, I looked at 3 models (‘she’s a model and she’s looking good…’ – Kraftwerk). Why we didn’t stick with only one?
Well, in Machine Learning there is something called the “No Free Lunch” Theorem, which – in short – states that there is no one-size-fits-all and that there is no one super model that works best for every problem.
Every model oversimplifies the underlying problem and tries to build a hypothesis that comes as close as possible to predicting the real outcome of the problem, but it nevertheless remains an approximation of the real underlying function.
Depending on the underlying problem and function outcomes, some model just performs better than others.
There are some companies out there that say, “we only do deep learning”. Well, to me it feels like those companies are throwing away a lot of potential great models, and wouldn’t it be a pity if one of those models might be a better fit for your particular problem?
If a construction company would say, we have decided to rely only on the use of hammers from now on, well the only type of problems that company will be able to solve are driving nails in all kinds of materials. Too bad if you’re actually expecting them to build you a house instead of a woodshed, isn’t it?
Bearing this in mind, I do have both a rule of thumb and also some experience on what model might perform better than others in my case.
Whatever model you might ultimately choose, the process always comes down to this:
Select your model
Okay, enough preaching about model selection, let’s get down to business.
Gradient Boosting Machine
I have to admit, I initially knew very little about this algorithm but I’ve got to know this a little more in detail since visiting JavaOne in San Francisco last year. I attended a conference talk in which this was used in an air traffic delay prediction task. I took notes to memorize it (analog, can you imagine) but didn’t really pay too much attention. However, later that evening I had a small discussion with some fellow attendees when the same talk came up again, as well as H2O.ai, a promising Machine Learning platform.
So, when starting to look into models, this is my first candidate to kick things off.
Let me quickly explain what the idea behind GBM is. It is an ensemble technique based on combining multiple decision trees boosted using gradient descent.
An ensemble algorithm starts from the idea that combining multiple models achieves higher accuracy then trying to capture the hypothesis function in one single model. The advantage here is that when combining multiple hypothesis spaces of different models, the single hypothesis of the trained ensemble algorithm is not necessarily restricted to those different hypothesis spaces.
In GBM the same is done actually by combining multiple regression trees that keep on refining the predictive result. The idea here is that multiple weak predictors lead to a single strong predictor. So, in my case, the next regression tree will try to classify the misclassifications of the previous tree and so on.
The advantage of using H2O.ai here, is that it comes with a nice UI to start testing the model on small datasets to see how it performs. Another bonus is that it has a nice API that can be accessed through R, Python, Java and …. Scala. It can be used on a single machine, or launched on a remote machine and accessed through its service layer.
Once you have decently performing model, H2O offers quite some possibilities to productize your model. Compare this for instance to scikit-learn – undoubtedly a great framework as well but limited to python – and you’ll quickly appreciate H20 for what it offers you in terms of getting a good model into production. Mind you, I’m not saying that it can’t be done with scikit-learn, it is just much more work when it has to be done in a “non-Python environment”.
Ok, but how does my model perform? Not very good actually. The problem with ensemble techniques is one of overfitting. Well, overfitting is actually a problem lurking around the corner of any algorithm, but the lack of generalization on my problem is a real challenge.
Remember that my dataset is relatively small and that I am definitely missing important features here, so I need to be cautious not to overfit on “noisy” feature windows.
Bearing in mind that Machine Learning is a concept of frustration-iteration, I kept on tuning the model, but it never got any real good results on my test set. Overfitting proves to be too big of a hurdle, even when using cross validation and fiddling around with the number of estimators and so on. I actually never managed to get to a nice compromise that both generalized well and at the same time get a decent score on my test set as well.
Check out a screenshot of the confusion matrix produced in one of my iterations on a small dataset illustrating the results I got from using GBM.
You’ll clearly notice that the algorithm had no problems getting everything right on the training set and also the validation set scored a staggering 96%. The model is however hopelessly overfitted unfortunately since it only scored 50% on our test set. After a couple of iterations, I managed to get it up to 80%, but the progress made was just not good enough to warrant a long life.
The whole idea about this frustration – iteration approach is that the more you iterate, the lesser frustrated you (should) get in the results. So as soon as you are not truly happy with the ratio between the two factors, you simply have to bail and see if there are other ways to tackle the problem.
Combatting overfitting here is a hard nut to crack, having such a limited dataset which was also imbalanced as an added bonus. So, to be perfectly honest, this wasn’t a real surprise.
So duly and quickly abandoning the GBM approach, I now look at something else that might be a better fit for our model: a Feed Forward Neural Network.
A couple of small notes though:
– Does this make GBM a bad approach a general?
No, of course it doesn’t, it is just a bad fit for my case, but this doesn’t mean I’ll never look at GBM again. To be fair, it is still on my agenda, because I think that – as soon as our dataset starts to grow to a decent size and the classes will balance out a little – the GBM algorithm still has potential to deliver.
– Was my effort completely wasted? No, of course it wasn’t, I definitely gained more insights and I managed to established a nice baseline for myself fairly quickly.
Artificial Neural Network
Respect the classics man! It’s a neural net! Yes, ANN’s are used in a variety of tasks and are here to stay. It was a no brainer to put it to the test. H2O.ai also supports neural nets and I decided to stick to the framework while working on this algorithm.
I’m not going to explain how a Neural Network works. The interested user will find plenty of papers out there that explains how this popular algorithm works, and I think that nowadays it’s the best known and understood algorithm out there.
Okay, but does it perform well? Yes, it actually does! It generalizes really well and scores quite good on my test set. This is the first model I actually put into production and use in a real-time testing cycle. It initially had its issues but, iteration after iteration, it performed better and better.
Here’s another confusion matrix of a subset of my data for the ANN.
As you’ll notice the trained model contains a bit more errors on the dataset compared to the GBM. But – in contrast – it scores much better on my test set and as such generalizes much better on unseen data. I encountered some local minima problems along the way, but managed to overcome them pretty soon.
I also trained the model on a small dataset for a number of times to check its stability.
This resulted in the following boxplot (at a given point in time).
The ANN starts to get a decent score already with only one hidden layer of 50 neurons. Using one hidden layer but with more neurons yields only minimal improvements in return. Two layers of 100 neurons each however still boosts a great performance and gets a decent 90% on the validation set and 80 to 85% on the test set. The boxplot above is the result of such neural network with 2 hidden layers.
Any outliers (three of them) are always on the bottom of the boxplot and are to be considered as “lucky” models. I omitted them for clarity.
I spent the most time on tuning parameters such as activation and loss functions, which gave me much more improvement in the results compared to fiddling with the network topology itself.
For example, the activation function. There is a function that is responsible for producing the output of a node (i.e. a neuron) depending on the input it receives, using the weight as a base for that calculation. Depending on the output, the activated node contributes more or less to the nodes that follow and so on.
This function is called the activation function and they say a neuron is more or less firing or not firing depending on its output.
Now there are a number of activation functions out there and H2O.ai leverages some of them out of the box.
To find the best activation function for our model, I looked into them in detail and experimented with them. Putting them to the test is easy, I easily went through 50 iterations of producing a model, trying a different activation function and printing out the boxplot.
Et voila, the result:
Not going into too much detail here, but you’ll notice that just by swapping the activation function the boxplots have an entirely different layout. To be fair, the dataset was still very imbalanced at the moment I created these plots and the data on some games was very sparse. Think of Puckman with no button presses registered. Maxout with dropout, the boxplot down on the left handles these problems much better and produces a much more stable result.
This model is actually the one currently in use in my demo setup. But is it really the best model I produced? Hard to say, since there is a strong contender. I looked into SVM as well after our ToThePoint intern brought the knowledge from her master thesis. Like I said earlier, Support Vector Machines were entirely new to me, but I was more than happy to indulge!
Support Vector Machine
Although I didn’t explain the ANN into any kind of detail, I’ll do a quick rundown of SVM.
First of all, SVM is a binary classification method. Given a set of labelled data points it will try to find a hyperplane that maximizes the distance between the two classes of the data points. Maybe the data points classes are hard to linearly separate in a two-dimensional space, but this might not be the case in a multi-dimensional space.
The best hyperplane maximizing the margin between the two classes is then selected to build the classifier. The data points are represented as vectors and the ones forming the maximum margin of the hyperplane are called the support vectors.
Now, when we have a new unseen dataset we need to classify, we need to map each data point to that multi-dimensional space, and the so-called kernel trick comes into play, that helping us to optimize this problem.
The kernel function can even map our data points into an infinite dimension space if needed. There are a couple of kernels out there and in our case, we used RBF (Radial Basis Function).
The SVM algorithm can be used in a multi-classification problem with a couple of techniques such as the “one vs all approach”.
Let’s look at some results. First off, check out this accuracy boxplot representing the stability of our model.
The next picture shows a confusion matrix of our resulting model.
If you look carefully, you’ll notice that the confusion matrix has a different layout compared to the matrices I showed you previously. And yes, this is because H2O.ai doesn’t offer any SVM support (yet?). Therefore, I turned to scikit-learn to implement this algorithm.
To use the SVM algorithm I need to take additional preprocessing steps of, such as scaling the data so that each feature attribute maps to a range of [0, 1]. Another thing we need to change is the output of the data: categorical attributes should be mapped to numeric data because of the fact that each data instance is to be represented as a vector.
As stated earlier, we used RBF as kernel method. RBF is a good default kernel to start with when dealing with data that needs mapping to a higher dimensional space.
Remember that earlier on the PCA analysis already gave me a hint for this.
RBF needs two parameters, C and gamma. Finding the ideal hyper parameters is tricky, and I use nested cross validation to tune my hyper-parameters.
As you can see, the accuracy is higher than 92,5% for most classes, except for Mortal Kombats’ prediction that is (obviously) confused with Street Fighter.
The resulting SVM models’ numbers are pretty good and it scores equally good or even better than the ANN.
I haven’t put it to the test though and I am really curious as to how it will perform on my test set first of all, but even when it comes has making real-time predictions.
The problem I see here is that I’ll have much more work putting this model in production (no H2O.ai SVM support here yet unfortunately, so come on guys 😉 and thus, it is still on my TODO list.
For that I need to serialize/de-serialize our model, called “pickling” in python, and then expose it as a service using Flask for instance.
Remember that this is an advantage of H2O.ai which has much better support for this.
H2O.ai Sparkling Water does support SVM though but I will check up on their progress in the near future.
Hopefully I find the time this summer to see if it can knock our ANN of its throne during demos.
One small note though: one time during my efforts I ended up with the following accuracy matrix, which clearly had still room for some improvement.
While I was tweaking our algorithm and getting increasingly improved results, I decided to add a new class to it, yielding the following accuracy matrix.
Notice the fierce drop in accuracy on the Mortal Kombat class. Confident about our model I again added a new class, Streetfighter, and that one immediately crushed the other class.
Just to show you that you need to keep monitoring, optimizing and tuning your model. Your model is something that breathes and lives and gains maturity over time. Please govern it, and never be too confident about it. When this happens in production this could be a nice “fatality” (pun intended).
I think I managed to get a decent model out of our data. Putting it in production and seeing it actually score really good on our test set is a real joy.
I would never have guessed that it would be able to get 9 out of 10 predictions right when I started this fun project.
I learned a lot about the different algorithms and was able to tune them to get to these results.
So, in summary, these are the key take-aways:
- There is no free lunch in Machine Learning
- Measure the performance of your model and know what the best performance metric is
- Train your model as you will use it
- (Get to) know your model
- Frustration – iteration: it’s an iterative process, be ready for it
- Govern your model when in production, monitor, log, optimize, tune!
I would like to thank our intern, Wanqiu Zhang, big time. She brought some really great insights to the table while feature engineering and building our models. Her input is sincerely valued!
Next week I’ll walk you through the technical architecture and shed some light on how to put the model into production, stay tuned!
Credits: blogpost by Kevin Smeyers, Machine Learning Master at ToThePoint