Last blogpost we discussed the problem and we started working a bit on describing our task environment.
We structured our data, got rid of the noise, and already identified some difficulties we’ll have to cope with. More precisely, the task environment for our classification agent is quite complex, because of the many unknown factors.
For instance, each game has multiple options, multiple levels and multiple characters you can choose from and each human player has a different skill level depending on the game.
Next, we suggested a couple of approaches to do the actual machine learning.
The k-NN (k-nearest neighbors) approach was a possible candidate, but we believe that we miss a couple of features to use this method to its full extend. For instance, both the player itself and its skill level would be a nice extra.
Our gut feeling pointed us in the direction of option 3 though.
Feature Engineering: final step
We split the timeseries in intervals and make a discretization of the values in that window interval. We’ll be sliding that window over the entire game in order to extract a training set, a validation set and a test set. There are a couple of options to do the discretization, one of which is to discretize by frequency.
It is easy to implement and should give us fast results.
In short, look at the following diagram of a Mortal Kombat game:
We see 2 sliding window approaches, fixed size and increasing size.
Fixed size sliding windows approach:
Step 1: We define a window of x seconds
Step 2: Gather data from the controller usage
Step 3: Next, shift the window y seconds and
Step 4: Gather data from the controller usage and
repeat steps 1 to 4 over and over until you hit the end of the entire game’s time series.
As a result of following the above procedure, all of our windows are of equals size.
We can choose to divide the game in equal intervals over the game time frequency or just get a fixed size of x seconds.
This windowed approach is shifted as indicated with the red arrows in the picture.
To science it up a little, this approach is defined as follows:
Given a time series T a collection of timestamps, a fixed window size x and a sliding size of y we get the following windows
The advantage of this approach is that certain patterns or subsequences might get discovered faster because of the small windows and the fact that the time dimension is the same for each interval record.
Incremental size sliding windows approach:
The next option would be incrementing the window size per y seconds and count the frequencies over that increasing window. The idea behind this approach is that we can fade out level differences in some particular games.
Given a time series T a collection of timestamps, a fixed window size x and a sliding size of y we get following windows
Choosing this option might come with a cost as we might lose critical information. But on the positive side, we might have a better generalization of the button usages over an entire game level.
An example! Let’s look at my old-time favorite, Sonic 2. In a level called Spring Yard Zone 1, Sonic starts by running to the right and jumping like usual.
A regular game storyboard for Sonic, but at a certain point he has to get passed moving blocks. He doesn’t really have to do much. The player just has to wait on a moving block and steer in at the right time, to descend further into the level and continue his road to the ending of the level as he would using the normal storyboard.
Let’s dig in a little further at the moving block scenario. In the fixed size sliding window approach, we might capture only a few button and joystick movements, label them as Sonic and feed them to our model. Although the moving block scenario is not very “characteristic” for Sonic, it might still be a pattern that the model can use to recognize that it is actually Sonic that you are playing.
In the incremental size sliding windows approach the moving block scenario will averaged out by its previous movements, representing gameplay that is actually more characteristic to Sonic.
Whatever approach we choose, it is imperative that we handle the predictions the same way as we trained the model. In our case with limited datasets, (yes, our dataset is limited although we had plenty of volunteers playing games), we notice that the incremental size sliding windows approach gives us the best results, so we continue to use this approach and never look back.
Ok, we defined our sliding windows approach, but what kind of data can we gather from the controllers…? We take the easy way out here and just count the number of times a certain button is used in that interval window. This gives us for instance the following output:
As you’ll notice there is no button usage in this example, and indeed, this window is labelled as being Puckman. Unfortunately, we lose some important information here… the button combinations are kind of lost in translation, a common problem in fact when changing data dimensionality. In order to solve this issue, we’ll introduce some cross features here to get the information visible to the model.
Each time a controller is pressed in combination with another controller the number is counted under that particular combo. For instance, consider the following case where both joystick down and button one is pressed simultaneously.
We will increase the jd (joystick down) feature with one, increase the (button one) b1 frequency with one and finally, the new cross feature comes into play. But we will increase the feature jd_b1 (joystick down in combination with button one) with 1 as well.
In this way we represent the game play by a combination matrix, where the row represents a controller and the values in that row are the number of combinations with other controllers.
The diagonal of the matrix is the number of combinations of the controller with itself, so these represent the single type features. Each number on the right of diagonal will then be a cross feature.
Here is an example of a combination matrix of a Mortal Kombat game. Notice that the player never used button four and five in this window interval.
Let’s have a look at the first line of the combination matrix.
Joystick down was used 94 times in total, which maps to feature “jd”.
In combination with joystick left, 48 times, which maps to feature “jd_jl”.
In combination with joystick right, 19 times, which maps to feature “jd_jr”.
In combination with button one, 4 times, which maps to feature “jd_b1”.
In combination with button two, 0 times, which maps to feature “jd_b2”.
In combination with button three, 14 times, which maps to feature “jd_b3”.
In combination with button four, 0 times, which maps to feature “jd_b4”.
In combination with button five, 0 times, which maps to feature “jd_b5”.
In combination with button six, 24 times, which maps to feature “jd_b6”.
Notice that for instance 24 is both in the upper right and in the left down corner. This totally makes sense because if joystick down is used in combination with button six, button six will be used with joystick down in combination as well.
Extracting all features from this combination matrix yields the following graphs. In this case I picked 4 random windows of Mortal Kombat games.
If you compare this graph with the previous ones, then you’ll notice this shows a bit more similarities. You’ll notice we added 2 more features here, namely the total button presses overall and total joystick movements overall. I’ll discuss a little further why I came up with these two.
Looking at some other games now: in the following graph I picked Galaxian, Streetfighter, Puckman and Bomberman as examples.
You’ll notice that indeed there are some clear distinctions between the 4 games. Although…, have a look at the previous Mortal Kombat graph and the StreetFighter graph. They are virtually the same 🙁 And yes, getting both these games into the model was a real challenge.
Am I happy yet?
I’m writing this blogpost like I had all things figured out before even beginning my machine learning assignment. Of course, this is not the case, it was in fact an iterative process, each time criticizing what would be the best option and moving on from there.
For instance, remember our Hello World problem from last post?
The 4 games I picked as an example were easy to recognize, so I didn’t need any cross features. I got a decent model with only 11 features!
I then added a new game to the classification model and everything went pear-shaped from then on. So that’s why I added additional features in the mix. Some features I kept, other features were ditched of the list.
For instance, there was a feature constructed on the window size to game time ratio. The problem arizing there was that this feature became too dominating. Why? Well, a closer look immediately exposed why: There are some games that are just shorter in game time, but ‘more intense’ than other games and as such have a less spread window size. The feature was briefly tested but got kicked out of our feature set quite swiftly.
Prinicipal Component Analysis
What I also did trying to gain some more insights in our data was a principal component analysis. PCA is actually a really good name, since it clearly states what it does. It identifies the principle components in your data.
A small side note here: a principal component is not a feature.
Let me explain briefly how we can use PCA in our case.
If you take a close look at our data you’ll notice that we have no negative button presses and no negative amount of joystick movements… uhum.
Now imagine I just look at the total amount of button presses and joystick movements to simplify our case even further. This way I can do some 2D representation of our data points. Then, when plotting our data observations on a X/Y axis, we will have all data points on the right upper side of the plot.
Something like this:
Now what I want to do is the find the line where the data is most spread out when projected to it. The further the data points are spread out, the easier it will be to make distinctions between different data points.
Because ‘spread out’ is not really something you can grasp, the actual ratio of spread is called the variance in statistics. Now, let’s say you can actually find a line that maximizes the variance between all data points…. Well that exact line is the principal component in our example. And that line would serve as a much better x-axis then the one we used.
If I now calculate the principal component that is perpendicular to the x-axis I found, I got myself a new y-axis. I’m not going into mathematical detail about how the principal component is calculated but the ‘line’ I mentioned earlier can be calculated and the line with the biggest value (eigenvalue) is considered to be the principal component.
Now, all this sounds very exciting and you’re probably wondering why on earth you would want to use this analysis (are you indeed?). Well, let’s say you found a third principal component but with a very small eigenvalue, practically equal to zero. What can you conclude looking at it? You found a decent x-axis, a new y-axis, but the third component which maps to the z-axis is barely contributing to your new axis system. So basically, you can get rid of this dimension.
In our example, let’s say you have 3 dimensions, total button presses, total joystick movements and the level of cheering registered by an IoT noise sensor.
Doing principal components analysis, you’ll notice that the third dimension, the level of cheering, doesn’t really contribute and is pretty useless…. So, you can confidently boot this feature from your data.
In conclusion, with principal component analysis you can estimate and visualize the amount of information your data contains.
PCA – the real deal
Now, in the previous section I kind of tried to explain principal component analysis by dumbing it down a bit. Let’s get back to our case study and see what we can visualize applying PCA to our data.
For this I made use of the scikit-learn open source framework. So, get set for some python code. Scikit is really powerful and has great documentation, so you should definitely look into it when doing PCA. The user guide gives much more detailed information (and examples!) about what you can do with PCA.
Import python libraries
Import our data (preprocessed in a csv).
List the labels in our dataset
Transform the full data and then standardize the dataset.
In order to make a plot I only consider the 2 most important principal components so that I can plot the PCA in 2D.
As you can see above the best principal component only explains 28% of the variance between datapoints. This is indeed a low score for a principal component. In an ideal situation they should already take responsibility for the majority of the explained variance.
Now let’s plot this to a 2D graph.
The scatter plot looks like the following:
The beauty of this graph is that everyone can see the real challenge in our case study. The two bad boys Streetfighter and Mortal Kombat are giving me a hard time. 🙂 They are scattered everywhere and the 2 principal components with the biggest eigenvalues have trouble explaining the variance between those data points.
Instead of reducing dimensions like in the previous case, we see that indeed all features will be much needed in order to make a decent classification…. Off course, this doesn’t come as a surprise at all, since the games are virtually the same and U knew upfront that our constraint of only using the joystick movements and button usage would be a hard nut to crack.
Am I there yet?
Unfortunately, not…. We have another problem. Imbalanced data. This will be a problem we’ll have to deal with in most machine learning challenges, and this case study is no different.
To train a model, any model in fact, you need a heck of a lot of data and not just any data, but real quality data. Why? Well, a model is basically a pipeline… if you feed the pipeline crappy data, you’ll get crappy predictions… the “shit in, shit out” paradigm definitely applies here.
Take a look at this output from my preprocessing step. It lists the 7 most popular games in the dataset. There are only 88 high quality games of Bomberman left after filtering and cleansing the dataset… on the other side of the spectrum is Streetfighter with 446 quality games, it’s one of the most popular games on our arcade cabinet.
Now what’s the problem here? Easy to explain… let’s say we’re doing a binary classification here, between Bomberman and Streetfighter. Would you be happy with an 80% test set score for your model? Maybe, but the real problem here is that if we guess Streetfighter no matter what, we already got an 80% score on our test set!
This problem is called the accuracy paradox and is one of the typical consequences of imbalanced datasets, where a certain class starts to weigh on the smaller sub datasets in the trained model.
Of course, the “80% score” on the test set was kind of a shortcut on my part, because the accuracy of a model in a binary classification is typically calculated entirely differently and a bad model like that would never score good in those scoring models.
For instance, ROC and AUC curves, F1 scores and confusion matrices would immediately start blinking flashlights in this case.
There are a couple of ways to cope with this issue, one of them is to just gather more data. And yes, I successfully organized an Arcade game tournament amongst the ToThePoint colleagues in order to gain more data. The showdown finalists know how to play the game really well, so the data quality came as bonus as the rounds progressed all the way to the finals.
Another way to handle this problem is by over/undersampling the dataset which I did as well. Some frameworks do have support for this meaning it can be done with absolutely minimal effort.
If you’re still reading this, my deepest sympathy goes out to you. Yes, 2 blogposts about the data gathering and feature engineering step… does this come as a surprise to you? I sincerely hope not, since you’ll have to take into account that feature engineering is and will always be the most complex and most important part of your machine learning trip to perfection.
AI and machine learning in particular doesn’t come cheap and there really is no free ticket. It is not just handing over the data to a framework and let it go and see what happens.
A good machine learning model consists of a nice engineering challenge solved by people with the right skills and business knowledge. It is not just called Data Science because the name sounds cool… there is definitely more to it than meets the eye.
Now enough said about our data, next blogpost I’ll dive into our models!
Will you meet me there?
Credits: blogpost by Kevin Smeyers, Machine Learning Master at ToThePoint