Are you ready for your internship at ToThePoint?
KMeansK - Build a web visitors clustering pipeline using Kmeans Algorithm
Build a real-time data processing pipeline of webserver logs in which K-Means clustering is used to categorise web visitors based on their behaviour and real-time visualisation of the clustering algorithm @ work.
Description of the assignment
- Our ToThePoint website generates a fair amount of webserver logs. In these there is a hidden treasure of information. To get a general idea on who our visitors are, we’d like to start clustering the users in certain categories
- This raw data will be captured and put into a Hadoop cluster. But at the same time we’d also like to publish them as Kafka events so we can ‘investigate’ them by using stream processing to check if our visitors can be categorized.
- To accomplish this, we’re looking to develop a KMeans algorithm.
- Before we can unleash a machine learning algorithm, we need to fine-tune the data by
implementing some pre-processing.
- Once the pre-processing and actual processing (KMeans) is in place, we’d like this algorithm
to be visualized in real-time – because KMeans offers an excellent way to visualize the real- time data in a 2D environment to check for centroids and how they are moving. We’re looking to build this using a ReactJS and D3.js.
- Iteratively design and develop a machine learning model
- Develop a data processing pipeline
- Pre-processing steps
- KMeans algorithm
- Put your design into production
- Build a real-time live visualization of the results
What you will gain
- Learn how to design an end-to-end data processing pipeline
- AND put this in productio
- Gain knowledge about steam processing
- Gain knowledge and experience in machine learning
- You will get to know Hadoop
- You will gain experience in powerful visualization libraries such as D3.js
- That lovely feeling you get knowing your design will be effectively used in production
What you need
- You have a shown interest in a challenging but instructive assignment
- You’d like to explore Machine Learning and stream processing techniques
- Using Spark, Python or Scala does not scare you at all
- You know what ReactJS is, or are eager to learn
- You like to learn about data visualization
- You like to learn a heck of a lot on a relatively short period of time
Technologies you'll be using
Location of your assignment
Veldkant 33B, 2550 Kontich
Kevin Smeyers – Technical lead machine learning ToThePoint