Data analytics and machine learning have been extensively used by software companies to obtain business insights. One important use of it is recommendation system, which allows applications to actively suggest items that users are likely to be interested in. In the space of music streaming services, Spotify is pioneering song-user matching technologies using its hybrid approach to predict users’ preference. KKBox, as the major competitor of Spotify in Asia, is trying to catch up and announced a data science challenge in Kaggle.
The objective of our project is to take on the challenge - given a song that a KKBox user has listened once, predict whether he/she would listen to the song repeatedly in the coming month. Our predictions will be benchmarked against participants worldwide in Kaggle. The rest of this report will discuss the dataset, data preprocessing and model training in details, as well as evaluate and discuss the results given by different approaches.
The training set contains 7,377,418 examples, associated with 34,379 users and 2,296,833 songs. These three sources of raw data are stored in three separate .csv files.
Since the dataset contains so many records, without cloud computing, it can take days to train the model with such giant dataset. Luckily, Apache Spark and its MLlib library provide a perfect solution to handling massive dataset and demonstrate the power of distributed computing in data analytic tasks.
Real-life datasets typically contain significant noise, corruption and inconsistency. To make unbiased and accurate predictions, we spent a huge amount of effort on data cleansing and transformation to make it follows an organized structure that fits the MLlib requirement and well demonstrate the use of Spark. The overall workflow is shown below.
After preprocessing, we split our training data into 80% training set and 20% evaluation set. Then, we feed our training set to four different models. They are Random Forest, Gradient Boost Tree, Naive Bayesian Classifier and Collaborative Filtering. Features are mapped to a Vector class and together with the labels forming the LabeledPoint class, required by the data mining models on MLlib.
We have established different models for predicting repeated listen for test dataset. All models generate over 50% accuracy which fulfilled our expected delivery. In fact, the result of Random Forest and Gradient Boost Tree already beats the benchmark set by Kaggle competition (60%).