[I recently moved the Jupyter notebooks for this problem to GitHub. If you want to skip the commentary, you may download the notebooks here: Part 1, Part 2]
This data set was posted to the UCI Machine Learning Repository a couple of weeks ago. It is a straight up spam/ham classification problem. The original source of the data is here.
The creators of this data set (Tiago, Tulio) collected 1,956 comments on five YouTube videos during a certain time period and classified each as spam or ham. We will attempt to build a machine learning model to learn the data set and test our model’s performance.
The data is spread across five similar, but distinct data sets. We will take two passes through this. In our first pass, we will consider only the first data set, based on a video by the artist Psy. We will build a simple model based on a Naive Bayes classifier.
In our second pass, we will merge all five data sets into one unified data set, and will build a model that learns from and predicts comments as spam. For this pass, we will attempt multiple classifiers and pick the one that has the best accuracy score, and will further tune this model to improve performance.
As we have done in the past, we will follow our established workflow for building and testing machine learning models, namely, read the data, perform data cleanup where necessary, split the data set, transform the data set as necessary, select a model, train the model, test the model, and determine next steps.
- The data set has five columns: comment_id, author, date, content, and class. Though not labeled explicitly, class 1 seems to represent spam and class 0 represents ham.
- Of these, the only relevant feature is content, with class being the target.
- The columns of interest do not have any missing data, so no data cleanup is necessary.
- The spam/ham distribution is split nearly equally. This is good because if one class dominated the data set, the model may be skewed towards that class.
- We will learn from 80% of the data set, and use the remaining 20% as our test set.
- Since the feature column is text, we will need to convert it to a numeric format. To do this, we will use the CountVectorizer class. The CountVectorizer will build a document-term matrix, which represents each row in the feature column as an array of ones and zeros, with the former representing the presence of a word. The result of this operation is a sparse matrix containing 1564 rows (80%) and 3810 columns (number of unique words in the training set).
- Next, we will build the model:
- For our first pass, we will build a model using the MultinomialNB class
- For our second pass, we will take a more elaborate approach, choosing from 8 different models. We will use 10-fold cross validation and select the model with the best accuracy score. In this case, the DecisionTreeClassifier performed the best. We will further tune this model using GridSearch to settle on the best set of parameters that produce the highest accuracy
- Once the model is built, we will test it against the testing data set.
- For the Psy data set, our accuracy score was 97%
- For the combined data set, our accuracy score was 93%
- As our final step, we can review the confusion matrix, and take a look at the false positives and false negatives.
- If our model allows it, we can also list the words that were the most and least spammy, i.e. words that had the highest and lowest spam-to-ham ratio.