Case Study Three

Spam email Classifier Using Clustering and Naive Bayes

David Grijalva, Nicole Norelli, & Mingyang Nick YU

10/01/2021

Abstract

The following deliverable investigated and predicted Spam and non-Spam emails. Feature extraction via Count Vectorizer and Term Frequency Inverse Document Frequency (TF-IDF) Vectorizer was performed. Naive Bayes was the primary statistical method used for classification. Additionally, KMeans clustering to create new features was explored as a method to improve classification results. Ultimately, both models using TF-IDF Vectorizer alone and the TF-IDF Vectorizer with an additional KMeans clustering feature were recommended to IT for final decision.

1. Introduction

This case study focused on predicting whether an email was "Spam" (junk mail) or "not Spam" (called ham by given data) using Naive Bayes. The data consisted of 9,353 sample emails labeled as either Spam or not Spam. The text from each email was first cleaned, and then it was vectorized. Both a Count Vectorizer and a TF-IDF Vectorizer were implemented to compare the results of each when used with Naive Bayes for classification. Also, KMeans clustering was used for the creation of new features to determine if these new features could improve the results of the Naive Bayes classification.

Vectorizers

After unnecessary text was removed, the text was vectorized. Two techniques were explored in this case study. The first, Count Vectorizer, tokenizes the text and then creates a matrix of token counts. This matrix indicates how many times each token appeared in each email. The second technique, TF-IDF Vectorizer, applies the same technique as Count Vectorizer, but follows with a transformation to normalize the count matrix. Instead of a raw representation of the word counts, a formula is applied to more heavily weight the words that appear less frequently within the corpus. TF-IDF for a term (t) in a document (d) is calculated by multiplying the term frequency (tf) in each document by the inverse document frequency (idf) of the term:

$tfidf = tf(t,d) * idf(t)$

The inverse document frequency is calculated by taking the log of the total number of documents divided by the document frequency of the term and then adding 1. The purpose of adding 1 is to ensure that terms that occur in all documents are not ignored.
$idf(t) = log\frac{n} {df(t)} + 1$

Naive Bayes

Once a vectorizer was applied, Naive Bayes was used to perform the classification task for this case study. It is based on Bayes Rule:

$P(A|B) = \frac{P(B|A)P(A)} {P(B)}$


This rule can be read as "the probability of A given B is the probability of B given A times the probability of A all divided by the probability of B." Bayes Rule can be extended for multiple variables:

$P(A|B,C) = P(A|x) = \frac{P(x|A)P(A)} {P(x)}$


An assumption of Naive Bayes is that all "x" variables are independent. For this case study, Multinomial Naive Bayes was used to calculate the probability of each email being either Spam or not Spam. Multinomial Naive Bayes is used for multinomially distributed data, such as word vectors. For each email, the probability of each of the two outcome classes (Spam or not Spam) is calculated using each feature (token) in the email, and the class with the higher probability is chosen for the classification result. For example, if the outcome Spam is represented by "S" and not Spam by "R" for an email with three words represented by "B", "C", and "D":

$P(R|B,C,D) \propto P(R)P(B|R)P(C|R)P(D|R)$

$P(S|B,C,D) \propto P(S)P(B|S)P(C|S)P(D|S)$


The class with the higher score (argmax) is chosen for classification of the email.

Cluster Analysis

KMeans Cluster

In an attempt to improve classification results, cluster analysis was used to create new features. Each instance was assigned to a cluster, and this new feature was incorporated into the Naive Bayes classification. Cluster analysis is an unsupervised technique, meaning there are no targets for the data. It is used to find structure and relationships, and there are a number of different methods that can be used for clustering. All methods are based on the idea of distance and finding data that are "close" to each other. Many different methods of calculating distance can be applied. For this case study, the KMeans method was explored. With the KMeans algorithm, the number of clusters is first specified, and then the data is separated into the specified number of groups by minimizing the within-cluster sum-of-squares (Scikit-learn documentation). Centroids for each cluster are first chosen randomly and instances are assigned to the nearest centroid. Then the centroids are updated and instances are relabeled. The process repeats until the centroid stops moving. This results in the specified number of clusters. Each instance is labeled with its cluster, and the cluster feature can then be incorporated into the classification technique.

Determining the number of clusters to specify for the KMeans model is not as straightfoward as selecting the model with the lowest inertia (mean within-cluster sum-of-squares) because as the number of clusters increase, inertia decreases. To get a basic idea of the best number of clusters, a graph of the number of clusters versus inertia can be examined. Typically an elbow in the graph indicates an ideal number of clusters. A more refined alternative is to use a silhouette score (source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow). A silhouette score is calculated using the silhouette coefficient:

$Silhouette Coefficient = \frac{(b-a)} {max(a, b)}$



Where "b" is the mean nearest-cluster distance and "a" is the mean intra-cluster distance (Scikit-learn documentation). Values range from -1 to 1, with 1 being best and -1 being worst. Values around 0 suggest overlapping clusters, and negative values suggest instances assigned to wrong clusters. Silhouette coefficients can also be visualized using silhouette diagrams.


2. Methods

Initial Data Observations

The study dataset contained 9,353 observations from two different classes. The Spam class represented 25% of all emails while the Ham class represented 75% of all emails. The dataset contained a total of 22 encoding types. From these, 14 were accepted and 8 were dropped as they contained unusual encoding including foreign languages.

Parsing emails

The email data needed to be properly parsed in order to apply a vectorizer and fit a Naive Bayes model. As the emails were initially provided in several folders, creating one dataframe with each email's location and its label (spam/ham) was necessary. The first step to parse the emails was to write a nested for loop that detected the email location from the data directory. This loop iterated through all the folders inside the data parent directory. If any of the sub directory folder names had either "spam" or "ham", the absolute path of all the contents of the inside of the folder were appended to a dataframe with two columns. Column 1, called "email_location", contained the absolute path for the files, and column , called "label", included the "ham" or "spam" label.

To properly read and parse the emails, a series of utility functions were created to address encoding, email structure (particularly multipart emails), convert HTML, and remove unwanted characters. These functions were used to identify the desired text from the email file and change it into the proper format for use with a vectorizer.

load_email

The load_email function reads the emails from an absolute path, encodes them into "LATIN1", and returns an email.message.Message object from the Python Standard Library.

get_email_structure

The get_email_structure reads an email object and uses the .get_payload() to get the email payload information. Then it checks if the payload is a list objective. If it is, then the function returns a string stating "multipart" followed by every element inside the list. If the payload is not a list object, then the function returns the string from the method .get_content_type()

html_to_plain

The html_to_plain is a very simple function that converts email in HTML format to plain text using the BeautifulSoup package. Additionally, the function eliminates all "\n" characters by replacing them with white space.

email_to_plain

The email_to_plain converts the email content to plain text. It first reads an email object. From this object, it gets the email structure using the get_email_structure function. A for loop using the .walk() method iterates over all the parts of the message object. For each part it gets the content type using the .get_content_type() method. If the part content type is not "text/plain" or "text/html", the function returns the part's payload. If the part content type is "text/plain", the function returns "text/plain." For every other case, the function assumes that the message is html format and returns the html_to_plain method which converts html to plain text.

get_email

The get_email function is the main function used to parse the emails. It first checks email structure by using the get_email_structure function. Based on the email structure it chooses the best course of action to parse the email. If the email structure is of type "multipart/alternative", it returns structure as a string. If not, it returns a string with the email content using the email_to_plain function.

clean_email

The clean_email function is very simple. Its logic is to replace the selected text from a string with a blank space.

get_encoding

The get_encoding function reads an email object and it uses the get_content_charset() method to get the encoding.

find_not_valid_chars

The find_not_valid_chars function flags strings with no valid characters. To do this it first checks if there is actually a string and it is not a None object. If it is a string object, then it uses a regex pattern to identify if there are no alphanumerical characters in the string. If the regex pattern matches, then the function returns True. If not, it returns False.

The second step to parse the emails was to apply the get_email function to the email_location column and create a new column called "raw_email" with the parsed email raw text.

The third step was to clean and inspect the raw text. The step had three main parts: remove unwanted characters from emails, identify and inspect emails that contain characters that are not valid, and remove emails with unusual encoding.

To remove unwanted characters from email, the clean_email function was used. The function helped remove all "\n" and "\t" characters.

To identify and inspect emails contain not valid characters the find_not_valid_chars function was used. This created a new dataframe column called "not_valid" containing the boolean values returned by the find_not_valid_chars function. Using dataframe filtering methods, every row that had a "True" value in the "not_valid column" was returned and inspected.

To remove unwanted encoding, an encoding column was added to the dataframe using the get_encoding function. Then all unusual encoding was filtered out from the dataframe. To get the list of unusual encoding, a for loop was run on the email objects and encoding was checked using the .get_content_charset() method. If the encoding was not part of the previously approved encoding list, it was added to the unusual encoding list.

Finally, one email with body is Null has been deleted prior to training.

Train a supervised model

Each of the supervised models was created using a pipeline. A pipeline allowed for the streamlining of token transformation using CountVectorizer() or TfidfVectorizer() and fitting the MultinomialNB with either vectorizer while trying various parameters to tune the models. Pipelines prevent data leakage when using grid search with 10-fold cross-validation in order to narrow down the best parameters for the text transformers and the models. Due to the nature of how the email dataframe was built, the email labels were in sequential order. To address this, data were first shuffled before 10-fold cross-validation to introduce more randomness for each cross-validation split. This shuffle reduced the differences between test scores on the cross-validation splits. A random state was also set for reproducibility.

First Classification Model: MultinomialNB using TfidfVectorizer

A pipeline was created for the TfidfVectorizer and MultinomialNB objects under the variable 'pipeline' to tune the alpha, the tfidf max features, and ngrams hyperparameter and assess predictions. The TfidfVectorizer object was also set to remove stop_words in English which is an optional parameter that the Sckit-learn API offers. The grid had the following hyperparameter options:

The model was fitted using the GridSearchCV object containing the pipeline, the grid, and using "roc_auc" as a scoring method. The best scoring combination had a tfidf vocabulary size of 12,000, and a tfidf ngram range of (1,1), meaning it only used unigrams and an alpha value of 0.1.

Second Classification Model: MultinomialNB using CountVectorizer

A pipeline was created for the CountVectorizer and MultinomialNB objects under the variable 'cv_pipeline' to tune the alpha, the count vectorizer (cv) max features, and ngrams hyperparameter and assess predictions. The CountVectorizer object was also set to remove stop_words in English which is an optional parameter that the Sckit-learn API offers. The grid had the following hyperparameter options:

The model was fitted using the GridSearchCV object containing the cv_pipeline, the grid, and using "roc_auc" as a scoring method. The best scoring combination had a tfidf vocabulary size of 12,000, tfidf ngram range of (2,2) meaning it only used bigrams, and an alpha value of 0.1.

Exploring additional labels using KMeans Clustering

For clustering, the KMeans algorithm introduced above was used. No pipelines were used for this section, and the fitting of each object was done separately.

First Clustering Model: using CountVectorizer

To fit the clustering model the team first created a variable called countVec which was assigned to a CountVectorizer using the best hyperparameters discovered during the previous GridSearch. The parameters used were:

The email raw text was fitted and transformed using the CountVectorizer object. The transformed data was assigned to a variable called cv_X. To select the best number for $k$, the Elbow method (Fig. 1) as well as the silhouette coefficient (Fig. 2) was utilized. To determine the elbow, inertia (within-cluster sum-of-squares) was plotted against the number of clusters. The point where average distance from the centroid suddenly falls appears to be around $k=3$. However, by plotting the silhouette coefficient for the number of clusters around $k = 3$ (specifically $k = 3, 4, 5, 6$), a problem was discovered. Since the vast majority of the emails (all except 3) were assigned into one cluster, this method can hardly provide additional benefits, thus it was not pursued further.

Figure 1: Inertia vs. Number of Clusters with Elbow identified
Figure 2: Silhouette Diagram of Silhouette Coefficient for Clusters of 3, 4, 5, and 6

Second Clustering Model: using TfidfVectorizer

To fit the clustering model, the team first created a variable called tfidf_vectorizer which was assigned to a TfidfVectorizer using the best hyperparameters from the previous GridSearch results using TfidfVectorizer and Multinomial Naive Bayes. The parameters were:

The email raw text was fitted and transformed using the TF-IDF object. The transformed data was assigned to a variable called X_cluster. To select the best number for $k$, the Elbow method (Fig. 3) as well as silhouette coefficient (Fig. 4) was utilized. Inertia (within-cluster sum-of-squares) was plotted against the number of clusters. The point where average distance from the centroid suddenly falls appears to be around $k=10$. However, the elbow is not very distinct according to the Inertia vs Number of Clusters k plot.

By further plotting silhouette coefficient for the number of clusters around $k = 10$ (specifically $k = 8, 9, 10, 11$), $k = 9$ appeared to generate the highest silhouette coefficient for each group, and the size of each group was more uniform at $k = 9$. Thus, the number of clusters of 9 was chosen for KMeans.

Figure 3: Inertia vs. Number of Clusters with Elbow identified
Figure 4: Silhouette Diagram of Silhouette Coefficient for Clusters of 8, 9, 10, and 11

Third Classification Model: MultinomialNB using TfidfVectorizer and Clustering as preprocessing

The last model created was a MultinomialNB classification model using both TfidfVectorizer and Clustering as preprocessing steps.

The email text data was fitted and transformed to TF-IDF format using the best parameters from the classification model 1 (TfidfVectorizer) which was assigned to the X_cluster variable. The KMeans model was fitted using nine clusters as mentioned on top of TF-IDF results.

The first step was to fit the KMeans model on the X_cluster which has been transformed to TF-IDF form using best variables identified through GridSearch. Once the KMeans were fitted, the .predict method was used to get the predictions on the X_cluster dataset. These were assigned to the kmeans_tf_label variable. Then each of the cluster labels was one hot encoded using the get_dummies method provided by the Pandas API. The one-hot encoded labels were assigned to the cluster_label_dummies variable. The numpy hstack method was used to merge the cluster_label_dummies and the X_cluster variables. This was assigned to the X_train variable, which is an inclusive dataset containing both clustering labels (one-hot encoded) and TF-IDF vectors.

Finally, the X_train variable was fitted to a MultinomialNB model with the alpha of 0.1, which was the best parameter found during the previous GridSearch.


3. Results

Models

Table 1 shows a comparison of the three different models attempted using Multinomial Naive Bayes. TF-IDF alone, TF-IDF with K-means clustering method, and Count Vectorizer alone. It can be observed that TF-IDF and TF-IDF with clustering generated the highest overall accuracy at 99%.

Precision for a class means among all the instances predicted by the model to be that class, how many are actually that class. Recall for a class means among all the instances that are in that class, how many can the model predict to be that class. Using the TF-IDF with clustering model as an example, precision for ham is 0.986, which means among all the emails that were predicted to be ham, 98.6% of them are actually ham. Recall for ham is 0.995, which means among all the emails that are actually ham, the model identified 99.5% of them. Precision and recall for each class under each model is also demonstrated in (Table 1).

Model ham precision ham recall spam precision spam recall overall accuracy
TFIDF alone 0.985 0.998 0.994 0.955 0.99
TFIDF with clustering 0.986 0.995 0.986 0.958 0.99
Count Vectorizer 0.969 0.997 0.990 0.907 0.97

Table 1: Overall performance for each model

The confusion matrices for TF-IDF (Fig. 5) and TF-IDF with clustering (Fig. 6) display the prediction details of each model when making predictions on the entire dataset.

Between the two models of TF-IDF and TF-IDF with clustering, it is a trade off between whether predicting ham correctly is more important than predicting spam. If the company can tolerate a few more occasional spam emails, and they want to have as little ham emails classified as spam as possible, the TF-IDF model would be preferred. On the countrary, if the company prefers a stricter spam filter and can tolerate going into a spam folder occasionally to find an email that is ham, then the TF-IDF with Clustering model is preferred. Both models do a superb job of predicting ham or spam correctly with simple cleaning of email body information and then using Multinomial Naive Bayes on TF-IDF or TF-IDF and clustered labels with the entire email body text given as the corpus.

Figure 5: Confusion matrix for TFIDF
Figure 6: Confusion matrix TFIDF with Clustering

4. Conclusion

The model that uses TF-IDF alone with Naive Bayes was better at predicting ham emails correctly with some more errors on spam prediction from time to time. While the model that uses TF-IDF, clustering, and Naive Bayes was more strict in identifying spam with some more ham emails identified as spam from time to time. These two can both be recommended to an IT department to ultimately decide which model to use based on company requirements. Count Vectorizer and Naive Bayes generated a much less ideal recall score on spam prediction, meaning among all the spam emails, the model using count vectorizer could identify them correctly 90.7% of the time. This is around 5% less when compared to TF-IDF alone or TF-IDF with clustering models, so this method is not recommended to the IT department.

It has not been investigated, but one can infer term frequency-inverse document frequency which takes in consideration of how relevant a word is to a document in a collection of documents might work better than simply counting up words that appear in each document. Certain words might appear in spam more often than ham emails, and TF-IDF is better at highlighting their relevant importance than simple counts.

Future directions of building a more robust spam filter can include adding subject field or sender's email as additional features, as well as trying various other machine learning models to keep improving precision and recall for both classes. However, the current scores for both TF-IDF and TF-IDF with clustering are already very high. Any additional improvement might be deemed trivial by an IT department. It is also worth noting the current method of using Multinomial Naive Bayes is a very fast machine learning algorithm using probabilities compared to other methods. If an IT department has many more emails to classify than the sample emails provided in this study, and the company is looking to launch this email filter very soon, Multinomial Naive Bayes, which has already generated ideal trial results, can save time and money for the company.

Appendix - Code

Classification - TFIDF

Classification - CountVectorizer

Clustering - CountVectorizer

Based on above number of k analysis, elbow method seems more informative. k=3 seems to worth trying.

However based on the clustering labels shown, it doesn't seem to provide added benefits for clustering using CountVectorizer

Clustering - TFIDF

k around 10 seem to be what's identified by elbow method with inertia, but exploring further at k=9 seems to be generating best results of Silhouette score for all classes.