Application of Naive Bayes for Filtering Email Spam

Blogs

Sign-up to receive the latest articles related to the area of business excellence.

Application of Naive Bayes for Filtering Email Spam

Author: Palak Kumar

We are heavily dependent on emails for our day to day communications. Emails are one of the major sources of information exchange for both personal and professional communication. However, one of the problems with emails is that if spammers get access to our email addresses, they start bombarding us with unwanted junk emails. Spam emails are those unsolicited and unwanted emails that are sent either for marketing purposes, phishing, or scams. Sifting through these emails to look at the important emails becomes an onerous task. Email service providers have been developing and advancing algorithms for identification and classification of spam emails to make the life of the email user a lot easier. If emails are flagged as spam, they can be redirected to a junk mail folder or deleted so that it does not distract the users and lets them focus on valid and genuine emails.

There is a constant struggle between spammers and spam detection tools. The spammers are trying to modify their emails in such a way so as to not be flagged as spam by the email providers. New methods and techniques are constantly being used by spammers to fool the e-mail servers. On the other hand, email service providers are constantly developing new tools and methods to detect and prevent spam from entering users email inboxes.

Spam detection is a big problem. The methods used for classification of spam are not fool proof and sometimes genuine emails gets flagged as spam and spam emails make it into our email folders. There are a number of factors that are analysed to determine if an incoming email is spam, such as:

Lack of or wrong Sender Policy Framework (SPF) record
Sending emails to large number of recipients
Sender’s IP address or domain is blacklisted
Using a lot of hyperlinks or shortened hyperlinks
Using linked or hosted images
Using specific keywords that are not commonly used in regular communication.

In this article, we will discuss the use of specific keywords to identify spam. The spam emails are often identified by the frequent appearance of specific phrases such as “Congratulations you won a lottery”, “Please enter your bank account number” and more. These patterns are recognized by computer algorithms and are used to determine the nature of an email.

Bayes Theorem

One of the tools that can be used to detect spam is the Naive Bayes algorithm. This algorithm is based on the Bayes Theorem that allows one to compute an unknown conditional probability of a pair of events given the known independent probability of each event and the reverse conditional probability of the pair of events.

Let us understand the Bayes theorem and see how we can use this to calculate the probabilities to identify spam. Let us say there are two events A and B. A could be an event such as getting the heads on a coin toss or getting a number less than 2 when a die is thrown. In our case, A is the probability of an email being a spam. Let us consider another event B. Similar to A, B could be any event. In our example, B is the presence of a word in the email. When we say P(A), read as probability of event A occurring, we are talking about the probability of an email being a spam. For example, out of 100 emails you may get in a day if you get 20 spam emails, then P(A) = 0.20 (20%). This by itself is not that useful – we need to be able to detect if a specific email we receive is a spam email. Hence, we talk about conditional probabilities. This is usually denoted as P (A | B), read as probability of an event A occurring given that B has occurred. So, if we know the words that are contained in an email, we can compute the conditional probability that that specific email is a spam email or not. If the value of P (A | B) is high, then there is a high likelihood that the email is a spam. If this value is low, then the probability is low that it is a spam. Now, the question becomes how do we compute P (A | B)? This is where Naïve Bayes theorem comes into play.

Bayes theorem is stated as the probability of event A given B is equal to the probability of event B given A multiplied by the probability of A upon the probability of B. Bayes Equation

The advantage of this theorem is that if it expresses P (A | B) in terms of P (B | A) which is easier to compute. P (B | A) means the probability of finding a specific word in the email, given that we know it is a spam email. This can be easily obtained from historical data. If we have a list of spam emails we have been receiving in the past, we can look for frequency of occurrence of specific words in those emails to calculate the probability. Hence, this equation can then be used to compute P (A | B) since the rest of the probabilities, P (A), P(B), and P (B | A) are relatively easy to compute.

In order to determine these probabilities, we need historical data to train the algorithm to compute the probabilities. Let us take a simple example to illustrate the calculations. We assume a simple training dataset with 5 emails we receive. Bayes Table 1

We usually remove commonly used words from the emails and not consider them for further analysis such as “a”, “an”, “the”, “on”, etc. Let us say we pick up the following words for analysis: “Share”, “Send”, “Password”, “Account”. The probability of occurrence of these words can be calculated as shown in the following table. Bayes Table 2

The overall probability of spam emails is P (spam) = 3/5 and probability of not spam emails is P (not spam) = 2/5.

Let us now assume we receive a new email with the key words “Share password”. Using our Bayes algorithm, we can calculate the following:
P (“Share AND password” | Spam) = P (Share | Spam) * P (Password | Spam) = 1/3 * 1/3 = 1/9
P (“Share AND password” | Not Spam) = 1/2 * 0/2 = 0
Note that we have assumed independence of keywords, so we can replace the probability of the two words as multiplication of individual probabilities.
P (Spam | “Share AND password”) = (1/9 * 3/5)/ (1/9 * 3/5 + 0 * 2/5) = 1.0
Since the probability is high (greater than 0.5), we can conclude that this email is probably a spam email.

Of course, in the real-world, we would use millions of records to train our algorithm and it would probably contain 1000 ’s of keywords. But the basic approach and concepts are very similar.

Sigma Magic Software Example

Let us use a software to perform the computations for the Naive Bayes algorithm. In this article, we have used the Sigma Magic software to run the Naive Bayes algorithm. Keywords are extracted from the historical emails and used to train the model. The response variable contains the probability of the word being in a spam email. The input variable are the keywords found in emails and the response variable is the probability of finding that word in a spam email. The first 70% of this training records are used to train the model and the remaining rows are used to test the performance of the algorithm to detect spam. Depending on the words contained in the test email, we can estimate the probability of it being a spam email. The predictions are shown in the rightmost column. This model does a relatively good job with the fit (especially on the training data set), though there are some keywords that it does not perform that well as shown in the confusion matrix.

Bayes Example

Advantages & Limitations of Naive Bayes Algorithm

In general, the Naive Bayes algorithm is a powerful tool that can be used to detect spam. It is relatively simple to program and use in the real world. The basic advantages and disadvantages of this algorithm are as follows:

Advantages of Naive Bayes Algorithm

It needs less training data to train the model
The model is simple and easy to use
Can handle both continuous and discrete data
Even though it is a rather simple model, it works remarkably well
Algorithm is quick and it can produce results in a very short period of time

Disadvantages of Naive Bayes Algorithm

It assumes all features are independent which may not be true in reality.
A bigger data set is required for making reliable predictions.
With small data sets, the precision will be less.
Requires training data sets in order to train the model.
Since spammers keep improvising, this model would need to be periodically updated.

Blogs