naive – POC FARM

Hi there!

We have seen the word spam filtering in many places. Let it be email, SMS or any kind of communication media, spammers will try get your attention.

We need to filter those spam messages. There are many algorithms to do that.

I use Naive Bayes method in this example. If you are not familiar with Naïve Bayes, you can learn more here.

In Naïve Bayes method, the system will learn with its experience. Initially we need to teach the system, how to categorize spam and ham, by providing it some sample spam and ham messages.

Logic used:

Step 1:

We need some sample spam and ham messages. Load them into separate variables first.

Then we need keyword list. This is the list of keywords, which we might encounter in spam and ham messages.

Ex: money, account password, urgent etc.

Step 2:

Build a matrix which stores keywords in one dimension and number of times they appear in spam and ham messages in other dimensions.

Step 3:

Now load the new message which is yet to be filtered.

Calculate how many times each keyword has repeated.

Step 4:

Use above calculated matrix as reference and use Naïve Bayes formula to find out if the new message is spam or ham.

Here is the R script:

#clean up the workplace
rm(list = setdiff(ls(), lsf.str()))

library(stringr) #load required library, stringr is used in string comparison
#################################################################################
#read the sample spam,ham and keywords list
#################################################################################
ham = scan(‘ham.txt’,
what=’character’, comment.char=’;’,sep = “\n”)
spam = scan(‘spam.txt’,
what=’character’, comment.char=’;’,sep = “\n”)
keywords = scan(‘KeyWords.txt’,
what=’character’, comment.char=’;’,sep = “\n”)

#################################################################################
#Calculate spam matrix
#################################################################################
keyLength <- length(keywords)
matSpam <- 0
matHam <- 0
for(i in 1:keyLength) {
tDF <- c(keywords[i],sum(str_count(spam,keywords[i])))
matSpam <- rbind(matSpam,tDF)
tDF <- c(keywords[i],sum(str_count(ham,keywords[i])))
matHam <- rbind(matHam,tDF)
}
#################################################################################
#read the data to be valuated, mesage.txt contains all messages which are to be valuated. Each message is delimited by new line.
#################################################################################
message = scan(‘message.txt’,
what=’character’, comment.char=’;’,sep = “\n”)
#################################################################################
#categorize the message as spam or ham
#################################################################################
#score it and build matrix
keyLength <- length(keywords)
matScore <- 0
for(i in 1:keyLength) {
tDF <- c(keywords[i],sum(str_count(message,keywords[i])))
matScore <- rbind(matScore,tDF)
}

#apply formula and find if it is spam
lengthSpam <- length(spam)
lengthHam <- length(ham)
totalScore <- 0
for (i in 1:keyLength) {
totalScore <- totalScore+as.numeric(matScore[i,2])*((as.numeric(matSpam[i,2])/lengthSpam)-(as.numeric(matHam[i,2])/lengthHam))
}
if(totalScore<0)
totalScore <- totalScore*(-1)
totalScore <- totalScore*100
print(“Percentage of being spam”)
100-totalScore

Output:

Find materials for this post on my Github.