IMPROVING NAIVE BAYESIAN SPAM FILTERING
Mid Sweden University Department for Information Technology and Media Spring 2005
Spam or unsolicited e-mail has become a major problem for companies and private users. This thesis explores the problems associated with spam and some different approaches attempting to deal with it. The most appealing methodsare those that are easy to maintain and prove to have a satisfactory performance. Statistical classifiers are such a group of methods as their ability to filter spam is based upon the previous knowledge gathered through collected and classified e-mails. A learning algorithm which uses the Naive Bayesian classifier has shown promising results in separating spam from legitimate mail. Tokenization,probability estimation and feature selection are processes performed prior to classification and all have a significant influence upon the performance of spam filtering. The main objective of this work is to examine and empirically test the currently known techniques used for each of these processes and to investigate the possibilities for improving the classifier performance. Firstly, how a filterand wrapper approach can be used to find tokenization delimiter subsets that improve classification is shown. After this, seven probability estimators are tested and compared in order to demonstrate which of them ameliorate the performance. Finally a survey of commonly used methods for the feature selection process is performed and recommendations for their use are presented.
I would like to thank my supervisor Iskra Popova for guiding me in how to write a thesis. She has helped me to improve the structure and pointed out where the content needed enrichment. For me it has been a learning process. I also want to thank everyone who makes an effort to fight spam, and there are a lot of you!
classifier clique complete graph corpus A person ormachine that is sorting out the constituents of a substance. A maximum complete subgraph of a graph. All vertices are connected with each other. A collection of natural language text used for accumulating statistics. More specifically in this thesis a corpus is a map between a word (token) and its frequency. Describe the number of values that are free to vary in a statistical calculation. A characterthat marks the beginning or end of a unit of data. A measure of the disorder that exists in a system. A prominent part or characteristic. An inducer is a machine learning algorithm that produces a classifier that, in turn, assigns a class to each instance. An acctual spam classified as good, or an acctual good classified as spam. The function f is monotone if, whenever x ≤ y , then f ( x) ≤ f ( y) . Stated differently, a monotone function is one that preserves the order. Describing two events, conditions, or variables which cannot occur at once. n features are considered at a time. Predicts that two distributions are the same. A list of the probabilities associated with each of its values. Throughout this work discreet probability distributions are used and they are non-continuous. Theprobability mass function is denoted by p ( xi )
degrees of freedom delimiter entropy feature inducer
mutually exclusive n-gram null hypothesis probability distribution
sample space significance level space spam token transitivity
where xi is a random discrete variable. The set of all possible outcomes of an experiment. Is the decisioncriterion for accepting or rejecting the null hypothesis. Character used to separate words. Unsolicited usually commercial e-mail sent to a large number of addresses. A distinguishing characteristic (feature). In mathematics, a binary relation R over a set X is transitive if it holds for all a , b , and c in X , that if a is related to b and b is related to c , then a is related to c . Only one feature...