As a second classifier for this task Multinomial Naive Bayes (MNB) was chosenfor the comparison.
Interestingly enough, even though MNB is fairly simplercompared to other available classifiers, it was used in many projects and documentedin many research papers as the classifier that gives relatively goodresults while being efficient and computationally inexpensive 11, 28).MNB is based on the Bayes assumption, which considers all the features tobe independent of each other. Even though this is incorrect and the featuresare actually dependent, it was shown to work well in practice without any goodexplanation 11.
That is why this assumption is called “naive”. In order to understandhow Naive Bayes classifiers work (MNB is only one type), it is neededto understand the Bayes’ theorem:What we are trying to calculate is the probability of having wj when xi happened(posterior probability), where wjis a class (j1,2,…,m) and x is a vectorof features (i 1,2,..
.,n) 28. P(xi|wj) is the probability of having xi given that itbelongs to the class wj (conditional probability), P(wj) is the probability of theclass (prior probability), while P(xi) is the probability of the features independentof the class (evidence) 28. With this formula what we are trying to do isto maximize the posterior probability given the training data so as to formulatethe decision rule.
28In order to calculate the conditional probability we go back to the Bayes’assumption that the features are independent, that is, that the change of probabilityof one feature will not cause any effect to the probability of some otherfeature 28. Under this assumption, we can calculate conditional probabilitiesof the samples from the training data. Given the vector x, we calculate theconditional probability given the following formula: that is, by multiplying the probability of each feature given the class wj. Thisis simply a frequency of that feature: number of times the feature appears inthe class wj divided by the the count of all the features of that class) 28. Probabilityof the prior P(wj) refers to the “prior knowledge” we have about our data,in general how probable is to encounter a class wjin our training data. It iscalculated by dividing the number of times wj appeared in our data by the totalcount of classes 28.Probability of the evidence P(x) is the probability of a particular vector of featuresappearing independent of the class.
In order to calculate the evidence,we use this formula:where not wj refers to the occurrences when a particular set of features isnot encountered in the class wj28.When working with classifying text, we will usually encounter in the test setwords that we did not have before in the training set. This would result in theclass conditional probability being zero, which would then turn all the result intozero, since the class prior is multiplied by the conditional probability. The waywhich this problem is resolved is by using a smoothing technique, in our case,Laplace correction. This means that we add +1 to the numerator and the sizeof our vocabulary to the denominator as follows:Nxi,wjis the number of times feature xi appears in samples from class wj,while Nwjis the total count of all features in class wj