Naïve Bayesian Classification

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x₁, x₂, . . . , x_n), depicting n measurements made on the tuple from n attributes, respectively, A₁, A₂, . . ., A_n.
Suppose that there are m classes, C₁, C₂, . . . , C_m. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class C_i if and only if

Thus we maximize P(C_i|X). The class C_i for which P(C_i|X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem
As P(X) is constant for all classes, only P(X|C_i)P(C_i) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is :
P(C₁) = P(C₂) = . . . = P(C_m), and we would therefore maximize P(X|C_i). Otherwise, we maximize P(X|C_i)P(C_i). Note that the class prior probabilities may be estimated by P(C_i)=|C_i,D|/|D|, where |C_i,D| is the number of training tuples of class C_i in D.
Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|C_i). In order to reduce computation in evaluating P(X|C_i), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus,

We can easily estimate the probabilities P(x₁|C_i), P(x₂|C_i), . . . , P(x_n|C_i) from the training tuples. Recall that here x_k refers to the value of attribute A_k for tuple X. For each attribute, we look at whether the attribute is categorical or continuous-valued. For instance, to compute P(X|C_i), we consider the following:
- If A_k is categorical, then P(x_k|C_i) is the number of tuples of class C_i in D having the value x_k for A_k, divided by |C_i,D|, the number of tuples of class C_i in D.
- If A_k is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean μ and standard deviation σ, defined by :
  
  Figure Nb.05
These equations may appear daunting, but hold on! We need to compute μCi and σCi , which are the mean (i.e., average) and standard deviation, respectively, of the values of attribute A_k for training tuples of class Ci. We then plug these two quantities into Figure Nb.05, together with x_k, in order to estimate P(x_k|C_i).
For example, let X = (35, $40,000), where A₁ and A₂ are the attributes age and income, respectively. Let the class label attribute be buys_computer. The associated class label for X is yes (i.e., buys_computer=yes). Let’s suppose that age has not been discretized and therefore exists as a continuous-valued attribute. Suppose that from the training set, we find that customers in D who buy a computer are 38±12 years of age. In other words, for attribute age and this class, we have μ = 38 years and σ = 12. We can plug these quantities, along with x₁ = 35 for our tuple X into Figure Nb.05 in order to estimate :
P(age = 35|buys_computer=yes).
In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class C_i. The classifier predicts that the class label of tuple X is the class Ci if and only if

In other words, the predicted class label is the class C_i for which P(X|C_i)P(C_i) is the maximum.
"How effective are Bayesian classifiers ?"
Various empirical studies of this classifier in comparison to decision tree and neural network classifiers have found it to be comparable in some domains. In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers. However, in practice this is not always the case, owing to inaccuracies in the assumptions made for its use, such as class conditional independence, and the lack of available probability data.
Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers that do not explicitly use Bayes’ theorem. For example, under certain assumptions, it can be shown that many neural network and curve-fitting algorithms output the maximum posteriori hypothesis, as does the naïve Bayesian classifier.

Naïve Bayesian Classification

www.CodeNirvana.in