Supervised Learning: Classification
Naive Bayes Classifier
1. Introduction
Naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. These assumptions are rarely true in real world scenario, however Naive Bayes algorithm sometimes performs surprisingly well. This is the supervised learning algorithm used for both classification and regression. Its advantage is that it requires very small computational power and as a result works fast even with large data.
2. Key Terms

Prior probability is the proportion of dependent variable (target) in the data set.

Likelihood is the probability of particular classification a given observation in presence of some other variable.

Marginal likelihood is the proportion of independent variable (predictor) in the data set.
These terms might not be clear to you. Let's dive into an example that shows what exactly Naive Bayes does, with an indication of these terms.
3. Example with Explanation
Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform Naive Bayes:
Step 1: Convert the data set into a frequency table (also called contingency table)
Step 2: Create Likelihood table.
Step 3: Use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.
3.1. Step 1 and Step 2
Let's go over the first two steps. These steps will also help us understand prior probability, likelihood and marginal likelihood.
Data set
Contingency Table
Step 1: To convert the data set into a contingency table, the first column should represent a feature (Weather), and the first row should represent the classes (Yes, No). Then, calculate totals of each row and each column.
Likelihood Table
Step 2: To make a Likelihood Table, convert the frequencies into probabilities by dividing every observed and marginal frequency by the overall total (14).
Weather
Play
Weather
Play
The terms Likelihood, Marginal Likelihood, and Prior Probability (or Class Prior Probability, as it is related to classes "Yes" or "No") that were mentioned above are shown below:
Likelihood (lime color)
Example: P (Overcast  "Yes")
Marginal Likelihood or Predictor Prior Probabilities (red color)
Example: P (Sunny)
Prior Probabilities or Class Prior Probabilities (olive drab color)
Example: P (No)
Likelihood = P (Feature  Class)
Prior Probabilities = P (Class)
Marginal Likelihood = P (Feature)
So, we can now see that:

Likelihood = P (Feature  Class)

Marginal Likelihood = P (Feature)

Prior Likelihood = P (Class)
Likelihood is just a probability of a feature within a class. For example, if we want to calculate P(Sunny  "Yes"), where Sunny is a feature, and "Yes" is a class, we will count all "Yes"es, or all times we went to Play, (and ignore "No"s) when we had "Sunny" weather, divided by the overall observed days in our data set.
Marginal Likelihood is a probability of a feature. For example, if we want to calculate P(Sunny), we will count all the Sunny days divided by the overall observed days in our data set.
Prior Likelihood or Class Prior Probability is a probability of a class. For example, if we want to calculate P("No"), we will count all the "No"s, or, the days we did not go to Play, divided by the overall observed days in our data set.
Posterior probability is the revised probability of an event occurring after taking into consideration new information. It will be discussed in more details later in this article.
Weather
Play
Weather
Play
3.2. Step 3
Use Bayes' Formula to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.
Likelihood
Class Prior Probability
Posterior Probability
Predictor Prior Probability
In formula above ’c’ denotes class and ’x’ denotes features. Next, let’s look at P(x). As you can see, the denominator contains the only term that is a function of the data (features)  it is not a function of the class we are currently looking at. Thus, it will be the same for all the classes. Traditionally in Naive Bayes Classification, we drop this denominator as it does not impact the final outcome of the classifier in order to make the prediction:
To make it more interesting, let’s assume we have an the additional feature  Wind:
Let’s assume we want to predict the class for the data with the following features:
In order to make a prediction we need to compare posterior probabilities for each class after observing the input data. For this purpose we will use the expression (1). Do not forget, that Naive Bayes assumes independence of features. In order not to inflate our formulas we will use the following notation: ’X1’ for ’Weather’, ’X2’ for ’Wind’ and ’C’ for ’Class’
First, we estimate the probability for going to Play (i.e. the class = "Yes") for Wind = Moderate, Weather = Sunny:
Second, we estimate the probability for not going to Play (i.e. the class = "No") for the same features:
As you can see from calculations above:
0.0098 > 0.0036 −> P(C = YesX) > P(C = NoX).
Thus, our prediction is that the given data point has class "Yes", meaning that we are going to Play.
4. Assumption on Independent Variables: Why Naive?
The rule has a very simple derivation that directly leads from the relationship between joint and conditional probabilities.
First, note that
(2)
Next, we can find conditional probabilities for P(BA):
(3)
Let's find P(B∩A) in terms of the equation (3) .
Propability says that P(A∩B)=P(B∩A). Let's now substitute P(A∩B) for P(BA)P(A) [which equals to P(B∩A)]. Thus,we get a Bayes formula.
As the denominator contains the only term that is a function of the data (features)  it is not a function of the class we are currently looking at. Thus, it will be a constant (the same for all the classes). Traditionally in Naive Bayes Classification, we drop this denominator as it does not impact the final outcome of the classifier in order to make the prediction:
Let's present an example:
If your data is composed of a feature vector X = {x1, x2, ... x10} and your class labels Y = {y1, y2, .. y5}. Thus, a Bayes classifier identifies the correct class label as the one that maximizes the following formula :
So for, it is still not Naive. However, it is hard to calculate , so we assume the features to be independent, this is what we call the Naive assumption. Hence, we end up with the following formula instead:
In simple terms, there are two aspects to remember when describing Naive Bayes:

Bayes theorem allows us to determine posterior probabilities from our priors, when presented with new evidence. (more simply, it is a method of revising existing predictions given new evidence)

Naive Bayes classifier is naive as it assumes that the presence (absence) of a particular feature of a class is unrelated to the presence (absence) of any other feature, given the class variable.

Naive Bayes is a generative model: it models the joint distribution of the feature X and target Y, [P(X, Y) = P(X)*P(YX)] and then predicts the posterior probability P(YX).
For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.