Naive Bayes

Naive Bayes

Posted by Leonard Meng on March 2, 2018

0 Probability theory

\[\begin{gathered} P(X\cap y)=P(X)*P(y)\\ P(X|y)P(y)=P(y|X)P(X)\\ P(y|X)=\frac{P(y)P(X|y)}{P(X)}\\ \end{gathered}\]

1. Naive Bayes

1.1 Principle of Naive Bayes

We all know that Naive Bayes is a classification model and is a probabilistic model. So what does Naive Bayes do? The first thing we need to consider is that what we want to do is predict what y should be when a certain feature occurs. From the above probability theory formula, we can see that $P(X), P(y), P(X|y)$ can be easily obtained. So $P(y|X)$ can be calculated simply. So what should this y be? This y is actually the probability of $P(y|X)$ when calculating the values of various y, and find the one with the highest probability. In other words, this is what is often referred to as calculating posterior probabilities with prior probabilities. which is:

\[\begin{aligned} \hat{y} &=\text{argmax}_{y \in Y}P(y|x) \\ &=\text{argmax}_{y \in Y}\frac{P(x|y)P(y)}{P(x)}\\ &=\text{argmax}_{y \in Y}P(x|y)P(y)\\ &=\text{argmax}_{y \in Y}P(x_1,x_2,x_3...,x_M|y)P(y) \end{aligned}\]

The above formula is not easy to calculate, but we can make an assumption: the conditions between X are independent, so the above formula can be rewritten as follows:

\[\begin{aligned} P\left(x_{1}, x_{2}, \ldots, x_{M} \mid y\right) P(y) & \approx P\left(x_{1} \mid y\right) P\left(x_{2} \mid y\right) \ldots P\left(x_{M} \mid y\right) P(y) \\ &=P(y) \prod_{m=1}^{M} P\left(x_{m} \mid y\right) \end{aligned}\]

In this way, we have a formula that can be calculated. Is there still something you don’t understand? It doesn’t matter. Next, I will give an example to verify this process.

2 An Example

2.1 Dataset

|Headache|Sore|Temperature|Cough|Diagnosis| |-|-|-|-|-| |severe|mild|high|yes|Flu |no|severe|normal|yes|Cold| |mild|mild|normal|yes|Flu| |mild|no|normal|no|Cold| |severe|severe|normal|yes|Flu|

The above dataset describes the flu or cold when various symptoms appear. I will take the headache as an example, and the rest will give the calculation results directly. We can see that there are three cases of influenza, so $P(Flu) = \frac{3}{5}$, and of the three cases of influenza, severe headaches account for two, $P(Headache = servere Flu) = \frac{2}{3}$, the headache is moderate, so $P(Headache = mild Flu) = \frac{1}{3}$, There are 0 cases where there is no headache, so $P(Headache = no Flu) = 0$. All probabilities are given directly below
\[\begin{gathered} P(Flu) = 3/5 \quad P(Cold) = 2/5 \\ P(Headache = severe|Flu) = 2/3 \quad P(Headache = severe|Cold) = 0/2 \\ P(Headache = mild|Flu) = 1/3 \quad P(Headache = mild|Cold) = 1/2 \\ P(Headache = no|Flu) = 0/3 \quad P(Headache = no|Cold) = 1/2 \\ P(Sore = severe|Flu) = 1/3 \quad P(Sore = severe|Cold) = 1/2 \\ P(Sore = mild|Flu) = 2/3 \quad P(Sore = mild|Cold) = 0/2 \\ P(Sore = no|Flu) = 0/3 \quad P(Sore = no|Cold) = 1/2 \\ P(Temp = high|Flu) = 1/3 \quad P(Temp = high|Cold) = 0/2 \\ P(Temp = normal|Flu) = 2/3 \quad P(Temp = normal|Cold) = 2/2 \\ P(Cough = yes|Flu) = 3/3 \quad P(Cough = yes|Cold) = 1/2 \\ P(Cough = no|Flu) = 0/3 \quad P(Cough = no|Cold) = 1/2 \end{gathered}\]

Now let’s calculate a probability. Suppose a person has a moderate headache, a severe sore throat, and a normal body temperature without coughing. Then the probability of this person having a cold or the flu is higher? First assume that the person has the flu, then calculate the probability of flu under the above characteristics.

\[\begin{gathered} P(\text{Flu})\times P(H=m \mid Flu) P(S=s \mid Flu) P(T=n \mid Flu) P(C=n \mid Flu) \\ = \frac{3}{5} \times\left(\frac{1}{3}\right)\left(\frac{1}{3}\right)\left(\frac{2}{3}\right)\left(\frac{0}{3}\right)=0 \end{gathered}\] \[\begin{gathered} P(\text{Cold})\times P(H=m |Cold) P(S=s |Cold) P(T=n|Cold) P(C=n|Cold) \\ = \frac{2}{5} \times\left(\frac{1}{2}\right)\left(\frac{1}{2}\right)\left(\frac{2}{2}\right)\left(\frac{1}{2}\right)=0.05 \\ \end{gathered}\]

From the calculation of the probability, we can predict that this person has a cold.

3 Smoothing of Naive Bayes

Now let’s consider the following situation. Suppose a person has a severe headache, a moderate sore throat, a high fever, and no cough. What would our calculation look like?

Cold:

\[\begin{aligned} P(\mathrm{Co}) & \times P(H=s \mid C o) P(S=m \mid C o) P(T=h \mid C o) P(C=n \mid C o) \\ \frac{2}{5} & \times\left(\frac{0}{2}\right)\left(\frac{0}{2}\right)\left(\frac{0}{2}\right)\left(\frac{1}{2}\right)=0 \end{aligned}\]

Flu:

\[\begin{aligned} P(F I) & \times P(H=s \mid Flu) P(S=m \mid Flu) P(T=h \mid Flu) P(C=n \mid Flu) \\ \frac{3}{5} & \times\left(\frac{2}{3}\right)\left(\frac{2}{3}\right)\left(\frac{1}{3}\right)\left(\frac{0}{3}\right)=0 \end{aligned}\]

In this case, the probability of both classes is 0. Why is this? This is because the decision boundary of our computational method is too steep, and the change between non-0 and 0 is abrupt. Smoothing is there to solve this problem. Two types of smoothing are described below

3.1 Epsilon Smoothing

Epsilon smoothing is very easy to understand, that is, all the probabilities of 0 become $\epsilon$, (an infinitesimal quantity). So we can get the following results:

Cold:

\[\begin{aligned} P(C o) & \times P(H=s \mid C o) P(S=m \mid C o) P(T=h \mid C o) P(C=n \mid C o) \\ \frac{2}{5} & \times(\epsilon)(\epsilon)(\epsilon)\left(\frac{1}{2}\right)=\frac{\epsilon^{3}}{5} \end{aligned}\]

Flu:

\[\begin{aligned} P(F I) & \times P(H=s \mid F I) P(S=m \mid F I) P(T=h \mid F I) P(C=n \mid F l) \\ \frac{3}{5} & \times\left(\frac{2}{3}\right)\left(\frac{2}{3}\right)\left(\frac{1}{3}\right)(\epsilon)=\frac{12 \epsilon}{135}=\frac{4 \epsilon}{45} \end{aligned}\]

This way we can compare the powers of epsilon, the higher the power, the lower the probability. So we can get, this person is predicted to have the flu.

3.1.1 Pros and Cons of Epsilon Smoothing

advantage

Simple, we can see that the calculation process of this is very simple, just replace the probability of 0 with epsilon.

disadvantages

  1. Epsilon smoothing has a fatal flaw, that is, we increase the total amount of probabilities out of thin air, because 0 becomes a number greater than 0.
  2. We can use $\epsilon$ in calculations, but what about in the process of implementing this algorithm? The value of this epsilon is another key point.

3.2 Laplace smoothing

Laplace’s smoothing idea is also very simple. When calculating the prior probability each time, add $\alpha$ to the molecule ($\alpha$ is an arbitrary number, usually 1 for the convenience of calculation), and add $\alpha$ to the molecule. Plus the number of categories of the current feature multiplied by $\alpha$. Here is the formula:

\[P(x_{m}=j| y=k)=\frac{\alpha+\text{count}\left(y=k, x_{m}=j\right)}{M \alpha+\text{count}(y=k)}\]

Here are the probabilities after smoothing with Laplace:

\[\begin{array}{lcc} & \text { original estimate } & \text { smoothed estimate }(\alpha=1) \\ P(\text { Headache }=\text { severe }|F| u) & 2 / 3 & (2+1) /(3+3)=3 / 6 \\ P(\text { Headache }=\text { mild } \mid \text { Flu }) & 1 / 3 & (1+1) /(3+3)=2 / 6 \\ P(\text { Headache }=n o \mid F l u) & 0 / 3 & (0+1) /(3+3)=1 / 6 \\ P(\text { Cough }=\text { yes }|F| u) & 3 / 3 & (3+1) /(3+2)=4 / 5 \\ P(\text { Cough }=n o \mid F l u) & 0 / 3 & (0+1) /(3+2)=1 / 5 \end{array}\]

3.2.1 Advantages and Disadvantages of Laplace Smoothing

Advantages

  1. The probability after Laplace smoothing is still 1
  2. Reduce data variance. All data is moved to the center, the distribution is more centralized

    Disadvantages

  3. Small datasets cannot be used because it drastically changes the probability distribution
  4. Large datasets can be used, but will increase bias
  5. Still the same question. How to get the value of $\alpha$?

3.3 Other Smoothing

Good-Turing, Kneser- Ney, Regression

4 Advantages and Disadvantages of Naive Bayes

4.1 Advantages

  1. Simple logic, easy to implement
  2. Low resource consumption in the classification process

4.2 Disadvantages

That is why Naive Bayes is Naive

Naive Bayes (Naive Bayes), this Naive means naive, looking back on our derivation process, we made an assumption during the derivation that all feature conditions are independent. But this assumption is almost impossible to achieve in reality. So Naive Bayes is called naive.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Welcome to reprint, and please indicate the source: Lingjun Meng's Blog Keep the content of the article complete and the above statement information!