Chapter 0 DataSet

Outlook	Temperature	Humidity	Windy	Play
sunny	hot	high	FALSE	no
sunny	hot	high	TRUE	no
overcast	hot	high	FALSE	yes
rainy	mild	high	FALSE	yes
rainy	cool	normal	FALSE	yes
rainy	cool	normal	TRUE	no
overcast	cool	normal	TRUE	yes
sunny	mild	high	FALSE	no
sunny	cool	normal	FALSE	yes
rainy	mild	normal	FALSE	yes
sunny	mild	normal	TRUE	yes
overcast	mild	high	TRUE	yes
overcast	hot	normal	FALSE	yes
rainy	mild	high	TRUE	no

Chapter 1 Decision Tree

1.1 From Decision Stump to Decision Tree

1.1.1 0-R Baseline

There are several commonly used baselines in machine learning. Among them, 0-R baseline is a simple statistics of the most labels in the label, and then all instances are predicted to be this label.

Based on the statistical results of the label in the above training set, we can predict all instances as play=yes.

Total errors = $\frac{5}{14}$

1.1.2 1-R Baselince(Decision Stump)

1-R baseline is based on 0-R baseline, selecting an attribute for decision classification.

Using Outlook:

According to the above statistical results, when outlook=sunny, we predict play=no. When outlook=overcast or rainy, we predict play=no.

Total errors = $\frac{4}{14}$

1.1.3 Decision Tree

The decision tree is based on the 1-R baseline using a variety of attributes to build a decision tree.

When does the decision tree terminate when it is built?

All the leaf nodes are (largely) dominated by a single class (that is the leaf nodes are nearly pure).
All attributes have been used.

Chapter 2 Decision tree generation algorithm(ID3)

We have already understood the use of decision trees. So how should a decision tree be optimized? In other words, how do we generate an optimal decision tree?

==Optimal construction of a Decision Tree is NP hard (non-deterministic polynomial).==

2.0 Criterion for Attribute Selection

We want to get the smallest tree (Occam’s Razor; generalisability). Prefer the shortest hypothesis that fits the data. In favor:

Fewer short hypotheses than long hypotheses
- a short hyp. that fits the data unlikely to be a coincidence
- a long hyp. that fits data might be a coincidence

2.1 Introduction

Basic method: recursive divide-and-conquer

FUNCTION ID3 (Root)

IF all instances at root have same class

THEN stop

ELSE

Step 1 Select a new attribute to use in partitioning root node instances

Step 2 Create a branch for each attribute value and parti- tion up root node instances according to each value

Step 3 Call ID3(LEAFi) for each leaf node $\text{LEAF}_i$

2.2 Predefined

2.2.1 Entropy

==The expected (average) level of surprise or uncertainty.==

==Low probability event==: if it happens, it’s big news! High surprise! ==High information!==, ==unpredictable==
==High probability event==: it was likely to happen anyway. Not very surprising. ==Low information!==, ==predictable==

Definition

A measure of ==unpredictability== Level of unpredictability (surprise) for a single event $i:$ self-information $\text { self-info }(i)=\frac{1}{P(i)}=-\log _{2} P(i)$ Given a probability distribution, the information (in bits) required to predict an event is the distribution’’s entropy or information value The entropy of a discrete random event $x$ with possible outcomes $x_{1}, . . x_{n}$ is: $\begin{aligned} H(x) &=\sum_{i=1}^{n} P\left(x_{i}\right) \text { self-info }\left(x_{i}\right) \\ &=-\sum_{i=1}^{n} P\left(x_{i}\right) \log _{2} P\left(x_{i}\right) \end{aligned}$ where $0 \log _{2} 0={ }^{\text {def }} 0$

Example 1

Biased coin. 55 flips: $50 \mathrm{x}$ head, $5 \mathrm{x}$ tail: $\begin{aligned} H &=-\left[\frac{50}{55} \log _{2}\left(\frac{50}{55}\right)+\frac{5}{55} \log _{2}\left(\frac{5}{55}\right)\right] \\ & \approx 0.44 \mathrm{bits} \end{aligned}$ Fair coin. 55 flips: $30 \mathrm{x}$ head, $25 \mathrm{x}$ tail: $\begin{aligned} H &=-\left[\frac{30}{55} \log _{2}\left(\frac{30}{55}\right)+\frac{25}{55} \log _{2}\left(\frac{25}{55}\right)\right] \\ & \approx 0.99 \mathrm{bits} \end{aligned}$ The more uncertainty, the higher the entropy.

Example 2

In the context of Decision Trees, we are looking at the class distribution at a node: $50 \mathrm{Y}$ instances, $5 \mathrm{~N}$ instances: $\begin{aligned} H &=-\left[\frac{50}{55} \log _{2}\left(\frac{50}{55}\right)+\frac{5}{55} \log _{2}\left(\frac{5}{55}\right)\right] \\ & \approx 0.44 \mathrm{bits} \end{aligned}$

$30 \mathrm{Y}$ instances, $25 \mathrm{~N}$ instances:

\[\begin{aligned} H &=-\left[\frac{30}{55} \log _{2}\left(\frac{30}{55}\right)+\frac{25}{55} \log _{2}\left(\frac{25}{55}\right)\right] \\ & \approx 0.99 \mathrm{bits} \end{aligned}\]

We want to classify with high certainty. ==We want leaves with low entropy!==

2.2.2 Mean Information

The weighted average of the entropy over the children after the split.

We calculate the mean information for a tree stump with $m$ attribute values as: $\operatorname{Mean} \operatorname{Info}\left(x_{1}, . ., x_{m}\right)=\sum_{i=1}^{m} P\left(x_{i}\right) H\left(x_{i}\right)$ where $H\left(x_{i}\right)$ is the entropy of the class distribution for the instances at node $x_{i}$ and $P\left(x_{i}\right)$ is the proportion of instances at sub-node $x_{i}$

2.2.3 Information Gain

Decision tree with low entropy: class is more predictable.

‘==Reduction of entropy== before and after the data is partitioned using the attribute A’.

Select the attribute that has ==largest information gain==: the most entropy (==uncertainty==) is reduced, class is ==most predictable==.

If the entropy ==decreases==, then we have a better tree (more predictable)

the entropy before splitting the tree using the attribute’s values
We determine which attribute $R_{A}$ (with values $x_{1}, \ldots x_{m}$ ) best partitions the instances at a given root node $R$ according to information gain (IG): $\begin{aligned} I G\left(R_{A} \mid R\right) &=H(R)-\operatorname{mean}-\operatorname{info}\left(R_{A}\right) \\ &=H(R)-\sum_{i=1}^{m} P\left(x_{i}\right) H\left(x_{i}\right) \end{aligned}$

2.3 Example

2.3.1 Root Node: Label

\[\begin{aligned} \mathrm{H}(\mathrm{R}) &=-\left(\left(\frac{9}{14}\right) \log _{2}\left(\frac{9}{14}\right)+\left(\frac{5}{14}\right) \log _{2}\left(\frac{5}{14}\right)\right) \\ &=-(-.4098-0.5305)=0.940 \end{aligned}\]

2.3.2 Entropy

Here I will calculate based on outlook.

\[\begin{aligned} \mathrm{H}(\text { rainy }) \quad &\left.=-\left(\left(\frac{3}{5}\right) \log _{2}\left(\frac{3}{5}\right)+\left(\frac{2}{5}\right) \log _{2}\left(\frac{2}{5}\right)\right)\right) \\ &=-(-0.4422-0.5288)=0.971 \\ \mathrm{H} \text { (overcast) } &\left.=-\left(\left(\frac{4}{4}\right) \log _{2}\left(\frac{4}{4}\right)+\left(\frac{0}{4}\right) \log _{2}\left(\frac{0}{4}\right)\right)\right) \\ &=0 \\ \mathrm{H}(\text { sunny }) &\left.=-\left(\left(\frac{2}{5}\right) \log _{2}\left(\frac{2}{5}\right)+\left(\frac{3}{5}\right) \log _{2}\left(\frac{3}{5}\right)\right)\right) \\ &=0.971 \end{aligned}\]

2.3.4 Mean Info

\[\begin{aligned} &\text{Mean Info(Outlook)}\\ &=P(\text {rainy}) H(\text {rainy})+P(\text {overcast}) H(\text {overcast})+P(\text {sunny }) H(\text {sunny}) \\ &=5 / 14 * 0.971+0+5 / 14 * 0.971=0.693 \end{aligned}\]

Mean Info(Temperature) = 0.911

Mean Info(Humidity) = 0.787

Mean Info(Windy) = 0.892

2.3.5 Information Gain

\[\begin{aligned} &I G\left(R_{A} \mid R\right) =H(R)-\operatorname{mean}-\operatorname{info}\left(R_{A}\right) \\ &=H(R)-\sum_{i=1}^{m} P\left(x_{i}\right) H\left(x_{i}\right) \\ &\color{red}{I G(\text { outlook} \mid R) =0.247} \\ &I G(\text { temperature } \mid R) =0.029 \\ &I G(\text { humidity } \mid R) =0.152 \\ &I G(w i n d y \mid R) =0.048 \end{aligned}\]

Based on above calculation, we should choose ==outlook== as the node of next level.

2.4 Shortcomings of Information Gain

2.4.1 Shortcomings

Information gain tends to ==prefer highly-branching attributes:==

A subset of instances is more likely to be homogeneous (pure) if there are only a few instances
Attribute with many values will have fewer instances at each child node

This ==may result in overfitting / fragmentation==

2.4.2 Example

==Mean info = 0==

2.5 Solution

2.5.1 Split info (SI)

Split info (SI) is the entropy of a given split (evenness of the distribution of instances to attribute values) $\begin{aligned} G R\left(R_{A} \mid R\right) &=\frac{I G\left(R_{A} \mid R\right)}{S I\left(R_{A} \mid R\right)}=\frac{I G\left(R_{A} \mid R\right)}{H\left(R_{A}\right)} \\ &=\frac{H(R)-\sum_{i=1}^{m} P\left(x_{i}\right) H\left(x_{i}\right)}{-\sum_{i=1}^{m} P\left(x_{i}\right) \log _{2} P\left(x_{i}\right)} \end{aligned}$

\[\begin{aligned} &S l(\text { outlook } \mid R) \\ &=-\left(\left(\frac{5}{14}\right) \log _{2}\left(\frac{5}{14}\right)+\left(\frac{4}{14}\right) \log _{2}\left(\frac{4}{14}\right)+\left(\frac{5}{14}\right) \log _{2}\left(\frac{5}{14}\right)\right) \\ &=1.577 \end{aligned}\]

2.5.2 Split info (SI)

Gain ratio (GR) reduces the bias for information gain towards highly-branching attributes by normalising relative to the split information $\begin{array}{lrl} &IG(\text { outlook } \mid R) &=0.247 \\ &SI(\text { outlook } \mid R) &=1.577 \\ &GR(\text { outlook } \mid R) &=0.156 \\ &IG(\text { humidity } \mid R) &=0.152 \\ &SI(\text { humidity } \mid R) &=1.000 \\ &GR(\text { humidity } \mid R) &=0.152 \\ &I G(\text { temperature } \mid R) & =0.029 \\ &S I(\text { temperature } \mid R) & =1.557 \\ &G R(\text { temperature } \mid R) & =0.019 \\ &I G(\text { windy } \mid R) & =0.048 \\ &S I(\text { windy } \mid R) & =0.985 \\ &G R(\text { windy } \mid R) & =0.049 \end{array}$ Based on the calculation, we should choose outlook

2.6 Stopping criteria

We recurse until the instances at a node ==are of the same class==
This is consistent with our usage of entropy: if all of the instances are of a single class, the ==entropy of the distribution is 0==
Considering other attributes cannot “improve” an entropy of 0 — the Info Gain is 0 by definition
The Info Gain/Gain Ratio allows us to choose the (seemingly) best attribute at a given node
However, it is also an approximate indication of how much absolute improvement we expect from partitioning the data according to the values of a given attribute
An Info Gain of 0 means that there is no improvement; a very small improvement is often unjustifiable
Typical modification of ID3: choose best attribute only if IG/GR is greater than some threshold τ
Other similar approaches use pruning — post-process the tree to remove undesirable branches (with few instances, or small IG/GR improvements)
We might observe improvement through every layer of the tree
We then run out of attributes, even though one or more leaves could be improved further
Fall back to majority class label for instances at a leaf with a mixed distribution — unclear what to do with ties
Possibly can be taken as evidence that the given attributes are insufficient for solving the problem

Chapter 3 Discussion

3.1 Hypothesis Space Search in ID3

ID3 (and DT learning in general) is an instance of ==combinatorial optimization==

ID3 can be characterized as searching a space of hypotheses for one that fits the training examples.
The hypothesis space searched by ID3 is the set of possible decision trees.
ID3 performs a greedy simple-to-complex search through this hypothesis space (with no backtracking)
- beginning with the empty tree
- considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data

3.2 Pros / Cons of Decision Trees

3.2.1 Pros

Highly regarded among basic supervised learnersFa
st to train, even faster to classify
Very transparent (probably the most interpretable of all classification algorithms!)

3.2.2 Cons

Prone to Overfitting
- Our goal is to reduce the entropy to 0
Loss of information for continuous variables
Complex calculation if there are many classes
No guarantee to return the globally optimal decision
Information gain: Bias for attributes with greater no. of values.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Welcome to reprint, and please indicate the source: Lingjun Meng's Blog Keep the content of the article complete and the above statement information!

Decision Tree