Recall that in a decision tree, we break complex boundaries into simple boundaries, and choose which boundary we're in by checking a few simple conditions, going down the tree.
We need to figure out how to make such a tree, what values we pick and what attributes we narrow down for distinguishing between the paths on the tree.
One way that we figure this out is by examining the entropy of the distribution of our data. This tells us how surprised we are to see new values in a sequence. It basically measures the spread of our data.
To calculate the entropy of a joint distribution, H(x, y), we take
- sum over all values of x of the (- sum over all values of y of
If one of the variables is given, we plug in that variable and then sum over the other. This gives us the conditional entropy for that specific value
H(X|Y=y) = - sum over x in X of (p(x | y)log2(p(x | y)))
H(X|Y) = sum over y in Y of p(y)H(X|Y=y)
It's the sum over all of the possible outcomes of our conditional entropies.
Tells us what the entropy is of a random variable is with regards to all the things that it is conditionally dependent on. We can see this as a measure of how much impact one variable has on another variable.
IG(X|Y) = H(X) - H(X|Y)
This is referred to as the information gain in X due to Y.
The optimal decision tree algorithm is NP-hard. So we're going to aim for a simple approximation. Our first approach will be a simple, greedy approach, where we build up the tree node-by-node. Steps:
Needs to be small, but not too small. The tree needs to distinguish your data sufficiently well.
However, also has to be not too big, to avoid computational complexity and to avoid overfitting.
We generally want to apply Occam's Razor: go for the simplest hypothesis.
We often get around this by reguarizing the construction process.
In KNN, we have piece-wise DBs, whereas in DTs we have axis-aligned, tree-structured boxes.
One way that we can classify multiple class data is to have K - 1 classifiers, each solving a two class problem of separating the class CK from points not in that class. Called the 1 vs all algorithm.
However, this means that there is more than one good answer where your classes collide.
So instead we introduce two-way classifiers, one for each possible pair of classes. Points are classified according to majority vote among these functions. This is called the 1 vs 1 classifier. However, it might not classify all the space(two way relations aren't transitive(?)).
We get around this by making K functions and K weights, and classifying based on the max.