We do cross-validation to estimate the performance of our model, in terms of its error rate. We use an additional dataset for tests that we try our model on after we've trained it using the training set.
In cross-validation, we have three types:
Intense use of cross-validation can overfit, so we can hold an additional dataset for testing that we don't use during validation, and only use it to confirm the best model can adapt. It can also be very time-consuming.
Classification method that figures out class of a point by examining k nearby points. If k = 1, then it classifies based on the nearest point, otherwise averages the class values of k nearest points and sets the class to be the one that's nearest to that averaged value. To avoid ties, k is usually odd.
Recall that a decision tree aims to classify a point by guiding down a tree, checking various conditions of a point. Creates a more complex boundary from several rectangles.
Also recall that entropy E(S) = -P+log2(P+) - P-log2(P-), in other words the positive examples and the negative examples logged and kinda added together.
When constructing the decision tree for categorical data, we can just partition our data based on the different attributes that affect our end classification, until we end up with a set of data in the leaf that all has the same class.
There are a few things we need to watch out for. If the decision tree has incorrect data, we'll overfit.