Decision Trees
Common Decision Tree Algorithms
Gini Index
Chi-Square
Inforformation Gain
Reduction in Variance
Mean Squared Error (MSE)
When performing regression with CART trees (i.e. the target values are continuous) we can evaluate a split using its MSE. The MSE of node is computed as follows:
: training examples in node
: total number of training examples in node
: target value of th example
Probably Approximately Correct (PAC)
The PAC is a learning models which is characterized by learning from examples.
In its simplest form, consider $f$ as the target function to be learnt, the learner (the computational model) is provided with some random examples in the form of .
Here, is a binary input instance of length , and is the value (boolean TRUE or FALSE) of the target function at that instance.
Based on these examples, the learner must succeed in deducing the target function which is expressed as:
.
"Probably: " If learner outputs the data with desired probability and confidence, the results are probably correct
"Approximately: ." A hypothesis is approximately correct if its error over the distribution of inputs is bounded by some predefined interval.
Training Process
The process of training a decision tree and predicting the target features of query instances is as follows:
Present a dataset of training examples containing features/predictors and a target. (similar to classifiers we have seen earlier)
Train the tree model by making splits for the target using the values of predictors. The predictor to use gets selected following the idea of feature selection and uses measures like "information gain" and "gini index" etc. We shall cover these shortly.
Tree is grown untill some stopping criteria is achieved. This could be a set depth of the tree or any other similar measure.
Show a new set of features to the tree, with an unknown class and let the example propagate through a trained tree. Resulting leaf node represents the class predictions this data.
Entropy
The idea of entropy is usually derived from Shannon's description of entropy to measure the information gain against some cost incurred in the process.
Entropy is a measure of Disorder or Uncertainty.
The entropy of a variable is the "amount of information" contained in the variable.
In terms of data, we can informally describe entropy as:
Entropy is an indicator of how messy your data is.
A high entropy always reflects "messed-up" data with low/no information content. The uncertainty about content of the data, before viewing the data remains same (or almost same) as that before the data was available.
Information Gain
Information gain is an impurity/uncertainty based criterion that uses the entropy as the impurity measure.
Tree Pruning
Now that we know how to grow a decision tree using python and scikit learn, let's move on and practice with the idea of Optimization of the classifer. This means that we have some options available with decision trees that we can tweak before the actual learning takes place.
A decision tree , grown beyond a certain level of complexity leads to over-fitting. If we grow our tree and carry on using poor predictors which dont have any impact on the accuracy will eventually a) slow down the learning, and b) cause overfitting.
This process of trimming decision trees to optimize the learning process is called "Tree Pruning".
We can prune our trees using the following parameters:
Maximum Depth
Reduce the depth of the tree to build a generalized tree. Set the depth of the tree to 3, 5, 10 depending after verification on test data
Minimum Samples Leaf with Split
Restrict the size of sample leaf
Minimum Leaf Sample Size
Size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
Maximum Leaf Nodes
Reduce the number of leaf nodes
Maximum Features
Maximum number of features to consider when splitting a node
CART training algorithm
In this lab we will focus on the CART algorithm (Classification and Regression Trees) for regression.
The CART algorithm builds a binary tree in which every non-leaf node has exactly two children (corresponding to a yes/no answer).
Given a set of training examples and their labels, the algorithm repeatedly splits the training examples into two subsets using some feature set and feature threshold such that samples with the same label are grouped together.
At each node, the algorithm selects the split that produces the smallest mean squared error (MSE) (alternatively, we could use the mean absolute error).
So at each step, the algorithm selects the parameters $\theta$ that minimize the following cost function:
: remaining training examples
: number of remaining training examples
: feature and feature threshold
: number of samples in the left/right subset
: MSE of the left/right subset
This step is repeated recursively until the maximum allowable depth is reached or the current number of samples $n_{total}$ drops below some minimum number. The original equations can be found here.
Last updated