Decision Trees

Common Decision Tree Algorithms

  1. Gini Index

  2. Chi-Square

  3. Inforformation Gain

  4. Reduction in Variance

Mean Squared Error (MSE)

When performing regression with CART trees (i.e. the target values are continuous) we can evaluate a split using its MSE. The MSE of node mm is computed as follows:

\begin{equation} \hat{y}_m = \frac{1}{n_{m}} \sum_{i \in D_m} y_i \end{equation}\

\begin{equation} MSE_m = \frac{1}{n_{m}} \sum_{i \in D_m} (y_i - \hat{y}_m)^2 \end{equation}

  • DmD_m: training examples in node mm

  • nmn_m : total number of training examples in node mm

  • yiy_i: target value of ii-th example

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

y_pred = regressor.predict(X_test)

print('MSE score: {}'.format(mse(y_test, y_pred)))
print('R^2 score: {}'.format(r2_score(y_test, y_pred)))

Probably Approximately Correct (PAC)

The PAC is a learning models which is characterized by learning from examples.

In its simplest form, consider $f$ as the target function to be learnt, the learner (the computational model) is provided with some random examples in the form of (X,f(X))(X,f(X)).

Here, XX is a binary input instance of length nn, and f(X)f(X) is the value (boolean TRUE or FALSE) of the target function at that instance.

Based on these examples, the learner must succeed in deducing the target function ff which is expressed as:

f:[0,1]n0,1f : [0, 1]^n → {0, 1}.

"Probably: " If learner outputs the data with desired probability and confidence, the results are probably correct

"Approximately: ." A hypothesis is approximately correct if its error over the distribution of inputs is bounded by some predefined interval.

Training Process

The process of training a decision tree and predicting the target features of query instances is as follows:

  1. Present a dataset of training examples containing features/predictors and a target. (similar to classifiers we have seen earlier)

  2. Train the tree model by making splits for the target using the values of predictors. The predictor to use gets selected following the idea of feature selection and uses measures like "information gain" and "gini index" etc. We shall cover these shortly.

  3. Tree is grown untill some stopping criteria is achieved. This could be a set depth of the tree or any other similar measure.

  4. Show a new set of features to the tree, with an unknown class and let the example propagate through a trained tree. Resulting leaf node represents the class predictions this data.

Entropy

The idea of entropy is usually derived from Shannon's description of entropy to measure the information gain against some cost incurred in the process.

import pandas as pd
import scipy as sc

# Input a pandas series 
def ent(data):
     # calculates the probabilities
     p_data= data.value_counts()/len(data)
     # input probabilities to get the entropy
     entropy=sc.stats.entropy(p_data) 
     
     return entropy

Entropy is a measure of Disorder or Uncertainty.

The entropy of a variable is the "amount of information" contained in the variable.

In terms of data, we can informally describe entropy as:

Entropy is an indicator of how messy your data is.

A high entropy always reflects "messed-up" data with low/no information content. The uncertainty about content of the data, before viewing the data remains same (or almost same) as that before the data was available.

Information Gain

Information gain is an impurity/uncertainty based criterion that uses the entropy as the impurity measure.

Tree Pruning

Now that we know how to grow a decision tree using python and scikit learn, let's move on and practice with the idea of Optimization of the classifer. This means that we have some options available with decision trees that we can tweak before the actual learning takes place.

A decision tree , grown beyond a certain level of complexity leads to over-fitting. If we grow our tree and carry on using poor predictors which dont have any impact on the accuracy will eventually a) slow down the learning, and b) cause overfitting.

This process of trimming decision trees to optimize the learning process is called "Tree Pruning".

We can prune our trees using the following parameters:

  • Maximum Depth

Reduce the depth of the tree to build a generalized tree. Set the depth of the tree to 3, 5, 10 depending after verification on test data

  • Minimum Samples Leaf with Split

Restrict the size of sample leaf

  • Minimum Leaf Sample Size

Size in terminal nodes can be fixed to 30, 100, 300 or 5% of total

  • Maximum Leaf Nodes

Reduce the number of leaf nodes

  • Maximum Features

Maximum number of features to consider when splitting a node

CART training algorithm

In this lab we will focus on the CART algorithm (Classification and Regression Trees) for regression.

The CART algorithm builds a binary tree in which every non-leaf node has exactly two children (corresponding to a yes/no answer).

Given a set of training examples and their labels, the algorithm repeatedly splits the training examples DD into two subsets Dleft,DrightD_{left}, D_{right} using some feature set ff and feature threshold tft_f such that samples with the same label are grouped together.

At each node, the algorithm selects the split θ=(f,tf)\theta = (f, t_f) that produces the smallest mean squared error (MSE) (alternatively, we could use the mean absolute error).

So at each step, the algorithm selects the parameters $\theta$ that minimize the following cost function:

\begin{equation} J(D, \theta) = \frac{n{left}}{n{total}} MSE{left} + \frac{n{right}}{n{total}} MSE{right} \end{equation}

  • DD: remaining training examples

  • ntotaln_{total} : number of remaining training examples

  • θ=(f,tf)\theta = (f, t_f): feature and feature threshold

  • nleft/nrightn_{left}/n_ {right}: number of samples in the left/right subset

  • MSEleft/MSErightMSE_{left}/MSE_{right}: MSE of the left/right subset

This step is repeated recursively until the maximum allowable depth is reached or the current number of samples $n_{total}$ drops below some minimum number. The original equations can be found here.

Last updated