What is a Decision Tree

The following was my solution to my University Data Science Exam. In this question we were expected to explain our chosen model to someone with a non-statistics university degree.

For this question I chose a decision tree, alternatively known as CART.

My answer is listed below.

Data Science: Exam Question 3 ๐Ÿ”—︎

Outline ๐Ÿ”—︎

CART stands for Classification and Regression Tree. The CART model is a tree based model that can be used for both regression and classification. Regression is useful where we’d like to consider a numerical value, like what the temperature will be tomorrow. While classification is when we would like to predict a category, for example if it is going to rain tomorrow (yes or no). Fortunately, CART models can be used for either of these problems.

A tree based model is one that uses nested if/then statements.

Consider the example of predicting if you should take an umbrella to work tomorrow. A tree based model could consider answering this question in the following way,

Is the forecast for MORE than 30 degrees tomorrow? Then I don’t need an umbrella

Is the forecast for LESS than 30 degrees tomorrow? Then I need to ask if it’s going to rain.

Is it forecast to rain tomorrow? Then I will need an umbrella.

Is it NOT forecast to rain tomorrow? Then I will NOT need an umbrella.

From this example, one can get an idea of how a tree based model works.

caption

Image from Towards Data Science1

The above diagram showcases another, well illustrated example, of how a tree based algorithm works.

In more complex terms, what is occurring at each step is a split of the set into 2 groups so that the error is minimised. This occurs until the maximum tree depth is reached.

Next there can be a pruning stage. This process assumes that the tree has probably fit the data too well, and won’t do well on a test set. In this case the tree will be pruned and the least important branches will be removed. This means that the final model is more simple and will likely extrapolate better to unseen data.

Advantages and Disadvantages ๐Ÿ”—︎

There are many advantages to tree based models such as CART.

  1. They are easy to explain. Trees can also be put into diagrams, such as the one above, that make understanding the conclusions of the model very simple.
  2. The approach is similar to how a human would approach the problem. In the umbrella example above, the way a person would use that logic is very similar to how these algorithms work.

However there are also some disadvantages

  1. Trees may not have the best accuracy
  2. Trees can change dramatically based on small changes in the data set

In order to solve these problems, more complicated tree methods are often used like Bagging, Boosting and Random Forests.

Tuning ๐Ÿ”—︎

Tuning parameters are those that are not provided by a formula, but must be given by the user. In a CART model, the parameters that can be tuned are the tree depth and the cost.

The model would be tuned by fitting CART models to the data using different tree depths, and considering which model has the lowest error rate. For example a CART with depth 1 would be fit, then depth 2, all up until depth 10.

In the case of classification, the best model could be decided by the mis-classification rate, that is, what percentage of outcomes were classified correctly from the total.

In the case of regression, the best model could be decided by one that has the lowest RMSE. That is, the lowest squared error between the predicted values and the actual values.

These error rates are taken after the model is fit using cross validation. The data is first split into a number of parts. The model is then fit on all parts except for 1, and the error rates obtained. The model is then fit and tested in the same way for all parts. If the data is split into $K$ portions, then there would be $K$ error samples. This means there is a much better idea of how the model is performing, rather than just testing on one training and testing set.

In the same way as tree depth, cross validation can also be used to find the optimal cost parameter $\alpha$.

The final model will then be fit using the best parameters found during the tuning process.