Tree-Based Methods
Last updated
Last updated
Tree-based methods, like the kNN, are conceptually simple for solving both classification and regression problems. These methods involve building a tree- or flowchart- like structure with the root of the tree at the top from which a series of nodes and binary branches follow – similar to the sample tree shown in Figure 2-1Figure 2-3. To make a prediction, you start at the root node and go down the tree following a path depending the on the answers given to each question encountered on the nodes. The predicted outcome is found once a terminal or leaf node is reached. Given the transparency by which decisions are made, these types of methods are often referred to as white-box models.
Figure 2-3 A hypothetical decision tree for predicting incident duration
In the simple hypothetical example illustrated in Figure 2-3, the input features are binary with a “yes” or “no” response but this need not be the case. In fact, it is possible to use any type of variable (e.g., ordinal, nominal, and continuous) as an input feature for creating the binary branches of the trees. For example, if a continuous variable is selected at a node then, a threshold should be selected to decide which one of the two branches to be followed. One of the branches is followed depending on whether the value of the variable is greater (or less) than this threshold. Given a training dataset, a tree-based model partitions the data into as many non-overlapping subsets as the number of leaves. In other words, using the feature vector, each one of the training samples should land on one any only one of the leaves.
In order to construct a decision tree, one needs to have a tree partitioning process for determining which input features to be used and in what sequence. The most common approach is to employ a greedy process that guarantees local optimality. This is done by selecting variables that maximize the reduction in error at each node. The total reduction in error from all nodes attributed to a variable is known as feature importance and is useful in variable selections. The tree partitioning process starts with the root node. Each feature (and its possible thresholds if continuous variable) is tried one by one to determine which feature yields the most desired outcome. If solving a classification problem, the feature (the decision rule in more general terms) that splits the data into the most homogeneous partitions is selected as the best option. To measure how homogeneous the two partitions are a metric called Gini impurity or index [1] is commonly used. If solving a regression problem, the criterion becomes the minimization of the sum of squares. For samples in each partition, the values of the response variable are averaged and taken to be the prediction. The difference between the actual response and predicted response is squared to compute the sum of such squared terms.
The process above is repeated recursively for all other nodes for splitting the data into further subsets. Another important aspect of building a tree is deciding when to stop the partitioning. Very large trees can lead to overfitting to the training data and such complex models may not generalize well. Conceptually, the process should stop when further partitioning does not add more value. For example, we can stop splitting when the best candidate split at a node reduces the impurity by less than a preset threshold. This then necessitates selecting a threshold value. There is no such threshold that is accepted across all applications. Alternatively, one can continue splitting until the error on validation dataset is minimum thereby eliminating the reliance on a preset threshold for stopping.
Overall, tree-based methods are easy to understand and implement and can handle both numerical and categorical data. These methods do not require the analyst to identify the best subset of features since the process explained above implicitly selects the most useful variables. This is useful especially when dealing with a large number of explanatory variables. Further details on tree-based methods can be found in the literature [1].