Support Vector Machines (SVMs)
Last updated
Last updated
A Support Vector Machine (SVM), in its most basic form, refers to a binary classification algorithm where a hyperplane is used as a discriminant (or a decision boundary) to separate the data into two groups. For two-dimensional data, this hyperplane is a straight line. In general, the hyperplane has one less dimension than that of the data. The main computational task in SVMs is finding the best separating hyperplane such that data belonging to one group fall on one side of the plane whereas the other group on the other side. The best hyperplane is defined as the one that has the largest distance away from any of the (closest) training samples for which the labels or classes are known. Such a decision boundary, e.g., the dashed line shown on the rightmost plot in Figure 2-5, is assumed to perform and generalize better than any other. To find this hyperplane, mathematical optimization techniques have been developed and are widely available as part of many machine-learning libraries. One of the major advantages of SVMs is the fact that the resulting mathematical optimization problem is convex and has a global minima or maxima, which is not the case in neural networks.
SVMs have been extended to solve nonlinear classification problems where the decision boundary is not linear in the original space. There are also support-vector clustering algorithms to handle and categorize unlabeled data. Support vector regression (SVR) models have also been developed.
The main idea behind SVMs is best explained through an example with two-dimensional data as in Figure 2-5. In the illustrative example shown in Figure 2-5, the data points are linearly separable into two distinct categories without any point being misclassified. All points to the left of the decision boundary, i.e., the dashed line, belong to a “negative” class whereas the ones to its right to a “positive” class. Out of an infinite number of possible hyperplanes (straight lines) that can separate the data, an SVM model finds a specific hyperplane that has the largest margin. The margin simply refers the gap or distance measured perpendicularly from the hyperplane to the closest point or points on either side of the hyperplane. An easier way to visualize the margin is shown in Figure 2-6 where two solid lines parallel to hyperplane h, one on each side of h, are pushed outwards until reaching a data point. The distance between these two parallel lines (l+ and l_ in Figure 2-6) is the margin. The linear decision boundary h returned by an SVM model will be right in the middle of this margin with equal distance away from the closest points from either class.
Any linear decision boundary can be written as h=x|w.x+b =0, where x represents the multidimensional feature vector, w the vector of coefficients, and b a scalar. In the given expression, the “.” between w and x is the well-known dot product. The label for a new observation x’ is predicted by simply evaluating the expression: w.x'+b. If this turns out to be greater than zero, the predicted label will be “positive,” otherwise “negative.” As shown in Figure 2-6, the value of this expression is set to +1 and -1 for lines l+ and l_ respectively. While setting them to +1 and -1 is an arbitrary choice (any other constant will work too) it provides a convenience in solving the mathematical problem. With this construct, it is clear to see that for any “positive” training samples w.x + b ≥ 1 and w.x + b ≤ -1 for all points in the “negative” class. For the special points that are on the lines l+ and l_, these constraints become equalities and these points are called the support vectors. Hence, the name support vector machines.
In an SVM model, the objective is to find the parameters w and b that produce the largest margin. Through basic algebra, it can be demonstrated that the size of the margin is equal to 2‖w‖, where ‖w‖ is the length or norm of vector w. Maximizing 2‖w‖ is equivalent to minimizing ‖w‖. Since it is mathematically more convenient, ‖w‖22 or equivalently wj2/2 is used as the objective function to be minimized in formulating the SVM problem. In summary, with this objective function and the set of inequality constraints discussed above for the training samples, a quadratic convex optimization problem is obtained. This is typically solved by a Lagrangian formulation where finding the unknowns involves computations in the order of N2 where N is the number of training samples. In other words, the computation time for training an SVM model is proportional to N2.
In many situations, problems which are linearly nonseparable need to be solved. To address linearly nonseparable cases, slack variables are introduced. Linear constraints in SVM force the largest margin to be found only where all the points are correctly classified. This can lead to very small margins which are not generalizable. Slack variables relax the linear constraints to allows for misclassified points, while maintaining a larger margin. This new margin, which allows for some misclassification, is called soft margin.
The slack variable ξi is defined as the distance by which point i goes beyond the boundary of its category, as shown in Figure 2-7. Slack variables allow some variables not to be classified correctly to some extent. By sacrificing some accuracy, SVM achieves a larger margin, which increases its flexibility and generalizability. However, there is a tradeoff between achieving a large margin and accuracy, and the balance is found by penalizing the tolerance to errors in the cost function (Eq.(2)). Higher value of C will only allow points very close to the decision boundary to be misclassified, while a low value of C is more tolerant.
12wTw+Ci=1Ni
For most practical problems, the given training data might not be linearly separable. Sometimes the overlap of classes might be so much that introducing a simple slack variable will not be able to provide a good solution. In these cases, it might be a good idea to transform the data through a function x and project the points to a higher dimension Z as shown in Figure 2-8. By doing this, the nonlinear problem is linearized in the new feature space, and as such linear separation is made possible as shown with the green line in the new space. The exact form of this mapping function is not needed to be known explicitly, since Kernel functions can be used instead.
The kernel function is defined as: Kxi, xj=T(xi)φ(xj). The essence of the kernel method lies in the fact that the explicit calculation of coordinates is not needed but rather the inner product of the points in the Z space is used. This operation is computationally cheaper than the explicit computation of the coordinates in the Z space. This approach is called the kernel trick.
Selection of kernel function requires a good understanding of the characteristics of the underlying data. There are many different types of kernel functions, some of which are polynomial, radial basis, and hyperbolic tangent kernel functions as shown in Eq.(3)-(5).
Extensions
SVM has been extended to solve regression problems as well. Support vector regression (SVR) models still maintain the key futures of the SVM models explained above and follow a similar mathematical approach to find the best fitting hyperplane. In contrast to the ordinary least-squares regression, an SVR model ignores errors within a threshold called ε. As an illustration, these ε-bounds are shown for one dimensional data in Figure 2-9. For points falling outside the ε-bounds, in its basic form, a linear and symmetrical function (right panel in Figure 2-9) is used to compute the total loss or prediction error (i.e., ri-ε) needed for the objective function. To balance model complexity and prediction error, just like in SVM for classification, the objective function also includes the wTw term as in Equation 2 where this balancing is achieved by tuning the hyperparameter C. As in SVM, the hyperplane, f(x), is represented in terms of support vectors, which are training samples that lie outside the ε-boundaries. A major advantage of SVR is that its computational complexity does not depend on the dimensionality of the input space [5].
The SVMs have been extended to solve the clustering problem as well. In support vector clustering (SVC), the data points are mapped to a higher dimensional feature space through a kernel function first. Then, the smallest sphere enclosing the data in the feature space is found. When this sphere is mapped back to the original space, it results in a set of contours that cluster points into groups as in Figure 2-10. More details can be found in the original paper on SVC [6].
To use an SVM model for either a classification or regression problem, one needs to select a kernel function and values for the hyperparameters. For the soft margin SVM, the optimal value for C can be found through a grid search of different C values and cross validation. For kernel SVM, the hyperparameters to be optimized depend on which kernel function is utilized. For example, when using a radial basis kernel function, the hyperparameters that need to be optimized are the in the kernel function, and the C penalty term. The optimal combination can be found through a grid search and cross-validation.
Through the kernel trick, the support vector machines enable fitting linear models to data in high dimensional feature space and allow transforming the optimization problem to a convex quadratic problem. This allows finding a global optimum solution for the model being fit – a significant advantage of SVM over other models. However, this model is valid under the preselected hyperparameters. For each value of the hyperparameters, a new SVM model is fit. Therefore, the overall process of finding the best SVM model requires numerous trials. Grid search is commonly used as an approach for tuning the hyper-parameter while checking the model performance on validation data.
Support vector machines have been widely applied in transportation literature to solve various problems. SVMs have been used for incident detection [7] and forecasting of volumes [8] on freeways; detecting travel mode [9] and vehicle stops [10] from smartphone sensor data; imputing missing data in activity-based travel diaries [11]; predicting motor vehicle crashes [12, 13]; modeling travel mode choice [14]; and many other problems and applications.