Supervised Learning_Complete Draft
Last updated
Last updated
Machine Learning Primer
[This will be a separate Chapter]
Author: Mecit Cetin
The k- Nearest Neighbor (kNN) algorithm is one of the simplest supervised learning methods for solving both classification and regression problems. The kNN algorithm has only one parameter, called k, to be tuned and is easy to understand and implement. To make a prediction for a given data point, it relies on the labels of the closest k points in its vicinity. Therefore, the underlying assumption is that points with similar values for their response variable (e.g., same labels) will be in close proximity. The input vector or input space can have any number of dimensions d.
To implement this algorithm, the number of neighbors (i.e., k) needs to be selected by setting it to a positive integer. Once k is known, a prediction can be made by relying on the training samples. For example, in solving a binary classification problem, the kNN algorithm runs the following steps to predict the label of a new observation
Find the k-training examples that are closest to the sth observation, and name them , k-neighbors of a d-dimensional . Since there are only two classes, the label of each training sample i can be represented by a zero or one, i.e.,
Using the labels of these k-neighbors, we then predict the label of the sth observation, , as follows:
The equation above simply yields the fraction of neighbors that have class 1 as their labels. If then, the label of the sth observation is 1 otherwise 0 (i.e., majority vote wins). As an illustration, Figure 2‑1 shows 2-dimensional data with binary labels. The class of a new point, depicted by a star, will be predicted as B when k is set to 3 since there are more class B points within its k-neighborhood.
Figure ‑ Sample data with two classes and three nearest neighbors for a given point
In implementing a kNN classification algorithm for a new data point these five main steps are typically followed: (i) set k to a positive integer; (ii) calculate the distances (e.g., Euclidian) between the new observation and all the training data; (iii) sort the distances and determine k-nearest neighbors; (iv) get the categories (classes) of these neighbors; and (v) estimate the category of the new observation based on the majority vote. If a regression problem is being solved, only step (v) needs to be modified. In that case, the mean or median of the k-neighbors can be computed to estimate the dependent variable for the new observation.
As indicated above, kNN method has a single parameter, k, that needs to be tuned. Like in any other ML algorithm, the value selected for the parameter will impact the performance of the model. Figure 2‑2 shows how the decision boundary will change depending on the selected value for k. Making k too small results in irregular boundaries and overfitting to the training data. On the other hand, a very large k may eliminate important local nonlinearities as a larger number of samples will influence the decision boundary and make it smoother. In Figure 2‑2, when k = 1, all the training samples are correctly classified whereas for k = 15 there are some misclassifications. However, the training error should not be the criteria to select the best model parameter. Otherwise, k will always be set to 1. Instead, the performance of the model under varying k should be assessed with a validation dataset. The k value that minimizes the error under validation data is expected to perform and generalize better.
Given its simplicity, kNN method is a commonly used method and often performs well, especially when the decision boundary is very irregular [1]. Since finding the k-nearest-neighbors requires computing distance to each one of the training data points, kNN algorithm is not computationally efficient. However, algorithms that can improve the computation speed have been developed [1].
Figure ‑ The effects of k on decision boundaries for solving binary classification problems
Tree-based methods, like the kNN, are conceptually simple for solving both classification and regression problems. These methods involve building a tree- or flowchart- like structure with the root of the tree at the top from which a series of nodes and binary branches follow – similar to the sample tree shown in Figure 2‑1Figure 2‑3. To make a prediction, you start at the root node and go down the tree following a path depending the on the answers given to each question encountered on the nodes. The predicted outcome is found once a terminal or leaf node is reached. Given the transparency by which decisions are made, these types of methods are often referred to as white-box models.
Figure ‑ A hypothetical decision tree for predicting incident duration
In the simple hypothetical example illustrated in Figure 2‑3, the input features are binary with a “yes” or “no” response but this need not be the case. In fact, it is possible to use any type of variable (e.g., ordinal, nominal, and continuous) as an input feature for creating the binary branches of the trees. For example, if a continuous variable is selected at a node then, a threshold should be selected to decide which one of the two branches to be followed. One of the branches is followed depending on whether the value of the variable is greater (or less) than this threshold. Given a training dataset, a tree-based model partitions the data into as many non-overlapping subsets as the number of leaves. In other words, using the feature vector, each one of the training samples should land on one any only one of the leaves.
In order to construct a decision tree, one needs to have a tree partitioning process for determining which input features to be used and in what sequence. The most common approach is to employ a greedy process that guarantees local optimality. This is done by selecting variables that maximize the reduction in error at each node. The total reduction in error from all nodes attributed to a variable is known as feature importance and is useful in variable selections. The tree partitioning process starts with the root node. Each feature (and its possible thresholds if continuous variable) is tried one by one to determine which feature yields the most desired outcome. If solving a classification problem, the feature (the decision rule in more general terms) that splits the data into the most homogeneous partitions is selected as the best option. To measure how homogeneous the two partitions are a metric called Gini impurity or index [1] is commonly used. If solving a regression problem, the criterion becomes the minimization of the sum of squares. For samples in each partition, the values of the response variable are averaged and taken to be the prediction. The difference between the actual response and predicted response is squared to compute the sum of such squared terms.
The process above is repeated recursively for all other nodes for splitting the data into further subsets. Another important aspect of building a tree is deciding when to stop the partitioning. Very large trees can lead to overfitting to the training data and such complex models may not generalize well. Conceptually, the process should stop when further partitioning does not add more value. For example, we can stop splitting when the best candidate split at a node reduces the impurity by less than a preset threshold. This then necessitates selecting a threshold value. There is no such threshold that is accepted across all applications. Alternatively, one can continue splitting until the error on validation dataset is minimum thereby eliminating the reliance on a preset threshold for stopping.
Overall, tree-based methods are easy to understand and implement and can handle both numerical and categorical data. These methods do not require the analyst to identify the best subset of features since the process explained above implicitly selects the most useful variables. This is useful especially when dealing with a large number of explanatory variables. Further details on tree-based methods can be found in the literature [1].
[Anyone volunteering to write a subsection on Random Forests? A volunteer contributing significant content will be added as an author/co-author.]
Author: David Reinke
Boosting is a way of combining a set of weak classifiers, known as ensembles, to make a much stronger classifier. This is accomplished through an iterative process where new weak learners are combined to further reduce model residual. It in effect creates a “committee” of models, using weighted results from each model.
The basic precept of boosting is the following:
If we have a weak classifier that performs even slightly better than random, we can combine a series of weak classifiers into a strong classifier.
Figure ‑ Illustration of AdaBoost
AdaBoost algorithm
1.
2.
(a)
(b)
Evaluate the quantities
and then set the weights
(c)
Update the data weighting coefficients:
3.
Make predictions using the final model given by
Boosting has the remarkable effect that it seems relatively impervious to overfitting.
A full description of boosting, and its applications in a number of areas, can be found in the book by the developers of the method [3].
Authors: Mecit Cetin and Ilyas Ustun
A Support Vector Machine (SVM), in its most basic form, refers to a binary classification algorithm where a hyperplane is used as a discriminant (or a decision boundary) to separate the data into two groups. For two-dimensional data, this hyperplane is a straight line. In general, the hyperplane has one less dimension than that of the data. The main computational task in SVMs is finding the best separating hyperplane such that data belonging to one group fall on one side of the plane whereas the other group on the other side. The best hyperplane is defined as the one that has the largest distance away from any of the (closest) training samples for which the labels or classes are known. Such a decision boundary, e.g., the dashed line shown on the rightmost plot in Figure 2‑5, is assumed to perform and generalize better than any other. To find this hyperplane, mathematical optimization techniques have been developed and are widely available as part of many machine-learning libraries. One of the major advantages of SVMs is the fact that the resulting mathematical optimization problem is convex and has a global minima or maxima, which is not the case in neural networks.
SVMs have been extended to solve nonlinear classification problems where the decision boundary is not linear in the original space. There are also support-vector clustering algorithms to handle and categorize unlabeled data. Support vector regression (SVR) models have also been developed.
Figure ‑ Sample hyperplanes (straight lines) indicated with dashed lines for two-dimensional data
Linearly Separable Data
The main idea behind SVMs is best explained through an example with two-dimensional data as in Figure 2‑5. In the illustrative example shown in Figure 2‑5, the data points are linearly separable into two distinct categories without any point being misclassified. All points to the left of the decision boundary, i.e., the dashed line, belong to a “negative” class whereas the ones to its right to a “positive” class. Out of an infinite number of possible hyperplanes (straight lines) that can separate the data, an SVM model finds a specific hyperplane that has the largest margin. The margin simply refers the gap or distance measured perpendicularly from the hyperplane to the closest point or points on either side of the hyperplane. An easier way to visualize the margin is shown in Figure 2‑6 where two solid lines parallel to hyperplane h, one on each side of h, are pushed outwards until reaching a data point. The distance between these two parallel lines (l+ and l_ in Figure 2‑6) is the margin. The linear decision boundary h returned by an SVM model will be right in the middle of this margin with equal distance away from the closest points from either class.
Figure ‑ SVM decision boundary and margin
Any linear decision boundary can be written as , where x represents the multidimensional feature vector, w the vector of coefficients, and b a scalar. In the given expression, the “.” between w and x is the well-known dot product. The label for a new observation x’ is predicted by simply evaluating the expression: . If this turns out to be greater than zero, the predicted label will be “positive,” otherwise “negative.” As shown in Figure 2‑6, the value of this expression is set to +1 and -1 for lines l+ and l_ respectively. While setting them to +1 and -1 is an arbitrary choice (any other constant will work too) it provides a convenience in solving the mathematical problem. With this construct, it is clear to see that for any “positive” training samples w.x + b ≥ 1 and w.x + b ≤ -1 for all points in the “negative” class. For the special points that are on the lines l+ and l_, these constraints become equalities and these points are called the support vectors. Hence, the name support vector machines.
In an SVM model, the objective is to find the parameters w and b that produce the largest margin. Through basic algebra, it can be demonstrated that the size of the margin is equal to, where is the length or norm of vector w. Maximizing is equivalent to minimizing. Since it is mathematically more convenient, or equivalently is used as the objective function to be minimized in formulating the SVM problem. In summary, with this objective function and the set of inequality constraints discussed above for the training samples, a quadratic convex optimization problem is obtained. This is typically solved by a Lagrangian formulation where finding the unknowns involves computations in the order of N2 where N is the number of training samples. In other words, the computation time for training an SVM model is proportional to N2.
Not Linearly Separable Data
In many situations, problems which are linearly nonseparable need to be solved. To address linearly nonseparable cases, slack variables are introduced. Linear constraints in SVM force the largest margin to be found only where all the points are correctly classified. This can lead to very small margins which are not generalizable. Slack variables relax the linear constraints to allows for misclassified points, while maintaining a larger margin. This new margin, which allows for some misclassification, is called soft margin.
The slack variable ξi is defined as the distance by which point i goes beyond the boundary of its category, as shown in Figure 2‑7. Slack variables allow some variables not to be classified correctly to some extent. By sacrificing some accuracy, SVM achieves a larger margin, which increases its flexibility and generalizability. However, there is a tradeoff between achieving a large margin and accuracy, and the balance is found by penalizing the tolerance to errors in the cost function (Eq.(2)). Higher value of C will only allow points very close to the decision boundary to be misclassified, while a low value of C is more tolerant.
Figure ‑ Slack variables allow for not-perfect separation and provide some tolerance [4].
Nonlinear Classification and Kernels
For most practical problems, the given training data might not be linearly separable. Sometimes the overlap of classes might be so much that introducing a simple slack variable will not be able to provide a good solution. In these cases, it might be a good idea to transform the data through a function and project the points to a higher dimension Z as shown in Figure 2‑8. By doing this, the nonlinear problem is linearized in the new feature space, and as such linear separation is made possible as shown with the green line in the new space. The exact form of this mapping function is not needed to be known explicitly, since Kernel functions can be used instead.
Figure ‑ Transformation of data through a kernel
The kernel function is defined as: . The essence of the kernel method lies in the fact that the explicit calculation of coordinates is not needed but rather the inner product of the points in the Z space is used. This operation is computationally cheaper than the explicit computation of the coordinates in the Z space. This approach is called the kernel trick.
Selection of kernel function requires a good understanding of the characteristics of the underlying data. There are many different types of kernel functions, some of which are polynomial, radial basis, and hyperbolic tangent kernel functions as shown in Eq.(3)-(5).
SVM has been extended to solve regression problems as well. Support vector regression (SVR) models still maintain the key futures of the SVM models explained above and follow a similar mathematical approach to find the best fitting hyperplane. In contrast to the ordinary least-squares regression, an SVR model ignores errors within a threshold called e. As an illustration, these e-bounds are shown for one dimensional data in Figure 2‑9. For points falling outside the e-bounds, in its basic form, a linear and symmetrical function (right panel in Figure 2‑9) is used to compute the total loss or prediction error (i.e., ) needed for the objective function. To balance model complexity and prediction error, just like in SVM for classification, the objective function also includes the term as in Equation 2 where this balancing is achieved by tuning the hyperparameter C. As in SVM, the hyperplane, f(x), is represented in terms of support vectors, which are training samples that lie outside the e-boundaries. A major advantage of SVR is that its computational complexity does not depend on the dimensionality of the input space [5].
Figure ‑ One dimensional linear regression with e-bounds (left) and e-insensitive error function used by the support vector regression machine (right). The graph on the right panel is from Elements of Statistical Learning [1]
The SVMs have been extended to solve the clustering problem as well. In support vector clustering (SVC), the data points are mapped to a higher dimensional feature space through a kernel function first. Then, the smallest sphere enclosing the data in the feature space is found. When this sphere is mapped back to the original space, it results in a set of contours that cluster points into groups as in Figure 2‑10. More details can be found in the original paper on SVC [6].
Figure ‑ Sample clustering results from an SVC model as the scale parameter of the Gaussian kernel is varied [6]
To use an SVM model for either a classification or regression problem, one needs to select a kernel function and values for the hyperparameters. For the soft margin SVM, the optimal value for C can be found through a grid search of different C values and cross validation. For kernel SVM, the hyperparameters to be optimized depend on which kernel function is utilized. For example, when using a radial basis kernel function, the hyperparameters that need to be optimized are the in the kernel function, and the C penalty term. The optimal combination can be found through a grid search and cross-validation.
Through the kernel trick, the support vector machines enable fitting linear models to data in high dimensional feature space and allow transforming the optimization problem to a convex quadratic problem. This allows finding a global optimum solution for the model being fit – a significant advantage of SVM over other models. However, this model is valid under the preselected hyperparameters. For each value of the hyperparameters, a new SVM model is fit. Therefore, the overall process of finding the best SVM model requires numerous trials. Grid search is commonly used as an approach for tuning the hyper-parameter while checking the model performance on validation data.
Support vector machines have been widely applied in transportation literature to solve various problems. SVMs have been used for incident detection [7] and forecasting of volumes [8] on freeways; detecting travel mode [9] and vehicle stops [10] from smartphone sensor data; imputing missing data in activity-based travel diaries [11]; predicting motor vehicle crashes [12, 13]; modeling travel mode choice [14]; and many other problems and applications.
Authors: Sherif Ishak, Franco Trifiro, and Mohammadeeza K. Hosseini
Neural networks (NNs), or connectionist systems, have experienced a resurgence of interest in recent years as a paradigm of computational and knowledge representation. After a first surge of attempts to simulate the functioning of the human brain using artificial neurons in the 1950s and 1960s, this AI subdiscipline did not receive much attention until the 1990s. The resurgence has been due mainly to the appearance of faster digital computers that can simulate large networks and the discovery of new NN architectures and more powerful learning mechanisms. The new network architectures, for the most part, are not meant to duplicate the operation of the human brain, but rather to receive inspiration from known facts about how the brain works.
NNs are concerned with processing the information by a learning process and by adaptively responding to inputs in accordance with a learning rule. These powerful models are composed of many simulated neurons or simple computational units that are connected in such a way that they are able to learn in a manner similar to how human brains learn. This distributed architecture makes NNs particularly appropriate for solving nonlinear problems and input–output mapping problems. The usual application of NNs is in the area of learning and generalization of knowledge and patterns. They are not suitable for expert reasoning and they have poor explanation capabilities.
While there are several definitions for NNs, the following definition emphasizes the key features of such models. An NN can be defined as a distributed, adaptive, generally nonlinear learning machine built from interconnecting different processing elements [15]. The functionality of NNs is based on the interconnectivity between the PEs. Each PE receives connections from other PEs and/or itself. The connectivity defines the topology of NN and plays a role at least as important as the PEs in the NN’s functionality. The signals transmitted via the connections are controlled by adjustable parameters called weights, wij .
A typical PE structure is depicted in Figure 2‑10 as a nonlinear (static) function applied to the sum of all the PE’s inputs. Due to the fact that NNs’ knowledge is stored in a distributed fashion through the connection weights between PEs and also the fact that the knowledge is acquired through a learning process that involves modification of the connection strengths between PEs, NNs tend to resemble in functionality the human brain.
There are many types of NN architectures, each designed to address a class of problems such as system identification, function approximation, nonlinear prediction, control, pattern recognition, clustering, feature extraction, and others. NNs may also be classified as either static or dynamic. Static networks represent good function approximators with the ability to build long-term memory into their synaptic weights during training. On the other hand, dynamic networks have a built-in mechanism to produce an output based on more than one time instant in the past, establishing what is commonly referred to as short-term memory.
Figure ‑ Example of a neural network.
The development process of NN models is typically carried out in two stages: training and testing. During the training stage an NN learns from the patterns presented in an existing dataset. The performance of the network is consequently evaluated using a testing dataset that is composed of patterns the network was never exposed to before. Because the learned knowledge is extracted from training datasets, NNs are considered both model-based and data-driven systems. Usually the learning phase uses an algorithm to adjust the connection weights, based on a given dataset of input–output pairs. Training patterns are presented to the network repeatedly until the error of the overall output is minimized. The presentation of all patterns once to the network is called an epoch and results in adjustment of the connection weights such that the network performance is improved. The training stage of NN is terminated when the error drops below a prespecified threshold value or when the number of epochs exceeds a certain prespecified limit. Another method to control the efficiency of the training stage is to monitor the network performance (errors) during the training stage on a cross-validation (CV) dataset, usually smaller than the learning dataset. The role of CV is to test for the network’s generalization capabilities during the training process. If the network is overtrained a sudden degradation of the network based on the CV data will trigger the training process to stop.
There are different types of NN. The most commonly used architecture of NN is the multilayer perceptron (MLP). MLP is a static NN that has been extensively used in many transportation applications due to its simplicity and ability to perform nonlinear pattern classification and function approximation. It is, therefore, considered the most widely implemented network topology by many researchers (e.g., [16] [17]). Its mapping capability is believed to approximate any arbitrary mathematical function.
MLP consists of three types of layers: input, hidden, and output. It has a one-directional flow of information, generally from the input layer, through hidden layer, and then to the output layer, which then provides the response of the network to the input stimuli. In this type of network, there are generally three distinct types of neurons organized in layers. The input layer contains as many neurons as the number of input variables. The hidden neurons, which are contained in one or more hidden layers, process the information and encode the knowledge within the network. The hidden layer receives, processes, and passes the input data, to the output layer. The selection of the number of hidden layers and the number of neurons within each affects the accuracy and performance of the network. The output layer contains the target output vector.
Figure 2‑11 depicts an example of MLP topology. A weight coefficient is associated with each of the connections between any two neurons inside the network. Information processing at the neuron level is done by an “activation function” that controls the output of each one.
NNs train through adaptation of their connection weights based on examples provided in a training set. The training is performed iteratively until the error between the computed and the real output over all training patterns is minimized. Output errors are calculated by comparing the desired output with the actual output. Therefore, it is possible to calculate an error function that is used to propagate the error back to the hidden layer and to the input layer in order to modify the weights. This iterative procedure is carried out until the error at the output layer is reduced to a prespecified minimum or for a prespecified number of epochs. The back-propagation algorithm is most commonly used for training MLP and is based on minimizing the sum of squared errors between the desired and actual outputs.
Validation of an already trained NN requires testing the network performance on a set of out-of-sample data, called testing data, which is composed of data that was never presented to the network before. If the error obtained in both training and testing phases is satisfactory, the NN is considered adequately developed and thus can be used for practical applications.
Figure ‑ Example of MLP network topology.
The NNs learn to approximate the mapping function between the pairs of input and output data using an iterative training process in a supervised learning approach. Depending on the type or the format of the output, regression and classification are the two main types of supervised learning problems. Classification refers to the problems when the output is a finite set of discrete labels or categories. For example, in the case of traffic prediction, if the prediction is in the form of explicit classes such as congested and uncongested states, the traffic prediction is a classification problem. On the other hand, regression refers to the problems when the output is numerical and continuous. When applying NN to predict the traffic state in the form of continuous numbers such as flow, density, or average speed, the prediction is considered as a regression problem.
Knowing the differences between these two types of supervised learning problems helps with the design of the output layer and selection of the loss function for the training process. For regression, the output of the network is in the form of continuous values, and the linear activation function is commonly used for the output layer. For classification, the output of the network is usually in the form of probability for each class or category. In this case, the softmax activation function is used for the last layer, which converts the output of the previous layers into probabilities that sum to one.
Different loss functions are adopted in the training of the regression and classification problems. Training of the NN is an iterative process, and the network parameters are adjusted at every step based on the performance (e.g., error) of the network estimated using the loss function. The loss function selected for classification problems for the networks with probability outputs is usually from the cross-entropy loss function family. Cross-entropy loss function enables optimizing the network parameters to return a higher probability of the correct class and lower probabilities for the other classes. For the case of regression, the loss function considers the differences between the actual and predicted values. The mean squared error is a popular loss function for the regression problems. The parameters of the network are adjusted to reduce the residuals between the prediction and the actual output variables.
In addition to the basic MLP architecture, several other advanced topologies have been developed in the past few years to meet the needs of different types of applications. Although NN and other soft computing constituents may perform exceptionally well when used individually, the development of practical and efficient intelligent tools may require a synergic integration of several topologies to form hybrid systems. In fact, computational intelligence and soft computing fields have witnessed in the past few decades an intensive research interest towards integrating different computing paradigms such as fuzzy set theory, GAs, and NNs to generate more efficient hybrid systems. The emphasis is placed on the synergistic, rather than the competitive, way the individual tools act to enhance each other’s application domain. The purpose is to provide flexible information processing systems that can exploit the tolerance for imprecision, uncertainty, approximate reasoning, and partial information to achieve tractability, robustness, low-solution cost, and close resemblance with human-like decision making [18].
For example, a combination of neural and fuzzy set, or neuro–fuzzy, model may consolidate the advantages of both techniques. When combined, they can be easily trained and have known properties of convergence and stability as NNs, and they can also provide a certain amount of functional transparency through rule dependency which is important to understand the solution of a problem. NN and GA could be combined to solve optimization problems. In fact, this hybrid approach could be applied using the properties of NN to define the observed functions with unknown shape, and the GA, to obtain the final result of an optimization problem. Examples of advanced and hybrid NN topologies include:
Modular networks,
Coactive neuro–fuzzy inference system (CANFIS),
Recurrent Neural Networks
Jordan–Elman network,
Partially recurrent network (PRN), and
Time-lagged feed-forward network (TLFN).
Modular Network
Modular networks are a special class of multiple parallel feed-forward MLPs. The input is processed with several MLPs and then the results are recombined. The topology used specifically for this application is composed of two primary components: local expert networks and a gating network ([19] [15]). Figure 2‑12 shows the topology of a modular network. The basic idea is linked to the concept of “divide and conquer,” where a complex system is better attacked when divided into smaller problems, whose solutions lead to the solution of the entire system. Using a modular network, a given task will be split up among some local expert networks, thus reducing the load on each in comparison with one single network that must learn to generalize from the entire input space. Then, the modular NN architecture builds a bigger network by using modules as building blocks. A very common method is to construct an architecture that supports a division of the complex task into simpler tasks.
All modules are NN. The architecture of a single module is simpler, and the subnetworks are smaller than a monolithic network. Due to the structural modifications, the task the module must learn is in general easier than the whole task of the network. This makes it easier to train a single module (SO). In a further step, the modules are connected to a network of modules rather than to a network of neurons. The modules are independent to a certain level which allows the system to work in parallel. This NN type offers specialization of a function in each sub-module and does not require full interconnectivity between the MLP’s layers. A gating network eventually combines the output from the local experts to produce an overall output. For this modular approach, it is always necessary to have a control system to enable the modules to work together in a useful way. The evaluation using different real-world data sets showed that the new architecture is very useful for high-dimensional input vectors. For certain domains, the learning speed and the generalization performance in the modular system is significantly better than in a monolithic multilayer feed-forward network [20].
Figure ‑ Example of the modular network topology.
Coactive Neuro–Fuzzy Inference System
CANFIS belongs to a more general class of adaptive neuro-fuzzy inference systems (ANFIS) [19]. CANFIS may be used as a universal approximator of any nonlinear function. The characteristics of CANFIS are emphasized by the advantages of integrating NN with fuzzy inference systems (FIS) in the same topology. The powerful capability of CANFIS stems from pattern-dependent weights between the consequent layer and the fuzzy association layer. The architecture of CANFIS is illustrated in Figure 2‑13.
The fundamental component for CANFIS is a fuzzy neuron that applies membership functions (MFs) to the inputs. Two membership functions are commonly used: general bell and Gaussian [21]. The network also contains a normalization axon to expand the output into a range of 0 to 1. The second major component in this type of CANFIS is a modular network that applies functional rules to the inputs. The number of modular networks matches the number of network outputs, and the number of processing elements in each network corresponds to the number of MFs. CANFIS also has a combiner axon that applies the MFs outputs to the modular network outputs. Finally, the combined outputs are channeled through a final output layer and the error is back propagated to both the MFs and the modular networks.
The function of each layer is described as follows. Each node in Layer 1 is the membership grade of a fuzzy set (A, B, C, or D) and specifies the degree to which the given input belongs to one of the fuzzy sets. The fuzzy sets are defined by three membership functions. Layer 2 receives input in the form of the product of all output pairs from the first layer. The third layer has two components. The upper component applies the membership functions to each of the inputs, while the lower component is a representation of the modular network that computes, for each output, the sum of all the firing strengths. The fourth layer calculates the weight normalization of the output of the two components from the third layer and produces the final output of the network.
Figure ‑ Example of CANFIS network topology.
Recurrent Neural Networks
Recurrent Neural Network (RNN) is a network topology designed to accommodate temporal dynamic behavior and to process sequences of inputs or outputs. The RNNs were first introduced in the 1980s ([22-25]) to capture the order and temporal dependencies between the observations. RNNs have gained a lot of attention recently, especially in the field of natural language processing (NLP) due to the improvement in computational power and the use of graphic processing units (GPUs). RNN also has a great application in time series forecasting due to its capability in learning the temporal dependencies between the observations. One advantage of the NNs over the conventional linear models in time series forecasting is their ability to learn and to approximate the nonlinear relations between the input and output features [26]. Moreover, the RNN eliminates the need for a predefined time window for input or output features and is also capable of modeling multivariate sequences [27]. Considering the capabilities of the RNN in the process of multivariate sequence data with temporal dependencies, the RNN has potential applications in the field of transportation such as forecasting state and evolution over time.
RNNs capture long-range time dependencies between the observations in sequence data [28]. The input and output sequence are denoted as (x(1), x(2), …, x(T)) and (y(1), y(2), …, y(T)), and each data point x(t) or y(t) can be a real-valued vector with any shape. The sequence represents a temporal or ordinal aspect of data, and the length of the sequence could be finite or infinite. RNN is a feedforward type of network with additional recurrent edges that connect to the next time steps. The recurrent edge creates a loop by connecting a node to itself that allows the information to be carried across time while receiving the input one step at a time. The recurrent nodes of the RNN maintain information from past events in their internal memories known as hidden state. Figure 2‑14 presents a simple RNN node with a recurrent connection. The output of the recurrent node, y(t), at any time depends on the hidden state, h(t), of the node at that time, according to equation 1. In this equation,and are the trainable network parameters (weights and biases), and can be any activation function.
The hidden state, , is the memory of the recurrent node and is estimated based on the current input, , and the previous hidden state of the node, with respect to equation 2. , , and are the trainable parameters, and is the activation function.
Figure ‑ Simple RNN Node with Recurrent Connection.
Consequently, the output of the recurrent node at each time step is dependent on the current input and the previous hidden state of the node (i.e., the memory of the node). The dynamics of the recurrent node over time can be unfolded, as illustrated in Figure 2‑14. By unfolding, the network is unrolled to the full network for the complete sequence. Considering the unfolded network, the RNN creates one layer for each time step with shared weights across all the series. The RNN is trained across all the time steps using the backpropagation through time (BTT) algorithm [29]. Figure 2‑14 presents a simple RNN node with an input sequence and an output sequence. Depending on the architecture of the RNN, the network may or may not have an output for every time step of the input sequence, and it may or may not require an input for every time step of the output sequence. The main feature of the RNN is that it maintains a hidden state that is the memory of the node and captures information about the sequence.
At each iteration of the training process, every parameter of the network is updated based on the gradient of the loss function in direction to reduce the total loss (i.e., error) of the network. When estimating the gradient in the backpropagation along with many time steps, the gradient could get exponentially large or exponentially small due to many multiplications through the chain rule. The exploding or vanishing gradient prevents the parameters of the network from converging to the values that result in a minimal error during the training process. Training of the RNN is considered to be challenging due to the problem of vanishing or exploding gradient.
There are techniques such as the truncated backpropagation through time (TBTT) to prevent the gradient from getting too large or too small. TBTT limits the number of time steps the error can be propagated that could restrict the ability of the network to learn the long-range dependencies in sequences. Fortunately, there are modern RNN architectures such as the Long Short-Term Memory (LSTM) network that address the vanishing or exploding gradient problem with high capability of learning long-range dependencies. The LSTM network is a type of RNN with recurrent connection but with advanced LSTM blocks instead of simple recurrent nodes. The core component of the LSTM block is the cell that serves as the memory of it. This component decides what to keep in or remove from memory, considering the previous state and the current input. Most of the recent RNN architectures use LSTM blocks, and this architecture is discussed in more detail in the deep learning component of this document.
Jordan-Elman Network
The Jordan–Elman network is also referred to as the simple recurrent network (SRN) [17]. It is a single hidden-layer feed-forward network with feedback connections from the outputs of the hidden-layer neuron to the input of the hidden layer [15]. It was originally developed to learn temporal sequences or time-varying patterns. As shown in Figure 2‑15 the network contains context units located in the upper portion and used to replicate the hidden-layer output signals.
The context units are introduced to resolve conflicts arising from patterns that are similar yet result in dissimilar outputs. The feedback provides a mechanism to discriminate between identical patterns occurring at different times. The context units are referred to as a low-pass filter that creates a weighted average output of some of the more recent past inputs. They are also called “memory units” since they tend to remember information from past events. The training phase of this network is achieved by adapting all the weights using standard back-propagation procedures. More details on this topology can be found in [17] and [21].
Figure ‑ Example of Jordan-Elman network topology.
Partially Recurrent Network
PRN is considered a simplified version of the Jordan–Elman network without hidden neurons. It is composed of an input layer of source and feedback nodes, and an output layer, which is composed of two types of computation nodes: output neurons and context neurons. The output neurons produce the overall output, while the context neurons provide feedback to the input layer after a time delay. The topological structure of the network is illustrated in Figure 2‑16. More details can be found in [30] and [21].
Time-Lagged Feed-Forward Network
In dynamic NN, time is explicitly included in mapping input-output relationships. As a special type, TLFN extends nonlinear mapping capabilities with time representation by integrating linear filter structures in a feed-forward network. The type of topology is also called focused TLFN and has memory only at the input layer. The TLFN is composed of feed-forward arrangement of memory and nonlinear processing elements. It has some of the advantages of feed-forward networks, such as stability, and can also capture information in input time signals. Figure 2‑17 shows a simplified topological structure of the focused TLFN. Figure 2‑17 shows that memory PEs are attached in the input layer only. The input-output mapping is performed in two stages: a linear time-representation stage at the memory PE layer and a nonlinear static stage between the representation layer and the output layer. Further details underlying the mathematical operations of TLFN can be found in [17], [15], and [21].
Figure ‑ FIGURE 12 Example of PRN topology.
Figure ‑ Example of TLFN topology.
Deep neural networks (DNN) refer to networks with multiple layers of neurons between the input and output layers. Increasing the number of layers (i.e., depth) and using nonlinear activation functions improves the ability of the network to learn a more complicated relationship between the input and output features. Adding more layers enables the network to transform features in different levels of abstraction, forming higher-level abstractions layer by layer from lower level features. The hierarchical feature transformation allows the network to learn the more complex mapping function from input variables to output variables. Empirically, increasing the depth of the network could improve the generalization of the network [31]. The training of DNNs may become challenging due to the potential for vanishing or exploding gradient as a result of many multiplications through the chain rule through many layers [32]. Still, the training of DNNs is attainable by choice of good activation function, proper initialization of the network parameters, and suitable network architecture. The increased capability of the network in learning more complex relation between the input and output features along with the improved generalization of the network are the main reasons for the popularity and superiority of the deep neural networks in a wide variety of tasks such as computer vision, natural language processing, and machine translation. DNN architectures such as Convolutional Neural Networks (CNN) and deep LSTM networks are discussed in more detail in the following sections of this document.
Author: Jidong Yang
Introduction
Convolutional Neural Networks, typically referred to as CNN or ConvNet, are perhaps the greatest success story of biologically inspired artificial intelligence [31]. Their success has been well-known for tasks, such as image classification, video analysis, object detection and tracking, natural language processing, etc. The domain of their applications continues to expand. CNN are constructed with convolutional layers that mimic mammalian visual cortexes containing neurons that individually respond to small regions of the visual field, discovered by Hubel and Weisel [33], who later received the Nobel Prize for their work. In a convolutional layer, each neuron receives input from a local subarea of the previous layer, named receptive field. For example, for an image, the receptive field is typically of a square shape (e.g., 3x3) with depth (e.g., 3 channels for RGB color images). CNN also uses a parameter-sharing scheme, in which each filter (i.e., same set of weights) scans or slides over the entire input area, resulting in significantly fewer parameters (or weights) than traditional multilayer neural networks, in which each neuron receive input from every element of the previous layer. When the size of the receptive field grows to the same size of the previous layer, the convolutional layer is the same as the fully connected or dense layer in traditional multilayer neural networks. In this sense, CNN can be considered as a more generic form that contains traditional multilayer neural networks as a special case. This built-in versatility makes CNN extremely scalable with large inputs (e.g., image data). Figure 2‑18 illustrates the concept of parameter sharing with a simple one-dimensional (1D) example.
Figure ‑ Effect of parameter sharing – comparison of CNN versus traditional multilayer neural networks
In practice, CNN has been widely used to deal with images. However, its applications are not limited to image data. CNN can discover or learn patterns from any data or signals with certain structures in spatial and/or temporal dimensions. For example, Abdoli, et al. [34] classified environmental sound based on a 1D CNN (Figure 2‑19), which learns a representation directly from the audio signal.
Figure 2‑20 The architecture of the proposed end-to-end 1D CNN for environmental sound classification [34]
Each convolutional layer typically has a number of “filters” to extract or assemble specific features from the previous layer. In the context of image processing, a filter is a set of numbers, typically oriented in the form of one or multiple square matrices. Each filter looks for a particular type of features, such as vertical edges or horizontal edges (note: we will have an example on this shortly). The filter in CNN is analogous to an optical filter, which selectively transmits light of different wavelengths. For example, to be able to identify a unique object such as a Stop Sign, a hierarchy of filters would be needed to extract specific features at different levels, including color (red and white), edges (vertical, horizontal, and slanting lines), shape (octagon), and four letters (STOP). Zeiler, et al. [35] showed that CNN can capture image information in a variety of forms: the low-level edges, mid-level edge junctions, high-level object parts and complete objects (see Figure 2‑20).
Figure 2‑21 Top-down parts-based image decomposition with an adaptive deconvolutional network. Each column corresponds to a different input image under the same model. Row 1 shows a single activation of a 4th layer feature map projected into image space. Conditional on the activations in the layer above, we also take a subset of 5,25 and 125 active features in layers 3, 2 and 1 respectively and visualize them in image space (rows 2–4). The activations reveal mid and high level primitives learned by our model. In practice there are many more activations such that the complete set sharply reconstructs the entire image from each layer [35].
The extracted features from images are normally general and can be used for many different tasks, such as classification, as shown in Figure 2‑21.
Figure 2‑22 CNN can solve the classification problem based on a hierarchy of features extracted from images [31]
Now, let us dig a bit deeper to understand how the “magic” happens. When applying a filter or kernel (the term “filter” is commonly used in CNN) to extract information from an image, a convolution operation is carried out between the filter and the image. For brevity without losing generality, we will use a gray-scale image, which has one channel as compared to three-channel RGB images. The pixel value or gray scale is represented by one byte (i.e., 8 bit) that codes 28 = 256 integer values, with 0 for Black and 255 for White. Any number in between represents Gray at an intermediate level. A simple grayscale image is shown in Figure 2‑22. The image has two distinct features: one vertical edge and one horizontal edge. How can we extract those features separately by applying two different filters?
Figure ‑ Image with a size of 9 pixels by 9 pixels
To accomplish this task, we could use the two filters (i.e., matrices) in Figure 2‑23. Those filters are also referred to as Sobel kernels [36].
Vertical (b) Horizontal
Figure 2‑24 Sobel Kernels Edge Filters
To apply the filters, convolution operation is performed, which is discussed in detail in the following section.
Convolution Operation
Convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other. It is defined as the integral of the product of the two functions after one is reversed and shifted and can be written in Eq. 1 [37].
(1)
Since we mostly deal with discrete inputs, such as pixels for images, the convolution operation is simply an element-wise multiplication or dot product, as shown in Figure 2‑24.
Figure 2‑25 Pixel-wise convolution
As illustrated in Figure 2‑24, the pixel value () in the original image becomes the pixel value () in the filtered image after convolution. To see the convolution operation in action, an animation is viewable at [38].
Given the 3x3 filters in Figure 2‑23, to retain the feature map in the same size of the input image, padding is needed. Normally a zero padding is used with zero-value pixels being added at the edges of the image as shown in Figure 2‑25.
Figure 2‑26 Illustration of zero padding
The same filter “scans” the image by sliding one pixel at a time (referred to as stride 1) horizontally and vertically till it scans the whole image. One step of horizontal scanning for the top row of the image is illustrated in Figure 2‑26.
Figure 2‑27 Filter sliding horizontally with stride 1.
The filtered image in Figure 2‑27 shows that the vertical edge has been extracted.
Figure 2‑28 Illustration of applying the vertical edge filter
Similarly, Figure 2‑28 shows the result of applying the horizontal filter.
Figure 2‑29 Illustration of applying the horizontal edge filter
Note the purpose of the example above is to illustrate how convolution and feature extraction work in general. Some variations have been proposed, including transposed convolution or deconvolution, separable convolution, deformable convolution [39], dilated convolutions or atrous convolution [40].
In CNN, the filters or weights are parameters to be learned from data during the training stage. A convolution layer often consists of a stack of filters, resulting in a stack of feature maps. This can be visualized in Figure 2‑29 as each convolutional layer transforms a 3D input volume to a 3D output volume of neuron activations.
Figure 2‑30 Illustration of convolution layers and feature extraction
As shown in Figure 2‑29, the input layer is an image, for which the width and height would be the dimensions of the image, and the depth would be 3 (i.e., Red, Green, Blue channels). The sequential convolution operations would commonly reduce the width and height of subsequent 3D volume. The depth of the subsequent 3D volume corresponds to the number of filters used, which is a hyperparameter of CNN. The number of filters determines the number of feature maps to generate from the 3D input volume in the previous layer. It should be pointed out that the connectivity for each filter along the depth axis is always equal to the depth of the input volume. In other words, each filter itself is a 3D shape with the depth equal to the depth of the input volume (see the square-based prism in dash lines in Figure 2‑29). For classification problems, the last layer is typically a fully connected layer resulting in a 1x1xN shape volume (N is the number of classes).
Typical Layer Structure
Four major types of layers are commonly used to construct CNN. They are Convolution Layer, Activation Layer (e.g., ReLU layer), Pooling Layer, and Fully Connected Layer. Convolution layers constitute the main tissue of CNN while the other layers complement the convolutional layers to create a functional CNN for a specific application. We will discuss those layers in detail.
To be able to learn complex high-level features, which are naturally nonlinear, CNN structure should be designed to account for such nonlinearity. By resembling the mechanism of brain neurons, the nonlinearity can be achieved by an activation function. In modern CNN practice, the commonly used activation function is called rectified linear unit (ReLU), which can be written as:
(2)
ReLU is linear for positive values and nonlinear for negative values as they are always suppressed to zero. It can be referred to as a piecewise linear function. Albeit simple, ReLU offers many advantages over its traditional counterpart (e.g., sigmoid or tanh functions), such as better gradient propagation, efficient computation, and scale invariant. Note the gradients of ReLU are very simple: zero for negative x and 1.0 for positive x, as shown in Figure 2‑30. Some variants of ReLU (e.g., Parametric or Leaky ReLU, Exponential Linear Unit, etc.) have been proposed and may work better in certain situations. For this primer, we will focus on ReLU. Interested readers are referred to He, et al.[41] for Leaky ReLU and Clevert, et al. [42] for Exponential Linear Unit.
Figure 2‑31 ReLU activation function
For example, for the filtered image in Figure 2‑27, after applying ReLU, it becomes activation map or feature map, shown in Figure 2‑31
Figure 2‑32 Activation map
In CNN literature, ReLU is often treated as a layer, which will not change the shape of the input volume. It simply performs element-wise ReLU computation to the input volume. In practice, every convolution layer is typically followed by a ReLU layer, resulting in a stack of activation maps or feature maps (as shown in Figure 2‑29). The number of feature maps is indicated by the depth of the 3D volume in Figure 2‑29. An example of how those layers are organized in a CNN is shown in Figure 2‑32.
Figure 2‑33 Classification example using CNN (Source: (cs231n, n.d.); a live demo can be found at [43]
As noted in Figure 2‑32, a pooling layer was inserted every two successive pairs of Conv/ReLU layers. What a pooling layer does is basically down-sampling, which reduce the dimension of the previous layer and allows the following layer to capture feature structure at a larger scale. Pooling layers also reduce sensitivity to small movements (e.g., shifting) of feature positions in the original image. This is due to the fact that the feature maps from convolutional layers keep track of the “precise” positions of features, which are sensitive to the location of the features in the original image. While the pooling layers retain relative (instead of precise) locations of the features (or the coarse structure of feature elements), which result in a lower resolution version of the feature maps that are more robust to changes in the position of the features in the original image. This is often referred to as local translation invariance.
It should be noted that down-sampling can also be achieved by larger strides of convolutional layers. However, use of a pooling layer is more common and robust for this purpose. Two typical pooling functions are max pooling and average pooling. Figure 2‑33 shows an example of 2x2 average pooling, which outputs the average value of the pooled area. In practice, max pooling has been found to work better than average pooling for computer vision tasks.
Figure 2‑34 Illustration of 2x2 averaging pooling
In summary, a stack of convolutional layers can extract features at different levels. The ReLU layers add “nonlinearity” to permit learning of complex features. The pooling layers can provide better local translation invariance and capture feature structures at larger scales. Figure 2‑34 illustrates the working concept of CNN. Note that ReLU layers are not shown to reduce clutter. The use of the “peeping a leopard through a pipe” example in Figure 2‑34 is to echo a famous Chinese metaphor for a biased, narrow-minded person who only has a partial view of a big problem.
Figure 2‑35 Do you see a spot or the leopard?
In modern practice, if the training data set is large, a batch normalization (BatchNorm) layer is often used after each ReLU layer to scale the activations. Batch normalization provides an elegant way of reparametrization, which significantly reduces the problem of coordinating updates across many layers of a deep neural network [31].
It has been shown that BatchNorm makes the optimization landscape significantly smoother. This ensures, in particular, that the gradients are more predictive and thus allow for use of larger range of learning rates and faster network convergence [44]. Additionally, it makes network training less sensitive to the weight initialization.
To prevent overfitting problems, dropout was proposed as an effective regularizer [45]. Typically, dropout is conducted in a random fashion which is equivalent to sampling many “thinned” networks from the original network. The word “thinned” arises from the fact that the original network is reduced because some of its neurons are dropped in a random fashion during training. Srivastava, et al. [45] found that as a side effect of dropout, the activations of hidden units become sparse, even when no sparsity inducing regularizers are present. An example of dropout is illustrated in Figure 2‑35.
Figure 2‑36 Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped [45].
Parameters and Hyperparameters
There are two types of parameters: model parameters and model hyperparameters. Model parameters are considered internal to the model and their values are learned from data during the training stage. Model hyperparameters are often considered external to the model and their values are normally set prior to training. More recently, trainable hyperparameters have been introduced, such as those for BatchNorm layers [46]. Typical parameters and hyperparameters by layers are summarized in Table 2‑1.
Table ‑ Parameters and Hyperparameters by Layers
Layers
Parameters
Hyperparameters
Conv
Weights and biases
Number of filters (i.e., the depth of each layer)
Filter Size
Filter Stride
Amount of zero padding
Activation (e.g., ReLU)
None
Choice of the activation function (e.g., ReLU, sigmoid, tanh, etc.)
Normalization (e.g., BatchNorm)
None
BatchNorm parameters, [46]
Pooling
None
Pooling size and stride
FC
Weights and biases
Number of neurons
Dropout*
None
Probability of retaining or dropping
*Dropout, although generally referred to as a “layer” in practice, is an effective regularization technique to combat overfitting.
Summary of Key Features
The key features of CNN can be better understood in contrast to traditional multilayer neural networks. The following features are thought to be unique of CNN.
Local connectivity: Mimic visual cortexes containing neurons that individually respond to small regions of the visual field. It permits location-invariant features to be extracted at lower level layers, which can be assembled to understand complex features in larger scales.
Shared weights: It reduces the number of parameters to be learned and make the network more scalable and easier to train.
Multiple feature maps: Allow a rich set of features to be captured at different levels and scales.
Pooling: Often referred to as down-sampling, which results in a lower resolution version of the feature maps that are more robust to changes in the position of the features in the original image. It also captures the feature structure at a larger scale.
Training of CNN
This section describes how learning occurs. The weights that connecting neurons in different layers are parameters of a network, which are learned through the training process. Training a CNN is similar to training a traditional multilayer neural network. Backpropagation algorithm with gradient descent is typically used. At start of training, all weights and biases are initialized by certain random schemes. A series of example pairs (with the input and the known output, i.e., the label) are presented to the network. The network propagates each input through all the layers (e.g., convolutional layers, ReLU layers, pooling layers, etc.) to reach the output, which is compared with the correct answer (i.e., label). The disagreement between the network output and the label is computed based on a loss function (e.g., Root-Mean-Square, Cross-entropy, etc.). For classification, cross-entropy loss is commonly used to measure the performance of a classification model whose output is a probability. Cross entropy measures the distance between model-predicted distribution and the true distribution. As an example, let us look at a simple binary case. Suppose we have a large image data set and each image in the data set contains either a cat or a dog. We want to compare the predicted distribution with the label to see the difference. The difference here is referred to as loss in the form of cross-entropy. For this dog and cat example, we will compute cross-entropy (CE) as follows:
(3)
where, Q is the probability distribution predicted by the model and P is the actual probability distribution according to the label.
Note Eq. 3 is purposely color-coded, where Red indicates the model-predicted probability distribution and Green indicates the true distribution by the label. For example, if predicted distribution for a particular image is:
.2
And the actual image label is a “dog”, i.e., the true distribution is:
Then, the cross-entropy is computed as:
To get the sense, a couple more sample calculations are shown in Table 3‑2‑2.
Table 3‑‑ Example calculations of cross-entropy
As shown in Table 3‑2‑2, cross-entropy (loss) reduces as prediction distribution gets closer to true distribution (the label) and increases as prediction distribution is far away from the true distribution. The goal of training is to minimize this cross-entropy loss by adjusting weights and biases.
To generalize to multi-class (n classes) models, the cross-entropy can be written as:
(4)
A very large dataset is normally needed to train a CNN. For example, ImageNet [47] contains over 14 million images. Computing gradients based on the entire training set (referred to as batch optimization) is often impractical with limited memory and computing power. Another issue with batch optimization approach is that it is hard to incorporate new data in an ‘online’ setting. The simple strategy to solve those issues is to compute the gradient and update the parameters using only a single or a few training examples (i.e., a mini batch). For example, stochastic gradient descent (SGD) is a commonly used algorithm, which takes a mini batch of the training data set at a time. The “stochastic” part of the name comes from the notion that a mini batch is in fact a random sample of the training data set. Once a mini batch is processed, the total loss and gradient are computed, and the weights are updated layer by layer in a backward fashion (backpropagation). This continues until all data in the training set is processed, which is called an epoch. The algorithm stops when the loss reduces below a predetermined threshold or a maximum number of epochs is reached. The training and validation accuracy and loss curves can be plotted and compared to check for problems of overfitting or underfitting.
Different from traditional multilayer networks, which are shallow in terms of a small number of layers, CNN is often composed of many layers (i.e., deep structure), which make the training quite challenging. Figure 2‑36 shows the architecture of VGG16 network [48], which predict 1000 categories in the ImageNet [47].
Figure 2‑37 VGG16 network architecture [48, 49]
This deep structure requires a long path to back-propagate gradients from output layer all the way to the input layer. Each step backward, the gradient of the current layer is multiplied to the gradient of the previous layer according to the chain rule. If the gradients are very small (i.e., close to zero), the multiplication will lead to vanishing gradients, a common problem for training CNN with traditional sigmoid or tanh activation functions. As shown in Figure 2‑37, when the input value gets large (either positive or negative, as indicated by the regions in red) the gradient approaches zero (red curve).
Figure 2‑38 Activation Functions –Sigmoid and Tanh
Since the gradient determines the amount of adjustment to be made to the weights and biases, when gradients are approaching zero, minor or no adjustments will be made, thus learning stops. This is one of the main reasons that ReLU activation becomes so popular for training deeper modern neural networks, such as CNN. Krizhevsky, et al. [50] demonstrated that deep CNN with ReLUs train several times faster than their equivalents with tanh units.
Other common techniques for mitigating the vanishing gradient problem include batch normalization layers [46] and use of residual blocks [51]. Batch normalization normalizes the input to each layer so that most of it concentrate in the green region in Figure 2‑37. Residual blocks reduce the number of activations by bypassing a bock of layers. Both techniques can lead to a much smoother loss landscape, which enables a broader range of learning rates and makes the training significantly faster.
Transfer Learning
Transfer learning refers to the concept that knowledge gained for solving one problem can be applied to a different but related problem. Although computing resources for deep learning become more and more accessible and affordable, training a CNN on a large dataset, such as ImageNet, can still be a daunting task. Knowing many images share similar features at certain levels, features learned from large image datasets can be leveraged and used for various computer vision tasks or applications. For example, the convolutional layers in a pretrained network can be used as a feature extractor for a new image classification problem. However, the final FC layer needs to be customized to match the number of classes required by the new problem. In this case, the weights and biases of the pretrained convolutional layers will be frozen (i.e., they are not allowed to change during training) and only the weights and biases of the new FC layers will be adjusted. As such, transfer learning significantly reduces the number of trainable parameters and the amount of data required for training.
Recent Development
A summary of representative work highlighting the recent advancements in CNN is discussed in this section.
AlexNet, ZFNet, VggNet, and GoogLeNet
AlexNet [50] is the winner of the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). The architecture of AlexNet is showed in Figure 2‑38. AlexNet used parallelization scheme to spread the network across two GTX 580 3GB GPUs, which takes about five to six days for the training.
Figure 2‑39 An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000 [50].
Since AlexNet’s success, CNN becomes the most popular method for the image recognition task. The ILSVRC 2013 winner was also a CNN, known as ZFNet [52]. The VGG-Net [48] is the second place of ILSVRC 2014. It uses multiple 3x3 kernel size filters instead of the first and second convolutional layers in AlexNet which used 11x11 and 5x5 kernel size filters. This change makes the network feasible to learn more complex features at low cost. The VGG-Net which demonstrated that multiple stacked small kernel size filters in convolutional layers is better than one layer of larger kernel size filter.
Continued research has been focused on how to make a network deeper with low computational requirements. However, increasing the depth of layers brings two issues: increasing cost of both time and memory and gradient diffusion which can be understood as unavoidable loss during back-propagation of information through layers. The ReLU function was unable to solve the sparsity in earlier layers of a very deep network, meaning the derivative of most nodes are 0. Dealing with those issues, networks with new architectures were proposed. Inception network was originally proposed by Szegedy, et al. [53]. So far, Inception has four versions. The Inception v1 is also named as GoogLeNet, which introduced a module called Inception that creatively used 1x1 convolutional kernel for dimensionality reduction. This Inception module offered GoogLeNet a better performance than VGG-Net with less computation time and is the winner of ILSVRC 2014. The GoogLeNet is deeper than VGG 16 but with fewer parameters.
ResNet
Knowing the depth of representations is of central importance for many visual recognition tasks and deeper neural networks are difficult to train, a residual learning framework was introduced by He, et al. [51] to ease the training of substantially deeper networks. In particular, the layers were explicitly reformulated as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions, see Figure 2‑39.
Figure 2‑40 Residual learning: a building block (He, et al., 2016)
As explained by He, et al. [51], it is unlikely that identity mappings are optimal, but the reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. Their experiments showed that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. This was later proved by Li, et al. [54] by exploring the structure of neural loss functions and the effect of loss landscapes on generalization, using visualization methods. As shown in Figure 2‑40, the ResNet (i.e., with skip connections) has a much smoother surface, which explained why ResNet is easier to train.
Figure 2‑41 The loss surfaces of ResNet-56 with/without skip connections [54]
U-net: Full Convolutional Network
The name of U-net comes from its U-shape architecture (see Figure 2‑41). U-net is a full convolutional network (FCN) and consists of three sections: contraction (left), bottleneck (bottom), and expansion(right). The contraction section is made of several blocks, each has two 3x3 convolution layers followed by 2x2 max pooling (down-sampling). As seen in Figure 2‑41, the number of feature maps doubles after each block, which permits the network to learn complex structures. The expansion section consists of same number of blocks in symmetry with the contraction section. Each expansion block combines the feature maps from corresponding contraction block at the same level with a 2x2 up-sampling layer. This ensures that the features learned at different levels are used to reconstruct an image.
Different from classification applications, where the last couple of layers are typically fully connected layers followed by a softmax layer for obtaining probability distribution across classes, fully convolutional networks (FCN) consist of only convolutional layers and are commonly used for semantic segmentation (i.e., predicting class by pixel). Although U-net was first introduced for biomedical image segmentation [55], its application has been widespread over other domains, including autonomous vehicles, Geo sensing, etc. Some results of U-net for cell segmentation are shown in Figure 2‑42.
Figure 2‑42 U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations [55].
Figure 2‑43 Results on the ISBI cell tracking challenge. (a) part of an input image of the “PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth (yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result (random colored masks) with manual ground truth (yellow border) [55].
R-CNN, Fast R-CNN, and Faster R-CNN
Girshick, et al. [56] showed that rich features generated by a single CNN could be used for both classification and localization tasks. It was found that most features learned in the convolutional layers are general and can be used for feature-based tasks in different domains.Some pre-trained CNNs on the large ImageNet dataset have been successfully tuned for smaller network with minimal modification. The architecture for the proposed Regions with CNN features (R-CNN) is shown in
Figure 2‑43.
Figure 2‑44 R-CNN architecture [56]
Taking an image as input, R-CNN creates region proposals or bounding boxes through selective search (step 2 in Figure 3‑26). After the proposals are created, R-CNN warps the region to a standard square size (224 x 224) and passes it to a CNN, a modified version of AlexNet, to compute features (step 3 in Figure 2‑43). A Support Vector Machine (SVM) is then applied on the final layer of the CNN to classify the regions by types of objects (step 4 in Figure 2‑43). Once a region is classified to a certain type of objects, a linear regression is performed on the coordinates of the region to output a tighter bounding box. In R-CNN, three different models are trained separately: (1) the CNN to extract image features, (2) the SVM classifier, and (3) the regression model to tighten the bounding boxes.
Even though the regions proposal by selective search reduces computational time as compared to the sliding window method, it is still quite slow, mainly because every single region proposal requires a forward pass of the CNN. To avoid the repeated passes of the CNN, Fast R-CNN [57] was proposed and the architecture is shown in Figure 2‑44.
Figure 2‑45 Fast R-CNN architecture [57].
In Fast R-CNN, the Region of Interest (ROI) Pooling is used, which shares the CNN features from one single forward pass of an image across its sub-regions. Then, the features in each region are pooled. In addition, Fast R-CNN replaced the SVM classifier with a softmax layer for the classification task and added a linear regression layer in parallel to the softmax layer to generate bounding box coordinates. By doing so, a single network can handle both classification and localization. Regardless of all the improvements, there is still one bottleneck in the Fast R-CNN, which is the region proposal by selective search. To solve this bottleneck problem, Faster R-CNN was proposed [58]. Instead of executing a separate algorithm for selective search, Faster R-CNN used the image features extracted from the forward pass of the CNN to generate region proposals. This results in nearly cost-free region proposals. As seen in Figure 2‑45 a single CNN was used for generation of region proposals and classification.
Figure 2‑46 Faster R-CNN is a single, unified network for object detection [58].
Mask R-CNN
Comparing with R-CNN, Fast R-CNN, and Faster R-CNN, which generate bounding boxes, not actual shapes of the objects, Mask R-CNN [59] extends the Faster R-CNN for image segmentation at the pixel level. In Mask R-CNN, a convolutional backbone architecture is used for feature extraction over an image, followed by a network head for the bounding-box recognition (classification and regression). A mask branch is added for mask prediction. The Mask R-CNN with the head architecture of the Feature Pyramid Network (FPN) backbone is shown in
Figure 2‑46. Some results on the COCO (Common Objects in Context) dataset is shown in Figure 2‑47.
Figure 2‑47 Head architecture of mask R-CNN plus Faster R-CNN [59]
Figure 2‑48 Keypoint detection results and predicted segmentation masks [59]
SSD and YOLO
To balance accuracy and speed, the recent work in CNN has been focused on one-stage object detection, which aims to deliver real-time inference. The two most popular architectures are Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO). Those models reach detection with one shot or one pass of the network, thus considerably reduce the computational time for inference and thus are capable of delivering real-time classification and detection.
For illustration purposes, the architecture of SSD based on the VGG-16 base network is shown in Figure 2‑48. The SSD network combines predictions from multiple feature maps of different resolutions to handle objects of various sizes. It eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network, providing a unified framework for both training and inference [60].
Figure 2‑49 Single shot multibox detector. (Liu, et al., 2016)
YOLO was first introduced by Redmon, et al.[61], followed by YOLOv2 and YOLO9000 [62]. YOLO9000 can detect over 9000 object categories in real time. More recently, an incremental improvement was made, resulting in YOLOv3, which use Darknet-53, consisting of 53 convolutional layers with some skip connections, as feature extractor [63].
RetinaNet
Lin, et al. [64] discovered that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause for the inferior performance of single stage object detection models (such as YOLO and SSD) as compared to a two-stage approach (e.g., R-CNN). To address this class imbalance, they proposed a new loss function (Focal Loss) that reshape the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. For demonstration, a simple dense detector, RetinaNet, was designed and trained with the Focal Loss. The experiments showed that RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. The architecture of RetinaNet is shown in Figure 2‑49.
Figure 2‑50 The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture [64].
RetinaNet is composed of an FPN backbone network and two subnetworks. The FPN sits on top of a ResNet to generate a rich multi-scale convolutional feature pyramid. Two subnetworks are attached to this backbone, one for classification and one for localization.
MobileNets
A class of efficient models named MobileNets was proposed by Howard, et al.[65]. MobileNets are often considered as light weight deep neural networks. They aim to reduce computational cost and make CNN feasible on mobile devices by using depth-wise separable convolutions. Two simple global-hyperparameters were introduced to efficiently tradeoff between latency and accuracy.
Deformable Convolution Networks
A key challenge in visual recognition is to accommodate geometric variations arising from object scale, pose, viewpoint, and part deformation. The CNNs discussed so far lacks internal mechanisms to handle the geometric transformations because they use fixed geometric structures, i.e., fixed sampling locations by convolution units, fixed-ratio pooling, fixed spatial bins for RoI pooling, etc. In addressing this issue, Dai, et al. [39] introduced two new modules to enhance CNNs’ capability of modeling geometric transformations.
Figure 2‑51 Illustration of 3 × 3 deformable convolution [39].
The first is deformable convolution, which enables free form deformation of the sampling grid by adding 2D offsets to the regular grid sampling locations in the standard convolution Figure 2‑50). The second is deformable Region of Interest (RoI) pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling (Figure 2‑51). Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes. Extensive experiments showed that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation.
Figure 2‑52 Illustration of 3 × 3 deformable RoI pooling [39].
CenterNet
More recently, CenterNet has been introduced [66], which models an object as a single point, i.e., the center point of its bounding box. It uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and pose. CenterNet is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box-based detectors. Some of the results are shown in Figure 2‑52 and Figure 2‑53.
Figure 2‑53 An object is modeled as the center point of its bounding box. The bounding box size and other object properties are inferred from the keypoint feature at the center [66].
Figure 2‑54 Qualitative results. All images were picked thematically without considering algorithms performance. First row: object detection on COCO validation. Second and third row: Human pose estimation on COCO validation. For each pair, we show the results of center offset regression (left) and heatmap matching (right). fourth and fifth row: 3D bounding box estimation on KITTI validation. The authors show projected bounding box(left) and bird eye view map(right). The ground truth detections are shown in solid red solid box. The center heatmap and 3D boxes are shown overlaid on the original image [66].
Exemplar Applications in Transportation
Given the amazing capability of CNNs for feature extraction and visual recognition, they have been widely applied in the transportation field. The most popular applications are autonomous or self-driving vehicles that use vision sensors to sense their surroundings (fixed objects, traffic signals, signs, pavement markings, other vehicles, motorcycles, bicyclists, pedestrians, etc.) to control themselves and avoid potential collisions (e.g, Tesla, Mobileye). Other applications include, but not limited to, traffic condition monitoring (e.g., congestion, incidents), truck taxonomy, etc.
YOLO models are among the most popular network architectures for transportation applications owning to their relatively high accuracy and real-time performance. For example, Sharma, et al. [67] used YOLOv3 model for vehicle detection (Figure 2‑54) and based on the detections, SORT algorithm was applied to track each vehicle (Figure 2‑55). In another study, Sharma, et al. [68] used both traditional deep convolutional neural network and YOLO models to detect traffic congestion from camera images, which performed well in nighttime and poor weather conditions. The study also explored a solution to the incident detection problem using semi-supervised techniques based on trajectory classification.
Figure 2‑55 Detection results using YOLOv3 [67].
Figure 2‑56 Tracking results [67].
In a different application, He, et al. [69] used YOLO object detector to detecting potential trucks shown in image frames (Figure 2‑56), followed by the popular DeepLabV2 model (which introduces multiple parallel atrous convolutional layers[40]) to estimate the vehicle shape. The decision tree classifier was used to classify tractors and trailers.
Figure 2‑57 truck detection using YOLO [69]
Besides YOLO detectors, Yang, et al. [70] used Inception v2 architecture to detect vehicles (Figure 2‑57) and then track them using an IoU-based method together with smoothing techniques (Figure 2‑58). The extracted trajectories were further analyzed to identify potential conflict events (Figure 2‑59).
Figure 2‑58 Detection and classification of vehicles from a traffic camera [70].
Figure 2‑59 Vehicle Tracking by IoU followed by a structured smoothing algorithm [70].
Figure 2‑60 Detection and Quantification of Live Traffic Conflict Events from a Traffic Camera. An example of rear-end conflict on the northbound approach of the intersection. Left: live camera view; Center: projected trajectories in the plan view; Right: quantified conflict event [70].
CNN has also been used for learning and predicting network-level traffic conditions. For example, Ma, et al. [71] proposes a convolutional neural network (CNN)-based method that learns traffic as images and predicts large-scale, network-wide traffic speed with a high accuracy. First, spatiotemporal traffic dynamics are converted to images describing the time and space relations of traffic flow via a two-dimensional time-space matrix (Figure 2‑60). Then, a CNN is applied to the image following two consecutive steps: abstract traffic feature extraction and network-wide traffic speed prediction.
Figure 2‑61 Illustration of traffic-to-image conversion on a network [71].
In addition to traffic applications previously described, another popular area for CNN applications is pavement condition assessment using pavement surface images ([72], [73], [74], etc.). For example, Li & Zhao [74] designed A CNN by modifying AlexNet and then trained and validated it using a built database with 60000 images. Figure 1 shows the flow chart of developing a CNN for cracks detection.
Figure 2‑62 flow chart of developing a CNN for cracks detection [74].
Dabiri & Heaslip, [75] used CNN architectures to predict travel modes based on only raw GPS trajectories, where the modes are labeled as walk, bike, bus, driving, and train. Four fundamental motion characteristics of a moving object, including speed, acceleration, jerk, and bearing rate, represent four channels in the input layer. Each channel has the shape of (1×M), where M is the segment length (i.e., the number of GPS points that forms the segment). The input structure is shown in Figure 2‑62. A variety of CNN configurations are evaluated and the highest accuracy of 84.8% has been achieved through the ensemble of the best CNN configuration.
Figure 2‑63 The four channel structure for a GPS segment [75]
Author(s): Samiul Hasan (Ph.D.) and Rezaur Rahman, University of Central Florida
Introduction
Recurrent neural networks, (RNNs), are a class of neural networks that stores relevant parts of the input variables and use this information to predict output in the future. RNNs repetitively perform the same computational operation on every element of a sequence and each output is calculated based on the previous computations (Figure 2‑63). This computational process captures the interdependency of the sequential time series data to estimate the outputs, that is why RNNs process sequential data very well [76].
RNN Cell
Figure ‑ Complete Structure of RNN Cell
As shown in Figure 2‑64, an RNN can be considered as a chain of repeating modules. In standard RNNs (Figure 2‑63), this repeating module will have a very simple structure, such as a single tanh layer. Hidden state or memory cell of this structure preserves information from the previous input variables. At time step , the memory cell’s current state is a function of input state vectors at that current time step and hidden state at the previous time step , so . Its output at time step , denoted by , is also a function of the previous state and the current input (Figure 2‑63). For standard RNN cells, the output and the hidden state at a given time step are same.
Figure ‑ A Recurrent Neuron Network Unrolled through Time [77]
Although RNNs can better capture nonlinearity in time series problems, they are weak on learning long-term dependencies due to vanishing of gradient during the backpropagation process ([78, 79]). Moreover, traditional RNNs learn a time series sequence based on a predetermined time lag, but it is difficult to find an optimal time window size in an automatic way ([78] [80]).
Long Short-Term Memory Neural Network
To overcome the disadvantages of RNNs, Hochreiter and Schmidhuber proposed the architecture of Long Short-Term Memory Neural Network (LSTM-NN) and an appropriate gradient-based algorithm to solve it [79]. The primary objectives of LSTM-NN are to capture long-term dependencies and determine the optimal time lag for time-series problems.
In an LSTM, the cell state (hidden state) is divided into two states: short-term state (similar to an RNN) and long-term state (Figure 2‑65). The long-term state stores the information to capture the long-term dependencies among current hidden state and previous hidden states over time. Traversing from the left to the right, the long-term state passes through a forget gate and drops some memories and then adds some new memories via an addition operation (Figure 2‑65 and Figure 2‑66).
LSTM Cell
Figure 2‑66 Complete Structure of LSTM Cell ([77]
Figure ‑ Long Short-Term Memory Neural Network Unrolled Over Time [81]
As shown in Figure 2‑65, a fully connected LSTM cell contains four layers (sigma and tanh). The input vector and the previous short-term state are fed into these layers. The main layer uses tanh activation functions which outputs . The output from this layer is partially stored in long-the term state . The other three layers are gate controllers using logistic activation function and their output ranges from 0 to 1. The forget state f(t) controls which parts of the long-term state should be erased. The input gate i(t) decides which parts of the input should be added. The output gate o(t), finally controls which parts of the long-term state should be read and output at this time step y(t) (=h(t)). The equations for these operations can be written as follows,
Input gate: (1)
Forget gate: (2) Output gate: (3) Cell input: (4)
Where, are the weight matrices of the each of the four layers for their connection to the input vector ; are the weight matrices of the each of the four layers for their connection to the short-term state ; , are the bias terms for each of the four layers, represents the sigmoid function ; and represents the hyperbolic tangent function . Finally, the5 long-term and short-term states are calculated using the following equations,
Long-term state: (5)
Short-term state: (6)
Application in transportation
Traffic forecasting involves prediction of future traffic state based on historical traffic data. Accurate traffic forecasting is one of the major challenges in transportation. In general, the complexities in traffic forecasting incurs due to presence of sharp non-linearities caused by transitions among free flow, breakdown, recovery, and congestion [82]. To deal such discontinuities researchers are relying on deep learning models particularly LSTM model. LSTM models have created a new opportunity to capture the sharp discontinuities present in traffic flows using a combination of non-linear functions (tanh, sigmoid etc.) with gating operation to store valuable information for long data sequences. Hence, applications of LSTM model in transportation has allowed us to deal with more complex problems and big data challenges.
Recently, LSTM and different variants of LSTM models have been used in several short-term traffic state prediction problems, such as, traffic speed prediction, ([83], [84], [80]) travel time prediction [85] , traffic flow prediction, ([86], [82], [87]) vehicular queue length prediction ([88], [89]). In short-term traffic prediction problem, the model learns to predict traffic variation over a short period of time.
Figure ‑ Traffic speed Prediction [81]
Figure ‑ Vehicle queue length prediction [89]
In addition to regular traffic condition, studies have also explored the application of LSTM for traffic management during a major event such as hurricane evacuation [81] and traffic incidents (accidents) [90]. During such events, traffic forecasting becomes more challenging due to irregular traffic flow variation induced by sudden surge in traffic demand, so we need a more robust model. However, LSTM model has successfully predicted traffic speed in such condition with reasonable accuracy. Such application has potential benefits in incident management such as evacuation traffic routing.
The applications of LSTM in transportation are not limited to traffic state prediction only. Other studies have also utilized the model in real time crash prediction [91] and crash severity analysis [92]. In case of real time crash prediction, the model learns to predict the crash risk for given location based on traffic state variation (speed, volume, occupancy) for short interval of time. In most of the cases LSTM perform better than tradition model in crash risk prediction. State-of-the-art crash prediction methods have gone beyond a zone wise crash risk analysis; researchers are exploiting LSTM models for network-level real time crash risk analysis [93]. These methods will help traffic and safety engineers to develop proactive measures to reduce the number of crashes.
On a final note for traffic forecasting, emerging methods are those that use a combination of different approaches to build a more effective model including the mixture of CNNs and LSTMs for short term traffic prediction [94]. These models have been shown to be able to utilize the capabilities of different methods simultaneously to explore and learn different spatiotemporal features in the training data to achieve more accurate results with better efficiency.
REFERENCES
1. Hastie, T., R. Tibshirani, and J.H. Friedman, The elements of statistical learning : data mining, inference, and prediction. 2nd ed. Springer series in statistics,. 2009, New York, NY: Springer. xxii, 745 p.
2. Bishop, C.M., Pattern recognition and machine learning. 2006, New York:: Springer.
3. Schapiere, R.E. and Y. Freund, Boosting: Foundations and Algorithms. . 2012, Cambridge, MA: MIT Press.
4. Zhang, Y. and Y. Xie, Travel mode choice modeling with support vector machines. Transportation Research Record: Journal of the Transportation Research Board, 2008(2076): p. 141-150.
5. Awad, M. and R. Khanna, Support Vector Regression, in Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers, M. Awad and R. Khanna, Editors. 2015, Apress: Berkeley, CA. p. 67-80.
6. Ben-Hur, A., et al., Support Vector Clustering. Journal of Machine Learning Research, 2001. 2: p. 125-137.
7. Yuan, F. and R.L. Cheu, Incident detection using support vector machines. Transportation Research Part C-Emerging Technologies, 2003. 11(3-4): p. 309-328.
8. Zhang, Y.L. and Y.C. Xie, Forecasting of short-term freeway volume with v-support vector machines. Transportation Research Record, 2007(2024): p. 92-99.
9. Jahangiri, A. and H.A. Rakha, Applying Machine Learning Techniques to Transportation Mode Recognition Using Mobile Phone Sensor Data. IEEE Transactions on Intelligent Transportation Systems, 2015. 16(5): p. 2406-2417.
10. Cetin, M., I. Ustun, and O. Sahin. Classification Algorithms for Detecting Vehicle Stops from Smartphone Accelerometer Data. in The 95th Annual meeting of the Transportation Research Board. 2016. Washington DC: Transportation Research Board.
11. Yang, B., et al. A Data Imputation Method with Support Vector Machines for Activity-Based Transportation Models. in Foundations of Intelligent Systems. 2012. Berlin, Heidelberg: Springer Berlin Heidelberg.
12. Li, X., et al., Predicting motor vehicle crashes using Support Vector Machine models. Accident Analysis and Prevention, 2008. 40(4): p. 1611-1618.
13. Theofilatos, A., C. Chen, and C. Antoniou, Comparing Machine Learning and Deep Learning Methods for Real-Time Crash Prediction. Transportation Research Record, 2019. 2673(8): p. 169-178.
14. Zhang, Y. and Y. Xie, Travel Mode Choice Modeling with Support Vector Machines. Transportation Research Record, 2008. 2076(1): p. 141-150.
15. Principe, J.C., N.R. Euliano, and W.C. Lefebvre, Neural and adaptive systems: fundamentals through simulations. Vol. 672. 2000: John Wiley and Sons, Inc., New York.
16. Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification. 2001: John Wiley & Sons, Inc., New York.
17. Ham, F.M. and I. Kostanic, Principles of neurocomputing for science and engineering. 2001: McGraw-Hill Higher Education.
18. Pal, S.K., Dillon, Tharam S., Yeung, Daniel S., Soft computing in case based reasoning. 2001: Springer-Verlag London Limited, Great Britain. 1-28.
19. Jang, J.-S.R., C.-T. Sun, and E. Mizutani, Neuro-fuzzy and soft computing-a computational approach to learning and machine intelligence. 1997: Prentice-Hall, N.J.
20. Ablameyko, S., Goras, L., Gori, M., Piuri, V, Neural Networks for Instrumentation, Measurement, and Related Industrial Applications. Series III: Computer and Systems Sciences. Vol. 185. 2003: IOS press.
21. Lefebvre, C., Neuro Solutions, Version 4.10. NeuroDimension, Inc., Gainesville, FL, 2001.
22. Hopfield, J.J., Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 1982. 79(8): p. 2554-2558.
23. Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning representations by back-propagating errors. Cognitive modeling, 1988. 5(3): p. 1.
24. Jordan, M., Serial order: a parallel distributed processing approach. Technical report, June 1985-March 1986. 1986, California Univ., San Diego, La Jolla (USA). Inst. for Cognitive Science.
25. Elman, J.L., Finding structure in time. Cognitive science, 1990. 14(2): p. 179-211.
26. Dorffner, G. Neural networks for time series processing. in Neural network world. 1996. Citeseer.
27. Malhotra, P., et al. Long short term memory networks for anomaly detection in time series. in Proceedings. 2015. Presses universitaires de Louvain.
28. Lipton, Z.C., J. Berkowitz, and C. Elkan, A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
29. Werbos, P.J., Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 1990. 78(10): p. 1550-1560.
30. Haykin, S., Neural networks: a comprehensive foundation. 1994: Macmillan College Publishing Company, Inc., Prentice Hall PTR.
32. Nielsen, M.A., Neural networks and deep learning. Vol. 25. 2015: Determination press San Francisco, CA, USA:.
33. Hubel, D.H. and T.N. Wiesel, Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 1968. 195(1): p. 215-243.
34. Abdoli, S., P. Cardinal, and A.L. Koerich, End-to-end environmental sound classification using a 1d convolutional neural network. Expert Systems with Applications, 2019. 136: p. 252-263.
35. Zeiler, M.D., G.W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. in 2011 International Conference on Computer Vision. 2011. IEEE.
36. Sobel, I.F., G, An isotropic 3× 3 image gradient operater. Machine vision for three-dimensional scenes, 1990: p. 376-379.
39. Dai, J., et al. Deformable convolutional networks. in Proceedings of the IEEE international conference on computer vision. 2017. arXiv preprint arXiv:1703.06211.
40. Chen, L.-C., et al., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 2017. 40(4): p. 834-848.
41. He, K., et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. in Proceedings of the IEEE international conference on computer vision. 2015.
42. Clevert, D.-A., T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289, 2015.
43. Stanford University, n.d. Convolutional Neural Networks for Visual Recognition. [Online]
44. Santurkar, S., et al. How does batch normalization help optimization? in Advances in Neural Information Processing Systems. 2018.
45. Srivastava, N., et al., Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014. 15(1): p. 1929-1958.
46. Ioffe, S. and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
48. Simonyan, K. and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
49. Neurohive, n.d. VGG16 – Convolutional Network for Classification and Detection. [Online]
50. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. in Advances in neural information processing systems. 2012.
51. He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
52. Zeiler, M.D. and R. Fergus. Visualizing and understanding convolutional networks. in European conference on computer vision. 2014. Springer.
53. Szegedy, C., et al. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
54. Li, H., et al. Visualizing the loss landscape of neural nets. in Advances in Neural Information Processing Systems. 2018.
55. Ronneberger, O., P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. in International Conference on Medical image computing and computer-assisted intervention. 2015. Springer.
56. Girshick, R., et al. Rich feature hierarchies for accurate object detection and semantic segmentation. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
57. Girshick, R. Fast R-CNN. in Proceedings of the IEEE international conference on computer vision. 2015.
58. Ren, S., et al. Faster R-CNN: Towards real-time object detection with region proposal networks. in Advances in neural information processing systems. 2015.
59. He, K., et al. Mask R-CNN. in Proceedings of the IEEE international conference on computer vision. 2017.
60. Liu, W., et al. SSD: Single shot multibox detector. in European conference on computer vision. 2016. Springer.
61. Redmon, J., et al. You only look once: Unified, real-time object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
62. Redmon, J. and A. Farhadi. YOLO9000: better, faster, stronger. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
63. Redmon, J. and A. Farhadi, Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
64. Lin, T.-Y., et al. Focal loss for dense object detection. in Proceedings of the IEEE international conference on computer vision. 2017.
65. Howard, A.G., et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
66. Zhou, X., D. Wang, and P. Krähenbühl, Objects as points. arXiv preprint arXiv:1904.07850, 2019.
67. Sharma, A., et al., Portable Multi-Sensor System for Intersection Safety Performance Assessment. 2018, Ames: Iowa Department of Transportation.
68. Sharma, A., Chakraborty, P., Hawkins, N. & Knickerbocker, S., Automating Near Miss Crash Detection using Existing Traffic Cameras. 2019, Ames: Iowa Department of Transportation.
69. He, P., et al., Truck Taxonomy and Classification Using Video and Weigh-In Motion (WIM) Technology. 2019, Florida Department of Transportation: Tallahassee.
70. Yang, J.J., Y. Wang, and C.-C. Hung, Monitoring and Assessing Traffic Safety at Signalized Intersections Using Live Video Images. 2018, Atlanta: Georgia Department of Transportation.
71. Ma, X., et al., Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors, 2017. 17(4): p. 818.
72. Wang, K., et al. Deep learning for asphalt pavement cracking recognition using convolutional neural network. in Proc. Int. Conf. Airfield Highway Pavements. 2017.
73. Fan, Z., et al., Automatic pavement crack detection based on structured prediction with the convolutional neural network. arXiv preprint arXiv:1802.02208.CNN, 2018.
74. Li, S. and X. Zhao, Image-Based Concrete Crack Detection Using Convolutional Neural Network and Exhaustive Search Technique. Advances in Civil Engineering, 2019. 2019: p. 1-12.
75. Dabiri, S. and K. Heaslip, Inferring transportation modes from GPS trajectories using a convolutional neural network. Transportation research part C: emerging technologies, 2018. 86: p. 360-371.
76. Xu, J., et al., Real-time prediction of taxi demand using recurrent neural networks. IEEE Transactions on Intelligent Transportation Systems, 2017. 19(8): p. 2572-2581.
77. Rahman, R., Applications of Deep Learning Models for Traffic Prediction Problems, in University of Central Florida. 2019.
78. Gers, F.A., Schmidhuber, J., Cummins, F., Learning to forget: continual prediction with LSTM, in 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470). 1999. p. 1-19.
79. Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780.
80. Ma, X., et al., Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 2015. 54: p. 187-197.
81. Rahman, R. and S. Hasan. Short-Term Traffic Speed Prediction for Freeways During Hurricane Evacuation: A Deep Learning Approach. in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). 2018. IEEE.
82. Polson, N.G. and V.O. Sokolov, Deep learning for short-term traffic flow prediction. Transportation Research Part C: Emerging Technologies, 2017. 79: p. 1-17.
83. Cui, Z., et al., Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143, 2018: p. 22-25.
84. Epelbaum, T., et al., Deep Learning applied to Road Traffic Speed forecasting. arXiv preprint arXiv:1710.08266, 2017.
85. Duan, Y., Y. Lv, and F.-Y. Wang. Travel time prediction with LSTM neural network. in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). 2016. IEEE.
86. Luo, X., et al., Spatiotemporal traffic flow prediction with KNN and LSTM. Journal of Advanced Transportation, 2019. 2019.
87. Yang, B., et al., Traffic flow prediction using LSTM with feature enhancement. Neurocomputing, 2019. 332: p. 320-327.
88. Lee, S., et al., An advanced deep learning approach to real-time estimation of lane-based queue lengths at a signalized junction. Transportation research part C: emerging technologies, 2019. 109: p. 117-136.
89. Rahman, R. and S. Hasan, Real-time Signal Queue Length Prediction Using Long Short-Term Memory Neural Network, in Transportation Research Board 98th Annual MeetingTransportation Research Board. 2019.
90. Yu, R., et al. Deep learning: A generic approach for extreme condition traffic forecasting. in Proceedings of the 2017 SIAM international Conference on Data Mining. 2017. SIAM.
91. Yuan, J., et al., Real-time crash risk prediction using long short-term memory recurrent neural network. Transportation research record, 2019. 2673(4): p. 314-326.
92. Sameen, M.I. and B. Pradhan, Severity prediction of traffic accidents with recurrent neural networks. Applied Sciences (Switzerland), 2017. 7(6): p. 476.
93. Ren, H., et al. A deep learning approach to the citywide traffic accident risk prediction. in 21st International Conference on Intelligent Transportation Systems (ITSC). 2018. IEEE.
94. Do, L.N.N., N. Taherifar, and H.L. Vu, Survey of neural network-based models for short-term traffic state prediction. WIREs Data Mining and Knowledge Discovery, 2019. 9(1): p. e1285.
While there are many options of boosting for both regression and classification, Figure 2‑4 illustrates the operation of one of the most common boosting algorithms, known as AdaBoost (from “adaptive boosting” and pronounced accordingly [2]). At each stage m we estimate our classifier using a set of weights for the n data points. From these we develop a set of weights for the prediction of each classifier. The final prediction is based on a weighted sum of outputs from each classifier using the as weights.
The AdaBoost algorithm works as follows. We start out with a set of data of independent variables and a set of target values (i.e., classifications)
Initialize the weighting coefficients by setting for
For :
Fit a classifier to the training data by minimizing the weighted error function
where is an indicator function that equals 1 when and 0 otherwise.
31. Goodfellow, I., Y. Bengio, and A. Courville, Deep learning. 2016, : MIT press.
37. Wikipedia, n.d. Convolution. [Online], Available at: .
38. cs231n, n.d. Convolutional Neural Networks for Visual Recognition. [Online], Available at:
47. Stanford Vision Lab, n.d. ImageNet.[Online], Available at: .