Machine Learning versus Statistics
Data analysis and modeling is probably the most important research tool in the field of transportation. There are two different schools of thought in the world of transportation data analysis: statistical models and machine learning. Statistical models have been traditionally used to analyze the transportation data, while machine learning techniques have recently gained a lot of attention from transportation researchers and practitioners.
Statistical techniques are the mathematics of collecting, cleaning, organizing, modeling and interpreting the data or a sample of the data. Machine learning techniques are the algorithms that combine learning, adaption, evaluation, and in general structuring an unstructured and complex data. This section describes some of the differences and similarities between machine learning and traditional statistical models.
Modeling Framework
Machine learning techniques are known for their ability to work with multidimensional, complex and non-linear data with a massive amount of observations and variables (i.e., big data). Statistical models, though, have solid mathematical frameworks that make them capable to discover the relationships between the variables, fit a project-specific probability model, compute quantitative measures of confidence, and verify certain statistical assumptions.
Statistical models are generally defined in terms of their mathematical formulation and their statistical assumptions, whereas machine learning techniques are mostly defined in terms of their architecture and learning algorithms.
Objectives
Statistical models largely emphasize on inference by providing a model, predictor or classifier, and offering insights on the data and its structure. Machine learning techniques find generalizable predictive patterns in the data and mostly emphasize on implementation.
Statistical models are self-explanatory and will be helpful to interpret the data by providing statistical measures, such as marginal effects, elasticities and coefficientβs signs, magnitude and significance. On the other hand, machine learning techniques provide an efficient, accurate and flexible representation of the underlying and hidden relationships between the variables in the data. Machine learning techniques are well-known for their high prediction capability.
Flexibility
Statistical models follow rigid assumptions and mostly result in one final model / conclusion, while machine learning techniques mostly result in more than one model / conclusion, as their learning curve may have various local minima. In fact, machine learning techniques are more flexible, since their functional form is approximated through a learning process, compared to the statistical models with multiple priori statistical assumptions. Overall, machine learning techniques have a higher learning, better generalization ability, and a higher adaptability.
Self-Exploratory
Since the machine learning techniques mostly have a hidden mechanism, some researchers call them black box. In contrast, the statistical models are self-exploratory, and one can interpret them by looking at the estimated coefficients, their signs and other statistical tests, such as t-statistics or p-value.
Assumptions & Limitations
Statistical models mostly have multiple specified, priori and strict hypotheses and place some restrictions on the data, distributions and error terms. Multicollinearity, nonlinearity, autocorrelation, and non-normality are some of the situations where statistical models will face problematic issues. On the other hand, machine learning techniques are extremely adaptable, with no prior model or error distribution specifications.
Data Requirements
Machine learning techniques require more data, compared to statistical models. A higher number of observations in a machine learning model (such as neural networks), will lead to a better learning fit and a higher prediction accuracy. It has been a common practice between machine learning modelers and transportation practitioners to put aside 10-20 and 20-30 percent of observations for cross-validation and model evaluation, respectively. Thus, most of the machine learning models are being developed by using 60-70 percent of the data, which requires even a greater number of observations.
Predicative Ability & Accuracy
One of the major differences between machine learning and statistical models is their prediction ability. Machine learning techniques are designed to make the most accurate predictions possible, while statistical models emphasize on the inference about the relationships between variables. The prediction process in machine learning techniques happens by using general-purpose learning algorithms to find patterns in large databases, with minimal assumptions about the data distribution. Although, the lack of an explicit model output can make machine learning techniques very difficult to interpret.
Modeling Effort
Modeling effort can be divided into four stages:
A) Data preparation: since machine learning techniques have less assumptions and limitations, the data preparation stage for statistical models is more difficult and more time-consuming.
B) Model calibration (running the model): machine learning models can handle larger data with more complexities, which requires more computations. Thus, the model calibration (model run) stage for machine learning techniques is more time-consuming compared to the statistical models.
C) Model validation (model improvement): since machine learning techniques have a higher predictive ability, compared to the statistical models, one can improve the model prediction easier. Less statistical and mathematical assumptions are associated with a more straightforward model improvement for machine learning techniques.
D) Model Interpretation statistical models provide multiple interpretative measures, such as coefficients, signs, t-statistics, p-values, confidence intervals, marginal effects, elasticities, etc.
General Comparison
Flexibility
β
Self-exploratory
β
Statistical framework
β
Probability maximization
β
Error minimization
β
Rigid assumptions
β
Rigid assumptions
β
Data Availability
Small sample data
β
Dimension reduction
β
Complex data
β
Big data
β
Real-time data
β
Dealing with missing data
β
β
Dealing with outliers
β
β
Modeling Purpose
Hypothesis check
β
β
Interpretation
β
Comparison
β
Prediction
β
Modeling Type
Regression
β
β
Classification
β
β
Count outcomes
β
β
Discrete outcomes
β
β
Rule-based
β
Pattern recognition
β
Modeling Effort
Data preparation
β
Running time
β
Model validation / improvement
β
Post Modeling
Sensitivity analysis
β
β
Elasticity / Marginal effect
β
Ranking variablesβ importance
β
Probability (significance) tests
β
Confidence intervals
β
Model learning improvement
β
Last updated