AIPrimer.AI
  • 🚦AI Primer In Transportation
  • CHAPTER 1 - INTRODUCTION TO MACHINE LEARNING
    • Machine Learning in Transportation
    • What is Machine Learning?
    • Types of Machine Learning
      • Supervised Learning
      • Unsupervised Learning
      • Semi-supervised Learning
      • Reinforced Learning
    • Fundamental concepts of machine learning
      • Model Training and Testing
      • Evaluating the Model’s Prediction Accuracy
      • The Underfitting and Overfitting Problems
      • Bias-Variance Tradeoff in Overfitting
      • Model Validation Techniques
      • Hyperparameter Tuning
      • Model Regularization
      • The Curse of Ddimensionality
    • Machine Learning versus Statistics
  • CHAPTER 2 - SUPERVISED METHODS
    • Supervised Learning_Complete Draft
    • K-Nearest Neighbor (KNN) Algorithm
    • Tree-Based Methods
    • Boosting
    • Support Vector Machines (SVMs)
  • CHAPTER 3 - UNSUPERVISED LEARNING
    • Principal Component Analysis
      • How Does It Work?
      • Interpretation of PCA result
      • Applications in Transportation
    • CLUSTERING
      • K-MEANS
      • SPECTRAL CLUSTERING
      • Hierarchical Clustering
    • REFERENCE
  • CHAPTER 4 - NEURAL NETWORK
    • The Basic Paradigm: Multilayer Perceptron
    • Regression and Classification Problems with Neural Networks
    • Advanced Topologies
      • Modular Network
      • Coactive Neuro–Fuzzy Inference System
      • Recurrent Neural Networks
      • Jordan-Elman Network
      • Time-Lagged Feed-Forward Network
      • Deep Neural Networks
  • CHAPTER 5 - DEEP LEARNING
    • Convolutional Neural Networks
      • Introduction
      • Convolution Operation
      • Typical Layer Structure
      • Parameters and Hyperparameters
      • Summary of Key Features
      • Training of CNN
      • Transfer Learning
    • Recurrent Neural Networks
      • Introduction
      • Long Short-Term Memory Neural Network
      • Application in transportation
    • Recent Development
      • AlexNet, ZFNet, VggNet, and GoogLeNet
      • ResNet
      • U-Net: Full Convolutional Network
      • R-CNN, Fast R-CNN, and Faster R-CNN
      • Mask R-CNN
      • SSD and YOLO
      • RetinaNet
      • MobileNets
      • Deformable Convolution Networks
      • CenterNet
      • Exemplar Applications in Transportation
    • Reference
  • CHAPTER 6 - REINFORCEMENT LEARNING
    • Introduction
    • Reinforcement Learning Algorithms
    • Model-free v.s. Model-based Reinforcement Learning
    • Applications of Reinforcement Learning to Transportation and Traffic Engineering
    • REFERENCE
  • CHAPTER 7 - IMPLEMENTING ML AND COMPUTATIONAL REQUIREMENTS
    • Data Pipeline for Machine Learning
      • Introduction
      • Problem Definition
      • Data Ingestion
      • Data Preparation
      • Data Segregation
      • Model Training
      • Model Deployment
      • Performance Monitoring
    • Implementation Tools: The Machine Learning Ecosystem
      • Machine Learning Framework
      • Data Ingestion tools
      • Databases
      • Programming Languages
      • Visualization Tools
    • Cloud Computing
      • Types and Services
    • High-Performance Computing
      • Deployment on-premise vs on-cloud
      • Case Study: Data-driven approach for the implementation of Variable Speed Limit
      • Conclusion
  • CHAPTER 8 - RESOURCES
    • Mathematics and Statistics
    • Programming, languages, and software
    • Machine learning environments
    • Tools of the Trade
    • Online Learning Sites
    • Key Math Concepts
  • REFERENCES
  • IMPROVEMENT BACKLOG
Powered by GitBook
On this page
  1. CHAPTER 1 - INTRODUCTION TO MACHINE LEARNING
  2. Fundamental concepts of machine learning

The Curse of Ddimensionality

PreviousModel RegularizationNextMachine Learning versus Statistics

Last updated 1 year ago

In machine learning we are frequently dealing with data sets of high dimensionality: i.e., data sets with large numbers of variables. This can lead to the problem of the curse of dimensionality, which was first formulated by the mathematician Richard Bellman.

Consider a data set where, without loss of generality, we can assume that all data have been scaled to lie in the interval [0,1]. If we have p variables, the data lie in a hypercube of p dimensions. Assume the data are distributed evenly over the hypercube.

Now consider how much of the data are located close to each other, i.e., within a unit sphere, as illustrated below for two dimensions:

The circle encloses about 79% of the area, or, in other words, about 79% of the data points lie closer to the center than the edges.

Now consider three dimensions, where we have a sphere embedded within a cube. The sphere encloses only 52% of the data points. Even if we are dealing with only 10 dimensions, more than 99.7% of the data will lie in the “corners” rather than near the center.

As the dimensionality increases, more and more of the data lie in the “corners” of the hypercube. What this means is that when fitting models with high-dimensionality, more of the data points will lie near the edges, and therefore we will be extrapolating from data points rather than interpolating between them.

Put another way, the required number of data points required to adequately cover the range of possible values goes up exponentially with the number of dimensions.

What this means is that high-dimensionality data present some difficult problems for building predictive models. Hence, dimensionality reduction is often a key goal in developing predictive models in machine learning.

The problem of the curse of dimensionality is somewhat ameliorated by the fact that in many practical problems, the data often lie on a manifold of fewer dimensions. Hence, the problem of dimensionality reduction often reduces to that of finding an appropriate description of the lower-dimensional manifold that encloses the data.

Figure 1.1 - Circle embedded within square