AIPrimer.AI
  • 🚦AI Primer In Transportation
  • CHAPTER 1 - INTRODUCTION TO MACHINE LEARNING
    • Machine Learning in Transportation
    • What is Machine Learning?
    • Types of Machine Learning
      • Supervised Learning
      • Unsupervised Learning
      • Semi-supervised Learning
      • Reinforced Learning
    • Fundamental concepts of machine learning
      • Model Training and Testing
      • Evaluating the Model’s Prediction Accuracy
      • The Underfitting and Overfitting Problems
      • Bias-Variance Tradeoff in Overfitting
      • Model Validation Techniques
      • Hyperparameter Tuning
      • Model Regularization
      • The Curse of Ddimensionality
    • Machine Learning versus Statistics
  • CHAPTER 2 - SUPERVISED METHODS
    • Supervised Learning_Complete Draft
    • K-Nearest Neighbor (KNN) Algorithm
    • Tree-Based Methods
    • Boosting
    • Support Vector Machines (SVMs)
  • CHAPTER 3 - UNSUPERVISED LEARNING
    • Principal Component Analysis
      • How Does It Work?
      • Interpretation of PCA result
      • Applications in Transportation
    • CLUSTERING
      • K-MEANS
      • SPECTRAL CLUSTERING
      • Hierarchical Clustering
    • REFERENCE
  • CHAPTER 4 - NEURAL NETWORK
    • The Basic Paradigm: Multilayer Perceptron
    • Regression and Classification Problems with Neural Networks
    • Advanced Topologies
      • Modular Network
      • Coactive Neuro–Fuzzy Inference System
      • Recurrent Neural Networks
      • Jordan-Elman Network
      • Time-Lagged Feed-Forward Network
      • Deep Neural Networks
  • CHAPTER 5 - DEEP LEARNING
    • Convolutional Neural Networks
      • Introduction
      • Convolution Operation
      • Typical Layer Structure
      • Parameters and Hyperparameters
      • Summary of Key Features
      • Training of CNN
      • Transfer Learning
    • Recurrent Neural Networks
      • Introduction
      • Long Short-Term Memory Neural Network
      • Application in transportation
    • Recent Development
      • AlexNet, ZFNet, VggNet, and GoogLeNet
      • ResNet
      • U-Net: Full Convolutional Network
      • R-CNN, Fast R-CNN, and Faster R-CNN
      • Mask R-CNN
      • SSD and YOLO
      • RetinaNet
      • MobileNets
      • Deformable Convolution Networks
      • CenterNet
      • Exemplar Applications in Transportation
    • Reference
  • CHAPTER 6 - REINFORCEMENT LEARNING
    • Introduction
    • Reinforcement Learning Algorithms
    • Model-free v.s. Model-based Reinforcement Learning
    • Applications of Reinforcement Learning to Transportation and Traffic Engineering
    • REFERENCE
  • CHAPTER 7 - IMPLEMENTING ML AND COMPUTATIONAL REQUIREMENTS
    • Data Pipeline for Machine Learning
      • Introduction
      • Problem Definition
      • Data Ingestion
      • Data Preparation
      • Data Segregation
      • Model Training
      • Model Deployment
      • Performance Monitoring
    • Implementation Tools: The Machine Learning Ecosystem
      • Machine Learning Framework
      • Data Ingestion tools
      • Databases
      • Programming Languages
      • Visualization Tools
    • Cloud Computing
      • Types and Services
    • High-Performance Computing
      • Deployment on-premise vs on-cloud
      • Case Study: Data-driven approach for the implementation of Variable Speed Limit
      • Conclusion
  • CHAPTER 8 - RESOURCES
    • Mathematics and Statistics
    • Programming, languages, and software
    • Machine learning environments
    • Tools of the Trade
    • Online Learning Sites
    • Key Math Concepts
  • REFERENCES
  • IMPROVEMENT BACKLOG
Powered by GitBook
On this page
  1. CHAPTER 7 - IMPLEMENTING ML AND COMPUTATIONAL REQUIREMENTS
  2. Data Pipeline for Machine Learning

Data Ingestion

PreviousProblem DefinitionNextData Preparation

Last updated 1 year ago

Data ingestion is the process of gathering data from different sources and linking the data into one storage space. Batch Processing and Real-time Processing are the two major types of data ingestion. Batch Processing involves the collection of data in groups at periodic intervals from multiple sources to be sent to a storage location. In Real-time processing as soon as the data is identified it is sent to storage without any form of grouping. Some platforms also conduct ‘micro-batching’ wherein the groups smaller in size and interval are sent for batch processing.

During data ingestion, the following challenges might arise:

  1. Maintenance of ingestion speed when data size and complexity increase.

  2. Handling ingestion of complex data formats together

  3. Cost implications of ingestion when handling large size data.

  4. Ensuring secure ingestion of data into the system.

There are many useful data ingestion tools available for use. Tools are chosen based on their ability to not slow down the entire pipeline system. If the tool is open source, it provides the benefit of permissions to write plug-ins of their own to the system. The tool must be compliant with security standards to protect the data. The tool must be easy to understand and manage with less dependency on the developer. It's is an additional incentive if the tool allows for real-time insights about the data.

Figure 5-3 Popular Data Ingestion tools

Figure 5-3 shows some of the popular ingestion tools in use which is described under Section 5.2.1. Once we have the right ingestion tool, data from different sources is collected. The collected data needs a storage space where it can be easy to handle, manipulate and transport this data. Hence, arises the need for a suitable database. Selecting the right database is also imperative to the efficient working of the pipeline system. Ranging from data properties such as type, structure, model to use case, querying mechanism and transaction speed are few of the factors which could influence the selection of database. They are two types of databases – relational and non-relational (Figure 5-4).

Figure 5-4 Popular Databases