AIPrimer.AI
  • 🚦AI Primer In Transportation
  • CHAPTER 1 - INTRODUCTION TO MACHINE LEARNING
    • Machine Learning in Transportation
    • What is Machine Learning?
    • Types of Machine Learning
      • Supervised Learning
      • Unsupervised Learning
      • Semi-supervised Learning
      • Reinforced Learning
    • Fundamental concepts of machine learning
      • Model Training and Testing
      • Evaluating the Model’s Prediction Accuracy
      • The Underfitting and Overfitting Problems
      • Bias-Variance Tradeoff in Overfitting
      • Model Validation Techniques
      • Hyperparameter Tuning
      • Model Regularization
      • The Curse of Ddimensionality
    • Machine Learning versus Statistics
  • CHAPTER 2 - SUPERVISED METHODS
    • Supervised Learning_Complete Draft
    • K-Nearest Neighbor (KNN) Algorithm
    • Tree-Based Methods
    • Boosting
    • Support Vector Machines (SVMs)
  • CHAPTER 3 - UNSUPERVISED LEARNING
    • Principal Component Analysis
      • How Does It Work?
      • Interpretation of PCA result
      • Applications in Transportation
    • CLUSTERING
      • K-MEANS
      • SPECTRAL CLUSTERING
      • Hierarchical Clustering
    • REFERENCE
  • CHAPTER 4 - NEURAL NETWORK
    • The Basic Paradigm: Multilayer Perceptron
    • Regression and Classification Problems with Neural Networks
    • Advanced Topologies
      • Modular Network
      • Coactive Neuro–Fuzzy Inference System
      • Recurrent Neural Networks
      • Jordan-Elman Network
      • Time-Lagged Feed-Forward Network
      • Deep Neural Networks
  • CHAPTER 5 - DEEP LEARNING
    • Convolutional Neural Networks
      • Introduction
      • Convolution Operation
      • Typical Layer Structure
      • Parameters and Hyperparameters
      • Summary of Key Features
      • Training of CNN
      • Transfer Learning
    • Recurrent Neural Networks
      • Introduction
      • Long Short-Term Memory Neural Network
      • Application in transportation
    • Recent Development
      • AlexNet, ZFNet, VggNet, and GoogLeNet
      • ResNet
      • U-Net: Full Convolutional Network
      • R-CNN, Fast R-CNN, and Faster R-CNN
      • Mask R-CNN
      • SSD and YOLO
      • RetinaNet
      • MobileNets
      • Deformable Convolution Networks
      • CenterNet
      • Exemplar Applications in Transportation
    • Reference
  • CHAPTER 6 - REINFORCEMENT LEARNING
    • Introduction
    • Reinforcement Learning Algorithms
    • Model-free v.s. Model-based Reinforcement Learning
    • Applications of Reinforcement Learning to Transportation and Traffic Engineering
    • REFERENCE
  • CHAPTER 7 - IMPLEMENTING ML AND COMPUTATIONAL REQUIREMENTS
    • Data Pipeline for Machine Learning
      • Introduction
      • Problem Definition
      • Data Ingestion
      • Data Preparation
      • Data Segregation
      • Model Training
      • Model Deployment
      • Performance Monitoring
    • Implementation Tools: The Machine Learning Ecosystem
      • Machine Learning Framework
      • Data Ingestion tools
      • Databases
      • Programming Languages
      • Visualization Tools
    • Cloud Computing
      • Types and Services
    • High-Performance Computing
      • Deployment on-premise vs on-cloud
      • Case Study: Data-driven approach for the implementation of Variable Speed Limit
      • Conclusion
  • CHAPTER 8 - RESOURCES
    • Mathematics and Statistics
    • Programming, languages, and software
    • Machine learning environments
    • Tools of the Trade
    • Online Learning Sites
    • Key Math Concepts
  • REFERENCES
  • IMPROVEMENT BACKLOG
Powered by GitBook
On this page
  1. CHAPTER 7 - IMPLEMENTING ML AND COMPUTATIONAL REQUIREMENTS
  2. Data Pipeline for Machine Learning

Introduction

PreviousData Pipeline for Machine LearningNextProblem Definition

Last updated 1 year ago

Implementing and executing a machine learning algorithm is a tedious task that requires comprehensive planning clearly outlining the tasks to be executed. A data pipeline precisely helps accomplish the same. Data Pipelines consist of a basic structure constructed to describe a smooth workflow. Data Pipelines have become popular as they help to navigate easily through the complex data analysis process. A pipeline simply describes the path to be followed from the start through the end of the task to be accomplished. (Figure 5-1):

Figure 5-1 A simple Data Pipeline structure

Planning is highly imperative for the success of any project. Consider the scenario of several transportation companies that are required to maintain efficient and sustainable management of their assets and fleets. This entire process is broken down into several steps such as the collection of data from the fleet, transmission of data over cloud, integration with enterprise data, modeling and prediction for maximizing efforts towards efficient businesses. The steps simply convert to the stages of a data pipeline, completed one by one, helps to ensure the smooth flow.

Creating a structured outline for the solution in hand, data pipelines are useful in identifying potential complications that can arise while deploying the model. When dealing with real-world data potential issues could include attacks to corrupt data, security lapses, low latency, etc. It is particularly useful when multiple streams of data are ingested into a model. Designing a data pipeline allows one to anticipate these difficulties and equip the model to handle them better.

Constructing a data pipeline is the first step towards the implementation of a Machine Learning solution. An ideal data pipeline for machine learning problem would consist of the following stages (Figure 5-2):

  1. Problem Definition

  2. Data Ingestion

  3. Data Preparation

  4. Data Segregation

  5. Model Training

  6. Model Evaluation

  7. Model Deployment

  8. Performance Monitoring

Each of these stages in a data pipeline is discussed in detail in the following sections.

Figure 5-2 Data Pipeline for Machine Learning

  1. Batch Analytics: This type of pipeline is used when there is a large volume of data that can be updated or uploaded in batch at periodic intervals.

  2. Stream Analytics: These data pipelines focus on delivering real-time results through data analytics. The real-time output could be predictive analysis or visualizations but is imperative that data is ingested and processed without latency to deliver the results effectively.

Based on the data properties and problem objective, there are different pipelines that can be adopted.