Supervised Learning
Supervised learning is one of the most common machine learning methods that is widely studied in academia and used in industry. Supervised learning is a category of machine learning methods that maps the functional relationship, between a set of inputs X and outputs y, where y and C being the number of classes. The goal of supervised learning is to estimate the function based on labeled training dataset, and then the relationship can be used to predict future outputs given a new set of unlabeled inputs. It is called supervised learning because the model is estimated based on known labeled outputs in the training dataset. In comparison with unsupervised learning, the model outcome is often unknown and not observable from training dataset. The model building process includes problem formulation, training dataset collection, feature extraction, model fitting, model evaluation, and deployment. The following chart described the supervised learning model training process at high-level.
Types of Supervised learning Algorithm
The supervised learning is a regression problem if the output is continuous and a classification problem if the output is categorical. The classification problem can be binary classification if number of outcomes C = 2, or multiclass classification problem if C > 2. Supervised learning method is often an effective way to train a function if training data can be relatively easily collected. However, collecting training dataset can often be expensive or sometimes impractical. The quality of training dataset often dictates the model predictability and accuracy. Supervised learning also makes a critical assumption that the distribution of training examples is the same as testing samples, which is often violated in real-world applications. There are many different supervised learning algorithms. The choice of different models is often determined by problem formulation, variables types (continuous vs. discrete), number of predicted outcomes (binary vs. multiclass), model training (online vs. batch), and model interpretability.
Generalized Linear Regression Models
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Support Vector Machine
Decision Tree
Naïve Bayes
Linear Discriminant Analysis
K-Nearest Neighbor
Neural Network
Examples of Supervised Learning:
Supervised learning is one of the most widely used machine learning methods. Some of the successful implementation including the following use cases:
Email Spam Filtering: Email can be classified into spam and non-spam emails based on the frequency for some of the key words used in the email, which is often processed by the technique called bag of words. The algorithm is based on training dataset that already labeled the email as spam or non-spam email. The algorithm is widely implemented in email spam filtering.
Crash Severity Modeling: Traffic crash data is a popular data source in transportation safety and crash outcomes can often be classified into different severity levels (KABCO Injury Classification). Researchers often used multinomial logistic regression model to identify the relationship between contributing factors and crash severity outcomes.
Image Classification in hand writing recognition and autonomous vehicle Object Detection: One of the common data sources in transportation is video image data collected from infrastructure and autonomous vehicles. We often need to classify types of vehicles pass through an intersection and whether there are pedestrians on the sidewalk. Convolutional neural network is often used to fit the labeled images and detect the objects when presented new images.
Last updated