Principal Component Analysis
Last updated
Last updated
Principal components analysis (PCA) essentially is a dimensionality reduction method that projects by projecting the original dataset ontointo a set of principal directions. The first principal direction is the one with the maximum variance. Therefore, PCA is achieved via two steps: a) finding the most important principal directions, b) projecting the dataset into these directions. Although the dataset can be fully represented if the dataset is projected into all the principal directions, generally a certain number of principal directions with the largest weights are adequate to capture the most important information of the original matrix (this number is usually smaller than 3 for data visualization).
A good example of the PCA process is the visualization of the Iris dataset. The Iris dataset contains 150 samples, and each sample contains 4 features of different species of iris. By projecting the dataset into the principal directions with top 2 weights, we can get a 2-D visualization of the dataset in Figure1. Each color represents one species iris.
Figure1, PCA application on Iris dataset
The function of PCA includes but is not limited to dimensional reduction. Due to the ability of capturing most important information of the dataset, the PCA can be also used for feature extraction, data denoising, pattern recognition, novelty detection and so on. Unlike some other machine learning algorithms such as neural network, PCA method is not a black box. We will try to unravel its computation process and to explain the physical meaning of the PCA results in this section.