The Curse of Ddimensionality
Last updated
Last updated
In machine learning we are frequently dealing with data sets of high dimensionality: i.e., data sets with large numbers of variables. This can lead to the problem of the curse of dimensionality, which was first formulated by the mathematician Richard Bellman.
Consider a data set where, without loss of generality, we can assume that all data have been scaled to lie in the interval [0,1]. If we have p variables, the data lie in a hypercube of p dimensions. Assume the data are distributed evenly over the hypercube.
Now consider how much of the data are located close to each other, i.e., within a unit sphere, as illustrated below for two dimensions:
The circle encloses about 79% of the area, or, in other words, about 79% of the data points lie closer to the center than the edges.
Now consider three dimensions, where we have a sphere embedded within a cube. The sphere encloses only 52% of the data points. Even if we are dealing with only 10 dimensions, more than 99.7% of the data will lie in the “corners” rather than near the center.
As the dimensionality increases, more and more of the data lie in the “corners” of the hypercube. What this means is that when fitting models with high-dimensionality, more of the data points will lie near the edges, and therefore we will be extrapolating from data points rather than interpolating between them.
Put another way, the required number of data points required to adequately cover the range of possible values goes up exponentially with the number of dimensions.
What this means is that high-dimensionality data present some difficult problems for building predictive models. Hence, dimensionality reduction is often a key goal in developing predictive models in machine learning.
The problem of the curse of dimensionality is somewhat ameliorated by the fact that in many practical problems, the data often lie on a manifold of fewer dimensions. Hence, the problem of dimensionality reduction often reduces to that of finding an appropriate description of the lower-dimensional manifold that encloses the data.