# Data Science - Curse of Dimensionality

Learn Aster

## Intuition

When there are few columns of data, it's relatively easy for the algorithms to perform good data science - Machine Learning like Clustering, Classification etc.,

As the number columns or features go up, it becomes exponentially harder to do predictive science with the same level of accuracy. The # of rows of data that is required to do any useful modeling grows exponentially higher as we add more columns or features to a table. This is due to a phenomenon called the curse of dimensionality. The best way to understand intuitively is to think about combinatorial for this problem. The number of distinct possibilities that can exist in the data set grows exponentially as you add more columns - yes ? That's indeed the source of the problem.

## More about the curse

If you are collecting data for your data science project and you have 100 numeric columns and 1 million rows, you may think of that as a lot. But the fact you have 100 columns of data puts you on a 100 dimension hyperspace - yikes. Unfortunately, mathematically this is nowhere a friendly place like the 3 dimensions we experience. The 1 million rows of data when plotted on 100 dimensions, would be very sparse. There would be no meaning to draw a plane or create clusters in this hyperspace. It would mathematically appear like a bunch of random dots with no pattern to it. In order to work in this space effectively, you'd probably need something like 100 trillion rows to do any good data science. This is difficult because you'll never get that kind of data in real life (even with big data sources).

## Wikipedia Definition

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2]

Picture (c) Wikipedia.

## About text data

While we tend to think of columns or features as dimensions, the text data works a bit differently. In text, every word or term becomes a dimension. Text is cursed - no pun intended as it can get to 10000s of dimensions real quickly. Just trying to find what topic a paragraph of text has, you are probably looking at 50 to 100 dimensions right there ! This is why text models need a lot of data to train with! When doing classification on some unknown piece of text, the models that you use are probably trained on a large corpus of data to minimize the effects of the curse.

## Effect of Sampling with growing dimensions

The curse of dimensionality has a dramatic effect on sampling assumptions.

The sampling quality degrades as more columns are added to the data set and something to keep in mind when dealing with wide column data. Performing data science on samples of high dimension data can be a problem if not thought carefully.

## Dimension Reduction & Feature Selection

To do anything useful with a large column data set, there are techniques like 'dimension reduction' which transforms the data down to fewer columns, so the row requirement drops down quite a bit. This makes it practical to do useful data science.

• For numeric data, there is PCA (Principal Component Analysis). Linear and non-linear versions of PCA are available depending on the characteristics of the features.
• For text, techniques such as LSI (Latent Semantic Indexing) can be used.
• SAX (Symbolic Aggregation Approximation) for Time Series is a great way to reduce dimensions on streams preserving the properties.
• If we are using Neural Networks, AutoEncoder works really well.

The above are the most popular ones. There are a lot more ways to do dimension reduction beyond above.

You can certainly do feature selection, by algorithmically finding out which columns are more significant in your data and just use that, so you reduce the dimension that way. Technologies like Deep Learning would claim that you don't even need to do this!

Link to my blog on "Art of Dimensionality Reduction"

## Algorithm Cure

When there are way too many dimensions, this results in an artificial data shortage - thanks to the curse. One way to mitigate this is to try different algorithms or ensembles. Some algorithms or combinations of, deal a lot better with data sparseness than others. But as you get more data, it's would not be so much about Algorithm X vs Algorithm Y!

## All Data Rows Solution

When dealing with wide or high dimensional data, building predictive models on *ALL* rows of data would indeed be the best case scenario. You "cannot" do better than this. Use all the data you got! You can do feature selection, dimension reduction with an MPP analytical platform like TERADATA ASTER or other open source technologies to reduce the columns by scanning 100s of TB of data quickly.

Click here for John Thuma's youtube video on "how to do PCA on Aster"

Hope you enjoyed the blog post ...