Data Science - Art of Dimensionality Reduction

Learn Data Science
Teradata Employee
The ability to capture the prominent characteristics of an object in its shadow or silhouette ...

The above is possible only if one manages to get a "good" shadow or silhouette - in other words if "dimensions were reduced" correctly. Dimensionality reduction also known as Dimension reduction, Dim reduction etc., is both a science and art form. In general it tends to be more science with numeric data and more art with text ! Dim reduction can be a very handy technique for a data scientist who deals with massive amounts of data and features. The technique can drastically simplify upstream computational needs and increase accuracy for predictive analytics algorithms.

Dim reduction is something we all do it quite a bit in everyday life - it doesn't necessarily have to do with data science. This concept is very important for data science problems involving numeric and text data.

Common Sense Definition:

When we try to explain something succinctly to someone about a very complex idea, it works, when we simplify it by making bullet points that doesn't overlap. Just look at most of the blogs or presentations out there that talks about '10 ways to be happy', 'N ways to move up in your career', 'X ways to do Y', etc... A small list of bullets or points that overlap as little as possible brings out clarity. The bullets together doesn't need to convey the entire information, but even if it covers 90% to bring out the insight & value, we've successfully reduced the dimensions of the problem.

Doing Dim reduction in data science means - we've mathematically paraphrased the data to a few prominent dimensions or "angles" by which we are capturing the essence of the subject. This process is also great for faster computation upstream w/o loss of key information in the workflow. With numeric data, Dim reduction eliminates what is known as multi-collinearity or mutually dependent columns. The process clears out overlap in information from numerous columns or features.

Is Dim Reduction the same idea as Data Sampling ?

NO - apparently it's not !! Sampling data gives you a smaller "row" set that's representative of the original data set. Sampling reduces the # of rows for you to work with, but doesn't necessarily reduce the representative columns or features.

With Dim reduction you can get a few representative or significant columns (real or derived) that captures most of the variability which can be used for further analysis. You can still continue to work on all the rows of data for these columns w/o loosing "key" information.

Dim Reduction in Numeric data:

Sensor data from end-point devices or large systems is a good candidate for Dim Reduction. Assume you are getting a stream of Temperature, CPU utilization, Fan speed, Vibration stats, Pressure, etc. every second. There could be 100s or 1000s of such variables that you have data on.

When we apply a predictive model to this data set to find something like an imminent component failure, it's often very difficult to get good predictions with high accuracy (both positive & negative). The most widely used algorithms like Naive Bayes & Decision Trees do loose accuracy with multi-collinearity aka variable dependency. These algorithms are designed to do best when the input variables or features are independent of each other.

In most cases, there is very little discriminatory value when you select a few variables and look at it. When Temperature goes up, CPU utilization goes up, Fan speed goes up. When Temperature goes down, CPU utilization drops, Fan speed drops. Clearly there is dependency between these variables. The question then is Do we need all the three sources for analysis ? Why can't we keep Temperature and throw CPU utilization and Fan speed ? Unfortunately, the answer is not easy. If we look carefully, you probably are going to find that Temperature is NOT dependent on a fourth variable say Pressure. So which one you keep and which one you throw away ? If we compare dependencies one variable pair at a time to figure it out, you'll see how daunting it is to drop entire variables or columns that are not useful.

Eigen Vectors & Principal Component Analysis (PCA):

High School Algebra - Without going into much detail, calculating Eigen Vectors on 100s of numeric columns in a table generates the derived columns or features. The top N prominent derived columns that has the more variability are called the Principal Components of the data set.

Example: How about getting 5 derived columns out of 100 original columns with 90% variability captured ?

Also known as Principal Component Analysis or PCA, this process reduces the dimension of the data quite a bit (example:- down to 5 from 100!) and can be used for accurate & faster predictive training/prediction ...

Art - PCA with Simplicity & Scale:

Teradata Aster Discovery platform can generate NxN pair-wise Stats Correlation in one pass, run large scalable PCA that does dimension reduction with a large number of numeric columns - on BIG numeric data. All in a very short time as it's done in parallel using smart data partitioning. One can also run Training & Prediction Machine Learning (ML) algorithms in parallel. The ML algorithms can also be cascaded with the PCA output in a single query !

As a data scientist one can now run PCA with just a few lines of SQL code using the SQL/MR pre-built functions.

New to Teradata Aster ? Try this link to download Aster Express VM for playing with PCA & Stats Correlation in your desktop or laptop (needs 4 GB memory) - Tks to John Thuma for pointing out. Youtube videos by John on using Aster Express VM & learning about Aster below:

Other mainstream applications of PCA:

Sensor data is just one example how you can exploit PCA's power. PCA is also pretty popular in the areas of data compression, image processing/classification and is used widely in taxonomy, biology, pharmacy, finance, agriculture, ecology, health & architecture.

Links for the relentless Data Scientist/Explorer ...

Hope you enjoyed this blog post !

1 Comment
Teradata Employee

Reposted from Linkedin