Dimensionality Reduction using PCA explained in my previous blog post was about the case of lowering the dimensionality of numeric data. PCA was used to simplify training & prediction/ classification for reducing computation time & improving accuracy.
In this blog post we will explore a counter-intuitive scenario. To do classification - we are going to generate additional dimensions to flat data, while smartly avoiding the resulting computational complexity. It's done through an algorithm called Support Vector Machine or SVM that incorporates a cool idea called the Kernel trick. IMHO - SVM is probably one of the brilliant ideas discovered in the area of Machine Learning Supervised Classification - the reason why it's wildly successful.
SVMs are great for text classification. You can see that a lot of white papers claim 90- 98% accuracy with SVM prediction (much more than any other algorithm). A lot of Teradata Aster customers use it quite a bit in their text use cases for the same reason.
An academic example: You have two sets of articles or books, one written by Noam Chomsky and the other set written by Ayn Rand. The SVM model can be trained with this known data using the labels, Noam and Ayn. Now, if you are given a new chapter of text from an article or book, SVM is known to predict the correct author with 95% + accuracy - Wow!
You have a deflated balloon with a bunch of colored dots - red and green on the balloon. We want to linearly separate the red and green dots. Linear means you are allowed either a point, line or a plane, to separate them. Looking at this particular pattern of dots, it is near impossible to draw anything linear to separate the colors.
Let's do a trick - like inflating the balloon. See below. You now have a much better chance of drawing a "linear plane" between the two colors and separate them into two groups !!
This is exactly how Support Vector Machines work. The process of inflation, in other words, adding another dimension to the data, we can make the data separation linear and easier. This is known as the Kernel Trick.
Why is linear cool ? Linear is mathematically easier to compute and it drastically reduces the complexity as dimensions pile up. This makes the process easy to generalize .
A dot on a X axis, can separate the line into two parts. A line on an XY diagram can separate the area into two, while a plane on the XYZ system can separate the space into two. The separators for anything beyond the 3 dimensions, are called as Hyperplanes ! The goal of SVM is to find a Hyperplane in a higher dimension than your current data which will separate the two classes or groups. Obviously you can find a lot of hyperplanes that can separate both groups. These are called as "Support Vectors". However, we want the best hyperplane that is at the optimal distance between the reds and the green dots so we can use that as a model for future prediction and that's one of the other challenges that SVM tackles.
Every time we "Up" a dimension to classify the data, the computational complexity grows exponentially in theory. The beauty of SVM is that the computational complexity does not spin out of control, thanks to the Kernel Trick. The Kernel trick not only,makes it easier to define hyperplanes to classify the initial data, but also reduces the task of finding the most optimal one !! Hence, the popularity of SVM.
What's the Kernel Trick really ? -> It's basically a simple dot product that happens on existing dimensions to create a new dimension. Simple it as it seems, it basically gives the algorithm leverage in a variety of ways - reducing the computational complexity being one of them.
There are several kernels available. The simplest of them is a linear kernel and the most complex ones are non-linear. Here's a long list of possible kernels:
Every kernel has it's pros and cons (no free lunch anywhere!). However, you will find that the linear kernel is probably sufficient for most scenarios !
Having a lot of data to begin with beats any algorithm! However, when the data is less and more sparse,choosing the right algorithms can result in more accuracy. SVM performs really well with smaller data sets than Naive Bayes, especially with text data. Text data is highly dimensional. Every word in a text is considered as another dimension. Training & constructing hyperplanes to separate documents into category A vs category B can be extremely time consuming ( and near impossible to do ) without a high performing algorithm like SVM.
SVM by design, is a binary classifier. However, most field implementations of SVM (as in the Teradata Aster Analytic Foundation) leverages the core functionality to build a multi-classifier called K-SVM or Sparse SVM. So you can predict more than one category !
The Sparse SVM training & prediction operators as part of the Teradata Aster Analytic foundation can be used by customers to do supervised learning and prediction for huge amounts of text data (100s of TB). The kernel tricks, dimension increase, finding an optimal hyperplane happens seamlessly. The algorithm is implemented in a way to exploit parallelism on the Teradata Aster MPP platform hence runs really really fast !!
See John Thuma's Usage & Demo of Aster SparseSVM
Sparse SVM can be used not only for pure text classification, but also for other types of data that needs supervised classification.
Also a list of broad use cases where SVMs can be used (from Wikipedia):
Hope you enjoyed my blog post!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.