Data Scientist - Unsupervised Learning & the AI promise ...

The best minds from Teradata, our partners, and customers blog about relevant topics and features.
Teradata Employee

An image recognition program is shown 10 cat and dog photos. The program tries to find which one of the image is a cat or a dog. Here's how it's typically done today :

  • A data scientist trains a Deep Learning model, ahead of time by showing it 1000s of cat and dog images from cat and dog directories using Keras/Tensorflow. Typically a couple of layers of convolution layers, some straight up neural network layers, probably run on a PC, A server with a GPU in AWS, Azure etc.,
  • Test it, tune it etc., and fine-tune the model using cross-validation, hyperparameter tuning etc.,
  • Run the program, score the 10 new photos given to you to be either cat or a dog. Do you get 99% accuracy? Awesome! Success!

What is odd with the problem & solution above?


There are three things:

  • The training step requires a ginormous amount of data for the model to get it right. Even though we are scoring just 10 photos at the end, the model typically has to learn from 1000s of images over and over (reinforcement learning). The training data problem gets worse if we are trying to classify images into a big list of animals. “Transfer Learning” to some extent allows you to reuse existing models, but data and training setup still need to be massaged to get the desired results!
  • Finding labeled data ready on day 1 is almost impossible. Whether it is collecting weblogs, customer transactions or even drone images of crops, customer uploaded images in a data lake etc., Labeled data is always an afterthought for data science projects. How do you go back and label 5 years of data into 15 different categories in the first place, so you can predict the unknown in future?
  • Also, if the given input image that we are asked to recognize is not a cat or dog, there is no easy way to say that "I found a new animal". Let's say if there are three kangaroos in the input. It's hard for the algorithm to detect that it found three *NEW* animals that are different from a cat or dog AND that they are similar.

Let's extend the use case above to finding new fraudulent behaviors, new diseases, new buying behaviors, new topics in customer comments etc., In general, if we know *enough* examples ahead of time, today's AI can spot a new one in the future and classify it to one of the examples with upwards of 95% accuracy.



Unsupervised Learning


For the uninitiated, unsupervised learning algorithms don't require data to be labeled. There are different algorithms that do unsupervised learning. Its promise is to sort and bin the data automatically into groups that have similar properties. Let us walk through some of the algorithms.


Traditional unsupervised algorithms require specifying the number of clusters to output ahead of time (KMeans, Latent Dirichlet Allocation, KMode etc.,). They are used for segmentation of products, customers and to group text by topic keywords. It is art & science to find the number of clusters ahead of time especially if the clusters are expected to be in the double digits.


Some of the more advanced algorithms like SOM (Self Organizing Maps) or GTM (Generative Topological Mapping) are used to let data self-organize itself into similar groups and more sophisticated. A recent favorite of mine T-SNE (t-distributed stochastic neighbor embedding) reduces higher dimension data to a few dimensions and facilitate easy clustering & visualization. T-SNE is a better version of PCA (Principal Component Analysis) which is very popular with traditional analytics folks. The neural network version of the same dimension reduction technique could be done with AutoEncoders and also with energy based models such as RBM (Restricted Boltzmann Machines).

Another popular one is where a similarity graph is created across the inputs using Cosine or Euclidean distances and then an algorithm like HCS Clustering (Highly Connected Sub-Graphs) done to partition the graph into clusters. A fun output of Graph Clustering shown below. The algorithm found clusters of products that have similar properties. The products are the nodes, the edges represent similarity and the clusters are automatically colored as they are identified as subgraphs. It's an expensive operation, but it is possible today to work with a large number of products or customer data.

For text, we have techniques like Word2Vec also known as neural embeddings to automatically infer connections between words that are used together in different places in the sentence etc., And finally to complete the list you have neighbor distance-based algorithms such as KNN (K-Nearest Neighbor) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). I've discussed my favorites in this blog post somewhat. There are a lot more algorithms as well, not covered here. Some data scientists also, use a combination of advanced algorithms along with traditional clustering techniques.


The unsolved problem

Given the state of the art where Unsupervised Learning is today, it's full promise still remains fairly elusive to a lot of practitioners - primarily because it requires a lot of tweaking, feature extraction, manual iterations, problems with algorithm convergence, consistency etc., Hyperparameter or model tuning is hard to do as there is nothing to "tune to" as a reference. Also if the algorithm found X unique clusters with all the tweaking, how do we explain and label them, so it matches human reasoning? The problem gets worse when X is not the right fit.

In contrast, Supervised Learning is delivering 95% to 99.95% accuracy with mature technologies like TensorFlow and standardized interfaces like Keras, with relative ease. Even explaining the models with Supervised Learning is becoming easier every day.

For AI to be "fully" useful in future, learning from the unknown, self-organizing, labeling and classifying objects easily has to be solved first. As new data comes in, there is still a challenge of deciding to map them to existing labels learned or create new ones on the fly. Will the maturity and standardization of this happen in 2 years, 5 years or 10 years? Ask the scientists ...


If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake - Yann LeCun

Meantime, for data science practitioners, there are still plenty of ways to cobble together techniques and even go into production, though not perfect or 100% automated. If you want to learn about fresh ideas or want to discuss on using the state of art for both supervised and unsupervised learning using existing Open Source or Teradata Platform, you can always reach out to us!