An image recognition program is shown 10 cat and dog photos. The program tries to find which one of the image is a cat or a dog. Here's how it's typically done today :
What is odd with the problem & solution above?
There are three things:
Let's extend the use case above to finding new fraudulent behaviors, new diseases, new buying behaviors, new topics in customer comments etc., In general, if we know *enough* examples ahead of time, today's AI can spot a new one in the future and classify it to one of the examples with upwards of 95% accuracy.
WHAT IF THERE IS SOMETHING NEW AND VALUABLE AND THERE ARE MANY INSTANCES OF THEM AND YOU MISSED IT ?
For the uninitiated, unsupervised learning algorithms don't require data to be labeled. There are different algorithms that do unsupervised learning. Its promise is to sort and bin the data automatically into groups that have similar properties. Let us walk through some of the algorithms.
Traditional unsupervised algorithms require specifying the number of clusters to output ahead of time (KMeans, Latent Dirichlet Allocation, KMode etc.,). They are used for segmentation of products, customers and to group text by topic keywords. It is art & science to find the number of clusters ahead of time especially if the clusters are expected to be in the double digits.
Some of the more advanced algorithms like SOM (Self Organizing Maps) or GTM (Generative Topological Mapping) are used to let data self-organize itself into similar groups and more sophisticated. A recent favorite of mine T-SNE (t-distributed stochastic neighbor embedding) reduces higher dimension data to a few dimensions and facilitate easy clustering & visualization. T-SNE is a better version of PCA (Principal Component Analysis) which is very popular with traditional analytics folks. The neural network version of the same dimension reduction technique could be done with AutoEncoders and also with energy based models such as RBM (Restricted Boltzmann Machines).
Another popular one is where a similarity graph is created across the inputs using Cosine or Euclidean distances and then an algorithm like HCS Clustering (Highly Connected Sub-Graphs) done to partition the graph into clusters. A fun output of Graph Clustering shown below. The algorithm found clusters of products that have similar properties. The products are the nodes, the edges represent similarity and the clusters are automatically colored as they are identified as subgraphs. It's an expensive operation, but it is possible today to work with a large number of products or customer data.
For text, we have techniques like Word2Vec also known as neural embeddings to automatically infer connections between words that are used together in different places in the sentence etc., And finally to complete the list you have neighbor distance-based algorithms such as KNN (K-Nearest Neighbor) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). I've discussed my favorites in this blog post somewhat. There are a lot more algorithms as well, not covered here. Some data scientists also, use a combination of advanced algorithms along with traditional clustering techniques.
The unsolved problem
Given the state of the art where Unsupervised Learning is today, it's full promise still remains fairly elusive to a lot of practitioners - primarily because it requires a lot of tweaking, feature extraction, manual iterations, problems with algorithm convergence, consistency etc., Hyperparameter or model tuning is hard to do as there is nothing to "tune to" as a reference. Also if the algorithm found X unique clusters with all the tweaking, how do we explain and label them, so it matches human reasoning? The problem gets worse when X is not the right fit.
In contrast, Supervised Learning is delivering 95% to 99.95% accuracy with mature technologies like TensorFlow and standardized interfaces like Keras, with relative ease. Even explaining the models with Supervised Learning is becoming easier every day.
For AI to be "fully" useful in future, learning from the unknown, self-organizing, labeling and classifying objects easily has to be solved first. As new data comes in, there is still a challenge of deciding to map them to existing labels learned or create new ones on the fly. Will the maturity and standardization of this happen in 2 years, 5 years or 10 years? Ask the scientists ...
If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake - Yann LeCun
Meantime, for data science practitioners, there are still plenty of ways to cobble together techniques and even go into production, though not perfect or 100% automated. If you want to learn about fresh ideas or want to discuss on using the state of art for both supervised and unsupervised learning using existing Open Source or Teradata Platform, you can always reach out to us!