Finding Categories in Text with Cosine Similarity

Learn Data Science
Teradata Employee

Aster has a ton of great text functions, but many of the text modeling functions require a large amount of data to get good results. LDA, for example, does not perform well on smaller data sets. A multigenre text approach with cosine similarity can give better results with less data.


In this example we are using survey results from a furniture retailer. The survey asked questions about demographics, personal style, and shopping patterns. These responses are easy to gain insights from, the data can easily be visualized in a BI tool. The challenging part is finding insights from the free form survey question, "Is there anything else you would like to add?" There are about 1000 responses with free form text and the customer wanted to cluster the responses into categories and get a nice visual representation of the clusters. 1000 responses is much too small to use LDA, and LDA doesn't easily lend itself to giving a nice visualization, so cosine similiarity was a good alternative solution. 


The cosine similarity function returns a similarity score for each pair of text responses. Since every pair will have some score, a minimum score threshold should be established for grouping responses into clusters. For example, the visualization below shows responses that have been grouped by a very high similarity score. 



The clusters are very interrelated and seperate, but the higher the similarity score, the less of the data will be included in analysis since the output will only include responses that have at least one other response with a similarity score above the threshold. If a response is very long, the longer response may not have a high similarity with a shorter response even though the responses have the same topic. In this case the longer response won't be included in the output if the threshold is too high. 


AppCenter is very useful when deciding the ideal threshold. In the app below a user can set a similarity score threshold and get back a visualization of the clusters, the top words in each cluster, and a table of all the responses and their cluster id so the user can manually check if the groupings make sense. 


While the visualization of the clusters does not look as nice as the high threshold picture, in the survey results example the low threshold was the best approach. Many of the text responses were long and responses with the same topic often had 15-25% similarity. 


From the top words in each cluster table it's now easy to give the clusters category names so the customer can get a good sense of the free form text content and the feedback can go to the appropirate department.