Art of Analytics: The Safety Cloud

Learn Data Science
Teradata Employee

safety-cloud.png

The Insights

What would you do if you could speedily read through reams of papers that contain a wide range of content and then exhibit a native facility to distill the content into distinct topics? Sounds like a worthy skill to have. We are in luck with the polychrome cloud that precisely shows how different topics are neatly clustered into self- contained blobs that can be further explored to understand the nature of what is being spoken about within the clusters.

 

In today's world, unstructured notes are part of core communications in virtually every industry. From medical professionals who record patient observations to auto technicians who note safety information to retailers who track social media for consumer comments to call centers that monitor customer calls, and in many more use cases, valuable information is often stored in vast scales as free form text. What makes this more of a challenge is that every industry has its own argot embedded within its data and to get any meaningful analysis on this data (e.g., what is the pre-dominant sentiment expressed by our retail customers?) one has to first manually codify these texts into a structured form —an expensive, time consuming, and nearly impossible proposition given vast data scales.

 

The large cloud visualization shown here is an outcome of applying various advanced analytic algorithms that systematically isolate various topics and themes that are deeply embedded within these vast data volumes. The green, purple, yellow, and red splotches are really distinct topic clusters that have been extracted from the various documents. Each topic is designated a particular color. One topic could be related to documents that discuss safety issues in a manufacturing process. Another color could refer to emerging geopolitical tensions in the Asia-Pacific. Still another could refer to a specific discussion of a retailer's designer clothing products for men. There are several smaller color groups that indicate topics of a lower frequency or incidence. Once we are able to successfully isolate the topics into their respective color groups, we can dive deeper to explore the nature of the comments within each one. 

 

The Analytics

This visualization was created using Teradata Aster Analytics. We used open-source news stories' data and applied two sets of advanced analytic algorithms. The first set included Term Frequency- Inverse Document Frequency (TF-IDF) and Cosine Similarity to isolate unique topic clusters within the documents. These clusters were visualized using force atlas graph clustering with open-source Gephi Libraries. The second set included NLP techniques such as Sentenizer, Parts of Speech Tagging, and Text Chunking, and were implemented to discover unique phrases that could indicate sentiments, events, or entities that were embedded within each topic cluster.

 

The Benefits

At least three benefits are realized through this advanced analytic implementation:

 

First, parsing a large volume of text data, regardless of where that data resides, becomes easy because clear topical areas of discussion within these texts can be readily discovered. For example, retail businesses are now able to quickly understand what product lines most of their customers are focused on in their social media statements. Or, for example, medical transcriptions can be quickly analyzed to understand primary patient afflictions.

 

Second, once we are able to deconstruct topics from masses of text, can easily perform further analysis to extract specific issues within these topics. For example, retailers can take topics that are associated with their luxury goods products to further understand the kinds of sentiments (e.g., positive, negative, neutral) that have been expressed about those products. The point is, understanding customer sentiments, patient issues, or product safety considerations helps organizations create products, policies, solutions, and procedures that ensure they can be a force for maximum good in the lives of their stakeholders.

 

Finally, identifying clusters enables analysts to form intelligent hypothesis around underlying causal factors. This type of unsupervised learning capability, where the number of topic clusters does not need to be specified in advance, ensures that unanticipated issues can be quickly grasped and acted upon with reasonable response times.