Understand Your Customers Through Sentiment Analysis

Learn Data Science
Teradata Employee

Understanding your existing and prospective customers has always been key in delivering the right product or service at the optimal price to generate and grow revenue. The constantly changing interests and demands of customers can be instrumental in developing new products, enhancing existing products, or aligning the most appropriate product to the customer. Collecting this information has been limited in the past with surveys or focus groups and is constrained by the questions being asked. Sometimes the most important information comes from questions not being asked. Furthermore, delays in collecting and analyzing the information could lead to ineffective or outdated promotions.

The age of Big Data and the social network has broken through this barrier and now it is possible to "listen" to the data. Now more than ever it becomes essential to manage the vast velocity of data faster, smarter, and easier. .

Collaborative Feedback

The objective becomes focused on collective feedback from all consumers of a product or service and generating a correlation to operational metrics.

Correlation is not cause and effect, rather it guides you in understanding the relationship between two variables such as sentiment score to revenue, such that if one variable is adjusted towards a specific direction, the other variable would also be adjusted in proportion to the relationship. Correlation measures relationships from weak to strong in either the positive or negative direction.

This provides information that can be used to develop a predictive model that can be used by the business to understand the impact of positive and negative comments. It is not just a rear view mirror of what happened, but a predictor, with a degree of confidence, of what will happen to one variable when the other variable is modified in a certain direction.

This can be used to take action and change what will happen in the future, thus extending analytic capabilities to produce a competitive advantage assuming that the analytic can be accelerated and integrated across the analytical ecosystem.

While correlation is not cause and effect, it is a process of having consumers guide you to where you may need to improve your product or service to keep up with popular trends. It is consumers producing collaborative feedback on what they have already experienced, and then identifying the impact it may have on your revenue.

Data Sources and the Data Lake

The challenge is the various sources of data, the volume, and the variety of data formats that it may arrive. Data sources to collect consumer sentiment may be as structured as survey results or as unstructured as free-form social media posts or on-line review text or audio recordings, or somewhere in between such as where free-form text is embedded as part of a structured document such as XML orJSON.

Survey results are usually comments or feedback based on questions targeting a specific topic while social media posts and on-line reviews are open discussions that could include multiple topics.

The Hadoop Distributed File System (HDFS) simplifies the process of collecting and storing all this data, but you still need to manage that collection. Hadoop is still a file system, a collection of files and directories (or folders) where files are stored and can be organized in any way the contributor of that data decides to store it.

Have you ever had access to a shared drive that many people in your department had read and write access to, and have it grow into such a disorganized collection that finding or understanding what was stored on the shared drive became impossible or at least intolerable? That is the potential hazard of  a mismanaged data lake on Hadoop.

The concept of the Data Lake and the lure of Big Data is all about doing things faster and easier, but you still need to do them smarter. Using Hadoop as the Data Lake can be faster and easier to store files, and you still need to consider implementing governance patterns, establishing cohesive storage patterns, defining flexible and robust ingestion patterns, accelerating and optimizing analytic processing design patterns, and tracking data lineage through metadata management. Skipping these steps can turn your Data Lake into a Data Swamp. It could also you land you in hot water if a regulatory audit requires you to produce data lineage to support your business results or claims that are based on Big Data sources.

Keep in mind that the Data Lake is not just confined to Hadoop. The relational database management system combined with metadata management can also be managed as part of the Data Lake. Essentially the Data Lake can span across multiple platforms across the analytical ecosystem.

Analyzing the Sentiment

Once the data is collected you need to go through the process of analyzing it. With so much of it potentially covering a wide variety of topics we come back to doing things smarter as well as faster and easier.

Consider the objective of the analysis, the business question. What is it about our product that gets people to buy it? What is it about our product that makes people consider it and then not buy it?

The traditional approach is through marketing surveys or focus group conversations that hone in on specific topics. While there may be some leeway for diverting from topics to more relevant topics it is the social media posts and on-line reviews that could provide an abundance of information to answer questions that were never asked or even considered to be asked. This is due to posts that are typically unsolicited, meaning that some part of a consumer's experience prompted them to say something about their experience. It could be something bad about the product, something good about the product, something new they would like to see, something about the service, any of the above, all of the above, or any combination of additional topics all in a single post or review.

The first order of business is cleaning up the text of misspellings, improper grammar, and other anomalies that may impact the accuracy of predictions while retaining the original meaning of the text. This is then followed by applying the appropriate statistical methods to predict sentiment by topics and correlate those scores into the data warehouse to identify the potential impact to revenue, customer loyalty, operational costs, and more.

There are numerous statistical methods that can be applied to the text analytic process in order to analyze and predict what the data is telling you. This can be a lengthy process done with statistical equations that are applied through programming languages such as R, Python, or Java.

There is a lot to do, so the objective will be to get it done faster, smarter, and easier than your competition and since not many companies have reached this level you can put yourself ahead of your competition.


There are an abundant supply of libraries or source code from CRAN(Comprehensive R Archive Network) and GitHub. The code would need to be examined and tested, and typically would go through a software life cycle of edit, compile, and test.

If the Aster Analytic Foundation (either on Hadoop or in its own Discovery Platform) is available then the engineered and tested Text Analytic functions can be applied with parameters and trained to sanitize all the free-form text data. Functions can be strung together such that the output of one function is used as input to the next function thus creating a data analytic factory within a single SQL statement or script.

This can then be fed into functions that can identify specific topics being discussed considering relationships of words, compound words and phrases, and detecting sarcasm or alternate meanings. The results can then be directed into functions that will classify topics or identify entities being discussed, and then extract sentiment scores going beyond just positive or negative utilizing topic classification or name entity recognition to connect sentiment to a specific topic about a specific product or service.

In the case of on-line reviews there are opportunities to compare sentiment to ratings where ratings are a subjective overview while sentiment can be scored against all topics and provide a new dimension to ratings. There can definitely be some differences where a high rating does not really represent the true sentiment of the author.

There is a lot of the scientific method that needs to be applied to the data and this requires time and effort to design and code.

The Aster Analytic Foundation (AAF) functions accelerates the development by using functions already developed and tested in a well equipped lab environment and is extensible beyond their original definition just like any other Java class. The functions themselves are embedded in the abstract language of SQL to facilitate the development of the analytic solution.

This does not, by any means, simplify the data science but it does accelerate it. You still need to conduct the science and statistical methods behind the exploratory data analysis and choose the functions that must be applied to the data. You may integrate the use of multiple programming languages such as Python, R, Java, etc.


The appeal of unstructured (non coupled) and loosely coupled data is that there is no extensive and lengthy work that needs to be done to define schema or structure to the data before it can be loaded. This doesn't mean that a schema is never is applied, it just means it doesn't need to be applied initially to get to the data, so the acquisition of data is faster, but in order to be smarter the output of the analytical data set is defined in a schema that establishes relationships to business data that is best managed in a data warehouse where tightly coupleddata is managed. Data pertaining to revenue performance such as sales, costs, gross margin, customer traffic, waste, etc.

Then the correlation can be made between the sentiment by topic/entity to revenue performance. This can give you insights to why revenue has dropped for a specific product, store location, service, etc. as well as drilling down to more specific topics such as the type or provider of a service, an attribute about a product, or the attribute about a store location.

Even smarter is the approach to grid computing where the data processing and integration is distributed among heterogeneous platforms, leaving the bulk of data where it lives and retrieving only the needed data sets to satisfy the analytic. This means data is processed on the platform that is best suited for the analytic. Data sets that are bulk set-theory based manipulation of merged data sets based on one-to-one or one-to-many relationships may be best approached with SQL managed by a robust database management system while data sets that may have more complex processing requirements due to many-to-many relationships may be best approached with a procedural programming language in a NoSQL or distributed file system cluster. Teradata QueryGrid provides the means to combine multiple methods across multiple platforms across the analytical ecosystem and in parallel where possible.

This can also be correlated to other Big Data sources such as web visitor navigation and using the Aster Analytic Foundation nPath function to identify if a path that led to an abandon cart was preceded by visits to an on-line review or social media post that generated a strong negative sentiment score.

The feedback loop is completed by then using the marketing products of the social media review site for smarter messaging to customers and prospects to influence the very specific sentiment topics that have the most impact to revenue generation and maximizing the marketing investment.


The data acquisition and the data science in producing algorithms that answer these very important business questions need to get into the hands of business users and it needs to be easy to use.

Teradata Listener provides a self-service ability to ingest and distribute fast moving data throughout the analytical ecosystem. This is done by pointing Listener to the data source and applying the algorithms within the analytical ecosystem to consume the data as it flows in.

The Teradata AppCenter is where many analytic applications can be parameterized with the underlying code developed by the data scientist and the connections to multiple platforms, technologies, and languages of the analytical ecosystem are hidden from the end-user. The end-user need only select the analytic app and provide the criteria under which to execute the app. The app's end result is often an interactive visualization to navigate through the results and to drill through scenarios that can identify what actions may be needed to change the future.

When your analytical ecosystem is comprised of the technologies and tools needed to accelerate and integrate analytic capabilities it will produce results faster, smarter, and easier and lead to your competitive advantage.