Sentiment Analysis with Teradata Aster

Learn Aster

In today’s world of social media, online customer reviews, and nearly instant feedback, sentiment extraction can be a powerful tool in the hands of business users. It can allow them to keep their finger on the pulse of customers’ attitudes toward their products and/or services. It can enable them to diagnose potential issues before they get out of hand. It can empower them to make changes to better serve customers and to find out what customers really want. In turn, customers appreciate businesses that listen.

What is sentiment extraction, exactly? It is the process of taking text-based data generated by an audience of interest on a topic of interest, flagging words that express emotion, and then aggregating the counts of those words in order to calculate an overall sentiment score, which can be calculated at either the document level or the sentence level. Teradata Aster’s Sentiment Extraction suite of functions is a very powerful tool that can be used in conjunction with Teradata Aster AppCenter in order to calculate sentiment scores and to visualize the results.

As a concrete example, suppose that you own a car company (that is itself a Teradata Aster customer) and that you are interested in developing a self-driving car. Before you invest your valuable time and resources on such a project, you want to get a general idea of how your customers feel about self-driving cars, as well as what specific concerns they have and what benefits they see from having self-driving cars on the road. If your company has social media feeds, then you can use them to collect data on what your customers are saying about self-driving cars.

To demonstrate, let’s assume that an open-source data set from CrowdFlower.com consists of user comments from your company’s social media feeds. In June 2015, the contributors of CrowdFlower.com collected a data set of over 7,000 tweets. They marked whether or not these tweets were relevant to the topic of self-driving cars, and, for the relevant tweets, they rated the sentiment on a scale of 1 to 5 (1 = stong negative, 5 = strong positive). The full data set can be downloaded here: http://www.crowdflower.com/data-for-everyone (see "Twitter Sentiment Analysis: Self-Driving Cars).

Once you have downloaded the data, you load the data into a Teradata Aster Database. To do this, first create a table using the following SQL code:


--Create table to hold social media data (drop table first, if it already exists).

DROP TABLE IF EXISTS [schema].tweets;

CREATE TABLE [schema].tweets(

        Id VARCHAR(15),

        golden VARCHAR(10),

        unit_state VARCHAR(15),

        trustedjudgements INT,

        lastjudgementat TIMESTAMP,

        sentiment VARCHAR(15),

        sentimentconfidence FLOAT,

        diffId VARCHAR(10),

        sentimentgold VARCHAR(15),

        sentimentgoldreason VARCHAR(255),

        tweettext VARCHAR(255)

)

DISTRIBUTE BY HASH(Id)

            COMPRESS LOW;


Let’s also assume that you are using Aster Express in a Windows environment. In order to load the data, take the following steps:

  1. Download the Windows Aster Client .zip file that is appropriate for your system; this can be downloaded here: http://downloads.teradata.com/download/aster/aster-client-tools-for-windows.
  2. Drill down until you see the ncluster_loader.exe file. Copy this file to the file on your local system where your data is saved.
  3. Connect to your Aster Express queen node.
  4. Open the Windows Command prompt. In the prompt, change directories to the file where you have both your data and the ncluster_loader.exe file saved.
  5. Run the following code in order to load your data:

ncluster_loader.exe –h [queen node IP address] –U [username] –w [password] –d [database] –c --verbose --skip-rows 1 --el-enabled [schema].tweets [Windows file path to where the input data set is saved. Include actual file name (with extension) in this path.]

IMPORTANT: Make sure that your Windows file path does not contain any spaces, or you will get an error when you try to run the code above.

Your data is now in a Teradata Aster database, inside the table that you created with SQL.

You can now use Teradata Aster’s Sentiment Extraction functions in order to calculate the sentiment score for this data. In this example, we will use the ExtractSentiment function with its built-in sentiment word dictionary in order to flag each comment as “positive,” “negative,” or “neutral.” We will then compare these results to the “true” sentiment that was assigned by the CrowdFlower.com contributors. (The ExtractSentiment function can also be run using the maximum entropy classification method instead of the sentiment word dictionary method, if desired). For this example, we will calculate sentiment scores on a document level, with each tweet being considered as a “document.” First, you run the following SQL code:

    --Clean out irrelevant tweets (marked ‘not_relevant’ by the CrowdFlower.com contributors)

    DROP TABLE IF EXISTS [schema].tweets2;

    CREATE TABLE [schema].tweets2

          DISTRIBUTE BY HASH(id)

          COMPRESS LOW

          AS (

          SELECT * FROM [schema].tweets 

                    WHERE sentiment NOT IN('not_relevant')

    );

    --Reformat sentiment column

    DROP TABLE IF EXISTS [schema].tweets3;

    CREATE TABLE [schema].tweets3

    DISTRIBUTE BY HASH(id)

    COMPRESS LOW AS (

SELECT *,

      CASE

            WHEN sentiment IN('4', '5') THEN 'POS'

            WHEN sentiment IN('3') THEN 'NEU'

            WHEN sentiment IN('1', '2') THEN 'NEG'

      END AS sentcat

FROM [schema].tweets2

    );

    --Use ExtractSentiment function - with default sentiment word dictionary

    --Wrap the results in a table statement

    DROP TABLE IF EXISTS [schema].allresults;

    CREATE TABLE [schema].allresults

    DISTRIBUTE BY HASH(id)

    COMPRESS LOW AS (

    SELECT * FROM ExtractSentiment (

    ON [schema].tweets3

    TEXT_COLUMN('tweettext')

    ACCUMULATE ('id', 'tweettext')

  )

    );

When you view your results, they look something like this:

Id

Tweettext

out_polarity

out_strength

out_sentiment_words

724324762

You know it's a good day when you see a [self-driving] car!

POS

2

"good 1. In total, positive score:1 negative score:0"

724327274

"Zombie cars, self driving cars, fast cars and...tanks"

NEU

0

"zombie -1, fast 1. In total, positive score:1 negative score:-1"

724327747

Zach Woods trapped in a rogue self-driving car is the funniest thing that has happened on Silicon Valley so far.

NEG

2

"trapped -1, rogue -1. In total, positive score:0 negative score:-2"

724327812

Does an #autonomous car make you want to get into a car more often? - @[company] not really

NEU

0

724327948

How much would you trust a #driverless car? -- @[company] far more than a human driver.

POS

2

"trust 1. In total, positive score:1 negative score:0"

724328160

If [company] creates self-driving taxi svc - Goober I may briefly forget the military robots thing. 'Splains wrinkled fenders

NEG

2

"wrinkled -1. In total, positive score:0 negative score:-1"

Table 1. Sample output – ExtractSentiment function

              Please note that sentiment analysis is not an exact science; you may get some comments that are misclassified. To calculate the recall and precision of your results, you use the EvaluateSentimentExtractor function and get the following results:

    --Running EvaluateSentimentExtractor

    SELECT * FROM EvaluateSentimentExtractor ( 

                ON (

              SELECT * FROM ExtractSentiment (

                                              ON kp186033.tweets3

                        TEXT_COLUMN('tweettext')

                        ACCUMULATE ('id', 'tweettext', 'sentcat')

              )

          ) PARTITION BY

        EXPECT_COLUMN('sentcat'

        RESULT_COLUMN('out_polarity')

    );

Evaluation_result

"positive record (total relevant, relevant, total retrieved): 1903 840 1975"

recall and precision: 0.44 0.43

"negative record (total relevant, relevant, total retrieved): 795 345 1042"

recall and precision: 0.43 0.33

"positive and negative record (total relevant, relevant, total retrieved): 2698 1185 3017"

recall and precision: 0.44 0.39

"neutral record (total relevant, relevant, total retrieved): 4242 2783 3923"

recall and precision: 0.66 0.71

Table 2. Recall and precision – classification of social media comments (positive, negative, or neutral)

You see that, when ExtractSentiment is used with its default sentiment word dictionary, the recall for positive comments is about 44%, while the precision for positive comments is about 43%. The recall measure means that, for all the comments that were truly positive, they were correctly flagged as being “positive” 44% of the time; the precision measure means that, for all the comments that ExtractSentiment flagged as “positive,” they truly were positive 43% of the time. For more on recall and precision, please see this article: https://en.wikipedia.org/wiki/Precision_and_recall. If you are not satisfied with the results obtained when using the default sentiment word dictionary, this dictionary can be downloaded, modified, and replaced in order to add new sentiment words or to change the score of existing words.

You then create a table that counts the number of positive, negative, and neutral comments using the following code:

    --Count number of positive, negative, and neutral comments

    DROP TABLE IF EXISTS [schema].countallresults;

    CREATE TABLE [schema].countallresults

    DISTRIBUTE BY HASH(out_polarity)

    COMPRESS LOW AS(

          SELECT out_polarity, count(out_polarity) AS commentcount

                    FROM [schema].allresults

                    GROUP BY out_polarity

    );

After getting the counts for the positive, negative, and neutral comments, you then create a bar chart using an app in Teradata Aster AppCenter in order to visualize the results. The app code should look like this:


    --Visualization code from AppCenter

    INSERT INTO app_center_visualizations  (json)

    SELECT json FROM Visualizer (

    ON "[schema].countallresults" PARTITION BYORDER BY out_polarity

    AsterFunction('custom')

    Title('Sentiment Analysis: Self-Driving Cars - Social Media Comments')

    VizType('bar')

    ColumnMap('x1=out_polarity','y1=commentcount')

    VizOptions('aggregation=unordered','coordinateaxes="Sentiment","",

    CATEGORICAL;"Count","",NUMERIC'

)

          );

You run the app and obtain the following results:


Bar Chart - Sentiment Analysis.PNG


Figure 1. Teradata Aster AppCenter Results: Sentiment Analysis – Social Media Comments on Self-Driving Cars

According to these results, customer sentiment toward self-driving cars is neutral overall, although substantially more positive comments have been made than negative comments. You decide that developing a self-driving car warrants further investigation. Of course, this is just one example of how Teradata Aster’s Sentiment Extraction suite of functions can be used to drive business decisions. If you have questions about these functions, please feel free to contact me at kate.phillips@teradata.com.