In-Database Python for Fine Food Reviews

Learn Data Science
Teradata Employee

Recently I reviewed the sentiment analysis data provided by a customer and wanted to get a quick idea of the positive/negative words in their customer survey notes. To accomplish this I created a python script that runs in-database and takes the output from the standard Aster sentiment analysis function to provide more information about the contents of the survey notes.

If you do not have a favorite text data set I suggest downloading "Amazon fine food reviews" from kaggle.  See reference section below for a download link.

Prep

First we create a table to stage our text data and load with ncluster_loader:

ncluster_loader -U myuser -d mydb --skip-rows 1 -c amzn_reviews Reviews.csv

Now we can run the ExtractSentiment function. For simplicity we will use the built in dictionary instead of building our own model. 

The function outputs a standard POS/NEU/NEG polarity including a strength indicator. One benefit of using a dictionary to analyze the text data is that we get an extra field in the output: "out_sentiment_words".  This field gives us more information about the words used to determine sentiment,  their frequency in the sentence or document and a total positive and negative score.

Way back in 2013 I created a perl script to parse this data and store the output in an easy to use table. The perl script is quite simple and is run using the Aster stream function. Many people dislike perl for its syntax.  For this blog post I decided to quickly convert the script to python 2.x and show you how to run python in database. Note that you can accomplish a similar result with the regexp function in combination with the pivot SQL/MR.

What do we need to run python in-database?

  1. specify the input data
  2. a python script
  3. the script has to be installed in Aster.
  4. define the output what we want

Let's review these requirements in more detail:

  1. By default the stream function expects tab delimited input. As an example we will take the id field from our fine food reviews dataset as the first field and the out_sentiment_words as the second field to pass to the stream function.
  2. Our python script will be called "parseSentiment.py":
  3. The python script is installed in Aster (in the system database table) using the act command-line tool.

    act -U myuser -d mydb -w -c "\remove parseSentiment.py"

    act -U myuser -d mydb -w -c "\install parseSentiment.py"

  4. We specify the fields we want to see in the output and their type.

Python in-database

Now that we know how to run the stream function we can dig a bit deeper into the python script.

We use the default tab delimiter to parse the input data fed to the stream function. A loop reads the data from STDIN. When we encounter an empty line we assume all data has been read and exit the script:

I will not discuss the simple data manipulation done by the script. What is important is that we output our desired results:

And our output fields have to match our definition in the OUTPUTS parameter for the stream function.

The last line in the script is also important. We need to properly flush the buffer to make sure we obtain all the output.

Note that in python 3.x you can type  print (i, end='') which will automatically flush the buffer.

Our output

Finally we review out result in the expanded format that we wanted.  For each document id we get a summary of the total positive and negative score (a simple count of the words) and a list of all the detected words.  The word_score is always 1 for positive words and -1 for negative words.

To present the results we can generate a wordcloud of all the top words based on their regular frequency.  

To get a better idea of the importance of the words we can run TF/IDF against all documents and join the result with our amzn_detail_output table to allow filtering on only the positive (where word_score = 1)  or negative words  (where word_score = -1).

Top 100 Positive words based on tf/idf scores

Top 100 Negative words based on tf/idf scores

Building a new classification model?

One option is to rely on external and third party sentiment analysis models. For example scoring the call notes for a telecom customer using publicly available movie reviews will not be that effective or relevant. Ideally a lot of time and effort is spent on manually reviewing the data and assigning the correct categories for sentiment analysis.  For those situations where time is limited this approach could be a potential alternative.

Now that we have identified many of the common positive and negative words we can manually review the remaining documents/survey notes that have few identified words.  Once those leftover entries have been manually categorized we can build and fine tune a new sentiment analysis model.  

References

Sample data set:

https://www.kaggle.com/snap/amazon-fine-food-reviews 

Belgian chocolates store display:

Verdonck | Winkel