Recently I reviewed the sentiment analysis data provided by a customer and wanted to get a quick idea of the positive/negative words in their customer survey notes. To accomplish this I created a python script that runs in-database and takes the output from the standard Aster sentiment analysis function to provide more information about the contents of the survey notes.
If you do not have a favorite text data set I suggest downloading "Amazon fine food reviews" from kaggle. See reference section below for a download link.
First we create a table to stage our text data and load with ncluster_loader:
ncluster_loader -U myuser -d mydb --skip-rows 1 -c amzn_reviews Reviews.csv
Now we can run the ExtractSentiment function. For simplicity we will use the built in dictionary instead of building our own model.
The function outputs a standard POS/NEU/NEG polarity including a strength indicator. One benefit of using a dictionary to analyze the text data is that we get an extra field in the output: "out_sentiment_words". This field gives us more information about the words used to determine sentiment, their frequency in the sentence or document and a total positive and negative score.
Way back in 2013 I created a perl script to parse this data and store the output in an easy to use table. The perl script is quite simple and is run using the Aster stream function. Many people dislike perl for its syntax. For this blog post I decided to quickly convert the script to python 2.x and show you how to run python in database. Note that you can accomplish a similar result with the regexp function in combination with the pivot SQL/MR.
Let's review these requirements in more detail:
act -U myuser -d mydb -w -c "\remove parseSentiment.py"
act -U myuser -d mydb -w -c "\install parseSentiment.py"
Now that we know how to run the stream function we can dig a bit deeper into the python script.
We use the default tab delimiter to parse the input data fed to the stream function. A loop reads the data from STDIN. When we encounter an empty line we assume all data has been read and exit the script:
I will not discuss the simple data manipulation done by the script. What is important is that we output our desired results:
And our output fields have to match our definition in the OUTPUTS parameter for the stream function.
The last line in the script is also important. We need to properly flush the buffer to make sure we obtain all the output.
Note that in python 3.x you can type print (i, end='') which will automatically flush the buffer.
Finally we review out result in the expanded format that we wanted. For each document id we get a summary of the total positive and negative score (a simple count of the words) and a list of all the detected words. The word_score is always 1 for positive words and -1 for negative words.
To present the results we can generate a wordcloud of all the top words based on their regular frequency.
To get a better idea of the importance of the words we can run TF/IDF against all documents and join the result with our amzn_detail_output table to allow filtering on only the positive (where word_score = 1) or negative words (where word_score = -1).
Top 100 Positive words based on tf/idf scores
Top 100 Negative words based on tf/idf scores
One option is to rely on external and third party sentiment analysis models. For example scoring the call notes for a telecom customer using publicly available movie reviews will not be that effective or relevant. Ideally a lot of time and effort is spent on manually reviewing the data and assigning the correct categories for sentiment analysis. For those situations where time is limited this approach could be a potential alternative.
Now that we have identified many of the common positive and negative words we can manually review the remaining documents/survey notes that have few identified words. Once those leftover entries have been manually categorized we can build and fine tune a new sentiment analysis model.
Sample data set:
Belgian chocolates store display:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.