Text Analytics: Topic Clustering using TF_IDF, LDA, and cFilter

Learn Data Science
Teradata Employee

Over the past couple of weeks I have had two projects concerning text analytics come across my desk.  I wanted to develop a repeatable pattern that could be used by others at these clients.  One use case was in banking and the other in healthcare.  They had different flavors of the use cases but they both came down to a couple of things.  How can I use the text coming in from people to better serve their interests and also how can I tag, identify and respond to topics of interest.  The data in question was either a complaint being filed or a patient visit.  Neither customer had a training data set so we had to use an unsupervised method to be able to classify these text documents. 

PLEASE TAKE NOTE:  I need to tune and do other data preparation steps but this is a good pattern to start.

The rest of this blog post will demonstrate how to use Aster in text analytics to be able to identify clusters of topics and then how to visualize and classify.  You can do it yourself if you want to.  First we will go over the results and then we will go into the code:

Data Source:  Amazon Reviews:  

 You can get it here: https://aster-community.teradata.com/groups/hackathon/content?filterID=contentstatus%5Bpublished%5D~...

Basic Approach:

      

   Tokenize - Using nGram (should have used Text_Parser which would have allowed me to stem and remove stop words)

   TF_IDF - mathematically find words of significance

   LDA - To identify topics across all documents

   cFilter - Collaborative Filter

   AppCenter Sigma Visual

Output/Visualizations:

Let's start with the punch line of the joke first shall we?  The visual below is the output of the cfilter statement that associates the topic words across thousands of Amazon reviews.  You can see there are 4 modular classes of documents and there should have been 5 but as you will see in the details it actually makes sense.  Two of the clusters are very similar.  So our visual tool actually did us a favor.  I used LDA to identify 5 topics and 15 words per topic.  These numbers can be tuned and changed in the code.  

As you see in the graphic below there are different colors for each of the clusters: light blue, blue, green, and red.  The red group is actually a combined group from the LDA topic clusters identified.  

ONE THING I WILL HAVE TO DO IS REMOVE STOP WORDS.  As you can see I have some words that I might want to remove such as:  or, of, is...  To do this is easy:  I would use Text_Parser instead of nGram.

The five topic cluster words identified for my Amazon data set were:

LDA Generated 5 topic clusters and words noted below:


-- TOPIC 0: are,these,they,them,chocolate,chips,sweet,snack,love,cookies,you,like,taste,great,sugar

-- TOPIC 1: was,amazon,price,product,order,were,at,from,box,store,in,buy,find,ordered,again

-- TOPIC 2: it,use,tea,make,hot,great,mix,is,with,easy,this,add,milk,you,or

-- TOPIC 3: coffee,tea,flavor,taste,it,cup,is,this,drink,like,not,strong,of,tried,good

-- TOPIC 4: food,dog,he,she,my,loves,her,dogs,treats,our,them,we,cat,to,it

I have not run the other function with LDA to associate the source records with the topic clusters.  I will do that and then go explore them closer.  However, on first blush it appears that the topic classifications could be:

0 – SWEET SNACKS (Bogie in the data is possibly CHIPS)

1 – AMAZON LOGISTICS – (looks like comments from customers regarding amazon operations?)

2 – HOT DRINKS - TEA

3 – HOT DRINKS – COFFEE and TEA

4 – PET FOOD – PET SUPPLY

So as you can see these could be my classifications of my topics identified in the data source provided by Amazon reviews.  Further exploration and tuning is required to refactor my code and again I should remove stop words.  

Please see my code below:

Data Source: (Amazon Reviews)

You can get it here: https://aster-community.teradata.com/groups/hackathon/content?filterID=contentstatus%5Bpublished%5D~...

DDL for the Table:

drop table if exists ss_amazon;

create table ss_amazon

(

    prod_id varchar,

    uid varchar,

    uid_name varchar,

    helpful varchar,

    score varchar,

    time_val varchar,

    title varchar,

    summary varchar

)

distribute by hash(uid);

 

--BUILD ROW ID:

--NOTE: There is a better way of doing this but I needed a clear row identifier.  

CREATE VIEW v_ss_amazon AS

SELECT prod_id || '-' || uid || '-' || time_val as r_id, * FROM ss_amazon;

I used nCluster_Loader to load my file.

--568453 total records

--ncluster_loader -U xxxx-w xxxx-d beehive -c public.ss_amazon amazon.txt --el-enabled --el-discard-errors

CREATE TABLE ss_amazon_w_id DISTRIBUTE BY HASH(uid) AS

SELECT * FROM v_ss_amazon;

 

SELECT * FROM ss_amazon_w_id LIMIT 10;

 

--TOKENIZE RECORDS USING NGRAM

--DROP TABLE ss_amazon_tfidf;

 

CREATE FACT TABLE ss_amazon_ngram DISTRIBUTE BY HASH(r_id) AS

SELECT * FROM nGram (

ON ss_amazon_w_id

TextColumn ('summary')

Delimiter (' ')

Grams ('1')

Overlapping ('false')

ToLowerCase ('true')

Punctuation ('\[.,?\!\]')

Reset ('\[.,?\!\]')

Total ('false')

Accumulate ('r_id')

);

 

--REVIEW RESULTS

SELECT * FROM ss_amazon_ngram ORDER BY 1,3,2 LIMIT 100;

 

--PERFORM TF_IDF

 

CREATE fact TABLE ss_amazon_tfidf_in DISTRIBUTE BY hash(term) AS

SELECT r_id as doc_id, ngram AS term, frequency AS count

FROM ss_amazon_ngram;

 

SELECT * FROM ss_amazon_tfidf_in ORDER BY 1,3,2 LIMIT 100;

 

CREATE fact TABLE ss_amazon_tfidf_out DISTRIBUTE BY HASH(doc_id) AS

SELECT * FROM tf_idf (

ON TF (

ON ss_amazon_tfidf_in PARTITION BY doc_id

) AS tf PARTITION BY term

ON (SELECT COUNT (DISTINCT doc_id) FROM ss_amazon_tfidf_in

) AS doccount DIMENSION

);

 

 --REVIEW RESULTS:

SELECT * FROM ss_amazon_tfidf_out order by doc_id LIMIT 100;

 

 

--NOW: latent Dirichlet allocation (LDA)

 

 

--LDA

 

SELECT * FROM LDATrainer (

ON (SELECT 1) PARTITION BY 1

InputTable ('ss_amazon_tfidf_out')

ModelTable ('ss_amazon_ldamodel')

OutputTable ('ss_amazon_ldaout1')

DocIDColumn ('doc_id')

WordColumn ('term')

CountColumn ('tf_idf')

TOPICNUMBER(5)

OUTPUTTOPICNUMBER(1)   

OUTPUTTOPICWORDNUMBER(15)

);

 

 

SELECT * FROM ss_amazon_ldaout1 WHERE topicid = 2  LIMIT 1000;

 

--COLLABORATIVE FILTER

--DATA PREPARATION

 

create table ss_amazon_ldaout1_spl distribute by replication as

select distinct topicid, regexp_split_to_table(topicwords, ',') as all_words from ss_amazon_ldaout1;

 

SELECT * FROM ss_amazon_ldaout1_spl LIMIT 100;

SELECT * FROM ss_amazon_tfidf_out LIMIT 100;

 

drop table if exists ss_amazon_ldaout1_cf_in;

 

create table ss_amazon_ldaout1_cf_in distribute by hash(doc_id) as

select * from ss_amazon_tfidf_out as b

inner join ss_amazon_ldaout1_spl as s

on b.term = s.all_words;

 

--REVIEW RESULTS OF DATA PREPARATION:

SELECT * FROM ss_amazon_ldaout1_cf_in LIMIT 100;

 

 --CFILTER:

SELECT * FROM CFilter (

ON (SELECT 1)

PARTITION BY 1

INPUTTABLE('ss_amazon_ldaout1_cf_in')

OUTPUTTABLE('ss_amazon_ldaout1_cf_out')

INPUTCOLUMNS('all_words')

JOINCOLUMNS('doc_id')

OTHERCOLUMNS('topicid')

DROPTABLE('true')

);

 

--VISUALIZE RESULTS BY GOING DIRECTLY TO URL:

--BYPASS APPCENTER UI:

INSERT INTO public.viz

SELECT id,json,url FROM Visualizer (

ON "ss_amazon_ldaout1_cf_out" PARTITION BY 1

AsterFunction('cfilter')

Title('GRAPH - AMAZON TOPIC/WORD CLUSTERS')

URL('visualization_server=10.25.98.32:444', 'profile=AIR', 'output_table=viz')

VizType('sigma')

);

 

SELECT * FROM public.viz;