Text analytics and Searching for Profanity

Learn Aster
Teradata Employee

First of all please accept my apologies for the use of pseudo profanity.  This is a real request I received this week and I thought I would share.

I was just recently asked by a client to extract profanity from client notes and other sources.  I found it to be a very simple use case and wanted to share it with the community.  I am sure there is more to this use case than just the matter at hand and hope to go back to them for the rest of the story.  They wanted to learn how to use a dictionary to find profanity in client/customer complaints.  I really do want to learn more about what they are trying to accomplish.

In pre 6.10 versions of Aster you would have to load a dictionary file on to the Aster Queen and then install that file.  This was a very cumbersome way of doing this and for some impossible as the queen might be locked down.  Now you can build an Aster table and load that dictionary. 

To implement this in Aster is very easy.  You create two tables and use one SQL-MR function.  The description and DDL is below.

  •      There is a dictionary table that holds the words or things you want to find:  dict
  •      There is a complaints or source table that holds the text you want to search:  complaint
  •       We then use the NER SQL-MR function in Aster to find the dict items contained in the complaints table.  Named entity recognition (NER) is a process for finding mentions of specified         entities in text.  For example, a simple news named-entity recognizer for English might find the person “John J. Smith” and the location “Seattle” in the text string “John J. Smith lives in       Seattle.”

--dict contains the dictionary of words you want to search for and  index:

CREATE TABLE dict (

        type VARCHAR,

          dict VARCHAR

)

DISTRIBUTE BY REPLICATION;

--Entries in the dict table:

type                     dict    

-------                 --------

LVL1                crap    

LVL2                bullshat

LVL3                fack 

--This table contains the actual customer notes or complaints to search through using the dict table entries.

CREATE TABLE complaint(

         id  INTEGER,

         complaint VARCHAR

)

DISTRIBUTE BY HASH(id);

id           complaint

-----      ------------------------------------------

2           This thing is bullshat

1           Please dont even pretend to give a crap   

3           The customer service rep didnt give a fack

3 record(s) selected [Fetch MetaData: 0ms] [Fetch Data: 203ms]

[Executed: 4/5/2016 9:20:05 PM] [Execution: 81ms]

--This is the single SQL-MR statement to look for the profanity in the complaint table from the terms used in the dict table.

SELECT * FROM NER(

ON complaint PARTITION BY ANY

ON dict as dict DIMENSION

TEXTCOLUMN('complaint')

ACCUMULATE('id', 'complaint')

);

id           complaint                                                           sn     entity            type                start          end      approach   

-----       ------------------------------------------                -----  ---------         -------            --------      ------       -----------

2           This thing is bullshat                                         1      bullshat           LVL2               4           4           DICT       

1           Please dont even pretend to give a crap             1      crap                LVL1                8           8           DICT       

3           The customer service rep didnt give a fack        1      fack                LVL3               8           8           DICT       

3 record(s) selected [Fetch MetaData: 0ms] [Fetch Data: 8s]

[Executed: 4/5/2016 9:12:55 PM] [Execution: 8s]