ASTER UNDERGROUND: Verbatim_Cleansing.zip

Aster Field Strong
Teradata Employee

This is a custom SQL-MR function built by the field and is not supported by Teradata engineering or warranty.  Please test first and use at your own risk.

Uses a dictionary to replace words in a dictionary table. 

I am attaching the SQL-MR function and changing the extension to .zi_ instead of .zip  

You may have to grant execute privs to the function to your user id.

You will need to get this installed on your system to use it.  let me know if you need help. Here is how it works:

Build two tables:

CREATE TABLE dict2 (

        type VARCHAR,

        dict VARCHAR

)

DISTRIBUTE BY REPLICATION;

/*  Table contains:

  type     dict

-------  -------

1     brown

2     cat

*/

CREATE TABLE complaint(

    id  INTEGER,

    complaint VARCHAR

)

DISTRIBUTE BY HASH(id);

/*

id     complaint

-----  ------------------------------------------

1      Please dont even pretend to give a crap   

3      The customer service rep didnt give a fack

5      The brown cat bullshat on the rug         

2      This thing is bullshat

4      The brown cat

*/

--the SQL MR statement to replace words in the dict2 table with the REPLACEMENT_TEXT string predicate:

We would still need to remove the messages that use the curse words and that can be done with the NER function.  I wanted to show that in two steps we can achieve your goals.

select *

from verbatim_cleansing(

   ON (select id, complaint from complaint) AS VERBATIM

   PARTITION BY id

   ON (select * from dict2) AS PHRASES

   DIMENSION

   KEY_COLUMN('id')

   VERBATIM_COLUMN('complaint')

   PHRASE_COLUMN('dict')

   REPLACEMENT_TEXT('~XX~')

   );

--OUTPUT OF THE FUNCTION:

id     complaint verbatim_cleansed

-----  ------------------------------------------ ------------------------------------------

1      Please dont even pretend to give a crap Please dont even pretend to give a crap   

3      The customer service rep didnt give a fack The customer service rep didnt give a fack

5      The brown cat bullshat on the rug The ~XX~ ~XX~ bullshat on the rug         

2      This thing is bullshat This thing is bullshat

4      The brown cat The ~XX~ ~XX~  

END======================================================================================================================================

KUDOS & CREDIT: Mo Patel and Mark Turner