Aster Field Strong

Aster Field Strong
Looking Glass

Explore and Discover

Latest Articles, Videos and Blog Posts speak with those interested in analytics-based decision making. Browse all content in the Teradata Data Science Community to gain valuable insights.

676 Views
0 Comments

documentParser is a map function that pulls a variety of document files stored in HDFS (Hadoop Distributed File System) or Aster (as of 2013-04-25) and parses them using nCluster. The parsed files can be outputted in one of four modes by specifying a parseMode of 'text', 'tokenize', 'email', or 'image'. 'text' extracts the plain text portion of a given document and outputs it as a single varchar field. 'tokenize' is like text but it takes each word and outputs it into a separate line. 'email' parses out TO, FROM, CC, BCC, SUBJECT, and BODY fields and outputs each as a separate column. 'email' works on plain text RFC emails as well as Outlook .msg files. 'image' parses out EXIF metadata such as focal length, exposure time, ISO, etc.

In addition to emitting the above mentioned columns, all of the modes output a "filename" column. This contains the full path and filename of the HDFS file. This is populated for all three operating modes. Thus, 'text' emits one row and two columns ("filename" and "content") for each document being parsed; 'email' and 'image' emit one row and multiple columns for each document or image; 'tokenize' emits many rows and two columns ("filename" and "word") for each document being parsed.

Under the covers, documentParser uses Apache Tika to parse documents. This simplifies the MR implementation since Tika both detects the document format and extracts the plain text portion from the document. Tika also supports a wide variety of formats including all Microsoft Office documents (Office '97-2010 including doc, docx, ppt, pptx, xls, xlsx), Outlook msg files, Outlook pst files, PDF, HTML, ePub, RDF, plain text, Apple iWorks, image files (TIFF, jpg, etc), and more. Click here to see the complete list.

See attached PPTX for full documentation

See ncluster_pdfloader bash script for base64 encoding and ncluster loader steps.  NOTE:  this may take some refactoring to work in your environment.

438 Views
0 Comments

More to come later regarding use and syntax

699 Views
0 Comments

GraphGen was put on a shelf to collect dust when the "official" Visualizer function was released. Those of us who believed in it continued to use it. As it is now clear that innovation, development and support for innovative technology initiatives like this must often be field-based, I have resurrected GraphGen to add a new feature.

Myself and others have clamored for literally years now for the ability to color nodes based on values other than modularity class. While there is still a bit of additional work to be done, I feel comfortable releasing this new version of GraphGen that includes this new Sigma graph functionality. It is possible there are still some kinks to be ironed out and I welcome feedback.

 

Please allow me to provide a couple of examples of where this new capability will be useful:

 

1. Market Basket Analysis

 

The Aster AppCenter "Retail - Market Basket & Product Recommender" app runs a collaborative-filter based market basket analysis to figure out which items are most frequently purchased with other items, i.e., product affinity. The resulting graph visualization appears as such:

 


 

In this visualization, the nodes are colored according to "modularity class". GraphGen has used the Gephi library to run the modularity clustering algorithm and assign a modularity class value to each node. This is incredibly useful as we seek to understand how the grocery items are clustered, but other associated (dimension) values are likely interesting as well.

 

For example, now that we have seen which items are clustered together based on how frequently they are bought together, how about contrasting that with where they are located in the store. In other words, is there an immediate noticeable correlation between which items are most frequently purchased together and in which aisle they have been stocked?

 


 

You can see in the screenshot above that we have selected "aisle" from the Node Color pull-down menu and we can now see which aisle is associated with each item. Additionally, we can flip over to the "Layout" tab to see the Legend which will show us which color is associated with which aisle number.

 


 

This type of analysis can likely tell us whether it might make sense to arrange items differently in the store to either help shoppers complete their trip faster OR drive shoppers to different parts of the store.

 

2. Telco Call Networks

 

The Aster AppCenter "Telco - Network Finder" app visualizes which callers are most frequently speaking with one another. By constructing these networks, we hopefully can gain a better understanding of social influencers and trending through our customer population. The resulting graph visualization appears as such:

 


 

In this visualization, the nodes are again colored according to "modularity class." We are able to see which customers are speaking with one another, but again, there are other associated variables that might be interesting and revealing as we drill-down further into these clusters. For example, do all of the customers is a specific cluster live in the same state? Are they similar in age? Here we select phone "make" to see whether customers in a given cluster are using the same make of phone.

 

 

And with the legend for clarification:

 


 

Perhaps customers who own Apple phones are more likely to speak with one another. Or perhaps it is those who own Nokia phones that are tighter. Or perhaps it makes no difference at all... It is the ability to test for this hypothesis in seconds rather than minutes or hours that makes this feature a valuable add-on to the GraphGen Sigma graph visualization.

 

 

How does it work?

To use this new node coloration functionality, simply use the multi-input capability of GraphGen. You can read all about this in the GraphGen documentation. Any field that is included in the dimension table will be accessible via the Node Color drop-down menu.

 

 

Where do I get the latest version?

Please speak to your friendly, neighborhood Teradata person. 

Enjoy!

378 Views
0 Comments

Description

This SQL/MR function generates unique big int ids for rows in an input table. Though the algorithm guarantees the ids to be unique, it doesn't however guarantee to be strictly serial (expect some gaps in the id #s).

Why not ROW_NUMBER() ?

The biggest advantage of ID() vs ROW_NUMBER() OVER (PARTITION BY 1) is that it can generate ids for table rows that scales linearly, because it has no repartitioning operation. ID() uses a truly parallel algorithm. It is especially useful when you load data from a fresh data source and each row needs id tags for Text Analysis, XML or JSON Parsing regardless of the row order. As an example, If you have 12 billion rows in a table that you just loaded and need to id it uniquely, just use ID() !

Usage

SELECT * FROM ID(

       on input_table

);

The output schema and data matches the input table with an additional column called id which is bigint !

Sample Input/Output

INPUT TABLE: webclicks


   
useridsessionidtime_stampevent
user102016-01-01 16:40home_page
user102016-01-01 16:42sports
user102016-01-01 16:46home_page
user212016-02-01 10:10news_page

CREATE TABLE webclicks_with_id DISTRIBUTE BY HASH(id)

AS

SELECT * FROM id(

     ON webclicks

);

OUTPUT TABLE: webclicks_with_id

iduseridsessionidtime_stampevent
1user102016-01-01 16:40home_page
2user102016-01-01 16:42sports
3user102016-01-01 16:46home_page
4user212016-02-01 10:10news_page

 

Installation

Download the ID.ZIP  attachment

Run \install id.zip in ACT for installing default schema 
Run \install <schema>/id.zip in ACT to install it in <schema>
Provide execute permissions on the ID function to the users with
GRANT EXECUTE ON FUNCTION ID TO <role>|<user>;

502 Views
0 Comments


Document_From_Table is an Aster function for automatic text/metadata extraction from 300+ document formats. It relies on the Apache Tika library to do the bulk of the work in extracting the contents. The function also includes the ability to OCR images using Google's Tesseract library. This library must be installed on all workers in the cluster for this functionality to be enabled.

Installation

Download Document_From_Table​, log into the aster cluster using ACT and \install the file.

Installation of Tesseract

The easiest way to install tesseract is to download the leptonica, tesseract and tesseract language data RPMs from the ifad repository. You can add a new repository to linux: http://download.opensuse.org/repositories/home:/vjt:/ifad/openSUSE_11.4/

and then 'zypper in tesseract' and it should install everything needed; otherwise, you will have to download the liblept5, leptonica, libtesseract, and tesseract-ocr rpms and install them manually on all the workers.

Usage

Document_From_Table takes a table of bytea files as input and extracts/ocrs the text from them. To load files into Aster in this format, please use the Aster Loader tool​. After populating the table with files, you can then use the document_from_table function with the following sql-mr syntax:

SELECT * FROM document_from_table(

  ON public.file_load

  PARTITION BY filename

);

1020 Views
0 Comments


The Aster Loader Tool is designed to load files into a table in bytea format. This allows them to be used by many SQL-MR functions.

Download

GUI Usage

Fill in the appropriate fields and drag one or more files into the empty space. Please hit upload only once. There is currently no progress indication and the program will look like it hung, but it is uploading.

Command Line Options

The loader uses similar arguments to ACT/nCluster loader/export:

- h: host ip address

- d: database name

- t: table name

- U: username

- w: password

- W: prompt for password

- f: path/to/file/or/folder

Command Line Usage

java -jar /Users/user/aster_loader.jar -h 192.168.100.100 -W -d beehive -U db_superuser -t public.file_load -f /Users/user/Download/file.txt

Schema

CREATE TABLE public.file_load (

     filename varchar,

     content bytea

) DISTRIBUTE BY HASH (filename);

287 Views
0 Comments

This is a custom SQL-MR function built by the field and is not supported by Teradata engineering or warranty.  Please test first and use at your own risk.

Uses a dictionary to replace words in a dictionary table. 

I am attaching the SQL-MR function and changing the extension to .zi_ instead of .zip  

You may have to grant execute privs to the function to your user id.

You will need to get this installed on your system to use it.  let me know if you need help. Here is how it works:

Build two tables:

CREATE TABLE dict2 (

        type VARCHAR,

        dict VARCHAR

)

DISTRIBUTE BY REPLICATION;

/*  Table contains:

  type     dict

-------  -------

1     brown

2     cat

*/

CREATE TABLE complaint(

    id  INTEGER,

    complaint VARCHAR

)

DISTRIBUTE BY HASH(id);

/*

id     complaint

-----  ------------------------------------------

1      Please dont even pretend to give a crap   

3      The customer service rep didnt give a fack

5      The brown cat bullshat on the rug         

2      This thing is bullshat

4      The brown cat

*/

--the SQL MR statement to replace words in the dict2 table with the REPLACEMENT_TEXT string predicate:

We would still need to remove the messages that use the curse words and that can be done with the NER function.  I wanted to show that in two steps we can achieve your goals.

select *

from verbatim_cleansing(

   ON (select id, complaint from complaint) AS VERBATIM

   PARTITION BY id

   ON (select * from dict2) AS PHRASES

   DIMENSION

   KEY_COLUMN('id')

   VERBATIM_COLUMN('complaint')

   PHRASE_COLUMN('dict')

   REPLACEMENT_TEXT('~XX~')

   );

--OUTPUT OF THE FUNCTION:

id     complaint verbatim_cleansed

-----  ------------------------------------------ ------------------------------------------

1      Please dont even pretend to give a crap Please dont even pretend to give a crap   

3      The customer service rep didnt give a fack The customer service rep didnt give a fack

5      The brown cat bullshat on the rug The ~XX~ ~XX~ bullshat on the rug         

2      This thing is bullshat This thing is bullshat

4      The brown cat The ~XX~ ~XX~  

END======================================================================================================================================

KUDOS & CREDIT: Mo Patel and Mark Turner

255 Views
0 Comments

THIS IS A CUSTOM SQL-MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD. IF YOU FIND A BUG, PLEASE LET ME KNOW AND I WILL TRY TO FIX IT AS SOON AS POSSIBLE.

For an Aster PoV in a cable TV customer, I have developed some custom SQL-MR functions that could help to handle some datasets. Some of these functions were created to be reuseable but for a single purpose. However, we have found them very useful for several cases, so I have decided to share some of them.

Find attached BooleanPivotMatrix SQL-MR function (partition function like a reduce). This function is similar to CumulativePivotMatrix custom SQL-MR and allows you to pivot information by partition and category (detail level data) to a table where you get a matrix of all the partitions (e.g. customers, users..) and if the category appears (1) or not (0). Function allows up to 1597 columns.

Find attached a .rar file that includes:

  • Function jar file
  • User manual
  • Summary slide
  • Step by step first demo

The original goal of the function was to get a matrix of TV hours watched by customer and channel or categories:

summary.JPG

Demo includes a dataset to help understand how it works. We have found it very useful for data mining, segmentations, minhash reverse engineering...

If you need further information or find a bug let me know.

Regards,

Ignacio

278 Views
0 Comments

THIS IS A CUSTOM SQL MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD. IF YOU FIND A BUG, PLEASE LET ME KNOW AND I WILL TRY TO FIX IT AS SOON AS POSSIBLE.

For an Aster PoV in a cable TV customer, I have developed some custom SQL-MR functions that could help to handle some datasets. Some of these functions were created to be reuseable but for a single purpose. However, we have found them very useful for several cases, so I have decided to share some of them.

Find attached CumulativePivotMatrix SQL-MR function (partition function like a reduce). This function allows you to pivot information by partition, category and value (detail level data) to a table where you get a matrix of all the partitions (e.g. customers, users..) and the added value grouped by categories. Function allows up to 1597 columns.

Find attached a .rar file that includes:

  • Function jar file
  • User manual
  • Summary slide
  • Step by step first demo

The original goal of the function was to get a matrix of TV hours watched by customer and channel or categories:

summary.JPG

Demo includes a dataset to help understand how it works. We have found it very useful for data mining, segmentations, text analitycs combined with ngrams...

If you need further information or find a bug let me know.

Regards,

Ignacio

409 Views
0 Comments

THIS IS A BETA CUSTOM SQL MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD.

The Tweet_Streamer is created to directly stream live Twitter events into Teradata Aster database. This SQL-MR will download tweets based on specified number of tweets, keywords and language. The idea is to stage the twitter feeds of JSON format into a payload table. Since the twitter data model may change due to at any time relevant business needs, it is best to acquire the complete JSON data and then later be parsed by using another SQL-MR JsonDataExtractor(), a configurable utility that can extract specific nodes and values of a JSON structure.

The SQL-MR is programmed to use an existing table with an auto-created id and a payload column of varchar type.

The main benefit of having this custom functionis that we can have direct access to live twitter feeds in Aster instead of using other external applications like Flume, HDFS. Execution can also be monitored via AMC. This functionality can also work in conjunction with an existing SQL-MR (load_tweets()) that fetches historical tweets for limited number of attributes.

Usage

Syntax

SELECT * FROM Tweet_Streamer(

    ON (SELECT 1) PARTITION BY 1

    [TOTAL_TWEET_COUNT(<integer>)]

    [BATCH_SIZE(<integer>)]

    CONSUMER_KEY('<consumer_key>)

    CONSUMER_SECRET(‘<consumer_secret>’)

    ACCESS_TOKEN('<access_token>)

    ACCESS_TOKEN_SECRET('<access_token_secret>’)

    OUTPUT_TABLE(‘<table>’)

    OUTPUT_TABLE_COL('<col>')

    [LANGUAGE('<languagei.e. en>')]

    KEYWORDS('<keyword1>',‘<keyword2>’, …, ‘<keyword400>’);

Create Twitter Developer App

1. Go to dev.twitter.com

2. At the bottom of the page, under the Tools, click on ‘Manage Your Apps’

3. Click on ‘Create New App’

4. Provide the details for the following

Name: your_app_name

Description: you_app_description

Website: any_valid_website, i.e. https://www.google.com

Callback URL: any_valid_website, i.e. https://www.google.com

5. Check ‘Agree’ on Developer Rules of the Road

6. Click on ‘Create your Twitter application’

7. Go to tab 'Keys and Access Tokens'.

8. At the bottom of the next page, click on ‘Create your access token’. Refresh the page if the update did not reflect

9. Congratulations you have created your consumer and access tokens.

Sample Usage

SELECT * FROM Tweet_Streamer(

    ON (SELECT 1) PARTITION BY 1

    TOTAL_TWEET_COUNT(100000)

    BATCH_SIZE(200)

    CONSUMER_KEY('LHZY9ntmE11HGcgXGu9c4DLVr')

    CONSUMER_SECRET('w89uXXvyNEWnVOh4ePYYnMjeM9fG53FNJfWOxTXWF5TLmxe5QU')

    ACCESS_TOKEN('371407254-gp9zmtYawiajRlrSa83nFzuwCU0b7mtkZo9Qlexv')

    ACCESS_TOKEN_SECRET('1XJku1VLzFWfTdk1i6ApTotiqYIt5xypAmFvWXjwV2Od7')

    OUTPUT_TABLE('twitterpool')

    OUTPUT_TABLE_COL('json')

    LANGUAGE('en')

    KEYWORDS(':)')

);

CREATE TABLE TWITTERPOOL

(

  ID BIGSERIAL LOCAL

  ,JSON VARCHAR ) DISTRIBUTE BY HASH(JSON);

Argument Semantics

i. ON (SELECT 1) PARTITION BY 1 [required]

  -   A dummy input for syntactical reasons and to make sure that the function runs on a single vworker.

                                            

ii. TOTAL_TWEET_COUNT

  -   The total number of tweets intended to be downloaded/streamed into the target Aster table. If not specified this is defaulted to 1000 tweets.

                                              

iii.     BATCH_SIZE

  -   The batch size controls the number of records to be inserted at a time. If not specified, this is defaulted to 10.

                                             

iv. CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET [required]

  -   These are required parameters that will be used to authenticate with Twitter API.

  -   See instructions on "Create Twitter Developer App" below on how to acquire these credentials.

                                              

v. OUTPUT_TABLE [required]

  -   The output table attribute indicates which table the JSON formatted tweets will be downloaded to.

                                       

vi. OUTPUT_TABLE_COL[required]

  -   The output table column attribute indicates which varchar column the JSON formatted tweets will be downloaded to.

                                           

vii.    LANGUAGE

  -   The language is an optional attribute that can specify the language of the tweet. If not specified all tweets without language preference will be downloaded.

  - Refer to http://msdn.microsoft.com/en-us/library/ms533052(v=vs.85).aspx for list of language codes.

                                    

viii.   KEYWORDS [required]

  -   Accepts a comma-delimited single quoted list of keywords. It can accept up to 400 keywords.

  - Refer to https://dev.twitter.com/rest/public/search, under 'Query operators' section for keyword operator behaviors.

Bloggers

Data Science Informative Articles and Blog Posts

Our blogs allows customers, prospects, partners, third-party influencers and analysts to share thoughts on a range of product and industry topics. Also, we have lots of content in the community; allowing you to gain valuable insights from Teradata data scientists.