What's New in the Aster Underground: Cosine Similarity

Learn Aster
Teradata Employee

THIS IS A BETA CUSTOM SQL MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD.

Binaries and full example found in the Aster Customer Zone.

Original MR Version

Given 2 vectors of words and their scores, the cosine similarity between the vectors computes the dot product of the vectors. The closer this value is to 1 the more similar the vectors are. More about cosine similarity at the wiki page.

Input to the cos_similarity() function comprises of 2 data streams:

Set of (vector, l2norm).

As an example from the use case of document clustering, each vector is a document (indicated by docid) and l2norm is simply the square of sums of the tfidf_scores for each word.

Set of (vector1:vector2 pair, product for an element), where there are as many rows for a given (vector1:vector2) pair as there are elements common to these vectors.

Continuing the document clustering example, each pair is a pair of documents while each row corresponds to the produt of tfidf_scores (tfidf_score1*tfidf_score2)  and there are as many rows per (vector1:vector2) pair as there are common words between the 2 documents.

Output from the function is (cos_item1, cos_item2, score) which is the item pair and their similarity score.

Example Syntax

select cos_item1, cos_item2, cos_score

from cos_similarity(

   ON ( select docid, sqrt(sum(tfidf_score*tfidf_score)) as l2norm

        from tf_idf

        group by 1) as ITEM_L2NORM

    DIMENSION

   ON (select cos_pair, ab_val from cos_pairs_input ) as PAIRS_INPUT

    PARTITION BY cos_pair

) T

Document Clustering - Demo

Input is a corpus of documents of form (docid,raw_data).

Output is to cluster the documents together based on their cosine similarities so that we can identify which descriptions are about similar incidents, people, places, geographic regions etc. A sample output from GraphGen after the analysis looks as follows:

doc_clustering.png