THIS IS A BETA CUSTOM SQL MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD.
Binaries and full example found in the Aster Customer Zone.
Original MR Version
Given 2 vectors of words and their scores, the cosine similarity between the vectors computes the dot product of the vectors. The closer this value is to 1 the more similar the vectors are. More about cosine similarity at the wiki page.
Input to the cos_similarity() function comprises of 2 data streams:
Set of (vector, l2norm).
As an example from the use case of document clustering, each vector is a document (indicated by docid) and l2norm is simply the square of sums of the tfidf_scores for each word.
Set of (vector1:vector2 pair, product for an element), where there are as many rows for a given (vector1:vector2) pair as there are elements common to these vectors.
Continuing the document clustering example, each pair is a pair of documents while each row corresponds to the produt of tfidf_scores (tfidf_score1*tfidf_score2) and there are as many rows per (vector1:vector2) pair as there are common words between the 2 documents.
Output from the function is (cos_item1, cos_item2, score) which is the item pair and their similarity score.
select cos_item1, cos_item2, cos_score
ON ( select docid, sqrt(sum(tfidf_score*tfidf_score)) as l2norm
group by 1) as ITEM_L2NORM
ON (select cos_pair, ab_val from cos_pairs_input ) as PAIRS_INPUT
PARTITION BY cos_pair
Document Clustering - Demo
Input is a corpus of documents of form (docid,raw_data).
Output is to cluster the documents together based on their cosine similarities so that we can identify which descriptions are about similar incidents, people, places, geographic regions etc. A sample output from GraphGen after the analysis looks as follows: