Learn Data Science

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Teradata
- :
- Data Science Blog Posts
- :
- Learn Data Science
- :
- What's New in the Aster Underground: Cosine Simil...

12-01-2015
07:08 AM

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

12-01-2015
07:08 AM

THIS IS A BETA CUSTOM SQL MR FUNCTION AND IS NOT SUPPORTED BY TERADATA ENGINEERING, CLIENT SUPPORT, OR THE FIELD.

Binaries and full example found in the Aster Customer Zone.

Original MR Version

Given 2 vectors of words and their scores, the cosine similarity between the vectors computes the dot product of the vectors. The closer this value is to 1 the more similar the vectors are. More about cosine similarity at the wiki page.

Input to the cos_similarity() function comprises of 2 data streams:

Set of (vector, l2norm).

As an example from the use case of document clustering, each vector is a document (indicated by docid) and l2norm is simply the square of sums of the tfidf_scores for each word.

Set of (vector1:vector2 pair, product for an element), where there are as many rows for a given (vector1:vector2) pair as there are elements common to these vectors.

Continuing the document clustering example, each pair is a pair of documents while each row corresponds to the produt of tfidf_scores (tfidf_score1*tfidf_score2) and there are as many rows per (vector1:vector2) pair as there are common words between the 2 documents.

Output from the function is (cos_item1, cos_item2, score) which is the item pair and their similarity score.

Example Syntax

select cos_item1, cos_item2, cos_score

from cos_similarity(

ON ( select docid, sqrt(sum(tfidf_score*tfidf_score)) as l2norm

from tf_idf

group by 1) as ITEM_L2NORM

DIMENSION

ON (select cos_pair, ab_val from cos_pairs_input ) as PAIRS_INPUT

PARTITION BY cos_pair

) T

Document Clustering - Demo

Input is a corpus of documents of form (docid,raw_data).

Output is to cluster the documents together based on their cosine similarities so that we can identify which descriptions are about similar incidents, people, places, geographic regions etc. A sample output from GraphGen after the analysis looks as follows:

Labels:

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.