Aster Association Analysis - CFilter Example

Aster
Teradata Employee

Aster Association Analysis - CFilter Example

CFilter


What Is It?

This function performs collaborative filtering by using a series of SQL commands and SQL-MapReduce functions. You run this function by using an internal JDBC wrapper function.


Usage

SELECT * FROM CFilter (
ON (SELECT 1)
PARTITION BY 1
[DOMAIN('host:port')]
[DATABASE('db_name')]
[USERID('user_id')]
[PASSWORD('password')]
[SSLSETTINGS('SSLsettings')]
[SSLTRUSTSTOREPASSWORD('SSLtruststorepassword')]
INPUTTABLE('input_table_name')
OUTPUTTABLE('output_table_name')
INPUTCOLUMNS('source_column1', 'source_column2',...)
JOINCOLUMNS('join_column1', 'join_column2',...)
[OTHERCOLUMNS('other_column1', 'other_column2',...)]
[PARTITIONKEYCOLUMN('partitionKeyColumn1')]
[MAXSET('max_item_set')]
[DROPTABLE('true'|'false')]
);


Argument Required? Description
INPUTTABLE Yes Name of the input table containing the data to be filtered.
OUTPUTTABLE Yes Name of the output table into which the function writes the final results. If the output table already exists, then you should also pass the DROPTABLE (‘true’) argument, to drop it before writing the new results. Otherwise, an exception is thrown.
INPUTCOLUMNS Yes A list of input columns to filter. The column names are single-quoted and written in the comma-delimited format <'col1', 'col2', ...>.
JOINCOLUMNS Yes A list of columns to join on. The column names are single-quoted and written in the comma-delimited format <'col1', 'col2', ...>.
OTHERCOLUMNS No A comma-separated list of the names of other input columns to add to the output. These columns pass through the function unchanged. The column names should be singlequoted (<'col1', 'col2', ...>). The output is partitioned based on the columns specified by this argument. If this argument is not specified, the input data is regarded as belonging to one partition.
PARTITIONKEY No Single column used as partition key for the newly created output table. Default partitionKey is col1_item1.
MAXSET No Size of the maximum item set to be considered. Default is 100.
DROPTABLE No If you set this option to true, if the output table name already exists, the function drops the table. Default value is false.


Notes

  1. Specifying the same column in the OTHERCOLUMNS and JOINCOLUMNS arguments results in an incorrect counting in the groups.


Input Data

Data for the following example is based on listening habits prior to 2009 of last.fm users from http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html, courtesy of last.fm.

The original data set contained the listening data for 360 thousand users, amounting to almost 18 million rows (1.52 GB in size).

Therefore, the data set below is a cleaned subset of the original data set (1 thousand users and about 100 thousand rows) to save space and make it easier to work with the data.

Data: thousand_users.csv

Loading the Data: data_loading.sql


Run Time Analysis

Number of Rows Run Time
98542 00:00:40.014


Example

First, we'll get the top scores from CFilter for all artists:

SELECT * FROM CFilter (
ON (SELECT 1)
PARTITION BY 1
INPUTTABLE('ja186045.onekusers')
OUTPUTTABLE('ja186045.onekuserscfilter')
INPUTCOLUMNS('artist')
JOINCOLUMNS('user_id')
DROPTABLE('true')
);

SELECT * FROM ja186045.onekuserscfilter
ORDER BY score DESC
LIMIT 5;


col1_item1 col1_item2 cntb cnt1 cnt2 score support confidence lift z_score
deolinda coimbra 1 1 1 1 0.001 1 1000 -0.217661878
hazel spoonboy 1 1 1 1 0.001 1 1000 -0.217661878
downy triceratops 1 1 1 1 0.001 1 1000 -0.217661878
tua chaoze one 1 1 1 1 0.001 1 1000 -0.217661878
run devil run all out war 1 1 1 1 0.001 1 1000 -0.217661878

As can be seen, the results don't provide much value as the artist pairings with the highest scores are those that were listened to the least.

Let's look at a couple examples for more popular artists:

SELECT * FROM ja186045.onekuserscfilter
WHERE col1_item1 = 'kanye west' or col1_item2 = 'kanye west'
ORDER BY score DESC
LIMIT 4;


col1_item1 col1_item2 cntb cnt1 cnt2 score support confidence lift z_score
kanye west jay-z 20 80 34 0.147058824 0.02 0.25 7.352941176 18.88348159
jay-z kanye west 20 34 80 0.147058824 0.02 0.588235294 7.352941176 18.88348159
the game kanye west 13 16 80 0.13203125 0.013 0.8125 10.15625 11.84621821
kanye west the game 13 80 16 0.13203125 0.013 0.1625 10.15625 11.84621821

Since we looked at Kanye, we might as well look at Taylor Swift too because of their history.

SELECT * FROM ja186045.onekuserscfilter
WHERE col1_item1 = 'taylor swift' or col1_item2 = 'taylor swift'
ORDER BY score DESC
LIMIT 4;


col1_item1 col1_item2 cntb cnt1 cnt2 score support confidence lift z_score
taylor swift carrie underwood 4 10 5 0.32 0.004 0.4 80 2.798308144
carrie underwood taylor swift 4 5 10 0.32 0.004 0.8 80 2.798308144
kellie pickler taylor swift 2 2 10 0.2 0.002 1 100 0.787661463
taylor swift reba mcentire 2 10 2 0.2 0.002 0.2 100 0.787661463

The results from these two queries show that users typically listen to a particular genre when it comes to these artists. People who listen to Kanye West also listen to other rappers such as Jay-Z or The Game and those who listen to Taylor Swift also listen to other country artists such as Carrie Underwood, Kellie Pickler, and Reba McEntire. If this was performed on a more recent data set, the results would probably be different as Taylor Swift has transitioned from a country singer to a pop artist.

The output from this function could be used to create a recommendation engine, like the ones used by streaming services such as last.fm, Pandora, and Spotify. By finding users that have similar tastes in music, this system could provide artist suggestions that the user may like based on their past listening history.