Aster's Market Basket and Collaborative Filtering functions

Aster
Teradata Aster is an analytic platform that embeds MapReduce analytic processing with data stores. •Embedded MapReduce analytic processing and unique SQL-MapReduce® framework •Massively parallel data store for multistructured data •Intuitive tools and SQL-MapReduce libraries for rapid analytic development
Teradata Employee

Aster's Market Basket and Collaborative Filtering functions

Retailers mine transaction data to track purchasing behavior.  Some of the more popular are Market Basket and Collaborative Filtering

By understanding what products customers tend to purchase, a vendor can maximize their sales for that customer.  Armed with this information, an analyst can initiate:

  • Couponing and discounts
  • Cross-selling and upselling
  • Product placement - Put physically close together (cough medicine and tissues)
  • Timing promotions – For customers that purchase lawn mower, they also purchase crabgrass killer and lawn fertilizer the following month

This presentation will provide a brief overview of how both Market Basket and Collaborative Filtering works conceptually, then we'll move into coding this on the Aster Data platform to see the results.  Let's get started.

Overview

Market Basket (MB) is concerned with products/services purchased at the same time (basket).  That’s simple enough.  MB uses rules.  A rule has two measures (Confidence and Support).  Support measures how often items occur together, as a percentage of the total transactions. Confidence measures how much a particular item is dependent on another item (ie: When people buy liquor, they also buy infant products 50% of the time).  Let’s look at a quick example to understand these 2 measures.

Assume have following table:

People who bought

Also bought this

Support

Confidence

Banana

Dates

57 % (4/7)

80% (4/5)

 From here, it starts to get complicated, but wanted to give you notion that it’s not just seeing what’s in the basket.

Collaborative Filtering (CF) is similar but different (it’s not uncommon to use the output of MB as input to CF).  It makes predictions about the interests of one User by collecting information from many Users (collaborating).  Basically if User A has the same opinion as User B, A is more likely to have B’s opinion of a different issue than to have the opinion of a random User.  Key point is that predictions are specific to a user, based on info from many users.

Here’s an example I pulled straight out of Wikipedia:

"Typically, the workflow of a collaborative filtering system is:

  1. A user expresses his or her preferences by rating items (eg. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
  2. The system matches this user’s ratings against other users’ and finds the people with most “similar” tastes.
  3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)"

Now that we have a fundamental understanding of MB and CF, let’s run some code on ACT (Aster Command Terminal) to see how this all works.

Aster function Market Basket (BASKET_GENERATOR)

We’ll be using the SALES_FACT table for this lab so here’s a partial listing at what’s inside.

QueryWant to find the 3 most popular products purchased in a basket on a per store basis.   

select store_id, product_id1, product_id2, product_id3, count(1)
from BASKET_GENERATOR
(on sales_fact
PARTITION BY store_id, basket_id
BASKET_ITEM ('product_id') BASKET_SIZE (3)
ACCUMULATE('store_id') COMBINATIONS('true'))
group by 1,2,3,4 order by 5 desc, 1 asc limit 10;


Armed with this information, I can now create a weekly flyer highlighting the Top 3 items on the cover page on a per store basis.

Here’s a listing of Arguments that is (mostly) self-explanatory.

PARTITION BY: Required.  Must specify the column(s) to partition by.  This specifies the sets for which combinations will generated and summed.

BASKET_ITEM: Required.  Name(s) of the input column(s)  that contains the items to be collected into baskets.  If more than one input column is specified, every unique combination of input column values is treated as one  item.  For example, if a single column is used, this is often the column containing the SKU that identifies an item that was sold.  If you wanted to further break down the results, you could specify both the SKU column and the month manufactured, color and/or size columns.

BASKET_SIZE: Required.  Integer number of items to be included in basket.  The default is 2 items.

ACCUMULATE: Optional. Names of input columns that will be returned as-is in the output. All input columns not named here are left out of the output. This must be a column that is part of the ON relation.

COMBINATIONS: Optional; defaults to 'true'.  If 'true' the function returns a basket for each unique combination of items.  If 'false' the function returns a basket for each unique permutation of items. Combinations are returned in lexicographical order.  For a combination, the order of the items doesn't matter (the basket "tomatoes and basil" is considered to be the same basket as "basil and tomatoes").  For a permutation, every a unique ordering of the items constitutes a unique basket.

ITEM_SET_MAX: (Type=int) [default=100]. This is the maximum number of items to be considered in a partition.  If the number of items in any partition exceeds ITEM_SET_MAX, no combinations (or permutations) will be emitted for that partition.


 That wraps it up for MB.  Let’s move on to CF.

Aster function Collaborative Filtering (cFilter)

 First off, here’s a partial listing of the table in the query.  I created this based on data in the AAF User’s Guide.


 

QueryI want to find which Products are frequently paired together for a special promotion next month. 

SELECT * FROM cfilter (
ON (SELECT 1)
PARTITION BY 1
password('beehive')
inputTable ('Collab') outputTable ('cfilter_output')
inputColumns ('item') joinColumns ('tranid'));


Notice this function accepts an INPUTTABLE and outputs an OUTPUTTABLE.  So to view the result set, you will have to SELECT from the OUTPUTTABLE name after the function has completed as shown below.

select * from cfilter_output order by score desc;

Hmmm.  It looks like Salsa and Chips have the highest affinity in my answer set.  So would always want to include these 2 together in any of my promotions.

The Arguments are:

DOMAIN : Optional.  Has the form, host:port .  The  host  is the Aster Database queen’s IP address or hostname. To specify an IPv6 address, enclose the host argu ment in square brackets, for example: [:: 1]:2406. The  port  is the port number that the queen  is listening on. The default is the Aster standard port number (2406). For example:  DOMAIN(10.51.23.100:2406)

DATABASE : Optional.  This is the name of the database  where the input table is present. Default database is beehive.

USERID: Optional.  The Aster Database user name of the user running this function. The default USERID is “beehive”.

PASSWORD : Required.  The Aster Database password of the user.

INPUTTABLE : Required.  Name of the input table whose data we will filter.

OUTPUTTABLE : Required.  Name of the output table into which we will write the final results. If the output table already exists, then you should also pass the DROPTABLE (‘yes’) argument, to drop it before writing the new results.  Otherwise, an exception will be thrown. The output table contains the columns listed in the section "Example Output from Collaborative Filter" below.

INPUTCOLUMNS : Required.  A list of input columns to collect. The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.

JOINCOLUMNS : Required.  A list of columns to join on.  The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.

OTHERCOLUMNS : Optional.  A list of other columns to output. These will pass through the function unchanged. The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.

PARTITIONKEY : Optional.  Single column used as partition key for the newly created output table. Default partitionKey is col1_item1.

MAXSET: Optional.  Size of the maximum item set to be considered. Default maxItemSet is 100.

DROPTABLE: Optional.  When this option is set to true , if the output tablename already exists, itwill be dropped. Default value is false.

In Conclusion

Market Basket and Collaborative Filters are two  ways to gain deeper insights into customer’s purchasing behavior.  With this knowledge in hand, it can help you drive increased sales volume for your company.

5 REPLIES
Enthusiast

Re: Aster's Market Basket and Collaborative Filtering functions

Aster express with the db version is 4.6.2 does not support the function "basket_generator?

And where to get the sample data you using in the sample?
Teradata Employee

Re: Aster's Market Basket and Collaborative Filtering functions

My examples were all done in Aster 4.6.3. Concerning the data, the sales_fact table was from the Aster 101 class that I took. I used NCLUSTER_LOADER to load that table since it's a pretty big file. For the CFilter, I just manually inserted that data. Here's the code:

create table Collab
(tranid int, dt date, storeid int, region text, item text, sku int, category text) distribute by hash (tranid);

insert into Collab values(1,'20100715'::date, 1, 'west', 'milk', 1, 'dairy');
insert into Collab values(1,'20100715'::date, 1, 'west', 'butter', 2, 'dairy');
insert into Collab values(1,'20100715'::date, 1, 'west', 'eggs', 3, 'dairy');
insert into Collab values(1,'19990715'::date, 1, 'west', 'flour', 4, 'baking');
insert into Collab values(1,'19990715'::date, 1, 'west', 'sugar', 5, 'baking');
insert into Collab values(1,'19990715'::date, 1, 'west', 'diapers', 6, 'baby');
insert into Collab values(2,'20100715'::date, 2, 'east', 'milk', 1, 'dairy');
insert into Collab values(2,'20100715'::date, 2, 'east', 'egg whites', 7, 'dairy');
insert into Collab values(2,'19990715'::date, 2, 'east', 'flour', 4, 'baking');
insert into Collab values(2,'19990715'::date, 2, 'east', 'sugar', 5, 'baking');
insert into Collab values(3,'20100715'::date, 3, 'east', 'milk', 1, 'dairy');
insert into Collab values(3,'20100715'::date, 3, 'east', 'eggs', 3 ,'dairy');
insert into Collab values(3,'20100715'::date, 3, 'east', 'flour', 4, 'baking');
insert into Collab values(3,'20100715'::date, 3, 'east', 'sugar', 5, 'baking');
insert into Collab values(4,'20100715'::date, 1, 'south', 'milk', 1, 'dairy');
insert into Collab values(4,'20100715'::date, 1, 'south', 'cereal', 8, 'breakfast');
insert into Collab values(3,'20100715'::date, 2, 'east', 'beer', 9, 'alcohol');
insert into Collab values(5,'20100715'::date, 2, 'east', 'diapers', 6, 'baby');
insert into Collab values(5,'20100715'::date, 2, 'east', 'milk', 1, 'dairy');
insert into Collab values(6,'20100715'::date, 1, 'east', 'beer', 9, 'alcohol');
insert into Collab values(6,'20100715'::date, 1, 'east', 'cereal', 8, 'breakfast');
insert into Collab values(7,'20100715'::date, 1, 'south', 'beer', 9, 'alcohol');
insert into Collab values(7,'20100715'::date, 1, 'south', 'diapers', 6, 'baby');
insert into Collab values(8,'20100715'::date, 2, 'east', 'beer', 9, 'alcohol');
insert into Collab values(9,'20100715'::date, 2, 'east', 'diapers', 6, 'baby');
insert into Collab values(10, '20100715'::date, 3, 'south', 'milk', 1, 'dairy');
insert into Collab values(11, '20100715'::date, 3, 'east', 'milk', 1, 'dairy');
insert into Collab values(11, '20100715'::date, 3, 'east', 'orange juice', 10, 'beverages');
insert into Collab values(12, '20100715'::date, 3, 'east', 'beer', 9, 'alcohol');
insert into Collab values(12, '20100715'::date, 3, 'east', 'red bull', 11, 'beverages');
insert into Collab values(13, '20100715'::date, 1, 'south', 'beer', 9 , 'alcohol');
insert into Collab values(13, '20100715'::date, 1, 'south', 'chips', 13, 'snacks');
insert into Collab values(14, '20100715'::date, 2, 'north', 'salsa', 12, 'snacks');
insert into Collab values(14, '20100715'::date, 2, 'north', 'chips', 13, 'snacks');
insert into Collab values(14, '20100715'::date, 2, 'north', 'beer', 9, 'alcohol');

Re: Aster's Market Basket and Collaborative Filtering functions

Hi Mark,

What do the following stmts in your sql query for collaborative filtering do ?

"ON (SELECT 1)
PARTITION BY 1"

Regards,
Teradata Employee

Re: Aster's Market Basket and Collaborative Filtering functions

My understanding for CF result is as below
ntb: count( item1 and item2 concurrently appeard by tranid level combination )
cnt1: count(item1 appeard by tranid level combination)
cnt2: count(item2 appeard by tranid level combination)
score: (cntb/cnt1) * (cntb/cnt2)
Teradata Employee

Re: Aster's Market Basket and Collaborative Filtering functions

Cntb = # of baskets containing 2 items
Cnt1 = For any basket which has item1, sum item1
Cnt2 = For any basket which has item2, sum item2
Score = (cntb/cnt1 * cntb/cnt2) which is chance of both being in the same basket with each other