Retailers mine transaction data to track purchasing behavior. Some of the more popular are Market Basket and Collaborative Filtering.
By understanding what products customers tend to purchase, a vendor can maximize their sales for that customer. Armed with this information, an analyst can initiate:
This presentation will provide a brief overview of how both Market Basket and Collaborative Filtering works conceptually, then we'll move into coding this on the Aster Data platform to see the results. Let's get started.
Market Basket (MB) is concerned with products/services purchased at the same time (basket). That’s simple enough. MB uses rules. A rule has two measures (Confidence and Support). Support measures how often items occur together, as a percentage of the total transactions. Confidence measures how much a particular item is dependent on another item (ie: When people buy liquor, they also buy infant products 50% of the time). Let’s look at a quick example to understand these 2 measures.
Assume have following table:
People who bought
Also bought this
57 % (4/7)
From here, it starts to get complicated, but wanted to give you notion that it’s not just seeing what’s in the basket.
Collaborative Filtering (CF) is similar but different (it’s not uncommon to use the output of MB as input to CF). It makes predictions about the interests of one User by collecting information from many Users (collaborating). Basically if User A has the same opinion as User B, A is more likely to have B’s opinion of a different issue than to have the opinion of a random User. Key point is that predictions are specific to a user, based on info from many users.
Here’s an example I pulled straight out of Wikipedia:
"Typically, the workflow of a collaborative filtering system is:
Now that we have a fundamental understanding of MB and CF, let’s run some code on ACT (Aster Command Terminal) to see how this all works.
We’ll be using the SALES_FACT table for this lab so here’s a partial listing at what’s inside.
Query: Want to find the 3 most popular products purchased in a basket on a per store basis.
select store_id, product_id1, product_id2, product_id3, count(1)
PARTITION BY store_id, basket_id
BASKET_ITEM ('product_id') BASKET_SIZE (3)
group by 1,2,3,4 order by 5 desc, 1 asc limit 10;
Armed with this information, I can now create a weekly flyer highlighting the Top 3 items on the cover page on a per store basis.
Here’s a listing of Arguments that is (mostly) self-explanatory.
PARTITION BY: Required. Must specify the column(s) to partition by. This specifies the sets for which combinations will generated and summed.
BASKET_ITEM: Required. Name(s) of the input column(s) that contains the items to be collected into baskets. If more than one input column is specified, every unique combination of input column values is treated as one item. For example, if a single column is used, this is often the column containing the SKU that identifies an item that was sold. If you wanted to further break down the results, you could specify both the SKU column and the month manufactured, color and/or size columns.
BASKET_SIZE: Required. Integer number of items to be included in basket. The default is 2 items.
ACCUMULATE: Optional. Names of input columns that will be returned as-is in the output. All input columns not named here are left out of the output. This must be a column that is part of the ON relation.
COMBINATIONS: Optional; defaults to 'true'. If 'true' the function returns a basket for each unique combination of items. If 'false' the function returns a basket for each unique permutation of items. Combinations are returned in lexicographical order. For a combination, the order of the items doesn't matter (the basket "tomatoes and basil" is considered to be the same basket as "basil and tomatoes"). For a permutation, every a unique ordering of the items constitutes a unique basket.
ITEM_SET_MAX: (Type=int) [default=100]. This is the maximum number of items to be considered in a partition. If the number of items in any partition exceeds ITEM_SET_MAX, no combinations (or permutations) will be emitted for that partition.
First off, here’s a partial listing of the table in the query. I created this based on data in the AAF User’s Guide.
Query: I want to find which Products are frequently paired together for a special promotion next month.
SELECT * FROM cfilter (
ON (SELECT 1)
PARTITION BY 1
inputTable ('Collab') outputTable ('cfilter_output')
inputColumns ('item') joinColumns ('tranid'));
Notice this function accepts an INPUTTABLE and outputs an OUTPUTTABLE. So to view the result set, you will have to SELECT from the OUTPUTTABLE name after the function has completed as shown below.
select * from cfilter_output order by score desc;
Hmmm. It looks like Salsa and Chips have the highest affinity in my answer set. So would always want to include these 2 together in any of my promotions.
The Arguments are:
DOMAIN : Optional. Has the form, host:port . The host is the Aster Database queen’s IP address or hostname. To specify an IPv6 address, enclose the host argu ment in square brackets, for example: [:: 1]:2406. The port is the port number that the queen is listening on. The default is the Aster standard port number (2406). For example: DOMAIN(10.51.23.100:2406)
DATABASE : Optional. This is the name of the database where the input table is present. Default database is beehive.
USERID: Optional. The Aster Database user name of the user running this function. The default USERID is “beehive”.
PASSWORD : Required. The Aster Database password of the user.
INPUTTABLE : Required. Name of the input table whose data we will filter.
OUTPUTTABLE : Required. Name of the output table into which we will write the final results. If the output table already exists, then you should also pass the DROPTABLE (‘yes’) argument, to drop it before writing the new results. Otherwise, an exception will be thrown. The output table contains the columns listed in the section "Example Output from Collaborative Filter" below.
INPUTCOLUMNS : Required. A list of input columns to collect. The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.
JOINCOLUMNS : Required. A list of columns to join on. The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.
OTHERCOLUMNS : Optional. A list of other columns to output. These will pass through the function unchanged. The column names are single quoted and written in the comma-delimited format <'col1', 'col2', ...>.
PARTITIONKEY : Optional. Single column used as partition key for the newly created output table. Default partitionKey is col1_item1.
MAXSET: Optional. Size of the maximum item set to be considered. Default maxItemSet is 100.
DROPTABLE: Optional. When this option is set to true , if the output tablename already exists, itwill be dropped. Default value is false.
Market Basket and Collaborative Filters are two ways to gain deeper insights into customer’s purchasing behavior. With this knowledge in hand, it can help you drive increased sales volume for your company.