I have a list of products and a count corresponding to the quantity sold in a single table. The data is laid out as such:
Product Name QTY_SOLD
I want to create a group ID where groups are broken up if the sum of the quantity sold is greater than 50. We can order by Product Name to get an output similar to the following.
Product Name QTY_SOLD GROUP_NBR
Keyboard 25 1
Laptop 45 1
Monitor 56 2
Mouse 23 3
TV 10 3
I created a case statement to create the output I need but if I want to change the group id cutoff from 50 to say 100 or if i get more products and quantities I have to keep changing the case statement. Is there an easy way to use either recursion or some other method to accomodate this?
FROM prod_list AS main,
, SUM(QTY_SOLD) OVER (ORDER BY PROD_NAME ROWS UNBOUNDED PRECEDING) RUNNING
SET GROUP_NBR =
WHEN RUNNING < 50 THEN 1
WHEN RUNNING > 50 AND RUNNING < 100 THEN 2
WHEN RUNNING > 100 AND RUNNING < 150 THEN 3
WHEN RUNNING > 150 AND RUNNING < 200 THEN 4
WHEN RUNNING > 200 AND RUNNING < 250 THEN 5
WHERE main.PROD_NAME = inr.PROD_NAME
I meant to say, I want to create a group ID where groups are created if the ROLLING sum of the quantity sold is greater than 50.
I was thinking I may be able to get away with simply distributing the records evenly based on an average or other distribution key. If I had data such as below I would want to distribute them into buckets somewhat evenly.
PROD A - 50
PROD B - 55
PROD C - 1000
PROD D - 25
PROD E - 32
PROD F - 800
One possible result
GRP 1 - (PROD A & D)
GRP 2 - (PROD B & E)
GRP 3 - (PROD C)
GRP 4 - (PROD F)
Is that possible?
Change your running sum to order on QtySold.
Join that result to SELECT SUM(QtySold) AS GrandTotal FROM Prod_List.
Replace your CASE expression with (running/GrandTotal) DIV <#of buckets you want>.
This distributes the products into a best approximation equal height histogram by their percentage contribution to the total quantity. If you have very large outliers then you might have missing buckets because the big one will cover multiple percentage buckets.
I tried to construct the query you recommended but when i ran it with over a hundred products it created almost as many buckets and didnt stick to the number of buckets i had wanted. I created another query using MOD and it gave me almost what i was looking for but didnt distribute as evenly as I had hoped for.
UPDATE main FROM prod_list AS main,
, SUM(QTY_SOLD) OVER (ORDER BY NUMOF ROWS UNBOUNDED PRECEDING) RUNNING
SET GROUP_NBR = RUNNING MOD <# of buckets>
WHERE main.PROD_NAME = inr.PROD_NAME ;
This is missing the join to and division by the grand total. That is necessary in order to make the running sum a percentage contribution. And I left off the multiplication of that result by 100 before the division by the number of buckets.
In your sample you order by NUMOF? rather than QTY_SOLD.
Is this the query you mean? I did the whole thing over so that I wouldnt have to do an insert and then an update. This still gives me some weird results what am I missing?
WITH PROD_CTE (PROD_NAME,QTY_SOLD) AS
, COUNT(DISTINCT PROD_ID) AS QTY_SOLD
GROUP BY 1
, (((RUNNING_SUM / TTL ) * 100 ) / 10) AS BUCKET
, SUM(QTY_SOLD) OVER (ORDER BY QTY_SOLD ROWS UNBOUNDED PRECEDING) AS RUNNING_SUM
SELECT SUM(QTY_SOLD) TTL
)sm ON 1=1
I ended up using a suggestion from the guys over at stack overflow.
First create a table that has the group id and the upper and lower limits of each bucket.
-- create the first entry for the recursive query
INSERT TMP_WORK_DB.GRP_NBRS VALUES (0,1,0,2000000);
INSERT TMP_WORK_DB.GRP_NBRS (GRP_NBR,LOWER_LIMIT, UPPER_LIMIT)
WITH RECURSIVE GRP_RECRSV (GRP_NBR, LOWER_LIMIT, UPPER_LIMIT)
1 AS GRP_NBR
GRP_NBR + 1
, LOWER_LIMIT + 2000000 -- set the interval to 2 million
, UPPER_LIMIT + 2000000 -- can be adjusted as needed
WHERE GRP_NBR < 120 -- needed a limit so that it would not be endless
SELECT * FROM GRP_RECRSV
Then use a simply WHERE BETWEEN on the data to determine which GROUP ID to assign.
-- delete the first entry because it was duplicated
DELETE FROM TMP_WORK_DB.GRP_NBRS WHERE GRP_NBR = 0;
-- set grp nbr using the limits table
WITH NUMOFPRODS (PROD_NAME,QTY,RUNNING) AS
, COUNT(DISTINCT PROD_ID) AS QTY
, SUM(QTY) OVER (ORDER BY QTY ROWS UNBOUNDED PRECEDING) RUNNING
GROUP BY 1
FROM NUMOFPRODS a
JOIN TMP_WORK_DB.GRP_NBRS b ON RUNNING BETWEEN LOWER_LIMIT AND UPPER_LIMIT