Join Issues - Need help in most efficient way to solve the problem

Database
Enthusiast

Join Issues - Need help in most efficient way to solve the problem

Hi,

I need some help to solve the following problem in the most efficient way. 

Consider Table1 - 

ShopNbr Item
1 100
1 101
1 102
2 101
2 104
3 100
3 102
3 103
3 104
3 105

I have provided a snapshot of the data above. The actual data will have close to 1000 shops and 300 items. Each shop need not have all 300 items.We have Shop-Item level data here.

In the above case, I need to choose 2 shops which will capture the maximum number of distinct items. i.e. If i choose Shops 1 and 2, I will get items 100, 101, 102, 104 ( 4 items) and similarly i need to check for combinations 2&3 and 1&3 and choose the one which captures most items. 

I can do this through a cross join but my problem is in my actual data I have close to 1000 shops. We need to know the most efficient way to do this.

In simple words I need the best combination (lets say I need to choose 5 out of 1000 shops), it would be 1000C5 iterations which is nearly 1000^5 which is not practical.

Also note we do not have access to stores procedures in our servers. 

Thanks in advance. Any help will be extremely useful :)

Karthik

14 REPLIES
Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

Any suggestions ?

Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

Dieter - Any suggestions ?

Senior Apprentice

Re: Join Issues - Need help in most efficient way to solve the problem

Hi Karthik,

i don't think there's a simple solution, especially without Stored Procedures.

As it's only 300 items you might use bit-functions to create a bit-string and then BITOR and COUNTSET to combine and count.

For two shops it's only 1000 * 1000 rows and maybe you can apply some rigid filter before you add the next shop to reduce the number of comparisons.

Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

Dieter - Thanks a lot.

If I give a rule saying that always choose a shop which has maximum items, can I write a normal SQL code to achieve that ? 

In other words, among all 1000 shops I will identify 1 shop which has maximum items, select that shop, delete all the items chosen in that shop, and again identify 2nd shop with maximum items and so on till 5 shops or even 10 shops if needed. ( I will know that number of shops)

How easy is this ? I am not able to come up with a coding logic for this though in Teradata.

Thanks again.

Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

CREATE VOLATILE TABLE T 
(
SHOP INT,
ITEM INT
)
UNIQUE PRIMARY INDEX(SHOP,ITEM)
ON COMMIT PRESERVE ROWS;
INSERT INTO T
VALUES (1,100);
INSERT INTO T
VALUES (1,101);
INSERT INTO T
VALUES (1,102);
INSERT INTO T
VALUES (2,101);
INSERT INTO T
VALUES (2,104);
INSERT INTO T
VALUES (3,100);
INSERT INTO T
VALUES (3,102);
INSERT INTO T
VALUES (3,103);
INSERT INTO T
VALUES (3,104);
INSERT INTO T
VALUES (3,105);

SELECT SHOP1, SHOP2 , SUM(CNT) AS TOTAL_CNT
FROM
(
SELECT T1.SHOP AS SHOP1, T2.SHOP AS SHOP2, COUNT(DISTINCT T1.ITEM) AS CNT
FROM
T T1
INNER JOIN
T T2
ON
T1.ITEM = T2.ITEM AND
T1.SHOP LT T2.SHOP
GROUP BY T1.SHOP, T2.SHOP
)A
GROUP BY 1,2
QUALIFY ROW_NUMBER() OVER (ORDER BY TOTAL_CNT DESC) =1
Senior Apprentice

Re: Join Issues - Need help in most efficient way to solve the problem

Hi Karthik,

imho you can't choose the shop with the maximum number of items, e.g.

shop 1: 2,3,4,5,6,7

shop 2:1,2,3,4

shop 3: 5,6,7,8

shop 1 + 2 or shop 1 + 3 has less items than shop 2+ 3

Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

Agreed totally with you Dieter. But in the actual data we have we feel that the case you have explained would be pretty rare and we are okay with ignorning that though I understand we are compromising the accuracy of the model. 

But if we are fine with it, I still am not able to come up with the macro which can perform the same for me.

Enthusiast

Re: Join Issues - Need help in most efficient way to solve the problem

Dieter - Sorry again but any new ideas ?

Teradata Employee

Re: Join Issues - Need help in most efficient way to solve the problem

Although this is an interesting problem, but I would agree with Dieter in that, this is more of an algorithmic efficiency problem than data access efficiency problem. At 1000x300, your entire problem can be represented in 37.5K bytes of storage (or 300K if you use byte instead of bit). It will be more efficient to use SQL to retrieve entire data into memory and then solve the problem using procedural language rather than solve the problem using SQL. Formally, this class of problems belongs to combinatorial optimization problems (http://en.wikipedia.org/wiki/Combinatorial_optimization).

Good luck.