why should we use SET table ?

UDA
Enthusiast

why should we use SET table ?

A SET table force Teradata to check for duplicate rows every time a row is inserted or updated. This can cause a lot of overhead on such operations.
Why cant we use multiset tables for all Tables which we are insert by other tables in DWH and use "Group By " to insert rows ?
32 REPLIES
Enthusiast

Re: why should we use SET table ?

There's no reason to use a SET (as opposed to MULTISET) table if you don't need the automatic duplicate
checking. By the way, if a table has a uniqueness constraint defined on some set of columns (i.e., a primary key or unique index), then there is absolutely no difference between set and multset, as only the uniqueness constraint will be enforced in either case.
Teradata Employee

Re: why should we use SET table ?

Actually, there IS a difference between SET and MULTISET even if you have a UNIQUE index:

CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY INDEX (c1);
INSERT T1 VALUES (1,1);
INSERT T1 VALUES (1,2);
INSERT T1 VALUES (2,1);

CREATE SET TABLE T2_S (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);
CREATE MULTISET TABLE T2_MS (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);

INSERT T2_S SELECT c1 FROM T1;
/* succeeds - quietly eliminates the duplicate and inserts 2 rows */
INSERT T2_MS SELECT c1 FROM T1;
/* fails with duplicate key error */
Enthusiast

Re: why should we use SET table ?

Hi Jim Chapman / Fred Pluebell

I am concerned about Time taken to load the Set Vs MultiSet tables
In my application there are many Set tables created for storing temp result
For loading these set tables its taking more time, as we are planning to change it in to MultiSet and put a group by all columns to remove duplicate rows , table index is Primary Index, (is it the correct way?)
Teradata Employee

Re: why should we use SET table ?

First, I'd suggest you verify that you have chosen a good Primary Index. If the number of rows with any given PI value is small, the SET table duplicate row checking overhead is small.

There may be cases in which an extremely non-unique PI is a reasonable choice; in those cases changing to MULTISET is certainly worth considering. Just be sure you aren't covering up the real problem.

Re: why should we use SET table ?

(Somewhat of a late comment relative to the prior posts, but the internet is timeless and it will help any googlers).

Can someone quantify when they say that a SET table has overhead? I like the fact that a SET table discards duplicates for me. I want to know how it scales. If I want to store, say, person id and sku pairs for billions of rows, I simply insert into my SET table knowing that it will discard prior pairs. That's better than creating a proper of NOT EXISTS or other techniques if it were a MULTISET table. How far can I expect to do this until the overhead is not worth it?

Senior Apprentice

Re: why should we use SET table ?

To insert a new row the system has to do a duplicate row check for all existing rows with the same RowHash value, i.e. all rows with the same PI value or another PI value which happens to hash the same.

And of course it's cumulative, to insert N rows with the same RowHash you'll need (N*(N-1)) / 2 checks:

10 rows -> 45 dup row checks

100 -> 4950

1000 -> 499500

The rule of thumb is:

If the number of rows per value is low (maybe up to a few hundred) and all rows fit on a single datablock it might be acceptable.

Dieter

Re: why should we use SET table ?

Dieter,

How does each instance of a dup row check equate to time? This is way beyond a linear increase, but I'm not seeing that. I have a x00 million row SET table where I can insert x0 million rows to it daily with net new rows about 5%, and I'm not getting the degradation timewise that the formula would indicate.

What would you advise as a more efficient technique where I simply want to sweep new pairs into my SET pair table . As is, I don't have to worry about ETL change data coding.

Appreciate your feedback.

Julius

Senior Apprentice

Re: why should we use SET table ?

Hi Julius,

what's the average number of rows per value for this table?

The formula is only to compare the overhead for different rows per PI, when the count is doubled the number of dup checks is quadrupled to insert all those rows with the same PI.

When the number of rows per PI increases due to inserting new rows with an existing PI you should see a linear increase of CPU usage in DBQL, e.g. existing number of rows per PI value: 50, insert 1 row per day per PI.

After 50 days the number of rows per PI doubled, to insert a new row there's 99 dup checks now vs. 49 on the first day, resulting in doubled CPU time per row.

Dieter

Re: why should we use SET table ?

SET tables will insert rows quickly as we start inserts, but will become

much slower as its record count in table reached millions.

This is because, as SET table won't allow entire row duplicate, as you

insert new rows to the table

TD system should check whether particular row is already present or not.

And if row is already present, it won't insert particular row. It won't give

any error message also.

As this is an automatic process, lot of resources and processor time

required to do the same.

Initially it work faster and as record count reach millions, it becomes

slow.

Possible cause is 'duplicate row check'. This process is likely to slow down

as you add more rows to the target table. This is a problem when you have

too many rows per PI data value (well, technically PI rowhash) AND the table

is SET AND there are no unique indexes on the target table.

- as the process starts, there are relatively few rows for any given PI

rowhash value.

- duplicate row checking happens, but with only a few rows to check it

doesn't appear to slow you down

- as you add more data to the target table, the number of rows with same PI

rowhash grows, and you can now 'see' the processing slow down.

If the duplicate row check looks like the culprit then a couple of things

come to mind

- add a unique index

- change the table to multiset (providing of course that your data

model/processing can handle this)

If the table has a unique index (either upi or usi) then you will ** not **

have a problem with 'duplicate row checks'.

The only time you have problems with duplicate row checks is when:

- the table is SET, AND

- the table has NO unique indexes, AND

- you have lots of rows with the same PI rowhash value.

If any of the above conditions are NOT met then you will not have a problem

with duplicate row checks during insert  processing.

Thanks,

Sughesh I S