(Somewhat of a late comment relative to the prior posts, but the internet is timeless and it will help any googlers).
Can someone quantify when they say that a SET table has overhead? I like the fact that a SET table discards duplicates for me. I want to know how it scales. If I want to store, say, person id and sku pairs for billions of rows, I simply insert into my SET table knowing that it will discard prior pairs. That's better than creating a proper of NOT EXISTS or other techniques if it were a MULTISET table. How far can I expect to do this until the overhead is not worth it?
To insert a new row the system has to do a duplicate row check for all existing rows with the same RowHash value, i.e. all rows with the same PI value or another PI value which happens to hash the same.
And of course it's cumulative, to insert N rows with the same RowHash you'll need (N*(N-1)) / 2 checks:
10 rows -> 45 dup row checks
100 -> 4950
1000 -> 499500
The rule of thumb is:
If the number of rows per value is low (maybe up to a few hundred) and all rows fit on a single datablock it might be acceptable.
How does each instance of a dup row check equate to time? This is way beyond a linear increase, but I'm not seeing that. I have a x00 million row SET table where I can insert x0 million rows to it daily with net new rows about 5%, and I'm not getting the degradation timewise that the formula would indicate.
What would you advise as a more efficient technique where I simply want to sweep new pairs into my SET pair table . As is, I don't have to worry about ETL change data coding.
Appreciate your feedback.
what's the average number of rows per value for this table?
The formula is only to compare the overhead for different rows per PI, when the count is doubled the number of dup checks is quadrupled to insert all those rows with the same PI.
When the number of rows per PI increases due to inserting new rows with an existing PI you should see a linear increase of CPU usage in DBQL, e.g. existing number of rows per PI value: 50, insert 1 row per day per PI.
After 50 days the number of rows per PI doubled, to insert a new row there's 99 dup checks now vs. 49 on the first day, resulting in doubled CPU time per row.
SET tables will insert rows quickly as we start inserts, but will become
much slower as its record count in table reached millions.
This is because, as SET table won't allow entire row duplicate, as you
insert new rows to the table
TD system should check whether particular row is already present or not.
And if row is already present, it won't insert particular row. It won't give
any error message also.
As this is an automatic process, lot of resources and processor time
required to do the same.
Initially it work faster and as record count reach millions, it becomes
Possible cause is 'duplicate row check'. This process is likely to slow down
as you add more rows to the target table. This is a problem when you have
too many rows per PI data value (well, technically PI rowhash) AND the table
is SET AND there are no unique indexes on the target table.
- as the process starts, there are relatively few rows for any given PI
- duplicate row checking happens, but with only a few rows to check it
doesn't appear to slow you down
- as you add more data to the target table, the number of rows with same PI
rowhash grows, and you can now 'see' the processing slow down.
If the duplicate row check looks like the culprit then a couple of things
come to mind
- add a unique index
- change the table to multiset (providing of course that your data
model/processing can handle this)
If the table has a unique index (either upi or usi) then you will ** not **
have a problem with 'duplicate row checks'.
The only time you have problems with duplicate row checks is when:
- the table is SET, AND
- the table has NO unique indexes, AND
- you have lots of rows with the same PI rowhash value.
If any of the above conditions are NOT met then you will not have a problem
with duplicate row checks during insert processing.
Sughesh I S