If i use SET table then the Teradata database will perform duplicate search in that table while loading the data.
When i use UPI, it eliminates duplication search.
Am i correct?
Could some one please explain in detail?
you are correct. Basically below are conditions which can help you
1. Table is set, we create UPI on it. this way teradata checks for duplicacy of PI values and rejects duplicates , no need to scan entire row and all conditions are satisfied.
2. table is multiset and UPI, here if we just check that no duplciates for PI value, we satisfy all conditions. so no need to scan complete row.
3. table is set and NUPI on it, then complete row needs to be scanned to avoid duplicate rows as PI can have duplicates and not bothered with
4. table is multiset and NUPI, no checks need to be done as everythign is good.
I hope that I helped little bit.
Your option four does indeed eliminate the duplicate row and PI checks, but is also most susceptible to duplicate rows. The duplicate tuple check should be handled by some other mechanism, Changed Data Capture processes, ELT or a Unique Secondary Index. The USI will maintain uniqueness without affecting the distribution of the data across the amps.
At some point you'll hear or read an Alison Torres presentation, the first concern of picking the PI on a Teradata table is the primary access path, with the caveat of minor skewing. For that reason, you'll see a lot of tables with NUPI's and USI's on them. Depending upon the size of the table, with some quick testing you'll see that the increased I/O of the USI subtable is offset by having amp-local joins with large and/or frequently joined tables.
When we can have any column in table as unique then it will help query performance as checking duplicates on column is pretty easy than comparing whole rows for duplicate values.
Multiset is table with whole row as duplicate of another one and this should be based on specific requirements only.
Option one will incur overhead of UPI duplicate row checking and the additional uniqueness check of the set table. Option two is the one that makes the most sense from what you are asking.
I'd recommend you please review option #3 in your first answer, if table SET and NUPI, complete row needs to be scanned *unless* a unique index (USI) is also defined, in which case the duplicate row check will be performed on the USI alone. A quite common practice in physical modeling is to use the Foreign Key as a NUPI for best joining with a major entity on its Primary Key, and the Primary Key as a USI, to enforce data integrity and the added benefit of low resources utilization for duplicates checking.
I believe there's no something as an additional uniqueness check overhead for option #1, the duplicate row check is just performed on the UPI
Yes, to prevent full row scan for duplication in set tabel with NUPI, normal practice is to add uniqueness on some column/columns..
This is a good disucssion and a common question. It has become common to use multiset or unique index/constraints to avoid duplicate row checks.
However, this can be overdone. If a NUPI has a reasonable number of duplicates then the cost of the duplicate row checking will be less than the cost of maintaining the separate structures or the impact of getting unintended duplicate rows in the table.
"reasonable" is defined as less than a couple data blocks of duplicate records.