I have a requirement to convert around 50 tables into multiset from set. I am trying to get the tradeofff of keeping the table as set against perfomance of inserting recs into it. I did an explain plan on insert statmenets for a table defined as set and then altered it as set. Both insert gives me same explain plan. I was expecting some additional steps in plan for set table because of system level duplicate check.
So my question does explain plan shows the additional steps involved in syetm level duplicate check performed for a set table.
you have really provided the answer in the first paragraph, the are no additional steps. The phrase "defined as set and then altered it as set" has the "set" mentioned twice, but I guess you mean one "set" and one "multiset".
The reason is that the duplicate check is performed not before and not after the insert, but during the insert, and only for the rows being inserted. The check is performed on the AMP level, not on the system level. Why? Remember that rows in regular Teradata tables are stored in PI rowhash sequence (PPI and NoPI are a bit different). When new rows are added, they are not added at the end of the table, but at the proper blocks to ensure the rowhash sequence. Therefore, it is convenient to do the duplicate check during the insert: for a given row, the AMP locates the data block where the row should be inserted, and if there is a duplicate row that already exists, then that row is located in that data block (duplicate row - same PI value - same rowhash value). So, it is convenient to check the duplicates right during the insert operation.
I agree with Vlad, the number of steps will be same with set and multiset tables, but the duplicate row check will be made at block level, byte by byte.
Thank you for agreeing with me. However, I believe that there is no need to dedicate an additional post to say only that. :)
Thanks all for the responses.
Yes the statment has a spello, it should read as " I did an explain plan on insert statmenets for a table defined as set and then altered it as multiset."
The volume of data differes, I have around 60 tables to consider.
So all in all , I am going to look at the data volume I am expecting on each table. If table volume is not expected to be less even after long period, I am planning to keep them as set.
For larger tables (in count), I will have to check the trade off between keeping table as set and doing a group by on all fields after changing table to multiset. This is where I am stuck, because adddition of group by shows in plan with additional time and I am not able to check if that extra time is bad or good than the AMP level duplicate check for set table.
it is possible to measure. Create two similar tables - one set and one multiset, with the same data. Insert same new data into both tables. Ant then measure CPU consumption in the DBQL journal, in the AMPCpuTime field. Even though the Explains are the same, the CPU consumption would probably be different due to duplicate checks.
Because of the AMP level mechanics mensioned above, the duplicate check does not directly depend on how big the table is. Rather than that, it depends on the demographics of the Primary Index. If it is almost unique, then the duplicate checks are reasonable. Because you have to compare the rows only with the same PI value. However, if there are PI values with high numbers of rows per value, then the insertion of a new row enforces the check with many rows with the same PI. That's CPU intensive.
Should you have DBQL results that would prove otherwise, please do not hesitate to share.