How does Fload handle loading of duplicate rows in NOPI table in case of a restart?
If there is a DBS restart, FastLoad reloads data from the last checkpoint, it does not effect dupicate rows loading processing.
What happens on the DBS side is:
The DBS has a special recovery scheme for NoPI table that throws away only duplicates from data that are resent due to a restart. There is no sort on NoPI table and therefore true duplicates from the data source do not get thrown away. This way, true duplicates stay but duplicates from resend are all cleaned up.
Thanks Thomas :)
My understandng is :
1) While Floading into the tables with PI the data will be sorted and duplicate rows will be eliminated.
2) While Floading into the tables with NOPI the data will not be sorted, so the restart can be started from the point where it got interupted.
Please correct me if i am wrong...
What is the exact reason for removing duplicates in case of Primary index and duplicate records in NOPI table. Why the records get sorted in PI case? Please explain. @dnoeth
FastLoad was implemented when Teradata only knew SET tables (i.e. relational tables not allowing duplicates), so removing duplicate rows was essential.
As tables with a PI are stored by ascending RowHash (RowID) within AMP the sort is necessary anyway.
A NoPI table still has a ROWID, but the data is not sorted, thus there's no way to find duplicate rows.
If you need to load duplicate rows you can simply switch to the MultiLoad protocoll instead.
Thanks @dnoeth... But one final understanding: the sorting is only at amp level in each amp . When we do select * from target table the data will be fetched from all the amps parallely so we cannot see that sorted data. One more point is sorting is of hash row_id is internal process we cannot see sorted value at table level anyways after doing select * from table. Am I correct?
There's a ROWID, but it's normally not enabled and thus can't be used.
You can do
select TOP 100 HashAmp(HashBucket(HashRow(PI column))), HashRow(PI column) from table
As the TOP is probably done on a single AMP, the AMP number returned should be the same and the RowHash should increase. Well, more or less increase, because it probably returns the rows in the actual order on the data block, which might be not the same as the logical order based on the block's sorted row reference array.
Using a low lovel tool like ferret you should be able to see the (logical) order within an AMP/table.
If a sorted result is desired, then an appropriate ORDER BY must always be specified. There is no default sort order in Teradata.