Hellow Teradata Gurus/enthusiasts,
I was going through Tpump working and I have an interesting question about Tpump's Serialization.
So my understanding is Tpump establishes sessions based on the no.of buffers created. If Serialization is turned on then all data with same hash row will be packed into the same buffer to avoid dead lock between Amp's trying to figure out records on different sessions. Right?
So my question is, consider this example: There are three buffers and three sessions established by Tpump with the RDBMS, If there are 10 records to be sent into these buffers and the first 9 records are unique and the last one is again the same record as the first. Will Tpump fill all records into buffers (1,2,3 into first buffer and send away the buffer since it is all together a separate session) or will it wait till all records are processed and then realize the 10th record is same as the first and then put both records in the same buffer? (So buffer 1 has 1,10 records and buffer 2, 3 will have the rest of the records?)
If yes, what if there are thousands of records, will Tpump wait the whole time till all same row-hashed values are put in the same buffer
The number of buffers specified are per session. So, if I run a 10-session Tpump and specify 2 buffers, I will get 20 buffers. The purpose of the buffers is to accumulate additional updates (could be updates/inserts/deletes) for a session while that session is currently busy applying updates to the database.
The "KEY" specified in the layout determines how the incoming records are distributed among the sessions. The "KEY" does not have to be the same as the primary index of the target table. If you do not make it the same as the primary index of the table and if it's not a subset of the columns in the primary index, then you will likely have deadlocks since rows with the same primary index value could go to different sessions.
In your example, if you had 10 sessions, Tpump will hash the incoming rows based on what you specify as the "KEY" over the sessions. It will accumulate the number of updates until it reaches the PACK factor for each session. Assuming that record 1 and 10 had the same key, they would be sent to the same session. As each session reaches it's pack factor, the updates for that session are applied to the database. While that is happening, a buffer is used to accumulate the next set of updates for the session so that the Tpump does not have to wait until the apply of the updates for that session completes in the database.
Based on the data values in the SERIALIZE ON columns, Teradata will determine which session will be used to send that row, and it will be placed in a buffer for that session. It's a hash / modulo calculation conceptually similar to (but simpler than) how rows are assigned to AMPs within the database. Whenever a buffer is full or there is no more input, the buffer is sent.
The SERIALIZE ON list should generally be a subset (or all) of the PI columns that has fairly even distribution of values.
BTW - I would sayTPump creates buffers based on the number of sessions established rather than the other way around.