Can some one please provide me some information on how data from different AMP's will be grouped together before sending to BYNET.
For example I'm trying to group data based on country name which is in different AMP's.
I know each AMP will work on their local data and send the aggregated data to BYNET, but who will take care of grouping data from different AMP's?
If same country name data is present in AMP 1 and AMP 2, how it will be aggregated before sending it through BYNET?
Thanks in Advance.
The grouping will first be performed locally in the AMPs to create a single row per group. This phase may be optimized out if the query optimizer can tell that very little (or no) reducction will be made in the number of result rows (eg if grouping is done on a group key that has a cardinality similar to the cardinality of the table).
Then each of the local aggregation rows will be hashed on the grouping fields and the local aggregation result will be redistributed to the AMP for that hash. Thus all of the rows for a particular group key will end up on the same AMP. The group keys as a whole will be spread across all the AMPs by the hashing on the group key. Each AMP will have to receive at most (number of AMPs - 1) rows for each group key. This phase will be optimized out if the optimizer can see that the local aggregation can produce only 1 result on 1 AMP per group key (eg if the group key is the PI).
Final aggregation is performed on the set of rows with a common group key by using appropriate operations (eg count is used for the local aggregation but SUM of the counts is used in the global aggregation).
The EXPLAIN gives clues to which phases are optimized out (eg global aggregation if the redistribution is performed and local aggregation if the redistribution is optimized out).
Both local and global aggregations are optimized with in-memory algorithms which cache the group key and aggreagate computations. Local aggregation reads from a source and does the local aggregation in the local cache. The redistribution is done directly from the cache onto the bynet. As the rows arrive at the receiving AMP, the global aggregation is done in the global cache. This eliminates sorting large sets to do aggregation, writing spools and reading spools to get the process done.
Thanks a lot ToddAWalter,
Could you please explain me how redistribution will be happen?
The local aggregated rows needs to go back to PE to get the hash value to find the AMP number right?
Inorder to perform sorting all the rows should be in the same AMP right, in that case how it process the data?
Thanks in advance.
Hashing happens in both the PE and the AMP. The hashing algorithm runs in both places.
Distribution of rows to AMPs happens in the bynet software. A row sent with a rowhash from either a PE or AMP gets mapped to the proper AMP in the bynet messaging software. It is not necessary to send the row to the PE to be sent to the right AMP.
redistribution of aggregate rows happes the same as for any relation being prepared for join or other processing. A field or fields is run through the hashing algoritm (in the case of aggregation the fields are the group key fields) and the rows are sent through the bynet to the right AMPs. For grouping, the redistribution after local aggregation results in all the partial aggregates for a group ending up on the same AMP and the set of all groups distributed across all AMPs.
I am not clean on your question about sorting. The point of the local and global aggregate caches is to perform the aggregation without sorting.
Thanks again ToddAWalter,
Actually, I want to know the internal mechanism for sorting operation. I'm wondering whether it uses redistribution of data or any other method?
Thank you so much for helping me to explore more.
When ORDER BY is specified in a SELECT, the final step of the plan will be a sort (unless the optimizer determines that the data is already sorted appropriately from previous steps). Sorting is done locally in the AMPs. Each AMP sorts the portion of the the last step that is created on that AMP. No redistribution is performed, whatever is created locally by the last step is the set of data that is sorted by that AMP. A nice side effect is achieved because sort is an (N (log N)) algorithm and N is divided by the number of AMPs resulting in very nice sort performance. The local sort also makes extensive use of memory - creating sort run buffers and performing merges when the sorted set is very large - maximizing the memory residency of the sort.
The local sorted results are placed in a spool file. A signal is sent to the application that the query has completed and the result is available for retrieval. The application then can start fetching rows from the database. To optimize the performance of fetch, the interfaces to Teradata work on a buffered basis. The first fetch goes to the database for a buffer of rows, then returns the first row to the application. Subsequent fetches return a row to the application from the buffer without requiring communication with the database. When the last record is fetched from the buffer, another buffer is retrieved from the database.
Since each AMP sorted its results locally, there does not exist a globally sorted set. When the fetch request is received for a buffer of rows a merge operation is executed to merge the top rows from each AMP's sorted result and place them into the buffer in globally ordered sequence. Logically, each AMP provides its next row, the merge operation looks at all the provided rows and chooses the top one relative to the sort order and places it into the buffer. It then tells the AMP that provided that row to provide a new row and it chooses the next row to buffer and so on until the buffer is full. Of course there is a lot of optimization with internal merge buffers in the nodes to coalesce the rows from the AMPs in that node and further buffering in the node to which the session is attached. These optimizations minimize the number of messages required to perform the merge and maximze the productive use of memory to buffer the process. The entire merge process is performed in the BYNet network software as a very low level protocol to make it as resource efficicient as possible.
Bringing all the rows together in one place and sorting them (AMP or PE) would result in a large skewed operation and a long single threaded operation that would not scale.The Teradata architecture for sorted result return guarantees scalability of even the largest sorted result sets. The system never has to bring the data together in one location nor does it have to perform a global sort of the set end to end.
The buffer for the result of the merge is allocated by the BYNet software. It is allocated in the node containing the PE to which the session is attached - meaning that the buffers are spread across the PE containing nodes. The size of the buffer is determined by the buffer size settings in the client software. The buffer exists only for the short time that the merge is running to fill the buffer. It is then returned to the client that requested the fetch.