Authors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair
Looking for a fresh perspective on collecting statistics, or just want a confirmation that you’re on the right track? Either way, you’ll want to read this quick summary of recommendations for Teradata 12 stats collection.
Statistical information is vital for the optimizer when query plans are built. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 12 software release levels. All these recommendations apply to releases earlier than Teradata 12, with the exception of Point #4 and #5 under “Other Considerations”. Those Point #4 and #5 recommendations take advantages of optimizer enhancements that are part of the Teradata 12 release and are not relevant to earlier releases.
1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;
3. For a partitioned primary index table, it is recommended that you always collect statistics on:
4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
5. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:
* Uneven distribution exists when the High Mode Frequency (ModeFrequency column in interval zero) in the histogram is greater than the average rows-per-value (RPV) by a factor of 4 or more. RPV is calculated as Number of Rows / Number of Uniques.
** Any column which is over 95% unique is considered as a neary-unique column.
*** Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.
Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.