Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
The following statistics collection recommendations are intended for sites that are on any of the Teradata 13.10 software release levels. Most of these recommendations apply to releases earlier than Teradata 13.10, however some may be specific to Teradata 13.10 only.
Collect Full Statistics
- Non-indexed columns used in predicates
- Single-column join constraints, if the column is not unique
- All NUSIs (but drop NUSIs that aren’t needed/used)
- USIs/UPIs if used in non-equality predicates (range constraints)
- Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
- Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
- PARTITION for all tables whether partitioned or not^{[1]}
Can Rely on Dynamic AMP Sampling ^{[2]}
- USIs or UPIs if only used with equality predicates
- NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
- See “Other Considerations” for additional points related to dynamic AMP sampling
Option to use USING SAMPLE
- Unique index columns
- Nearly-unique ^{[3]} columns or indexes
Collect Multicolumn Statistics
- Groups of columns that often appear together with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics are used for single-table estimates.
- Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics.
Other Considerations
- Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.
- NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefits from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;
- For a partitioned primary index table, it is recommended that you always collect statistics on:
- PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
- The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
- For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
- (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
- (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey-based merge join costing.
- Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). For small tables, with less than 25 rows per AMP, all-AMP sampling is done automatically. It is also the default for volatile tables and sparse join indexes. All-AMP sampling comes with these tradeoffs:
- Dynamic all-AMP sampling provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
- Statistics extrapolation for any column in a table is triggered only when the optimizer detects that the table has grown. The growth is computed by comparing the current row count with the last known row count to the optimizer. If the default single-AMP dynamic sampling estimate of the current row count is not accurate (which can happen if the primary index is skewed), it is recommended to enable all-AMP sampling or re-collect PARTITION statistics.
- Parsing times for queries may increase when all AMPs are involved, as the queries that perform dynamic AMP sampling will have slightly more work to do. Note that dynamic AMP samples will stay in the dictionary cache until the periodic cache flush, or unless they are purged from the cache for some reason. Because they can be retrieved once and re-used multiple times, it is not expected that dynamic all-AMP samping will will cause additional overhead for all query executions.
- For temporal tables, follow all collection recommendations made above. However, statistics are currently not supported on BEGIN and END period types. That capability is planned for a future release.
^{[1]} Collecting PARTITION statistics on tables that have grown supports more accurate statistics extrapolations. Collecting on PARTITION is an extremely quick operation (as long as table is not over-partitioned).
^{[2] } This is sometimes referred to as a random AMP sample since, in early releases of Teradata, a random AMP was picked for obtaining the sample; however, in current releases, the term dynamic AMP sample is more appropriate.
^{[3] } Any column which is over 95% unique is considered as a nearly-unique column.
These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair