Statistics Collection Recommendations – Teradata 14.10, 15.0 and 15.10

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata Database 14.10, 15.0, or 15.10 software release levels. Some of these recommendations apply to releases earlier than Teradata Database 14.10 and some rely on new features available starting in Teradata Database 14.10.  Statistics collection functionality in the Teradata Database works the same in 14.10, 15.0 and 15.10 releases.

For greater detail on collecting statistics for Teradata Database 14.10, see the orange book titled:  Teradata Database 14.10 Statistics Enhancements by Rama Krishna Korlapati.

Contributors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, September 11, 2014


Collect Full Statistics

  • Non-indexed columns used in predicates
  • All NUSIs
  • USIs/UPIs if used in non-equality predicates (range constraints)
  • Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
  • Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
  • PARTITION for all partitioned tables undergoing upward growth
  • Partitioning columns of a row-partitioned table

Can Rely on Dynamic AMP Sampling

  • USIs or UPIs if only used with equality predicates
  • NUSIs with an even distribution of values
  • NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
  • See “Other Considerations” for additional points related to dynamic AMP sampling

Collect Multicolumn Statistics

  • Groups of columns that often appear together with equality predicates. These statistics are used for single-tables estimates.
  • Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics is in this situation.
  • Specify a name for such statistics, for ease of recollection, viewing, and/or dropping.

General Suggestions for the Statistics Collection Process

  • When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.
  • After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.
  • Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems. Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) are available to the optimizer.
  • Recollect table-level SUMMARY statistics after data loading events in order to provide the optimizer with current table row counts and other detail. This operation runs very quickly and supports more effective extrapolations of statistics that were not able to be recollected after updates.
  • Do not drop and then recollect statistics, as history records for the statistic are lost when the statistic is dropped, making it less likely that the optimizer skips statistics collection or downgrades to sampling.

New Recommendations for Teradata Database 14.10

  • If migrating from a previous release, set the DBS Control internal field NoDot0Backdown to true in order to make use of the new Version 6 statistics histograms, which can carry update, delete and insert counts.
  • Enable DBQL USECOUNT logging for all important databases whose table row counts may change over time. This provides information about updates, deletes and inserts performed on each table within the logged databases and contributes to better extrapolations.
  • Benefit from the default system threshold option, which may allow some submitted statistics to be skipped, by turning on DBQL USECOUNT logging and building up statistic history records. Skipping can potentially reduce the resources required by statistics recollections.
  • Expect to perform several full collections before statistic skipping or automatic downgrade to sampling is considered.
  • Use the collection statement-level THRESHOLD option only for cases where there is a specific need to override the global threshold default.
  • Consider collecting statistics (and providing a name) on SQL expressions if they are frequently used in queries and if they reference columns from a single table.

Other Considerations

  1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.
  2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT NUPI-column FROM table;

  1. For a partitioned table, it is recommended that you always collect statistics on:
  • PARTITION. This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.
  • Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
  1. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
  • (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
  • (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.
  1. Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). Dynamic all-AMP sampling has these particular advantages:
  • It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
  • Statistics extrapolation for any column in a table is not attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP dynamic AMP sampling is turned on. Because a dynamic AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate dynamic AMP sample row count is critical for determining if collected statistics are stale, or not.
  1. For temporal tables, follow all collection recommendations made above. Currently, statistics are not supported on BEGIN and END period types. That capability is planned for a future release.

[1] Any column which is over 95% unique is considered as a nearly-unique column.

[2] Dynamic AMP sampling is sometimes referred to as random AMP sampling.

53 Comments
Scout

@David_Roth

 

Thanks for reply.

1. Currenly sampling parameter SysSampleOption is disabled.

2.Its disabled because customer doesn't want sample stats to be collected.

3. Yes TASM is implemented and collect stats has given medium priority. No delay observed for collect stats sessions.

 

Above 3 points are same since last 6 months. Data growth is < 2%. Same collect stats used to complete in 26 min till last month. But its taking 3 hrs now. Any thing i need to change in TASM?

 

@carrie

Yes explain plan is same for both statemenets. No stats skipped as SysSampleOption is disabled.

About parallelly running  queries and skewed AMP, will surely investigate it.

 

As said before Even though SysSampleOption is diabled since 1 yr, collect stats was taking 26 min. Enabling SysSampleOption will enhance performance of collect stats now?? Are sample stats recommended on large tables(data growth < 2%) having billions of records?

Teradata Employee

Kiran,

 

There are several prerequisites required before the optimizer will consider sampling stats, once SysSampleOption has been turned on:

  • USECOUNT logging must be on for that database
  • SysChangeThresholdOption (which is on by default) must be turned off.  That means no optimizer-determined skipping will take place
  • There must be adequate history records collected and the optimizer must be able to detect a pattern within the history records as the table size changes over time
  • The usage type for the statistics must be ‘S’, meaning the histogram for this particular statistic is being used primarily for the SummaryInfo data only, and not being use for the detailed histogram data

 

That last bullet means that if the statistic has in the past been used mainly for table join purposes (which only require SummaryInfo data from the histogram and will have a usagetype = ‘S’), then the optimizer may choose to do sampling.  But if the statistic has been used for single-table selection estimations (will have a usagetype = ‘D’), optimizer-initiated sampling will not take place.

 

You have to evaluate the value you could be getting from skipping (with SysChangeTresholdOption on) and whether you want to give that up in order to get optimizer-initiated sampling.  You can’t have both.  The default settings favor skipping, which is why SysChangeThresholdOption is on by default and sampling is not.

 

So you can see that it is not guaranteed that you will get sampling with the SysSampleOption even when it is turned on.  All those factors must be satisfied.

 

No one can tell you for sure that sampling will be acceptable for your large tables.  It depends on whether or not you get acceptable plans with sampling or not.  You could try enforcing sampling yourself (with the USING SAMPLE N PERCENT) clause, starting at a high sampling level (like 30% or so) and see how the plans look, then reduce down the sampling over time.

 

Thanks, -Carrie

Scout

Thank you Carrie for you Feedback !