Statistics Collection Recommendations for Teradata 12

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Statistics Collection Recommendations for Teradata 12

Authors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair

Looking for a fresh perspective on collecting statistics, or just want a confirmation that you’re on the right track? Either way, you’ll want to read this quick summary of recommendations for Teradata 12 stats collection.

Statistical information is vital for the optimizer when query plans are built. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 12 software release levels. All these recommendations apply to releases earlier than Teradata 12, with the exception of Point #4 and #5 under “Other Considerations”. Those Point #4 and #5 recommendations take advantages of optimizer enhancements that are part of the Teradata 12 release and are not relevant to earlier releases.

Collect Full Statistics 

  • Non-indexed columns used in predicates
  • All NUSIs with an uneven distribution of values * 
  • NUSIs used in join steps
  • USIs/UPIs if used in non-equality predicates (range constraints)
  • Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
  • Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)

Can Rely on Random AMP Sampling   

  • USIs or UPIs if only used with equality predicates
  • NUSIs with an even distribution of values
  • NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
  • See “Other Considerations” for additional points related to random AMP sampling

Option to use USING SAMPLE

  • Unique index columns
  • Nearly-unique columns or indexes**

Collect Multicolumn Statistics

  • Groups of columns that often appear together in conditions with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics will be used for single-table estimates.
  • Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.***  With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be.

Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.

2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that.  A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

 3. For a partitioned primary index table, it is recommended that you always collect statistics on:

  • PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
  • The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.

4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):

  • (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
  • (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

5. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:

  • It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
  • Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.

 * Uneven distribution exists when the High Mode Frequency (ModeFrequency column in interval zero) in the histogram is greater than the average rows-per-value (RPV) by a factor of 4 or more.  RPV is calculated as Number of Rows / Number of Uniques.

** Any column which is over 95% unique is considered as a neary-unique column.

*** Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.

Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.

Tags (1)
59 REPLIES
Fan

Re: Statistics Collection Recommendations for Teradata 12

nicee Recommendations !! thanks alot :D
Enthusiast

Re: Statistics Collection Recommendations for Teradata 12

With regard to Multi Column Stats, Can you please illustrate on "where there is either a dependency or some degree of correlation among them".
What kind of dependancy is refered here. Can you please explain?
Teradata Employee

Re: Statistics Collection Recommendations for Teradata 12

I have incoporated the following response, and a few other minor updates, to the original article.

Response:

Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.

Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.
Enthusiast

Re: Statistics Collection Recommendations for Teradata 12

First thing first, this is a very interesting platform for the teradata developers to share their thoughts about teradata.

Now coming to my questions, Suppose I have a query which retreives records from a single-table based on conditions on where clause. If we assume the where clause conditions are based on Primary Index columns, Non-Index columns and PPI columns, How does collection of stats help the query? I assume this would only help in giving the estimates , however, for this type of queries, no matter what the stats are on this table, the end-end run time should be same for a single table retreival. Can you enlighten me on this ?
Teradata Employee

Re: Statistics Collection Recommendations for Teradata 12

Good question.

If the access of a single table is by primary index in an equality condition, statistics are not used and this will be a single-AMP operation, with no table scan.

If the access of single table by primary index uses a range constraint, then statistics on the PI will be used to produce a reasonable time and row count estimate, but the plan, since it only uses a single table, will not change.

The accuracy of these types of estimates could be important because TASM can use estimated processing time to classify queries to different workloads that have different priorities. This makes the accuracy of statistics important, even when they do not change the plan, and this classification could impact the elapsed time of the query.

In addition, there is an option on system/object throttles called a step time threshold. This option allows work to run immediately (and never be delayed) if all the steps in the query have estimated processing times that are less than the specified step time threshold. This is a good technique to use when the throttle manages both short work and long work and you don't want the short work delayed. So accurate estimates are important for this feature to work as intended.

The same is true for collecting statistics on the partitioning column and on PARTITION. They may not change the plan if only one table is involved, but they allow for more accurate statistics for the purposes of TASM classification.

Fan

Re: Statistics Collection Recommendations for Teradata 12

What is the difference in plan when stats are collected only on PARTITION or only on PPI column if the query uses join on PI and PPI with single table predicate on both PPI columns?
Teradata Employee

Re: Statistics Collection Recommendations for Teradata 12

It sounds like this join you are describing is between 2 PPI tables, both partitioned identically, and the join is on both the PI and the partitioning columns of both tables. In this case the partitioning column stats (on both tables) will be used to provide estimated time and row counts for the step(s) that access the tables. You may not get a plan change without partitioning column stats, but your estimates will be more accurate with those stats. That could be important for TASM classification purposes, or if there is a step time threshold on an object throttle associated with this query.

In the case of PARTITION, collecting or not collecting on this system-derived column will not change estimated row counts, and is unlikely to change the plan of such a simple query as you describe. However, in some other possible cases a less optimial join plan may be selected if there are no stats on PARTITION, and since collecting stats on PARTITION is such as quick operation to perform, it may help other more complex queries in the future if these statistics are available.

In the absence of PARTITION statistics, the optimizer assumes rows are distributed evenly among all partitions. If this is not the case, you could end up with incorrect IO costing. This may affect single-table access paths when there are secondary indexes and/or it could affect the join order.
Enthusiast

Re: Statistics Collection Recommendations for Teradata 12

Hello Carrie...

I'm a big fan of your postings.Nice to find you here.It will surely bring quality to the forum.
I have a basic question- Which table columns can go without the need of collecting statistics? In my present engaugement,I have seen the trend of collecting statictics on all the columns, combination of columns. This in turn increase the time taken for statistics execution.
Teradata Employee

Re: Statistics Collection Recommendations for Teradata 12

Thank you for your kind words, and for taking time to add a comment.

The most common place I see unneeded statistics collected is with multicolumns stats. This is especially true if they are lengthy (greater than 16 bytes) and if their first 16 bytes don't offer much differentiation. I think I have a blog posting from July that discusses dropping stats for multicolumn stats under some conditions, so you might want to find that.

One of the reason these are first-line candidates for dropping is that in order for them to be used by the optimizer for single table selection purposes, each column must be expressed in an equal condition. If most of your queries use these columns in a combination of equality and range selection, or don't use the columns together, the statistics are not providing you value.

If the site has been relying on HELPSTATS output to decide what to collect on, that feature goes overboard when it makes its recommendations for multicolumn stats, particularly the V2R6.2 and earlier releases. In Teradata 12, HELPSTATS does a better scoping job, and you won't see so many extreme recommendations.

For unique or near-unique columns, try using USING SAMPLE. It's a big savings, even if you can only apply it to a few columns here and there.