Statistics collection recommendations – Teradata 14.0

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.

Statistical information is vital for the optimizer when it builds query plans.  But collecting statistics can involve time and resources.  By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 14.0 software release levels.   Some of these recommendations apply to releases earlier than Teradata 14.0, however some rely on new features available only in Teradata 14.0.

Contributors:  Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

Collect Full Statistics

• Non-indexed columns used in predicates
• All NUSIs
• USIs/UPIs if used in non-equality predicates (range constraints)
• Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
• Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
• PARTITION for all partitioned tables undergoing upward growth
• Partitioning columns of a row-partitioned table

Can Rely on Random AMP Sampling

• USIs or UPIs if only used with equality predicates
• NUSIs with an even distribution of values
• NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
• See “Other Considerations” for additional points related to random AMP sampling

Option to use USING SAMPLE

• Unique index columns
• Nearly-unique[1] columns or indexes

Collect Multicolumn Statistics

• Groups of columns that often appear together with equality predicates. These statistics will be used for single-tables estimates.
• Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.  With no multicolumn statistics collected, the optimizer assumes complete independence among the column values.  The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be in this situation.

1. When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.
2. After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.
3. Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems.  Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) will be available to the optimizer.
4. Recollect SUMMARY statistics after data loads, it runs very quickly and provides up-to-the-date row counts to the optimizer.
5. In Teradata 14.0, PARTITION statistics are no longer required on nonpartitioned tables, as the new SUMMARY table replaces the function previously provided by PARTITION or primary index statistics in determining stale statistics.

Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that.  A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values.  However, if it is close to being 75% unique, then random AMP samples are adequate.  To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

1. For a partitioned table, it is recommended that you always collect statistics on:
• PARTITION.  This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions.  This statistic is used for optimizer costing.
• Any partitioning columns.  This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
1. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
• (PARTITION, PI).  This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions.  It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
• (PARTITION, PI, partitioning column).   This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination.  It is used in rowkey join costing.
1. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default).  All-AMP random AMP sampling has these particular advantages:
• It provides a more accurate row count estimate for a table with a NUPI.  This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
• All-AMP random AMP samples are collected automatically when table-level SUMMARY statistics are collected.
1. For temporal tables, follow all collection recommendations made above.  Currently, statistics are not supported on BEGIN and END period types.  That capability is planned for a future release.

Contributors:  Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

[1] Any column which is over 95% unique is considered as a nearly-unique column.

Enthusiast

Hi Carrie,

Is there a plan in Teradata to enable multi column statistics with different ordering?

I mean you cannot have both (a,b) and (b,a) statistics on the same table.

Or by design it has to remain the same forever?

I guess for the optimizer it would be hard to decide which one to trust/choose.

Thanks,

Norbert

Hi Norbert,

There are no plans to change that convention.  It is by design that you can only choose one column ordering for a multicolumn statistics.

This is in order to avoid an explosion of statistics with different orderings on the same set of columns. Having this kind of explosion can cause the optimizer to spend lot of parse time determining which set to use.  The assumption is that a single specific column-ordering will be adequate.

Thanks, -Carrie

Enthusiast
Carrie,

Many tables have multiple stats collected against them to support ETL and reporting.  Is there a way to easily refresh all the stats currently collected for a table, without having to manually specify each stat one?  I could probably build some kind of proc to get the info from DBC and generate a series of collection statements but was hoping there is a built-in mechanism.

Thanks!

Rich

Hi Rich,

Once you have defined all the stats on a table separately, you can recollect at the table level with one SQL command.  This will re-collect everything already defined against the table.

In 14.0 the syntax is:  COLLECT STATISTICS ON my_table;

Thanks, -Carrie

Enthusiast

I tried issuing the following to find uniqueness, but I don't see anything in the explain plan that provides this info.  Am I missing something?

1. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

I'm not sure why your explain text doesn't give you estimated rows, but when I tried your SQL it worked for me.  Make sure you have collected stats on that column.

EXPLAIN SELECT DISTINCT l_orderkey FROM myli;

1) First, we lock a distinct CAB."pseudo table" for read on a

RowHash to prevent global deadlock for CAB.myli.

2) Next, we lock CAB.myli for read.

3) We do an all-AMPs SUM step to aggregate from CAB.myli by way

of an all-rows scan with no residual conditions

, grouping by field1 ( CAB.myli.L_ORDERKEY).  Aggregate

Intermediate Results are computed locally, then placed in Spool 1.

The size of Spool 1 is estimated with high confidence to be

2,000,000 rows (42,000,000 bytes).  The estimated time for this

step is 1 minute and 11 seconds.

This table has 7,999,406 rows:   2,000,000  / 7,999,406 = 0.25

Thanks, -Carrie

Enthusiast

Hi Carrie,

In Teradata 12, is there any option to find no of rows and and avg row size for SI sub table.

While going through the SQL data defns statements document, collect demographics is mentioned like this

COLLECT DEMOGRAPHICS determines and writes the following information for each

specified table on an AMP-wise basis1 into the DataDemographics table of the specified query

capture database:

• Subtable type

• Subtable ID

• Cardinality

• Average row length

• Various system-related information

But in this total number of rows are not available. Which one can be used Collect stats or Collect demograhics or collect index statistics.

Not applicable

Hi all,

Can anyone help me out on below issue!!!.

We would like to know how to incorporate- "Parameterized Macro" in JAVA, and does it realy  supports in JAVA ?.

Basically Macros are oustanding for improve the overall system performance.

Here is the example of "Parameterized Macros":

USING ENO INT, ENAME CHAR(5)

EXEC MAC_1 (:ENO, :ENAME);

Again, the syntex is totally diff for Parameterized Vs Non-Parameterized Macros, and realy

know -how to incorporate the "Parameterized macros" in Java. ???. Cany anyone provide the flow of java coding, and guidelines about -Parameterized Macros in Java.

Not applicable

Can anyone point to some documentation about-how to use "Parameterized macros" for java/jdbc application ?. Trying to embedded the syntex in java application(below is the syntex), and curious to know, how to embedded the Parameterized macros for java application.

As you know, the parameterized macros is fit for "run time passing values", and it gives more optimized results. can any one help me out!!!

Syntex:

USING ENO INT

EXEC MAC(:ENO);

Enthusiast

Hi Carrie,

I am having two questions. Please clarify.

Q1.In Teradata 12, is there any option to find no of rows and and avg row size for SI sub table. Which one can be used Collect stats or Collect demograhics or collect index statistics?

Q2.There is table defined with PI(Sale_date,store_num,item_num) and NUSI on item_nbr. Queries currently perform a full table scan, even though the users are providing join criteria for all primary index columns.80% of the queries access one to two weeks of data. What can be done to avoid the FTS? Whether Collect table-level statistics cah help to avoid FTS?

gsasubi-

I am not able to answer your question on JAVA apps and macros.  Please post this question on one of the Teradata Q&A Forums if you want to get it out to a broader audience.  Thanks, -Carrie

trustngs-

1.) You can get the number of distinct values in a NUSI subtable by doing a HELP INDEX table-name.   But information as to the exact number of rows across all AMPs and the average row size are not externalized and therefore will not be available.  The collect stats statement does not use as input or output either NUSI row count or NUSI average row size.

2.)  If you are joining the PI of a large table by joining the different columns of its PI to smaller lookup tables, I do not think that you will get primary index access.  You will have to do a scan, or at least a row hash match scan of the large table.  However I would expect this to happen locally without having to redistribute the large table.  If you do not have the exact values for the primary index columns at the time query is submitted you will not get primary index access into the large table. If the large table is partitioned by date, you might be able to reduce the scan effort by relying on partition elimination.  You can consider dynamic partition elimination, where the query does not have the value for the partitioning column, but a join to a lookup table provides it. From the Release 12.0 PPI orange book (which can be found in the orange book repository), Appendix D:

Dynamic partition elimination (DPE)  for product joins may occur when there is an equality constraint between the partitioning column of a partitioning expression of one table and a column of another table or spool.  This is useful when looking up rows in one table and matching those rows to rows in corresponding partitions (using a product join) instead of a product join to the entire table.  Only partitions required to answer the request are involved in the join.  When performing join steps, the qualifying combined partitions of the PPI table are dynamically determined based on the values of rows selected from the other table.  Instead of a product join against all the rows in the PPI table, a product join is only done for each set of rows of the other table that match against a single set of combined partitions.  Product join DPE may occur at one or more partitioning levels.

With PPI table, always collect statistics on the partitioning column and the system-derived column PARTITION.

Thanks, -Carrie

Enthusiast

Carrie - great enumeraion of what needs stats - my favoirie topic... but it's missing an equally important topic: discussion of the frequency when to collect.

As it stands above, a developer will codify everything in the list above to be safe - with zero understanding of the actual cost and benefit.

We need to have guidance on the frequency and intervals stats are collected - at the same time as, and with the same weight of - what needs stats.

Experience in multiple large sites shows that un-rationalized stats collection will typically consume 15% of a platform, which really needs to be less than 2%.

/michael

Enthusiast

Hi Carrie,

Thank you very Much, got clarified now.

Hi Michael,

I agree with you that decisions about which stats to collect and how often are challenging today.  As you are probably aware, there is increasing attention being paid to enhancing the statistics collection process in Teradata to address these needs.  Here's a couple of examples:

In current software, you have the option of relying on extrapolation, which allows you to re-collect some statistics (especially on larger tables) less often.  Instead of recollecting daily, re-collecting once a week, for example, and relying on the optimizer to sense the degree of growth, patterns of change, and extrapolate statistics for you.  Whether or not extrapolation is accurate enough for you can be assessed in 14.0 and later releases using the new "HELP CURRENT STATISTICS ON table-name" syntax, which shows you the values the optimizer has extrapolated for all collected stats on the table.

A new option in 14.10 is the THRESHOLD option which automatically skips re-collection of a statistic when its not needed.  A USING THRESHOLD clause on the collect stats statement itself allows you to specify a the number of days that must have passed since the last collection, or the percent of change that the table must have undergone since the last collection.  These then become thresholds that must be met before the recollection will be run.

The direction of enhancements around statistics is to automate more decisions and reduce the manpower and platform resources (not to mention guesswork) that are required by most sites today.

Thanks, -Carrie

Not applicable

Thank you Carrie,

Would like to know the scenario, where this Parameterized Macro supports/works ?. Example like - it supports only BTEQ utility..

Enthusiast

Hi Carrie,

Regarding Sample Stats; is it possible and feasible to define AMP(s) as well as percentage for the sampling?

What I have in mind is column(s) badly distributed, Hashamp/bucket/row, and then define the worst and/or the best AMP distribution?

Thanks,

Enthusiast

Hi Carrie,

When collecting sample stats on a table and join it to itself it ignores sample stats and cause product join, is this true?

Thanks,

I see your point, however there is no functionality to allow you to sample on only specific AMPs.  All AMPs will be sampled at the same percent.

The stats don't force any particular join types. The estimations derived from statistics are used as an input for cost formulas.  It is the costing that determines the join plan. Sampled stats (or any kind of stats) are never ignored.

Thanks, -Carrie

Fan

Hi Carrie,

What is the thumb of rule you would recommend for tables larger than 1TB.

I totally understand that it depends on how the data is changes in the table.

I was leaning more towards 10% size change. But for table greater than 1TB, is that ok?

Because the full collect stats runs really long. My main concern is running stats during the etl window which needs resources.

And for updates there is no really good way to find 10% .

So what do you recommend other than the sample stats explain above.

We currently have td 13.10.

Thanks

ppg.

Enthusiast

Thanks Carrie for the clarification. Regarding the second question we have seen that when sample stats was collected the join took very long time, product join plan, and we had to abort it. Then collecting full stats and ran the same job it took few minutes.

ppg-

The same recommendations for recollection of statistics apply no matter what the table size.  10% change is just a rough guess, but it is all that we have to recommend at this time.  Future releases of Teradata will give you more information about which statistics are used and will let the optimizer determine if recollection of stats is actually needed.

But today, the need for recollection will depend on whether or not you are still getting good query plans without refreshing query plans as often.  If a table is very large, some sites only recollect with sampling percent, as you mention.  However, you need to make sure you have a large enough sample to adequately represent any change to the spread of values across the statistics columns in the table.

The other thing you can do, since you are not yet on 14.0, is collect PARTITION stats on the table immediately after any load activity against the table.  This is a very fast operation, and will provide information to the optimizer as to how much the table has grown (row count).  This can be used for extrapolation. See the blog posting on statitistics collection for 13.10.

Thanks, -Carrie

Enthusiast

Far too much CPU can be wasted on a poor stats collection regime, often unknowingly.

I wrote this last year to blow off steam:

http://blog.vldbsolutions.com/lies-damned-lies-statistics

I know, I should get out more etc ;-)

Cheers,

Paul.

Carrie,

Let's say we have a NUSI.  Is it sufficient to collect at the column level, index or both?

Collect

column(x),

index(x)

on db.tb;

Collecting statistics on an INDEX versus the COLUMN that make up the index is just a syntax difference today.  The collection and storage of the stats information is the same, and the impact on optimizer cost estimation is the same.   There is one difference having to do with the collection itself--if the index is built before statistics are collected the first time, then the collection process will used the NUSI subtables, which are faster than scanning the table. This is true with either syntax.  And either syntax will also cause subsequent collections to use the subtables.

There used to be a difference in which dictionary table the stats were stored in, prior to 14.0.   Column stats were stored in the DBC.TVFields table for the column syntax, while index stats were held in DBC.Indexes rows.  But as of 14.0, both are stored in the DBC.StatsTbl table.

I believe offering the two different syntax forms is really for historical purposes when you get to 14.0. So collect on whichever you choose, but there is no advantage in collecting on both.  The only difference I have noticed is that if you use the INDEX syntax, the column must actually be indexed or you get an error. But the explain text for both are identical.

Thanks, -Carrie

Enthusiast

Hi Carrie,

In TD 12, is there any option to store data and collect stats on that data into a same table. Which type of table(Spool,journal,Dervied,GTT) can be used for this purpose?

Vasudev

I am unable to understand exactly what you are asking.  Teradata certainly allows you to insert data into a table you have defined.  It also allows you to collect statistics on columns of that table.  However, that only is for base tables, not spools or journal files.  The database controls those.

For more information on statistics, see Chapter 2 of the SQL Reference Statement and Transaction Processing manual.

Thanks, -Carrie

Enthusiast

Hi Carrie,

Sorry for making it complicated. My question is this. In TD 12 i want to store data on a table and for that table i need to take collect stats also. Out of spool,journal,derived and GTT which one supports this feature?

In Teradata 12.0 you can collect stats on base table columns/indexes, and also you can collect statistics on GTTs.  But with GTTs you can only collect single-column and index statistics.  Multicolumn statistics support for GTTs is given in TD 14.0.

Statistics cannot be collected on spool, journals, or derived tables.

Thanks, -Carrie

Enthusiast

Thank you very much. Got clarified now !!!

Fan

TD 13 Onwards we can collect stats on VTT tables tooo .... That means we can have stats on Spool, since VTT tables stores in Spool.. Correct me if i am wrong...

No, it is not possible to collect stats on spool files in any release. There is no interface available that would support that.

Thanks, -Carrie

Not applicable

hi carrie,

i am new to teradata.i had a table as emp(eid integer,ename varchar(30),bf decimal(7,2),basic decimal(7,2))

now my question is there are four columns ,i wanna to display extrs column as sum(bf+basic) at display level by adding those two columns

thanks

hari krishna

Hari Krishna,

You can include the calculation as a column in the select list of the SQL statement and give it a name with the "AS" clause:

(bf+basic)  as sumup

Thanks, -Carrie

Fan

Hi Carrie,

Could you pls explain collect stats on VOLATILE TABLES, where it stores ... is it in Spool ...

Thanks

Raj

Not applicable

Please guide me in finding solution of below problem.

I want to use a query in stored procedure which should either say Yes(Y) or No(N).

All I want to find whether or not STATS are collected on PI(Not other index or columns) of a table by passing TableName and DB Name?

Thanks

Saurabh

Not applicable

Please guide me in finding solution of below problem.

I want to use a query in stored procedure which should either say Yes(Y) or No(N).

All I want to find whether or not STATS are collected on PI(Not other index or columns) of a table by passing TableName and DB Name?

Regards

Saurabh

Raj,

When you collect stats on a volatile tables, those stats are held in the memory of the parsing engine.  They are not written to the data dictionary.   They can be viewed using the same methods as you would use to view stats and or the histogram of stats on a perm table, but they only stay in memory for the life of the transaction.

Thanks, -Carrie

Saurabh,

Once you are on 14.0 you can query the StatsTbl in the data dictionary.  The IndexID field in that table will = 1 for the primary index fields(s).    There is a view on top of the StatsTbl (and a couple of other DBC tables) called StatsV that carries DatabaseName, TableName and also IndexID.  If you pass values for those fields in a query you can determine is a specific table has stats collected for the primary index.

In 14.0 there  is also a view called TableStatsv.  In this case use the indexnumber = 1 to see if the primary index has stats collected.

sel  indexnumber, tablename, databasename

from dbc.tablestatsv

where databasename='db_name'

and tablename = 'tbl_name'

and indexnumber = 1;

Thanks, -Carrie

Not applicable

Hi,

Will TLE,stats collected(from production) influence the optimizer to choose the same plan on a test system loaded with test data(for which I have not collected stats).

Thanks,

KSN

Target Level Emulation mimics the customer environment by capturing such detail as DDL, stats, RAS, cost profiles, DBS Control detail, etc. from one system and then placing that detail on another system.

You should expect to get the same query plans on the target system, even though you have not collected statistics there.

Thanks, -Carrie

Not applicable

Hi i am new to teradata can you plz any one explain me about collect statistics..

Enthusiast

Hi Carrie,

From TD 14, will "PARTITION" stats on PPI tables help the optimizer in stats extrapolation at all?

The reason I asked is because we have a daily script that collects stats on keyword: "PARTITION" on all ppi tables. Once we go to td 14, do you suggest changing the script to perform summary stats as well?

-Suhail

To venkatarayudu:

The official Teradata documentation has background and explanatory information about statistics collection.

This manual has some detail about statistics collection in 14.0:

SQL Data Definition Language

Detailed Topics

Release 14.0

B035-1184-111A

July 2013

There are several other manuals you can read that provide detail on statistics collection as well.

Thanks, -Carrie

To Suhail,

It is always a good thing to collect stats on PARTITION for all partitioned tables.  It is not used for extrapolation in 14.0, but is important for optimizer decisions when building the query plan.  Plus it is a very fast operation, so there is little benefit in dropping it.

The difference in 14.0 is that you no longer need to collect PARTITON on non-partitioned tables.

Thanks, -Carrie

Enthusiast

Hello Carrie,

Thank you for your response. As I mentioned before, we have a daily script that collects PARTITION stats on all partitioned tables.

Post TD 14, should I change our script to collect summary stats as well on the same partitioned tables? Or will they be collected automatically once the PARTITION stats are collected?

Regards,

Suhail

Suhail,

You do not need to collect summary stats, unless no other stats have been collected on the table during the recent timeframe.  Summary stats are automatically collected every time normal statistics, including PARTITION stats, are collected on a table.

You can validate that by just collecting PARTITION stats on a table, then immediately after that collection look at the summary stats for the table.  There is a timestamp column the summary stats that should now reflect the time you collected on PARTITION.

To see summary stats, issue a SHOW SUMMARY STATISTICS VALUES ON table-name command.

Thanks, -Carrie

Not applicable

Hi Carrie,

In our system incoming data from source files is not consistence, one day 10K rows and other day 10m rows

So, to avoid spool space issues we checking loaded data  manually and triggering stats scripts in case of 10 to 15 % changes in data volumne

Due to this we are missing some SLA

So is there any way to design a stats script to trigger whenever there 10 or more 10% change in table data?

Ram,

I am not aware of any such script, but your could post your questoin on one of the Teradata forums and see if anyone in Professional Services or anyone working at a Teradata site has done something like that.  It is entirely possible.

Best regards, -Carrie

Fan

Hi Carrie,

I have observed new thing while collecting stats on my table. Recently we have upgraded to TD 14. After collecting stats on Index and column, I can see in 'help stats' that its showing one more record for '*' (guessing for all columns from table).

Is it the new feature of TD 14? I havent explicitly ran any command to collect stats on all columns.

sample output:

help stats <tableName>

----------------------

Date                Time                      UniqueValues                   Column Names

13/10/22     07:37:30               9,999,999                              *

13/10/22      07:37:30                    1                                     column_abc

Regards,

Chetan