Statistics Collection Recommendations – Teradata 14.10, 15.0 and 15.10

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata Database 14.10, 15.0, or 15.10 software release levels. Some of these recommendations apply to releases earlier than Teradata Database 14.10 and some rely on new features available starting in Teradata Database 14.10.  Statistics collection functionality in the Teradata Database works the same in 14.10, 15.0 and 15.10 releases.

For greater detail on collecting statistics for Teradata Database 14.10, see the orange book titled:  Teradata Database 14.10 Statistics Enhancements by Rama Krishna Korlapati.

Contributors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, September 11, 2014


Collect Full Statistics

  • Non-indexed columns used in predicates
  • All NUSIs
  • USIs/UPIs if used in non-equality predicates (range constraints)
  • Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
  • Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
  • PARTITION for all partitioned tables undergoing upward growth
  • Partitioning columns of a row-partitioned table

Can Rely on Dynamic AMP Sampling

  • USIs or UPIs if only used with equality predicates
  • NUSIs with an even distribution of values
  • NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
  • See “Other Considerations” for additional points related to dynamic AMP sampling

Collect Multicolumn Statistics

  • Groups of columns that often appear together with equality predicates. These statistics are used for single-tables estimates.
  • Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics is in this situation.
  • Specify a name for such statistics, for ease of recollection, viewing, and/or dropping.

General Suggestions for the Statistics Collection Process

  • When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.
  • After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.
  • Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems. Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) are available to the optimizer.
  • Recollect table-level SUMMARY statistics after data loading events in order to provide the optimizer with current table row counts and other detail. This operation runs very quickly and supports more effective extrapolations of statistics that were not able to be recollected after updates.
  • Do not drop and then recollect statistics, as history records for the statistic are lost when the statistic is dropped, making it less likely that the optimizer skips statistics collection or downgrades to sampling.

New Recommendations for Teradata Database 14.10

  • If migrating from a previous release, set the DBS Control internal field NoDot0Backdown to true in order to make use of the new Version 6 statistics histograms, which can carry update, delete and insert counts.
  • Enable DBQL USECOUNT logging for all important databases whose table row counts may change over time. This provides information about updates, deletes and inserts performed on each table within the logged databases and contributes to better extrapolations.
  • Benefit from the default system threshold option, which may allow some submitted statistics to be skipped, by turning on DBQL USECOUNT logging and building up statistic history records. Skipping can potentially reduce the resources required by statistics recollections.
  • Expect to perform several full collections before statistic skipping or automatic downgrade to sampling is considered.
  • Use the collection statement-level THRESHOLD option only for cases where there is a specific need to override the global threshold default.
  • Consider collecting statistics (and providing a name) on SQL expressions if they are frequently used in queries and if they reference columns from a single table.

Other Considerations

  1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.
  2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT NUPI-column FROM table;

  1. For a partitioned table, it is recommended that you always collect statistics on:
  • PARTITION. This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.
  • Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
  1. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
  • (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
  • (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.
  1. Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). Dynamic all-AMP sampling has these particular advantages:
  • It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
  • Statistics extrapolation for any column in a table is not attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP dynamic AMP sampling is turned on. Because a dynamic AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate dynamic AMP sample row count is critical for determining if collected statistics are stale, or not.
  1. For temporal tables, follow all collection recommendations made above. Currently, statistics are not supported on BEGIN and END period types. That capability is planned for a future release.

[1] Any column which is over 95% unique is considered as a nearly-unique column.

[2] Dynamic AMP sampling is sometimes referred to as random AMP sampling.

53 Comments
Visitor

Dear Carrie

We are in the phase to enable THRESHOLDS for our Stats Collection process. The big question is now with which SYSTEM Thresholds we should start. Are there any good practise values available or best practices how to get them.

Thanks Armin

Teradata Employee

Armin,

If you haven't seen it, take a look at my blog posting Statistics Threshold Functionality 101.  That may provide a little more understanding about how to use threshold.

In general, the recommendations are to rely primarily on the system threshold option, SysChangeThresholdOption.  You don't have to set that, but it does require that UseCount logging be on for the database.

  1. Turn on USECOUNT logging in DBQL for all databases for which statistics are being collected and where you are relying on system threshold, otherwise it will not do anything.
  2. Use the statement-level threshold only for special statistics that need to be handled differently from the system default or the DBA-defined global defaults.  Favor percent of change over number of days as your first choice for the type of threshold to use. 
  3. But if USECOUNT is not being logged, then rely on time-based thresholds and set DefaultTimeThreshold at your preferred number of days. 

Thanks, -Carrie

I am not sure who I should send this to so you are it.  I work at AAA in Calif and I work on the mainframe and also with Teradata tables.  I have to create interactive screens on the mainframe using a Dialog Manager Screen using REXX to display the data.  I want to be able to read our teradata tables using COBOL but I don't know if this can be done.  If it can could you tell me where I can look at some examples.  I do this all the time using DB2 tables but have never done it with Teradata.  The only other way I could do this would be to dump the tables down to the mainframe and then use those files but most tables are way to large.  Hope you can help and if not have a good one.

Teradata Employee

Hi Terry,

I wish I could help you, but this is not an area of familiarity for me.  And I have no idea who might know something like this.   

I'd suggest you post this question on one of the Teradata Forums, and see of anyone else has ideas about it.  That will give your question much greater exposure than tagging it on as a comment at the bottom of this blog.  Or contact your Teradata account team and see if they can locate the appropriate resource within Teradata who might be able to advise you.  

Thanks, -Carrie

Enthusiast

Hi Carrie,

Thanks for the valuable information. Actually we are planning to enable Auto Stats from lower to higher environment. As part of testing, I did following testing on following cases to ensure whether everything is working as expected or not.

Case 1: Insert, update, delete operation - During collection, once system data change pattern reached, whether it is collecting or skipping for this DML operations.

Case 2: Analyze job - Whether it is giving right recommendations or not for all type of tables especially PPI tables, join indexes, etc

Case 3: Collect Job - Whether it is running the job as per the schedule or not and refreshing daily the PARTITION, Partition columns, NUSIs, sample statistics, DBC tables or not

Case 4: If it is a big table (>1000 rows/AMP) and already collected full statistics on full table, whether it is going for Sample stats or not

Case 5: If it is a small table (<1000 rows/AMP), whether it is going for full stats or not

Case 6: Whether it is giving recommendation on timestamp and collecting stats on Timestamp column or not

As per your experience, do you think, is there any other cases i need to do a test, so that we can have a smoother implementation of Autostats.

Thanks in advance.

Enthusiast

Hi Carrie,

Is there any solution to handle the Auto stats (via Auto scheduler) during maintenance window. the question here is, assume the stats are scheudled to run during this maintenance window where the system is down. Is there any solution to address this loss of stats.

Thanks in advance.

Teradata Employee

The only information I currently have on AutoStats has been posted on Developer Exchange at:

http://developer.teradata.com/blog/carrie/2013/12/easing-into-using-the-new-autostats-feature

In addition, you may find it helpful to read through the orange book titled  "Automated Statistics Management" which can be found on Teradata At Your Service.

I did notice that you have posted these same identical questions to bloggers Dieter Noeth and Marcio Moura.   If you wish to reach a broader audience, consider posting your questions on Teradata Forum:

http://forums.teradata.com/forum

Best regards, -Carrie

Teradata Employee

Harsha,

Why don't you talk to your Teradata account team and ask if they can help you locate a resource to answer questions you have on AutoStats.  Unfortunately, I am not that person, but I do know several members of your account team and if they contact me offline I could help them locate a resource that might be able to help.

Best regards, -Carrie

Enthusiast

Hi Carrie,

Thanks for your response,

Actually i already gone through all the forums in Developer Exchange related to Stats. I got clear understanding from these Amazing articles on Autostats.  As part of Autostats implementation, I already discussed with Teradata PS team and gathered as much information as i can as part of testing. And since due to less time limit and everyone in the partners meeting last week, i thought to better post it to Dieter and Marcio too, so that atleast i can get answer from one of them. Yes, i already posted even in Master forums but not in regular forum.

Teradata Employee

Very valuable article. Thanks Carrie.

In Teradata 14 onwards, we have this feature wherein the individual stats information of Col(A) & Col(B) can be used for Col(A,B) if the stats on Col(A,B) is not collected & the same is required by Optimizer. Same goes for Col(A) & Col(B) when individual stats is not collected & MultiColumn stat Col(A,B) is collected. Optimizer uses the stats information of Col(A,B) for individual columns also.

Let's say Query_01 uses only ColA, Query_02 uses only ColB & Query_03 uses ColA & ColB in the WHERE clause. Based on the enhancements explained, we can collect Multi-Column Stats on Col(A,B) & it will service all the 03 queries. Or, we can collect individual stats on ColA & ColB with the Query_03 being serviced implicitly. My question here is: Which approach is better from an Optimizer perspective. Again, Stats related question are always best answered practically, yet I wanted an opinion from you.

Thanks,

Smarak

Teradata Employee

Individual column statistics and multicolumn statistics are not interchangeable.  For example, the optimizer will not use two individual column stats to replace a multicolumn stat that includes both columns, or vice versa.   That is why sites often collect multicolumn stats.  Multicolumn stats provide a value that cannot be provided by multiple single column stats.  Multicolumn stats tell the optimizer about demographics related to the correlation of the individual partitioning columns, and which values appear together and how often. 

On exception is that the leading column of a multicolumn statistic can be used independently by the optimizer for single table selection purposes, as though it were a single column.  However, the trailing columns in a multicolumn stat are not able to offer similar benefit to the optimizer.  You have collect single column stats on trailing columns if you want that information to be available independently.

Maybe you were thinking about the capability starting in 14.0 where you can collect on a multicolumn stat and on the columns that make up the multicolumn stats in one collect statistics command, and those collections will overlap, saving resources.

The statistics recommendations in the posting above are correct, you need multicolumn stats if there is a correlation among the participating columns and if all the stats are used for single table selection purposes.

Thanks, -Carrie

Fan

Hi Carrie,

You mentioned above

"Collect full statistics on Partitioning columns of a row-partitioned table"

...

"6. For temporal tables, follow all collection recommendations made above. Currently, statistics are not supported on BEGIN and END period types. That capability is planned for a future release."

I have bi-temporal partitioned table where I partition on RANGE_N on BEGIN/END(Valid_time) and BEGIN/END(Transaction_time). When it is not support statistics on BEGIN and END period types, should I omit collect statistics on partitioning columns ?

Br,

Socola

Teradata Employee

Socola,

Starting in  14.10, statistics can be collected on BEGIN/END(valid_time/transaction_time) since we support expression statistics. However, note that we don’t allow collection on the PERIOD type such as valid_time/transaction_time since they are not the basic type. When we extract BEGIN/END portion of the PERIOD column, it results in a basic DATE or TIMESTAMP type which is the reason we can collect stats on them.

Thanks, -Carrie

Teradata Employee

Hi Carrie,

First of all, please apologize me for commenting on an old post, but since my question relates to this, hence I had to. :-)

We need to perform a daily scheduled aggregation on a table with 25B+ rows & PPI on DATE INTERVAL '1' DAY. This table is regularly loaded every hour and the daily growth rate of this table is ~10%.

If we don't collect stats on the PPI column before the aggregation, the whole agrregation process keeps running for 10+ hours. However, upon collecting stats immediately before the aggregation, helps it to complete within just ~20 mins.

Keeping in view the above scenario, what would be your recommendations for the aggregation optimization?

Should we explicitly collect stats on the daily basis, immediately before the scheduled aggregation or is there any other way to optimize the aggregation?

Regards,

Wasif

Teradata Employee

Wasif,

It's always helpful to provide the database release that you are on when it comes to statistics, as many things have changed in current releases.

If you are on 14.10 or above, there are improvements made to the statistics extrapolation process that may help you.    If you have DBQL USECOUNT logging on the database where the underlying table resides, then the optimizer can make better extrapolations because any changes to the table row count will be captured in DBC.ObjectUsage and this can help to provide more reliable extrapolations.  In addition, if you build up history records on the partitioning column statistics, data in the history records will be used to detect patterns in the statistics with growth.  

I have found that 14.10 extrapolations based on date partitioning columns of 1 day to be fairly accurate, once adequate history records have been accumulated.   The optimizer expects that any column that has a DATE data type is a rolling column, and it's number of distinct values will increase predictably over time.   This is reinforced by having history records at hand.

Also, you should recollect summary stats on the table (which is very quick) immediately following the load, to provide for more accurate determination of stale statistics, which is a prerequisite for good extrapolations.

Once you have 5 history records accumulated (from full statistics collection after the table has grown), then immediately after the load do a HELP CURRENT STATS command and it will show you the extrapolated values that the optimizer generates for the partitioning column.  That way you can see yourself if extrapolation is providing number of distinct values for that column that are in line with what you expect after the load job completes.   Then run an explain of the aggregation and see if you are getting the "good" plan that leads to the lower completion time.  If you are getting the "good" plan through extrapolation, then do do not need to recollect stats immediately after each load.

Thanks, -Carrie

Enthusiast

Hi Carrie. I detected a strange behavior on on Teradata 14.10.

As soon as I am executing the statement "HELP STATS ON TheTable;" on a table without any statistics collected, I receive the error message:

"no statistics defined on this table" (I can't remember the exect error code)

Nevertheless, it looks like Teradata silently collects statistics about table cardinality because afterwards, cardinality estimations are switching from "low confidence" to "high confidence" and the estimations match reality.

Nevertheless, "officially" there are still no statistics available on this table...

Do we have any information why this happens? 

Teradata Employee

In general, HELP STATS doesn’t do anything to the confidence levels.

However, one possibility could be random AMP sampling.   The HELP STATS  operation logic triggers all-AMP random AMP sampling, which may be impacting some confidence levels.

Before the statistics dictionary table (dbc.statstbl) is checked for the existence of statistics, we first fetch the table header. The random AMP samples get brought into the dictionary cache as part of accessing the table header.  Your first explain had no random AMP samples to look at, your second explain did.   I can't say for sure, but that could be what happened in your case.   

Thanks, -Carrie

Enthusiast

Hi Carrie,

Can you share any new recommendations for Stats with Version 15. 

Thanks

Manjeeth

Teradata Employee

I don't have a list of DBC stats recommendations focused especially on the 15.0 release.  What you can do is use the recommendations in the  blog posting for collecting stats on the dictionary tables in 14.0 or 14.10, then add stats to the collection scripts you plan to use in 15.0 to cover any new DBC tables,  as you find that you need to.

The best way to determine which dictionary stats you actually need to collect is to examine DBQL output and see which dictionary views are being used by your end users.  Then make sure all join columns, etc. within those views are covered.  That is likely to differ from site to site. 

Thanks, -Carrie

Enthusiast

Carrie,

Can we run collect stats when a table is being insert/update or vice versa? Is there any performance issue occur? Does it impact users who accesses those table on above situation?

Thanks,

Dinesh

Teradata Employee

Dinesh,

The stats collection operation uses an access lock on the table and can be run when that table is being read or updated by a query.  You can see this access lock if you do an explain on the collect statistics statement.

However at the completion of a collect stats, all query plans using that table will be flushed from request cache.  That may or may not have an impact on subsequent queries that access that table depending on if they were eligible to use cached plans and could have used cached plans.

While you can do it, it's probably not ideal to run an update job against that table you are collecting stats on as the stats will not reflect the actual state of the table when the updates are complete.   It is recommended to run stats collection after loading is complete not during loading. 

In terms of the query performance, when a select statement is executing, it has already had its plan built.   It will have used whatever statistics were available at that time the query began to build its query plan.  Even is updated statistics are collected while the query is executing, it will not change that query's plan.  So there will be no impact in terms of the plan changing.  

A collect statistics statement will use some level of resources. It is like an aggregation.  Depending on the priority at which statistics are executed, and the priority at which the query is executing, and the size and complexity of the collect statistics aggregation, the additional resource usage from the stats collection could have a performance impact on the running query. 

Thanks, - Carrie

Teradata Employee

Carrie,

We are running on TD15.10. When I collect stats on a volatile table and execute an explain request after turning on "HELPSTAST" for session the optimizer is still recommending stats on the volatile table used in query. The collect stats execution returns 0 rows updated but "Help Stats" on table volatile table shows values. How do I get optimizer to recognize /use stats collected on volatile tables?

Thanks,

Gennar

Enthusiast

Carrie,

We have USEOUNT enabled for all our end table databases... But I was wondering if there will be a negative system impact if we turn it ON for ALL databases,, fro example  staging tables databases, View databases,,, etc..... So we can see AccessCount and LastAccess in the Teradata Administrator.  Would I cause system problem if I enable it for ALL databases?

What is a recommended way to measure the impact on the system before and after enabling USECOUNT on a database?

Teradata Employee

Genner,

Statistics for volatile tables (VTTs) are handled differently from statistics on base tables.  Statistics collected from VTTs  are not saved in DBC.StatsTbl. They are saved in memory for the given session.  They get deleted when the session is logged off.   They don’t accumulate history records and extrapolations are not performed against them.  

The reason you are seeing 0 rows updated when you collect stats on a VT table is because there isn't a histogram or an entry in DBC.StatsTbl, but just a place in memory where these stats stay for the duration of the session. 

The optimizer should by using the stats on a volatile table as it would any other table stats.  I would suggest you look at the estimates and confidence levels in the explain text to see whether the stats appear to be being used or not.  Because these stats are held in memory, it's possible, the diagnostic HELPSTATS has some issues in correctly identifying the presence of stats.   So it keeps on recommending stats even when they exists.   I think that is likely to be what is happening in your case.

If you see that the estimates/confidence levels in the explain don’t validate that VTT stats  are being used, then I would recommend you open an incident.  Otherwise, just ignore the recommendations you are getting from the diagnostic for stats that already exist.

Thanks, -Carrie

Teradata Employee

Nazy,

I have not heard of any issues with USECOUNT logging overhead being problematic. USECOUNT logging uses the same techniques as other DBQL collections. There is a USECOUNT DBQL cache that is flushed at pre-set intervals and the cached entries are inserted into the database in a very efficient way. However, every site is different, so cannot tell you for sure that you will not experience some overhead. But based on feedback from other customers, it is not likely to be of much magnitude.

What you can do is turn USECOUNT logging on for a subset of databases first, then add more databases gradually, and see if  you experience any increasing overhead. 

You could also post your question on Teradata Forum, and see if you can get some feedback from other customers as to how widely they use USECOUNT logging. 

The only way I know of detecting this overhead with any accuracy is my means of a very controlled test, where you run the same test script against the same data, with USECOUNT on and then with USECOUNT off, and see if there is any difference in throughput.   

There are many background tasks and moniting activities that run in Teradata in the background.  It is a challenge measuring the impact of these types of things.

Thanks, -Carrie

Enthusiast

Is there a provision to recollect stats based on their names ?

COLLECT STATS <Statistics Name> ON TABLENAME;

Teradata Employee

Hi Sanji,

Yes, you can recollect statistics by name, if a name exists.  This is particularly useful if you have multicolumn statistics.

From page 7 of the orange book titled:  Teradata Database 14.10 Statistics Enhancements:

"When you collect statistics on an index or column set, specifying a name for them is optional. A name is mandatory if statistics are for anything other than column references on expressions.

You can later use the name for recollections, copies, transfers, help information, show information, and for dropping the collected statistics. The rules for naming statistics are the same as the rules for naming database objects."

Thanks, -Carrie

Hi Carrie,

             Need a clarification on how skipping stats works.

1. We had a traditional automated stats collection process where 10% was crietieria which we replaced with new feature of skipping process

question: Once we enable this skip process does 10% of change in data holds any value?

(reason our weekend stats collection is still based on 10 %)

2. After enabling this feature do we still need sample stats ??

can you direct me to an orange book or an article how the extrapolation works once this feature is enaabled

..

Happy Holidays

Also one more question how is this stats skip related to stats_mngr process ?

Teradata Employee

Question #1:

I am not sure that I understand question #1:

                "Once we enable this skip process does 10% of change in data holds any value?"

The way it works is you don't enable the skip process.  It is the system-level threshold that decdes whether or not skip.  System-level threshold is on by default starting  in 14.10.  If other prerequisites for thresholding are in place (such as  USECOUNT logging and adequate history records),  then skipping will automatically be considered based on optimizer-determined thresholds.   You can turn off system level default if you wish, then you will not get any optimizer-initiated skipping. 

If you specify USING THRESHOLD 10 PERCENT, that will override any decision that the optimizer makes about threshold or skipping. You can choose to use USING THRESHOLD to either override the system-level threshold functionality, or to specify a threshold when the system default has been turned off in DBS Control.  If you have turned off system level threshold, the USING clause means you can still get skipping If that is what you want.

The recommendation is to let the system-level threshold make decisions, and make sure that USECOUNT logging is enabled, which is required for automatic thresholding (and potentially skipping) to happen.

Question #2:

If system level threshold is on, as it is by default, then optimizer-initiated sampling will not take place.  However, you can specify USING SAMPLE n PERCENT...if you wish, and that will work whether or not system threshold is on.

Please see the orange book on 14.10 statistics enhancements for more detail on these features. 

There is also an orange book on statistics extrapolation, called    "Statistics Extrapolation".  It is a little old, but still explains the basics.   Extrapolation works the same whether or not the system threshold is on or off.    If statistics are skipped some of the time, then extrapolation will actually play a larger role and will be applied for the stats that are somewhat out of date, but that have not yet reached the threshold that is applied to them.

Here is a response to the third question that was sent later, "How is this stats skip related to stats_mngr process ?":

If you are using the Stats Manager portet (AutoStats), up-do-date (un-stale) stats will have a lower priority within the list of stats collection statements that AutoStats builds.  But they will be included in the job and will be submitted by a collect job, assuming the job's time duration has not expired.

If a change-based threshold is in place (either system default or with a USING clause), the actual collection will be skipped if these stats are still up to date at the time the collect stats statement runs and do not satisfy the threshold.

There is an orange book for AutoStats and Stats Manager titled:  Automated Statistics Management if you need more information.

Thanks -Carrie

Enthusiast

Hi Carrie,

Wanted to check is there any way we can disable stats extrapolation. Truning off All-AMP dynamic Amp sampling will help here.

Also can you suggest any article/tool to well interpret STATSUSAGE and XMLPLAN from DBC.DBQLXMLTBL

Thanks

TDThrottle

Teradata Employee

Here is a link for an article here on Dev X that describes STATSUSAGE and XMLPLAN:

http://developer.teradata.com/database/articles/identifying-used-unused-and-missing-statistics

There is no way to turn off stats extrapolation.  However, extrapolation only takes place if statistics are stale.  Keeping stats up to date will eliminate the need for the optimizer to extrapolate.   Collecting summary stats frequently, even if you cannot collect full stats that often, will help ensure that extrapolations are more accurate.  If you think extrapolation is not working as it should, it would be advisable to open an incident with the support center.

Thanks, -Carrie

Enthusiast

Thank you Carrie.

In Teradata, do we have any way to collect statistics for a particular day/incremental statistics for Partitioned Tables. Some thing similar to Incremental Gloabl Statistics in Oracle and SQL Server.


E.g: PPI table with RANGE parition on TRAN_DATE BETWEEN 20160201  AND 201629  EACH 1

As data is loaded for current date, I want to collect statstics only for paritition holding recent data and not for entire table.

Thanks

TDThrottle

Teradata Employee

There is no method for just collecting incremental statistics on Teradata currently.   You can collect stats on expressions starting in 14.1, but that will not help you with a partitioned table.   There has been much discussion and thought given to the question you are asking, but there is no available technique for doing that at this time.

Thanks, -Carrie

Enthusiast

Hi Carrie,

I am using Teradata 14. Can I get a list of stats recommended for all my tables in my database.

If yes can you please hint me on the query we can use to check that.

Appreciate your help.

Thanks,

Katie

Teradata Employee

Katie,

The above posting constitutes overall recommendations on what statistics to collect.  You will have to evaluate which columns in your tables these recommendations apply to.   There is a similar such posting for 14.0 as well if you use this link:

http://developer.teradata.com/blog/carrie/2013/06/recommended-dictionary-statistics-to-collect-in-te...

I don't have any short cuts or SQL scripts to help with the selection of which columns/indexes to collect stats on.  But maybe if you post that inquiry on one of the Teradata forums, someone working at a customer site may have put something together. 

Thanks, -Carrie

Teradata Employee

Hi Carrie,

Good Morning, i have few questions.

One of our Customer went to TD 14.10 last year but the NoDot0Backdown is still False. I recently join the team and am planning to propose to set it TRUE.  Are there any disadvantages and risk/wider impact on any other functionalities of Database if we go for the change ?  

Does this parameter have any influence on Stats Extrapolation and Auto Stats Sampling ? ( We do have UDI enable but all of the stats are still on Version 5).

Teradata Employee

One more question.

Is there any way to check if the Extrapolations happened was due to Random/Dynamic Amp Sampling or Optimizer used UDI information. In other words, can we check which of the above two techiniques were used to extrapolate the stats ? 

Thank You,

Teradata Employee

RESPONSE TO FIRST QUESTION:  

Even if you have usecount logging on, the information is not able to be stored in the statistics  histogram, and therefore cannot be used for extrapolation or to properly assess changes over time from the history records, because you are not using Version 6 histograms.   Not being able to properly assess changes over time across the history records could make it less likely that the optimizer will select a downgrade to sampling.  

You only will get Verions 6 histograms when you change the NoDot0Backdown flag.   You should get more accurate extrapolations with the V6 histograms.  

The only tradeoff in setting that flag to TRUE  is that you will not be able to back down to an earlier release.   So if there is a possibility you may want to back down you should wait until such time as you are confident you want to remain on 14.10.    I don't know of any particular risks at this time in staying on 14.10 and setting the flag to TRUE, but neither have I been tracking this as a possible issue.   If you really are concerned, talk to the GSC and seek their advice.

Thanks, -Carrie

Teradata Employee

RESPONSE TO SECOND QUESTION:

If UDI counts are available, they will always be used instead of random AMP sampling.  So if you see that your histogram for that statistic carries UDI counts then you know that any extrapolation that took place used those counts.   If you don't see any UDI counts in the histogram, then assume that random AMP samples were used for extrapolation purposes.  

Thanks, -Carrie

Teradata Employee

Thank You Carrie, 

N/A
Hi Carrie,

We are in 14.10 .We have set  NoDot0Backdown = False in the control file.

Currently I created few Analyze jobs in stats manager for databases I enabled use count logging. The analyze job is running for a long time(almost 72 hours) and giving recommendations only for misssing and deactivate but not giving recommendations for stale stats 

  I have created analyze job based on PDCRDATA database with limit queries to 1 week 

Why analyze jobs taking longtime ? do we need to enable any logging other than statsusage and usecount?

is this due to stats version 5 we are currently using?

Thanks

Chandrasekhar

managed services

Teradata Employee

Chandrasekhar,

The NoDot0Backdown setting needs to be changed to TRUE. The logic used by Analyze Jobs to detect staleness relies on UDI counts from USECOUNT logging in conjunction with the defined  COLLECT STATS THRESHOLD option for each stat.  Change-based THRESHOLD logic requires version 6 stats to track the UDI counts which in turn requires NoDot0Backdown to be set to TRUE.  Only in Version 6 will the statistics histrograms store those UDI counts.

There are a number of enhancements that have been made in the last year or two to improve the performance of StatsMgr Analyze jobs.  If you call the support center they can help advise you which ones are available for your platform.

Thanks, -Carrie

Visitor

Hi Carrie,

 

We recently moved to TD15.10. We started observing some statistics are skipped from collection. We have the following parameters disabled in dbscontrol.

What is new in TD15.10 that is causing to skip? We happened to notice more on columns that are highly unique and where table does not change much.

 

DBS Control Record - Optimizer Statistics Fields:

1. DefaultTimeThreshold = 0 (Disabled; Default value)
2. DefaultUserChangeThreshold = 0 (Disabled; Default value)
3. SysChangeThresholdOption = 3 (Disabled)
4. SysSampleOption = 2 (Disabled)
5. BLCStats = 0 (Enabled; Default value)

 

Teradata Employee

Haritha,

 

If SysChangeThresholdOption is disabled there should not be any skipping that is managed by the optimizer. However I have to say, the statisitics that you describe as being skipped (nearly unique, not much change) are certainly the types of statistics that I would expect to be skipped if SysChangeThresholdOption were enabled.

 

If you haven’t yet done that, it might be a good idea to validate that skipping is actually taking place by running in Explain on the collect statistics statements that you believe were skipped.

 

Even if you have turned off the global threshold option, skipping that is enabled by the user could still take place. This could happen if the USING THRESHOLD option has been placed on some of your statistics collection statements. It would be a good idea to eliminate that being a factor by checking your stats collection scripts.  

 

Another reason for running an Explain or two is to see why the skipping took place. If a stat is skipped the Explain text will tell you whether it was skipped due to a user-defined threshold (resulting from a USING THRESHOLD clause) or by the system threshold.  

 

If you are getting skipping happening at the system level and SysChangeThresholdOption is disabled, then it is probably best to open an incident with the support center.

 

Thanks, -Carrie

Visitor

Thanks Carrie. It didn't occur to me to run Explain on collect stats statement. I could see it is skipping based on system time threshold. We have disabled it at system level  and do not have user threshold defined but looks like it is still looking at it. Here is the explain I could see. I will open an incident. Thanks for the help. 

 

2) Next, We SKIP collecting STATISTICS for ('GCLM_AGT_INIT '),
because the age of the statistics (50 days) does not exceed the
system-determined time threshold of 9999 days.
3) We lock DBC.StatsTbl for write on a RowHash.
4) We do a single-AMP UPDATE from DBC.StatsTbl by way of the primary
index "{LeftTable}.Field_2 = '7003574F0000'XB" with a residual
condition of ("DBC.StatsTbl.StatsId = 1").
5) We spoil the statistics cache for the table, view or query.
6) We spoil the parser's dictionary cache for the table.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
ult of statement 1.

Teradata Employee

Haritha,

 

Please open an incident for this.   There is a problem in the code that needs to be fixed that is causing this issue.

 

Thanks, -Carrie

 

 

Scout

Hi Carrie,

 

We had collected stats on one table first time as:  COLLECT STATISTICS COLUMN1(...),COLUMN2(....),.........COLUMNn ON tablename;

After that to REFRESH stats we run statements as : COLLECT STATISTICS ON tablename;

 

First time it took time i can understand. For REFRESH it was taking ~26 min initially for 1 week(even though DBS parameter SysSampleOption was disabled).

But now same RFERESH statement is taking ~ 3 hrs time. As far as data growth/change it has not changed much < 2%.

what could be the reason for slowness of collect stats REFRESH ? 

Teradata Employee

What does the explain say will happen with the COLLECT STATISTICS ON tablename;

 

Is it not skipping columns it used to or are thresholds not what you expect?

 

Also, do you have TASM enabled? Remember resource usage is not the same as elapsed time.

 

Could your statistics workload be getting delayed or given a lower priority?

 

thanks

 

Dave

Teradata Employee

Kiran,

 

Let me add to what Dave says above. How quickly stats collection will take place will depend on what else is running at the same time, what the priority is of the stats collection request, whether or not the table has grown, whether the system is out of AMP worker tasks at the time, whether there are any exclusive locks on the underlying objects.

 

Both the "COLLECT STATISTICS COLUMN1(...),COLUMN2(....),.........COLUMNn ON tablename"

and the "COLLECT STATISTICS ON tablename" should use the same approach and bundle together individual stats into a single table scan whereever possible.   You can see this by running explains on both statements.  If the plan is the same for both approaches then the reason for elapsed time differences have to do with the enviornmental factors mentioned above.   If stats are going to be skipped, that will show up in the explain as well, if you run the explain just before the stats collection.  

 

Collect stats is an all-AMP operation. If even one AMP is congested, that can slow down the entire process.

 

Thanks, -Carrie