Statistics collection recommendations – Teradata 14.0

Blog
The best minds from Teradata, our partners, and customers blog about relevant topics and features.

Statistical information is vital for the optimizer when it builds query plans.  But collecting statistics can involve time and resources.  By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 14.0 software release levels.   Some of these recommendations apply to releases earlier than Teradata 14.0, however some rely on new features available only in Teradata 14.0.

Contributors:  Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

Collect Full Statistics

• Non-indexed columns used in predicates
• All NUSIs
• USIs/UPIs if used in non-equality predicates (range constraints)
• Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
• Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
• PARTITION for all partitioned tables undergoing upward growth
• Partitioning columns of a row-partitioned table

Can Rely on Random AMP Sampling

• USIs or UPIs if only used with equality predicates
• NUSIs with an even distribution of values
• NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
• See “Other Considerations” for additional points related to random AMP sampling

Option to use USING SAMPLE

• Unique index columns
• Nearly-unique[1] columns or indexes

Collect Multicolumn Statistics

• Groups of columns that often appear together with equality predicates. These statistics will be used for single-tables estimates.
• Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.  With no multicolumn statistics collected, the optimizer assumes complete independence among the column values.  The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be in this situation.

1. When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.
2. After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.
3. Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems.  Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) will be available to the optimizer.
4. Recollect SUMMARY statistics after data loads, it runs very quickly and provides up-to-the-date row counts to the optimizer.
5. In Teradata 14.0, PARTITION statistics are no longer required on nonpartitioned tables, as the new SUMMARY table replaces the function previously provided by PARTITION or primary index statistics in determining stale statistics.

Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that.  A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values.  However, if it is close to being 75% unique, then random AMP samples are adequate.  To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

1. For a partitioned table, it is recommended that you always collect statistics on:
• PARTITION.  This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions.  This statistic is used for optimizer costing.
• Any partitioning columns.  This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
1. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
• (PARTITION, PI).  This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions.  It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
• (PARTITION, PI, partitioning column).   This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination.  It is used in rowkey join costing.
1. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default).  All-AMP random AMP sampling has these particular advantages:
• It provides a more accurate row count estimate for a table with a NUPI.  This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
• All-AMP random AMP samples are collected automatically when table-level SUMMARY statistics are collected.
1. For temporal tables, follow all collection recommendations made above.  Currently, statistics are not supported on BEGIN and END period types.  That capability is planned for a future release.

Contributors:  Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

[1] Any column which is over 95% unique is considered as a nearly-unique column.

Hi Chetan,

In 14.0 the HELP STATS statement has been enhanced to show a summary of the table's statistical information.  The first row of output, which will display an asterisk under Column Names, indicates the last known table row count for the table, from the latest summary statistics collection against the table.

This summary data for the table is collected every time a collect stats statement is executed against any column or index in the table.

If you issue the HELP STATS statement with "CURRENT", the column with the asterisk (the table summary information) will display the extrapolated value for the table, if the table has undergone growth since the last summary collection was performed:

HELP CURRENT STATISTICS ON table-name;

One of the nice things about using the syntax that includes CURRENT (new in 14.0) is that it will show you extrapolated values for each statistics in the table, not just the table row count.   That gives you an easy way to validate that extrapolation is working, and provides a window into how well the optimizer is doing when it performs extrapolation.

Thanks, -Carrie

Fan

Thats cool!

-Chetan

Chetan,

I am so sorry, but I must have missed answering this question at the time it was posted.

The asterisk under column names is indeed new in 14.0 and represents the number of rows in the table. Therefore there is no column name associated with that number of unique values.

If you have access to Teradat orange books, this is documented in Teradata Database 14.0 Statistics Enhancements  Page 16.  That orange book explains all of the new changes for statistics in the 14.0 release.

Enthusiast

Hi Carrie,

In TD 13.10 and older, when collecting on multiple columns, the order of the columns in which the optimizer analyzed the demographics was determine not by the order of the collect statement (i.e collect stats on table collumn a,b), but rather the order in which the columns were defined in the table.  In other words, it is my understanding, that if the columns were ordered (b,a) but the stats statement read (a,b) they would still be collected in (b,a) order.  Again, it is my understanding that if the b column is 24 bytes long and the a column is 2 bytes long, it is best to define them in (a,b) order to get the best histogram based on the 16 byte limit.  Do I have this right for 13.10 and earlier releases?  My other question is.... if I have this right, does anything change in 14.0?  I vaguely remember someone telling me this would change in 14.0.

Thanks,

Joe

Joe,

Your understanding is correct as to how pre-14.0 statistics collection on multiple columns work.

There are two enhancements in 14.0 that will apply here:

1.   The default value length stored in the statistics histogram has been increased to 25 bytes.

2.  You can expand the default value length that will be carried in the histogram for any given statistics by means of the new USING MAXVALUELENGTH.   You can go pretty high on this specification if you need to, but the tradeoff is that it may increase the size of the total histogram.   One nice thing about how this options works is that the value length field in the histogram will not pad unused bytes, so if a given value is smaller than the MAXVALUELENGTH specifies, the histogram will only hold characters up to the actual length of each particular value.

Thanks, -Carrie

Enthusiast

Is there a consenus on the age of statsitics that should be recollected, assuming < 10% change?  I have seen a number of references to Age of Collection but no discussion of the appropriate Age.

Hi Andrew,

Age, or the passage of time, is a less accurate way of determining when statistics need recollecting compared to using the percent of change as a threshold.  Some tables may change a lot in, say, 7 days, while others may change very little, and still others may not change at all.

Maybe one reason you don't hear much discussion around how to set that type of threshold is that it is inherently difficult to come up with a different ageing algorithm for each table's statistics.   And you would tend to consume unnecessary resources recollecting more often than needed if you base your time threshold on a single number of days for all tables.

What I've seen in terms of time as a threshold is pretty generic:   Recollect on all tables that undergo any level of change weekly, and the ones that undergo larger changes or are more critical, recollect daily.

You might want to post this question on Teradata Forum and see what other Teradata DBAs are doing in this regard.  You would probably get more detailed and anecdotal information doing that.

The new 14.10 statistics enhancements do rely primarily on percent of change as a threshold, but introduce more accurate ways of measuring table change.  There is a default setting in 14.10 called SysChangeThresholdOption which will examine each statistic when it is submitted for collection and decide whether or not a reasonable threshold of change has been met.  If it has not, then that statistic will not be recollected at that time.

This default system threshold setting has access to insert, delete and update counts for the table and only looks percent of change, not age.  However, it only does threshold evaluation if the correct (and new) DBQL logging has been turned on, logging that tracks the update counts for the table.

See the orange book Teradata Database 14.10 Statistics Enhancements for more detail on 14.10 statistics features.

Thanks, -Carrie

Fan

Hi Carrie,

In our environment, before applying compression we are dropping statistics and after applying compression using alter on the table we are recollecting the statistics.

However this is a time /resource consuming process to recollect statistics on huge data.

so instead of recollecting can we do copy stats like below after applying compression

collect stats on table a from table b;

Is there any difference in recollecting Vs Copying statistics?

Thanks,

Mani

That will work for you.   When you copy statistics, any statistics history records that have been accumulated are retained.

When you get on Teradata Database 14.0 you want to be cautious about dropping statistics as dropping will destroy the history records that have been building up with each recollection.  Those history records are used by the optimizer during the extrapolation process to detect patterns of change over time, and provide more robust extrapolations.

You could use only the table undering the alter, by exporting the statistics to a text file using the new command SHOW STATS VALUES ON <Tbl>,  then dropping the statistics, and then doing your ALTER, then importing them back after the ALTER is done.  The output text from the above command is in SQL format, ready to be sumitted. Any history records will  be included during the export and then will automatically be imported back in.

Thanks, - Carrie

Fan

Thank you Carrie, appreciate your help !!!

Enthusiast

Hello Carrie,

First of all , i would like to thank you for your time on writting / replying all these.

I would like to ask five questions ,

1.Is the order of the columns of DDL the default ordering on 14(.10) or we define it when we collect the stats?

2.On your previous post , on 13,10 recommendation , on one answer you posted that in multi - column stats , every single column is 25 bytes?Correct? Furthermore , if we will define a maxvaluelength , it will increase every column which is participated on multicolumn  or it will affect the total one?

3.When we have on where clauses , a combinanation of  cl1= xxx and cl2<> yyyy , it is recommended to put first the no-equal column on the defined stats?Meaning , collect stats.......(cl2,cl1)

4.Until 13.10 , it was recommended not to have a lot of stats defined.From my experience , i noticed that a lot of times,more stats meaning more troubles and bad estimations.In 14(.10) , we are free to add as many stats as we want (multicolumn,single and so on), without to confuse the parser? Moreover, in 13.10 , i noticed that if i had stats on date,timestamp columns and these columns were participated on range filters(between .....) , the estimation was totally wrong , so i was not selecting these one,in order to have a better plan.

5. Due to many reasons , some times we have a plan defined. In previous realeases , we could add trim()  or other functions , in order to force the parcer not to be reliable to stats in order to make some checks?How can we do this to 14(.10)?Is it available?Or like  in previous releases to add more where clauses to increase or decrease the estimation ..

Thank you very much for your time.

#1.  There is no default ordering for multicolumn stats in 14.0 and above.  The order you define the columns in the COLLECT STATS statement determines their ordering.

#2.  In 14.0 and above the combined combination of values is truncated at 25 bytes by default for the value columns in the histogram.  Each column in the multicolumn stats is not getting 25 bytes individually.   If you use MaxValueLength you can extend this default, but it will still represent the total combined length starting with the first column.   You can read more about MaxValueLength in the blog posting"  New Opportunities for Stats Collection in 14.0.

#3:  Yes.  That is the recommendation.

#4:  Statistics help provide information to the optimizer, and ideally they should provide value. If a particular statistic is not providing value, then it is recommended that you drop it, and call the support center to see if there is something in the software that can be improved.

#5:    I am not sure what the question is here, but if you are asking about ways to invalidate a statistic so the optimizer will ignore it, I am not the right person to help you with that.   If you want feedback on doing those kinds of things, I suggest you post the question on Teradata Forum or one of the other Q&A sites where people working at other sites can see it and have a chance to respond.

Thanks, -Carrie

Enthusiast

Hello Carrie ,

Thank you very much for your responses && time.

Enthusiast

Hello Carrie ,

I would like to ask three more question concerning the stats , if it is possible.

1.If i will make a new table with different DDL (change some columns from varchar(20) to (10) or verca ) , is it safe to copy the stats? Or, it would be better to collect them again?

2.If a table has no stats defined , and this table is been used to insert a simple aggregation to an other table ,  - obvious the expecting rows are wrong -  , does this influence TD on how much CPU/IO it will give to this transaction?(taking under consideration that the system is almost idle / there is no skew)

3.Ending , concerning the multi column stats which use the max value length , is it safe to collect stats using this feature  or not.Because TD on one document suggest that this on some cases may lead to mistakes. ( I know the most secure case is to make seperate test for every stat , but i would like to have a base and then check all the cases)

1.  You can copy stats whether or not a column definition in the source table has changed.  Copied stats are for an unchanged column will be fine.   And if the column changed does not involve any truncation of the data (if when going from 20 to 10 no values get truncated, or if going from 10 to 20) that is fine as well.  The only case where its a good idea to recollect the stats after such a copy is if truncation took place due to the change.

2.  I'm not sure I understand the problem that you are trying to solve here.   Resource distribution  does not depend on whether or not stats have been collected.   It has to do with how workoad management has been set up and the priority of the work.    Usually, if the system is nearly idle, resources will be available to process whatever work is running, no matter what its priority.

3.  It is safe to use USING MAXVALUELENGTH.  But be aware that it will increase the size of the histogram, so it might take longer to transfer it from disk or could put more pressure on the data dictionary.   I am not aware of any mistakes or errors that come from using this option, other than the tradeoffs of making this histogram larger which I have already mentioned.

Thanks, -Carrie

Enthusiast

Hello,

Thank you one more time for your replies.

Concerning the (2)  , i did not set the question correctly (and correct me again , since i am not very familiar with TASM).

I wanted to understund  , if simple queries will benefit from stats ,since we will have more accurate estimated times and rows on the plan , so TASM can classify these queries to appropriate workloads that have different priorities .

Even simple queries may benefit from stats collections whether or not you are using TASM.   And since statistics are collected on a table's columns/indexes, not on the query, it is possible other more complex queries are accessing the same table as the simple queries, and they will also benefit from stats.

You are correct that collecting stats and as a result having as close as possible estimates in query plans will make classification criteria more effective in TASM (if estimates are being used for classification purposes).

Thanks, -Carrie

Fan

Hi,

I'm just new to Teradata and my background is Oracle. When collecting Statistics in Teradata, after setting up a table for stats collection, do I need to have another script to refresh the stats on a daily basis?

Thanks

Chris

Hi Chris,

You don't really "set a table up" for stats collection, you just decide which stats you wish to collect and then build a script with collect stats statements for those columns or indexes and then run the script.   You can rerun the collect stats statements (using the same syntax as the initial collection if you wish) whenever you want to recollect stats for those columns or indexes.  Sometimes that is weekly, sometimes monthly, in some cases daily.  It depends on how much the table has grown since the previous collection.

If you want to recollect statistics on all statistics that exist on a given table at the same time, you can simply issue a table-level statistics collection statement.

When performed individually, the statement to initially collect stats and to recollect stats is the same.  So you don't really need another script, but you could have another script if you only wanted to recollect a subset of the stats.

In 14.10 you can put your stats recollections under the control of the Automated Statistics Management feature.  I have another blog posting that describes that a little bit.  If you are using the Automated Stats Manager feature, you never have to issue scripts to recollect.

Thanks, -Carrie

Chetan,

The first line that you see in the HELP STATS output in 14.0 represents the table row count in the UniqueValues column.   That is why it has an asterisk for the Column Names column.  It is not colunn-specific info, but rather table-level info.   Actually it represents the table's SUMMARY stats row count...If you were to recollect on the SUMMARY stats for that table that value would change.  SUMMARY stats are recollected automatically every time you recollect stats on any column or index for that table.  So this is how SUMMARY stats are shown in the HELP STATS output.

Thanks, -Carrie

Enthusiast

Hi,

Sorry for asking very basic question. But i wanted to know the difference.

I am having a table A like this

Col1 int,

Col2 int,

col3 int,

col4 int,

col5 int

PI(Col1)

NUSI1(Col2)

NUSI2(Col3,Col4)

Is there a diffrence between collecting stats using keyword index and columns?

Like this

Set1:

collect stats tableA on index NUSI1;

Collect stats tableA on index NUSI2;

(or)

Set 2:

collect stats tableA on column(col2);

collect stats tableA on columun(Col3,Col4);

Out of these two sets which one is better? Or is there no difference? Please advise.

Thanks,

Vasudev

Enthusiast

I got the answer for my question from Carrie's blog posted on 19-Sep-2012.

Carrie,

Please correct me if i am wrong.

Column level stats is helpful even if the indexes are dropped still stats will exist. This is applicable only for single column stats. For multi column stats - Even if it is collected using index or column it will be removed if index is dropped.

So, for single column stats better we can make use of stats using column instead of index.

Thanks,

Vasudev

Enthusiast

Adding on to the above one i am intrested to understand about the order of columns used in multi column stats. In TD 12 whether it makes any difference. Say taking the same above mentioned scenario. On TableA stats collected on NUSI2 like this this collect stats tableA on column(col3,col4); after some time NUSI is dropped for data load and recreated. Now can we collect stats by changing order like this. Collect stats tableA on column(col4,col3); Whether it makes any difference.

Thanks,

Vasudev

Enthusiast

Hi Carrie,

One of my table having 2,132,426,728 rows in a column which is also UPI for this table. It is on TD 14.10. While collecting stats on this table(Column) it is taking ~14K CPUs without using USING SAMPLE option. However while collecting stats with USING SAMPLE option it is taking more cpu. Can you please tell me why it is taking more cpu....?

Regards,

Arpit

Arpit,

USING SAMPLE specified on a collect statistics statement does not necessarily result in sampling taking place.   It means you are letting the optimizer decide whether or not to sample.   You would have to do an explain of the COLLECT STATS statement to see if sampling was taking place.  Or you could look in the SummaryInfo part of the statistics histogram and see whant the sampling percent field says.   Most likely, sampling is not taking place in your case so you are not seeing any reduction in CPU for the operation.

CPU usage for a statistic collection could vary somewhat from run to run.  There could be many issues involved in different levels of CPU, such as the number of rows in the table having been increased, or fewer data blocks being found in the FSG cache, or other system conditions.   You don’t say how much of a CPU difference you are seeing.  If the percent of difference is in the single digits, it's probably noise.

Unique primary indexes often do not require collecting statistics.    The optimizer recognizes such columns are unique if the table is defined as having a UPI.   The only use case I can think of where you do need stats on a UPI is if you are doing range constraints in your queries against the UPI, or any kind of non-equal conditions against the UPI.

See the 3rd bullet under "Collect Full Statistics" in the blog posting above.

Thanks, -Carrie

Enthusiast

Thanks Carrie; Got the point.

Actually our client was collecting full stats on this column(s) and I saw the opportunity to suggest them this feature.

Next time onwards first I will read the explain plan; Here when I hardcoded 2 percent in USING SAMPLE command i saw the diffrence in CPU.

If I see a candidate for USING SAMPLE; Is there any way to get the most accurate percentage of how much I need to put while collecting stats.

`I am very sorry if i skipped this point in your earlier replies.`

2% sampling should fine fine for a UPI column.  Sampling at low percentages is fairly accurate for any column that is unique or nearly-unique.

Unfortunately, there is no way I know of to derive the perfect sampling percent to use.   When the optimizer has control (when you code in USING SYSTEM SAMPLE, or USING SAMPLE), it attempts itself to set the optimal sampling percent at that point in time, based on looking at the detail within past full statistics collections, and trying to detect patterns within the statistic that it might need to account for, like skew.  But I cannot tell you first-hand that I have studied the choices the optimizer in this regard.  I have not.  But in my experience the optimizer tends to be conservative in its decisions and is probably doing a better job than either you or I could do in picking a good sampling percent.

But as you have found out, the optimizer will often not use sampling at all when you specify USING SYSTEM SAMPLE, because it doesn't have enough background information to be confident in selecting a good sampling percent, or doesn't believe the statistic is suitable for sampling.

Thanks, -Carrie

Not applicable

Hi Carrie,

Is there some harm if I define full statistics for some fields of the table and sample stats for the rest? I ask this because the client decided to collect sample statistics for all single fields and Teradata started to create Product Joins with one join field believing it wouldn’t get any rows. The joined field has 4.474 different values with two peaks (one value has 77% of data and 20% are nulls) and the client has 78 AMPs (TD 14.0). After collecting full statistics for this field the product join disappeared.

Regards

Christer

Christer,

There is no problem collecting sampled statstics on some columns and full statistics on others on the same table.   Full stats are often a better choice when there is noteable skew in the column, as your example below points out.   If you are sampling at a low percent, or even a moderate percent, it may be difficult to properly capture the degree of skew or the percent of null values.

Whatever statistics strategy is resulting in the best plans is usually the right action to take.

Thanks, -Carrie

Not applicable

Hi

I converted a Non Partitioned table to a Partitioned table (both same index). in the partitioned table i added new Stat Column(Partition).

I found that the new table (PPi) took more time to do collect stat than the old one (non partitioned).  When i checked the I/O and CPU that too was increased 3 times than the old table. IS it necessary to collect "Column(Partition)" for Partitioned tables?

Regards

Robin

Robin,

By any chance are you collecting statistics with the USING SAMPLE option? If sampled stats are being collected on a NPPI table, when you change to a PPI table the optimizer may decide to sample each partition for the stats being collected on the primary index and the partitioning columns.  This can take more time compared to a NPPI table where sampling is only at the beginning of the table.   Check your output and see if it is those two columns that are contributing to greater resource usage.

Adding statistics collection on the column PARTITION is not likely to add much time or resources.  The elapsed time for collecting on PARTITION is usually a second or two, as the process just reads the cylinder indexes, not the base table rows.

Thanks, -Carrie

Not applicable

Yes I am collecting Sample Stats. How can i reduce the I/O in this case.

Regards

Robin

Robin,

I don't know of any way of "tuning" a collect statistics statement, in the way you might try to tune a regular query.   A collect stats statement is going to scan the table and build the aggregate histogram.   You can make the sampling size smaller, but you have to be careful if you do that, as you might risk getting poorer plans as a result.

Thanks, -Carrie

Not applicable

I have partition defined for more than 20 fact tables in my database only 2 tables has this issue.

Enthusiast

Hello

We would like to ask you some questions, concerning the order of the fields in the statistics combinations of the Teradata tables:
We recollect once a week ALL the existing statistics on more than ten thousands production tables in order to refresh them. The order of the fields in combinations is not sorted out.
Do you know any mechanism or algorithm that may sort them out in the stats combinations?
Will it improve the performance in the TERADATA SQLs and if "YES" – significantly or not?
If this mechanism exists – so , should we execute it repeatedly ?

What are your recommendations about the size of MaxValueLength parameter (25 or 50) and how to define it ?
And to maintain ?

Thanks a lot!

For your first question, I am not clear on what you mean when you say: “The order of the fields in combinations is not sorted out. Do you know any mechanism or algorithm that may sort them out in the stats combinations?”

Do you want the individual columns within each multicolumn statistics to be ordered alphabetically (or some other order)? Or do you want all statistics being collected for a given table to be sorted alphabetically (or some other order)?

I don't think it will matter either way.  Within a given table, the optimizer will look at all the stats to-be-collected and will try to bundle up all ones where it makes sense into a single scan. How you order the individual stats in your Collect Statistics statement will not have any impact on that process.

The ordering of columns within a multicolumn statistic is usually best starting from small-sized columns to larger-sized columns in order to get more information in the values fields within the histogram.

For the second question, USING MAXVALUELENGTH comes with tradeoffs. Because there are tradeoffs, if you are getting adequate differentiation in the histogram between different combinations of values in different rows and the optimizer is providing good plans, stay with the default of 25 bytes.

Please refer to the webcast class from Teradata Education Network (TEN), course #5822 titled “Statistics Collection Basics”. It cover all the tradeoffs with USING MAXVALUELENGTH in great detail, more than I put into this response. Just a few points include:

• Larger MaxValueLengths require more space in the histogram
• Histogram is limited to 64 KB, so it may have to take space from somewhere else
• Number of histogram intervals may be reduced if increase MaxValueLength is set too high
• Need to balance increased value lengths with how much the number of detailed intervals is reduced
• Always check impact on overall histogram when you increase MaxValueLength

• Increasing MaxValueLength is useful if:
• The statistic is being used for single-table selection
• And larger value lengths are required for interval value differentiation

Thanks, -Carrie

Enthusiast

Hi Carrie,

Of course, I didn't intend that the fields in multicolumn statistics combination should be order by alphabetically.

I mentioned above the size of the fields..

You wrote:

"The ordering of columns within a multicolumn statistic is usually best starting from small-sized columns to

larger-sized columns in order to get more information in the values fields within the histogram".

So, You did meant the size of the fields in multicolumn combination!

My question was "how to sort out the fields?"

Do you know any existing mechanism or algorithm that may sort them out in the stats combinations? Because when i checkeגת The fields are not sorted in our system from small to large(field size)

Will it improve the performance in the TERADATA SQLs and if "YES" – significantly or not?

If this mechanism exists – so , should we execute it repeatedly ?

Concerning MAXVALUELENGTH , it's good enouph for me to get the default value suggestion.

Thanks again!

For your first question: “So, You did meant the size of the fields in multicolumn combination!”

No I did not mean that. When I said “a multicolumn statistic is usually best starting from small-sized columns to larger-sized columns” I meant the size of each individual column that participates in a multicolumn statistic, it is better to order those columns that will all be in the same statistic with the smaller first. I did not mean the size of the multicolumn statistic in combination.

Sorry, but I don’t understand what you mean by "how to sort out the fields?"

When you define the multicolumn statistics you define the order of the columns in whatever way you want, and that is what theordering of columns within the multicolumn statistic becomes. Whichever column you name first in the definition will be first column in the multicolumn statistic. There is no “sorting out” involved.

On your question: “Do you know any existing mechanism or algorithm that may sort them out in the stats combinations?”

Sorry, but I am not understanding what you mean by "sort them out". There is no sorting involved in defining or collecting multicolumn statistics.

On your question: “Will it improve the performance in the TERADATA SQLs and if "YES" – significantly or not?” Same response as above, I don’t know what “it” means.

Thanks, -Carrie

Enthusiast

Hi Carrie,

Actually we are talking about the same things.. I meant to the size of the individual columns in a multicolumn statistic.

Lets assume i have the following statistic-

collect statistics  column(desc, name, id) on db.table. (example 1)

the size of the column "desc" is - 50 (columnlength),  the size of the column "name" is 30 and the size of the column "id" id 3.

based on your recommendation the statistic should look like this-

collect statistics  column(id, name, desc) on db.table. (example 2)

So, My question about how to sort it out is- Because we have a lot of "unordered" statistics- do you know any automated way (like script ,algorithm etc..) that sort those statistics so that the smaller column would be first in the multicolumn statistics like in the second example above?

And to the last question about performance - what i meant to ask is- Is changing the statistics so that the smaller column would be first - like the second example i wrote above, may improve the performance of SQLs in Teradata?

Thanks a lot!

Yes, it is usually beneficial to put the shorter columns at the beginning of the multicolumn statistic, if the statistic is being used for single table selection. That is so the leading columns do not dominate the space allowable for values that are inserted into the histogram. If the multicolumn statistic is being used join processing, it is less important to order them by size, as the optimizer does not usually look at the particular values in the histogram when doing join planning. It looks primarily at the number of distinct values, which is carried in the SummaryInfo section of the histogram.  There is no truncation during the calculation of number of distinct values.

The one case when the smaller-column first is not the best choice, is if there is a date in the multicolumn statistic, and the date is used by queries in range constraints. In those cases it would be better to lead with the date, even if there are other columns that are narrower.

I do not know of any scripts or algorithms that will order the columns in each multicolumn statistic by size. You would need to do that individually.

The only potential performance advantage of ordering the columns by size is this: if that if the length of the combined values exceeds the maxvaluelength for the statistics (25 by default), then truncation at the end of the combined values will take place. With shorter columns defined first, you will be more likely to get representation within the multicolumn statistic values for a greater number of the columns. That may or not happen. And if you get the content of more columns in the histogram values, that doesn’t necessarily mean you will get better plans. And as I said earlier, if the multicolumn statistic is used for join planning, the ordering of the columns will not matter at all. Since the optimizer will not look at the values in the histogram when doing join planning.  So there are no guarantees it will make any difference at all.

Thanks, -Carrie