New opportunities for statistics collection in Teradata 14.0

Blog
The best minds from Teradata, our partners, and customers blog about relevant topics and features.
Teradata Employee

Teradata 14.0 offers some very helpful enhancements to the statistics collection process.   This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective.

For more detail on these and other statistics collection enhancements, please read the orange book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs. 

New USING options add greater flexibility

In Teradata 14.0 you may optionally specify a USINGclause within the collect statistics statement.  As an example, here are the 3 new  USINGoptions that are available in 14.0 with parameters you might use:

. . . USING MAXINTERVALS 300

. . . USING MAXVALUELENGTH 50

. . . USING SAMPLE 10 PERCENT

MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram.  The default maximum number of intervals is 250.  The valid range is 0 to 500.  A larger number of intervals can be useful if you have widespread skew on a column or index  you are collecting statistics on, and you want more individiual high-row-count values to be represented in the histogram.   Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value.  By increasing the number of intervals,  you will be providing the optimizer an accurate row count for a greater number of popular values.

MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic.  The new default length is 25 bytes, when previously it was 16.  If needed, you can specify well over 1000 bytes for a maximum value length.  No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named MAXVALUELENGTH instead of VALUELENGTH).   The 16-byte limit on value sizes in earlier releases was always padded to full size.  Even if you statistics value was one character, you used the full 16 bytes to represent it.     

Another improvement around value lengths has to do with multicolumn statistics.  In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string, and truncation took place at the end.  Truncation will still take place at the end of the combined string in 14.0, however in this more current release you can support a much longer combined value.   In addition, during the aggregation process that builds up the statistical histogram, each column within the statistic will be able to have its its first 32 bytes represented when determining the number of distinct values, ensuring that a significant  number of bytes in each column can contribute to understanding how distinct the combination of column values are.

SAMPLE n PERCENTallows you to specify sampling at the individual statistics collection level, rather than at the system level.  This allows you to easily apply different levels of statistics sampling to different columns and indexes.  

Here's an example of how this USINGsyntax might look:

COLLECT STATISTICS 
        USING MAXVALUELENGTH 50
        COLUMN ( P_NAME )
ON CAB.product;

Combining multiple collections in one statement

Statistic collection statements for the same table that share the same USINGoptions, and that request full statistics (as opposed to sampled), can now be grouped syntactically.  In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group.  The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses. 

Here is an example

The old way:

COLLECT STATISTICS COLUMN 
(o_orderdatetime,o_orderID)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderdatetime)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderID)
ON Orders;

The new, recommended way: 

COLLECT STATISTICS 
  COLUMN (o_orderdatetime,o_orderID)
, COLUMN (o_orderdatetime)
, COLUMN (o_orderID)
ON Orders;

This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above.   In those cases the optimizer will perform the most inclusive collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to derive the statistics for the other two columns.  Only a single table scan is required, instead of 3 table scans using the old approach.  

Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan.   

But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings.

Summary statistics

New in Teradata 14.0, table-level statistics known as “summary statistics”  are collected alongside of the column or index statistics you request.  Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the new DBC.StatsTbl. It is a very fast operation.   Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature.  An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table.  Time is reported in MM:SS:

COLLECT STATISTICS ON Items COLUMN I_ProductID;
 Elapsed time (mm:ss):  9:55 

 COLLECT SUMMARY STATISTICS ON Items;
 Elapsed time (mm:ss):  00:01

You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered.  For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repetively for the same table within the same script. 

There are several benefits in having summary statistics.  One critical advantage is that the optimizer now  uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations.  It no longer needs to depend on primary index or PARTITIONstats, as was the case in earlier releases,  to perform good extrapolations when it finds statistics on a table to be stale.

Here’s an example of what the most recent summary statistic for the Items table looks like:

SHOW SUMMARY STATISTICS VALUES ON Items;

COLLECT SUMMARY STATISTICS
            ON CAB.Items
            VALUES
(
 /** TableLevelSummary **/
 /* Version           */ 5,
 /* NumOfRecords      */ 50,
 /* Reserved1         */ 0.000000,
 /* Reserved2         */ 0.000000,
 
 /* SummaryRecord[1] */
 /* Temperature           */ 0,
 /* TimeStamp             */ TIMESTAMP '2011-12-29 13:30:46',
 /* NumOfAMPs             */ 160,
 /* OneAMPSampleEst       */ 5761783680,
 /* AllAMPSampleEst       */ 5759927040,
 /* RowCount              */ 5759985050,
 /* DelRowCount           */ 0,
 /* PhyRowCount           */ 5759927040,
 /* AvgRowsPerBlock       */ 81921.871617,
 /* AvgBlockSize          */ 65024.000000,
 /* BLCPctCompressed      */ 0.00,
 /* BLCBlkUcpuCost        */ 0.000000,
 /* BLCBlkURatio          */ 0.000000,
 /* RowSizeSampleEst      */ 148.000000,
 /* Reserved2             */ 0.000000,
 /* Reserved3             */ 0.000000,
 /* Reserved4             */ 0.000000
);

99 Comments
Teradata Employee

Srini,

The asterisk that you see in the first row of output for the HELP STATS under the "Column Names" column indicates that you are viewing row count of the table at the time the statistics were collected.   The value in the "Unique Values" column of that first row is the number of rows in the table.

Thanks, -Carrie

Enthusiast

Thank you Carrie.

Enthusiast

Hi Carrie,

Starting with Teradata 12 or 13 (cannot remember), we collected stats on column PARTITION

on ALL our tables (partition tables and non partition tables),as we were advised.

With TD14.10 now, should we still keep that column PARTITION collection on non-partition tables? 

Should we replace it with Summary stats?

Is the * we see when doing help stats the result of the summary stats?

Will collection on column PARTITION also updates the Summary stats?

Thanks

Teradata Employee

Hi Nazy,

On 14.10 (and 14.0 as well) you no longer need to collect statistics on PARTITION, unless your table is a partitioned table.  For partitioned tables you should definitely collect on PARTITION.

You may not need to consciously replace PARTITION stats on non-partitioned tables with summary stats, as summary stats will automatically be collected at the time any column/index on the table has its stats collected.  Collection on PARTITION will update summary stats as well.  But if you have not collected any stats on the table for a while, you can explicitly collect stats on just summary stats for the table by issuing.  It's a very fast execution, and a good idea to use it after loading activity on a table when you don't have time for full stats recollections.

COLLECT SUMMARY STATISTICS ON table-name;

The asterisk in HELP STATS tells you the number of rows in the table on which the statistics were taken.  It will change as summary stats are refreshed, even if other statistics on the table have not been refreshed.  So you could say that number is a result of summary stats.  That's the correct way to look at it.

Thanks, -Carrie

Enthusiast

Hi Carrie,

Thanks for nice explanation. 

I have query related to Summary statistics for the staging tables.

Following steps are performed before loading data into final Model tables.

Step 1:- Loading data into Staging table 

CREATE SET TABLE staging.ABC ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      RECORD_TYPE CHAR(2) CHARACTER SET LATIN NOT CASESPECIFIC,

      ACCOUNT_NO INTEGER NOT NULL,

      PARENT_ID INTEGER,

      CHILD_COUNT INTEGER,

      HIERARCHY_ID INTEGER,

PRIMARY INDEX ( ACCOUNT_NO );

Step 2:- Collecting statistics on staging table - staging.ABC

Step 3:- Rename staging.ABC to staging.ABC_XYZ and then renamed table "staging.ABC_XYZ" will be further used will be in batch processing and in transafomration.

Step 4:- Re-create the staging table ipstaging.AR_A100 for next run.

Questions:- 

1). Does Teradata recommanded to collect stats after renaming the table staging.ABC_XYZ? if Answer is Yes, then I would like to know the reason for the same?

2). Does renaming the tables can drop Random Amp sample from summary statistics history as Table-id has been changed during staging process. Does optimizer may extrapulate wronly here as stats has been not refreshed after renaming the staging table ""staging.ABC_XYZ".

Thanks for your help in Advance and eagerly awaiting answer from you.

Regards

Pinal

Teradata Employee

Pinal,

Statistics rows in the new DBC.StatsTbl in 14.0 refer to table id not table name. So, there is no problem with renaming a table, the stats from the old name (collected and RAS) will still be usable.

See page 893 of the SQL Data Definition Language Detailed Topics manual for validation of this: 

Function of RENAME Table Requests

When you rename a table, Teradata Database only changes the table name. All statistics and

privileges belonging to the table remain with it under the new name.

It's easy to see for yourself that the stats are still there: create table, add some rows, collect some stats, rename the table and you will see the stats are associated with the renamed table.    If they are outdated, it's always a good idea to recollect them whether or not you have renamed the table.  But the act of renaming does not in and of itself require a recollection.

Thanks, -Carrie

Teradata Employee

Srini,

My apologies, but I did not get notified of your comment.

The asterisk in the first line of the HELP STATS output represents the row count of the table.  It comes from the SUMMARY statistics that are collected for every table at the time any column or index stat is collected or recollected.  So you can think of it as representing the SUMMARY statistics for that table.  You will see it change as SUMMARY stats are refreshed.   It does not represent anything you have explicitly asked for.   But if you have even one stat collected on the table, you will see that row in HELP STATS.

Thanks, -Carrie

Teradata Employee

Ciao Carrie,

I have a very simple and quick question regarding STATS and CURRENT STATS.

I create a table T and define some stats without insert any rows.

If I run

     HELP STATS ON T 

I get the right estimation: 0 rows in summary ('*') and column where I define the stats.

If I run

     HELP CURRENT STATS ON T

I get 1 row on each stats.

Why ?

I'm asking this because many times in a query with a lot of joins the plan build by the optimizer can be very different with 0 estimation instead of 1 row estimation !

Thanks & regards,

Pietro.

Teradata Employee

Hi Pietro,

Even when a table has no rows at the time you collect stats, the optimizer will always adjust that up to one row before it attempts to build the query plan.   The cost calculations used by the optimizer include many formulas.  Some of these formulas include division or multiplication by row count, so zero as a row count could be problematic.

You can see this adjustment in your HELP CURRENT STATs output, which displays the extrapolated row count.  The optimizer always uses what it extrapolates.  

To validate this run an explain of a select * on an empty table and you will see the plan specifying that one row will be returned.

Thanks, -Carrie

Enthusiast

Hi Carrie

I have SQL Assistant v.15.00 and I cannot drop STAT.

Can you help me with the syntax for dropping STAT?

CREATE MULTISET TABLE T1 (val INTEGER);
COLLECT STATISTICS T1 COLUMN val;
DROP STATISTICS T1;

then line DROP STATISTICS T1; fails with the Error "Query is invalid".

The code is working ok when I use Assistant 13.11

Peter Schwennesen

Teradata Employee

Hi Peter,

I tried your syntax on a pre-14.0 version of SQLA and it worked fine for me.  

Unfortunately, I'm not going to be able to help you with SQLA behavior or release changes.   You will need to find someone with client side background or experience with SQLA for that. 

I suggest that you post your question on the Teradata Forum, or see if you can find any blog postings here in Dev X related to client functionality.

Best regards, -Carrie

Enthusiast

I already did. Becaus there was a new systax, I thought that there also was a new for drop, but I now know that there is not, and that it is an bug in SQL Assistant version 15.00.00.00, it was fixed i versin 15.00.00.03.

But this version are not to be found under downloads

br

Peter Schwennesen

Enthusiast

Hi carrie,

In the 14.0 orange book, section 2.1.5 Column Ordering for Multicolumn Stats, it says that column ordering is honored with the exception of the statistics collected with Index specification.

Would you please explain...  What order is used for indexes?

Teradata Employee

Nazy,

If the INDEX specification is included in the multicolumn collect stats statement, the columns will be reordered to match the index.  This is because the index columns are already ordered in a certain way, and stats that are being collected on it need to use column ordering that is the same as the index itself, or the statistics will be less useful for the index-oriented optimizations.  But if you collect using the COLUMN specification on the same columns, you can order them however you like.

Here is a simple example: 

Create table t1(x1 int, y1 int, d1 date);

Create index (x1, d1) on t1;

--COLLECT STATS with INDEX spec. Columns gets reordered to match index on these two columns.

collect stats index (d1, x1) on t1;

help stats t1;

*** Help information returned. 2 rows.

*** Total elapsed time was 1 second.

Date     Time     Unique Values        Column Names

-------- -------- -------------------- ------------

14/10/30 11:59:07                    1      *

14/10/30 11:59:07                    1      x1,d1

--Stats are collected with COLUMN spec. Column ordering is maintained.

collect stats column (d1, x1) on t1;

help stats t1;

*** Help information returned. 2 rows.

*** Total elapsed time was 1 second.

Date     Time     Unique Values        Column Names

-------- -------- -------------------- ------------------------------------

14/10/30 12:00:05                    1      *

14/10/30 12:00:05                    1      d1,x1

Even though you have asked for COLUMN  statistics, if they are on the same columns as the index uses but just in a different order they will get used in many cases on behalf of the index even though they are not explicitly INDEX stats.    But there are cases where the COLUMN-specified stats would not be useful when considering the index, like if you have truncation of the combined values in the histogram, and the optimizer needs the full values to determine whether or not use an index.  

Thanks, -Carrie

Enthusiast

Hi Carrie,

I have a MERGE query which runs 2 -3 times a day on a table which is PPI.Suddenly yesterday the query ran for more than 5 hours where the avg execution time for the query is 15min.

When i checked the latest EXPLAIN PLAN and the Older EXPLAIN plan where query completed in 15 mins, i could see that Optimizer choosed different explain plan for the both the run..

Could you please help me with this so that i can nail down this problem to avoice this situation in future.

Thanks

Enthusiast

Thanks Carrie. That was very helpful...

I have another question.

If I understand correctly, if we drop stats on a table, we lose history.

If we drop a table, we will also lose history.

So, when we modify an existing table in a way that we need to drop and recreate it (redo compression, redo, partitioning, change PI, etc...),, for the most part, stats are to be the same as before,,, may be we add a couple column stats, if columns added for example. 

So, is there a way to still keep the history on stats?  because its still the same table.

The orange book says to export stats by using show statistics values....   and then import.

How do you Import stats back to the new table?

Thanks

Nazy

Teradata Employee

Response to HateOra:

From a statistics point of view, make sure you have followed all the recommendations for PPI table statistics that are provided in this blog posting:

http://developer.teradata.com/blog/carrie/2013/02/statistics-collection-recommendations-teradata-14-...

Unfortunately, I am not in a position where I can debug your query, but if you post your question on Teradata Forum, and include the CREATE TABLE, query syntax, and the explains before and after, there might be someone there with strong query tuning experience that will look at your detail and come up with suggestions.  

A different plan may or may be responsible for the slower execution time.  Other things to check include:

- Did the query selection criteria require reading more data?

- Had the table and/or partition(s) increased in row count?

- Was the system heavily-loaded at the time the longer-running query executed?

- Were any system resources exhausted at the time of execution (such as AMP worker tasks or CPU)?

- Was the bad query executing at a lower priority?

- Was the query delayed by a throttle?

Thanks, -Carrie

Teradata Employee

Response to Nazy:

You can export stats and then after you have inserted into all rows into the new table, you can import them back.    When you issue this command: 

SHOW STATISTICS VALUES ON table-name;

You get COLLECT STATISTICS statements that include a VALUES clause, which causes all the values and histogram detail and history records that have been collected for each statistic in that table to be produced in the output.

When you submit those COLLECT STATISTICS commands in a script (they are just regular SQL) after the new table has been created (assuming the new talbe is named the same as the table you exported from, or that you have modified the table-name in the COLLECT STATISTICS commands)  it will copy in that histogram and history detail just as if you had manually one-by-one collected the statistics yourself.

Using the export (which creates the SQL to repopulate statistics after the table has been recreated) followed by an import (which can be submitted as regular SQL) you will preserve the histograms and the history records.

Thanks, -Carrie

Enthusiast

Hi Carrie,

I am not sure I quite understand the Syssampleoption in DBSControl.

When its set to 0 or 1(default), does it mean all stats collections are sampled based on the optimizer deciding on the %, or does it mean only when we specify a sample %, then all recollections will use that % without having to specify?

Enthusiast

Asking see I see this:

show statistics column (orig_sys_id) on NA_EDW.WHSE_INV_LOC

COLLECT STATISTICS

                   -- default SYSTEM SAMPLE PERCENT

                   -- default SYSTEM THRESHOLD PERCENT

            COLUMN ( ORIG_SYS_ID )

                ON NA_EDW.WHSE_INV_LOC ;

Teradata Employee

Hi Nazy,

The SysSampleOption is to enable or disable system-determined sample percent globally.   When it is enabled, the system decides whether or not to using sampling, and at what percent.

When an explicit sample percent is specified in the COLLECT STATS statement level, it overrides the system-determined sampling decisions and uses the specified sample percent for all recollections.

In your second comment, those clauses (default SYTEM SAMPLE PERCENT, for example) indicate that global defaults have been set for SYSTEM SAMPLE and SYSTEM THRESHOLDS defined DBSControl.  What you show above in the SHOW STATS output is indicating that this collect statistics statement is elgible for system-determined sample and also system-determined thresholds.

Thanks, -Carrie

Not applicable

Hi Carrie,

Thanks for introducing me to the world of collect stats in TD 14.10.

With the TD 14.1 recommendations on collect stats, I could see savings upto 33% in TotalIOCount, 47% in AMPCPUTime, 2% in SpoolUsage, 38% in ReqIOKB. However ReqPhysIO and ReqPhysIOKB showed drop in performance of upto -790% and -977% respectively.

Our Box is EDW 67xx series. Is this acceptable performance for this configuration. Can I recommend my team to use new way of collecting stats (by bundling the columns in single collect). Please advise.

Teradata Employee

Sudhakar,

Different statistics collections on different tables will behave differently when you move to either 14.0 or 14.10 from an earlier release.    I am glad to hear that you are seeing resource savings in some areas.  Under varying circumstances it is possible to experience greater resource requirements for certain stats collections on the newer releases, but I would expect this to be counterbalanced by savings elsewhere, with other stats.

It is advisable to recollect at the table level so the optimizer can attempt to bundle the stats collections for the entire table.

If you detect any resource usage increase with stats collections that is greater than 10%, I'd suggest you open an incident with the support center.

Thanks, -Carrie

I read the below from Developers exhange earlier (for 13.10), wondering whether these guidelines are still good with 14.10 version too, or anything changed?

Guidelines for Stats Collection                                                                                                                                        

For a partitioned table, it is recommended that you always collect statistics on:

·         PARTITION: This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.

·         Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria

For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):

·   (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.

·   (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

Teradata Employee

There is posting here on Dev that addresses 14.10 statistics recommendations.  Information contained there should answer your question:

http://developer.teradata.com/blog/carrie/2014/09/statistics-collection-recommendations-teradata-dat...

Thanks, -Carrie

Enthusiast

Hello,

I just heard from someone that in TD14.10 stats can be collected on only NEW Partition of a table.

And they are not talking about collecting stats on column PARTITION.

Can we collect stats on specific partition of a table?

Teradata Employee

Hi Nazy,

Statistics are always collected on the entire table.  There is no capability in 14.10 to just collect stats on a single partition.  That capability may be added at some future time, but I cannot tell you when that might be.

Thanks, -Carrie

Teradata Employee

Hi Carrie,

I have some doubt. please help:

1. If a table is being accessed only by PI then do we still need to collect stats on table.

2. if table is well distributted(zero skewness) and this table is being accessed by PI or non-PI column then do we need stats collect on this able. will randaom Amp Sampling help on this case as stats does.

3. the proccess of the optimizer searches for statistics in teradata 14/14.10 is same as teradata 12. if anything differ please help on this

Teradata Employee

Hi Sandeep,

Here are responses to your questions:

1.  Whether or not you will need to collect stats depends on whether it is a UPI or a NUPI.  For example, a UPI does not generally require statistics, unless it is used in range constraints by the queries.  There are recommendations on stats collection for both UPIs and NUPIs on the Dev X blog posting titled: 

Statistics Collection Recommendations – Teradata Database 14.10

2.  Random AMP sampling only delivers an estimate on the table's row count and is not a substitute for stats collected on a column or an index that belongs to that table.   The same blog posting mentioned above has guidelines for when column-level statistics are recommended.   A column may have an uneven distribution of values across the rows in a table, even if the table's rows themselves are spread evenly across AMPs.   Zero skewness at the table level is not the same thing as zero skewness for a statistic's values.

3.  The process the optimizer uses for searching for statistics is the same today as it was for 12.0. However, there are many new features added after 12.0  which may look for more stats or different kind of stats.

Thanks, -Carrie

Not applicable

I believe I was told that I should collect statistics on both join columns and predicator columns.  But I am not sure if I should collect statistics on the combined columns or collect one for join and another for predicator?  Also, does column order matter?  For the example below, which collect statistics statement should I use?

SELECT     a.C1, b.C2

FROM       tableA a

INNER JOIN tableB b ON a.JC1 = b.JC1 AND a.JC2 = b.JC2

WHERE      a.C3 BETWEEN 1 AND 100

;

(1) COLLECT STATISTICS COLUMN(JC1,JC2), COLUMN(C3) ON tableA;

(2) COLLECT STATISTICS COLUMN(JC1,JC2,C3) ON tableA;

(3) COLLECT STATISTICS COLUMN(C3,JC1,JC2) ON tableA;

Teradata Employee

The following recommendations for multicolumns statistics are made in my blog posting on statistics collection recommendations for 14.10:

Collect Multicolumn Statistics:

 • Groups of columns that often appear together with equality predicates. These statistics are used for single-tables estimates.

• Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics is in this situation.

• Specify a name for such statistics, for ease of recollection, viewing, and/or dropping.

In the absence of column correlation information, the optimizer assumes that columns in a multicolumn statistic are completely independent.  The optimizer assumes that the number of unique values of the join columns doesn’t get reduced after applying the single table predicates.  Further, it assumes that the single table predicates remove the rows evenly from every value group of the join column until there is one row per value. 

If single table predicate columns and join columns have a high degree of correlation and the join selectivity estimate by itself is not accurate, then you want to collect base table multicolumn stats with single table predicate columns as leading columns following by join columns.

Thanks, -Carrie

Enthusiast

Hi Carrie, I believe summary statistics is not allowed on volatile tables but still it is receommended when you do help stats on which will throw an error "COLLECT failed. 3706:summary option is not allowed on volatile tables". I am on TD14.

Does it do any good if stats are collected on volatile tables though the engine recommends and allows?

Teradata Employee

Statistics  on volatile tables are not saved in DBC.statstbl, as are other statistics on base tables.  Volatile table stats are saved in memory for the given session.

Currently, there is no slot in memory for SUMMARY stats. By nature, volatile tables are specific to a session and gets deleted when the session is logged off. So, the assumption made by developers is that volatile table statistics don’t need any history records, extrapolations, etc.   Hence the restriction. 

In fact, the optimizer always does an all-AMP random AMP sampling for volatile tables.  This makes explicitly collecting SUMMARY stats less important.

Thanks, -Carrie

Not applicable

Hi Carrie,

I have a question.

Creating new stats on modified table or copying the stats from existing table will it make any difference for time perspective?

Here New stats which I am creating is exactly same as on old table.

Actualy I prefer to create new stats when moving any modified table to new database but if copying the stats from existing table create less time to collect stats then its better for me to copy it. Can you please clarify my doubt on it.

Teradata Employee

If you wish to copy the statistics that already exist on table_A to a new table that is identical to table_A, call it table_B, the fastest and easiest way to do this is to export the stats from table_A and import them to table_B.   There is no collection overhead when you do this, just the statistics histograms being copied from Table_A and associated to Table_B.   

Before you do this, you need to be sure that the tables are identical, with the same number of rows and values, etc. 

Issue this command:   SHOW STATISTICS VALUES ON table_A

Save the output to a text file, change the name "table_A" to "table_B" in all of the collect stats statements in the output, then submit the SQL statements.    This will copy all the histograms and their values onto table_B.

Try it on a small test table first to get familiar with how it works.

Thanks, -Carrie

Enthusiast

Carrie

Just wanted to check whether there is a need to use the key word TEMPORARY when collecting stats on GLOBAL TEMPORARY tables under V14.0 onwards?

COLLECT STATISTICS

        INDEX ( COLUMN1, COLUMN2 )

        ,COLUMN (COLUMN3)

     ON $DATABASEVARIABLE.GT_TABLENAME       

;

Thanks

Teradata Employee

If you want to collect statistics on the instance of a global temp table within your session, you would need to include the word TEMPORARY.  If you do not include TEMPORARY, then stats will be collected on the GTT definition that is in the data dictionary (which has no rows).  That will have no impact on your session's copy of the GTT.   If you collect on several columns on the GTT definition by omitting the word TEMPORARY, then later, when you have a session with an instance of that GTT you can collect stats by the GTT table name (using TEMPORARY) and all of the stats associated with the GTT definition will be included in the table level collection for that instance.

So if you only want to collect stats on your session's instance of the GTT, include the word TEMPORARY.   But if you want to set up a template for stats collection on multiple columns of the GTT to be available for all instances of this GTT, then omit the word TEMPORARY.  After collecting the column stats against the dictionary definition, then when a session is using an instance of the GTT you can collect stats (with TEMPORARY) at the table level and all those predefined stats will be collected for your session.

Thanks, -Carrie

Enthusiast

Thanks Carrie, very much appreciated

Enthusiast

Carrie, one more question - would the same rules apply volatile tables and would the syntax be: -

COLLECT STATISTICS

        INDEX ( COLUMN1, COLUMN2 )

        ,COLUMN (COLUMN3)

     ON $DATABASEVARIABLE.VT_TABLENAME       

;

Thanks

Teradata Employee

The rules that apply to a global temp table in terms of stats collection do not apply to volatile tables.   Global temp table have a definition in the dictionary, but volatile tables do not.  Volatile tables only exist within one session, they do not have a global version.  Stats for VTTs are held only in the memory of the session, not in a DBC table. 

Your syntax below should work.   Stats collection success on a VTT can be validated by running an explain that uses the column that has stats for selection and checking the confidence level and the estimated row counts.  

Thanks, -Carrie

Not applicable

Carrie,

When collecting stats in the following format, it was my understanding that the table is read once; similar to what I have been told about just collecting at the table level.

COLLECT STATS

COLUMN(COL_A),

COLUMN(COL_B),

COLUMN(COL_C),

COLUMN(PARTITION)

ON <databasename>.<tablename>;

However, when reading the explain, it appears that it doing steps similar to the following for each individual column: 

We do a SUM step to aggregate from <database>.table by way of an all-rows scan.

Aggregate Intermediate Results are computed globally, then placed in Spool 35472.

We Then we save the UPDATED STATISTICS from Spool 35472(Last Use) into Spool 35474, which is built locally on a single AMP derived

from the hash of the table id

Questions:

1.  Is my understanding incorrect? 

2.  Will Teradata only do one pass when collecting at the table level only?

Thanks,

Mark

Teradata Employee

Hi Mark,

The optimizer will ATTEMPT to collect multiple stats in one pass of the table when you use the coding below, but often it decides there is no cost savings in doing so and will revert to individual table scans/aggregations.  The cases where it tends to bundle up multiple collections in a single table scan are when there are collections on multicolumn stats as well as collections on the component columns that make up the mutlicolumn stat.  The other cases where the optimizer frequently bundles up collections is when the columns have a low number of distinct values. 

The optimizer has to be able to re-use the spool file it creates in the single pass to make the bundling a viable choice.  For that reason, columns with higher numbers of distinct values that have no relationship one to another almost always involve their own full table scan.  

It works the same whether you are using the syntax below or recollecting at the table level.  If the optimizer can detect a benefit in bundling multiple collect statements within a table,  or a subset of them, it will.  Otherwise it won't.

Thanks, -Carrie

Enthusiast

Hi carrie,

Is there a benefit in collecting stats on the ROWID column in a join index?

For the first time today, we saw a recommendation from Stats Manager to collect stats on ROWID column of a join index.  We have never  collected stats on ROWID before. 

The stats I collected on RowID has the same number of Unique Values as the Summary stats.

Teradata Employee

Hi Nazy,

I can't think of any reason there would be a value in collecting stats on rowID.   RowIDs are often used in single-table join indexes as a means for the database code to get from the join index row directly to the unique base table row, such as:

CREATE JOIN INDEX  PriceJI   AS

SELECT  o_price, o_orderkey , ordertbl.ROWID

FROM  ordertbl

PRIMARY INDEX ( o_price );

Statistics will not add any performance advantage for a query that uses the above join index or one similar to it.   There might be some relevance in stats collection if the rowID value was passed in the query, but that is not something we usually recommend doing.  And even if that happened, since RowID directly accesses a single row, similar to a unique primary index, collected statistics are never required because they would not provide any useful information to the optimizer.

Thanks, -Carrie 

Enthusiast

Thanks For the clarification on ROWID......

I have another question.  In the original blog you have the explanation below.  Could you please expand  eon this?  I am not sure I quite understand where that 32 bytes come from?   In order for the optimizer to use the first 32 bytes of each coumn in the multicolumn stats, do we need to have the using maxvaluelenght be > 32 for as many columns as are used in the multicolumn stats? 

ex:  tableA has stats on 4 columns: ( I am exagerating on these column lengths so I can make my question more clear - I hope!)

A is 10 bytes

B is 35 bytes

C is 42 bytes

D is 50 bytes

collect stats COLUMN (A,B,C,D) on TableA;     do we get real number of unique values here since B is > 25, and therefore th whole length is > 25?

collect stats USING MAXVALUELENGTH 70 COLUMN (A,B,C,D) on TableA:  does this give the correct number, since we only have 70 characters (instead of 137)?

collect statistics USING MAXVALUELENGTH 140 COLUMN (A,B,C,D) on TableA;  This should work since we cover teh length... But does it only look at the 32 bytes of columns B, C & D?

.....  In addition, during the aggregation process that builds up the statistical histogram, each column within the statistics will be able to have its first 32 bytes represented when determining the number of distinct values, ensuring that s significant number of bytes in each column can contribute to understanding how distinct the combination of column values are.......

Teradata Employee

Hi Nazy,

The 32 bytes I mentioned is not directly related to MAXVALUELENGTH.   MAXVALUELENGTH only applies to the number of characters that are stored in the histogram intervals value fields at the time those values are moved into the histogram intervals.     

The 32 bytes is used during the aggregation process to determine the number of distinct values and the interval buckets.  The stats aggregation process  uses first 32 bytes of each column in the multicolumn stat when it performs its aggregation.  This happens automatically, and you don't need to set anything up to make it happen.  Also, you cannot increase or decrease that value. 

After the aggregation process is complete, MAXVALUELENGTH is used to determined how many characters should be moved into the values fields in the histogram intervals.   The default is 25.

So you do not need to change MAXVALUELENGTH to be 32 bytes for each column included.   

For your four columns (A, B, C, D), the number of unique values for the entire statistics will be calculated based on taking the first 32 bytes from the beginning of each column and combining them.   Depending on your demographics, the numer of unique values that results should be pretty close to accurate.

With no MAXVALUELENGTH specified, the histogram value columns will contain all of colum A (10 bytes) plus part of column B (the first 15 bytes) and nothing from column C or D.  The greater you make the MAXVALUELENGTH, the more complete the histogram interval information will be.   And as you say, doing MAXVALUELENGTH = 140 will cover all four columns completely.

However, increasing MAXVALUELENGTH too greatly will tend to increase the size of the histogram.   Usually the optimizer will reduce the number of intervals or the number of biased values to compensate, so the entire histrogram doesn't grow too large.   So it's best to only use MAXVALUELENGTH up to a size that you actually need for differentiation.

Thanks, -Carrie

Enthusiast
Hi Carrie,  Thanks for the explanation... 
We sometimes have tables with several multicolumn stats having the same number
of unique values.

In this example:

collect stats column (A,B)     on tableA;
collect stats column (A,B,C)   on tableA;
collect stats column (A,B,C,D) on tableA;

are collected, becasue any of those combinations could be used in a
join or filter....

All three collections give the same number of unique values.
Even when I increase the MAXVALUELENGTH to the max length of 140,
I still see the same number of unique values for all 3,,, 
And column C & D do not have constant values..... 
So that's kind of strange for me...  

should we still collect all 3 sets?

Teradata Employee

Hi Nazy,

If you don't collect stats for each separate combination, then there will no histogram where the optimizer can see the number of uniques and the interval values.  So if you are using those different combinations of those columns for different joins it would be advisable to collect stats on them.

However, since there are overlaps in the columns represented, the optimizer should choose to scan the table one time for the collection of all three stats.  It will first collect stats on the broadest set (A,B,C) and use the spool from that collection for the (A,B) and the (A) stats.   But for that to happen you will happen to bundle all three statistics in the same collect statistics statement.  

Thanks, -Carrie

Enthusiast

Hi Carrie, 

 

I have a question about the new functionalities in TD 14 specifically maxvaluelength & Rollup & Pre aggregation of the stats, but first of all I wanted to thank you for all the detailed information you have been providing through your blog.

You mentioned 

"The number of distinct values in a statistic is accurately determined at the time of the collection and aggregation process.  The full length of the field is used during that process, so it can know about all the distinct values in their entirety.  It is only after the process is complete, and the number of distinct values is determined that the values will be moved into the histogram.  It is the histogram value fields that cause truncation, but that will not impact the distinct number of values that have already been calculated.   The problem around this truncation within the histogram is that it may impact single table selection row estimates, as the optimizer may not be able to see which values are actually being represented in which field or which interval, if they have been truncated too severely."

 

Maxvaluelength

From my understanding of your post, collecting stats on a single or multiple set of columns which well exceed beyond 25 bytes will still show the right number of unique values, but its just that the single table estimates will not be accurate, as the histogram data truncates the values which exceed beyond 25 bytes. 

1) In our environment, we have a lot of multicolumn stats without maxvaluelength  ( sometimes upto 30 columns together ), and these stats are being used by the optimizer. How would I know if there is really a benefit from these multicolumn stats or if these have been providing wrong estimates all the while because of the truncation.

2) And how do i determine in which all scenarios maxvaluelength should be used and where it should be excluded or should i be using maxvaluelength for all the stats which exceed 25 bytes? I read somewhere the query optimization time can increase due to the increased histogram size and maxvaluelngth should be used selectively.

 

Rollup & Pre aggregation of stats 

The Orange book on Stats enhancements on TD14 mentions that "When statistics are requested on both a single-column and a multicolumn on the same set of columns, the aggregation result used to produce multicolumn statistics is reused and rolled up to produce the single-column statistics"

 

Is this only applicable when single column stats are part of multicolumn stats, for example 

Collect stats Column ( A,B ) , Column( A) , Column ( B )  or is it also applicable for multicolumn stats, such as

Collect stats Column ( A,B,C ) , Column( A,B)  or  Collect stats Column ( A,B,C,D ) , Column( B,C,D) 

 

Also, I was not clearly able to understand the difference between rollup & pre aggregation, are they one and the same ?

 

Thanks,

Karthik