Easy Statistics Recommendations - Statistics Wizard Feature

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Easy Statistics Recommendations - Statistics Wizard Feature

The principle focus of query tuning is to provide reliable summary information about the data to the optimizer. This is done by collecting accurate statistics which are then stored in a synoptic data structure known as an interval histogram. The correct choice of the column and index sets on which Statistics should be collected can help the optimizer generate better query plans, dramatically improving query performance and reduce the collection overhead. It can be difficult to understand how the optimizer uses statistics as well as deciding what statistics are needed without an automated method to help. That automated method is the Teradata Statistics Wizard, which is a client-based GUI interface for obtaining statistics recommendations for particular queries or query workloads that are submitted for analysis.

Easy Feature of the Statistics Wizard

Statistical information availability is critical for optimal query plans, but Statistics collection is a time and resource consuming job. Appropriate and up-to-date Statistics can improve the performance of a query. As the Teradata Database becomes more powerful, applications and the queries they perform have also become more complex and sophisticated. The job to decide upon adequate and appropriate Statistics collection, re-collection of Statistics, and the correct balance between good query plans and the time required to collect Statistics on complex databases has become very complicated.

The Easy Feature of the Statistics Wizard simplifies this by automating the process of collection/re-collection of Statistics for particular database/tables by applying the best practices based upon the data demographics information available in the database system.

Approaches to determine the recommendations:

There are two main approaches in using the Easy Feature for Statistics recommendations:

1) Full Statistics Recommendations are based on the following attributes:

  • Age of last collection in days
  • Table growth
  • Table skew
  • General statistics rules
  • Search for new tables/ indexes

2) Sampled Statistics Recommendations are based on the following attributes:

  • Tables with greater than 1000 rows/AMP
  • Unique index columns
  • Highly unique indexes

Purposes of Statistics:

The following are the most important reasons for column and index statistics.

  • The optimizer uses statistics to decide whether it should generate a query plan that uses a secondary, hash, or join index instead of performing a full-table scan.
  • The optimizer uses statistics to estimate the cardinalities of intermediate spool files based on the qualifying conditions specified by a query.
  • The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.
  • For PPI tables, statistics collected on the PARTITION system-derived column permit the optimizer to better estimate costs. The system also uses PARTITION Statistics for estimates when predicates are based on the PARTITION column.

Sampled Statistics:

In many cases, sampled statistics are not accurate enough for the optimizer to generate an optimal or even a good join plan. However, it is sometimes true that statistics collected by sampling small subpopulations of table rows can be as good as those collected with a full-table scan. The value of collecting full-table statistics is that they provide the optimizer with the most accurate information that can be gathered for making the best possible query plan cost estimates.

Statistical accuracy is fundamentally important for any query plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is significant.

Statistics and optimizer estimates also play a critical role in Teradata Active System Management (TASM) configuration and strategy.

Comparison of Full Statistics Vs. Sample Statistics:

Statistics Method Characteristics Best Use
Full Statistics
  • Collects all Statistics for the data.
  • Is often time consuming.
  • Most accurate method of collecting Statistics.
  • Stored in interval histograms in the data dictionary
  • Best choice for columns or indexes with moderate or highly skewed data values.
  • Recommended for small tables (tables with fewer than 1,000 rows per AMP).
  • Recommended for selection columns having a moderate to low number of distinct values.
  • Recommended for most NUSIs and other selection columns.
  • Collection time on NUSIs is very fast.
  • Recommended for all column sets or index column sets where full Statistics add value, and where sampling does not provide satisfactory statistical estimates
Sampled Statistics
  • Collects all Statistics for the data, but limits the number of rows accessed in the table.
  • Significantly faster collection time.
  • Stored in interval histograms in the data dictionary.
  • Recommended for unique columns (or near-unique), unique indexes, and for columns or indexes that are highly singular.
  • May be preferable to no statistics for very large tables (tables with tens of billions of rows) where statistics collection is prohibitive.
  • Not recommended for small tables

Full Statistics Recommendations:

If the ‟Full Statistics Recommendations‟ radio button is selected, the Statistics Wizard gives Full Statistics Recommendations based on the following attributes for the selected objects:

  • Age of Last Collection in days
  • Table Growth in %
  • Table Skew

     All NUPIs

     All NUSIs

  • General Statistics Rules

     All Indexes for Join Index Table

     All NUSIs

     All VOSI

     All Partitioned Tables

     Search for New Tables/ Indexes

     Primary Index/Key

     Foreign Key

Example: lineitem table definition

CREATE SET TABLE lineitem, NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
L_ORDERKEY INTEGER NOT NULL,
L_PARTKEY INTEGER NOT NULL,
L_QUANTITY FLOAT,
L_SHIPMODE CHAR(10)CHARACTER SET LATIN CASESPECIFIC NOT NULL DEFAULT 'IIA4 ',
L_SHIPINSTRUCT VARCHAR(45) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'IIA5',
L_SUPPKEY INTEGER,
L_LINENUMBER INTEGER,
L_EXTENDEDPRICE FLOAT,
L_LINESTATUS CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'D')
UNIQUE PRIMARY INDEX (L_ORDERKEY)
INDEX (L_QUANTITY)
INDEX ALL (L_PARTKEY, L_QUANTITY, L_SHIPMODE) ORDER BY VALUES (L_PARTKEY)
INDEX (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS) ORDER BY VALUES (L_SUPPKEY)
INDEX (L_PARTKEY, L_QUANTITY, L_LINENUMBER) ORDER BY VALUES (L_LINENUMBER)
INDEX (L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS)
UNIQUE INDEX (L_PARTKEY);

Age of Last Collection in days:

Recollection of statistics for all columns is recommended if the days elapsed since the last collection exceeds a user-specified threshold.

Example: The last collection on the lineitem table exceeds the user-specified threshold (10 days from date of 05/19/2008).

HELP STATISTICS lineitem;
*** Help information returned. 3 rows.
*** Total elapsed time was 1 second.

Date Time Unique
Values Column Names
-------- -------- ------- --------------------------------------
03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_SHIPMODE
03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_LINENUMBER
05/19/2008 10:37:15 1,280 L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER,
L_EXTENDEDPRICE , L_LINESTATUS

Statistics were previously collected on (L_PARTKEY, L_QUANTITY, L_SHIPMODE) and (L_PARTKEY , L_QUANTITY, L_LINENUMBER) columns of lineitem table on 03/16/2008. The user selected 10 days as the recollection option and the last collected statistics date exceeded 10 days (from 05/19/2008). Teradata Statistics Wizard will recommend recollection of statistics on columns referred to in the first two rows. No recommendations are provided for last row because statistics were collected on same day (05/19/2008).

Statistics recollection recommendations are based on the previous collected Statistics method. (If originally Sample Statistics were collected on tables then Teradata Statistics Wizard will recommend recollection of Sample Statistics).

Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
Collect Statistics on lineitem Column (L_PARTKEY, L_QUANTITY, L_LINENUMBER);

Table Growth:

Table growth recommends re-collection for all columns that have a change in row count exceeding a user-specified threshold since statistics were last collected. If the total rows count value from the existing statistics is less or greater than the user-specified threshold value then re-collection statistics recommendations are generated.

Example: 128 rows have been inserted to table lineitem since the last statistics colletion. (last row count 1280 + new inserted rows 128 = 1408) (refer to Help Statistics output in previous section).

Select count (*) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Count (*)
--------------
1408

Teradata Statistics Wizard provides the following statistics recollection recommendations. The row count of lineitem table has exceeded the user-specified threshold (10%).

Table growth recommendations (if full statistics were collected previously):

Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
Collect Statistics on lineitem index (L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER,
L_EXTENDEDPRICE, L_LINESTATUS);
Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);

Statistics on Skewed Data:

When you collect statistics for skewed data, the optimizer can accommodate exceptional values.

Statistics reveal values that include the most frequent value in the table and the most frequent value per value range. The system divides the table into 100 groupings and maintains statistics on the most non-unique value/range.

Without collected statistics, the system derives row counts from a random AMP sample for:

  • Small tables (less than 1000 rows per AMP).
  • Unevenly distributed (skewed row distribution due to PI) tables.

Small tables often distribute unevenly. If rows in a table are not distributed evenly across all AMPs, random AMP samples may not represent the true total number of rows in the table.

Table Skew – All Non-Unique Primary Indexes:

Full Statistics availability is needed for highly skewed data for the best plan generation. The optimizer needs to be fully aware of such skew. Teradata Statistics Wizard recommends statistics collection on all non-unique primary indexes for tables that have table skew that exceed a user-configured threshold.

SQL to calculate table skew on all NUPI:

SELECT di.databasename, di.tablename, di.indexname, 
di.indextype, di.uniqueflag,
SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb,
100*((cast(max(t.currentperm ) as decimal(18,3))
- cast(ave(t.currentperm ) as decimal(18,3))))
/(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew
FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName')
AND di.tablename in ('<TableName>')
AND di.indextype = 'P'
AND di.uniqueflag = 'N'
AND di.databasename = t.databasename
AND di.tablename = t.tablename
GROUP BY di.databasename, di.tablename, di.indextype,
di.indexname, di.uniqueflag
ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;

Table Skew – All Non-Unique Primary Index recommendations:

Collect Statistics on NUPITABLE index (NUPI_Col1); 

Table Skew – All Non-Unique Secondary Indexes:

Teradata Statistics Wizard recommends statistics collection on all non-unique secondary indexes for tables that have index columns skews that exceed a user-configured threshold.

SQL to calculate table skew on all NUSI:

SELECT di.databasename, di.tablename, di.indexname, 
di.indextype, di.uniqueflag,
SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb,
100*((cast(max(t.currentperm ) as decimal(18,3))
- cast(ave(t.currentperm ) as decimal(18,3))))
/(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew
FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName')
AND di.tablename in ('<TableName>')
AND di.indextype = 'S'
AND di.uniqueflag = 'N'
AND di.databasename = t.databasename
AND di.tablename = t.tablename
GROUP BY di.databasename, di.tablename, di.indextype,
di.indexname, di.uniqueflag
ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;

Table Skew – All Non-Unique Secondary Index recommendations:

Collect Statistics on lineitem index (L_LINENUMBER, L_EXTENDEDPRICE ,L_LINESTATUS);
Collect Statistics on lineitem index (L_QUANTITY);

General Statistics Rules - All Indexes for Join Index:

The COLLECT STATISTICS statement collects demographics, computes Statistics from them, and writes the resulting data into individual entries for each individual base table and join index table on the system.

When join indexes are used for estimates, you might not need to collect statistics on the base table columns that are covered by the join index with some exceptions. For example, statistics are still required on base table index columns that have single-table predicates in order to do access path analysis. Similarly, statistics might be required if the covered columns are used as join columns or grouping columns in a GROUP BY clause.

Consider the following query and join index definition:

SELECT *
FROM t1, t2
WHERE nusi_col = 10
AND a1 > 20
AND t1.a1 = t2.a2;
CREATE JOIN INDEX ji AS
SELECT a1, b1
FROM t1
WHERE nusi_col = 10
AND a1 > 20;

In this case, the join index cardinality provides the final selectivity for t1; However, statistics on column a1 might still be required to make join cardinality, rows per value, and other cardinality estimates. Ideally, the statistics on a1 from the join index are more accurate than the base table statistics. The optimizer first attempts to locate the join column a1 in the join index and then attempts to use its statistics. If this column is not projected, or if the statistics are not available, then the optimizer looks for statistics on the base table.

When usable single-table join indexes are found for single-table cardinality estimations, and the columns referenced in the join predicates or GROUP BY clause are projected from the join index, then the Statistics Wizard recommends statistics on the join index columns if they are not available. For the previous example, the Statistics Wizard recommends collecting statistics on ji.a1 since the column a1 is projected in the join index. If this column is not projected, then the Statistics Wizard recommends collecting statistics on the base table columns t1.a1

Column statistics for join indexes and their underlying base tables are not interchangeable. You need to submit separate COLLECT STATISTICS statements for the columns in the join index and the source columns in the base tables. Join index tables and data tables are seen as separate entities to the optimizer. 

General Statistics Rules - All Non-Unique Secondary Indexes (ALL Option):

Collecting STATISTICS on NUSIs makes it possible for the optimizer to decide when to use the NUSI for access and when to do a full table scan.

Teradata Statistics Wizard recommends collect statistics on all non-unique secondary indexes due to the data skew possibilities described previously.

General - All Non-Unique Secondary Index recommendations:

Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);

General Statistics Rules - All VOSI (Value ordered NUSI):

Value-ordered NUSIs are useful for processing range conditions and conditions with an inequality on the secondary index column set.

By sorting the NUSI rows by data value, it is possible to search only a portion of the index subtable for a given range of key values. The major advantage of a value-ordered NUSI is in the performance of range queries.

The following example illustrates a value-ordered NUSI (defined by an ORDER BY clause that specifies the VALUES keyword option) and a query that would probably be solved more efficiently if the specified value-ordered NUSI were selected by the optimizer to access the requested rows. 

CREATE INDEX Idx_Date (o_orderdate)
ORDER BY VALUES (o_orderdate)
ON Orders;
SELECT *
FROM Orders
WHERE o_orderdate
BETWEEN ‘1997-10-01’
AND ‘1997-10-07’;

Teradata Statistics Wizard recommends statistics on all VOSIs.

General - All Value Ordered NUSI recommendations

Collect Statistics on lineitem Index (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER,
L_EXTENDEDPRICE , L_LINESTATUS);
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);

General Statistics Rules - All Partitioned Tables:

If there are any predicates on the partitioning columns of a PPI and the system does partition elimination, or if there are any join predicates on the primary index or the partitioning column of the PPI table, then the Statistics Wizard recommends single-column partition statistics on the system-derived PARTITION column with High Confidence.

If the system does partition elimination, then multicolumn PARTITION Statistics are recommended with Low Confidence on the system-derived PARTITION column and other join columns. This helps to derive the column correlations for the other join columns based on the qualified partitions.

For example, suppose you submit the following query:

SELECT *
FROM t1_ppi, t2
WHERE t1_ppi.partition_column BETWEEN 10 AND 20
AND t1_ppi.x1 = t2.y1;

If the qualified partitions are (4, 7) then it is helpful for the optimizer to derive the demographics of the join column t1_ppi.x1 for the given partitions (4, 7) in order to provide more accurate estimation and costing. The recommendations need to be implemented only if the partitioning column and the join columns are correlated. 

General Statistics Rules - Search for New Tables/Indexes:

This option will search all the new tables or indexes created after a specified date and generate the recommendations on new tables or newly added indexes.

For example, if the following new table is created after a specified date (i.e. CreateTimeStamp column of DBC.TVM table),

CREATE MULTISET TABLE supplier , NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
S_SUPPKEY INTEGER NOT NULL,
S_NAME CHAR(25) NOT NULL,
S_ADDRESS VARCHAR(40) NOT NULL,
S_NATIONKEY INTEGER NOT NULL,
S_PHONE CHAR(15) NOT NULL,
S_ACCTBAL DECIMAL(15,2) NOT NULL,
S_COMMENT VARCHAR(101) NOT NULL)
UNIQUE PRIMARY INDEX ( S_SUPPKEY )
INDEX ( S_NATIONKEY );
Teradata Statistics Wizard recommends the following Statistics collection:
Collect Statistics on supplier Index (S_SUPPKEY);
Collect Statistics on supplier Index (S_NATIONKEY);

Following are the recommendations when Statistics was not collected for any of ‘Process-Order’ tables as all tables are newly added after specified date.  

Sampled Statistics Recommendations:

Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection.

If the columns are not indexed, then the rows are organized randomly, so the system just scans the first n percent it finds, where the value of n is determined by the relative presence or absence of skew in the data. Conceivably, the entire sample could be taken from the first data block on each AMP, depending on the system configuration and cardinality of the table being sampled. Particularly avoid using sampled statistics with skewed columns or indexes. For example, columns or indexes that are unique or nearly unique are uniformly distributed, and a good candidate for sampled statistics. However, sampling should not be considered for data that is highly skewed because the optimizer needs to be fully aware of such skew. In addition to uniformly distributed data, full statistics collection on NUSIs is so fast and efficient that there is not much collection overhead to improve upon going to sampling. Sample statistics for NUSIs is not worth the reduction of information available for the minimal collection overhead saved.

To summarize, sampled statistics are generally most appropriate for:

  • Data that is uniformly distributed
  • Near-unique or unique primary index columns
  • Very large tables, but only when collecting full statistics would be prohibitive 

Teradata Statistics Wizard recommends sampled Statistics based on following rules.

  • Tables with greater than 1000 rows/AMP
  • Unique columns
  • Highly unique columns or indexes
  • Sample Percentage (Default 2%)

Note: The quality of the statistics collected with sampling is not guaranteed to be as good as the quality of statistics collected on an entire table without sampling, but is most likely better than having no statistics at all. When you use sampled statistics rather than full-table statistics, you are trading reduced statistics collection time and resources for less accurate statistics.

Sampled Statistics Recommendations -Table with greater than 1000 rows/AMP:

Sampled Statistics on a small table must be avoided as small tables (fewer rows/AMPs) may end up with very inaccurate row counts. Sampled Statistics can be recommended on big tables (e.g. Size > 1000 rows/AMP).

Teradata Statistics Wizard recommends sample statistics for all tables with a size greater than 1000 rows/AMP. The default sample size is 2%.

Example: 
Total number of Amps:
Select HashAmp () +1;
*** Query completed. 3 rows found. One column returned.
*** Total elapsed time was 1 second.
(HASHAMP () +1)
-------------
2
Ie: The database contains 2 Amps.
Total number of rows per Amps
Select Count (*)/(Hashamp()+1) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
(Count(*)/(HASHAMP()+1))
------------------------
647

No Sampled recommendations are generated by Teradata Statistics Wizard as the table must contain more than 1000 rows per AMP.

Inserting rows to table
Insert 707 rows to the existing table lineitem.
Select Count (*)/(Hashamp()+1) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
(Count(*)/(HASHAMP()+1))
------------------------
1001

Now there are more than 1000 rows per AMP.

Sampled Statistics Recommendations:

Collect Statistics using sample on lineitem column (L_ORDERKEY);

Sampled Statistics Recommendations -Unique Columns:

Sampled Statistics are very reasonable for columns that are unique or almost unique, so Teradata Statistics Wizard recommends sampled statistics for all unique columns based on unique constrains defined on tables.

Example:

Collect Statistics using sample on lineitem columns (L_ORDERKEY);
Collect Statistics using sample on lineitem columns (L_PARTKEY);

Sampled Statistics Recommendations -Unique Indexes:

Sampled Statistics are very reasonable on unique indexes (like UPI) because of data uniqueness distribution, so Teradata Statistics Wizard recommends sampled statistics on all unique indexes based on unique index attributes on selected data objects.

Example:

Collect Statistics using sample on lineitem index (L_ORDERKEY);
Collect Statistics using sample on lineitem Index (L_PARTKEY);

Note the overlap in the unique column and unique index recommendations. The following will be recommendations
if both unique columns and unique indexes are selected:

Collect Statistics using sample on lineitem index (L_ORDERKEY);
Collect Statistics using sample on lineitem columns (L_PARTKEY);

Conclusion and Summary:

Statistics will always be the most complete when they are collected fully. But due to the time and resources required in the statistics collection effort, it makes sense to consider alternative approaches, if they can provide a similar level of accuracy.

USING SAMPLE Statistics

  • Significantly faster than full Statistics collection
  • Acceptable for columns or indexes that are “highly unique,” that is, whose number of distinct values approaches the row count of the table
  • Not recommended for small tables, that is, tables whose row count is less than 20 times the number of AMPs in the system

When to use:

For columns that are unique, such as unique primary indexes, or for columns or indexes that are close to unique. Some customers, experienced in USING SAMPLE, suggest that USING SAMPLE provides particular value on very large tables, tables with tens of billions of rows.

Full Statistics

  • Best choice on columns or indexes when data is moderately or greatly skewed
  • Recommended for small tables, that is, tables with fewer than 1000 rows per AMP
  • For selection columns whose number of distinct values is moderate or low
  • For most NUSI’s and other selection columns

When to use:

For all columns/indexes where collected Statistics add value, and where USING SAMPLE does not provide satisfactory estimates.

Statistics collection is an ongoing job. Because it is a highly-repetitive activity, it is worth taking some time to investigate the optimal way to combine these different approaches at your site.

Because a mixed approach to statistics collection requires column-by-column (or index-by-index) analysis, it may take some time and effort to apply comprehensively. However, by focusing only on the larger tables, where full statistics collection time is the greatest, significant savings can be experienced with a moderate effort.

6 REPLIES
Teradata Employee

Re: Easy Statistics Recommendations - Statistics Wizard Feature

Nice overview of the tool and options. I just want to stress to people the importance of the first line in your "when to use" summary. The output of the tool are "recommendations". You must apply, test and either keep or remove the recommended stats. Do not just apply *ALL* recommendations , only the ones that add value.

Dave
Teradata Employee

Re: Easy Statistics Recommendations - Statistics Wizard Feature

Agree with Dave. There are recommendations based on general Statistics rules. You must apply only the once that add value.

Statistics Wizard also recommended workload based Statistics recommendations based on optimizer plan of workload queries. Workload based Statistics recommendations are must needed statistics collection. You can apply all Statistics recommendations on workload/queries to improve the plan estimation.

This article addresses the process of Statistics recommendations based on general heuristics. It does not discuss how to determine appropriate Statistics recommendations for workloads (sets of queries). Please refer the Teradata Statistics Wizard User Guide for workload based Statistics recommendations.

Enthusiast

Re: Easy Statistics Recommendations - Statistics Wizard Feature

I Agree with this short but complete statistics recomendations. Really this very usefull descriptions to know about the stats wizards.I have one one questions?
What percentage data if changes then we should apply the stats and how frequently we should recomend the stats on the column ?
Teradata Employee

Re: Easy Statistics Recommendations - Statistics Wizard Feature

The thumb rule is to recollect the statistics when data changes for table minimum 10 percentages. You have to recollect statistics on column based on data changes in table. However, you can recommend statistics on column based on how frequently this column used in queries submitted to database (based on Range Access, Join Access and Value Access frequency of column). You can use workload based recommendation feature of stats wizards to get appropriate statistics recommendations based on optimizer plan of queries in workload (set of queries).
Enthusiast

Re: Easy Statistics Recommendations - Statistics Wizard Feature

Hello Sahu,

Thanks a lot for providing a quick summary for statistics collection recommendations.

I have following doubts in statistics collection using Teradata Statistics Wizard:-

1. What are stale statistics?

2. Collection of statistics is advisable to be collected as per recommendation for a particular workload or for all tables which are part of the workload?

3. If I define a workload using "Create from SQL statement" and then get recommendations for collecting statistics on that particular workload, recommendation would include statistics collection on columns/ indexes for the query mentioned in workload. Here it might happen that statistics are collected on some columns/ indexes which may be not required for other queries. In other words, would the statistics collected for a particular query be stale for other queries?

4. Is it advisable to collect statistics for all tables that are frequently used regularly?

Thanks,
Shrey
Teradata Employee

Re: Easy Statistics Recommendations - Statistics Wizard Feature

To answer the question about stale statistics, they are stats that no longer represent the distribution of the data in a column or index.  Stale stats influence the optimizer to do things in query plans based upon stats that are inaccurate;  therefore causing bad plans. 

Something to remember when you talk about frequency of recollection.  It is a change in the distribution of data that requires recollection and not necessarily a percentage of how many rows have been added or values updated.  Personally, I think the 10% figure of new rows is too conservative.  (e.g. a 10% gain in rows with all the same values of a data element would be the worst case scenario;  how would that change the distribution of data?  So, if I had 100000 rows with only 10 values, then the distribution may not be changed significantly, but if I have 10000 values, then definitely there would be a significant distribution change) --- so, my message is that number of values per column plays a role in how frequently to change as well as number of new rows.  My rule of thumb is somewhere between 10 and 20% depending upon number of values vs number of rows.