The principle focus of query tuning is to provide reliable summary information about the data to the optimizer. This is done by collecting accurate statistics which are then stored in a synoptic data structure known as an interval histogram. The correct choice of the column and index sets on which Statistics should be collected can help the optimizer generate better query plans, dramatically improving query performance and reduce the collection overhead. It can be difficult to understand how the optimizer uses statistics as well as deciding what statistics are needed without an automated method to help. That automated method is the Teradata Statistics Wizard, which is a clientbased GUI interface for obtaining statistics recommendations for particular queries or query workloads that are submitted for analysis.
Statistical information availability is critical for optimal query plans, but Statistics collection is a time and resource consuming job. Appropriate and uptodate Statistics can improve the performance of a query. As the Teradata Database becomes more powerful, applications and the queries they perform have also become more complex and sophisticated. The job to decide upon adequate and appropriate Statistics collection, recollection of Statistics, and the correct balance between good query plans and the time required to collect Statistics on complex databases has become very complicated.
The Easy Feature of the Statistics Wizard simplifies this by automating the process of collection/recollection of Statistics for particular database/tables by applying the best practices based upon the data demographics information available in the database system.
There are two main approaches in using the Easy Feature for Statistics recommendations:
1) Full Statistics Recommendations are based on the following attributes:
2) Sampled Statistics Recommendations are based on the following attributes:
The following are the most important reasons for column and index statistics.
In many cases, sampled statistics are not accurate enough for the optimizer to generate an optimal or even a good join plan. However, it is sometimes true that statistics collected by sampling small subpopulations of table rows can be as good as those collected with a fulltable scan. The value of collecting fulltable statistics is that they provide the optimizer with the most accurate information that can be gathered for making the best possible query plan cost estimates.
Statistical accuracy is fundamentally important for any query plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is significant.
Statistics and optimizer estimates also play a critical role in Teradata Active System Management (TASM) configuration and strategy.
Statistics Method  Characteristics  Best Use 
Full Statistics 


Sampled Statistics 


If the ‟Full Statistics Recommendations‟ radio button is selected, the Statistics Wizard gives Full Statistics Recommendations based on the following attributes for the selected objects:
All NUPIs
All NUSIs
All Indexes for Join Index Table
All NUSIs
All VOSI
All Partitioned Tables
Search for New Tables/ Indexes
Primary Index/Key
Foreign Key
CREATE SET TABLE lineitem, NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
L_ORDERKEY INTEGER NOT NULL,
L_PARTKEY INTEGER NOT NULL,
L_QUANTITY FLOAT,
L_SHIPMODE CHAR(10)CHARACTER SET LATIN CASESPECIFIC NOT NULL DEFAULT 'IIA4 ',
L_SHIPINSTRUCT VARCHAR(45) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'IIA5',
L_SUPPKEY INTEGER,
L_LINENUMBER INTEGER,
L_EXTENDEDPRICE FLOAT,
L_LINESTATUS CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'D')
UNIQUE PRIMARY INDEX (L_ORDERKEY)
INDEX (L_QUANTITY)
INDEX ALL (L_PARTKEY, L_QUANTITY, L_SHIPMODE) ORDER BY VALUES (L_PARTKEY)
INDEX (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS) ORDER BY VALUES (L_SUPPKEY)
INDEX (L_PARTKEY, L_QUANTITY, L_LINENUMBER) ORDER BY VALUES (L_LINENUMBER)
INDEX (L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS)
UNIQUE INDEX (L_PARTKEY);
Recollection of statistics for all columns is recommended if the days elapsed since the last collection exceeds a userspecified threshold.
Example: The last collection on the lineitem table exceeds the userspecified threshold (10 days from date of 05/19/2008).
HELP STATISTICS lineitem;
*** Help information returned. 3 rows.
*** Total elapsed time was 1 second.
Date Time Unique
Values Column Names
   
03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_SHIPMODE
03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_LINENUMBER
05/19/2008 10:37:15 1,280 L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER,
L_EXTENDEDPRICE , L_LINESTATUS
Statistics were previously collected on (L_PARTKEY, L_QUANTITY, L_SHIPMODE) and (L_PARTKEY , L_QUANTITY, L_LINENUMBER) columns of lineitem table on 03/16/2008. The user selected 10 days as the recollection option and the last collected statistics date exceeded 10 days (from 05/19/2008). Teradata Statistics Wizard will recommend recollection of statistics on columns referred to in the first two rows. No recommendations are provided for last row because statistics were collected on same day (05/19/2008).
Statistics recollection recommendations are based on the previous collected Statistics method. (If originally Sample Statistics were collected on tables then Teradata Statistics Wizard will recommend recollection of Sample Statistics).
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
Collect Statistics on lineitem Column (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
Table growth recommends recollection for all columns that have a change in row count exceeding a userspecified threshold since statistics were last collected. If the total rows count value from the existing statistics is less or greater than the userspecified threshold value then recollection statistics recommendations are generated.
Example: 128 rows have been inserted to table lineitem since the last statistics colletion. (last row count 1280 + new inserted rows 128 = 1408) (refer to Help Statistics output in previous section).
Select count (*) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
Count (*)

1408
Teradata Statistics Wizard provides the following statistics recollection recommendations. The row count of lineitem table has exceeded the userspecified threshold (10%).
Table growth recommendations (if full statistics were collected previously):
Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
Collect Statistics on lineitem index (L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER,
L_EXTENDEDPRICE, L_LINESTATUS);
Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
When you collect statistics for skewed data, the optimizer can accommodate exceptional values.
Statistics reveal values that include the most frequent value in the table and the most frequent value per value range. The system divides the table into 100 groupings and maintains statistics on the most nonunique value/range.
Without collected statistics, the system derives row counts from a random AMP sample for:
Small tables often distribute unevenly. If rows in a table are not distributed evenly across all AMPs, random AMP samples may not represent the true total number of rows in the table.
Full Statistics availability is needed for highly skewed data for the best plan generation. The optimizer needs to be fully aware of such skew. Teradata Statistics Wizard recommends statistics collection on all nonunique primary indexes for tables that have table skew that exceed a userconfigured threshold.
SQL to calculate table skew on all NUPI:
SELECT di.databasename, di.tablename, di.indexname,
di.indextype, di.uniqueflag,
SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb,
100*((cast(max(t.currentperm ) as decimal(18,3))
 cast(ave(t.currentperm ) as decimal(18,3))))
/(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew
FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName')
AND di.tablename in ('<TableName>')
AND di.indextype = 'P'
AND di.uniqueflag = 'N'
AND di.databasename = t.databasename
AND di.tablename = t.tablename
GROUP BY di.databasename, di.tablename, di.indextype,
di.indexname, di.uniqueflag
ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;
Collect Statistics on NUPITABLE index (NUPI_Col1);
Teradata Statistics Wizard recommends statistics collection on all nonunique secondary indexes for tables that have index columns skews that exceed a userconfigured threshold.
SQL to calculate table skew on all NUSI:
SELECT di.databasename, di.tablename, di.indexname,
di.indextype, di.uniqueflag,
SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb,
100*((cast(max(t.currentperm ) as decimal(18,3))
 cast(ave(t.currentperm ) as decimal(18,3))))
/(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew
FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName')
AND di.tablename in ('<TableName>')
AND di.indextype = 'S'
AND di.uniqueflag = 'N'
AND di.databasename = t.databasename
AND di.tablename = t.tablename
GROUP BY di.databasename, di.tablename, di.indextype,
di.indexname, di.uniqueflag
ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;
Collect Statistics on lineitem index (L_LINENUMBER, L_EXTENDEDPRICE ,L_LINESTATUS);
Collect Statistics on lineitem index (L_QUANTITY);
The COLLECT STATISTICS statement collects demographics, computes Statistics from them, and writes the resulting data into individual entries for each individual base table and join index table on the system.
When join indexes are used for estimates, you might not need to collect statistics on the base table columns that are covered by the join index with some exceptions. For example, statistics are still required on base table index columns that have singletable predicates in order to do access path analysis. Similarly, statistics might be required if the covered columns are used as join columns or grouping columns in a GROUP BY clause.
Consider the following query and join index definition:
SELECT *
FROM t1, t2
WHERE nusi_col = 10
AND a1 > 20
AND t1.a1 = t2.a2;
CREATE JOIN INDEX ji AS
SELECT a1, b1
FROM t1
WHERE nusi_col = 10
AND a1 > 20;
In this case, the join index cardinality provides the final selectivity for t1; However, statistics on column a1 might still be required to make join cardinality, rows per value, and other cardinality estimates. Ideally, the statistics on a1 from the join index are more accurate than the base table statistics. The optimizer first attempts to locate the join column a1 in the join index and then attempts to use its statistics. If this column is not projected, or if the statistics are not available, then the optimizer looks for statistics on the base table.
When usable singletable join indexes are found for singletable cardinality estimations, and the columns referenced in the join predicates or GROUP BY clause are projected from the join index, then the Statistics Wizard recommends statistics on the join index columns if they are not available. For the previous example, the Statistics Wizard recommends collecting statistics on ji.a1 since the column a1 is projected in the join index. If this column is not projected, then the Statistics Wizard recommends collecting statistics on the base table columns t1.a1
Column statistics for join indexes and their underlying base tables are not interchangeable. You need to submit separate COLLECT STATISTICS statements for the columns in the join index and the source columns in the base tables. Join index tables and data tables are seen as separate entities to the optimizer.
Collecting STATISTICS on NUSIs makes it possible for the optimizer to decide when to use the NUSI for access and when to do a full table scan.
Teradata Statistics Wizard recommends collect statistics on all nonunique secondary indexes due to the data skew possibilities described previously.
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
Valueordered NUSIs are useful for processing range conditions and conditions with an inequality on the secondary index column set.
By sorting the NUSI rows by data value, it is possible to search only a portion of the index subtable for a given range of key values. The major advantage of a valueordered NUSI is in the performance of range queries.
The following example illustrates a valueordered NUSI (defined by an ORDER BY clause that specifies the VALUES keyword option) and a query that would probably be solved more efficiently if the specified valueordered NUSI were selected by the optimizer to access the requested rows.
CREATE INDEX Idx_Date (o_orderdate)
ORDER BY VALUES (o_orderdate)
ON Orders;
SELECT *
FROM Orders
WHERE o_orderdate
BETWEEN ‘19971001’
AND ‘19971007’;
Teradata Statistics Wizard recommends statistics on all VOSIs.
General  All Value Ordered NUSI recommendations
Collect Statistics on lineitem Index (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER,
L_EXTENDEDPRICE , L_LINESTATUS);
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
If there are any predicates on the partitioning columns of a PPI and the system does partition elimination, or if there are any join predicates on the primary index or the partitioning column of the PPI table, then the Statistics Wizard recommends singlecolumn partition statistics on the systemderived PARTITION column with High Confidence.
If the system does partition elimination, then multicolumn PARTITION Statistics are recommended with Low Confidence on the systemderived PARTITION column and other join columns. This helps to derive the column correlations for the other join columns based on the qualified partitions.
For example, suppose you submit the following query:
SELECT *
FROM t1_ppi, t2
WHERE t1_ppi.partition_column BETWEEN 10 AND 20
AND t1_ppi.x1 = t2.y1;
If the qualified partitions are (4, 7) then it is helpful for the optimizer to derive the demographics of the join column t1_ppi.x1 for the given partitions (4, 7) in order to provide more accurate estimation and costing. The recommendations need to be implemented only if the partitioning column and the join columns are correlated.
This option will search all the new tables or indexes created after a specified date and generate the recommendations on new tables or newly added indexes.
For example, if the following new table is created after a specified date (i.e. CreateTimeStamp column of DBC.TVM table),
CREATE MULTISET TABLE supplier , NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
S_SUPPKEY INTEGER NOT NULL,
S_NAME CHAR(25) NOT NULL,
S_ADDRESS VARCHAR(40) NOT NULL,
S_NATIONKEY INTEGER NOT NULL,
S_PHONE CHAR(15) NOT NULL,
S_ACCTBAL DECIMAL(15,2) NOT NULL,
S_COMMENT VARCHAR(101) NOT NULL)
UNIQUE PRIMARY INDEX ( S_SUPPKEY )
INDEX ( S_NATIONKEY );
Teradata Statistics Wizard recommends the following Statistics collection:
Collect Statistics on supplier Index (S_SUPPKEY);
Collect Statistics on supplier Index (S_NATIONKEY);
Following are the recommendations when Statistics was not collected for any of ‘ProcessOrder’ tables as all tables are newly added after specified date.
Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection.
If the columns are not indexed, then the rows are organized randomly, so the system just scans the first n percent it finds, where the value of n is determined by the relative presence or absence of skew in the data. Conceivably, the entire sample could be taken from the first data block on each AMP, depending on the system configuration and cardinality of the table being sampled. Particularly avoid using sampled statistics with skewed columns or indexes. For example, columns or indexes that are unique or nearly unique are uniformly distributed, and a good candidate for sampled statistics. However, sampling should not be considered for data that is highly skewed because the optimizer needs to be fully aware of such skew. In addition to uniformly distributed data, full statistics collection on NUSIs is so fast and efficient that there is not much collection overhead to improve upon going to sampling. Sample statistics for NUSIs is not worth the reduction of information available for the minimal collection overhead saved.
To summarize, sampled statistics are generally most appropriate for:
Teradata Statistics Wizard recommends sampled Statistics based on following rules.
Note: The quality of the statistics collected with sampling is not guaranteed to be as good as the quality of statistics collected on an entire table without sampling, but is most likely better than having no statistics at all. When you use sampled statistics rather than fulltable statistics, you are trading reduced statistics collection time and resources for less accurate statistics.
Sampled Statistics on a small table must be avoided as small tables (fewer rows/AMPs) may end up with very inaccurate row counts. Sampled Statistics can be recommended on big tables (e.g. Size > 1000 rows/AMP).
Teradata Statistics Wizard recommends sample statistics for all tables with a size greater than 1000 rows/AMP. The default sample size is 2%.
Example:
Total number of Amps:
Select HashAmp () +1;
*** Query completed. 3 rows found. One column returned.
*** Total elapsed time was 1 second.
(HASHAMP () +1)

2
Ie: The database contains 2 Amps.
Total number of rows per Amps
Select Count (*)/(Hashamp()+1) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
(Count(*)/(HASHAMP()+1))

647
No Sampled recommendations are generated by Teradata Statistics Wizard as the table must contain more than 1000 rows per AMP.
Inserting rows to table
Insert 707 rows to the existing table lineitem.
Select Count (*)/(Hashamp()+1) from lineitem;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
(Count(*)/(HASHAMP()+1))

1001
Now there are more than 1000 rows per AMP.
Sampled Statistics Recommendations:
Collect Statistics using sample on lineitem column (L_ORDERKEY);
Sampled Statistics are very reasonable for columns that are unique or almost unique, so Teradata Statistics Wizard recommends sampled statistics for all unique columns based on unique constrains defined on tables.
Example:
Collect Statistics using sample on lineitem columns (L_ORDERKEY);
Collect Statistics using sample on lineitem columns (L_PARTKEY);
Sampled Statistics are very reasonable on unique indexes (like UPI) because of data uniqueness distribution, so Teradata Statistics Wizard recommends sampled statistics on all unique indexes based on unique index attributes on selected data objects.
Example:
Collect Statistics using sample on lineitem index (L_ORDERKEY);
Collect Statistics using sample on lineitem Index (L_PARTKEY);
Note the overlap in the unique column and unique index recommendations. The following will be recommendations
if both unique columns and unique indexes are selected:
Collect Statistics using sample on lineitem index (L_ORDERKEY);
Collect Statistics using sample on lineitem columns (L_PARTKEY);
Statistics will always be the most complete when they are collected fully. But due to the time and resources required in the statistics collection effort, it makes sense to consider alternative approaches, if they can provide a similar level of accuracy.
USING SAMPLE Statistics
When to use:
For columns that are unique, such as unique primary indexes, or for columns or indexes that are close to unique. Some customers, experienced in USING SAMPLE, suggest that USING SAMPLE provides particular value on very large tables, tables with tens of billions of rows.
Full Statistics
When to use:
For all columns/indexes where collected Statistics add value, and where USING SAMPLE does not provide satisfactory estimates.
Statistics collection is an ongoing job. Because it is a highlyrepetitive activity, it is worth taking some time to investigate the optimal way to combine these different approaches at your site.
Because a mixed approach to statistics collection requires columnbycolumn (or indexbyindex) analysis, it may take some time and effort to apply comprehensively. However, by focusing only on the larger tables, where full statistics collection time is the greatest, significant savings can be experienced with a moderate effort.
To answer the question about stale statistics, they are stats that no longer represent the distribution of the data in a column or index. Stale stats influence the optimizer to do things in query plans based upon stats that are inaccurate; therefore causing bad plans.
Something to remember when you talk about frequency of recollection. It is a change in the distribution of data that requires recollection and not necessarily a percentage of how many rows have been added or values updated. Personally, I think the 10% figure of new rows is too conservative. (e.g. a 10% gain in rows with all the same values of a data element would be the worst case scenario; how would that change the distribution of data? So, if I had 100000 rows with only 10 values, then the distribution may not be changed significantly, but if I have 10000 values, then definitely there would be a significant distribution change)  so, my message is that number of values per column plays a role in how frequently to change as well as number of new rows. My rule of thumb is somewhere between 10 and 20% depending upon number of values vs number of rows.