When is the right time to refresh statistics? - Part I

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.

The appropriate time to refresh statistics is just before the query or the load process breaks. This means just before the optimizer change the access plan from a good plan to a bad plan. However, the more relevant question is: How can I determine the right time to refresh statistics of any table or join index? To answer this question there are a couple of items that is important to know and understand first as listed below:

1. System workload

The system workload, specifically the workload of the critical processes and reports. It usually represents the processes and reports associated with a SLA. The DBQL and Resusage data will provide the necessary information for you to understand your system workload and also your critical processes and reports.

2. Statistics collection recommendations

The statistics collection recommendations to be applied to any table/join index in your environment. This is a summary of statistics collection recommendations sourced from the article Statistics Collection Recommendations for Teradata 12 by Carrie Ballinger.

Collect Full Statistics

  • Non-indexed columns used in constraints and joins
  • All NUSIs with uneven distribution of values
  • NUSIs used in join steps
  • USIs/UPIs if used in range constraints
  • Value Ordered NUSIs
  • NUPIs
  • Relevant columns and indexes on small tables 

Collect Sample Statistics (very large tables)

  • Unique index columns
  • Nearly-unique columns or indexes

Collect Multicolumn Statistics

  • Group of columns used in constraints with equality predicates
  • Group of columns used in joins or aggregations

Collect PPI Statistics

  • PARTITION
  • Partition column

Collect PPI Statistics (TD12)

  • (PARTITION, PI)
  • (PARTITION, PI, Partition column)

3. Refresh Frequency

How often do the statistics need to be refreshed? This question needs to be answered while taking into consideration the system workload. However, there are some recommendations based on best practices.

After every load

  • Refreshed tables 

Daily

  • PARTITION
  • Partition column
  • Value ordered NUSIs
  • NUSIs associated with critical processes/reports
  • Sample statistics
  • DBC tables

After changing 10% in size

  • Any table or join index that changed 10% in size

Every 90 days

  • Rolling tables
  • Join index associated with rolling tables
  • Any statistics that is 90 days old (almost static table) 

 Zero Statistics

  • Any statistics showing zero unique values 

No Statistics

  • Any index with no statistics defined 

4. Impact

What is the impact of the collect statistics process of a column, an index, and a multi-column associated with any table or join index specially the very large ones? It’s possible to determine the impact of the collect statistic process by checking on DBQL data, but also looking at the Explain statements associated with the collect statistics statements. The high CPU impact is associated with large tables and another factor is the uniqueness of the column.

UPI or NUPI

The impact is high for very large tables or join indexes because it does all-rows scan to be able to collect statistics. Please see the explain below. 

COLLECT STATISTICS ON GDW_AGG.RPT_ITEM INDEX( ITEM_ID);

Explanation

  1. First, we lock GDW_AGG.RPT_ITEM for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_ITEM by way of an all-rows scan into Spool 1 (all_amps), which is built locally on the AMPs.
  3. Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into Spool 3, which is built locally on the AMP derived from DBC.TVFields by way of the primary index.
  4. We lock DBC.TVFields for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 3 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 4 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.TVFields from Spool 4 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

USI or NUSI

The impact is low even on very large tables or join indexes because it uses the secondary index sub-table to collect statistics. Please see the explain below. 

COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 INDEX ( DISTRICT_ID );

Explanation

  1. First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 by way of a traversal of index # 8 without accessing the base table into Spool 3 (all_amps), which is built locally on the AMPs.

  3. Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index.
  4. We lock DBC.TVFields for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

PARTITION

The impact is low even on very large tables because it uses the cylinder headers to determine the partitions and to collect statistics but the explain does not reflect that. Please see the explain below.

COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN PARTITION;

Explanation

  1. First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs.

  3. Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.Indexes by way of the primary index.
  4. We lock DBC.Indexes for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

Partition Column

The impact is high for very large tables or join indexes because it does all-rows scan to be able to collect statistics. Please see the explain below. 

COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN INV_DT;

Explanation

  1. First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs.

  3. Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index.
  4. We lock DBC.TVFields for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

Partition Column with NUSI

The impact is low even on very large tables or join indexes because it uses the secondary index sub-table to collect statistics. Please see the explain below. 

COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN BUSINESSDAY_DT;

Explanation

  1. First, we lock GDW_DM.SALES_DETAIL for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of a traversal of index # 12 without accessing the base table into Spool 3 (all_amps), which is built locally on the AMPs.
  3. Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index.
  4. We lock DBC.TVFields for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

Multi-Column

The impact is high for very large tables because it does all-rows scan to be able to collect statistics. But don’t forget the 16 bytes truncation for multi-column statistics. Please see the explain below. 

COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN ( MIN_BUSINESSDAY_DT,SALES_LOCATION_ID );

Explanation

  1. First, we lock GDW_DM.SALES_DETAIL for access.
  2. Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs.
  3. Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.Indexes by way of the primary index.
  4. We lock DBC.Indexes for write on a RowHash.
  5. We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
  6. We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use).
  7. We spoil the parser's dictionary cache for the table.
  8. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  9. -> No rows are returned to the user as the result of statement 1. 

5. Refreshing statistics on a table or a join index?

What is the best way to refresh statistics on a table or a join index? The answer for this question usually depends on the size of the table and how many CPU cycles are available to collect statistics. Consider the following best practices to avoid wasting resources in case the collect statistics process needs to be aborted:

Refresh at the Table level

  • Global temporary tables
  • Small tables

Refresh at the Index/Column/Multi-Column level

  • All medium and large tables and join indexes 

6. Next...

The next step is to create a process that automatically considers all items mentioned above and collects statistics efficiently without wasting system resources. Continue to Part II...

11 Comments
N/A
Great article, Marcio!! Looking forward to the techniques in automation of the statistics collection process.

Cheers!
Enthusiast
Marcio, this is a very useful information and will help in efficient stats management.
Trying to figure out an efficient way to identify stale columns for which statistics could be dropped to improve stats runtime and reduce resource wastage.
Abraham, the best way to identify stale columns is to turn on the collection on DBQL Object table and run a monthly report on the column usage level by table/database.
Marcio,
This is a very good technique to get more value out of your Teradata System and BI Solution. Applying your techniques will keep business users happy. It’s a great add-on for the consulting toolbox.
Regards, Ulbricht
Marcio,
The information is presented in a very useful context - one that I can share with others and try to get them to consider other methods and techniques - This helps - thanks
It usually takes less five minutes to refresh stats on DBC tables.
Enthusiast
Thanks for such a good article :)
Enthusiast
Hi Macrio,

Thanks for such a nice article!!!
Enthusiast

Hi Marcio,

I heard V14 got so many features on collect stats. One thing i would like bring here up is Summary stats. I know SUMMARY statistics are different from regular column or index statistics and doesn’t contain any histogram and provide up-to-date summary information to the optimizer in a quick and efficient way.

But what is the best way to decide the candidates for collecting summary stats.

Enthusiast

Hi Marcio,

Thanks for the valuable information. Actually we are planning to enable Auto Stats from lower to higher environment. As part of testing, I did following testing on following cases to ensure whether everything is working as expected or not.

Case 1: Insert, update, delete operation - During collection, once system data change pattern reached, whether it is collecting or skipping for this DML operations.

Case 2: Analyze job - Whether it is giving right recommendations or not for all type of tables especially PPI tables, join indexes, etc

Case 3: Collect Job - Whether it is running the job as per the schedule or not and refreshing daily the PARTITION, Partition columns, NUSIs, sample statistics, DBC tables or not

Case 4: If it is a big table (>1000 rows/AMP) and already collected full statistics on full table, whether it is going for Sample stats or not

Case 5: If it is a small table (<1000 rows/AMP), whether it is going for full stats or not

Case 6: Whether it is giving recommendation on timestamp and collecting stats on Timestamp column or not

As per your experience, do you think, is there any other cases i need to do a test, so that we can have a smoother implementation of Autostats.

Thanks in advance.

Enthusiast

Hi Marcio,

Is there any solution to handle the Auto stats (via Auto scheduler) during maintenance window. the question here is, assume the stats are scheudled to run during this maintenance window where the system is down. Is there any solution to address this loss of stats.

Thanks in advance.