How to simplify your statistics collection with a queue table

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Junior Contributor

When you start designing a statistics collection process topics include:

  • stats on some tables are more important than others.
  • some stats on a single table are more important than others.

Both problems involve some kind of prioritization process, which finally leads to a sorted list of COLLECT STATS statement (See Marcio Moura's blog When is the right time to refresh statistics?)

Processing this list (using a BTEQ script or a cursor in a Stored Procedure) could lead to another problem:

  • only a limited time frame for collection is available.

Therefore you might have to stop the process before all those stats are collected.

But now you have to take care that the remaining stats will be collected during the next run (and probably with a higher priority).


Introducing the queue table

Queue tables have been implemented in V2R6, but are still one of Teradata's hardly known features. Their main purpose is to allow asynchronous event processing in a FIFO and this perfectly matches with our requirements:

  • When the process stops the remaining rows are still in the queue and will be processed first when the next run starts.
  • More statistics might be collected in the same period when multiple processes consume rows from the queue (this is really hard to implement without a queue - how to decide which stats should be collected by which process?).

As an additional benefit ETL-jobs requiring some final COLLECT STATS steps might finish earlier, when those statements can run asynchronous via queue.


The attached script creates this queue table (CollectStatsQueue) plus two log tables (CollectStatsLog and CollectStatsLogDetails), a Stored Procedure (CollectStatsProc) to process the queue and a macro (StopQueue) to stop the processing. Syntax is based on a TD13.0 running in Teradata mode, older releases would require some modifications.

CollectStatsQueue

This is the main driver table for the collection process, a COLLECT STATS is triggered for each inserted row.

QITS

When your current process results in an ordered list this order of execution is maintained by setting the queue's "Query Insertion Timestamp" column to

CURRENT_TIMESTAMP

 + (ROW_NUMBER() OVER (ORDER BY whatever_determins_the_order) * INTERVAL '00.000001' SECOND)


For a single row insert higher or lower priority might be assigned by simply adjusting this value, e.g.

CURRENT_TIMESTAMP - INTERVAL '10' MINUTE AS QITS -- higher priority

DatabaseName

TableName

Self-explanatory

IndexName

Optional, when supplied used instead of the column list

ColumnName

Comma-delimited list of column names (for column and index level stats)

CollectLevel

Collect on  T (Table), C (Column), I (Index) level

SampleSize

uses SAMPLE when set to any non-NULL value

CollectStatsLog

Records info about every run of the CollectStatsProc including the number of statistics collected and the number of errors during the collect.

CollectStatsLogDetails

Records info for every failing (optionally also successful) COLLECT STATS.

CollectStatsProc

This Stored Procedure actually submits the COLLECT STATS for each row in the queue

IN MaxTS

Process will stop when a COLLECT finishes after MaxTS or all rows in queue are processed.

If MaxTS is set to NULL the process will run until the StopQueue macro is executed.

IN LogDetails

If set to 'Y' info about each processed row is written to CollectStatsLogDetails, otherwise only failed collects.

IN MonitorInterval

Update the CollectStatsLog table after x minutes with info about the number of statistics successfully collected or failed.

OUT RC

SQL Errorcode: when there's any error outside of the actual COLLECT statement the process stops

OUT Msg

Result of the execution: Success/Warning/Failed plus additional info.

StopQueue

Macro to stop all SPs processing the queue: invoke EXEC StopQueue; from any session.


Usage examples

CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '1' HOUR,'Y',5, rc, msg);

Run for up to one hour, log all rows in the CollectStatsLogDetails table and update the CollectStatsLog table every 5 minutes.

Can be stopped earlier by simply running EXEC StopQueue; from any session.

CALL CollectStatsProc(NULL,'N',10, rc, msg);

Run until StopQueue is executed, log only failed collects in the CollectStatsLog table and update the CollectStatsLog table every 10 minutes.

Running in parallel

As the CollectStatsProc will probably be CALLed in a BTEQ script it's easy to run multiple copies:

.SET SESSIONS 2

.REPEAT 2

CALL CollectStatsProc...

Some remarks on running multiple instances of CollectStatsProc:

  • Collecting multiple stats on a small table using one row per index/column might result in a deadlock (when stats data is merged into the system tables), thus it's better to collect those stats on table level
  • As stats collection is quite resource intensive you should check if additional resources are available, multiple instances are mainly suitable for running in a dedicated batch window.

Sample script

The following example to recollect all existing stats within a database is based on my StatsInfo view:

INSERT INTO CollectStatsQueue
( QITS, DatabaseName, TableName, IndexName, ColumnName, CollectLevel, SampleSize)
SELECT
CURRENT_TIMESTAMP
+ (ROW_NUMBER() OVER (ORDER BY databasename, tablename) * INTERVAL '0.000001' SECOND)
,dt.*
FROM
(
SELECT DISTINCT
DatabaseName
,TableName
,CASE WHEN CollectLevel = 'T' THEN '' ELSE IndexName END AS IndexName
,CASE WHEN CollectLevel = 'T' THEN '' ELSE ColumnName END AS ColumnName
,CASE -- collect on table level for small tables
WHEN COUNT(*) OVER (PARTITION BY TableId) > 1
AND (NumRows < 100000
OR SUM(CollectDuration) OVER (PARTITION BY TableId) < INTERVAL '0:10' MINUTE TO SECOND)
THEN 'T'
WHEN StatsType IN ('Col', 'MCol', 'Part') THEN 'C'
ELSE 'I'
END AS CollectLevel
,SampleSize
FROM
StatsInfo
WHERE TableType <> 'TempTbl'
AND MissingStats = 'N'
AND DatabaseName = ...
) AS dt;

CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '10' MINUTE,'Y',1, rc, msg);

SELECT * FROM CollectStatsLog ORDER BY StartTS;

SELECT * FROM CollectStatsLogDetails ORDER BY StartTS;


Please provide any kind of feedback (especially about performance when running multiple instances).

19 Comments
Senior Supporter
Very nice idea! Thanks for sharing
Thanks for sharing, but unfortunately not finding any attachment here.
Junior Contributor
Strange, there should be an attachement area right to the headline of this blog entry. You could try to empty the browser cache.

Otherwise this is the direct link:
http://developer.teradata.com/sites/all/files/CollectStatsQueue.txt

Dieter
Enthusiast
Great post Dieter. Thank you for sharing!
Enthusiast
Thanks for sharing.

Hi,

I am facing an issue in running a teradata procedure.
Getting the error code as 3807,

This procedure is called via informatica mapping and this is a monthly process, suddenly having issue in inserting to a table through the proc.

Till last month this was running fine.

Please help me in resolving this issue.

Thanks,
Sasikala V
Enthusiast
Verify once, that the procedure and the table are qualified with the databasename. This may be one of the reasons behind the issue.
Enthusiast
Nice idea - We can also use other parameters. I have been able to get very good results considering other factors Like
- Update priority based on Frequency_Of_use (Partition, PI and PPI can get special attention)
- Consider Table / Column level Stats Collection.
- Set a Max limit on CPU to be used by Stats Process.
- Increment the priority of left overs to avoid starvation.
- split the table using MOD function to run # parrallel threads based on State of system.

Najeeb
Enthusiast

Excellent demonstration of a pratical use of a Queue table, excatly what I've been looking for.  

Thanks for the informative post Dieter.

-Mike

Like the idea of the Stop Queue Macro, Dieter.

Think I'll take your concept and use it with my 'mixed workload' application that I'm currently building. 

Probably what I'll do is allow the Stored Procedure - the one which does the Consuming - to run 'forever',  unless we decide (either by choice, or by periodic necessity) to Stop the Queue.

So, thank you.

Enthusiast

Yes, the Stop Queue Macro!!!!! amazing. I was in a different world before this :).

Visitor

Dieter,

Are there any recommened changes to this process due to 14.10 new features?  Especially around the area of collecting stats at "table level" with the new syntax, i.e. collect stats column (col1), column (col2), index (column 6) on table...

Thanks!

Kevin

Junior Contributor

Hi Kevin,

sorry for the late answer, I overlooked your comment...

This works as-is in 14.10, but I didn't find a way to code combining multiple related stats (e.g. (col1), (col2), (col1,col2)) into one statement to get the benefits of the new optimization.

Visitor
SELECT
   CURRENT_TIMESTAMP  + (ROW_NUMBER() OVER (ORDER BY databasename, tablename) * INTERVAL '0.000001' SECOND),
   dt.*
FROM
 (
   SELECT DISTINCT
      DatabaseName                 
      ,TableName                    
      ,CASE WHEN CollectLevel = 'T' THEN '' ELSE IndexName END AS IndexName
      ,CASE WHEN CollectLevel = 'T' THEN '' ELSE ColumnName END AS ColumnName
      ,CASE -- collect on table level for small tables
          WHEN COUNT(*) OVER (PARTITION BY TableId) > 1 AND (NumRows < 100000 OR SUM(CollectDuration) OVER (PARTITION BY TableId < INTERVAL '0:10' MINUTE TO SECOND) THEN 'T'
          WHEN StatsType IN ('Col', 'MCol', 'Part') THEN 'C'
          ELSE 'I'
       END AS CollectLevel
      ,SampleSize                   
   FROM
     sysdba.StatsInfo
   WHERE TableType <> 'TempTbl'
   AND MissingStats= 'N'
   AND DatabaseName = ....
    ) AS dt

Thanks Dieter!  No Problem! 

I was just testing the below/above/attached query for insert into queue table, but it fails with multiple columns not existing in the statsinfo view.  The view here at this site might have been modified or was given to them by Teradata recently, I am not sure as I just started a few months ago.  Anyway the columns that are missing are...numrows, collectduration,& missingstats.  Do you have the original statsinfo view?

Also, they told me that Teradata had told them that it was better to collect at the table level (it was faster) than collecting at column level.  True???

Thanks!

Kevin

Visitor

Sorry, also samplesize column was missing from the statsinfo view as well

Kevin

Visitor

Dieter,  I was just found your new documents on the statsinfo (from above link to statsinfo view), but with these changes how would you implement a change to the insert into the queue table?

Thanks again!

Kevin

Junior Contributor

Hi Kevin,

this queue approach was based on TD13/13.10, but my StatsInfo query changed in TD14, I removed some columns, see New StatsInfo query for TD14 for more details.

For TD14 it is actually recommended to collect multiple stats in a single statement. Not neccessarily on a table level, but those which result in a more efficient collection (That's why I wrote I don't know how to find those which benefit from combining them):

New opportunities for statistics collection in Teradata 14.

And in TD14.10 you should look at AutoStats instead, which really simplifies collection:

Easing Into Using the New AutoStats Feature

Visitor

Thanks Dieter!

Enthusiast

I need to ask a question but not getting any option to ask...