Incremental statistics partitioned tables


Incremental statistics partitioned tables



We have been struggling to minimize the time spent to run a full statistics collection for a 30 billion record fact table that is currently partitioned in ~500 million records chunks. 


Every week, a new chunk may be added to the fact table and some specific queries take a long time if a new statistic is not run after this addition. The issue is that a full stats run takes about 44 hours to complete!


How to work around this issue? Is there a way to perform an incremental statistic instead of a full one? It does not seem reasonable to scan all the data over again to collect stats info.

Teradata Employee

Re: Incremental statistics partitioned tables

Is full stats a recollection at the table level?

If you know what statistics directly affect the query performance you only need to recollect those.

Are you using thresholds for frequency and percent change? How about sample statistics?




Re: Incremental statistics partitioned tables

1) Yes, the collection is made at table level taking into consideration about 60 columns out of about 200.


2) We are taking out some columns from the collection, but the execution time stills long. This purging process will take maybe months until we realize with columns can be left out.


3) We are using thresholds, but this only delays the full stats execution a few weeks later when changes go over 5%, taking then the same 44 hours on the weekend and conflicting with our backup schedule.


3.1) Sample stats is not currently used because there are many repeated values on those columns, and as far we read the docs, it is not advisable to use.


We are thinking on a more effort-demanding approach, spliting the fact table in two (2): the first table ("hot") will have only the last 12-24 chunks, that are updated more constantly and the older partitions will be left on another table. The user will be a single vision of all the data from a VIEW that will unite both tables. Full stats will run way faster on the newer/updated chunk while the "cold" table will have an almost frozen stats info. Every semester, there will be data moved from the "hot" table to the "cold" table and then a full stats would be run on the "cold" table, but this would be an scheduled activity that would occur a lot less frequently.