I want some info on stats collection. I am aware about below fact :
1) how stats are collected and how stats helps optimizer in chossing better paln. 2) what are the interval concept in stats. 3) At frequency it should be refreshed.
But I am working on a automation process in which I am trying to calculate the the % data change in table. If this is more then 10% then only I want to collect stats.
Using this I can give more windows to my loads and my CPU will be free for other activities . Other then these what else benefits can I have by not refreshing my table stats if % data change is not more then 10%.
A little weird question But please reply as per your understanding.
As I understand the stats collection provides the rows per value for the intervals in the table as well as the min and max values of the column. This helps the optimizer determine the cardinality that results from joining to tables together. Without the stats the optimizer must rely on random amp sampling to obtain these values. This can result in a less than optimal query plan.
Regarding the frequency of stats collection I tend to use the growth in the table size as well as the age of the statistics. Right now I compare the last access date and the date of the oldest stats collected on the table to refresh the stats on the table.
if its a mid sized table then u got to collect stats completely . If its a BIG BIG EdW then for those tables in GB's( example 50 GB Table) which u think data has changed more than 10% its better to go for Random Amp Sampling. If u try to collect stats on those big tables it would definitely be time as well as CPU consuming.
Since u r automating the process u can do this. 1. u should know when ur ETL starts or ETL ends. Based on the NON ETL window time u can give 2 HRS to collect stats on all the tables which u think is necessary. 2. Shell script should be in such a way that if collect stats over shoots the 2 hour window time the job should abort.
For what it is worth, in my current environment I am only running a collect stats processing once a week on the weekends. Sure, there are environments where you may not have such a luxury. And as teradata_dba mentioned, on extremely large tables you can rely on either random amp sampling or consider using a sample percentage of the overall table when collecting stats. Using a sample on your stats collection will reduce the overhead associated.