We have a situation where our statistics collection process that we run after our load job is taking longer than the load job itself. The table has very good data distribution, but it has close to 200 million rows and 400 columns in it. We are using the following as part of our collect statistics command.
COLLECT STATISTICS USING THRESHOLD 10 PERCENT AND THRESHOLD 7 DAYS
The problem is that I believe that we refresh a good chunk of the table as we have to "replace" the last 2 years of history in the table as the source system doesn't have change data capture capabilities. Even though a lot of the rows are not changed, they do get replaced. I'm not sure if this has an impact on the 10 percent or not.
In any event, I'm looking for ideas for improving this process without having an impact on end user queries.
I would use SAMPLE STATISTICS whenever possible to reduce collection runtime and resource consumption, but you have to compare the output of USING SAMPLE against the full statistics collected in order to be sure that there is no negative impact. I would start with SAMPLE STATISTICS on columns (indexed or not indexed) which are unique or almost unique. The easiest way to compare the estimations of full statistics and sample statistics is to check the output of SHOW STATISTICS VALUES...