Sampled Stats vs Time trade-off

Database
Enthusiast

Sampled Stats vs Time trade-off

Our Goal is we need to minimize collect sats time:

we do use Combination for Full / Sample stats based on Row size:

- Smaple 50% for all table's statitics combination for tables  > 5 Billion Rows

- FULL for all table's statitics combination for tables  < 5 Billion Rows

While this still take long time and we came accross this idea, to minimize it

- we will not evaluate full or sample bu total row count, but for the number of values in each stats combination

Ex:

if # of rows is 2 bilion, but # of values by stats combination (col1,col2) = 10billion

-- >  the we use sampled stats for this column combination specifically

if # of rows is 10 bilion, but # of values by stats combination (col3,col4) = 2billion

-- >  the we will not use sampled stats for this column combination specifically

Is logical to use this approach or we need to stck to total # of rows for the table

Tags (3)
3 REPLIES
Senior Apprentice

Re: Sampled Stats vs Time trade-off

Did you read Carrie Ballinger's blogs on "Statistics Recommendation" for your release? 13/13.10/14?

For some really large tables there's no other way than sample stats.

You just have to be carefull when a column is correlated to a partitioning column, e.g. partition by date and another column year_month. Never collect sample stats on year_month without also including the date, because sampling just starts to read the first blocks and a partitioned table is sorted by partition number...

Dieter

Enthusiast

Re: Sampled Stats vs Time trade-off

Thanks Dieter Carrie blog is very useful and having date columns not samples is on our plan now.

I've noticed when minimizing the Sample size the number of values computed decreases; which can be expected.
My Concern about the point you mentioned it will always gets the first % blocks specified in the sample which can be the first / last blocks inserted.

in case of highly unique columns ( only 10 values for the column for the column); a possibly bad plan may occur; once a specific value of these 10 values is missed in the samples stats and used in a predicate;
which what's the trade-off for accuracy / stats time is all about.

My question : is there any way we can change the way the sample is collected other then picking the first blocks; like random blocks for the same percentage ?
Senior Apprentice

Re: Sampled Stats vs Time trade-off

For a table with a PI but no partitioning the sampling approach is usually ok, a table is sorted by RowHash -> not based on the order of inserts and looks quite random (for a NoPI table it *is* a chronological order).

When you SAMPLE on a partitioning column in a PPI table the optimizer is smart enough to know about that and samples the first x percent from *each* partition.

But there's not way to change the optimizer strategy (although it would be nice to have regarding the problem with correlated date)

Dieter