When sample statistics go wild

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Junior Contributor

The accuracy of sample statistics has been greatly improved over the last Teradata releases. So I usually try to use sample stats on most of the big tables and I found them to be reliable on many columns, not only the officially recommended unique or nearly-unique columns, e.g. on DATEs.

But there is a specific scenario when sample stats result in worst case optimizer plans:

The table is partitioned and there’s a dependency between the partitioning and the sampled column.


Background

The process used for sampling stats is not the same as SAMPLE in a SELECT, which is a truly random sample. It’s similar to a TOP, i.e. it simply starts scanning the first percents of the table. This would result in wrong data when the stats are on the partitioning column, so the optimizer is smart enough to recognize that and switches to scan the first percent of each partition.

Now consider following scenario:

CREATE MULTISET TABLE statstest(dt DATE, yearmonth INT)
PRIMARY INDEX (dt)
PARTITION BY RANGE_N(dt BETWEEN DATE '2010-01-01' AND DATE '2020-12-31' EACH INTERVAL '1' DAY);

INSERT INTO statstest
SELECT
calendar_date
,EXTRACT(YEAR FROM calendar_date) * 100
+EXTRACT(MONTH FROM calendar_date)
FROM sys_calendar.calendar
WHERE calendar_date BETWEEN DATE '2010-01-01' AND CURRENT_DATE
SAMPLE WITH REPLACEMENT 100000;

COLLECT STATS USING SAMPLE 2 PERCENT COLUMN(yearmonth) ON statstest;

SELECT MIN(yearmonth), MAX(yearmonth), COUNT(DISTINCT yearmonth) FROM statstest;
201001 201410 58

SHOW STATS VALUES COLUMN(yearmonth) ON stats test;
...
/* MinVal */ 201001,
/* MaxVal */ 201002,
/* ModeVal */ 201001,
...
/* NumOfDistinctVals */ 2,

Ops, there are 58 months but according to the stats there are only two.

Why?

monthyear is directly correlated to dt and because all rows within a table are sorted by partition (in this case by date) only the oldes dates are fetched.

Any query with a WHERE-condition based on yearmonth outside of the estimated range might result in a really bad plan. For equality the optimizer will assume the average number of rows (50,000) but a BETWEEN will result in estimated with high confidence to be 1 row. You can imagine the performance of such a plan.

Solution

To solve this problem switch to full stats whenever you need to collect stats on a dependent column.

Caution: In TD14.10 the optimizer might automatically switch to sample stats causing this problem to appear a few weeks delayed. In that case better force full stats using the NO SAMPLE option.

12 Comments
Enthusiast

Hi Dieter,

I first collected sample (2% in this case) stat on table and then tried collected full stat using below SQL.

COLLECT STATS  COLUMN(yearmonth)  ON statstest ;

Now when I do help stat on table it still showing 2 distinct value for yearmonth. Do we need to first drop the STAT before switiching from sample stat to full stat?

Thanks,

Gyanendra

Junior Contributor

Hi Gyanendra,

in TD14+ the SAMPLE is automatically applied when you collect the next time. You need to switch it off explicitly using NO SAMPLE.

Enthusiast

Hi Dieter,

Thanks for the valuable information. Actually we are planning to enable Auto Stats from lower to higher environment. As part of testing, I did following testing on following cases to ensure whether everything is working as expected or not.

Case 1: Insert, update, delete operation - During collection, once system data change pattern reached, whether it is collecting or skipping for this DML operations.

Case 2: Analyze job - Whether it is giving right recommendations or not for all type of tables especially PPI tables, join indexes, etc

Case 3: Collect Job - Whether it is running the job as per the schedule or not and refreshing daily the PARTITION, Partition columns, NUSIs, sample statistics, DBC tables or not

Case 4: If it is a big table (>1000 rows/AMP) and already collected full statistics on full table, whether it is going for Sample stats or not

Case 5: If it is a small table (<1000 rows/AMP), whether it is going for full stats or not

Case 6: Whether it is giving recommendation on timestamp and collecting stats on Timestamp column or not

As per your experience, do you think, is there any other cases i need to do a test, so that we can have a smoother implementation of Autostats.

Thanks in advance.

Enthusiast

Hi Dieter,

Is there any solution to handle the Auto stats (via Auto scheduler) during maintenance window. the question here is, assume the stats are scheudled to run during this maintenance window where the system is down. Is there any solution to address this loss of stats.

Thanks in advance.

Junior Contributor

Sorry, but I don't have expertise in using AutoStats, yet.

Teradata Employee

Hi Harsha,

in the Autostats module in Viewpoint, you can choose which type of Sampling setting you want, by going to Actions -> Edit Collect Settings.

Regards,

E.

Enthusiast

Hello, would you be available for 1:1 training in Teradata?

Enthusiast

Hi Dieter,

Although I did Range partition a Date column (say Trx_Dt) and collected Partition stats, a simple "SELECT MIN(Trx_Dt) or MAX(Trx_Dt)" is doing FULL table scan (all-row scan)? I'm trying to understand why Teradata isn't knowing min/max values while it's a range partition. Through stats it has all the information - definitely it doen't have to scan through every partition for max(trx_dt). We have 7 billion record table with partition on Trx_Dt and each partition holds about 80 million records.

Thanks for your help with this!

Karteek

Enthusiast

Hello,

Thank you once more for all these new tricky updates on evert TD version.

One question ,  so we have to check all the PPI tables where the PI column is also a level of partition ?

Furthermore , this scenario is even more strange with Volatiles tables , there it works normally  :)

Thank you.

Enthusiast

Hi Dieter,

How to disable stats extrapolation in 15.00. Truning off All-AMP dynamic Amp sampling will help?

Thanks!

Junior Contributor

Afaik you can't turn of stats extrapolation and imho there's no reason to do so.

It's just a try to adjust the stats to the new count and if it's not working as expected you can simple recollect them.

All-AMP sampling simply increases the accuracy of the Random AMP Sample.

Enthusiast

Thank you Dieter