Wrong Statistic Estimations of Optimizer Teradata 14.10?

Database
Enthusiast

Wrong Statistic Estimations of Optimizer Teradata 14.10?

Hi,

I have a row partitioned table with the following statistics, and execute below SQL:

COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;

SELECT * FROM TheDatabase.TheTable

WHERE  COL1= 'XYZ';

The value 'XYZ' is not available, you can see it as well in below SHOW STATISTICS output:

SHOW  STATISTICS VALUES COLUMN(COL1) ON TheDatabase.TheTable;

 /** SummaryInfo **/ 

 /* Version               */ 6, 

 /* OriginalVersion       */ 6, 

 /* DBSVersion            */ '14.10.06.03', 

 /* UsageType             */ 'D', 

 /* ComplexStatInfo       */ 'ComplexStatInfo', 

 /* NumOfBiasedValues     */ 4, 

 /* NumOfEHIntervals      */ 9, 

 /* NumOfHistoryRecords   */ 0, 

 /* HighModeFreq          */ 26412500, 

 /* NumOfDistinctVals     */ 13, 

 /* NumOfRows             */ 65057255, 

 /** Biased: Value, Frequency **/ 

 /*   1 */   'Text0', 253267, 

 /*   2 */   'Original', 26412500, 

 /*   3 */   'Pay', 16767796, 

 /*   4 */   'Reduce', 21611177, 

 /** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/ 

 /*     1   */ 'Text1', 'Text1', 55, 55, 0, 0, 

 /*     2   */ 'Text2', 'Text2', 9840, 9840, 0, 0, 

 /*     3   */ 'Text3', 'Text3', 2, 2, 0, 0, 

 /*     4   */ 'Text4', 'Text4', 1965, 1965, 0, 0, 

 /*     5   */ 'XY.000360', 'XY.000360', 1, 1, 0, 0, 

 /*     6   */ 'XY.000390', 'XY.000390', 10, 10, 0, 0, 

 /*     7   */ 'XY.000403', 'XY.000403', 4, 4, 0, 0, 

 /*     8   */ 'XY.001249', 'XY.001249', 3, 3, 0, 0, 

 /*     9   */ 'Text5', 'Text5', 635, 635, 0, 0

The explain plan is the following:

Explain SELECT * FROM TheDatabase.TheTable

WHERE 

COL1 = 'XYZ'

  1) First, we lock a distinct TheDatabase."pseudo table"

     for read on a RowHash to prevent global deadlock for

     TheDatabase.TheTable.

  2) Next, we lock TheDatabase.TheTable for read.

  3) We do an all-AMPs RETRIEVE step from

     TheDatabase.TheTable by way of an all-rows

     scan with a condition of (

     "TheDatabase.TheTable.COL1 = 'XYZ'") into

     Spool 1 (group_amps), which is built locally on the AMPs.  The

     input table will not be cached in memory, but it is eligible for

     synchronized scanning.  The size of Spool 1 is estimated with high

     confidence to be 5,004,405 rows (3,337,938,135 bytes).  The

     estimated time for this step is 37.22 seconds.

  4) Finally, we send out an END TRANSACTION step to all AMPs involved

     in processing the request.

  -> The contents of Spool 1 are sent back to the user as the result of

     statement 1.  The total estimated time is 37.22 seconds.

As you can see: The optimizer estimates 5 Million rows with high confidence buit 0 rows are in reality returned.

I don't understand where this estimation comes from. I would have expected that the optimizer knows from the histogram(s) that no row with COL1= 'XYZ' exists. Do you have any idea?

 

Roland Wenzlofsky
4 REPLIES
Senior Apprentice

Re: Wrong Statistic Estimations of Optimizer Teradata 14.10?

Hi Roland,

the optimizer never fully trusts stats :-)

It seems to think that the user has a better knowledge and this value might exists, thus it's assuming the average number of rows per value.

In fact this is usually not a problem, if the value does not exist, the result set will be empty anyway. I prefer overestimating 1 row to be 5000000 over underestimating 5000000 to be 1 row.

I know this behaviour from DATEs, when you change the condition to a BETWEEN where both values are outside of the stats range you'll get back "low conficence 1 row" (unless the number of rows increased and extrapolation kicks in).

In your case WHERE  COL1 BETWEEN 'XYZ' AND 'XYZa' might return the expected 1.

Enthusiast

Re: Wrong Statistic Estimations of Optimizer Teradata 14.10?

Thanks Dieter.

I tried it with BETWEEN and you are right: the estimation is 1 row.

I agree that it's better to overestimate than to underestimate, especially as a wrongly chosen join type would probably have a very negative impact (for example if a product join is done as the optimizer thinks one of the tables is anyway empty).

Still, what looks strange to me is:

If I hit exactly one row in one of the histogram intervalls, the optimizer fully trusts into this value and estimates with high confidence 1 row (not considering extrapolation, etc.)

But if the selected value is not available in any of the histogram intervals, it is doing the "average rows per value" calculation.

Roland Wenzlofsky
Senior Apprentice

Re: Wrong Statistic Estimations of Optimizer Teradata 14.10?

If the value is within the range of known values but not one of the exactly known values the optimzer does this average calculation based on that specific interval CEILING(OtherRows / OtherVals).

If the value is outside of the known min/max range it's assuming CEILING(NumOfRows / NumOfDistinctVals), in your case there's no range of values, all 13 values are exactly known.

Teradata Employee

Re: Wrong Statistic Estimations of Optimizer Teradata 14.10?

Hi all,

I've found this thread while looking for information on a similar problem.

The only difference is that, in my case, table is partitioned on the field used in the WHERE condition (actually table has a MLPPI, using a well-known formula based on RANGE_N, hashing functions and MOD operator). Stats on this field are collected and fresh.

This is what I get, using equality conditions:

Explain LOCK ROW ACCESS SEL * FROM
TheDB.TheTable
WHERE
field4 = 'N0N_3x15ting_val';

  1) First, we lock TheDB.TheTable for access.
  2) Next, we do an all-AMPs RETRIEVE step from 102 partitions of
     TheDB.TheTable with a condition of (
     "TheDB.TheTable.field4 =
     'N0N_3x15ting_val'") into Spool 1 (group_amps), which is built
     locally on the AMPs.  The size of Spool 1 is estimated with high
     confidence to be 22,946,272 rows (4,382,737,952 bytes).  The
     estimated time for this step is 2.61 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 2.61 seconds.

And this is what I get using the BETWEEN as suggested by Dieter:

Explain LOCK ROW ACCESS SEL * FROM
TheDB.TheTable
WHERE
field4 BETWEEN 'N0N_3x15ting_val' AND 'N0N_3x15ting_vam';

  1) First, we lock TheDB.TheTable for access.
  2) Next, we do an all-AMPs RETRIEVE step from
     TheDB.TheTable by way of an all-rows scan with
     a condition of (
     "(TheDB.TheTable.field4 >= 'N0N_3x15ting_val') AND
     (TheDB.TheTable.field4 <=
     'N0N_3x15ting_vam')") into Spool 1 (group_amps), which is built
     locally on the AMPs.  The size of Spool 1 is estimated with high
     confidence to be 1 row (191 bytes).  The estimated time for this
     step is 1.88 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 1.88 seconds.

One thing to be noticed is that (probably because of the BETWEEN) optimizer goes for an FTS, rather than getting a partition subset. Anyway and FWIW, estimates are more favourable in the latter case.

In your opinion, is it still acceptable going for an FTS (partitioned tables are often huge) and having a good estimate? Or is it better to overestimate and get a subset of data?

Thanks for reading and for answers!