Database

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

02-10-2016
07:09 AM

02-10-2016
07:09 AM

Hi,

I have a row partitioned table with the following statistics, and execute below SQL:

COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;

SELECT * FROM TheDatabase.TheTable

WHERE COL1= 'XYZ';

The value 'XYZ' is not available, you can see it as well in below SHOW STATISTICS output:

SHOW STATISTICS VALUES COLUMN(COL1) ON TheDatabase.TheTable;

/** SummaryInfo **/

/* Version */ 6,

/* OriginalVersion */ 6,

/* DBSVersion */ '14.10.06.03',

/* UsageType */ 'D',

/* ComplexStatInfo */ 'ComplexStatInfo',

/* NumOfBiasedValues */ 4,

/* NumOfEHIntervals */ 9,

/* NumOfHistoryRecords */ 0,

/* HighModeFreq */ 26412500,

/* NumOfDistinctVals */ 13,

/* NumOfRows */ 65057255,

/** Biased: Value, Frequency **/

/* 1 */ 'Text0', 253267,

/* 2 */ 'Original', 26412500,

/* 3 */ 'Pay', 16767796,

/* 4 */ 'Reduce', 21611177,

/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/

/* 1 */ 'Text1', 'Text1', 55, 55, 0, 0,

/* 2 */ 'Text2', 'Text2', 9840, 9840, 0, 0,

/* 3 */ 'Text3', 'Text3', 2, 2, 0, 0,

/* 4 */ 'Text4', 'Text4', 1965, 1965, 0, 0,

/* 5 */ 'XY.000360', 'XY.000360', 1, 1, 0, 0,

/* 6 */ 'XY.000390', 'XY.000390', 10, 10, 0, 0,

/* 7 */ 'XY.000403', 'XY.000403', 4, 4, 0, 0,

/* 8 */ 'XY.001249', 'XY.001249', 3, 3, 0, 0,

/* 9 */ 'Text5', 'Text5', 635, 635, 0, 0

The explain plan is the following:

The explain plan is the following:

Explain SELECT * FROM TheDatabase.TheTable

WHERE

COL1 = 'XYZ'

1) First, we lock a distinct TheDatabase."pseudo table"

for read on a RowHash to prevent global deadlock for

TheDatabase.TheTable.

2) Next, we lock TheDatabase.TheTable for read.

3) We do an all-AMPs RETRIEVE step from

TheDatabase.TheTable by way of an all-rows

scan with a condition of (

"TheDatabase.TheTable.COL1 = 'XYZ'") into

Spool 1 (group_amps), which is built locally on the AMPs. The

input table will not be cached in memory, but it is eligible for

synchronized scanning. The size of Spool 1 is estimated with high

confidence to be **5,004,405** rows (3,337,938,135 bytes). The

estimated time for this step is 37.22 seconds.

4) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of

statement 1. The total estimated time is 37.22 seconds.

As you can see: The optimizer estimates 5 Million rows with high confidence buit 0 rows are in reality returned.

I don't understand where this estimation comes from. I would have expected that the optimizer knows from the histogram(s) that no row with COL1= 'XYZ' exists. Do you have any idea?

I don't understand where this estimation comes from. I would have expected that the optimizer knows from the histogram(s) that no row with COL1= 'XYZ' exists. Do you have any idea?

Roland Wenzlofsky

5 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

02-10-2016
12:30 PM

02-10-2016
12:30 PM

Hi Roland,

the optimizer never fully trusts stats :-)

It seems to think that the user has a better knowledge and this value might exists, thus it's assuming the average number of rows per value.

In fact this is usually not a problem, if the value does not exist, the result set will be empty anyway. I prefer overestimating 1 row to be 5000000 over underestimating 5000000 to be 1 row.

I know this behaviour from DATEs, when you change the condition to a BETWEEN where both values are outside of the stats range you'll get back "low conficence 1 row" (unless the number of rows increased and extrapolation kicks in).

In your case *WHERE COL1 BETWEEN 'XYZ' AND 'XYZa'* might return the expected 1.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

02-11-2016
01:55 AM

02-11-2016
01:55 AM

Thanks Dieter.

I tried it with BETWEEN and you are right: the estimation is 1 row.

I agree that it's better to overestimate than to underestimate, especially as a wrongly chosen join type would probably have a very negative impact (for example if a product join is done as the optimizer thinks one of the tables is anyway empty).

Still, what looks strange to me is:

If I hit exactly one row in one of the histogram intervalls, the optimizer fully trusts into this value and estimates with high confidence 1 row (not considering extrapolation, etc.)

But if the selected value is not available in any of the histogram intervals, it is doing the "average rows per value" calculation.

Roland Wenzlofsky

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

02-11-2016
09:35 AM

02-11-2016
09:35 AM

If the value is within the range of known values but not one of the exactly known values the optimzer does this average calculation based on that specific interval **CEILING(OtherRows / OtherVals)***.*

If the value is outside of the known min/max range it's assuming **CEILING(NumOfRows / NumOfDistinctVals)**, in your case there's no range of values, all 13 values are exactly known.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

04-27-2016
09:52 AM

04-27-2016
09:52 AM

Hi all,

I've found this thread while looking for information on a similar problem.

The only difference is that, in my case, table is partitioned on the field used in the WHERE condition (actually table has a MLPPI, using a well-known formula based on RANGE_N, hashing functions and MOD operator). Stats on this field are collected and fresh.

This is what I get, using equality conditions:

Explain LOCK ROW ACCESS SEL * FROM

TheDB.TheTable

WHERE

field4 = 'N0N_3x15ting_val';

1) First, we lock TheDB.TheTable for access.

2) Next, we do an all-AMPs RETRIEVE step from 102 partitions of

TheDB.TheTable with a condition of (

"TheDB.TheTable.field4 =

'N0N_3x15ting_val'") into Spool 1 (group_amps), which is built

locally on the AMPs. The size of Spool 1 is estimated with high

confidence to be 22,946,272 rows (4,382,737,952 bytes). The

estimated time for this step is 2.61 seconds.

3) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of

statement 1. The total estimated time is 2.61 seconds.

And this is what I get using the BETWEEN as suggested by Dieter:

Explain LOCK ROW ACCESS SEL * FROM

TheDB.TheTable

WHERE

field4 BETWEEN 'N0N_3x15ting_val' AND 'N0N_3x15ting_vam';

1) First, we lock TheDB.TheTable for access.

2) Next, we do an all-AMPs RETRIEVE step from

TheDB.TheTable by way of an all-rows scan with

a condition of (

"(TheDB.TheTable.field4 >= 'N0N_3x15ting_val') AND

(TheDB.TheTable.field4 <=

'N0N_3x15ting_vam')") into Spool 1 (group_amps), which is built

locally on the AMPs. The size of Spool 1 is estimated with high

confidence to be 1 row (191 bytes). The estimated time for this

step is 1.88 seconds.

3) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of

statement 1. The total estimated time is 1.88 seconds.

One thing to be noticed is that (probably because of the BETWEEN) optimizer goes for an FTS, rather than getting a partition subset. Anyway and FWIW, estimates are more favourable in the latter case.

In your opinion, is it still acceptable going for an FTS (partitioned tables are often huge) and having a good estimate? Or is it better to overestimate and get a subset of data?

Thanks for reading and for answers!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

05-11-2017
04:59 PM

05-11-2017
04:59 PM

I agree in most cases the over estimation is probably the better way to error, but in some cases it is not. The situation that I am facing is on a left outer join where the join will fail to find a match 100% of the time because of a filter on a column (which has stats collected on), and then a column from the joined table is then used in a subsequent join. Because of the over estimation, the optimizer is thinking that first join will find matches a high percent of the time and uses a column from that joined table to redistibute spool on for the next join. Because the join failed to find any matches, the returned value is NULL and all of the rows in spool are being redistributed to one AMP. Without changing the SQL, the only way I've found so far to get around this is to insert one dummy row in the table with the value that the filter is looking for and collecting stats using no sample.