Huge Cardinality estimates for TIMESTAMP COLUMNS

Database
Enthusiast

Huge Cardinality estimates for TIMESTAMP COLUMNS

estimated cardinality for the following predicate towards timestamp column is far too high vs date column

1) Table DDL:

--------------------------------------

CREATE MULTISET TABLE tdb.subscription_offering_hist ,NO FALLBACK ,

xxxx    (

      Subscription_Id INTEGER,

      Offering_Id INTEGER,

      Subscription_Offering_Start_Dt TIMESTAMP(0),

......

     DWH_Entry_Date DATE FORMAT 'YY/MM/DD')

PRIMARY INDEX ( Subscription_Id );

2) predicate towards date column:

-----------------------------------------

SELECT count(Subscription_Id)

FROM devdb.soh

WHERE  CAST( '2012-05-31' AS DATE )  

BETWEEN DWH_Entry_Date

AND DWH_Entry_Date + 150

;

Actual rows returned : 162,480

Estimated rows returned: 161,085

3) predicate towards Timestamp column:

-----------------------------------------

SELECT Subscription_Id

FROM devdb.soh

WHERE CAST( CAST( '2012-05-31' AS DATE ) AS TIMESTAMP(0))

BETWEEN subscription_offering_start_dt

AND subscription_offering_start_dt  + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  

Actual rows returned : 1,724

Estimated rows returned: 96,960

What's strange that i've tried different values for the timestamp predicate and whichever the value use; optimizer always estimates 96,960 row; while it;s not the case if we run the predicate for date column; different date sure return different estimations.

i've collected needed stats; also this is a test table so it;s not a corrupted stats problem since i've dropped & recreated stats on original table and same issue presisted.

Regards

7 REPLIES
Senior Apprentice

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS


When you check Explain you should see different confidence levels:

HIGH for #2

NO for #3, this indicates that stats could not be used, so it defaults to 1/3 of the rows due to the BETWEEN



For #2 the optimizer rewrote your condition first to

WHERE DWH_Entry_Date BETWEEN   

DATE '2012-05-31' - 150 AND DATE '2012-05-31'



and finally to

WHERE DWH_Entry_Date BETWEEN   

DATE '2012-01-02' - 150 AND DATE '2012-05-31'



But this kind of optimization is only done for INTEGERs and DATEs, not for TIMESTAMPs.



You have to rewrite it on your own to remove the calculation on the DWH_Entry_Date column for #3:



WHERE subscription_offering_start_dt

BETWEEN CAST(DATE '2012-05-31' AS TIMESTAMP(0))

AND CAST(DATE '2012-05-31' AS TIMESTAMP(0)) + INTERVAL '23:59:59' HOUR TO SECOND



or



WHERE subscription_offering_start_dt

BETWEEN TIMESTAMP '2012-05-31 00:00:00'

AND TIMESTAMP '2012-05-31 23:59:59'



This should result in HIGH confidence again.



Dieter


Enthusiast

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

Thank You Dieter for the important information:

However:

SELECT Subscription_Id

FROM devdb.soh

WHERE CAST( CAST( '2012-05-31' AS DATE ) AS TIMESTAMP(0)) + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  

BETWEEN subscription_offering_start_dt

AND subscription_offering_start_dt  + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  

EST: 76,574,794

ACT:  1,956,001

SELECT COUNT(Subscription_id) FROM  vdb.subscription_offering_hist c

WHERE c.subscription_offering_start_dt

BETWEEN TIMESTAMP '2012-05-31 00:00:00'

AND TIMESTAMP '2012-05-31 23:59:59';

EST: 74,069

ACT: 1,956,001

>> The size of Spool 1 is estimated with high confidence to be 74,069 rows !

Senior Apprentice

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

What's your TD release?

The high confidence indicates stale stats, could you show the output of HELP STATS for vdb.subscription_offering_hist?

Dieter

Enthusiast

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

i thought high confidence means stats are good  current !

we are running 13.10

 Date Time              Unique Values     Column Names

------ ------                ------------------   --------------------

12/07/14 23:07:50           43,775,363 Subscription_Id

12/07/13 16:13:55           151            Offering_Id

3 12/07/13 16:14:02         9,396,526   Subscription_Offering_Start_Dt

4 12/07/13 16:14:03         1,255,879   Subscription_Offering_End_Dttm

5 12/07/15 09:11:36                    1 PARTITION

Senior Apprentice

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

Sorry, i wasn't clear. Of course "high confidence" usually indicates that there are stats (and the optimizer trusts them).

But in this special case (EST: 74,069 vs. ACT: 1,956,001) i assumed the stats could be outdated, resulting in wrong estimates.

Your stats are current, so this under-estimation by the factor of 26 must have a different reason:

Is there an even distribution of values or are offerings sent on specific days within a month?

Are the stats sampled?

When you incease the range of dates is the estimate closer to the actual number

Dieter

Enthusiast

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

NO it;s not sampled.

with wider range ( the whole month)  the ratio changed from 1:10

SELECT Subscription_id FROM  tdb.subscription_offering_hist c

WHERE c.subscription_offering_start_dt

BETWEEN TIMESTAMP '2012-05-01 00:00:00'

AND TIMESTAMP '2012-05-31 23:59:59'

;

EST: 97,176,312

ACT: 11,057,406

i still think there another since we don't encounter such issue with date column.

Senior Apprentice

Re: Huge Cardinality estimates for TIMESTAMP COLUMNS

This is strange,

- increasing the range from 1 day to 31 days: 31x

- actual rows: 6x (-> not evenly ditributed)

- estimated rows: 1300x (74,069 -> 11,057,406)

Might be a problem specific to Timestamps and/or your actual data, did you already inspect the stats using "HELP STATS tdb.subscription_offering_hist COLUMN subscription_offering_start_dt"?

Maybe you should ask TD support if this is a known problem.

Dieter