Query on Skew-Sensitivity in the TD12 above Optimizer

Database
Enthusiast

Query on Skew-Sensitivity in the TD12 above Optimizer

Hi All,

I was reading Carrie's blog on skew sensitivity of TD12 optimizer.

http://developer.teradata.com/database/articles/skew-sensitivity-in-the-optimizer-new-in-teradata-12

Here at one point she mentioned, 

"Instead of assuming each AMP will receive 1.1 millions rows during row redistribution, the Teradata 12 optimizer assumes a worst-case redistribution of 6 million rows for at least one AMP. It uses that worst-case number of rows to adjust the estimated cost of the row redistribution choice."

Assume that TableA PI column is joined with TableB non-PI column. Definitely TableB will be redistributed to all AMP. Now TableB is a very large table. 

As per my understanding in v2r6 it would have assumed that each AMP has received those huge rows and processing in each AMP would have been skewed. (performance of all AMP is equivalent to the slowest AMP).

Does that not happen any more in optimizer?

I mean, currently (from TD12 onwards) will the optimizer be able to pin-point the worst case scenario of row redistribution and accordingly will change its method so that the processing is similar in each AMP?

Please help me to clarify my understanding.

Thanks

Santanu

Tags (3)
2 REPLIES
Enthusiast

Re: Query on Skew-Sensitivity in the TD12 above Optimizer

Hi All

I have not received any update on my above query. However, I found few things by myself. I am sharing that. Hope this might be helpful to others.

1. I create the below table.

CREATE TABLE SCPLN.CLASSES, NO FALLBACK

(

CLASSUID INTEGER NOT NULL,

COURSEUID INTEGER NOT NULL,

CREATEDDATE TIMESTAMP(6),

ETL_ACTION VARCHAR(1)

)

UNIQUE PRIMARY INDEX(CLASSUID)

;

where each course id may have many class id. So course:class has 1:many relation. Then I did

COLLECT STATS ON SCPLN.CLASSES COLUMN(COURSEUID);

COLLECT STATS ON SCPLN.CLASSES COLUMN(ETL_ACTION);

2. Now the 2nd table is

CREATE TABLE SCPLN.COURSES, NO FALLBACK

(

COURSEUID INTEGER NOT NULL,

COURSENAME VARCHAR(20)

)

UNIQUE PRIMARY INDEX(COURSEUID)

;

This table has around 30000 rows with even distribution.

3. Then I ran the below SQL with join between 2 tables.

SELECT O.CLASSUID, O.CREATEDDATE, C.COURSEUID, C.COURSENAME

FROM SCPLN.COURSES C

INNER JOIN 

SCPLN.CLASSES O

ON C.COURSEUID = O.COURSEUID

WHERE O.ETL_ACTION = 'I'

;

The inital count of CLASSES was 246077 out of which COURSEUID = 383712 had count 4529.

First time the explain of SQL said,

4) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an

     all-rows scan with a condition of ("SCPLN.O.ETL_ACTION = 'I'")

     into Spool 2 (all_amps), which is redistributed by the hash code

     of (SCPLN.O.COURSEUID) to all AMPs.  The size of Spool 2 is

     estimated with high confidence to be 246,077 rows (7,628,387

     bytes).  The estimated time for this step is 0.08 seconds. 

  5) We do an all-AMPs JOIN step from SCPLN.C by way of an all-rows

     scan with no residual conditions, which is joined to Spool 2 (Last

     Use) by way of an all-rows scan.  SCPLN.C and Spool 2 are joined

     using a single partition hash join, with a join condition of (

     "SCPLN.C.COURSEUID = COURSEUID").  The result goes into Spool 1

     (group_amps), which is built locally on the AMPs.  The size of

     Spool 1 is estimated with low confidence to be 246,077 rows (

     15,502,851 bytes).  The estimated time for this step is 0.03

     seconds. 

4. I inserted another 100000 rows (approx). Now the row count for CLASSES is 327292 out of which COURSEUID = 383712 had count 85744.

5. I ran the SQL once again. Below is the explain

  4) We execute the following steps in parallel. 

       1) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an

          all-rows scan with a condition of ("SCPLN.O.ETL_ACTION =

          'I'") into Spool 2 (all_amps), which is built locally on the

          AMPs.  The size of Spool 2 is estimated with high confidence

          to be 327,292 rows (10,146,052 bytes).  The estimated time

          for this step is 0.02 seconds. 

       2) We do an all-AMPs RETRIEVE step from SCPLN.C by way of an

          all-rows scan with no residual conditions into Spool 3

          (all_amps), which is duplicated on all AMPs.  The size of

          Spool 3 is estimated with high confidence to be 2,758,860

          rows (68,971,500 bytes).  The estimated time for this step is

          0.06 seconds. 

  5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an

     all-rows scan, which is joined to Spool 3 (Last Use) by way of an

     all-rows scan.  Spool 2 and Spool 3 are joined using a single

     partition hash join, with a join condition of ("COURSEUID =

     COURSEUID").  The result goes into Spool 1 (group_amps), which is

     built locally on the AMPs.  The size of Spool 1 is estimated with

     low confidence to be 327,292 rows (20,619,396 bytes).  The

     estimated time for this step is 0.05 seconds. 

So, I think optimizer is smart enough to decide whether data distribution is getting skewed or not. Accordingly it will change the join plan to make sure join processing is similar on each AMP.

Please correct me if I am missing any point.

Thanks

Santanu

N/A

Re: Query on Skew-Sensitivity in the TD12 above Optimizer

Hi santanu,

what was the conclusion of carrie, in skewd AMP, the join rocessing for all AMP will be similar or join processing for each AMP will be skewed , as i know

'NUPI’s can create irregular distributions, called "skewed distributions". AMPs that have more than an average number or rows will take longer for full table operations than the other AMPs will. Because an operation is not complete until all AMPs have finished, this will cause the operation to finish less quickly due to being underutilized. ' plus as you said performance is equivalent to the slowest AMP.therefore join processing on each AMP will be skewed or the Optimizer as carrie said will make join processing on each AMP similar?