Sliding Window Merge Join Question

Database
Enthusiast

Sliding Window Merge Join Question

Hi,

I have a huge table with billions of rows which is joined against a table containing 350.000 rows. Bith tables have the same PI and the same Partitioning. Statistics are available and correct.

I would have expected a Rowkeybased Merge Join as PI and Partitions match.

But the Optimizer is doing the following:

It departitions the small table and does a subsequent Sliding Window Merge Join. I am on Teradata 14.10.

Why is this happening? Could this be less resource consuming than the Rowkey Based Merge Join?

Thanks in Advance

Roland

Roland Wenzlofsky
15 REPLIES
Senior Apprentice

Re: Sliding Window Merge Join Question

Hi Roland,

a Rowkey Based Merge Join is the best you can get on a partitioned table, exactlye the same as a PI-Join on non-partitioned tables.

Did you double check if the partitioning column is also included in the join?

Otherwise can you show the PPI definiton and the actual join?

Enthusiast

Re: Sliding Window Merge Join Question

Thanks Dieter. Here are the table definitions:

CREATE MULTISET TABLE CDR.Big ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      SUBSCRIBER_ID INTEGER NOT NULL,

      START_DATE DATE FORMAT 'YYYY-MM-DD',

      START_TIME INTEGER FORMAT '99:99:99' COMPRESS 0 ,

      DURATION INTEGER COMPRESS,

)

PRIMARY INDEX ( SUBSCRIBER_ID )

PARTITION BY RANGE_N(START_DATE  BETWEEN DATE '2015-03-01' AND DATE '2100-12-31' EACH INTERVAL '1' DAY );

CREATE MULTISET TABLE CDR.Small ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      SUBSCRIBER_ID INTEGER NOT NULL,

      START_DATE DATE FORMAT 'YYYY-MM-DD' NOT NULL,

      START_TIME INTEGER FORMAT '99:99:99' NOT NULL,

      DURATION INTEGER,

)

PRIMARY INDEX ( SUBSCRIBER_ID )

PARTITION BY RANGE_N(START_DATE  BETWEEN DATE '2015-03-01' AND DATE '2100-12-31' EACH INTERVAL '1' DAY );

This is the statement I am executing:

create multiset volatile table VTAB as

(

select

t01.*

FROM CDR.BIG          t01

JOIN CDR.SMALL        t02

AND   t01.START_DATE                = t02.START_DATE

AND   t01.START_TIME                = t02.START_TIME

AND   t01.SUBSCRIBER_ID             = t02.SUBSCRIBER_ID

) with data

primary index(SUBSCRIBER_ID)

on commit preserve rows;

As you can see from the explain statement, the small table is partitioned exactly like the big table (small table: 70.000 rows, big table are hundreds of millions of rows). PI is the same. Nevertheless, the small table is spooled locally and joined with a sliding window merge join.

The only "special" thing is my data: the small table only contains 3 different dates, while the big table contains the dates of many years. Could this be a reason?

1) First, we lock a distinct CDR."pseudo table" for read

     on a RowHash to prevent global deadlock for CDR.T01.

  2) Next, we lock a distinct SMALL."pseudo table" for read on

     a RowHash to prevent global deadlock for SMALL.T2.

  3) We lock CDR.T01 for read, and we lock SMALL.T2

     for read.

  4) We create the table header.

  5) We do an all-AMPs RETRIEVE step from SMALL.T2 by way of an

     all-rows scan with no residual conditions into Spool 2 (all_amps)

     (compressed columns allowed), which is built locally on the AMPs.

     Then we do a SORT to order Spool 2 by the hash code of (

     SMALL.T2.SUBSCRIBER_ID).  The size of Spool 2 is estimated with

     high confidence to be 74,894 rows (4,119,170 bytes).  The

     estimated time for this step is 0.01 seconds.

  6) We do an all-AMPs JOIN step from CDR.T01 by way of a

     RowHash match scan with no residual conditions, which is joined to

    Spool 2 (Last Use) by way of a RowHash match scan.

     CDR.T01 and Spool 2 are joined using a sliding-window

     merge join, with a join condition of (

     "((CDR.T01.START_DATE = START_DATE) AND

     ((CDR.T01.START_TIME = START_TIME) AND

      (CDR.T01.SUBSCRIBER_ID = SUBSCRIBER_ID )))))").  The result goes

     into Spool 1 (all_amps) (compressed columns allowed), which is

     built locally on the AMPs.  Then we do a SORT to order Spool 1 by

     the hash code of (CDR.T01.SUBSCRIBER_ID).  The size of Spool

     1 is estimated with low confidence to be 74,894 rows (24,715,020

     bytes).  The estimated time for this step is 1.24 seconds.

  7) We do an all-AMPs MERGE into MYUSER.VTAB from Spool 1 (Last Use).

     The size is estimated with low confidence to be 74,894 rows.  The

     estimated time for this step is 3.63 seconds.

  8) Finally, we send out an END TRANSACTION step to all AMPs involved

     in processing the request.

  -> No rows are returned to the user as the result of statement 1.

UPDATE:

After adding statistics, the Optimizer does a rowkey based merge join like expected:

collect statistics column(SUBSCRIBER_ID ),column(start_date) on CDR.SMALL;

1) First, we lock a distinct CDR."pseudo table" for read

    on a RowHash to prevent global deadlock for CDR.T01.

  2) Next, we lock a distinct CDR."pseudo table" for read on

     a RowHash to prevent global deadlock for CDR.T02.

  3) We lock CDR.T01 for read, and we lock CDR.T02

     for read.

  4) We create the table header.

  5) We do an all-AMPs JOIN step from CDR.T02 by way of a

     RowHash match scan with no residual conditions, which is joined to

     CDR.T01 by way of a RowHash match scan with no residual

     conditions.  CDR.T02 and CDR.T01 are joined

     using a rowkey-based merge join, with a join condition of (

     "((CDR.T01.START_DATE =

     CDR.T02.START_DATE) AND

     ((CDR.T01.START_TIME =

     CDR.T02.START_TIME) AND (CDR.T01.SUBSCRIBER_ID =

     CDR.T02.SUBSCRIBER_ID )))))").  The result goes into Spool 1

     (all_amps) (compressed columns allowed), which is built locally on

     the AMPs.  Then we do a SORT to order Spool 1 by the hash code of

     (CDR.T01.SUBSCRIBER_ID).  The size of Spool 1 is estimated

     with low confidence to be 77,831 rows (25,684,230 bytes).  The

     estimated time for this step is 0.04 seconds.

  6) We do an all-AMPs MERGE into OPER_DWH.vT01 from Spool 1 (Last Use).

     The size is estimated with low confidence to be 77,831 rows.  The

     estimated time for this step is 3.76 seconds.

  7) Finally, we send out an END TRANSACTION step to all AMPs involved

     in processing the request.

  -> No rows are returned to the user as the result of statement 1.

Still, it's strange to me. Estimations before collecting stats were quite near to the real number of rows and furthermore, I still would expect a rowkey based merge join to be always cheaper than a sliding window merge join, independently from statistics. 

BR,

Roland

Roland Wenzlofsky
Enthusiast

Re: Sliding Window Merge Join Question

I found another interesting behaviour when joining two tables with matching PI and Partitioning. I have the follwowing 2 tables:

CREATE MULTISET TABLE DWHPro.Sales2 ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      SalesId INTEGER NOT NULL,

      Quantity DECIMAL(18,2),

      SalesDate DATE FORMAT 'YYYY-MM-DD')

PRIMARY INDEX ( SalesId )

PARTITION BY RANGE_N(SalesDate  BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY ,

 NO RANGE);

CREATE MULTISET TABLE DWHPro.Sales3 ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      SalesId INTEGER NOT NULL,

      Quantity DECIMAL(18,2),

      SalesDate DATE FORMAT 'YYYY-MM-DD')

PRIMARY INDEX ( SalesId )

PARTITION BY RANGE_N(SalesDate  BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY ,

 NO RANGE);

Running the following query, which basically selects all partitions:

SELECT 

  t01.SalesId,

  t02.SalesId

 FROM Sales2 t01

 INNER JOIN

   Sales3 t02

ON

t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate

WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';

I have the following execution plan: Both tables are redisrtibuted to (SalesId, SalesDate) and a simple merge join is done:

  1) First, we lock a distinct DWHPRO."pseudo table" for read on a

     RowHash to prevent global deadlock for DWHPRO.t02.

  2) Next, we lock a distinct DWHPRO."pseudo table" for read on a

     RowHash to prevent global deadlock for DWHPRO.t01.

  3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.

  4) We execute the following steps in parallel.

       1) We do an all-AMPs RETRIEVE step from 42734 partitions of

          DWHPRO.t02 with a condition of ("(DWHPRO.t02.SalesDate >=

          DATE '1900-01-01') AND (DWHPRO.t02.SalesDate <= DATE

          '2016-12-31')") into Spool 2 (all_amps), which is

          redistributed by the hash code of (DWHPRO.t02.SalesId,

          DWHPRO.t02.SalesDate) to all AMPs.  Then we do a SORT to

          order Spool 2 by row hash.  The size of Spool 2 is estimated

          with high confidence to be 42,734 rows (897,414 bytes).  The

          estimated time for this step is 0.34 seconds.

       2) We do an all-AMPs RETRIEVE step from 42734 partitions of

          DWHPRO.t01 with a condition of ("(DWHPRO.t01.SalesDate <=

          DATE '2016-12-31') AND (DWHPRO.t01.SalesDate >= DATE

          '1900-01-01')") into Spool 3 (all_amps), which is

          redistributed by the hash code of (DWHPRO.t01.SalesId,

          DWHPRO.t01.SalesDate) to all AMPs.  Then we do a SORT to

          order Spool 3 by row hash.  The size of Spool 3 is estimated

          with high confidence to be 42,734 rows (897,414 bytes).  The

          estimated time for this step is 0.34 seconds.

  5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a

     RowHash match scan, which is joined to Spool 3 (Last Use) by way

     of a RowHash match scan.  Spool 2 and Spool 3 are joined using a

     merge join, with a join condition of ("(SalesId = SalesId) AND

     (SalesDate = SalesDate)").  The result goes into Spool 1

     (group_amps), which is built locally on the AMPs.  The size of

     Spool 1 is estimated with low confidence to be 42,734 rows (

     1,239,286 bytes).  The estimated time for this step is 0.31

     seconds.

  -> The contents of Spool 1 are sent back to the user as the result of

     statement 1.  The total estimated time is 0.66 seconds.

Afterwards, i was removing only one date from the WHERE condition (


SELECT 

  t01.SalesId,

  t02.SalesId

 FROM Sales2 t01

 INNER JOIN

    Sales3 t02

ON

t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate

WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30'; --> removed the 2016-12-31 !

i experienced a rowkey based merge join:

  1) First, we lock a distinct DWHPRO."pseudo table" for read on a

     RowHash to prevent global deadlock for DWHPRO.t02.

  2) Next, we lock a distinct DWHPRO."pseudo table" for read on a

     RowHash to prevent global deadlock for DWHPRO.t01.

  3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.

  4) We do an all-AMPs JOIN step from 42733 partitions of DWHPRO.t02 by

     way of a RowHash match scan with a condition of (

     "(DWHPRO.t02.SalesDate >= DATE '1900-01-01') AND

     (DWHPRO.t02.SalesDate <= DATE '2016-12-30')"), which is joined to

     42733 partitions of DWHPRO.t01 by way of a RowHash match scan with

     a condition of ("(DWHPRO.t01.SalesDate <= DATE '2016-12-30') AND

     (DWHPRO.t01.SalesDate >= DATE '1900-01-01')").  DWHPRO.t02 and

     DWHPRO.t01 are joined using a rowkey-based merge join, with a join

     condition of ("(DWHPRO.t01.SalesId = DWHPRO.t02.SalesId) AND

     (DWHPRO.t01.SalesDate = DWHPRO.t02.SalesDate)").  The result goes

     into Spool 1 (group_amps), which is built locally on the AMPs.

     The size of Spool 1 is estimated with low confidence to be 42,734

     rows (1,239,286 bytes).  The estimated time for this step is 0.28

     seconds.

  -> The contents of Spool 1 are sent back to the user as the result of

     statement 1.  The total estimated time is 0.28 seconds.

Being curious, I compared the following wo statements. Both of them delivering the same result, but one doing the standard merge join, the other being "forced into" a rowkey based merge join by separately handling the 2016-12-31:

SELECT -- Standard merge join

  t01.SalesId,

  t02.SalesId

 FROM Sales2 t01

 INNER JOIN

    Sales3 t02

ON

t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate

WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';


SELECT -- Forcing the Optimizer into the rowkey based merge join

  t01.SalesId,

  t02.SalesId

 FROM Sales2 t01

 INNER JOIN

   Sales3 t02

ON

t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate

WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30'

UNION ALL

SELECT 

  t01.SalesId,

  t02.SalesId

 FROM Sales2 t01

 INNER JOIN

   Sales3 t02

ON

t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate

WHERE t01.SALESDATE = DATE'2016-12-31';


Like expected, the rowkey based merge join is much cheaper:

Rowkey Based MJ: 91 IOs and 0.02 CPU Seconds, 102MB Spool

Merge Join: 272 IOs and 0.24 CPU seconds 1.8 GB Spool

While I don't understand why the Optimizer is chosing a standard merge join if you select all partitions, it shows that there is a lot of potential for optimizing joins manually

BR

Roland

Roland Wenzlofsky
MM
Fan

Re: Sliding Window Merge Join Question

Hello,

I think that the optimizer does not want to do a rowkey-based merge join, if he thinks, that one table has many partitions and all of them are very small.

But I do not know why he does this.

Additionally it seems that he has a mindchange if he can exclude at least 1 partition.

For me this is semms also strange

lG Martin

Enthusiast

Re: Sliding Window Merge Join Question

What were your Stats definitions in all these cases ?

Enthusiast

Re: Sliding Window Merge Join Question

Single stats on SALESID (the PI) and SALESDATE (Partition column). We are on TD 14.10 Full Stats (no samples)

Roland Wenzlofsky
Enthusiast

Re: Sliding Window Merge Join Question

Can you please try adding Stats on ( PARTITION ),  (  PARTITION, PI)  and ( PARTITION, PI, Partition Column) and check your results?

Below is the excerpt from Carrie's Blog on Partition recommendation.

  • (PARTITION, PI).  This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions.  It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.  
  • (PARTITION, PI, partitioning column).   This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination.  It is used in rowkey join costing.
Enthusiast

Re: Sliding Window Merge Join Question

Hi,

I added the statistics you mentioned on both tables. Now i always get the traditional merge join with a redistribution step on (SalesId,SalesDate) for both tables at the begin...if i select all partitions or (partitions - 1) does not matter anymore.... the behaviour changed. but still, the rowkey based merged join would be much cheaper as I showed in my test setup. 

I tried another setup, removing all statistics from Sales2 and Sales3. I still received for all partitions or (partitions -1 ) the traditional merge join.

The only setup were the selection of (partitions-1) leads to a rowkey based merge join is when i have stats on both tables on columns SalesID and SalesDate (single column stats).

It seem to be definitely related to statistics.

BTW, my test setup contains exactly 1 row per partition (in both tables).

I would assume that the optimizer knows in both cases (as it has statistics on all needed columns) that in one case it selected 69999 rows and in the other case 70000 rows, only one row difference...

The assumption that 69999 rows are cheaper to select with a rowkey based merge join but 70000 rows are cheaper with data redistribution and a traditional merge join is somehow not logical to me (and as my test showed: the traditional merge join is much more expensive).

BR

Roland

Roland Wenzlofsky
MM
Fan

Re: Sliding Window Merge Join Question

Hello, 

my Question: 

Given 2 tables, same PI, same partitioning. Joining them on the the PI and the partioning columns (and maybe some additional columns).

Can there be an advantage of departioning one table but keeping the same PI and then doing a sliding window merge join instead of doing a rowkey based merge join?? 

I have no example where this could be advantageous.

lG Martin