Slow left join query

Database
N/A

Slow left join query

Hello,

This is a left join query and it consumes more than 40 minutes.

select

  table2.task_sk,

  table1.sub_sk,

  table1.sub_id,

  sum(table2.items) as items,

  sum(table2.amount) as amount,

  ...

from table1 left join table2

  on table1.sub_sk = table2.sub_sk and

  table2.date_sk between 17901 and 17907

group by

  table1.sub_id,

  table1.sub_sk,

  table2.task_sk;

table1 has 200000 records and table2 has 500 million records.

Can you suggest any area of improvements in above query for better performance?

Thanks in advance.

Nitin

Tags (1)
6 REPLIES

Re: Slow left join query

What are the Indexes on those tables? Are the stats refreshed?

Can you share the explain plan of the query?

N/A

Re: Slow left join query

Hello,

table1 index is sub_sk.

table2 index is contrib_sk.

Stats are collected on index, columns specified in ON and GROUP BY and table.

table2 is partitioned on usg_dt_sk.

  5) We do an all-AMPs RETRIEVE step from 7 partitions of MISC.table2

     with a condition of ("(MISC.table2.USG_DT_SK <= 17907.) AND

     (MISC.table2.USG_DT_SK >= 17901.)") into Spool 4 (all_amps)

     (compressed columns allowed), which is redistributed by the hash

     code of (MISC.table2.sub_sk) to all AMPs.  Then we do a SORT to

     order Spool 4 by row hash.  The size of Spool 4 is estimated with

     low confidence to be 14,814,310 rows (1,259,216,350 bytes).  The

     estimated time for this step is 7 minutes and 45 seconds.

  6) We do an all-AMPs JOIN step from MISC.table1 by way of a RowHash

     match scan with no residual conditions, which is joined to Spool 4

     (Last Use) by way of a RowHash match scan.  MISC.table1 and Spool

     4 are left outer joined using a merge join, with a join condition

     of ("MISC.table1.sub_sk = sub_sk").  The result goes into Spool 3

     (all_amps) (compressed columns allowed), which is built locally on

     the AMPs.  The size of Spool 3 is estimated with index join

     confidence to be 14,814,310 rows (1,466,616,690 bytes).  The

     estimated time for this step is 3 minutes and 16 seconds.

  7) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by

     way of an all-rows scan , grouping by field1 ( MISC.table1.sub_id

     ,MISC.table1.sub_sk ,MISC.table2.task_sk).  Aggregate Intermediate

     Results are computed globally, then placed in Spool 7.  The size

     of Spool 7 is estimated with low confidence to be 11,110,733 rows

     (1,633,277,751 bytes).  The estimated time for this step is 6

     minutes and 43 seconds.

  8) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by way of

     an all-rows scan into Spool 1 (all_amps) (compressed columns

     allowed), which is redistributed by the hash code of (

     MISC.table2.task_sk) to all AMPs.  Then we do a SORT to order

     Spool 1 by row hash.  The size of Spool 1 is estimated with low

     confidence to be 11,110,733 rows (1,077,741,101 bytes).  The

     estimated time for this step is 6 minutes and 2 seconds.

  9) We do an all-AMPs MERGE into MISC.spa_out_join from Spool 1 (Last

     Use).  The size is estimated with low confidence to be 11,110,733

     rows.  The estimated time for this step is 1 second.

Nitin

Re: Slow left join query

You didn't mention which index types have been defined on the columns.

Anyways, table2.sub_sk doesn't have any index? The first step of the explain that you shared mentions that table2 is redistributed on the basis of table2_sub_sk, and as table2 has around 500 million records so this is a huge task.

You get the best query results when the qualifying rows of both the tables are on the same AMP and this is possible when they have the same Index.

The confidence level in the Explain plan is low in almost all the steps which is not a good sign caz the optimizer is not considering to use the stats. When were the stats refreshed last time?

Can you share the DDL of the tables?

N/A

Re: Slow left join query

CREATE TABLE MISC.table1

(

 sub_sk              NUMERIC(8) NOT NULL

  TITLE 'sub_sk' ,

 sub_id              VARCHAR(32) NOT NULL

  TITLE 'sub_id' ,

 PROCESSED_DTTM       TIMESTAMP(0) NOT NULL

  TITLE 'PROCESSED_DTTM')

primary index(sub_sk)

;

CREATE TABLE misc.table2

(

 contrib_sk           NUMERIC(8) NOT NULL

  TITLE 'contrib_sk',

 PROD_SK              NUMERIC(8) NOT NULL

  TITLE 'PROD_SK' compress 0.00,

 task_sk           NUMERIC(8) NOT NULL

  TITLE 'task_sk' compress 0.00,

 EVENT_TYPE_SK        NUMERIC(8) NOT NULL

  TITLE 'EVENT_TYPE_SK' compress 0.00,

 USG_DT_SK            NUMERIC(8) NOT NULL

  TITLE 'USG_DT_SK',

...

)

 PRIMARY INDEX (contrib_sk)

PARTITION BY RANGE_N(CAST((USG_DT_SK ) AS INTEGER) BETWEEN

17901  AND 17907  EACH 1 ,

17908  AND 17914  EACH 1 ,

...

)

Stats are refreshed.

N/A

Re: Slow left join query

Hi Nitin,

you might try an early aggregation similar to this one:

http://forums.teradata.com/forum/database/tuning-the-join#comment-23696

Of course it depends on your data, if and how this is possible.

Dieter

N/A

Re: Slow left join query

Dieter

Thanks for your response.

It worked. It takes 10 minutes.

Nitin