This is a left join query and it consumes more than 40 minutes.
sum(table2.items) as items,
sum(table2.amount) as amount,
from table1 left join table2
on table1.sub_sk = table2.sub_sk and
table2.date_sk between 17901 and 17907
table1 has 200000 records and table2 has 500 million records.
Can you suggest any area of improvements in above query for better performance?
Thanks in advance.
table1 index is sub_sk.
table2 index is contrib_sk.
Stats are collected on index, columns specified in ON and GROUP BY and table.
table2 is partitioned on usg_dt_sk.
5) We do an all-AMPs RETRIEVE step from 7 partitions of MISC.table2
with a condition of ("(MISC.table2.USG_DT_SK <= 17907.) AND
(MISC.table2.USG_DT_SK >= 17901.)") into Spool 4 (all_amps)
(compressed columns allowed), which is redistributed by the hash
code of (MISC.table2.sub_sk) to all AMPs. Then we do a SORT to
order Spool 4 by row hash. The size of Spool 4 is estimated with
low confidence to be 14,814,310 rows (1,259,216,350 bytes). The
estimated time for this step is 7 minutes and 45 seconds.
6) We do an all-AMPs JOIN step from MISC.table1 by way of a RowHash
match scan with no residual conditions, which is joined to Spool 4
(Last Use) by way of a RowHash match scan. MISC.table1 and Spool
4 are left outer joined using a merge join, with a join condition
of ("MISC.table1.sub_sk = sub_sk"). The result goes into Spool 3
(all_amps) (compressed columns allowed), which is built locally on
the AMPs. The size of Spool 3 is estimated with index join
confidence to be 14,814,310 rows (1,466,616,690 bytes). The
estimated time for this step is 3 minutes and 16 seconds.
7) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 ( MISC.table1.sub_id
,MISC.table1.sub_sk ,MISC.table2.task_sk). Aggregate Intermediate
Results are computed globally, then placed in Spool 7. The size
of Spool 7 is estimated with low confidence to be 11,110,733 rows
(1,633,277,751 bytes). The estimated time for this step is 6
minutes and 43 seconds.
8) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by way of
an all-rows scan into Spool 1 (all_amps) (compressed columns
allowed), which is redistributed by the hash code of (
MISC.table2.task_sk) to all AMPs. Then we do a SORT to order
Spool 1 by row hash. The size of Spool 1 is estimated with low
confidence to be 11,110,733 rows (1,077,741,101 bytes). The
estimated time for this step is 6 minutes and 2 seconds.
9) We do an all-AMPs MERGE into MISC.spa_out_join from Spool 1 (Last
Use). The size is estimated with low confidence to be 11,110,733
rows. The estimated time for this step is 1 second.
You didn't mention which index types have been defined on the columns.
Anyways, table2.sub_sk doesn't have any index? The first step of the explain that you shared mentions that table2 is redistributed on the basis of table2_sub_sk, and as table2 has around 500 million records so this is a huge task.
You get the best query results when the qualifying rows of both the tables are on the same AMP and this is possible when they have the same Index.
The confidence level in the Explain plan is low in almost all the steps which is not a good sign caz the optimizer is not considering to use the stats. When were the stats refreshed last time?
Can you share the DDL of the tables?
CREATE TABLE MISC.table1
sub_sk NUMERIC(8) NOT NULL
TITLE 'sub_sk' ,
sub_id VARCHAR(32) NOT NULL
TITLE 'sub_id' ,
PROCESSED_DTTM TIMESTAMP(0) NOT NULL
CREATE TABLE misc.table2
contrib_sk NUMERIC(8) NOT NULL
PROD_SK NUMERIC(8) NOT NULL
TITLE 'PROD_SK' compress 0.00,
task_sk NUMERIC(8) NOT NULL
TITLE 'task_sk' compress 0.00,
EVENT_TYPE_SK NUMERIC(8) NOT NULL
TITLE 'EVENT_TYPE_SK' compress 0.00,
USG_DT_SK NUMERIC(8) NOT NULL
PRIMARY INDEX (contrib_sk)
PARTITION BY RANGE_N(CAST((USG_DT_SK ) AS INTEGER) BETWEEN
17901 AND 17907 EACH 1 ,
17908 AND 17914 EACH 1 ,
Stats are refreshed.
you might try an early aggregation similar to this one:
Of course it depends on your data, if and how this is possible.