Help with query optimization having analytical functions

Database
nsi
Teradata Employee

Help with query optimization having analytical functions

Hi,

I am relatively new to Teradata database and working on performance optimization of the below query. This query takes 1 hour to run and is quering the table having around 1.6 billion records. I tried compression on the columns subs_dvc_inst_end_dttm (as this column has value: 2050-12-31 present for 16% times of the total record count) and subs_dvc_inst_rsn_cd (as this column has only 2 reason codes coming in). I also collected multi column stats on the columns (subs_id, subs_dvc_inst_rsn_cd) present in PARTITION BY of the query. Still nothing seems to help. Also, the explain plan is just hinting at FTS being done many times. Can anyone please suggest some way to avoid redistribution happening so many times in this query and improve the query performance?

 

Thanks 

Neha

 

Table PRD_EDW.SUBS_DVC_LST has 1.6 billion data. This query after performing the analytics and qualify returns around 1.4 billion records.

 

Query:

select
SDDL_.subs_id,
SDDL_.imei_val,
SDDL_.dvc_prod_id,
SDDL_.subs_dvc_inst_start_dttm,
max(SDDL_.subs_dvc_inst_end_dttm)
over(partition by SDDL_.subs_id,SDDL_.subs_dvc_inst_rsn_cd order by SDDL_.subs_dvc_inst_start_dttm, SDDL_.dvc_prod_id reset when SDDL_.dvc_prod_id <> min(SDDL_.dvc_prod_id)
over (partition by SDDL_.subs_id,SDDL_.subs_dvc_inst_rsn_cd order by SDDL_.subs_dvc_inst_start_dttm, SDDL_.subs_id, SDDL_.dvc_prod_id rows between 1 preceding and 1 preceding)
),
min(SDDL_.subs_dvc_inst_start_dttm) over (partition by SDDL_.subs_id, SDDL_.dvc_prod_id order by SDDL_.dvc_prod_id),
max(SDDL_.prcss_crtn_dttm) over (partition by SDDL_.subs_id),
SDDL_.subs_dvc_inst_rsn_cd
from PRD_EDW.SUBS_DVC_LST SDDL_


qualify rank() over (partition by subs_id, subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id reset when dvc_prod_id <> min(dvc_prod_id) over (partition by subs_id, subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, subs_id, subs_dvc_inst_rsn_cd, dvc_prod_id rows between 1 preceding and 1 preceding)) = 1;

 

Table information:

show table PRD_EDW.SUBS_DVC_LST;

CREATE MULTISET TABLE PRD_EDW.SUBS_DVC_LST ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
subs_id INTEGER,
imei_val VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
dvc_prod_id INTEGER,
subs_dvc_inst_start_dttm TIMESTAMP(6),
subs_dvc_inst_end_dttm TIMESTAMP(6) COMPRESS (TIMESTAMP '2050-12-31 23:59:59.000000'),
dvc_inst_id BIGINT,
prcss_crtn_dttm TIMESTAMP(6),
subs_dvc_inst_rsn_cd VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('DMUD-NPSD','DMUD-PSD'))
PRIMARY INDEX ( dvc_inst_id );

 

Stats:

collect statistics column (dvc_prod_id),
column (dvc_inst_id),
column (subs_dvc_inst_rsn_cd),
column (subs_id,subs_dvc_inst_rsn_cd)
on PRD_EDW.SUBS_DVC_LST;

 

Explain Plan:

1) First, we lock a distinct prd_edw."pseudo table" for read on a
RowHash to prevent global deadlock for prd_edw.SDDL_.
2) Next, we lock prd_edw.SDDL_ for read.
3) We do an all-AMPs STAT FUNCTION step from prd_edw.SDDL_ by way of
an all-rows scan with no residual conditions into Spool 6 (Last
Use), which is redistributed by hash code to all AMPs. The result
rows are put into Spool 4 (all_amps) (compressed columns allowed),
which is built locally on the AMPs. The size is estimated with
high confidence to be 1,638,751,678 rows (291,697,798,684 bytes).
4) We do an all-AMPs STAT FUNCTION step from Spool 4 (Last Use) by
way of an all-rows scan into Spool 9 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 8 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (262,200,268,480 bytes).
5) We do an all-AMPs STAT FUNCTION step from Spool 8 (Last Use) by
way of an all-rows scan into Spool 12 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 11 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (255,645,261,768 bytes).
6) We do an all-AMPs STAT FUNCTION step from Spool 11 (Last Use) by
way of an all-rows scan into Spool 15 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 14 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (229,425,234,920 bytes).
7) We do an all-AMPs STAT FUNCTION step from Spool 14 (Last Use) by
way of an all-rows scan into Spool 18 (Last Use), which is assumed
to be redistributed by value to all AMPs. The result rows are put
into Spool 17 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (203,205,208,072 bytes).
8) We do an all-AMPs RETRIEVE step from Spool 17 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is built locally on the AMPs. The size of Spool 2
is estimated with high confidence to be 1,638,751,678 rows (
196,650,201,360 bytes). The estimated time for this step is 37.93
seconds.
9) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 24 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 22 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (268,755,275,192 bytes).
10) We do an all-AMPs STAT FUNCTION step from Spool 22 (Last Use) by
way of an all-rows scan into Spool 27 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 26 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (265,477,771,836 bytes).
11) We do an all-AMPs RETRIEVE step from Spool 26 (Last Use) by way of
an all-rows scan into Spool 1 (all_amps) (compressed columns
allowed), which is built locally on the AMPs. The size of Spool 1
is estimated with high confidence to be 1,638,751,678 rows (
203,205,208,072 bytes). The estimated time for this step is 37.93
seconds.
12) We do an all-AMPs STAT FUNCTION step from Spool 1 (Last Use) by
way of an all-rows scan into Spool 34 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 32 (all_amps) (compressed columns allowed), which is
built locally on the AMPs. The size is estimated with high
confidence to be 1,638,751,678 rows (347,415,355,736 bytes).
13) We do an all-AMPs STAT FUNCTION step from Spool 32 (Last Use) by
way of an all-rows scan into Spool 37 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 30 (group_amps), which is built locally on the AMPs.
The size is estimated with high confidence to be 1,638,751,678
rows (370,357,879,228 bytes).
14) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 30 are sent back to the user as the result
of statement 1.


Accepted Solutions
Junior Contributor

Re: Help with query optimization having analytical functions

Ops, of course, I forgot to add the MAX (and ROWS UNBOUNDED PRECEDING for the SUM(flag)):

 

SELECT 
   subs_id,
   imei_val,
   dvc_prod_id,
   subs_dvc_inst_start_dttm,
   min_start,
   max_prcss,
   subs_dvc_inst_rsn_cd,
   Max(subs_dvc_inst_end_dttm)
   Over(PARTITION BY subs_id,subs_dvc_inst_rsn_cd,reset_partition)
  -- no need for ORDER BY:
  -- ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id)
FROM
 ( 
   SELECT 
      subs_id,
      imei_val,
      dvc_prod_id,
      subs_dvc_inst_start_dttm,
      min_start,
      max_prcss,
      subs_dvc_inst_rsn_cd,
      subs_dvc_inst_end_dttm,
      -- additional partitions based on RESET WHEN
      Sum(flag)
      Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd
            ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id
            ROWS UNBOUNDED PRECEDING) AS reset_partition
   FROM 
    (
      SELECT
         subs_id,
         imei_val,
         dvc_prod_id,
         subs_dvc_inst_start_dttm,
         subs_dvc_inst_end_dttm,
         -- RESET WHEN condition
         CASE WHEN dvc_prod_id
            <> Min(dvc_prod_id)
               Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd
                     ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id
                      ROWS BETWEEN 1 Preceding AND 1 Preceding) 
              THEN 1 
              ELSE 0
         END AS flag,
         
         Min(subs_dvc_inst_start_dttm)
         Over (PARTITION BY subs_id, dvc_prod_id) AS min_start,
         
         Max(prcss_crtn_dttm)
         Over (PARTITION BY subs_id) AS max_prcss,
         
         subs_dvc_inst_rsn_cd
      FROM SUBS_DVC_LST SDDL_
    ) AS dt
 ) AS dt
QUALIFY
   Rank()
   Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd, reset_partition
         ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

But still only 5 steps

 

 

Yep, the SUM(flag) calculates the additional partitions based on RESET WHEN.

 

 

1 ACCEPTED SOLUTION
9 REPLIES
Enthusiast

Re: Help with query optimization having analytical functions

Can you please collect stats as follows and share the results

--Try 1:
COLLECT STATISTICS COLUMN
(
SUBS_ID
,IMEI_VAL
,DVC_PROD_ID
,SUBS_DVC_INST_START_DTTM
,SUBS_DVC_INST_END_DTTM
,SUBS_DVC_INST_RSN_CD
)
ON PRD_EDW.SUBS_DVC_LST;


COLLECT STATISTICS COLUMN
(
SUBS_ID
,IMEI_VAL
,DVC_PROD_ID
,SUBS_DVC_INST_START_DTTM
,SUBS_DVC_INST_RSN_CD
)
ON PRD_EDW.SUBS_DVC_LST;

--Try 2:
COLLECT STATISTICS COLUMN
(
DVC_INST_ID
,SUBS_ID
,IMEI_VAL
,DVC_PROD_ID
,SUBS_DVC_INST_START_DTTM
,SUBS_DVC_INST_END_DTTM
,SUBS_DVC_INST_RSN_CD
)
ON PRD_EDW.SUBS_DVC_LST;


COLLECT STATISTICS COLUMN
(
DVC_INST_ID
,SUBS_ID
,IMEI_VAL
,DVC_PROD_ID
,SUBS_DVC_INST_START_DTTM
,SUBS_DVC_INST_RSN_CD
)
ON PRD_EDW.SUBS_DVC_LST;

Looking out for seperate result for both TRY

Junior Contributor

Re: Help with query optimization having analytical functions

There's one STAT-step for each different PARTITION/ORDER BY, thus one should try to get multiple OLAP-functions with the same window definition.

 

So let's start with removing some unneccessary code:

 

Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd
   -- ORDER BY subs_dvc_inst_start_dttm, subs_id, dvc_prod_id
   -- subs_id can be removed (already in PARTITION)
      ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id

Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd
   -- ORDER BY subs_dvc_inst_start_dttm, subs_id, subs_dvc_inst_rsn_cd, dvc_prod_id
   -- both subs_id & subs_dvc_inst_rsn_cd can be removed (already in PARTITION)
      ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id
-- Now both share the same PARTTITON/ORDER
   
Min(subs_dvc_inst_start_dttm)
Over (PARTITION BY subs_id, dvc_prod_id
   -- ORDER BY dvc_prod_id)
   -- no ORDER BY needed (MIN) & dvc_prod_id already in PARTITION

This results in two STATS steps less, 7 vs. 9.

 

But RESET WHEN is not well optimized, those two function are defined in a similar way, but not folded into the same steps.

Rewriting RESET needs three OLAP steps (sometines only two), first calculate a flag indicating the RESET WHEN condition, then create new partitions based on that flag and finally do the calculation including those new partitions (sometimes other OLAP functions with the same window can be combined, too):

SELECT 
   subs_id,
   imei_val,
   dvc_prod_id,
   subs_dvc_inst_start_dttm,
   min_start,
   max_prcss,
   subs_dvc_inst_rsn_cd
FROM
 ( 
   SELECT 
      subs_id,
      imei_val,
      dvc_prod_id,
      subs_dvc_inst_start_dttm,
      min_start,
      max_prcss,
      subs_dvc_inst_rsn_cd,
-- additional partitions based on RESET WHEN Sum(flag) Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) AS reset_partition FROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, -- RESET WHEN condition CASE WHEN dvc_prod_id <> Min(dvc_prod_id) Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS BETWEEN 1 Preceding AND 1 Preceding) THEN 1 ELSE 0 END AS flag, Min(subs_dvc_inst_start_dttm) Over (PARTITION BY subs_id, dvc_prod_id) AS min_start, Max(prcss_crtn_dttm) Over (PARTITION BY subs_id) AS max_prcss, subs_dvc_inst_rsn_cd FROM SUBS_DVC_LST SDDL_ ) AS dt ) AS dt QUALIFY Rank() Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd, reset_partition ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

Now Explain shows only 5 STATS-steps :-)

 

Additionally spool usage could be reduced by correct data types:

subs_dvc_inst_rsn_cd (as this column has only 2 reason codes coming in)

subs_dvc_inst_rsn_cd VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('DMUD-NPSD','DMUD-PSD')

 

 

If there are actually only those two codes it should be defined a VARCHAR(9) instead (and UPPERCASE should be added):

VARCHARs are expanded to the full defined size when used in GROUP/PARTITION/ORDER BY), 20 vs. 9 bytes, saving 11 * 1,600,000,000 bytes in spool (there are two spools per step).

 

Of course it's usually not an easy task to modify the data model, but you might change the codes for the PARTTION BY to a byteint using a CASE:

PARTITION BY CASE subs_id, subs_dvc_inst_rsn_cd WHEN 'DMUD-NPSD' THEN 0 ELSE 1 END

1 vs. 20  bytes.

 

 

 

nsi
Teradata Employee

Re: Help with query optimization having analytical functions

Hi,

Please find below stats:

For Try 1:

Try 1.jpg

For Try 2:

Try 2.JPG

Thanks

Enthusiast

Re: Help with query optimization having analytical functions

After these stats please check if there is any impact on query execution time...

Please share following:

Explain plan of SQL after TRY 1 stats:

 

Explain plan of SQL after TRY 2 stats:

 

nsi
Teradata Employee

Re: Help with query optimization having analytical functions

Thank you Dieter. The way you have broken this query into parts is commendable indeed.

 

However, max(subs_dvc_inst_end_dttm) has to be calculated over the partition (partition by subs_id,subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id) which runs on top of the 'reset when' logic. I think I can use CASE statement and the flag value to evaluate the max(subs_dvc_inst_end_dttm), but I am not sure how to evaluate this value over the partition window (partition by subs_id,subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id).

 

Also, the sum(flag) is being used just a placeholder for the reset partition which is being called again in the QUALIFY clause, right?

Junior Contributor

Re: Help with query optimization having analytical functions

Ops, of course, I forgot to add the MAX (and ROWS UNBOUNDED PRECEDING for the SUM(flag)):

 

SELECT 
   subs_id,
   imei_val,
   dvc_prod_id,
   subs_dvc_inst_start_dttm,
   min_start,
   max_prcss,
   subs_dvc_inst_rsn_cd,
   Max(subs_dvc_inst_end_dttm)
   Over(PARTITION BY subs_id,subs_dvc_inst_rsn_cd,reset_partition)
  -- no need for ORDER BY:
  -- ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id)
FROM
 ( 
   SELECT 
      subs_id,
      imei_val,
      dvc_prod_id,
      subs_dvc_inst_start_dttm,
      min_start,
      max_prcss,
      subs_dvc_inst_rsn_cd,
      subs_dvc_inst_end_dttm,
      -- additional partitions based on RESET WHEN
      Sum(flag)
      Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd
            ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id
            ROWS UNBOUNDED PRECEDING) AS reset_partition
   FROM 
    (
      SELECT
         subs_id,
         imei_val,
         dvc_prod_id,
         subs_dvc_inst_start_dttm,
         subs_dvc_inst_end_dttm,
         -- RESET WHEN condition
         CASE WHEN dvc_prod_id
            <> Min(dvc_prod_id)
               Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd
                     ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id
                      ROWS BETWEEN 1 Preceding AND 1 Preceding) 
              THEN 1 
              ELSE 0
         END AS flag,
         
         Min(subs_dvc_inst_start_dttm)
         Over (PARTITION BY subs_id, dvc_prod_id) AS min_start,
         
         Max(prcss_crtn_dttm)
         Over (PARTITION BY subs_id) AS max_prcss,
         
         subs_dvc_inst_rsn_cd
      FROM SUBS_DVC_LST SDDL_
    ) AS dt
 ) AS dt
QUALIFY
   Rank()
   Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd, reset_partition
         ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

But still only 5 steps

 

 

Yep, the SUM(flag) calculates the additional partitions based on RESET WHEN.

 

 

Junior Contributor

Re: Help with query optimization having analytical functions

 


Try 2.JPG

Btw, there's no need to calculate the 1st stats, the 2nd is already unique and adding another column will not increase uniqueness (in fact you might not need any of those).

Of course this stats should be collected using a forced sample, e.g. SAMPLE 1 PERCENT

Enthusiast

Re: Help with query optimization having analytical functions

I am not sure if it applies to you but I have seen larger Teradata install (hundreds of nodes and thousands of AMP's), there was a steap increase in CPU utilization post V12. 

Then Teradata provided few customers cost profiles (don't know which) but it dramatically reduced CPU and brought back tranquility as before.

 

You should check with your SA on this. 

And yes, follow thru the suggestions that Dieter provided.

When he talks, we listen!

 

Best,

Vinay Bagare

nsi
Teradata Employee

Re: Help with query optimization having analytical functions

Thanks Dieter! I incorporated your suggestions and the query now runs in 7 mins; the overall skew has also reduced :-) Really appreciate your help on this.

 

Thanks

Neha