Database

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-17-2017
11:33 PM

10-17-2017
11:33 PM

Hi,

I am relatively new to Teradata database and working on performance optimization of the below query. This query takes 1 hour to run and is quering the table having around 1.6 billion records. I tried compression on the columns subs_dvc_inst_end_dttm (as this column has value: 2050-12-31 present for 16% times of the total record count) and subs_dvc_inst_rsn_cd (as this column has only 2 reason codes coming in). I also collected multi column stats on the columns (subs_id, subs_dvc_inst_rsn_cd) present in PARTITION BY of the query. Still nothing seems to help. Also, the explain plan is just hinting at FTS being done many times. Can anyone please suggest some way to avoid redistribution happening so many times in this query and improve the query performance?

Thanks

Neha

Table PRD_EDW.SUBS_DVC_LST has 1.6 billion data. This query after performing the analytics and qualify returns around 1.4 billion records.

Query:

select

SDDL_.subs_id,

SDDL_.imei_val,

SDDL_.dvc_prod_id,

SDDL_.subs_dvc_inst_start_dttm,

max(SDDL_.subs_dvc_inst_end_dttm)

over(partition by SDDL_.subs_id,SDDL_.subs_dvc_inst_rsn_cd order by SDDL_.subs_dvc_inst_start_dttm, SDDL_.dvc_prod_id reset when SDDL_.dvc_prod_id <> min(SDDL_.dvc_prod_id)

over (partition by SDDL_.subs_id,SDDL_.subs_dvc_inst_rsn_cd order by SDDL_.subs_dvc_inst_start_dttm, SDDL_.subs_id, SDDL_.dvc_prod_id rows between 1 preceding and 1 preceding)

),

min(SDDL_.subs_dvc_inst_start_dttm) over (partition by SDDL_.subs_id, SDDL_.dvc_prod_id order by SDDL_.dvc_prod_id),

max(SDDL_.prcss_crtn_dttm) over (partition by SDDL_.subs_id),

SDDL_.subs_dvc_inst_rsn_cd

from PRD_EDW.SUBS_DVC_LST SDDL_

qualify rank() over (partition by subs_id, subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id reset when dvc_prod_id <> min(dvc_prod_id) over (partition by subs_id, subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, subs_id, subs_dvc_inst_rsn_cd, dvc_prod_id rows between 1 preceding and 1 preceding)) = 1;

Table information:

show table PRD_EDW.SUBS_DVC_LST;

CREATE MULTISET TABLE PRD_EDW.SUBS_DVC_LST ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT,

DEFAULT MERGEBLOCKRATIO

(

subs_id INTEGER,

imei_val VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

dvc_prod_id INTEGER,

subs_dvc_inst_start_dttm TIMESTAMP(6),

subs_dvc_inst_end_dttm TIMESTAMP(6) COMPRESS (TIMESTAMP '2050-12-31 23:59:59.000000'),

dvc_inst_id BIGINT,

prcss_crtn_dttm TIMESTAMP(6),

subs_dvc_inst_rsn_cd VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('DMUD-NPSD','DMUD-PSD'))

PRIMARY INDEX ( dvc_inst_id );

Stats:

collect statistics column (dvc_prod_id),

column (dvc_inst_id),

column (subs_dvc_inst_rsn_cd),

column (subs_id,subs_dvc_inst_rsn_cd)

on PRD_EDW.SUBS_DVC_LST;

Explain Plan:

1) First, we lock a distinct prd_edw."pseudo table" for read on a

RowHash to prevent global deadlock for prd_edw.SDDL_.

2) Next, we lock prd_edw.SDDL_ for read.

3) We do an all-AMPs STAT FUNCTION step from prd_edw.SDDL_ by way of

an all-rows scan with no residual conditions into Spool 6 (Last

Use), which is redistributed by hash code to all AMPs. The result

rows are put into Spool 4 (all_amps) (compressed columns allowed),

which is built locally on the AMPs. The size is estimated with

high confidence to be 1,638,751,678 rows (291,697,798,684 bytes).

4) We do an all-AMPs STAT FUNCTION step from Spool 4 (Last Use) by

way of an all-rows scan into Spool 9 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 8 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (262,200,268,480 bytes).

5) We do an all-AMPs STAT FUNCTION step from Spool 8 (Last Use) by

way of an all-rows scan into Spool 12 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 11 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (255,645,261,768 bytes).

6) We do an all-AMPs STAT FUNCTION step from Spool 11 (Last Use) by

way of an all-rows scan into Spool 15 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 14 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (229,425,234,920 bytes).

7) We do an all-AMPs STAT FUNCTION step from Spool 14 (Last Use) by

way of an all-rows scan into Spool 18 (Last Use), which is assumed

to be redistributed by value to all AMPs. The result rows are put

into Spool 17 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (203,205,208,072 bytes).

8) We do an all-AMPs RETRIEVE step from Spool 17 (Last Use) by way of

an all-rows scan into Spool 2 (all_amps) (compressed columns

allowed), which is built locally on the AMPs. The size of Spool 2

is estimated with high confidence to be 1,638,751,678 rows (

196,650,201,360 bytes). The estimated time for this step is 37.93

seconds.

9) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by

way of an all-rows scan into Spool 24 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 22 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (268,755,275,192 bytes).

10) We do an all-AMPs STAT FUNCTION step from Spool 22 (Last Use) by

way of an all-rows scan into Spool 27 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 26 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (265,477,771,836 bytes).

11) We do an all-AMPs RETRIEVE step from Spool 26 (Last Use) by way of

an all-rows scan into Spool 1 (all_amps) (compressed columns

allowed), which is built locally on the AMPs. The size of Spool 1

is estimated with high confidence to be 1,638,751,678 rows (

203,205,208,072 bytes). The estimated time for this step is 37.93

seconds.

12) We do an all-AMPs STAT FUNCTION step from Spool 1 (Last Use) by

way of an all-rows scan into Spool 34 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 32 (all_amps) (compressed columns allowed), which is

built locally on the AMPs. The size is estimated with high

confidence to be 1,638,751,678 rows (347,415,355,736 bytes).

13) We do an all-AMPs STAT FUNCTION step from Spool 32 (Last Use) by

way of an all-rows scan into Spool 37 (Last Use), which is

redistributed by hash code to all AMPs. The result rows are put

into Spool 30 (group_amps), which is built locally on the AMPs.

The size is estimated with high confidence to be 1,638,751,678

rows (370,357,879,228 bytes).

14) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 30 are sent back to the user as the result

of statement 1.

Solved! Go to Solution.

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
02:51 AM

10-18-2017
02:51 AM

Ops, of course, I forgot to add the MAX (and ROWS UNBOUNDED PRECEDING for the SUM(flag)):

SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd, Max(subs_dvc_inst_end_dttm) Over(PARTITION BY subs_id,subs_dvc_inst_rsn_cd,reset_partition) -- no need for ORDER BY: -- ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) FROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd, subs_dvc_inst_end_dttm, -- additional partitions based on RESET WHEN Sum(flag) Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS UNBOUNDED PRECEDING) ASreset_partitionFROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, subs_dvc_inst_end_dttm, -- RESET WHEN condition CASE WHEN dvc_prod_id <> Min(dvc_prod_id) Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS BETWEEN 1 Preceding AND 1 Preceding) THEN 1 ELSE 0 END AS flag, Min(subs_dvc_inst_start_dttm) Over (PARTITION BY subs_id, dvc_prod_id) AS min_start, Max(prcss_crtn_dttm) Over (PARTITION BY subs_id) AS max_prcss, subs_dvc_inst_rsn_cd FROM SUBS_DVC_LST SDDL_ ) AS dt ) AS dt QUALIFY Rank() Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd,reset_partitionORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

But still only 5 steps

Yep, the SUM(flag) calculates the additional partitions based on RESET WHEN.

1 ACCEPTED SOLUTION

9 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-17-2017
11:58 PM

10-17-2017
11:58 PM

Can you please collect stats as follows and share the results

**--Try 1:**

COLLECT STATISTICS COLUMN

(

SUBS_ID

,IMEI_VAL

,DVC_PROD_ID

,SUBS_DVC_INST_START_DTTM

,SUBS_DVC_INST_END_DTTM

,SUBS_DVC_INST_RSN_CD

)

ON PRD_EDW.SUBS_DVC_LST;

COLLECT STATISTICS COLUMN

(

SUBS_ID

,IMEI_VAL

,DVC_PROD_ID

,SUBS_DVC_INST_START_DTTM

,SUBS_DVC_INST_RSN_CD

)

ON PRD_EDW.SUBS_DVC_LST;

**--Try 2:**

COLLECT STATISTICS COLUMN

(

DVC_INST_ID

,SUBS_ID

,IMEI_VAL

,DVC_PROD_ID

,SUBS_DVC_INST_START_DTTM

,SUBS_DVC_INST_END_DTTM

,SUBS_DVC_INST_RSN_CD

)

ON PRD_EDW.SUBS_DVC_LST;

COLLECT STATISTICS COLUMN

(

DVC_INST_ID

,SUBS_ID

,IMEI_VAL

,DVC_PROD_ID

,SUBS_DVC_INST_START_DTTM

,SUBS_DVC_INST_RSN_CD

)

ON PRD_EDW.SUBS_DVC_LST;

Looking out for seperate result for both TRY

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
01:05 AM

10-18-2017
01:05 AM

There's one STAT-step for each different PARTITION/ORDER BY, thus one should try to get multiple OLAP-functions with the same window definition.

So let's start with removing some unneccessary code:

Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd -- ORDER BY subs_dvc_inst_start_dttm, subs_id, dvc_prod_id -- subs_id can be removed (already in PARTITION) ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd -- ORDER BY subs_dvc_inst_start_dttm, subs_id, subs_dvc_inst_rsn_cd, dvc_prod_id -- both subs_id & subs_dvc_inst_rsn_cd can be removed (already in PARTITION) ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id -- Now both share the same PARTTITON/ORDER Min(subs_dvc_inst_start_dttm) Over (PARTITION BY subs_id, dvc_prod_id -- ORDER BY dvc_prod_id) -- no ORDER BY needed (MIN) & dvc_prod_id already in PARTITION

This results in two STATS steps less, 7 vs. 9.

But RESET WHEN is not well optimized, those two function are defined in a similar way, but not folded into the same steps.

Rewriting RESET needs three OLAP steps (sometines only two), first calculate a flag indicating the RESET WHEN condition, then create new partitions based on that flag and finally do the calculation including those new partitions (sometimes other OLAP functions with the same window can be combined, too):

SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd FROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd,

-- additional partitions based on RESET WHEN Sum(flag) Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) ASreset_partitionFROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, -- RESET WHEN condition CASE WHEN dvc_prod_id <> Min(dvc_prod_id) Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS BETWEEN 1 Preceding AND 1 Preceding) THEN 1 ELSE 0 END AS flag, Min(subs_dvc_inst_start_dttm) Over (PARTITION BY subs_id, dvc_prod_id) AS min_start, Max(prcss_crtn_dttm) Over (PARTITION BY subs_id) AS max_prcss, subs_dvc_inst_rsn_cd FROM SUBS_DVC_LST SDDL_ ) AS dt ) AS dt QUALIFY Rank() Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd,reset_partitionORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

Now Explain shows only 5 STATS-steps :-)

Additionally spool usage could be reduced by correct data types:

*subs_dvc_inst_rsn_cd (as this column has only 2 reason codes coming in)*

*subs_dvc_inst_rsn_cd VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('DMUD-NPSD','DMUD-PSD')*

If there are actually only those two codes it should be defined a VARCHAR(9) instead (and UPPERCASE should be added):

VARCHARs are expanded to the full defined size when used in GROUP/PARTITION/ORDER BY), 20 vs. 9 bytes, saving 11 * 1,600,000,000 bytes in spool (there are two spools per step).

Of course it's usually not an easy task to modify the data model, but you might change the codes for the PARTTION BY to a byteint using a CASE:

PARTITION BY CASE subs_id, subs_dvc_inst_rsn_cd WHEN 'DMUD-NPSD' THEN 0 ELSE 1 END

1 vs. 20 bytes.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
01:52 AM

10-18-2017
01:52 AM

Hi,

Please find below stats:

For Try 1:

For Try 2:

Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
02:06 AM

10-18-2017
02:06 AM

After these stats please check if there is any impact on query execution time...

Please share following:

Explain plan of SQL after TRY 1 stats:

Explain plan of SQL after TRY 2 stats:

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
02:18 AM

10-18-2017
02:18 AM

Thank you Dieter. The way you have broken this query into parts is commendable indeed.

However, max(subs_dvc_inst_end_dttm) has to be calculated over the partition (partition by subs_id,subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id) which runs on top of the 'reset when' logic. I think I can use CASE statement and the flag value to evaluate the max(subs_dvc_inst_end_dttm), but I am not sure how to evaluate this value over the partition window (partition by subs_id,subs_dvc_inst_rsn_cd order by subs_dvc_inst_start_dttm, dvc_prod_id).

Also, the sum(flag) is being used just a placeholder for the reset partition which is being called again in the QUALIFY clause, right?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
02:51 AM

10-18-2017
02:51 AM

Ops, of course, I forgot to add the MAX (and ROWS UNBOUNDED PRECEDING for the SUM(flag)):

SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd, Max(subs_dvc_inst_end_dttm) Over(PARTITION BY subs_id,subs_dvc_inst_rsn_cd,reset_partition) -- no need for ORDER BY: -- ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) FROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, min_start, max_prcss, subs_dvc_inst_rsn_cd, subs_dvc_inst_end_dttm, -- additional partitions based on RESET WHEN Sum(flag) Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS UNBOUNDED PRECEDING) ASreset_partitionFROM ( SELECT subs_id, imei_val, dvc_prod_id, subs_dvc_inst_start_dttm, subs_dvc_inst_end_dttm, -- RESET WHEN condition CASE WHEN dvc_prod_id <> Min(dvc_prod_id) Over (PARTITION BY subs_id,subs_dvc_inst_rsn_cd ORDER BY subs_dvc_inst_start_dttm, dvc_prod_id ROWS BETWEEN 1 Preceding AND 1 Preceding) THEN 1 ELSE 0 END AS flag, Min(subs_dvc_inst_start_dttm) Over (PARTITION BY subs_id, dvc_prod_id) AS min_start, Max(prcss_crtn_dttm) Over (PARTITION BY subs_id) AS max_prcss, subs_dvc_inst_rsn_cd FROM SUBS_DVC_LST SDDL_ ) AS dt ) AS dt QUALIFY Rank() Over (PARTITION BY subs_id, subs_dvc_inst_rsn_cd,reset_partitionORDER BY subs_dvc_inst_start_dttm, dvc_prod_id) = 1;

But still only 5 steps

Yep, the SUM(flag) calculates the additional partitions based on RESET WHEN.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
02:56 AM

10-18-2017
02:56 AM

Btw, there's no need to calculate the 1st stats, the 2nd is already unique and adding another column will not increase uniqueness (in fact you might not need any of those).

Of course this stats should be collected using a forced sample, e.g. SAMPLE 1 PERCENT

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-18-2017
06:55 PM

10-18-2017
06:55 PM

I am not sure if it applies to you but I have seen larger Teradata install (hundreds of nodes and thousands of AMP's), there was a steap increase in CPU utilization post V12.

Then Teradata provided few customers cost profiles (don't know which) but it dramatically reduced CPU and brought back tranquility as before.

You should check with your SA on this.

And yes, follow thru the suggestions that Dieter provided.

When he talks, we listen!

Best,

Vinay Bagare

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-19-2017
02:36 AM

10-19-2017
02:36 AM

Thanks Dieter! I incorporated your suggestions and the query now runs in 7 mins; the overall skew has also reduced :-) Really appreciate your help on this.

Thanks

Neha