Spool space: query optimization

Database
Enthusiast

Spool space: query optimization

Hello All,

 

I am running the following query and running out of spool space.  I'm trying to get some guidiance on optimization so I've included the log from teradata (I'm a beginner in querying and not much experience in optimization):

SyntaxEditor Code Snippet

SELECT
    regexp_replace(A.EN_TXN_LNG_DESC_LIN_2, '[^A-Z/@\-]', ' ', 1, 0, 'i') as NEW_DESC,
    COUNT(DISTINCT A.ar_id) as AR_COUNT

FROM  ddwv01.pers_chequing_txn A
LEFT JOIN DDWV01.DEPOSIT_ACCOUNT B ON (A.AR_ID = B.AR_ID)

WHERE 
A.DR_CR_TYP_CD IN ( 'D')AND A.PST_DT between '2017/01/01' and '2018/05/18'
AND A.EN_TXN_LNG_DESC_LIN_1 LIKE '%insur%'
AND B.SNAP_DT IN '2018/04/30' 
AND B.ACTIVE_INACTV_STTS IN ('A')

GROUP BY NEW_DESC

Log:
1) First, we lock DDWTA.T_DEPOSIT_ACCOUNT in view
ddwv01.DEPOSIT_ACCOUNT for access on a single partition, and we
lock DDWTA.T_PERS_CHEQUING_TXN in view ddwv01.pers_chequing_txn
for access.
2) Next, we do an all-AMPs RETRIEVE step from a single partition of
DDWTA.T_DEPOSIT_ACCOUNT in view ddwv01.DEPOSIT_ACCOUNT with a
condition of ("DDWTA.T_DEPOSIT_ACCOUNT in view
ddwv01.DEPOSIT_ACCOUNT.SNAP_DT = DATE '2018-04-30'") with a
residual condition of ("(DDWTA.T_DEPOSIT_ACCOUNT in view
ddwv01.DEPOSIT_ACCOUNT.SNAP_DT = DATE '2018-04-30') AND
(DDWTA.T_DEPOSIT_ACCOUNT in view
ddwv01.DEPOSIT_ACCOUNT.ACTIVE_INACTV_STTS = 'A')") into Spool 4
(all_amps) (compressed columns allowed), which is built locally on
the AMPs. Then we do a SORT to order Spool 4 by row hash. The
size of Spool 4 is estimated with low confidence to be 13,746,857
rows (508,633,709 bytes). The estimated time for this step is
0.08 seconds.
3) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
RowHash match scan, which is joined to 503 partitions of
DDWTA.T_PERS_CHEQUING_TXN in view ddwv01.pers_chequing_txn by way
of a RowHash match scan with a condition of (
"(DDWTA.T_PERS_CHEQUING_TXN in view
ddwv01.pers_chequing_txn.PST_DT >= DATE '2017-01-01') AND
((DDWTA.T_PERS_CHEQUING_TXN in view
ddwv01.pers_chequing_txn.PST_DT <= DATE '2018-05-18') AND
((DDWTA.T_PERS_CHEQUING_TXN in view
ddwv01.pers_chequing_txn.DR_CR_TYP_CD = 'D') AND
(DDWTA.T_PERS_CHEQUING_TXN in view
ddwv01.pers_chequing_txn.EN_TXN_LNG_DESC_LIN_1 LIKE '%insur%')))").
Spool 4 and DDWTA.T_PERS_CHEQUING_TXN are joined using a
sliding-window merge join, with a join condition of (
"DDWTA.T_PERS_CHEQUING_TXN.AR_ID = AR_ID"). The input table
DDWTA.T_PERS_CHEQUING_TXN will not be cached in memory. The
result goes into Spool 3 (all_amps) (compressed columns allowed),
which is built locally on the AMPs. The size of Spool 3 is
estimated with low confidence to be 488,741,637 rows (
36,655,622,775 bytes). The estimated time for this step is 12.49
seconds.
4) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan, and the grouping identifier in field 1.
Aggregate Intermediate Results are computed locally, then placed
in Spool 6. The aggregate spool file will not be cached in memory.
The size of Spool 6 is estimated with no confidence to be
366,556,228 rows (7,841,370,829,376 bytes). The estimated time
for this step is 31 minutes and 29 seconds.
5) We do an all-AMPs SUM step to aggregate from Spool 6 (Last Use) by
way of an all-rows scan, and the grouping identifier in field 2.
Aggregate Intermediate Results are computed globally, then placed
in Spool 8. The aggregate spool file will not be cached in memory.
The size of Spool 8 is estimated with no confidence to be
274,917,171 rows (2,939,139,475,161 bytes). The estimated time
for this step is 1 hour and 13 minutes.
6) We do an all-AMPs RETRIEVE step from Spool 8 (Last Use) by way of
an all-rows scan into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 274,917,171 rows (740,351,941,503 bytes). The estimated
time for this step is 1 minute and 29 seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 1 hour and 46 minutes.


 What steps should I take to optimize the query?  It seems the 'all-row scan' is causing the mosts amount of time and taking the most memory.

 

Appreciate the help everyone!


Accepted Solutions
Teradata Employee

Re: Spool space: query optimization

What is the data type of A.EN_TXN_LNG_DESC_LIN_2? If it's, say, VARCHAR(50) then

 

    CAST(regexp_replace(A.EN_TXN_LNG_DESC_LIN_2, '[^A-Z/@\-]', ' ', 1, 0, 'i') AS VARCHAR(50)) as NEW_DESC,

1 ACCEPTED SOLUTION
12 REPLIES
Junior Supporter

Re: Spool space: query optimization

Hello NoPunIntended,

Steps 4, 5 and 6 are 'all row scans' as they should be, scanning the result of the prior steps. Please be more specific about which step is most problematic.

I notice you do a LEFT OUTER JOIN to "B", but no columns from "B" are in the result.
It looks like you try to limit the result with conditions on "B", but it does not reduce the rows from "A" as a LEFT JOIN. I think you intend to use INNER JOIN, to limit the rows from "A". Which will reduce the rows from the JOIN and allow your query to run.

Cheers

Teradata Frank, Certified Master
Teradata Employee

Re: Spool space: query optimization

With the equality predicates on "B", the optimizer is already doing an INNER join (notice no "OUTER" keyword mentioned in the EXPLAIN text).

 

NEW_DESC will be VARCHAR(8000) CHARACTER SET UNICODE by default, and is part of the sort key for the COUNT DISTINCT (so will be expanded to 16KB per row in spool for the key, plus the column itself). That's probably much larger than it needs to be. CAST it back to the original length, maybe even add TRANSLATE unicode_to_latin, to reduce the size of the spool needed.

 

 

Junior Supporter

Re: Spool space: query optimization

The keyword OUTER in the FROM clause is optional (page 224 on SQL DML 16.10)
Thus not specifying INNER or OUTER, this is an OUTER JOIN by default.
The explain used to clarify this, but somehow it no longer says INNER or OUTER.
As such the filter from the "B" table is not applied to the "A" table.
One clue from the explain is the use of 503 partitions in step 3.

 

Teradata Frank, Certified Master
Teradata Employee

Re: Spool space: query optimization

No.

 

It is certainly true that LEFT JOIN and LEFT OUTER JOIN are equivalent syntax, requesting an outer join. But in this case, the optimizer recognized that (because of the predicates in the WHERE clause), any "non-matching" rows will just be filtered out anyway; an INNER JOIN will provide exactly the same result as the requested outer join. Since there are more optimization techniques available for INNER than OUTER joins, the optimizer chooses to substitute an INNER join in query planning.

 

It uses 503 partitions because that's the number of days in the range of PST_DT values. Whether the join on AR_ID is INNER or OUTER makes no difference in the number of daily partitions that must be scanned.

Junior Supporter

Re: Spool space: query optimization

You are correct that the result of step 3 is the same for INNER and OUTER operation, in this case. (not always)
However, they are executed differently;
The OUTER joins all the rows on AR_ID, then applies the WHERE filters after the join.
The INNER will apply the WHERE filters during the join.
As a result the spool for step 3 may get bigger, and processing will take longer.
OUTER joins should only be used when needed, and INNER should be specified if this is the intended join.

 

Teradata Frank, Certified Master
Junior Contributor

Re: Spool space: query optimization

The optimizer is quite smart regarding Outer vs. Inner joins.

When you write an Outer Join, but can't find the keyword outer in Explain, it has been rewritten as Inner Join, see Coding ON Clauses With WHERE Clauses for Outer Joins:

If a search condition on the inner table is placed in the WHERE clause, the join is logically equivalent to an inner join, even if you explicitly specify the keywords LEFT/RIGHT/FULL OUTER JOIN in the query. The Optimizer always treats such a join as an inner join to simplify the query, rewriting it to roll the entire complex process into a single step.

 

And the following Outer Join Case Study

Enthusiast

Re: Spool space: query optimization

Thank you for the response FGroenen.

I tried using an Inner Join as an experimental exercise and it actually uses more spool space.

 

Here are the steps that are problematic if I am not mistaken since they require the most time:

4) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan, and the grouping identifier in field 1.
Aggregate Intermediate Results are computed locally, then placed
in Spool 6. The aggregate spool file will not be cached in memory.
The size of Spool 6 is estimated with no confidence to be
366,556,228 rows (7,841,370,829,376 bytes). The estimated time
for this step is 31 minutes and 29 seconds.


5) We do an all-AMPs SUM step to aggregate from Spool 6 (Last Use) by
way of an all-rows scan, and the grouping identifier in field 2.
Aggregate Intermediate Results are computed globally, then placed
in Spool 8. The aggregate spool file will not be cached in memory.
The size of Spool 8 is estimated with no confidence to be
274,917,171 rows (2,939,139,475,161 bytes). The estimated time
for this step is 1 hour and 13 minutes.

 

Junior Contributor

Re: Spool space: query optimization

The Inner join yields exactly the same as the Outer.

 

Did you test Fred's advice?

Simply cast the result of the regexp_replace to a Varchar of a matching size.

Enthusiast

Re: Spool space: query optimization


@dnoeth wrote:

The Inner join yields exactly the same as the Outer.

 

Did you test Fred's advice?

Simply cast the result of the regexp_replace to a Varchar of a matching size.


Would you be able to share the sytnax / argument that I need to pass to get the matching size?