I have a multiset volatile table having 10k records and a table having 5 billion records. I have to create a multiset volatile table by inner joining these two tables. But, the query is failing due to spool space issue. Is there any way to tune this SQL, apart from increasing the spool space?
My first question would be if the inner join condition applies to the Primary Index (PI) of the larger table.
Otherwise the execution plan would probably try to redistribute those 5 billion records to Spool, preparing for join processing, and that would probably exhaust your user Spool quota.
So you can tune the SQL if the join condition is not currently applied to the PI of the larger table, check if this is feasible, depending on the volatile table design.
Skew or bad PI distribution could also be the problem, i.e. multiple records having same PI value in the result set would hash to same AMP and perhaps trigger the Spool space error (remeber the Spool quota is divided by the number of AMPs, and if just one AMP hits its own maximum the Spool space error occurs).
But this is a data demographics issue.
Normally to diagnose a problem like this and suggest SQL tuning one needs an EXPLAIN of the query and the DDL of the accessed tables.
To better understand the question we would need to know the SQL conditions relating the tables. Are these conditions selective or is it expected that the entire 5 billion rows would be included in the result? If the latter, then spool space for two copies of the 5 billion rows will be necessary - one for the join result and one for the final result. And of course those both would have all the columns of the result of the join.
This is the explain plan I got:
1) First, we lock a distinct QA_RMAP_1."pseudo table" for read on a
RowHash to prevent global deadlock for
2) Next, we lock a distinct QA_RMAP_1."pseudo table" for read on a
RowHash to prevent global deadlock for
3) We lock QA_RMAP_1.ENROLL_SNAPSHOT_DATE for read, and we lock
QA_RMAP_1.REVENUE_SUBSCRIPTION_PRODUCT_NO_DUP for read.
4) We create the table header.
5) We do an all-AMPs RETRIEVE step from
QA_RMAP_1.ENROLL_SNAPSHOT_DATE by way of an all-rows scan with a
condition of ("(QA_RMAP_1.ENROLL_SNAPSHOT_DATE.SNPSHT_ID = 2) AND
(QA_RMAP_1.ENROLL_SNAPSHOT_DATE.SNPSHT_TYP = 1)") into Spool 2
(all_amps), which is built locally on the AMPs. The size of Spool
2 is estimated with high confidence to be 156,627,426 rows (
5,795,214,762 bytes). The estimated time for this step is 13.76
6) We do an all-AMPs JOIN step from
QA_RMAP_1.REVENUE_SUBSCRIPTION_PRODUCT_NO_DUP by way of a RowHash
match scan with no residual conditions, which is joined to Spool 2
(Last Use) by way of a RowHash match scan.
QA_RMAP_1.REVENUE_SUBSCRIPTION_PRODUCT_NO_DUP and Spool 2 are
joined using a merge join, with a join condition of (
<= SNPSHT_DT) AND
> SNPSHT_DT) AND
CUST_ACCT_KEY ))))))"). The input table
QA_RMAP_1.REVENUE_SUBSCRIPTION_PRODUCT_NO_DUP will not be cached
in memory, but it is eligible for synchronized scanning. The
result goes into Spool 1 (all_amps), which is redistributed by
hash code to all AMPs. Then we do a SORT to order Spool 1 by row
hash. The result spool file will not be cached in memory. The
size of Spool 1 is estimated with low confidence to be
17,497,152,691 rows (23,848,619,117,833 bytes). The estimated
time for this step is 10 hours and 17 minutes.
7) We do an all-AMPs MERGE into
QA_RMAP_SQLETL.TMP_SUBSCRIPTION_SNAPSHOT_JOIN_INNER_1 from Spool 1
(Last Use). The size is estimated with low confidence to be
17,497,152,691 rows. The estimated time for this step is 1,336
hours and 9 minutes.
8) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
1. Check the duplicates for the table 'REVENUE_SUBSCRIPTION_PRODUCT_NO_DUP' on PK columns.
2. Check for the Multi Column Stats on below combination. Always Stats Play a major role.
Run the below query.