Performance of Correlated Subquery!!!

Database
Enthusiast

Performance of Correlated Subquery!!!

It has been mentioned in the document that
"Correlated subqueries" are faster than Normal joins.

But in another document it has been mentioned any subquery including "Correlated subqueries" are always implemented as inner joins or product joins by the Optimiser.

If this is the case then how come correlated queries are faster than the regular joins when they are anyway implemented as normal joins??

Regards,
Annal T
11 REPLIES
Enthusiast

Re: Performance of Correlated Subquery!!!

I wonder which documents you are referring too ?

Explains would tell better as to what the optimizer is cooking up,

And given that most of the inner joins and their cor-related are not necessarily "equivalent" though they look so, because TD might have to do NULL checking, check dups etc depending on the conditions.

For example

SELECT E.*
FROM EMPLOYEE_00A E, DEPT_00A D
WHERE E.DEPTNO = D.DEPTNO;

And

SELECT E.*
FROM EMPLOYEE E
WHERE E.DEPTNO = (SELECT D.DEPTNO FROM DEPT D WHERE D.DEPTNO = E.DEPTNO);

looks temptingly equivalent, but is not necessarily the case.

This is because if the correlated query returns duplicate records (if deptno was not unique) the second query should error out where as the first one would just get a duplicate record for each employee having the deptno in duplicate in DEPT table.

Now if we change it to

SELECT E.*
FROM EMPLOYEE E
WHERE E.DEPTNO IN (SELECT D.DEPTNO FROM DEPT D WHERE D.DEPTNO = E.DEPTNO);

it's closer to the inner join, but not quite the same still. That's because even if there are dups on deptno, there won't be any duplicate EMP record returned (unlike the inner join which would return duplicate EMP records).

So there are lots of factors which determine whether two queries has the same semantics and will always result in the same output, the optimizer knows these differences and hence at times make different plans.

There are factors like uniqueness, null conditions, nature of set operation etc which determines whether two queries are equivalent.
Enthusiast

Re: Performance of Correlated Subquery!!!

Thanx for the info!!!
i now understand the differences that might arise in the result set when we use correlated subquery and the regular join.

But i would like to know the spot where the Query Performance takes a hit when we opt for regular join instead of the corresponding Correlated subquery.

Regards,
Annal T
Enthusiast

Re: Performance of Correlated Subquery!!!


considering that there are quite a bit of permutations and combinations, the best way would be to check the explains of both queries to see what's happening (and is the practical way, it's difficult/unnecessary to remember all that when you have a very good explain output available).

Personally if I have to use a correlated sub-query, I would try to use EXISTS, IN, = in that order this does not mean all these operators would always give you the same explain/results it depends on other factors as I mentioned before.

Now the good thing is that the optimizer is wise enough to make the best plan. So in those ideal situations when one method has the same symantics as the other one (ie will generate the same answer set for all given data values), the optimizer will generate the same plan irrespective of what operator you used.

So you might even find that the inner join and the correlated sub-query can at times have the same plan and performance.

So always check your explains !
Teradata Employee

Re: Performance of Correlated Subquery!!!

Hello Joel,

Really nice explanation above. 

When you say "SELECT E.*FROM EMPLOYEE EWHERE E.DEPTNO = (SELECT D.DEPTNO FROM DEPT D WHERE D.DEPTNO = E.DEPTNO);" error out when there are duplicates DEPTNO records in the DEPT table, what exactly you mean ?

Is there any such rule for co-related sub-query that Duplicates will cause error ??

Thanks in advance,

Smarak

Enthusiast

Re: Performance of Correlated Subquery!!!

Hi

Please refer the below example.

CREATE MULTISET TABLE EMP_TAB

(

EMP_ID INTEGER NOT NULL,

EMP_NAME VARCHAR(10),

EMP_DEPT INTEGER

)

UNIQUE PRIMARY INDEX(EMP_ID)

;

CREATE MULTISET TABLE DEPT_TAB

(

DEPT_ID INTEGER,

DEPT_NAME VARCHAR(10),

DEPT_CNTRY VARCHAR(10)

)

UNIQUE PRIMARY INDEX(DEPT_ID, DEPT_CNTRY)

;

INSERT INTO EMP_TAB VALUES (1,'Santanu',100);

INSERT INTO EMP_TAB VALUES (2,'Smarak',101);

INSERT INTO EMP_TAB VALUES (3,'Joe',100);

INSERT INTO DEPT_TAB VALUES (100,'TEST','India');

INSERT INTO DEPT_TAB VALUES (101,'DUMMY','India');

EMP_ID EMP_NAME EMP_DEPT

1 Santanu 100

2 Smarak 101

3 Joe 100

DEPT_ID DEPT_NAME DEPT_CNTRY

100 TEST India

101 DUMMY India

Now if we run the below query it will run fine.

SEL E.* FROM EMP_TAB E WHERE E.EMP_DEPT = (SEL D.DEPT_ID FROM DEPT_TAB D WHERE E.EMP_DEPT = D.DEPT_ID ) ; 

But if we insert another row to dept table as below (which is also valid).

INSERT INTO DEPT_TAB VALUES (100,'TEST','USA');

DEPT_ID DEPT_NAME DEPT_CNTRY

100 TEST India

100 TEST USA

101 DUMMY India

The subquery will return more than 1 value and the correlated subquery will be error out.

I think that is what Joel meant.

Thanks

Santanu

Teradata Employee

Re: Performance of Correlated Subquery!!!

Thanks Santanu for the examples. I wanted to understand why the presence of duplicates will cause an error. I did a similar testing as you elaborated. Now, it's clear to me. Pasting the code snippet for anyone else looking into the same topic. 

CREATE TABLE TIM_1_0605
(EMPID INTEGER, SALARY FLOAT, DEPTID INTEGER)
PRIMARY INDEX(EMPID);

INS INTO TIM_1_0605 VALUES(1,25000,01);
INS INTO TIM_1_0605 VALUES(2,35000,01);
INS INTO TIM_1_0605 VALUES(3,45000,01);
INS INTO TIM_1_0605 VALUES(4,25000,02);
INS INTO TIM_1_0605 VALUES(5,35000,02);
INS INTO TIM_1_0605 VALUES(6,55000,02);
INS INTO TIM_1_0605 VALUES(7,55000,03);

CREATE TABLE TIM_2_0605
(DEPTID INTEGER, DEPT_NAME VARCHAR(30))
PRIMARY INDEX(DEPTID);

INS INTO TIM_2_0605 VALUES(01,'SALES');
INS INTO TIM_2_0605 VALUES(02,'MARKET');
INS INTO TIM_2_0605 VALUES(02,'FINANCE');

SELECT A.* FROM TIM_1_0605 A INNER JOIN TIM_2_0605 B ON A.DEPTID EQ B.DEPTID; -- Returns 09 Rows

SELECT A.* FROM TIM_1_0605 A WHERE A.DEPTID EQ (SELECT B.DEPTID FROM TIM_2_0605 B WHERE A.DEPTID EQ B.DEPTID); --Errored Out Owing To "More Than 1 Values Returned By The Subquery"
SELECT A.* FROM TIM_1_0605 A WHERE A.DEPTID IN (SELECT B.DEPTID FROM TIM_2_0605 B); -- Returns 06 Rows As Duplicate Rows Are Discarded

Thanks,

Smarak

Enthusiast

Re: Performance of Correlated Subquery!!!

Hi

Nicely done. This will definitely be helpful for anyone seeking the answer.

Thanks

Santanu

Re: Performance of Correlated Subquery!!!

Hello terauser9,

It is true that correlated subqueries in teradata are faster than traditional joins as it uses a special technique called "shared spool" but it is not true for all cases.

Hello joedsilva,

Sorry to point out, not always a correlated subquery would through an error. The example cited by you uses an = operator which does not accept more than 1 value as either of its parameter, so it would throw an error in case of multiple values. There is alaways more than 1 way of writing SQL query to acheive the same result.

Hello Santanu84 & SmarakDas,

Thanks for sharing the examples and code snippet.

To understand the performance of correlated subqueries, lets see the below examples.

With the same code snippet as SmarakDas shared, I choose the SQL statement:

SELECT A.* FROM DBA_MAINT.EMP A WHERE A.DEPTID IN (SELECT B.DEPTID FROM DBA_MAINT.DEPT B);

which the optimizer would most likely process as:

SELECT A.* 
FROM DBA_MAINT.EMP A ,
(
SELECT DISTINCT DEPTID
FROM DBA_MAINT.DEPT
) B
WHERE A.DEPTID = B.DEPTID;

Typically, the subquery gets evaluated, distincted, indexed or hashed or sorted and then joined to table EMP.

Whereas, writing below SQL statement:

SELECT A.* FROM DBA_MAINT.EMP A WHERE EXISTS (SELECT B.DEPTID FROM DBA_MAINT.DEPT B WHERE A.DEPTID EQ B.DEPTID);

will be processed like

FOR DEPTID IN ( SELECT * FROM DBA_MAINT.EMP )
LOOP
IF ( EXISTS ( SELECT NULL FROM DBA_MAINT.DEPT B WHERE B.DEPTID = DEPTID )
THEN
--Fetch the record
END IF
END LOOP

which as you can see would always result in a full table scan of EMP.

CASE1: Correlated Subquery better than normal Subquery or Join

When table DEPT is huge (Select DEPTID from DEPT takes huge resource and time) but table EMP is quite small (executes SELECT NULL FROM DEPT B WHERE B.DEPTID = DEPTID is faster).

CASE2: Correlated Subquery worse than normal Subquery or Join

Table EMP is huge and subquery of DEPT is relatively smaller.

CASE3: Correlated Subquery nearly equal to normal Subquery or Join

When both the table EMP and the subquery of table DEPT is huge, they both would perform nearly similar.

Note:

  • Since the optimizer has query rewrite capabilities, it can rewrite it to a more efficient queries using joins.
  • All these examples have been provided excluding any other factors in mind. the queries would obviously depend on other factors like indexes and statistics which you could utilize to improve performance of any of these since now you know how the queries are being processed by the optimizer.
  • The examples have been tested on teradata system and the explanations have been cited from explain plans.

Correlated subqueries are very efficient in Teradata as compared to any other database systems mostly due to shared spool technique and query rewrite capabilities of optimizer.

Hope this helps.

Thanks,

Anish

Junior Contributor

Re: Performance of Correlated Subquery!!!

Hi Anish,

some remarks:

There's no special technique called "shared spool" for correlated subqueries (CS), the query is simply rewritten as join.

Joe D'Silva didn't claim that the mentioned CS always throws an error.

Your description how a CS is processed is the logical flow, but never actually done this way. So when "explanations have been cited from explain plans" can you show this explain?

I've never seen a CS using EXISTS being slower than IN/JOIN regardless of the table sizes as the optimizer rewrites them to joins anyway. They might be faster, especially when the inner query is the non-unique part, as the optimizer might do an "inclusion" join which can't be expressed with join-syntax.

A huge performance difference might also exist for NOT EXISTS vs. NOT IN and LEFT JOIN + WHERE IS NULL.

And many DBMSes will create a plan similar to Teradata.