Really nice explanation above.
When you say "SELECT E.*FROM EMPLOYEE EWHERE E.DEPTNO = (SELECT D.DEPTNO FROM DEPT D WHERE D.DEPTNO = E.DEPTNO);" error out when there are duplicates DEPTNO records in the DEPT table, what exactly you mean ?
Is there any such rule for co-related sub-query that Duplicates will cause error ??
Thanks in advance,
Thanks Santanu for the examples. I wanted to understand why the presence of duplicates will cause an error. I did a similar testing as you elaborated. Now, it's clear to me. Pasting the code snippet for anyone else looking into the same topic.
CREATE TABLE TIM_1_0605
(EMPID INTEGER, SALARY FLOAT, DEPTID INTEGER)
INS INTO TIM_1_0605 VALUES(1,25000,01);
INS INTO TIM_1_0605 VALUES(2,35000,01);
INS INTO TIM_1_0605 VALUES(3,45000,01);
INS INTO TIM_1_0605 VALUES(4,25000,02);
INS INTO TIM_1_0605 VALUES(5,35000,02);
INS INTO TIM_1_0605 VALUES(6,55000,02);
INS INTO TIM_1_0605 VALUES(7,55000,03);
CREATE TABLE TIM_2_0605
(DEPTID INTEGER, DEPT_NAME VARCHAR(30))
INS INTO TIM_2_0605 VALUES(01,'SALES');
INS INTO TIM_2_0605 VALUES(02,'MARKET');
INS INTO TIM_2_0605 VALUES(02,'FINANCE');
SELECT A.* FROM TIM_1_0605 A INNER JOIN TIM_2_0605 B ON A.DEPTID EQ B.DEPTID; -- Returns 09 Rows
SELECT A.* FROM TIM_1_0605 A WHERE A.DEPTID EQ (SELECT B.DEPTID FROM TIM_2_0605 B WHERE A.DEPTID EQ B.DEPTID); --Errored Out Owing To "More Than 1 Values Returned By The Subquery"
SELECT A.* FROM TIM_1_0605 A WHERE A.DEPTID IN (SELECT B.DEPTID FROM TIM_2_0605 B); -- Returns 06 Rows As Duplicate Rows Are Discarded
It is true that correlated subqueries in teradata are faster than traditional joins as it uses a special technique called "shared spool" but it is not true for all cases.
Sorry to point out, not always a correlated subquery would through an error. The example cited by you uses an = operator which does not accept more than 1 value as either of its parameter, so it would throw an error in case of multiple values. There is alaways more than 1 way of writing SQL query to acheive the same result.
Thanks for sharing the examples and code snippet.
To understand the performance of correlated subqueries, lets see the below examples.
With the same code snippet as SmarakDas shared, I choose the SQL statement:
SELECT A.* FROM DBA_MAINT.EMP A WHERE A.DEPTID IN (SELECT B.DEPTID FROM DBA_MAINT.DEPT B);
which the optimizer would most likely process as:
FROM DBA_MAINT.EMP A ,
SELECT DISTINCT DEPTID
WHERE A.DEPTID = B.DEPTID;
Typically, the subquery gets evaluated, distincted, indexed or hashed or sorted and then joined to table EMP.
Whereas, writing below SQL statement:
SELECT A.* FROM DBA_MAINT.EMP A WHERE EXISTS (SELECT B.DEPTID FROM DBA_MAINT.DEPT B WHERE A.DEPTID EQ B.DEPTID);
will be processed like
FOR DEPTID IN ( SELECT * FROM DBA_MAINT.EMP )
IF ( EXISTS ( SELECT NULL FROM DBA_MAINT.DEPT B WHERE B.DEPTID = DEPTID )
--Fetch the record
which as you can see would always result in a full table scan of EMP.
CASE1: Correlated Subquery better than normal Subquery or Join
When table DEPT is huge (Select DEPTID from DEPT takes huge resource and time) but table EMP is quite small (executes SELECT NULL FROM DEPT B WHERE B.DEPTID = DEPTID is faster).
CASE2: Correlated Subquery worse than normal Subquery or Join
Table EMP is huge and subquery of DEPT is relatively smaller.
CASE3: Correlated Subquery nearly equal to normal Subquery or Join
When both the table EMP and the subquery of table DEPT is huge, they both would perform nearly similar.
Correlated subqueries are very efficient in Teradata as compared to any other database systems mostly due to shared spool technique and query rewrite capabilities of optimizer.
Hope this helps.
There's no special technique called "shared spool" for correlated subqueries (CS), the query is simply rewritten as join.
Joe D'Silva didn't claim that the mentioned CS always throws an error.
Your description how a CS is processed is the logical flow, but never actually done this way. So when "explanations have been cited from explain plans" can you show this explain?
I've never seen a CS using EXISTS being slower than IN/JOIN regardless of the table sizes as the optimizer rewrites them to joins anyway. They might be faster, especially when the inner query is the non-unique part, as the optimizer might do an "inclusion" join which can't be expressed with join-syntax.
A huge performance difference might also exist for NOT EXISTS vs. NOT IN and LEFT JOIN + WHERE IS NULL.
And many DBMSes will create a plan similar to Teradata.