difference between distinct and group by

Analytics
Enthusiast

difference between distinct and group by

Hi,

When we fetch unique rows from a table which is better distinct or group by??
4 REPLIES
Highlighted
Fan

Re: difference between distinct and group by

"GROUP BY" improves perfomance compare to DISTINCT CLAUSE.
Junior Contributor

Re: difference between distinct and group by

It depends.

as a rule of thumb: If you don't know exactly about the data, better use GROUP BY.

DISTINCT is better for columns with a low number of rows per value: #rows < #AMPs
GROUP BY is better for columns with a large number of rows per value: #rows > #AMPs

TD13 will automatically rewrite DISTINCT to GROUP BY and apply some new algorithms to get faster response.

Dieter

Re: difference between distinct and group by

Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, where as  GROUP BY that only sends unique values between the AMPs.

So, we can say that  GROUP BY sounds more efficient. 

    But when you assume that data is nearly unique in a table, GROUP BY will  spend more time attempting to eliminate duplicates that do not exist at all.Therefore, it is wasting its  time to check for duplicates the first time. Then, it must redistribute the same amount of data .

Let us see how these steps are used in each case for elimination of Duplicates

(can be found out using explain plan)

DISTINCT

1. It reads each row on AMP

2. Hashes the column value identified in the distinct clause of select statement.

3. Then redistributes the rows according to row value into appropriate AMP

4. Once  redistribution is completed , it

    a. Sorts data to group duplicates on each AMP

    b. Will remove all the duplicates on each amp and sends the original/unique value

P.s: There are cases when "Error : 2646 No more Spool Space " . In such cases try using GROUP BY.

GROUP BY

1. It reads all the rows part of GROUP BY

2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept

3. Hashes the unique values on each AMP

4. Then it will re-distribute them to particular /appropriate AMP's

5. Once  redistribution is completed , it

    a. Sorts data to group duplicates on each AMP

    b. Will remove all the duplicates on each amp and sends the original/unique value

Hence it is better to  go for

  • GROUP BY  -  when Many duplicates
  • DISTINCT        -  when few or no duplicates
  • GROUP BY -  SPOOL space is exceeded
Enthusiast

Re: difference between distinct and group by

Hi Santosh,

Really very good explanation in distinct vs Group by

but just want to add something that  from TD 12 onwards , teradata is recommending to always use a Group by in place of Distinct in a query  and  Optimizer will automatically decide whether to apply distinct or group by internally for fetching the answer set for best performance

and also as suggested by Dieter above that TD13 will automatically rewrite Distinct to a GRoup by ..

Also as part of Teradata standards in various projects ,its told not to go with the DISTINCT option..

Cheers!

Nishant