I want to know which one is better to handle duplicate records ; is it using GROUP BY or DISTINCT while selecting data from Source table or Creating Target table as SET table ?
Thanks in advance.
Nitin 'Raj' Srivastava
classical answer for an unspefic question ;-> - it depends.
Can you share more information on the process. How many rows, how many dups etc. Where do the dups come from?
But even if you give this info - the answer it will be difficult to judge as the system configuration can also make a difference. So running some tests on your environment will give you best answers.
Dedup can become an expensive operation should be done only once and as early as possible. Lassy design can result in a lot of group by and distinct usage and will consume huge amount of resources...
It again depends on the number of rows per value present in the table.
GROUPBY can be used if the table contains more duplicate rows per value and DISTINCT incase of less duplicate rows per value.
Creating a SET table in the target makes the duplicate check for the entire row. Using GROUP BY or DISTINCT in the selecting data for your source tables does the check for the columns specified in your GROUP BY or DISTINCT clause. Both the plans for GROUP BY and DISTINCT are similar but it does make a difference based on the data set.
I think that SET table are useful to garantee the integrity when you know that you gonna have a poor ETL development. Considering only the performance objective, it depends. You should test your case.
Performance varies depending on the data.
is better when the data is nearly unique. actually when distinct is used; the intermediate spool is sorted and discards the duplicates.
is better when data has relatively few unique values and works by performing an AMP local grouping operation and then merging the partial result sets for final processing.
also check 13.0 release summary (see attachement).
Optimzer is getting smarter and sometimes distinct and group by result in the same plan.