Performance Tips for Group by on all columns

Database

Performance Tips for Group by on all columns

Hi, 

I was working on tuning a query and I saw that the code was doing a group by on all columns being selected ( about 20 ). Even if i remove the group by condition, the select is not returning any duplicate rows, but the group by was put to eliminate duplicates if at all they exist. I noticed in viewpoint that the group by on all columns step is taking about 13 mins to complete. The target table is defined as Multiset table with NUPI, what should be the right approach here, should i remove the group by condition, define the table as Set with UPI or is this there any work around, maybe pre aggregation ? Please let me know your thoughts.

Thanks,

Karthik

Tags (1)
6 REPLIES

Re: Performance Tips for Group by on all columns

Hi Karthik,

Did you checked by doing explain of that query which will give more details about the performance issue and will also show if there is any Product join involved?

Is all the required stats are collected?

Re: Performance Tips for Group by on all columns

Hi Ravi, 

Yes I did take a look at the explain, the issue is not in the join step, as the steps leading to join completes in 5 mins, but the group by step takes about 13 mins to complete.

--Karthik

Re: Performance Tips for Group by on all columns

My suggestion is get back to the person or designer or modeler, why it is made thus. There may  be reasons for making it multiset, nupi , thus. So also for the query, you can get in touch with the end users. Multiset tables violate set theory semantics and the relational data model by permitting duplicate rows in a table. Now with latest technologies , complicated requirements, different data types , heteregeneous sources........ I am a strong proponent of duplicate rows.

N/A

Re: Performance Tips for Group by on all columns

mutiset tables became popular in some areas as they are sometimes faster while inserting data. There are cases where it makes sensce - but as you figured out if you have to pay a lot of resources to dedup with a group by it is not always a cleaver approach.

What you describe seems to be a case where the decission of multiset vs. set should be revisited. Seems you have some good arguments in hand for the discussion with the PDM owner.

Re: Performance Tips for Group by on all columns

Thank you Raja & Ulrich for your suggestions, It definitely was helpful.

--Karthik

Re: Performance Tips for Group by on all columns

Btw, if your varchar field(s) is/are very long and if placed in the sort key of spool file, then it can engender performance. Even collect stats can take more time for varchars which are very long.