I was working on tuning a query and I saw that the code was doing a group by on all columns being selected ( about 20 ). Even if i remove the group by condition, the select is not returning any duplicate rows, but the group by was put to eliminate duplicates if at all they exist. I noticed in viewpoint that the group by on all columns step is taking about 13 mins to complete. The target table is defined as Multiset table with NUPI, what should be the right approach here, should i remove the group by condition, define the table as Set with UPI or is this there any work around, maybe pre aggregation ? Please let me know your thoughts.
Did you checked by doing explain of that query which will give more details about the performance issue and will also show if there is any Product join involved?
Is all the required stats are collected?
Yes I did take a look at the explain, the issue is not in the join step, as the steps leading to join completes in 5 mins, but the group by step takes about 13 mins to complete.
My suggestion is get back to the person or designer or modeler, why it is made thus. There may be reasons for making it multiset, nupi , thus. So also for the query, you can get in touch with the end users. Multiset tables violate set theory semantics and the relational data model by permitting duplicate rows in a table. Now with latest technologies , complicated requirements, different data types , heteregeneous sources........ I am a strong proponent of duplicate rows.
mutiset tables became popular in some areas as they are sometimes faster while inserting data. There are cases where it makes sensce - but as you figured out if you have to pay a lot of resources to dedup with a group by it is not always a cleaver approach.
What you describe seems to be a case where the decission of multiset vs. set should be revisited. Seems you have some good arguments in hand for the discussion with the PDM owner.
Btw, if your varchar field(s) is/are very long and if placed in the sort key of spool file, then it can engender performance. Even collect stats can take more time for varchars which are very long.