Bad query performance in aggregation

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Highlighted
Junior Supporter

Bad query performance in aggregation

 

  SELECT  ATTRB_ID ,
            MDL_INSTNC_ID ,
            INPUT_ATTRB_VAL_TX ,
            DW_EFF_DT (NAMED DW_EFF_DT_MIN ) ,
            MINIMUM ( DW_ROW_NR ) AS DW_ROW_NR_MIN 
    FROM      mytable  W 
    GROUP BY ATTRB_ID ,
            MDL_INSTNC_ID ,
            INPUT_ATTRB_VAL_TX  ,
            DW_EFF_DT 

Hi All,

 

This is a very simple query where i am aggregating on few columns and it takes 30K CPU as the base table has 2bn records. The issue i think is that SyntaxEditor Code Snippet

INPUT_ATTRB_VAL_TX

 is varchar(2000). If i change it to varchar(500) it takes lasser CPU 10K. However, we dont have much control here to change the datatype to varchar(500) as this is how the data is expect in this column. So, what are the tuning options here ?

 

Thanks 

Samir

4 REPLIES
Senior Apprentice

Re: Bad query performance in aggregation

Hi Samir,

 

Sorry, but I don't think there is if you're looking at a single query.

 

If you need to GROUP BY the entire width of the column for functional reasons (i.e. to get the correct result) then that is what you have to do. What is happening internally is the data value in this column in each row is being expanded to the full 2000 characters for the aggregation processing, so that the dbms gives the correct answer. This is causing a lot of overflow in the aggregation processing buffer - and I can't see a way of increasing that for a single query.

 

I notice that in your example there is no WHERE clause, so you're aggregating all rows in the table. Again, if that is what you need for functional reasons then you've got to do it. If you can apply selection to reduce the number of rows being aggregated that might help.

 

There is an internal DBSControl setting which affects the aggregate cache buffer size. BUT remember that giving more memory to one process will mean less memory is available for other processes. You might need to talk to your DBA team about changes to DBSControl (Sorry but I don't know your role in your company).

 

HTH

Dave

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Junior Supporter

Re: Bad query performance in aggregation

Thanks Dave for your reply.

We cant put a filter on the query as this is the stage data and its a weekly job and we have to process all this data.

Spoiler
If you need to GROUP BY the entire width of the column for functional reasons (i.e. to get the correct result) then that is what you have to do. What is happening internally is the data value in this column in each row is being expanded to the full 2000 characters for the aggregation processing, so that the dbms gives the correct answer. This is causing a lot of overflow in the aggregation processing buffer - and I can't see a way of increasing that for a single query.

For your point above, i do agreee with you and it is very evident when you see the explain of the query. However, i am few Qs here :

  1. When this is a varchar column and each col data has the actual length stored with it, dont you think it is expected from the optmizer to know the max length of the data in the field and use it instead of 2000 as defined in the table? To test this, i created another table with the problemetic field with varchar(500) and i see the CPU reducing.
  2. Are we saying that for this kind of join, TD DB is not optimized to handle or is this the way any RDBMS would handle (please dont get me wrong here !!)? Is there any other approach that we can follow that you would suggest so that we can suggest to revisit the design ?

I dont think we can chnage the DBS settings for this.

 

Thanks !

Samir

Senior Apprentice

Re: Bad query performance in aggregation

Hi,

 

Your questions are sensible / obvious, but (for now at least) I don't think it changes anything.

 

"When this is a varchar column and each col data has the actual length stored with it, don't you think it is expected from the optmizer to know the max length of the data in the field and use it instead of 2000 as defined in the table?"

 

You're correct that the dbms knows the actual length of the data value in each column in each row.

 

I think there is an optimisation (to only use the actual length) that might be applied when GROUP BY only includes a single column, but with multiple columns (as in your case) it is hard to do so.

 

When grouping by multiple columns the dbms effectively concatenates the data values from each column (in the GROUP BY clause) to build the grouping value. This is where the issue arises and also remember that the first job of the dbms when running a query is to produce the correct result - even at the cost of some performance.

 

Consider the following scenario which shows the values in two 'group by' columns (COL1 and COL2) for two data rows:

COL1  COL2 COL3
----  ---- ----
AB    C       1
A     BC      2

If we were to run the following query we would (correctly) expect the answer set to contain two rows, because the combination of columns COL1 and COL2 contains two values.

SELECT col1, col2
   ,SUM(col3)
FROM table
GROUP BY col1, col2;

The way that Teradata handles this is to concatenate the values of COL1 and COL2 together (in memory) to find all the distinct grouping values.

To always do this correctly, it uses the full width of columns COL1 and COL2. If the columns are defined as VARCHAR then they are expanded to their full width. (I think) Any other mechanism results in the possibility of producing an incorrect result - which as a user you do not want.

 

Do other dbms's do it the same way? I have no idea.

 

If they don't then they probably have multiple code paths for performing aggregation based on the data types involved.

 

HTH

Dave

 

 

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Teradata Employee

Re: Bad query performance in aggregation


Do other dbms's do it the same way? I have no idea. 


At the red company, the columns values are hashed and the group by is done against the hash.

Much faster than using regular columns... but not an MPP system.