help optimizing GROUP BY query

Database
Enthusiast

help optimizing GROUP BY query

This is my query:

CREATE TABLE rtl.intermediate AS (
SELECT
customer_id,
MAX(new_to) AS new_to,
MIN(age) AS age,
MIN(gender) AS gender,
MIN(existing) AS existing
FROM rtl.base
WHERE
country='China'
AND
product='cereal'
AND
dt BETWEEN '2015-01-01' AND '2016-01-01'
GROUP BY customer_id
) WITH DATA
UNIQUE PRIMARY INDEX (customer_id, new_to, gender);

It currently takes about 10 seconds to run, and I would like to bring it down to 2 seconds.  The rtl.base table is partitioned on date (every 7 days) and has a primary index on customer_id, product, country, date (called dt).  I have collected statistics on the partition and the age column.

 

This is the explain:

1) First, we lock a distinct rtl."pseudo table" for read on a
RowHash to prevent global deadlock for
rtl.base.
2) Next, we lock rtl.intermediate for
exclusive use, and we lock rtl.base for read.
3) We lock a distinct DBC."pseudo table" for read on a RowHash for
deadlock prevention.
4) We lock DBC.DBase for read on a RowHash.
5) We do a single-AMP ABORT test from DBC.DBase by way of the unique
primary index "Field_1 = 'rtl'" with a residual condition of (
"'0000BF0A'XB= DBC.DBase.Field_2").
6) We create the table header.
7) We do an all-AMPs SUM step to aggregate from 53 partitions of
rtl.base with a condition of (
"(rtl.base.dt >= DATE '2015-01-01') AND
((rtl.base.dt <= DATE '2016-01-01') AND
((rtl.base.country = 'CHN') AND
(rtl.base.product = 'cereal')))")
, grouping by field1 ( rtl.base.customer_id).
Aggregate Intermediate Results are computed globally, then placed
in Spool 3. The size of Spool 3 is estimated with no confidence
to be 8,142,324 rows (293,123,664 bytes). The estimated time for
this step is 0.28 seconds.
8) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 1 (all_amps), which is redistributed
by the hash code of (rtl.base.customer_id,
rtl.base.new_to,
rtl.base.gender) to all AMPs. Then we do a
SORT to order Spool 1 by row hash. The size of Spool 1 is
estimated with no confidence to be 8,142,324 rows (227,985,072
bytes). The estimated time for this step is 0.15 seconds.
9) We do an all-AMPs MERGE into
rtl.intermediate from Spool 1 (Last Use).
The size is estimated with no confidence to be 8,142,324 rows.
The estimated time for this step is 1 minute and 27 seconds.
10) We lock a distinct DBC."pseudo table" for write on a RowHash for
deadlock prevention, we lock a distinct DBC."pseudo table" for
write on a RowHash for deadlock prevention, and we lock a distinct
DBC."pseudo table" for write on a RowHash for deadlock prevention.
11) We lock DBC.Indexes for write on a RowHash, we lock DBC.TVFields
for write on a RowHash, we lock DBC.TVM for write on a RowHash,
and we lock DBC.AccessRights for write on a RowHash.
12) We execute the following steps in parallel.
1) We do a single-AMP ABORT test from DBC.TVM by way of the
unique primary index "Field_1 = '0000BF0A'XB, Field_2 =
'INTERMEDIATE'".
2) We do an INSERT into DBC.Indexes (no lock required).
3) We do an INSERT into DBC.Indexes (no lock required).
4) We do an INSERT into DBC.Indexes (no lock required).
5) We do an INSERT into DBC.TVFields (no lock required).
6) We do an INSERT into DBC.TVFields (no lock required).
7) We do an INSERT into DBC.TVFields (no lock required).
8) We do an INSERT into DBC.TVFields (no lock required).
9) We do an INSERT into DBC.TVFields (no lock required).
10) We do an INSERT into DBC.TVM (no lock required).
11) We INSERT default rights to DBC.AccessRights for
rtl.intermediate.
13) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.

thanks!

Tags (1)
12 REPLIES
Junior Contributor

Re: help optimizing GROUP BY query

Why do you create a three column UPI if customer_id is already unique? 

Simply use UNIQUE PRIMARY INDEX (customer_id) and there's no redistribution needed. 

Did you create the base table?

You might try changing the PI, too, can you add some demographics for customer_id, product, country?

Enthusiast

Re: help optimizing GROUP BY query

thanks Dieter,

I added the secondary index including gender, as I later need to GROUP BY gender and get a COUNT(*).  I assumed that would need to be indexed, so I added it to the secondary index.  Should I not do that?

Yes, I did create the base table as well.  But it already has a UNIQUE PRIMARY INDEX (customer_id, product, country) etc.  What are you recommending that I change?

regards,

imran

Junior Contributor

Re: help optimizing GROUP BY query

Hi imran,

well, there's no secondary index in your table definition?

Stay away from defining UPIs on all your tables, you'll get perfectly distributed tables and a horrible performance.

That's why I asked for additional info, you might be able to switch to a NUPI which matches your queries (e.g. PI(customer_id) for improving GROUP BY performance), or you can add additional partitioning.

<cut & paste>

Criteria for Choosing the Primary Index

Access = Maximize single-AMP operations

- Choose the column(s) most frequently used for JOIN (and WHERE conditions), usually a Primary or Foreign Key

Distribution = Optimize parallel processing

- Choose the column(s) that provides good distribution

- Better when more unique values and less rows per value

- Adding more columns might result in better distribution but less

usability: The RowHash can only be calculated when all values are

exactly known

Volatility = Reduce maintenance resource overhead (I/O)

- Choose a column with stable data values

- An UPDATE of the PI column effectively results in:

   DELETE row – send row to new AMP – INSERT row (plus maintaining all 

Secondary/Join Indexes)

The Primary Key of a table is always a perfect fit based on distribution 

and volatility, as it's unique and stable.

But access is the most important criterion and a PK might not be 

accessed at all.

So data distribution has to be balanced with access usage in choosing a PI.

<\cut & paste>

Enthusiast

Re: help optimizing GROUP BY query

thanks Dieter,

I've now changed to a non-unique primary index, with a MULTISET table, and partitioning by month on the date range.

Doing the aggregate statistics is pretty fast (~ 3 seconds) on about 800 million rows of data with a few filters applied.

What takes longer is figuring out the first row chronologically of each customer that matches the filters, in the selected date range.  It's possible for a customer to purchase a particular item on two separate days, so right now they will be represented as two rows in the table.  For a metric like % of customers, I need only one row per customer, and would like to use the first row of each customer (by date).

I've tried the standard GROUP BY subquery:

SELECT

   AVG(age) AS age,

   AVG(spend) AS spend

FROM rtl.base t1

INNER JOIN

(

SELECT

customer_id,

MIN(dt) AS dt

FROM

rtl.base

WHERE

    country='China'

AND

    product='cereal'

AND

    dt BETWEEN '2015-01-01' AND '2016-01-01'

GROUP BY

customer_id

) ON (t1.customer_id = t2.customer_id AND t1.dt = t2.dt);

Also tried ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date ASC)..... QUALIFY row_no = 1

But it takes quite a bit of time (< 30 seconds) on about 3 million rows.  Any suggestions on how to speed this up?

Enthusiast

Re: help optimizing GROUP BY query

Would it help if I added a second level of partitioning on customer_id, so that each AMP would have all the rows of data it needs for the GROUP BY locally?

thanks

Junior Contributor

Re: help optimizing GROUP BY query

Partitioning does not change the distribution of rows, just the internal order, distribution is only based on PI. What's your current PI?

I usually assume that OLAP-functions are more efficient, you mean this query was performing slower?

SELECT
AVG(age) AS age,
AVG(spend) AS spend
FROM
(
SELECT
age, spend
FROM
rtl.base
WHERE
country='China'
AND
product='cereal'
AND
dt BETWEEN '2015-01-01' AND '2016-01-01'
QUALIFY
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date ASC) = 1
) AS dt
Enthusiast

Re: help optimizing GROUP BY query

That actually seems to have slowed down the query even further, it now runs in 60 seconds.

This is how I partitioned it:

DATA PRIMARY INDEX (customer_id, dt, country, product)

  PARTITION BY (

    RANGE_N(dt BETWEEN '2000-04-01' AND '2016-04-05' EACH INTERVAL '1' MONTH),

    RANGE_N(customer_id BETWEEN 1 AND 1000000000 EACH 2000000));

The explain statement says it should only take 3 seconds to run.

Even after collecting stats on the base table for columns customer_id and date, the performance remains slow (44 seconds).

Any suggestions on improvement?

Junior Contributor

Re: help optimizing GROUP BY query

Statistics will hardly help, just the estimates might be better, but the actual resource usage wil not change.

In therory the fastest way should be  

PRIMARY INDEX (customer_id)
PARTITION BY (
RANGE_N(dt BETWEEN '2000-04-01' AND '2016-12-31' EACH INTERVAL '1' MONTH)

Depending on the actual data this might result in a skewed table. What's the average and maximum number of rows per customer_id/month?


Enthusiast

Re: help optimizing GROUP BY query

Hi Dieter!  Is this a one-man support forum?

The average is 1, the max is 55.

thanks,

imran