This is my query:
CREATE TABLE rtl.intermediate AS (
MAX(new_to) AS new_to,
MIN(age) AS age,
MIN(gender) AS gender,
MIN(existing) AS existing
dt BETWEEN '2015-01-01' AND '2016-01-01'
GROUP BY customer_id
) WITH DATA
UNIQUE PRIMARY INDEX (customer_id, new_to, gender);
It currently takes about 10 seconds to run, and I would like to bring it down to 2 seconds. The rtl.base table is partitioned on date (every 7 days) and has a primary index on customer_id, product, country, date (called dt). I have collected statistics on the partition and the age column.
This is the explain:
1) First, we lock a distinct rtl."pseudo table" for read on a
RowHash to prevent global deadlock for
2) Next, we lock rtl.intermediate for
exclusive use, and we lock rtl.base for read.
3) We lock a distinct DBC."pseudo table" for read on a RowHash for
4) We lock DBC.DBase for read on a RowHash.
5) We do a single-AMP ABORT test from DBC.DBase by way of the unique
primary index "Field_1 = 'rtl'" with a residual condition of (
6) We create the table header.
7) We do an all-AMPs SUM step to aggregate from 53 partitions of
rtl.base with a condition of (
"(rtl.base.dt >= DATE '2015-01-01') AND
((rtl.base.dt <= DATE '2016-01-01') AND
((rtl.base.country = 'CHN') AND
(rtl.base.product = 'cereal')))")
, grouping by field1 ( rtl.base.customer_id).
Aggregate Intermediate Results are computed globally, then placed
in Spool 3. The size of Spool 3 is estimated with no confidence
to be 8,142,324 rows (293,123,664 bytes). The estimated time for
this step is 0.28 seconds.
8) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 1 (all_amps), which is redistributed
by the hash code of (rtl.base.customer_id,
rtl.base.gender) to all AMPs. Then we do a
SORT to order Spool 1 by row hash. The size of Spool 1 is
estimated with no confidence to be 8,142,324 rows (227,985,072
bytes). The estimated time for this step is 0.15 seconds.
9) We do an all-AMPs MERGE into
rtl.intermediate from Spool 1 (Last Use).
The size is estimated with no confidence to be 8,142,324 rows.
The estimated time for this step is 1 minute and 27 seconds.
10) We lock a distinct DBC."pseudo table" for write on a RowHash for
deadlock prevention, we lock a distinct DBC."pseudo table" for
write on a RowHash for deadlock prevention, and we lock a distinct
DBC."pseudo table" for write on a RowHash for deadlock prevention.
11) We lock DBC.Indexes for write on a RowHash, we lock DBC.TVFields
for write on a RowHash, we lock DBC.TVM for write on a RowHash,
and we lock DBC.AccessRights for write on a RowHash.
12) We execute the following steps in parallel.
1) We do a single-AMP ABORT test from DBC.TVM by way of the
unique primary index "Field_1 = '0000BF0A'XB, Field_2 =
2) We do an INSERT into DBC.Indexes (no lock required).
3) We do an INSERT into DBC.Indexes (no lock required).
4) We do an INSERT into DBC.Indexes (no lock required).
5) We do an INSERT into DBC.TVFields (no lock required).
6) We do an INSERT into DBC.TVFields (no lock required).
7) We do an INSERT into DBC.TVFields (no lock required).
8) We do an INSERT into DBC.TVFields (no lock required).
9) We do an INSERT into DBC.TVFields (no lock required).
10) We do an INSERT into DBC.TVM (no lock required).
11) We INSERT default rights to DBC.AccessRights for
13) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
Why do you create a three column UPI if customer_id is already unique?
Simply use UNIQUE PRIMARY INDEX (customer_id) and there's no redistribution needed.
Did you create the base table?
You might try changing the PI, too, can you add some demographics for customer_id, product, country?
I added the secondary index including gender, as I later need to GROUP BY gender and get a COUNT(*). I assumed that would need to be indexed, so I added it to the secondary index. Should I not do that?
Yes, I did create the base table as well. But it already has a UNIQUE PRIMARY INDEX (customer_id, product, country) etc. What are you recommending that I change?
well, there's no secondary index in your table definition?
Stay away from defining UPIs on all your tables, you'll get perfectly distributed tables and a horrible performance.
That's why I asked for additional info, you might be able to switch to a NUPI which matches your queries (e.g. PI(customer_id) for improving GROUP BY performance), or you can add additional partitioning.
<cut & paste>
Criteria for Choosing the Primary Index
I've now changed to a non-unique primary index, with a MULTISET table, and partitioning by month on the date range.
Doing the aggregate statistics is pretty fast (~ 3 seconds) on about 800 million rows of data with a few filters applied.
What takes longer is figuring out the first row chronologically of each customer that matches the filters, in the selected date range. It's possible for a customer to purchase a particular item on two separate days, so right now they will be represented as two rows in the table. For a metric like % of customers, I need only one row per customer, and would like to use the first row of each customer (by date).
I've tried the standard GROUP BY subquery:
Also tried ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date ASC)..... QUALIFY row_no = 1
But it takes quite a bit of time (< 30 seconds) on about 3 million rows. Any suggestions on how to speed this up?
Would it help if I added a second level of partitioning on customer_id, so that each AMP would have all the rows of data it needs for the GROUP BY locally?
Partitioning does not change the distribution of rows, just the internal order, distribution is only based on PI. What's your current PI?
I usually assume that OLAP-functions are more efficient, you mean this query was performing slower?
AVG(age) AS age,
AVG(spend) AS spend
dt BETWEEN '2015-01-01' AND '2016-01-01'
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date ASC) = 1
) AS dt
That actually seems to have slowed down the query even further, it now runs in 60 seconds.
This is how I partitioned it:
Statistics will hardly help, just the estimates might be better, but the actual resource usage wil not change.
In therory the fastest way should be
PRIMARY INDEX (customer_id)
PARTITION BY (
RANGE_N(dt BETWEEN '2000-04-01' AND '2016-12-31' EACH INTERVAL '1' MONTH)
Depending on the actual data this might result in a skewed table. What's the average and maximum number of rows per customer_id/month?