Re-writing a query interms of performance

Database
KVB
Enthusiast

Re-writing a query interms of performance

CREATE MULTISET TABLE KEY_T ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
DW_CREAT_DT DATE FORMAT 'YYYY-MM-DD' NOT NULL,
VISITOR_ID INTEGER NOT NULL,
SRCE_VISITOR_ID VARCHAR(255) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
SITE_NM VARCHAR(255) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
EFF_DT DATE FORMAT 'YYYY-MM-DD' NOT NULL,
DOMN_ID CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
DW_LOAD_TS TIMESTAMP(0) NOT NULL,
DW_UPDT_TS TIMESTAMP(0) NOT NULL)
UNIQUE PRIMARY INDEX ( DW_CREAT_DT ,VISITOR_ID )
PARTITION BY RANGE_N(DW_CREAT_DT BETWEEN DATE '2010-08-01' AND '2099-12-31' EACH INTERVAL '1' DAY )
UNIQUE INDEX ( SRCE_VISITOR_D ,SITE_URL_NM ,DOMN_ID );

CREATE MULTISET TABLE CUST_T ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
FIRST_DT DATE FORMAT 'yyyy-mm-dd' NOT NULL,
VISITOR_ID INTEGER NOT NULL,
CUST_ID INTEGER NOT NULL,
DOMN_ID CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
EFF_DT DATE FORMAT 'yyyy-mm-dd' NOT NULL,
DW_LOAD_TS TIMESTAMP(0) NOT NULL)
UNIQUE PRIMARY INDEX ( FIRST_DT ,VISITOR_ID,CUST_ID )
PARTITION BY RANGE_N(FIRST_DT BETWEEN DATE '2010-08-01' AND '2099-12-31' EACH INTERVAL '1' DAY );

Both the tables contains same number of rows. 1366380662

Here is my query

SELECT MIN(A.FIRST_DT), SRCE_VISITOR_ID, CUST_ID
FROM CUST_T A
INNER JOIN KEY_T B
ON A.FIRST_DT = B.DW_CREAT_DT
AND A.VISITOR_ID = B.VISITOR_ID
WHERE CUST_ID != 0
AND A.DOMN_ID = 'W'
AND B.DOMN_ID = 'W'
GROUP BY SRCE_VISITOR_ID, CUST_ID

Is there any other way to avoid all row scan ,even though there are 100% uniqueness.It is taking 15 minutes.

Please suggest any approach if there is any.

9 REPLIES
Enthusiast

Re: Re-writing a query interms of performance

Hi,

I can see that the two tables have UPIs with only 1 column matching. Have you thought of creating AJI and see the preformance?

Cheers,

Raja

Senior Apprentice

Re: Re-writing a query interms of performance

All those unique indexes seem suspicious, are they actually based on access/join paths or do they just implement the logical keys? A NUPI (FIRST_DT ,VISITOR_ID) on T_CUST would allow a direct join without preparartion.

Joining up to 1.300.000.000 rows and aggregating them is a lot of work, can you show the explain?

KVB
Enthusiast

Re: Re-writing a query interms of performance

Sure Deiter,Please find the below explain plan.It is showing as 2 min .But when I execute it,it takes more than 45 mins..

1) First, we lock KEY_T in view
KEY_1 for access, and we lock
CUST_T in view
CUST_1 for access.
2) Next, we do an all-AMPs RETRIEVE step from
CUST_T in view
CUST_1 by way of an all-rows scan
with a condition of ("(CUST_T in view
CUST_1.CUST_ID <> 0) AND
((CUST_T in view
CUST_1.DOMN_ID )= 'W')")
into Spool 4 (all_amps), which is redistributed by the rowkey of (
CUST_T.FIRST_DT,
CUST_T.VISITOR_ID) to all AMPs.
Then we do a SORT to partition Spool 4 by rowkey. The size of
Spool 4 is estimated with low confidence to be 417,736,528 rows (
10,861,149,728 bytes). The estimated time for this step is 48.61
seconds.
3) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
RowHash match scan, which is joined to
KEY_T in view
KEY_1 by way of a RowHash match
scan with a condition of ("(KEY_T in
view KEY_1.DOMN_ID )=
'W'"). Spool 4 and KEY_T are joined
using a rowkey-based merge join, with a join condition of (
"(FIRST_DT =
KEY_T.DW_CREAT_DT)
AND ((VISITOR_ID =
KEY_T.VISITOR_ID) AND
((DOMN_ID )=
(KEY_T.DOMN_ID )))"). The
input table KEY_T will not be cached
in memory. The result goes into Spool 3 (all_amps), which is
built locally on the AMPs. The size of Spool 3 is estimated with
no confidence to be 54,250,291 rows (5,967,532,010 bytes). The
estimated time for this step is 21.80 seconds.
4) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
KEY_T.SRCE_VISITOR_ID
,CUST_T.CUST_ID). Aggregate
Intermediate Results are computed globally, then placed in Spool 5.
The aggregate spool file will not be cached in memory. The size
of Spool 5 is estimated with no confidence to be 54,250,291 rows (
20,181,108,252 bytes). The estimated time for this step is 1
minute and 13 seconds.
5) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
an all-rows scan into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 54,250,291 rows (6,618,535,502 bytes). The estimated time
for this step is 2.22 seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 2 minutes and 25 seconds.

Thanks
Senior Apprentice

Re: Re-writing a query interms of performance

The explain looks like expected, redistributing CUST_T and then a local join to KEY_T.

There's "no confidence" for join and aggregate (which stats are collected?), if the actual number is much larger than expected it will be much slower. But you should never use wall clock time as it's dependant on the load on your system.

Do you have access to the Querylog? The QryLogStepsV will have all details about estimated vs. actual.

How often is this query supposed to run?

KVB
Enthusiast

Re: Re-writing a query interms of performance

For CUST_T

Date Time Unique Values Column Names
13/11/24 1:57:25 1,090 FIRST_DT
13/11/24 1:58:15 163,402,379 VISITOR_ID
13/11/24 1:58:39 4,192,811 CUST_ID
13/11/24 1:58:52 4 DOMN_ID
13/11/28 5:28:33 1,367,554,969 FIRST_DT,VISITOR_ID,CUST_ID
13/11/27 20:37:06 1,094 PARTITION
13/11/24 2:01:55 1,105,915,258 FIRST_DT,VISITOR_ID

For KEY_T

Date Time Unique Values Column Names
13/11/02 23:26:44 1,069 DW_CREAT_DT
13/11/02 23:27:56 163,402,379 VISITOR_ID
13/11/02 23:38:49 856,216,070 SRCE_VISITOR_ID
13/11/02 23:39:50 1,086,403,953 DW_CREAT_DT,VISITOR_ID

these are the stats.
Enthusiast

Re: Re-writing a query interms of performance

Hi Bikky,

I was just analysing your scenaio, and got this link, just read the disadvantages of PPI and I hope you will be able to optimize your query.

http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Database_Management/B035_1094_111...

Khurram
Senior Apprentice

Re: Re-writing a query interms of performance

There's no PARTITION stats on KEY_T and no re-collection for three weeks.

The UPI on T_CUST does not provide a better distribution than a NUPI (FIRST_DT ,VISITOR_ID), but prevents a fast AMP-local join.

And the number of rows is probbaly underestimated.

Btw, you'll probably not run a select like that returning millions of rows, is this part of an Insert/Select or Create Table As?

Enthusiast

Re: Re-writing a query interms of performance

Hi,

I think you need to revisit on PI (NUPI) selection , suiting the business requirement.

Do you really need partition day wise?

Cheers,

Raja

Enthusiast

Re: Re-writing a query interms of performance

If you have option to revise PI, you can improve performance of this query. You might need to implement following in CUST_T table. This will help to maintain uniquness, support to other queries using USI and improved performance of mentioned or similar to mentioned queries.

PRIMARY INDEX ( FIRST_DT ,VISITOR_ID )
PARTITION BY RANGE_N(FIRST_DT BETWEEN DATE '2010-08-01'
AND '2099-12-31' EACH INTERVAL '1' DAY )
UNIQUE INDEX ( FIRST_DT ,VISITOR_ID,CUST_ID );