Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tune

Database
Teradata Employee

Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tune

Hi All,

 

Could you please help me in tuning oppurtunity in below Query. its running 7 to 8 hours for 1 million records to process.

Insert MDM34_CRDM.XREF_2063
Select
m.CEC_ID,
m.ADDR_ID,
s.EXT_CNSMR_ID,
s.SRC_CD_IN,
s.BRAND_CD,
case when ADDRSIM(m.ADDRESS_LINE1,s.lct_Address1) = 0 then -4.248 else ADDRSIM(m.ADDRESS_LINE1,s.lct_Address1)*4.248 end
+ case when Amg_Name_Match(m.FIRST_NAME,s.FIRST_NAME) = 0 then -4.248 else Amg_Name_Match(m.FIRST_NAME,s.FIRST_NAME)*4.248 end
+ case when Amg_Name_Match(m.LAST_NAME,s.LAST_NAME) = 0 then -4.248 else Amg_Name_Match(m.LAST_NAME,s.LAST_NAME)*4.248 end score ,
null from
MDM34.VW_CEC_ADDR_PRFL m,MDM34_STAGE.SRC_CEC_WRK_ADDR s
where m.POSTAL_CODE=s.ZIP_CD_BASE and score >= 12.744;

 

ADDRSIM is Teradata product SysLib Function.

Name_Mtch Function is custom function and definition of that is below.

 

CREATE FUNCTION
syslib.Amg_Name_Match(string1 varchar(100), string2 varchar(100))
RETURNS INTEGER
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
CASE WHEN sndx(STRTOK(string1, '#',2)) = sndx(STRTOK(string2,'#',2)) THEN 1
ELSE
CASE WHEN editdistance_u(STRTOK(string1, '#',2),STRTOK(string2, '#',2))<=2 AND sndx(STRTOK(string1, '#',1)) = sndx(STRTOK(string2,'#',1)) THEN 1
ELSE 0
END
END;

 

Definition of Views and Tables which are referring by above insert-select query is below.

 

REPLACE VIEW MDM34.VW_CEC_ADDR_PRFL AS
LOCKING ROW FOR ACCESS SEL
CEC_PRFL.CEC_ID,
CEC_XREF.EXT_CNSMR_ID,
CEC_XREF.BRAND_CD,
CEC_XREF.SRC_CD_IN,
--COALESCE(CEC_PRFL.FIRST_NAME,CEC_PRFL.CEC_ID) FIRST_NAME,
--COALESCE(STRTOK(CEC_PRFL.LAST_NAME,' ',1),CEC_PRFL.CEC_ID) LAST_NAME,
cast(COALESCE(STRTOK(CEC_PRFL.LAST_NAME,' ',1),CEC_PRFL.CEC_ID)||'$'||COALESCE(STRTOK(CEC_PRFL.LAST_NAME,' ',1),CEC_PRFL.CEC_ID)||'#'||COALESCE(CEC_PRFL.FIRST_NAME,CEC_PRFL.CEC_ID)||'$'||COALESCE(CEC_PRFL.FIRST_NAME,CEC_PRFL.CEC_ID) AS VARCHAR(500)) FIRST_NAME,
CEC_PRFL.MIDDLE_NAME,
cast(COALESCE(CEC_PRFL.FIRST_NAME,CEC_PRFL.CEC_ID)||'$'||COALESCE(CEC_PRFL.FIRST_NAME,CEC_PRFL.CEC_ID)||'#'||COALESCE(STRTOK(CEC_PRFL.LAST_NAME,' ',1),CEC_PRFL.CEC_ID)||'$'||COALESCE(STRTOK(CEC_PRFL.LAST_NAME,' ',1),CEC_PRFL.CEC_ID) AS VARCHAR(500)) LAST_NAME,
CEC_PRFL.FULL_NAME,
CEC_PRFL.NAME_PREFIX,
CEC_PRFL.NAME_SUFFIX,
CEC_PRFL.NAME_TITLE,
CEC_PRFL.GNDR_CD,
CEC_PRFL.STATUS,
CEC_PRFL.PERSON_TYPE,
CEC_PRFL.IMS_NUM,
CEC_PRFL.NPI_NUM,
CEC_PRFL.BRTH_DT, --COALESCE(CEC_PRFL.BRTH_DT,CAST('9999-01-01' AS DATE)) AS NCOA_DATE
CEC_PRFL.DEATH_DATE,
CEC_ADDR.ADDR_ID,
CEC_ADDR.ADDR_TYPE,
CEC_ADDR.ADDR_PRF_IND,
CEC_ADDR.ADDRESS_LINE1,
CEC_ADDR.ADDRESS_LINE2,
CEC_ADDR.ADDRESS_LINE3,
CEC_ADDR.CITY,
CEC_ADDR.STATE_CD,
CEC_ADDR.COUNTRY,
--CEC_ADDR.POSTAL_CODE,
SUBSTR(CEC_ADDR.POSTAL_CODE,0,6) AS POSTAL_CODE,
CEC_ADDR.IS_ADDR_VALID,
CEC_ADDR.SYS_TARGET_ID,
CEC_ADDR.SYS_AUTH_ID,
CEC_ADDR.SYS_SOURCE,
CEC_ADDR.SYS_CREATED_BY,
CEC_ADDR.SYS_CREATION_DATE,
CEC_ADDR.SYS_ENT_STATE,
CEC_ADDR.SYS_LAST_MODIFIED_BY,
CEC_ADDR.SYS_LAST_MODIFIED_DATE,
CEC_ADDR.SYS_NC_TYPE,
CEC_ADDR.SYS_ERR_CODE,
CEC_ADDR.SYS_ERR_SVRTY
FROM
MDM34_MST.CEC_PRFL CEC_PRFL
INNER JOIN MDM34_MST.CEC_POSTL_ADDR CEC_ADDR
ON CEC_PRFL.CEC_ID = CEC_ADDR.CEC_ID
INNER JOIN MDM34.CEC_CROSS_REFERENCE CEC_XREF
ON CEC_PRFL.CEC_ID = CEC_XREF.CEC_ID;

 

And below is for another view.

 

show view MDM34_STAGE.SRC_CEC_WRK_ADDR;

REPLACE VIEW MDM34_STAGE.SRC_CEC_WRK_ADDR AS
SELECT
EXT_CNSMR_ID,
CNSMR_ID,
--FIRST_NAME,
cast(COALESCE(LAST_NAME,EXT_CNSMR_ID)||'$'||COALESCE(LAST_NAME,EXT_CNSMR_ID)||'#'||COALESCE(FIRST_NAME,EXT_CNSMR_ID)||'$'||COALESCE(FIRST_NAME,EXT_CNSMR_ID) AS VARCHAR(500)) FIRST_NAME,
MIDDLE_NAME,
cast(COALESCE(FIRST_NAME,EXT_CNSMR_ID)||'$'||COALESCE(FIRST_NAME,EXT_CNSMR_ID)||'#'||COALESCE(LAST_NAME,EXT_CNSMR_ID)||'$'||COALESCE(LAST_NAME,EXT_CNSMR_ID) AS VARCHAR(500)) LAST_NAME,
--LAST_NAME,
FULL_NAME,
NAME_PRFX,
NAME_SUFFIX,
CNSMR_TITLE,
STATUS,
/*i think not required
PERSON_TYPE,
RECIPIENT_TYPE, */
PERSON_TYPE ,
IMS_NUM ,
NPI_NUM ,
PRESCRIBER_FLAG,
--COALESCE(BRTH_DT,CAST('9999-01-01' AS DATE)) AS BRTH_DT,
BRTH_DT,
GNDR_CD,
OPT_STATUS,
SOURCE_CD,
SRC_CD_IN,
PROGRAM_CD,
ADDR_ID,
ADDR_TYPE,
ADDR_PRF_IND,

COALESCE(ADDR_LINE_1, EXT_CNSMR_ID) ADDR_LINE_1,
ADDR_LINE_2,
ADDR_LINE_3,
CITY,
STATE_CD,
ALL_AMGEN_OPT_OUT_STATUS
,ADDR_PERM_STATUS
,ADDR_SOURCE
,IS_ADDR_VALID,
SUBSTR(ZIP_CD_BASE,0,6) AS ZIP_CD_BASE,
ZIP_CD_SUFX,
COUNTRY,
GEO_LONG,
GEO_LAT,
MAIL_STATUS,
ADDR_START_DATE,
NCOA_DATE,
SUPPRESS_CD,
NCOA_BAD_ADDR_FLG,
COA_DATE,
COA_FLAG,
BRAND_CD,
DEATH_DATE,
CREATE_DATE,
RECORD_DATE,
COALESCE(LAST_UPDATE_DATE,CREATE_DATE) AS LAST_UPDATE_DATE,
IGNORE_ADDR,
AGE_RANGE,
DIVISION_CODE,
REGION_CODE,
ORG_ID,
ADDR_CREATION_DATE,
COALESCE(ADDR_UPDATE_DATE,ADDR_CREATION_DATE) AS ADDR_UPDATE_DATE,
CONSUMER_STATUS,
EXT_UNIQUE_ID,
AGE,
RMDB_CNSMR_ID,
COALESCE(lct_Address1, EXT_CNSMR_ID) lct_Address1,
lct_Address2,
lct_Address3,
lct_Locality,
lct_AdministrativeArea,
SUBSTR(lct_PostalCode,0,6) AS lct_PostalCode,
lct_ISO,
lct_AVC,
lct_Latitude,
lct_Longitude,
lct_GeoAccuracy,
lct_Organization,
lct_SubBuilding,
lct_Building,
lct_Premise,
lct_Thoroughfare,
lct_Telephone,
COALESCE(lct_Address1, EXT_CNSMR_ID || SRC_CD_IN) as lct_Junk,
JOB_RUN_ID,
SRC_JOB_RUN_ID
FROM MDM34_MST.CEC_WRK_ADDR
WHERE addr_type IS NOT NULL
AND
IS_CDI_PART = 'Y';

 

Could you please help me on this performance issue.

 

Thanks,

Chandra

8 REPLIES
Junior Supporter

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Chandra,

 

View "m" is a join of 3 tables. The results of this is joined to the single table in view "s".

Each row of "m" is joined to all rows in "s" with the same POSTAL_CODE.

I'm guessing the result of this is more than your expected 1million rows.

During the join, the results of the 3 ADDRSIM compares are tested for result value 12.744.

For each of the 3 ADDRSIM calls, multiple complex functions are executed.

Overall a lot of work !

 

Did you test this with a single, small POSTAL_CODE ?

 

Instead of using "score >= 12.744", you may want to write the 3 compares from the function into the WHERE. That way the number of calls/functions gets reduced (the 1st failing compare flushes the next AND-ed conditions).

 

Key to reducing the work, is to eliminate row compares with single instructions.

You may have to experiment a bit and maybe add conditions like:

( (INDEX(m.FIRST_NAME,'A')>0 AND INDEX(s.FIRST_NAME,'A')>0)
OR (INDEX(m.FIRST_NAME,'E')>0 AND INDEX(s.FIRST_NAME,'E')>0)
OR (INDEX(m.FIRST_NAME,'I')>0 AND INDEX(s.FIRST_NAME,'I')>0)
OR (INDEX(m.FIRST_NAME,'O')>0 AND INDEX(s.FIRST_NAME,'O')>0)
OR (INDEX(m.FIRST_NAME,'U')>0 AND INDEX(s.FIRST_NAME,'U')>0) )

Assuming that all names have at least a 'AEIOU' character.

 

Frank

Teradata Master Frank
Teradata Employee

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Hi Frank,

 

Yes. I have tested with small and single postal_code values.

 

I saw your below extra conditions. In the similar lines can we generate windowkey like cancatination of postal_code and first alphabet of address_line1 and first consonent of address_line1 for both s and m views. and use that window key in where condition to reduce rows comparisons. is that make sense?

 

Here is the problem i am facing how to get the first alphabet letter from address_line 1 and first consonent from addtess_line1. can you help me if you are aware of that. (from Ex address is like this 200 W Arbor Dr)

 

Thansk,

Chandra

Highlighted
Junior Contributor

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

There's no builtin ADDRSIM UDF.

 

The million records to process is probably the result, but to get this you create a huge m-to-n-join using m.POSTAL_CODE=s.ZIP_CD_BASE.

 

What's the number of rows in both tables and the avg/max number of rows per POSTAL_CODE/ZIP_CD_BASE?

Teradata Employee

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Here is ADDRSIM function definition.

 

REPLACE FUNCTION syslib.ADDRSIM (string1 varchar(1000), string2 varchar(1000) )
RETURNS decimal(3,2)
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN case when squeeze(string1)=squeeze(string2)

then
1
else
case when (ngramm_u(squeeze(lower(string1)),squeeze(lower(string2)),2)*2.00)/(chars(squeeze(lower(string1)))+chars(squeeze(lower(string2)))-2) >=.8


then 1
else

 

case when (ngramm_u(squeeze(lower(string1)),squeeze(lower(string2)),2)*2.00)/(chars(squeeze(lower(string1)))+chars(squeeze(lower(string2)))-2) >=.55
then (ngramm_u(squeeze(lower(string1)),squeeze(lower(string2)),2)*2.00)/((chars(squeeze(lower(string1)))+chars(squeeze(lower(string2)))-2))

 

else 0
end

end
end
;

 

We have per postal code average 50k and maximum to 100k.

 

After joining based on m and s views with postal code its giving us results as 440 millions approximately.

Junior Contributor

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Well, you compare every row in the 1st table to every row in the 2nd table with the same postal code, e.g. 50k rows = 50000*50000 comparisons.

 

Each comparison is based on lots of nested functions, of course this consumes lots of CPU.

 

 

Junior Supporter

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Chandra,
 
Besides the issue of how to find these letters, I don't think your logic would work.
Compare "200 W Arbor Dr", "200 West Arbor Dr", "200 Arbor Dr W".
Using the first consonants and vowels, these would not match.
 
You do have the right idea; eliminate joining rows that don't match with as simple as possible compares.
But to do it right, you probably end up with the complex compare you have now, which does consume large amounts of CPU.
 
Cheers, Frank
Teradata Master Frank
Teradata Employee

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Thanks Frank.

 

But we have address standardized by Loqate before this query runs.

 

I think what ever problem you mentioned may not occur since we have standardized address coming as input to query.

 

 

Thanks,
Chandra

Junior Supporter

Re: Performance issue - below Query is running 7 hours for 1 million records. - Please help me to tu

Chandra,

 

In that case you may want try using first consonent and vowel, and see if it reduces the join.

 

To get these I would use simple CASE, assuming that they are within the first 10 characters

CASE WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,1,1)) > 0 THEN SUBSTR(AddrLine1,1,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,2,1)) > 0 THEN SUBSTR(AddrLine1,2,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,3,1)) > 0 THEN SUBSTR(AddrLine1,3,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,4,1)) > 0 THEN SUBSTR(AddrLine1,4,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,5,1)) > 0 THEN SUBSTR(AddrLine1,5,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,6,1)) > 0 THEN SUBSTR(AddrLine1,6,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,7,1)) > 0 THEN SUBSTR(AddrLine1,7,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,8,1)) > 0 THEN SUBSTR(AddrLine1,8,1)
WHEN INDEX('BCDFGHJKLMNPQRSTVWXYZ',SUBSTR(AddrLine1,9,1)) > 0 THEN SUBSTR(AddrLine1,9,1)
ELSE SUBSTR(AddrLine1,10,1) END AS FirstConsonant
CASE WHEN INDEX('AEIOU',SUBSTR(AddrLine1,1,1)) > 0 THEN SUBSTR(AddrLine1,1,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,2,1)) > 0 THEN SUBSTR(AddrLine1,2,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,3,1)) > 0 THEN SUBSTR(AddrLine1,3,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,4,1)) > 0 THEN SUBSTR(AddrLine1,4,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,5,1)) > 0 THEN SUBSTR(AddrLine1,5,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,6,1)) > 0 THEN SUBSTR(AddrLine1,6,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,7,1)) > 0 THEN SUBSTR(AddrLine1,7,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,8,1)) > 0 THEN SUBSTR(AddrLine1,8,1)
WHEN INDEX('AEIOU',SUBSTR(AddrLine1,9,1)) > 0 THEN SUBSTR(AddrLine1,9,1)
ELSE SUBSTR(AddrLine1,10,1) END AS FirstVowel

 

Teradata Master Frank