The below query takes around 30-35 minutes to store results in a temporary table.
SELECT c.call_start_dt ,
'xrx' AS source ,
WHEN c.postpaid_seq IS NOT NULL THEN 'X'
WHEN c.prepaid_seq IS NOT NULL THEN 'Y'
END AS source_system ,
WHEN c.prepaid_seq IS NOT NULL THEN c.postpaid_seq
FROM vw_xerox.call_details c --pointing to call_details table (vw_xerox.call_details is a 1:1 view)
WHERE c.call_start_dt BETWEEN date '2015-03-24'-120 AND (
cast('2015-07-24' AS date) + 31)
/* The table call_details is a partition table for each day
and holds months data only. The volume of this table is quite high ( total rows = 7923659811 rows) partitin is done based on call start date
and PI is on call start date and call start time*/
c.st_ind <> 'Q'
FROM xer.xerox_load_status_hist --Returns around 3999 rows
WHERE src_system IN ('X',
run_dt <= date '2015-07-24'
AND process_dt IS NOT NULL)
active_ind = 'Y'
AND run_dt = cast('2015-07-24' AS date)
AND process_dt IS NULL)) )
src_system,src_batch_nr, ins_id,upd_id) NOT IN
/* 2. performance bottleneck also observed here */
WHERE src_system IN ('X',
'Y') --Returns around 3400 rows
) ) ;
/* In above script column and table name are replaced by some other names */
Is there a way to tune this SQL? Even the insert from call details table for 4 months data into temp table takes significant amount of time!!
and (src_system,src_batch_nr, ins_id,upd_id) not in /* 2. performance bottleneck observed here */
select src_system,src_batch_nr, ins_id,upd_id from xer.xerox_loaded_usg
where src_system in ('X','Y') --Returns around 3400 rows
For above part alternative approach can be - delete those records from temporary table which have a match found in xerox_loaded_usg for the combination(src_system,src_batch_nr, ins_id,upd_id) - this might be a costly operation as we are doing delete from a big table for some combinations.
Would splitting the resultset into small chunks help to make use of the day wise partition?? Thanks!
#1: Change the NOT IN to a NOT EXISTS:
and NOT EXISTS
from xer.xerox_loaded_usg AS u
where src_system in ('X','Y')
AND c.src_system = u.src_system
AND c.src_batch_nr = u.src_batch_nr
AND c.ins_id = u.ins_id
AND c.upd_id = .upd_id
#2: Can you show the Explain and (P)PI of all tables including the target table for the insert/select?
Thanks for your response.
I tried the approach you mentioned. But did not gain any improvement.
c.src_system , s
c.src_batch_nr are derived column
and hence should be &
The plan for a given timeframe was given in previous link:
Here timeframe would be between CURRENT_DATE - 120 and CURRENT_DATE +31.
xerox_loaded_usg table have PI on
xer.xerox_loaded_usg have PI on ins_id,src_batch_nr,upd_id !!!! -> This can also be a problem.
( Call_Start_Dt ,Call_Start_Tms ,originating_num ,Terminating_Num ,imei_nbr ,Seq_Nr ,Prepaid_Seq ,
postpaid_Seq ,I_Seq ) is the PI on target table and is a multiset table. Thanks!
You don't have any matching PIs and estimated time is close to actual.
Check QryLogStepsV for the actual vs. estimated numbers and skew.
Yes, there is no matching PIs.
Unfortunately, I don't have access to the tables you mentioned.
Not sure if the below steps that I have choosen may be correct. Need your suggestion on this
1. Populate the data for each day ( to make use of partition and for an optimal plan) for the given timeframe to a temporary table say T1. INSERT for such a big timeframe is taking too much time.
I do understand that BULK INSERT is much faster than INSERT thru loop. I am trying to make use of partition and for an optimal plan. I may be wrong here but may be if spool space is less than it may have an impact.
Also can append the IN (or Exist) part of the SQL.
The call_details table is a BIG table. It takes around 380 GB of space for 6 months of data. Or, would have preferred CT and then drop partition for date greater than Upper bound value for the date ( less than lower bound value of the date ) and append rows as required or simply delete rows to make the resultset that
have the data for the timeframe
c.call_start_dt BETWEEN date '2015-03-24'- 120 AND (
cast('2015-07-24' AS date) + 31)
I was also thinking to do a join instead of IN. But in that case, we are joining one Partition table with a non partition table. And to my understanding the joining key of the partition table will be SORTed first which is an
overhead. PLease correct me if I am WRONG here.
2. Delete the data from T1 where c.st_ind = 'Q'.This way we can replace it by = instead of <>. But again deleting data from a huge table with a filter predicate may have performance overhead.
3. Delete the data from T1 which have a match found in xerox_loaded_usg for the combination(src_system,src_batch_nr, ins_id,upd_id) - this might be a costly operation as we are doing delete from a big table for some combinations. But this way we can avoid NOT IN and can replace by IN.Also,
the tables have no matching PIs!
This is one of the approach that I can think of at this point of time.
Is there any other approach I am missing out? Your valuable suggestions will surely help.Thanks once again for giving your valuable time.
IN will be rewritten to a JOIN anyway and the optimizer decides if/what/when to sort.
But before you try to change your approach you should find out why/which step is performing bad.
For that you need data from DBQL, you should contact your DBA and ask for it.
Btw, what's your system type, 1xxx, 2xxx or 6xxx?
It's a 120 AMP system?
Thanks once again for the correction. Yes indeed, it would be replaced by JOIN.
It's a 120 AMP system. I need to get the information on system type and will get back to you.
Well, I tried incremental approach to understand the performance one step at a time. Even with applying ONLY call start date predicate for such a big timeframe, it takes significant amount of time.Note that the stats are up to date.So, only part remaining are IN, NOT IN and '<>'.
It has been observed that for even a single day when the NOT IN part is appended and when data volume is high than it takes more time. Hence, the conclusion is that the NOT IN part of the sql have the performance bottleneck. Hence, the above approach came to my mind. Not really sure if that works as I have no prod like setup to observe the performance.
Anyways, I will try to validate it by looking at the tablename you mentioned by the help of DBA.
I have a question here - Why you said that the tables should have matching PI? If I am correct, joins of partitioned tables to unpartitioned tables with the same primary index have impact on performance. It results in performance degrade.
Well, I am more an Oracle guy and new to Teradata tuning. It seems that the journey gonna be interesting one :)
Thanks for all the valuable suggestion and guidance.
NOT IN might result in a overly complicated plan, but never when using NOT EXISTS.
Of course joining partitioned to non-partitioned tables on their PI is slower than joining tables with same PI & partitioning. But it's still faster than joining tables on non-matching PIs.