Change data capture

Database

Change data capture

Hoping that someone can help with a situation I'm in.  We have a source table that we're pulling in from SQL Server where the source is 3 million rows, and the table has about 100 columns in it.  Once we pull it in, we have to do a full table comparison to see if any of the columns changed since the last load (CDC).  I've tried using a MINUS, and also a LEFT OUTER JOIN, but in both cases, I run out of spool space because of having to coalesce all of the columns...many of which are null.  I'm looking for suggestions on the best way to do this.

Thanks!

Larry

2 REPLIES

Re: Change data capture

Few ideas that crop up in my mind :)

100 columns!!!! Comparing all columns, source and target, database resource consumption will be high.Of course , if necessary(very rarely), you are 100% sure of any change, that is in vantage. Cyclic Redundancy check(CRC-code) is another option may be better but not that good too.

Can you ask the upstream folks to produce delta? Or you can think of transaction log mechanism where it sniffs for changes and mark the changes. SQL server must have in-built feature for CDC,or you can check with some tools.Triggers can affect performance, if it triggers for changes.Check te replication services available and do some programming to capture changes. Else you have to think of candidate columns that change, like timestamp,date, indicator fields.... if you have any.

End of the day, it is project specific requirement and data. So make choice judiciouly to live happily :)

N/A

Re: Change data capture

Hi Larry,

both joining on COALESCE and MINUS will result in redistribution steps, you better switch to NOT EXISTS instead:

SELECT *
FROM t1
WHERE NOT EXISTS
(SELECT * FROM t2
WHERE (t1.col1 = t2.col2 OR (t1.col1 IS NULL AND t2.col1 IS NULL))
AND (t1.col2 = t2.col2 OR (t1.col2 IS NULL AND t2.col2 IS NULL))
AND (t1.col3 = t2.col3 OR (t1.col3 IS NULL AND t2.col3 IS NULL))
...
AND (t1.col100 = t2.col100 OR (t1.col100 IS NULL AND t2.col100 IS NULL))

Assuming both table share the same PI this will result in an AMP-local join without any preparation step. No need to think about which value to use in COALESCE and the code is easily created automatically.