Detecting a Broken Prior Value/New Value Chain

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Fan

Problem: There are many cases where a database stores the previous value along with the replacement value. When an update happens, it is important to validate that (using CDC terms) the "before image" matches the most-recent "after image". If those don't match, then the chain of updates is broken.

The specific problem being solved here is where the prior value/new value pairs are stored in separate columns in the each row.

KeyPrior ValueNew ValueDate
1AB2011-02-05
1BC2011-02-06
1CD2011-02-07
1DE2011-02-08

These prior value/new value pairs generally build a chain. But how do you know when that chain is broken? We'll build an example and show the query that flags the records where the chain is broken.

Example

In the complete example below, a log table records division assignments as an employee of a large corporation (Volkswagen) moves from division to division. The table includes the "old" division and the "new" division in each record.

The "broken chain" happens when a record's "old" value does not match the previous record's "new" value for the same entity (in this case, an employee).

Set-based detection of these broken chains can be done using the scheme explained in the example below.

Test Table/Data

Employees that work for Volkswagen get moved from division to division. Here we track two workers, Luc and Ulrich.

-- ============================================
--DROP TABLE person_div_assgt_log;
CREATE VOLATILE TABLE person_div_assgt_log (
first_name varchar(10) NOT NULL,
old_div_name varchar(12) NULL,
new_div_name varchar(12) NOT NULL,
log_date date NOT NULL
) ON COMMIT PRESERVE ROWS;

-- ============================================
-- ===== Luc
-- Luc starts out working for Volkswagen
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Luc',
NULL,
'Volkswagen',
DATE '2000-01-01'
);

-- Luc moves to Skoda
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Luc',
'Volkswagen',
'Skoda',
DATE '2001-01-01'
);

-- Luc moves to Bentley
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Luc',
'Skoda',
'Bentley',
DATE '2002-01-01'
);

-- ===== Ulrich
-- Ulrich starts out working for Skoda
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Ulrich',
NULL,
'Skoda',
DATE '2000-01-01'
);

-- Ulrich moves to Audi
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Ulrich',
'Skoda',
'Audi',
DATE '2002-01-01'
);

-- Ulrich moves to Lamborghini
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Ulrich',
'Skoda', -- Incorrect "old" value. Should be 'Audi'
'Lamborghini',
DATE '2005-01-01'
);

-- Ulrich now works for Bugatti
INSERT INTO person_div_assgt_log (
first_name,
old_div_name,
new_div_name,
log_date
)
VALUES (
'Ulrich',
'Lamborghini',
'Bugatti',
DATE '2011-01-01'
);

Create Volatile Tables

These two volatile tables will allow us to compare old to new.

CREATE VOLATILE TABLE tOld (
first_name varchar(10) NOT NULL,
old_div_name varchar(12) NULL,
new_div_name varchar(12) NULL,
log_date date NOT NULL,
Rnk int NOT NULL
) ON COMMIT PRESERVE ROWS;

CREATE VOLATILE TABLE tNew (
first_name varchar(10) NOT NULL,
old_div_name varchar(12) NULL,
new_div_name varchar(12) NULL,
log_date date NOT NULL,
Rnk int NOT NULL
) ON COMMIT PRESERVE ROWS;

Load Volatile Tables

INSERT INTO tOld
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) AS Rnk
FROM person_div_assgt_log;

INSERT INTO tNew
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) - 1 AS Rnk
FROM person_div_assgt_log;

Notice that the second RANK() has "-1". This creates the offset that allows us to compare old to new.

Simple Data Queries

All Records

SELECT * FROM person_div_assgt_log ORDER BY first_name, log_date;

Results

first_nameold_div_namenew_div_namelog_date
LucNULLVolkswagen2000-01-01
LucVolkswagenSkoda2001-01-01
LucSkodaBentley2002-01-01
UlrichNULLSkoda2000-01-01
UlrichSkodaAudi2002-01-01
UlrichSkodaLamborghini2005-01-01
UlrichLamborghiniBugatti2011-01-01

tOld Records

SELECT * FROM tOld ORDER BY first_name, log_date;

Results

first_nameold_div_namenew_div_namelog_dateRnk
LucNULLVolkswagen2000-01-011
LucVolkswagenSkoda2001-01-012
LucSkodaBentley2002-01-013
UlrichNULLSkoda2000-01-011
UlrichSkodaAudi2002-01-012
UlrichSkodaLamborghini2005-01-013
UlrichLamborghiniBugatti2011-01-014

tNew Records

SELECT * FROM tNew ORDER BY first_name, log_date;

Results

first_nameold_div_namenew_div_namelog_dateRnk
LucNULLVolkswagen2000-01-010
LucVolkswagenSkoda2001-01-011
LucSkodaBentley2002-01-012
UlrichNULLSkoda2000-01-010
UlrichSkodaAudi2002-01-011
UlrichSkodaLamborghini2005-01-012
UlrichLamborghiniBugatti2011-01-013

Broken Chain Query

SELECT
o.first_name,
o.new_div_name AS Old_NewName,
o.Rnk AS oRnk,
n.old_div_name AS New_OldName,
n.Rnk AS nRnk
FROM tOld AS o
JOIN tNew AS n
ON o.first_name = n.first_name
AND o.Rnk = n.Rnk
-- The old record's "new" does not match the new record's "old"
AND o.new_div_name <> n.old_div_name
AND o.Rnk IS NOT NULL
AND n.Rnk IS NOT NULL
ORDER BY
o.first_name,
o.Rnk;

Results

first_nameNew_OldNameoRnkOld_NewNameRnk
UlrichAudi2Skoda2

This query successfully flagged the broken chain.

These tables show how the RANK() results line up the entries for comparison. They show only rows where the Rnk column matches between tOld and tNew.

Luc
tOldtNew
new_div_nameRnkold_div_nameRnk
Volkswagen1Volkswagen1
Skoda2Skoda2
Ulrich
tOldtNew
new_div_nameRnkold_div_nameRnk
Skoda1Skoda1
Audi2Skoda2
Lamborghini3Lamborghini3

Simpler Query without Volatile Tables

The volatile tables allow you to step through the data and understand it, but there is no real reason to use them. Here is the query without volatile tables to store the ranking.

SELECT
o.first_name,
o.new_div_name AS Old_NewName,
o.Rnk AS oRnk,
n.old_div_name AS New_OldName,
n.Rnk AS nRnk
FROM (
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) AS Rnk
FROM person_div_assgt_log
) AS o
JOIN (
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) - 1 AS Rnk
FROM person_div_assgt_log
) AS n
ON o.first_name = n.first_name
AND o.Rnk = n.Rnk
-- The old record's "new" does not match the new record's "old"
AND o.new_div_name <> n.old_div_name
ORDER BY
o.first_name,
o.Rnk;

The results are exactly the same.

Alternate Queries

These queries were suggested by Dieter Nöth in the comments and are important enough that they should be included in the body to make sure no one misses them. I modified them slightly to get results similar to the other queries, but the logic is his.

CTE Query

Dieter's first suggestion uses a single CTE. A nice simplification that also has performance implications that he explains below.

WITH cte (
first_name,
old_div_name,
new_div_name,
log_date,
Rnk
) AS (
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) AS Rnk
FROM person_div_assgt_log
)
SELECT
o.first_name,
o.new_div_name AS New_OldName,
o.Rnk AS oRnk,
n.old_div_name AS Old_NewName,
n.Rnk - 1 AS nRnk
--n.Rnk AS nRnk
FROM cte AS o
JOIN cte AS n
ON o.first_name = n.first_name
AND o.Rnk = n.Rnk - 1
-- The old record's "new" does not match the new record's "old"
AND o.new_div_name <> n.old_div_name
ORDER BY
o.first_name,
o.Rnk;

Super-Compact Query

This query is amazingly compact considering what it's doing.

SELECT
first_name,
MIN(new_div_name)
OVER (
PARTITION BY first_name
ORDER BY log_date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS New_OldName,
old_div_name AS Old_NewName
FROM person_div_assgt_log
QUALIFY New_OldName <> Old_NewName;

Again, the results are (*almost) exactly the same. (* The "Super-Compact" query does not include rank values but correctly flags the offending records.)

4 Comments
Junior Contributor
Hi Rob,
two parts where this can be further simpliyfied:

#1:
Do the same RANK-calculation in both Derived Tables and then join on "o.Rnk = n.Rnk - 1".
Then the optimizer will re-use the spool once created, whereas different calculations result in two STAT steps.
And now you can use a single WITH instead of two Derived Tables:

WITH cte(
first_name,
old_div_name,
new_div_name,
log_date,
Rnk)
AS
(
SELECT
first_name,
old_div_name,
new_div_name,
log_date,
RANK() OVER (PARTITION BY first_name ORDER BY log_date) AS Rnk
FROM person_div_assgt_log
)

SELECT *
FROM cte AS o
JOIN cte AS n
ON o.first_name = n.first_name
AND o.Rnk = n.Rnk - 1
-- The old record's "new" does not match the new record's "old"
AND o.new_div_name <> n.old_div_name
ORDER BY
o.first_name,
o.Rnk;

#2:
If you only want to check the previous value, it's much easier to do it this way:
SELECT
first_name,
MIN(new_div_name)
OVER (PARTITION BY first_name
ORDER BY log_date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS New_OldName,
old_div_name AS Old_NewName
FROM
person_div_assgt_log
QUALIFY New_OldName <> Old_NewName

This results in a single STAT step in explain.

Sometimes a query in Teradata is much easier than in SQL Server ;-)

Dieter
Fan
Dieter, those are great suggestions. Thank you!
Enthusiast
Dieter - The CTE solution is a great one. I need to keep the concept of the CTE in mind. I am sure it will come in handy one day. I've used the window aggregate approach before to organize data and "peak" ahead or behind.

RobG - Great blog topic. Thank you!

Hi 

I have a huge Fact Table with millions of rows. I am trying to connect all Current/Prev records to get full history . Sample Set of Data which exist in fact table is shown below.

Rec_No........Prev_Rec_No..........Next_rect_No...........P_ID.......S_TEXT_REF.. eff_dt .... end_dt

1                  Null                        103                                393         S77             01/02/2013   01/05/2013

102                 1                         757                                393         S77             01/06/2013    01/09/2013

757                 102                    855                                 393          S88            01/10/2013    01/15/2013

855                 757                    Null                                  393         S88            01/16/2013     12/31/9999

I am trying to produce above mentioned results via Sql. Kindly advice .