Identifying Duplicate Records with Time Difference Condition

Database
Fan

Identifying Duplicate Records with Time Difference Condition

I'm a bit of a novice when it comes to SQL and Teradata..but I am familiar with the following syntax for finding duplicates records

 

SELECT *
FROM TABLE
QULAIFY
  COUNT(*) OVER (PARTITION BY COL1,COL2,COL3) > 1

 

In my case however, I am dealing with time bound data where timestamp should be the unique identifier...in other words it is possible for records to otherwise be identical other than the timestamp. We have manually identified cases where duplicate records have erroneusly been sent/received within a couple minutes of each other. Given the nature of the data, we know it is not possible to receive two identical records within ~30 minutes of each other. So I want to be able to identify duplicate records, but only when the timestamps are within 30 minutes of one another. My first thought was to try to do something like this...

 

SELECT *
FROM TABLE
QULAIFY
  COUNT(*) OVER (PARTITION BY EXTRACT(HOUR FROM TIME_REC),COL1,COL2,COL3) > 1

 

...but obviously this will only return a hit if the two records were received within the same hour of the day (i.e. 2:15 & 2:37)...but not pick up instances that span over the top of the hour (i.e. 2:47 & 3:04)...not to mention it doesn't get to the grainularity of minutes that I'm after. My beginner level understanding is preventing me from seeing whether there are any other solutions that can help me identify duplicate points with a time difference condition.

 

Any thoughts? Thanks in advance!

3 REPLIES
Teradata Employee

Re: Identifying Duplicate Records with Time Difference Condition

In a subquery, you can compute the time difference between successive rows, e.g.

(MIN(TIME_REC) OVER (PARTITION BY COL1, COL2, COL3 ORDER BY TIME_REC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)) - TIME_REC AS TIME_TO_NEXT_ROW

and/or

TIME_REC - (MIN(TIME_REC) OVER (PARTITION BY COL1, COL2, COL3 ORDER BY TIME_REC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) AS TIME_SINCE_PRIOR_ROW

Note: Either MIN or MAX will work since you are limiting the window to at most one row.

 

Then filter in the outer query based on the time difference.

 

 

Fan

Re: Identifying Duplicate Records with Time Difference Condition

Thanks Fred...I can see how to incorporate your suggestions into a select statement to yeild the following output...

 

TS       DIFF1              DIFF2

05:30   ?                     17:00.0000

05:47   17:00.0000     ?

07:33   ?                     ?

 

but the real question is can this somehow be rolled into a qualify statement so that I can return just a count of the duplicates...more akin to the original example I posted?

Junior Contributor

Re: Identifying Duplicate Records with Time Difference Condition

No need to count:

QUALIFY DIFF1 IS NOT NULL OR DIFF2 IS NOT NULL