I'm a bit of a novice when it comes to SQL and Teradata..but I am familiar with the following syntax for finding duplicates records
SELECT * FROM TABLE QULAIFY COUNT(*) OVER (PARTITION BY COL1,COL2,COL3) > 1
In my case however, I am dealing with time bound data where timestamp should be the unique identifier...in other words it is possible for records to otherwise be identical other than the timestamp. We have manually identified cases where duplicate records have erroneusly been sent/received within a couple minutes of each other. Given the nature of the data, we know it is not possible to receive two identical records within ~30 minutes of each other. So I want to be able to identify duplicate records, but only when the timestamps are within 30 minutes of one another. My first thought was to try to do something like this...
SELECT * FROM TABLE QULAIFY COUNT(*) OVER (PARTITION BY EXTRACT(HOUR FROM TIME_REC),COL1,COL2,COL3) > 1
...but obviously this will only return a hit if the two records were received within the same hour of the day (i.e. 2:15 & 2:37)...but not pick up instances that span over the top of the hour (i.e. 2:47 & 3:04)...not to mention it doesn't get to the grainularity of minutes that I'm after. My beginner level understanding is preventing me from seeing whether there are any other solutions that can help me identify duplicate points with a time difference condition.
Any thoughts? Thanks in advance!
In a subquery, you can compute the time difference between successive rows, e.g.
(MIN(TIME_REC) OVER (PARTITION BY COL1, COL2, COL3 ORDER BY TIME_REC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)) - TIME_REC AS TIME_TO_NEXT_ROW
TIME_REC - (MIN(TIME_REC) OVER (PARTITION BY COL1, COL2, COL3 ORDER BY TIME_REC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) AS TIME_SINCE_PRIOR_ROW
Note: Either MIN or MAX will work since you are limiting the window to at most one row.
Then filter in the outer query based on the time difference.
Thanks Fred...I can see how to incorporate your suggestions into a select statement to yeild the following output...
TS DIFF1 DIFF2
05:30 ? 17:00.0000
05:47 17:00.0000 ?
07:33 ? ?
but the real question is can this somehow be rolled into a qualify statement so that I can return just a count of the duplicates...more akin to the original example I posted?