Need to Remove Duplicates from a dataset

Database

Need to Remove Duplicates from a dataset

Hi All ,

I need to remove only the Immediate duplicates if there is no change in some specific columns.

Incomming Data:

ID  STATUS  STORE UPDATE_DATE

1       A           S1         Oct-1

1       A           S3         Oct-2

2       S           S1         Oct-3 

1       S           S3         Oct-5

1       A           S3         Oct-8

2       S           S1         Oct-9

2       A           S1         Oct-9

OUTPUT Should be as below.

ID  STATUS  STORE UPDATE_DATE

1       A           S1         Oct-1

1       S           S3         Oct-5

1       A           S3         Oct-8

2       S           S1         Oct-3 

2       A           S1         Oct-9

Notice that i do not care if STORE Column changes for a Particular ID. If there is immediate change for an ID in STATUS then i want to keep that,Else i want to remove that record.

Need Solution in Analytical functions with Window Query only. No self Joins please.

Appreciate some quick help.

Thanks,

Rakesh

8 REPLIES
N/A

Re: Need to Remove Duplicates from a dataset

Assuming this should be based on update_date simply add

QUALIFY 
COALESCE( criteriaMIN(status)
OVER (PARTITION BY id
ORDER BY update_date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
,'~~~~')
<> status

But if you actually got multiple rows for the same id/update_date combination you need to define some additional sort criteria.

Dieter

Re: Need to Remove Duplicates from a dataset

Thank you Dieter.

However if we have more than one column to track immediate changes then we cannot use this Query.(Here we have only STATUS)

Is there any alternative? Can we use Row_number?

Thanks,

Rakesh

N/A

Re: Need to Remove Duplicates from a dataset

Hi Rakesh,

you can simply add another ORed condition to check for change in another column.

Dieter

Re: Need to Remove Duplicates from a dataset

Thanks Dieter. Can we use to generate a row_number for each ID partition by the columns that we want to track change and pick up only the row_num=1 ?

I tried my best but looks like row_num function is not suppoorted by window function.

Please let me know your suggestions.

Thnaks,

Rakesh

N/A

Re: Need to Remove Duplicates from a dataset

Hi Rakesh,

it's ROW_NUMBER not ROW_NUM.

Dieter

Re: Need to Remove Duplicates from a dataset

Yes Dieter i just wrote that in a short version. So back to my question Can we use ROW_NUMBER function ? I am not able to use  with Window function. I mean can we assign a row_number for a particular window of records?

N/A

Re: Need to Remove Duplicates from a dataset

ROW_NUMBER() OVER (PARTITION BY id, status, store ORDER BY update_date) 

will not work for your problem as it will not flag a row where the same (status,store) combination is found in multiple rows, e.g.

1,a -- row_number -> 1 - returned

1,a -- row_number -> 2 - not returned

2,a -- row_number -> 1 - returned

1,a -- row_number -> 3 - not returned, but should be based on your description

Dieter

Re: Need to Remove Duplicates from a dataset

Thanks Dieter for the explanation. However i want the 4th record to be returned. How can i make that row_number=1 instead of 3 ? Or anyother way?