Remove duplicate records and load into another table

Database
Enthusiast

Remove duplicate records and load into another table

Hi There,

I have a table called Old Table as shown below which has duplicate records. The difference is only in the Key_Id column(Surrogat Key) which is sequencly generated in our process. Since all other columns are same, we treat this as dupliucates and we need to be deleting the records and insert only 1 record in a table called New Table as shown below. Appreciate if anyone could higlight me the easisiset and simplest way to achieve this.

Old Table

Key id          PROCESS_DATA_MONTH             DATA_SOURCE_ID         DATA_SOURCE_CODE         LOGICAL_DATA_SOURCE_CODE   GEOGRAPHY_CODE

2123                   6/1/2014                                      720                             998,649                                  901,584                            PHILADELPHIA PA

2124                   6/1/2014                                      720                             998,649                                 901,584                           PHILADELPHIA PA

2125                   6/1/2014                                      720                             998,649                                  901,584                            PHILADELPHIA PA

New Table

Key id         PROCESS_DATA_MONTH               DATA_SOURCE_ID        DATA_SOURCE_CODE       LOGICAL_DATA_SOURCE_CODE     GEOGRAPHY_CODE

2123                   6/1/2014                                       720                            998,649                                 901,584                            PHILADELPHIA PA

7 REPLIES
Supporter

Re: Remove duplicate records and load into another table

try the following in case all columns are not null

Insert into new table
select *
from oldTable
where keyid in (
select key_id
from OLD_Table
qualify key_id = min(key_id) as key_id over (partition by PROCESS_DATA_MONTH !! DATA_SOURCE_ID !! DATA_SOURCE_CODE !! LOGICAL_DATA_SOURCE_CODE GEOGRAPHY_CODE)
)

in case a column can be null you might change this column to coalesce(cast(columnname as varchar(100)) , 'XXXXnotNULLxxxx')...

Enthusiast

Re: Remove duplicate records and load into another table

insert into newtable
select
min(key_id)
,process_data_month
,data_source_id
,data_source_code
,logical_data_source_code
,geography_code
from oldtable
group by
process_data_month
,data_source_id
,data_source_code
,logical_data_source_code
,geography_code

will also work and avoid the olap f(x)... the same note re: null columns applies

Cheers

Enthusiast

Re: Remove duplicate records and load into another table

Hai thanks a lot...

Can you come again on the null values?? What if i had a NULL in the geography code? Like below..

Key id          PROCESS_DATA_MONTH             DATA_SOURCE_ID         DATA_SOURCE_CODE         LOGICAL_DATA_SOURCE_CODE   GEOGRAPHY_CODE

2123                   6/1/2014                                      720                             998,649                                  901,584                             ?

2124                   6/1/2014                                      720                             998,649                                  901,584                              ?

2125                   6/1/2014                                      720                             998,649                                  901,584                              ?

 

Enthusiast

Re: Remove duplicate records and load into another table

Hi there,

 I am trying to fetch the records leavign all the duplicates by the below 2 methods. Both are consuming time:(

while doing an explain the first query estimate time is around 45 min and for the 2nd query it is not displaying the estimated time.

Total count of the records in the table is around 31million.

Please help if any alternative.

1)  

select max(CONSUMER_PROMO_DIRECT_ID), 

PROCESS_DATA_MONTH,

DATA_SOURCE_ID,

DATA_SOURCE_CODE,

LOGICAL_DATA_SOURCE_CODE,

GEOGRAPHY_CODE

    from  old_table

group by PPROCESS_DATA_MONTH,DATA_SOURCE_ID,DATA_SOURCE_CODE,LOGICAL_DATA_SOURCE_CODE,GEOGRAPHY_CODE

2)

  select *from ddsd.CONSUMER_PROMO_DIRECT 

      qualify  row_number() over( partition by 

      PROCESS_DATA_MONTH,DATA_SOURCE_ID,DATA_SOURCE_CODE,LOGICAL_DATA_SOURCE_CODE

      ,GEOGRAPHY_CODE

 order by PROCESS_DATA_MONTH,DATA_SOURCE_ID,DATA_SOURCE_CODE,LOGICAL_DATA_SOURCE_CODE

      ,GEOGRAPHY_CODE desc) =1


Re: Remove duplicate records and load into another table

Hi,

All of them that is writen above is true. You can use min or max aggregation for surrogate keys and than you should group the other columns, or you can qualify the data as patition by DATA_SOURCE_ID, DATA_SOURCE_CODE order by Key_id (asc or desc what you would like to see as surrogate key) . 

Also when you create surrogate key tables source identifier column and primary key column of the source system must be unique primary index, if you need to access data from surrogate key tables  (usually semantic layers use surrogate key tables to extend datamarts with source primary key columns)  with surrogate key columns I recomend sparse index with surrogate key column primary index.

Regards,

Kadir Evciler

Junior Contributor

Re: Remove duplicate records and load into another table

Did you actually run the queries and checked DBQL?

Unless the PI of the table is the key_id the GROUP BY should be fast.

Btw, how can you insert duplicate values in a table, is there no Primary Key or does the loading process ignore it?

Enthusiast

Re: Remove duplicate records and load into another table

You have to remember that the hours, minutes, seconds in an explain plan are pronounced KOST.  There is no relevance to wall clock time.  Those measure labels in the explain plan are labels to describe the relative cost of each step, NO how long it will take.

Cheers.