Non duplicate row selection query

Database
Enthusiast

Non duplicate row selection query

Hi there,

Have a table with 1 billion records in which almost close to half the records are of exact row duplicates. I need to select one record from the set of duplicates and load into another table.The query that i used is as below. The DIRECT_ID   column used in the order by clause is a sequence generated id and hence i was picking the most recent genarated id for that set of duplicates records. But this is taking too much of time.

Appreciate for suggestions..

Insert into New_Consumer

select *from CONSUMER_TABLE 

qualify row_number() over ( partition by PROCESS_DATA_MONTH,                  

GEOGRAPHY_CODE,                

PRODUCT_NAME,                  

VARIABLE_NAME,                 

TIME_PERIOD,                   

VARIABLE_QUANTITY,             

OUTLET_NAME,                   

BRAND_UNBRAND,             

CAMPAIGN_ID,                   

VENDOR,                        

CHANNEL_NAME,                  

ACTIVE_FLAG,                   

CREATED_DATE,                  

CREATED_BY,                    

MODIFIED_DATE,                 

MODIFIED_BY,                   

BATCH_ID,                      

BATCH_RUN_NUMBER,              

IMS_NUM,                   

GEO_ID,                  

TIME_ID,                       

AR_CUSTOMER_ID,            

CUSTOMER_ID,                   

AR_PRODUCT_ID,             

PRODUCT_ID,                   

INDICATION,                    

PROGRAM_NAME,                  

PROGRAM_CODE,                  

MEDIA_SHOW_TIME,               

PROGRAM_DURATION,              

MEDIA_TYPE,                    

CAMPAIGN_NAME,                 

PLACEMENT_NAME,                

METRIC_TYPE,                   

VARIABLE_DESCRIPTION,          

CHANNEL_ID,                    

MARKET_IDENTIFIER,         

PRODUCT_IDENTIFIER,        

ABBREVIATION order by DIRECT_ID  desc)=1

Regards

John

17 REPLIES
Teradata Employee

Re: Non duplicate row selection query

Hello John,

You can try the below approach also & see if the Performance Gain is positive:

INSERT Into <Target Table>

Select Max(DIRECT_ID), <All Columns> From <Table Name> Group By <All Columns Except DIRECT_ID>

Aggregate Operation should give better performance than the PARTITIONING Operations.

Thanks,

Smarak

Enthusiast

Re: Non duplicate row selection query

Hi Smarak,

Thanks. Not much improvement. I ran the query for almost 15 minutes and then aborted it:(

Regards

John

Enthusiast

Re: Non duplicate row selection query

If you have access to DDLS,

Create a temp table as set table - Table 1

Insert the rows from the source to "Table 1"

Insert the rows from "Source"  minus "Table1" to target table based on DIRECT_ID.

i.e. Source - 1,2,1,3

Table 1  - 1,2,3

Target table - Source - Table1 - 1

This would work but not sure about time.

Thanks

Yuva.

Enthusiast

Re: Non duplicate row selection query

Hi John,

Apart from DIRECT_ID is there any other columns which you can use as a differentiating column between two records. For ex: date or timestamp column etc..

When you are inserting to New_Consumer table, do you want to select all distinct records from CONSUMER_TABLE table. Is this a one time activity or if it's a daily activity why you need to load all records?

Junior Contributor

Re: Non duplicate row selection query

There's no way to avoid the GROUP/PARTITION BY all columns.

Clean your data once and next time simply don't insert duplicates, your load process should take care of this.

There should be a logical Primary Key on your table, but that's not a sequence like DIRECT_ID.

Enthusiast

Re: Non duplicate row selection query

Hi Ravimans,

We do not have any other column that differentiates the records apart from the DIRECT_ID column.

Yes we want to select all the distinct records from consumer table and will be a one time activity..

Regards

John

Enthusiast

Re: Non duplicate row selection query

Hi Dnoeth,

Unfortunately there is no there key which differentiates the record and which i admit is poor design. Now living wih what we have, we are trying to do a one time activity of loading disitnct records into a new table from the table that has duplicates. Have used the same query as i have pasted above with partition and i did a explain too.. the explain is not showing any time that i might take to complete the process.. Any help or advice?

Regards

John

Enthusiast

Re: Non duplicate row selection query

Logic given by me may not work..since direct_id is different...My bad :(

Thanks

Yuva

Enthusiast

Re: Non duplicate row selection query

No Worries Yuva.. Any other ways? I have tried using Partition and group by but both are taking too long..Any help?

Regards

John