Generate SurrogateKey with a huge table

Database
Enthusiast

Generate SurrogateKey with a huge table

Hello everybody,

I've a table which has over 20 billion record for now, and it'll get bigger everyday.

So Ive problem with generate skey, because it takes to much time.

I use ROW_NUMBER() + SEL MAX_ID logic. but even SEL MAX_ID statement took over 20 minutes.

such that big table, shoul I use any other logic to create skey.

thanks.

13 REPLIES
Enthusiast

Re: Generate SurrogateKey with a huge table

I used same method when loading from staging to target. I think you use staging too. Do you use? Please let me know.

I feel that at any cost, you have to do it, one way or the other. My thought.Now I can think of

running the job during low traffic time(maybe night time).

Cheers,

Enthusiast

Re: Generate SurrogateKey with a huge table

first I create skey from staging, then I load data to target using that skey 

what you mean with "one way or the other". can you tell about more pls.

thnks.

Enthusiast

Re: Generate SurrogateKey with a huge table


So , yes it is the same concept we did. I think it is a better way to do this way,rather than using sequence or identity number. There may be smarter ways of doing, if someone can share.

Cheers,

Enthusiast

Re: Generate SurrogateKey with a huge table

Hi

In my knowledge, within teradata there are 3 ways to generate skey.

1. Indentity Column, 2. Hashing Algorithm and 3. Analytical Function

The 3rd option is more effective. We can use either ROW_NUMBER() or CSUM() to generate skey. But as the table will grow so the processing time.

Hope with new TD version they will provide sequence generator option like oracle.

However, you can try generating skey thorugh any ETL/ELT tools also.

I agree with Raja, if the experts here can share their experience for a smarter way that would be great.

Thanking You

Santanu

Enthusiast

Re: Generate SurrogateKey with a huge table

Hi,

The third one is discussed in detail with explanation from expert:

http://forums.teradata.com/forum/database/best-way-to-generate-the-sequential-numbers-csum-or-identi...

sequence generator option like oracle: It has loop hole too, if you use currval, nextval etc when rerun and mess up your data and want to go for correction.

Yes in ETL tool, like Ab Initio(next_in_sequence etc), informatica..sequence geneartor, SAP DS..key_generation etc.

Cheers,

Teradata Employee

Re: Generate SurrogateKey with a huge table

How about storing the max value in a separate table instead of calculating every time?

For example, let table sk_seed have two columns (tablename varchar(128), next_sk bigint) with tablename being the PI, then psuedo code for generating surrogate key would be something like:

bt;

-- lock the seed row for write in case of concurrent usage
update sk_seed set next_sk = next_sk+0 where tablename = 'staging_table';

-- generate and use surrogate keys
select (select next_sk from sk_seed where tablename = 'staging_table')
+ row_number() over(order by natural_key) as surrogate_key
from staging_table;

-- update next_sk for the next iteration
update sk_seed set next_sk = next_sk + (select count(*) from staging_table) where tablename = 'staging_table';

et;

One downside of this approach is, if surrogate keys generation logic isn't centralized to use above logic then sk_seed table may get out-of-sync and you'll get duplicate surrogate keys.

Teradata Employee

Re: Generate SurrogateKey with a huge table

Is there any compulsion in using ROW_NUMBER?

From my experience, CSUM and ROW_NUMBER can cause the response time to slow-down considerably.

For smaller tables they work fine, but for larger tables i prefer using [and is getting much better results] following logic:

SELECT

SUM(1) OVER(ROWS UNBOUNDED PRECEDING) + AliasMAX.MAX_VALUE AS SK

,Col1

,Col2

FROM LRD.Table1 LRD

LEFT OUTER JOIN

  EDW.Table1 EDW

ON

EDW.ID = LRD.ID

,

 (

  SELECT

   COALESCE(MAX(ID), 0) AS MAX_VALUE

  FROM

   EDW.Table1

 ) AliasMAX

HTH!

Senior Apprentice

Re: Generate SurrogateKey with a huge table

Hi Adeel,

did you check DBQL for resources used?

SUM(1) or COUNT(*) OVER (ROWS UNBOUNDED PRECEDING) (without PARTITION BY) should be worse than ROW_NUMBER:

There will be an implicit ORDER BY over all columns while ROW_NUMBER can sort by less than all columns, so SUM should consume more spool and more CPU.

If there's PARTITION BY then SUM actually doesn't sort -> faster than ROW_NUMBER

Teradata Employee

Re: Generate SurrogateKey with a huge table

Indeed Dieter .... I wrote SQL on-the-fly and missed PARTITION BY in there. Thanks! :)