I've a table which has over 20 billion record for now, and it'll get bigger everyday.
So Ive problem with generate skey, because it takes to much time.
I use ROW_NUMBER() + SEL MAX_ID logic. but even SEL MAX_ID statement took over 20 minutes.
such that big table, shoul I use any other logic to create skey.
first I create skey from staging, then I load data to target using that skey
what you mean with "one way or the other". can you tell about more pls.
In my knowledge, within teradata there are 3 ways to generate skey.
1. Indentity Column, 2. Hashing Algorithm and 3. Analytical Function
The 3rd option is more effective. We can use either ROW_NUMBER() or CSUM() to generate skey. But as the table will grow so the processing time.
Hope with new TD version they will provide sequence generator option like oracle.
However, you can try generating skey thorugh any ETL/ELT tools also.
I agree with Raja, if the experts here can share their experience for a smarter way that would be great.
The third one is discussed in detail with explanation from expert:
sequence generator option like oracle: It has loop hole too, if you use currval, nextval etc when rerun and mess up your data and want to go for correction.
Yes in ETL tool, like Ab Initio(next_in_sequence etc), informatica..sequence geneartor, SAP DS..key_generation etc.
How about storing the max value in a separate table instead of calculating every time?
For example, let table sk_seed have two columns (tablename varchar(128), next_sk bigint) with tablename being the PI, then psuedo code for generating surrogate key would be something like:
-- lock the seed row for write in case of concurrent usage
update sk_seed set next_sk = next_sk+0 where tablename = 'staging_table';
-- generate and use surrogate keys
select (select next_sk from sk_seed where tablename = 'staging_table')
+ row_number() over(order by natural_key) as surrogate_key
-- update next_sk for the next iteration
update sk_seed set next_sk = next_sk + (select count(*) from staging_table) where tablename = 'staging_table';
One downside of this approach is, if surrogate keys generation logic isn't centralized to use above logic then sk_seed table may get out-of-sync and you'll get duplicate surrogate keys.
Is there any compulsion in using ROW_NUMBER?
From my experience, CSUM and ROW_NUMBER can cause the response time to slow-down considerably.
For smaller tables they work fine, but for larger tables i prefer using [and is getting much better results] following logic:
did you check DBQL for resources used?
SUM(1) or COUNT(*) OVER (ROWS UNBOUNDED PRECEDING) (without PARTITION BY) should be worse than ROW_NUMBER:
There will be an implicit ORDER BY over all columns while ROW_NUMBER can sort by less than all columns, so SUM should consume more spool and more CPU.
If there's PARTITION BY then SUM actually doesn't sort -> faster than ROW_NUMBER