Key Generation based on 3 columns

Database
Res
Enthusiast

Key Generation based on 3 columns

Hi , 

 

I need to insert records from staging to target  and generate the ID column for every new combination of A,B&C .  

For the first load , the target will be empty . There may be more than one record in the staging for the new combinations . Every new distinct combination should have the same ID . 

 

Has someone written this without using a stored proc ? Please share if this can be achieved in a single query. 

 

Example : 

 

TGT
A B C ID
1 8 27 1
1 8 27 1
1 4 18 2
1 4 18 2
1 7 26 3
1 9 28 4
1 9 28 4


STG
A B C ID
1 8 27 1
1 8 27 1
1 8 27 1
1 4 18 2
1 4 18 2
1 4 18 2
1 4 18 2
1 7 26 3
1 9 28 4
1 9 28 4
2 8 27 5
2 8 27 5
2 8 27 5
2 10 29 6
2 10 29 6
2 4 16 7
2 4 17 8

 

 

Thanks .

12 REPLIES
Teradata Employee

Re: Key Generation based on 3 columns

If the only purpose of an ID is to match STG to TGT, then the easy way to do this is to not bother with an ID at all.  Just define the Primary Index as (A,B,C) on both STG and TGT.

If you really want waste space with an extra ID column then you can insert it as hashbucket(hashrow(A,B,C)).  However: this will not only create unnecessary duplicate calculations when you MERGE STG into TGT, it is not guaranteed to be unique for every combination of A,B,C.  Qualifying explicitly on A,B,C is probably your best option.

Res
Enthusiast

Re: Key Generation based on 3 columns

I need the ID column to be stored in another Fact table and the issue here is not storage . And there are many more columns in the table other than A,B,C. 

 

Res
Enthusiast

Re: Key Generation based on 3 columns

Just to make it clear , the STG doesn't have the ID column : The earlier values just demonstrated how the values will be generated for the TGT from STG.

 

TGT
A B C ID
1 8 27 1
1 8 27 1
1 4 18 2
1 4 18 2
1 7 26 3
1 9 28 4
1 9 28 4


STG
A B C
1 8 27
1 8 27
1 8 27
1 4 18
1 4 18
1 4 18
1 4 18
1 7 26
1 9 28
1 9 28
2 8 27
2 8 27
2 8 27
2 10 29
2 10 29
2 4 16
2 4 17

 

Res
Enthusiast

Re: Key Generation based on 3 columns

Just to make it clear , the STG doesn't have the ID column : The earlier values just demonstrated how the values will be generated for the TGT from STG.

 

TGT
A B C ID
1 8 27 1
1 8 27 1
1 4 18 2
1 4 18 2
1 7 26 3
1 9 28 4
1 9 28 4


STG
A B C
1 8 27
1 8 27
1 8 27
1 4 18
1 4 18
1 4 18
1 4 18
1 7 26
1 9 28
1 9 28
2 8 27
2 8 27
2 8 27
2 10 29
2 10 29
2 4 16
2 4 17

 

Senior Apprentice

Re: Key Generation based on 3 columns

Hi,

 

Use the DENSE_RANK function.

DENSE_RANK() OVER(ORDER BY a,b,c) as combination_id

@GJColemanSorry but I don't think you can use HASH functions for this because it is possible for multiple, different combinations of columns A,B and C to generate the same result value.

 

Cheers,

Dave

 

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Res
Enthusiast

Re: Key Generation based on 3 columns

Thanks @DaveWellman, but the dense will generate the ID again for every load. This is a delta . Hence the values already assigned an ID should be retained and new values to be given new ID's by incrementing the MAX value from the TGT. 

 

I have a lengthy solution ready , but would need a simpler one if possible. 

 

 

Senior Supporter

Re: Key Generation based on 3 columns

I would recommend to us a hash (sha2 for example) of column a,b,c - the resulting column is wide but you don't need to worry about hash collisions and it makes processing easy as you can calculate the value where and when ever you need it (within the DB or outside).

Otherwise you need:

1.
insert into target
select s.*, t.id
from stg as s
         join
        (select a,b,c,id from target group by 1,2,3,4) as t
        on s.a = t.a and s.b = t.b and s.c = t.c -- assuming that a,b,c are not null otherwise use coalesce

2. create volatile table vt_new_ids
    as
    (select stg.*,
DENSE_RANK() OVER(ORDER BY a,b,c) as combination_id
   from stg as s where (a,b,c) not exists in (select a,b,c from target t where )
   ) with data primary index (a,b,c) 

3. move the stg to target
   insert into target 
   select a,b,c,..., comibnation_id + (select max(id) from target) 
   from vt_new_ids
 

you need to consider transaction handling etc.
as mentioned the cartographic hash has fare more benefits then downsides due to the higher space requirements most of the cases

just my thoughts

 

Ulrich

Senior Apprentice

Re: Key Generation based on 3 columns

...so you could join the staging to final and add in the current maximum. Something like:

INSERT INTO target
SELECT ...
  ,(CASE
    WHEN tgt.combination_id IS NOT NULL THEN tgt.combination_id
   ELSE DENSE_RANK() OVER(ORDER BY stg.a, stg.b, stg.c) + MV.MaxID
   END) as combination_id

FROM staging AS STG
LEFT OUTER JOIN target AS tgt
  ON stg.a = tgt.a
  AND stg.b = TGT.b
  AND stg.c = TGT.c
CROSS JOIN (SELECT MAX(combination_id) AS MaxID FROM target) AS MV

I've shown the full CASE logic here but you could use COALESCE instead.

 

Is that any better for you?

This will not give you sequential values for the 'combination_id' column after the first load, which may or may not be important.

 

Cheers,

Dave

 

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Teradata Employee

Re: Key Generation based on 3 columns

Yes! As I said, "it is not guaranteed to be unique for every combination of A,B,C.  Qualifying explicitly on A,B,C is probably your best option."

There is no algorithm for generating a hash value that is unique to each A-B-C combination.  Since you have duplicate-key rows, your best option is to store the values for A,B,C in the Fact table, and make A,B,C the Primary Index of TGT.

The only way I can think of to generate an ID-number is to create yet another table containing only the values A,B,C and a Teradata-generated Identity Column.  Then when you get a new STG table, join it to this table to get the ID Column value.  Also, any A-B-C values that are not already in this table have to be inserted so you can then get the new ID Column value.  If there are are more than 1000 rows in STG, you don't want to do this in a stored procedure cursor loop - use set SQL.  This all seems like a lot of effort just to reduce 3 values to 1.  I would just add A,B,C to the Fact table.