Let's talk about Surrogate Key Generation

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.

There are data models that use surrogate keys as part of primary key/primary index definition.

Examples are: Industry and Homegrown models.

The use of surrogate keys is usually considered when loading different source systems with different natural keys on the same integrated data model. This implies a conversion of the different natural keys into a standard surrogate key that can be used for all source systems.

Another important consideration of using surrogate keys as part of the primary index is to be able to improve performance on the joins among tables with the same surrogate key.

Surrogate keys can be generated via ETL/ELT tools or inside Teradata.

Let's talk about three different ways of generating surrogate keys inside Teradata.

  1. Identity Columns
  2. OLAP Function (CSUM)
  3. Hashing Algorithm

Identity Columns

This is a good way to generate surrogate keys if you don't have a dual active Teradata environment and the surrogate keys don't need to be generated in sequential order.

The identity column will generate different surrogate keys on different systems.

Also, each AMP will have a set of numbers to use and this range of numbers is defined based on the following DBS Control parameter.

  • IdCol Batch Size
    • Description: Indicates the size of a pool of numbers reserved by a vproc for assigning identity values to

      rows inserted into an identity column table.
    • Valid Range: 1 through 1,000,000
    • Default Value: 100,000

This means AMP 1 will have a pool of numbers from 1 to 100,000 and AMP 2 will have a pool of numbers from 100,001 to 200,000 on a 2 AMP system using the IdCol Batch Size = 100,000.

This option can be implemented by defining a key table containing an identity column.

A new surrogate key will be generated every time a new record is inserted into this key table and there is no need to pass any value to the identity column when it is defined as follows.

The key table also provide redundancy in case of a disaster.

Surrogate Key Table

CREATE MULTISET TABLE DATABASENAME.AR_KEY ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130560 BYTES, CHECKSUM = DEFAULT
(
SRC_SYS_UNQ_KEY_TX VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
SRC_SYS_CD CHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
AR_ID DECIMAL(15,0) GENERATED ALWAYS AS IDENTITY
(START WITH 1
INCREMENT BY 1
MINVALUE -999999999999999
MAXVALUE 999999999999999
NO CYCLE),
CRE_RUN_ID INTEGER NOT NULL,
ICDW_ISRT_TS TIMESTAMP(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0))
UNIQUE PRIMARY INDEX ( SRC_SYS_UNQ_KEY_TX ,SRC_SYS_CD );

There are some special considerations when the key table containing an identity column needs to be re-loaded  in case of a disaster.

  • The key table needs to be re-created with a new START and MINVALUE numbers to exclude the numbers that were generated before to avoid the same number to be generated again and have duplicate numbers on the surrogate key (identity column).
  • The definition of the identity column needs to be changed from "GENERATED ALWAYS" to "GENERATED BY DEFAULT" to be able to reload the keys that were generated before.
  • New surrogate keys will be generated only when Nulls (no values) are passed to the identity column.

So, for example a surrogate key table with a highest identity column value of 999,999 needs to be reloaded because of a disaster, then the new surrogate key table needs to be defined as follows.

Surrogate Key Table (Re-Load)

CREATE MULTISET TABLE DATABASENAME.AR_KEY ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130560 BYTES, CHECKSUM = DEFAULT
(
SRC_SYS_UNQ_KEY_TX VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
SRC_SYS_CD CHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
AR_ID DECIMAL(15,0) GENERATED BY DEFAULT AS IDENTITY
(START WITH 1000000
INCREMENT BY 1
MINVALUE 1000000
MAXVALUE 999999999999999
NO CYCLE),
CRE_RUN_ID INTEGER NOT NULL,
ICDW_ISRT_TS TIMESTAMP(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0))
UNIQUE PRIMARY INDEX ( SRC_SYS_UNQ_KEY_TX ,SRC_SYS_CD );

OLAP Function (CSUM)

This is a good way to generate surrogate keys in sequential order. It can be used on a dual active environment if a dual load strategy is in place or if the surrogate keys are generated on the primary system and then it is copied over to the secondary system by using Data Mover for example.

The CSUM function will generate the next surrogate key number only if the highest surrogate key already generated is provided as part of the equation.

This option can be implemented by developing a surrogate key generation process via a stored procedure together with a surrogate key table containing the natural key plus the surrogate key.

A new surrogate key is generated by the CSUM function every time a new record is inserted into the surrogate key table.

The key table also provides redundancy in case of a disaster.

Surrogate Key Table

CREATE MULTISET TABLE DATABASENAME.AR_KEY ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130560 BYTES, CHECKSUM = DEFAULT
(
SRC_SYS_UNQ_KEY_TX VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
SRC_SYS_CD CHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
AR_ID DECIMAL(15,0) NOT NULL,
CRE_RUN_ID INTEGER NOT NULL,
ICDW_ISRT_TS TIMESTAMP(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0))
UNIQUE PRIMARY INDEX ( SRC_SYS_UNQ_KEY_TX ,SRC_SYS_CD );

Surrogate Key Stored Procedure

  BEGIN TRANSACTION;

-- CHECK THE HIGHEST SURROGATE KEY GENERATED

SET tmpMsg = 'SELECT (COALESCE(MAX(AR_ID),0)) FROM DATABASENAME.AR_KEY;';
SET cntr = 0;

LOCKING DATABASENAME.AR_KEY FOR WRITE
SELECT (COALESCE(MAX(AR_ID),0))
INTO cntr
FROM DATABASENAME.AR_KEY
;

IF SQLSTATE = '02000' THEN
SET cntr = 0;
END IF;

-- GENERATING NEW SURROGATE KEYS

SET tmpMsg = 'INSERT INTO DATABASENAME.AR_KEY...';

SET cntrkey = 0;

SET statmnt = '
INSERT INTO DATABASENAME.AR_KEY
SELECT
SRC_SYS_UNQ_KEY_TX
,SRC_SYS_CD
,(CSUM(1,SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) +' || cntr || ' ) AS AR_ID
,CRE_RUN_ID
,CURRENT_TIMESTAMP(0) AS ICDW_ISRT_TS
FROM
(SELECT
SRC_SYS_UNQ_KEY_TX
,SRC_SYS_CD
,CRE_RUN_ID
FROM DATABASENAME.' || 'AR_' || sourcecd || '_SA3
WHERE (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) NOT IN
(SELECT
SRC_SYS_UNQ_KEY_TX
,SRC_SYS_CD
FROM DATABASENAME.AR_KEY
WHERE SRC_SYS_CD = ' || '''' || srcsyscd || ''''|| '
)
AND SRC_SYS_CD = ' || '''' || srcsyscd || ''''|| '
AND CRE_RUN_ID = ' || crerunid || '
AND ETL_ACTN_CD = ''I''
GROUP BY 1,2,3) T0
;';

CALL DBC.SysExecSQL(statmnt);

SET cntrkey = ACTIVITY_COUNT;

IF cntrkey = 0 THEN
SET tmpMsg = 'APP WARNING: NO NEW AR_IDs WERE GENERATED';
SET SQLMSG = TMPMSG;
SET SQLERRCDE = SQLCODE;
SET SQLSTTE = SQLSTATE;
SET SQL_ERR_CDE = SQLCODE;
SET SQL_STATE = SQLSTATE;
INSERT INTO DATABASENAME.SPROC_ERR_LOG
VALUES ('DATABASENAME','SPROC_GEN_AR_KEY','W',:SQL_ERR_CDE,:SQL_STATE,:TMPMSG,:crerunid,CURRENT_TIMESTAMP(0));
END IF;

END TRANSACTION;

Please notice that the maximum surrogate key value is added to the CSUM expression to generate the next surrogate key.

Please also notice that there are a "BEGIN TRANSACTION;", a " END TRANSACTION;", and a "LOCKING DATABASENAME.AR_KEY FOR WRITE" statements inside of the stored procedure code for concurrency reasons. This allows multiple surrogate key generation processes to run at the same time.

If there are two surrogate key generation processes running a the same time, then the first process to place the write lock on the key table will generate new surrogate keys while the second process will be waiting until the end transaction statement of the first process is done.

There are some special considerations when the key table containing the surrogate keys needs to be re-loaded  in case of a disaster.

  • The surrogate keys can't be re-generated using the stored procedure because the process could generate new surrogate keys different from the original ones.
  • The surrogate keys need to be copied from the target tables into the key table or restored from the latest backup.
  • Continue using the stored procedure after the key table is re-loaded.

Hashing Algorithm

This is a good way to generate surrogate keys if you have a dual active environment and the surrogate keys don't need to be generated in sequential order.

The hashing algorithm will generate the same surrogate keys on different Teradata systems but key collisions can happen.

Therefore, the hashing algorithm needs to generate a uniqueness value as part of the surrogate key generation process to eliminate key collisions.

This option can be implemented by developing a surrogate key generation process via a stored procedure together with a surrogate key table containing the natural key plus the surrogate key (hashing key plus uniqueness value).

The idea here is to generate the surrogate keys just once and store into a key table.  Then, the key table is used to assign the surrogate keys to all tables that require it.

The key table also provides redundancy in case of a disaster.

Surrogate Key Table

CREATE MULTISET TABLE DATABASENAME.AR_KEY ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130560 BYTES, CHECKSUM = DEFAULT
(
SRC_SYS_UNQ_KEY_TX VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
SRC_SYS_CD CHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
AR_ID DECIMAL(18,0) NOT NULL,
HASH_AR_ID DECIMAL(18,0) NOT NULL,
UNIQUENESS_VALUE INTEGER NOT NULL,
CRE_RUN_ID INTEGER NOT NULL,
ICDW_ISRT_TS TIMESTAMP(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0))
UNIQUE PRIMARY INDEX ( SRC_SYS_UNQ_KEY_TX ,SRC_SYS_CD );

Hash Key Temporary Table

CREATE MULTISET GLOBAL TEMPORARY TABLE DATABASENAME.HASH_AR_KEY ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130560 BYTES, CHECKSUM = DEFAULT,
LOG
(
SRC_SYS_UNQ_KEY_TX VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
SRC_SYS_CD CHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
HASH_AR_ID DECIMAL(18,0) NOT NULL,
CRE_RUN_ID INTEGER NOT NULL,
ICDW_ISRT_TS TIMESTAMP(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0))
PRIMARY INDEX ( HASH_AR_ID )
ON COMMIT PRESERVE ROWS;

Surrogate Key Stored Procedure

-- GENERATING NEW HASHING KEYS

SET cntrkey = 0;
SET tmpMsg = 'INSERT INTO DATABASENAME.HASH_AR_KEY...';

SET statmnt = '
INSERT INTO DATABASENAME.HASH_AR_KEY
SELECT
T1.SRC_SYS_UNQ_KEY_TX
,T1.SRC_SYS_CD
,CAST(T1.AR_ID AS DECIMAL(18)) AS HASH_AR_ID
,T1.CRE_RUN_ID
,CURRENT_TIMESTAMP(0) AS ICDW_ISRT_TS
FROM
(SELECT
TRIM(HASHBUCKET('||''''||'00'||''''||'XB || SUBSTR(HASHROW (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) ,1,1) (BYTE(4)) ))||TRIM(HASHBUCKET('||''''||'00'||''''||'XB || SUBSTR(HASHROW (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) ,2,1) (BYTE(4)) ))||TRIM(HASHBUCKET('||''''||'00'||''''||'XB || SUBSTR(HASHROW (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) ,3,1) (BYTE(4)) ))||TRIM(HASHBUCKET('||''''||'00'||''''||'XB || SUBSTR(HASHROW (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) ,4,1) (BYTE(4)) )) AS AR_ID
,SRC_SYS_UNQ_KEY_TX
,SRC_SYS_CD
,CRE_RUN_ID
FROM DATABASENAME.' || 'AR_' || sourcecd || '_SA3
WHERE (SRC_SYS_UNQ_KEY_TX,SRC_SYS_CD) NOT IN
(SELECT
SRC_SYS_UNQ_KEY_TX
,SRC_SYS_CD
FROM DATABASENAME.AR_KEY
WHERE SRC_SYS_CD = ' || '''' || srcsyscd || ''''|| '
)
AND SRC_SYS_CD = ' || '''' || srcsyscd || ''''|| '
AND CRE_RUN_ID = ' || crerunid || '
AND ETL_ACTN_CD = ''I''
GROUP BY 1,2,3,4) T1
;';

CALL DBC.SysExecSQL(statmnt);

SET cntrkey = ACTIVITY_COUNT;

IF cntrkey = 0 THEN
SET tmpMsg = 'APP WARNING: NO NEW HASH_AR_IDs WERE GENERATED';
SET SQLMSG = TMPMSG;
SET SQLERRCDE = SQLCODE;
SET SQLSTTE = SQLSTATE;
SET SQL_ERR_CDE = SQLCODE;
SET SQL_STATE = SQLSTATE;
INSERT INTO DATABASENAME.SPROC_ERR_LOG
VALUES ('DATABASENAME','SPROC_GEN_AR_KEY','W',:SQL_ERR_CDE,:SQL_STATE,:TMPMSG,:crerunid,CURRENT_TIMESTAMP(0));
END IF;

-- GENERATING NEW SURROGATE KEYS

SET cntrkey = 0;
SET tmpMsg = 'INSERT INTO DATABASENAME.AR_KEY...';

SET statmnt = '
INSERT INTO DATABASENAME.AR_KEY
SELECT
T1.SRC_SYS_UNQ_KEY_TX
,T1.SRC_SYS_CD
,(CAST((TRIM(T1.HASH_AR_ID)||TRIM(SUM(T1.UNIQUENESS_VALUE) OVER (PARTITION BY T1.HASH_AR_ID ORDER BY T1.HASH_AR_ID ROWS UNBOUNDED PRECEDING))) AS DECIMAL(18))) + COALESCE(T2.UNIQUENESS_VALUE,0) AS AR_ID
,CAST(T1.HASH_AR_ID AS DECIMAL(18)) AS HASH_AR_ID
,(CAST(SUM(T1.UNIQUENESS_VALUE) OVER (PARTITION BY T1.HASH_AR_ID ORDER BY T1.HASH_AR_ID ROWS UNBOUNDED PRECEDING) AS INTEGER)) + COALESCE(T2.UNIQUENESS_VALUE,0) AS UNIQUENESS_VALUE
,T1.CRE_RUN_ID
,CURRENT_TIMESTAMP(0) AS ICDW_ISRT_TS
FROM
(SELECT
HASH_AR_ID,
SRC_SYS_UNQ_KEY_TX,
SRC_SYS_CD,
CRE_RUN_ID,
1 AS UNIQUENESS_VALUE
FROM DATABASENAME.HASH_AR_KEY
) T1
LEFT OUTER JOIN
(SELECT
HASH_AR_ID,
MAX(UNIQUENESS_VALUE) AS UNIQUENESS_VALUE
FROM DATABASENAME.AR_KEY
GROUP BY 1
) T2
ON T1.HASH_AR_ID = T2.HASH_AR_ID
;';

CALL DBC.SysExecSQL(statmnt);

SET cntrkey = ACTIVITY_COUNT;

IF cntrkey = 0 THEN
SET tmpMsg = 'APP WARNING: NO NEW AR_IDs WERE GENERATED';
SET SQLMSG = TMPMSG;
SET SQLERRCDE = SQLCODE;
SET SQLSTTE = SQLSTATE;
SET SQL_ERR_CDE = SQLCODE;
SET SQL_STATE = SQLSTATE;
INSERT INTO DATABASENAME.SPROC_ERR_LOG
VALUES ('DATABASENAME','SPROC_GEN_AR_KEY','W',:SQL_ERR_CDE,:SQL_STATE,:TMPMSG,:crerunid,CURRENT_TIMESTAMP(0));
END IF;

Please notice that the process is split in two steps. The first one generates new hashing keys and the second one generates new surrogate keys by adding the uniqueness value to the hashing keys. 

The process is also taking in consideration the highest uniqueness value for any existing hashing key to manage key collisions and to assign the uniqueness value correctly.

There are some special considerations when the key table containing the surrogate keys needs to be re-loaded  in case of a disaster because of the uniqueness value.

  • The surrogate keys can't be re-generated using the stored procedure because the process could generate new surrogate keys different from the original ones because of the uniqueness value.
  • The surrogate keys need to be copied from the target tables into the key table or restored from the latest backup.
  • Continue using the stored procedure after the key table is re-loaded.
14 Comments
Enthusiast
I know there is documentation/literature supports the use of 1 unique number generated per row in a table irrespective of the natural key.
For example : First time a row for customer "A" is created in customer table and tagged to a SK 100. If anything changes about that customer row in that table,
then a new row is created for the same customer with the latest values and tagged to a "new SK" 101.
When you tie these row to the transactions, it does not support as-was and as-is reporting very well. It only supports as-was reporting easily.
To switch from as-was to as-is will cause a self join on the customer table which (if the table is big) will cause performance issues.
Compare this to just date locking the Customer table and picking up the version of customer you need.
Enthusiast
Use of identity columns has always caused issues.
1. it needs the columns to be Decimal(18)
2. If a system restart happens then you will see a huge jump between the sequences. The sequence does not matter but the numbers become so big that you will need a column like Decimal(18) , if non identity column is used then an integer might suffice.
N/A
Greatly enjoyed this one, Marcio! Keep up the good work!

@Shiv_Giri, based on my understanding of the post, I don't believe the surrogate key tables that Marcio is talking about are meant to be the actual dimension tables. They're really just used as a translation, or "cross-reference," between the natural key(s) in the source systems and the data warehouse. Once you have this translation mechanism, you can populate your dimension tables accordingly, whether they're slowly-changind dimensions or not.
Enthusiast
Hi Reezun,
This is a good discussion. Once you assign the SKs what other transalation are you referring to ?
As you know, even in transactions, there are many types and in some cases you want to track the changes for a transactions (like when a order got place, processed, shipped etc..) in this case generating the SK for every change will be troublesome.
In simple scenarios, when you have a single instance of a row for a transaction and nothing will change on it - then yes you can have a straight forward mechanism to assign SKs.

CSUM() is slow. Identity has some inconvenient restrictions. We kept SK generation algorithm out of Terdata.
I don't agree that CSUM() function is slow. I used the CSUM function to generate millions of surrogate keys in a couple of seconds. The bottle neck can be the key table and not the CSUM function.
Enthusiast
I too have not seen any performance issues while generating surrogate keys for millions of records using CSUM. However I am wondering how to bring in key reuse concept here.
Lets say I have a transaction table having 2 years worth of data, the table was deployed 3 years back. the last one year data is purged. What would be the mot efficient way to reuse the keys given that the surrogate ey column data type can hold only a finite set of numbers.
I would set an active indicator on the key table to flag the expired surrogate keys.
I would also change the logic on the process to include maximum surrogate key value to the equation and a restart from the bottom when the maximum surrogate key value is reached.
Enthusiast
Sorry, I am totally new to Teradata. I have question regarding structure of your SK table.
You have declared (SRC_SYS_UNQ_KEY_TX ,SRC_SYS_CD) as UPI.
In cases type 2 change I would need different SK for the same natural key (SRC_SYS_UNQ_KEY_TX) and the same source (SRC_SYS_CD) . In that instance declared uniqness will prevent me from generating it. Please, clarify. Am I missing something specisifc for Teradata?
The process will not generate a new key for an existing (SRC_SYS_UNQ_KEY_TX, SRC_SYS_CD) on the key table and the type 2 change will use the same surrogate key to apply the change on the target table.
Enthusiast
Type 2 change cannot use the same SC.
Type 2 creates a new record and SC is PK.
the key table is not a type 2 table. The target tables are type 2 tables where the SK is defined. But the PK on the target tables it is usually the SK plus effective timestamp. The SK will be the same and a new record with a new effective timestamp will be created and the previous record will be expired.
Teradata Employee
Does it exist a summary table that differentiates the performance impact of the different approaches in terms of CPU, I/O, elapsed time when inserting in the target table from a staging table?
CPU and I/o remains the same but I am quite unsure about the TAT time.