Underground: Generating Surrogate Keys with ID()

Aster Field Strong
Teradata Employee

Description

This SQL/MR function generates unique big int ids for rows in an input table. Though the algorithm guarantees the ids to be unique, it doesn't however guarantee to be strictly serial (expect some gaps in the id #s).

Why not ROW_NUMBER() ?

The biggest advantage of ID() vs ROW_NUMBER() OVER (PARTITION BY 1) is that it can generate ids for table rows that scales linearly, because it has no repartitioning operation. ID() uses a truly parallel algorithm. It is especially useful when you load data from a fresh data source and each row needs id tags for Text Analysis, XML or JSON Parsing regardless of the row order. As an example, If you have 12 billion rows in a table that you just loaded and need to id it uniquely, just use ID() !

Usage

SELECT * FROM ID(

       on input_table

);

The output schema and data matches the input table with an additional column called id which is bigint !

Sample Input/Output

INPUT TABLE: webclicks


   
useridsessionidtime_stampevent
user102016-01-01 16:40home_page
user102016-01-01 16:42sports
user102016-01-01 16:46home_page
user212016-02-01 10:10news_page

CREATE TABLE webclicks_with_id DISTRIBUTE BY HASH(id)

AS

SELECT * FROM id(

     ON webclicks

);

OUTPUT TABLE: webclicks_with_id

iduseridsessionidtime_stampevent
1user102016-01-01 16:40home_page
2user102016-01-01 16:42sports
3user102016-01-01 16:46home_page
4user212016-02-01 10:10news_page

 

Installation

Download the ID.ZIP  attachment

Run \install id.zip in ACT for installing default schema 
Run \install <schema>/id.zip in ACT to install it in <schema>
Provide execute permissions on the ID function to the users with
GRANT EXECUTE ON FUNCTION ID TO <role>|<user>;