CSUM() vs ROW_NUMBER() for generating Surrogate keys

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
New Member

CSUM() vs ROW_NUMBER() for generating Surrogate keys

Dont see any reason to believe CSUM() under performs ROW_NUMBER(). Did a test with ~422M records

Using CSUM :
Create volatile table vt_csum, no fallback, no log as
(
Sel mem_nm, csum(1,mem_nm) as mem_surg
from xxxtttviews.mem_hist
) with data  primary index(mem_surg) on commit preserve rows
;
Sel count(*) from vt_csum; --output: 422,670,591

Using Row_number :
Create volatile table vt_zzzrow, no fallback, no log as
(
Sel mem_nm, Row_Number() Over (order by mem_nm) as mem_surg
from xxxtttviews.mem_hist
) with data  primary index(mem_surg) on commit preserve rows
;
Sel count(*)  from vt_zzzrow; --output:422,670,591

Stats from dbc.DBQLogTbl:  Looks like CSUM performed better at CPU Time
Using CSUM :       TotalIOCount = 2,808,082.00 AMPCPUTime= 5,350.20 Parser CPU Time = 0.02 SpoolSpace = 144,843,587,584.00
Using Row_number : TotalIOCount = 2,758,403.00 AMPCPUTime= 6,397.18 Parser CPU Time = 0.03 SpoolSpace = 145,247,547,392.00


With a literal value(1) in Order by clause both queries returned "2646:  No more spool space" error

Create volatile table vt_zzzrow_noprt, no fallback, no log as
(
Sel mem_nm,Row_Number() Over (order by 1) as mem_surg
from xxxtttviews.mem_hist
) with data  primary index(mem_surg) on commit preserve rows
;

Create volatile table vt_csum_noprt, no fallback, no log as
(
Sel mem_nm,csum(1,1) as mem_surg
from xxxtttviews.mem_hist
) with data primary index(mem_surg) on commit preserve rows
;

1 REPLY
Junior Contributor

Re: CSUM() vs ROW_NUMBER() for generating Surrogate keys

CSUM is just an old, deprecated, non-Standard SQL syntax.

Of course CSUM and ROW_NUMBER are similar, but when you Google you will find mainly recommendations for CSUM(1,1). Probably because it's short and you don't have to add ORDER BY (of course it's still sorting by the value "1"). As you noticed this runs on a single AMP and causes extreme skew (or runs out of spool) when used for a larger number of rows.

 

Similar for SUM(1) or COUNT(*) OVER (ROWS UNBOUNDED PRECEDING), there's no ORDER BY, but Teradata simply orders by all columns, which increases spool & CPU over a ROW_NUMBER.

 

Adding PARTITION BY changes this, if there's no ORDER BY there's no sort, now Count(*) Over (PARTITION BY col ROWS Unbounded Preceding) is more efficient than ROW_NUMBER() OVER (PARTITION BY col ORDER BY whatever). Best case is a PARTITION BY PI-columns.