I have a data set where each row is different, I need to create a row_id for each row. However since this data set is so large that if I use row_number() to achieve the goal I'll encounter 'no more spool space' problem.
Then I figured out I could use hashrow to generate a row_id alike index. The primary index of my data set consists of multiple columns, say col1, col2, ..., col6, and I suppose hashrow(col1, col2, ...,col6) will generate the same number of distinct values as the data size, however there are a certain amount of rows having the same hashing value, is it expected?
If the hashrow method does not work, do you have other suggestions to generate a row_id alike index for a huge data set?
Thanks a lot!
Yes, getting duplicate values from the HASHROW function is expected, even if the data values that you are hashing are unique.
On the 'no more spool space' error:
Assuming that the problem is skewing.
If that doesn't sort it the following may work for you:
SELECT dt2.col1 ,dt2.col2 ,--other columns from dt2-- ,(((dt3.group_sequence - 1) * dt3.max_rowcount) + dt2.rownumber) AS new_row_id FROM (SELECT col1 ,col2 ,--other columns-- ,ROW_NUMBER() OVER(PARTITION BY col1,col2 ORDER BY col1,col2) AS rownumber FROM t1 ) AS dt2 INNER JOIN (SELECT col1, col2 ,ROW_NUMBER() OVER(ORDER BY col1, col2) AS group_sequence ,MAX(rowcount) OVER() AS max_rowcount FROM (SELECT col1 ,col2 ,COUNT(*) AS rowcount FROM t1 GROUP BY 1,2) AS dt1 ) AS dt3 ON dt3.col1 = dt2.col1 AND dt3.col2 = dt2.col2
The above is based around:
,(((dt3.group_sequence - 1) * dt3.max_rowcount) + dt2.rownumber) AS new_row_id
This will generate unique values for 'new-row_id' in the result set. Those values will NOT be sequential. To reduce the size of 'gaps' you can add more columns into the select list (and GROUP B clause) in DT1.
Depending on the actual volume of data, you may find that you have to CAST some numbers/counters to increase data value range.
If that gives you what you want then wrap it inside a 'create table as' command and you have a new table.
Does that help?
Hi Dave @DaveWellman
Thanks for your detailed answer! Following your suggestions step by step, I first checked the skew factor of the table
SELECT TABLENAME, SUM(CURRENTPERM) /(1024*1024) AS CURRENTPERM, (100 - (AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS SKEWFACTOR FROM DBC.TABLESIZE WHERE DATABASENAME = 'MYDB' AND TABLENAME = 'SRP_VI' GROUP BY 1;
The skew factor is 3.19, which is pretty low.
I also used code below to check the average row count on each AMP
SEL COUNT(*) / COUNT(DISTINCT(HASHROW(QT, TRMT_VRSN_ID, GUID, SESSION_SKEY, SESSION_START_DT, EVENT_TIMESTAMP, SEQNUM))) AS AVG_ROW_CNT FROM MYDB.SRP_VI;
It shows that the avg_row_cnt is 1. So we can now remove the reason of 'high skewness'.
Then I checked the size of the table using code below:
SEL DATABASENAME, TABLENAME, SUM(CURRENTPERM) / 1024 ** 3 AS CURRENT_GC --IN GB FROM DBC.ALLSPACE WHERE TABLENAME = 'SRP_VI' AND DATABASENAME = 'MYDB' GROUP BY 1, 2 ORDER BY 1, 2;
The result shows the table size is around 17.06 GB, which is pretty large...
Thirdly I already used 'order by' clause in ROW_NUMBER function. Actually since I want a row_id for the entire data set I didn't use 'partition by' but only 'order by (the primary indexes which together defines the uniqueness of a row)' in ROW_NUMBER function.
I tried the main code you suggested but it seems like running forever.. I suppose since for a 19GB size data set, too many joins and derived table is very slow. I am wondering whether there is a way to extract the rowID (which is rowhash value + uniqueness value) for each row, I think it's unique and can be used as a row_id alike index. Unfortunately I didn't find the solution online. Do you have any idea?
Thanks in advance!
(Well done for working through those analysis items).
Teradata doesn't allow you to retrieve the rowid for a row. You can get the row hash, but not the uniqueness value.
In your analysis you've looked at the skew of the table. This might have an impact on the skew of a query, but please remember that the error you hit was 'no more spool space' - which is a problem with spool file(s) and does not have anything directly to do with the table.
The table size is about 17GB. What is the spool space limit on your userid?
Is 17GB a big table? That really depends on the size of your TD system. On a 170 AMP system it is only 100MB per AMP. On a 340 AMP system it is only 50MB per AMP. I accept that isn't tiny but I don't think it is huge either.
You said that "Actually since I want a row_id for the entire data set I didn't use 'partition by' but only 'order by (the primary indexes which together defines the uniqueness of a row)' in ROW_NUMBER function". In my experience this is very likely to lead to skewing. You NEED the 'PARTITION BY' in order to avoid the skewing. I understand that you need a rownumber across the entire data set, but that is what my logic below does, that is why we have DT1 and DT3.
If that single process seems to be taking a long time, try breaking it into two.
Create the DT1/DT3 portion as a volatile table in one request.
Then join that volatile table into the main processing, replacing the DT1/DT3 processing with your volatile table (or even make it a permanent table if required).
Does that help?
OLAP functions at least double the spool space (there are two spools within a STATS step) and the ORDER BY might also increase spool usage.
What's the DDL of that table and which of those columns are in ORDER BY?
What's the row count?
And can you show the Explain?
Hi Dnoeth @dnoeth
Actually I have asked a similar question before in this thread, and I tried your solution to concetenate the primary indexes to generate the inkey
CREATE VOLATILE MULTISET TABLE SRP AS ( SEL A.INKEY, QT, TRMT_VRSN_ID, -- EP_CHNL, GUID, SESSION_START_DT, SESSION_SKEY, EVENT_TIMESTAMP, SEQNUM, PAGE_TYPE, KEYWORD, A.ITEM_LIST, N_ITEM, CIID AS JOIN_ID, B.IDX, CAST(B.ITEM_ID AS DECIMAL(18, 0)) ITEM_ID FROM SRP_VI_R A INNER JOIN ( SEL * FROM TABLE (STRTOK_SPLIT_TO_TABLE(SRP_VI_R.INKEY, SRP_VI_R.ITEM_LIST, ',') RETURNS (OUTKEY VARCHAR(128), IDX INT, ITEM_ID VARCHAR(50) )) AS DT ) B ON A.INKEY = B.INKEY WHERE A.PAGE_TYPE = 'SRP' ) WITH DATA PRIMARY INDEX (QT, GUID, SESSION_START_DT, SESSION_SKEY) ON COMMIT PRESERVE ROWS;
But I got an error saying that STRTOK: VARCHAR INKEY EXCEED THE MAX DEFINED. There seems to be a constraint on the length of inkey and outkey. I tried that the max length of inkey works is 64, however my inkey is longer than it? Do you have any idea how to solve this problem?
Thanks a lot for your patience
I tried your code but failed at creating table DT1, due to the error 'no more spool space'. I finally concatenate the primary indexes to generate a row_id alike index, and avoided the row_number() function.
I learned a lot from your answers. Appreciate it a lot, cheers!