Question on Hashrow and row_number()

Database
Enthusiast

Question on Hashrow and row_number()

Hello,

 

I have a data set where each row is different, I need to create a row_id for each row. However since this data set is so large that if I use row_number() to achieve the goal I'll encounter 'no more spool space' problem.

 

Then I figured out I could use hashrow to generate a row_id alike index. The primary index of my data set consists of multiple columns, say col1, col2, ..., col6, and I suppose hashrow(col1, col2, ...,col6) will generate the same number of distinct values as the data size, however there are a certain amount of rows having the same hashing value, is it expected? 

 

If the hashrow method does not work, do you have other suggestions to generate a row_id alike index for a huge data set?

 

Thanks a lot!

 

Lorraine

6 REPLIES
Junior Apprentice

Re: Question on Hashrow and row_number()

Hi Lorraine,

 

Yes, getting duplicate values from the HASHROW function is expected, even if the data values that you are hashing are unique.

 

On the 'no more spool space' error:

  • Have you checked that this is due to skewing and it isn't simply that you don't have a big enough spool limit for your user?
  • What is the volume of data being processed and what is your spool limit?

 

Assuming that the problem is skewing.

  • When using the ROW_NUMBER function have you tried 'ORDER BY' lots (maybe all) columns in the table?

 

If that doesn't sort it the following may work for you:

SELECT dt2.col1
      ,dt2.col2
      ,--other columns from dt2-- 
      ,(((dt3.group_sequence - 1) * dt3.max_rowcount) + dt2.rownumber) AS new_row_id
FROM (SELECT col1
            ,col2
            ,--other columns--
            ,ROW_NUMBER() OVER(PARTITION BY col1,col2 ORDER BY col1,col2) AS rownumber
      FROM t1
      ) AS dt2
INNER JOIN (SELECT col1, col2
                  ,ROW_NUMBER() OVER(ORDER BY col1, col2) AS group_sequence
	          ,MAX(rowcount) OVER() AS max_rowcount
            FROM (SELECT col1
                        ,col2
                        ,COUNT(*) AS rowcount
                  FROM t1
                  GROUP BY 1,2) AS dt1
           ) AS dt3
  ON dt3.col1 = dt2.col1
    AND dt3.col2 = dt2.col2

The above is based around:

  • an existing table called T1
  • the PI of this table are 'col1' and 'col2' - those columns need to be in your 'PARTITION BY' clause of the main derived table (DT2) in order to avoid skewing
  • DT2 generates a 'row number'  per 'group' (the group being the PI columns)
  • DT3 identifies each 'group' and assigns a unique number to the group.
  • DT3 also finds out the maximum number of rows in each 'group'.
  • The main query then uses this information to calculate a unique 'new_row_id' value for each row.
,(((dt3.group_sequence - 1) * dt3.max_rowcount) + dt2.rownumber) AS new_row_id

This will generate unique values for 'new-row_id' in the result set. Those values will NOT be sequential. To reduce the size of 'gaps' you can add more columns into the select list (and GROUP B clause) in DT1.

  • BUT (I think) the columns in the 'PARTITION BY' (DT2) need to match the 'GROUP BY' clause in DT1.

 

Depending on the actual volume of data, you may find that you have to CAST some numbers/counters to increase data value range.

 

If that gives you what you want then wrap it inside a 'create table as' command and you have a new table.

 

Does that help?

 

Cheers,

Dave

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Enthusiast

Re: Question on Hashrow and row_number()

Hi Dave @DaveWellman

Thanks for your detailed answer! Following your suggestions step by step, I first checked the skew factor of the table

SELECT 
TABLENAME,
SUM(CURRENTPERM) /(1024*1024) AS CURRENTPERM, 
(100 - (AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS SKEWFACTOR 
FROM 
DBC.TABLESIZE 
WHERE DATABASENAME = 'MYDB'
AND TABLENAME = 'SRP_VI'
GROUP BY 1;

The skew factor is 3.19, which is pretty low.  

skew_factor.PNG

I also used code below to check the average row count on each AMP

SEL COUNT(*) / COUNT(DISTINCT(HASHROW(QT, TRMT_VRSN_ID, GUID, SESSION_SKEY, SESSION_START_DT, EVENT_TIMESTAMP, SEQNUM))) AS AVG_ROW_CNT
FROM MYDB.SRP_VI;

It shows that the avg_row_cnt is 1. So we can now remove the reason of 'high skewness'.

 

Then I checked the size of the table using code below:

SEL DATABASENAME,
	TABLENAME,
	SUM(CURRENTPERM) / 1024 ** 3 AS CURRENT_GC --IN GB
FROM DBC.ALLSPACE
WHERE TABLENAME = 'SRP_VI'
AND DATABASENAME = 'MYDB'
GROUP BY 1, 2
ORDER BY 1, 2;

The result shows the table size is around 17.06 GB, which is pretty large...

 

Thirdly I already used 'order by' clause in ROW_NUMBER function. Actually since I want a row_id for the entire data set I didn't use 'partition by' but only 'order by (the primary indexes which together defines the uniqueness of a row)' in ROW_NUMBER function.

 

I  tried the main code you suggested but it seems like running forever.. I suppose since for a 19GB size data set, too many joins and derived table is very slow.  I am wondering whether there is a way to extract the rowID (which is rowhash value + uniqueness value) for each row, I think it's unique and can be used as a row_id alike index. Unfortunately I didn't find the solution online. Do you have any idea?

 

Thanks in advance!

 

Lorraine

 

Junior Apprentice

Re: Question on Hashrow and row_number()

Hi Lorraine,

 

(Well done for working through those analysis items).

 

Teradata doesn't allow you to retrieve the rowid for a row. You can get the row hash, but not the uniqueness value.

 

In your analysis you've looked at the skew of the table. This might have an impact on the skew of a query, but please remember that the error you hit was 'no more spool space' - which is a problem with spool file(s) and does not have anything directly to do with the table.

 

  • You ideally need to monitor your spool space whilst your query is running to check on the spool space being used.
  • As an alternative, get your peakspool values for each AMP reset to 0. Run the query. Let it fail. Then check the size and distribution of PeakSpool values in dbc.diskspacev.

The table size is about 17GB. What is the spool space limit on your userid?

 

Is 17GB a big table? That really depends on the size of your TD system. On a 170 AMP system it is only 100MB per AMP. On a 340 AMP system it is only 50MB per AMP. I accept that isn't tiny but I don't think it is huge either.

 

You said that "Actually since I want a row_id for the entire data set I didn't use 'partition by' but only 'order by (the primary indexes which together defines the uniqueness of a row)' in ROW_NUMBER function". In my experience this is very likely to lead to skewing. You NEED the 'PARTITION BY' in order to avoid the skewing. I understand that you need a rownumber across the entire data set, but that is what my logic below does, that is why we have DT1 and DT3.

 

If that single process seems to be taking a long time, try breaking it into two.

Create the DT1/DT3 portion as a volatile table in one request.

Then join that volatile table into the main processing, replacing the DT1/DT3 processing with your volatile table (or even make it a permanent table if required).

 

Does that help?

 

Cheers,

Dave

Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Junior Contributor

Re: Question on Hashrow and row_number()

OLAP functions at least double the spool space (there are two spools within a STATS step) and the ORDER BY might also increase spool usage.

What's the DDL of that table and which of those columns are in ORDER BY?

What's the row count?

And can you show the Explain?

Enthusiast

Re: Question on Hashrow and row_number()

Hi Dnoeth @dnoeth

 

Actually I have asked a similar question before in  this thread, and I tried your solution to concetenate the primary indexes to generate the inkey

CREATE VOLATILE MULTISET TABLE SRP AS (
SEL A.INKEY,
	QT,
	TRMT_VRSN_ID,
--	EP_CHNL,
	GUID,
	SESSION_START_DT,
	SESSION_SKEY,
	EVENT_TIMESTAMP,
	SEQNUM,
	PAGE_TYPE,
	KEYWORD,
	A.ITEM_LIST,
	N_ITEM,
	CIID AS JOIN_ID, 
        B.IDX,
        CAST(B.ITEM_ID AS DECIMAL(18, 0)) ITEM_ID
FROM SRP_VI_R A
INNER JOIN (
SEL *
FROM
  TABLE (STRTOK_SPLIT_TO_TABLE(SRP_VI_R.INKEY, SRP_VI_R.ITEM_LIST, ',')
  RETURNS (OUTKEY VARCHAR(128),
           IDX INT,
           ITEM_ID VARCHAR(50) 
)) AS DT 
) B
ON A.INKEY = B.INKEY
WHERE A.PAGE_TYPE = 'SRP'
) WITH DATA PRIMARY INDEX (QT, GUID, SESSION_START_DT, SESSION_SKEY) ON COMMIT PRESERVE ROWS;

But I got an error saying that STRTOK: VARCHAR INKEY EXCEED THE MAX DEFINED. There seems to be a constraint on the length of inkey and outkey. I tried that the max length of inkey works is 64, however my inkey is longer than it? Do you have any idea how to solve this problem? 

 

Thanks!

Enthusiast

Re: Question on Hashrow and row_number()

Hi Dave,

 

Thanks a lot for your patienceSmiley Very Happy

 

I tried your code but failed at creating table DT1, due to the error 'no more spool space'. I finally concatenate the primary indexes to generate a row_id alike index, and avoided the row_number() function.

 

I learned a lot from your answers. Appreciate it a lot, cheers!

 

Lorraine