Getting a unique value for VARCHAR field.

Database
Enthusiast

Getting a unique value for VARCHAR field.

Hi,

 I need to generate a unique value based on description of the field without using the identity columns the reason being if the same description comes in the next load we will get a different value. I tried using HASH functions but it is giving same values for multiple descriptions.

My TABLE:

FAC_ID - INTEGER

PROC_DESCR - VARCHAR(100)

I need to generate one more column based on PROC_DESCR that will assign a unique code to it. Here is what I tried:

Select FAC_ID,

CAST(CAST(HASHBUCKET(hashrow(PROC_DESCR) (BYTE(4))) AS BIGINT) AS VARCHAR(15)) as PROC_CODE,

PROC_DESCR

FROM MYTABLE ;

But it gives me same PROC_CODE's for different PROC_DESCR which fails my requirement.

10 REPLIES
Senior Apprentice

Re: Getting a unique value for VARCHAR field.

HASHBUCKET returns a value between 0 and 2**16 (or 2**20, depending on your system).

Use only HASHROW which results in a BYTE(4) or use the approach i use in my stats query (http://developer.teradata.com/node/9598):

BYTE(4) to signed integer (INT)

REPLACE FUNCTION Byte4_to_Int (inBytes VARBYTE(4))

RETURNS INT

LANGUAGE SQL

CONTAINS SQL

DETERMINISTIC

RETURNS NULL ON NULL INPUT

SQL SECURITY DEFINER

COLLATION INVOKER

INLINE TYPE 1

RETURN

  (HASHBUCKET(inBytes (BYTE(4))) / ((HASHBUCKET()+1)/65536) * 65536. +

   HASHBUCKET(SUBSTR(inBytes,3,2) (BYTE(4))) / ((HASHBUCKET()+1)/65536) -

   HASHBUCKET(inBytes (BYTE(4)))  / ((HASHBUCKET()+1)/65536) / 32768 * 4294967296)

;

BYTE(4) to unsigned integer (BIGINT)

REPLACE FUNCTION Byte4_to_UnsignedInt (inBytes VARBYTE(4))

RETURNS BIGINT

LANGUAGE SQL

CONTAINS SQL

DETERMINISTIC

RETURNS NULL ON NULL INPUT

SQL SECURITY DEFINER

COLLATION INVOKER

INLINE TYPE 1

RETURN

  HASHBUCKET(inBytes (BYTE(4)))  / ((HASHBUCKET()+1)/65536) * 65536. +

  HASHBUCKET(SUBSTR(inBytes,3,2) (BYTE(4)))  / ((HASHBUCKET()+1)/65536)

;

If you're not on TD13.10 simply cut & paste the source code and replace inBytes with PROC_DESCR.

Dieter

Enthusiast

Re: Getting a unique value for VARCHAR field.

Thanks a lot, Dieter..!!

Re: Getting a unique value for VARCHAR field.

Hi Dieter,

Can you please help me for my problem.

We have TAX_ID field in one of our tables and we are trying to mask that value and so mask value should be unique per given TAX_ID. So we are trying to use HASHROW(TAX_ID) and its returning unique values but since its in BYTE datatype, end users are unable to use that value. Is there any UDF function available which converts that to CHAR? Or any other ideas you have?

This is pretty urgent and any help is appreciated.

Regards

Ajay

Senior Apprentice

Re: Getting a unique value for VARCHAR field.

Hi Ajay,

you could use the calculation i posted and cast it to a char.

But be aware, the HashRow function is not guaranteed to return unique values, you might need to implement your own UDF.

Dieter

Enthusiast

Re: Getting a unique value for VARCHAR field.

Hi Dieter,

You mentioned in your previous post that : "If you're not on TD13.10 simply cut & paste the source code and replace inBytes with PROC_DESCR"

We're currently on 13.10 and we would like to use the function:

BYTE(4) to unsigned integer (BIGINT)

REPLACE FUNCTION Byte4_to_UnsignedInt (inBytes VARBYTE(4))

RETURNS BIGINT

LANGUAGE SQL

CONTAINS SQL

DETERMINISTIC

RETURNS NULL ON NULL INPUT

SQL SECURITY DEFINER

COLLATION INVOKER

INLINE TYPE 1

RETURN

HASHBUCKET(inBytes (BYTE(4))) / ((HASHBUCKET()+1)/65536) * 65536. +

HASHBUCKET(SUBSTR(inBytes,3,2) (BYTE(4))) / ((HASHBUCKET()+1)/65536)

;

Can we still use it? If not, can we know what changes we need to make to make it work on 13.10?

Also, I did not entirely understand the logic:

HASHBUCKET(inBytes (BYTE(4))) / ((HASHBUCKET()+1)/65536) * 65536. +

HASHBUCKET(SUBSTR(inBytes,3,2) (BYTE(4))) / ((HASHBUCKET()+1)/65536)

It would be great if you can explain me this logic. I could understand why you divided by ((HASHBUCKET()+1)/65536) (to handle hashbucket being 20 bit now), but I didn't understand why multiply by 65536 and no substr in first part(of addition) and apply substr and not multiply in the other part(of the addition)

-Suhail

Senior Apprentice

Re: Getting a unique value for VARCHAR field.

Hi Suhail,

of course you can use the function as-is, pre-13.10 just doesn't support SQL functions.

The basic calculation is like getting a year-month out of a year and a month:

year * 100 + month = 2013*100+01 -> 201301

For binary data it's:

aabbccdd -> hashbucket(aabb....) * 2**16 + hashbucket(ccdd....)

Dieter

Enthusiast

Re: Getting a unique value for VARCHAR field.

Hi Dieter,

I tried your solution posted above, and columnA is defined as varchar(50).

The following is the query, but there are few duplicates after converting(below is an example). Can you please take a look to see if any workaround for this?

SELECT  DISTINCT columnA as a, 
HASHROW( columnA ) as b,
(HASHBUCKET(HASHROW (columnA) (BYTE(4))) / ((HASHBUCKET()+1)/65536) * 65536. +
HASHBUCKET(SUBSTR(HASHROW (columnA),3,2) (BYTE(4))) / ((HASHBUCKET()+1)/65536)) as c
FROM TableA

  output c is the same when columnA in ('0034317069', '0063313119882_206148')

Fan

Re: Getting a unique value for VARCHAR field.

Hi Dieter,

Above in this thread you have mentioned that HASHROW function is not guranted to return a unique value. Can you please explain as we want to use HASHROW to generate a hashvalue for a VARCHAR column (ACCOUNT_NUMBER). We anticipate:

1)  each time  same HAHROW would be the out put for a particular account number i.e. 123 will always be hashed into 27-48-9E-A5. 

2) No two account numbers can have the same ROWHASH value.

3) ROWHASH value can not be translated back to input value (ACCOUNT_ NUMBER in this case)

Thanks,

Asif

Teradata Employee

Re: Getting a unique value for VARCHAR field.

Assumption #3 is correct - the hash function is not invertible.

Assumption #2 is incorrect. It is possible (though mildly unusual) for two different input values to yield the same hash (a hash collision).

Assumption #1 is only partially correct. On a given system, the same input value should return the same hash. Changing the hash function implementation would require migrating the data to new hardware or doing a full sysinit & restore, so changes would be very infrequent. But there have been several versions of hash function over time.