Calculation of Table Hash Values to Compare Table Content

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Senior Supporter

At the Partners 2012 I presented the session “How to compare tables between TD systems in a dual active environment” - see attached slides for details if of interest for you.

The main emphasis was and is to draw attention to an issue which becomes more important as more and more customers are using multiple Teradata instances. Either as dual active or by using different appliances for different purposes (e.g. online archive plus EDW instance). If you expect to have the same data on two or more systems, the question can be asked on how to PROOF that the data is really the same – joining is not possible.

Current approaches like calculating column measures have major limitations especially for big tables – see the presentation for details. Unfortunately classical hash algorithm like SHA1 or MD5 – which had been designed especially for that purpose - can not be used within the DB as native functions on table level as they require the same ordering of data which can not be guaranteed at two different systems. An option is to extract the whole data sorted and run the SHA1 on the output file. But this leads to major costs for the sorts and network traffic.

A new way to overcome these limitations is a new table hash function which overcomes the ordering requirement of the SHA1 and MD5 hash functions.

The main idea is based on a two step approach:

1Calculate for each row a hash value where the outcome has the classical hash properties.

2“Aggregate” the row hash values by an aggregate function where the result does not depend on the ordering of the input (commutative and associative).

The resulting output can be used as a table hash to compare data content between different systems.

The main considerations to choose the right component function are:

for 1. avoid of hash collision, which require reasonable long hash values for the row hash. This is for example an issue if you wane use the internal hashrow function of Teradata as hash collisions occur even for small tables.

for 2. use a good function which is also handling multiset tables (where XOR) has an issue.

We implemented a C UDF and compared the results with export of data options and see already with this implementation competitive resource consumption figures.

During the last months we have spent more R&D on this and have found a better hash function for the row hash calculation. This reduces the CPU consumption to about 60% in comparison to the SHA1 hash function we used before. In addition we found a better aggregation function which overcomes limitations of the ADD MOD 2^Hash length function.

As discussed already in the Partner 2012 presentation the best performance for this kind of functions should be achieved in case the function would be implemented as a core DB function similar to the hashrow function. This would also improve usability as the described data preparation is not needed. But it is likely that Teradata will only implement this with a strong demand from the customer side.

In the mean time – and here starts the sales pitch – if you need to proof that data on two different systems are the same or you wane discuss any of this in more detail then contact me.

There are also additional use cases for this function which you should also take into consideration e.g. regression testing of DB versions, systems and your own software.

So in summary - be aware of the issues and consider the alternatives!

4 Comments
Enthusiast

This is awesome. I assume you're not providing the source code for free? :)

Regards,

Suhail

Senior Supporter

Hi Suhail,

sorry no :). 

Ulrich

Enthusiast

Hi Ulrich,

How can I contact you?

Regards,

Suhail

Senior Supporter

ulrich.arndt@data2knowledge.de - see also last page of the slides ;-)