Example Java UDF for Table Hash Calculations

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Senior Supporter

There is a demand to have a functionality similar to the industry standard SHA256 hash function to condense RDBMs table content to a single value which is changing completely in case a single bit changes in multi million or billion row tables.

The main driver for this demand is the fact that dual systems – EDW dual active or EDW and Appliances mixtures – become more often in use. With this the question of cross system reconciliation becomes more important. Requirements can be to monitor the correctness of IT processes on both systems but also regulatory requirements like SOX compliance.

In 2013 I shared the idea of the table hash calculation which is not order depend and will work also in relational database systems at http://developer.teradata.com/blog/ulrich/2013/05/calculation-of-table-hash-values-to-compare-table-... . This post was referring to my Partners presentation of 2012 that is attached to this post.

As the topic sounds very theoretically I developed a Java UDF implementation to allow everybody who is interested to test this approach.

The Java UDF implemented the algorithm mentioned in the Partners presentation of 2012. Due to the fact that Java UDFs are always running in protected mode this UDF will not give you good performance but it should show you how the UDF is called and how it can be used to reconcile tables / queries between two different systems.

In case of interest on the fast version (C-UDF with faster hash functions) contact me directly, but don’t expect things for free. This is related to long ongoing R&D activities which have it’s prise.

The attached are two zip: 

1. tablehash_sha1.zi

This contains one jar file, 3 SQL files and 1 logon file in case you install this UDF via bteq.

    • Unzip the file.
    • In all 3 scripts you need to change the databasename yourDB to the name where you want to install the udf to.
    • tablehash_sha1Install.sql - will load the jar file to the TD server. The TD user who is running this need to have execute rights to the stored procedure.
    • tablehash_sha1Create.sql - will create the table hash JAVA UDF.
    • tablehash_sha1Drop.sql - can be used to drop the SP and remove the jar from the TD system (clean up).

Now you would be able to calculate table hashes - some SQL examples below.

2. tdth.zip

As outlined in the presentation we have to create a long string for this function - a TD internal implementation would be able to overcome this heavy burden, as also stated in the presentation.

This can make the usage somehow a bit inconvenient. To overcome this we developed a small java program to address this.

To set up the program (Java 1.6 or higher is needed on the computer where this should run) you have to do a bit of configuration - one time activity:

    • Change to the dwl folder and create a copy of the logon_sample.xml file and specify your credentials into it. It is straight forward.
<?xml version="1.0" encoding="ISO-8859-1"?>
<params type="logon">
<host>HostNameOrIPAddress</host>
<user>td_user</user>
<pwd>pwd</pwd>
</params>
    • Change to the parameter folder and modify the base.xml file

      The main two parameter to change are:

      <logon_file>yourLogonFile.xml</logon_file> 
      <th_func>yourDB.tablehash_sha1</th_func>

      remove the <sql> row 

    • now you should be able to run this - if you are on the base directory - via:

      java -jar jar/tdth.jar --param_file parameter/base.xml --table sys_calendar.calendar

this will calculate the table_hash for the sys_calendar.calendar view.

 java -jar jar/tdth.jar --param_file parameter/base.xml  --table yourDB.yourObj 
would calculate it for any other table/view you give for yourDB.yourObj

The nice thing is that you can use this call also to generate the correct SQL call - if you remove the <process/> line from the base.xml parameter file it will only generate the SQL and not execute the SQL.

After the run you find in log the tdth_sql.txt (this name can be changed in the base.xml parameter file). This file contained the SQL which had been generated based on DBC infos (null info, datatype handling etc.). This is quite useful generate the SQLs and use them for direct execution on SQLA or TD Studio.

There are more options available (SQL instead of views / tables etc.) which are described in more detail in the documentation (folder doc).

You will recognise that the first run is slower then the repeating once - I guess the DB need to initialize each Java UDF once it runs first.

But also as mentioned above the Java UDF will be always slow - as it runs in protected mode. So, don't run this with multi million row tables!

The C version in unprotected mode will run 100 times faster.

Finally some SQL examples (the base udf call was generated with above call) and the expected results:

SELECT
    yourDB.tablehash_sha1(  
        coalesce ( TRIM(CAST(calendar_date AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_week AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(weekday_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_quarter AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(quarter_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(quarter_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(year_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' )
    ) AS table_hash 
FROM
sys_calendar.calendar
;

expected result: a490387fbc9b93b74ea85d0312c23da7c6e8ae0e

As the udf is a standard aggregation function you can use the same call to get detailed values per group (which might allow to get the details where differences exists)

SELECT month_of_year,
    yourDB.tablehash_sha1(  
        coalesce ( TRIM(CAST(calendar_date AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_week AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(day_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(weekday_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(week_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_quarter AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(month_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(quarter_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(quarter_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
        coalesce ( TRIM(CAST(year_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' )
    ) AS table_hash 

FROM
 sys_calendar.calendar

where year_of_calendar = 2014

group by 1

order by 1;

expected results:
month, table_hash
1, f62e7c2d6647907c9f90d9de260a5e3f48a2f121
2, f6c224f2be371f281c759af71a3664de76dbefdf
3, f6584c17e5e0b86db8be2eed3f4b5f8509b02b48
4, 6d17c7bae4f9d2525ee88706d0150a30baddceee
5, 656eda9686371c53de25718b093fb84c199f421d
6, eda1fa483e7f7cd14353ed84fff210c3c0074201
7, adabac9c58e95b6c7f3b2be58724e48f6bfd8745
8, 87cfeab49fee83278f3099d1278fc69e415d2f16
9, f33d69a0604219d9c6292b2508589b2f13180951
10,3ce090dcd97a585b9c3191a47cf617a8e3ab95b0
11,5575cf90bb46bed6a21e477cc249582e372a69d0
12,9ee1449915a7a67b8d8d0d9f7eb44f466943e45b

Given this - enjoy to try it and let me know if you have any questions.

3 Comments
Teradata Employee

Hi Ulrich,thank you for this share. Where are the attached file?

Senior Supporter

At the top of the page where the bog text starts on the right side is a section attachement. I get three files listed there...

Senior Supporter

P.S. maybe try also a different browser - it seems some have issues with the developer.teradata.com page...