Selecting an ALC compression algorithm

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.
Teradata Employee

Selecting an ALC compression algorithm

Teradata 13.10 provides Algorithmic Compression (ALC) feature that allows  users to apply compression / decompression functions on a specific column of character or byte type. The compression / decompression functions may be Teradata built-in functions provided along with ALC or user provided compression / decompression algorithm registered as UDFs.

As we know it is often a tradeoff between space and speed when using compression. Each compression algorithm has its pros and cons, is appropriate for data with certain characteristics and usage, but may not be appropriate to others. So a common question is which compression algorithm to use for specific data?

Here we provide some general guidelines for selecting a compression algorithm from the list of available built-in compression algorithms in TD 13.10, considering the tradeoff between space and speed. We also provide a test package that you can try on your specific data column by column. The results will show how much space every available compression algorithm can save on your data.

ALC Compression Algorithm Selection Guidelines

TD 13.10 provides 3 sets of built-in compression algorithms that can be specified for a column using ALC: UTF8, CAMSET and ZLIB. The set of UTF8 functions is for Unicode data only, and it is for Unicode character column that contains mostly US-ASCII characters. The compression function name is TransUnicodeToUTF8, not applicable to Latin data. The compression rate is 50% when data is all US-ASCII though defined as Unicode. Its typical compression rate for Unicode data that contains mainly US-ASCII characters is 30% to 50%. Expansions are likely for many common Asian languages. When expansions occur for specific rows, DBS will decide not to apply the compression function on those rows. This applies to all built-in compression algorithms.

The set of CAMSET functions handles both Unicode and Latin columns. CAMSET is the compression function name for Unicode; CAMSET_L is for Latin. Within TD 13.10, CAMSET compresses well for any length strings that contain mainly digits and English alphabet letters, do not frequently switch between lowercase and uppercase, or between digits and letters, or between Latin and non-Latin characters. It may save space up to 75% on Unicode data and 50% on Latin data when data contains all digits, which is not uncommon. Its typical compression rate is 30% to 70% for Unicode, 20% to 40% for Latin, CAMSET generally outperforms ZLIB on short or medium length columns of Latin or Unicode, and normally outperforms UTF8 that is for Unicode data only.

The set of ZLIB functions handles both Unicode and Latin strings. The compression function name for Unicode is LZCOMP and the compression function name for Latin is LZCOMP_L. It can save space up to 99.9% in rare cases where data contains all repeated characters. Its typical compression rate is 50% to 80%. It is designed to compress long strings (of high hundreds, thousands or more characters), normally not able to compress short strings. In particular, due to ZLIB's data structure overhead, strings shorter than 15 characters normally get significantly expanded. However for short or medium length strings (tens or a few hundreds of characters) with many repeating characters within each string, ZLIB may compress better than the current CAMSET in TD 13.10.

Regarding time performance,

  • UTF8 has low CPU cost on both compression and decompression.
  • ZLIB has high CPU cost on compression, multiple times the compression cost of UTF8. Its decompression cost is low, somewhat higher than UTF8 for short and medium length strings, but somewhat lower than UTF8 for long strings.
  • CAMSET has low CPU cost (higher than UTF8 but lower than ZLIB) for short and medium length data, but high cost (even higher than ZLIB) for long columns on compression. CAMSET's decompression cost increases more quickly than UTF8 and ZLIB as the column size grows, but for short and medium length columns, the difference is not very significant.

Considering the relative capabilities of the 3 sets of compression algorithms on space and speed, we propose the following guidelines for selecting a compression algorithm for a specific column.

  • For short or medium length columns, choose CAMSET in general. 

    However if column is Unicode but mainly contains US-ASCII characters and there are frequent switches between digits and letters, between lowercase letters and uppercase letters, between Latin characters and non-Latin characters, UTF8 is a better choice.

    May choose ZLIB if data is medium length but relative long with a lot of repetitions within individual fields, especially for Latin columns.
  • For long columns, choose ZLIB in general.

    However UTF8 is a better choice if the column is defined as Unicode but contains mainly US-ASCII characters, and the data is very dynamic (under frequent update).

If a decision cannot be made according to the above guidelines or you want to figure out the compression capabilities of all the available compression algorithms on your data, you read on and try the ALC test package.

ALC Compression Rate Test

The goal of this test is to figure out the compression rates that TD 13.10 built-in compression algorithms will be able to achieve on specific data.

The test package contains user-defined functions that simulate compression algorithms directly or indirectly. The test package also includes another UDF for gathering data demographics that help understand the data.

The UDFs have been tested and do not cause any issues on host systems and no customer data will be compromised.

In the following, we guide you through the test process step by step, including how to install the UDFs and how to create test scripts for columns in interest. Test scripts are to be created by copying and modifying the provided test script templates in the package.

The samples folder contains example bteq files that are modified from the templates for testing over a small test table (also used in verify_UDFs.bteq). The output files are also included in the samples folder. They serve as references that help you see how the expected output may look like. The sample files include the following and more.

c2str200f_latin_fixchar.bteq test script for a fixed length Latin column named as c2str200f
c2str200f_latin_fixchar.out test output for column c2str200f
c5str300v_unicode_varchar.bteq test script for a Unicode varchar column c5str300v
c5str300v_unicode_varchar.out test output for column c5str300v

Ready to proceed? Download the ALC test package now.

Step 1: Install UDFs

  1. Unzip the package and go to folder Linux, copy the udf folder to c:/temp on your client machine from which you will run bteq.

    The tests are for Linux system only. If your DB server platform is not Linux, please let us know, we may be able to create and provide you the UDF object files for your platform.
  2. Modify the logon info in install_UDFs.bteq to your system name and your username/password where you will install the UDFs.
  3. Run install_UDFs.bteq from the command line in the directory the bteq file is located.

    bteq < install_UDFs.bteq > install_UDFs.out 2>&1

    Check and make sure the output file install_UDFs.out has no errors or failures. You can ignore:

    *** Warning: 5607 Check output for possible warnings encountered in compiling and/or linking UDF/XSP/UDM/UDT.

  4. Verify UDFs work and see their operation.

    This is optional, you can skip this step if you are concerned about writing data into your system.

    Modify the login info in verify_UDFs.bteq, run the bteq file from the command line in the directory where the bteq file is located.
    bteq < verify_UDFs.bteq > verify_UDFs.out 2>&1

    Check the output file, making sure there are no unexpected errors. Also, you can see how the UDFs have been used and what the test query results look like.

Step 2: Run UDFs and collect results

Run compression UDFs for each column of interest and collect results.

  1. Copy the test script template specific for your column type:
    • For a latin fixed length char column, use template_latin_fixchar.bteq.
    • For a latin varchar column, use template_latin_varchar.bteq.
    • For a unicode fixed length char column, use template_unicode_fixchar.bteq.
    • For a unicode varchar column, use template_unicode_varchar.bteq.
  2. Edit the copied file to use your DB name, table name, column name and DB system name in all statements that are commented with 'Please modify ...' inside the file.

    You may want to use a replace command instead of finding and modifying manually, except manually modifying the logon info at the very top just once. Here are the specific items to be replaced:
    • Replace all 'dbx.tablex' with your_db_name.table_name.
    • This step is optional, and only available on TD13.10+.

      If you are running TD 13.10+ system, you may make the test run faster by using the available built-in functions instead of the UDFs in this package.

      For very long columns (> 4096 characters), please read Appendix A for more details.
      Search Replace
      udf_lzcomp lzcomp
      udf_lzcomp_l lzcomp_l
      udf_transunicodetoutf8_bytes(columnx) bytes(transunicodetoutf8(columnx))
      udf_camset_bytes_l(columnx) bytes(camset_l(columnx))
      udf_camset_bytes(columnx) bytes(camset(columnx))
    • Replace all 'columnx' with your column name.
  3. Run the appropriate bteq file from the command line in the directory the bteq file is located. That is, run one of the following commands:
    • bteq < columnx_unicode_varchar.bteq > columnx_unicode_varchar.out 2>&1
    • bteq < columnx_unicode_fixchar.bteq > columnx_unicode_fixchar.out 2>&1
    • bteq < columnx_latin_fixchar.bteq > columnx_latin_fixchar.out 2>&1
    • bteq < columnx_latin_varchar.bteq > columnx_latin_varchar.out 2>&1
  4. Check the output file. Make sure there are no unexpected failures or errors.

Step 3: Select algorithm

Select compression algorithms based on results in the output files (hopefully straightforward to interpret) and the relative CPU costs of different compression / decompression algorithms mentioned in the guidelines. If you have questions on how to interpret your results, please post a question in the Extensibility forum.

Step 4: Uninstall UDFs

After you are done testing all interesting columns, you may run uninstall_UDFs.bteq to remove the installed UDFs in the right order.

bteq < uninstall_UDFs.bteq > uninstall_UDFs.out 2>&1

Appendix A: ZLIB for long strings

Recreate built-in ZLIB functions in order to handle extremely long strings.

TD built-in ZLIB functions lzcomp, lzdecomp, lzcomp_l and lzdecomp_l are created to handle strings up to 4096 characters via DIP scripts. In order to test over extremely long string columns, recreate them to the row limit. Notice for Unicode functions lzcomp and lzdecomp, the input / output VARBYTE size shall be two times the output / input VARCHAR size.

DROP SPECIFIC FUNCTION TD_SYSFNLIB.lzcomp;

CREATE FUNCTION TD_SYSFNLIB.lzcomp
( UnicodeStr VARCHAR(32000) CHARACTER SET UNICODE)
RETURNS VARBYTE(64000)
NO SQL
PARAMETER STYLE TD_INTERNAL
SPECIFIC TD_SYSFNLIB.lzcomp
RETURNS NULL ON NULL INPUT
DETERMINISTIC FOR COMPRESS
LANGUAGE C
EXTERNAL LOCAL;

DROP SPECIFIC FUNCTION TD_SYSFNLIB.lzcomp_l;

CREATE FUNCTION TD_SYSFNLIB.lzcomp_L
( LatinStr VARCHAR(64000) CHARACTER SET LATIN)
RETURNS VARBYTE(64000)
NO SQL
PARAMETER STYLE TD_INTERNAL
SPECIFIC TD_SYSFNLIB.lzcomp_l
RETURNS NULL ON NULL INPUT
DETERMINISTIC FOR COMPRESS
LANGUAGE C
EXTERNAL LOCAL;

DROP SPECIFIC FUNCTION TD_SYSFNLIB.lzdecomp;

CREATE FUNCTION TD_SYSFNLIB.lzdecomp
( CompressedStr VARBYTE(64000) )
RETURNS VARCHAR(32000) CHARACTER SET UNICODE
NO SQL
PARAMETER STYLE TD_INTERNAL
SPECIFIC TD_SYSFNLIB.lzdecomp
RETURNS NULL ON NULL INPUT
DETERMINISTIC FOR DECOMPRESS
LANGUAGE C
EXTERNAL LOCAL;

DROP SPECIFIC FUNCTION TD_SYSFNLIB.lzdecomp_l;

CREATE FUNCTION TD_SYSFNLIB.lzdecomp_L
( CompressedStr VARBYTE(64000) )
RETURNS VARCHAR(64000) CHARACTER SET LATIN
NO SQL
PARAMETER STYLE TD_INTERNAL
SPECIFIC TD_SYSFNLIB.lzdecomp_l
RETURNS NULL ON NULL INPUT
DETERMINISTIC FOR DECOMPRESS
LANGUAGE C
EXTERNAL LOCAL;

Finally

If you have feedback on this package, please feel free to leave comments below. Or, for community support, please post a question in the Extensibility forum.

13 REPLIES
Senior Supporter

Re: Selecting an ALC compression algorithm

Minor comment on the "verify_udfs.bteq" script. It typically won't run as is because it contains a hard-coded databasename in four places ("gw1"). Having removed that it all ran as described. This is a nice piece of work for getting started with ALC routines.
Ward Analytics Ltd - information in motion
www: http://www.ward-analytics.com
Teradata Employee

Re: Selecting an ALC compression algorithm

The camset_l UDF doesn't seem to be available in the TD13.10 VM image. I can't find it in TD_SYSFNLIB database. Can you post the functions for compress and decompress please?
Teradata Employee

Re: Selecting an ALC compression algorithm

The camset functions were not available until 13.10.00.06. Please use 13.10.00.06 or later releases if you are interested in trying them. Thanks.

Re: Selecting an ALC compression algorithm

I use the latest 40 GB VMware Player Edition with the following Version (checked with select * from dbc.dbcinfo)
InfoKey InfoData
1 RELEASE 13.10.00.10
2 VERSION 13.10.00.14
3 LANGUAGE SUPPORT MODE Standard
and the Compress/Decompress UDF 'CAMSET_L' are still not in not found in SYSUDTLIB/TD_SYSFNLIB.
Other standard Compress/Decompress functions are there in database TD_SYSFNLIB but not camset and camset_l.
Kind regards,
Herbert

Re: Selecting an ALC compression algorithm

I tried and tested the latest Teradata VMPlayer 1TB express edition. The database TD_SYSFNLIB contains all compress and decompress functions especially CAMSET_L and DECAMSET_L. RELEASE = 13.10.00.10, VERSION = 13.10.00.14.
N/A

Re: Selecting an ALC compression algorithm

Hi Jenny,
Can you tell the difference between ALC and MVC, and when we should choose ALC or MVC, we now use more MVC. and i checked the document 13.10, it said "When column values are mostly unique, ALC may provide better compression results than MVC", so that's the only rule to determine the ALC or MVC ? is there are some detail compare between ALC and MVC?

Gus
Enthusiast

Re: Selecting an ALC compression algorithm

We have a basic (test) ALC UDF (actually it is just a dummy), and it works.

But we are not able to return with -1 to indicate that we cannot compress the value, because it returns with:

INSERT Failed. 7509: Result Exceeded maximum length for UDF/XSP/UDM SYSUDTLIB.alc1_num.

void alc1_num( VARCHAR_LATIN *inStr,
VARBYTE *result,
char sqlstate[6])

{
result->length = -1;

return;
}

The ALC UDF is not really documented, there is no sample for even a dummy function.
We tried in many form (setting sqlstate also), but no luck.

The compress way is working, so we can store and retrieve a compressed value, only this special case, when we say: well it is not worth to compress, so let's return with length < 0 is not working.

Can you help with this?
Teradata Employee

Re: Selecting an ALC compression algorithm

Please change not to return negative length, leave it to ALC infrastructure to determine whether it is worth to compress or not.
Enthusiast

Re: Selecting an ALC compression algorithm

Based on the Teradata doc named "SQL External Routine Programming":
"When the input cannot be compressed, the UDF can indicate it is unable to compress the
column data by setting the length field of the output VARBYTE argument to less than 0."
So, what value shoul I put in there, in case the compressed bytes are longer than the input?
The actual value? And if it is longer, than it will not compress it?
Is that what you suggest?
Thank you very much for your answer!