How Compression Works

Database
Enthusiast

How Compression Works

Hi All,

I have a table with  billion records. I have quite a few char(01) columns in this table which has repeating values like 'Y' , 'N'. I have created a similar table  but with the compression added on all these char(01) columns. I found out the table with compressing all these columns increases the original space.

So my question is : is not recommnded to compress the char(01) columns. If so can you please explain why ?

Thanks in advance

Jana

8 REPLIES
Enthusiast

Re: How Compression Works

Hi,

When you define compression on some columns in TD, Teradata stores the compressed values in the table header. So if you are compressing a large number of columns, the header size will become large. Thats why you might be getting the the table size larger than actual.

Khurram
Teradata Employee

Re: How Compression Works

Hi,

You can get the size of the table header by creating an empty table.  Depends on the number of AMPs.

Apart from the table header, another important reason is the introduction of additional presence bits - in each row of the table.   The more compressed values you enlist, the more presence bits you need to compress those values.

- For those rows where values are compressed, we have to store only the presence bit(s), not the actual values, so the storage space is greatly reduced.   Bits are packed into bytes.   Works like a charm.

- However, if many rows have values that are not compressed, for those rows we store both presence bits and actual value (if not compressed).   This introduces a possibility of a size increase, but only when the compressed values are defined incorrectly.    So, either no compression takes place for many rows, or we spend too many presense bits for values we don't really need.

So, for each column you should verify whether a compressed value is really encountered in a large number of rows.    If a value is quite rare - encountered only in a small number of rows, it might be cheaper to omit it in the compression list.    

Char(1) are usually good candidates for compression.

Regards,

Vlad.

Enthusiast

Re: How Compression Works

Hi all,

Just want to confirm that as the number of values lead to addtion of presence bits which in turn increases to size of table header. 

Do we need to go for odd number of values or the even number of values per column for the good

space savings?

Regards,

Nishant

Senior Apprentice

Re: How Compression Works

Hi Nishant,

it's odd, (2**n) -1 :-)

BITS     VALUES
1 -> 1
2 -> 2 - 3
3 -> 4 - 7
4 -> 8 - 15
5 -> 16 - 31
6 -> 32 - 63
7 -> 64 - 127
8 -> 128 - 255
Enthusiast

Re: How Compression Works

Ok thanks dieter.

Also just wanted to know ,how much savings we get by compressing NULLS or zero length string (i.e '' ) ?

I believe NULLS or '' also  take   1 byte.

Please correct me if I am wrong?

Regards,

Nishant

Fan

Re: How Compression Works

Dieter, Nishant,

just to add the formula above differs for nullable and not nullable columns. Don't forget you need to save that extra state for "NULL" which is compressed implicitely for nullable columns. To describe further, I will work with the word STATE:

1) NULLABLE - NULL, (not compressed), ValueA, ValueB - 2 items on compression list, 4 states to be coded -> 2 presence bits required

2) NOT NULLABLE - (not compressed), ValueA, ValueB, ValueC - 3 items on compression list, 4 states to be coded -> 2 presence bits required

N presence bits can be used to code up 2^N states. Therefore you need n=[log(2)(k+2)] bits to code (k) nullable columns and n=[log(2)(k+1)] for not nullable...

Long story short - the ideal number of items on compression list is:

1) NULLABLE - (0),2,4,8,16... (0 means only compress)

2) NOT NULLABLE - 1,3,7,15...

If you need to go deeper, feel free to ask :)

Vlcik

Senior Apprentice

Re: How Compression Works

Hi Vlcik,

nope, the NULL is already included :-)

See Number of Presence Bits Required to Represent Compressed Values in the Database Design manual:

http://www.info.teradata.com/HTMLPubs/DB_TTU_15_10/Database_Management/B035_1094_151K/Database_Level...

Fan

Re: How Compression Works

Dieter,

thanks for the link. I see the information in the documentation contradictory though. As stated in one of the tables, it says:

col_1 CHAR(1) NOT NULL COMPRESS (‘A’) - requires 1 presence bit

col_1 CHAR(1) COMPRESS (‘A’) - requires 2 presence bits

Therefore NULL is not included, in fact nullability requires 1 extra state to be "coded" using presence bits and thus requires the extra presence bit in the case above. The only explanation would be that Teradata stores the information about nullability in a extra separate (presence?) bit, which would be very uneffective. The number of states to be represented by presence bits obviously differs - since compressing null values with COMPRESS is implicit and can't be skipped.

If I'm in fact wrong, can't you tell me why there is difference in the case above?