I have a table with billion records. I have quite a few char(01) columns in this table which has repeating values like 'Y' , 'N'. I have created a similar table but with the compression added on all these char(01) columns. I found out the table with compressing all these columns increases the original space.
So my question is : is not recommnded to compress the char(01) columns. If so can you please explain why ?
Thanks in advance
When you define compression on some columns in TD, Teradata stores the compressed values in the table header. So if you are compressing a large number of columns, the header size will become large. Thats why you might be getting the the table size larger than actual.
You can get the size of the table header by creating an empty table. Depends on the number of AMPs.
Apart from the table header, another important reason is the introduction of additional presence bits - in each row of the table. The more compressed values you enlist, the more presence bits you need to compress those values.
- For those rows where values are compressed, we have to store only the presence bit(s), not the actual values, so the storage space is greatly reduced. Bits are packed into bytes. Works like a charm.
- However, if many rows have values that are not compressed, for those rows we store both presence bits and actual value (if not compressed). This introduces a possibility of a size increase, but only when the compressed values are defined incorrectly. So, either no compression takes place for many rows, or we spend too many presense bits for values we don't really need.
So, for each column you should verify whether a compressed value is really encountered in a large number of rows. If a value is quite rare - encountered only in a small number of rows, it might be cheaper to omit it in the compression list.
Char(1) are usually good candidates for compression.
Just want to confirm that as the number of values lead to addtion of presence bits which in turn increases to size of table header.
Do we need to go for odd number of values or the even number of values per column for the good
it's odd, (2**n) -1 :-)
1 -> 1
2 -> 2 - 3
3 -> 4 - 7
4 -> 8 - 15
5 -> 16 - 31
6 -> 32 - 63
7 -> 64 - 127
8 -> 128 - 255
Ok thanks dieter.
Also just wanted to know ,how much savings we get by compressing NULLS or zero length string (i.e '' ) ?
I believe NULLS or '' also take 1 byte.
Please correct me if I am wrong?
nope, the NULL is already included :-)
See Number of Presence Bits Required to Represent Compressed Values in the Database Design manual:
thanks for the link. I see the information in the documentation contradictory though. As stated in one of the tables, it says:
col_1 CHAR(1) NOT NULL COMPRESS (‘A’) - requires 1 presence bit
col_1 CHAR(1) COMPRESS (‘A’) - requires 2 presence bits
Therefore NULL is not included, in fact nullability requires 1 extra state to be "coded" using presence bits and thus requires the extra presence bit in the case above. The only explanation would be that Teradata stores the information about nullability in a extra separate (presence?) bit, which would be very uneffective. The number of states to be represented by presence bits obviously differs - since compressing null values with COMPRESS is implicit and can't be skipped.
If I'm in fact wrong, can't you tell me why there is difference in the case above?