Space saved using Compression is exceptionally different from calculated result..

Database
Enthusiast

Space saved using Compression is exceptionally different from calculated result..

For a test conducted, the amount of space saved on using compression seems to be more than what it should be.

The expected space saving was 7500 bytes while the actual space saved is reported to be 20,480.00 bytes.

Can someone throw light on the unexpected result and tell me what the formula for calculating space saving is?

The details of the test are given below.

/////////////////////////////////////////////////////////////////////////
////////////////////////TEST RESULT//////////////////////////////////////
/////////////////////////////////////////////////////////////////////////

BTEQ 08.02.03.03 Thu Jun 21 16:36:29 2007

+---------+---------+---------+---------+---------+---------+---------+----

.LOGON demotdat/dbc,

*** Logon successfully completed.
*** Teradata Database Release is V2R.06.01.01.17
*** Teradata Database Version is 06.01.01.17
*** Transaction Semantics are BTET.
*** Character Set Name is 'ASCII'.

*** Total elapsed time was 1 second.

+---------+---------+---------+---------+---------+---------+---------+----

SELECT * FROM DBC.DBCINFO;

*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

InfoKey InfoData
------------------------------ --------------------------------------------
RELEASE V2R.06.01.01.17
VERSION 06.01.01.17

+---------+---------+---------+---------+---------+---------+---------+----

SELECT CURRENT_TIMESTAMP;

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Current TimeStamp(6)
--------------------------------
2007-06-21 16:36:29.610000+00:00

+---------+---------+---------+---------+---------+---------+---------+----

SELECT HASHAMP()+1 AS "Number of AMPS";

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Number of AMPS
--------------
2

+---------+---------+---------+---------+---------+---------+---------+----

DATABASE E_COMPANYDB;

*** New default database accepted.
*** Total elapsed time was 1 second.

+---------+---------+---------+---------+---------+---------+---------+----

/*----------------------DDLS----------------------------------------

Observe that the DDL's are similar and the CHAR(1) column is compressed
in one of the tables.

*/

SHOW TABLE TABWITHCOMPRESS;

*** Text of DDL statement returned.
*** Total elapsed time was 1 second.

---------------------------------------------------------------------------
CREATE SET TABLE E_COMPANYDB.TABWITHCOMPRESS ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
c1 INTEGER,
c2 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('F','T'))
PRIMARY INDEX ( c1 );

+---------+---------+---------+---------+---------+---------+---------+----

SHOW TABLE TABWITHOUTCOMPRESS;

*** Text of DDL statement returned.
*** Total elapsed time was 1 second.

---------------------------------------------------------------------------
CREATE SET TABLE E_COMPANYDB.TABWITHOUTCOMPRESS ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
c1 INTEGER,
c2 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( c1 );

+---------+---------+---------+---------+---------+---------+---------+----

/*----------------------Tables have identical rows------------------*/

SELECT count(*) FROM TABWITHCOMPRESS;

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Count(*)
-----------
10000

+---------+---------+---------+---------+---------+---------+---------+----

SELECT count(*) FROM TABWITHOUTCOMPRESS;

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Count(*)
-----------
10000

+---------+---------+---------+---------+---------+---------+---------+----

SELECT * FROM TABWITHCOMPRESS
MINUS
SELECT * FROM TABWITHOUTCOMPRESS;

*** Query completed. No rows found.
*** Total elapsed time was 1 second.

+---------+---------+---------+---------+---------+---------+---------+----

SELECT * FROM TABWITHOUTCOMPRESS
MINUS
SELECT * FROM TABWITHCOMPRESS;

*** Query completed. No rows found.
*** Total elapsed time was 1 second.

+---------+---------+---------+---------+---------+---------+---------+----

/*----------------------Data in compressed column--------------------*/
SELECT c2, count(*)
FROM TABWITHCOMPRESS
GROUP BY 1;

*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

c2 Count(*)
-- -----------
T 4999
F 5001

+---------+---------+---------+---------+---------+---------+---------+----

/*----------------------SPACE SAVING----------------------------------*/

/*
1. Observe that the space saved in the table with compression is 20,480.00 bytes.
2. This is bizzare as the compression was applied on a CHAR(1) column holding
values 'T' and F'. The table has 10,000 records. Why is the max space saved
over 10,000 bytes (or 7500 bytes to be precise)?

*/

SELECT tablename, sum(currentperm)
FROM dbc.tablesize
WHERE databasename='e_companydb'
AND tablename IN ('TABWITHCOMPRESS', 'TABWITHOUTCOMPRESS')
GROUP BY 1
ORDER BY 1;

*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

TableName Sum(CurrentPerm)
------------------------------ -------------------
TabWithCompress 181,248
TabWithoutCompress 201,728

+---------+---------+---------+---------+---------+---------+---------+----

.LOGOFF;
*** You are now logged off from the DBC.
+---------+---------+---------+---------+---------+---------+---------+----

11 REPLIES
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

well if you go by books then you are supposed to get (8-2)bits/8bits * 10000 records = 7,500 bytes free.

However, the disk space reported by the DD is not the "actual" space occupied by the rows in the blocks, but the size of the blocks allocated for the table itself. And there are obviously lots of free space in a block which does not belong to any row.

When you compress, you "shrink" your row, which means more rows fit into a given data block, and also means you waste less space with in the data block due to fragmentation. And that's an extra bonus space saving for compression.

There's a simple test you can do to verify this ...

Create an empty table, check the current perm for the table for each vproc.

insert something like 5 records (and take space readings after each insert).

You would see that while some of the inserts resulted in an increase of space, some didn't. The ones that didn't result in an increase of space were the ones which went into the free space that's in the already allocated data blocks for the table.
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

hmmm..interesting. Thanks.
Junior Contributor

Re: Space saved using Compression is exceptionally different from calculated result..

If there are not enough compress bits available then a new compress bytes is added:
So compressing a CHAR(1) will save you 0 or 1 bytes per compressed row, but not 0.75 ;-)

In your example COMPRESS is not increasing the number of compress bytes,
therefore the rowsize decreases by one byte
-> so it will save you exactly 10000 bytes.

Rows are allocated on word-aligned (starting on even byte number) within a datablock
-> Before compression the recordsize was an odd number (so one byte was added to make it even),
after compression it's an even number (no extra byte needed)
-> This saves another 10000 bytes

And the remaining 480 bytes are because of blocks increasing in sectors of 512 bytes...

Dieter
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

Hmm.... lol that explains why my maths teacher was so mad at me ;-)

Hi Dieter, You are right we always have allocation in terms of bytes, so the existing presence bit array byte would suffice for the bits required for compression. and there's a saving on even number of bytes versus odd number. (now I need to find my misplaced row size calculation spreadsheet :-) )

And since the tables would have been freshly loaded (and for table of this small row size), fragmentation must not be a factor to account for.
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

Suppose out of 20 columns in my table ,10 columns have comression.I would like to understand th following.
(1)How many compression bytes are needed per row.
(2)How does Vproc identify which is the compressed value for which column.

Can someone clearly explain the internals of how compression is done, where this information is stored?

Also, let me know the overhead (if any) of using compression on char(1) column? What if we do not mention compression on a particular column, does teradata still allocate the 1 byte for presence bits?

ouch!!! I think I have askd so many question, however, I would like to understand the full concept of compression. If someone could answer these questions it would be great.But,please do not re-direct me to Manuals, As I have read them earlier.
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

here is a pdf for v2R5 multi-value compression
http://www.mycustomer.com/download/4228/teradata-111.pdf
It might help answer some questions.

I am not sure if there is a new doc for the newer version of Teradata.

Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

Thanks for the document Feroz.

I would still like to understand the relationship between Presence bits, their count and how Teradata correlates the Presence bits with compressed columns .
Junior Contributor

Re: Space saved using Compression is exceptionally different from calculated result..

Hi Leo,

"Suppose out of 20 columns in my table ,10 columns have comression.I would like to understand th following.
(1)How many compression bytes are needed per row."

At least two, but it depends on the number of compressed values per row. And don't forget the presence bit for each NULLable column.

"(2)How does Vproc identify which is the compressed value for which column."

It simply stores an array of n values within the table header and extracts the n-th value.

CHAR(3) COMPRESS ('a', 'bb', 'ccc')
-> X = array of char(3) in table header -> X[1] = 'a', X[2] = 'bb', X[3] = 'ccc'
-> two bits are needed to store 4 different values: 0,1,2,3
-> 0 means column is not compressed and the actual value is stored within row,
1 to 3 indicate the corresponding values from the array

"Also, let me know the overhead (if any) of using compression on char(1) column?"

In worst case this will not save anything, but increase table size, because there was no compress *bit* available and therefore a new compress *byte* was added to each row.

"But,please do not re-direct me to Manuals, As I have read them earlier."

Maybe you read the wrong chapter ;-)

Database Design
Chapter 15: Database-Level Capacity Planning Considerations
How the Row Header Indicates Column Nullability and Compressibility

Dieter
Enthusiast

Re: Space saved using Compression is exceptionally different from calculated result..

Thanks Dnoeth :-)