Column compress values from statistics

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Enthusiast
Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.

 



The idea

 

Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:

 

...

 /* NumOfNulls            */ 20,

...

 /* NumOfRows             */ 3180,

...

 /** Biased: Value, Frequency **/

 /*   1 */   'N', 3147,

 /*   2 */   'Y', 13

...

 

Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.

 

The column for compression has to have the following requirements:
  • Statistic has to be representative and actual, but could be sampled
  • Column is not allowed to be part of index or partition
  • The statistics values must have the correct length
  • It is not allowed to have statistics on the column during the alter table statement

In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.

The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.

 

The advantages are:

  • No costs for getting the values for compression
  • Good compression results with easy algorithm

This easy solution for fitting on one page has some disadvantages:

  • Procedure doesn't take care of previous values list
  • Algorithm doesn't take care of multi columns collect statistics

The algorithm

First we execute for each column with statistics of the table to compress the "SHOW STATISTICS VALUES COLUMN". From this output we take the numbers of null and the values of the biased values block. From the number of occurences we decide which values come into the multi value compress list. At the moment each value has to have an estimation of more than 1% in the data. With this limit it could not happen that we have more than 100 compress values. In parallel we create a "DROP STATISTICS" and the "COLLECT STATISTICS COLUMN ... ON ... VALUES (...);" to put the statistics back. With this three files we first drop the statistics, perform the alter table statement and after that put the statistics back.

The process

The algorithm consists of a sql file and an awk script. The sql file gets the "SHOW STATISTICS VALUES COLUMN" for the columns for the tables in an useful ordering:

SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col3 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab2;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab2;

These commands have to be executed by bteq and stored in one file. The awk script takes this file and produces a larger file:

DROP STATISTICS column col1 on dbtest.tab1;
DROP STATISTICS column col2 on dbtest.tab1;
DROP STATISTICS column col3 on dbtest.tab1;
ALTER TABLE dbtest.tab1 add col1 compress ( ...)
, add col2 compress ( ...)
, add col3 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col3 ) ON dbtest.tab1 VALUES (...);

DROP STATISTICS column col1 on dbtest.tab2;
DROP STATISTICS column col2 on dbtest.tab2;
ALTER TABLE dbtest.tab2 add col1 compress ( ...)
, add col2 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab2 VALUES (...);

Executing these statements perform the compression. Finished.

The source code

SQL File

SELECT
         'SHOW STATISTICS VALUES COLUMN '||(trim (both from a.columnname))||' on '||(trim(both from a.databasename))||'.'||(trim(both from a.tablename))||';' as stmt
FROM
        dbc.ColumnStatsV a
INNER JOIN
        dbc.columns b
ON
a.databasename=b.databasename
        AND
a.tablename=b.tablename
        AND
a.columnname=b.columnname
LEFT OUTER JOIN
        dbc.PartitioningConstraintsV c
ON
a.databasename=c.databasename
        AND
a.tablename=c.tablename
        AND
upper(c.constrainttext) LIKE '%'||(upper(a.columnname))||'%'

WHERE
        c.constrainttext is null
        AND
a.indexnumber is null
        AND
a.databasename='${DB}'
        AND
(a.databasename,a.tablename,a.columnname) not in (select databasename,tablename,columnname from dbc.indices)
        AND
(a.databasename,a.tablename) in (select databasename,tablename from dbc.tables where tablekind='T')
order by a.databasename,a.tablename,a.columnname;

AWK File

BEGIN   { CUTPERCENTAGE=1;
          print ".errorlevel (3582) severity 0";
          print ".errorlevel (6956) severity 0";
          print ".errorlevel (5625) severity 0";
          print ".errorlevel (3933) severity 0";
        }
/            COLUMN \(/ { COL=$3; }
/                ON / { DBTAB=$2; }
/^ \/\*\* / { BIASEDON=0; }
/Data Type and Length/ { DATATYPE=substr($6,2,2); }
/NumOfRows/ { CUTROWS=$4*CUTPERCENTAGE/100;
              BIASED=="";
              if (0+CUTROWS<0+NULLROWS) BIASED="NULL,";
            }
/\/\* NumOfNulls/ { NULLROWS=$4; }
/^ \/\*\* Biased:/ { if (DATATYPE!="TS" && DATATYPE!="AT" && DATATYPE!="DS") BIASEDON=1;}
/^ \/\* / { if (BIASEDON==1)
                {
                if (CUTROWS < 0+gensub(".*,","","",gensub(",? ?$","","g")))
                        {
                        BIAS=gensub("^ */[^/]*/","","g",gensub(",[0-9 ]*,? ?$","","g"));
                        if (index (BIASED,BIAS)==0)
                                BIASED=BIASED BIAS ",";
                        }
                }
        }
/^COLLECT STATISTICS/   { COLSTATON=1; }
        {       if (COLSTATON==1) COLSTAT=COLSTAT "\n" $0; }

/^);/   { BIASEDON=0;
        COLSTATON=0;
        if (BIASED=="")
                {
                COLSTAT="";
                next;
                }
        if (DBTAB!=DBTABOLD)
                {
                if (DBTABOLD!="")
                        {
                        print DROPSTATS;
                        print ALTERTABLE ";";
                        print COLSTATALL;
                        COLSTATALL="";
                        }
                ALTERTABLE="ALTER TABLE " DBTAB " ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS="DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                DBTABOLD=DBTAB;
                }
        else
                {
                ALTERTABLE=ALTERTABLE "\n" "        ,ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS=DROPSTATS "\n" "DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                }
        COLSTATALL=COLSTATALL "\n" COLSTAT;
        COLSTAT="";
        BIASED="";
        }

END     {
        print ";";
        }



First Results and Motivation

As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.

Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.

Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.

13 Comments
Enthusiast

I didn't understand : "It is not allowed to have statistics on the column". Can you elaborate?

Enthusiast

Thanks for the comment, on this position blog was not clear. What I want to say is that during the "alter table ... compress" statement it is not allowed to have statistics on the column to compress. otherwise you get "ALTER TABLE Failed. 6956:  Column .. has statistics and cannot be dropped or modified."

BR Roland

Enthusiast

Thank you. That explains it. Nice article btw.

-Suhail

Enthusiast

It is a nice article.

Can I say that the compression process explained above is same as the process that was followed before TD14.10 except that the input required for compression process is now readily available(by using show stats...) and do not have to profile the column separately? Anything else is different?

Enthusiast

Yes, you can reuse your algorithm for compression of columns. Only the typical time consuming "select col,count(*) from ..." at the beginning can be replaced by a millisecond operation when you have statistics.

Enthusiast

Thanks.  Is there a place or dbc tables that holds ouput of "SHOW STATISTICS VALUES COLUMN col ON db.tab;"?  If yes, then it helps me build data profiling reports.

Enthusiast

Hi  ,

I m bit unclear abt dropping the statistics after getting the OP of Show statistics .

Once we get the compression list  through Show statistics , can we directly go with alter statement?

also do we have to use values clause in  COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...); ?

VALUES will be compression list ?

Can you pls elaborate?  Aplogies for the confusion.

Thanks in advance!

Cheers!

Nishant

Enthusiast

Hi all,

The first "Unfortunately": Unfortunately the data for the "show statistics values" command is stored in a blob called FieldStatistics in binary form to be easily read by the optimizer.

The second "Unfortunately": Unfortunately you cannot at the moment perform an "alter table" statement with having statistics on the column. You get the already mentioned error code 6956.

But I think both "Unfortunately" are not so bad, because with the "show statistics values" you get the content of the statistics in readable form and even more the text output is already the command to insert the statistics back into the database after you dropped the statistics.

To be more clear the "COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...);" command in my example is exactly the output of the "SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab2;" I executed before.

I hope I could clear this up.

Thanks BR Roland

Enthusiast

Yes, got it Roland.. Thanks for the great explanation..

these steps really combine the Compression and the stats tasks together.

cheers!

Nishant

Enthusiast

Hi Roland,

I was following the above approach during finding the candidates for MVC.

Just noticed that the nulls were not coming up as  in the biased values, frequency.

Can you pls  suggest if I am missing some where?

Thanks in advance!

Cheers!

Nishant

Enthusiast

Hi Nishant,

The number of nulls is never in the Biased values. In the output there is always an extra line with "NumOfNulls" (AWK Script Line 15). If the percentage of Nulls is relevant, this is checked in line 13. Null compression is special, because it is automatically added with the first compress value.

BR

Roland

Enthusiast

Thanks roland for the nice info.

Just wanted to confirm one thing that once we compress the first compress value, the nulls are also covered through that

like

statement 1:

ALTER TABLE <TAB_NM> ADD < co_nm>  COMPRESS( 'TEST', NULL);

statement 2:

ALTER TABLE <TAB_NM> ADD < co_nm>  COMPRESS( 'TEST' );

As Nulls are automatically compressed then we can use the statement 2 itself, there is no need to execute the statement 1..?

Please correct me if i am missing some where.

Cheers!

Nishant

Enthusiast

Hi Nishant,

You are completely right. Only for the case that it would be good to compress only nulls and you want to write COMPRESS(NULL) the finding of the percentage of nulls makes sense. At the moment all occurences of more than 1% are compressed. It makes maybe sense to substract the number of nulls from the total number of rows and after that to make the border of 1%. For example in one column you have 90% nulls and 0.5% of another value. Maybe when you don't care of the numbers of nulls. it makes maybe sense to compress also the 0.5%.

Best regards, Roland