Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.
Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:
/* NumOfNulls */ 20,
/* NumOfRows */ 3180,
/** Biased: Value, Frequency **/
/* 1 */ 'N', 3147,
/* 2 */ 'Y', 13
Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.
The column for compression has to have the following requirements:
Statistic has to be representative and actual, but could be sampled
Column is not allowed to be part of index or partition
The statistics values must have the correct length
It is not allowed to have statistics on the column during the alter table statement
In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.
The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.
The advantages are:
First Results and Motivation
As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.
Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.
Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.