Everyone is aware of Teradata’s continued commitment to reducing the footprint and resource usage of data by adding compression capabilities and the result on perm space savings and performance, by reducing block size, IO and spool usage.
While this new functionality is great, I would like to suggest that you not ignore a basic best practice step in implementing your data warehouse model. In this case, I would like to suggest that before compression, you look at 'right sizing' your data first.
Compression works by eliminating repetitive values within a column across rows and replacing the values with indicator bit(s) in the header. The more frequent the value is used, the higher the rate of compression. This works especially well for columns that contain a few non-distinct values. If however, your column contains many distinct values, compression will not do as much for you.
After logical modeling, and before physical modeling, there is that all-important data demographics / information gathering step. You know, the one that helps you to identify good primary and secondary index candidates, possible statistics, and of course, the appropriate data type for each column based on the domain values. C’mon, you know, the step that is frequently skipped because its work and a pain?
As an example, let us look at two simple amount columns modeled in a table. They were both defined as
DECIMAL(18,2), because that is how they were defined in the source system and no one did any further demographic analysis to validate this.
| || |
| || |
A full demographic analysis of the
PAST_DUE_AMOUNT columns shows that the maximum value they contains are
With these max values, and adding an additional digit for slightly larger values, this data could be stored in a
DECIMAL(8,2) data type.
DECIMAL(8,2) data type uses only 4 bytes of storage versus 8 bytes of storage for a
DECIMAL(15,2), changing this data type saves us 4 bytes for every row in the table.
|Column Name||DataType||New Data Type||Savings In Bytes|
| || || || |
| || || || |
Now, eight bytes per row probably does not sound like a lot, but remember we need to do the math. Eight bytes times how many rows? Enough of these columns times billions of rows starts to add up in savings. In addition, remember, this is savings “before” compression, so it does not become less effective if the column’s value demographics change. I like to think of this as “hard” savings versus “soft” or changing savings from compression.
I think that if you start insisting on performing this important step in your modeling process, you will find that with a little effort, it will continuously pay off well.
If you would like to get some of this savings by performing this analysis on existing Teradata tables, your local Professional Services Consultant has access to a tool that they can use to help with this basic analysis (Reference Knowledge Base # KA62796).
Very Nice Approach...
I must say this David, your Practical Approach helps in quicker & efficient understanding than other Theoretical Article.
Also, the Same Approach with Integer or other Numeric Datatype is reduce time taken to identify Compressible Values in the table.
We must not also forget that by applying correct DDL , we will minimize the spool / cpu / io on queries which contain Varchar fieds and participate on group by / order by . Furthermore , on this scenario the gain is not only with the current number of rows of the table , but also for the total one which produced on a step of the query , for example an attribute of a dimension which has to be shown together with a transactional fact of couple of billion rows.
Unfortunately , the most Accurate DDL represents how current implementations have been developed , how weel they now their data and how good they know how Teradata works. The result is really sad.