Right Sizing Data for Space Savings and Performance

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Right Sizing Data for Space Savings and Performance

Everyone is aware of Teradata’s continued commitment to reducing the footprint and resource usage of data by adding compression capabilities and the result on perm space savings and performance, by reducing block size, IO and spool usage.  

While this new functionality is great, I would like to suggest that you not ignore a basic best practice step in implementing your data warehouse model.  In this case, I would like to suggest that before compression, you look at 'right sizing' your data first. 

Compression works by eliminating repetitive values within a column across rows and replacing the values with indicator bit(s) in the header.  The more frequent the value is used, the higher the rate of compression.  This works especially well for columns that contain a few non-distinct values. If however, your column contains many distinct values, compression will not do as much for you. 

After logical modeling, and before physical modeling, there is that all-important data demographics / information gathering step.  You know, the one that helps you to identify good primary and secondary index candidates, possible statistics, and of course, the appropriate data type for each column based on the domain values.  C’mon, you know, the step that is frequently skipped because its work and a pain? 

As an example, let us look at two simple amount columns modeled in a table.  They were both defined as DECIMAL(18,2), because that is how they were defined in the source system and no one did any further demographic analysis to validate this. 

Column Name DataType
CHARGED_AMOUNT DECIMAL(15,2)
PAST_DUE_AMOUNT DECIMAL(15,2)

A full demographic analysis of the CHARGED_AMOUNT and PAST_DUE_AMOUNT columns shows that the maximum value they contains are 17080.12 and 10180.00 respectively. 

With these max values, and adding an additional digit for slightly larger values, this data could be stored in a DECIMAL(8,2) data type.  

Since a DECIMAL(8,2) data type uses only 4 bytes of storage versus 8 bytes of storage for a DECIMAL(15,2), changing this data type saves us 4 bytes for every row in the table.

Column Name  DataType New Data Type Savings In Bytes
CHARGED_AMOUNT DECIMAL(15,2) DECIMAL(8,2) 4
PAST_DUE_AMOUNT DECIMAL(15,2) DECIMAL(8,2) 4

Now, eight bytes per row probably does not sound like a lot, but remember we need to do the math.  Eight bytes times how many rows?  Enough of these columns times billions of rows starts to add up in savings. In addition, remember, this is savings “before” compression, so it does not become less effective if the column’s value demographics change.  I like to think of this as “hard” savings versus “soft” or changing savings from compression.

I think that if you start insisting on performing this important step in your modeling process, you will find that with a little effort, it will continuously pay off well.

If you would like to get some of this savings by performing this analysis on existing Teradata tables, your local Professional Services Consultant has access to a tool that they can use to help with this basic analysis (Reference Knowledge Base # KA62796).

Good Luck!

Dave

2 REPLIES

Re: Right Sizing Data for Space Savings and Performance

Very Nice Approach...

I must say this David, your Practical Approach helps in quicker & efficient understanding than other Theoretical Article.

Also, the Same Approach with Integer or other Numeric Datatype is reduce time taken to identify Compressible Values in the table.

Re: Right Sizing Data for Space Savings and Performance

We must not also forget that by applying correct DDL ,  we will minimize the spool / cpu / io on queries which contain Varchar fieds and participate on group by / order by  . Furthermore , on this scenario the gain is not only with the current number of rows of the table , but also  for the total one which produced on a step of the query , for example an attribute of a dimension which has to be shown together with a transactional fact of couple of billion rows.

Unfortunately , the most Accurate DDL  represents how current implementations have been developed  , how weel they now their data and how good they know how Teradata works. The result is really sad.