Teradata Columnar

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Teradata 14.0 introduces Teradata Columnar – a new option to organize the data of a user-defined table or join index on disk.

Teradata Columnar offers the ability to partition a table or join index by column.  It introduces column-storage as an alternative choice to row-storage for a column partition and autocompression.  Column partitioning can be used alone in a single-level partitioning definition or with row partitioning in a multilevel partitioning definition.

Teradata Columnar is a new paradigm for partitioning, storing data, and compression that changes the cost-benefit tradeoffs of the available physical database design choices and their combinations.  Teradata Columnar provides a benefit to the user by reducing I/O for certain classes of queries while at the same time decreasing space usage.

A column-partitioned (CP) table or join index has several key characteristics:

  1. It does not have a primary index (a future blog entry will discuss why).
  2. Each of its column partitions can be composed of a single column or multiple columns.
  3. Each column partition usually contains multiple physical rows.  Physical rows are the disk-based structures that the Teradata file system uses to store data in order based on the rowid associated with each physical row.  The first part of a rowid indicates the partition number so that physical rows are ordered by partition number first and within partition by the rest of the rowid.
  4. A new physical row format COLUMN may be utilized for a column partition; such a physical row is called a container.  This is used to implement column-storage, row header compression, and autocompression for a column partition.  This provides a compact way to store a series of column partition values.
  5. Alternatively, a column partition may have physical rows with ROW format that are used to implement row-storage; such a physical row is called a subrow.  Each column partition value is in its own physical row.  Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value.
  6. A CP table is just another type of table that can be accessed by a query.  A single query can access multiple kinds of tables.

Learn more about Teradata Columnar by listening to the presentation from the 2011 Teradata PARTNERS User Group Conference

Rows vs. Columns? Why not Both?

http://developer.teradata.com/database/training/rows-versus-columns-why-not-both

Also, refer to the Orange Book: Teradata Columnar

PPI stands for partitioned primary index which means the table has a primary index and the rows are partitioned on the AMPs (and within a partition, the rows are ordered by a hash of the primary index columns).

A CP table is not a PPI table since a CP table doesn't have a primary index.  But a CP table can have RANGE_N and CASE_N row partitioning (the kind of partitioning of rows that is used in PPI) but since there is no primary index, the rows within a row partition are not ordered by a hash of some columns of each row -- they are just in insert order. 

A CP table could have a join index on it where the join index does have primary index (but not column partitioning).

A PI or PPI table could have a join index on it where the join index has column partitioning (but not a primary index) plus optionally 1 or more levels of row partitioning. 

A NoPI table can't have row partitioning unless it also has column partitioning. 

For example:

CREATE TABLE SALES (
TxnNo INTEGER,
TxnDate DATE,
ItemNo INTEGER,
Quantity INTEGER )
PARTITION BY COLUMN,
UNIQUE INDEX (TxnNo);

This creates a column-partitioned (CP) table that partitions the data of the table vertically.  Each column is in its own column partition that is stored using column-storage with row header compression and autocompression.  All the data for TxnNo comes first, followed by the data for TxnDate, followed by the data for ItemNo, and then the data for Quantity.  Note that a primary index is not specified so this is NoPI table.  Moreover, a primary index must not be specified if the table is column partitioned.

The following adds a level of row partitioning (so the table has multilevel partitioning).  All the data for TxnNo for the first day comes first, followed by the next day of data for TxnNo, etc. then all the data for TxnDate for the first day, the second day, etc, ending with the last day of data for Quantity.

CREATE TABLE SALES (
TxnNo INTEGER,
TxnDate DATE,
ItemNo INTEGER,
Quantity INTEGER )
PARTITION BY (
COLUMN,
RANGE_N(TxnDate BETWEEN
DATE '2011-01-01' AND DATE '2011-12-31' EACH INTERVAL '1' DAY) ),
UNIQUE INDEX (TxnNo);

48 Comments
Enthusiast
P - Paul's
A - architected
R - release 5
T - Teradata
I - innovation
T - to reduce
I
O
N - now
Enthusiast
Paul, a very nice article and a great presentation at Partners on columnar. We hope to get TD 14 in the near future to be able to use it.
Enthusiast
Hi Paul,

In regards to the columnar feature, I would like to understand clearly about the limits of "Container & Subrow". As far as I know, as the container/subrow is similar to the physical row, it is limited by the maximum physical row size of 65KB. But I saw the below sentence in one of your articles: "Note that the table row limit of 65KB would be exceeded before the subrow limit could be exceeded". Can you please be eloborated about this.

Also one more question about the subrow. Subrow means it can only holds one column parition value right? OR it can hold subset of the columns for a table row? I am confused about subrow. Can you give me some more info on this please?

Thank you.
Geeta Kudumula.
Teradata Employee
A container can contain one or more column partition values. A subrow contains only one column partition value. A column partition value can consist of one or more column values. Containers and subrows are both physical rows for the file system (one or more physical rows occur in a data block). A "row" of a table (consisting of values for all the columns of the table) is limited to 65KB. For a column-partitiond (CP) table, a table row is broken up into one more column partition values (one for each column partition). For a subrow to be bigger than 65KB, the table row that it is part of would have to be bigger than 65KB but that would exceed the limit for a table row.

A subrow is the same as a regular row (in regard to physcial rows in which they are stored) except that subrow contains values for a subset of one or more columns of the table while a regular row has values for all the columns of the table.

CREATE TABLE t1 (COLULMN(a1 INT, a2 INT), ROW(a3 int, a4 CHAR(60000)) PARTITION BY COLUMN;

This has 2 column partitions: one with a1 and a2 that is stored in containers and another with a3 and a4 that is stored in subrows.

Insert t1 (1, 2, 3, 'big string');

The values 1 and 2 are appended into a container for the first column partition as one column partition value (other column partition values can be in the container -- another container is started when it gets full) while the value 3 and 'big string' are appended as a subrow of the second partition (only one column partition value can go in this subrow -- the next column partition value for this column partition would go into a separate subrow).
Enthusiast
Thanks Paul, Great clarification.

I am reiterating the things about what i understood on CONTAINER & SUBROW with below example. Please correct me if i apprehend anything wrong. I am using the same table that you have defined in your explanation with 4 rows inserted

***Table Data***
a1 a2 a3 a4
1 2 3 big string
4 5 6 big string2
7 8 9 big string3

So the data would be arranged into the container and subrows as shown below right?

CONTAINER#1: 1,2,4,5,7,8,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_....up to 65KB
CONTAINER#2: _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_....up to 65KB

SUBROW#1: 3, big string
SUBROW#2: 6, big string
SUBROW#3: 9, big string
SUBROW#4: _, _

I am thinking I got a clear picture if I am correct on above example.

But i still didnt understand about the 65K limit of the subrow. I am thinking a CONTAINER cannot exceed 65K limit. When it gets full (upto 65K), one more container will be started.

On subrow you said ""For a subrow to be bigger than 65KB, the table row that it is part of would have to be bigger than 65KB but that would exceed the limit for a table row""". That means to exceed the 65K limit of subrow, first off all the table row limit of 65KB should be exceeded?

Teradata Employee
***Table Data***
a1 a2 a3 a4
1 2 3 big string <-- up to 65KB
4 5 6 big string2 <-- up to 65KB
7 8 9 big string3 <-- up to 65KB

So the data would be arranged into the container and subrows as shown below right?

CONTAINER#1: 1,2,4,5,7,8,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_, _,_,_....up to 65KB
CONTAINER#2: _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_, _,_,_....up to 65KB

SUBROW#1: 3, big string <-- subrow can't be bigger than 65KB; if a1, a2, a3, a4 can't be bigger than 65KB then
this will be less than 65KB since it only has values for a3 and a4.
SUBROW#2: 6, big string
SUBROW#3: 9, big string
SUBROW#4: _, _

That means to exceed the 65K limit of subrow, first off the table row limit of 65KB should be exceeded? Yes.
Enthusiast
Perfect, thank you.
Enthusiast
Hi Paul,
Can you help me with my question regarding RowID of the CP table (NoPI table).
I know the CP table RowID is the combination of "Partition#, HashBucket# & Uniquenessvalue ". The first row going into the NoPI table starts off with a uniqueness of 1. The uniqueness value increases as more rows are inserted into the table. So that means
the row partition number will be unique for each CP, hash bucket will be same, but the uniqueness value will be differed (i mean increamented by 1) for each CP value.

Now my question is how the below statement will be true?

The rowids(combination of "Partition#, HashBucket# & Uniquenessvalue ") for the different column partition values of a logical row are the same except for the column partition number which is unique to each column partition.

I am assuming for the different CP values of a row, the row id should be different, isn't it?
Enthusiast
Hi Paul,

Great article. I have a question with regards to Autocompression, I'm just wondering why Autocompression is most effective for single-column partitions with COLUMN format and less so for multicolumn partitions with COLUMN format? Is this only because of significant CPU increase or there are any other reasons?

Thanks in advance.
Teradata Employee
For each of the column partition values of a row, the rowid for each of those column partition values is different but the only portion of the rowid that is different is the column partition number (they have the same row partition numbers, hash bucket, and uniqueness value). The column partition values are stored in the file system in rowid order (that is, in order by the internal partition number, and then for the same internal partition number, in order by the hash bucket value, and within the same internal partition number and hash bucket pair by the uniquness value).

By the name, one might think that a uniqueness value is unique for a table but it is really only unique within an internal partition number/hash bucket pair (or for a PI table, internal partition number/hash value pair). It is the combination of internal partition/hash bucket/uniqueness (that is, a rowid) that is unique for the column partition values for a table.

Note that the uniqueness value starts at one and is incremented by one for each internal partition / hash bucket pair for which a column partition value is inserted. If the table is not also row partitioned, this is the same as starting at one and incrementing by one for each row inserted. But with row partitioning, it starts at one and incremented by one for each row inserted into a combined row partition. So the uniqueness value is not unique across all rows of the table -- it is only unique within an internal partition number/hash bucket pair.

For more information on rowids for a NoPI table and a CP table, see the Orange Book on Columnar.
Teradata Employee
Autocompression is most effective for single-column partitions with COLUMN format and less so for multicolumn partitions with COLUMN format because of the compression techniques used. Effective here means effective in providing a reduction in disk space usage.

For run length compression and value-list compression, compression is better if column partition values are very nonunique. Adding more column values into a column partition value tends to make the column partition values more unique and makes the compression less effective.

The trimming compression techniques only trim the column partition value so adding more column values into a column partition value usually means only the last column value in the column partition value is being trimmed. If the column values were in separate column partitions, trimming might be applicable to each of the column values.

Similiar issues occur for the other techniques if there are multiple columns in a column partition.

Note that with ROW format there are currently no techniques being used for autocompression. So to a have chance that autocompression is effective, currently the column partition must have COLUMN format.

Autocompression and other forms of compression have some CPU overhead so a consideration can be whether the compression ratio achieved, reduction in disk space usage, and the reduced I/O due to compression is worth the extra CPU overhead to compress and uncompress the data. On an I/O bound system or disk space limited system, compression may be worth it but if CPU bound, it may not.
Teradata Employee
Will there be an in-database SQL or Java utility that will allow pivoting a table or subset of a table from row to column in V14?
Teradata Employee
The simple answer to my knowledge is no. A user-defined table function could possibly be used to do this. Also, this might be done with some join logic though that might be complex and not perform well. If readers have other suggestions, please let us know.

Aside: Note that an insert does take a row and properly insert the data into a column-partitioned table but this isn't changing a row to be column.
Hi Paul, Thanks for your great articel and explonation. I have one Question here. I am going to partition the table based on the date columns. And we will have daily feed, and it is one of the long running project. So while creating the table, if i give '01-01-2012 To 2299-12-31' Interval by '1' month, and till date i have the data so the data will go to that particular partition. My Question is, will teradata create the partition for all the ranges that specified that is 2299-12-31 in the initial stage or it will create the partition once the data comes in the feed. If all the partition is created on the very initial stage, then will create any performance issues. Could you please clarify me.

Thanks!
Teradata Employee
A table is just a series of physical rows, each with a rowid, where the physical rows are kept in in rowid order. The first portion of the rowid is the partition number (or combined partition number for multilevel partitioning) so rows with the same partition number would be grouped together (with no gaps between if there are empty partitions between them). If there are no rows for a particular partition, it does not take any space (and just isn't there). A partition is not "created" until a physical row is inserted into it (there is really no actual creation process for a partition other than to insert a row for it). For the CREATE TABLE statement, you are just defining the partitioning expression which is later to used to compute the partition number (and part of determining the rowid) for a row when it is inserted. So there are not any performance issues related to "creating" partitions. Similarily, if all the rows for partition are deleted, there is no special action -- you just delete the rows for it and the partition just no longer has any rows.

When creating a table, a table header row is written to each AMP and the data dictionary is updated. But no other space is actually used until rows start being inserted.

This is unlike some other database systems where a partition has to have a data space allocated for it prior to insertion of rows into the partition.
Enthusiast
Hi Paul

This is a good article about the TD Columnar.
When I refer through, there are some areas in which needed more clarity.

As I understand, though it is Column Partition, the data is stored as Row only. But it contains the Row or Column values in the Subrow or Containers.

Can you please explain how the the Containers and Subrows are distributed in the System?
is it distributed among AMPs equally?
Will this cause any Hot AMP Conditions?
Is the CP Tables are good candidates for Huge Volume of Data?

Thanks,
Mani
Enthusiast
Hi Paul,

I have a small clarification..
consider a column partitioned table with three column partitions C1,C2 and C3.
The table has say a billion rows.
Since it is big table it will have multiple datablocks.
so will each datablock contains only only one kind of Container C1,C2 or C3.
or is it mandatory that a datablock should have combination of all Container C1,C2 and C3.
If the Former is true then a logical row in the table will split across different datablocks right?
is that allowed? or else should a logical row must exist within a single datablock?
Teradata Employee
Mani, I am working on a response to a your comments (delayed because I just came back from a sailing around some exotic islands in French Polynesia).

In the meantime, here is a quick response to Cvinodh's comment:

A partition can end in the middle of data block and then another partition can start in the data block. The physical rows are ordered by their internal partition number in the Teradata file system -- a data block is just one or more physical rows (and a data block is the unit of what can be written or read at a time). For a large table, there will be, on an AMP, data blocks for containers of only C1, then possibly a data block with some containers for C1 and then some containers for C2, then data blocks for only containers of C2, then possibly a data block with some containers for C2 and then some containers of C3, and finally data blocks for only containers of C3. That is, partition boundaries are not necessarily on data block boundaries but all the C1 containers comes first then the containers for C2 and then the containers for C3. So usually a data block only has containers for one column partition but, at the boundaries, it will change from one partition to the next partition. So for your question, the "former" is true except for this minor caveat and the "latter" is false.

All the column values of a logical row do not usually exist in same data block -- the column values of a logical row are split across column partitions. For a large table, the column partitions would be large and span many data blocks. It would be only in very small table that the column values of a logical row could exist in the same data block (that is, when all the column partitions fit in the same data block). However, do note that all the column values for a logical row would be on the same AMP.

Note that this is not a "hybrid" columnar approach where the columns are only partitioned within a "data block" or "compression unit" and all the column values of a logical row can be found within that data block or compression unit. Teradata supports a fully columnar approach where each column is separate from the others so to read the data for a particular column only entails reading the data blocks for that column (where possibly the first and the last data block read may contain some values for a preceding column or a following column, respectively). Teradata also supports multicolumn partitions.

For more information, refer to the Orange Book: Teradata Columnar and the Orange Book: Increased Partition Limit and other Partitioning enhancements.
Enthusiast
Thanks for the reply Paul..
but one final question..
if the logical row is now distributed across different datablocks or cylinders in a column partitioned NOPI table... Then conventional Master Index and Cylinder Index (with starting RowID and highest Row hash value) entries are not applicable for these tables right? so these tables dont have any of these structures?
Teradata Employee
Mani, below I try to answer your questions:

1. As I understand, though it is Column Partition, the data is stored as Row only.

>> All data in Teradata is stored in what I refer to as physical rows by the Teradata file system. A physical row does not always correspond to what is considered to be a row in a table (or to what I refer to as a table row or logical row). The data in a physical row can be a table header, a portion of table header, a regular row (which does correspond to a table row), an index row, container (a series of column partition values; that is, a column partition with COLUMN format), a subrow (one column partition value; that is, a column partition with ROW format), and other miscellaneous data structures. A physical row always has a row header which includes the length of the physical row and a rowid. The Teradata file system keeps physical rows in order by rowid (of which the internal partition number is the first component and therefore, phsyical rows with the same internal partition number are grouped together). Information in the tableid and in the table header is used to determine what kind of data is in a physical row.

2. But it contains the Row or Column values in the Subrow or Containers.

>> If a column partition has ROW format, the physical rows for that column partition will have subrow format (that is, a one column partition value). If a column partition has COLUMN format, the physical rows for that column partition will have container format (that is, one or more column partition values – usually a lot of column partition values). For a single-column partition, a column partition value is one value of the column. For a multicolumn partition, a column partition value is the concatenation of the values of those columns from the same table row.

3. Can you please explain how the Containers and Subrows are distributed in the System?

>> Since a CP table is a NoPI table, single-row and array inserts are randomly distributed to the AMPs. On the AMP, the table row is then split into column partition values for each of the column partitions and, if the column partition has COLUMN format, a column partition value is appended to last existing container if there is one and there is room, or a new container is appended with that column partition value. If the column partition has ROW format, the column partition value for the column partition is appended to the column partition as a subrow. The random distribution should usually provide a fairly even distribution of the rows to the AMPs.

>> For an INSERT-SELECT, rows from the source are not distributed by default but rather are locally copied on each AMP from the source (the result of the SELECT) to the CP table and then each table row is split up as described above. If the source is distributed fairly equally, the data for the CP table should be fairly evenly distributed among the AMPs; but if not, the CP table would have the same skew as the source. However, there is an option to force the INSERT-SELECT to randomly distribute the rows from the source before inserting them into a NoPI or CP table and an option to specify columns on which to hash distribute rows – these options (using the HASH BY clause) should be used in the case the source is or might be skewed to make sure table rows are distributed equally among the AMPs.

4. Is it distributed among AMPs equally?

>> This depends on how the table rows are inserted into the CP table as described above. Also, a copy/restore to a different configuration or a reconfig can cause a NoPI or CP table to become skewed.

5. Will this cause any Hot AMP Conditions?

>> This can cause hot AMP conditions for SELECTs from the table if an INSERT-SELECT is used to load the table from a skewed source without using the HASH BY clause. However, as mentioned above the HASH BY option should be used when the source is skewed in order to avoid an unequal distribution of the rows among the AMPs.

6. Is a CP Table a good candidate for Huge Volume of Data?

>> Having a large volume of data is an indication that CP should be considered but there are also a number of other things to consider in regard to whether a table is actually a good candidate to be CP. In particular, the kinds of queries that will occur is an important factor to consider. There is also a cost to loading data into a CP table (that is, to partition the rows of the table into the column partitions) which needs to be considered. There is also usually a trade-off in that I/O may decrease but CPU usage may increase. For other considerations and more information about Teradata Columnar, see the Orange Book: Teradata Columnar.
Teradata Employee
Cvinodh asked:

If the logical row is now distributed across different datablocks or cylinders in a column partitioned NOPI table... Then conventional Master Index and Cylinder Index (with starting RowID and highest Row hash value) entries are not applicable for these tables right? so these tables dont have any of these structures?

>> Actually, they do have these structures. These structures are used by the Teradata file system to maintain physical rows in rowid order and to find physical rows based on a rowid (for instance, rowids coming from an index or rowid spool). These structures have the starting rowid of a cylinder or data block and the internal partition number (this is zero for a nonpartitioned table) and the first 32 bits of the rest of the highest rowid for the cylinder or data block. In a primary index table, these 32 bits would correspond to a row hash; however, for a NoPI table (and a CP table is a NoPI table), part of these 32 bits are a hash bucket value chosen from the NoPI hash map and part of them are the high order bits of the uniqueness (a NoPI table has more uniqueness bits in the rowid than a PI table). The Teradata file system really doesn't care what these bits represent -- it just orders the physical rows by the rowid it is given for each of the physical rows. For more information about the rowid structure for a NoPI and a CP table, see the Orange Book: No Primary Index (NoPI) Table User's Guide and the Orange Book: Teradata Columnar.
Teradata Employee

Hi Paul,

One question :

Why the columnar is not allowed on PI tables? what is the reason/logic behind this?

Teradata Employee

Why the columnar is not allowed on PI tables? What is the reason/logic behind this?

There is a discussion about this in the blog entry http://developer.teradata.com/blog/paulsinclair/2011/10/why-no-primary-index-nopi-for-a-column-parti....  There are definitely valid use cases.  This is on the roadmap for implementation in a future release.

Paul,

How does the TD14 columnar storage can be configured? Is it something available intrinsically or do we need to get special hardware from Teradata to make use of this feature ?

Teradata Employee

How is TD14 columnar storage configured? Is it something available intrinsically or do we need to get special hardware from Teradata to make use of this feature?

This is a software feature of Teradata TD 14 (enhanced in TD 14.10) and builds on existing Teradata File System support. There is no special hardware or additional software needed for this feature. However, you do need to be licensed to use this feature and have a Teradata representative turn on the internal flag that enables it.

Once enabled, you can create objects with columnar storage using the PARTITION BY clause on the CREATE TABLE or CREATE JOIN INDEX statement. Column partitions will default to either COLUMN or ROW storage depending on their size and other factors (but you can override that default by explicitly specifying COLUMN or ROW format for a column partitioning -- however, the default in many cases is probabaly the best choice). For more details, see the Orange Book for Teradata Columnar.

Enthusiast

Hi Paul,

Why do we achieve better compression if we have the column partition in column format vs storing it in row format?

Also when you mentioned: "Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value."

Can you please explain why storing a column partition value in "ROW" format gives better access than storing it in "COLUMN" format?

Also, you mentioned that each physical row has a unique row id. Then if such a physical row is a "container" row, then it stores a lot of column partition values in the same row(upto 65KB worth of values). Do all these column partition values have the same row id??

If that is true then how do we link any X value in a "container" row  to a Y value in some other column partition(which may be stored in container or row format and where both X and Y are part of the same "logical" row)? Perhaps, the uniqueness value?

Teradata Employee

Suhail, thanks for the questions. I'll try to answer them:

1. Why do we achieve better compression if we have the column partition in COLUMN format vs. storing it in ROW format?

Whether better compression is achieved with COLUMN format or ROW format depends on how well the compression techniques used for a column partition actually compress the data for the column partition. With ROW format, no autocompression techniques are currently applied and row header compression is not applicable to ROW format; for ROW format, you may explicitly apply compression at the column level (MVC or ALC) or use block-level compression for the table. Using COLUMN format, the system may be able to apply autocompression (6 different techniques currently and some of these may be used automatically in combination) and row header compression may be significant if the column partition is narrow; in addition, you may add MVC and ALC for the columns in the column partition and block-level compression for the table as can be done with column partitions with ROW format. Row header compression can be a significant savings for COLUMN format over ROW format if the column partition is narrow. For a wide column partition, row header compression may not be significant and opportunities for autocompression might be limited. Since COLUMN format has the additional compression techniques of autocompression and row header compression, it may be easier (or in some cases, only possible) to get good compression with COLUMN format compared to ROW format, especially for narrow column partitions and single-column partitions; however, cases can occur where ROW format (using MVC, ALC, and block-level compression as appropriate) may be a better choice, especially for wide column partitions.

2. Also when you mentioned: "Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value." Can you please explain why storing a column partition value in "ROW" format gives better access than storing it in "COLUMN" format?

To be clear, it provides a more "direct" access to the column partition value, not better access in general since there are definitely advantages to using COLUMN format under certain conditions such as for narrow column partitions being accessed by a selective scan. What I mean by direct access is finding a column partition value for a specific rowid (for instance, a rowid coming from an index, from a rowid spool, or from a selective scan of another column partition). For a column partition with ROW format, if a column partition value is accessed via a rowid, the rowid provides direct access (via the master index and the cylinder indexes) to a specific physical row that just has the column partition value of interest (since in ROW format, there is only one column partition value per physical row). But for a column partition with COLUMN format, many column partition values may be packed into a physical row -- the file system will find the physical row (via the master index and the cylinder indexes) with the greatest rowid that is less or equal to the desired rowid and return that physical row -- then the database must find the the desired column partition value in the physical row corresponding to the desired rowid (this would include handling any autocompression applied to the column partition values in the physical row) -- sometimes finding this column partition value within the physical row can be very efficient but it does require some extra CPU to do this compared to a column partiiton with ROW format.

3. Also, you mentioned that each physical row has a unique rowid. Then if such a physical row is a "container" row, then it stores a lot of column partition values in the same row (upto 65KB worth of values). Do all these column partition values have the same rowid?

No. There is a unique rowid associated with every column partiiton value. The rowid in the rowheader for a "container" physical row (that is, for a column partition with COLUMN format) is just the rowid of the first column partition value in the container. Each subsequent column partition value in the container has an associated rowid that is just one more than the rowid of the previous column partition value but this rowid is not actually stored in the physical row since it can be easily be determined by its position in the physical row (not including each of these rowids is what is called row header compression). The next physical row would have a rowid in its row header that is at least one greater than the rowid associated with the last column partition value in its preceding physical row.

4. If that is true then how do we link any X value in a "container" row  to a Y value in some other column partition(which may be stored in container or row format and where both X and Y are part of the same "logical" row)? Perhaps, the uniqueness value?

Since each column partition value has a unique rowid associated with it, it can be linked to the corresponding column partition value for another column partiiton of the same logical row using the same rowid modified to the desired column partition -- that is, the rowid for the y value will have the same row partitioning, hash bucket and uniqueness as the x value but with the appropriate column partition number for the y value. If y is in a column partition with ROW format, the file system can be requested to find the physical row with this rowid in its row header. For a column partition with COLUMN format, the file system can be requested to find the physical rowid with the greatest rowid in its row header that is less than or equal to the rowid for y; once that physical row is found, y can be found within the physical row by its position relative to the rowid in the row header of this physical row. In either case, the file system can efficiently find the physical row via the master index and cylinder indexes.

You may also want to refer to the Orange Book: Teradata Columnar for more information.

Enthusiast

Thanks Paul. I'm going to save this article on my desktop along with the comments. This is one of the few articles where the comments proved to be more informative than the article itself :)

-Suhail

Teradata Employee

Hi Paul

CREATE TABLE SALES (

    TxnNo     INTEGER,

    TxnDate   DATE,

    ItemNo    INTEGER,

    Quantity  INTEGER )

  PARTITION BY COLUMN,

  UNIQUE INDEX (TxnNo);

 

how come we can create a columnar since we have a UPI on tables sales

Teradata Employee

Hi Paul

                 Sorry my bad its a seconday index .

Enthusiast

Hi Paul,

If we have a CPPA table, then how are the rows on the different AMPS are sorted?

All rows with same PA column value go the same AMP?

Are rows with the same PA column values colocated to each other on any AMP?

-Suhail

Enthusiast

Also, if a CP table has row partition, then is the row partition number stored in the container row header too?

Teradata Employee

Teradata 15.10 introduces column partitioning with a primary index (CPPI) and with a primary AMP index (CPPA). A primary AMP index can be thought of as being halfway between a PI and NoPI -- rows distributed to AMPs based on the hash of the PA value but rows on an AMP are treated like they are for a NoPI table (that is, not sorted by the hash value of the PA).

As with a CPNoPI (column partitioning with no primary index), a CPPI or CPPA can also optionally be row partitioned using multilevel partitioning.

For more information about these new capabilities (especially if the preceding paragraphs sound like gibberish), see the latest version of the Orange Book: Teradata Columnar for Teradata Database 15.10. And, before using a CPPA, CPPI,  or CPNoPI, make sure you understand when it is appropriate to use each one (and when not to use column partitioning). This is especially true for a CPPI because you might suddenly find the storage needs for the table becoming HUGE if you define too many column partitions.

Teradata Employee

Suhail, glad to hear from you again and your interest in CPPA. I'll try to answer your questions (plus a little bit more) in the following:

1. If we have a CPPA table, how are the rows on the different AMPs sorted?

The physical rows of a table are always maintained by the Teradata file system in sorted order by rowid on the storage for an AMP whether it is PI, PA, or NoPI. The rowids for a PA table are generated the same as they are for a NoPI table; that is, an internal partition number, hash bucket, and uniqueness where the hash bucket, identifying the owning AMP, doesn't change unless uniqueness would exceed 44 bits. This is unlike a PI where a rowid consists of an internal partition number, hash of the PI value, and uniqueness. The internal partition number is 0 if the table is not partitioned.

Note that the size of the rowid is the same for a PI, PA, and NoPI. The Teradata file system treats the rowid the same for all three (the file system doesn't care how the rowid was constructed) and uses the master and cylinder indexes to maintain the order and access physical rows.

2. Do all rows with same PA column value go to the same AMP?

Yes. Distribution of table rows to the AMPs for a PA is the same as for a PI. The PA value is hashed and the row is sent to the AMP that owns the hash bucket which is the top 20 (or 16) bits of this hash value. If a query specifies the value for the PA, the plan can access the rows on a single AMP (just like for a PI but how it finds the rows on that AMP for a PA is handled like it is for a NoPI). Note that while the hash of the PA value is used to determine which AMP receives and owns the table row, the hash of the PA value is not used in defining rowids for the PA table (as discussed in the answer to the preceding question).

3. Are rows with the same PA column values colocated to each other on any AMP?

Yes, table rows with the same PA columns values are colocated on the same AMP. This is the advantage of a PA over a NoPI and allows a PA to have single-AMP access when the value of the PA is specified in a query and more efficient joins and aggregations when they are on the PA columns.

However, on an AMP, those table rows or their column partition values with the same PA value are not necessarily (and are usually not) colocated within the storage for that AMP. Whereas with a PI there is colocation by hash based on the nature of the rowids constructed for a PI table within each internal partition (which allows for faster access by PI value on the AMP and joins such as merge joins).

4. Also, if a CP table has row partitioning, is the row partition number stored in the container rowheader too?

Yes, row and column partition numbers are combined into the internal partition number of the rowid stored in the rowheader for a container. This is the rowid of the first column partition value in the container; subsequent column partition values have a rowid (which is not physically stored) that increases by one for each column partition value in the container.

Since only the uniqueness can change for column partition values in a container, a container cannot contain column partition values that have different internal partition numbers, different hash buckets for PA and NoPI (however, note the hash bucket usually never changes), or different hash values for PI. Note that containers are used when a column partition has COLUMN format. Also, a container with more than one column partition value is limited to about 8KB in size; so if column partition values are large, only one or a few column partition values may fit in a container.

For a CPPI, since there are often very few column partition values with the same hash value within an internal partition, a container might have only one or a few values which is not an efficient use of containers. In this case, using ROW format for column partitions (so that subrows are used instead of containers) and having a few column partitions may be more efficient and avoid a huge increase in the size of the table.

Note that a rowheader for a container is much larger than a rowheader for a subrow (about 50 bytes compared to about 14 or 20 bytes). This is because a rowheader for a container contains additional information about autocompression and the layout of the container. So if only a one or few column partition values could be placed in each container of a column partition, using ROW format (to get subrows) for the column partition is more efficient. COLUMN format is more efficient if a lot of values can be put in a container. 

See also the latest version of the Orange Book: Teradata Columnar for Teradata Database 15.10.

If you have further questions, let me know.

Enthusiast

Hi Paul,

Thank you for the detailed responses. You mentioned: "However, on an AMP, those table rows or their column partition values with the same PA value are not necessarily (and are usually not) colocated within the storage for that AMP"

Does this mean that if there are 2 table rows with the same PA value and the table is column partitioned on the PA column alone with column format, the 2 values may not necessarily share consecutive uniquess values?

A larger question is what is Teradata's logic of assigning uniqueness values to rows in a non partitioned CPPA table? is it just random?

Teradata Employee

1. Does this mean that if there are 2 table rows with the same PA value and the table is column partitioned on the PA column alone with column format, the 2 values may not necessarily share consecutive uniquess values?

Yes, that is correct. They would only be consecutive if the second row is inserted immediately following the first of the two rows and, if the table is row partitioned, only if they are in the same combined row partition.

2. A larger question is what is Teradata's logic of assigning uniqueness values to rows in a nonpartitioned CPPA table? is it just random?

A CPPA can't be nonpartitioned. By definition, it is column partitioned in order to be CPPA. If you mean that the CPPA table is not row partitioned, whether it is row partitioned or not doesn't affect how uniqueness is determined for a table row.

Uniqueness is not assigned randomly. It starts at one and is incremented by one as rows are inserted for an internal partition and hash bucket for a PA or NoPI table. For a PI table, it is for an internal partition and hash value (of the PI columns).  See my first comment for March 7, 2012 above for more on this.

Enthusiast

Thanks Paul. So if there is a non row partitioned CPPA table with column partition in column format on every single column solely and there is a  select query like:

select * from notrowpartitioned_CPPA table where PA_COLUMN='X';

Then Teradata performs a single amp operation but within that AMP, it will search for every container of the column partition corresponding to PA_COLUMN and search for that value "x", just like searching for a PI value in a row partitioned regular table, correct?

Teradata Employee

select * from CPPA table where PA_COLUMN='x';

For a row-partitioned or non-row-partitioned CPPA table (assuming PA_COLUMN is not also a partitioning column), Teradata performs a single-AMP operation but within that AMP, it searches for 'x' in every container or subrow of the column partition corresponding to PA_COLUMN.  This is just like searching for a value in a CPNoPI (except for CPNoPI would be doing this on all AMPs). In general for a PA, behavior to determine the AMP or AMPs is like PI; the behavior on an AMP is like NoPI.

select * from notrowpartitioned_CPPI table where PI_COLUMN='x';

For a non-row-partitioned CPPI table, Teradata performs a single-AMP operation and within that AMP, it probes once (lookup using master/cylinder indexes) in the column partition corresponding to PI_COLUMN based on the hash of 'x', and within that single set of column partition values (either in containers or subrows), searches for 'x'. Note that multiple values could have the same hash.

select * from rowpartitioned_CPPI table where PI_COLUMN='x';

For a row-partitioned CPPI table (assuming PI_COLUMN is not also a partitioning column), Teradata performs a single-AMP operation and within that AMP, it probes (lookups using master/cylinder indexes) based on the hash of 'x' in each combined row partition in the column partition corresponding to PI_COLUMN, and within each of those sets of column partition values (either in containers or subrows), searches for 'x'. Some probing can be skipped if there are empty combined row partitions since a probe looks for a specific hash value and if not found the next higher.

Note that a column partition may contain multiple columns (if you define it that way) and have either containers (COLUMN format) or subrows (ROW format) but not a mix of both in the same column partition.

Enthusiast

Thanks Paul. I am having a hard time understanding what Teradata does when a table is both row and column partitioned. Especially when column is at first level and row is at second level.

I have a row partitioned (by date) big table right now that has a lot of columns and it has a lot of rows too. DBAs have done the analysis and found out that whenever users access this table, only a subset of all columns are accessed for a date range criteria on the date column the table is currently row partitioned on. The access is so varied that the DBAs can't find a particular set of columns they should combine and put a CP on. So my first stab at columnar is to have a CPPA table and convert every column of the table to its sole column partition. I want to achieve faster execution times by reducing IO due to which I have decided to go with COLUMN FORMAT. The PI column will become the PA column.

The place where I'm stuck is partitioning. Trying to decide what partitioning scheme works best in this scenario:

1. Row partitioning at first level and column partition at the next level OR

2. Column partitioning at first level and row partition at the next level

I think I know how #1 works when there is a date range select query that only refers to a small subset of columns. Teradata will first search only those rows that belong to the date range criteria and then refer the specific column partitions and give back the data. If I choose this partitioning scheme , the table rows are sorted by row partition#, hasbucket and uniqueness value. Within each container, the values are sorted by row partition#, column partition#, hasbucket and uniquess value.

I have no idea how #2 works. How are the table rows sorted when column partitioning is at the first level and row partitioning at the next level? Also for the specific table I'm referring too, what partitioning scheme would you suggest?

-Suhail

Teradata Employee

Table rows are not sorted by rowid since when dealing with rowids in a CP table, we are talking about the ordering of column partition values by rowid (or in your case with single-column partitions, ordering of column values by rowid). A table row is not stored as a complete row but is broken up into its column partition values corresponding to each of the column partitions of the table. Values are never sorted within a container since within a container all the column partition values have the same internal partition number and hash bucket and uniqueness is assigned incrementally (which defines an ordering but is not based on any sorting other than how table rows arrive to be inserted). Table rows may be sorted (or naturally arrive in some sorted order) prior to inserting the rows into the table to improve load efficiency but a table row itself is split into column partition values and each of those column partitions values is inserted based on its internal partition number and the hash bucket with the uniqueness being the next higher number for the internal partition number/hash bucket.

1. Row partitioning at first level and column partition at the next level

Column partition values of table rows are ordered first by row partition number (of the table row for which the column partition value is part of), the column partition number of the column partition for the column partition value, hash bucket (which usually doesn't change), and then uniqueness.

2. Column partitioning at first level and row partition at the next level

Column partition values of table rows are ordered first by the column partition number of the column partition for the column partition value, row partition number (of the table row for which the column partition value is part of), hash bucket (which usually doesn't change), and then uniqueness.

To some degree either choice is okay though there are some subtle performance differences but some of these are the same as what order to put multiple levels of row partitioning). In your case, if the focus is queries, I would suggest using 2 -- based on the heuristics to usually put column partitioning first and to order levels in increasing number of partitions (I am guessing the number of column partitions in your case is smaller then the number of row partitions) and simplifies the decision. But if you want to make sure you are doing the absolutely best thing and doing the best trade-offs for your workload, you need to check both 1 and 2 for load performance, various query operations such as COUNT, etc. to see what works best overall (you might have to make some tradeoffs since for some things 1 may be better than 2 while for other things 2 may be better). Others with experience using columnar on actual workloads may have some advice, and I hope they chime in on this.

See also the orange books on partitioning and be aware of the considerations, trade-offs, etc. discussed in these orange books, such as 2-byte vs. 8-byte partitioning, multilevel partitioning, over partitioning, simple vs. complex row partitioning, COLUMN format vs. ROW format, single-column vs. multicolumn partitions, etc. There is a lot to learn (some learning comes with experience and experimentation) but hopefully the effort will be worthwhile in achieving higher performance for your workload.

Enthusiast

I apologize Paul. I still don't understand the difference between the 2 schemes. You mentioned:

"1. Row partitioning at first level and column partition at the next level

 

Column partition values of table rows are ordered first by row partition number (of the table row for which the column partition value is part of), the column partition number of the column partition for the column partition value, hash bucket (which usually doesn't change), and then uniqueness.

2. Column partitioning at first level and row partition at the next level

Column partition values of table rows are ordered first by the column partition number of the column partition for the column partition value, row partition number (of the table row for which the column partition value is part of), hash bucket (which usually doesn't change), and then uniqueness."

In the case of the table I referred to in my previous comment (where each column has its own CP), since the column partition# is the same for every column value, wouldn't both partitioning schemes end up doing the same thing?

How is there a performance difference between the schemes?

Ultimately, regarless of whether we use any of the above 2 schemes, every value in any column partition will be sorted by row partition#, hashbucket and uniqueness value (because the column partition# is the same for the whole CP).

As we put all the column partitions together, the entire table will appear to be sorted by row partition#, hashbucket and the same uniqueness value regarless of whichever partitioning scheme I choose, correct?

Also, different column values of the a particular table row have the same uniqueness value (in the container) across different column partitions, right?

Teradata Employee

From a user point of view, either way ends up doing the same thing. The differences are subtle differences in performance due to the way the values are ordered in combination with how data is placed on the disks in data blocks.

Say there is a table with two columns cp1 and cp2 and the table is partitioned by a partitioning expression that defines 3 row partitions (designated below by rp1, rp2, and rp3).

1. PARTITION BY (partitioning_expression, COLUMN) is stored in the following order: cp1 values for rp1, cp2 values for rp1, cp1 values for rp2, cp2 values for rp2, cp1 values for rp3, cp2 values for rp3. To scan all the cp1 values for particular values, need to read cp1 values for rp1, skip over cp2 values for rp1, read cp1 values for rp2, skip over cp2 values for rp2, and read cp1 values for rp3.

2. PARTITION BY (COLUMN, partitioning_expression) is stored in the following order: cp1 values for rp1, cp1 values for rp2, cp1 values for rp3, cp2 values for rp1, cp2 values for rp2, cp2 values for rp3. To scan all the cp1 values to look for particular values, read all cp1 values straight through rp1, rp2, and rp3.

Either case works, just the first case has to skip over cp2 values to read through the cp1 values. Skipping over data blocks is fast but since combined partitions can start in the middle of a data block for case 1, the last data block for cp1 values in rp1 may end with some containers for cp2 values for rp1, and the first data block for cp1 values in rp2 may begin with some containers for cp2 values for rp1, etc. But for case 2, to read cp1 values it can just read straight through and only the last data block it needs to read might have containers for cp2 values. On average, case 1 has an extra data block read for each row partition boundary, not a big deal if combined row partitions are big but it can be if row partitions are small (but in that case, the table is over partitioned -- not a good thing). But for case 2, on average just 1/2 data block of cp2 values for rp1 are read. So I generally recommend to do 2 as I suggested. However, 1 can win sometimes in loading situations since data may be going into one row partition (say loading the current month). Also, if you have a  query that has row partition elimination to a single partition, case 1 might be slightly better than case 2 since it might minimize head movement on the disks.

As the column partition values for a row (assuming just a simple select of row from the table) are put together, the rows on each AMP (not the entire table) will appear to be sorted by row partition #, hash bucket, and unigueness. The rows may not be returned exactly in that order since rows would be coming back from all the AMPs and intermixed.

Column values for different columns of a particular table row will have the same row partitioning #, hash bucket and uniqueness value across the different column partitions. They only differ by their column partition #.

Enthusiast

Great explanation Paul. Thanks a lot for answering my questions.

-Suhail

Enthusiast

Hi Paul,

Is CPPA only supported from 15.10? We're on TD 15.0 and when I try to run:

CREATE TABLE Orders
(
o_orderkey INTEGER NOT NULL,
o_custkey INTEGER,
o_orderstatus CHAR(1) CASESPECIFIC,
o_totalprice DECIMAL(13,2) NOT NULL,
o_orderdate DATE FORMAT 'yyyy‐mm‐dd' NOT NULL,
o_shippriority INTEGER,
o_comment VARCHAR(79))
PRIMARY AMP INDEX (o_orderkey) PARTITION BY COLUMN
UNIQUE INDEX (o_orderkey);

I get the error: 

CREATE TABLE Failed.  [3706] Syntax error: expected something between the 'PRIMARY' keyword and the 'AMP' keyword.

Teradata Employee

CPPA and CPPI require Teradata 15.10. CPNoPI was available starting in Teradata 14.0. CP may require a license and to be enabled by a support representative in order to use it.

If you want to experiment with CPPA or CPPI, you can try downloading Teradata Express for 15.10 at:

http://downloads.teradata.com/download/database/teradata-express-for-vmware-player

I think CPPA and CPPI are enabled in Teradata Express 15.10 and do not require a specific license to use but I haven't verified that.

Enthusiast

Paul, we're on 15.10 now but creating a table with Primary Amp index fails with:

"Executed as Single statement. Failed [9473 : HY000] Column Partitioning with a Primary AMP Index or a Primary Index has not yet been enabled.
STATEMENT 1: Create Table failed. ".

 

Below is the create table statement I tried. What needs to be done to enable creation of CPPA tables? FYI, If I create the same table as NOPI, it works. So it's not the case that columnar feature is not enabled on our system. It's just that we're unable to create CPPA tables.

 

 

 CREATE TABLE edw_deploys.Orders
(o_orderkey INTEGER NOT NULL,
o_custkey INTEGER,
o_orderstatus CHAR(1) CASESPECIFIC,
o_totalprice DECIMAL(13,2) NOT NULL,
o_orderdate DATE NOT NULL,
o_shippriority INTEGER,
o_comment VARCHAR(79))
PRIMARY AMP INDEX (o_orderkey) 
PARTITION BY COLUMN
UNIQUE INDEX (o_orderkey);  
Teradata Employee

Since the system was upgraded from 15.0 to 15.10 rather than being initialized at 15.10, the NoDot0Backdown flag must be set to TRUE. (In other words, you must give up the option to "back down" to 15.0 before you can use new features that would cause failures after backing down.) Please contact Teradata support for assistance.

Enthusiast

Thank you Fred. This worked. 

-Suhail