Why no primary index (NoPI) for a Column-Partitioned Table?

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Let's start by understanding what a primary index does.  A primary index (PI) is used to distribute the rows of a table to the AMPs and on each AMP to order the rows by hash value within the combined partitions defined by a PARTITION BY clause (if this clause is not specified, there is just one partition, that is, the entire table).

One purpose of column-storage (implemented using physical rows called containers) for a column partition is to provide row header compression by having sequential values with increasing uniqueness without having an explicit row header (or rowid) for each value.  This means that, for a column-partitioned table with a PI, a container would only be able to contain values with the same internal partition number and hash value.  This is likely to cause over-partitioning (partitioning to too fine of a granularity).  That is, each container would have one or only a few values.  The row header overhead for these containers would cause a large increase (up to 23 times) in space usage.  The maintenance overhead could be very expensive.

Some of the complications of having a PI for a CP table could be avoided if the PI was only used for distribution of rows to AMPs but not to use it to order rows within combined partitions on the AMPs.  This would allow single-AMP access improving tactical queries against a CP table.  Also, this would allow for local-based joins on the PI columns.  However, merge joins where the rows are ordered by the hash of the PI columns would not be applicable and other local join methods would need to be modified or implemented.

While there are some use cases that might benefit when rows are distributed by a hash value (instead of randomly or locally) and either ordered by the hash value within a combined partition or not (in the case the column partitions are wide), they also require further implementation effort.  As the feature matures, enhancement opportunities will be explored.

Note:  all join methods are supported with a CP table.  However for some cases, selected rows from the CP table may need to be distributed/spooled/sorted prior to a join step (such as a merge join).  Since only the needed columns are scanned, a scan of the CP table can be much more efficient than scanning a non-CP table where entire rows must be read.

If a PI or partitioned PI (PPI) is needed for some queries (tactical queries, primary index joins, etc.), a CP table can have a join index on it where the join index does have a PI, PPI, or MLPPI (but not column partitioning).  Alternatively, a PI, PPI, or MLPPI table could have a join index on it where the join index (possibly sparse) has column partitioning (but not a primary index) plus optionally one or more levels of row partitioning.  Either of these would allow the benefits of both PI/PPI/MLPPI and column partitioning, albeit with the extra space usage and maintenance of the join index.  The optimizer would pick the table or the join index depending on which one was better suited for the query.

5 Comments
Enthusiast

Hi paul..

Is my table going to be Skewed?

I have one table with 10 millions rows for example

create table table 1

As

(column1 Interger

column2 Interger

)Primary Index(column2);

Now I am trying to create a table 2 and

create table table2

As

(column_1,

column_2)As

(SELECT 1,column2 from Table 1)WITH DATA;

So will my table2 have no issues and contain the same rows as in table1 or Do need I need to define the primary Index on coulmn_2 to fix the skewness.

Please do let me know




Teradata Employee

1. Is my table going to be skewed?

Probably (after you fix the syntax errors) but depends on your system's setting for PrimaryIndexDefault.

2. So will my table2 have no issues and contain the same rows as in table1 or do need I need to define the primary index on column_2 to fix the skewness?

You will probably have issues. But table2 will contain the same number of rows as table1 and the same values for column_2 in table2 as for column2 in table2. You need to define a primary index on column_2 or define (or default to) NO PRIMARY INDEX to avoid skew.

In general, depending on the default PI (or NoPI) chosen for a table being created is not a good idea. This is especially problematic if you do not know the rules for how the system chooses the default, know the system settings, etc. plus know that the chosen default is what you actually want for the table, its data, its queries, and its data maintenance. Since it is unlikely you will know all the rules (I have trouble remembering them), it is usually better to specify the primary index (or NoPI) that you want rather than depend on the default. If you do create a table without specifying the primary index (or NoPI), the easiest thing way to find out what was chosen is to do a SHOW TABLE on the table (however, you might want to do this before copying the data).

Let's look at your example after making a few syntax corrections (and assumptions).

CREATE TABLE table1 AS (column1 INT, column2 INT) PRIMARY INDEX (column2);

Now, insert 10 millions rows into table1. I am assuming that these inserted rows are evenly distributed to AMPs based on the hash bucket determined from the hash value for each value of column2. However, if there is skew in the values for column2 (for instance, only a few distinct values), table1 would be skewed.

Then create table2.

CREATE TABLE table2 (column_1, column_2) AS (SELECT 1, column2 FROM table1) WITH DATA;

Since you are creating table2 using a SELECT statement (and not directly from table1), table2 does not get the primary index definition from table1. Instead, it gets a default chosen by the system unrelated to the indexes defined for table1.

If the system chooses to use column_1 as the primary index (based on your system setting for PrimaryIndexDefault), table2 is skewed to one AMP. For some tables (for instance, a small table), maybe that is what you want in order to be able have single-AMP access to the data (if this is what you wanted, there are no issues). But unless you have a huge system where 10 million rows is considered small, you probably would not want that choice.

On my system, the default chosen is NO PRIMARY INDEX because the DBS Control setting for PrimaryIndexDefault is N. In this case (assuming table1 is not skewed), table2 is not skewed (since data is locally copied on each AMP from table1 to table2). However, even though table2 is not skewed, you might have wanted table2 to have a primary index instead of no primary index.

Other factors (but not a complete list) that can affect the default chosen are whether there are PRIMARY KEY or UNIQUE constraints and whether PARTITION BY is specified.

So if you want the primary index to be on column_2, you should do the following.

CREATE TABLE table2 (column_1, column_2) AS (SELECT 1, column2 FROM table1) WITH DATA AND STATISTICS PRIMARY INDEX (column_2);

In this case, whether table2 is skewed or not would depend on whether table1 was skewed or not on the values of column2. I added the AND STATISTICS option because most likely you would want table1's statistics copied over to table2 along with the data (assuming you had collected some statistics on table1).

The only time it is probably safe to not specify the primary index is when you simply want an exact copy of the table (assuming the source table doesn't already have issues).

CREATE TABLE table2 AS table1 WITH DATA AND STATISTICS;

Paul,

I would like to understand below Note as per your article.

Can two CP tables be joined as Left outer join or Inner Join as regular PI table can?

Is it like Join Method is different from Join types?

Note:  all join methods are supported with a CP table.  However for some cases, selected rows from the CP table may need to be distributed/spooled/sorted prior to a join step (such as a merge join).  Since only the needed columns are scanned, a scan of the CP table can be much more efficient than scanning a non-CP table where entire rows must be read.

Teradata Employee

Good question. This can benefit from some clarification.

I was using join method to mean the method used in a join step (for example, merge join, nested join, sliding-window join, hash join, product join, etc.).  These are used to implement the joins specified in a request (such as left outer join, inner join, etc. -- call these join types).

Two CP tables can be joined using the same join types as a PI table, however, the plan and join method (step) may be different. In some cases, the join method may be the same (e.g., merge join) but there may be one or two additional retrieve steps before the join step (and the join step may be for the spools from the retrieves rather than directly on the CP tables). A CP table can also be joined to PI or NoPI table or vice versa in a request (you just may get different plans for different combinations and demographics).

Teradata Employee

Hope to see you at Partners Conference next week!