What you need to know before creating a table in Teradata

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Enthusiast

What you need to know before creating a table in Teradata

The following information covers a handful of the basics of the Teradata architecture, to help new Teradata users better understand what considerations should go into creating a table.

Teradata Architecture

Teradata uses Massively Parallel Processing (MPP) to provide linear scalability of the system by distributing the data across a number of processing units (AMPs). Each record in a table is placed on an AMP. The more evenly the data is distributed across the AMPs for all tables, the better the system performs, because each AMP does an equal amount of work to satisfy a query. When the data is unevenly distributed, the AMPs with the most records work harder, while the AMPs with the fewest records are underutilized.

Key point: Teradata performs the best when data is evenly distributed

Primary Index

The primary index (PI) distributes the records in a table across the AMPs, by hashing the columns that make up the PI to determine which records go to which AMP. If no PI is specified when a table is created, the first column of the table will be used as the PI.

When creating a table, care needs to be taken to choose a column or set of columns that evenly distribute the data across the AMPs. A PI that distributes data unevenly will at the very least impact the performance of the table, and depending on the size of the table, has the potential to negatively impact the entire system.

Even distribution of the PI isn’t the only criteria to use when choosing a PI. Consideration should also be given to how the data will be queried. If the data can be evenly distributed using different sets of columns, then the determination of which columns to use should be based on how the data will be queried and what other tables it will be joined to. If two tables that are frequently joined have the same PI, then joining them doesn’t require the records to be redistributed to other AMPs to satisfy a query.

A PI doesn’t have to be the same as the primary key (PK) of a table. The purpose of a PI is to evenly distribute the data, while the purpose of a PK is to identify unique records. The PI and PK can be the same, but it isn’t required.

Key point: Picking the right primary index is critical to ensuring good table and system performance

Skew Factor

A table that has perfectly distributed data has a skew factor of 0%. The higher the skew factor is, the more unevenly data in a table is distributed. As a general rule, tables with a skew factor higher than 50% should be evaluated to determine if a different primary index would distribute the data better, and thereby improve performance.

Tables with fewer records than the number of AMPs will have a higher skew factor that 0%, simply because the records cannot be evenly distributed across all of the AMPs. For tables that have fewer records than the number of AMPs the skew factor of the table may not be improved by choosing a different primary index.

Key point: Skew factor indicates how evenly a table’s data is distributed

Table Creation Options

Along with choosing the PI of a table, another choice needs to be made when creating a table. The two options are SET and MULTISET, with SET being the default if neither is specified. A SET table prohibits duplicate records with identical values in every column from existing in the table, while a MULTISET table allows them. When using FastLoad or the TPT Load Operator, if you attempt to insert duplicate records into a SET table, the duplicates are discarded without any notification that an attempt to insert duplicates took place. A SET table with no unique primary index has a performance overhead because every record that is inserted must be evaluated to determine if a duplicate already exists in the table. This overhead can be minimized by defining a unique index on the table (see the Teradata Database Database Design manual for more information on minimizing duplicate row checks for tables without unique primary indexes).

Create table syntax examples:

CREATE SET TABLE … (results in a SET table being created)

CREATE MULTISET TABLE … (results in a MULTISET table being created)

CREATE TABLE … (results in a SET table being created with Teradata semantics mode and results in a MULTISET table being created with ANSI semantics mode)

Key point: SET or MULTISET in a table creation statement determines whether duplicate records can be stored in the table

Suggestions for further reading

Teradata provides extensive documentation at http://www.info.teradata.com/. Download the full documentation set for the release your site uses, and then start with the document titled either  “Introduction to Teradata Warehouse” for V2R6 and 12.0 or “Introduction to Teradata” for 13.0 and 13.10.

Tags (1)
18 REPLIES
bwb
Teradata Employee

Re: What you need to know before creating a table in Teradata

One minor correction: CREATE TABLE will result in a SET table being created if the session is using Teradata semantics; CREATE TABLE will result in a MULTISET table being created if the session is using ANSI semantics.

The semantics mode (Teradata vs. ANSI) can be set globally via DBSControl (SessionMode GDO), or at the session level by the user with CLIv2 applications (DBCAREA Transaction Semantics field) or certain Teradata utilities (e.g., PP2 TRANSACT(ANSI|BTET) or BTEQ .SET SESSION TRANSACTION ANSI|TERADATA). Note: despite the use of "transaction" in the options, more than transaction semantics are affected.
Enthusiast

Re: What you need to know before creating a table in Teradata

Thanks for providing the clarification on the effect of the Teradata and ANSI semantics mode on a CREATE TABLE statement when SET or MULTISET aren't specified. I've edited the article to add this clarification.
Enthusiast

Re: What you need to know before creating a table in Teradata

That was a good explanation....

is there a way that we can know which semantic we are in ?
for example if we connect from Query man .. what is the default semantic and if we connect from bteq what is the default semantic ?
Enthusiast

Re: What you need to know before creating a table in Teradata

You can tell which semantic mode you're with SQL Assistant or BTEQ using the following information:
SQL Assistant (AKA Queryman) - run the HELP SESSION command and look for the column named Transaction Semantics to tell whether you're in Teradata or ANSI mode
BTEQ - log into BTEQ and look for the message that says 'Tranaction semantics are' followed by BTET (Teradata) or ANSI

Re: What you need to know before creating a table in Teradata

Hi,

The explanation was very clear and crisp, thanks for the info.
Enthusiast

Re: What you need to know before creating a table in Teradata

If a SET table is create with a UNIQUE INDEX and/or a PRIMARY KEY INDEX on a subset of the columns in the table, will the database still examine each additional column for duplication?

Table ( colA_SK
, colB_NK
, colC_NK
, colD
, colE
, ... colZ)
UNIQUE PRIMARY INDEX (colA_SK)
UNIQUE INDEX (colB_NK, colC_NK)

With those unique indexes no combination of colB_NK and colC_NK can be duplicated, would the database continue to validate the uniqueness of the record with colD - colZ on a SET table? Or should you set it up as a Multiset table with the Unique Indexes. (ETL process will also be taking into account the NK and PK on the table for INSERTS vs UPDATES on the tabl)
bwb
Teradata Employee

Re: What you need to know before creating a table in Teradata

With a UNIQUE PRIMARY INDEX, there will be no duplicate row checking (there can't be any duplicate rows if the PIs are unique).

With a non-UNIQUE PRIMARY INDEX, but one or more UNIQUE INDEX, duplicate row checking is not necessary (since there can't be any duplicate rows if one or more SIs are unique). However, the ordering of the internal INSERT and UPDATE operations may result in a duplicate row check, and that certainly used to be the case; I have a query in to one of the DBS experts to find out if that is still the case.

Personally, I'm opposed to using MULTISET tables as a way to bypass the duplicate row check. It's all to easy to get into that habit, and if a particular MULTISET table has no UNIQUE indexes, the potential exists to end up with duplicate rows...and if the logic that accesses such a table assumes that there are no duplicates, you're potentially in big trouble.
Enthusiast

Re: What you need to know before creating a table in Teradata

Thank you for the question and the answer provided in the comments on duplicate row checking for SET tables. I've updated the article to provide clarification on when duplicate row checking occurs on a SET table, including a reference to the Teradata Database Database Design manual for further information.
bwb
Teradata Employee

Re: What you need to know before creating a table in Teradata

You're welcome. I did get confirmation that if there is any UNIQUE INDEX, there will be no duplicate row checking.

There is one exception. If the error-logging option is used on an INSERT-SELECT or MERGE INTO targeting a SET table, duplicate row checking will be done if the PRIMARY INDEX is non-UNIQUE, regardless of the presence or absence of UNIQUE INDEX.

Also, I have a couple of other comments on the article:

1. In addition to performance, a significant skew factor results in more space being taken for table data on some AMPs, and less on others. A skew factor of 50% would mean that some AMPs have (roughly) twice as much data for the table than others; that could cause disk space to be exhausted on the high-skewing AMP(s) when there's lots of space free on the other AMPs. Except for tables with very few rows, I would guess that a skew factor of more than 10% (perhaps more than 5%) is abnormal and probably indicates a poor choice of PI. 50% would just be huge (my classic example of such bad skewing is someone making the PI of a personnel table the "sex" column, where all the rows would end up on two AMPs). In all my years with Teradata, I've never heard of a real-world (i.e., customer) skew more than 10% for any table with a significant number of rows.

2. With regard to duplicate rows, the statement "If you attempt to insert duplicate records into a SET table, the duplicates are discarded without any notification that an attempt to insert duplicates took place." isn't strictly true. It happens to be true if FastLoad (or the TPT Load Operator) is used. However, when using MultiLoad (or the corresponding TPT Update Operator) or TPump (or the corresponding TPT Stream Operator), you can choose whether or not duplicate rows are recorded. If you're using SQL directly (e.g., BTEQ), duplicate rows will always be reported (2802 error).