The following information covers a handful of the basics of the Teradata architecture, to help new Teradata users better understand what considerations should go into creating a table.
Teradata uses Massively Parallel Processing (MPP) to provide linear scalability of the system by distributing the data across a number of processing units (AMPs). Each record in a table is placed on an AMP. The more evenly the data is distributed across the AMPs for all tables, the better the system performs, because each AMP does an equal amount of work to satisfy a query. When the data is unevenly distributed, the AMPs with the most records work harder, while the AMPs with the fewest records are underutilized.
Key point: Teradata performs the best when data is evenly distributed
The primary index (PI) distributes the records in a table across the AMPs, by hashing the columns that make up the PI to determine which records go to which AMP. If no PI is specified when a table is created, the first column of the table will be used as the PI.
When creating a table, care needs to be taken to choose a column or set of columns that evenly distribute the data across the AMPs. A PI that distributes data unevenly will at the very least impact the performance of the table, and depending on the size of the table, has the potential to negatively impact the entire system.
Even distribution of the PI isn’t the only criteria to use when choosing a PI. Consideration should also be given to how the data will be queried. If the data can be evenly distributed using different sets of columns, then the determination of which columns to use should be based on how the data will be queried and what other tables it will be joined to. If two tables that are frequently joined have the same PI, then joining them doesn’t require the records to be redistributed to other AMPs to satisfy a query.
A PI doesn’t have to be the same as the primary key (PK) of a table. The purpose of a PI is to evenly distribute the data, while the purpose of a PK is to identify unique records. The PI and PK can be the same, but it isn’t required.
Key point: Picking the right primary index is critical to ensuring good table and system performance
A table that has perfectly distributed data has a skew factor of 0%. The higher the skew factor is, the more unevenly data in a table is distributed. As a general rule, tables with a skew factor higher than 50% should be evaluated to determine if a different primary index would distribute the data better, and thereby improve performance.
Tables with fewer records than the number of AMPs will have a higher skew factor that 0%, simply because the records cannot be evenly distributed across all of the AMPs. For tables that have fewer records than the number of AMPs the skew factor of the table may not be improved by choosing a different primary index.
Key point: Skew factor indicates how evenly a table’s data is distributed
Table Creation Options
Along with choosing the PI of a table, another choice needs to be made when creating a table. The two options are SET and MULTISET, with SET being the default if neither is specified. A SET table prohibits duplicate records with identical values in every column from existing in the table, while a MULTISET table allows them. When using FastLoad or the TPT Load Operator, if you attempt to insert duplicate records into a SET table, the duplicates are discarded without any notification that an attempt to insert duplicates took place. A SET table with no unique primary index has a performance overhead because every record that is inserted must be evaluated to determine if a duplicate already exists in the table. This overhead can be minimized by defining a unique index on the table (see the Teradata Database Database Design manual for more information on minimizing duplicate row checks for tables without unique primary indexes).
Create table syntax examples:
CREATE SET TABLE … (results in a SET table being created)
CREATE MULTISET TABLE … (results in a MULTISET table being created)
CREATE TABLE … (results in a SET table being created with Teradata semantics mode and results in a MULTISET table being created with ANSI semantics mode)
Key point: SET or MULTISET in a table creation statement determines whether duplicate records can be stored in the table
Suggestions for further reading
Teradata provides extensive documentation at http://www.info.teradata.com/. Download the full documentation set for the release your site uses, and then start with the document titled either “Introduction to Teradata Warehouse” for V2R6 and 12.0 or “Introduction to Teradata” for 13.0 and 13.10.