Indexes, Too Much of a Good Thing?

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Indexes, Too Much of a Good Thing?

Some of you may be familiar with relational databases other than Teradata and how those other RDBMs utilize indexes.

In Teradata, an “Index” is a physical mechanism that is used to distribute, store, and access data rows. Indexes provide a physical access path to the data and their use can avoid unnecessary full-table scans to locate rows.

To level set, let’s first consider the difference between a Key and an Index. A “Key” is a relational term. The Primary Key of a table is a column set that uniquely identifies a row in a logical table. A key is an identifier and not a physical mechanism.

In Teradata, there are four main types of indexes: Primary Indexes, Secondary Indexes, Join Indexes, and Hash Indexes. Each of these types of indexes have their own distinctive flavors and uses.

All Teradata Database tables require a Primary Index because the system distributes table rows to the AMPs based on primary index values. (* Teradata 13.0 has a NoPI feature for loading staging tables.) To accommodate Teradata’s massively parallel architecture, indexes use a hashing algorithm based on data row hash values as the most efficient means of distributing and retrieving data.

Primary indexes can be either unique or non-unique and partitioned or non-partitioned:

  • Unique primary index (UPI)
  • Non-unique primary index (NUPI)
  • Non-partitioned primary index (NPPI)
  • Partitioned primary index (PPI)

Partitioned primary indexes may have a single partitioning expression or multiple partitioning expressions:

  • Single-level PPI (SLPPI)
  • Multilevel PPI (MLPPI)

The main purposes for Teradata’s Primary Indexes:

  • Provide access to data rows, obviating the need to do full-table scans. By providing the values for all the primary index columns in your SQL WHERE clause, direct access to the AMP with the row(s) can be made using that primary index value. (Note: Teradata does not do ‘partial’ indexing, you must provide values for all the columns defined in an index.)
  • Determine which AMP a row will distribute to. Rows are distributed across the AMPs based on the hash of their Primary Index. The uniqueness of the data values in the primary index will affect how evenly the data distribution is across the AMPs.
  • Efficiency of join processing. Depending upon the choice of index, rows of the table may or may not have to be redistributed, spooled, and sorted prior to the join.

A trade-off that must often be considered is between the needs of full table scan queries and queries that can be range constrained in some way. For example, between strategic queries that need to scan a multi-year history of sales, and tactical queries that only need transactions from, say, the last 60 days. The compromise that’s often made is to break the large table into a number of range partitioned tables which work better for tactical queries than scanning one big table, and to UNION the partitioned tables together for strategic queries that need all the data. But breaking up what should be one table into multiple tables makes them harder to manage and maintain, and although it improves performance for tactical queries, it is not the optimum solution. Up to 64 columns can be specified for a primary index definition.

Partitioned Primary Index (PPI) is a table organization scheme that very elegantly enhances the existing Teradata structures and helps remove this trade-off dilemma. It’s an optional extension to Teradata’s hashed primary indexing that adapts it to more efficiently handle range queries.

What if, instead of having to fully scan a large table to satisfy a query we only had to scan 50% of it? What if we only had to scan 10%….or perhaps even less? This would have a huge, very positive impact on performance, and this is what PPI has been designed to address, to push the envelope for Teradata users who want to see better and better query performance whether it’s for tactical or strategic requests.

PPI has also been designed to be very easily set up and managed and to put a minimum additional burden on the DBA in keeping with our philosophy of low cost database management.

Secondary Indexes provide alternate access paths to the data and may be Hash-ordered or Value-ordered. Secondary indexes may be added or dropped as needed with the caveat that building them requires some amount of system resources and you’ll want to check with your DBA appropriate.

  • Unique secondary index (USI)
  • Nonunique secondary index (NUSI)

Join Indexes offer a variety of six subtypes that include:

  • Single-table join index (STJI)
  • Single-table aggregate join index (STAJI)
  • Single-table sparse join index
  • Multitable simple join index
  • Multitable aggregate join index
  • Multitable sparse join index

Join Indexes are highly recommended for decision support applications because they often provide superior performance to large table joins and aggregate computations. Any Join Index, whether simple or aggregate, multi-table or single-table, can be sparse. The create statement of the index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. A caution when choosing Join Indexes is that neither MultiLoad nor FastLoad utilities support tables with join indexes. If Join Indexes are needed, a possible workaround could be to FastLoad data into an empty table and then either MERGE or do a INSERT/SELECT into the table with the Join Index.

Hash Indexes are a similar to Join Indexes and have a narrower usage. Hash Indexes are limited to one table and the table’s Primary Index cannot be partitioned.

Hash indexes are useful for queries where the index contains the columns referenced by a query, which will allow the Optimizer to use the index to satisfy the query rather than the underlying base table. Hash indexes can also be defined on a table in place of traditional secondary indexes. Like the Join Index, FastLoad and MulitLoad do not support Hash Indexes.

Criteria for choosing Join Indexes vs NUSIs

The similarities between STJI and NUSI are that STJIs can be defined with the same columns as NUSI; the concept of index covering (the query can be satisfied by columns in the index without accessing the base table) applies to both STJI and NUSI; value ordering is available on both STJI and NUSI; and joins to either STJIs or NUSIs are possible. Teradata does not do partial index retrievals, however, it can use either STJIs or NUSIs to do Full Table Scans of their subtables instead of a FTS of the base table.

The basic differences between STJI and NUSI are that a STJI is similar to a table with a primary index with additional columns defined; a STJI row can be stored on the same AMP or a different AMP as table data row whereas NUSIs are stored on same AMP as table data row; and NUSIs are supported by MultiLoad, but not STJIs.

Advantages of Indexes

Indexes are a retrieval mechanism and the intent of indexes is to lessen the time it takes to retrieve rows from a database and eliminate full table scans.

Disadvantages of Indexes

This is where too much of a good thing can be a disadvantage. Whenever a base table row is updated, deleted, or inserted, Index subtables must be also be updated. The more secondary, join or hash indexes you have, the more maintenance that will have to be performed. All Teradata secondary, hash, and join indexes are stored in subtables and will require extra storage space. Some of the indexes are “allergic” to some Load Utilities so you’ll want to check on what may be impacted as a result of index choices.

A good practice is to use the EXPLAIN before you execute a query will help you determine which indexes are being used for your query.

One of the most important tasks of a DBA is to choose Indexes, CHOOSE WISELY!

19 REPLIES
Teradata Employee

Re: Indexes, Too Much of a Good Thing?

How is it possible that a table could be created without Primary Index (Teradata 13)? How is data distributed among the AMPs?
Teradata Employee

Re: Indexes, Too Much of a Good Thing?

Celia, That's a great question. In TD 13 there is an option to create tables with no primary index.

On this one, we broke all the rules, and the developers had a lot of fun doing it! As you know, on TD a table always has a PI, always has data distribution driven by the hash vlaue of the PI. Once rows arrive at their AMP, they are stored in order by hash value...and that is true of every table on TD.
In the case of No Primary Index Tables (NoPI), they are not hash distributed, and they are not hash ordered.

The goal of NoPI tables is to have a table that is very efficient to acquire data into. Like – ELT for batch or work tables. These table are not going to be positioned properly for joins or primary index access, they're intended to just be places to temporarily land and store data -- until you do the next processing step that generally would have to scan the table and read every row anyway. This will significantly improve the overall process.

For example, you define a NoPI table and load the data with a FastLoad. Instead of doing all the row redistribution, sorting, etc, we are going to take each block that FastLoads into the table and randomly pick an AMP that is going to get that block of data. Then, we'll takd that data block and insert it at the end of the table, that's it. Getting the data into the work tables is going to be a lot more efficient. 30+% improvement in acquiring data in FDL in phase one.

Insert/select into that table will be a little different because the data is not coming from outside so data it will flow into the table based on the AMP where it was created so final result of the query won’t be redistributed, we’ll store the data locally. At that point, inserting data into the table by row hash goes away and the rows just go to the end of the table (on that AMP). This table is only meant to be scanned and used for the next operation, not a specific select operation.
NoPI tables will reduce skew in the batch load operation and you won't have to fret or think about which column(s) would make the best index.

Secondary indexes are allowed on NoPI tables and secondary index access paths for SQL statements will work the same on a NoPI table as they do for a PI table.

That being the case, we all really know that every row in TD has a unique identifier, the rowid. For the NoPI tables, the system will generate a rowid based on the AMP where the row is being stored. No, you can't choose what you want it to be, and no, you can't use it as though it were a real Primary Index.
Teradata Employee

Re: Indexes, Too Much of a Good Thing?

Hi Alison,

i would like to know when we define indexes in particular for NUSI's is it recommeded to define NUSI's individually on column or in other way when do you think its appropriate to define composite NUSI's?
Teradata Employee

Re: Indexes, Too Much of a Good Thing?

Thanks, Alison, for your answer.
N/A

Re: Indexes, Too Much of a Good Thing?

Alison, did you know that if a single table JI is created to create a different access path to the main table (it has a NUPI) using a different field as its PI can cause the dreaded "Product Join" to occur when you are updating rows on the main table via a simple update statement (two table). We found that the optimizer for 6.2 has been very silly and tries to update the main table via the JI and as the JI has its PI as a NUPI it has many duplicate rows causing the "Product Join" to occur. The only solution we found was to implement the PK via a USI. Then the optimizer choose that path and did a cool join (Merge). Is there any other way to influence the optimizer ? (we tried adding the recommended indexes)
N/A

Re: Indexes, Too Much of a Good Thing?

Alison,

Related to the NoPI tables you said they are mostly useful for ELT operations. When you are doing ELT for the most part you join multiple tables and then load to a target. Typically you land data into work tables and then join the work tables to load a target table. If the work tables are not distributed by PI what happens when you join 2 or more tables? Don't the tables have to be redistributed which takes away any benefit you get by not having a PI in the first place? Also if the target table has a PI and the work table does not the work table will have to be redistributed before inserting into the target table. Is that correct or am I missinbg something?

Thanks
Sunny.
Teradata Employee

Re: Indexes, Too Much of a Good Thing?

Sunny, I think that "Say Yes to No Primary Index (No PI) Tables" article (http://developer.teradata.com/database/articles/say-yes-to-no-primary-index-no-pi-tables) can reply your question.

Re: Indexes, Too Much of a Good Thing?

hi alison,
ca you please tell me the diffence btwn the tables
created in user and a databse

Re: Indexes, Too Much of a Good Thing?

I'm trying to understand if I'm better of using a NUSI or STJI as a covering index. One thing I can't determine is whether or not a NUSI can take advantage of the partitioning in the underlying table or not.
Also, Alison mentions that there is no partial index retrieval with either type of index, but if an STJI is partitioned, won't we get partition elimination if the query has a predicate that filters on the partition column(s)?