I've been looking on the internet for a best practise for creating tables with primary indices and keys but i've only found fragmented non concluse information so i hope the experts here can give me their opinion.
UPI in Teradata works to distribute the data across all AMPS. PK is not works like that.
At a very high level, logical primary keys can be enforced with a unique secondary index (USI) allowing the table's primary index to be defined to provide the best data distribution, data access, and/or joins with related tables in the model.
Just some clarifications on it:
PI (Unique or non-unique): Primary index is used in Teradata for data distribution. The more even the data distiribution is, the better the perfomnce will be
PK (always unique): Primary key is a relation modelling concept which is use to uniquly identify the rows in a table. It has nothing to do with data distirbution. As robpaller mentioned, in Teradata USI (Unique secondary indext) is applied at back end whenever you create a PK in Teradata table, unless that PK column is not set as a UPI
I would take issue with the statement:'The more even the data distiribution is, the better the perfomnce will be'
There are 3 things you should take into account when defining the primary index:
1) Volatility - the primary index value should rarely, if ever, change.
2) Distribution - the data should be distributed fairly evenly across the amps
3) Access - where possible the primary index should be used to access the data or join to other tables
If you just want perfect distribution then you would always choose a unique primary index for all tables. In a previous role, working for a bank, this is what most people did on almost all tables. However performance was terrible!
If you consider all of the above criteria then you would probably decide on using something like sortcode and account number for most of the tables, so that tables join locally on the amps, and queries for individual customers would be single amp. Large transaction tables should be made multiset, which removes any duplicate row checking, to improve efficiency when populating them.
If you want to ensure data integrity then a unique secondary index on the primary key values can do this.
A couple comments on the original post:
- In Teradata, a UPI on a table will result in only the columns comprising the UPI being checked for uniqueness. Making it a set or multiset table does not matter, we will always only look only at the UPI fields to determine uniqueness of the row and enforce no duplicates.
- The relational primary key (RPK) with all of what it implies in relational theory is a great thing to know in your logical modeling. There will be some tables that do not have a natural one - eg logs - but for ones with a natural key it is a good thing to know to understand how the data in your model relates.
- As noted above, whether on not you use that RPK as your PI or whether you express that RPK at all in your physical model in the database depends on many other considerations. These are described in the database management manual or the Physical Database Design Course. They include the common join access paths, data distribution considerations, tactical query accesses and many more.
- In many cases the RPK will be a good choice for a UPI, especially among reference/dimension tables. In fact, detail, log and other tables it is less likely that the RPK will be the right UPI choice because of the other considerations. When the RPK is not the UPI, then it is possible to define the RPK on the table via the UNIQUE clause. This has an additional overhead to create a structure underneath to enforce the uniquess - Teradata uses the Unique Secondary Index structure to do this. Many times this is not done in the physical model because the RPK is enforced in the ETL processing and there is a desire to avoid the overhead.
- When the RPK is not the UPI and is not materialized with a UNIQUE clause, then it is generally very important to collect stats on the RPK as it will generally be used to qualify and join to the table and the fact of its uniqueness is important for the optimizer to know.
- Finally, if you just say UPK on a table and do not define a primary index (UPI or NUPI) then the UPK will be implemented as the UPI on the table and will be indistinguishable from having defined it as a UPI.
Thank you all for you extensive replies. It clarified the importance of the index and keys but i still not clear about the following:
- I understand from your comments its always advisable to either use PK or set the secondary indices ( so this is best practise for data integrity). Where setting the PK will under water set the secondary index. What is your opinion about only using UPI's and no PK or secondary index?
- A key field doesnt allow null. I was under the impression that indexes can be null.? So the secondary index is different in that way?
PK or UNIQUE in Teradata do not allow the column(s) to be nullable. For the UPI definition, you control whether nulls are allowed by specifying the appropriate null option on the column. If you allow Null in the columns, Teradata still enforces the uniqueness - allowing only one row to be null for instance if there is only one column in the UPI definition. It is inadvisable to allow nullable columns in your UPI or NUPI unless there is a compelling data design reason to do so because it can easily result in surprising join/query results to users if they do not understand the way the nulls will operate. But we allow them anyway. When you are using PK definition, we assume you are identifying the RPK and in relational modeling null is not allowed in the RPK so ANSI and Teradata disallow that combination.
Whether or not to use secondary indexes should be determined solely on the need to index for query performance. Many tables in Teradata implementations have only a PI and no secondary indexes of any type. We recommend that you start with a minimum of indexes after determining a good PI using all the PI considerations. Then add an index or two to support specific high frequency queries or short SLA query requirements. You should find that you need many fewer indexes on Teradata tables than on other database technologies and that creating indexes in the same way as on those other technologies will result in a lot of overhead on data change operations with little payback on the queries.
As noted above whether the RPK is identified on the table should be first a function of whether it is the proper choice for the PI and second for whether the ETL process already enforces the uniqueness. For instance if the RPK is not the right choice for the PI and the ETL process does the enforcement already, then it may well be appropriate not to identify the RPK at all in the physical model and avoid the overhead of enforcing it at the database level.