I have defined a PI for an table. Now I need to validate whether the columns selected for PI are the right candidate. What are the ways to go for it?
How can I showcase whether all the AMP's are optimally used?
Which tools should I use to focus on performance aspect?
The thing about the Primary Index is that ISN'T all about data distribution. If you select a PI that perfectly distributes the data across the AMPS, but is never used as a Join column or in a query predicate, you're doing yourself and your database a dis-service.
A key priority of the PI is that it should define the most common access path, some data skew is perfectly acceptable. There are probably as many definitions for accetable skew as there are people who can spell "skew"...so don't look for one "perfect" answer, personally I try to keep data skew below 20-25.
When you define the same PI on two commonly joined tables, you enable your TD system to create AMP local joins, with no redistribution. You can see this in the explain plan.
Remember, data has to be on the same AMP to be joined. If you have two tables with different PI's, the optimizer will pick the smaller (if it can) of the two tables, assuming you have current statistics collected, do a FTS on the smaller table, bring it into spool, re-hash the rows on that table in spool and push it back down to the AMPs according to the re-hashed column ( the PI of the larger table) and THEN join your rows.
If you can define the same PI on two freuqently joined tables and give up some skew, you can eliminate the FTS, re-hash and redistribution.
Consider the Sales Order Header and Sales Order Line Item tables, a fairly common occurence. The natural identifier for the Header Table and by default, initial PI selection, is some thing SalesOrderHeaderID. The Line Item table also has a natural identifier, but in this case it happens to be a composite PI, composed of SalesOrderHeaderID ( to relate each line back to its parent header row) and LineItemID (to differentiate each line item within the order).
If you select SalesOrderHeaderId, LineItemID as a composite PI on the Line Item Table, you've now hashed ALL of your Item rows to different amps than your header rows, perfectly distributed, but NOT co-located. The query that joins headers and line items now has to grab the line item table complete a full table scan, rehash it to the salesorderheaderid and then join it. All you've achieved is data distribution at the more complex processing.
If you accept some skew and define the PI of the LineItem table as SalesOrderHeaderId alone, the child rows will hash to the same amps as their parent rows, run the same query and you've now got amp local operations...
Thanks Blaine and Wasif for wonderful replies. Blaine, your reply was surely very insightful and would help me a lot.
Continuing on the above..
We have the concepts of foreign keys (either Soft RI or hard RI). It is oftne recommended to go for soft RI since it wont hard stop any insertion of records as well used by query optimizer.
How can one benefit by using RI (used by optimizer) compare to PI definition? Is there any substitute of not opting for PI having composite key and instead going for RI?
thanks :) it was helpful and concrete. I would like to see in teradata forums more concrete examples like this. it was hard for me to find useful script like this, since there are many descriptive scripts with many unknown parameters, and useless explanations. this was clear and it works!
Soft RI is basically a notification to the optimizer that two tables are logically connected by a foreign key, and thus it affects complex join plans. But if two tables are frequently joined and they have a PI column in common (as in the case of a parent-child relationship), then defining the same PI on both tables prevents having to redistribute rows for the join. Defining Soft RI would not make up for the loss of this benefit.