I expect to get another Amp distribution if I defined my primary index as (col1, col2) instead of (col2, col1), but will this also result in another Skew? Is the order of column in the primary index definition significant or not?
No the order in the PI definition doesn't change the data distribution scheme.
Data will land on the same AMP whether you select (col1, col2) or (col2, col1) as the PI.
You can use the HASHAMP(HASHBUCKET(HASHROW(<<Column_List>>))) to check the columns which provide better distribution.
Ok. What if the Column are changed in the CREATE TABLE?
CREATE TABLE T1 (Col1, Col2) PRIMARY INDEX (Col1, Col2);
CREATE TABLE T2 (Col2, Col1) PRIMARY INDEX (Col1, Col2);
Will I get the same Skew now?
Is the Column order in the primary Index also not important in Teradata 14.10?
It never makes a difference. As long at it is the same columns, the hashing will be the same. Nothing you do to the order or the position in the table or the PI will make any difference in the way the data is hashed.
This is on purpose BTW. We wanted to always make sure that tables hash the same way for joins, spools,... regardless of the order the columns are presented to the hashing algorithm. The algorithm is very carefully crafted to create the same hash regardless of the order.
Ok. I am glad to hear that the order of the columns in the pi is not significant.
The problem I am sitting with right now is to get a good distribution, this is the only goal. We have a lot of backup tables, that most of the time just sit there collecting data, then once a week or once a month, all data are deleted and the collection of backup data start over again. These tables are the only in case of extreme emergency, and I think we only used then, maybe 1 or 2 times the last 3 years. Because some of these tables are extreme huge, just a little bit of skew can take up a lot of space.
I thought that just selecting all columns would give the best skew, but this seems not to be the case. I found the adding a column could result in worse skew so that Col1, Col2 would give a skew of 60, but adding col3, “Col1, Col2, Col3” would give a skew of 70. Actually giving the table this PI and moving data into it gave nearly the same Skew when I looked in the Administrator. Also I have seen that adding a Column where all values are identical, like a column containing only “0” could end up giving a better Skew.
Adding another column to a PI should not increase skew (unless you test it on avery small table).
How many AMPs are on your system and how many rows do you use for testing?
Usually you get almost no skew when your PI columns are (almost) unique, so best case for distribution is the Primary Key of tablke.