To add more on choosing PI first do the data profiling: Consider both the column as candidate columns for PI selection and identify the following stats for each of them: 1. Number of rows 2. Number of distinct values 3. Typical rows per value 4. Number of NULL values 5. Value Access Frequency - Frequency of usage of this column in Where Clause 6. Join Access Frequency - Frequency of usage of this column in Join
Based on the above stats give priority to the 6 and 7 items, and you can have the data skewness upto 20% if that column is going to be mostly used in Join/Where.
Its a trade off between data distribution and retrival. Surrogate key can be a good PI candidate for data distribution but wont be effective in queries on the table. Thus if the table PI doesnt exist in queries, then the retrival would be slow because of higher number of AMPs involvement,data redistribution or dublication.
I believe you are in physical database design stage. Now in order to choose a good PI here you should have complete ELDM ( extended logical datamodel design ) The ELDM components are "join clause and Data demographics"
Join Clause inculdes : 1)join access frequency 2)Value access frequency 3)Value access rows
Data Demographics include: 1)Distinct rows 2)Max rows per value 3)max rows null 4)typical rows per value 5)change rating
Note: A great PI will have : high value access frequency ,join access frequency, reasonable distribution and a change rating below 2 (out of 10).