I have a transactional dataset which has a date field, my query is whether it is better to bring in the date field into the PI for the table i am using or leave it out. The PI is currently made up of a product unique identifier which is duplicated for each date related to the product. I am performing updates and joins using just the unique product identifier as well as updates using the calendar date as well.
If include the calendar date in the PI will that remove any performance benefit for those joins where calendar date is not a join condition?
To choose a Primary Index you must keep in mind:
1-Access columns (where clause) to maximize the number of one-AMP operation
2-Distribution demographics. If you get a nice distribution then you will optimize the parallel processing
3-Balance great access-distribution because many times index candidates offer poor distribution
4-The PI should not be very volatile (those columns should not be updated frequently)
So, probably is you include the date column in the PI you will performs better updates using calendar date as well and worse update/joins using only the product column. You should balance how much you improve the first one vs how much you get worse on the second and how often you do one thing and few other
Be cautious while choosing the primary index because it affects the data storage and performance.
The following are the important tips while choosing the primary index.
1. Data Distribution.
You need to analyze the number of distinct values in the table. If the primary index of the table contains less number of null values and more distinct values, it will give better the performance.
2. Access frequency.
The column has to be frequently used in the where clause during the row selection.
The column should be that which is frequently used in join process.
The column should not be frequently changed.