Helping a join to perform well with significant unknown volumes
I tried searching the forum for this but couldn't find the right search terms to get anything relevant.
We have a large fact table: approx 10 billion records for 12 months history, some of which has links to customer, some not.
Due to old data issues, about 50% of history records have no Customer Key. Going forward, only about 10-15% will be missing a Customer Key.
We had initially set the PI on the fact to the customer key to allow for quick joins to Customer info. But that had of course too high a skew for the historical data at least. So we had to choose a non customer PI but that gave relatively terrible performance.
Now we want to split the fact physically into a customer version and unknown version. The customer version would have the customer key as PI. The unknown version would choose the other key. We'll then have a view over those to union all the data back up for consumption.
When joining the view to the customer dim we still get bad performance thanks to the unknown side of the union. Is there any way we can tell the optimizer that for that side of the union, it will only ever hit the one unknown record of the customer dim so that performance is improved?