I know what the documentation generally states about using surrogate keys vs business keys, but I would like to know what everyone in the real world is doing. I have a couple of questions:
1. Does anyone work in a DW group that utlizes nothing but surrogate keys in their base (3nf) and semantic (dimensional) physical data bases?
2. What are the pros and cons of surrogate vs business keys from a real world perspective?
3. How do most people handle business keys from multiple source systems? source system id?
4. What is the main driver when deciding to use surrogate vs business keys?
I have been to a few Partners conferences and always bring the topic up to the SIG data modeling panels. When I communicated that our DW is 100% surrogate key based, they look at me like I am crazy. Yet when I talk to people out in the real world, it does not come across as such a strange scenario.
How does your shop handle surrogate vs business keys?
I thank you all in advance to your expert input!
In a perfect world or ideal scenario .... which, unfortunately we dont live in .... natural keys or business keys would be preferred as there wont be an overhead of generating and maintaining SKs. But as the number of source-systems increase the less reliable BKs generally get and so implemetations shift to SKs.
If you can use BKs, its considered good as there is no hasle of depending on SKs .... and BKs wont change with time. By this what i mean is .... once SKs are generated and core os populated .... imagine having an issue with SK table .... whole data is screwed. With BKs .... coming-in from source directly .... this is not a concern for DWH.
SKs provide a way to be able to identify single entity coming in from different source systems, which may have different natural keys at he source-side. Hence making it a powerful tool for integration purpose.
Regardin your questions:
1. Yes, surely!
2. Just stated some above.
3. Using SKs for a correct picture.
4. If they serve the purpose and give an actual picture or not.
Thanks Adeel... yes that does help. I know what the books say... just looking for real world scenarios. Many times I am presented with the argument:
"Surrogate keys should always be used because we never know if another source system will come online in the future". Hard to argue against that reasoning even though we know the likelyhood will be slim to none.
True, thats because creating & maintaining SKs is lot cheaper as compared to the efforts of introducing SK [if ever] once the new systems becomes the source of same information which is being brought-in by some source already.
In such scenarios, clients tend to put pressure to keep the same information in multiple tables hence effectively causing information to be un-relatable and creation of mini-data-islands to pop-up in the model/DWH .... which is never a good thing to have.
In addition to all the reasons mentioned above, surrogate keys could be handy when dealing with the certain cases of changes in the dimension data (slowly changing dimensions tye 2 , for example).
I work in an environment where we use natural keys in our Teradata DW and was told that, with Teradata, using surrogate keys is bad because of "how it spreads the data." Does anybody know what the rationale behind this statement would be?
I normally argue in favor of using surrogate keys in a data warehouse for the reasons mentioned above - it makes the DW more loosely-coupled with source systems, and makes handling type 2 SCDs easier. In addition, I think that it is more efficient, e.g. why store 30 bytes of composite key in both a fact table and a dimension table and have to do a multi-column join over that width of data x potentially millions of rows when you could just use a surrogate key - just a 4 byte integer and only one column in both fact and dimension? Wouldn't the latter be much more performant for querying, make both the fact and dimension take less space, and simplify the structure of the DW (simply fewer columns!)?
>>> 2. What are the pros and cons of surrogate vs business keys from a real world perspective?
Regarding the OP's question, Wikipedia has a good explanation of the pros and cons of surrogate keys: http://en.wikipedia.org/wiki/Surrogate_key
>>> with Teradata, using surrogate keys is bad because of "how it spreads the data." Does anybody know what the rationale behind this statement would be?
My initial reaction is that's a false statement. The Teradata Database handles surrogate keys as well as natural keys.
With a surrogate key, you have control over the data type and value range, so with proper choices, you should be able to achieve even distribution of data (and avoid skewed data) more easily with a surrogate key.
I wholeheartedly agree with Tom and would add that when natural keys are made up of multiple columns and used as a composite primary index then the opposite could be true.
I will address the following questions: What are the pros and cons of surrogate vs business keys from a real world perspective? ... with Teradata, using surrogate keys is bad because of "how it spreads the data."
In a parallel database, data is spread over nodes using a partitioning key. This key is hashed to the appropriate node. The same partitioning key always hashes to the same node. The goal is an even distribution of the data so that there are no peaks and valleys, which could cause bottlenecks in performance. The partitioining key does not have to be the primary key. If you use a natural attribute as the partitioning key, then the business has some control over this key in the sense that it understands what it is. A surrogate key can be any number, such as a purely sequentially generated number or a random number. It is more difficult to understand how this number will hash evenly across nodes. Every table has a primary key. This can be a natural key or a surrogate key. Every table will have a partitioning key. The partitioning key does not have to be the primary key, and often is not. Using a surrogate key as the primary key only affects the distribution of data if the surrogate key is used as the partitioning key.
So the above statement, "with Teradata, using surrogate keys is bad because of 'how it spreads the data,'" is not entirely true and is oversimplified. In summary, use of a surrogate key only affect spreading of the data if the surrogate key is used as the partitioning key.