We are designing a dimension of ITEMS (in a Point of sales DWH) as a table with Name-Value Pairs i.e. ITEM attributes stored as ROWS instead of COLUMNS.This ITEM table is joined to a fact table containing SALES $ and QUANTITY.
CREATE TABLE ITEM_TRAIT (ITEM_KEY INTEGER NOT NULL, TRAIT_CODE CHAR(10), TRAIT_VALUE VARCHAR(100)) unique primary index (ITEM_KEY,TRAIT_CODE);
++++SQL to query ITEM/FACT+++ SELECT CAL.WEEK AS WEEK, (case when item.trait_code='BRAND' then trait_value end ) AS BRAND, SUM(FACT.POSSLS) AS POSSALES, SUM(FACT.POSQTY) AS POSQTY FROM FACT, CALENDAR CAL, ITEM_TRAIT ITEM WHERE FACT.PERIOD_KEY = CAL.PERIOD_KEY AND FACT.ITEM_KEY=ITEM.ITEM_KEY AND CAL.PERIOD_KEY between 20100101 and 20100228 AND POSSLS <> 0 GROUP BY 1,2 ++++++++++++++++++++++++++
So for 10000 ITEMS each with 30 attributes,this table contains 90K rows.
Typical queries ask for sales/quantity (from fact) grouped by BRAND for a time period
How does the optimiser know that the ITEM table contains repeated ITEM_KEY?
When ITEM table joins to FACT,the optimiser does not seem to "know" that BRAND is an attribute,unlike a normal dimension and seems to incur a lot more I/O ...This is compared to a similar SQL to a "normal" ITEM table and the response time and I/O are significantly less.