Any New School of Thought to Optimize Multi-TB Data Mart Analytics?

Data Modeling

Any New School of Thought to Optimize Multi-TB Data Mart Analytics?

Typical 2 years Sales Data Mart contains detail and aggregates in order to support time traverse metrics like 4 Week Shipments, 13 Week Shipments and so on... 



The typical dimensional or normalized data structure marts creates more challenges from processing as well as data delivery stand point. In spite of managing varied level of aggregates and pre-calculated metrics, evolving/changing business queries often access detail fact and that does not provides expected performance

Yes I understand that we can spend more on technology, procure more powerful database servers like Teradata, Netezza etc. but is that the only solution. In fact did spend lot of money on buying technologies and additional Teradata nodes, but still the data still explodes exponentially from no of rows stand point. 



The business evolving needs always slice-n-dice data from all dimensional contexts, and quite impossible to create so many aggregate layers to support all the business queries. The aggregate process layer even become more complex with change in dimensional hierarchies either on temporary basis or data anomalies, but that leads towards aggregate re-build or an adjustments. Let's take an example if a company has distribution channel across 500K stores (+200 attributes) within US, approx. 10K products (+100 attributes) in catalogue and lets say partnering nearly thru 2000 distributor warehouses (+100 attributes), the data explode exponentially in a typical dimensional or normalized data model say billions of records. Definitely from technology stand point you can add more nodes to distribute processing, but is that something even need to look at from data structure stand point. 



Please do note that Normalization is to ease processing, dimensional is to bring data analysis closer to the business process outlook, how about data structures to ease access performance? 

Any simpler approach to ease processing as well data analysis delivery to optimize business 360 dive-in analysis?

1 REPLY
Enthusiast

Re: Any New School of Thought to Optimize Multi-TB Data Mart Analytics?

An apology from my end if things do not make sense to you. It is just my viewpoint.

There was a client  before, when I used to work in a consulting company.We just ask them, hey, what are your priortities in terms of high,medium, low, and useless.How 

come useless!!! Let us segregate them. There maybe something that you are unaware of behind the scene. Let us do clinic checkup.After it is done, the client realizes 

the benefit.

How about partitioning tables,  putting the older ones into partitions with coarser granularity, and the newer ones into partitions with finer granularity.Maintenance 

may come into the picture in the future.

I think you have done prejoins,join indexes.Denormalizing with repeating groups can be one pointer.Anything you may feel that denormalizing can be done by fly-away 

GTT, VT or views.Normalized model can answer unseen questions whereas Dimensional model answer preset questions.In DM may not be able to join to other marts.

I feel a semantic model on top of 3 nf is a good choice. If serving many lines of business , then a core and common semantic model can be there to serve others. If 

monster dimension grows exponentially, how about building bridge and demographic  dimension.

If someone says that processing of data is challenging, then it is time to revisit the tool/utility and technique. Teradata has robust utilities to load data based on 

scenarios, is it tpt, mload, fastload......insert ...select...tpump...There are many consultants, who just

say BIG WORDS,but having no practical experience. Let them sit there and show the feasibility.It reminds me yesterday, when I wrote and execute a linux script for a 

very complex ETL logic.I have been using hadoop and hive backend.But for this complex one, it cannot be done there. So I ported the files from hadoop to the linux 

server and have to write a script.All the time, I am screaming that it is too slow, too slow. No chance of getting it done.But later on, I realize that there are two 

lines of code that slow down the entire processing.I m not sure it may be system upgrade!! I blame the poor system which listen patiently to my curses :)

Teradata UDA perchance can solve your problems, since it is in TB. Dealing  with hive,hadoop,linux, etl tools, DBs, lead me to think thus too. You can bring and draw 

stringent rules and standards across all platforms, that the data structures expected need to be in one format at one point of time to ease and regulate processing. No 

deviation from this rule. I think streamlining stuffs at various checkpoints can help.  

Last but not the least,I feel getting the right driver for an ordinary car and expensive car is also a point. For an ordinary car, we should not expect many luxuries. 

For an expensive one, we should expect.A techno-business person is the one who can bridge the gap.Technology should serve business needs. Again here also is a 

problem.Technically good and sound resource(reliable person) hard to get.Because of competition,salary attrition, internal problems within organization, techies leave. 

So it seems average person is an option. Sometimes it seems that we need to follow this obsolete English proverb "better ride on donkey that carries me home than a 

horse that throws me" to solve problems :)