Does XML open up the implementation of a truly ad hoc EDW? I’m not talking about an EDW that supports ad hoc queries. I’m suggesting a fully ad hoc EDW: data model, data structures, ETL/ELT and BI tool access to end-users with no design or predefinition. The EDW points at a data source and builds a queryable subject area based only on the data available, and discovers the relationships of this new subject area to the existing metadata and tables residing in the EDW.
Scary thought? A DBA’s worst nightmare?
A cursory examination of this subject shows that a “default mapping” feature of an XML shredder tool would gives us most of this functionality. After loading the XML schema into a modeling tool, the shredder can generate the relational model, DDL and load the data into the resulting models. This entire process can be automated, including loading the relational structure’s metadata in the BI tool of your choice.
After the data resides in relational structures in the database and the metadata is published to the BI tools, then the relationships to the other EDW subject areas can be derived. Queries then follow, whether they be discovery queries based on human intervention, or a preliminary set of “what did we find” automated analysis based on the business needs.
The application: automated data capture and automated analysis of unstructured data outside, and inside, the enterprise. This can include email, forum postings, and social networking media: blogs and twitters and digs (oh my!). Do an internet search on “percentage of data is unstructured” and you’ll get varying estimates that up to 90% of data is unstructured, leaving many critical business decisions based on only a small percentage of the data available.
Now the most cringe-worthy issue: EDW consistency and performance. How can we possibly just throw unvetted data structures and unpolished data into the EDW and EDW metadata? How can we immediately let the end users submit unplanned for and poorly performing queries against these untuned data structures into our carefully monitored and controlled workload mix? Ugh!
My thought is, “How can we not?” The EDW needs to support the enterprise. The enterprise needs to react immediately to changes in the real world. The EDW support teams will never have the resources to review, categorize, design and implement rigorous, tested systems in a timely manner to address to explosion of data outside the enterprise: data that the enterprise needs to analyze and react to with an immediacy required by today’s Black Swan events.
As for performance, in my opinion that is the database product’s problem. That is what you are paying your database vendor for: performance when you need it with minimal support requirements.
A challenge? Yes: but doable and necessary. “Fasten your seatbelts. It’s going to be a bumpy ride.”
Thanks to Srinivas Pandrangi, lead engineer for Teradata XML Services, for contributing to this article