Big Data: Where Are We Now?

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

When this blog started, it was based on my 2008 Teradata Partners User Conference presentation on the future data explosion and its impact on the Enterprise Data Warehouse:  “’Long Tails,’ ‘Black Swans,’ and Their Impact on EDW and AEI.”  One could foresee a definite future negative impact on the existing EDW platforms trying to support all this new volume and types of data (social, sensor, etc.).

I thought I would post here the last four design issue slides from this presentation:

Data History

  • More online or degraded online history availability
  • Cost/benefit:  difficult, so need to keep the additional cost of degraded online history down 
  • Cost/benefit analysis needs to include “what if” analysis

Data Detail

  • Need more complex data structures to support obscure sources and outlier data
  • More complex metadata and physical data modeling for access to low usage data

Varied Access:

  • Higher levels of security and access control complexity as supply chain opened to vendors
  • Mixed workload administration to meet service level goals

ETL/ELT complexity

  • Supporting unstructured text
  • Supporting external and “moving target” data sources
  • Flexibility in data structures and ETL/ELT tools

Data History

  • Off-line storage may not meet service level goals
  • Need “catch-up” mechanisms for ETL
  • Assume will fall behind and put systems in place
  • Note:  Fastload/Merge is as good as or better than ML or Stream, and supports all indexes

Fail-over capacity (Dual Active)

  • Both for recovery and for bursts of query activity

Need flexibility to generate queries to react to unknown events

  • Flexible data structures and ad hoc query tools

Event Triggers

  • Normal processing should be fully automated

Mixed workload planning for access capacity crush

  • Setup TDWM Rule Sets to be activated in advance for “capacity crush” scenario
  • Capacity planning needs to include “what if” scenarios

Since 2008, the technology has kept up and all of this data is being accommodated today with the varied technologies available to us.  However the vision of a single view of the data is still paramount, based on my two decades of experience with Integrated Data Warehouses for Teradata.  This overall requirement is addressed by Teradata Unified Data Architecture (UDA):  still a single view of the data, but on different platforms that most cost effectively store and process the data, with all the data accessible from a single query across a high-speed interconnect

So where are we going from here?  As always, the industry will apply the most cost effective hardware and software to the value of the data being captured.  And as the cost of the platforms steadily decrease, the amount of data being captured and stored for analysis will increase.  Data that was too costly to collect and store for analysis three years ago is now available on the web for free.

Expect the difficulty of supporting this vision to continue to be in the complexity of connecting all of the disparate data stores into a single view for the end user, be they a data scientist or a CEO.