Dr. Daniel Abadi October 17, 2014
Thisis a guest post by Dr. Abadi also posted at http://blogs.teradata.com/data-points/teradata-que
For those readers who followed my writings for the Hadapt blog before it was acquired by Teradata http://hadapt.com/blog/, one of my common refrains was the architectural flaws inherent in database connectors to Hadoop. My problems with connectors centered on the following issues:
(1) Big data may mean different things to different people, but just about everybody agrees that it includes the property of “bigness” --- i.e. large scales of data. As Hadoop emerges as one of the major players in the Big Data space, enterprises are becoming increasingly comfortable storing very large data sets in Hadoop. Shipping a petabyte of data over a network to other systems for processing is challenging and expensive to do at high performance, and violates the general “Big Data” architectural principle of pushing the processing to the data.
(2) Queries that involved joining data located inside Hadoop with data inside the database system generally required loading the data from one system to the other, before running the query in a single system (either all in Hadoop, or all in the database system).
(3) For connectors that enabled queries to access data stored in a remote system (i.e. access DBMS data from Hadoop, or access Hadoop data from the DBMS), the lack of shared metadata makes optimizing these queries complex, and scan-based plans (over petabytes of data) are generally the only option.
In this post, I will discuss Teradata’s QueryGrid technology, and how, when used to “connect” Teradata and Hadoop, it functions quite differently from the traditional database-Hadoop connectors discussed above.
First, a little bit of history. When DBMS-Hadoop connectors first started being developed, Hadoop’s native query processing abilities were extremely poor. Hive still required every query to use the MapReduce data processing engine, Impala didn’t exist yet, Hadapt was still in its nascent stages, and Presto, Shark, and Parquet were not even conceptualized yet. Connectors to database systems were therefore created to enable Hadoop data to be processed by far more advanced query processing engines.
The world is quite different today, and Hadoop’s native query processing engines have grown by leaps and bounds, performing many more query processing operations at much improved performance. While more work still needs to be done for Hadoop’s engines to match the decades of effort of query optimizer refinement and SQL support that traditional database systems have been able to achieve, Hadoop’s native query processing support is finally rounding into shape.
Teradata QueryGrid over Inifiniband
Therefore, Teradata’s QueryGrid works in conjunction with the native query processing execution engines of Hadoop rather than attempting to replace them with Teradata. For example, let’s say a retail company has an error log file from their web server stored in Hadoop, and it was recently discovered that a bug in the application code sometimes prevented customers from being able to add to their shopping cart a quantity of more than one item of some particular widget. We therefore want to go through our error log and find customers that ran into this bug, and send them an email inviting them to increase their order quantity. While the error log contains information about when this bug was encountered, the email address and other contact information is stored in a Teradata database. In order to perform this query, we therefore need to join data in Hadoop with data inside Teradata.
Teradata’s QueryGrid technology allows users to express such queries over Teradata and Hadoop data in a single SQL query, and enables Hadoop (e.g. via Hive/Tez) to process the part of the query that accesses Hadoop-local data before performing the join. In our example, Hadoop’s native query processing engine performs (1) the scan of the error log for entries matching the particular bug that prevented quantities of more than one from being added to the shopping cart, and (2) the extraction of customer identifier information associated with these log entries --- either directly from the error log or via a join with another table in Hadoop that can be used to map session identifiers to customer identifiers. This list of matching customer identifiers is thus the only data that is sent over the “grid” to Teradata, where processing of the query (extracting the email addresses of these customers) is completed. By pushing down most of the query processing to Hadoop, Teradata thus avoids the “Big Data” network bottleneck of traditional DBMS-Hadoop connectors.
Even for those queries where the processing performed by Hadoop does not reduce the input data size, QueryGrid’s parallel data transfer technology enables a petabyte of data to be transferred from Hadoop to Teradata in a few minutes. (This assumes the transfer is done from a 1000-node Hadoop cluster to a Teradata cluster with 1000 AMPs over an Infiniband connection. Clusters that are 10 times smaller will take approximately 10 times as long). Each HDFS data node is automatically mapped to a Teradata AMP node and data is transferred directly over a socket connection. Thus, QueryGrid uses the network as efficiently as possible. QueryGrid leverages Hadoop’s processing engines for those queries where pushing down parts of query execution to Hadoop will reduce network data transfer, and for cases where this is not possible, can transfer data between Hadoop and Teradata at enormously high rates.
However, perhaps the coolest part of QueryGrid is actually what happens after Hadoop has finished with its part of query processing and the query is completed in Teradata. In general, it is impossible to predict the statistical profile of the data being sent to Teradata after initial processing in Hadoop. For our example above, it is unknown in advance whether 5 customers or 100,000 customers encountered this web server bug, and it is unknown if the same customers experienced the same bug repeatedly, or whether the bug was encountered randomly across the customer base. This problem is known as the predicate selectivity estimation problem in the database systems research community, and while it is notoriously difficult to solve in any database system, the complexity of having data spread over multiple locations without a centralized catalog renders this problem an order of magnitude more difficult.
Teradata QueryGrid therefore actively calculates statistics on data as it is sent from Hadoop to Teradata, and incrementally plans the query once it has a clear perspective of the statistical profile of the received data. This greatly improves Teradata’s ability to optimize the query, choosing the right query processing plan for the particular dataset set it ends up processing. This is an extremely important feature. At the end of the day, the success of any Big Data query processing platform is extremely dependent on the query optimizer. By optimizing and dynamically reoptimizing queries, Teradata QueryGrid overcomes a major shortfall of traditional federated systems.
QueryGrid is thus not your traditional DBMS-Hadoop connector. By pushing down query processing to Hadoop’s increasingly advanced query processing engines, it overcomes the network bottleneck of the early DBMS-Hadoop connectors. By enabling SQL queries to span both DBMS-local and Hadoop-local data (all managed with Teradata’s permissions and access control framework), far more advanced query analysis becomes possible. And with a dynamic optimization framework, query performance for multi-system queries is significantly improved.
It’s important to note that the same technology used to connect Teradata and Hadoop, can be used to connect Teradata with many other systems (e.g. MongoDB, Teradata, Aster, etc.). The basic principles of query pushdown, multi-system SQL, and dynamic optimization provide an excellent foundation upon which a flexible and powerful data processing platform can be created.
Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.