Guest post by Dr. Daniel Abadi, March 6, 2015
During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.
Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).
In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.
There is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.
Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:
(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above mentioned approaches.
The flexibility of Teradata’s approach is characterized by three main contributions.
1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.
The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.
Figure 1: Row and Column Stores (uncompressed)
To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range. However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).
This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.
Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.
2: Teradata can physically store each rectangular partition in “row-format” or “column-format”.
One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.
Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.
Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.
Figure 2: Column-format storage using containers.
However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.
In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.
3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.
Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.
Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.