Optimizing Disk IO and Memory for Big Data Vector Analysis
Daniel Abadi Nov 11, 2014 This is a guest post by Dr. Abadi http://blogs.teradata.com/data-points/optimizing-disk-io-and-memory-for-big-data-vector-analysis/
At Yale, every spring I usually teach an advanced database systems implementation class that covers both traditional and more modern database system architectures. I often like to test my students with questions like the following:
Let’s say the following SQL query is issued to a data warehouse for a retail store (the query requests the total revenue generated by a particular store on October 20th, 2014):
WHERE store_id = 37
Assume that we either do not have indexes on store_id and date, or the query optimizer chooses not to use them to process this query (e.g. because the predicates are not restrictive enough). The query processor thus executes this query by scanning every record in the transactions table, performs two predicate checks (one to see if store_id is 37, and one to see if date is 2014-10-20), and if both checks pass, extracts the transaction amount and adds it to a cumulative sum.
What is the performance bottleneck of this query?
I obviously have not given the students enough information about the hardware configuration of the machine (or cluster of machines) on which this query is executed, nor about the software implementation of the database system, for students to be able to answer the above question definitively. But you can tell a lot about how much students understand the principles of database system architecture by how they go about describing what the answer depends on.
Before describing some potential bottlenecks, let’s eliminate one possibility: network bandwidth between query processing servers. It doesn’t matter whether the transactions table is small and fits easily on a single machine or is very large and is partitioned across thousands of machines --- this query is almost perfectly partitionable --- each machine can calculate its own cumulative sum for the partition of transaction records located locally, and the only communication across query processing servers that needs to happen is at the very end of the query, when each of the subtotals for each partition are added together.
However, several other bottlenecks may exist. To understand where they may come from, let’s examine the path of a single record from the transactions table through the database query processing engine. Let’s assume for now that the transactions table is larger than the size of memory, and the record we will track starts off on stable storage (e.g. on magnetic disk or SSD).
First, this record needs to be read from stable storage into the memory of the server that will process this record. Second, this record needs to be transferred from memory to the local cache/registers of the processing core that will end up processing this record. Third (finally), the processing core needs to perform two predicate evaluations on the record (on store_id and date), extract (if necessary) the transaction amount, and add it to the cumulative sum.
Each of these three steps will have to be performed on each record of the transactions table, and each may become a bottleneck for query processing. The first step is a potential “disk bandwidth bottleneck” --- the rate at which data can be transferred from stable storage to memory is simply too slow to keep up with subsequent parts of query processing. The second step is a potential “memory bandwidth bottleneck”, and the third step is a potential “CPU processing bottleneck”.
For “big data” datasets where the size of data is significantly larger than the size of memory, the most common bottleneck is disk bandwidth. Disk bandwidth of the highest-end disks remain on the order of hundreds of megabytes per second, while memory bandwidth is usually at least an order of magnitude faster. Furthermore, very little work is required of the CPU per record (just two predicate evaluations and a sum) --- database queries tend to be far less CPU-intensive than other domains (such as graphics rendering or scientific simulations). Hence, step 1 is often a bottleneck.
There are three techniques commonly used to eliminate the disk-bandwidth bottleneck in database systems. First, you can increase the memory available to store data on each machine, and thus decrease the amount of data that must be read from disk at query time. Second, you can use storage arrays of large numbers of disks and a fat pipe to transfer data from this array of disks to the processing server. In other words, even if each disk can only transfer data at 100MB/sec, 20 disks combined can transfer data at an aggregate rate of 2GB/sec. Third, you can leverage software to increase the efficiency of data transfer. For example, data compression allows for an effectively larger number of records to be packed per bit transferred. Column-store database systems allow only the columns accessed by the query (in our example --- store_id, date, and transaction_amount) to be read off of disk (so disk bandwidth need not be wasted reading in irrelevant data for a query). With some amount of effort, it is usually possible to eliminate the disk-bandwidth bottleneck through some combination of the three techniques described above.
Figure 1. Optimizing Disk IO, Memory, and CPU Usage
The next bottleneck to present itself is usually memory bandwidth. Unlike the disk bandwidth bottleneck, the memory bandwidth bottleneck cannot typically be solved via modifying the hardware configuration. Instead, the database system software needs to be intelligent about how to efficiently use memory bandwidth, so that every drop of memory bandwidth is utilized to the maximum possible extent. The two main techniques used for this are the same two software techniques mentioned above: compression and column-orientation.
In order for compression to alleviate the memory bandwidth bottleneck, data must remain compressed in memory. Many database systems decompress data in the buffer pool after it is brought in from disk --- this simplifies the system code, and makes data modifications easier to handle. Unfortunately this exacerbates the memory bandwidth bottleneck, as the full, uncompressed data is sent to the CPU whenever it needs to be processed. In contrast, well optimized systems (especially systems optimized for high performance data analysis) will keep data compressed in memory, and either operate directly on the compressed data, or only decompress in the CPU immediately prior to processing.
Column-oriented page layout is a particularly important technique to eliminate the memory bandwidth bottleneck. Assume that store_id, date, and transaction_amount each take up 4 bytes per record (12 bytes), and that each record in the transactions table is 300 bytes. Furthermore, assume a CPU core cache line is 128 bytes. In the best case scenario for a traditional row-oriented data layout, store_id, date, and transaction_amount are all in the same 128-byte subset of the 300-byte record, and the 128-byte cache line containing these three attributes is sent from memory to the cache of the processing core that will process this record. In such a scenario 12 / 128 = 9.4% of the data transferred from memory to cache will be accessed by the CPU. In the worst case scenario, these three attributes are located within different cache lines of the 300-byte record, and as a result, only 12/300 = 4% of the data transferred from memory to cache will be accessed by the CPU.
In contrast, if data is laid out on a page column-by-column rather than row-by-row, each cache line consists entirely of data from a single attribute of the table. Only cache lines corresponding to the three relevant attributes are sent from memory to the processing core cache, and hence nearly 100% of the data transferred from memory to cache will be accessed by the CPU. Hence, column-orientation usually improves the memory bandwidth efficiency by an order of magnitude, significantly alleviating the memory-bandwidth bottleneck.
In summary, only with effort and intelligence (especially around compression and column-orientation) can the first two bottlenecks --- disk-bandwidth and memory-bandwidth --- be eliminated, and the bottleneck shifted to the CPU.
To alleviate the CPU bottleneck, techniques must be used to maximize the efficiency of the processor. One clever way to accomplish this is to leverage the SIMD instruction set on modern CPUs such as the Intel Haswell using AVX2 vector processing. To understand how SIMD works, let’s first examine how the predicate of our example query (store_id = 20) would be evaluated normally (without leveraging SIMD instructions). For each record, the store_id would be extracted from the 128-byte cache line and sent to the 128-bit CPU register to do the comparison with the value “20”. This process happens sequentially --- each store_id is extracted and sent to the CPU register for comparison to “20” one-after-the-other.
Figure 2. Single CPU addition versus four with 128-bit vector registers
Note that a 128-bit register can actually hold 128/8 = 16 bytes of data. Since each store_id takes up 4 bytes, you can actually store 4 store_ids in a single register. In such a situation where you can fit a “vector” of values within a single CPU register, the SIMD instruction set allows a programmer to perform the same exact operation in parallel to each element of the vector. In other words, it is possible to compare 4 different store_ids to 20 in parallel in a single CPU step. This effectively quadruples the CPU efficiency.
We will perhaps go into more details of how to leverage SIMD features for database operations in future blog posts. The main takeaway that I want readers to get from this post is that leveraging SIMD instructions on modern CPUs (also known as “vectorized processing”) is entirely a CPU optimization. For such an optimization to make a difference in actual query performance, the disk-bandwidth and memory-bandwidth bottlenecks have to be removed first. Teradata Database builds cache line friendly columnar table structures in memory to exploit the Intel AVX2 vector processing.
Figure 3. The Pipeline to L1 Cache and CPU Cores
Teradata recently announced support for vectorized processing. While this is great news by itself, what is more interesting to me is the announcement that vectorized processing improves performance on real-life, actual Teradata queries by 10-20%. Using columnar arrays in memory, individual steps within a query run up to 50% faster. While a straightforward reading of these performance claims would seem to indicate that this improved performance is a testament to the high quality of this particular new feature, I hope that someone who has read this post carefully will understand that it means much more than that: it’s a testament to the high quality of the rest of the Teradata product to remove the disk-bandwidth and memory-bandwidth bottlenecks so that Intel CPU optimizations can actually make a bottom-line difference. It is an indication that Teradata has a legitimate column-store --- a column-store that improves the efficiency of disk transfer by only transferring the relevant columns, improves the efficiency of memory transfer by avoiding cache pollution of irrelevant attributes, and keeps data in columns inside the CPU register for parallel processing at the hardware level.
Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.
Hi Dan - will Teradata be supporting AVX or AVX2 or both? Also, do you know which chipsets will be supported in Teradata 15.10 vectorized instruction sets?
The system senses the CPU capability and provides either AVX2 or SSE3. The implementation looks at the CPU version and chooses the library that is appropriate for CPU type in the node. In addition to the functionality in 15.10, the compression algorithm also uses the CPU version specific vector instructions, including AVX2.0.
To use the AVX2 instruction set, all nodes must be Haswell nodes…there is no mix and match of Intel CPU types in a system. If running coexistence, the system will default to SSE3.
Internal benchmarks I saw just today show fabulous performance gains, especially for row qualification and hash joins. Columnar tables get the biggest boost but row-based tables arent far behind. It works better than we expected.