Super-Scalar Database Compression between RAM and CPU Cache

Super-Scalar , 1="FEMALE"). A disadvantage of dictionary compression is that new value inserts may enlarge the subset of used values to the point that an extra bit is required for the integer codes, triggering re-compression of all previously stored values. Although it has been proved experimentally that those schemes are capable of achieving some performance gains, most of the results are limited to I/O bound queries on relatively low-end, single disk systems. For ). Another opportunity for optimization arises when arithmetic operations are executed on a dictionary compressed column. In that case, it is sometimes possible to execute the operation only on the dictionary, and leave the column values unchanged [41] (called “enumeration views” in MonetDB [6]). The proper way to make these choices is to make the decision whether and when to compress or decompress an additional task of the query optimizer, as proposed in [7]. The authors assume page-level decompression, but discuss the possibility to keep the compressed representation of the column values in a page in case a query just copies an input column unchanged into a result table (unnecessary decompression and subsequent compression can then be avoided). Finally, compression to reduce I/O has received significant attention in the information retrieval community, in particular for compressing inverted lists [47, 37, 3, 4]. Inverted lists contain all positions where a term occurs in a document (collection), and are often stored as an ascending sequence of integers. It is therefore effective to compress the gaps rather than the term positions (Delta Compression). Such compression is the prime reason why inverted lists are now commonly considered superior to signature files as an IR access structure. Early inverted list compression work focused on exploiting the specific characteristics of gap distributions to achieve optimal compression ratio (e.g. using Huffman [18] or Golomb [13] coding tuned to the frequency of each particular term with a local Bernoulli model). More recently, attention has been paid to schemes that trade compression ratio for higher decompression speed [43]. In Section 5.2, we show that our new PFOR-DELTA scheme compares quite favorably with the most recent such proposal, the word-aligned compression scheme called “carryover-12” [4]. A less common direction is taken in [32, 35] where the possibility of applying data compression hardware in database systems is discussed. Although this is an interesting research area, as compression hardware could potentially lead to high compression bandwidths and free the CPU from (de-)compression overheads, it is still not widely investigated within the database research field. Finally, most of the work related to the MonetDB main-memory database system [6, 23] and its cacheconscious approach to query evaluation [24] are relevant in the sense that attention is being paid to the interaction between query evaluation and modern computer hardware. Cache and CPU awareness is taken even further by the vectorized MonetDB/X100 query engine [5, 49]. One research goal of the MonetDB/X100 system is to scale its high query evaluation performance to disk resident data sets, which served as the prime motivation to investigate high-speed data compression algorithms.

Chapter 3

Towards Super-scalar Compression 3.1

Goal

Our goal is to speedup execution times of I/O-bound database queries by means of compression, while minimizing any negative impact of (de-)compression overheads on CPU-bound queries. CPU-bound queries are those queries for which the limiting factor on execution time is the speed of the CPU, which is utilized fully, meaning that compression overhead interferes with query processing. The main idea is, that by transferring compressed data from disk to main memory, we could, ideally, scale the effective I/O bandwidth by the compression ratio of the data being transferred. However, compression can not only increase bandwidth, it can potentially decrease average memory access latencies as well. A more formal analysis of these effects is presented in the following sections.

3.1.1

Increasing Bandwidth

The effects of compression and decompression on the performance of a database system can be modeled by considering the following parameters: Tio Tdc Tq CP Udc CP Uq T r

= = = = = = =

Maximum throughput of I/O subsystem. Maximum decompressed data production rate of decompression algorithm. Maximum data consumption rate of query. Fraction of CPU power allocated to decompression. Fraction of CPU power allocated to query execution. Actual data throughput between decompression algorithm and query. Compression ratio.

The first three parameters are constant, assuming a fixed hardware platform, some fixed decompression algorithm, and a fixed query. Both Tdc and Tq depend on the speed of the CPU, and represent maximum throughput, given full CPU utilization. The goal of applying compression is to improve the performance of queries that are I/O-bound, e.g. Tq > Tio , by multiplying the I/O bandwidth with the compression ratio: T =

( Tio × r Tdc ×Tq Tdc +Tq

: :

Tio ×r Tdc Tio ×r Tdc

+ +

Tio ×r Tq Tio ×r Tq

≤1

(I/O-bound)

>1

(CPU-bound)

(3.1)

In the first case, the query remains I/O-bound, even with compression, so that both the query and the decompression algorithm have enough CPU power available to process their input data in real-time. In the second case however, the query and decompresser have to compete for the CPU, as now there is not enough CPU power to let both of them process their input data at the rate it comes in. In this case, the query is said to be CPU-bound. To prove the correctness of the second alternative in equation 3.1, we need to look at the distribution of CPU time between the query and the decompresser. Given that we are dealing with a CPU-bound 13

14

CHAPTER 3. TOWARDS SUPER-SCALAR COMPRESSION

situation it is safe to assume that CP Udc + CP Uq = 1. Now, to find the optimal allocation of CPU time we need to solve: CP Uq × Tq = CP Udc × Tdc

(3.2)

CP Uq + CP Udc = 1

(3.3)

Subject to the constraint: Resulting in: CP Uq =

Tdc Tq + Tdc

(3.4)

This allocation of CPU time results in a net throughput of: T = CP Uq × Tq =

Tdc × Tq Tq + Tdc

(3.5)

which completes the proof. It is important to note that, in the CPU-bound case, we might still improve overall performance with respect to uncompressed execution. The query and the decompression algorithm simply have to balance their data consumption and production rates respectively, according to equation 3.1. The resulting throughput, T , can still be higher then the bare Tio . If it is lower, decompression will actually induce a performance penalty. In both cases however, decompression interferes with query execution, so again it should be stressed that compression algorithms should be CPU efficient.

3.1.2

Reducing Latency

If data is stored in main memory in compressed form, the amount of source data it can hold increases, thereby virtually increasing main memory size. This allows more data to be cached in RAM, resulting in a higher probability of finding some data item in there. This, in turn, results in a potential decrease in the number of disk accesses and thus in a reduced average access latency. Assuming that the buffer manager is fully filled with M fixed size disk blocks, and ignoring the block replacement policy, the probability of finding a random object in the buffer manager (i.e. a hit) can be estimated in the uncompressed case as M , (3.6) phit = N where N is the total size of the database in disk blocks. If we incorporate compression, we need to take care of the compression ratio, which can vary on a per block basis. If we call the compression ratio of block i ri we can estimate the hit probability as PM

i=1 phit = PN

ri

j=1 rj

,

(3.7)

meaning that the hit probability now depends on the compression ratio of the blocks that are cached in the buffer manager.

3.2

Requirements

To be effective as a performance enhancement tool, database compression and decompression algorithms should 1. Induce minimal processing overhead. 2. Keep memory traffic to a minimum. 3. Allow for fine grained access.

3.2. REQUIREMENTS

15

Although the main focus of this work is on decompression performance, compression speeds are considered important as well. In previous work, compression has always been considered a one-time operation, and its performance was therefore neglected. However, in situations where large intermediate result sets need to be materialized, compression can potentially increase performance. Ideally by increasing the effective disk write bandwidth, which in general is less then the read bandwidth, but at least by keeping close to the disk write bandwidth, so that performance is gained during a subsequent read of the compressed results. Furthermore, updates on a compressed database require periodic re-compression of compressed blocks, in which case fast compression is desired as well.

3.2.1

Minimize Processing Overhead

To meet the first requirement, the compression and decompression algorithms should be light-weight, meaning they are only allowed to spend a couple of clock cycles per byte produced. To maximize the amount of work done in a minimal amount of cycles, we should furthermore strive for algorithms capable of achieving high Instructions-per-Cycle (IPC) rates (see Appendix A.5). This can be done by following these guidelines: 1. Compression and decompression should take place in tight loops that operate on small, cacheresident arrays of values. 2. Avoid conditional branches, such as if-then-else constructs or nested loops within the main loop to prevent control hazards (see Appendix A.4.2). 3. If possible, keep loop iterations independent by avoiding data hazards (see Appendix A.4.1). Adherence to these guidelines should provide the compiler with enough room for optimizations, and the CPU with enough inherent concurrency to utilize its resources effectively.

3.2.2

Minimize Memory Traffic

Memory traffic can be minimized by observing that in a database system a stream of tuples is in general consumed by some operator, which participates in an operator pipeline, and therefore consumes tuples at a small granularity only (see Appendix B.1.4). Therefore, it does not make a lot of sense to write decompressed blocks back to main memory before feeding them into the operator, which requires the decompressed blocks to be moved back into the CPU again. It is more efficient to decompress small batches of values into the cache, and feed those values immediately into the operator pipeline, as was already illustrated in Figure 1.2. On the output side of an operator pipeline, we should compress a cacheresident result vector and append it to a compressed output column, if output needs to be materialized. Such an approach requires pages to be buffered in main memory in compressed form, together with compression and decompression algorithms that allow for incremental operation. As evaluation of our compression algorithms takes place in the MonetDB/X100 system, the natural place to introduce compression is the append method of the ColumnSave operator, and similarly for decompression in the next method of the ColumnScan (see Appendix B.2.3). Besides being required for into-cache decompression, keeping I/O blocks compressed in the buffer manager has a performance advantage of its own: buffering of compressed pages allows more data to be cached in RAM, resulting in a higher probability of finding some data item in there. This, in turn, results in a potential decrease in the number of disk accesses, and thus in a reduced average access latency. In conclusion, to satisfy our second requirement of minimizing memory traffic, we argue that compression and decompression should take place at the boundary between the CPU cache and main memory, which requires us to: 1. Keep I/O blocks compressed in the buffer manager. 2. Decompress I/O blocks into the cache, in an incremental, on-demand fashion. Furthermore, to restrict memory traffic even further, all algorithms should try to minimize both the size of any external data structures and the amount of external buffering they need.

16

3.2.3


Fine Grained Access

The fact that I/O blocks stay compressed in main memory can complicate random access to arbitrary values within the compressed block. This implies that, in general, an I/O block needs to be decompressed fully, or at least up till the location of the value in question, before such a value can be used. As MonetDB/X100 I/O blocks are multiple megabytes in size, the overhead of decompressing halve the block, on average, to access a single value, is a waste of CPU time. On the other hand, this decompression time might still be negligible with respect to the time needed to fetch the compressed block from disk. But still, compression schemes that allow for more fine grained access to arbitrary values within a block are preferred.

3.3 3.3.1

Experimental Setup Micro Benchmarks

To evaluate the performance of candidate algorithms, micro benchmarks are conducted, which measure performance in the form of both peak compression and decompression bandwidths, and compression ratios. Furthermore, these benchmarks provide insight into instructions-per-clock (IPC), branch misprediction rates (BMR) and cache miss rates (CMR), through the use of hardware counters. Most micro benchmark results are presented as a function of a miss rate. This miss rate is an inherent property of most of our compression schemes, as these tend to classify the source values being encoded as either a hit or a miss. Hits are values that can be effectively encoded, misses cannot and are therefore handled as exceptional cases. For this reason, in the remainder of this text, hits and misses are also called codes and exceptions respectively. The miss rate, pmiss , is a simple function of the number of hits, Nhits , and the number of misses, Nmiss , and is defined in equation 3.8. pmiss =

Nmiss Nmiss + Nhit

(3.8)

In case compression bandwidths are reported, results refer to the amount of source data the compression algorithm is capable to consume each second, while in case of decompression bandwidths we refer to the rate at which a decompresser can produce uncompressed data. So bandwidths are always relative to the size of the original, uncompressed data. Furthermore, throughout the micro benchmark results, the following abbreviations are used: c2c Cache to cache. Input is read from cache and output is written to cache as well. c2m Cache to memory. Input is read from cache but output is materialized in memory. m2c Memory to cache. Input is read from memory but output is written to cache only. m2m Memory to memory. Input is read from memory and output is materialized in memory as well.

3.3.2

Platforms

Micro benchmarks are conducted on a variety of hardware platforms, to be able to study and identify platform dependent properties of our compression schemes. Our goal is to implement any algorithms in such a way that they perform well on all platforms, without adding any platform specific code. This might result in situations where a suboptimal implementation is chosen for a certain platform. These occasions are reported, to reduce efforts in case a platform dependent implementation is desired. Furthermore, implementations of the algorithms are geared towards 64-bit platforms as these are likely to become standard in the near future. Details of the hardware platforms being used can be found in Table 3.1. AMDs Opteron was chosen as a typical, good performing out-of-order CPU. Intels Xeon was chosen for its long pipeline, which is a challenge to keep filled. Finally, the Itanium 2 was chosen for its potentially large concurrency, provided

3.4. GENERIC COMPRESSION ALGORITHMS

Manufacturer Model Architecture Bits Clock Speed (MHz) Max. IPC Pipeline Stages (int/fp) L1 Cache (Instr/Data) L2 Cache L3 Cache Memory Bandwidth Main Memory

Opteron AMD Opteron 246 x86 64 2000 3 12/17 64KB/64KB 1MB 6.4GB/s 3GB

17 Xeon Intel Xeon EMT64 x86 64 3000 3 31/31 16KB/12Kµ-ops 1MB 6.4GB/s 4GB

Itanium 2 Intel Itanium 2 IA-64 64 1300 6 8/8 16KB/16KB 256KB 3MB 6.4GB/s 12GB

Table 3.1: Hardware platforms used in experiments

# Disks Configuration Capacity Max. Bandwidth

Opteron 12 RAID-5 1TB 350MB/s

Xeon 4 RAID-0 734GB 90MB/s

Itanium 2 1 24GB 40MB/s

Table 3.2: Disk configurations of the hardware platforms used in experiments

by its VLIW (see Appendix A.9) architecture. IPC scores on Itanium 2 provide a good measure of the amount of inherent concurrency in a piece of source code, as all concurrency is provided by the compiler. Memory benchmarks were conducted to measure real-world memory read- and write bandwidths. The results can be found in Figure 3.1. These bandwidths are useful as baseline numbers against which we can evaluate compression and decompression performance. For example, a practical limit on the bandwidth we can obtain during decompression can be defined as the product of the compression ratio and the maximum memory read bandwidth. The three experimental platforms are equipped with different disk configurations of varying speeds, as summarized in table 3.2. This allows the impact of compression on the perceived bandwidth to be evaluated against the native I/O bandwidth of a disk system.

3.4

Generic Compression Algorithms

To get an idea about the performance of some of the popular, generic compression algorithms, a simple experiment was conducted. The experiment consisted of compressing a 10MB array of 32-bit, in-memory integers. However, all these integers were actually randomly generated, but smaller than 256, so that they all used a maximum of 8 out of the 32 bits. As this leaves roughly 75% of the bytes consisting only of zero bits, one would expect a compression ratio of nearly four. The algorithms under test were Moffat’s arithmetic coding (AC) [25] implementation, the very generic and relatively fast zlib, used by the popular gzip compression utility, bzlib2, which is known to achieve relatively high compression ratios but at a slow rate, and finally lzrw and lzo, which are considered to belong to the fastest generic compression libraries at the moment, but not very good at achieving high compression ratios. The results in Table 3.3 confirm these expectations and show that, in general, it is very unlikely that we could achieve any bandwidth gains with these algorithms. LZRW and LZO might increase bandwidth on relatively slow disk systems, with bandwidths up to 100MB/s, but this would induce high processing overheads, which interferes with query execution. On a fast disk system, such as our 350MB/s 12 disk RAID, all the generic algorithms will fail to achieve any speedup. Furthermore, none of these algorithms allow for fine grained access within a compressed block, requiring decompression of the full block to retrieve a single

18


4000

Opteron Xeon Itanium 2

3500

Bandwidth (MB/s)

3000 2500 2000 1500 1000

write-64

write-32

read-64

read-32

m2m-TC

m2m

m2c

0

c2m

500

Figure 3.1: Memory benchmark results for both memcpy and simple reads and writes of both 32 and 64 bit integers in unrolled loops. The m2m-TC variant uses multiple memcpy calls to copy data from memory to memory, in batches through the cache. algorithm AC ZLIB BZIP2 LZRW-1 LZO1-C

comp. ratio 2.85 2.61 3.96 1.87 2.01

compress (MB/s) 3.93 2.52 4.14 142.13 110.88

decompress (MB/s) 5.03 80.72 10.19 199.67 302.47

Table 3.3: Compression ratios and compression/decompression bandwidths of some popular compression libraries on our Opteron platform. data item. None of these generic methods, except for the slow bzip2, get close to the, in this case trivially achievable, compression ratio of four. For compression scenarios like the one presented here, which are not uncommon for a database column, we can clearly do better using faster and more type-aware compression schemes that exploit any knowledge about the domains and distributions of the data stored in the database.

3.5

Length Encoding

3.5.1

Original Algorithm

A technique that is based on prefix suppression, or null suppression [35], and was first introduced in [45] is that of length encoding (LE). Length encoding assumes that the larger values in an integer domain are often used less frequently than the smaller values, and tries to exploit this by observing that small values have a lot of leading zeroes in their binary representation, which can be thrown away if we encode the number of bits that were thrown away, or, as we are dealing with fixed width types, the length of the remainder. In a performance-critical setting it is undesirable to work with sub-byte boundaries, so restricting the cutoff position to byte boundaries is a natural choice. This will sacrifice some of the compression ratio in favor of increased performance as there is no need for bit-level operations. Furthermore, it allows us to encode the length of the remainder in bytes, which in turn improves a bit on the compression ratio. To keep those lengths byte-aligned, they can be packed together in a single header of one byte, followed by the corresponding truncated values, as illustrated in Figure 3.2 for 32-bit integers. This way we can

3.5. LENGTH ENCODING

19

if (src[i] & 0xffff0000) { length1 = (src[i] & 0xff000000) ? 4 : 3 ; } else { length1 = (src[i] & 0x0000ff00) ? 2 : 1 ; }

Figure 3.3: Determining the length (in bytes) of a 32-bit integer, stored in src[i], requires two comparisons. /* copy the non-zero bytes from src into dst */ *(uint*)dst = src[i] ; dst += length1 ; *(uint*)dst = src[i+1] ; dst += length2 ; *(uint*)dst = src[i+2] ; dst += length3 ; *(uint*)dst = src[i+3] ; dst += length4 ;

Figure 3.4: Writing the non-zero bytes in our source integer data into byte array dst introduces a data dependency, as we need to compute the byte address before each memory write. Note that this code only works on little-endian architectures; on big-endian ones we would require an additional bitwise shift of the source values. encode and decode arbitrary streams of integers in sequential fashion. Assuming we are dealing with 32-bit integers, the compression ratio of this generic length en01010010 0xABCD coding scheme depends on the relative frequency of integers that fit into one, two, three and four 0x1234 0x1F bytes respectively. If we model these frequencies 0xFFFFFF with probabilities p1 , p2 , p3 and p4 , such that p1 + p2 + p3 + p4 = 1, we can compute the avFigure 3.2: Length Encoding of 32-bit inteerage compression ratio by equation 3.9. gers 0x0000ABCD, 0x00001234, 0x0000001F and 32 0x00FFFFFF into two, two, one and three bytes re(3.9) rLE = spectively. The first byte is the header which encodes 8p1 + 16p2 + 24p3 + 32p4 + 2 those lengths. Equation 3.9 also shows that length encoding could potentially result in an increase in data size, for example by taking p1 = p2 = p3 = 0 and p4 = 1. Although this original version of length encoding is simple and generic, it has some disadvantages. First of all, the encoding stage requires the computation of the leading number of bytes that can be truncated. Using binary search, this requires two tests per value, as shown in Figure 3.3. These tests, however, introduce severe control hazards. Second, the variable length representation of encoded values presents a data hazard. It is only possible to store a truncated value once we have computed the lengths of all its predecessors in the current fourtuple and written those values to memory. This presents sequential execution constraints, illustrated in Figure 3.4, which prevent the CPU from effectively applying parallel and out-of-order execution. During decompression, there is a similar data dependency, but now for reading. It is illustrated in Figure 3.5. This again involves keeping track of a cumulative address, dependent on the lengths of the already decoded values. One can work around this by implementing a table of 256, automatically generated decode routines, indexed by the header byte. This would allow for hard-coded, pre-computed offsets in each routine. An example function body of such a routine can be found in Figure 3.6. Although this clearly removes the data dependency, a function call for each four-tuple introduces a significant overhead, which could easily ruin the benefits of the hard coded byte offsets. A third issue is that most architectures use two’s complement representation for negative numbers. This means that all negative numbers have their most significant bit, or sign bit, set to 1 so that they

20


while (i < n) { uint l1, l2, l3, l4 ; /* lengths of four values */ const uint mask[5] = {0, 0xff, 0xffff, 0xffffff, 0xffffffff} ; ubyte header = *src++ ; /* l1 l2 l3 l4

decode lengths from header */ = (uint) ((header >> 6) & 0x3) + 1; = (uint) ((header >> 4) & 0x3) + 1; = (uint) ((header >> 2) & 0x3) + 1; = (uint) (header & 0x3) + 1;

dst[i] = *((uint*)src) & src += l1 ; dst[i+1] = *((uint*)src) src += l2 ; dst[i+2] = *((uint*)src) src += l3 ; dst[i+3] = *((uint*)src) src += l4 ; i += 4 ;

mask[l1] ; & mask[l2] ; & mask[l3] ; & mask[l4] ;

}

Figure 3.5: Decoding of length encoded values again introduces a data dependency, this time on the src pointer.

do { dst[0] = *((uint*)(src+1)) & 0x0000ffff ; dst[1] = *((uint*)(src+3)) & 0x0000ffff ; dst[2] = (uint) src[5] ; dst[3] = *((uint*)(src+6)) & 0x00ffffff ; src += 9 ; dst += 4 ; } while(src[0] == 82 && dst < dst_end) ;

Figure 3.6: Function body of a routine used to decode the next four values in case the header equals 01010010 (= 82), meaning that the values have lengths of 2, 2, 1 and 3 bytes respectively, as in Figure 3.2. Note that src is the compressed input source as a byte array, while decoded integers are written to dst. Furthermore, this routine keeps checking the header; as long as it remains the same, it does not return. This was done to reduce function call overhead in case that headers reoccur.


21

would always get encoded as full 32-bit integers. A possible solution to this problem is to encode the signs in a separate header. A fourth disadvantage, is that complexity gets worse for 64-bit integers. Measuring the number of leading zero bytes now requires three tests instead of two and encoding the length of a value in the header now requires three bits instead of two, allowing for only three values per header, while wasting one header bit. A natural solution here, would be to stick with four length classes, and, for example, encoding whether a value uses 64, 32, 16 or 8 bits. This does however have its cost in the form of reduced compression ratios. Finally, writing parts of multi-byte integers to memory, on which length encoding relies, is a byte order dependent operation which requires separate implementations for little-endian and big-endian machines. If possible, platform dependent code has to be avoided.

3.5.2

Binary Length Encoding

To solve some of the issues with the original length encoding scheme in Section 3.5.1 some adjustments were made. First of all, the revised algorithm allows for binary decisions only. This means that we do not re00011000 0x00 0x0F 0xFF ally encode the number of bytes in which a value 0x1FF is stored, but only whether it did or did not fit 0xFFFFFFFF in nbytes bytes, where nbytes is a compression pa0x00 0xFF 0x0F rameter, which will typically have a value of either one, two or three. So in this scheme a value is always either a hit or a miss, which is the reason this Figure 3.7: Storage layout of Binary Length Encoded scheme was called binary length encoding (BLE). values. The i’th bit in the header indicates whether The most important consequences of this adjust- the i’th value after the header is encoded as a hit or ment are that during compression there is only one as a miss. Hits are truncated to nbytes = 1 bytes, test per value to determine whether it is a hit or while misses are stored as full 32-bit integers. a miss. Furthermore, this method needs only one header bit per value, as illustrated in Figure 3.7, meaning that if we use decode routines indexed by the header, it will decompress at least eight values per function call. Finally, this adjustment makes binary length encoding equally well applicable to 64-bit integers. A second adjustment is to allow for an arbitrary base value, relative to which values will be encoded. This base value is then subtracted from every value during encoding, and added back during decoding, making all integers in the range [base, base + 28×nbytes ) hits. This allows for more flexibility in the distribution of values and can encode negative ranges as well without having to explicitly store the sign bit. Both base and nbytes are stored as a compression parameter in the header of a compressed block. do { dst[0] = (uint)src[1] + base ; dst[1] = (uint)src[2] + base ; dst[2] = (uint)src[3] + base ; dst[3] = *(uint*)(src+4) + base ; dst[4] = *(uint*)(src+8) + base ; dst[5] = (uint)src[12] + base ; dst[6] = (uint)src[13] + base ; dst[7] = (uint)src[14] + base ; src += 15 ; /* consumed 15 compressed bytes */ dst += 8 ; /* decoded 8 result integers */ } while(src[0] == 24 && dst < dst_end) ;

Figure 3.8: Function body of the Binary Length Encoding routine used to decode the eight values after a 00011000 (=24) header, as was found in Figure 3.7. Variable base is a constant argument. These adjustments change the expected compression ratio, which is now defined for integers of s bits as rBLE =

s 8 × p × nbytes + (1 − p) × s + 1

(3.10)

22


with p the probability of a hit. Although binary length encoding improves upon original length encoding for arbitrarily positioned small ranges of integers, it looses some generality for misses that still have some leading zero bytes. Furthermore, some of the disadvantages of the original scheme, such as the issue of byte order dependence and undesirable data dependencies during compression still remain. Also the restriction to byte-aligned cutoff positions potentially wastes a lot of space.

3.5.3

Evaluation

4 1

3 Bandwidth (GB/s)

Compression Ratio

3.5

2.5 2 1.5 1

0.6 0.4 0.2

0.5 0

0.8

0 LE

0.2 0.4 0.6 0.8 Miss rate BLE

1

0

0 LE

0.2 0.4 0.6 0.8 Miss rate

1

BLE

To evaluate the performance of both LE and BLE, micro benchmarks were conducted on a synthetic collection of 32bit integers, trying to compress them into 8-bit code words. In these experiments, performance was analyzed as a function of the miss rate, where a value that did not fit eight bits was considered a miss. It should be noted, that the original LE does not have a strict hit/miss separation, so that it might actually compress values that are considered to be a miss. However, as the miss values are distributed uniformly, they are most likely to occupy four bytes, as is always the case for BLE.

Figure 3.9 shows both the compression ratios of LE and BLE, and their compression bandwidth, which is measured as the amount of source data that is consumed per second. Clearly, BLE wins in terms of speed. With respect to the compression ratios, it should be noted that the minor advantage of BLE over LE is mainly attributable to the fact that BLE uses a single header bit per value, against two for LE. If misses were not distributed uniformly but skewed towards the smaller miss values, for example numbers that fit into 9 to 16 bits, LE would outperform BLE on compression ratio.

Figure 3.9: Compression ratios (left) and bandwidths (right) for both LE and BLE. These bandwidths were obtained on the AMD Opteron.

The fact that BLE wins bandwidth wise, can mainly be attributed to the reduction in the number of conditional branches needed to measure the length of a value, which is two for LE against one for BLE. These branches are also the cause of the concave shapes of the bandwidth plots; for miss rates close to either zero or one, the CPU can do a good job predicting the branch outcome, which reduces the number of pipeline flushes due to mispredictions. Decompression bandwidths can be found in Figure 3.10. We can see that the loop approach performs similarly for both LE and BLE, and remains relatively flat for varying miss rates. This flat shape suggests that the algorithm is CPU-bound, as in the ideal case performance should be the product of the maximum memory read bandwidth and the compression ratio, which is far from true. The suspicion that we can do better is strengthened further by looking at the bandwidths for the function table approach. There, the peak bandwidths, achieved at zero miss rate, lie around 3GB/s. The reason for this is twofold. First of all, the function approach does not suffer from data dependencies in the address computations. And second, as was shown in the code snippets in Figures 3.6 and 3.8, as long as the header bytes stay the same, which is clearly the case at zero miss rate, decoding stays within a single function, amortizing any function call overheads.


23

Decode routines 3.5

3

3

2.5

2.5

Bandwidth (GB/s)

Bandwidth (GB/s)

Decode loop 3.5

2 1.5 1 0.5 0

2 1.5 1 0.5

0 LE

0.2 0.4 0.6 0.8 Miss rate BLE

1

0

0

0.2 0.4 0.6 0.8 Miss rate

LE

1

BLE

Figure 3.10: Decompression bandwidths for LE and BLE, using either a single decode loop (left) or a table of decode routines indexed by the header byte (right).

Branch Missprediction Rate

IPC

However, the function approach is suboptimal either, as performance 2.6 0.3 starts to degrade quickly with only mi2.4 0.25 nor increases in miss rate. This is 2.2 2 caused by both function call overhead, 0.2 1.8 due to function arguments that need to 1.6 be pushed onto the stack and instruc0.15 1.4 tion execution that needs to be redi1.2 0.1 rected to the function body, and by the 1 0.8 fact that it quickly becomes very hard 0.05 0.6 for a branch predictor to correctly pre0.4 0 dict which of the 256 functions to jump 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 to. This is illustrated in Figure 3.11. Miss rate Miss rate This Figure also shows the correspondLE (loop) BLE (loop) LE (routines) BLE (routines) ing rapid decline in IPC. BLE (loop) BLE (loop) Results on other platforms are left BLE (routines) BLE (routines) out, as the AMD Opteron is the most generic of all test platforms and is suf- Figure 3.11: IPC and branch misprediction rates for both LE ficient to prove the inefficiency of both and BLE on AMD Opteron. length encoding methods. It is however worth mentioning that Intels Itanium 2 really did not like this code, achieving compression and decompression bandwidths between 10-20 MB/s only. It is unlikely that this is entirely attributable to the data dependencies in the code, as performance was similar when using decode routines. Apparently, the Itanium does not like the irregular, byte-level memory addressing which is present in both approaches, although this would require further investigation to confirm. In conclusion, although performance of both length encoding methods is not very bad, there is significant room for improvement. Furthermore, we can see that high IPC in itself is not enough, as both LE and BLE achieve relatively high IPCs but disappointingly low bandwidths in their single loop implementations. This is mainly caused by the decoding of the headers, which does not generate any output data by itself and might therefore be considered overhead. So what we should strive for, are algorithms with performance similar to the decode loops within the decode routines, which have no data dependencies, no need to decode a header, have a high IPC, and achieve bandwidths of around 3GB/s.

Chapter 4

PFOR, PFOR-DELTA and PDICT 4.1

Patching

The main source of problems with both length encoding and binary length encoding, is that they store compressed and uncompressed values in an intermixed fashion, storing the relevant meta information in a separate header. Theoretically there is nothing wrong with such an approach. It is however the main source of all the practical problems described previously in Section 3.5. If we restrict ourselves to binary decisions of either a hit or a miss, as was already done for binary length encoding, we can avoid those problems. However, to introduce more structure in the encoding scheme we need to take the idea of separation even further by physically separating the hits from the misses and getting rid of the headers. There are several ways to achieve this. One of those, called patching, is introduced in this section. The compression schemes that use this generic patching approach, PFOR, PFOR-DELTA and PDICT, are introduced later in Sections 4.2.1, 4.2.2 and 4.3 respectively.

4.1.1

Storage Layout

In general, categorization of values into hits and misses, using a single finite buffer, results in an area of codes, which is a sequential concatenation of the compressed hits starting at the beginning of the buffer and an area of exceptions, a concatenation of uncompressed misses, growing from the end of the buffer. This results in a layout similar to Figure 4.1. The buffer is divided into four sections:

header

• a fixed-size header, that contains compressionmethod specific information and things like sizes and bit-widths of the other sections. • an optional entry point section that allows for fine-grained tuple access, if desired. Fine grained access is explained in Section 4.1.7.

entry points

3

1

5

0

4

1

5

5

2

1

7

3

3

2

6

5

3

code section

• a code section, which is a forward-growing area of densely packed code words, one per exception section encoded value. The code word size can be ar8 9 9 8 9 bitrary, but is fixed within a block. Packing and unpacking of arbitrary bit-width codes is discussed in Section 4.1.4. Figure 4.1: Compressed buffer layout for patching approach using a patch list. The example shows the • an exception section, growing backwards, first few digits from the number π (pi), with a digit which stores non-compressed values that being considered an exception if it is bigger then could not be encoded into a small integer seven. code.

25

26

CHAPTER 4. PFOR, PFOR-DELTA AND PDICT /* NAIVE approach to decompression */ int code[n]; /* temporary machine addressable buffer / /* blow up b-bit input code words into machine addressable representation */ UNPACK[b](code, input, n) ; for(i=j=0; i 15) & 131071 ; codes[1] = (input[0] > 30 ; codes[2] = (input[1] >> 13) & 131071 ; codes[3] = (input[1] > 28 ; ... codes[30] = (input[15] > 17 ; codes[31] = input[16] & 131071 ; } return (i>>5) * 68 ; }

Figure 4.8: Unpacking a vector of n 17-bit code words from input into machine addressable 32-bit integers in codes

30

CHAPTER 4. PFOR, PFOR-DELTA AND PDICT

32-bit granularity

64-bit granularity

64-bit granularity

3

3

2.5

2.5

2

2

1.5

1.5

12

12

10

10

8

8

6

6

4

4

1

1

2

2

0.5

0.5

0

5

10

15

20

0

pack c2c pack c2m

IPC

Bandwidth (GB/s)

32-bit granularity

5

10

15

unpack c2c unpack m2c

(a) Bandwidth

20

0

5

10

15

0

20

pack c2c pack c2m

5

10

15

20

unpack c2c unpack m2c

(b) IPC

Figure 4.9: Packing bandwidth and IPC on Opteron as a function of the code word size. Source data is 32 bits initially, with packed result data being written at either 32- or 64-bit granularity. bandwidth usage. Once this data is in the cache we can unpack it into the cache, where it can be decoded and patched. The reasoning can again be reversed for compression. To investigate the performance overhead of the packing and unpacking stages, both at 32- and 64-bit granularity, some micro benchmarks were conducted. The 64-bit variant was included to investigate whether reading and writing memory at 64-bit granularity improved memory bandwidth usage, thereby possibly improving performance. Figure 4.9 shows us that both packing and unpacking can be done at high bandwidths, from which we can conclude that they do not induce large overheads. From the difference between cache-only runs (c2c), and their main memory versions (c2m and m2c), both with respect to bandwidth and with respect to IPC, we can further conclude that both packing and unpacking are main memory bound, which strengthens the conclusion that CPU overheads are low. However, the differences between the 64-bit and the 32-bit variants are negligible, especially in the cache-to-memory and memory-to-cache runs, so it was decided to stick with 32-bit types, as this only requires loops to be unrolled into 32 steps to keep packed code words aligned to 32-bit boundaries after each iteration.

4.1. PATCHING

4.1.5

31

RAM-RAM versus RAM-Cache Decompression

L2 Miss Rate (%)

Bandwidth (GB/s)

Bandwidth (GB/s)

L2 Miss Rate (%)

To validate our assumption that into-cache (RAMCache) decompression improves performance over Xeon 3GHz Opteron 2GHz Itanium 1.3GHz 3 3 into-memory (RAM-RAM) decompression, a sim2 2 ple micro benchmark was conducted. For this 1 1 purpose, a simple decoder was used which un0 0 packs 8-bit code words into 32-bit integers and 5 5 then patches the result in case of misses. De4 4 compression bandwidth was analyzed as a func3 3 tion of the miss rate, decompressing into a sin2 2 gle cache-resident destination vector in the RAMCache case, and, in case of RAM-RAM decompres1 1 sion, appending decompressed vectors to a large 0 0 0.5 1 0 0.5 1 0 0.5 1 memory resident result array. The results of this Miss rate Miss rate Miss rate experiment can be found in Figure 4.10 and show Figure 4.10: RAM-Cache (thick lines) versus RAM- a clear advantage for the RAM-Cache case. BandRAM (thin lines) decompression bandwidth as a widths in the RAM-RAM case are almost flat. function of the miss rate, when unpacking and patch- This implies that writing back the result data, which remains constant in size for varying miss ing 8-bit codewords into 32-bit integers. rates, introduces a significant bottleneck. The advantage of RAM-Cache decompression is further illustrated by the difference in L2 cache miss rates, as more cache misses means more memory traffic. On Opteron both cache miss rates are roughly equal, although this could be caused by an error in the L2 miss rate counter documentation or configuration, as the result is not as expected. These results clearly confirm the validity of our assumption that RAM-Cache decompression should provide for major gains in efficiency.

4.1.6

Compression Ratio

The compression ratio of schemes that separate between compressed hits and uncompressed misses is directly related to the miss rate, pmiss , as was defined in equation 3.8. In case of patching, the introduction of compulsory misses induces a lower bound on this miss rate of 1/2b , which for a 4 bit code is already more then 0.06, or roughly 6 percent, and quickly becomes worse for smaller codes. To compute the average compression ratio of s-bit values into b-bit codes as a function of the bounded miss rate p0miss = max(pmiss , 1/2b ) we can use equation 4.1. rpatch =

p0miss (b + s) + (1 − p0miss )b s

(4.1)

One should note that for very low miss rates we can improve this a bit. As the index to the first miss is stored in the block header, it is not limited to b bits. By making it wide enough to index any location in the entire compressed block, the process of inserting compulsory misses only needs to be started after the first miss was found. Until the first miss is found, compulsory misses are avoided by using the compression loop from Figure 4.11. const unsigned int max int i = 0 ;

= 1 > M].offset_code + (x & ((1 M].offset_exception ; /* walk partial exception list till position x */ while(i < x) { i += code[x]; j--; } /* check whether we are dealing with an exception */ if (i == x) return exception[j] ; else return DECODE(code[x]) ; }

Each entry point, stored once every 2M values, can be represented in a 32-bit integer type, of which M bits represent the relative code offset of the next exception, and 32 − M bits the exception offset. This introduces a storage overhead of 32/2M bits per value. A value of either M = 7 or M = 8 is suggested, with the first case providing faster decompression speed and the second one providing less header storage overhead, by halving the size of the entry table, and more efficient decoding of the header, as both the

p’ (effective exception rate)

4.1. PATCHING

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

33

b=1 b=2 b>4 (p’ = p)

b=3 b=4 0

0.05

0.1

0.15 0.2 p (data exception rate)

0.25

0.3

Figure 4.12: The effect of compulsory misses on the effective miss rate p0miss , for b ≤ 4 code offset and the exception offset are byte aligned in that case. Furthermore, we need at least 24 bits for the exception offset, as that still allows for indexing a 16MB section of 8-bit exception values. As far as decompression latency is concerned, it would be nice to stay in the order of a cache miss, which is around 100-300 CPU cycles and is likely to occur anyway in case of a random access. Decompression speed is influenced by M as follows: assuming that we have 30% misses, which is a rough upper-bound to keep this compression scheme useful in practice, our decompression loop takes about 0.3 × 2M iterations, which is around 40 in case of M = 7. Assuming that we meet our target decoding performance of roughly four cycles per value, this would result in an overhead of 100-200 cycles, which is similar to the latency of a cache miss. For M = 8 this latency is roughly doubled, but still acceptable in case of lower miss rates. The effect of fine-grained access on the compression ratio is twofold: first of all, the requirement to store an additional entry table in each block has a negative impact on the compression ratio. Second, however, the entry points allow us to reset the insertion of compulsory misses at each such point, as for each of them we know the relative index of the first miss within the entry points range of 2M values. Thus, the gaps between the start of this range and the first miss, and between the last miss in the range and the end of the range do not need compulsory misses. This means that, on average, assuming a uniform distribution of misses, the area in the codes section that must be covered by compulsory misses can be estimated as 1/pmiss (where pmiss is the real miss rate in the source data). From this, we can estimate the effective miss rate p0miss , which takes compulsory misses into account, as: p0miss = max(pmiss ,

2M pmiss − 1 × 2−b ) 2M pmiss

(4.2)

Figure 4.12 illustrates this effect of compulsory misses on the effective exception rate for M = 128 and small code words of size 1 to 4 bits. It can be seen that for a code word size of 4, the effect of compulsory misses has become negligible.

4.1.8

Multiple Exception Lists

An extension that sacrifices compression- and some decompression speed to improve the compression ratio can be found in maintaining multiple exception lists. In the original scheme there is just one exception list containing s-bit values, with s the uncompressed size of a single value. Sometimes, these exceptions, that do not fit the code word size b, might still fit less then s bits. So one could decide to, for example, maintain a total of three lists of exception values that fit into either 16, 24 or s(> 24) bits, thereby further compressing the exceptions. By storing the index of the first exception for each list, and maintaining a separate linked list through the locations that need to be patched for each of the lists, one can simply iterate all the lists sequentially and apply the patches from each of them. The multiple exception list extension is only mentioned here, but was not implemented, as it is expected to perform significantly worse then a single list approach, and the potential gains in compression ratio are questionable. Encoding performance suffers from additional checks that are needed to determine in which exception list a value should be stored. Depending on the number of bits used per value for a particular exception list, encoding might even require an additional packing phase for exceptions.

34


Furthermore, as all these lists need to be stored at the end of an I/O block, after the section of codes, we need to buffer them in main memory before writing them into the block, as we cannot predict how long each one will become. This requires external buffering, which introduces additional memory traffic and cache pollution. Decoding performance will mainly suffer from the fact that multiple exception lists introduce more random memory access patterns, which is a threat to locality and might therefore introduce more cache misses. Furthermore, decoding becomes more complex in case an exception value is stored in, for example, 24 bits, as this is not directly addressable on current machines. Although this extension is geared at improving the compression ratio, a potential danger comes from the fact that we need to keep multiple patch lists connected, which is likely to increase the number of compulsory misses.

4.2 4.2.1

PFOR Regular PFOR

F(x)

In PFOR, or Patched Frame-of-Reference, the bbit code words represent an offset from a base base base + 64 value, where both the base value and b are compression parameters and are fixed for a single I/O block. If a value is not within the range [base, base + 2b ) it is considered a miss. PFOR exploits a single peak in the probability mass funcexceptions codes exceptions tion underlying the source data, which occurs if values are clustered together within a single subrange of their original domain, as illustrated in Figure 4.13. PFOR is logically equivalent to Binary Length 0 50 100 150 200 250 Encoding, but implemented in a patching fashion, x without header bytes. Furthermore, it allows code words of arbitrary bit-width. PFOR is also similar Figure 4.13: PFOR compression for b = 6 and base = to original FOR compression [12], but is not sen- 118 captures most of the values of this 256 value sitive to outliers. FOR is, as it uses the minimum domain as codes. value in a block as the base and then encodes each value using dlog2 (max − min + 1)e bits. In PFOR, outliers are simply stored as exceptions. PFORs encoding phase simply subtracts the base during encoding and then checks whether the result is a hit or a miss. During the decode phase, the base is simply added back to each value, thereby restoring the original data.

4.2.2

PFOR-DELTA

PFOR-DELTA is regular PFOR on the difference, or delta, between a value and its predecessor. It exploits sequential orderings within a column of values under the assumption that differences between subsequent values are similar in magnitude, and that those differences cover a smaller subset of the integer domain than the original values. If there is a peak in the frequency distribution of the differences, we can exploit it using PFOR. As opposed to PFOR, PFOR-DELTA is also usable if a collection of values covers their full domain, as long as there are long runs of subsequent values that are close to each other within that domain. This is for example the case for sorted or clustered data, often including automatically incremented object identifiers, timestamps and dates. PFOR-DELTAs encoding phase first subtracts a values predecessor and after that the base value. The result is checked against fitting into b bits. During decoding, the operation is inverted, adding both the base and the predecessor to the value that is currently being decoded. There are some problems that make PFOR-DELTA likely to perform worse then PFOR. First of all, during decoding, the summation of subsequent values introduces a data dependency between the current

4.2. PFOR

35

value and its predecessor, as we can only compute the current value if the previous one is available. A final issue is related to random access through multiple entry points in the block header, which was described in Section 4.1.7. To be able to decode a random value, we must also store the current running bits storage overhead per value, where s is the size total for each entry point, resulting in a total of 32+s 2M of the data-type being encoded, in bits, and 2M is the interval at which entry points are inserted. As a large block can potentially hold millions of values, one has to be aware of the fact that such an entry point table can occupy a large part of a block. Furthermore, decoding will now require decompression of 2M −1 values on average, as we need to compute the sum up till the value we are interested in. For M = 7 or M = 8 this attributes to a CPU cost in the order of several hundreds of cycles.

4.2.3

Parameter Selection

The performance of both PFOR and PFOR-DELTA relies largely on the values of the base and b parameters: base should be chosen somewhere to the left of the largest peak in the value distribution, and b should be chosen in such a way that it optimizes the compression ratio. Of course these parameters could be set by the database administrator (DBA) for each column, at creation time, however, that involves careful analysis of the source data and does not allow the parameters to be dynamic over I/O blocks. The alternative is to have automatic parameter selection during compression of a column. To find the optimal parameters for a collection of N s-bit source values, and if time is not an issue, the following exhaustive search algorithm could be used. For PFOR-DELTA, the same algorithm can be used, but then on the differences. 1. Sort: Sort the N s-bit input elements into v. 2. Build histogram: Count the number of occurrences of each of the H distinct values present in v: H←1 value[1] ← v[1] count[1] ← 0 For i = 1 to N If not v[i] = value[H] H ←H +1 value[H] ← v[i] count[H] ← 0 cumsum[H] ← cumsum[H − 1] End count[H] ← count[H] + 1 cumsum[H] ← cumsum[H] + 1 End 3. Find optimal compression ratio: Exhaustively search for the lo and hi parameters that optimize the compression ratio r: ropt ← 0 For lo = 1 to H For hi = lo to H If r(low, hi) > ropt loopt ← lo hiopt ← hi ropt ← r(low, hi) End End End Where the compression ratio r can be computed using r(lo, hi) =

N ×s Nc × dlog2 (value[hi] − value[lo])e + (N − Nc ) × s

(4.3)

36


and Nc = cumsum[hi] − cumsum[lo − 1] is the number of values in the range of codes. Special care needs to be taken in case lo = hi, in which case 1 should be taken for the logarithmic term, which represents the code word size b. If lo = 1, Nc should be taken to be cumsum[hi]. Furthermore, in situations where the range [lo − hi] does not cover the full histogram, the compression ratio should be bounded from below to accommodate for compulsory misses.

4. Done: Compute optimal return parameters: base ← lo, b ← dlog2 (value[hi] − value[lo])e

This algorithm uses O(N ) memory and has a time complexity of O(N log(N )) + O(H 2 ) , where O(N log(N )) is needed to sort the input, and the O(H 2 ) term is introduced by the search for the optimum. By taking a fixed set B of bit-widths we are interested in, we can reduce this time complexity in case |B| < H 2 , by running the linear time algorithm from Figure 4.14 on a sorted set of N input values, for all b ∈ B. This results in an overall complexity of O(N log N ) + O(N × |B|), where |B| will often be some fixed constant. If the set of input values is not the full data set but only a sample, one should take care not to allow for small values of b, which could ruin compression due to compulsory misses which might be introduced by exceptions outside the sample. length ← 0 base ← 0 range ← 2b For hi = 1 to N If v[hi] − v[lo] > range If hi − lo > length base ← lo length ← hi − lo End While v[hi] − v[lo] > range lo ← lo + 1 End End End Figure 4.14: Finding the optimal base value for PFOR compression at bit-width b for N input values.

This can be done in one pass through the sample, and can be repeated for all the B values of b that are of interest, resulting in a complexity of O(S log(S)) for the sort, plus O(B × S) to find suitable parameters. The set of bit-widths, B, to try will in general be constant.

4.2.4

In-place Decompression

Both PFOR and PFOR-DELTA allow for in-place decompression, meaning that we do not need the temporary code buffer from the original decompression algorithm in Figure 4.3. Code words are simply unpacked into the output buffer, patched and then decoded, as illustrated in Figure 4.15. Note that the patching stage and the decoding stage are interchanged now, as we do not want to apply decoding on the patch-list offsets, which would ruin it.

4.2. PFOR

37

Compression Ratio 16

12 10

2.4

8 6 4

2.2 2 1.8 1.6 1.4 1.2 1

2 0

b=2 b=4 b=8 b=12 b=16 b=20 b=24

2.6

Bandwidth (GB/s)

14 Compression Ratio

PFOR Decompression Bandwidth 2.8

b=2 b=4 b=8 b=12 b=16 b=20 b=24

0.8 0

0.2

0.4

0.6

0.8

1

0.6

0

Miss rate

0.2

0.4

0.6

0.8

1

Miss rate

Figure 4.16: Compression ratio (a) and PFOR decompression bandwidth on Opteron (b) for several code word sizes, as a function of the miss rate in the source data. We can see that for small b, compulsory misses have a significant impact on both the compression ratio and the decompression bandwidth. /* initialize cur to index of first exception within codes */ int cur = first_exception; /* blow up b-bit input code words into output buffer */ UNPACK[b](output, input, n) ; /* LOOP1: patch it up */ for(int i=1; cur < n; i++) { unsigned int tmp = output[cur] ; output[cur] = exception[-i]; cur = cur + tmp ; } /* LOOP2: decode all values in-place */ for(int i=0; i 0, it has been shown to insert all K keys in time proportional to O(K), using O(2M ) space. More importantly, lookup is guaranteed to be of O(1), with each search requiring two probes. In short, values are inserted in one of the two tables, lef t or right, each with their own hash function. In case of a collision, the already present value is reinserted into the other table, and so on until an empty bucket is found. This way, for any key collection, a collision free distribution over the two hash tables can be found [30]. Once the tables have been populated, the body of the main PDICT encoding loop looks as follows: /* Dictionary Lookup with Cuckoo Hashing */ inline int PDICT_ENCODE(ANY v) { ANY l = HASH1(v)&MAXCODE]; ANY r = HASH2(v)&MAXCODE]; if (left_hash[l] == v) return l; if (right_hash[r] == v) return r; return MAXCODE; }

4.3. PDICT

49

Bucket-Chained Hash PDICT cost 48 cycles/value C

500MB/s

25 cycles/value

27 cycles/value

416MB/s

592MB/s

Cuckoo Hash PDICT cost 34 cycles/value C

796MB/s

5 cycles/value

25 cycles/value

2114MB/s

640MB/s

Memorized Probing single-cursor 33 cycles/value

6 cycles/value

20 cycles/value

PDICT cost 24 cycles/value

7 cycles/value

14 cycles/value

PDICT IPC 1.35 (max: 3)

5.25 (max: 6)

1.45 (max: 3)

1486MB/s

1142MB/s

C

1000MB/s

Xeon 3.0GHz Itanium2 1.3GHz Opteron 2.0GHz

Table 4.2: Hash Algorithm Performance in PDICT compression (64-bits into 8-bits) Although Table 4.2 shows that encoding speed using Cuckoo hashing is better then for chaining, there is still room for improvement. Performance-wise, the main problem of the encoding routine is in the fact that values are distributed randomly over the two tables, which causes branch mispredictions in the two conditionals within the routine. Another disadvantage is in the fact that the storage requirement is O(2M ), with M > K, which wastes roughly half the storage space and code word space. Memorized Probing As with Cuckoo hashing, our new memorized probing scheme makes use of two hash functions. The first one determines an index into an offset table, containing 4M single-byte offsets. The second one determines an index into a value array, which can store M = 2b dictionary values.

H2(v1) = 7

0

H2(v2) = 12

1

H1(v1) + offset[H2(v1)] = 1

v1 v2

collision: H1(v1) = H1(v2) = 1

value H1(v2) + offset[H2(v2)] = 2

offset

Figure 4.25: Memorized probing for M = 4. The offset hash, in combination with hash function H2, is used to memorize the offsets that are used to resolve collisions in the value hash, which uses hash function H1. In the traditional linear probing approach to hash table construction, values are simply hashed into a single value table. If a collision occurs, subsequent positions in the table are probed until a free slot is found. This results in lots of values being stored at a certain offset from the position they hashed to originally, meaning that for a lookup we must loop over the table until either the value is found, or an empty bucket is found. The idea behind memorized probing is to remember the offsets from the original hash position in the separate offset hash table, as illustrated in Figure 4.25, so that the search loop can be eliminated. Once the dictionary has been populated, we can encode an array of values v using the following predicated encoding kernel: for(i=0; i>11) ^ (key>>23) ^ key OFFSET_HASH(key) = (key>>13) ^ (key>>21) ^ (key= 256 wait for branch condition wait for branch condition wait for instruction to be loaded assign immediate value of 255 to b

Figure A.4: Stalls due to a control hazard, represented as no-operations (NOP).

A.6. CACHE MEMORIES

A.6

75

Cache Memories

Until now the solutions proposed against data and control hazards focused on hiding their latencies. But what about the latencies themselves? The worst latencies are caused by memory references, which typically take in the order of hundreds of clock cycles to fulfill. Such latencies are clearly unacceptable and hard to hide effectively by out of order execution. Therefore, caches were introduced. Caches are smaller but faster memories that are positioned closer to the CPU, often on the same die. There are often two levels of caches and sometimes even three. The highest level, closest to the CPU, is smallest and fastest, while lower levels are bigger and slower, but still smaller and faster then main memory. Caches hold copies of regions of main memory, thereby allowing for faster access to those regions by the CPU. They rely on the concepts of spatial- and temporal locality. Temporal locality : If a data item or instruction is referenced, it will tend to be referenced again soon. Spatial locality : If a data item or instruction is referenced, items whose addresses are close by will tend to be referenced soon. Caching is entirely transparent to a program and is handled entirely in hardware. Often, there are separate caches for both instructions and data. The event of finding an item in a cache is often referred to as a cache hit. If an item is not found in a cache this is called a cache miss.

A.7

Role of Compilers

Compilers can increase instruction throughput by hiding, and sometimes even entirely removing data and control dependencies. Hiding can be done by reordering instructions, which is known as static scheduling, as opposed to the dynamic, run-time scheduling performed by out-of-order processors. By reordering instructions, a compiler can sometimes fill slots that would otherwise contain a pipeline bubble with independent instructions from other locations in the program, for example by moving memory load and branch instructions to an earlier slot.

A.7.1

Loop Unrolling

To reduce the overhead of loops, such as incrementing loop counters and testing the loop condition, a compiler can unroll loops. Logically, it will rewrite a loop of the form: for (i = 0; i < n; i++) { sum = sum + a[i] ; }

into: for (i = 0; i < sum = sum + sum = sum + sum = sum + sum = sum + } while(i < n) { sum = sum + }

A.7.2

n - 4; a[i] ; a[i+1] a[i+2] a[i+3]

i = i + 4) { ; ; ;

a[i++] ;

Loop Pipelining

Another compile-time optimization on loops is loop pipelining, which transforms an operation consisting from multiple dependent operations, f() and g(), on all n independent elements of an array a from: f(a[0]), g(a[0]), f(a[1]), g(a[1]),..., f(a[n]), g(a[n])

into: f(a[0]), f(a[1]), f(a[2]), g(a[0]), g(a[1]), g(a[2]), f(a[3]),...

This way, the compiler can make sure that the modified value of a[0], after f(a[0]), is available by the time g(a[0]) needs it, and so on, potentially removing all data hazards entirely.

76

APPENDIX A. MODERN CPUS

A.8

Coding Techniques

A.8.1

Predication

In general, the negative performance impact of control hazards is bigger then that of data hazards. For this reason, one can sometimes benefit from transforming a control hazard into a data hazards by a technique called predication [34]. Predication rewrites selective conditions on a collection of values from a naive implementation, with an if statement in the loop body, into a more optimal form, which increments a loop counter with a boolean, defined to be either 0 (false) or 1 (true). For example, to find the positions of all values bigger then 255 in array a and store those positions in array b, we could write for (i = 0, j = 0; i < n; i++) { if (a[i] > 255) { b[j++] = i ; } }

Using predication we can remove the if statement by rewriting this code into: for (i = 0, j = 0; i < n; i++) { b[j] = i ; /* always store index i as an exception position */ j += (a[i] > 255) ; /* increment exception index j with a boolean */ }

Now, the current loop index, i, is always written to position b[j], but j is incremented by one only if a[i] > 255.

A.8.2

Multiple Cursors

A compiler can only extract instruction level parallelism if it exists logically at the source code level. This means that in loops with control or data hazards, where loop unrolling is not effective, it might pay off to extend the loop body with more independent code that the compiler or CPU can use to fill up pipeline stalls. A technique to extract this kind of concurrency in case of array iteration, is to split up the array into two or more distinct regions, and use a single loop to iterate those regions in parallel by maintaining multiple array indices, or cursors. For example, to hide the remaining data dependency after applying predication in our example from the previous section, we could split up the array in two equally sized halves, and process them independently, as in: int m = n/2; for (i = 0, j_0 = 0, j_m = 0; i < m; i++) { b_0[j_0] = i ; b_m[j_m] = i + m ; j_0 += (a[i] > 255) ; j_m += (a[i+m] > 255) ; }

This type of concurrency is not detectable by the compiler but can often help to fill up all the parallel pipelines in a modern CPU. Theoretically, one can extend this technique to an arbitrary number of cursors, but eventually this will lead to register spills, meaning that a lack of CPU registers causes additional cache traffic, due to the inevitable introduction of new variables.

A.9

VLIW

Out-of-order dynamic scheduling as implemented by most modern, high performance CPUs requires a lot of hardware resources to detect parallelism and reorder instructions at run-time. A different approach is taken in the very long instruction word (VLIW) designs. Just as with super-scalar processors, a VLIW architecture has multiple functional units. However, the hardware is freed entirely from dynamic dependency detection and reordering of instructions. To fill the parallel functional units, a VLIW architecture

A.10. DATA REPRESENTATION

77 decimal 127 2 1 0 -1 -2 -3 -128

binary 01111111 00000010 00000001 00000000 11111111 11111110 11111101 10000000

Table A.1: Some 8-bit signed integers using two’s complement representation. packages multiple, independent instructions into one very long instruction, often called a bundle, of which it dispatches a single instance each and every clock cycle. Once dispatched, a bundle is executed in order. The responsibility of detecting the needed parallelism to create the static instruction bundles is put entirely by the compiler. Because of this dependence on the compiler, it becomes even more important to design inherently parallel algorithms, as the compiler will not be able to detect everything. Because the hardware needed for out-of-order execution does not scale well with issue width, VLIW processors start to become especially beneficial when more then four parallel execution pipelines are employed. For example, Intels Itanium 2 VLIW processor uses bundles of six instructions, which is twice the issue width of Intels out-of-order Pentium 4. One advantage of such an amount of parallelism, is that branch misprediction penalties can often be eliminated entirely by executing both the if part and the else part in parallel and discarding one of them as soon as the branch condition is known.

A.10

Data Representation

Most computers store data in binary representation. Most often, the smallest addressable unit is the byte, which is 8 bits and can therefore be used to represent 28 = 256 distinct values. This means that larger objects, which are capable of representing more then 256 distinct values, need to be stored in an integer number of bytes. Besides bytes, most systems provide addressing of half words (16 bits), words (32 bits) and often also double words (64 bits). There is, however, a difference in the way systems represent such larger types. For example, a word is made up of four bytes, but these bytes in turn can be ordered in multiple ways. The two orderings in use nowadays are the so called little-endian and big-endian byte orders. In a little-endian ordering, the least significant byte, that is, the byte with the lower order bits, is stored at the first byte address and the most significant byte at the last one. In big-endian representation it is exactly the other way around, as illustrated in Figure A.5. 00000000

00000001 00000010 00000011

00000011

00000010

00000001

00000000

Figure A.5: Big-endian (left) versus little-endian (right) representation of the binary number 00000000 00000001 00000010 00000011, which is 66051 in decimal representation. Another issue with regard to representation of integer numbers, is how the sign of a number is represented. The most common way to do this, is the so-called two’s complement representation. In this approach, the highest order bit is used to represent the sign. If it is zero, we are dealing with a positive integer and all leading zeroes can be ignored. If, however, the first bit is a one, then the integer is negative, and all leading ones can be ignored, as now the positions of the zero bits are important. Some example two’s complement conversions between decimal and binary representation can be found in Table A.1. Due to the fact that zero is represented using a positive sign, the maximum positive integer that can be represented using b bits is 2b−1 − 1. The minimum negative integer, however, is −2b−1 .

Appendix B

Relational Database Systems This appendix is intended to introduce the basic concepts behind relational database systems and query processing, which are introduced in Section B.1. More recent research results can be found in Section B.2, where we take a look at the interaction between database architecture and the underlying hardware. In Section B.2.3 we also present an overview of the novel MonetDB/X100 database kernel, being developed in the database group at CWI, and designed to extract maximum performance out of todays CPUs.

B.1

General Theory

A database is a collection of interrelated data. A program used to manage such a database is a database management system (DBMS). The main goal of a DBMS is to provide its end-users with a convenient and efficient way to retrieve and manipulate the underlying data. Nowadays, most mainstream database systems implement the relational model. What follows is a quick overview of relational database systems (RDBMS), which implement this model. For a more elaborate introduction on this topic the reader is referred to a textbook such as [39].

B.1.1

Relational Model

A logical database design is called a schema, while a snapshot of its contents at a given instant in time is called a database instance. In the relational model both data itself and the relationships between those data are represented by means of tables, as illustrated in Figure B.1. A table is built up of rows and columns. The columns represent the attributes of a relation according to the relation schema which defines those attributes and their domains. The set of rows, also called tuples or records, on the other hand, represents a relation instance, or simply relation. Each table has a unique name. Attribute names have to be unique within a given table.

custkey 1 2 3

name Johnson Smit Jansen

nationkey 1 2

address Main 31, New York Dam 2, Amsterdam Plein 23, Den Haag

name USA Netherlands

nationkey 1 2 2

continent North America Europe

Figure B.1: Sample relational database consisting of two tables, customer (top) and nation (bottom). 78

B.1. GENERAL THEORY

B.1.2

79

Storage Layouts

NSM The relational model is a logical model and does not mandate anything about the way data is stored physically on disk. Still, most database systems store data according to the conceptual model, meaning that all attributes of a row are stored together in a record on disk. This approach is often termed the n-ary storage model (NSM), where n refers to the number of attributes in a relation. Each table is mapped to a single file, which stores records from the given table one after another, often distributing them over multiple fixed size disk blocks. Records can be either of fixed or variable length. In case of variable length records, the offsets of each record need to be explicitly encoded, as illustrated in Figure B.2, which shows an NSM disk block for our customer relation from Figure B.1. NSM stores full records together to allow for relatively easy, fine grained locking mechanisms, such as block-level or even tuple-level locking. Locking is needed to keep the database consistent in case of concurrent modifications; the more fine-grained the locking mechanism, the more concurrency a DBMS can allow. In case of data-intensive, analytic applications, which mainly need read-only access to a database, such locking mechanism are not needed, and NSM actually becomes a potential performance bottleneck. The first performance disadvantage of NSM is, that in case we only need access to a small number of attributes in a relation, we still need to read all attributes from disk, which might introduce I/O bandwidth bottlenecks. A second disadvantage is that, due to the lack of spatial locality between attribute values, NSM incurs a lot of data cache misses within the CPU [2], which results in accesses to lower layers of the memory hierarchy, together with associated performance penalties. PAGE HEADER

1

Johnson

Main 31, New York

1

2

Dam 2, Amsterdam

2

3 Jansen

Plein 23, Den Haag

2

Smit

Figure B.2: NSM storage layout for customer relation.

DSM An alternative storage layout is presented by the decomposed storage model (DSM) [8]. DSM partitions a relation with n attributes vertically into n binary relations as illustrated in Figure B.3 for the customer relation from our example database in Figure B.1. This results in binary relations consisting of a single attribute column taken from the original relation, paired with a unique object-id (oid) to prevent us from loosing the original relational schema semantics. oid 1 2 3

custkey 1 2 3

oid 1 2 3

name Johnson Smit Jansen

oid 1 2 3

address Main 31, New York Dam 2, Amsterdam Plein 23, Den Haag

oid 1 2 3

nationkey 1 2 2

Figure B.3: DSM representation of example customer relation. As opposed to NSM, a DSM disk block contains only values from a single attribute. As long as we keep attribute columns from the same table vertically aligned, we do not even need to store the object ids on disk, just the values. For variable length records we still need to store tuple offsets though. The

80

APPENDIX B. RELATIONAL DATABASE SYSTEMS

DSM storage layout of our decomposed customer relation from Figure B.3 can be found in Figure B.4. An alternative way to store variable length records would be to split them into two columns: one that contains a sequential concatenation of values only, and another one containing the offsets of the individual values. PAGE HEADER 1

2

PAGE HEADER

3

Johnson

Smit

PAGE HEADER

PAGE HEADER

Main 31, New York

1

2

2

Dam 2, Amsterdam

Jansen

Plein 23, Den Haag

Figure B.4: DSM storage layout for customer relation. DSM does not suffer from the performance problems present in NSM. This is caused by the fact that in DSM we can simply read only those attributes we actually need. Furthermore, attribute values are clustered together, providing for more efficient cache usage. However, DSM incurs a performance penalty in case n-ary relations need to be reconstructed, as this needs to be done explicitly by the CPU at run-time. PAX A third, and more recent, storage layout is provided by partition attributes across (PAX) [1]. As in NSM, PAX stores the attribute values of a record in the same disk block. However, within a block, PAX applies vertical decomposition into so-called mini pages, as illustrated in Figure B.5. PAGE HEADER 1

2

3

Johnson

Mini Page Smit

Main 31, New York

Jansen Dam 2, Amsterdam

Plein 23, Den Haag

1

2

2

Figure B.5: PAX storage layout for customer relation. The idea behind such a storage scheme is, that it takes the best of NSM and DSM and combines this into a single storage layout. Indeed, PAX does not suffer from the cache miss problems present in NSM, as attributes are clustered within a block, providing for spatial locality. Furthermore, PAX does not suffer from significant reconstruction overheads as only the records within a block need to be reconstructed, which incurs negligible cost [1]. However, PAX does still suffer from the fact that it always needs to read all attributes from disk, even if it only needs few of them, which incurs the same I/O overhead as was the case with NSM.

B.1.3

Relational Algebra

To be able to query a relational schema we need a query language that operates on the relations defined by the database. One such language is the relational algebra. It implements the following relational operators, which operate on either one or two input relations and produce a relation as a result. Since relations are sets of tuples, we can start with the fundamental set operations, union, difference and Cartesian product. Both operands have to share the same schema. The result will be of this schema as well.

B.1. GENERAL THEORY

81

Union (R1 ∪ R2 ) results in a relation containing all tuples from R1 , R2 or both. Difference (R1 − R2 ) results in a relation containing all tuples from R1 that do not occur in R2 . Cartesian Product (R1 × R2 ) results in a relation R with a schema which is the concatenation of the schemata of both R1 and R2 . It contains all possible tuples which can be generated by concatenating each tuple from R1 with all tuples in R2 . Furthermore we have the operators specific to relational algebra: Select (σP (R1 )) results in a relation containing all tuples from R1 that satisfy boolean predicate P . Project (ΠS (R1 )) results in a relation containing only a subset of the attributes of R1 . The subset is listed in attribute list S. As the result relation is a set as well, potential duplicates are eliminated from the result. Rename (ρx (R1 )) results in a relation which is a copy of input R1 , but with name x. An additional operator which can be defined in terms of above operators but is useful enough to deserve its own listing is the natural join. Join (R1 o n R2 ) results in a relation which is the Cartesian product of R1 and R2 but only contains those tuples that satisfy equality on all shared attributes, with duplicates removed. Or formally σR1 .a1 ,R1 .a2 ,···R1 .an =R2 .a1 ,R2 .a2 ,···R2 .an (R1 ×R2 ), with a1 , a2 , · · · an representing all shared attributes. To avoid attribute name collisions the colliding names are prefixed by their source relations name. As each operator produces a relation as output, they can be cascaded into more complex queries. For example we can use Πcustomer.name (σnation.name=0 N etherlands0 (nation) o n customer)

(B.1)

to select the names of all customers that live in the Netherlands from the sample database in Figure B.1.

B.1.4

Query Processing

Intuitively, cascaded queries can be represented by an operator tree, as depicted in Figure B.6 for our example query. A naive way to execute this query would be to start evaluating operators at the lowest level of the tree and evaluate all operators one by one, materializing all intermediate results by writing them to main memory. If the input relations to an operator are all available, the operator is ready to execute and produce its own set of results. Πcustomer.name o n H

HH

σnation.name=0 N etherlands0

H

HH

H H customer

nation

Figure B.6: Operator tree representation of example query Often, full materialization of intermediate results is not needed and even undesired. A frequently used solution is to implement the evaluation of an operator tree in a pipelined fashion. In such an approach, several operations are combined into a pipeline in which the results of one operation, the producer, are passed along to its parent in the tree, the consumer, as soon as they become available. The results are then immediately processed by the consumer. This reduces the amount of intermediate results that need to be stored and can thus save significantly in I/O costs.

82


A pipeline where an operator has to explicitly request the next result from its child nodes is called demand-driven, while a pipeline where the child pushes results into its parent is called producer-driven. In a demand-driven pipeline, which is the most common, data is being pulled up through the operator tree, starting at the topmost node. A single operation in such a pipeline is therefore often implemented as a Volcano [14] style iterator, which provides the operations open(), next() and close(). After a call to open(), each call to next() returns the next output tuple of the operation. The operation in turn calls next() on its children, to get input tuples whenever required. The iterator maintains the state of its execution in between each call to next(), so that successive calls to next() return successive result tuples. The stop() call is used to notify an iterator that no more tuples are needed.

B.1.5

Database Modifications

Until now, the discussion has focused on retrieving data from a database system. In reality, however, information in a database system might need to be modified from time to time. Modifications to a database can be of three different kinds: deletion of existing tuples in the database, insertion of novel tuples, and updates on (a subset of) the attributes of existing tuples. In formal relational algebra terms, deletion is defined as r ←r−E where r is a relation and E an arbitrary relational algebra query. Similarly, insertion can be defined as r ←r∪E Updates involve a selection (σ) over some predicate, P , selecting which tuples to update. A generalized projection (Π) over the attributes Fi of relation r is then used to update the desired attributes: r ← ΠF1 ,F2 ,··· ,Fn (σP (r)) ∪ (r − σP (r))

B.1.6

Application Domains

At a high level, database applications can be categorized into two classes. These are on-line transaction processing (OLTP) systems and decision support systems (DSS). As the name implies, OLTP is highly focused at efficient processing of transactions. Think, for example, about an on-line store which has to handle orders and credit-card transactions on a per customer basis. This often involves lots of database accesses and updates at single tuple granularity. For such systems, it is important to support a high rate of update transactions. In decision support on the other hand, systems need to access and analyze vast amounts of data. This requires good query optimization and evaluation strategies together with a decent memory and I/O management infrastructure. To this end, relational data is often restructured into a more efficient representation through application of on-line analytic processing (OLAP). The domain of DSS is often categorized further into the fields of data analysis and data mining. Data analysis is focused on extracting statistical summary tables from large data sets. This often involves lots of aggregations over multiple attributes, leading to large amounts of data being scanned. In data mining on the other hand, the goal is to discover knowledge from large volumes of data. Techniques borrowed from the field of machine learning are applied to discover knowledge in the form of rules, which can be thought of as logical implications together with associated measures for their support and confidence. Besides the traditional fields of OLTP and DSS, novel application domains have emerged over the last decade. For example multimedia databases, with a focus on storage and retrieval of image, video and audio data. For such systems it is important to be able to deliver their contents in a continuous, or streaming, fashion.

B.2

Database- versus Hardware Architecture

Given the capabilities of modern processors outlined in Appendix A, one can question how well a database system utilizes the available CPU resources. One would expect this utilization to be quite high, especially

B.2. DATABASE- VERSUS HARDWARE ARCHITECTURE

83

in query-intensive database workloads such as OLAP, data-mining, decision support and multimedia retrieval, that often contain many independent calculations, which should provide the CPU with enough parallelism to achieve near optimal IPC scores.

B.2.1

Tuple-at-a-time Processing

Previous work has shown that most traditional database systems do not perform well with respect to IPC, scoring disappointingly low [2, 5]. This can mainly be attributed to the way most systems implement the pipelined Volcano iterator model introduced in Section B.1.4. It is often done at tuple granularity, meaning that each call to next() returns a single tuple. Processing tuples one at a time introduces high interpretation overheads, as there are often multiple function calls and operations involved, e.g. to lookup the tuple in some disk block and let each operation in the operator pipeline process the tuple as it comes along, in a strictly stepwise fashion. Furthermore, tuple-at-a-time execution limits achievable IPC, as it hides the parallelism that is present if tuples could theoretically be processed independently, from both the compiler and CPU. Padmanabhan et al. present a partial solution to this problem by suggesting to let operators operate on cache-optimized blocks of tuples instead of single tuples, thereby allowing for more concurrency during execution and reductions in interpretation overhead. However, their experiments are built on top of the traditional NSM storage model, which is known to waste I/O bandwidth if not all of a relations attributes are needed.

B.2.2

Column-at-a-time Processing: MonetDB

One way to reduce both the interpretation overhead associated with tuple-at-a-time execution and the wasted I/O bandwidth due to retrieval of unneeded attributes, is to employ a DSM storage layout and process tuples at a larger granularity. This is the approach taken by MonetDB [6, 23]. MonetDB is a main memory database system that strictly uses binary relations, which it calls binary association tables (BAT). It was designed as a flexible and efficient DBMS for query intensive applications. MonetDB provides its own query language over BATs, which is called Monet interpreter language (MIL). Because all execution primitives in this language operate on BATs, which represent an entire column, and are therefore only invoked once per column, interpretation overhead is low. Furthermore, implementation of such primitives boils down to loops over entire columns, which are represented in memory as simple arrays, allowing the compiler to apply loop-pipelining and to provide the CPU with loop-level parallelism. This allows MonetDB to score high on IPC. However, the column-at-a-time execution model does not incorporate any pipelining in its operator tree. Therefore, operators need to materialize all their results before passing them to the consuming operator. Such materialization can have negative effects on performance in case results do not fit main memory, meaning that intermediate results need to be swapped to disk by the virtual memory system. But even if results do not fit the CPU cache, full materialization can be a problem, as main memory access might become a bottleneck as well. This is the case when dealing with simple, high throughput operators, which can process data at a faster rate then main memory is able to provide it.

B.2.3

Vectorized Processing: MonetDB/X100

Boncz, Zukowski and Nes [5] try to strike a balance between the original Volcano pipelined iterator model, allowing to avoid intermediate result materialization, and the low overhead, column-at-a-time MonetDB execution model. This is achieved by a further run-time, horizontal partitioning step, which splits up DSM columns into smaller, cache-resident, vertical vectors, and lets the operator pipeline operate on vertically aligned sets of those attribute vectors. This idea is implemented in a novel query execution engine called MonetDB/X100, which is still work in progress. Its design goals are: 1. Execute high-volume queries at high CPU efficiency. 2. Be extensible to other, non-standard, application domains and achieve the same efficiency on extensibility code.

84


3. To scale with the size of the lowest storage hierarchy, disk. CPU Cache

X100 execution engine

7

. . . Query tree . . .

hash table maintenance

12 34 1

56 7 vectors fit in the cache

2

shipdate

returnflag

returnflag

3

Scan

sum_vat_price

aggr_sum_flt_col

Aggregate

5

extprice

ColumnBM

6

map_hash_chr_col selection 4 vector

Project

vat_price 3

Decompression

map_mul_flt_val_flt_col

vectors

contain multiple values of a single attribute

primitives

process entire vectors at a time

1.19

Main memory

data in DSM

selection 4 vector

Select

select_lt_date_col_date_val 1998−09−03

Storage Disk

Disk

Network

1

shipdate

2

returnflag

3

Scan

operators

process sets of tuples represented as aligned vectors

extprice

Figure B.7: MonetDB/X100 architecture To achieve these goals several architectural decisions needed to be made, which are illustrated in Figure B.7. First of all, MonetDB/X100 departs from the column-at-a-time MIL language, falling back to relational algebra. Recall that relational algebra allows its operators to process n-ary relations, which in case of MonetDB/X100 comes down to n aligned, vertical vectors of attribute values. Such a vectorized execution model allows compilers to generate highly concurrent, loop-pipelined code, resulting in proper CPU utilization. Second, pipelined execution should be implemented in such a way that vectors stay in the cache as long as possible, until all operators that need to process them have done so. The motivation being that the cache is the only location where bandwidth bottlenecks are negligible and we can allow for random access. An implication of the previous point is that main memory access patterns need to be cache optimized as well. In the ideal case, all tuples from an attribute column at the leaf of an operator tree should be read from main memory only once, when being fed into the pipeline. These reads should fetch full vectors at once so that memory bandwidth can be used at its full potential. Memory writes on the other hand, should be restricted to occur only at the end of the pipeline whenever possible. The third goal presents a second departure from the original MonetDB. That is, MonetDB/X100 is not intended to be a main memory only DBMS, but should allow for disk access and explicit buffer management as well. To this end, MonetDB/X100 employs its own buffer manager called ColumnBM. As was the case with memory accesses, ColumnBM tries to utilize disk bandwidth at its full potential by reading large 1 to 32 MB sequential I/O blocks at once, trying to keep random I/O at a minimum. This requires a second horizontal partitioning step of the vertically decomposed columns, into a sequence of I/O blocks, which are much bigger then vectors, as was already illustrated in Figure B.7. Each I/O block holds only attributes values of a single column. Al values in a column share the same data-type, and can potentially be modeled by some kind of value or frequency distribution. This opens many possibilities for data compression, as we can try to exploit this data-type and domain specific information. The use of I/O blocks is entirely transparent to query execution. Logically, a query sees a column as a sequence of values, which can be read or written to. To interface ColumnBM columns to the vectorized MonetDB/X100 query execution engine, ColumnBM provides an iterator based interface by means of the ColumnSave and ColumnScan operators. ColumnSave stores a stream of attribute values sequentially in a ColumnBM column. It has an iterator with an append() method that appends a single, cacheresident vector of values to the end of the column being saved. ColumnScan on the other hand, scans a column sequentially. It provides an iterator with a next() method that returns the next vector of values from the column being scanned, thereby loading it into the cache. To illustrate the importance of the vector size in a vectorized query execution engine, Figure B.8 shows the performance of MonetDB/X100 on Query 1 of the TPC-H database benchmark [42]. As a comparison,

B.2. DATABASE- VERSUS HARDWARE ARCHITECTURE

85

100

"tuple at a time" DBMS "X" 26.6 MySQL 4.1 Time (seconds)

28.1

interpretation dominates execution

10

"column at a time" MonetDB/MIL main−memory materialization overhead

interpretation overhead decreases

1 0.60 0.22 0.1

query without selection

MonetDB/X100 "vector at a time"

4

16

vectors start to exceed CPU cache, causing extra memory traffic

low interpretation overhead in−cache materialization

Hand−Coded C Program 1

3.7 2.4

64

256

1K

4K

16K 64K 256K 1M 4M 6M

Figure B.8: MonetDB/X100 performance on TPC-H Query 1 as a function of the vector size (in number of tuples). the figure also shows results for MonetDB, MySQL [26] and a commercial DBMS “X”. It can be seen that at a vector size of one, meaning tuple-at-a-time execution, performance is very similar to that of MySQL and DBMS “X”. As the vector size is increased, performance improves significantly due to decreased interpretation overheads. Eventually however, further increases in vector size will worsen performance. This is caused by the fact that intermediate result vectors become to big to fit the CPU cache and need to be materialized in main memory, introducing more cache misses, and thereby increased data access latencies. As the vector size reaches six million, which in this case is the size of a column, performance is similar to that of the column-at-a-time MonetDB. As the results show, the architectural design of the MonetDB/X100 system results in a database kernel which is able to execute query-intensive workloads at one or two orders of magnitude faster then existing technology. However, this increase in raw execution power further increases the discrepancy between query throughput rates and the rate at which disk systems are able to deliver this data, as outlined in the introduction. This pushes the necessity for higher I/O bandwidths, and thus for light-weight data compression, even further. At the time of this writing, MonetDB/X100 is still lacking functionality to modify a database. There are plans however to implement updates as memory-resident diffs on a column. These diffs are then applied on each scan of a column, providing subsequent operators with an up-to-date view on the current database. This approach requires periodic updates to the underlying I/O blocks in an updated column, to make the modifications persistent. If I/O blocks are stored in compressed form, this will in general require blocks to be re-compressed from time to time. To reduce any overheads associated with such re-compression, MonetDB/X100 compression algorithms should be capable of not only achieving high decompression bandwidths, but also high compression speeds.

Appendix C

Data Compression C.1

Compression Ratio

For data compression to be applicable at all, the source data to be compressed should exhibit a certain level of redundancy. By definition, the aim of data compression then becomes to detect this redundancy and remove it by generating as compact a representation of the source data as possible. Informally, this is achieved by assigning short descriptions to the most frequent outcomes of a data source and longer codes to the less frequent ones. The effectiveness of compression is often related to the so called compression ratio, defined as: r=

C.2

data size without compression data size with compression

(C.1)

Lossless and Lossy Compression

An important categorization of compression methods is the one into lossy, or non-reversible, and lossless, or reversible, data compression. Lossy compression reduces the size of the physical representation of the source data, but tries to preserve only those parts that are considered relevant. Some examples are: sacrificing image detail to reduce an images’ size, or removing markup from a text document. Lossy compression is often applied to signal data, such as images, sound and video. In lossless data compression, decompression will always recover the full source data in its original representation. This is the kind of compression that most general purpose compression algorithms, like Huffman coding [18], arithmetic coding [25] and the Lempel-Ziv [48] family of dictionary based methods, are geared at. Lossless data compression is often applied to textual data and binaries, for which it is important to be able to restore the exact original contents.

C.3

Limits on Compressibility

Already back in 1948, Claude Shannon showed with his Noiseless Coding Theorem that there is a limit on the amount of compression we can obtain [38]. To illustrate this finding we have to look at some information theory. Let the set of values to be encoded, or source words, be X and let D : {0, 1}∗ → X be a prefix-code with a single code word per source word. A code is called a prefix-code if no code word is a prefix of any other code word, and it implies that the code is uniquely decodable, which is required for a code to be of practical value. Now let P (x) be the probability, or frequency, of source P word x and lx its length. The goal of data compression can then be formulated as minimizing LD,P = x∈X P (x)lx . The minimal average code word length L is then defined as L = min{LD,P : D is a prefix-code}. A prefix-code that achieves this is called an optimal prefix-code. 86

C.4. DATABASE COMPRESSION

87

Given above definitions, Shannons’ proof relates the minimal average code word length L to the entropy H as follows H(P ) ≤ L ≤ H(P ) + 1 (C.2) Where entropy represents the amount of uncertainty in a data source that generates source symbols according to distribution P and is defined as X H(P ) = − P (x) log2 P (x) (C.3) x∈X

This limit induces an upper bound on the amount of compression we can obtain given a set of source words and their frequency distribution. It is not hard to see that if the source words are distributed uniformly, 1 i.e. P (x) = |X| , the uncertainty is maximal and equal to H(p) = log2 (|X|), rendering compression useless.

C.4

Database Compression

Database systems often posses a number of characteristics that make them good candidates for compression, especially at column granularity [12]. Roth and Van Horn [35] identify the four most important characteristics as Sparseness Sparse databases tend to have large clusters of zeros, blanks or NULL values. Such clusters can be compressed efficiently. Value Distribution Often, values do not cover the entire domain, but are clustered into ranges within this domain. Joint Attributes Sometimes relational attributes are duplicated over tables. This is mainly the case for key attributes. Frequency Often certain values occur more frequently then others, opening possibilities for more efficient encodings. In most of these cases, lossless compression is the more natural choice. Although lossy compression could have applications as well, for example in databases that store large amounts of sensor data, such as multimedia databases.

Appendix D

Memorized Probing: Hash Table Construction

88

89

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 34 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

#define M1 (4