COLLATE - A Conceptual Model for Lightweight Data Compression ...

COLLATE - A Conceptual Model for Lightweight Data Compression Algorithms Juliana Hildebrandt, Dirk Habich, Patrick Damme, and Wolfgang Lehner Technische Universit¨ at Dresden, Database Systems Group, 01062 Dresden, Germany [email protected]

Abstract. Modern database systems are very often in the position to store their entire data in main memory. Aside from increased main memory capacities, a further driver for in-memory database systems was the shift to a decomposition storage model in combination with lightweight data compression algorithms. Using both mentioned software concepts, large datasets can be held and processed in main memory with a low memory footprint. In recent years, a large corpus of lightweight data compression algorithms has been developed to efficiently support different data characteristics. In this paper, we present our novel conceptual model COLLATE for lightweight data compression algorithms. COLLATE helps describing, understanding, communicating, and comparing algorithms on a conceptual level that is missing up to now. In detail, we introduce our model including all necessary concepts and highlight the applicability of our model.

1

Introduction

The relational data model has been established as de-facto standard for database systems [1, 2]. For this model, several storage formats have been proposed. On the one hand, the N-ary storage model (NSM) stores tuples coherently [3]. That means, the tuple unit is preserved in the storage format. On the other hand, the decomposition storage model (DSM), proposed in 1985 [4], partitions an nattribute relation vertically into n sub-relations. Each sub-relation contains two attributes, a logical record id (surrogate) and an attribute value [4, 5]. In most cases, the surrogate is neglected by the order of the tuples. That means, subrelation storage equals to a value-based storage model in form of a sequence of values. While the NSM storage format is used in disk-based database systems, the DSM is the preferred layout in in-memory database systems [5–7]. Aside from accessing each attribute independently, DSM enables efficient lossless data compression opportunities. As shown, e.g., in [7–9], the query performance gain is massive. For DSM compression, a large corpus of lightweight data compression algorithms has been developed to efficiently support different data characteristics [7, 9, 10]. These algorithms achieve good compression rates by utilizing the context knowledge of databases and they provide fast compression as well as decompression. The corpus further evolves because it is impossible to design an algorithm that always produces optimal results for any data

being stored in a database system. Unfortunately, there is no unified model available for this specific corpus of algorithms which helps describing, understanding, communicating, and comparing the algorithms on a conceptual level. Usually, the algorithms are proposed in papers with pseudo-code and natural language descriptions. However, this kind of presentation is suboptimal. Sometimes, the descriptions are confusing and hard to understand because the logical compositions of several basic techniques are not obvious. Furthermore, algorithm designers may forget to mention aspects that are important for the understanding. Moreover, the logical approach is mixed with specific implementation aspects regarding efficient execution which complicates the understanding. Our Contribution and Outline: Database system architects and developers have a need for understanding, communicating, reasoning, and making decisions about this specific system-level aspect [7, 9–11]. As described in [11, 12], conceptual modeling is a good tool to support these needs. Therefore, we developed a conceptual model for the problem domain of lightweight data compression algorithms called COLLATE. The aim of COLLATE is to provide a holistic view of necessary components including all aspects of data (structure), behavior (function), and interaction (component interaction). In this paper, we focus on the compression part. Nevertheless, we can use the same model for decompression. Thereby, we follow the approach described by Robinson [13]. In detail our contributions are: 1. We present a systematic treatment of the lightweight compression aspect and solution area with a derived system description of the important aspects in Section 2. In particular, we introduce basic techniques and give an overview of lightweight data compression algorithms. 2. We propose our novel conceptual model COLLATE in Section 3. As we are going to show, our COLLATE consists of five main concepts and we can specify a basic functional arrangement of these concepts as blueprint for every lightweight data compression algorithm. 3. Section 4 focuses on the applicability of COLLATE by the transcription of concrete algorithms. Here, we discuss several aspects like expressiveness, unambiguity, and easy variability. Finally, we conclude the paper with a short description of related work in Section 5 and with a conclusion in Section 6.

2

Problem Statement

Our focus is the evolving corpus of lightweight lossless data compression algorithms which are heavily used in in-memory database systems [5–7]. With DSM as storage model [4], the input of every lightweight compression algorithm is a sequence or an array of values, while the output is a sequence or an array of codewords. In this section, we give a tutorial on lightweight data compression algorithms for a deep system understanding. We start with a comparison of basic techniques referring to mapping cardinalities and data dependencies (Section 2.1), followed by specific algorithms (Section 2.2), and conclude with a system

value

value

4 3 2 1 0

4 3 2 1 0 1 2 3 4 5 position

O w e D 1 2 3 4 5 position

1:1

1:1

value

value

4 3 2 1 0

4 3 2 1 0 1 2 3 4 5 position

(a) FOR

value

O w e D 1 2 3 4 5 position

1:1 value

1 2 3 4 5 position

value 1 0

1 2 3 4 5 position

1 2 3 4 5 6 position

value

value O w e D

1 2 3 4 5 position

1 32 rl

1 2 3 4 5 6 position 1:1 mapping or N:1 mapping

N:1

1000 0100 0010 0001

7

value O w e D

1:1

L n d 1 2 3 4 5 position

value

value 1 0

1 2 3 position

(b) DELTA (c) DICT (d) Bit Vectors (e) RLE Fig. 1. Basic Lightweight Compression Techniques.

1 32 rl

1 2 3 position

(f) NS + RLE

description (Section 2.3). Generally, the material results from an exhaustive algorithm analysis. 2.1

Analysis of Basic Lightweight Compression Techniques

The basis of lightweight data compression algorithms are six basic techniques as depicted in Fig. 1: frame-of-reference (FOR) [9, 14], delta coding (DELTA) [15, 16], dictionary compression (DICT) [7, 9], Bit Vectors (BV) [17], run-length encoding (RLE) [7, 15], and Null Suppression (NS) [7, 15]. FOR and DELTA represent each value as the difference to a certain given reference value (FOR) respectively to its predecessor value (DELTA). DICT replaces each value by its unique key given by a dictionary. The objective of these three well-known techniques is to represent the original data as a sequence of small integers, which is then suited for actual compression using the NS technique. NS is the most wellstudied kind of lightweight compression techniques. Its basic idea is the omission of leading zeros in the bit representation of small integers. In contrast to DICT, the technique BV replaces each input value with a bit vector representation in the output. Finally, RLE tackles uninterrupted sequences of occurrences of the same value, so called runs. In its compressed format, each run is represented by its value and length. Therefore, the compressed data is a sequence of such pairs. If we analyze these techniques without their application in algorithms, we observe the following characteristics: 1. The techniques address different data levels. While FOR, DELTA, DICT, BV, and RLE consider the logical data level, NS addresses the bit or byte level. Therefore, it is clear why algorithms usually combine techniques from the logical level with NS. 2. We are able to distinguish two approaches how input values are mapped to output values as depicted in Fig 1. FOR, DELTA, DICT, and BV map each input value to exactly one integer as output value (1:1 mapping). The goal is to achieve smaller numbers which can be better compressed on bit level. In RLE, not every input value is necessarily mapped to an encoded output value, because a successive subsequence of equal values is encoded

Fig. 2. Overview of an Extract from our Lightweight Data Compression Survey.

in the output as a pair of run value and run length (N:1 mapping). The NS technique is either a 1:1 mapping or a N:1 mapping depending on the concrete expression. 3. The techniques are either parameter-dependent or data-dependent. Most of the 1:1 mapping techniques except DELTA are parameter-dependent. That means, each input value is independently encoded from other values within the input sequence to an output representation, but the encoding depends on some parameter, i.e., the reference value in FOR or the bit width for NS. DELTA is a data-dependent technique, because it encodes the values in dependency of their predecessor. The same is valid for RLE, therefore we define RLE as data-dependent technique. To summarize, each technique has its own characteristics and objectives, which are applied in the algorithms in different ways. In particular, the algorithms precisely define some open questions regarding the application. For example, one open question is how the parameter values for parameter-dependent techniques are determined. A second open question is whether the whole input sequence is processed with the same parameter value, or if the sequence is partitioned and for each subsequence a separate parameter value is used. 2.2

Analysis of Lightweight Compression Algorithms

A lot of lightweight data compression algorithms have been developed to efficiently support different data characteristics. Without claiming completeness, Fig. 2 shows an overview of various algorithms. As illustrated, the algorithms can

be organized in families. The algorithms evolve and the development activities increase over the years. The families can be described as follows: Algorithms in the family of Byte-oriented Encodings rely on the basic technique NS [18]. They map uncompressed values to codewords of a bit length that is a multiple of eight. That means, these algorithms implement NS according to a 1:1 mapping, whereas the corresponding parameter for encoding is determined for each single input value (number of essential bytes). That means, byte-oriented encodings compute the parameters for the parameter-dependent NS technique data-dependent by computing one parameter per input data value. For decoding, it is necessary to store the length of a codeword as a parameter. The algorithms have a lot of similarities. Sometimes, the only difference between two algorithms is the arrangement of bits. But they also differ in the encoding of the essential length value. Simple-based Algorithms apply only the NS technique, again [19]. Here, the algorithms try to pack as many binary encoded integer values in a 32 resp. 64 bit codeword by suppressing leading zeros. In contrast to the previous family, the algorithms follow an N:1 mapping for NS. The algorithms subdivide the input sequence in subsequences depending on the size of the input values. In each codeword of a fixed length, several descriptor bits serve as parameters to determine the bit width for all values that are encoded with this codeword. The remaining bits are filled with NS compressed data values. Simple-based algorithms apply the parameter-dependent NS technique in a data-dependent fashion with one parameter per input data subsequence. PFOR-based Algorithms implement the FOR technique [9] in combination with NS. They subdivide the input in subsequences of a fix length and calculate two parameters per subsequence: a reference value for the FOR technique and a common bit width for NS. Each subsequence is encoded using their specific parameters, thereby the parameters are data-dependently derived. The values that cannot be encoded with the given bit width are stored separately with a greater bit width. Adaptive FOR algorithms [20, 21] bundle a lot of NS algorithms, but they focus on the problem of how to optimize the subdividing of a finite sequence of integer values in subsequences, such that every value in a subsequence is encoded with the same bit width. These algorithms use the parameter-dependent technique NS in a data-dependent fashion with a N:1 mapping approach. All described families apply one or more basic lightweight compression techniques, but the application is different. For the last three families, data subdivision into subsequences plays an important role. Also the calculation of parameter values that are related to all values within a subsequence like a common bit width is a core part of the algorithms. In general, this aspect influences the encoding of the values in each subsequence differently. That is not addressed on the level of basic techniques, but on the level of the algorithms. Furthermore, the parameters for each subsequence have to be included in the encoded output sequences for decompression purposes. Generally, the parameter-dependent basic

techniques are applied in a data-dependent fashion in algorithms by computing the parameters based on the input data sequence. 2.3

Derived System Description and Properties

Based on the decomposition storage model [4], the input for every lightweight data compression algorithm is a finite sequence of (integer) values. The output is a sequence of codewords and parameters representing the compressed data. The parameters like bit width or run length are required for decoding/decompression or are part of an access path within the compressed data format. Input and output data have a logical representation (semantic level) and a physical representation (bit or encoding level). While for some compression techniques it is useful to focus on the semantic level (FOR and DELTA), for other techniques the physical level is more important (NS). For the transformation from input to output, two further characteristics play an important role for every algorithm. First, most of the basic lightweight compression techniques are parameter-dependent. Within the algorithms, a number of parameter values have to be calculated. Second, the basic techniques and algorithms differ in their mapping cardinalities as described above. For parameterdependent N:1 mappings, parameters are calculated for each subsequence of N values. In general, a lot of algorithms subdivide input data hierarchically in subsequences (i.e. PFOR) for which the parameters can be calculated. Moreover, the further data processing of a subsequence depends on the subsequence itself. That means, data subdivision and parameter calculation are the actual screws and the application of the basic techniques is then straightforward. Finally, for an exact algorithm description, the combination and arrangement of codewords and parameters have to be defined. Here, the algorithms also differ widely.

3

COLLATE Model

Based on the previous section, we now introduce our novel conceptual model called COLLATE for the domain of lightweight data compression algorithms. In detail, we describe the main functions (behavior) of our model in Section 3.1, followed by properties of the main concepts in Section 3.2. Then, we explain the interaction of the concepts in Section 3.3. 3.1

Data Structure and Concepts

As described in Section 2.3, we rely on the decomposition storage model (DSM) [4] for input and output data. That means, input is a sequence of (integer) values and output is a sequence of codewords and parameters representing the compressed data. To convert input data to its compressed output data, several functional concepts with regard to our system description (see Section 2.3) are necessary. Fundamentally, our model consists of five main concepts (functional components) as highlighted in gray boxes in Fig. 3. The concepts are:

Fig. 3. COLLATE Class Diagram.

Recursion: This concept is responsible for the hierarchical data subdivision and for applying the included concepts in the Recursion on each data subsequence. Each modeled algorithm is a Recursion. Tokenizer: This concept is responsible for dividing an input sequence into finite subsequences or single values. Parameter Calculator: The concept Parameter Calculator determines parameter values for finite subsequences or single values. The specification of the parameter values is done using parameter definitions. Encoder: The third concept determines the encoded form for values to be compressed at bit level. Again, the concrete encoding is specified using functions representing the basic techniques. Combiner: The Combiner is essential to arrange the encoded values and the calculated parameters for the output representation. Additionally, Recursion and Encoder form a composite design pattern Scheme in Fig. 3 which is appropriate for dealing with hierarchically subdivided sequences. Moreover, there are algorithms that encode sequences or values with special characteristics in a different way than common sequences or values. An example is an algorithm that has a special encoding mode for runs of ones [22]. To cover such algorithms, a separate pair of a Parameter Calculator and an Encoder resp. a Recursion for each encoding mode can be modeled. However, most of the algorithms need only one pair of Parameter Calculator and Scheme per Recursion. 3.2

Properties of Concepts

In this section, we want to concretize the properties of the concepts Tokenizer, Parameter Calculator and Encoder. The concepts Combiner and Recursion are clear enough. For the Tokenizer concept, we identified three classifying characteristics. The first one is the data dependency. A data independent Tokenizer outputs a special number of values without regarding the value itself, while a data dependent Tokenizer is used if the decision how many values to output is led by the knowledge of the concrete values. A second characteristic is the adaptivity. A Tokenizer is adaptive if the calculation rule changes depending on previously read data. The third property is the necessary input for decisions. Most Tokenizers need only a finite prefix of a data sequence to decide how

many values to output. The rest of the sequence is used as further input for the Tokenizer and processed in the same manner. Only those Tokenizers are able to process data streams with potentially infinite data sequences. Moreover, there are Tokenizers needing the whole (finite) input sequence to decide how to subdivide it. All of these eight combinations are possible. Some of them occur more frequently than others in existing algorithms. Some analyzed algorithms are very complex concerning sequence subdivision. It is not sufficient to assume that Tokenizers subdivide sequences in a linear way. As an example for PFORbased algorithms, we also need Tokenizers that arrange somehow subsequences, mostly regarding content of single values of the sequence. With such kinds of Tokenizers (mostly categorizable as non adaptive, data dependent and with the need of finite input sequences), we can rearrange the values in a different (data dependent) order than the one of the input sequence. A second task of the Tokenizer is to decide for each output sequence which pair of Parameter Calculator and Encoder is used for the further data processing. Most algorithms process all data in the same way, some of them distinguish several cases, so that this choice is necessary. The finite output sequence of the Tokenizer serves as input for the Parameter Calculator. Parameters are often required for the encoding and decoding. Therefore, we introduce the Parameter Calculator concept, which knows special rules (parameter definitions) for the calculation of several parameters. There are different kinds of parameter definitions. We often need single numbers like a common bit width for all values or mapping informations for dictionary based encodings. We call a parameter definition adaptive, if the knowledge of a calculated parameter for one token (output of the Tokenizer) is needed for the calculation of parameters for further tokens at the same hierarchical level. For example, an adaptive parameter definition is necessary for DELTA. Calculated parameters have a logical representation for further calculations and the encoding of values as well as a representation at bit level, because on the one hand they are needed to calculate the encoding of values, on the other hand they have to be stored additionally to allow the decoding. If an algorithm is characterized by hierarchically calculated parameters, it is possible that a parameter definition depends on other calculated parameters that are additional input for a Parameter Calculator. The Encoder processes an atomic input, where the output of the Parameter Calculator and other parameters are additional inputs. The input is a token that cannot or shall not be subdivided anymore. In practice the Encoder mostly gets a single integer value to be mapped into a binary code (1:1 mapping techniques). An exception is RLE as N:1 mapping technique, where the Parameter Calculator maps a sequence of equal values to its run length and the Encoder maps the sequence to the special value. Equally to the parameter definitions, the Encoder calculates a logical representation of its input value and an encoding at bit level.

Recursion tail of input sequence

input sequence

calculated parameters adapted value

fix parameters

encoded sequence

Parameter Calculator

Tokenizer parameter/ set of valid tokens/. . .

fix parameters adapted parameters/ adapted set of valid tokens/. . .

Encoder/ Recursion

compressed value or compressed sequence

Combiner

token: finite sequence

Fig. 4. Interaction and Data Flow of our COLLATE Model.

3.3

Interaction of Concepts

To summarize, Fig. 4 illustrates the interactions of our concepts and the data flow through the concepts for lightweight data compression for a simple case with only one pair of Parameter Calculator and Encoder. In general, this arrangement can be used as blueprint for lightweight compression algorithms as we will show in the next section. The dashed lines highlight several properties of the concepts as mentioned above.

4

Model Application

In this section, we want to show and evaluate the applicability of our model. Due to space constraints, we focus on algorithms from the class of Byte-oriented Encodings in this paper. Nevertheless, all algorithms depicted in Fig. 2 can be modeled with COLLATE as demonstrated on our project website1 . 4.1

Model Instance for Algorithm varint-G8IU

The algorithm varint-G8IU suppresses leading zeros by encoding each 32-bit integer value with the smallest possible number of bytes [18]. The number of used bytes is indicated unarily. For example, 220 is compressed to 3 bytes with a length information of ”011”. This length encoding has to be interpreted as follows: The compressed value has 3 bytes, therefore the length encoding has a size of 3. The value ”1” indicates that the corresponding byte position is used and is not the last one that belongs to the compressed value. The value ”0” highlights that this byte is the last one belonging to the compressed value. The output of varint-G8IU is always a sequence of 9 byte blocks containing multiple compressed integer values. Each 9 byte block consists of 8 data bytes and 1 byte for length information. For example, to compress the sequence 25 :220 :210 :230 , we are able to store the first three values in one 9-byte block, because there are 8 available data bytes. In this example, we need 6 to store the first three values and 10 to store all four values. The 2 unused data bytes in the output are filled with padding zeros, the two unused bits in the length byte are set to 1 to ensure a correct decoding. 1

Website - https://wwwdb.inf.tu-dresden.de/research-projects/projects/collate/

Recursion tail of input sequence input sequence

data independent

Tokenizer k = 1

bw

Combiner

Parameter Calculator

bw = (blog256 (max(1, •))c + 1, unary(•))

l = min n,

Encoder

vc = (•, bin(•, 8 · bw))

such that Pn+1 |vc i | > 64 i=1 vc

(0

64−|vc l |

: vc l :

encoded sequence

l 18−|bw | : bwl )?

token: 1 integer value

Fig. 5. Model Instance for Algorithm varint-G8IU.

Fig. 5 depicts a simple model instance for that algorithm. The Tokenizer outputs single values (indicated by k = 1). The Parameter Calculator determines the number of needed bytes (bw = blog256 (max(1, •))c + 1) as a logical value and encodes it unary, depicted as a tuple of functions with input ”•”. The Encoder transforms the logical input value, but also determines its binary representation encoded with 8 · bw bits. The output data format is constructed by the Combiner. It calculates k as the greatest number n of compressed values that fit into 8 bytes (= 64 bits) and concatenates the k encoded length descriptors, k encoded data values, and both kinds of padding bits. In Fig. 5, the bit level representations of parameters and data values are underlined. The symbol ”:” as well as the potentiation depicts a concatenation of the bit level representations of the values. The notation ”| • |” represents the bit width of a value. 4.2

Different Model Instances for One Algorithm

The creation of such model instances is usually straightforward, since our concepts indicate a clear line. Nevertheless, some algorithmic aspects can be modeled in different ways. In our first model instance (see Fig. 5), the Combiner performs many tasks. A second possible model instance of the same algorithm is depicted in Fig. 6. Here, the Tokenizer in the first Recursion determines the number of values that fit into one 9-byte block and outputs the corresponding values as subsequence. This subsequence is further processed in a second Recursion. Here, the Tokenizer outputs single values. The Parameter Calculator and the Encoder are exactly the same as those in the first model instance. The inner Combiner only concatenates all received data values and length descriptors as well as padding bits to a 9-byte block. The outer Combiner concatenates simply all 9-byte blocks. This second model instance of varint-G8IU is more intuitive to understand, but not minimal in terms of recursion depth. However, in this second instance, there is a clear separation of data division and the output arrangement which is not obvious in the first model instance. So it is a matter of discretion how to model an algorithm. From our point of view, this is not a disadvantage, but shows the flexibility of our approach. For reasons of comparability with other byte-oriented encoding algorithms it makes sense to use the first model instance. For a clear explanation of the algorithm the second model instance might be better.

tail of input sequence

Recursion

input sequence

data dependent Tokenizer = min n such that Pn+1 blog256 (max(1, mi ))c + 1 > 8 i=1

Combiner vc ?

compressed sequence vc

Recursion 2

k


encoded sequence

Recursion 2 tail of input sequence

bw

input sequence

Parameter Calculator bw = (blog8 (max(1, •))c + 1, unary(•))

data independent Tokenizer k = 1

Encoder vc = (•, bin(•, 8 · bw))

compressed value vc

Combiner n 08·(8−|bw |) : vc n : n| 8−|bw 1 : bwn

encoded sequence


Fig. 6. Second Model Instance for varint-G8IU.

4.3

Simple Variability

Next, we want to address the simple variability or extensibility of model instances. That means, we are able to substitute parts of an instance, so that algorithms can be altered and extended easily. We show that aspect using two similar byte-oriented algorithms: varint-SU and varint-PU [18]. Both take 32-bit integer values as input and map them to codewords of variable length. This is shown in Fig. 7 for the binary representation of the value 104125. Both algorithms determine the smallest number of 7-bit units that are needed for the binary representation of the value without losing any information. In our example, we need at least three 7-bit units and we are able to suppress 11 leading zero bits. In order to support the decoding, we have to store the number three as additional parameter. This is done in a unary way as 011. The algorithm varint-SU stores each single parameter bit at the high end of a 7-bit data unit, whereas varint-PU stores the complete parameter at one end of the 21 data bits.

parameter bit

data bitsp

1 0 1 1 1 1 0 1

varint-SU

1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 32

24

16

8

1 1 0

0 0 1 1 1 1 0 1

0 1 0 1 1 0 1 0 0 0 0 1 1 0

varint-PU

parameter bits

data bits

Fig. 7. Example for varint-SU (green) and varint-PU (blue).

bw unary: bbw−1 . . . b0 = 01bw−1

Recursion tail of input sequence

input sequence

data independent Tokenizer k = 1

Parameter Calculator bw = (blog128 (max(1, •))c + 1, unary(•))

Encoder vc = (•, bin(•, 7 · bw)) (vc = vbw·7−1 . . . v0 )

Combiner varint-SU: !? bw−1 bi : v7·i+6 . . . v7·i i=0 varint-PU: (bw : vc )?

:

encoded sequence

token: 1 Integer value

Fig. 8. Model Instances for varint-SU (green) and varint-PU (blue).

Both algorithms are similar, which is also observable in their model instances as depicted in Fig. 8. Both algorithms use a very simple Tokenizer outputting single integer values (indicated by k = 1). For each value, the Parameter Calculator determines the number of necessary 7-bit units using an appropriate function (bw = blog128 (max(1, •))c + 1). The determined number is used in the subsequent Encoder to compute the number of data bits as a multiple of 7. Up to now, both algorithms have the same behavior. The only difference between both algorithms can be found in the Combiner, because only the output is different. As shown in this example, we are able to compare algorithms on the model instance level. If algorithms are similar, they also have similar model instances. Furthermore, we can easily change only one part of a model instance to describe another algorithm that is similar. 4.4

Summary

We developed a powerful conceptual model for the lightweight data compression domain that helps database system architects in understanding, communicating, reasoning, and making decisions within this problem domain. Generally, we focused on the compression of data here. Nevertheless, we are able to use our model for the decompression part as well. Advantages of our model are: (1) COLLATE helps to describe a multitude of todays lightweight compression algorithms in a clear way. It decomposes the data processing in data subdivision, parameter calculation for finite subsequences, encoding, and output construction which are all necessary steps during the data compression process. The two different data views of the semantic level and the bit level lead to a clear distinction between different abstraction layers. (2) The model instances for algorithms can be used as communication device to support a better understanding of the structure of algorithms, data relations, and the compressed output format. (3) Different algorithms can be compared on an abstract level. With appropriate model instances, it is not hard to see similarities and differences as well as advancements.

5

Related Work

Fundamentally, a wide range of conceptual models for various application domains has been proposed [23–25]. Hitherto, none of them has considered the field

of lightweight data compression. Nevertheless, in the context of data transmission, a modularization concept for data compression was already developed [17]. That scheme subdivides compression methods just in (i) a data model adopting to data already read and (ii) a coder encoding the incoming data by means of the calculated data model. In that work, the area of conventional data compression was considered. Here, a multitude of approaches like arithmetic coding [26], Huffman [27] or Lempel-Ziv [28] exists. These classical algorithms achieve high compression rates, but the computational effort is high. Therefore, those techniques are usually denoted as heavyweight. The modularization concept of [17] is well-suited for that kind of heavyweight algorithms, but it does not adequately reflect different properties of the lightweight algorithms. For example, it does not support a sophisticated segmentation or a multi-level data partitioning as required by lightweight data algorithms.

6

Conclusion

In this paper, we have introduced our novel conceptual model COLLATE for lightweight data compression algorithms consisting of five main concepts, which is suitable to design a variety of algorithms in a systematic manner. By the replacement of individual concepts or only parameters, different algorithms can be designed from a template. Some concepts and related groups of concepts are always occurring in various algorithms, such as the Recursion. From our point of view, the understandability of algorithms is improved by the subdivision into different, independent, and small concepts. Furthermore, our developed conceptual model is a well-defined foundation for the abstract consideration of lightweight compression algorithms. Next, we want to develop an approach to transform model instances of algorithms to efficient executable code. Once this is possible, we want to integrate our approach in a database system. This would allow users to specify their own compression algorithms and use them in database systems without much effort.

References 1. B. Thalheim, Entity-relationship modeling - foundations of database technology. Springer, 2000. 2. H. Mannila and K. R¨ aih¨ a, Design of Relational Databases. Addison-Wesley, 1992. 3. D. J. Abadi, S. Madden, and N. Hachem, “Column-stores vs. row-stores: how different are they really?” in SIGMOD, 2008, pp. 967–980. 4. G. P. Copeland and S. N. Khoshafian, “A decomposition storage model,” SIGMOD Rec., vol. 14, no. 4, pp. 268–279, May 1985. 5. D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, and S. Madden, “The design and implementation of modern column-oriented database systems,” Foundations and Trends in Databases, vol. 5, no. 3, pp. 197–280, 2013. 6. H. Plattner, “A common database approach for oltp and olap using an in-memory column database,” in SIGMOD, 2009, pp. 1–2.

7. D. Abadi, S. Madden, and M. Ferreira, “Integrating compression and execution in column-oriented database systems,” in SIGMOD, 2006, pp. 671–682. 8. M. Zukowski, N. Nes, and P. A. Boncz, “DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing,” in DaMoN, 2008, pp. 47–54. 9. M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super-scalar ram-cpu cache compression,” in ICDE, 2006, pp. 59–. 10. P. Damme, D. Habich, and W. Lehner, “Direct transformation techniques for compressed data: General approach and application scenarios,” in ADBIS, 2015, pp. 151–165. 11. G. Muller, “Challenges in teaching conceptual modeling for systems architecting,” in ER Workshops, 2015, pp. 317–326. 12. I. Davies, P. Green, M. Rosemann, M. Indulska, and S. Gallo, “How do practitioners use conceptual modeling in practice?” Data Knowl. Eng., vol. 58, no. 3, pp. 358– 380, Sep. 2006. 13. S. Robinson, “Conceptual modeling for simulation,” in Proceedings of the 2013 Winter Simulation Conference: Simulation: Making Decisions in a Complex World, ser. WSC ’13, 2013, pp. 377–388. 14. J. Goldstein, R. Ramakrishnan, and U. Shaft, “Compressing relations and indexes,” in ICDE, 1998, pp. 370–379. 15. M. A. Roth and S. J. V. Horn, “Database compression.” SIGMOD Record, vol. 22, no. 3, pp. 31–39, 1993. 16. D. Lemire and L. Boytsov, “Decoding billions of integers per second through vectorization,” Softw., Pract. Exper., vol. 45, no. 1, pp. 1–29, 2015. 17. R. Williams, Adaptive Data Compression, ser. Kluwer international series in engineering and computer science: Communications and information theory. Springer US, 1991. 18. A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi, “Simdbased decoding of posting lists,” in CIKM, 2011, pp. 317–326. 19. V. N. Anh and A. Moffat, “Inverted index compression using word-aligned binary codes,” Inf. Retr., vol. 8, no. 1, pp. 151–166, Jan. 2005. 20. F. Silvestri and R. Venturini, “Vsencoding: Efficient coding and fast decoding of integer lists via dynamic programming,” in CIKM, 2010, pp. 1219–1228. 21. R. Delbru, S. Campinas, K. Samp, G. Tummarello, L. Dangan, R. Delbru, S. Campinas, K. Samp, and G. Tummarello, “Adaptive frame of reference for compressing inverted lists,” 2010. 22. D. Arroyuelo, S. Gonz´ alez, M. Oyarz´ un, and V. Sepulveda, “Document identifier reassignment and run-length-compressed inverted indexes for improved search performance,” in SIGIR, 2013, pp. 173–182. 23. M. Vergne and A. Susi, “Breaking the recursivity: Towards a model to analyse expert finders,” in ER, 2015, pp. 539–547. 24. T. N¨ osinger, M. Klettke, and A. Heuer, “A conceptual model for the XML schema evolution,” in GvD, 2013, pp. 28–33. 25. M. Hahmann, D. Habich, and W. Lehner, “Modular data clustering - algorithm design beyond mapreduce,” in EDBT/ICDT Workshops, 2014, pp. 50–59. 26. I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, Jun. 1987. 27. D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the Institute of Radio Engineers, vol. 40, no. 9, pp. 1098–1101, September 1952. 28. J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theor., vol. 23, no. 3, pp. 337–343, 1977.