How are Data Compression Algorithms implemented in ClickHouse?

Introduction

ClickHouse implements data compression in several ways to reduce storage space and improve query performance. Here are a few examples of how data compression is implemented in ClickHouse:

(1) Dictionary encoding

ClickHouse uses dictionary encoding to compress data. This involves replacing repeated values with a smaller integer, called a dictionary index. The original values and their corresponding dictionary indices are stored separately in a dictionary. This method is particularly useful for compressing data with many repeated values, such as enumerated data types.

Algorithm to implement Dictionary encoding

Dictionary encoding is a technique used to compress data by replacing repeated values with a smaller integer called a dictionary index. The algorithm to implement dictionary encoding can be broken down into the following steps:

  1. Create an empty dictionary: Initialize an empty dictionary that will store the original values and their corresponding dictionary indices.
  2. Traverse the data: Traverse the data and for each unique value, add it to the dictionary and assign it a new dictionary index.
  3. Replace values with dictionary indices: For each value in the data, replace it with its corresponding dictionary index from the dictionary.
  4. Store dictionary and compressed data: Store the dictionary and the compressed data (i.e the data with values replaced by dictionary indices) separately.
  5. Decompression: For decompression, use the stored dictionary to replace the dictionary indices with the original values.

It’s important to note that dictionary encoding works best for data that has many repeated values. It’s less efficient for data with few or no repeated values and for data that is already highly compressed. Also, dictionary encoding can be applied to specific columns of a table and the column type should be of enumerated type.

(2) Delta encoding

ClickHouse uses delta encoding to compress data. This involves storing the difference between consecutive values rather than the actual values. This method is particularly useful for compressing data with many consecutive values, such as timestamp data types.

Algorithm to implement Delta encoding

Delta encoding is a technique used to compress data by storing the difference between consecutive values rather than the actual values. The algorithm to implement delta encoding can be broken down into the following steps:

  1. Initialize a variable to store the previous value: Initialize a variable to store the previous value, which will be used to calculate the delta.
  2. Traverse the data: Traverse the data and for each value, calculate the delta by subtracting the previous value from the current value.
  3. Replace values with delta: Replace each value in the data with its corresponding delta.
  4. Store the initial value and the compressed data: Store the initial value and the compressed data (i.e the data with values replaced by deltas) separately.
  5. Decompression: For decompression, use the stored initial value and the delta to calculate the original value.

It’s important to note that delta encoding works best for data that has many consecutive values, such as timestamp data types. It’s less efficient for data with few or no consecutive values and for data that is already highly compressed. Also, it’s recommended to use delta encoding for numeric data types and for data that is ordered by the column on which delta encoding is applied.

(3) LZ4 compression

ClickHouse uses LZ4 compression to compress data. This is a lossless compression algorithm that is fast and efficient. It is used to compress data blocks in ClickHouse.

Algorithm to implement LZ4 compression

LZ4 compression is a lossless compression algorithm that is fast and efficient. The algorithm to implement LZ4 compression can be broken down into the following steps:

  1. Divide the data into blocks: Divide the data into blocks of a fixed size, for example, 64KB.
  2. Search for repeated patterns: For each block, search for repeated patterns within the block and store the location of the first occurrence of the pattern.
  3. Replace repeated patterns with references: Replace the repeated patterns with references to the first occurrence of the pattern.
  4. Compress the data: Compress the data using the LZ4 algorithm.
  5. Decompression: For decompression, use the LZ4 algorithm to decompress the data and then replace the references with the repeated patterns.

It’s important to note that LZ4 compression is fast and efficient but it’s not as efficient as other compression algorithms like Gzip. The LZ4 algorithm works by analyzing data at the byte level, and it doesn’t take into account the structure of the data, which can make it less efficient for some types of data. Also, LZ4 compression is used to compress data blocks rather than the entire data set, this allows for faster decompression and random access to the data.

(4) Run-length encoding

ClickHouse uses run-length encoding to compress data. This involves storing the number of consecutive identical values rather than the actual values. This method is particularly useful for compressing data with many consecutive identical values.

Algorithm to implement Run-length encoding

Run-length encoding (RLE) is a technique used to compress data by storing the number of consecutive identical values rather than the actual values. The algorithm to implement run-length encoding can be broken down into the following steps:

  1. Initialize a variable to store the previous value: Initialize a variable to store the previous value, which will be used to detect consecutive identical values.
  2. Initialize a variable to store the run-length: Initialize a variable to store the run-length, which is the number of consecutive identical values.
  3. Traverse the data: Traverse the data and for each value, compare it to the previous value:
    1. If the current value is different from the previous value, store the previous value and the run-length in the compressed data and reset the run-length to 1.
    2. If the current value is the same as the previous value, increment the run-length by 1.
  4. Store the compressed data: Store the compressed data in the format of value and run-length pairs.
  5. Decompression: For decompression, iterate through the compressed data and for each value and run-length pair, output the value run-length number of times.

It’s important to note that RLE works best for data that has many consecutive identical values, such as images with large areas of the same color. It’s less efficient for data with few or no consecutive identical values and for data that is already highly compressed. Also, the RLE compression method can be applied to specific columns of a table and the column type should be of numeric or fixed-length data types.

(5) Huffman encoding

ClickHouse uses Huffman encoding to compress data. This is a lossless compression algorithm that uses a variable-length code to represent each symbol based on its frequency of occurrence.

Algorithm to implement Huffman encoding

Huffman encoding is a lossless compression algorithm that uses a variable-length code to represent each symbol based on its frequency of occurrence. The algorithm to implement Huffman encoding can be broken down into the following steps:

  1. Count the frequency of each symbol in the data: Count the number of occurrences of each symbol in the data.
  2. Create a leaf node for each symbol: Create a leaf node for each symbol and assign the frequency of the symbol as the weight of the node.
  3. Create a Huffman tree: Create a Huffman tree by repeatedly combining the two nodes with the lowest weight into a new node until there is only one root node. The weight of the new node is the sum of the weights of the two child nodes.
  4. Assign codes to each symbol: Assign a unique code to each leaf node in the Huffman tree. The code is represented by the path from the root node to the leaf node, with 0 representing a left branch and 1 representing a right branch.
  5. Replace symbols with codes: Replace each symbol in the data with its corresponding code.
  6. Compress the data: Compress the data using the Huffman encoded data.
  7. Decompression: For decompression, use the Huffman tree to decode the compressed data and recover the original data.

It’s important to note that Huffman encoding works well for data with non-uniform symbol frequency distribution, such as natural language text. It’s less efficient for data with uniform symbol frequency distribution, or for data that is already highly compressed. Also, Huffman encoding can be applied to specific columns of a table and the column type should be of string data types.

Conclusion

These are some of the compression methods that ClickHouse uses to compress data. It uses a combination of these methods to compress data effectively based on the data type and values. It also allows for the configuration of the compression method for specific columns and tables. The compression method chosen for a specific column or table can be set in the table’s CREATE statement.

To know more about Data Compression in ClickHouse, please do consider reading the below articles: 

About Shiv Iyer 215 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.