Compression Algorithms and Codecs in ClickHouse

Photo by HD Wallpapers on StockSnap

Introduction

The amount of data stored in databases is increasing day by day. This increases the cost required for data storage and network access. Compression techniques are a commonly used method to save storage space and speed up data access. This article aims to explain compression algorithms and codecs for ClickHouse.

 

Compression Types

The ClickHouse protocol supports LZ4 and ZSTD compression algorithms. They are both dictionary-based compression algorithms with a checksum. LZ4 is faster but compresses less than ZSTD. You can choose what suits your case and workload best. For detail about column store compression algorithms, please click here.

The default compression type for ClickHouse is LZ4. It is advised to use LZ4 if you are not sure what mode to pick. For MergeTree-engine tables, data compression settings can be modified with “compression” settings in the “config.xml” file.

You can modify the compression method by just changing the “Value” to “None,” “lz4“, or “zstd“. If you don’t add or uncomment this setting, LZ4 is used as a default compression method.

<compression>
    <case>
            <method>Value</method>
    </case>
</compression>

Let’s have a look at the compression ratio in the table. For that purpose, I am using the Cell Towers dataset. This dataset has 43+ million records; you can get it by clicking here.

Two identical tables with different compression types(lz4 and zstd)were created and loaded with the Cell Towers dataset. The compression ratio of the tables shown as Table 1.

Table Name Compressed Table Size(GB) Uncompressed Table Size(GB) Compression Ratio
cell_towers_LZ4 1,07 2,06 1,92
cell_towers_zstd 0,84 2,06 2,45

Table 1:Compression ratios of the Cell Towers dataset with respect to different compression algorithms.

Now, let’s have a look at the compression in columns. The compression ratio of the Cell Towers dataset with LZ4 compression is shown below. Cardinality and datatypes are the main factors for the the compression ratio.

┌─Column Name───┬─Column Type─┬─compressed─┬─uncompressed─┬─Compression Ratio─┬─compression_codec─┐
│ changeable    │ UInt8       │ 188.32 KiB │ 41.27 MiB    │            224.41 │                   │
│ averageSignal │ UInt8       │ 188.32 KiB │ 41.27 MiB    │            224.41 │                   │
│ radio         │ Enum8(''    │ 188.38 KiB │ 41.27 MiB    │            224.35 │                   │
│ mcc           │ UInt16      │ 384.88 KiB │ 82.54 MiB    │            219.61 │                   │
│ net           │ UInt16      │ 410.99 KiB │ 82.54 MiB    │            205.66 │                   │
│ unit          │ Int16       │ 2.12 MiB   │ 82.54 MiB    │             38.86 │                   │
│ range         │ UInt32      │ 48.27 MiB  │ 165.09 MiB   │              3.42 │                   │
│ samples       │ UInt32      │ 77.14 MiB  │ 165.09 MiB   │              2.14 │                   │
│ created       │ DateTime    │ 87.37 MiB  │ 165.09 MiB   │              1.89 │                   │
│ cell          │ UInt64      │ 178.76 MiB │ 330.17 MiB   │              1.85 │                   │
│ area          │ UInt16      │ 48.29 MiB  │ 82.54 MiB    │              1.71 │                   │
│ lat           │ Float64     │ 259.85 MiB │ 330.17 MiB   │              1.27 │                   │
│ lon           │ Float64     │ 261.98 MiB │ 330.17 MiB   │              1.26 │                   │
│ updated       │ DateTime    │ 130.71 MiB │ 165.09 MiB   │              1.26 │                   │
└───────────────┴─────────────┴────────────┴──────────────┴───────────────────┴───────────────────┘

 

The compression ratio of the Cell Towers dataset with ZSTD compression is shown below. ZSTD compressed better than LZ4 for this dataset.

┌─Column Name───┬─Column Type─┬─compressed─┬─uncompressed─┬─Compression Ratio─┬─compression_codec─┐
│ changeable    │ UInt8       │ 29.05 KiB  │ 41.27 MiB    │           1454.95 │                   │
│ averageSignal │ UInt8       │ 29.05 KiB  │ 41.27 MiB    │           1454.95 │                   │
│ radio         │ Enum8(''    │ 29.08 KiB  │ 41.27 MiB    │           1453.44 │                   │
│ mcc           │ UInt16      │ 62.84 KiB  │ 82.54 MiB    │           1344.98 │                   │
│ net           │ UInt16      │ 80.79 KiB  │ 82.54 MiB    │           1046.21 │                   │
│ unit          │ Int16       │ 1.19 MiB   │ 82.54 MiB    │             69.18 │                   │
│ samples       │ UInt32      │ 31.46 MiB  │ 165.09 MiB   │              5.25 │                   │
│ range         │ UInt32      │ 31.51 MiB  │ 165.09 MiB   │              5.24 │                   │
│ cell          │ UInt64      │ 113.25 MiB │ 330.17 MiB   │              2.92 │                   │
│ created       │ DateTime    │ 70.06 MiB  │ 165.09 MiB   │              2.36 │                   │
│ area          │ UInt16      │ 38.65 MiB  │ 82.54 MiB    │              2.14 │                   │
│ lat           │ Float64     │ 225.17 MiB │ 330.17 MiB   │              1.47 │                   │
│ lon           │ Float64     │ 229.93 MiB │ 330.17 MiB   │              1.44 │                   │
│ updated       │ DateTime    │ 119.08 MiB │ 165.09 MiB   │              1.39 │                   │
└───────────────┴─────────────┴────────────┴──────────────┴───────────────────┴───────────────────┘

 

Column Compression Codecs

In ClickHouse, it is also possible to compress individual columns in supported table engines. Compression supported table engines are shown in Table 2.

Table Engine Column Compression Default Compression
Merge Tree Family Yes Yes, Change with “compression” settings
Log Family Yes Yes, only LZ4 by default
Set No Yes, only default compression
Join No Yes, only default compression

Table 2:Compression supported table engines

Compression methods for the given columns can be defined in the table creation(CREATE TABLE) or column modification( ALTER TABLE … MODIFY COLUMN …) with CODEC keyword.

CREATE TABLE <database>.<table>
(
    column1 DateTime CODEC(<Codec>),
    .
    .
    .
)
ENGINE = <EngineType>
. . .


--------------------------------

ALTER TABLE <database>.<table> MODIFY COLUMN column1 CODEC(<Codec>);

ClickHouse both supports general purpose codecs and specialized codecs. General purpose codecs are much more like default codecs(LZ4, ZTSD) and their modified versions. Specialized codecs are designed to make compression more effective by using specific features of data.

General Purpose Codecs

Types of the general purpose codes are:

  • NONE : No Compression.
  • LZ4 : Applies LZ4 fast compression.
  • LZ4HC[(level)] : LZ4 HC (high compression) algorithm with configurable level.
  • ZSTD[(level)] : ZSTD compression algorithm with configurable level.

Specialized Compression Codecs

These codecs are designed to make compression more effective by using specific features of data. Some of these codecs do not compress data themself. Instead, they prepare the data for a common purpose codec, which compresses it better than without this preparation.

  • Delta : This approach stores the difference between 2 neighbor values. It can be combined with LZ4 and ZSTD.
  • DoubleDelta : This approach stores the difference between 2 neighbor delta values (delta of deltas). Suitable for time series data.
  • Gorilla : Calculates XOR between current and previous value. Suitable for slowly changing floating numbers.
  • T64 : It crops unused high bits of values in integer data types(include Enum, Date, DateTime) and puts them into a 64×64 bit matrix.
  • FPC : Used in floating point values. XOR between the actual value and the predicted value.

 

I chose different types of columns from the Cell Towers dataset and compared the compression ratio of the following codecs.

For the first comparison, ENUM8 datatype(“radio” column) is selected. “radio” column has 5 different values.

┌──count()─┬─radio─┐
│      867 │ NR    │
│   556344 │ CDMA  │
│  9931312 │ GSM   │
│ 12101148 │ LTE   │
│ 20686487 │ UMTS  │
└──────────┴───────┘

Compression ratio comparison for “radio” column is shown in Fig 1. The ZSTD itself and the combination of specialized codecs with ZSTD performed better than the others.

Fig 1 – Compression ratios for column “radio”

 

Then, a column with a UInt16 datatype(“area”)  is used for the tests. The “area” column has 57512 distinct values. Fig 2 shows that ZSTD performed better than the other again.

Fig 2 – Compression ratios for column “area”

 

At last, a column with a DateTime datatype(“updated”) is used. This column contains 1,7 million slowly changing time series data. DoubleDelta combined with ZSTD performed best for this.

Fig 3 – Compression ratios for column “updated”

 

Click here to access the scripts for repeating the benchmark tests.

Conclusion

In this research, the types of compression and compression codecs are explained in ClickHouse, and the efficiency of related algorithms and codecs is examined with a sample dataset. According to the findings, a compression ratio is affected not only by compression algorithms and codecs but also by datatype, cardinality, and data characterization.

References

About Emrah Idman 5 Articles
Emrah Idman has considerable experience in relational and NoSQL databases. He has worked in a large-scale financial company for over 15 years. He has significant experience in team management, procurement and capacity planning, database administration and product testing for high-volume systems. He is working at ChistaDATA Inc. as senior database administrator.
Contact: Website