Introduction
The amount of data stored in databases is increasing day by day. This increases the cost required for data storage and network access. Compression techniques are a commonly used method to save storage space and speed up data access. This article aims to explain compression algorithms and codecs for ClickHouse.
Compression Types in ClickHouse
The ClickHouse protocol supports LZ4 and ZSTD compression algorithms. They are both dictionary-based compression algorithms with a checksum. LZ4 is faster but compresses less than ZSTD. You can choose what suits your case and workload best. For detail about column store compression algorithms, please click here.
The default compression type for ClickHouse is LZ4. It is advised to use LZ4 if you are not sure what mode to pick. For MergeTree-engine tables, data compression settings can be modified with “compression” settings in the “config.xml” file.
You can modify the compression method by just changing the “Value” to “None,” “lz4“, or “zstd“. If you don’t add or uncomment this setting, LZ4 is used as a default compression method.
<compression> <case> <method>Value</method> </case> </compression>
Let’s have a look at the compression ratio in the table. For that purpose, I am using the Cell Towers dataset. This dataset has 43+ million records; you can get it by clicking here.
Two identical tables with different compression types(lz4 and zstd)were created and loaded with the Cell Towers dataset. The compression ratio of the tables shown as Table 1.
Table Name | Compressed Table Size(GB) | Uncompressed Table Size(GB) | Compression Ratio |
cell_towers_LZ4 | 1,07 | 2,06 | 1,92 |
cell_towers_zstd | 0,84 | 2,06 | 2,45 |
Table 1: Compression ratios of the Cell Towers dataset with respect to different compression algorithms.
Now, let’s have a look at the compression in columns. The compression ratio of the Cell Towers dataset with LZ4 compression is shown below. Cardinality and datatypes are the main factors for the the compression ratio.
┌─Column Name───┬─Column Type─┬─compressed─┬─uncompressed─┬─Compression Ratio─┬─compression_codec─┐ │ changeable │ UInt8 │ 188.32 KiB │ 41.27 MiB │ 224.41 │ │ │ averageSignal │ UInt8 │ 188.32 KiB │ 41.27 MiB │ 224.41 │ │ │ radio │ Enum8('' │ 188.38 KiB │ 41.27 MiB │ 224.35 │ │ │ mcc │ UInt16 │ 384.88 KiB │ 82.54 MiB │ 219.61 │ │ │ net │ UInt16 │ 410.99 KiB │ 82.54 MiB │ 205.66 │ │ │ unit │ Int16 │ 2.12 MiB │ 82.54 MiB │ 38.86 │ │ │ range │ UInt32 │ 48.27 MiB │ 165.09 MiB │ 3.42 │ │ │ samples │ UInt32 │ 77.14 MiB │ 165.09 MiB │ 2.14 │ │ │ created │ DateTime │ 87.37 MiB │ 165.09 MiB │ 1.89 │ │ │ cell │ UInt64 │ 178.76 MiB │ 330.17 MiB │ 1.85 │ │ │ area │ UInt16 │ 48.29 MiB │ 82.54 MiB │ 1.71 │ │ │ lat │ Float64 │ 259.85 MiB │ 330.17 MiB │ 1.27 │ │ │ lon │ Float64 │ 261.98 MiB │ 330.17 MiB │ 1.26 │ │ │ updated │ DateTime │ 130.71 MiB │ 165.09 MiB │ 1.26 │ │ └───────────────┴─────────────┴────────────┴──────────────┴───────────────────┴───────────────────┘
The compression ratio of the Cell Towers dataset with ZSTD compression is shown below. ZSTD compressed better than LZ4 for this dataset.
┌─Column Name───┬─Column Type─┬─compressed─┬─uncompressed─┬─Compression Ratio─┬─compression_codec─┐ │ changeable │ UInt8 │ 29.05 KiB │ 41.27 MiB │ 1454.95 │ │ │ averageSignal │ UInt8 │ 29.05 KiB │ 41.27 MiB │ 1454.95 │ │ │ radio │ Enum8('' │ 29.08 KiB │ 41.27 MiB │ 1453.44 │ │ │ mcc │ UInt16 │ 62.84 KiB │ 82.54 MiB │ 1344.98 │ │ │ net │ UInt16 │ 80.79 KiB │ 82.54 MiB │ 1046.21 │ │ │ unit │ Int16 │ 1.19 MiB │ 82.54 MiB │ 69.18 │ │ │ samples │ UInt32 │ 31.46 MiB │ 165.09 MiB │ 5.25 │ │ │ range │ UInt32 │ 31.51 MiB │ 165.09 MiB │ 5.24 │ │ │ cell │ UInt64 │ 113.25 MiB │ 330.17 MiB │ 2.92 │ │ │ created │ DateTime │ 70.06 MiB │ 165.09 MiB │ 2.36 │ │ │ area │ UInt16 │ 38.65 MiB │ 82.54 MiB │ 2.14 │ │ │ lat │ Float64 │ 225.17 MiB │ 330.17 MiB │ 1.47 │ │ │ lon │ Float64 │ 229.93 MiB │ 330.17 MiB │ 1.44 │ │ │ updated │ DateTime │ 119.08 MiB │ 165.09 MiB │ 1.39 │ │ └───────────────┴─────────────┴────────────┴──────────────┴───────────────────┴───────────────────┘
Column Compression Codecs in ClickHouse
In ClickHouse, it is also possible to compress individual columns in supported table engines. Compression supported table engines are shown in Table 2.
Table Engine | Column Compression | Default Compression |
Merge Tree Family | Yes | Yes, Change with “compression” settings |
Log Family | Yes | Yes, only LZ4 by default |
Set | No | Yes, only default compression |
Join | No | Yes, only default compression |
Table 2: Compression supported table engines
Compression methods for the given columns can be defined in the table creation(CREATE TABLE) or column modification( ALTER TABLE … MODIFY COLUMN …) with CODEC keyword.
CREATE TABLE <database>.<table> ( column1 DateTime CODEC(<Codec>), . . . ) ENGINE = <EngineType> . . . -------------------------------- ALTER TABLE <database>.<table> MODIFY COLUMN column1 CODEC(<Codec>);
ClickHouse both supports general purpose codecs and specialized codecs. General purpose codecs are much more like default codecs(LZ4, ZTSD) and their modified versions. Specialized codecs are designed to make compression more effective by using specific features of data.
General Purpose Codecs in ClickHouse
Types of the general purpose codes are:
- NONE : No Compression.
- LZ4 : Applies LZ4 fast compression.
- LZ4HC[(level)] : LZ4 HC (high compression) algorithm with configurable level.
- ZSTD[(level)] : ZSTD compression algorithm with configurable level.
Specialized Compression Codecs in ClickHouse
These codecs are designed to make compression more effective by using specific features of data. Some of these codecs do not compress data themself. Instead, they prepare the data for a common purpose codec, which compresses it better than without this preparation.
- Delta : This approach stores the difference between 2 neighbor values. It can be combined with LZ4 and ZSTD.
- DoubleDelta : This approach stores the difference between 2 neighbor delta values (delta of deltas). Suitable for time series data.
- Gorilla : Calculates XOR between current and previous value. Suitable for slowly changing floating numbers.
- T64 : It crops unused high bits of values in integer data types(include Enum, Date, DateTime) and puts them into a 64×64 bit matrix.
- FPC : Used in floating point values. XOR between the actual value and the predicted value.
I chose different types of columns from the Cell Towers dataset and compared the compression ratio of the following codecs.
For the first comparison, ENUM8 datatype(“radio” column) is selected. “radio” column has 5 different values.
┌──count()─┬─radio─┐ │ 867 │ NR │ │ 556344 │ CDMA │ │ 9931312 │ GSM │ │ 12101148 │ LTE │ │ 20686487 │ UMTS │ └──────────┴───────┘
Compression ratio comparison for “radio” column is shown in Fig 1. The ZSTD itself and the combination of specialized codecs with ZSTD performed better than the others.
Fig 1 – Compression ratios for column “radio”
Then, a column with a UInt16 datatype(“area”) is used for the tests. The “area” column has 57512 distinct values. Fig 2 shows that ZSTD performed better than the other again.
Fig 2 – Compression ratios for column “area”
At last, a column with a DateTime datatype(“updated”) is used. This column contains 1,7 million slowly changing time series data. DoubleDelta combined with ZSTD performed best for this.
Fig 3 – Compression ratios for column “updated”
Click here to access the scripts for repeating the benchmark tests.
Conclusion
In this research, the types of compression and compression codecs are explained in ClickHouse, and the efficiency of related algorithms and codecs is examined with a sample dataset. According to the findings, a compression ratio is affected not only by compression algorithms and codecs but also by datatype, cardinality, and data characterization.
To learn more about compression in ClickHouse, read the following articles:
- ClickHouse Performance: Achieving Maximum Data Compression
- Data Compression in ClickHouse: Algorithms for Top 5 Codecs
- Implementing Data Compression in ClickHouse with COMPRESS Function
- ClickHouse Data Compression Techniques for Time-series Datasets
References