What do sensors in self-driving vehicles, social media activities, and autonomous stock/crypto trading algorithms have in common? These applications are based on data that measures how things change over time. It’s called time-series data, and storing this time-series data is one of the hot topics nowadays. ClickHouse performs pretty well with few interventions in this regard.
ClickHouse is a golden solution for analytics, but it can also be used to store time-series data. Though there are several time-series databases available such as Prometheus and InfluxDB, ClickHouse can handle this duty pretty well and save time managing additional tools. To get the most out of ClickHouse while working with time series, we need to act on specific compression codecs and set up a data retention policy as specified in the documentation. Let’s dive into them!
Use the codec, but which one?
ClickHouse supports several codecs that manage fundamental methods to compress data, but time-series data is typically pairings of time:value, which is both numbers, so gzip compression is not the best option for time-series data. In other words, it is obvious that utilizing specific codecs only is one of the necessary ways for time-series data, and we can utilize either popular algorithms like DoubleDelta and Gorilla or ClickHouse-specific algorithms like T64.
T64, for example, trims superfluous bits for integer values, including date/time types. When time-series data varies little from point to point, the DoubleDelta and Gorilla codecs are efficient, so it operates on delta instead of the value itself.
In this article, we will compare RAW(uncompressed), DoubleDelta/Gorilla, and T64 table codecs and see which is more proper for our purpose. We will compare each codec using the same data and under the same conditions at the end. Let’s create them!
- RAW(Uncompressed)
CREATE TABLE time_series ( `timeseries` DateTime, `data` UInt64 ) ENGINE = MergeTree ORDER BY timeseries;
INSERT INTO time_series SELECT now() - number, number FROM numbers(10000000);
SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE active AND table = 'time_series';
Result is 76,5 MB
- T64
CREATE TABLE time_series_t64 ( `timeseries` DateTime CODEC(T64), `data` UInt64 CODEC(T64) ) ENGINE = MergeTree ORDER BY timeseries;
INSERT INTO time_series_t64 SELECT now() - number, number FROM numbers(10000000);
SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE active AND table = 'time_series_t64';
Result is 37 MB
- DoubleDelta/Gorilla
set allow_suspicious_codecs = 1; CREATE TABLE time_series_dd_g ( `timeseries` DateTime CODEC(DoubleDelta), `data` UInt64 CODEC(Gorilla) ) ENGINE = MergeTree ORDER BY timeseries;
INSERT INTO time_series_dd_g SELECT now() - number, number FROM numbers(10000000);
SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE active AND table = 'time_series_dd_g';
The result is 19 MB
In a larger sense, DoubleDelta/Gorilla compressed the same data takes exactly four times the capacity of the usual storage mode, which is shocking in terms of storage space. Without a doubt, this feature will be useful for storing the time series.
Regarding the root cause of the results, T64 is a function that crops superfluous bits from integer data, including date/time types. When time-series data varies significantly from point to point, the DoubleDelta and Gorilla codecs are more effective because operating on delta instead of the value itself. Delta of Delta is stored in DoubleDelta. Delta is a normally fixed value for a constantly increasing or decreasing value. This value might be small or huge and takes up a lot of storage space. When the Delta of Delta is calculated again, it is almost zero. This form of data is frequently encountered in large data scenarios, particularly in monitoring scenarios. The time column, for example, is normally fixed at a point of n seconds, the data in the incremental column grows by one each time, and the accumulative values, such as request times and traffic values, change stably. Lastly, if we discuss Gorilla, its encoding is very efficient for values that do not change often. It applies to both float and integer data types.
Use a data retention policy.
TTL is an indicator that determines how long data can exist in general; on the column level, it simply means how long the value of that column should be retained.
Moving expired data to other discs is useful for time-series data. The ClickHouse TTL move function allows you to configure keeping new hot data on SSD and gradually moving it to HDD as it expires. Before migrating data, the storage policy must have been modified previously in config.xml.
CREATE TABLE time_series_ttl_hdd ( `timeseries` DateTime CODEC(T64), `data` UInt64 CODEC(T64) ) ENGINE = MergeTree ORDER BY timeseries TTL timeseries + INTERVAL 2 WEEK TO DISK 'hdd' SETTINGS storage_policy = 'policy_hdd';
It may be beneficial to erase data relevant to your scenario according to retention value.
CREATE TABLE time_series_ttl ( `timeseries` DateTime CODEC(T64), `data` UInt64 CODEC(T64) ) ENGINE = MergeTree ORDER BY timeseries TTL timeseries + INTERVAL 2 WEEK DELETE;
Use materialized view with TTL
Are you interested in the last five days’ average data values? Let’s create a table with specified 5-day TTL and wait for the time-series data. As an example, we created a new table; however, as you may expect, we can use the MODIFY command to add the TTL feature to existing tables.
CREATE TABLE time_series ( `timeseries` DateTime CODEC(T64), `data` UInt64 CODEC(T64) ) ENGINE = MergeTree ORDER BY timeseries TTL timeseries + INTERVAL 5 DAY DELETE;
Now we’ll create a materialized view using the TTL-enabled table.
CREATE MATERIALIZED VIEW timeseries_mw ( `date` Date, `data` UInt64 ) ENGINE = AggregatingMergeTree ORDER BY date AS SELECT DATE(timeseries) AS date, sum(data) AS value FROM time_series GROUP BY date;
This will generate the `timeseries_mw` materialized view, which will use the AggregatingMergeTree engine to automatically aggregate all data(last 5 days) from the `time_series` table based on the specified query.
Summary
ClickHouse is a column-based analytic RDBMS. However, after making some dynamic changes to the ClickHouse database to handle time-series data and we’ve experienced how they perform effectively.