Performance Tricks in ClickHouse – Part III

Among DBAs, there is much debate regarding the significance of database performance tuning. As we now understand, various data types really power the business world. It should be operating very effectively and always be accessible to ensure that the data is quickly and easily accessible for all beneficiaries.

ClickHouse supports advanced compression to improve speed and lower storage costs. Technically speaking, optimizing the ClickHouse schema’s storage architecture will boost both memory and network bandwidth performance. But what must be done to efficient compression?

Generating the JSON file with ordered column:

CREATE TABLE default.table_one
(
    `id` UInt64
)
ENGINE = MergeTree
ORDER BY `id`

insert into table_one select number/10 from numbers(5000000)
clickhouse-client -q "select * from table_one" > table_one.json
 
gzip table_one.json

Generating the JSON file with an unordered column:

CREATE TABLE default.table_two
(
    `id` UInt64
)
ENGINE = MergeTree
ORDER BY tuple()

insert into table_two select number/10 from numbers(5000000) order by rand()
clickhouse-client -q "select * from table_two" > table_two.json
 
gzip table_two.json

Comparing the size result:

The tables only contain 5 million rows, but you can see how much the compressed files change in size.

-rw-r--r-- 1 root root   837985 Jan  8 01:47 table_one.json.gz
-rw-r--r-- 1 root root 15615551 Jan  8 01:48 table_two.json.gz

“Why does ClickHouse take so much RAM for a straightforward query, and how can we lower this requirement in production environments?”

It is actually a FAQ.  The most important point is with this operation, the amount of memory used decreased significantly after the re-sorting procedure with the SELECT operations. You can quickly observe a change in the effectiveness of your SELECT queries in sorted columns. A table with order expression correctly will perform better in SELECT queries, without a doubt.

Well, how do existing tables’ columns get sorted?

You can add expressions that use only the newly added columns. So, let’s do that:

CREATE TABLE default.test_exp
(
    `id` UInt64,
    `duration` UInt64
)
ENGINE = MergeTree
ORDER BY id
ALTER TABLE default.test_exp ADD COLUMN version Int32, MODIFY ORDER BY (id, version)

We now know how crucial the ORDER BY phrase is. Making the appropriate choice is crucial while creating a table. It is vital for efficient SELECT operations. Also, before compressing a table, it is advised that the current columns be re-sorted to reduce the size; however, with ClickHouse, you may include expressions that use the newly inserted columns as the sorting key. We just mentioned the version column in the instructions because of this. The sorting procedure was unable to include the duration column.

The primary key in ClickHouse cannot be changed as online. But ChistaDATA Inc. has OSC tool for these kinds of operations. Adding or modifying the primary key is feasible if you wish to update the ORDER BY expression in your existing table. The OSC tool from ChistaDATA Inc. might be used to do this task online without cost. Just stay tuned for this tool.

Summary

As demonstrated, using a proper ordering key can boost table compression by a factor of 20 or more. Repetitive values are better compressed in such files than random values. Also, for performance-enhancing SELECT queries, remember to re-order your columns in addition to this. Therefore, less RAM can be used for your cloud service, and you can pay less.

About Ilkay 10 Articles
Ilkay has been administering databases including NoSQL and RDBMS for over 7 years. He is experienced working high-scale banking databases. He is currently working at ChistaDATA Inc. as Database Administrator.
Contact: Website