Optimising ClickHouse Queries – Turbocharge your queries

Improve the performance of ClickHouse queries

Pic Courtesy – Pexels

ClickHouse is an open-source online analytical database system for real-time analytics and super-fast queries on large datasets. While ClickHouse is inherently optimized for read speeds, it is still possible to end up with sub-optimal queries due to various reasons. In this article, let us look at various options to have optimal query performance in ClickHouse tables. More specifically, we will focus on MergeTree table engines since they are the primary choice of engine in ClickHouse.

Pre-requisites:

  1. Play UI
  2. Basic ClickHouse SQL knowledge

Schema Design

Primary Key/ Sorting Key

The ClickHouse tables are indexed based on the primary key/order by keys. So it is essential to have a good choice of columns that are part of the Primary Key. Unlike relational databases, where the choice of the primary key is based on the uniqueness of the row, the primary keys in ClickHouse need not be based on the row uniqueness and instead should be based on the columns used to filter the data while querying.

While it is not possible to add all the frequently used columns for filtering the data as part of the primary key columns, the important ones should be added.

CREATE TABLE wikistat
(
    `time` DateTime,
    `project` LowCardinality(String),
    `subproject` LowCardinality(String),
    `path` String ,
    `hits` UInt64
)
ENGINE = MergeTree()
ORDER BY (path, time);

In the above example, the primary key column is the combination of path and time column and the queries that filter the data using these columns are inherently faster. Queries that don’t use these columns to filter the data will run a full-table scan, which is quite slow.

Ordering of Columns in Primary key

While it is important to add the columns that are frequently used to filter the data while querying, it is equally important to choose the order of the columns while specifying the primary key.

ClickHouse will sort the inserted data based on the primary key columns before storing them on the disk. ClickHouse creates/updates sparse primary indexes after sorting the inserted data. Depending on the cardinality of the columns and it’s ordinal position in the primary key, ClickHouse uses different algorithms to filter the data. Let us assume there are two columns in primary key and column_1 has a lower cardinality than column_2.

  1. Primary key – (column_1, column_2) – Filter the data by column_1 : ClickHouse uses binary search to filter the data and this is quite fast
  2. Primary key – (column_1, column_2) – Filter the data by column_2: ClickHouse uses generic exclusion search and this is not as fast as binary search
  3. Primary key – (column_2, column_1) – Filter the data by column_1 : ClickHouse uses generic exclusion search and this can be quite ineffective since the cardinality of column_2 is much greater than column_1
  4. Primary key – (column_2, column_1) – Filter the data by column_2 : ClickHouse uses binary search to filter the data and this is quite fast

The third scenario is quite important since the query performance won’t be good despite filtering the data from one of the primary key columns.

As a rule of thumb, order the columns in a primary key by their cardinality in ascending order. This will help in the efficient filtering of data based on the secondary columns and also in data compression.

Query Optimization

Instead of indexing every row in the table, the primary index has one index entry per group of rows. This group of rows is called a granule in ClickHouse. The default size of a granule is 8192 rows. An ideal ClickHouse query should be able to narrow down to the granules that are only required for the query to process.

EXPLAIN

ClickHouse SQL has an EXPLAIN statement, which comes in handy for query optimizations. Let us look at important methods to identify if the query is an efficient one.

Let us make use of the opensky table from the Clickhouse Play UI. Say that we have a query to run, which is given below.

SELECT firstseen, lastseen, origin, destination, callsign FROM opensky
WHERE day=toDateTime('2019-12-25');

We can look at the query execution plan along with the usage of indexes using the below statement.

EXPLAIN PLAN indexes=1 SELECT firstseen, lastseen, origin, destination, callsign FROM opensky
WHERE day=toDateTime('2019-12-25');

This statement will give the below output.

So this query uses all the available granules. Let us look at another query.

EXPLAIN PLAN indexes=1 SELECT firstseen, lastseen, origin, destination, callsign
FROM opensky WHERE origin='CYYZ' AND day>toDateTime('2019-12-25') AND destination='06FA';

The output is given below.

This query uses just two granules. So this query is more optimal than the previous one. We have the options listed below in case the number of granules to be processed is very high and cannot be addressed with existing Primary Keys.

  1. Data Skipping indexes
  2. Projections with different primary keys
  3. Materialized views with different primary keys

Increase Query Threads

This option works if there are a lot of CPU resources at disposal. ClickHouse can process granules in parallel and by default the maximum number of threads used by ClickHouse is equal to the number of CPU cores. If we have a lesser load on the Clickhouse server, we can increase the maximum threads that will be used by the query. An example is below.

SELECT firstseen, lastseen, origin, destination, callsign
FROM opensky WHERE day>toDateTime('2019-12-25')
SETTINGS max_threads=16;

So the above query will run on 16 threads.

Tweak concurrent read settings

Two other settings can help us improve the query performance in some cases.

  1. merge_tree_min_rows_for_concurrent_read – Default is 163840 rows
  2. merge_tree_min_bytes_for_concurrent_read – Default is 251658240 Bytes (240 MegaBytes)

We can reduce these values so that we have more parallel threads that can read the data.

References

https://clickhouse.com/docs/en/operations/settings/settings#setting-merge-tree-min-rows-for-concurrent-read

https://clickhouse.com/docs/en/sql-reference/statements/explain

Introduction to Materialized Views In ClickHouse

ClickHouse Performance: How to use Projections for Query Optimization

How to Configure Data Skipping Indexes in ClickHouse for Query Performance

ClickHouse Performance: Advanced Optimizations for Ultra-low Latency