Pic Courtesy – Pexels
ClickHouse is an open-source online analytical database system for real-time analytics and super-fast queries on large datasets. While ClickHouse is inherently optimized for read speeds, it is still possible to end up with sub-optimal queries due to various reasons. In this article, let us look at various options to have optimal query performance in ClickHouse tables. More specifically, we will focus on MergeTree table engines since they are the primary choice of engine in ClickHouse.
Pre-requisites:
- Play UI
- Basic ClickHouse SQL knowledge
Schema Design
Primary Key/ Sorting Key
The ClickHouse tables are indexed based on the primary key/order by keys. So it is essential to have a good choice of columns that are part of the Primary Key. Unlike relational databases, where the choice of the primary key is based on the uniqueness of the row, the primary keys in ClickHouse need not be based on the row uniqueness and instead should be based on the columns used to filter the data while querying.
While it is not possible to add all the frequently used columns for filtering the data as part of the primary key columns, the important ones should be added.
CREATE TABLE wikistat ( `time` DateTime, `project` LowCardinality(String), `subproject` LowCardinality(String), `path` String , `hits` UInt64 ) ENGINE = MergeTree() ORDER BY (path, time);
In the above example, the primary key column is the combination of path and time column and the queries that filter the data using these columns are inherently faster. Queries that don’t use these columns to filter the data will run a full-table scan, which is quite slow.
Ordering of Columns in Primary key
While it is important to add the columns that are frequently used to filter the data while querying, it is equally important to choose the order of the columns while specifying the primary key.
ClickHouse will sort the inserted data based on the primary key columns before storing them on the disk. ClickHouse creates/updates sparse primary indexes after sorting the inserted data. Depending on the cardinality of the columns and it’s ordinal position in the primary key, ClickHouse uses different algorithms to filter the data. Let us assume there are two columns in primary key and column_1 has a lower cardinality than column_2.
- Primary key – (column_1, column_2) – Filter the data by column_1 : ClickHouse uses binary search to filter the data and this is quite fast
- Primary key – (column_1, column_2) – Filter the data by column_2: ClickHouse uses generic exclusion search and this is not as fast as binary search
- Primary key – (column_2, column_1) – Filter the data by column_1 : ClickHouse uses generic exclusion search and this can be quite ineffective since the cardinality of column_2 is much greater than column_1
- Primary key – (column_2, column_1) – Filter the data by column_2 : ClickHouse uses binary search to filter the data and this is quite fast
The third scenario is quite important since the query performance won’t be good despite filtering the data from one of the primary key columns.
As a rule of thumb, order the columns in a primary key by their cardinality in ascending order. This will help in the efficient filtering of data based on the secondary columns and also in data compression.
Query Optimization
Instead of indexing every row in the table, the primary index has one index entry per group of rows. This group of rows is called a granule in ClickHouse. The default size of a granule is 8192 rows. An ideal ClickHouse query should be able to narrow down to the granules that are only required for the query to process.
EXPLAIN
ClickHouse SQL has an EXPLAIN statement, which comes in handy for query optimizations. Let us look at important methods to identify if the query is an efficient one.
Let us make use of the opensky table from the Clickhouse Play UI. Say that we have a query to run, which is given below.
SELECT firstseen, lastseen, origin, destination, callsign FROM opensky WHERE day=toDateTime('2019-12-25');
We can look at the query execution plan along with the usage of indexes using the below statement.
EXPLAIN PLAN indexes=1 SELECT firstseen, lastseen, origin, destination, callsign FROM opensky WHERE day=toDateTime('2019-12-25');
This statement will give the below output.
So this query uses all the available granules. Let us look at another query.
EXPLAIN PLAN indexes=1 SELECT firstseen, lastseen, origin, destination, callsign FROM opensky WHERE origin='CYYZ' AND day>toDateTime('2019-12-25') AND destination='06FA';
The output is given below.
This query uses just two granules. So this query is more optimal than the previous one. We have the options listed below in case the number of granules to be processed is very high and cannot be addressed with existing Primary Keys.
- Data Skipping indexes
- Projections with different primary keys
- Materialized views with different primary keys
Increase Query Threads
This option works if there are a lot of CPU resources at disposal. ClickHouse can process granules in parallel and by default the maximum number of threads used by ClickHouse is equal to the number of CPU cores. If we have a lesser load on the Clickhouse server, we can increase the maximum threads that will be used by the query. An example is below.
SELECT firstseen, lastseen, origin, destination, callsign FROM opensky WHERE day>toDateTime('2019-12-25') SETTINGS max_threads=16;
So the above query will run on 16 threads.
Tweak concurrent read settings
Two other settings can help us improve the query performance in some cases.
- merge_tree_min_rows_for_concurrent_read – Default is 163840 rows
- merge_tree_min_bytes_for_concurrent_read – Default is 251658240 Bytes (240 MegaBytes)
We can reduce these values so that we have more parallel threads that can read the data.
References
https://clickhouse.com/docs/en/sql-reference/statements/explain
ClickHouse Performance: How to use Projections for Query Optimization
How to Configure Data Skipping Indexes in ClickHouse for Query Performance
ClickHouse Performance: Advanced Optimizations for Ultra-low Latency