ClickHouse SQL Engineering Best Practices

Table of Contents

Introduction

SQL is the computer language used to communicate with Databases. It is also an important key factor for database performance. If the SQL query is not designed effectively, it can impact performance negatively. In this blog post, I will share some best practices and recommendations which will help improve the performance in ClickHouse.

ClickHouse SQL Engineering Best Practices

Try to add the filter wherever possible

It is recommended to add the filter in the SQL query. The query will perform the full table scan without filtering and utilize more system resources. In peak time, There is a possibility of performance impact as well. So, it is good to have filtering wherever possible.

For example:

I have the table “cell_towers”. The table has the data from the year “1970” to “2017”. Each year has at least 1 lakh of data.

SELECT     min(toYear(created)),     max(toYear(created)) FROM cell_towers 
Query id: b46fdaa9-a3c4-488f-b65e-a55994784417 
┌─min(toYear(created))─┬─max(toYear(created))─┐ 
│                 1970 │                 2017 │ 
└──────────────────────┴──────────────────────┘

In this case, the following query is used to fetch the top 10 records in descending order. I know that all the results will appear from 2017. Because I am ordering the data based on the year.

SELECT    
  cell,     
  area,     
  lon,     
  toYear(created) AS year FROM cell_towers 
  ORDER BY year DESC LIMIT 10

So we can introduce the filter here as shown below,

SELECT     
   cell,    
   area,     
   lon,     
   toYear(created) AS year FROM cell_towers 
   WHERE toYear(created) = 2017ORDER BY year 
   DESC LIMIT 10

Result without filtering:

┌─database─┬─table───────┬─parts─┬────rows─┬─marks─┐ 
│ sql      │ cell_towers │     1 │ 6291270 │   768 │ 
└──────────┴─────────────┴───────┴─────────┴───────┘

10 rows in a set. Elapsed: 0.151 sec. Processed 6.29 million rows, 138.41 MB (41.57 million rows/s., 914.59 MB/s.)

Result with filtering:

┌─database─┬─table───────┬─parts─┬────rows─┬─marks─┐ 
│ sql      │ cell_towers │     1 │ 2760518 │   337 │ 
└──────────┴─────────────┴───────┴─────────┴───────┘

10 rows in a set. Elapsed: 0.059 sec. Processed 2.76 million rows, 33.01 MB (40.12 million rows/s., 479.68 MB/s.)

We can see a notable improvement in the Query after introducing the filter. Sometimes, It is hard to maintain hardcoded values from the application side. But, this is just an example, and we can consider adding the filter wherever possible 🙂

Design the JOINs in an optimized way

This is one of the crucial things to consider when you design the Query with JOINs. It is always recommended to avoid the JOIN and other operations ( grouping, sorting ) to improve the Query performance.

For example,

The following query is designed to perform the JOIN operation with GROUP BY and ORDER BY operations.

SELECT 
  net 
  FROM cell_users 
  INNER JOIN cell_users_total 
  ON cell_users.net = cell_users_total.net_total 
  GROUP BY net ORDER BY net DESC LIMIT 10

The Query can be rewritten as follow using the subquery,

SELECT 
DISTINCT net FROM 
(     
     SELECT 
     net     
     FROM cell_users     
     GROUP BY net     
     ORDER BY net DESC    
     LIMIT 10 
) 
AS al INNER JOIN cell_users_total ON cell_users_total.net_total = net

With this approach, the subquery will be executed first, then the result will be used to perform the JOIN operation. In this case, the query will be much faster than the previous approach.

Let’s compare the results,

Note: The following query can be used to merge the results from system tables “text_log” and “query_log”. ( Note: Make sure to enable the parameter “text_log” with log level “information” to access the Query information in the “system.text_log” table ). Replace the query_id.

SELECT DISTINCT     
  system.text_log.query_id,     
  system.query_log.query,     
  system.text_log.message FROM system.text_log INNER JOIN system.query_log 
  ON system.text_log.query_id = system.query_log.query_id 
  WHERE system.text_log.query_id IN ('a8fa2bb6-6ad8-4d43-a19c-051917c5f4ec', 'a535ac59-db6b-4dfe-bbe0-72da1c1f5ba9')

Query with JOIN during GROUP BY, ORDER BY:

query_id: a535ac59-db6b-4dfe-bbe0-72da1c1f5ba9 
query:    SELECT  net FROM cell_users JOIN cell_users_total ON cell_users.net = cell_users_total.net_total GROUP BY     net ORDER BY net desc LIMIT 10 
message:  Read 291272 rows, 568.89 KiB in 9.577540082 sec., 30411 rows/sec., 59.40 KiB/sec.

Query with JOIN after the GROUP BY, ORDER BY:

query_id: a8fa2bb6-6ad8-4d43-a19c-051917c5f4ec 
query:    select distinct(net)  from ( select net from cell_users  group by net  order by net desc limit 10 ) al inner join cell_users_total on cell_users_total.net_total = net; 
message:  Read 291272 rows, 568.89 KiB in 0.009675702 sec., 30103448 rows/sec., 57.42 MiB/sec.

From the above example, we can see the query with JOIN after the GROUP BY, ORDER BY operation gives a lot of improvement compared to the previous approach 🙂

Use the same Data Types when making the JOINs

Clickhouse will not allow you to make the JOIN with different data types columns by default.

For example,

SELECT     
  database,     
  table,     
  name,     
  type 
  FROM system.columns 
  WHERE (database = 'sql') AND (table IN ('T_employees', 'T_users')) AND (name = 'unit') 
Query id: 64e34cf4-c06e-44df-9f28-41371482815a 
┌─database─┬─table───────┬─name─┬─type───┐ 
│ sql      │ T_employees │ unit │ Int16  │ 
│ sql      │ T_users     │ unit │ String │ 
└──────────┴─────────────┴──────┴────────┘

I have two different tables, “T_users” and “T_employees” both the tables have the column “unit”. But, the data type is different, as I showed above. Let’s make join and see what happens.

It gives the following error.

SELECT     
  radio,     
  net 
FROM T_users 
INNER JOIN T_employees ON T_users.unit = T_employees.unit 
LIMIT 10 

Query id: 297aa08b-1064-4106-9340-dab166a2bfbb 
0 rows in set. Elapsed: 0.003 sec.  

Received exception from server (version 22.7.2): Code: 53. DB::Exception: Received from localhost:9000. DB::Exception: Can't infer common type for joined columns: unit: String at left, T_employees.unit: Int16 at right. There is no supertype for types String, Int16 because some of them are String/FixedString and some of them are not. (TYPE_MISMATCH)

It is still possible to do the JOIN by converting the String to “In16” in the query itself without modifying the table structure as shown below.

SELECT     
  radio,     
  net 
FROM T_users 
INNER JOIN T_employees ON toInt16(T_users.unit) = T_employees.unit 
LIMIT 10

But, it may consume some additional system resources for the conversion. It is always recommended to JOIN the same data type columns to avoid such issues.

Use MATERIALIZED columns to store the complicated results

MATERIALIZED columns add or update a column with complicated expression results. Because If we use those expressions in direct SELECT, it might be an expensive operation during the query execution. Once we store the result in the MATERIALIZED columns, that will be used during the query execution.

For example, The following query has the function “positionCaseInsensitive” which has the expression to find the position of string “menu” from column “notes”. Doing such operations during the query is expensive. Because the column “notes” has multiple strings.

┌─notes─────────────────────────────────────────────────┐ 
│ wine list in Swedish; Arthur H. Rumpf Menu Collection │ 
│ wine list in Swedish; Arthur H. Rumpf Menu Collection │ 
└───────────────────────────────────────────────────────┘

For example, the current query looks like below

SELECT     
  dish_count,     
  status,     
  positionCaseInsensitive(notes, 'menu') AS result 
FROM menu 
GROUP BY     
  dish_count,     
  status,     
  result 
ORDER BY result DESC 
LIMIT 2 
Query id: f4f409bb-c03a-4ed1-8fad-cc4484733f40 
┌─dish_count─┬─status───┬─result─┐ 
│         32 │ complete │    249 │ 
│        219 │ complete │    234 │ 
└────────────┴──────────┴────────┘

In this case, We can consider creating the MATERIALIZED column to store the position of the string “menu” from column “notes”. MATERIALIZED columns can be created using the ALTER command.

ALTER TABLE menu     
ADD COLUMN `materlized_result` Int16 MATERIALIZED positionCaseInsensitive(notes, 'menu') 
Query id: 357a819a-e6b8-409f-829c-9a6c55b1fa60 
Ok. 0 rows in set. Elapsed: 0.006 sec.

After adding the columns, the upcoming data will be filled into the MATERIALIZED column. Still, we need to backfill the old data. We can do that by executing the OPTIMISE statement.

OPTIMIZE TABLE menu FINAL 
Query id: 11d75f21-3880-4647-97f4-649b0833033f 
Ok. 0 rows in set. Elapsed: 2.056 sec.

Now, let’s compare the results,

Without MATERIALIZED column,

SELECT     
  dish_count,     
  status,     
  positionCaseInsensitive(notes, 'menu') AS result 
FROM menu GROUP BY     
  dish_count,     
  status,     
  result 
ORDER BY result DESC 
LIMIT 2 
Query id: a915a7b4-8578-4f12-a1c5-9ce7292835a0 
┌─dish_count─┬─status───┬─result─┐ 
│         32 │ complete │    249 │ 
│        219 │ complete │    234 │ 
└────────────┴──────────┴────────┘ 
2 rows in set. Elapsed: 0.374 sec. Processed 17.97 million rows, 1.15 GB (48.05 million rows/s., 3.07 GB/s.)

With MATERIALIZED column,

SELECT     
  dish_count,     
  status,     
  materlized_result AS result 
FROM menu 
GROUP BY     
  dish_count,     
  status,     
  result 
ORDER BY result DESC 
LIMIT 2 
Query id: dd34b4a9-4d91-4557-a674-d800354c480f 
┌─dish_count─┬─status───┬─result─┐ 
│         32 │ complete │    249 │ 
│        219 │ complete │    234 │ 
└────────────┴──────────┴────────┘ 
2 rows in set. Elapsed: 0.190 sec. Processed 17.97 million rows, 379.34 MB (81.53 million rows/s., 1.72 GB/s.)

From the result, we can see that the query is performing well with MATERIALIZED columns. The query time is 2x better, and the overall processed rows size is small with MATERIALIZED columns.

I would recommend having the MATERIALIZED column when you are extracting the small result from the large columns. You will see good benefits with that 🙂

**Avoid “SELECT * ” and query the exact data**

It is recommended to not to use the “SELECT * ” in your query until you don’t need all the data from the table. Doing “SELECT * ” will be an expensive operation.

For example,

SELECT * FROM 
  sql.menu 
ORDER BY dish_count DESC 
LIMIT 10 
10 rows in set. 
Elapsed: 0.931 sec. Processed 17.97 million rows, 5.64 GB (19.30 million rows/s., 6.05 GB/s.)

But, I just need the result for the columns “name, event”. So, the following query is enough to satisfy my requirement.

SELECT     
  name,     
  event 
FROM sql.menu 
ORDER BY dish_count DESC 
LIMIT 10 
10 rows in set. 
Elapsed: 0.118 sec. Processed 17.97 million rows, 538.07 MB (151.88 million rows/s., 4.55 GB/s.)

If we compare the above result, we can see the great differences between both the results. The query with the selected columns performs 8x better than the query with “SELECT *”.

Be aware of case sensitive and insensitive columns

Know your data! Yes, It is necessary to use the right case when querying the data for the case sensitive columns. Otherwise, the Query may give the wrong result and possibly for the performance issue as well. The table “system.data_type_families” has the information about what data types are case sensitive or insensitive.

For example, the data type “String” is case sensitive.

SELECT * FROM 
  system.data_type_families 
  WHERE name = 'String' 
Query id: 10c03b83-ba80-4f25-984a-f9710b30f5f6 
┌─name───┬─case_insensitive─┬─alias_to─┐ 
│ String │                0 │          │ 
└────────┴──────────────────┴──────────┘ 
1 row in set. Elapsed: 0.002 sec.

I have the table “dish”, which has overall 27401281 rows and the column “name” is in the “String” data type.

SELECT     
  database,     
  table,     
  name,     
  type 
FROM system.columns 
WHERE (table = 'dish') AND (name = 'name') 
Query id: c2def569-63fe-4bc2-80c5-191cb5dbb33b 
┌─database─┬─table─┬─name─┬─type───┐ 
│ sql      │ dish  │ name │ String │ 
└──────────┴───────┴──────┴────────┘

I am trying to find the name “Chicken soup with rice”. The data is stored in the following manner. The first letter “C” is in upper case and other letters are in lower case.

SELECT     
  id,     
  name,     
  highest_price 
FROM dish 
WHERE name = 'Chicken soup with rice' 
GROUP BY     
  id,     
  name,     
  highest_price 
ORDER BY highest_price ASC 
Query id: fe9b056a-7876-4872-ab14-41198d1061e7 
┌─id─┬─name───────────────────┬─highest_price─┐ 
│  8 │ Chicken soup with rice │           0.6 │ 
└────┴────────────────────────┴───────────────┘ 
1 row in set. Elapsed: 0.156 sec. Processed 27.40 million rows, 1.12 GB (175.72 million rows/s., 7.16 GB/s.)

Now, I am trying to find the same data in column “name”. But, this time, I replaced the “rice” uppercase.

SELECT     
  id,     
  name,     
  highest_price 
FROM dish 
WHERE name = 'Chicken soup with RICE' 
GROUP BY     
  id,     
  name,     
  highest_price 
ORDER BY highest_price ASC 
Query id: 2ed4df52-f656-4112-8c9f-4ef3aea6df24 
Ok. 0 rows in set. 
Elapsed: 0.155 sec. Processed 27.40 million rows, 1.12 GB (176.90 million rows/s., 7.21 GB/s.)

Here the result is empty, but the query scan has gone through the entire table to find the data. This is something like the unwanted load we are pushing with no reason. So, it is recommended to use the right string when designing the query.

Use the right function to search the position inside the string ( position – Case sensitive :: positionCaseInsensitive – Case Insensitive )
Use right operator to perform the full text search ( like – Case sensitive :: iLike – Case insensitive)

Conclusion

Designing the query in the proper way is the crucial thing to improve the database performance. I hope this blog post is useful for you to get some tips about the Query design. Let me know if you have any feedback or suggestions on this. I will come up with my next blog soon.

To read more about SQL engineering for ClickHouse Performance, do consider reading the following articles

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse Performance: Comprehensive Guide to SQL Engineering Best Practices

Introduction

ClickHouse SQL Engineering Best Practices

Try to add the filter wherever possible

Design the JOINs in an optimized way

Use the same Data Types when making the JOINs

Use MATERIALIZED columns to store the complicated results

**Avoid “SELECT * ” and query the exact data**

Be aware of case sensitive and insensitive columns

Conclusion