Introduction
SQL is the computer language used to communicate with Databases. It is also an important key factor for database performance. If the SQL query is not designed effectively, it can impact performance negatively. In this blog post, I will share some best practices and recommendations which will help improve the performance in ClickHouse.
ClickHouse SQL Engineering Best Practices
Try to add the filter wherever possible
It is recommended to add the filter in the SQL query. The query will perform the full table scan without filtering and utilize more system resources. In peak time, There is a possibility of performance impact as well. So, it is good to have filtering wherever possible.
For example:
I have the table “cell_towers”. The table has the data from the year “1970” to “2017”. Each year has at least 1 lakh of data.
SELECT min(toYear(created)), max(toYear(created)) FROM cell_towers Query id: b46fdaa9-a3c4-488f-b65e-a55994784417 ┌─min(toYear(created))─┬─max(toYear(created))─┐ │ 1970 │ 2017 │ └──────────────────────┴──────────────────────┘
In this case, the following query is used to fetch the top 10 records in descending order. I know that all the results will appear from 2017. Because I am ordering the data based on the year.
SELECT cell, area, lon, toYear(created) AS year FROM cell_towers ORDER BY year DESC LIMIT 10
So we can introduce the filter here as shown below,
SELECT cell, area, lon, toYear(created) AS year FROM cell_towers WHERE toYear(created) = 2017ORDER BY year DESC LIMIT 10
Result without filtering:
┌─database─┬─table───────┬─parts─┬────rows─┬─marks─┐ │ sql │ cell_towers │ 1 │ 6291270 │ 768 │ └──────────┴─────────────┴───────┴─────────┴───────┘
10 rows in a set. Elapsed: 0.151 sec. Processed 6.29 million rows, 138.41 MB (41.57 million rows/s., 914.59 MB/s.)
Result with filtering:
┌─database─┬─table───────┬─parts─┬────rows─┬─marks─┐ │ sql │ cell_towers │ 1 │ 2760518 │ 337 │ └──────────┴─────────────┴───────┴─────────┴───────┘
10 rows in a set. Elapsed: 0.059 sec. Processed 2.76 million rows, 33.01 MB (40.12 million rows/s., 479.68 MB/s.)
We can see a notable improvement in the Query after introducing the filter. Sometimes, It is hard to maintain hardcoded values from the application side. But, this is just an example, and we can consider adding the filter wherever possible 🙂
Design the JOINs in an optimized way
This is one of the crucial things to consider when you design the Query with JOINs. It is always recommended to avoid the JOIN and other operations ( grouping, sorting ) to improve the Query performance.
For example,
The following query is designed to perform the JOIN operation with GROUP BY and ORDER BY operations.
SELECT net FROM cell_users INNER JOIN cell_users_total ON cell_users.net = cell_users_total.net_total GROUP BY net ORDER BY net DESC LIMIT 10
The Query can be rewritten as follow using the subquery,
SELECT DISTINCT net FROM ( SELECT net FROM cell_users GROUP BY net ORDER BY net DESC LIMIT 10 ) AS al INNER JOIN cell_users_total ON cell_users_total.net_total = net
With this approach, the subquery will be executed first, then the result will be used to perform the JOIN operation. In this case, the query will be much faster than the previous approach.
Let’s compare the results,
Note: The following query can be used to merge the results from system tables “text_log” and “query_log”. ( Note: Make sure to enable the parameter “text_log” with log level “information” to access the Query information in the “system.text_log” table ). Replace the query_id.
SELECT DISTINCT system.text_log.query_id, system.query_log.query, system.text_log.message FROM system.text_log INNER JOIN system.query_log ON system.text_log.query_id = system.query_log.query_id WHERE system.text_log.query_id IN ('a8fa2bb6-6ad8-4d43-a19c-051917c5f4ec', 'a535ac59-db6b-4dfe-bbe0-72da1c1f5ba9')
Query with JOIN during GROUP BY, ORDER BY:
query_id: a535ac59-db6b-4dfe-bbe0-72da1c1f5ba9 query: SELECT net FROM cell_users JOIN cell_users_total ON cell_users.net = cell_users_total.net_total GROUP BY net ORDER BY net desc LIMIT 10 message: Read 291272 rows, 568.89 KiB in 9.577540082 sec., 30411 rows/sec., 59.40 KiB/sec.
Query with JOIN after the GROUP BY, ORDER BY:
query_id: a8fa2bb6-6ad8-4d43-a19c-051917c5f4ec query: select distinct(net) from ( select net from cell_users group by net order by net desc limit 10 ) al inner join cell_users_total on cell_users_total.net_total = net; message: Read 291272 rows, 568.89 KiB in 0.009675702 sec., 30103448 rows/sec., 57.42 MiB/sec.
From the above example, we can see the query with JOIN after the GROUP BY, ORDER BY operation gives a lot of improvement compared to the previous approach 🙂
Use the same Data Types when making the JOINs
Clickhouse will not allow you to make the JOIN with different data types columns by default.
For example,
SELECT database, table, name, type FROM system.columns WHERE (database = 'sql') AND (table IN ('T_employees', 'T_users')) AND (name = 'unit') Query id: 64e34cf4-c06e-44df-9f28-41371482815a ┌─database─┬─table───────┬─name─┬─type───┐ │ sql │ T_employees │ unit │ Int16 │ │ sql │ T_users │ unit │ String │ └──────────┴─────────────┴──────┴────────┘
I have two different tables, “T_users” and “T_employees” both the tables have the column “unit”. But, the data type is different, as I showed above. Let’s make join and see what happens.
It gives the following error.
SELECT radio, net FROM T_users INNER JOIN T_employees ON T_users.unit = T_employees.unit LIMIT 10 Query id: 297aa08b-1064-4106-9340-dab166a2bfbb 0 rows in set. Elapsed: 0.003 sec. Received exception from server (version 22.7.2): Code: 53. DB::Exception: Received from localhost:9000. DB::Exception: Can't infer common type for joined columns: unit: String at left, T_employees.unit: Int16 at right. There is no supertype for types String, Int16 because some of them are String/FixedString and some of them are not. (TYPE_MISMATCH)
It is still possible to do the JOIN by converting the String to “In16” in the query itself without modifying the table structure as shown below.
SELECT radio, net FROM T_users INNER JOIN T_employees ON toInt16(T_users.unit) = T_employees.unit LIMIT 10
But, it may consume some additional system resources for the conversion. It is always recommended to JOIN the same data type columns to avoid such issues.
Use MATERIALIZED columns to store the complicated results
MATERIALIZED columns add or update a column with complicated expression results. Because If we use those expressions in direct SELECT, it might be an expensive operation during the query execution. Once we store the result in the MATERIALIZED columns, that will be used during the query execution.
For example, The following query has the function “positionCaseInsensitive” which has the expression to find the position of string “menu” from column “notes”. Doing such operations during the query is expensive. Because the column “notes” has multiple strings.
┌─notes─────────────────────────────────────────────────┐ │ wine list in Swedish; Arthur H. Rumpf Menu Collection │ │ wine list in Swedish; Arthur H. Rumpf Menu Collection │ └───────────────────────────────────────────────────────┘
For example, the current query looks like below
SELECT dish_count, status, positionCaseInsensitive(notes, 'menu') AS result FROM menu GROUP BY dish_count, status, result ORDER BY result DESC LIMIT 2 Query id: f4f409bb-c03a-4ed1-8fad-cc4484733f40 ┌─dish_count─┬─status───┬─result─┐ │ 32 │ complete │ 249 │ │ 219 │ complete │ 234 │ └────────────┴──────────┴────────┘
In this case, We can consider creating the MATERIALIZED column to store the position of the string “menu” from column “notes”. MATERIALIZED columns can be created using the ALTER command.
ALTER TABLE menu ADD COLUMN `materlized_result` Int16 MATERIALIZED positionCaseInsensitive(notes, 'menu') Query id: 357a819a-e6b8-409f-829c-9a6c55b1fa60 Ok. 0 rows in set. Elapsed: 0.006 sec.
After adding the columns, the upcoming data will be filled into the MATERIALIZED column. Still, we need to backfill the old data. We can do that by executing the OPTIMISE statement.
OPTIMIZE TABLE menu FINAL Query id: 11d75f21-3880-4647-97f4-649b0833033f Ok. 0 rows in set. Elapsed: 2.056 sec.
Now, let’s compare the results,
Without MATERIALIZED column,
SELECT dish_count, status, positionCaseInsensitive(notes, 'menu') AS result FROM menu GROUP BY dish_count, status, result ORDER BY result DESC LIMIT 2 Query id: a915a7b4-8578-4f12-a1c5-9ce7292835a0 ┌─dish_count─┬─status───┬─result─┐ │ 32 │ complete │ 249 │ │ 219 │ complete │ 234 │ └────────────┴──────────┴────────┘ 2 rows in set. Elapsed: 0.374 sec. Processed 17.97 million rows, 1.15 GB (48.05 million rows/s., 3.07 GB/s.)
With MATERIALIZED column,
SELECT dish_count, status, materlized_result AS result FROM menu GROUP BY dish_count, status, result ORDER BY result DESC LIMIT 2 Query id: dd34b4a9-4d91-4557-a674-d800354c480f ┌─dish_count─┬─status───┬─result─┐ │ 32 │ complete │ 249 │ │ 219 │ complete │ 234 │ └────────────┴──────────┴────────┘ 2 rows in set. Elapsed: 0.190 sec. Processed 17.97 million rows, 379.34 MB (81.53 million rows/s., 1.72 GB/s.)
From the result, we can see that the query is performing well with MATERIALIZED columns. The query time is 2x better, and the overall processed rows size is small with MATERIALIZED columns.
I would recommend having the MATERIALIZED column when you are extracting the small result from the large columns. You will see good benefits with that 🙂
Avoid “SELECT * ” and query the exact data
It is recommended to not to use the “SELECT * ” in your query until you don’t need all the data from the table. Doing “SELECT * ” will be an expensive operation.
For example,
SELECT * FROM sql.menu ORDER BY dish_count DESC LIMIT 10 10 rows in set. Elapsed: 0.931 sec. Processed 17.97 million rows, 5.64 GB (19.30 million rows/s., 6.05 GB/s.)
But, I just need the result for the columns “name, event”. So, the following query is enough to satisfy my requirement.
SELECT name, event FROM sql.menu ORDER BY dish_count DESC LIMIT 10 10 rows in set. Elapsed: 0.118 sec. Processed 17.97 million rows, 538.07 MB (151.88 million rows/s., 4.55 GB/s.)
If we compare the above result, we can see the great differences between both the results. The query with the selected columns performs 8x better than the query with “SELECT *”.
Be aware of case sensitive and insensitive columns
Know your data! Yes, It is necessary to use the right case when querying the data for the case sensitive columns. Otherwise, the Query may give the wrong result and possibly for the performance issue as well. The table “system.data_type_families” has the information about what data types are case sensitive or insensitive.
For example, the data type “String” is case sensitive.
SELECT * FROM system.data_type_families WHERE name = 'String' Query id: 10c03b83-ba80-4f25-984a-f9710b30f5f6 ┌─name───┬─case_insensitive─┬─alias_to─┐ │ String │ 0 │ │ └────────┴──────────────────┴──────────┘ 1 row in set. Elapsed: 0.002 sec.
I have the table “dish”, which has overall 27401281 rows and the column “name” is in the “String” data type.
SELECT database, table, name, type FROM system.columns WHERE (table = 'dish') AND (name = 'name') Query id: c2def569-63fe-4bc2-80c5-191cb5dbb33b ┌─database─┬─table─┬─name─┬─type───┐ │ sql │ dish │ name │ String │ └──────────┴───────┴──────┴────────┘
I am trying to find the name “Chicken soup with rice”. The data is stored in the following manner. The first letter “C” is in upper case and other letters are in lower case.
SELECT id, name, highest_price FROM dish WHERE name = 'Chicken soup with rice' GROUP BY id, name, highest_price ORDER BY highest_price ASC Query id: fe9b056a-7876-4872-ab14-41198d1061e7 ┌─id─┬─name───────────────────┬─highest_price─┐ │ 8 │ Chicken soup with rice │ 0.6 │ └────┴────────────────────────┴───────────────┘ 1 row in set. Elapsed: 0.156 sec. Processed 27.40 million rows, 1.12 GB (175.72 million rows/s., 7.16 GB/s.)
Now, I am trying to find the same data in column “name”. But, this time, I replaced the “rice” uppercase.
SELECT id, name, highest_price FROM dish WHERE name = 'Chicken soup with RICE' GROUP BY id, name, highest_price ORDER BY highest_price ASC Query id: 2ed4df52-f656-4112-8c9f-4ef3aea6df24 Ok. 0 rows in set. Elapsed: 0.155 sec. Processed 27.40 million rows, 1.12 GB (176.90 million rows/s., 7.21 GB/s.)
Here the result is empty, but the query scan has gone through the entire table to find the data. This is something like the unwanted load we are pushing with no reason. So, it is recommended to use the right string when designing the query.
- Use the right function to search the position inside the string ( position – Case sensitive :: positionCaseInsensitive – Case Insensitive )
- Use right operator to perform the full text search ( like – Case sensitive :: iLike – Case insensitive)
Conclusion
Designing the query in the proper way is the crucial thing to improve the database performance. I hope this blog post is useful for you to get some tips about the Query design. Let me know if you have any feedback or suggestions on this. I will come up with my next blog soon.
To read more about SQL engineering for ClickHouse Performance, do consider reading the following articles