The query cache in any database significantly improves performance by reducing query execution time, decreasing server load, and optimizing resource utilization. It is a valuable feature for enhancing scalability, particularly in read-intensive environments. However, proper configuration and thoughtful cache invalidation strategies are necessary to maintain data consistency. Let’s explore how the query cache works in ClickHouse step by step.
Note: The query cache feature was introduced as an experimental preview in ClickHouse version 23.1, and it became production-ready starting from version 23.5. If you are using an older version, make sure to enable it with the command SET allow_experimental_query_cache = true.
Example: The following query took 8.934 seconds to execute, and the empty state of the system.query_cache table indicates that no results have been cached yet.
clickhouse-dbnode :) SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added Query id: cabb79da-766c-4167-a693-079b47ac5cdd ┌─day_of_week_added─┬────num─┬───avg_days_present─┐ │ 1 │ 173241 │ 193.07947887624755 │ │ 2 │ 142198 │ 152.979387895751 │ │ 3 │ 163436 │ 136.45409212168678 │ │ 4 │ 257044 │ 121.19575247817494 │ │ 5 │ 204567 │ 141.79689294949821 │ │ 6 │ 62476 │ 202.50060823356168 │ │ 7 │ 71084 │ 219.8482499577964 │ └───────────────────┴────────┴────────────────────┘ 7 rows in set. Elapsed: 8.934 sec. Processed 15.48 million rows, 147.11 MB (1.73 million rows/s., 16.47 MB/s.) clickhouse-dbnode :) SELECT * FROM system.query_cache SELECT * FROM system.query_cache Query id: 25987b26-d62e-4aff-a0bb-717f601a2b95 Ok. 0 rows in set. Elapsed: 0.002 sec.
Now, let’s enhance the query by enabling the query cache with the SETTINGS use_query_cache = 1 directive. This will allow ClickHouse to utilize the query cache for optimizing subsequent executions of the same query.
Example: With the same query, we have appended the true cache settings by adding SETTINGS use_query_cache = 1 in the query.
clickhouse-dbnode :) SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added SETTINGS use_query_cache = 1 SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added SETTINGS use_query_cache = 1 Query id: 48b3e250-5800-49c1-ac01-e22b24761ad0 ┌─day_of_week_added─┬────num─┬───avg_days_present─┐ │ 1 │ 173241 │ 193.07947887624755 │ │ 2 │ 142198 │ 152.979387895751 │ │ 3 │ 163436 │ 136.45409212168678 │ │ 4 │ 257044 │ 121.19575247817494 │ │ 5 │ 204567 │ 141.79689294949821 │ │ 6 │ 62476 │ 202.50060823356168 │ │ 7 │ 71084 │ 219.8482499577964 │ └───────────────────┴────────┴────────────────────┘ 7 rows in set. Elapsed: 8.738 sec. Processed 15.48 million rows, 147.11 MB (1.77 million rows/s., 16.84 MB/s.) clickhouse-dbnode :)
After executing the modified query with the SETTINGS use_query_cache = 1
directive, we can observe that the system.query_cache table displays the queries that are stored in the cache. This indicates that ClickHouse is successfully utilizing the query cache to store and retrieve query results.
clickhouse-dbnode :) SELECT * FROM system.query_cache\G SELECT * FROM system.query_cache Query id: 55f7c6f5-f9f0-4eff-872e-c955b7380e4a Row 1: ────── query: SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM (SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM (SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line) AS added_code INNER JOIN (SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time) GROUP BY dayOfWeek(added_day) AS day_of_week_added SETTINGS result_size: 768 stale: 0 shared: 0 compressed: 1 expires_at: 2023-06-30 05:59:05 key_hash: 1259426933936142896 1 row in set. Elapsed: 0.013 sec.
By default, the time-to-live (TTL) for cached queries in ClickHouse is set to 60 seconds. This timeout can be adjusted using the query_cache_ttl
variable either at the query level or at the session level, providing flexibility in managing the cache duration.
Additionally, ClickHouse sets a default maximum cache entry size of 1 MiB. However, you can control this size by utilizing the max_entry_size
variable available at the server-level setting in ClickHouse’s server configuration file. This allows you to customize the cache capacity based on your specific requirements.
Now, if you execute the same query again, you will observe a remarkable improvement in performance. The query is expected to finish with an Elapsed time of just 0.002 seconds, demonstrating the effectiveness of the query cache in significantly reducing query execution time.
clickhouse-dbnode :) SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added SETTINGS use_query_cache = 1 SELECT day_of_week_added, count() AS num, avg(days_present) AS avg_days_present FROM ( SELECT added_code.line, added_code.time AS added_day, dateDiff('day', added_code.time, removed_code.time) AS days_present FROM ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS added_code INNER JOIN ( SELECT path, line, max(time) AS time FROM git.line_changes WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty')) GROUP BY path, line ) AS removed_code USING (path, line) WHERE removed_code.time > added_code.time ) GROUP BY dayOfWeek(added_day) AS day_of_week_added SETTINGS use_query_cache = 1 Query id: d4bae98b-a2ad-4a6e-b8fa-2205e1072a3c ┌─day_of_week_added─┬────num─┬───avg_days_present─┐ │ 1 │ 173241 │ 193.07947887624755 │ │ 2 │ 142198 │ 152.979387895751 │ │ 3 │ 163436 │ 136.45409212168678 │ │ 4 │ 257044 │ 121.19575247817494 │ │ 5 │ 204567 │ 141.79689294949821 │ │ 6 │ 62476 │ 202.50060823356168 │ │ 7 │ 71084 │ 219.8482499577964 │ └───────────────────┴────────┴────────────────────┘ 7 rows in set. Elapsed: 0.002 sec.
To clear the cache in ClickHouse, you can utilize the following query:
clickhouse-dbnode :) SYSTEM DROP QUERY CACHE SYSTEM DROP QUERY CACHE Query id: 4fdedf4d-d5fc-4b17-94bd-1d7b0dc4cd4e Ok. 0 rows in set. Elapsed: 0.001 sec.
Executing this query will remove all cached entries from the query cache, allowing you to start with a fresh state.