Monitoring Query Latency due to Wait and Latch Events in ClickHouse

Table of Contents

Introduction

In ClickHouse, you can monitor query latency and waits/latches using the system.query_log and system.metrics tables. Here is an example query that can be used to monitor query latency and wait/latch events:

SELECT
    query_id,
    toDateTime(query_start_time) AS query_start_time,
    round((query_duration_ms / 1000), 2) AS query_duration_sec,
    round((read_duration_ms / 1000), 2) AS read_duration_sec,
    round((execution_time_ms / 1000), 2) AS execution_time_sec,
    round((result_rows / execution_time_ms) * 1000, 2) AS rows_per_sec,
    round((result_bytes / execution_time_ms) * 1000, 2) AS bytes_per_sec,
    arrayStringConcat(waits, ', ') AS waits,
    arrayStringConcat(latches, ', ') AS latches
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY query_start_time DESC
LIMIT 100;

This query selects query-related columns from the system.query_log table, including the query ID, start time, duration, read duration, execution time, rows per second, bytes per second, and waits/latches. The waits and latches are concatenated into comma-separated lists using the arrayStringConcat function.

To get more detailed information about the waits and latches, you can use the system.metrics table. Here is an example query that joins the system.metrics table with the system.query_log table to show detailed information about the waits and latches for a particular query:

SELECT
    query_id,
    event_name,
    value AS count,
    round((duration_ms / 1000), 2) AS duration_sec,
    round((value / duration_ms) * 1000, 2) AS rate_per_sec
FROM system.metrics
ANY LEFT JOIN system.query_log ON query_id = metric_query_id
WHERE query_id = 'YOUR_QUERY_ID_HERE'
    AND event_name LIKE 'Wait%' OR event_name LIKE 'Latch%'
ORDER BY duration_sec DESC;

This query selects wait and latch-related columns from the system.metrics table and joins it with the system.query_log table using the query ID. The query filters by a particular query ID and event names starting with ‘Wait’ or ‘Latch’. The results include the event name, count, duration, and rate per second for each wait/latch event.

Conclusion

These queries can help identify slow queries and the waits/latches causing wait errors in ClickHouse. Note that monitoring query-related information can be resource-intensive, so it is important to use these queries judiciously and with care on busy systems.

To read more locks and waits in ClickHouse, please do consider reading the below articles

About Shiv Iyer 215 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.