Monitoring ClickHouse Server Load

Introduction

What is the Data Load distribution happening to ClickHouse Server? I ask this question often to ClickHouse infrastructure owners. This matrix is used for trending the load and throughput. This will help me identify the load on ClickHouse Server from peak to off-peak hours.

As a Performance Engineer, I spent long hours trending the performance of ClickHouse Infrastructure Operations across latency and throughput for troubleshooting performance more evidence-based than speculative. A detailed understanding of READ – WRITE distribution on ClickHouse Server also helps efficiently compute the usage of available system resources optimally. Both oversized and undersized Infrastructure is a different issue to troubleshoot altogether.

Observing the Load on ClickHouse Server

  • Total Rows Inserted to all the Tables
  • Total number of INSERT queries
  • Total Data Inserted in Bytes
  • Time spent(WRITE Latency) waiting for WRITE SYSCALL
  • Total number of SELECT queries
  • Total Rows Selected
  • Time spent(READ Latency) waiting for READ SYSCALL
  • Total time spent on query operations – Query Operations Latency Matrix
  • Query operations throughput

Script to measure the load on ClickHouse Server

select *
from (select *
from (select toDateTime(system.metric_log.event_time) as "Event Date Time",
sum(system.metric_log.ProfileEvent_InsertedRows) as "Number of rows INSERTed to all tables.",
sum(system.metric_log.ProfileEvent_DiskWriteElapsedMicroseconds) as "Total time spent waiting for write syscall. This include writes to page cache.",
sum(system.metric_log.ProfileEvent_InsertQuery) as "Total number of INSERT queries.",
sum(system.metric_log.ProfileEvent_InsertedBytes) as "Total data INSERTed in Bytes",
sum(system.metric_log.ProfileEvent_QueryTimeMicroseconds) as "Total time of all queries",
sum(system.metric_log.ProfileEvent_SelectQuery) as "Total number of SELECT queries",
sum(system.metric_log.ProfileEvent_SelectQueryTimeMicroseconds) as "Total time of SELECT queries.",
sum(system.metric_log.ProfileEvent_SelectedRows) as "Total number of rows SELECTed",
sum(system.metric_log.ProfileEvent_DiskReadElapsedMicroseconds) as "Total time spent waiting for read syscall. This include reads from page cache.",
sum(system.metric_log.ProfileEvent_QueryTimeMicroseconds) as "Total time of all queries."
from system.metric_log
group by
system.metric_log.event_time,system.metric_log.ProfileEvent_InsertedRows,
system.metric_log.ProfileEvent_DiskWriteElapsedMicroseconds,
system.metric_log.ProfileEvent_InsertQuery,system.metric_log.ProfileEvent_InsertedBytes,
system.metric_log.ProfileEvent_QueryTimeMicroseconds,
system.metric_log.ProfileEvent_SelectQuery,
system.metric_log.ProfileEvent_SelectQueryTimeMicroseconds,
system.metric_log.ProfileEvent_SelectedRows,
system.metric_log.ProfileEvent_DiskReadElapsedMicroseconds,
system.metric_log.ProfileEvent_QueryTimeMicroseconds
order by
system.metric_log.ProfileEvent_QueryTimeMicroseconds desc
));

Load on ClickHouse Server Metrics Explained

ID.Metric NameDescription
1metric_log.event_timeEvent Date Time
2 metric_log.ProfileEvent_InsertedRowsNumber of rows INSERTed to all tables
3metric_log.ProfileEvent_DiskWriteElapsedMicrosecondsTotal time spent waiting for write syscall. This include writes to page cache
4metric_log.ProfileEvent_InsertQueryTotal number of INSERT queries
5metric_log.ProfileEvent_InsertedBytesTotal data INSERTed in Bytes
6metric_log.ProfileEvent_QueryTimeMicrosecondsTotal time of all queries
7metric_log.ProfileEvent_SelectQueryTotal number of SELECT queries
8metric_log.ProfileEvent_SelectQueryTimeMicrosecondsTotal time of SELECT queries
9metric_log.ProfileEvent_SelectedRowsTotal number of rows SELECTed
10metric_log.ProfileEvent_DiskReadElapsedMicrosecondsTotal time spent waiting for read syscall. This include reads from page cache
11metric_log.ProfileEvent_QueryTimeMicrosecondsTotal time of all queries

Conclusion

It is essential to closely monitor and troubleshoot ClickHouse server load to be able to optimize the cluster for maximum performance. This guide will give a high-level view on the critical metrics to keep an eye on as you engage in this critical process to achieve high performance ClickHouse operations.

To know more about monitoring in ClickHouse, do read the following articles:

 
About Shiv Iyer 218 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.