Introduction
What is the Data Load distribution happening to ClickHouse Server? I ask this question often to ClickHouse infrastructure owners. This matrix is used for trending the load and throughput. This will help me identify the load on ClickHouse Server from peak to off-peak hours.
As a Performance Engineer, I spent long hours trending the performance of ClickHouse Infrastructure Operations across latency and throughput for troubleshooting performance more evidence-based than speculative. A detailed understanding of READ – WRITE distribution on ClickHouse Server also helps efficiently compute the usage of available system resources optimally. Both oversized and undersized Infrastructure is a different issue to troubleshoot altogether.
Observing the Load on ClickHouse Server
- Total Rows Inserted to all the Tables
- Total number of INSERT queries
- Total Data Inserted in Bytes
- Time spent(WRITE Latency) waiting for WRITE SYSCALL
- Total number of SELECT queries
- Total Rows Selected
- Time spent(READ Latency) waiting for READ SYSCALL
- Total time spent on query operations – Query Operations Latency Matrix
- Query operations throughput
Script to measure the load on ClickHouse Server
select * from (select * from (select toDateTime(system.metric_log.event_time) as "Event Date Time", sum(system.metric_log.ProfileEvent_InsertedRows) as "Number of rows INSERTed to all tables.", sum(system.metric_log.ProfileEvent_DiskWriteElapsedMicroseconds) as "Total time spent waiting for write syscall. This include writes to page cache.", sum(system.metric_log.ProfileEvent_InsertQuery) as "Total number of INSERT queries.", sum(system.metric_log.ProfileEvent_InsertedBytes) as "Total data INSERTed in Bytes", sum(system.metric_log.ProfileEvent_QueryTimeMicroseconds) as "Total time of all queries", sum(system.metric_log.ProfileEvent_SelectQuery) as "Total number of SELECT queries", sum(system.metric_log.ProfileEvent_SelectQueryTimeMicroseconds) as "Total time of SELECT queries.", sum(system.metric_log.ProfileEvent_SelectedRows) as "Total number of rows SELECTed", sum(system.metric_log.ProfileEvent_DiskReadElapsedMicroseconds) as "Total time spent waiting for read syscall. This include reads from page cache.", sum(system.metric_log.ProfileEvent_QueryTimeMicroseconds) as "Total time of all queries." from system.metric_log group by system.metric_log.event_time,system.metric_log.ProfileEvent_InsertedRows, system.metric_log.ProfileEvent_DiskWriteElapsedMicroseconds, system.metric_log.ProfileEvent_InsertQuery,system.metric_log.ProfileEvent_InsertedBytes, system.metric_log.ProfileEvent_QueryTimeMicroseconds, system.metric_log.ProfileEvent_SelectQuery, system.metric_log.ProfileEvent_SelectQueryTimeMicroseconds, system.metric_log.ProfileEvent_SelectedRows, system.metric_log.ProfileEvent_DiskReadElapsedMicroseconds, system.metric_log.ProfileEvent_QueryTimeMicroseconds order by system.metric_log.ProfileEvent_QueryTimeMicroseconds desc ));
Load on ClickHouse Server Metrics Explained
ID. | Metric Name | Description |
---|---|---|
1 | metric_log.event_time | Event Date Time |
2 | metric_log.ProfileEvent_InsertedRows | Number of rows INSERTed to all tables |
3 | metric_log.ProfileEvent_DiskWriteElapsedMicroseconds | Total time spent waiting for write syscall. This include writes to page cache |
4 | metric_log.ProfileEvent_InsertQuery | Total number of INSERT queries |
5 | metric_log.ProfileEvent_InsertedBytes | Total data INSERTed in Bytes |
6 | metric_log.ProfileEvent_QueryTimeMicroseconds | Total time of all queries |
7 | metric_log.ProfileEvent_SelectQuery | Total number of SELECT queries |
8 | metric_log.ProfileEvent_SelectQueryTimeMicroseconds | Total time of SELECT queries |
9 | metric_log.ProfileEvent_SelectedRows | Total number of rows SELECTed |
10 | metric_log.ProfileEvent_DiskReadElapsedMicroseconds | Total time spent waiting for read syscall. This include reads from page cache |
11 | metric_log.ProfileEvent_QueryTimeMicroseconds | Total time of all queries |
Conclusion
It is essential to closely monitor and troubleshoot ClickHouse server load to be able to optimize the cluster for maximum performance. This guide will give a high-level view on the critical metrics to keep an eye on as you engage in this critical process to achieve high performance ClickHouse operations.
To know more about monitoring in ClickHouse, do read the following articles:
- ClickHouse Monitoring: Python Script to Monitor Cluster Performance in Real-Time
- ClickHouse Monitoring: Server Load and Performance
- Monitoring ClickHouse Server CPU Usage