ClickHouse Monitoring: Disk I/O Metrics


Troubleshooting performance issues in ClickHouse often involves looking at various metrics, including disk I/O metrics such as Current Disk Queue Length, average disk reads per second, and average disk writes per second. These metrics can give insights into whether disk I/O is a bottleneck in your system. Here’s a guide on how to monitor these metrics and what they indicate about ClickHouse performance.

Key Disk I/O Metrics in ClickHouse

1. Monitoring Disk Queue Length

  • Metric Explained: The disk queue length is the number of I/O operations waiting to be written to or read from the disk. A longer queue can indicate a bottleneck.
  • How to Monitor:
    • Use tools like iostat on Linux.
    • Look at the avgqu-sz (average queue size) metric.
  • Interpreting the Data:
    • A consistently high queue length might indicate that the disk is a bottleneck.
    • SSDs typically handle higher queue lengths better than HDDs.

2. Monitoring Average Disk Reads/Second

  • Metric Explained: This measures the number of read operations from disk per second. High values may indicate heavy read load.
  • How to Monitor:
    • iostat -x provides detailed disk I/O stats, including reads per second (r/s).
  • Interpreting the Data:
    • Spikes in reads/sec could be due to heavy querying, insufficient caching, or inefficient queries.
    • Consistently high reads/sec might suggest a need for query optimization or increased RAM.

3. Monitoring Average Disk Writes/Second

  • Metric Explained: Indicates the number of write operations to disk per second. It’s crucial for understanding the write load.
  • How to Monitor:
    • Again, iostat -x is useful, look at the writes per second (w/s).
  • Interpreting the Data:
    • High writes/sec can occur during heavy data ingestion, large insertions, or many small updates/deletes.
    • Persistent high write rates may suggest a need for better disk performance or tuning of the data ingestion process.

General Troubleshooting Steps

  1. Correlate with ClickHouse Workload:
    • Check if high disk I/O correlates with specific ClickHouse operations (like large inserts, merges, or queries).
  2. Optimize Disk Usage:
    • Ensure that ClickHouse tables are properly indexed.
    • Regularly optimize tables (OPTIMIZE TABLE command).
    • Consider partitioning tables to improve disk I/O.
  3. Improve Hardware:
    • Upgrade to faster disks (SSDs, especially NVMe, offer significant improvements over HDDs).
    • Implement RAID configurations for better performance and redundancy.
  4. Review ClickHouse Configuration:
    • Adjust settings like max_bytes_to_merge_at_max_space_in_pool and max_bytes_to_read to balance merge and read operations.
  5. Query Optimization:
    • Optimize queries to reduce unnecessary disk reads.
    • Use ClickHouse’s EXPLAIN syntax to understand query execution plans.
  6. System-Level Tweaks:
    • Adjust OS-level parameters (like vm.swappiness and disk scheduler settings).
    • Ensure the file system is optimized for large files (if applicable).
  7. Regular Monitoring:
    • Continuously monitor disk I/O metrics.
    • Use monitoring tools like Zabbix, Prometheus, or Grafana for real-time analytics.


By monitoring and analyzing these disk I/O metrics, you can gain valuable insights into how disk performance is impacting the overall performance of ClickHouse. This, combined with specific ClickHouse and system-level optimizations, can help alleviate bottlenecks and improve database performance.

To learn more about ClickHouse Monitoring, do consider reading the following articles: 

About Shiv Iyer 222 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.