7 Powerful ClickHouse Matrices for Troubleshooting Query and IOPS Performance

When a ClickHouse cluster starts feeling sluggish, the hardest part is rarely fixing the problem — it’s finding it. Slow queries, saturated disks, and unpredictable latency all leave fingerprints, but those fingerprints are scattered across dozens of system tables. This is where ClickHouse Matrices come in. By treating ClickHouse’s system.* tables as a set of diagnostic matrices, you can correlate query behavior with I/O pressure and turn a vague “the database is slow” complaint into a precise, evidence-backed root cause.

In this guide we’ll walk through the practical ClickHouse Matrices that engineers actually reach for when troubleshooting query and IOPS performance. Everything below is based on real production patterns, and every query is ready to paste into clickhouse-client. If you’re new to the underlying tables, the official ClickHouse system tables documentation is a useful companion reference.

Table of Contents

What Are ClickHouse Matrices?

A “matrix” here isn’t a special ClickHouse object — it’s a way of thinking. Each ClickHouse Matrix is a focused view built from one or more system tables that maps a symptom (high latency, disk saturation, memory spikes) against the dimensions that explain it (query, user, table, part, thread). When you line these matrices up side by side, performance regressions that looked random suddenly have a clear story.

The most useful ClickHouse Matrices for performance work fall into two buckets: query matrices, which explain where CPU and memory time goes, and I/O matrices, which explain where disk and IOPS pressure comes from. For a broader tuning context, our guide on ClickHouse performance optimization pairs well with what follows. Let’s start with queries.

Building a Query Performance Matrix from system.query_log

The single richest source for query troubleshooting is system.query_log. It records every executed query along with duration, memory usage, rows read, and bytes read. The following ClickHouse Matrix surfaces your heaviest recent queries by read volume and duration.

SELECT
    query_duration_ms,
    formatReadableSize(read_bytes)      AS read_data,
    formatReadableQuantity(read_rows)   AS rows_read,
    formatReadableSize(memory_usage)    AS mem_used,
    user,
    substring(query, 1, 80)             AS query_preview
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY read_bytes DESC
LIMIT 20;

Read this matrix top-down. A query that reads hundreds of gigabytes to return a handful of rows is almost always missing a good primary-key filter or scanning parts it shouldn’t. High memory_usage next to modest read_rows usually points to a heavy GROUP BY or a join that spills.

Adding a Time Dimension

Raw totals hide patterns. To see whether latency is creeping up over time, pivot the same data into an hourly matrix:

SELECT
    toStartOfHour(event_time)           AS hour,
    count()                             AS queries,
    round(avg(query_duration_ms), 1)    AS avg_ms,
    round(quantile(0.99)(query_duration_ms), 1) AS p99_ms,
    formatReadableSize(sum(read_bytes)) AS total_read
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

The p99 column is the one to watch. If average latency stays flat while p99 climbs, you have a tail-latency problem — often caused by I/O contention rather than the queries themselves. That’s our cue to switch to the IOPS matrices.

The IOPS Matrix: Where Disk Pressure Really Comes From

ClickHouse exposes I/O behavior through system.asynchronous_metrics, system.metrics, and the profile events inside system.query_log. Combining them gives you an IOPS matrix that ties disk activity back to the workload driving it.

Start with a live snapshot of read and write throughput and outstanding requests:

SELECT
    metric,
    value
FROM system.asynchronous_metrics
WHERE metric LIKE '%DiskRead%'
   OR metric LIKE '%DiskWrite%'
   OR metric LIKE '%IOPS%'
ORDER BY metric;

These counters tell you how much I/O is happening, but not who caused it. To close that gap, mine the profile events recorded per query. This is arguably the most valuable ClickHouse Matrix for IOPS troubleshooting because it attributes physical reads directly to the queries responsible.

SELECT
    substring(query, 1, 60)                                  AS query_preview,
    ProfileEvents['OSReadBytes']                             AS os_read_bytes,
    ProfileEvents['OSReadChars']                             AS logical_read,
    ProfileEvents['ReadBufferFromFileDescriptorReadBytes']  AS file_read_bytes,
    query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
  AND ProfileEvents['OSReadBytes'] > 0
ORDER BY os_read_bytes DESC
LIMIT 15;

When OSReadBytes (actual physical disk reads) is high relative to OSReadChars (logical reads), your page cache isn’t helping and every query is hitting the disk. That’s a classic sign of an undersized cache, cold data, or a working set that no longer fits in RAM.

Correlating Merges with IOPS Spikes

Background merges are a frequent and easily-missed source of IOPS. A merge storm can starve user queries of disk bandwidth even when the queries themselves are efficient. This ClickHouse Matrix shows active merges and the volume of data they’re moving:

SELECT
    database,
    table,
    round(elapsed, 1)                     AS elapsed_s,
    round(progress * 100, 1)              AS pct_done,
    formatReadableSize(bytes_read_uncompressed)  AS read_size,
    formatReadableSize(memory_usage)      AS mem,
    num_parts
FROM system.merges
ORDER BY bytes_read_uncompressed DESC;

If system.merges is busy at exactly the moments your p99 latency spikes, you’ve found your culprit. The fix is usually workload-shaping: tuning background_pool_size, revisiting your partitioning scheme, or reducing insert frequency so fewer small parts need merging.

Bringing the Matrices Together

The real power of ClickHouse Matrices shows up when you read them as a sequence rather than in isolation. A healthy troubleshooting flow that chains these ClickHouse Matrices together looks like this: the query matrix tells you which queries are slow, the time matrix tells you when latency degrades, the IOPS matrix tells you whether disk is the bottleneck, and the merges matrix tells you what background work is competing for that disk.

Here’s a compact query that overlays parts, disk usage, and compression per table — a “storage matrix” that often explains why a table is I/O-heavy in the first place:

SELECT
    database,
    table,
    count()                                    AS parts,
    formatReadableSize(sum(bytes_on_disk))     AS on_disk,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round(sum(data_uncompressed_bytes) / sum(bytes_on_disk), 2) AS compression_ratio,
    sum(rows)                                  AS total_rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;

A table with thousands of active parts is a red flag: more parts mean more file handles, more seeks, and more merge pressure — all of which inflate IOPS. A poor compression ratio, meanwhile, means you’re reading far more bytes off disk than you need to. The system.parts documentation lists every column you can pull into this matrix.

Practical Tips for Reading ClickHouse Matrices

A few habits make these ClickHouse Matrices far more effective in day-to-day troubleshooting. Always bound your queries with an event_time filter so you’re looking at the incident window, not all of history. Keep query_log and metric_log enabled with a sane retention TTL so the data is there when you need it. And whenever you spot a suspicious query, pull its full text and ProfileEvents map — the answer is almost always hiding in the profile events.

Finally, remember that no single matrix is conclusive. Query duration without I/O context can mislead you into rewriting a query that was actually starved by merges. Reading the matrices together is what turns guesswork into diagnosis.

A Real-World Troubleshooting Walkthrough with ClickHouse Matrices

To see how these ClickHouse Matrices fit together in practice, picture a common Monday-morning incident: a dashboard team reports that a report that used to load in two seconds now takes fifteen. Rather than guessing, you work the matrices in order. You open the query matrix from system.query_log, filter to the last hour, and immediately spot a query reading 40 GB where it used to read 2 GB.

Next you pull the hourly time matrix and confirm the p99 latency started climbing at 08:15, not gradually but as a sharp step. That step change is the tell-tale sign of a background process rather than a slow rollout of bad queries. So you jump to the merges matrix and find a large merge on the same table that kicked off at 08:14. The pieces line up: an oversized insert batch created hundreds of small parts, ClickHouse scheduled a heavy merge, and that merge is now competing with your report for disk IOPS.

The storage matrix from system.parts confirms the root cause — the table has 3,400 active parts against a healthy target of a few dozen. The fix isn’t to rewrite the report at all; it’s to batch inserts more aggressively and let the part count fall. Without these ClickHouse Matrices, that same investigation could easily have burned an afternoon chasing the wrong query.

Automating Your ClickHouse Matrices

Once you trust a set of matrices, the natural next step is to stop running them by hand. Most teams wire the same queries into Grafana panels or a lightweight cron job that writes results into a dedicated observability table. The advantage is history: when an incident happens, you can look back at how the matrices trended in the minutes before the spike instead of only seeing the current snapshot.

A good starting point is to schedule the query, IOPS, and merges matrices at a one-minute resolution and retain a couple of weeks of data. That retention window is usually enough to catch weekly batch jobs and month-end reporting surges, which are two of the most common hidden sources of IOPS pressure that a single live query would never reveal.

Conclusion

ClickHouse Matrices give you a repeatable, data-driven method for troubleshooting query and IOPS performance instead of chasing symptoms. By correlating system.query_log, system.asynchronous_metrics, system.merges, and system.parts, you can pinpoint exactly which queries hurt, when they hurt, and what disk activity is behind it. Turn these ClickHouse Matrices into a small dashboard or saved snippet set, and the next time someone says the cluster is slow, you’ll have an answer in minutes rather than hours.

If you want hands-on help tuning your deployment, our ClickHouse consulting team at ChistaDATA works on exactly these performance problems every day.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services