Disk and Memory Alerting for ClickHouse: Signals That Catch Outages Early

On a busy ClickHouse cluster, disk and memory alerting for ClickHouse and the signals that catch outages early rarely gets attention until something starts misbehaving. Disk fill rate, memory pressure, and merge backlog are the three signals that almost always precede ClickHouse disk and memory alerting outages. That single sentence hides a fair amount of detail, and the rest of this piece pulls those details apart so the levers and trade-offs are visible.

The most common version of the problem is straightforward: alerting is set on the symptoms (query failures) instead of the early signals, so incidents fire after the cluster is already in trouble. That kind of issue rarely traces back to a single setting. It is usually a combination of schema, sort key, and a few small misconfigurations stacking on top of each other, and the path to fixing it starts with understanding the mechanics.

For teams running ClickHouse in production, the cost of getting disk and memory alerting for ClickHouse and the signals that catch outages early wrong is felt in tail latency, in runaway memory grants, and in the hours operators spend chasing intermittent issues. Getting ClickHouse disk and memory alerting right takes some up-front investment in measurement and a willingness to revisit defaults when the workload changes.

ClickHouse Disk and Memory Alerting

How it actually works

Before changing any setting, it helps to walk through what ClickHouse is actually doing under the surface. The behaviour described here is not specific to one release; the broad shape has held across recent versions, and the operational implications are the same on self-managed clusters and on managed offerings.

  • Disk usage grows with ingestion and TTL behaviour; fill rate is more informative than instantaneous use.
  • Memory pressure shows in MemoryTracking metrics, in OOM events from system.events, and in slow query memory caps.
  • Merge backlog appears in system.merges queue and in active part counts per partition.
  • Each of these has a slow form (capacity outgrowth) and a fast form (a query that misbehaves).
  • Alerting catches the slow forms; runbooks handle the fast forms.

Each of those steps has its own characteristic cost, and the slow ones tend to be the ones that show up in p95 and p99 latency. That is why the rest of this piece focuses on the levers that actually move those percentiles, rather than on micro-optimisations that look good in synthetic tests but rarely survive contact with production workloads.

Settings that actually matter

The configuration surface in ClickHouse is broad, and most of it does not need to be touched in a typical deployment. The settings below are the ones worth understanding because they shape behaviour directly under load. Defaults work for small workloads; the right values for production are usually different.

Setting Suggested value Notes
max_server_memory_usage 80% of host Server memory cap.
max_memory_usage Per-query memory cap.
max_partitions_per_insert_block Limit fan-out.
min_free_disk_space 1073741824 Refuse merges below this.
max_concurrent_queries 100 Concurrency cap.

None of these are universal. The right number on a node with sixty-four cores and NVMe is not the right number on a smaller VM with attached storage, and the right number for an analytics workload differs from a streaming ingestion workload. The values above are starting points, not endpoints.

ClickHouse SQL examples

The SQL below shows the pattern in concrete terms. It is meant to be read alongside the explanation, not copied verbatim into a production script.

-- Disk usage trend
SELECT toStartOfHour(event_time) AS h,
       max(value) AS bytes_used
FROM system.metric_log
WHERE event_time > now() - INTERVAL 24 HOUR
  AND metric = 'DiskUsedBytes'
GROUP BY h ORDER BY h;

-- Memory pressure
SELECT event, value FROM system.events
WHERE event LIKE 'OOM%' OR event LIKE 'Memory%';

ClickHouse Disk and Memory Alerting

Tuning approach that works in practice

The list below is the order most operators converge on when tuning disk and memory alerting for ClickHouse and the signals that catch outages early. It is not a recipe; the right answer depends on the workload. But it is a defensible sequence: each step is cheap to verify, and each one has a measurable effect when the change matters.

  1. Alert on disk fill rate over a window, not instantaneous usage.
  2. Set per-query memory caps so one bad query cannot OOM the server.
  3. Track active parts per partition; alert at a fraction of parts_to_throw_insert.
  4. Page on growing merge queue depth, not just on insert errors.

Each change should be measured against the metrics that matter — usually p95 latency at a target throughput, plus query log statistics and CPU behaviour. Changes that do not move those numbers are not actually changes; they are configuration churn.

What to look at first

When something goes wrong with disk and memory alerting for ClickHouse and the signals that catch outages early, the first move is usually a handful of system table queries. The objects below are the ones that produce useful output fast, without needing a full monitoring pipeline to interpret.

Object What it shows
system.metrics Live counters such as Query, Merge, BackgroundPoolTask, and many others.
system.events Cumulative event counters: SelectQuery, Insert, FailedQuery, MarkCacheHits, etc.
system.parts Active and inactive data parts per table, with row counts, bytes on disk, and merge state.
system.merges Currently running merges with progress, source parts, and total bytes to merge.

Guardrails worth setting up

Tuning without monitoring is guesswork. The signals listed below are the ones that catch problems early enough to act on, and most production clusters end up alerting on a similar shortlist whether they planned to or not.

  • Alert on time-to-full < 24 hours.
  • Alert on per-query OOM events.
  • Alert on merge queue depth above baseline.

Pitfalls that show up repeatedly

The same handful of mistakes appears across cluster after cluster. Most of them are easier to avoid than to fix, and the cost of getting them wrong tends to compound — what starts as a small misconfiguration becomes a real incident weeks later when the workload grows.

  • Alerting only on disk full; the server is often unrecoverable by then.
  • Setting max_memory_usage too low and seeing legitimate queries fail.
  • Forgetting that merges hold disk for the duration; transient growth is normal.

None of those are exotic. They show up in code reviews, in postmortems, and occasionally in vendor support tickets, and the operational habit of catching them early is worth more than any single configuration change.

Frequently asked questions

A handful of questions come up every time this topic is discussed. The answers below are the ones that hold up across most production deployments; the exceptions are usually visible in the metrics.

Should I use cgroups for memory?

ClickHouse honours cgroup limits; combined with max_server_memory_usage it gives a stable bound.

Is swap useful?

Almost never. Swap on a database server hides problems and turns slow into very slow.

How do I prevent runaway queries?

Per-query memory caps and per-user concurrency limits.

How do I know I am about to run out of disk?

Project the slope of the past day or week; instantaneous percent is too late.

Should I turn off TTL when disks are tight?

No. That makes it worse. Loosen schedules but keep TTL on.

Hardware specifications change as nodes are replaced and infrastructure is upgraded. A configuration that fit a previous generation of disks or CPUs may underperform on the next, and revisiting tuning decisions when hardware changes is part of routine operations rather than an exceptional event.

Configuration changes that are documented and reversible are easier to live with than ones that are not. Even small changes are worth recording with the date, the reason, and the before-and-after metric, because the same change is likely to come up again in a future incident or capacity review.

ClickHouse rarely operates in isolation. It sits inside a larger data platform with its own monitoring, deployment, and incident workflows, and the engine’s performance characteristics interact with those workflows in ways that are easy to miss. Treating ClickHouse as part of a system, rather than a standalone service, generally produces better outcomes.

The query log is one of the most useful diagnostic surfaces in ClickHouse, and the retention policy applied to it determines how far back a team can look during a postmortem. A few weeks of retention is the minimum that supports root-cause analysis on slow-developing problems, and many teams hold it for longer.

Part count is a quiet failure mode: the cluster keeps working as parts accumulate, and then suddenly latency spikes or a merge thread saturates. Watching part count per partition and tying it to ingestion rate is a small habit that catches the problem long before it becomes an incident.

Every new lever pulled on a ClickHouse cluster adds operational surface area. There is real value in keeping the configuration surface small — fewer custom values mean fewer things to remember during incident response, and fewer things that surprise the next operator who inherits the cluster.

Teams that want a deeper look at disk and memory alerting for ClickHouse and the signals that catch outages early can review ChistaDATA’s observability articles, or contact ChistaDATA about ClickHouse support for production engagements.

Putting it together

ClickHouse rewards operators who understand the mechanics rather than ones who memorise tuning recipes. The settings that matter for disk and memory alerting for ClickHouse and the signals that catch outages early are the ones that line up with how the workload actually uses the engine, and that match comes from looking at system tables, query plans, and the schema together rather than at any one of them in isolation.

The work is rarely finished, but it is also not as mysterious as it sometimes feels: a small number of mechanisms drive most of the behaviour, and the levers that matter are mostly the ones described above.

You might also like:

About ChistaDATA Inc. 210 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc