On a busy ClickHouse cluster, retention and TTL policies for ClickHouse telemetry retention rarely get attention until something starts misbehaving. Telemetry retention on ClickHouse is best expressed with TTL — drop, move, recompress, or roll up — combined with daily partitions for cheap lifecycle operations. That single sentence hides a fair amount of detail, and the rest of this piece pulls those details apart so the levers and trade-offs are visible.
The most common version of the problem is straightforward: retention policies set in operational policy documents do not map to ClickHouse settings, so storage either grows unbounded or data is lost prematurely. That kind of issue rarely traces back to a single setting. It is usually a combination of schema, sort key, and a few small misconfigurations stacking on top of each other, and the path to fixing it starts with understanding the mechanics.
For teams running ClickHouse in production, the cost of getting retention and TTL policies for telemetry data on ClickHouse wrong is felt in tail latency, in runaway memory grants, and in the hours operators spend chasing intermittent issues. Getting it right takes some up-front investment in measurement and a willingness to revisit defaults when the workload changes.
How it actually works
Understanding ClickHouse Telemetry Retention
Before changing any setting, it helps to walk through what ClickHouse is actually doing under the surface. The behaviour described here is not specific to one release; the broad shape has held across recent versions, and the operational implications are the same on self-managed clusters and on managed offerings.
- TTL clauses run during merges and decide what happens to data past a deadline.
- ttl_only_drop_parts drops whole parts when every row is expired; the cheapest path.
- TTL TO VOLUME / TO DISK tiers data to cheaper storage as it ages.
- TTL RECOMPRESS rewrites with denser codecs.
- TTL … GROUP BY rolls up to summaries.
Each of those steps has its own characteristic cost, and the slow ones tend to be the ones that show up in p95 and p99 latency. That is why the rest of this piece focuses on the levers that actually move those percentiles, rather than on micro-optimisations that look good in synthetic tests but rarely survive contact with production workloads.
Settings that actually matter
The configuration surface in ClickHouse is broad, and most of it does not need to be touched in a typical deployment. The settings below are the ones worth understanding because they shape behaviour directly under load. Defaults work for small workloads; the right values for production are usually different.
| Setting | Suggested value | Notes |
|---|---|---|
| TTL | ts + INTERVAL N DAY | Multiple deadlines per table. |
| ttl_only_drop_parts | 0/1 | Whole-part drops. |
| merge_with_ttl_timeout | 14400 | TTL merge cooldown. |
| storage_policy | — | Defines disk hierarchy. |
| PARTITION BY | toDate(ts) | Daily partitions for cheap drops. |
None of these are universal. The right number on a node with sixty-four cores and NVMe is not the right number on a smaller VM with attached storage, and the right number for an analytics workload differs from a streaming ingestion workload. The values above are starting points, not endpoints.
ClickHouse SQL examples
The SQL below shows the pattern in concrete terms. It is meant to be read alongside the explanation, not copied verbatim into a production script.
CREATE TABLE traces
(
ts DateTime,
trace_id String,
payload String CODEC(ZSTD(6))
)
ENGINE = MergeTree
PARTITION BY toDate(ts)
ORDER BY (trace_id, ts)
TTL ts + INTERVAL 7 DAY DELETE,
ts + INTERVAL 30 DAY TO VOLUME 'cold',
ts + INTERVAL 365 DAY DELETE;
Tuning approach that works in practice
The list below is the order most operators converge on when tuning retention and TTL policies for telemetry data on ClickHouse. It is not a recipe; the right answer depends on the workload. But it is a defensible sequence: each step is cheap to verify, and each one has a measurable effect when the change matters.
- Use daily partitions for telemetry; whole-part drops are cheap.
- Tier hot data on SSD, warm on slower disks, cold on object storage.
- Use RECOMPRESS to shave storage on cold data without dropping it.
- Stagger TTL across tables so merges do not synchronise.
Each change should be measured against the metrics that matter — usually p95 latency at a target throughput, plus query log statistics and CPU behaviour. Changes that do not move those numbers are not actually changes; they are configuration churn.
What to look at first
When something goes wrong with retention and TTL policies for telemetry data on ClickHouse, the first move is usually a handful of system table queries. The objects below are the ones that produce useful output fast, without needing a full monitoring pipeline to interpret.
| Object | What it shows |
|---|---|
| system.parts | Active and inactive data parts per table, with row counts, bytes on disk, and merge state. |
| system.merges | Currently running merges with progress, source parts, and total bytes to merge. |
Guardrails worth setting up
Tuning without monitoring is guesswork. The signals listed below are the ones that catch problems early enough to act on, and most production clusters end up alerting on a similar shortlist whether they planned to or not.
- Track per-table size and verify TTL is enforcing the policy.
- Alert on growth that contradicts the retention policy.
Pitfalls that show up repeatedly
The same handful of mistakes appears across cluster after cluster. Most of them are easier to avoid than to fix, and the cost of getting them wrong tends to compound — what starts as a small misconfiguration becomes a real incident weeks later when the workload grows.
- TTL deadlines that are tighter than the slowest dashboard’s lookback.
- Forgetting ttl_only_drop_parts; row-level cleanup is much slower than part drops.
- Mixing many TTL clauses on the same column without testing the merge plan.
None of those are exotic. They show up in code reviews, in postmortems, and occasionally in vendor support tickets, and the operational habit of catching them early is worth more than any single configuration change.
Frequently asked questions
A handful of questions come up every time this topic is discussed. The answers below are the ones that hold up across most production deployments; the exceptions are usually visible in the metrics.
Can I change TTL?
Yes — ALTER MODIFY TTL. Run MATERIALIZE TTL to apply to existing parts.
Is TTL enforcement strict?
It is enforced by merges; expired data lingers until the next eligible merge runs.
Can TTL move data to S3?
Yes — define an S3 volume in the storage policy and TTL TO VOLUME ‘s3’.
Does TTL affect replication?
Each replica enforces TTL on its merges; one performs the work, others fetch the result.
How do I prove retention is working?
Audit min(ts) per table over time and confirm it aligns with policy.
Behind every ClickHouse cluster there is a team that owns it, and the team’s habits matter as much as the configuration. Clear runbooks, clear ownership, and unambiguous SLOs do more for reliability than any single tuning decision, and they are what make tuning sustainable over time.
Workloads do not stand still. New dashboards, new tenants, and changes in usage patterns shift the shape of the traffic, and configuration that was right last quarter may be wrong this one. The cluster’s behaviour is a moving target, and the tuning posture should reflect that.
ClickHouse rarely operates in isolation. It sits inside a larger data platform with its own monitoring, deployment, and incident workflows, and the engine’s performance characteristics interact with those workflows in ways that are easy to miss. Treating ClickHouse as part of a system, rather than a standalone service, generally produces better outcomes.
Every new lever pulled on a ClickHouse cluster adds operational surface area. There is real value in keeping the configuration surface small — fewer custom values mean fewer things to remember during incident response, and fewer things that surprise the next operator who inherits the cluster.
The query log is one of the most useful diagnostic surfaces in ClickHouse, and the retention policy applied to it determines how far back a team can look during a postmortem. A few weeks of retention is the minimum that supports root-cause analysis on slow-developing problems, and many teams hold it for longer.
Hardware specifications change as nodes are replaced and infrastructure is upgraded. A configuration that fit a previous generation of disks or CPUs may underperform on the next, and revisiting tuning decisions when hardware changes is part of routine operations rather than an exceptional event.
Configuration changes that are documented and reversible are easier to live with than ones that are not. Even small changes are worth recording with the date, the reason, and the before-and-after metric, because the same change is likely to come up again in a future incident or capacity review.
Monitoring decisions tend to follow tuning decisions: once a setting is in place, the metrics that prove it is working become the ongoing signal that triggers the next change. Without that loop, a tuned cluster drifts back toward defaults whenever workload changes nudge it that way, and the work has to be redone.
Teams that want a deeper look at retention and TTL policies for telemetry data on ClickHouse can review ChistaDATA’s observability articles, or contact ChistaDATA about ClickHouse support for production engagements.
Putting it together
retention and TTL policies for telemetry data on ClickHouse sits at the intersection of schema design, hardware choice, and operational habits. Each of those areas can be tuned in isolation, but real performance comes from getting all three roughly right at the same time. The work pays off in the form of latency that holds during peaks and a cluster that scales without surprises.
The work is rarely finished, but it is also not as mysterious as it sometimes feels: a small number of mechanisms drive most of the behaviour, and the levers that matter are mostly the ones described above.
You might also like:
- Implementing Inverted Indexes in ClickHouse for Fast Search (Part 2)
- Mastering Concurrency in ClickHouse by Optimizing ClickHouse Thread Performance
- ClickHouse Troubleshooting: Understanding Estimated I/O and CPU Costs
- ClickHouse Redo Operations for Data Reliability
- Understanding the OpenTelemetry Collector: A Comprehensive Guide to Modern Telemetry Management
