mining the ClickHouse query log for performance insights is one of those areas where defaults work for small workloads and quietly fail at scale. The ClickHouse query log records every executed query with timing, IO, memory, and ProfileEvents — enough data to find regressions, plan changes, and noisy neighbours without external tools. That single sentence hides a fair amount of detail, and the rest of this piece pulls those details apart so the levers and trade-offs are visible. If you want to understand optimization, focus on the ClickHouse query log.
The most common version of the problem is straightforward: operators forget the query log exists, or use only its top-N queries by duration, and miss the deeper signals it carries. That kind of issue rarely traces back to a single setting. It is usually a combination of schema, sort key, and a few small misconfigurations stacking on top of each other, and the path to fixing it starts with understanding the mechanics.
The insights drawn from the ClickHouse query log can reveal hidden patterns that impact performance.
For teams running ClickHouse in production, the cost of getting mining the ClickHouse query log for performance insights wrong is felt in tail latency, in runaway memory grants, and in the hours operators spend chasing intermittent issues. Getting it right takes some up-front investment in measurement and a willingness to revisit defaults when the workload changes.
For those teams looking to optimize, regularly reviewing the ClickHouse query log is essential to identify potential bottlenecks.
How it actually works
Before changing any setting, it helps to walk through what ClickHouse is actually doing under the surface. The behaviour described here is not specific to one release; the broad shape has held across recent versions, and the operational implications are the same on self-managed clusters and on managed offerings.
- system.query_log captures one row per query event (start, finish, exception).
- ProfileEvents column holds dozens of counters: rows read, marks selected, network bytes, cache hits.
- normalized_query_hash groups equivalent queries with different literals, enabling aggregation.
- system.query_thread_log adds per-thread detail when enabled.
- Retention is controlled by query_log table TTL; defaults vary by version.
Each of those steps has its own characteristic cost, and the slow ones tend to be the ones that show up in p95 and p99 latency. That is why the rest of this piece focuses on the levers that actually move those percentiles, rather than on micro-optimisations that look good in synthetic tests but rarely survive contact with production workloads.
The ClickHouse query log is a vital resource for understanding query performance over time.
Settings that actually matter
Make sure that your ClickHouse query log settings align with your performance goals.
The configuration surface in ClickHouse is broad, and most of it does not need to be touched in a typical deployment. The settings below are the ones worth understanding because they shape behaviour directly under load. Defaults work for small workloads; the right values for production are usually different.
| Setting | Suggested value | Notes |
|---|---|---|
| query_log enabled | default | On by default. |
| log_queries_min_query_duration_ms | 0 | Skip very fast queries. |
| log_query_threads | 1 | Enable per-thread log. |
| query_log TTL | — | Trim retention. |
| ProfileEvents | — | Counters per query. |
None of these are universal. The right number on a node with sixty-four cores and NVMe is not the right number on a smaller VM with attached storage, and the right number for an analytics workload differs from a streaming ingestion workload. The values above are starting points, not endpoints.
ClickHouse SQL examples
The SQL below shows the pattern in concrete terms. It is meant to be read alongside the explanation, not copied verbatim into a production script.
-- Worst queries by p95 duration in the last hour
SELECT normalized_query_hash, any(query) AS sample,
count() AS runs,
quantile(0.95)(query_duration_ms) AS p95,
quantile(0.99)(query_duration_ms) AS p99
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR AND type = 'QueryFinish'
GROUP BY normalized_query_hash
ORDER BY p95 DESC LIMIT 30;
-- Cache effectiveness
SELECT sum(ProfileEvents['MarkCacheHits']) / sum(ProfileEvents['MarkCacheHits'] + ProfileEvents['MarkCacheMisses']) AS hit_rate
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR;
Tuning approach that works in practice
The list below is the order most operators converge on when tuning mining the ClickHouse query log for performance insights. It is not a recipe; the right answer depends on the workload. But it is a defensible sequence: each step is cheap to verify, and each one has a measurable effect when the change matters.
- Group by normalized_query_hash to see equivalent queries together.
- Watch read_bytes regressions; sudden growth almost always means lost index pruning.
- Use ProfileEvents to confirm whether the engine is actually using the optimisations you expect.
- Trim retention with TTL on the query_log table itself; it can grow large.
Each change should be measured against the metrics that matter — usually p95 latency at a target throughput, plus query log statistics and CPU behaviour. Changes that do not move those numbers are not actually changes; they are configuration churn.
Analyzing the ClickHouse query log helps in fine-tuning query performance and overall cluster efficiency.
What to look at first
Initial steps in performance troubleshooting often involve insights from the ClickHouse query log.
When something goes wrong with mining the ClickHouse query log for performance insights, the first move is usually a handful of system table queries. The objects below are the ones that produce useful output fast, without needing a full monitoring pipeline to interpret.
| Object | What it shows |
|---|---|
| system.query_log | Completed queries with duration, memory, rows read, and the user who ran them. |
| system.query_thread_log | Per-thread breakdown of query work: CPU, real time, and per-thread profile events. |
| system.events | Cumulative event counters: SelectQuery, Insert, FailedQuery, MarkCacheHits, etc. |
Guardrails worth setting up
Tuning without monitoring is guesswork. The signals listed below are the ones that catch problems early enough to act on, and most production clusters end up alerting on a similar shortlist whether they planned to or not.
To effectively monitor your cluster, leverage insights from the ClickHouse query log.
- Alert on p95 of important normalized queries crossing thresholds.
- Track exception rates from system.query_log type = ‘ExceptionBeforeStart’ or ‘ExceptionWhileProcessing’.
Pitfalls that show up repeatedly
The same handful of mistakes appears across cluster after cluster. Most of them are easier to avoid than to fix, and the cost of getting them wrong tends to compound — what starts as a small misconfiguration becomes a real incident weeks later when the workload grows.
Understanding the trends in your ClickHouse query log can prevent common pitfalls in performance tuning.
- Reading raw query text without normalised_query_hash; aggregations become noisy.
- Forgetting that user errors land in query_log too; filter them out for performance work.
- Letting query_log fill the disk; TTL it.
None of those are exotic. They show up in code reviews, in postmortems, and occasionally in vendor support tickets, and the operational habit of catching them early is worth more than any single configuration change.
Frequently asked questions
A handful of questions come up every time this topic is discussed. The answers below are the ones that hold up across most production deployments; the exceptions are usually visible in the metrics.
The best practices for maintaining performance often start with a thorough review of the ClickHouse query log.
Is query_log enabled by default?
Yes. It can be disabled, but doing so blinds operators to most performance work.
Staying informed about query execution using the ClickHouse query log is crucial for performance optimization.
How big does it get?
Big. TTL it aggressively unless you have specific compliance reasons to keep more.
The ClickHouse query log can grow rapidly; monitor it regularly to avoid unexpected issues.
What is normalized_query_hash?
A hash of the query with literals replaced by placeholders, so ‘WHERE id = 1’ and ‘WHERE id = 2’ group together.
The ClickHouse query log provides valuable context for understanding query performance nuances.
Can I export query_log to dashboards?
Yes, via materialized views into a smaller table, or via Grafana on the source.
Whether for debugging or optimization, the ClickHouse query log is an indispensable tool.
Should I enable query_thread_log?
It is heavier than query_log. Enable for short profiling sessions; consider keeping it off in steady state.
Workloads do not stand still. New dashboards, new tenants, and changes in usage patterns shift the shape of the traffic, and configuration that was right last quarter may be wrong this one. The cluster’s behaviour is a moving target, and the tuning posture should reflect that.
An effective strategy incorporates regular assessments of the ClickHouse query log.
ClickHouse rarely operates in isolation. It sits inside a larger data platform with its own monitoring, deployment, and incident workflows, and the engine’s performance characteristics interact with those workflows in ways that are easy to miss. Treating ClickHouse as part of a system, rather than a standalone service, generally produces better outcomes.
Monitoring decisions tend to follow tuning decisions: once a setting is in place, the metrics that prove it is working become the ongoing signal that triggers the next change. Without that loop, a tuned cluster drifts back toward defaults whenever workload changes nudge it that way, and the work has to be redone.
Behind every ClickHouse cluster there is a team that owns it, and the team’s habits matter as much as the configuration. Clear runbooks, clear ownership, and unambiguous SLOs do more for reliability than any single tuning decision, and they are what make tuning sustainable over time.
Part count is a quiet failure mode: the cluster keeps working as parts accumulate, and then suddenly latency spikes or a merge thread saturates. Watching part count per partition and tying it to ingestion rate is a small habit that catches the problem long before it becomes an incident.
Understanding the implications of the ClickHouse query log can significantly impact cluster performance.
The query log is one of the most useful diagnostic surfaces in ClickHouse, and the retention policy applied to it determines how far back a team can look during a postmortem. A few weeks of retention is the minimum that supports root-cause analysis on slow-developing problems, and many teams hold it for longer.
Teams that want a deeper look at mining the ClickHouse query log for performance insights can review ChistaDATA’s observability articles, or contact ChistaDATA about ClickHouse support for production engagements.
Strategies centered around the ClickHouse query log can lead to sustained performance improvements.
Putting it together
mining the ClickHouse query log for performance insights sits at the intersection of schema design, hardware choice, and operational habits. Each of those areas can be tuned in isolation, but real performance comes from getting all three roughly right at the same time. The work pays off in the form of latency that holds during peaks and a cluster that scales without surprises.
Revisiting the findings from the ClickHouse query log regularly can help maintain optimal performance.
The work is rarely finished, but it is also not as mysterious as it sometimes feels: a small number of mechanisms drive most of the behaviour, and the levers that matter are mostly the ones described above.
