Benchmarking ClickHouse: Real-World Workloads and Throughput Analysis
Most ClickHouse benchmarks measure the wrong thing on the wrong data and prove nothing. This is a senior engineer’s guide to benchmarking ClickHouse against the workloads you actually run — with the metrics that matter, the methodology that holds up, and the throughput analysis that turns numbers into decisions.
There is no shortage of ClickHouse benchmarks on the internet, and almost all of them are useless to you — because they measure synthetic data, on hardware you do not run, under a concurrency pattern your application never produces.
ClickHouse is, by a wide margin, one of the fastest analytical databases available, and that reputation is well earned. But “fast” is not a number you can put in a capacity plan or a contract. The question that matters is not whether ClickHouse is fast in the abstract; it is whether your workload, on your hardware, with your schema and concurrency, meets the latency and throughput targets your business depends on — and how much headroom remains before you must scale. Answering that requires benchmarking as an engineering discipline, not a leaderboard score.
This guide lays out how our engineers benchmark ClickHouse on real engagements: the metrics worth measuring, the workload profiles that appear again and again in production, a methodology that produces repeatable results, and an honest analysis of ingestion throughput, query latency, and the hardware that drives both. Every figure presented here is representative of well-tuned deployments and is meant to illustrate method and order of magnitude — your numbers will differ, and the entire point of benchmarking is to discover precisely how.
It is worth being clear about why this matters commercially. A trustworthy benchmark is the difference between a capacity plan you can defend to finance and a guess that fails under the next traffic spike; between choosing the right instance class and over-provisioning by a factor of three; between signing a latency SLA you can meet and one that generates incident pages. Benchmarking is not an academic exercise — it is the evidence base for every significant decision you will make about a ClickHouse deployment, from initial sizing to a migration to a tuning investment. Treating it with engineering rigor pays for itself many times over.
The metrics that actually matter
A benchmark is only as meaningful as the metrics it captures, and ClickHouse rewards measuring a specific, interrelated set. Throughput and latency are the headline figures, but they must be read alongside the efficiency metrics that determine cost. A cluster that hits its latency target while burning every available core has no headroom; one that meets throughput targets while reading ten times more data than necessary is a tuning opportunity disguised as a capacity problem.
The metrics below form the core of any serious ClickHouse benchmark. Capture all of them together, because optimizing one in isolation almost always moves another.
| Metric | What it tells you | Where to read it |
|---|---|---|
| Ingestion throughput | Rows/sec and MB/sec the cluster sustains without backpressure | system.query_log, insert metrics |
| Query latency (p50/p95/p99) | Response time distribution, not just the average | system.query_log |
| Query throughput (QPS) | Concurrent queries per second before latency degrades | clickhouse-benchmark |
| Data read per query | Rows and bytes scanned — the root cause of most latency | system.query_log |
| Compression ratio | On-disk vs uncompressed size; drives I/O and cost | system.parts |
| Resource efficiency | CPU, memory, and I/O consumed per unit of work | System metrics, system.metrics |
Note the emphasis on percentiles rather than averages. An average latency hides the tail, and for customer-facing analytics the tail is the experience users remember. Always benchmark and report p95 and p99, and treat data read per query as the leading indicator it is: in a columnar engine, latency is overwhelmingly a function of how many granules a query touches.
One metric deserves promotion above the rest: data read per query, expressed as a fraction of the table. It is the closest thing ClickHouse offers to a universal explanation of performance, because nearly every latency and throughput outcome traces back to how much data the engine had to touch. When a query is slow, the first question is not “how many cores do we have” but “what fraction of the table did this read, and why” — and a benchmark that captures this number for every query gives you the answer directly rather than forcing you to infer it from wall-clock time alone.
Real-world workload profiles
The single biggest mistake in ClickHouse benchmarking is using data and queries that do not resemble production. ClickHouse behaves very differently across workload types, so a benchmark is only valid if it mirrors the profile you actually run. Four profiles cover the large majority of production deployments, and each stresses the engine in a distinct way.
Observability & logs
Benchmarking ClickHouse: Understanding Workload Profiles
High-volume append-only ingestion of logs, metrics, and traces, queried by time range and a few dimensions. Stresses ingestion throughput and time-range pruning.
Benchmark: ingest rows/sec, p99 range-scanClickstream & AdTech
Massive event streams with high-cardinality identifiers, queried for funnels, attribution, and aggregates. Stresses cardinality handling and aggregation.
Benchmark: group-by QPS, unique countingTime-series & IoT
Regular high-frequency sensor and telemetry data with heavy use of downsampling and rollups. Stresses materialized views and codecs.
Benchmark: rollup latency, compressionCustomer-facing analytics
Multi-tenant dashboards with strict latency SLAs and high concurrency. Stresses p99 under load and per-tenant isolation.
Benchmark: concurrent QPS at fixed p99Before writing a single benchmark query, classify your workload against these profiles and build the test data and query mix to match. Generate data with realistic cardinality and distribution rather than uniform random values, which compress and prune unrealistically well and flatter the engine. The closer the test data resembles production in shape and skew, the more the benchmark predicts real behavior.
A benchmarking methodology that holds up
A defensible benchmark is reproducible, isolates variables, and measures the system under conditions that resemble production. The single most useful tool is clickhouse-benchmark, which runs a stream of queries at a configurable concurrency and reports the full latency distribution along with rows and bytes processed per second. For standardized cross-system comparison, the ClickBench suite provides a well-defined analytical workload, though you should still complement it with your own queries.
Method discipline matters more than tool choice. Separate cold-cache and warm-cache runs and report both, because the first run pays I/O the cache later hides. Run each test multiple times and report the distribution, not a single number. Change one variable at a time — schema, setting, hardware, or concurrency — so that any difference is attributable. And always capture rows and bytes read from system.query_log alongside wall-clock time, because that is the metric that explains why a number changed.
# Drive a realistic concurrency level and read the full distribution
clickhouse-benchmark --concurrency 16 --iterations 1000 \
--query "SELECT tenant_id, count() FROM events
WHERE event_date >= today() - 7 GROUP BY tenant_id"
# Ground truth: what each query actually read
clickhouse-client --query "SELECT query_duration_ms, read_rows,
formatReadableSize(read_bytes) FROM system.query_log
WHERE type='QueryFinish' ORDER BY event_time DESC LIMIT 20"
The benchmark that wins arguments is the one run on a copy of production data at production concurrency. We routinely see synthetic benchmarks overstate throughput by an order of magnitude because uniform random data prunes and compresses in ways real data never does. Our ClickHouse performance audit always benchmarks against representative data.
Repeatability deserves particular attention because it is what makes a benchmark evidence rather than anecdote. Fix the ClickHouse version, the server settings, and the dataset, and record them alongside every result, so a number can be reproduced months later or by a colleague. Quiesce the environment — no competing workloads on the host, no background merges mid-run — or, where you cannot, measure with them present and say so. Warm the cache deliberately for warm-run figures and drop it deliberately for cold-run figures, rather than letting cache state vary uncontrolled between runs. A result you cannot reproduce is a result you cannot defend in a capacity plan or a vendor negotiation.
Ingestion throughput analysis
Ingestion is where ClickHouse newcomers most often leave performance on the table, and the cause is almost always insert pattern rather than raw capacity. ClickHouse is built to ingest data in large batches: each insert creates a part, and a flood of tiny inserts produces a flood of tiny parts that the engine must then merge, consuming I/O and CPU and degrading both ingestion and queries. The same hardware can differ by orders of magnitude in sustained throughput depending purely on how rows are batched.
The chart below illustrates the representative shape of this relationship — the absolute numbers depend entirely on row width, hardware, and schema, but the pattern is universal.
The lesson is to batch aggressively. Where the application cannot batch — many independent producers writing small payloads — asynchronous inserts let the server accumulate rows and write them in efficient batches, recovering most of the throughput without changing the client. Tune insert block size, parallelism, and the number of concurrent insert streams, and watch part counts in system.parts to confirm you are not creating merge pressure. ClickHouse’s own guidance on bulk inserts is the right starting point.
When you benchmark ingestion, measure sustained throughput rather than a momentary peak. A cluster can absorb a brief burst that it cannot maintain, because background merges eventually compete with incoming writes for I/O and CPU; the number that matters for capacity planning is the rate the cluster holds for hours without the merge queue growing without bound. Watch the merge backlog and part count throughout the run, and treat a steadily rising part count as the signal that you have exceeded sustainable ingestion regardless of what the instantaneous rows-per-second figure claims. Sustainable throughput, not peak throughput, is what your pipeline must be sized against.
Query latency and concurrency
Query benchmarking has two dimensions that are easy to conflate: single-query latency and behavior under concurrency. A query that returns in 80 milliseconds in isolation may degrade sharply when fifty of them run at once, because they contend for CPU, memory, and I/O bandwidth. For customer-facing analytics, the figure that matters is the concurrency level the cluster sustains while holding p99 under the SLA — not the best-case latency of a single query on an idle system.
Benchmark latency across the percentiles and across realistic concurrency levels, increasing concurrency until p99 breaches your target to find the cluster’s true capacity ceiling. The chart below shows the representative effect of proper indexing and pre-aggregation on tail latency — the kind of improvement a structured tuning engagement targets.
The takeaway is that latency is rarely a hardware problem first. The largest gains come from reducing the data each query reads — through the sorting key, data-skipping indexes, projections, and pre-aggregation — after which concurrency scales because each query consumes less of the shared resource pool. Benchmark before and after each change so the improvement is attributable and defensible.
Hardware and cost efficiency
Throughput numbers are meaningless without the hardware context that produced them, and cost efficiency is increasingly the metric executives actually care about. ClickHouse scales well with CPU cores because query execution is heavily parallelized, benefits enormously from fast local NVMe storage for hot data, and uses memory as the cache that keeps the working set off disk. A benchmark should always record the instance type, core count, memory, and storage class, because a throughput figure without them is uninterpretable.
The modern decision is also architectural: local-disk clusters versus the separation of storage and compute, including object storage tiers and managed offerings such as ClickHouse Cloud. Each has a different cost-and-latency profile, and the right choice depends on whether your workload is latency-critical or cost-critical and how bursty it is. ClickHouse publishes sizing and hardware recommendations that make a sound baseline. The metric to optimize is cost per query or cost per ingested terabyte at the required latency — a figure that turns a benchmark into a budget.
Benchmark scaling explicitly rather than assuming it. ClickHouse parallelizes a single query across cores, so adding cores often reduces latency for large scans, but the benefit tapers once a query is bound by memory bandwidth or I/O rather than CPU. Adding nodes to a cluster increases aggregate throughput and concurrency but introduces network and coordination costs that a single-node test will not reveal. Measure how your specific workload responds to more cores, more memory, and more nodes, because the shape of that scaling curve — not a single data point — is what tells you whether to scale up, scale out, or tune first. Frequently the most cost-effective move is to make each query read less before adding any hardware at all.
Common benchmarking mistakes
Most ClickHouse benchmarks fail in predictable ways, and recognizing the patterns saves weeks of misdirected effort. The most common is unrealistic data: uniformly random values that compress and prune far better than the skewed, correlated data of production, producing throughput numbers that evaporate on real workloads. Close behind is benchmarking at concurrency one, which measures a scenario that never occurs and ignores the contention that defines real capacity.
Other recurring mistakes include comparing a cold-cache run against a warm one, benchmarking a single query in isolation rather than a representative mix, ignoring the merge activity that small inserts generate, and reporting averages that conceal a punishing tail. Perhaps the most consequential is failing to record rows and bytes read, which leaves you unable to explain why a number changed and reduces the exercise to guesswork. A benchmark that cannot explain its own results cannot guide a decision.
If a vendor benchmark does not state the hardware, the data distribution, the concurrency, and the cache state, treat it as marketing rather than evidence. The same rigor applies to your own internal tests — and it is exactly the rigor a ClickHouse consulting review brings.
Turning benchmarks into decisions
A benchmark is a means, not an end. Its purpose is to answer concrete questions: will this cluster meet our latency SLA at projected growth, how much headroom remains before we must scale, which hardware delivers the lowest cost per query at our target latency, and what is the return on a tuning engagement. Frame every benchmark around the decision it must inform, and the methodology follows naturally from the question.
The discipline that connects measurement to decision is the same throughout this guide: benchmark against representative data and concurrency, capture the full set of metrics including data read, change one variable at a time, and report distributions rather than averages. Build a baseline you re-run as the workload evolves, and a regression becomes visible the moment it appears rather than after users complain. This is precisely the telemetry our engineers establish at the start of an engagement and re-measure at the end, so that every recommendation is backed by a number rather than an opinion.
Done well, benchmarking also reframes the cost conversation. Once you can express performance as cost per query and cost per ingested terabyte at a required latency, infrastructure spend becomes an engineering variable you can optimize rather than a fixed tax you absorb. A tuning engagement that halves the data each query reads does not merely make dashboards faster; it defers the next hardware purchase, lowers the cloud bill, and raises the concurrency ceiling at the same time. That is why we treat benchmarking as the opening move of every performance engagement — it converts a vague sense that the cluster is slow or expensive into a precise, prioritized list of the changes that will pay back fastest.
The benchmarking method in one view
Eight principles, one discipline: measure your workload, not a leaderboard.
- Benchmark your workload. Real data shape, real query mix, real concurrency — synthetic uniform data lies.
- Measure the full set. Throughput, p95/p99 latency, QPS, and — critically — data read per query.
- Batch ingestion aggressively. Insert strategy moves throughput by orders of magnitude; use async inserts when you cannot batch.
- Test under concurrency. The capacity that matters is QPS at a fixed p99, not single-query best case.
- Record the hardware and cache state. A number without context is uninterpretable; optimize cost per query.
- Change one variable, report distributions, re-baseline. A benchmark that cannot explain its results cannot guide a decision.
Frequently asked questions
What tool should I use to benchmark ClickHouse?
Start with clickhouse-benchmark, which drives a configurable concurrency level and reports the full latency distribution plus rows and bytes processed per second. For standardized cross-system comparison, ClickBench provides a well-defined analytical workload. Whichever you use, always complement it with your own representative queries and capture system.query_log, because the standard suites cannot reflect your specific schema and access patterns.
Why are my benchmark numbers so different from published ClickHouse benchmarks?
Almost always because of data and conditions. Published benchmarks frequently use synthetic, uniformly distributed data that compresses and prunes far better than production data, run on hardware you do not have, and measure at a concurrency your application never produces. Benchmark on a copy of your real data, at your real concurrency, and record the hardware — the gap usually explains itself.
How do I maximize ClickHouse ingestion throughput?
Batch inserts aggressively — large batches rather than row-by-row — because each insert creates a part and many small parts create merge pressure that degrades everything. Where the application cannot batch, enable asynchronous inserts so the server batches on your behalf. Then tune block size and insert parallelism, and monitor part counts in system.parts to confirm you are not overwhelming the merge scheduler.
Should I benchmark with averages or percentiles?
Percentiles, always. An average latency conceals the tail, and for customer-facing workloads the tail — p95 and p99 — is the experience users actually remember. Benchmark and report the distribution, and find the concurrency level at which p99 breaches your SLA to determine the cluster’s true capacity ceiling rather than its best case.
Does more hardware fix slow ClickHouse queries?
Rarely as the first move. In a columnar engine, latency is overwhelmingly a function of how much data each query reads, so the largest gains come from reducing data read — through the sorting key, data-skipping indexes, projections, and pre-aggregation. Tune first and benchmark the result; scale hardware once the schema and queries are efficient and you have measured a genuine capacity ceiling.
When should we engage a specialist to benchmark and tune ClickHouse?
When latency or cost targets are at risk, when planning a major scale-up or migration, or when internal benchmarks are producing numbers you cannot explain or trust. ChistaDATA provides ClickHouse performance audits, ongoing ClickHouse support, and managed services for teams that want a senior bench without carrying the headcount.
ChistaDATA Engineering Team
ChistaDATA Inc. is a specialist ClickHouse infrastructure operations firm delivering consulting, performance engineering, migration, and 24×7 support on 100% open-source ClickHouse with zero vendor lock-in. Read more on the ChistaDATA blog or explore ClickHouse consulting.
Want a benchmark you can actually trust — and a faster cluster?
ChistaDATA engineers benchmark ClickHouse against your real workload, then re-engineer the schema, ingestion, and indexing to hit your latency and cost targets — verified against your production telemetry. A thirty-minute conversation is enough to scope the work.
ClickHouse is a registered trademark of ClickHouse, Inc. ChistaDATA is not affiliated with, endorsed by, or sponsored by ClickHouse, Inc. Benchmark figures shown are representative and illustrative; actual results depend on hardware, schema, data, and workload. All other trademarks are the property of their respective owners. Copyright © 2010–2026. All Rights Reserved by ChistaDATA®.