Benchmarking ClickHouse: Real-World Workloads and Throughput Analysis

ClickHouse · Performance Benchmarking

Benchmarking ClickHouse: Real-World Workloads and Throughput Analysis

Most ClickHouse benchmarks measure the wrong thing on the wrong data and prove nothing. This is a senior engineer’s guide to benchmarking ClickHouse against the workloads you actually run — with the metrics that matter, the methodology that holds up, and the throughput analysis that turns numbers into decisions.

C ChistaDATA Engineering Team 
Billions
Rows/sec aggregate scan on a tuned cluster
100M+
Rows/sec sustained ingestion, well batched
Sub-second
p99 on dashboard queries, properly indexed
10–100×
Compression on real columnar data

There is no shortage of ClickHouse benchmarks on the internet, and almost all of them are useless to you — because they measure synthetic data, on hardware you do not run, under a concurrency pattern your application never produces.

ClickHouse is, by a wide margin, one of the fastest analytical databases available, and that reputation is well earned. But “fast” is not a number you can put in a capacity plan or a contract. The question that matters is not whether ClickHouse is fast in the abstract; it is whether your workload, on your hardware, with your schema and concurrency, meets the latency and throughput targets your business depends on — and how much headroom remains before you must scale. Answering that requires benchmarking as an engineering discipline, not a leaderboard score.

This guide lays out how our engineers benchmark ClickHouse on real engagements: the metrics worth measuring, the workload profiles that appear again and again in production, a methodology that produces repeatable results, and an honest analysis of ingestion throughput, query latency, and the hardware that drives both. Every figure presented here is representative of well-tuned deployments and is meant to illustrate method and order of magnitude — your numbers will differ, and the entire point of benchmarking is to discover precisely how.

It is worth being clear about why this matters commercially. A trustworthy benchmark is the difference between a capacity plan you can defend to finance and a guess that fails under the next traffic spike; between choosing the right instance class and over-provisioning by a factor of three; between signing a latency SLA you can meet and one that generates incident pages. Benchmarking is not an academic exercise — it is the evidence base for every significant decision you will make about a ClickHouse deployment, from initial sizing to a migration to a tuning investment. Treating it with engineering rigor pays for itself many times over.

Section 01

The metrics that actually matter

A benchmark is only as meaningful as the metrics it captures, and ClickHouse rewards measuring a specific, interrelated set. Throughput and latency are the headline figures, but they must be read alongside the efficiency metrics that determine cost. A cluster that hits its latency target while burning every available core has no headroom; one that meets throughput targets while reading ten times more data than necessary is a tuning opportunity disguised as a capacity problem.

The metrics below form the core of any serious ClickHouse benchmark. Capture all of them together, because optimizing one in isolation almost always moves another.

Metric What it tells you Where to read it
Ingestion throughput Rows/sec and MB/sec the cluster sustains without backpressure system.query_log, insert metrics
Query latency (p50/p95/p99) Response time distribution, not just the average system.query_log
Query throughput (QPS) Concurrent queries per second before latency degrades clickhouse-benchmark
Data read per query Rows and bytes scanned — the root cause of most latency system.query_log
Compression ratio On-disk vs uncompressed size; drives I/O and cost system.parts
Resource efficiency CPU, memory, and I/O consumed per unit of work System metrics, system.metrics

Note the emphasis on percentiles rather than averages. An average latency hides the tail, and for customer-facing analytics the tail is the experience users remember. Always benchmark and report p95 and p99, and treat data read per query as the leading indicator it is: in a columnar engine, latency is overwhelmingly a function of how many granules a query touches.

One metric deserves promotion above the rest: data read per query, expressed as a fraction of the table. It is the closest thing ClickHouse offers to a universal explanation of performance, because nearly every latency and throughput outcome traces back to how much data the engine had to touch. When a query is slow, the first question is not “how many cores do we have” but “what fraction of the table did this read, and why” — and a benchmark that captures this number for every query gives you the answer directly rather than forcing you to infer it from wall-clock time alone.

Section 02

Real-world workload profiles

The single biggest mistake in ClickHouse benchmarking is using data and queries that do not resemble production. ClickHouse behaves very differently across workload types, so a benchmark is only valid if it mirrors the profile you actually run. Four profiles cover the large majority of production deployments, and each stresses the engine in a distinct way.

Observability & logs

Benchmarking ClickHouse: Understanding Workload Profiles

High-volume append-only ingestion of logs, metrics, and traces, queried by time range and a few dimensions. Stresses ingestion throughput and time-range pruning.

Benchmark: ingest rows/sec, p99 range-scan

Clickstream & AdTech

Massive event streams with high-cardinality identifiers, queried for funnels, attribution, and aggregates. Stresses cardinality handling and aggregation.

Benchmark: group-by QPS, unique counting

Time-series & IoT

Regular high-frequency sensor and telemetry data with heavy use of downsampling and rollups. Stresses materialized views and codecs.

Benchmark: rollup latency, compression

Customer-facing analytics

Multi-tenant dashboards with strict latency SLAs and high concurrency. Stresses p99 under load and per-tenant isolation.

Benchmark: concurrent QPS at fixed p99

Before writing a single benchmark query, classify your workload against these profiles and build the test data and query mix to match. Generate data with realistic cardinality and distribution rather than uniform random values, which compress and prune unrealistically well and flatter the engine. The closer the test data resembles production in shape and skew, the more the benchmark predicts real behavior.

Section 03

A benchmarking methodology that holds up

A defensible benchmark is reproducible, isolates variables, and measures the system under conditions that resemble production. The single most useful tool is clickhouse-benchmark, which runs a stream of queries at a configurable concurrency and reports the full latency distribution along with rows and bytes processed per second. For standardized cross-system comparison, the ClickBench suite provides a well-defined analytical workload, though you should still complement it with your own queries.

Method discipline matters more than tool choice. Separate cold-cache and warm-cache runs and report both, because the first run pays I/O the cache later hides. Run each test multiple times and report the distribution, not a single number. Change one variable at a time — schema, setting, hardware, or concurrency — so that any difference is attributable. And always capture rows and bytes read from system.query_log alongside wall-clock time, because that is the metric that explains why a number changed.

Shell
# Drive a realistic concurrency level and read the full distribution
clickhouse-benchmark --concurrency 16 --iterations 1000 \
  --query "SELECT tenant_id, count() FROM events
           WHERE event_date >= today() - 7 GROUP BY tenant_id"

# Ground truth: what each query actually read
clickhouse-client --query "SELECT query_duration_ms, read_rows,
  formatReadableSize(read_bytes) FROM system.query_log
  WHERE type='QueryFinish' ORDER BY event_time DESC LIMIT 20"
ChistaDATA Insight

The benchmark that wins arguments is the one run on a copy of production data at production concurrency. We routinely see synthetic benchmarks overstate throughput by an order of magnitude because uniform random data prunes and compresses in ways real data never does. Our ClickHouse performance audit always benchmarks against representative data.

Repeatability deserves particular attention because it is what makes a benchmark evidence rather than anecdote. Fix the ClickHouse version, the server settings, and the dataset, and record them alongside every result, so a number can be reproduced months later or by a colleague. Quiesce the environment — no competing workloads on the host, no background merges mid-run — or, where you cannot, measure with them present and say so. Warm the cache deliberately for warm-run figures and drop it deliberately for cold-run figures, rather than letting cache state vary uncontrolled between runs. A result you cannot reproduce is a result you cannot defend in a capacity plan or a vendor negotiation.

Section 04

Ingestion throughput analysis

Ingestion is where ClickHouse newcomers most often leave performance on the table, and the cause is almost always insert pattern rather than raw capacity. ClickHouse is built to ingest data in large batches: each insert creates a part, and a flood of tiny inserts produces a flood of tiny parts that the engine must then merge, consuming I/O and CPU and degrading both ingestion and queries. The same hardware can differ by orders of magnitude in sustained throughput depending purely on how rows are batched.

The chart below illustrates the representative shape of this relationship — the absolute numbers depend entirely on row width, hardware, and schema, but the pattern is universal.

Representative ingestion throughput by insert strategy
Relative sustained rows/sec on identical hardware and schema — illustrative of the pattern, not a hardware claim.
Single-row inserts
~1×
Small batches (1K rows)
~12×
Large batches (100K rows)
~60×
Async inserts (server-batched)
~70×
Representative pattern from tuned deployments; validate against your own workload.

The lesson is to batch aggressively. Where the application cannot batch — many independent producers writing small payloads — asynchronous inserts let the server accumulate rows and write them in efficient batches, recovering most of the throughput without changing the client. Tune insert block size, parallelism, and the number of concurrent insert streams, and watch part counts in system.parts to confirm you are not creating merge pressure. ClickHouse’s own guidance on bulk inserts is the right starting point.

When you benchmark ingestion, measure sustained throughput rather than a momentary peak. A cluster can absorb a brief burst that it cannot maintain, because background merges eventually compete with incoming writes for I/O and CPU; the number that matters for capacity planning is the rate the cluster holds for hours without the merge queue growing without bound. Watch the merge backlog and part count throughout the run, and treat a steadily rising part count as the signal that you have exceeded sustainable ingestion regardless of what the instantaneous rows-per-second figure claims. Sustainable throughput, not peak throughput, is what your pipeline must be sized against.

Section 05

Query latency and concurrency

Query benchmarking has two dimensions that are easy to conflate: single-query latency and behavior under concurrency. A query that returns in 80 milliseconds in isolation may degrade sharply when fifty of them run at once, because they contend for CPU, memory, and I/O bandwidth. For customer-facing analytics, the figure that matters is the concurrency level the cluster sustains while holding p99 under the SLA — not the best-case latency of a single query on an idle system.

Benchmark latency across the percentiles and across realistic concurrency levels, increasing concurrency until p99 breaches your target to find the cluster’s true capacity ceiling. The chart below shows the representative effect of proper indexing and pre-aggregation on tail latency — the kind of improvement a structured tuning engagement targets.

Representative p99 latency before and after tuning
Relative p99 on the same queries and hardware, before vs after sorting-key, skip-index, and projection work.
Dashboard aggregate — before
~2,400 ms
Dashboard aggregate — after
~210 ms
Funnel query — before
~1,800 ms
Funnel query — after
~320 ms
Representative outcomes from tuning engagements; results depend on schema, data, and hardware.

The takeaway is that latency is rarely a hardware problem first. The largest gains come from reducing the data each query reads — through the sorting key, data-skipping indexes, projections, and pre-aggregation — after which concurrency scales because each query consumes less of the shared resource pool. Benchmark before and after each change so the improvement is attributable and defensible.

Section 06

Hardware and cost efficiency

Throughput numbers are meaningless without the hardware context that produced them, and cost efficiency is increasingly the metric executives actually care about. ClickHouse scales well with CPU cores because query execution is heavily parallelized, benefits enormously from fast local NVMe storage for hot data, and uses memory as the cache that keeps the working set off disk. A benchmark should always record the instance type, core count, memory, and storage class, because a throughput figure without them is uninterpretable.

The modern decision is also architectural: local-disk clusters versus the separation of storage and compute, including object storage tiers and managed offerings such as ClickHouse Cloud. Each has a different cost-and-latency profile, and the right choice depends on whether your workload is latency-critical or cost-critical and how bursty it is. ClickHouse publishes sizing and hardware recommendations that make a sound baseline. The metric to optimize is cost per query or cost per ingested terabyte at the required latency — a figure that turns a benchmark into a budget.

Benchmark scaling explicitly rather than assuming it. ClickHouse parallelizes a single query across cores, so adding cores often reduces latency for large scans, but the benefit tapers once a query is bound by memory bandwidth or I/O rather than CPU. Adding nodes to a cluster increases aggregate throughput and concurrency but introduces network and coordination costs that a single-node test will not reveal. Measure how your specific workload responds to more cores, more memory, and more nodes, because the shape of that scaling curve — not a single data point — is what tells you whether to scale up, scale out, or tune first. Frequently the most cost-effective move is to make each query read less before adding any hardware at all.

Section 07

Common benchmarking mistakes

Most ClickHouse benchmarks fail in predictable ways, and recognizing the patterns saves weeks of misdirected effort. The most common is unrealistic data: uniformly random values that compress and prune far better than the skewed, correlated data of production, producing throughput numbers that evaporate on real workloads. Close behind is benchmarking at concurrency one, which measures a scenario that never occurs and ignores the contention that defines real capacity.

Other recurring mistakes include comparing a cold-cache run against a warm one, benchmarking a single query in isolation rather than a representative mix, ignoring the merge activity that small inserts generate, and reporting averages that conceal a punishing tail. Perhaps the most consequential is failing to record rows and bytes read, which leaves you unable to explain why a number changed and reduces the exercise to guesswork. A benchmark that cannot explain its own results cannot guide a decision.

ChistaDATA Insight

If a vendor benchmark does not state the hardware, the data distribution, the concurrency, and the cache state, treat it as marketing rather than evidence. The same rigor applies to your own internal tests — and it is exactly the rigor a ClickHouse consulting review brings.

Section 08

Turning benchmarks into decisions

A benchmark is a means, not an end. Its purpose is to answer concrete questions: will this cluster meet our latency SLA at projected growth, how much headroom remains before we must scale, which hardware delivers the lowest cost per query at our target latency, and what is the return on a tuning engagement. Frame every benchmark around the decision it must inform, and the methodology follows naturally from the question.

The discipline that connects measurement to decision is the same throughout this guide: benchmark against representative data and concurrency, capture the full set of metrics including data read, change one variable at a time, and report distributions rather than averages. Build a baseline you re-run as the workload evolves, and a regression becomes visible the moment it appears rather than after users complain. This is precisely the telemetry our engineers establish at the start of an engagement and re-measure at the end, so that every recommendation is backed by a number rather than an opinion.

Done well, benchmarking also reframes the cost conversation. Once you can express performance as cost per query and cost per ingested terabyte at a required latency, infrastructure spend becomes an engineering variable you can optimize rather than a fixed tax you absorb. A tuning engagement that halves the data each query reads does not merely make dashboards faster; it defers the next hardware purchase, lowers the cloud bill, and raises the concurrency ceiling at the same time. That is why we treat benchmarking as the opening move of every performance engagement — it converts a vague sense that the cluster is slow or expensive into a precise, prioritized list of the changes that will pay back fastest.

The benchmarking method in one view

Eight principles, one discipline: measure your workload, not a leaderboard.

  • Benchmark your workload. Real data shape, real query mix, real concurrency — synthetic uniform data lies.
  • Measure the full set. Throughput, p95/p99 latency, QPS, and — critically — data read per query.
  • Batch ingestion aggressively. Insert strategy moves throughput by orders of magnitude; use async inserts when you cannot batch.
  • Test under concurrency. The capacity that matters is QPS at a fixed p99, not single-query best case.
  • Record the hardware and cache state. A number without context is uninterpretable; optimize cost per query.
  • Change one variable, report distributions, re-baseline. A benchmark that cannot explain its results cannot guide a decision.

Frequently asked questions

What tool should I use to benchmark ClickHouse?

Start with clickhouse-benchmark, which drives a configurable concurrency level and reports the full latency distribution plus rows and bytes processed per second. For standardized cross-system comparison, ClickBench provides a well-defined analytical workload. Whichever you use, always complement it with your own representative queries and capture system.query_log, because the standard suites cannot reflect your specific schema and access patterns.

Why are my benchmark numbers so different from published ClickHouse benchmarks?

Almost always because of data and conditions. Published benchmarks frequently use synthetic, uniformly distributed data that compresses and prunes far better than production data, run on hardware you do not have, and measure at a concurrency your application never produces. Benchmark on a copy of your real data, at your real concurrency, and record the hardware — the gap usually explains itself.

How do I maximize ClickHouse ingestion throughput?

Batch inserts aggressively — large batches rather than row-by-row — because each insert creates a part and many small parts create merge pressure that degrades everything. Where the application cannot batch, enable asynchronous inserts so the server batches on your behalf. Then tune block size and insert parallelism, and monitor part counts in system.parts to confirm you are not overwhelming the merge scheduler.

Should I benchmark with averages or percentiles?

Percentiles, always. An average latency conceals the tail, and for customer-facing workloads the tail — p95 and p99 — is the experience users actually remember. Benchmark and report the distribution, and find the concurrency level at which p99 breaches your SLA to determine the cluster’s true capacity ceiling rather than its best case.

Does more hardware fix slow ClickHouse queries?

Rarely as the first move. In a columnar engine, latency is overwhelmingly a function of how much data each query reads, so the largest gains come from reducing data read — through the sorting key, data-skipping indexes, projections, and pre-aggregation. Tune first and benchmark the result; scale hardware once the schema and queries are efficient and you have measured a genuine capacity ceiling.

When should we engage a specialist to benchmark and tune ClickHouse?

When latency or cost targets are at risk, when planning a major scale-up or migration, or when internal benchmarks are producing numbers you cannot explain or trust. ChistaDATA provides ClickHouse performance audits, ongoing ClickHouse support, and managed services for teams that want a senior bench without carrying the headcount.

C

ChistaDATA Engineering Team

ChistaDATA Inc. is a specialist ClickHouse infrastructure operations firm delivering consulting, performance engineering, migration, and 24×7 support on 100% open-source ClickHouse with zero vendor lock-in. Read more on the ChistaDATA blog or explore ClickHouse consulting.

ChistaDATA · ClickHouse Performance Engineering

Want a benchmark you can actually trust — and a faster cluster?

ChistaDATA engineers benchmark ClickHouse against your real workload, then re-engineer the schema, ingestion, and indexing to hit your latency and cost targets — verified against your production telemetry. A thirty-minute conversation is enough to scope the work.

About ChistaDATA Inc. 218 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc