Unlocking Success: Scaling ClickHouse from Gigabytes to Petabytes

Most teams don’t run into ClickHouse scaling challenges when they’re managing a few hundred gigabytes. The problems surface the moment data volumes cross a threshold where naive configurations begin to crack — queries slow down, ingestion pipelines stall, and what once felt like a high-performance analytics engine suddenly feels like it’s working against you. The good news? ClickHouse was designed from the ground up to scale. The not-so-good news? Getting it to actually scale to petabyte levels requires deliberate architectural decisions, careful tuning, and an honest understanding of the trade-offs involved. This is especially true when scaling ClickHouse to handle large data sets.

This playbook draws from real-world production deployments and hard-won operational experience. Whether you’re running a managed ClickHouse cluster on ChistaDATA Cloud or operating self-managed nodes, the principles here apply consistently. Let’s walk through what actually matters.

Table of Contents

Understanding ClickHouse’s Scaling Model

Before diving into configuration knobs and architectural patterns, it’s worth building a clear mental model of how ClickHouse scales. ClickHouse is fundamentally a columnar, append-oriented OLAP database. Its MergeTree storage engine writes data in parts, which are continuously merged in the background. This design choice has profound implications for how you scale: writes are fast and relatively cheap, but uncontrolled part proliferation can suffocate a cluster just as surely as underpowered hardware.

ClickHouse scales along two axes: vertical (more CPU cores, RAM, and faster NVMe storage per node) and horizontal (more shards and replicas). In practice, most production environments benefit from a hybrid approach — fat nodes with local NVMe storage for the hot tier, paired with a distributed sharding strategy across a cluster. Neither pure vertical nor pure horizontal scaling alone is sufficient at petabyte scale.

Understanding the challenges of scaling ClickHouse is crucial for optimizing performance and ensuring effective resource utilization.

The concept of shards and replicas is central to ClickHouse’s distributed architecture. Shards partition your data across nodes, enabling parallel query execution. Replicas provide redundancy and additional read capacity. A typical production cluster running tens of terabytes per day might use three shards with two replicas each — a 3×2 topology. At petabyte scale, you might operate 10–20 shards, each replica carrying hundreds of terabytes of compressed data.

Storage Architecture: Choosing the Right Tier

Storage is where most petabyte-scale deployments either succeed or quietly accumulate technical debt. ClickHouse’s native support for tiered storage — introduced through the storage_configuration in config.xml — lets you define multiple disk types and automatically move data between them based on age or size policies.

A practical tiered storage setup typically looks like this: a fast NVMe SSD tier for the most recent 7–30 days of data, followed by a larger spinning disk or network-attached storage (NAS) tier for data up to 180 days, and finally object storage (S3, GCS, or Azure Blob) for cold historical data. ClickHouse’s S3-backed MergeTree (available as a disk type via s3 in the disk configuration) makes cold storage genuinely practical — you’re no longer forced to archive data outside of ClickHouse, which used to be the only real option for cost management at scale.

One mistake teams repeatedly make is underestimating the importance of local disk throughput for merge operations. Merges are disk-intensive by nature — they read existing parts, sort and merge them, then write new parts. On a heavily loaded cluster, merges can consume 30–50% of disk I/O. If your local disk is a bottleneck, merge backlogs build up, query performance degrades (more parts to scan), and you enter a self-reinforcing spiral. The fix is typically faster NVMe drives or, for cloud deployments, provisioned IOPS volumes.

Partitioning and Primary Keys: Getting the Fundamentals Right

It sounds obvious, but bad partition and primary key design is the single most common root cause of poor performance at scale. ClickHouse partition pruning eliminates entire partitions from scans at query time — but only if your queries filter on the partition key. A table partitioned by toYYYYMM(event_date) will efficiently serve queries filtered by month. A table partitioned by an arbitrary UUID will not prune anything useful.

Keep partition granularity coarse enough to avoid having thousands of partitions. A rule of thumb: aim for each partition to contain between 1 GB and 100 GB of compressed data, and keep the total partition count under a few thousand. Overly granular partitioning (e.g., partitioning by day when you have years of data and hundreds of tables) creates operational overhead with no query benefit for most access patterns.

The primary key (ORDER BY key in MergeTree) determines how data is sorted on disk within each part. ClickHouse uses a sparse index — one index entry per granule (8192 rows by default) — rather than a per-row index. This means the primary key is effective for range queries on the leading columns, but provides no benefit for ad-hoc filters on trailing columns. A common high-performance pattern is to lead with a low-cardinality time dimension (e.g., toDate(timestamp)), followed by a high-cardinality identifier (e.g., user_id or tenant_id). This supports both time-range queries and per-entity lookups efficiently.

For multi-tenant architectures — common in SaaS analytics platforms — consider including tenant_id as the first column in both the partition key and ORDER BY. This enables aggressive partition pruning and data locality for tenant-scoped queries, which is critical when a single cluster serves dozens or hundreds of customers with very different data volumes.

Ingestion at Scale: Balancing Throughput and Part Overhead

ClickHouse is extremely good at high-throughput ingestion when fed data in appropriately sized batches. Each INSERT statement creates at least one new part on disk. If you’re inserting rows one at a time or in tiny batches, you’ll create thousands of parts per second, and the background merge process will never keep up. The result is a “too many parts” error — ClickHouse’s circuit breaker that throttles inserts when part counts exceed a threshold (default: 300 parts per partition).

The solution is batching at the ingestion layer. Aim for inserts that are at least 100,000 rows per batch, and ideally 500,000 to 1,000,000 rows. For streaming workloads using Apache Kafka or Apache Pulsar, this means tuning your consumer batch sizes and flush intervals. The ClickHouse Kafka Engine provides native consumption from Kafka topics; configuring kafka_max_block_size and using a materialized view to write to a final MergeTree table is a proven production pattern.

For extremely high-throughput ingestion (billions of events per day), many teams adopt a Buffer table pattern. Data is written to an in-memory Buffer table, which accumulates rows and flushes to the underlying MergeTree table once configurable thresholds are hit. This decouples the ingestion rate from the write rate to disk, absorbing bursts gracefully. The trade-off is that Buffer tables are not replicated, so a node restart before a flush results in data loss for the buffered portion — acceptable for some use cases, not for others.

Distributed Query Execution and Cluster Topology

ClickHouse’s Distributed table engine is the mechanism that ties a sharded cluster together. A Distributed table is a virtual layer that sits atop local MergeTree tables on each shard — queries sent to the Distributed table fan out to all shards, execute locally, and results are merged at the initiating node. Understanding how this fan-out works is essential for building efficient queries at scale.

One of the most impactful optimizations for distributed query performance is collocated JOINs. When two large tables are sharded by the same key (e.g., both partitioned and distributed by user_id), a JOIN between them can be executed entirely locally on each shard without cross-shard data shuffling. This is dramatically more efficient than a general distributed JOIN, which may require transferring large intermediate result sets across the network. The pattern requires careful schema design from the outset, but the query performance gains are substantial — often 10–50x faster for complex analytical queries.

For clusters running diverse workloads — some requiring low-latency interactive queries, others running long batch analytics — consider implementing workload isolation through dedicated replica groups. ClickHouse’s load_balancing setting and preference configurations allow you to route different query types to different replicas, preventing heavy analytical jobs from impacting user-facing dashboards. At ChistaDATA, we’ve seen this pattern dramatically improve SLA compliance for mixed-workload clusters.

Memory Management and Query Optimization

Memory is often the silent bottleneck at petabyte scale. ClickHouse aggressively uses RAM for sorting, hashing (in GROUP BY and JOIN), and caching. Without proper limits, a single runaway analytical query can exhaust all available memory and crash the server. The max_memory_usage setting controls per-query memory limits (default: 10 GB), while max_server_memory_usage caps total server memory consumption.

For heavy aggregation workloads, enabling external aggregation via max_bytes_before_external_group_by allows GROUP BY operations to spill intermediate state to disk when memory limits are approached, rather than failing with an out-of-memory error. Similarly, max_bytes_before_external_sort enables disk-spill for ORDER BY operations. These settings come with a performance penalty (disk I/O), but they provide a safety net for queries that occasionally exceed in-memory capacity.

The ClickHouse query cache (available in recent versions) can significantly reduce repeated query latency for common analytical patterns. For dashboard workloads where many users run similar queries over the same time ranges, a well-configured query cache can cut database load by 30–70%. At the page-level, configuring mark_cache_size and uncompressed_cache_size appropriately for your available RAM ensures that frequently accessed data blocks stay in memory rather than being re-read from disk.

Replication and High Availability

ClickHouse replication is built on ZooKeeper (or its replacement, ClickHouse Keeper, which is now the preferred option). ReplicatedMergeTree tables use ZooKeeper to coordinate part merges and ensure replicas stay in sync. At scale, ZooKeeper becomes a critical dependency — a slow or overloaded ZooKeeper ensemble will directly impact insert latency and merge coordination.

For production clusters handling significant write throughput, run a dedicated ZooKeeper or ClickHouse Keeper ensemble (typically 3 or 5 nodes) that is not co-located with ClickHouse data nodes. Monitor ZooKeeper latency aggressively; sustained p99 latencies above 10–20ms are a warning sign. Migrating from ZooKeeper to ClickHouse Keeper is strongly recommended for new deployments — ClickHouse Keeper is purpose-built for ClickHouse’s replication patterns and has demonstrated better performance and operational simplicity in production.

A common replication anti-pattern at scale is relying on synchronous inserts to all replicas before acknowledging success (insert_quorum). While this maximizes durability, it adds significant tail latency to inserts, especially under network partitions. Many production deployments use asynchronous replication for inserts and rely on ClickHouse’s internal consistency mechanisms to catch up replicas in the background, accepting a small window of potential data loss in exchange for much better write latency.

Observability and Operational Tooling

At petabyte scale, you cannot operate a ClickHouse cluster without comprehensive observability. The system tables in ClickHouse are extraordinarily rich — system.query_log, system.part_log, system.replication_queue, and system.merges provide a detailed picture of cluster health. Building dashboards on top of these tables (ideally using ClickHouse itself as the analytics backend for its own metrics) gives you visibility into slow queries, merge backlogs, replication lag, and storage consumption patterns.

Key metrics to monitor continuously at scale include: parts per partition (alert above 200), active merges count and merge speed, replica lag in ZooKeeper, query latency percentiles (p50, p95, p99), and disk write amplification ratio. Integrating ClickHouse metrics with Prometheus and Grafana via the built-in /metrics endpoint is straightforward and should be a day-one requirement for any production deployment.

For managed ClickHouse deployments, ChistaDATA’s managed platform provides built-in monitoring, automated backups, and expert support — offloading much of the operational complexity described above while retaining the full performance and flexibility of ClickHouse. This is particularly valuable for teams that need petabyte-scale analytics capabilities without the overhead of a dedicated database operations team.

Backup, Recovery, and Data Lifecycle Management

Backups at petabyte scale are non-trivial. A full backup of a multi-petabyte cluster is impractical in terms of both time and storage cost. The practical solution is a combination of incremental backups and object storage tiering. ClickHouse’s native BACKUP command (available since version 22.x) supports incremental backups to S3-compatible object storage, backing up only new or changed parts since the last backup. For most production deployments, a daily incremental backup with a weekly full backup provides a reasonable balance of recovery speed and storage cost.

Data lifecycle management — automatically expiring and deleting old data — is handled through TTL expressions in MergeTree. A well-designed TTL policy, combined with tiered storage, gives you a clean data pipeline: hot data on local NVMe, warm data on slower local disk or NAS, cold data in S3, and expired data deleted automatically. This architecture keeps your total storage cost predictable and manageable as data volumes grow.

Common Scaling Pitfalls to Avoid

Teams scaling ClickHouse for the first time tend to make a consistent set of mistakes. Over-sharding is one of the most common: adding more shards than your data volume or query parallelism requires adds coordination overhead without improving performance. A cluster with 20 shards holding 10 TB total is almost certainly worse than a cluster with 4 fat shards holding the same data — more parts to coordinate, more network round trips per distributed query, and more ZooKeeper load.

Another frequent mistake is using ALTER TABLE ... DELETE for large-scale deletes. Unlike traditional databases, ClickHouse handles mutations (updates and deletes) as rewrite operations — a mutation touching millions of rows triggers a full rewrite of all affected parts. For high-frequency delete patterns, consider partitioning your data such that entire partitions can be dropped via DROP PARTITION, which is a metadata-only operation and essentially instantaneous.

Finally, avoid the temptation to normalize your schema for ClickHouse. ClickHouse is optimized for wide, denormalized tables. JOINs in ClickHouse, while supported, are more expensive than in row-oriented databases and become significantly more costly at scale. The ClickHouse idiom is to denormalize at ingestion time — flatten your data model, duplicate dimension data into fact tables, and use dictionaries for lookup enrichment rather than runtime JOINs.

The Path Forward

Scaling ClickHouse from gigabytes to petabytes is not a single event — it’s an ongoing process of measurement, tuning, and architectural evolution. The teams that succeed are those that invest early in understanding ClickHouse’s internals, build observability from day one, and treat schema design with the same rigor they’d apply to any critical systems architecture decision.

The good news is that ClickHouse has a remarkable track record at petabyte scale. Yandex, Cloudflare, ByteDance, and many others run some of the world’s largest ClickHouse deployments, and the engineering investments the ClickHouse team has made in scalability over the past several years have been substantial. With the right foundation — solid schema design, appropriate hardware tiering, disciplined ingestion, and comprehensive observability — ClickHouse can handle virtually any analytical workload you throw at it.

If you’re evaluating your current ClickHouse architecture or planning a migration to a larger scale, the team at ChistaDATA has helped numerous organizations navigate exactly these challenges. Our ClickHouse consulting and managed services combine deep technical expertise with production-proven tooling to accelerate your journey — whether you’re at 100 GB today or planning for 100 PB tomorrow.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Scaling ClickHouse from Gigabytes to Petabytes: A Practical Playbook