The first time a production ClickHouse table runs into trouble, the symptoms are almost always the same — queries that scanned 200 milliseconds last week now take 8 seconds, the merge queue is climbing, and a SELECT that touched a single day suddenly reads three months of compressed columns off disk. In every one of these situations our engineering team at ChistaDATA has investigated, the root cause traces back to four design decisions made on the day the table was created: how data is split into parts, how partitions are keyed, how granules are sized, and how the ClickHouse primary index design is structured. Get these four right and ClickHouse stays fast through orders-of-magnitude growth. Get any of them wrong and no amount of hardware will rescue the workload.
This article is a practitioner’s deep dive into all four layers. It is the same material we walk through with customers during schema reviews, only without the whiteboard. By the end you will have a working mental model of how MergeTree stores data on disk and a set of design heuristics that hold up at petabyte scale.
Why MergeTree Internals Matter
ClickHouse owes most of its performance reputation to one engine family — MergeTree and its specialized variants (ReplicatedMergeTree, ReplacingMergeTree, AggregatingMergeTree, CollapsingMergeTree, SummingMergeTree). The variants differ in how rows are reconciled at merge time, but the storage substrate is identical. Understand MergeTree and you understand 90% of what makes the engine fast.
The engine is built around four ideas that interact constantly:
- Immutable parts — every INSERT writes a new directory of column files; nothing is updated in place.
- Background merges — small parts are continuously combined into larger sorted parts in the background.
- Sparse primary index — one index entry per granule of rows, not per row, keeping the index small enough to live in RAM.
- Partition pruning — entire directories are skipped at query planning time based on the partition key.
When a query arrives, the engine walks down this hierarchy: prune partitions, scan the primary index of each surviving part, identify the granules whose key range matches the filter, and read only those granules from compressed column files. The faster each step eliminates data, the faster the query returns. That is the entire story — and every design decision in this article exists to make one of those steps eliminate more data. For broader context on how our team approaches column-store engineering across the customer base, the engineering notes published at ChistaDATA complement what follows here.
Parts: The Atomic Unit of MergeTree Storage
A part is the smallest unit ClickHouse reads, writes, replicates, or freezes. Physically, a part is a directory on disk containing one binary file per column (in Wide format) or a single file holding all columns (in Compact format), plus mark files, the primary index file, checksums, and a small metadata file. The directory name encodes the partition ID, the minimum block number, the maximum block number, the mutation version, and the merge level — for example, 202605_1_47_3 means partition 202605, blocks 1 through 47, third merge level.
Understanding the ClickHouse primary index design is crucial for optimizing query performance and ensuring efficient data retrieval.
Every INSERT statement produces at least one new part. The data is sorted in memory by the table’s ORDER BY clause, written to disk, and registered with the metadata store. Once written, a part is immutable. Updates and deletes do not edit parts — they create new parts that supersede ranges of older parts, and old parts are dropped only after the background merge process retires them.
This design has consequences that surprise teams arriving from row-store databases:
- Inserts must be batched. A thousand single-row INSERT statements per second creates a thousand parts per second, which the merge scheduler cannot keep up with. The well-known “Too many parts” error is the system telling you the merge queue is drowning.
- Deletes are never cheap. A DELETE statement is implemented as a mutation that rewrites every affected part. On a 2 TB table, a single DELETE can rewrite hundreds of gigabytes of data.
- Replication operates on parts. ReplicatedMergeTree synchronizes whole parts across replicas, which is why network and disk bandwidth between replicas often becomes the bottleneck before CPU does.
Inspecting parts in production is straightforward — every running cluster exposes the truth through system.parts:
SELECT
database,
table,
partition,
name,
rows,
formatReadableSize(bytes_on_disk) AS size,
formatReadableSize(primary_key_bytes_in_memory) AS pk_in_mem,
active,
level
FROM system.parts
WHERE database = 'analytics' AND table = 'events'
ORDER BY modification_time DESC
LIMIT 25;
If the count of active parts per partition is climbing past a few hundred and never coming back down, the merge engine cannot keep up with the insert rate. That is a signal to either increase batch sizes upstream, raise merge concurrency, or revisit the partition key — usually all three.
Partitions: Logical Grouping, Not Physical Sharding
Partitions in ClickHouse are widely misunderstood, mostly because the word “partition” means three different things in three different systems. Here, a partition is a logical grouping of parts that share the same value of the PARTITION BY expression. Partitions are not shards. They do not distribute data across nodes — that is the job of Distributed tables and sharding keys.
What partitions actually buy you is twofold. First, partition pruning at query planning time: if a query filters on the partition expression, the engine drops every part outside the matching partitions before the primary index is even consulted. Second, partition-level operations: dropping, detaching, freezing, and moving data between storage tiers all work at partition granularity and execute in roughly constant time regardless of partition size.
The most common partition keys we see in production:
-- Monthly — sensible default for event data with multi-year retention PARTITION BY toYYYYMM(event_time) -- Daily — appropriate when daily retention or daily TTL applies PARTITION BY toDate(event_time) -- Weekly — uncommon but useful for medium-volume tables PARTITION BY toMonday(event_time) -- Composite — partition by date AND a low-cardinality dimension PARTITION BY (toYYYYMM(event_time), region)
The single biggest mistake we see is over-partitioning. Partitioning by toStartOfHour on a high-cardinality timestamp produces a part directory for every hour, which means thousands of partitions and a merge tree that cannot consolidate work efficiently. The MergeTree engine is designed to keep the total partition count modest — most well-designed tables we operate carry between 30 and 200 partitions, occasionally up to 1000 for very long-retention archives. Beyond that, both metadata operations and the merge scheduler start to feel the strain.
The complementary mistake is under-partitioning. Keeping a single partition for a 5 TB table means every DROP, every ALTER, and every TTL action has to traverse the entire dataset. We recommend a partition cadence that aligns with the natural retention boundary of the data — if rows expire monthly, partition monthly; if hourly retention applies, partition daily and use TTL DELETE to manage the rest.
Choosing the Partition Key in Practice
Three questions usually settle the partition design:
- What is the dominant filter in 80% of queries? If most queries filter by event_time over a day or month window, partition by toDate or toYYYYMM of event_time.
- What is the natural retention or lifecycle boundary? If data is dropped or moved to cold storage on a monthly cadence, monthly partitioning aligns retention with the operational primitive.
- What does the insert pattern look like? If inserts always carry data from a narrow time window, the partition key should match — otherwise every insert touches multiple partitions and amplifies merge load.
Migrating partitioning later is possible but painful — it requires creating a new table and copying data, since the partition key is fixed at CREATE TABLE time. This is one of the cases where an hour of design review at the start of a project saves a weekend of operations work six months later. The database engineering disciplines we apply across our consulting engagements at MinervaDB for PostgreSQL and MySQL apply equally to column-store schemas — schema design is the highest-leverage decision in the system lifecycle.
Granules and index_granularity: The Sparse Index Building Block
Inside each part, rows are not indexed individually. Instead, consecutive rows are grouped into granules, and the primary index stores one entry per granule. This is what makes the primary index sparse, and it is the design choice that lets the engine keep the index in memory even for tables with hundreds of billions of rows.
The granule size is governed by the index_granularity setting, which defaults to 8192 rows. A part containing 80 million rows therefore has roughly 10,000 granules and 10,000 entries in its primary index. At 16 bytes per index entry for a typical key, the primary index for the entire part fits in 160 KB.
CREATE TABLE events
(
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
payload String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192;
When a query arrives, the primary index identifies the granules whose key range overlaps the WHERE clause, then reads only those granules from the column files. The default of 8192 is a reasonable compromise across most workloads, but it is not always correct. The right value depends on the access pattern:
- Analytical scans over wide ranges: the default 8192 or even larger values (16384, 32768) reduce index size and improve sequential read throughput.
- Point lookups and narrow filters: smaller granules (1024 or 2048) reduce the amount of data read per match, at the cost of a larger in-memory index.
- Tables with very wide rows: use index_granularity_bytes (adaptive granularity, default 10 MB) to size granules by bytes rather than row count. Adaptive granularity has been the default since version 19.11 and is almost always the right choice for modern deployments.
One thing that surprises teams: index_granularity is set per-part and cannot be changed retroactively. Existing parts retain whatever granularity they were written with. Changing the setting only affects new parts going forward, and the old granules remain until merges replace them. If a granule size needs to change at scale, the practical path is to create a new table with the desired setting and copy the data over.
Inspect granule effectiveness with EXPLAIN indexes = 1:
EXPLAIN indexes = 1 SELECT count() FROM events WHERE event_time BETWEEN '2026-05-01' AND '2026-05-02' AND user_id = 4711;
The output reports how many granules each index pruned. If a query reads 50,000 granules to return 10 rows, the primary index is doing almost nothing useful and the design needs revisiting.
Primary Key vs. Sorting Key: A Distinction That Trips Up Most Teams
This is the single most consistently misunderstood corner of MergeTree. In standard SQL, “primary key” means a uniqueness constraint and usually implies a unique index. Here it means neither.
- The sorting key, defined by ORDER BY, controls the physical row order on disk. Within each part, rows are stored sorted by the sorting key. This drives compression efficiency and read locality.
- The primary key, defined by PRIMARY KEY, defines which prefix of the sorting key is materialized in the sparse primary index. If PRIMARY KEY is omitted, it defaults to the entire ORDER BY.
The primary key does not enforce uniqueness. It is purely a query-skipping mechanism. Two rows with identical primary key values are perfectly legal and coexist happily in the same part.
The reason the engine separates the two concepts is that the sorting key often needs to be long for compression and data layout reasons, while the primary index needs to be short to stay memory-resident. By specifying a primary key shorter than the sorting key, you keep the in-memory index small without giving up the locality benefits of a longer sort order:
CREATE TABLE clickstream
(
event_time DateTime,
user_id UInt32,
session_id UInt64,
page_url String,
duration UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id, session_id)
PRIMARY KEY (event_time, user_id);
In this design the rows are physically sorted by (event_time, user_id, session_id), so all events for a given user within a second land in the same granule. The primary index, however, only carries (event_time, user_id) — significantly smaller in memory, and still effective for the common filter patterns. The session_id remains useful for compression and ordering but does not consume index memory.
Designing the Primary Key for Real Workloads
Primary key design is more art than algorithm. The decisions that drive it are the actual query patterns, the cardinality of the candidate columns, and the order in which filters typically appear. Some heuristics that have served us well across hundreds of production schemas:
1. Put Low-Cardinality Columns First
If queries frequently filter by a coarse dimension like tenant_id or region before drilling down, that low-cardinality column should lead the primary key. The reason is mechanical — the sparse index is most effective when consecutive granules share key prefixes. A primary key of (tenant_id, event_time) on a multi-tenant table lets the engine skip every granule outside the tenant before evaluating the time filter.
2. Avoid Putting High-Cardinality Columns First
Leading the primary key with a UUID or a user_id makes the granule key ranges effectively random. The index degenerates into a sparse hash, and skipping becomes ineffective. We have seen 100× query latency improvements simply by demoting a high-cardinality column from the leading position of the primary key.
3. Match the Common Filter Order
If 80% of queries filter on event_time first and event_type second, the primary key should be (event_time, event_type, …). The primary index only helps when the filter aligns with the leading columns — a filter on event_type alone cannot use a primary key that starts with event_time.
4. Use LowCardinality and Codecs for Compression
The sorting key drives compression. Columns adjacent in sort order compress together. Wrapping a sorted string column in LowCardinality(String) can cut storage by 5–10× when the cardinality is below about 10,000. Codecs like Delta, DoubleDelta, and Gorilla applied to timestamp and numeric columns in the sort key further reduce on-disk size.
CREATE TABLE metrics
(
ts DateTime CODEC(DoubleDelta, ZSTD(1)),
host LowCardinality(String),
metric_name LowCardinality(String),
value Float64 CODEC(Gorilla, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (host, metric_name, ts)
PRIMARY KEY (host, metric_name);
5. Add Skip Indexes for Secondary Filters
For columns that appear in filters but cannot fit in the primary key, data-skipping indexes (minmax, set, bloom_filter, tokenbf_v1, ngrambf_v1) provide a second pruning layer. Skip indexes are themselves granular — they are evaluated on groups of granules — and they complement rather than replace the primary index. Layered pruning is a pattern that appears across modern analytical engines, and academic treatments of multi-level skipping are well documented in papers indexed on arXiv.
Compact vs. Wide Parts: When Each Format Wins
MergeTree writes parts in one of two physical formats. The choice is controlled by two thresholds:
- min_bytes_for_wide_part (default 10485760, i.e. 10 MB)
- min_rows_for_wide_part (default 0, disabled)
If a new part is smaller than the threshold, it is written in compact format — all columns concatenated into a single file with one mark file. Larger parts are written in wide format, with one binary file per column and per-column mark files.
Compact parts are dramatically more efficient for small inserts because they avoid the per-column file overhead. On filesystems that handle large numbers of small files poorly (any cloud object storage, in particular), compact format eliminates a serious performance trap. Wide parts, on the other hand, allow columnar I/O — a query that reads only two of forty columns reads only those two files. Once parts grow beyond the threshold, that selectivity dominates and wide format wins.
The defaults are sensible for most workloads. The places we tune them are large multi-column tables where the wide-format threshold needs to drop (to enable columnar I/O sooner) and high-frequency ingestion tables on object storage where compact parts need to persist longer to avoid the small-file penalty.
Merges, Mutations, and the “Too Many Parts” Failure Mode
Background merges are what keep MergeTree from collapsing under the weight of millions of small parts. The merge scheduler picks small parts within the same partition, reads them in sorted order, writes a new combined part, and retires the originals. The process runs continuously in the background, prioritized by part size, level, and age.
The settings that govern the behavior most often:
- background_pool_size — number of threads available for merges (default scales with CPU count).
- parts_to_throw_insert — when the active part count per partition exceeds this number, the engine rejects new inserts. Default is 3000.
- parts_to_delay_insert — the soft threshold (default 1000) at which inserts are deliberately slowed to let merges catch up.
- max_bytes_to_merge_at_max_space_in_pool — caps the size of merges so a single huge merge does not monopolize the pool.
The “Too many parts” error is not a bug. It is the system protecting itself from a workload pattern it cannot sustain. The right response is rarely to raise the threshold — that just delays the symptom. The right response is to find the cause: usually small batches, an over-partitioned table, or inserts hitting too many partitions per statement.
Mutations (ALTER UPDATE, ALTER DELETE) deserve their own caution. A mutation runs asynchronously and rewrites every part that contains a matching row. On a wide table with months of data, a single mutation can churn terabytes of storage and saturate disk I/O for hours. The system.mutations table shows pending and in-flight mutations; we monitor it closely on customer clusters and warn teams against using mutations for routine workflows. For replacing-by-key semantics, ReplacingMergeTree or a CollapsingMergeTree variant is almost always the right tool.
Operational Checklist for Production MergeTree
The schema review checklist we walk through on every ChistaDATA engagement compresses to roughly twelve items. Most production incidents we get called into trace back to a missing item from this list:
- Partition count stays between 30 and 200 under steady state; alerts fire at 1000.
- Active parts per partition stays under 300; alerts fire at 1000.
- Primary key length is short and leads with low-cardinality columns.
- Sorting key extends beyond the primary key when it improves compression.
- index_granularity is left at default unless a clear access pattern justifies a change.
- LowCardinality and codecs applied to repetitive and sequential columns.
- Skip indexes added for columns that appear in filters but cannot lead the primary key.
- TTL clauses set explicit lifecycle for retention and tiering.
- Insert batch size is at least 1000 rows and ideally 100,000+ per INSERT.
- ReplicatedMergeTree used in any cluster larger than a single node.
- Mutations restricted to operational maintenance, not routine workflows.
- system.parts, system.merges, and system.mutations exported to monitoring.
For real-world tuning we lean on conference talks and engineering write-ups from operators who run column-store analytics at scale, including material curated by the CNCF ecosystem and peer-reviewed analytical-database research from venues like VLDB. The fundamentals do not change quickly; the operational lessons compound.
Key Takeaways
- Parts are immutable directories of column files, and every INSERT creates one — batch your inserts.
- Partitions are a logical grouping for pruning and lifecycle operations, not a sharding mechanism; keep the count modest.
- Granules are the sparse-index unit. The default of 8192 is reasonable; adaptive granularity by bytes is the modern default.
- Sorting key and primary key are distinct — the sorting key drives physical order and compression, the primary key drives the in-memory sparse index.
- Lead the primary key with low-cardinality columns that match the dominant filter pattern.
- Layer skip indexes on top of the primary index for secondary filter columns that cannot lead the sort.
- “Too many parts” is a workload signal, not a configuration ceiling — fix the cause rather than the limit.
- Mutations are expensive and asynchronous; design schemas so routine workflows do not depend on them.
How ChistaDATA Can Help
At ChistaDATA we design, deploy, tune, and operate column-store infrastructure for customers running everything from sub-second observability platforms to petabyte-scale analytical warehouses. Our engineering team handles schema design reviews, capacity planning, migration from row stores, replication and sharding topology, monitoring, and 24×7 production operations. Whether the workload is on bare metal, AWS, GCP, or Azure, we bring the same operational discipline we apply across the wider ChistaDATA practice. If a deployment is heading into production or already missing performance and reliability targets, schedule a consultation with our engineering team and we will help map the right design to the workload. For broader full-stack database infrastructure engagements spanning PostgreSQL, MySQL, and analytical engines, our sister practice at MinervaDB works alongside us on cross-stack architectures.
Frequently Asked Questions
What is the difference between a part and a partition?
A partition is a logical grouping defined by the PARTITION BY expression — all parts whose partition key evaluates to the same value belong to the same partition. A part is the physical unit of storage, a directory of column files written by an INSERT or produced by a merge. A single partition typically contains many parts, which background merges gradually consolidate into fewer, larger parts.
What index_granularity value should I use?
Leave it at the default of 8192 unless there is a specific reason to change it. Reduce it (to 1024 or 2048) only for point-lookup-heavy workloads where reducing rows-per-granule meaningfully cuts unnecessary reads. For wide-row tables, prefer adaptive granularity via index_granularity_bytes, which has been the default since version 19.11.
Can I change the primary key or partition key after the table is created?
Partitioning cannot be changed on an existing table; you must create a new table with the desired partition key and migrate the data. The primary key can be extended (with ALTER TABLE … MODIFY ORDER BY adding columns) but cannot be reordered or have leading columns removed without a full rebuild. Plan both choices carefully at schema design time.
Why does my table report “Too many parts”?
The active part count per partition has crossed the configured threshold (default 3000). This almost always means inserts are arriving in batches too small for the merge scheduler to consolidate, the table is over-partitioned, or each insert touches too many partitions. The fix is to increase batch size, reduce partition cardinality, or align inserts to a narrow time window — not simply raising the threshold.
When should I use ReplacingMergeTree instead of mutations?
Always prefer ReplacingMergeTree when the workload requires replacing rows by key — ingesting CDC streams, deduplicating event data, or maintaining a latest-state table. Mutations are designed for rare operational maintenance, not routine deduplication. ReplacingMergeTree handles the same semantics through the merge process at a fraction of the I/O cost.
How do partitions and shards interact in a distributed cluster?
Partitions are local to each node — they govern how data is organized within a table on a single shard. Sharding distributes rows across nodes using a separate sharding key declared on the Distributed table. The two work independently: a query is routed across shards by the sharding key, and within each shard, partition pruning eliminates irrelevant local data. Designing both well is essential for distributed performance.