ClickHouse Query Optimization for Petabyte-Scale Analytics
Query optimization is where ClickHouse either sings or suffers. A well-designed query on a well-designed schema scans terabytes in milliseconds; a poorly designed query scans the same data in minutes. This ChistaDATA guide details the optimization techniques that turn petabyte-scale ClickHouse deployments into interactive analytics platforms.

The ClickHouse Query Optimization Mental Model
Every ClickHouse query optimization reduces to one of three goals:
- Read fewer rows from disk. The single highest-leverage dimension. Primary index pruning, data skipping indexes, and projections all attack this.
- Do less work per row. Vectorized execution, smart column types, and avoiding materializing unnecessary columns reduce CPU.
- Parallelize effectively. More threads, more shards, and smarter JOIN strategies reduce wall-clock time.
The ClickHouse official guide to faster queries articulates the philosophy succinctly: tables are designed to receive millions of row inserts per second and store hundreds of petabytes, so fast query speed depends on properly utilizing a table’s sparse primary index to drastically limit the data ClickHouse needs to read from disk.
Reading the EXPLAIN Output Like a Performance Engineer
Every optimization starts with EXPLAIN. ClickHouse offers four EXPLAIN modes, each useful at a different stage:
-- Logical plan EXPLAIN SYNTAX SELECT count() FROM events WHERE country = 'IN'; -- Physical plan with pipeline stages EXPLAIN PIPELINE SELECT count() FROM events WHERE country = 'IN'; -- Index usage and row estimates EXPLAIN indexes = 1 SELECT count() FROM events WHERE country = 'IN' AND event_time > today() - 7; -- Query pipeline graph (for visualization) EXPLAIN PIPELINE graph = 1 SELECT country, count() FROM events GROUP BY country;
EXPLAIN indexes = 1 is the single most useful variant for optimization. It tells you which indexes were considered, which were used, how many granules were eliminated, and how many rows will actually be read. A good query eliminates 95%+ of granules before any row is materialized.
Primary Key and Sort Key Optimization
The ORDER BY clause of a MergeTree table determines the sort order on disk, which in turn determines the primary index. Five principles we apply at ChistaDATA:
- Lead with the columns most commonly used in equality filters. Low-cardinality, high selectivity columns first.
- Place time columns last. This lets range filters on time prune granules efficiently while allowing equality filters on higher-level dimensions to eliminate entire regions of the index.
- Avoid high-cardinality columns early in the sort key. They bloat the index and reduce granule compression.
- Use PRIMARY KEY separately when it differs from ORDER BY. You can have a shorter primary key than sort key to reduce memory footprint.
- Do not use hash columns in the sort key. They defeat compression and index effectiveness.
-- Good: low cardinality leads, time last ORDER BY (tenant_id, event_type, country, event_time) -- Bad: high cardinality first ORDER BY (user_id, event_time, tenant_id) -- Good: primary key separate from sort key ORDER BY (tenant_id, event_type, country, event_time) PRIMARY KEY (tenant_id, event_type)
Data Skipping Indexes: Bloom Filters, MinMax, and Set
Data skipping indexes accelerate filters on columns that are not part of the primary key. Each index type suits a different filter pattern:
| Index Type | Best For | Typical GRANULARITY |
|---|---|---|
| minmax | Range filters on numeric or date columns | 1 |
| set(N) | Equality filters on low-cardinality columns (up to N distinct values) | 2-4 |
| bloom_filter | Equality filters on high-cardinality columns | 2-4 |
| ngrambf_v1 | LIKE / substring search on text | 3-4 |
| tokenbf_v1 | Full-token search on text | 3-4 |
ALTER TABLE events ADD INDEX idx_user_id user_id TYPE bloom_filter(0.01) GRANULARITY 4, ADD INDEX idx_country country TYPE set(256) GRANULARITY 4, ADD INDEX idx_revenue revenue TYPE minmax GRANULARITY 2, ADD INDEX idx_url_search url TYPE ngrambf_v1(4, 16384, 3, 0) GRANULARITY 4; -- Materialize the indexes for existing parts ALTER TABLE events MATERIALIZE INDEX idx_user_id; ALTER TABLE events MATERIALIZE INDEX idx_country; ALTER TABLE events MATERIALIZE INDEX idx_revenue; ALTER TABLE events MATERIALIZE INDEX idx_url_search;
A skipping index is only useful if the filter is selective. Adding a bloom filter on a column where 80% of rows match a filter value wastes space without accelerating queries. Profile representative queries with EXPLAIN indexes = 1 before committing to an index set.
Projections: Alternative Sort Orders and Pre-Aggregations
A projection is an alternative physical layout of a table — a different sort order, a pre-aggregation, or both — maintained automatically by ClickHouse. When a query matches a projection, the optimizer routes reads through the projection instead of the base table.
ALTER TABLE events
ADD PROJECTION proj_by_user (
SELECT *
ORDER BY (user_id, event_time)
);
ALTER TABLE events
ADD PROJECTION proj_country_daily (
SELECT
toDate(event_time) AS day,
country,
count() AS events,
sum(revenue) AS revenue,
uniqExact(user_id) AS users
GROUP BY day, country
);
-- Materialize projections for existing parts
ALTER TABLE events MATERIALIZE PROJECTION proj_by_user;
ALTER TABLE events MATERIALIZE PROJECTION proj_country_daily;
JOIN Strategies at Scale
- Dictionaries. For small dimension tables (up to a few million rows), a dictionary is a memory-mapped lookup that replaces the JOIN with a function call.
- JOIN with GLOBAL for distributed queries. Otherwise each shard reloads the right table, multiplying memory usage by shard count.
- partial_merge and grace_hash JOIN algorithms. These spill to disk rather than requiring the right table to fit in memory.
- Rewrite to IN or semi-JOIN. WHERE id IN (SELECT …) is often faster than a JOIN for filtering.
-- Dictionary for small dimension tables
CREATE DICTIONARY countries_dict (
code String,
name String,
region String
)
PRIMARY KEY code
SOURCE(POSTGRESQL(host 'dim-db' db 'dims' table 'countries' port 5432 user 'reader' password '***'))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);
-- Use dictGet instead of JOIN
SELECT
dictGet('countries_dict', 'name', country_code) AS country,
dictGet('countries_dict', 'region', country_code) AS region,
count(),
sum(revenue)
FROM events
GROUP BY country, region;
-- Disk-spilling JOIN for large dimensions
SELECT e.user_id, e.revenue, u.segment
FROM events AS e
LEFT JOIN users AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'grace_hash',
max_bytes_in_join = 10737418240; -- 10 GB
Execution-Time Tuning
Beyond schema and query shape, ClickHouse exposes dozens of execution settings. The handful we tune most often at ChistaDATA:
-- Parallelism SET max_threads = 32; -- match physical cores SET max_streams_for_merge_tree_reading = 64; -- parallel part reading -- Memory SET max_memory_usage = 20000000000; -- 20 GB per query SET max_bytes_before_external_group_by = 10000000000; -- spill GROUP BY -- Query optimizer SET allow_experimental_analyzer = 1; -- new query analyzer (default in 24.x) SET optimize_read_in_order = 1; SET optimize_aggregation_in_order = 1; SET enable_optimize_predicate_expression = 1; -- Query cache for repeated dashboard queries SET use_query_cache = 1; SET query_cache_ttl = 60;
Three high-impact settings that are frequently undertuned:
- optimize_read_in_order = 1 skips the sort step when the query’s ORDER BY matches the table’s sort key. Huge win for time-series ORDER BY event_time DESC LIMIT N patterns.
- optimize_aggregation_in_order = 1 streams GROUP BY when the grouping columns match a prefix of the sort key. Converts O(N) memory into O(groups).
- max_bytes_before_external_group_by protects against OOM on huge GROUP BYs by spilling to disk.
A Repeatable Optimization Workflow
When ChistaDATA engineers optimize a slow ClickHouse query, we follow the same repeatable seven-step workflow. It works at terabyte scale and at petabyte scale.
- Capture the query and baseline — clear caches, run three times, record p50/p99, record rows read from system.query_log.
- Run EXPLAIN indexes = 1 — confirm which granules are eliminated and which columns are actually scanned.
- Check sort key alignment — are the WHERE filters using the leading sort key columns? If not, consider a projection or schema redesign.
- Add targeted skipping indexes — bloom or set on filter columns not covered by the sort key.
- Tune execution settings — max_threads, optimize_read_in_order, optimize_aggregation_in_order.
- Consider a projection or MV — if the same query pattern is run repeatedly, precompute it.
- Re-baseline and document — measure improvement, capture the final query plan, and add the query to a regression test suite.
Key Takeaways
- Most ClickHouse query optimization reduces to reading fewer rows from disk — sort key, skipping indexes, and projections are the three main levers.
- Always run EXPLAIN indexes = 1 before and after optimization to quantify granule pruning.
- Order sort keys as (low-cardinality, high-selectivity columns first, time last).
- Use dictionaries instead of JOINs for small dimension tables; use grace_hash or partial_merge for large JOINs.
- Projections give you alternative sort orders and pre-aggregations without denormalizing the source table.
- Tune max_threads, optimize_read_in_order, and optimize_aggregation_in_order for most dashboard workloads.
- Adopt a repeatable seven-step optimization workflow and add solved queries to a regression test suite.
At ChistaDATA Inc., we are 100% focused on ClickHouse. Our engineering team delivers ClickHouse consulting, 24×7 managed services, performance engineering, and mission-critical support for analytics workloads ranging from gigabytes to multi-petabyte clusters. Whether you are building a real-time analytics platform from scratch, migrating from Snowflake, BigQuery, or Redshift, or tuning an existing ClickHouse deployment, we can help accelerate outcomes. Schedule a consultation with our ClickHouse engineering team to discuss architecture, capacity planning, and performance SLAs.