ClickHouse Query Optimization for Petabyte-Scale Analytics

ClickHouse Query Optimization for Petabyte-Scale Analytics


Query optimization is where ClickHouse either sings or suffers. A well-designed query on a well-designed schema scans terabytes in milliseconds; a poorly designed query scans the same data in minutes. This ChistaDATA guide details the optimization techniques that turn petabyte-scale ClickHouse deployments into interactive analytics platforms.

ClickHouse Query Optimization

The ClickHouse Query Optimization Mental Model

Every ClickHouse query optimization reduces to one of three goals:

  1. Read fewer rows from disk. The single highest-leverage dimension. Primary index pruning, data skipping indexes, and projections all attack this.
  2. Do less work per row. Vectorized execution, smart column types, and avoiding materializing unnecessary columns reduce CPU.
  3. Parallelize effectively. More threads, more shards, and smarter JOIN strategies reduce wall-clock time.

The ClickHouse official guide to faster queries articulates the philosophy succinctly: tables are designed to receive millions of row inserts per second and store hundreds of petabytes, so fast query speed depends on properly utilizing a table’s sparse primary index to drastically limit the data ClickHouse needs to read from disk.

Reading the EXPLAIN Output Like a Performance Engineer

Every optimization starts with EXPLAIN. ClickHouse offers four EXPLAIN modes, each useful at a different stage:

-- Logical plan
EXPLAIN SYNTAX SELECT count() FROM events WHERE country = 'IN';

-- Physical plan with pipeline stages
EXPLAIN PIPELINE SELECT count() FROM events WHERE country = 'IN';

-- Index usage and row estimates
EXPLAIN indexes = 1
SELECT count() FROM events WHERE country = 'IN' AND event_time > today() - 7;

-- Query pipeline graph (for visualization)
EXPLAIN PIPELINE graph = 1
SELECT country, count() FROM events GROUP BY country;

EXPLAIN indexes = 1 is the single most useful variant for optimization. It tells you which indexes were considered, which were used, how many granules were eliminated, and how many rows will actually be read. A good query eliminates 95%+ of granules before any row is materialized.

Primary Key and Sort Key Optimization

The ORDER BY clause of a MergeTree table determines the sort order on disk, which in turn determines the primary index. Five principles we apply at ChistaDATA:

  • Lead with the columns most commonly used in equality filters. Low-cardinality, high selectivity columns first.
  • Place time columns last. This lets range filters on time prune granules efficiently while allowing equality filters on higher-level dimensions to eliminate entire regions of the index.
  • Avoid high-cardinality columns early in the sort key. They bloat the index and reduce granule compression.
  • Use PRIMARY KEY separately when it differs from ORDER BY. You can have a shorter primary key than sort key to reduce memory footprint.
  • Do not use hash columns in the sort key. They defeat compression and index effectiveness.
-- Good: low cardinality leads, time last
ORDER BY (tenant_id, event_type, country, event_time)

-- Bad: high cardinality first
ORDER BY (user_id, event_time, tenant_id)

-- Good: primary key separate from sort key
ORDER BY (tenant_id, event_type, country, event_time)
PRIMARY KEY (tenant_id, event_type)

Data Skipping Indexes: Bloom Filters, MinMax, and Set

Data skipping indexes accelerate filters on columns that are not part of the primary key. Each index type suits a different filter pattern:

Index TypeBest ForTypical GRANULARITY
minmaxRange filters on numeric or date columns1
set(N)Equality filters on low-cardinality columns (up to N distinct values)2-4
bloom_filterEquality filters on high-cardinality columns2-4
ngrambf_v1LIKE / substring search on text3-4
tokenbf_v1Full-token search on text3-4

ALTER TABLE events
  ADD INDEX idx_user_id    user_id    TYPE bloom_filter(0.01) GRANULARITY 4,
  ADD INDEX idx_country    country    TYPE set(256)           GRANULARITY 4,
  ADD INDEX idx_revenue    revenue    TYPE minmax             GRANULARITY 2,
  ADD INDEX idx_url_search url        TYPE ngrambf_v1(4, 16384, 3, 0) GRANULARITY 4;

-- Materialize the indexes for existing parts
ALTER TABLE events MATERIALIZE INDEX idx_user_id;
ALTER TABLE events MATERIALIZE INDEX idx_country;
ALTER TABLE events MATERIALIZE INDEX idx_revenue;
ALTER TABLE events MATERIALIZE INDEX idx_url_search;

A skipping index is only useful if the filter is selective. Adding a bloom filter on a column where 80% of rows match a filter value wastes space without accelerating queries. Profile representative queries with EXPLAIN indexes = 1 before committing to an index set.

Projections: Alternative Sort Orders and Pre-Aggregations

A projection is an alternative physical layout of a table — a different sort order, a pre-aggregation, or both — maintained automatically by ClickHouse. When a query matches a projection, the optimizer routes reads through the projection instead of the base table.

ALTER TABLE events
  ADD PROJECTION proj_by_user (
    SELECT *
    ORDER BY (user_id, event_time)
  );

ALTER TABLE events
  ADD PROJECTION proj_country_daily (
    SELECT
      toDate(event_time) AS day,
      country,
      count() AS events,
      sum(revenue) AS revenue,
      uniqExact(user_id) AS users
    GROUP BY day, country
  );

-- Materialize projections for existing parts
ALTER TABLE events MATERIALIZE PROJECTION proj_by_user;
ALTER TABLE events MATERIALIZE PROJECTION proj_country_daily;
Projections shine when you have two or three fundamentally different access patterns against the same table — for example, by-user and by-country. Rather than denormalizing into multiple tables with separate ingestion pipelines, a projection keeps the source of truth in one place while offering multiple physical layouts.The tradeoff: projections increase storage (each projection is a full copy or aggregate of the base data) and insert cost (each insert writes to every projection). For most tables, 1–2 projections is the sweet spot.

JOIN Strategies at Scale

JOINs are the hardest operation to scale in ClickHouse because the default behavior is a hash join that loads the right-hand table entirely into memory. At petabyte scale this is fatal. Four strategies to avoid memory disasters:

  1. Dictionaries. For small dimension tables (up to a few million rows), a dictionary is a memory-mapped lookup that replaces the JOIN with a function call.
  2. JOIN with GLOBAL for distributed queries. Otherwise each shard reloads the right table, multiplying memory usage by shard count.
  3. partial_merge and grace_hash JOIN algorithms. These spill to disk rather than requiring the right table to fit in memory.
  4. Rewrite to IN or semi-JOIN. WHERE id IN (SELECT …) is often faster than a JOIN for filtering.
-- Dictionary for small dimension tables
CREATE DICTIONARY countries_dict (
    code String,
    name String,
    region String
)
PRIMARY KEY code
SOURCE(POSTGRESQL(host 'dim-db' db 'dims' table 'countries' port 5432 user 'reader' password '***'))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);

-- Use dictGet instead of JOIN
SELECT
    dictGet('countries_dict', 'name',   country_code) AS country,
    dictGet('countries_dict', 'region', country_code) AS region,
    count(),
    sum(revenue)
FROM events
GROUP BY country, region;

-- Disk-spilling JOIN for large dimensions
SELECT e.user_id, e.revenue, u.segment
FROM events AS e
LEFT JOIN users AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'grace_hash',
         max_bytes_in_join = 10737418240;  -- 10 GB

Execution-Time Tuning

Beyond schema and query shape, ClickHouse exposes dozens of execution settings. The handful we tune most often at ChistaDATA:

-- Parallelism
SET max_threads = 32;                          -- match physical cores
SET max_streams_for_merge_tree_reading = 64;   -- parallel part reading

-- Memory
SET max_memory_usage = 20000000000;            -- 20 GB per query
SET max_bytes_before_external_group_by = 10000000000;  -- spill GROUP BY

-- Query optimizer
SET allow_experimental_analyzer = 1;           -- new query analyzer (default in 24.x)
SET optimize_read_in_order = 1;
SET optimize_aggregation_in_order = 1;
SET enable_optimize_predicate_expression = 1;

-- Query cache for repeated dashboard queries
SET use_query_cache = 1;
SET query_cache_ttl = 60;

Three high-impact settings that are frequently undertuned:

  • optimize_read_in_order = 1 skips the sort step when the query’s ORDER BY matches the table’s sort key. Huge win for time-series ORDER BY event_time DESC LIMIT N patterns.
  • optimize_aggregation_in_order = 1 streams GROUP BY when the grouping columns match a prefix of the sort key. Converts O(N) memory into O(groups).
  • max_bytes_before_external_group_by protects against OOM on huge GROUP BYs by spilling to disk.

A Repeatable Optimization Workflow

When ChistaDATA engineers optimize a slow ClickHouse query, we follow the same repeatable seven-step workflow. It works at terabyte scale and at petabyte scale.

  1. Capture the query and baseline — clear caches, run three times, record p50/p99, record rows read from system.query_log.
  2. Run EXPLAIN indexes = 1 — confirm which granules are eliminated and which columns are actually scanned.
  3. Check sort key alignment — are the WHERE filters using the leading sort key columns? If not, consider a projection or schema redesign.
  4. Add targeted skipping indexes — bloom or set on filter columns not covered by the sort key.
  5. Tune execution settings — max_threads, optimize_read_in_order, optimize_aggregation_in_order.
  6. Consider a projection or MV — if the same query pattern is run repeatedly, precompute it.
  7. Re-baseline and document — measure improvement, capture the final query plan, and add the query to a regression test suite.

Key Takeaways

  • Most ClickHouse query optimization reduces to reading fewer rows from disk — sort key, skipping indexes, and projections are the three main levers.
  • Always run EXPLAIN indexes = 1 before and after optimization to quantify granule pruning.
  • Order sort keys as (low-cardinality, high-selectivity columns first, time last).
  • Use dictionaries instead of JOINs for small dimension tables; use grace_hash or partial_merge for large JOINs.
  • Projections give you alternative sort orders and pre-aggregations without denormalizing the source table.
  • Tune max_threads, optimize_read_in_order, and optimize_aggregation_in_order for most dashboard workloads.
  • Adopt a repeatable seven-step optimization workflow and add solved queries to a regression test suite.

At ChistaDATA Inc., we are 100% focused on ClickHouse. Our engineering team delivers ClickHouse consulting, 24×7 managed services, performance engineering, and mission-critical support for analytics workloads ranging from gigabytes to multi-petabyte clusters. Whether you are building a real-time analytics platform from scratch, migrating from Snowflake, BigQuery, or Redshift, or tuning an existing ClickHouse deployment, we can help accelerate outcomes. Schedule a consultation with our ClickHouse engineering team to discuss architecture, capacity planning, and performance SLAs.

You might also like:

About ChistaDATA Inc. 201 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc