ClickHouse Workload Isolation: 7 Techniques for Noisy-Neighbor Problems

Seven practical ClickHouse workload isolation layers — from resource groups and query limits to physical replica separation — to eliminate noisy-neighbor interference in production ClickHouse clusters.

When a single analytics team runs an aggressive backfill job at 3 a.m. — or a product dashboard spins up hundreds of concurrent sub-second queries at 9 a.m. — every other workload on the same ClickHouse cluster pays the price. This is the noisy-neighbor problem: resource contention among unrelated workloads sharing a common database tier. In a column store built for sub-second analytical queries, even a brief I/O or CPU spike from one query class can push latency into seconds for everyone else.

This guide covers the practical ClickHouse workload isolation techniques that production engineers apply to eliminate noisy-neighbor interference — from query-level resource limits to cluster-level architectural separation. Every technique includes working SQL and the judgment required to apply it correctly. For a broader view of ClickHouse performance engineering, see our guide to ClickHouse query optimization and indexing strategies.

ClickHouse workload isolation architecture showing 7 layers of resource management to prevent noisy-neighbor problems
ClickHouse workload isolation: 7 layers from cluster separation to per-query complexity limits

What Is the Noisy-Neighbor Problem in ClickHouse?

ClickHouse is a shared-nothing columnar engine. A single query can saturate every CPU core, read hundreds of gigabytes from disk, and consume most of the network bandwidth between replicas. When multiple such queries run simultaneously from different teams — each with different latency requirements, different data volumes, and different scheduling priorities — they compete for the same finite pool of CPU threads, I/O slots, and memory.

The result is unpredictable: an interactive dashboard query that normally completes in 80 ms takes 4 seconds because a data-science aggregation is monopolizing thread slots. A nightly ETL job misses its window because ad-hoc exploration queries are filling the memory budget. The problem is not any one query being slow — it is the absence of ClickHouse workload isolation between query classes.

ClickHouse provides a layered toolkit for workload isolation. The layers stack from coarsest to finest:

  1. Cluster-level separation — dedicated replica sets per workload class
  2. Resource groups (workload hierarchies via CREATE WORKLOAD)
  3. User-level and profile-level settings — per-user resource caps
  4. Query complexity limits — hard stops on scan depth and row counts
  5. Priority and scheduling — weighted fair queuing across concurrent requests
  6. Merge and background operation throttling — protecting foreground I/O
  7. Replica-level read isolation — routing workloads to dedicated replicas

Understanding where each layer operates — and where it falls short — is the foundation of effective ClickHouse workload isolation engineering.

Layer 1: Cluster-Level Workload Isolation

The strongest isolation guarantee is physical: route different workload classes to entirely separate replica sets. When an interactive application tier and a batch analytics tier talk to different nodes, they cannot contend on CPU, I/O, or memory — period. This is the approach enterprises use for their most latency-sensitive ClickHouse workload isolation requirements.

ClickHouse implements this through its distributed query routing. A <remote_servers> cluster definition in config.xml names each shard and replica set. By creating two named clusters — one for OLAP batch and one for OLAP interactive — and directing application connection strings accordingly, you achieve hard multi-tenancy with zero scheduler complexity. The ClickHouse replication documentation covers the full replica configuration options.

<!-- config.xml: two logically separate clusters on the same ZooKeeper -->
<remote_servers>
    <cluster_interactive>
        <shard>
            <replica>
                <host>ch-node-01</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>ch-node-02</host>
                <port>9000</port>
            </replica>
        </shard>
    </cluster_interactive>

    <cluster_batch>
        <shard>
            <replica>
                <host>ch-node-03</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>ch-node-04</host>
                <port>9000</port>
            </replica>
        </shard>
    </cluster_batch>
</remote_servers>

The trade-off is hardware cost. Physical isolation doubles the node count. For organisations where the cost is justified — fintech real-time fraud scoring running alongside data-science model training, for instance — it is the only ClickHouse workload isolation technique with a hard guarantee. For everyone else, the in-cluster mechanisms below provide the practical operating point.

Layer 2: ClickHouse Workload Resource Groups (Scheduler Hierarchies)

ClickHouse 24.1 introduced a first-class workload and resource scheduling system. The CREATE WORKLOAD and CREATE RESOURCE DDL commands let you build a named hierarchy of workload classes with explicit priority weights, maximum concurrency, and bandwidth allocations. This is the recommended in-cluster ClickHouse workload isolation mechanism from ClickHouse 24.x onward. See the official ClickHouse workload scheduling documentation for the full DDL reference.

The design mirrors Linux cgroups in spirit: a tree of workloads, each with a weight and optional caps, competing for a shared resource pool. Interactive queries run in a high-priority leaf; background ETL runs in a low-priority leaf; the scheduler ensures the interactive branch always gets its share first.

-- Define the disk I/O resource pool
CREATE RESOURCE io_disk (WRITE DISK default, READ DISK default);

-- Root workload (required parent)
CREATE WORKLOAD all SETTINGS max_io_bandwidth = '2Gi';

-- Interactive workload: high priority, bounded concurrency
CREATE WORKLOAD interactive IN all
SETTINGS
    weight = 10,
    max_concurrent_queries = 20,
    priority = HIGH;

-- Batch workload: low priority, higher concurrency ceiling
CREATE WORKLOAD batch IN all
SETTINGS
    weight = 1,
    max_concurrent_queries = 50,
    priority = LOW;

-- ETL workload: guaranteed minimum share, throttled burst
CREATE WORKLOAD etl IN all
SETTINGS
    weight = 2,
    max_concurrent_queries = 10;

Once workloads are defined, route users into them via their settings profile or per-query override:

-- Assign a user to a named workload at creation time
CREATE USER dashboard_user
SETTINGS workload = 'interactive';

-- Override per-session if needed
SET workload = 'batch';

-- Verify active workload assignments
SELECT user, client_hostname, getSetting('workload') AS workload
FROM system.processes;

The scheduler enforces weights using a weighted fair queue: when both interactive and batch workloads have pending queries, the interactive pool receives 10× the I/O share of the batch pool. If the interactive workload is idle, the batch workload can consume all available I/O — there is no wasted capacity from hard reservation.

When to Use Workload Resource Groups

Workload resource groups are the right tool when multiple query classes coexist on the same physical nodes and you need soft priority guarantees without dedicating hardware. They are not a substitute for physical separation in situations where a batch job must be completely isolated from interactive SLAs. The key operational discipline is to assign every user or service account to a workload at provisioning time — ad-hoc users left in the default workload will quietly interfere with everyone else. This is the ClickHouse workload isolation pattern our engineers enforce at the start of every engagement.

Layer 3: User Profiles and Per-User Resource Limits

Before workload resource groups existed, the primary ClickHouse workload isolation mechanism was the settings profile: a named bundle of configuration that caps how much any single query from a given user can consume. Profiles remain essential — they operate at a finer granularity than workload weights and enforce hard limits that the scheduler cannot relax.

The most impactful profile settings for ClickHouse workload isolation are listed below. Each controls a different resource axis.

<!-- users.xml: define a profile for dashboard users -->
<profiles>
    <dashboard>
        <!-- CPU: max concurrent threads per query -->
        <max_threads>8</max_threads>

        <!-- Memory: hard per-query limit, then per-user cap -->
        <max_memory_usage>4294967296</max_memory_usage>         <!-- 4 GiB per query -->
        <max_memory_usage_for_user>12884901888</max_memory_usage_for_user>  <!-- 12 GiB per user -->

        <!-- Concurrency: throttle simultaneous queries from this user -->
        <max_concurrent_queries_for_user>10</max_concurrent_queries_for_user>

        <!-- I/O: limit network bandwidth for result transmission -->
        <max_network_bandwidth>104857600</max_network_bandwidth>  <!-- 100 MB/s -->

        <!-- Timeout: kill runaway queries -->
        <max_execution_time>30</max_execution_time>
    </dashboard>

    <etl_worker>
        <max_threads>32</max_threads>
        <max_memory_usage>17179869184</max_memory_usage>   <!-- 16 GiB -->
        <max_concurrent_queries_for_user>4</max_concurrent_queries_for_user>
        <max_execution_time>3600</max_execution_time>  <!-- 1 hour for batch -->
        <priority>10</priority>  <!-- OS thread priority, higher = lower priority -->
    </etl_worker>
</profiles>
<!-- Assign profiles to users -->
<users>
    <dashboard_svc>
        <profile>dashboard</profile>
        <networks><ip>::</ip></networks>
    </dashboard_svc>
    <etl_svc>
        <profile>etl_worker</profile>
        <networks><ip>::</ip></networks>
    </etl_svc>
</users>

A critical subtlety: max_threads limits parallelism within a single query. Reducing it from the default (typically equal to CPU count) is the single most reliable lever for ClickHouse workload isolation at the query level. A dashboard user with max_threads = 8 on a 64-core node will never take more than 12.5% of available CPU even if their query is a full table scan.

Layer 4: Query Complexity Limits

Resource limits cap how much CPU and memory a query can use after it starts executing. Complexity limits stop dangerously large queries before they begin. Both layers are necessary for ClickHouse workload isolation: a query that would exhaust memory in 20 seconds should be rejected at submission, not allowed to start and then killed partway through — which wastes I/O and forces partial work to be retried.

ClickHouse’s query complexity settings cover every axis of scan cost. The full reference is in the ClickHouse query complexity settings documentation:

-- Session-level complexity limits (can also be set in profiles)
SET max_rows_to_read           = 5000000000;   -- 5 billion row cap
SET max_bytes_to_read          = 107374182400; -- 100 GiB data scan cap
SET max_result_rows            = 1000000;      -- result set size
SET max_result_bytes           = 104857600;    -- 100 MiB result cap
SET max_rows_in_join           = 100000000;    -- join build-side cap
SET max_bytes_in_join          = 10737418240;  -- 10 GiB join memory
SET max_rows_to_group_by       = 50000000;     -- GROUP BY cardinality cap
SET group_by_overflow_mode     = 'throw';      -- reject instead of spill
SET timeout_overflow_mode      = 'throw';      -- reject on timeout
SET read_overflow_mode         = 'throw';      -- reject on read limit

The overflow_mode settings deserve explicit attention. The default for most limits is 'throw' — reject the query with an error. This is almost always the right choice for interactive workloads: a dashboard that receives a clear error message can surface it to the user, whereas a query silently truncated to the limit returns a wrong answer. Set overflow_mode = 'break' only for exploratory workloads where approximate answers are acceptable.

Enforcing Limits Across All Users

Complexity limits set in a user profile apply only to that user. To enforce a cluster-wide floor — protecting the system from any user, including superusers — set them in the default profile or in the server-level query_complexity configuration. Pair per-user limits with cluster-level defaults to create defence in depth: a misconfigured service account cannot bypass the system-wide caps even if its profile is accidentally left at defaults.

Layer 5: Query Priority and Concurrent Query Caps

When more queries arrive than the server can execute simultaneously at target latency, the scheduler must decide which ones run first. ClickHouse exposes two complementary mechanisms: the priority setting (OS thread scheduling priority) and max_concurrent_queries with per-user variants.

-- Server-level concurrent query cap (clickhouse-server.xml or users.xml default profile)
-- max_concurrent_queries = 100  (typical production value)

-- Per-user concurrency: dashboard users get more slots than batch
-- In dashboard profile:
SET max_concurrent_queries_for_user = 20;

-- In batch profile:
SET max_concurrent_queries_for_user = 5;

-- Thread-level priority: higher number = lower OS priority (Linux nice value)
-- Interactive queries: priority 0 (default, highest)
-- Batch queries: priority 10 (background)
SET priority = 10;

The priority setting maps directly to the OS thread nice value on Linux. A batch query running at priority 10 yields CPU to interactive queries at priority 0 automatically — no ClickHouse-level scheduler involvement. This is a lightweight ClickHouse workload isolation complement that costs nothing to enable and is effective even on ClickHouse versions predating the workload DDL.

Queue Behaviour Under Load

When max_concurrent_queries is reached, ClickHouse returns an immediate error to the client rather than queuing the request. This is intentional: a queue that grows without bound converts a latency problem into a starvation problem, and interactive workloads will wait behind a backed-up batch queue. Design your application tier to handle TOO_MANY_SIMULTANEOUS_QUERIES errors with client-side exponential backoff rather than expecting the server to queue on your behalf. The Workload resource group system introduced in 24.x adds proper server-side queuing for the first time, which is another reason to adopt it for new deployments.

Layer 6: Merge and Background Operation Throttling

The noisiest background process in a ClickHouse cluster is the MergeTree merge. Part merges read large amounts of data from disk, rewrite it compressed, and write it back — all while competing with foreground queries for I/O bandwidth. On insert-heavy clusters, unthrottled merges are a primary cause of query latency spikes that look like noisy-neighbor problems but are actually foreground-background contention. This is a ClickHouse workload isolation challenge distinct from user-to-user contention.

-- Throttle background merge I/O (bytes per second)
-- Set in config.xml under <merge_tree> or via ALTER TABLE ... SETTINGS

-- Server-wide background I/O cap
-- In config.xml:
-- <background_pool_size>16</background_pool_size>
-- <background_merges_mutations_concurrency_ratio>2</background_merges_mutations_concurrency_ratio>

-- Per-table merge bandwidth throttle
ALTER TABLE events
MODIFY SETTING
    max_bytes_to_merge_at_min_space_in_pool = 1073741824,  -- 1 GiB
    number_of_free_entries_in_pool_to_lower_max_size_of_merge = 8;

-- Inspect current merge activity
SELECT
    database,
    table,
    elapsed,
    progress,
    formatReadableSize(total_size_bytes_compressed) AS compressed_size,
    formatReadableSize(memory_usage) AS mem_usage
FROM system.merges
ORDER BY elapsed DESC;

Two settings govern merge concurrency at the server level. background_pool_size sets the number of threads dedicated to background merges and mutations. background_merges_mutations_concurrency_ratio multiplies this to get the maximum concurrent merge tasks. Reducing background_pool_size from the default of 16 to 8 on query-heavy clusters immediately frees I/O for foreground work — but at the cost of slower part consolidation, which can cause part count to grow if the insert rate is high. Monitor system.parts alongside query latency when tuning this value.

Layer 7: Replica-Level Read Isolation via prefer_localhost_replica

In a replicated ClickHouse setup, each shard has multiple replicas that are physically identical. By default, ClickHouse routes a distributed query to the least-loaded replica. This automatic balancing is useful for single-workload clusters but works against ClickHouse workload isolation: a batch query scheduled to run on a “lightly loaded” replica may land on the same node as a burst of interactive traffic because replica load is measured at query submission, not projected forward.

A practical isolation technique is to bind specific workloads to specific replicas using the load_balancing setting. The load_balancing setting documentation covers all available policies:

-- Route a workload to a specific replica set by policy
-- 'in_order' tries replicas in config order — use to pin to first replica
SET load_balancing = 'in_order';

-- 'nearest_hostname' prefers replicas with similar hostnames (same rack/AZ)
SET load_balancing = 'nearest_hostname';

-- 'random' (default) — balanced but unpredictable
SET load_balancing = 'random';

-- For pinning ETL to specific nodes, combine with session-level host specification
-- in the TCP connection string: clickhouse://ch-batch-01:9000

When combined with distinct named users — interactive users connecting to one load balancer VIP, batch users connecting to another — this technique achieves soft replica pinning without changes to cluster topology. It does not provide hard isolation (a replica can still be reached by the other workload class if the preferred replicas are unavailable), but it significantly reduces cross-workload interference in practice.

Observing Workload Interference: Diagnostic Queries

None of the ClickHouse workload isolation techniques above are set-and-forget. Workload patterns change, data volumes grow, and new query classes appear. The only way to know whether isolation is working is to measure it continuously. ClickHouse’s system tables provide everything needed to detect noisy-neighbor interference in real time.

-- 1. Identify active queries consuming the most resources
SELECT
    query_id,
    user,
    getSetting('workload') AS workload,
    elapsed,
    formatReadableSize(memory_usage) AS memory,
    read_rows,
    formatReadableSize(read_bytes) AS read_bytes,
    query
FROM system.processes
ORDER BY memory_usage DESC
LIMIT 20;

-- 2. Measure per-user resource consumption over the last hour
SELECT
    user,
    count()                             AS queries,
    avg(query_duration_ms)              AS avg_ms,
    quantile(0.99)(query_duration_ms)   AS p99_ms,
    sum(read_rows)                      AS total_rows_read,
    formatReadableSize(sum(read_bytes)) AS total_bytes_read
FROM system.query_log
WHERE
    type = 'QueryFinish'
    AND event_time >= now() - INTERVAL 1 HOUR
GROUP BY user
ORDER BY total_bytes_read DESC;

-- 3. Detect queries that hit complexity limits (noisy-neighbor symptoms)
SELECT
    user,
    exception,
    query
FROM system.query_log
WHERE
    type = 'ExceptionWhileProcessing'
    AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 50;

-- 4. Monitor part counts — high part counts indicate merge contention
SELECT
    database,
    table,
    count()        AS part_count,
    sum(rows)      AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS disk_size
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING part_count > 200
ORDER BY part_count DESC;

The p99 latency split by user (query 2) is the canonical noisy-neighbor detector for ClickHouse workload isolation: when interactive-user p99 latency correlates in time with batch-user query volume, you have a resource contention problem. Capture this query as a scheduled view or export it to a metrics system, and configure an alert on interactive p99 > your SLA threshold. For guidance on setting up ClickHouse monitoring, see our post on ClickHouse performance tuning and indexing.

Putting the Layers Together: A Reference Architecture

No single ClickHouse workload isolation technique is sufficient on its own. The production pattern that eliminates noisy-neighbor problems in ClickHouse clusters combines all 7 layers, applied at the right scope:

Layer Technique Scope Isolation Strength
1 Separate replica sets per workload class Cluster topology Hard (physical)
2 CREATE WORKLOAD resource hierarchies Server scheduler Strong (weighted fair)
3 Settings profiles: max_threads, max_memory_usage Per user/query Hard caps per query
4 Query complexity limits: max_rows_to_read, etc. Per user/query Admission control
5 Thread priority + concurrent query caps Per user/query Soft (OS scheduling)
6 Merge throttling: background_pool_size Server / per table Foreground/background
7 Replica pinning via load_balancing Per session Soft (routing preference)

A practical deployment sequence for ClickHouse workload isolation: start with settings profiles and complexity limits (layers 3 and 4) — they are available in every ClickHouse version, require no architectural change, and immediately cap the blast radius of any single query. Add workload resource groups (layer 2) once you are on ClickHouse 24.1 or later. Introduce merge throttling (layer 6) if insert-heavy workloads coexist with interactive queries. Reserve physical replica separation (layer 1) for workloads with contractual SLAs that cannot tolerate any cross-workload interference.

Frequently Asked Questions

What is the fastest way to stop a noisy-neighbor query in ClickHouse?

Use KILL QUERY WHERE query_id = '<id>' from a superuser session. To find the offending query, run SELECT query_id, user, elapsed, memory_usage, read_bytes FROM system.processes ORDER BY read_bytes DESC LIMIT 10. For ongoing ClickHouse workload isolation, add max_execution_time and max_bytes_to_read limits to the responsible user’s profile so that future runaway queries self-terminate.

Does ClickHouse support Kubernetes resource limits for workload isolation?

Kubernetes CPU and memory limits apply to the ClickHouse pod as a whole; they do not distinguish between query classes within the process. For pod-level isolation, deploy separate ClickHouse pods (or Deployments) per workload class and route clients to the appropriate Service endpoint. ClickHouse’s in-process ClickHouse workload isolation (layers 2–7) remains necessary even with pod separation to prevent a single query within a pod from starving others in the same pod.

How do max_concurrent_queries and max_concurrent_queries_for_user interact?

The server-level max_concurrent_queries is a hard ceiling for the entire server. max_concurrent_queries_for_user is a per-user subset of that ceiling. Both must be satisfied: a user with a per-user limit of 20 will still be blocked if the server is at its global limit of 100 and other users are consuming the remainder. Size the global limit at 1.5–2× the sum of per-user limits to leave headroom for administrative queries and internal operations.

Can materialized views cause noisy-neighbor problems?

Yes. In ClickHouse, materialized views execute synchronously on the inserting thread as part of every INSERT operation. A complex aggregation in a materialized view increases insert latency and can block the insertion pipeline for other tables. If a materialized view is causing insert contention, consider pre-aggregating with a dedicated background process instead, or restructure the view to minimise its per-block work. Monitor system.query_log for INSERT queries with unexpectedly high query_duration_ms as the diagnostic signal.

What ClickHouse version is required for workload resource groups?

The CREATE WORKLOAD and CREATE RESOURCE syntax was introduced in ClickHouse 24.1 (January 2024). Earlier versions support ClickHouse workload isolation exclusively through settings profiles, query complexity limits, thread priority, and physical cluster separation. If you are on a version prior to 24.1, layers 3–7 remain fully functional and provide meaningful isolation — the scheduler-level guarantees of layer 2 simply require an upgrade.

Conclusion

ClickHouse workload isolation is not a single feature — it is a discipline applied at 7 distinct layers, from cluster topology down to per-query complexity limits. The noisy-neighbor problem in ClickHouse is ultimately a resource contention problem: when interactive and batch workloads share CPU, I/O, and memory without guardrails, the least predictable query class sets the latency floor for everyone.

The engineering response is layered defence: start with settings profiles and complexity limits to cap individual query blast radius, add workload resource groups for scheduler-level fairness, throttle background merges to protect foreground I/O, and graduate to physical replica separation only where hard SLA guarantees are required. Measure with system.query_log and system.processes throughout — ClickHouse workload isolation is proved by telemetry, not configuration alone.

If your ClickHouse cluster is exhibiting unpredictable latency spikes or users are reporting inconsistent query performance, the root cause is almost always one of the 7 contention patterns described above. Need expert help implementing ClickHouse workload isolation for your production cluster? Contact the ChistaDATA engineering team for a performance audit.

About ChistaDATA Inc. 222 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc