How to Run a Complete ClickHouse Performance Audit in Under 60 Minutes

ClickHouse performance audit is the fastest way to uncover hidden bottlenecks, misconfigured settings, and query inefficiencies draining your analytical database. Whether you are running ClickHouse on bare metal, Kubernetes, or as a managed cloud service, a structured performance audit in under 60 minutes gives your team the actionable intelligence needed to optimize ClickHouse throughput, reduce query latency, and stabilize resource utilization before a production incident forces your hand.

This guide walks you through a complete ClickHouse performance audit framework covering storage diagnostics, query profiling, merge tree health, replication lag, and system-level resource analysis all executable within a single engineering session.

How to Run a ClickHouse Performance Audit in Under 60 Minutes

Table of Contents

Why a ClickHouse Performance Audit Matters

ClickHouse is purpose-built for high-speed analytical workloads but even the fastest columnar database degrades over time when left unexamined. Common causes of ClickHouse performance degradation include suboptimal table engines, excessive part fragmentation, untuned merge settings, missing or poorly designed primary keys, and runaway background operations competing with query execution.

A periodic ClickHouse performance audit is not optional infrastructure hygiene. It is the operational discipline that separates teams running ClickHouse reliably at scale from those constantly firefighting slow queries and unexpected memory spikes. Teams that skip regular audits typically discover performance problems when end-users complain, not when the window to fix them quietly is still open.

The 60-minute audit framework in this post is designed to be repeatable, scriptable, and actionable. Each section maps to a specific layer of ClickHouse performance: storage, query execution, background processing, and system resources.

Prerequisites Before Starting Your ClickHouse Performance Audit

Before running any diagnostic queries, verify you have the following access and baseline information:

  • ClickHouse client access: You need clickhouse-client or HTTP access to the cluster with at least read access to system tables.
  • Permissions: Query profiling requires access to system.query_log, system.processes, and system.merges.
  • Baseline metrics: Collect current CPU, memory, and disk usage from your monitoring platform before starting.
  • ClickHouse version: Note your current version as some diagnostic system tables and settings differ between versions. Check the ClickHouse changelog for version-specific changes.

With these in place, start your timer. The entire audit should complete in under 60 minutes for most single-node and small multi-node ClickHouse deployments.

Step 1 (0-10 Minutes): Assess Storage Health and Part Fragmentation

The first phase of any ClickHouse performance audit starts with storage. ClickHouse’s MergeTree engine writes data as immutable parts that are periodically merged in the background. When merges fall behind ingestion rates, part fragmentation builds up causing slower reads, higher memory usage during queries, and increased SELECT latency.

Check Active Parts Per Table

Run the following query to identify tables with abnormally high active part counts. A healthy table typically has fewer than 150 active parts. Tables exceeding 300 active parts are candidates for immediate investigation.

SELECT
    database,
    table,
    count() AS active_parts,
    sum(rows) AS total_rows,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 20;

Identify Tables With Too Many Parts

Tables with more than 300 active parts indicate that background merges are not keeping up with inserts. This is one of the most common findings in a ClickHouse performance audit and directly impacts query performance.

SELECT
    database,
    table,
    count() AS parts_count,
    sum(rows) AS total_rows,
    max(modification_time) AS last_modified
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING parts_count > 300
ORDER BY parts_count DESC;

Check Storage Disk Utilization

SELECT
    name,
    path,
    formatReadableSize(free_space) AS free_space,
    formatReadableSize(total_space) AS total_space,
    round((1 - free_space / total_space) * 100, 1) AS used_pct
FROM system.disks
ORDER BY used_pct DESC;

Disk utilization above 80% is a critical finding. ClickHouse requires headroom for part merges. Without it, merges stall, fragmentation accelerates, and query performance degrades rapidly.

Step 2 (10-25 Minutes): Profile Slow and Expensive Queries

Query profiling is the most analytically rich phase of a ClickHouse performance audit. The system.query_log table records every query executed including execution time, memory usage, rows read, and bytes processed. You can learn more about all available system tables in the ClickHouse official system tables documentation.

Find the Top 20 Slowest Queries in the Last 24 Hours

SELECT
    query_id,
    user,
    query_duration_ms,
    read_rows,
    read_bytes,
    formatReadableSize(memory_usage) AS memory_usage,
    left(query, 120) AS query_snippet
FROM system.query_log
WHERE
    type = 'QueryFinish'
    AND event_time >= now() - INTERVAL 24 HOUR
    AND query_duration_ms > 1000
ORDER BY query_duration_ms DESC
LIMIT 20;

Identify High-Memory Queries

Memory spikes during query execution are a leading cause of ClickHouse OOM kills and query failures. Use this query to find the most memory-hungry workloads from the last 24 hours:

SELECT
    user,
    query_id,
    formatReadableSize(memory_usage) AS memory_used,
    query_duration_ms,
    read_rows,
    left(query, 120) AS query_snippet
FROM system.query_log
WHERE
    type = 'QueryFinish'
    AND event_time >= now() - INTERVAL 24 HOUR
    AND memory_usage > 1073741824
ORDER BY memory_usage DESC
LIMIT 20;

Check for Currently Running Long Queries

SELECT
    query_id,
    user,
    elapsed,
    read_rows,
    read_bytes,
    formatReadableSize(memory_usage) AS memory_usage,
    left(query, 120) AS query_snippet
FROM system.processes
ORDER BY elapsed DESC;

Any query running for more than 60 seconds in an OLAP environment deserves investigation. Long-running queries compete with background merges and can trigger memory pressure across the entire ClickHouse node.

Analyze Query Error Rates

SELECT
    type,
    exception_code,
    count() AS error_count,
    left(exception, 100) AS exception_snippet
FROM system.query_log
WHERE
    (type = 'ExceptionBeforeStart' OR type = 'ExceptionWhileProcessing')
    AND event_time >= now() - INTERVAL 24 HOUR
GROUP BY type, exception_code, exception_snippet
ORDER BY error_count DESC
LIMIT 20;

Step 3 (25-35 Minutes): Audit MergeTree Engine Health and Background Operations

ClickHouse’s background merge process is its self-healing mechanism. When merges fall behind or fail, the database accumulates technical debt in the form of fragmentation, replication lag, and eventually query performance degradation. This phase of the ClickHouse performance audit examines the health of background operations.

Check Active Merges

SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts,
    result_part_name,
    total_size_bytes_compressed,
    rows_read,
    rows_written
FROM system.merges
ORDER BY elapsed DESC;

If the progress column shows values near 0 for merges that have been running for several minutes, the merge is stalled. This is a critical finding that requires immediate investigation of disk I/O or background thread configuration.

Check Mutation Status

Mutations (ALTER TABLE UPDATE and DELETE operations) in ClickHouse are heavy background operations. Stalled or accumulating mutations are a frequent but overlooked cause of performance degradation:

SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    parts_to_do,
    is_done,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY parts_to_do DESC;

Inspect Background Pool Saturation

SELECT
    metric,
    value
FROM system.metrics
WHERE metric IN (
    'BackgroundMergesAndMutationsPoolTask',
    'BackgroundPoolTask',
    'BackgroundFetchesPoolTask',
    'BackgroundSchedulePoolTask',
    'BackgroundBufferFlushSchedulePoolTask'
)
ORDER BY metric;

A BackgroundMergesAndMutationsPoolTask value consistently at or near the pool maximum (default 16 threads) indicates background pool saturation, which is a key tuning target identified by this section of the ClickHouse performance audit.

Step 4 (35-45 Minutes): Evaluate Replication Health

For ClickHouse clusters using ReplicatedMergeTree, replication lag and replica divergence are silent performance killers. Even a single lagging replica can cause SELECT queries routed to that replica to return stale data or time out.

Check Replication Queue Depth

SELECT
    database,
    table,
    replica_name,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    log_max_index,
    log_pointer,
    last_queue_update,
    last_queue_update_exception
FROM system.replicas
ORDER BY queue_size DESC
LIMIT 20;

A queue_size above 100 indicates a lagging replica. A queue_size above 1000 is a critical alert requiring immediate investigation.

Check for Replication Errors

SELECT
    database,
    table,
    replica_name,
    last_queue_update_exception,
    zookeeper_exception
FROM system.replicas
WHERE
    last_queue_update_exception != ''
    OR zookeeper_exception != '';

Check ZooKeeper and ClickHouse Keeper Health

SELECT
    metric,
    value
FROM system.metrics
WHERE metric LIKE '%Zookeeper%' OR metric LIKE '%Keeper%'
ORDER BY metric;

Step 5 (45-55 Minutes): System Resource and Configuration Audit

The final diagnostic phase of a ClickHouse performance audit examines system-level resource utilization and validates that key ClickHouse configuration parameters are optimally set for your workload.

Check CPU and Memory Metrics

SELECT
    metric,
    value,
    description
FROM system.metrics
WHERE metric IN (
    'Query',
    'Merge',
    'MemoryTracking',
    'MemoryTrackingInBackgroundProcessingPool',
    'MemoryTrackingInBackgroundSchedulePool',
    'OSUserTimeMicroseconds',
    'OSSystemTimeMicroseconds'
)
ORDER BY metric;

Check Asynchronous Metrics for System-Wide Health

SELECT
    metric,
    value
FROM system.asynchronous_metrics
WHERE metric IN (
    'NumberOfTables',
    'NumberOfDatabases',
    'ReplicasMaxAbsoluteDelay',
    'ReplicasMaxRelativeDelay',
    'ReplicasMaxQueueSize',
    'jemalloc.active',
    'jemalloc.resident',
    'OSMemoryTotal',
    'OSMemoryAvailable'
)
ORDER BY metric;

Validate Key ClickHouse Configuration Settings

SELECT
    name,
    value,
    description
FROM system.settings
WHERE name IN (
    'max_memory_usage',
    'max_bytes_before_external_group_by',
    'max_bytes_before_external_sort',
    'merge_max_block_size',
    'background_pool_size',
    'background_merges_mutations_concurrency_ratio',
    'max_threads',
    'max_insert_threads',
    'compile_expressions',
    'use_uncompressed_cache'
)
ORDER BY name;

Check Table-Level Settings for MergeTree Optimization

SELECT
    database,
    name AS table_name,
    engine,
    engine_full,
    sorting_key,
    primary_key,
    partition_key
FROM system.tables
WHERE engine LIKE '%MergeTree%'
  AND database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA')
ORDER BY database, name;

Step 6 (55-60 Minutes): Compile Findings and Create an Action Plan

The final five minutes of the ClickHouse performance audit should be spent synthesizing findings into a prioritized action plan. Use the following framework to triage what you found:

Performance Audit Severity Classification

  • Critical (fix within 24 hours): Disk utilization above 85%, replication queue above 1000, active stalled mutations, background pool fully saturated, queries with memory usage above available RAM headroom.
  • High (fix within 1 week): Tables with more than 300 active parts, slow queries above 30 seconds running regularly, missing primary key optimization, high query error rates.
  • Medium (fix within 30 days): Compression ratios below 3x on analytical tables, suboptimal partition keys, untuned merge settings, queries doing full table scans on large tables.
  • Low (optimize during next maintenance window): Minor configuration tuning opportunities, unused tables consuming storage, stale materialized views.

Generating a Quick Audit Summary Report

After collecting all findings, generate a summary of the most important metrics in a single query:

SELECT 'Total Active MergeTree Tables' AS metric, toString(count()) AS value
FROM system.tables
WHERE engine LIKE '%MergeTree%' AND database NOT IN ('system')

UNION ALL

SELECT 'Tables With More Than 300 Parts' AS metric, toString(count()) AS value
FROM (
    SELECT table, count() AS parts
    FROM system.parts
    WHERE active = 1
    GROUP BY database, table
    HAVING parts > 300
)

UNION ALL

SELECT 'Active Merges' AS metric, toString(count()) AS value
FROM system.merges

UNION ALL

SELECT 'Pending Mutations' AS metric, toString(count()) AS value
FROM system.mutations
WHERE is_done = 0

UNION ALL

SELECT 'Replicas With Queue Above 100' AS metric, toString(count()) AS value
FROM system.replicas
WHERE queue_size > 100

UNION ALL

SELECT 'Slow Queries Last 24h Above 10s' AS metric, toString(count()) AS value
FROM system.query_log
WHERE type = 'QueryFinish'
    AND query_duration_ms > 10000
    AND event_time >= now() - INTERVAL 24 HOUR;

Common ClickHouse Performance Audit Findings and How to Fix Them

Finding 1: Excessive Part Fragmentation

Root cause: Insert frequency exceeds the background merge rate, or background merge threads are misconfigured.

Fix: Increase the background merge pool size in your ClickHouse configuration:

<clickhouse>
    <merge_tree>
        <background_pool_size>32</background_pool_size>
        <background_merges_mutations_concurrency_ratio>2</background_merges_mutations_concurrency_ratio>
    </merge_tree>
</clickhouse>

Also consider batching inserts to reduce the number of parts created per unit time. The recommended minimum batch size is 1,000 rows per insert with an ideal target of 10,000 to 100,000 rows per batch.

Finding 2: Slow Queries Due to Missing Primary Key Optimization

Root cause: Queries filtering on non-primary-key columns require full table scans.

Fix: Use EXPLAIN to verify index utilization:

EXPLAIN indexes = 1
SELECT count()
FROM your_database.your_table
WHERE event_date >= today() - 7
  AND user_id = 12345;

If the output shows that the index condition is not being used for frequently queried columns, consider redesigning the primary key or adding a ClickHouse skip index.

Finding 3: High Memory Usage During GROUP BY Operations

Root cause: Large aggregations exhausting the per-query memory limit.

Fix: Enable external aggregation to spill to disk:

SET max_bytes_before_external_group_by = 10000000000;
SET max_memory_usage = 20000000000;

SELECT
    user_id,
    count() AS event_count,
    uniq(session_id) AS unique_sessions
FROM your_database.events
WHERE event_date >= today() - 30
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 100;

Finding 4: Replication Lag

Root cause: Network bottlenecks, slow disks on replica nodes, or ZooKeeper and Keeper overload.

Fix: Check what is causing the lag:

SELECT
    database,
    table,
    replica_name,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    last_queue_update_exception
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;

If inserts_in_queue is high, the replica is catching up on data. If merges_in_queue is high, background merges on the replica are lagging. Each requires a different tuning response.

ClickHouse Performance Audit Checklist

Use this checklist to ensure complete coverage every time you run a ClickHouse performance audit:

  • Checked active part counts per table and confirmed no table exceeds 300 parts
  • Verified disk utilization is below 80% on all storage disks
  • Profiled top 20 slowest queries from the last 24 hours
  • Identified and reviewed high-memory queries using more than 1 GB RAM
  • Verified no long-running queries over 60 seconds exist in system.processes
  • Confirmed query error rates are within acceptable thresholds
  • Reviewed active merges and mutation status
  • Verified background pool is not saturated
  • Checked replication queue depth across all replicas
  • Validated ZooKeeper and Keeper connectivity and health
  • Reviewed key ClickHouse configuration settings
  • Verified primary key and index effectiveness on top tables
  • Produced a prioritized action plan with severity classifications

Automating Your ClickHouse Performance Audit

Running a manual ClickHouse performance audit every quarter is a good start. But high-volume deployments benefit from automated, scheduled audits that flag regressions before they become incidents. The queries in this guide can be wrapped in a shell script using clickhouse-client:

#!/bin/bash
# ClickHouse Performance Audit Automation Script
# Schedule via cron: 0 6 * * 1 /opt/scripts/clickhouse-audit.sh

CH_HOST="${CH_HOST:-localhost}"
CH_PORT="${CH_PORT:-9000}"
CH_USER="${CH_USER:-default}"
REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="/var/log/clickhouse/audit-${REPORT_DATE}.txt"

echo "=== ClickHouse Performance Audit: ${REPORT_DATE} ===" > "${REPORT_FILE}"

# Section 1: Part fragmentation
echo -e "\n--- Tables with > 300 Active Parts ---" >> "${REPORT_FILE}"
clickhouse-client \
  --host="${CH_HOST}" \
  --port="${CH_PORT}" \
  --user="${CH_USER}" \
  --query="SELECT database, table, count() AS parts FROM system.parts WHERE active = 1 GROUP BY database, table HAVING parts > 300 ORDER BY parts DESC FORMAT PrettyCompact" \
  >> "${REPORT_FILE}" 2>&1

# Section 2: Slow queries
echo -e "\n--- Top 10 Slowest Queries (Last 24h) ---" >> "${REPORT_FILE}"
clickhouse-client \
  --host="${CH_HOST}" \
  --port="${CH_PORT}" \
  --user="${CH_USER}" \
  --query="SELECT user, query_duration_ms, left(query, 80) AS q FROM system.query_log WHERE type='QueryFinish' AND event_time >= now() - INTERVAL 24 HOUR ORDER BY query_duration_ms DESC LIMIT 10 FORMAT PrettyCompact" \
  >> "${REPORT_FILE}" 2>&1

# Section 3: Replication health
echo -e "\n--- Replication Queue Status ---" >> "${REPORT_FILE}"
clickhouse-client \
  --host="${CH_HOST}" \
  --port="${CH_PORT}" \
  --user="${CH_USER}" \
  --query="SELECT database, table, replica_name, queue_size FROM system.replicas WHERE queue_size > 0 ORDER BY queue_size DESC FORMAT PrettyCompact" \
  >> "${REPORT_FILE}" 2>&1

echo -e "\nAudit complete. Report: ${REPORT_FILE}"

Schedule this script weekly via cron to maintain a continuous performance baseline and detect regressions automatically.

Integrating ClickHouse Performance Audit Results With Monitoring

A ClickHouse performance audit is most powerful when its findings inform your monitoring configuration. After completing an audit, translate key thresholds into Prometheus alerting rules or ClickHouse-native monitoring queries.

For teams running ClickHouse with OpenTelemetry-based observability pipelines, the system tables queried in this audit can be continuously exported to a long-term storage backend for trend analysis and anomaly detection.

For teams running managed ClickHouse through ChistaDATA ClickHouse DBaaS, performance audits are integrated into the platform’s automated health-check engine eliminating the need for manual audits while providing continuous visibility into query performance, storage health, and replication status across your entire cluster fleet.

Conclusion

A structured ClickHouse performance audit does not require specialized tooling or a week-long engagement. The six-step framework in this guide covering storage fragmentation, query profiling, merge tree health, replication status, system resources, and configuration validation gives any ClickHouse operator a complete picture of cluster health in under 60 minutes.

Run this audit after every major schema change, following a significant increase in data volume, and on a scheduled basis at least once per quarter. The earlier you catch performance regressions in ClickHouse, the cheaper they are to fix and the more reliably your analytical workloads will perform at scale.

To learn more about ClickHouse optimization strategies, explore our related guides on ClickHouse indexing strategies, ClickHouse workload isolation, and ClickHouse reliability engineering.

About ChistaDATA Inc. 226 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc