Advanced eBPF-Based Performance Analysis for ClickHouse

eBPF | ClickHouse Performance

Advanced eBPF-Based Performance Analysis for ClickHouse: Kernel-Level Observability Techniques


eBPF (Extended Berkeley Packet Filter) delivers comprehensive, zero-overhead kernel-space observability for ClickHouse performance analysis. Through dynamic instrumentation of kernel tracepoints, kprobes, uprobes, and hardware performance monitoring units (PMUs), eBPF eliminates the need for application-level modifications, debug symbol requirements, or recompilation with profiling flags.

CPU Microarchitecture Profiling with Hardware Performance Counters

The perf subsystem leverages eBPF programs attached to hardware performance monitoring units (PMUs) through the perf_event_open() syscall, enabling precise CPU cycle attribution and microarchitectural event sampling:

# Sample CPU cycles with call-graph reconstruction using frame pointers
sudo perf record -e cycles:u -g --call-graph=fp -p $(pidof clickhouse-server) --freq=997
# Generate detailed performance report with symbol resolution
sudo perf report --stdio --sort=dso,symbol --percent-limit=1

This technique captures statistical sampling data at configurable frequencies (typically 997 Hz to avoid timer aliasing), generating flame-graph compatible stack traces that reveal CPU time distribution across ClickHouse’s execution contexts. This includes:

  • Query parsing (AST construction)
  • Vectorized expression evaluation
  • Compression algorithms (LZ4/ZSTD)
  • Background merge operations
  • Kernel functions such as shrink_lruvec() in the memory management subsystem

Real-World Impact: ChistaDATA’s production analysis revealed that 64% of CPU cycles were consumed within kernel memory reclamation paths rather than user-space query processing—a bottleneck invisible to application performance monitoring (APM) tools that lack kernel visibility.

Thread Pool Synchronization and Lock Contention Analysis

ClickHouse’s global thread pool implementation utilizes POSIX threading primitives that can exhibit severe contention under high concurrency workloads. eBPF uprobes enable precise instrumentation of pthread_create() latency and associated kernel synchronization overhead:

#!/usr/bin/env bpftrace
uprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_create /pid == $1/ {
    @thread_create_start[tid] = nsecs;
    @thread_create_count++;
}

uretprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_create /pid == $1/ {
    if (@thread_create_start[tid] != 0) {
        $latency_ns = nsecs - @thread_create_start[tid];
        $latency_us = $latency_ns / 1000;

        printf("pthread_create() latency: %d μs, VMA count: %d, RSS: %d KB\n",
               $latency_us, curtask->mm->map_count, curtask->mm->total_vm * 4);

        @pthread_latency_hist = hist($latency_us);
        delete(@thread_create_start[tid]);
    }
}

Critical Finding: ChistaDATA’s forensic analysis identified thread creation latencies exceeding 450,000 μs (450ms) under concurrent load, attributed to mmap_lock (formerly mmap_sem) contention in the kernel’s virtual memory area (VMA) management subsystem. This cascading synchronization bottleneck caused even trivial SELECT 1 queries to exhibit 5-30 second response times due to thread pool starvation.

Memory Management Subsystem Instrumentation

eBPF tracepoints provide deep visibility into kernel memory management operations, particularly the mmap_lock read-write semaphore that serializes virtual memory area modifications:

#!/usr/bin/env bpftrace
tracepoint:mmap_lock:mmap_lock_acquire_returned /pid == $1/ {
    @mmap_lock_acquire[tid] = nsecs;
    @lock_acquisitions++;
}

tracepoint:mmap_lock:mmap_lock_released /pid == $1 && @mmap_lock_acquire[tid] > 0/ {
    $hold_duration_ns = nsecs - @mmap_lock_acquire[tid];
    $hold_duration_us = $hold_duration_ns / 1000;

    if ($hold_duration_us > 100000) { // Alert on >100ms holds
        printf("CRITICAL: mmap_lock hold duration PID %d TID %d: %d μs\n", 
               pid, tid, $hold_duration_us);
        printf("  Current VMA count: %d, RSS pages: %d\n",
               curtask->mm->map_count, curtask->mm->rss_stat.count[0]);
    }

    @mmap_lock_hold_histogram = hist($hold_duration_us);
    delete(@mmap_lock_acquire[tid]);
}

Production Telemetry: ChistaDATA’s ClickHouse deployments revealed mmap_lock hold durations exceeding 230 seconds, indicating kernel-level livelock conditions within the memory reclamation subsystem (shrink_lruvec() and related vmscan functions).

System Call Latency Distribution Analysis

eBPF syscall tracepoints enable precise measurement of kernel-to-userspace transition overhead and I/O subsystem performance characteristics:

#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_mincore /pid == $1/ {
    @syscall_enter[tid] = nsecs;
    @mincore_calls++;
}

tracepoint:syscalls:sys_exit_mincore /pid == $1 && @syscall_enter[tid] > 0/ {
    $latency_ns = nsecs - @syscall_enter[tid];
    $latency_us = $latency_ns / 1000;

    @mincore_latency_distribution = hist($latency_us);
    @mincore_total_time += $latency_ns;

    if (args->ret < 0) {
        @mincore_errors++;
        printf("mincore() error: %d\n", args->ret);
    }

    delete(@syscall_enter[tid]);
}

END {
    printf("mincore() statistics:\n");
    printf("  Total calls: %d\n", @mincore_calls);
    printf("  Total time: %d ms\n", @mincore_total_time / 1000000);
    printf("  Average latency: %d μs\n", 
           (@mincore_total_time / @mincore_calls) / 1000);
}

This methodology enabled ChistaDATA engineers to definitively eliminate mincore() syscall overhead as a performance bottleneck, redirecting investigation toward the actual root cause in kernel memory management.

Virtual Memory Scanner (vmscan) Subsystem Profiling

The Linux kernel’s vmscan subsystem handles page reclamation under memory pressure. eBPF tracepoints provide visibility into reclaim efficiency and cgroup memory controller behavior:

#!/usr/bin/env bpftrace
tracepoint:vmscan:mm_vmscan_memcg_reclaim_begin /pid == $1/ {
    @reclaim_start[tid] = nsecs;
    @reclaim_attempts++;
}

tracepoint:vmscan:mm_vmscan_memcg_reclaim_end /pid == $1 && @reclaim_start[tid] > 0/ {
    $elapsed_ns = nsecs - @reclaim_start[tid];
    $elapsed_ms = $elapsed_ns / 1000000;

    printf("memcg reclaim: %ld pages reclaimed in %ld ms (efficiency: %.2f%%)\n",
           args->nr_reclaimed, $elapsed_ms,
           (args->nr_reclaimed * 100.0) / args->nr_scanned);

    @reclaim_duration_hist = hist($elapsed_ms);
    @pages_reclaimed_total += args->nr_reclaimed;
    @pages_scanned_total += args->nr_scanned;

    delete(@reclaim_start[tid]);
}

tracepoint:vmscan:mm_vmscan_direct_reclaim_begin /pid == $1/ {
    printf("ALERT: Direct reclaim triggered (memory pressure)\n");
    @direct_reclaim_events++;
}

Hardware Performance Counter Integration

eBPF programs can access hardware performance monitoring units (PMUs) for microarchitectural analysis:

#!/usr/bin/env bpftrace
hardware:cache-misses:1000000 /pid == $1/ {
    @cache_miss_samples++;
    @cache_miss_stacks[ustack] = count();
}

hardware:branch-misses:1000000 /pid == $1/ {
    @branch_miss_samples++;
    printf("Branch misprediction at %s\n", usym(reg("ip")));
}

hardware:stalled-cycles-frontend:1000000 /pid == $1/ {
    @frontend_stalls++;
}

hardware:stalled-cycles-backend:1000000 /pid == $1/ {
    @backend_stalls++;
}

Production-Grade eBPF Monitoring Architecture

For continuous observability, Coroot implements a comprehensive eBPF-based monitoring agent that automatically instruments ClickHouse service-level indicators (SLIs) without application-side modifications:

Key Monitoring Capabilities

  • Query Performance Metrics: Latency percentiles (P50, P95, P99), throughput (QPS), error rates
  • ZooKeeper/ClickHouse Keeper Integration: Connection health, session timeouts, ensemble synchronization
  • Resource Utilization: CPU cycles per query, memory allocation patterns, I/O bandwidth consumption
  • Kernel-Level Bottlenecks: Lock contention, memory reclaim stalls, interrupt processing overhead

Technical Specifications and Remediation Matrix

Performance BottleneckeBPF Instrumentation MethodKernel SubsystemDiagnostic MetricsRemediation Strategy
CPU hotspotsperf_event_open() + PMU samplingScheduler, CPU microarchitectureCycles per function, IPC, cache miss ratioQuery optimization, vectorization tuning
Thread pool contentionuprobe on pthread_create()Process/thread managementThread creation latency distributionUpgrade to ClickHouse ≥24.10, increase max_thread_pool_free_size
mmap_lock contentionmmap_lock tracepointsVirtual memory managementLock hold duration, VMA countKernel tuning, memory layout optimization
I/O subsystem latencysyscall tracepoints (read, write, fsync)Block layer, filesystemPer-syscall latency histogramsStorage optimization, async I/O tuning
Memory reclaim stallsvmscan tracepointsMemory managementPages scanned/reclaimed ratio, reclaim durationcgroup memory limits, swap configuration
Network stack overheadnet tracepoints, socket uprobesTCP/IP stackPacket processing latency, socket buffer utilizationNetwork buffer tuning, TCP optimization

Version-Specific Considerations

ClickHouse Version Requirements

ClickHouse < 24.10: Thread pool implementation exhibits severe mmap_lock contention under concurrent workloads. The 24.10 release introduced optimized thread pool management that reduced lock wait times by 860× (from ~450ms to ~0.5ms median latency).

Kernel Requirements

eBPF tracepoint availability varies by kernel version:

  • mmap_lock tracepoints: Linux ≥5.12
  • Enhanced vmscan tracepoints: Linux ≥5.8
  • Hardware PMU access via eBPF: Linux ≥4.17

Performance Impact

eBPF instrumentation overhead is typically <1% CPU utilization for production workloads, with memory overhead of ~10-50MB per monitored process depending on active probe count and sampling frequency.


This comprehensive blog post provides the foundation for implementing kernel-level observability in ClickHouse environments, enabling unprecedented visibility into performance bottlenecks that traditional monitoring tools cannot detect. ChistaDATA’s expertise in eBPF-based performance analysis helps organizations optimize their ClickHouse deployments for maximum efficiency and reliability.



Further Reading

You might also like:

About ChistaDATA Inc. 198 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc