Charming eBPF-Based Performance Analysis for ClickHouse

Table of Contents

Advanced eBPF-Based Performance Analysis for ClickHouse: Kernel-Level Observability Techniques

eBPF (Extended Berkeley Packet Filter) delivers comprehensive, zero-overhead kernel-space observability for ClickHouse performance analysis. Through dynamic instrumentation of kernel tracepoints, kprobes, uprobes, and hardware performance monitoring units (PMUs), eBPF eliminates the need for application-level modifications, debug symbol requirements, or recompilation with profiling flags.

CPU Microarchitecture Profiling with Hardware Performance Counters

The perf subsystem leverages eBPF programs attached to hardware performance monitoring units (PMUs) through the perf_event_open() syscall, enabling precise CPU cycle attribution and microarchitectural event sampling:

# Sample CPU cycles with call-graph reconstruction using frame pointers
sudo perf record -e cycles:u -g --call-graph=fp -p $(pidof clickhouse-server) --freq=997
# Generate detailed performance report with symbol resolution
sudo perf report --stdio --sort=dso,symbol --percent-limit=1

This technique captures statistical sampling data at configurable frequencies (typically 997 Hz to avoid timer aliasing), generating flame-graph compatible stack traces that reveal CPU time distribution across ClickHouse’s execution contexts. This includes:

Query parsing (AST construction)
Vectorized expression evaluation
Compression algorithms (LZ4/ZSTD)
Background merge operations
Kernel functions such as shrink_lruvec() in the memory management subsystem

Real-World Impact: ChistaDATA’s production analysis revealed that 64% of CPU cycles were consumed within kernel memory reclamation paths rather than user-space query processing—a bottleneck invisible to application performance monitoring (APM) tools that lack kernel visibility.

Thread Pool Synchronization and Lock Contention Analysis

ClickHouse’s global thread pool implementation utilizes POSIX threading primitives that can exhibit severe contention under high concurrency workloads. eBPF uprobes enable precise instrumentation of pthread_create() latency and associated kernel synchronization overhead:

#!/usr/bin/env bpftrace
uprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_create /pid == $1/ {
    @thread_create_start[tid] = nsecs;
    @thread_create_count++;
}

uretprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_create /pid == $1/ {
    if (@thread_create_start[tid] != 0) {
        $latency_ns = nsecs - @thread_create_start[tid];
        $latency_us = $latency_ns / 1000;

        printf("pthread_create() latency: %d μs, VMA count: %d, RSS: %d KB\n",
               $latency_us, curtask->mm->map_count, curtask->mm->total_vm * 4);

        @pthread_latency_hist = hist($latency_us);
        delete(@thread_create_start[tid]);
    }
}

Critical Finding: ChistaDATA’s forensic analysis identified thread creation latencies exceeding 450,000 μs (450ms) under concurrent load, attributed to mmap_lock (formerly mmap_sem) contention in the kernel’s virtual memory area (VMA) management subsystem. This cascading synchronization bottleneck caused even trivial SELECT 1 queries to exhibit 5-30 second response times due to thread pool starvation.

Memory Management Subsystem Instrumentation

eBPF tracepoints provide deep visibility into kernel memory management operations, particularly the mmap_lock read-write semaphore that serializes virtual memory area modifications:

#!/usr/bin/env bpftrace
tracepoint:mmap_lock:mmap_lock_acquire_returned /pid == $1/ {
    @mmap_lock_acquire[tid] = nsecs;
    @lock_acquisitions++;
}

tracepoint:mmap_lock:mmap_lock_released /pid == $1 && @mmap_lock_acquire[tid] > 0/ {
    $hold_duration_ns = nsecs - @mmap_lock_acquire[tid];
    $hold_duration_us = $hold_duration_ns / 1000;

    if ($hold_duration_us > 100000) { // Alert on >100ms holds
        printf("CRITICAL: mmap_lock hold duration PID %d TID %d: %d μs\n", 
               pid, tid, $hold_duration_us);
        printf("  Current VMA count: %d, RSS pages: %d\n",
               curtask->mm->map_count, curtask->mm->rss_stat.count[0]);
    }

    @mmap_lock_hold_histogram = hist($hold_duration_us);
    delete(@mmap_lock_acquire[tid]);
}

Production Telemetry: ChistaDATA’s ClickHouse deployments revealed mmap_lock hold durations exceeding 230 seconds, indicating kernel-level livelock conditions within the memory reclamation subsystem (shrink_lruvec() and related vmscan functions).

System Call Latency Distribution Analysis

eBPF syscall tracepoints enable precise measurement of kernel-to-userspace transition overhead and I/O subsystem performance characteristics:

#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_mincore /pid == $1/ {
    @syscall_enter[tid] = nsecs;
    @mincore_calls++;
}

tracepoint:syscalls:sys_exit_mincore /pid == $1 && @syscall_enter[tid] > 0/ {
    $latency_ns = nsecs - @syscall_enter[tid];
    $latency_us = $latency_ns / 1000;

    @mincore_latency_distribution = hist($latency_us);
    @mincore_total_time += $latency_ns;

    if (args->ret < 0) {
        @mincore_errors++;
        printf("mincore() error: %d\n", args->ret);
    }

    delete(@syscall_enter[tid]);
}

END {
    printf("mincore() statistics:\n");
    printf("  Total calls: %d\n", @mincore_calls);
    printf("  Total time: %d ms\n", @mincore_total_time / 1000000);
    printf("  Average latency: %d μs\n", 
           (@mincore_total_time / @mincore_calls) / 1000);
}

This methodology enabled ChistaDATA engineers to definitively eliminate mincore() syscall overhead as a performance bottleneck, redirecting investigation toward the actual root cause in kernel memory management.

Virtual Memory Scanner (vmscan) Subsystem Profiling

The Linux kernel’s vmscan subsystem handles page reclamation under memory pressure. eBPF tracepoints provide visibility into reclaim efficiency and cgroup memory controller behavior:

#!/usr/bin/env bpftrace
tracepoint:vmscan:mm_vmscan_memcg_reclaim_begin /pid == $1/ {
    @reclaim_start[tid] = nsecs;
    @reclaim_attempts++;
}

tracepoint:vmscan:mm_vmscan_memcg_reclaim_end /pid == $1 && @reclaim_start[tid] > 0/ {
    $elapsed_ns = nsecs - @reclaim_start[tid];
    $elapsed_ms = $elapsed_ns / 1000000;

    printf("memcg reclaim: %ld pages reclaimed in %ld ms (efficiency: %.2f%%)\n",
           args->nr_reclaimed, $elapsed_ms,
           (args->nr_reclaimed * 100.0) / args->nr_scanned);

    @reclaim_duration_hist = hist($elapsed_ms);
    @pages_reclaimed_total += args->nr_reclaimed;
    @pages_scanned_total += args->nr_scanned;

    delete(@reclaim_start[tid]);
}

tracepoint:vmscan:mm_vmscan_direct_reclaim_begin /pid == $1/ {
    printf("ALERT: Direct reclaim triggered (memory pressure)\n");
    @direct_reclaim_events++;
}

Hardware Performance Counter Integration

eBPF programs can access hardware performance monitoring units (PMUs) for microarchitectural analysis:

#!/usr/bin/env bpftrace
hardware:cache-misses:1000000 /pid == $1/ {
    @cache_miss_samples++;
    @cache_miss_stacks[ustack] = count();
}

hardware:branch-misses:1000000 /pid == $1/ {
    @branch_miss_samples++;
    printf("Branch misprediction at %s\n", usym(reg("ip")));
}

hardware:stalled-cycles-frontend:1000000 /pid == $1/ {
    @frontend_stalls++;
}

hardware:stalled-cycles-backend:1000000 /pid == $1/ {
    @backend_stalls++;
}

Production-Grade eBPF Monitoring Architecture

For continuous observability, Coroot implements a comprehensive eBPF-based monitoring agent that automatically instruments ClickHouse service-level indicators (SLIs) without application-side modifications:

Key Monitoring Capabilities

Query Performance Metrics: Latency percentiles (P50, P95, P99), throughput (QPS), error rates
ZooKeeper/ClickHouse Keeper Integration: Connection health, session timeouts, ensemble synchronization
Resource Utilization: CPU cycles per query, memory allocation patterns, I/O bandwidth consumption
Kernel-Level Bottlenecks: Lock contention, memory reclaim stalls, interrupt processing overhead

Technical Specifications and Remediation Matrix

Performance Bottleneck	eBPF Instrumentation Method	Kernel Subsystem	Diagnostic Metrics	Remediation Strategy
CPU hotspots	perf_event_open() + PMU sampling	Scheduler, CPU microarchitecture	Cycles per function, IPC, cache miss ratio	Query optimization, vectorization tuning
Thread pool contention	uprobe on pthread_create()	Process/thread management	Thread creation latency distribution	Upgrade to ClickHouse ≥24.10, increase max_thread_pool_free_size
mmap_lock contention	mmap_lock tracepoints	Virtual memory management	Lock hold duration, VMA count	Kernel tuning, memory layout optimization
I/O subsystem latency	syscall tracepoints (read, write, fsync)	Block layer, filesystem	Per-syscall latency histograms	Storage optimization, async I/O tuning
Memory reclaim stalls	vmscan tracepoints	Memory management	Pages scanned/reclaimed ratio, reclaim duration	cgroup memory limits, swap configuration
Network stack overhead	net tracepoints, socket uprobes	TCP/IP stack	Packet processing latency, socket buffer utilization	Network buffer tuning, TCP optimization

Version-Specific Considerations

ClickHouse Version Requirements

ClickHouse < 24.10: Thread pool implementation exhibits severe mmap_lock contention under concurrent workloads. The 24.10 release introduced optimized thread pool management that reduced lock wait times by 860× (from ~450ms to ~0.5ms median latency).

Kernel Requirements

eBPF tracepoint availability varies by kernel version:

mmap_lock tracepoints: Linux ≥5.12
Enhanced vmscan tracepoints: Linux ≥5.8
Hardware PMU access via eBPF: Linux ≥4.17

Performance Impact

eBPF instrumentation overhead is typically <1% CPU utilization for production workloads, with memory overhead of ~10-50MB per monitored process depending on active probe count and sampling frequency.

This comprehensive blog post provides the foundation for implementing kernel-level observability in ClickHouse environments, enabling unprecedented visibility into performance bottlenecks that traditional monitoring tools cannot detect. ChistaDATA’s expertise in eBPF-based performance analysis helps organizations optimize their ClickHouse deployments for maximum efficiency and reliability.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Advanced eBPF-Based Performance Analysis for ClickHouse

Advanced eBPF-Based Performance Analysis for ClickHouse: Kernel-Level Observability Techniques

CPU Microarchitecture Profiling with Hardware Performance Counters

Thread Pool Synchronization and Lock Contention Analysis

Memory Management Subsystem Instrumentation

System Call Latency Distribution Analysis

Virtual Memory Scanner (vmscan) Subsystem Profiling

Hardware Performance Counter Integration

Production-Grade eBPF Monitoring Architecture

Technical Specifications and Remediation Matrix

Version-Specific Considerations

Performance Impact

Further Reading

You might also like:

Advanced eBPF-Based Performance Analysis for ClickHouse: Kernel-Level Observability Techniques

CPU Microarchitecture Profiling with Hardware Performance Counters

Thread Pool Synchronization and Lock Contention Analysis

Memory Management Subsystem Instrumentation

System Call Latency Distribution Analysis

Virtual Memory Scanner (vmscan) Subsystem Profiling

Hardware Performance Counter Integration

Production-Grade eBPF Monitoring Architecture

Technical Specifications and Remediation Matrix

Version-Specific Considerations

Performance Impact

Further Reading

You might also like:

Related Articles

Strategic Considerations for Integrating ClickHouse with Row-based Systems: Balancing Performance and Architecture

How do you query large quantities of data optimally in ClickHouse?

ClickHouse 23.8 LTS – Release Blog