Unlocking Success with a ClickHouse Vector log pipeline

Q: What is the difference between tokenbf_v1 and ngrambf_v1 for log search?

tokenbf_v1 splits text on non-alphanumeric boundaries and stores bloom filter hashes of whole tokens. ngrambf_v1 stores hashes of fixed-length character n-grams. For log search targeting whole words or identifiers such as exception class names or trace IDs, tokenbf_v1 is more selective and produces smaller index sizes. Use ngrambf_v1 only when substring matching within a single token is required.

Q: Can Vector enrich logs with Kubernetes pod metadata automatically?

Yes. The kubernetes_logs source attaches Kubernetes metadata to each event under the .kubernetes field including namespace, pod name, pod labels, container name, and node name. VRL transforms can then promote any field to a top-level schema column, enabling per-service partitioning and filtering in ClickHouse without additional enrichment infrastructure or sidecar processes.

Table of Contents

ClickHouse Vector log pipeline

Collecting hundreds of millions of log events per day without data loss while querying that corpus in under a second during an incident requires a purpose-built pipeline. Elastic stacks buckle under index overhead that drives storage costs far higher than necessary, and legacy shippers like Logstash consume JVM resources before data reaches storage. The ClickHouse Vector log pipeline solves both problems. Vector, the high-performance Rust-based observability data pipeline maintained by Datadog, ships logs from every host directly into ClickHouse—a columnar OLAP engine compressing log data 10x or more and answering full-text queries across tens of billions of rows in sub-second time. At ChistaDATA, we deploy this architecture for production workloads ranging from Kubernetes-native microservices to bare-metal data platforms, and this guide covers everything needed to implement it correctly.

Vector: One Binary to Replace the Entire Shipper Stack

Vector is a single statically linked Rust binary that replaces Logstash, Fluentd, Filebeat, and Telegraf in one deployment. The architecture is built around three primitives: sources (where data enters), transforms (where data is shaped), and sinks (where data is delivered). Every component runs in the same process with zero inter-process serialization overhead, allowing Vector to sustain over one million events per second on a single aggregator node at a fraction of the CPU cost of a JVM-based pipeline. The official ClickHouse sink is a first-class component supporting batching, compression, TLS, and HTTP authentication. Vector is open-source under MPL-2.0 and backs the Datadog commercial platform at scale.

Relevant source types include kubernetes_logs (reads pod log files from /var/log/containers/), file (tails files with glob patterns), journald (streams from systemd journal), and syslog (listens on UDP/TCP 514 for RFC 3164 and RFC 5424). The Vector Remap Language (VRL) is the primary transform engine: a statically typed scripting language built for log mutation with zero runtime panics.

Why ClickHouse Is the Right Long-Term Log Store

Elasticsearch indexes every token in every field by default, producing index sizes that often exceed raw data. ClickHouse applies columnar storage with sparse primary indexes and optional data-skipping indexes added only where needed. For log workloads this produces 10–15x compression with ZSTD and sub-second query latency across tens of billions of rows—performance Elasticsearch cannot match at the same hardware cost.

ClickHouse’s columnar layout means a query on service = 'payment-api' AND level = 'ERROR' reads only two narrow columns rather than entire row blocks. The ORDER BY (service, timestamp) primary key makes service-scoped time-range scans particularly efficient. TTL expressions drop old partitions automatically, eliminating the index lifecycle management overhead common in Elastic deployments. Our ClickHouse consulting practice routinely sees teams cut log storage costs 80–90% when migrating from Elasticsearch to ClickHouse with Vector as the ingest layer.

Pipeline Architecture: Agent to Aggregator to ClickHouse

The recommended topology runs Vector as a DaemonSet on every Kubernetes node—one agent pod per node—collecting logs from all containers via the kubernetes_logs source. Agents perform lightweight parsing close to the source, then forward events to a small pool of Vector aggregator deployments (typically two to four pods) that handle batching, buffering, and delivery to ClickHouse over HTTPS on port 8443. The aggregator tier absorbs backpressure when ClickHouse is slow or unreachable, preventing in-memory queue buildup on agents. Events are written using the HTTP interface with JSONEachRow format. ClickHouse 24.8 LTS async_insert semantics allow the server to buffer concurrent small writes into optimal parts, preventing the small parts explosion that degrades query performance when many agents write without batching.

Configuring Vector: Sources, Transforms, and the ClickHouse Sink

The configuration below demonstrates a complete aggregator pipeline. Agents forward events via the vector source protocol; the aggregator parses JSON logs, extracts structured fields, and flushes to ClickHouse in compressed batches.

# vector.yaml — Vector aggregator configuration
# Tested with Vector 0.38+

sources:
  kubernetes_agent_input:
    type: vector
    address: "0.0.0.0:9000"

  journald_local:
    type: journald
    include_units:
      - kubelet.service
      - containerd.service

transforms:
  parse_app_logs:
    type: remap
    inputs:
      - kubernetes_agent_input
      - journald_local
    source: |
      # Parse JSON application logs; fall back to raw string
      parsed, err = parse_json(.message)
      if err == null {
        .message   = string(parsed.msg  ?? parsed.message ?? .message)
        .level     = upcase(string(parsed.level ?? parsed.severity ?? "INFO"))
        .trace_id  = string(parsed.trace_id ?? "")
        .attributes = {
          "error.type": string(parsed.error ?? ""),
          "http.status": to_string(int(parsed.status ?? 0))
        }
      } else {
        .level = "INFO"
        .attributes = {}
      }

      # Normalise service name from k8s labels
      .service = string(
        .kubernetes.pod_labels."app.kubernetes.io/name" ??
        .kubernetes.pod_labels.app ??
        "unknown"
      )

      # Drop 90% of DEBUG logs to reduce volume
      if .level == "DEBUG" {
        if random_bool(0.9) { abort }
      }

      # PII scrubbing: mask email addresses
      .message = replace(.message, r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b', "[email]")

      if !exists(.timestamp) { .timestamp = now() }

sinks:
  clickhouse_logs:
    type: clickhouse
    inputs:
      - parse_app_logs
    endpoint: "https://clickhouse.internal:8443"
    database: "logs"
    table: "app_logs"
    auth:
      strategy: basic
      user: "vector_writer"
      password: "${CLICKHOUSE_PASSWORD}"
    tls:
      verify_certificate: true
      ca_file: "/etc/ssl/certs/clickhouse-ca.crt"
    compression: zstd
    encoding:
      timestamp_format: unix
    batch:
      max_events: 50000
      timeout_secs: 5
      max_bytes: 10485760   # 10 MiB
    buffer:
      type: disk
      max_size: 10737418240  # 10 GiB
      when_full: block
    request:
      retry_attempts: 10
      retry_initial_backoff_secs: 1
      retry_max_duration_secs: 60
    acknowledgements:
      enabled: true

ClickHouse Schema Design and Tokenized Full-Text Index

LowCardinality(String) applies dictionary encoding to bounded-vocabulary fields—service names, hostnames, log levels—dramatically reducing column sizes. A Map(LowCardinality(String), String) catch-all column absorbs arbitrary JSON fields, preventing schema drift from breaking the pipeline. The tokenized bloom-filter index on message enables full-text search without scanning the entire column.

-- ClickHouse 24.8 LTS
CREATE TABLE logs.app_logs ON CLUSTER '{cluster}'
(
    timestamp   DateTime64(3, 'UTC')               CODEC(Delta, ZSTD(3)),
    host        LowCardinality(String)             CODEC(ZSTD(3)),
    service     LowCardinality(String)             CODEC(ZSTD(3)),
    level       LowCardinality(String)             CODEC(ZSTD(3)),
    message     String                             CODEC(ZSTD(3)),
    trace_id    String                             CODEC(ZSTD(3)),
    attributes  Map(LowCardinality(String), String) CODEC(ZSTD(3)),

    INDEX message_idx message
          TYPE tokenbf_v1(32768, 3, 0)
          GRANULARITY 4
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/logs/app_logs',
    '{replica}'
)
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (service, level, timestamp)
TTL timestamp + INTERVAL 30 DAY DELETE
SETTINGS
    index_granularity            = 8192,
    merge_with_ttl_timeout       = 86400,
    async_insert                 = 1,
    async_insert_max_data_size   = 10485760,
    async_insert_busy_timeout_ms = 5000;

-- Full-text search using the tokenbf_v1 skip index
SELECT
    timestamp,
    host,
    service,
    message,
    attributes['error.type'] AS error_type
FROM logs.app_logs
WHERE
    service   = 'payment-api'
    AND level = 'ERROR'
    AND timestamp >= now() - INTERVAL 1 HOUR
    AND hasToken(message, 'NullPointerException')
ORDER BY timestamp DESC
LIMIT 100
SETTINGS max_threads = 8, use_skip_indexes = 1;

Disk buffer saturation strikes when ClickHouse is unavailable longer than the buffer can absorb. The when_full: block policy applies back-pressure rather than dropping events; Vector drains the buffer in order once ClickHouse recovers. Schema drift occurs when application teams add new JSON fields—the attributes map column absorbs unknowns without pipeline restarts or redeployment.

TLS and authentication are non-negotiable in production. Apply VRL’s replace() with regex patterns to scrub PII from message bodies before data lands in ClickHouse, satisfying data residency requirements at the pipeline layer. The Grafana ClickHouse data source plugin connects over HTTP and supports SQL queries against logs.app_logs, enabling live error-rate panels that give on-call engineers immediate situational awareness.

# Deploy Vector DaemonSet (agent) and aggregator
kubectl apply -f vector-agent-daemonset.yaml
kubectl apply -f vector-aggregator-deployment.yaml

# Confirm all agent pods are running (one per node)
kubectl rollout status daemonset/vector-agent -n observability

# Confirm aggregator replicas are ready
kubectl rollout status deployment/vector-aggregator -n observability

# On a systemd host (non-k8s), verify Vector is active
systemctl status vector
journalctl -u vector -f --since "5 minutes ago"

# Check ClickHouse async_insert queue depth
clickhouse-client --query "
  SELECT database, table,
         count()   AS pending_parts,
         sum(rows) AS pending_rows
  FROM system.async_insert_log
  WHERE status = 'Ok'
    AND event_time >= now() - INTERVAL 5 MINUTE
  GROUP BY database, table
"

Frequently Asked Questions

Does Vector support sending logs to ClickHouse Cloud?

Yes. The Vector ClickHouse sink works with ClickHouse Cloud by setting the endpoint to the Cloud HTTPS endpoint on port 8443, providing credentials via auth.user and auth.password, and enabling TLS certificate verification. ClickHouse Cloud enforces TLS on all connections. Async inserts are fully supported and recommended to handle concurrent agent writes without creating excess parts on the Cloud instance.

How does Vector handle ClickHouse downtime without losing log data?

Vector’s disk buffer persists events on the aggregator’s local disk when the ClickHouse sink cannot deliver data. The when_full: block policy applies back-pressure through the pipeline rather than dropping events. Once ClickHouse becomes reachable again, Vector drains the buffer in insertion order. Size the buffer to cover the longest expected ClickHouse maintenance window multiplied by peak ingest rate—typically 5–10 GiB for production deployments.

What is the difference between tokenbf_v1 and ngrambf_v1 for log search?

tokenbf_v1 splits text on non-alphanumeric boundaries and stores bloom filter hashes of whole tokens. ngrambf_v1 stores hashes of fixed-length character n-grams. For log search targeting whole words or identifiers—exception class names, HTTP methods, trace IDs—tokenbf_v1 is more selective and produces smaller index sizes. Use ngrambf_v1 only when substring matching within a single token is genuinely required.

Can Vector enrich logs with Kubernetes pod metadata automatically?

Yes. The kubernetes_logs source attaches Kubernetes metadata to each event under the .kubernetes field: namespace, pod name, pod labels, container name, and node name. VRL transforms can then promote any field—such as a pod label value—to a top-level schema column, enabling per-service partitioning and efficient filtering in ClickHouse without additional enrichment infrastructure or sidecar processes.

How should log retention be managed in ClickHouse?

ClickHouse TTL expressions handle retention automatically. The TTL timestamp + INTERVAL 30 DAY DELETE clause drops rows older than 30 days during background merges. Partition-level TTL can tier older data to cheaper object storage before deletion. Aligning partition granularity with the TTL interval—daily partitions with a 30-day TTL—allows expired partitions to be dropped as whole units rather than through row-level merge operations, which is significantly more efficient.

Is it possible to run Vector without a separate aggregator tier?

For small Kubernetes clusters under 20 nodes or single-host deployments, agents can write directly to ClickHouse without a dedicated aggregator. At scale, however, each agent must maintain its own disk buffer and connection pool, and coordinating back-pressure across dozens of agents becomes operationally complex. The aggregator tier centralises buffering, reduces open connections to ClickHouse, and allows batching policies to be tuned in a single configuration file.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse and Vector: High-Throughput Log Pipelines for Production Observability

ClickHouse Vector log pipeline

Vector: One Binary to Replace the Entire Shipper Stack

Why ClickHouse Is the Right Long-Term Log Store

Pipeline Architecture: Agent to Aggregator to ClickHouse

Configuring Vector: Sources, Transforms, and the ClickHouse Sink

ClickHouse Schema Design and Tokenized Full-Text Index

Frequently Asked Questions

Does Vector support sending logs to ClickHouse Cloud?

How does Vector handle ClickHouse downtime without losing log data?

What is the difference between tokenbf_v1 and ngrambf_v1 for log search?

Can Vector enrich logs with Kubernetes pod metadata automatically?

How should log retention be managed in ClickHouse?

Is it possible to run Vector without a separate aggregator tier?

You might also like:

ClickHouse Vector log pipeline

Vector: One Binary to Replace the Entire Shipper Stack

Why ClickHouse Is the Right Long-Term Log Store

Pipeline Architecture: Agent to Aggregator to ClickHouse

Configuring Vector: Sources, Transforms, and the ClickHouse Sink

ClickHouse Schema Design and Tokenized Full-Text Index

Frequently Asked Questions

Does Vector support sending logs to ClickHouse Cloud?

How does Vector handle ClickHouse downtime without losing log data?

What is the difference between tokenbf_v1 and ngrambf_v1 for log search?

Can Vector enrich logs with Kubernetes pod metadata automatically?

How should log retention be managed in ClickHouse?

Is it possible to run Vector without a separate aggregator tier?

You might also like:

Related Articles

Best Practices for Optmizing ClickHouse MergeTree on S3

Connecting ChistaDATA Cloud for ClickHouse with Python

Understanding the OpenTelemetry Collector: A Comprehensive Guide to Modern Telemetry Management