Achieve Clarity with ClickStack & HyperDX Observability

Q: What is ClickStack and how does it differ from a custom ClickHouse + Grafana observability stack?

ClickStack is an opinionated, pre-integrated bundle of ClickHouse, an OTel Collector, and the HyperDX UI released by ClickHouse Inc. in May 2025. Unlike a custom Grafana setup, ClickStack auto-creates production-grade schemas with correct codecs, TTL, and secondary indexes, and the HyperDX UI is purpose-built for ClickHouse query patterns. Teams skip weeks of schema design, collector configuration, and Grafana plugin tuning and get a working observability stack on day one.

Q: Can HyperDX work with an existing ClickHouse cluster and custom schema?

Yes. HyperDX is schema-agnostic and can be configured to point at any ClickHouse table via its source configuration. Engineers map the timestamp column, service name field, body field, and attribute expressions manually in the HyperDX source editor. This makes HyperDX a viable UI upgrade path for teams already running custom OTel-to-ClickHouse pipelines without requiring a schema migration to the ClickStack defaults.

Q: How does ClickStack handle session replay data storage and correlation?

Session replay events from browser instrumentation are written to the hyperdx_sessions table, which mirrors the otel_logs schema structure but without a default TTL. The rum.sessionId attribute is materialized as a column in otel_traces and indexed with a bloom filter. HyperDX uses this index to join session events to server-side traces — clicking a session replay timestamp jumps directly to the correlated backend spans without a full table scan.

Q: Is ClickStack suitable for regulated environments requiring data residency or multi-tenant isolation?

Yes, with proper configuration. Multi-tenant isolation is enforced via ClickHouse row policies, which filter rows at query execution time based on the authenticated user's tenant mapping. Data residency is managed by deploying ClickHouse within the required geographic boundary — either on-premises or in a single-region cloud environment. ChistaDATA has deployed ClickStack for financial services and healthcare workloads where data residency and access control are audit requirements, using row policies combined with ClickHouse user-level quota limits.

Table of Contents

ClickStack & HyperDX

Running a distributed application without unified observability is an operational liability. Teams end up stitching together separate log aggregators, APM tools, and session recorders — each with its own retention model, query language, and cost structure. ClickStack & HyperDX, launched by ClickHouse Inc. in May 2025 and built on the acquired HyperDX platform, changes that calculus. It bundles an opinionated OpenTelemetry Collector, ClickHouse, and the HyperDX UI into a coherent stack — a production-grade, open-source Datadog alternative correlating logs, traces, metrics, and session replays by trace_id and session_id in one interface. At ChistaDATA, we help engineering teams deploy, tune, and operate ClickStack & HyperDX on self-managed and fully managed ClickHouse clusters. This article covers architecture, canonical schemas, deployment patterns, ClickHouse tuning, and the failure modes we encounter most often in production.

Architecture: How ClickStack Fits Together

ClickStack is a three-component stack. Applications instrument with an OpenTelemetry SDK and emit OTLP data to a ClickStack-distributed OTel Collector. The collector runs receivers, processors (batching, enrichment, k8s attribute injection), and a ClickHouse exporter that creates canonical tables on first start. The HyperDX UI connects directly to ClickHouse’s HTTP interface and executes parameterized SQL, rendering log search, service maps, p50/p95/p99 latency charts, error tracking, and session replay timelines. All telemetry signals correlate through two join keys: TraceId (linking logs to the span that emitted them) and rum.sessionId (linking browser session events to server-side traces). This join-key design is baked into the HyperDX UI — click a log line and the correlated trace waterfall opens without a secondary query. The data flow is:

App (OTel SDK)
  → OTLP/HTTP or gRPC
  → OTel Collector (ClickStack distribution)
      processors: [memory_limiter, k8sattributes, batch]
      exporter: clickhouse (creates otel_logs, otel_traces,
                            otel_metrics_*, hyperdx_sessions)
  → ClickHouse (MergeTree tables, TTL 30 days, ZSTD compression)
  → HyperDX UI / SQL editor / alert engine

Deployment Options: Dev to Production

ClickStack ships four deployment modes. The all-in-one Docker image is the fastest path for local development: one command brings up ClickHouse, the OTel Collector, and HyperDX on port 8080 with OTLP on 4317 (gRPC) and 4318 (HTTP). For production Kubernetes, the official Helm chart separates components into independently scalable pods and supports external ClickHouse clusters, including a ChistaDATA-managed ClickHouse deployment. Below is a Docker Compose layout for staging:

version: "3.8"

services:
  clickhouse:
    image: clickhouse/clickhouse-server:25.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_DB: default
      CLICKHOUSE_USER: default
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    volumes:
      - ch_data:/var/lib/clickhouse
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  otel-collector:
    image: ghcr.io/hyperdxio/hyperdx-otel-collector:latest
    ports:
      - "4317:4317"   # gRPC OTLP receiver
      - "4318:4318"   # HTTP OTLP receiver
    environment:
      CLICKHOUSE_ENDPOINT: "tcp://clickhouse:9000"
      CLICKHOUSE_DATABASE: default
      CLICKHOUSE_USERNAME: default
      CLICKHOUSE_PASSWORD: ""
    depends_on:
      - clickhouse

  hyperdx:
    image: docker.hyperdx.io/hyperdx/hyperdx-app:latest
    ports:
      - "8080:8080"
    environment:
      CLICKHOUSE_HOST: http://clickhouse:8123
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: ""
    depends_on:
      - clickhouse
      - otel-collector

volumes:
  ch_data:

# helm upgrade --install clickstack clickstack/clickstack \
#   -f values-prod.yaml

clickhouse:
  enabled: false          # use external ClickHouse
  external:
    host: "ch.prod.internal"
    port: 9000
    database: otel
    username: clickstack
    password: "changeme"

otelCollector:
  replicaCount: 3
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"
  config:
    processors:
      batch:
        send_batch_size: 50000
        timeout: 5s
      memory_limiter:
        limit_mib: 3500
        spike_limit_mib: 500

hyperdx:
  replicaCount: 2
  ingress:
    enabled: true
    hostname: observability.example.com

CREATE TABLE IF NOT EXISTS default.otel_logs
(
    `Timestamp`       DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TimestampTime`   DateTime DEFAULT toDateTime(Timestamp),
    `TraceId`         String CODEC(ZSTD(1)),
    `SpanId`          String CODEC(ZSTD(1)),
    `TraceFlags`      UInt8,
    `SeverityText`    LowCardinality(String) CODEC(ZSTD(1)),
    `SeverityNumber`  UInt8,
    `ServiceName`     LowCardinality(String) CODEC(ZSTD(1)),
    `Body`            String CODEC(ZSTD(1)),
    `ResourceSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeSchemaUrl`  LowCardinality(String) CODEC(ZSTD(1)),
    `ScopeName`       String CODEC(ZSTD(1)),
    `ScopeVersion`    LowCardinality(String) CODEC(ZSTD(1)),
    `ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `LogAttributes`   Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    -- Materialized columns for fast k8s attribute lookup
    `__hdx_materialized_k8s.pod.name`
        LowCardinality(String)
        MATERIALIZED ResourceAttributes['k8s.pod.name'] CODEC(ZSTD(1)),
    `__hdx_materialized_k8s.namespace.name`
        LowCardinality(String)
        MATERIALIZED ResourceAttributes['k8s.namespace.name'] CODEC(ZSTD(1)),
    INDEX idx_trace_id       TraceId              TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key   mapKeys(ResourceAttributes) TYPE bloom_filter(0.01)  GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_key   mapKeys(LogAttributes) TYPE bloom_filter(0.01)       GRANULARITY 1,
    INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01)     GRANULARITY 1,
    INDEX idx_lower_body     lower(Body)          TYPE tokenbf_v1(32768, 3, 0)    GRANULARITY 8
)
ENGINE = MergeTree
PARTITION BY toDate(TimestampTime)
PRIMARY KEY (ServiceName, TimestampTime)
ORDER BY (ServiceName, TimestampTime, Timestamp)
TTL TimestampTime + toIntervalDay(30)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;

Querying and Alerting with HyperDX

HyperDX exposes a Lucene-style search bar that translates property filters (e.g., level:error service:payments-api) into native ClickHouse SQL with partition pruning. A built-in SQL editor gives direct access to otel_logs and otel_traces. Alert rules are SQL queries saved with a threshold and evaluation interval; when the result exceeds the threshold, HyperDX routes the alert to a webhook, PagerDuty, or Slack. A p99 latency alert and a cross-signal correlation query look like this:

-- HyperDX alert query: p99 latency > 500ms for checkout-service
-- Evaluated every 60 seconds over a 5-minute window
SELECT
    toStartOfMinute(toDateTime(Timestamp)) AS window,
    quantile(0.99)(Duration / 1e6)         AS p99_ms,
    count()                                AS span_count
FROM default.otel_traces
WHERE
    ServiceName    = 'checkout-service'
    AND StatusCode = 'STATUS_CODE_ERROR'
    AND Timestamp  >= now() - INTERVAL 5 MINUTE
GROUP BY window
HAVING p99_ms > 500
ORDER BY window DESC
LIMIT 1;

-- Cross-signal correlation: find logs for a slow trace
SELECT
    l.Timestamp,
    l.SeverityText,
    l.Body,
    l.LogAttributes['http.url'] AS url
FROM default.otel_logs AS l
INNER JOIN (
    SELECT TraceId
    FROM default.otel_traces
    WHERE
        ServiceName = 'checkout-service'
        AND Duration / 1e6 > 500   -- spans slower than 500 ms
        AND Timestamp >= now() - INTERVAL 1 HOUR
    LIMIT 50
) AS slow_traces USING (TraceId)
ORDER BY l.Timestamp DESC
LIMIT 200;

-- Apply to the ClickStack ingest user on ClickHouse 24.8 / 25.3
ALTER USER clickstack_ingest
    SETTINGS
        async_insert             = 1,
        async_insert_max_data_size = 104857600,  -- 100 MB buffer
        async_insert_busy_timeout_ms = 5000,
        insert_deduplicate       = 0;            -- OTel data is not idempotent

-- Verify current part count per table (watch for buildup > 300 active parts)
SELECT
    table,
    count()          AS active_parts,
    sum(rows)        AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS disk_size
FROM system.parts
WHERE active AND database = 'default'
  AND table IN ('otel_logs', 'otel_traces', 'otel_metrics_sum')
GROUP BY table
ORDER BY active_parts DESC;

Operational Concerns: Migrations, Multi-Tenancy, and Backups

Schema migrations across ClickStack version upgrades are handled via ALTER TABLE ... ADD COLUMN — ClickHouse adds columns to metadata instantly without rewriting data. New materialized columns (__hdx_materialized_*) require a MATERIALIZE COLUMN mutation to backfill existing parts, monitored via system.mutations. Multi-tenant isolation in a shared cluster uses row policies rather than separate databases:

-- Multi-tenant row policy: tenant 'acme' sees only its own service data
CREATE ROW POLICY acme_isolation ON default.otel_logs
    FOR SELECT
    USING ServiceName IN (
        SELECT service_name
        FROM tenant_registry
        WHERE tenant_id = 'acme'
    )
    TO acme_reader;

-- Grant read access to the tenant's ClickHouse user
GRANT SELECT ON default.otel_logs TO acme_reader;

-- Backup: ClickHouse native BACKUP to S3 (ClickHouse 24.8+)
BACKUP TABLE default.otel_logs
TO S3('s3://ch-backups/otel_logs/2025-07-01/', 'ACCESS_KEY', 'SECRET_KEY')
SETTINGS compression_method='lz4', async=true;

-- Check backup status
SELECT * FROM system.backups ORDER BY start_time DESC LIMIT 5;

Failure Modes and Runbook

Three failure modes dominate ClickStack production incidents. First, collector OOM under high-cardinality attribute bursts: a misconfigured k8s label selector that emits thousands of distinct attribute keys per span causes the batch processor’s buffer to grow faster than it drains. Mitigation: place a memory_limiter processor before the batch processor and alert on collector heap metrics. Second, parts buildup: async inserts with a low fill rate trigger async_insert_busy_timeout_ms repeatedly, creating many small parts. When active parts exceed 300 per table, SELECT queries slow due to per-part metadata overhead — monitor system.parts and tune the batch timeout upward if average batch size stays below 10,000 rows. Third, TTL lag: when merge throughput is saturated by heavy inserts, the 30-day TTL eviction falls behind. Force manual eviction with ALTER TABLE otel_logs MATERIALIZE TTL during a low-traffic window and lower merge_with_ttl_timeout to reprioritize TTL merges.

ClickStack vs SigNoz vs Datadog

ClickStack is the most native ClickHouse observability experience available. SigNoz is also ClickHouse-backed and community-driven, but schema evolutions can lag behind ClickHouse engine improvements since the UI and schema are maintained independently of ClickHouse Inc. Datadog is proprietary SaaS with no query-level storage access — it is impossible to write cross-signal correlation SQL of the kind HyperDX enables natively, and ingestion costs at 1–5 TB/day are typically 10–50x higher than a self-managed ClickStack deployment. ClickStack’s trade-off is operational responsibility for ClickHouse cluster health, part merges, and TTL tuning — precisely where ChistaDATA’s managed ClickHouse services remove friction.

Key Takeaways

ClickStack bundles HyperDX, an OTel Collector, and ClickHouse into a single open-source observability stack launched in May 2025 — the most integrated ClickHouse-native alternative to proprietary APM.
All telemetry signals correlate through TraceId and rum.sessionId, enabling one-click pivots from a log line to the trace waterfall or browser session replay without secondary queries.
Auto-created schemas use LowCardinality(String), Delta(8)+ZSTD(1) codecs, bloom filter secondary indexes, and a 30-day TTL with ttl_only_drop_parts = 1 — zero schema design required on day one.
Enable async_insert = 1 with insert_deduplicate = 0 on the ingest user; set async_insert_max_data_size to 100 MB to prevent excessive part creation.
A 3-node ClickHouse cluster handles 1–5 TB raw logs/day; S3 cold-tier storage extends practical retention to 90+ days without NVMe cost.
Multi-tenant isolation via row policies keeps schemas centralized while enforcing per-tenant data boundaries at query execution time.
The three most common production failures are collector OOM under cardinality bursts, parts buildup from under-batched inserts, and TTL lag when merge throughput is saturated — preventable with monitoring of system.parts and collector metrics.

How ChistaDATA Can Help

At ChistaDATA, we specialize in deploying and operating ClickHouse at production scale — including as the backend for ClickStack observability pipelines. Our engineers have hands-on experience sizing clusters for high-volume telemetry, tuning async insert settings, designing multi-tenant row policies, and building pre-aggregation Materialized Views that keep p99 latency under one second on hundreds of terabytes. Whether the goal is migrating from Datadog to ClickStack, hardening an existing self-managed deployment, or running ClickHouse on fully managed infrastructure with 24×7 support, we cover the full lifecycle: schema migrations during ClickStack upgrades, S3 cold-tier configuration, and proactive alerting on merge health and part counts. Schedule a consultation with our ClickHouse engineering team to discuss your production ClickStack requirements.

Frequently Asked Questions

What is ClickStack and how does it differ from a custom ClickHouse + Grafana observability stack?

ClickStack is an opinionated bundle of ClickHouse, an OTel Collector, and the HyperDX UI released by ClickHouse Inc. in May 2025. Unlike a custom Grafana setup, ClickStack auto-creates schemas with correct codecs, TTL, and secondary indexes, and HyperDX is purpose-built for ClickHouse query patterns. Teams skip weeks of schema design and collector configuration and get a working observability stack on day one.

Can HyperDX work with an existing ClickHouse cluster and custom schema?

Yes. HyperDX is schema-agnostic and can point at any ClickHouse table via source configuration. Engineers map timestamp, service name, body, and attribute expression fields in the HyperDX source editor. This makes HyperDX a viable UI upgrade for teams running custom OTel-to-ClickHouse pipelines without a schema migration to the ClickStack defaults.

What ClickHouse version is required to run ClickStack?

ClickStack supports ClickHouse 24.8 LTS and later, including 25.3. The schemas use standard MergeTree features — LowCardinality, Map types, bloom filter secondary indexes, and TTL with ttl_only_drop_parts — all stable since ClickHouse 23.x. The Helm chart and Docker images pin to a tested version, but the ClickHouse exporter is compatible with any 24.8+ instance.

How does ClickStack handle session replay data storage and correlation?

Browser session replay events are written to hyperdx_sessions, which mirrors the otel_logs schema without a default TTL. The rum.sessionId attribute is materialized as a column in otel_traces and indexed with a bloom filter. HyperDX uses this index to join session events to server-side traces — clicking a replay timestamp jumps to correlated backend spans without a full table scan.

What are the main cost drivers when running ClickStack at scale?

At 1–5 TB raw ingestion per day, the dominant costs are ClickHouse compute, NVMe for the hot tier, and S3 for cold-tier retention. ZSTD at 8–12x means 3 TB/day raw becomes roughly 250–375 GB stored. Transitioning partitions older than 30 days to an S3-backed cold tier typically cuts storage costs by 70–80% versus keeping all data on NVMe.

Is ClickStack suitable for regulated environments requiring data residency or multi-tenant isolation?

Yes, with proper configuration. Row policies filter rows at query execution time based on the authenticated user’s tenant mapping. Data residency is enforced by deploying ClickHouse within the required geographic boundary — on-premises or single-region cloud. ChistaDATA has deployed ClickStack for financial services and healthcare clients where data residency and access control are audit requirements, pairing row policies with user-level quota limits.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickStack and HyperDX: Full-Stack Observability Built Natively on ClickHouse