ClickStack & HyperDX
Running a distributed application without unified observability is an operational liability. Teams end up stitching together separate log aggregators, APM tools, and session recorders — each with its own retention model, query language, and cost structure. ClickStack & HyperDX, launched by ClickHouse Inc. in May 2025 and built on the acquired HyperDX platform, changes that calculus. It bundles an opinionated OpenTelemetry Collector, ClickHouse, and the HyperDX UI into a coherent stack — a production-grade, open-source Datadog alternative correlating logs, traces, metrics, and session replays by trace_id and session_id in one interface. At ChistaDATA, we help engineering teams deploy, tune, and operate ClickStack & HyperDX on self-managed and fully managed ClickHouse clusters. This article covers architecture, canonical schemas, deployment patterns, ClickHouse tuning, and the failure modes we encounter most often in production.

Architecture: How ClickStack Fits Together
ClickStack is a three-component stack. Applications instrument with an OpenTelemetry SDK and emit OTLP data to a ClickStack-distributed OTel Collector. The collector runs receivers, processors (batching, enrichment, k8s attribute injection), and a ClickHouse exporter that creates canonical tables on first start. The HyperDX UI connects directly to ClickHouse’s HTTP interface and executes parameterized SQL, rendering log search, service maps, p50/p95/p99 latency charts, error tracking, and session replay timelines. All telemetry signals correlate through two join keys: TraceId (linking logs to the span that emitted them) and rum.sessionId (linking browser session events to server-side traces). This join-key design is baked into the HyperDX UI — click a log line and the correlated trace waterfall opens without a secondary query. The data flow is:
App (OTel SDK)
→ OTLP/HTTP or gRPC
→ OTel Collector (ClickStack distribution)
processors: [memory_limiter, k8sattributes, batch]
exporter: clickhouse (creates otel_logs, otel_traces,
otel_metrics_*, hyperdx_sessions)
→ ClickHouse (MergeTree tables, TTL 30 days, ZSTD compression)
→ HyperDX UI / SQL editor / alert engine
Deployment Options: Dev to Production
ClickStack ships four deployment modes. The all-in-one Docker image is the fastest path for local development: one command brings up ClickHouse, the OTel Collector, and HyperDX on port 8080 with OTLP on 4317 (gRPC) and 4318 (HTTP). For production Kubernetes, the official Helm chart separates components into independently scalable pods and supports external ClickHouse clusters, including a ChistaDATA-managed ClickHouse deployment. Below is a Docker Compose layout for staging:
version: "3.8"
services:
clickhouse:
image: clickhouse/clickhouse-server:25.3
ports:
- "8123:8123"
- "9000:9000"
environment:
CLICKHOUSE_DB: default
CLICKHOUSE_USER: default
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
volumes:
- ch_data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
otel-collector:
image: ghcr.io/hyperdxio/hyperdx-otel-collector:latest
ports:
- "4317:4317" # gRPC OTLP receiver
- "4318:4318" # HTTP OTLP receiver
environment:
CLICKHOUSE_ENDPOINT: "tcp://clickhouse:9000"
CLICKHOUSE_DATABASE: default
CLICKHOUSE_USERNAME: default
CLICKHOUSE_PASSWORD: ""
depends_on:
- clickhouse
hyperdx:
image: docker.hyperdx.io/hyperdx/hyperdx-app:latest
ports:
- "8080:8080"
environment:
CLICKHOUSE_HOST: http://clickhouse:8123
CLICKHOUSE_USER: default
CLICKHOUSE_PASSWORD: ""
depends_on:
- clickhouse
- otel-collector
volumes:
ch_data:
# helm upgrade --install clickstack clickstack/clickstack \
# -f values-prod.yaml
clickhouse:
enabled: false # use external ClickHouse
external:
host: "ch.prod.internal"
port: 9000
database: otel
username: clickstack
password: "changeme"
otelCollector:
replicaCount: 3
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
config:
processors:
batch:
send_batch_size: 50000
timeout: 5s
memory_limiter:
limit_mib: 3500
spike_limit_mib: 500
hyperdx:
replicaCount: 2
ingress:
enabled: true
hostname: observability.example.com
CREATE TABLE IF NOT EXISTS default.otel_logs
(
`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`TimestampTime` DateTime DEFAULT toDateTime(Timestamp),
`TraceId` String CODEC(ZSTD(1)),
`SpanId` String CODEC(ZSTD(1)),
`TraceFlags` UInt8,
`SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
`SeverityNumber` UInt8,
`ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
`Body` String CODEC(ZSTD(1)),
`ResourceSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`ScopeSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
`ScopeName` String CODEC(ZSTD(1)),
`ScopeVersion` LowCardinality(String) CODEC(ZSTD(1)),
`ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
-- Materialized columns for fast k8s attribute lookup
`__hdx_materialized_k8s.pod.name`
LowCardinality(String)
MATERIALIZED ResourceAttributes['k8s.pod.name'] CODEC(ZSTD(1)),
`__hdx_materialized_k8s.namespace.name`
LowCardinality(String)
MATERIALIZED ResourceAttributes['k8s.namespace.name'] CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_lower_body lower(Body) TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8
)
ENGINE = MergeTree
PARTITION BY toDate(TimestampTime)
PRIMARY KEY (ServiceName, TimestampTime)
ORDER BY (ServiceName, TimestampTime, Timestamp)
TTL TimestampTime + toIntervalDay(30)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;
Querying and Alerting with HyperDX
HyperDX exposes a Lucene-style search bar that translates property filters (e.g., level:error service:payments-api) into native ClickHouse SQL with partition pruning. A built-in SQL editor gives direct access to otel_logs and otel_traces. Alert rules are SQL queries saved with a threshold and evaluation interval; when the result exceeds the threshold, HyperDX routes the alert to a webhook, PagerDuty, or Slack. A p99 latency alert and a cross-signal correlation query look like this:
-- HyperDX alert query: p99 latency > 500ms for checkout-service
-- Evaluated every 60 seconds over a 5-minute window
SELECT
toStartOfMinute(toDateTime(Timestamp)) AS window,
quantile(0.99)(Duration / 1e6) AS p99_ms,
count() AS span_count
FROM default.otel_traces
WHERE
ServiceName = 'checkout-service'
AND StatusCode = 'STATUS_CODE_ERROR'
AND Timestamp >= now() - INTERVAL 5 MINUTE
GROUP BY window
HAVING p99_ms > 500
ORDER BY window DESC
LIMIT 1;
-- Cross-signal correlation: find logs for a slow trace
SELECT
l.Timestamp,
l.SeverityText,
l.Body,
l.LogAttributes['http.url'] AS url
FROM default.otel_logs AS l
INNER JOIN (
SELECT TraceId
FROM default.otel_traces
WHERE
ServiceName = 'checkout-service'
AND Duration / 1e6 > 500 -- spans slower than 500 ms
AND Timestamp >= now() - INTERVAL 1 HOUR
LIMIT 50
) AS slow_traces USING (TraceId)
ORDER BY l.Timestamp DESC
LIMIT 200;
-- Apply to the ClickStack ingest user on ClickHouse 24.8 / 25.3
ALTER USER clickstack_ingest
SETTINGS
async_insert = 1,
async_insert_max_data_size = 104857600, -- 100 MB buffer
async_insert_busy_timeout_ms = 5000,
insert_deduplicate = 0; -- OTel data is not idempotent
-- Verify current part count per table (watch for buildup > 300 active parts)
SELECT
table,
count() AS active_parts,
sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS disk_size
FROM system.parts
WHERE active AND database = 'default'
AND table IN ('otel_logs', 'otel_traces', 'otel_metrics_sum')
GROUP BY table
ORDER BY active_parts DESC;
Operational Concerns: Migrations, Multi-Tenancy, and Backups
Schema migrations across ClickStack version upgrades are handled via ALTER TABLE ... ADD COLUMN — ClickHouse adds columns to metadata instantly without rewriting data. New materialized columns (__hdx_materialized_*) require a MATERIALIZE COLUMN mutation to backfill existing parts, monitored via system.mutations. Multi-tenant isolation in a shared cluster uses row policies rather than separate databases:
-- Multi-tenant row policy: tenant 'acme' sees only its own service data
CREATE ROW POLICY acme_isolation ON default.otel_logs
FOR SELECT
USING ServiceName IN (
SELECT service_name
FROM tenant_registry
WHERE tenant_id = 'acme'
)
TO acme_reader;
-- Grant read access to the tenant's ClickHouse user
GRANT SELECT ON default.otel_logs TO acme_reader;
-- Backup: ClickHouse native BACKUP to S3 (ClickHouse 24.8+)
BACKUP TABLE default.otel_logs
TO S3('s3://ch-backups/otel_logs/2025-07-01/', 'ACCESS_KEY', 'SECRET_KEY')
SETTINGS compression_method='lz4', async=true;
-- Check backup status
SELECT * FROM system.backups ORDER BY start_time DESC LIMIT 5;
Failure Modes and Runbook
Three failure modes dominate ClickStack production incidents. First, collector OOM under high-cardinality attribute bursts: a misconfigured k8s label selector that emits thousands of distinct attribute keys per span causes the batch processor’s buffer to grow faster than it drains. Mitigation: place a memory_limiter processor before the batch processor and alert on collector heap metrics. Second, parts buildup: async inserts with a low fill rate trigger async_insert_busy_timeout_ms repeatedly, creating many small parts. When active parts exceed 300 per table, SELECT queries slow due to per-part metadata overhead — monitor system.parts and tune the batch timeout upward if average batch size stays below 10,000 rows. Third, TTL lag: when merge throughput is saturated by heavy inserts, the 30-day TTL eviction falls behind. Force manual eviction with ALTER TABLE otel_logs MATERIALIZE TTL during a low-traffic window and lower merge_with_ttl_timeout to reprioritize TTL merges.
ClickStack vs SigNoz vs Datadog
ClickStack is the most native ClickHouse observability experience available. SigNoz is also ClickHouse-backed and community-driven, but schema evolutions can lag behind ClickHouse engine improvements since the UI and schema are maintained independently of ClickHouse Inc. Datadog is proprietary SaaS with no query-level storage access — it is impossible to write cross-signal correlation SQL of the kind HyperDX enables natively, and ingestion costs at 1–5 TB/day are typically 10–50x higher than a self-managed ClickStack deployment. ClickStack’s trade-off is operational responsibility for ClickHouse cluster health, part merges, and TTL tuning — precisely where ChistaDATA’s managed ClickHouse services remove friction.
Key Takeaways
- ClickStack bundles HyperDX, an OTel Collector, and ClickHouse into a single open-source observability stack launched in May 2025 — the most integrated ClickHouse-native alternative to proprietary APM.
- All telemetry signals correlate through
TraceIdandrum.sessionId, enabling one-click pivots from a log line to the trace waterfall or browser session replay without secondary queries. - Auto-created schemas use
LowCardinality(String),Delta(8)+ZSTD(1)codecs, bloom filter secondary indexes, and a 30-day TTL withttl_only_drop_parts = 1— zero schema design required on day one. - Enable
async_insert = 1withinsert_deduplicate = 0on the ingest user; setasync_insert_max_data_sizeto 100 MB to prevent excessive part creation. - A 3-node ClickHouse cluster handles 1–5 TB raw logs/day; S3 cold-tier storage extends practical retention to 90+ days without NVMe cost.
- Multi-tenant isolation via row policies keeps schemas centralized while enforcing per-tenant data boundaries at query execution time.
- The three most common production failures are collector OOM under cardinality bursts, parts buildup from under-batched inserts, and TTL lag when merge throughput is saturated — preventable with monitoring of
system.partsand collector metrics.
How ChistaDATA Can Help
At ChistaDATA, we specialize in deploying and operating ClickHouse at production scale — including as the backend for ClickStack observability pipelines. Our engineers have hands-on experience sizing clusters for high-volume telemetry, tuning async insert settings, designing multi-tenant row policies, and building pre-aggregation Materialized Views that keep p99 latency under one second on hundreds of terabytes. Whether the goal is migrating from Datadog to ClickStack, hardening an existing self-managed deployment, or running ClickHouse on fully managed infrastructure with 24×7 support, we cover the full lifecycle: schema migrations during ClickStack upgrades, S3 cold-tier configuration, and proactive alerting on merge health and part counts. Schedule a consultation with our ClickHouse engineering team to discuss your production ClickStack requirements.
Frequently Asked Questions
What is ClickStack and how does it differ from a custom ClickHouse + Grafana observability stack?
ClickStack is an opinionated bundle of ClickHouse, an OTel Collector, and the HyperDX UI released by ClickHouse Inc. in May 2025. Unlike a custom Grafana setup, ClickStack auto-creates schemas with correct codecs, TTL, and secondary indexes, and HyperDX is purpose-built for ClickHouse query patterns. Teams skip weeks of schema design and collector configuration and get a working observability stack on day one.
Can HyperDX work with an existing ClickHouse cluster and custom schema?
Yes. HyperDX is schema-agnostic and can point at any ClickHouse table via source configuration. Engineers map timestamp, service name, body, and attribute expression fields in the HyperDX source editor. This makes HyperDX a viable UI upgrade for teams running custom OTel-to-ClickHouse pipelines without a schema migration to the ClickStack defaults.
What ClickHouse version is required to run ClickStack?
ClickStack supports ClickHouse 24.8 LTS and later, including 25.3. The schemas use standard MergeTree features — LowCardinality, Map types, bloom filter secondary indexes, and TTL with ttl_only_drop_parts — all stable since ClickHouse 23.x. The Helm chart and Docker images pin to a tested version, but the ClickHouse exporter is compatible with any 24.8+ instance.
How does ClickStack handle session replay data storage and correlation?
Browser session replay events are written to hyperdx_sessions, which mirrors the otel_logs schema without a default TTL. The rum.sessionId attribute is materialized as a column in otel_traces and indexed with a bloom filter. HyperDX uses this index to join session events to server-side traces — clicking a replay timestamp jumps to correlated backend spans without a full table scan.
What are the main cost drivers when running ClickStack at scale?
At 1–5 TB raw ingestion per day, the dominant costs are ClickHouse compute, NVMe for the hot tier, and S3 for cold-tier retention. ZSTD at 8–12x means 3 TB/day raw becomes roughly 250–375 GB stored. Transitioning partitions older than 30 days to an S3-backed cold tier typically cuts storage costs by 70–80% versus keeping all data on NVMe.
Is ClickStack suitable for regulated environments requiring data residency or multi-tenant isolation?
Yes, with proper configuration. Row policies filter rows at query execution time based on the authenticated user’s tenant mapping. Data residency is enforced by deploying ClickHouse within the required geographic boundary — on-premises or single-region cloud. ChistaDATA has deployed ClickStack for financial services and healthcare clients where data residency and access control are audit requirements, pairing row policies with user-level quota limits.