Prometheus: ClickHouse Solutions for Long-Term Metrics

Q: How does ClickHouse Prometheus long-term metrics storage compare to Thanos?

Thanos adds long-term storage by uploading TSDB blocks to object storage and running a sidecar, compactor, store gateway, and query frontend. This works well in Kubernetes-native shops already managing that operator overhead. ClickHouse requires fewer moving parts when metrics, logs, and traces share one cluster, offers richer SQL analytics, and compresses data 10–30x better than Thanos block format.

Q: Does ClickHouse support PromQL queries natively?

ClickHouse does not execute PromQL directly. The Prometheus remote_read endpoint allows Prometheus and Grafana (via the Prometheus datasource) to issue PromQL queries that are translated to ClickHouse reads internally. For full PromQL coverage, point Grafana at a Prometheus instance that uses ClickHouse as remote_read backend. For analytical workloads, use the ClickHouse SQL datasource in Grafana with direct SQL queries.

Q: How should HA Prometheus instances be handled to avoid duplicate data?

Configure each Prometheus instance with a distinct external_labels.replica value. Both instances write to ClickHouse independently. At query time, use LIMIT 1 BY (metric, tags, timestamp) to deduplicate exact duplicates, or rely on aggregation functions (avg, max, sum) which absorb near-duplicate values without distorting results. For gauge metrics, duplicate samples within a few milliseconds have negligible impact on aggregated dashboards.

Q: What storage savings are realistic with ClickHouse versus Prometheus TSDB?

Prometheus TSDB stores approximately 1–2 bytes per sample after its internal Gorilla compression, but TSDB block overhead, index structures, and WAL bring effective on-disk size to 5–15 bytes per sample. ClickHouse with DoubleDelta + ZSTD codec achieves 1–3 bytes per sample on typical infrastructure metrics, and the columnar layout allows AggregatingMergeTree rollups to reduce the oldest data to under 0.5 bytes per logical sample at the hourly granularity.

Prometheus is the de-facto standard for infrastructure metrics collection, yet the local TSDB is architected for short-term, high-fidelity storage — not for the years of retention that capacity planning, SLO trend analysis, and compliance audits demand. The default 15-day retention window evaporates long before engineering teams can answer questions like “how did CPU utilization trend across the past three quarters?” At ChistaDATA, we work with teams running thousands of Prometheus targets who cannot afford the memory explosion that comes with high-cardinality label sets, nor the operational overhead of Thanos or Cortex. This article explains three integration patterns for ClickHouse Prometheus long-term metrics storage, deep-dives into the ClickHouse-native TimeSeries engine introduced in 24.x, and shows how to configure remote_write, build a downsampling pipeline, and operate the stack in production.

Table of Contents

The Prometheus Retention Problem

Prometheus stores samples in a local TSDB bounded by the --storage.tsdb.retention.time flag (default: 15 days) or a byte limit. Each time series is indexed in memory: a single server scraping 5,000 targets at 60-second intervals accumulates 500,000 active series, consuming 4–8 GB of RAM for the head chunk and inverted index. Add a Kubernetes cluster with ephemeral pod labels and cardinality climbs into the millions, crashing the server or forcing series throttling.

Beyond memory, the business case for long-term retention is clear. Capacity planning requires 12–24 months of utilization history. SLO dashboards comparing quarter-over-quarter trends need 365+ days. Security audits may require two years of login attempt counters. None of these are achievable with local TSDB. The standard solutions — Thanos, Cortex, and Mimir — solve durability at the cost of significant Kubernetes operational complexity. VictoriaMetrics is lighter but introduces a proprietary storage engine. ClickHouse, as a columnar OLAP database with a mature ecosystem of codecs and TTL policies, is an alternative that many teams have already deployed for logs and traces.

Three Integration Patterns

At ChistaDATA, we see three practical integration patterns in production environments.

Pattern 1 — Prometheus remote_write via adapter. A bridge process (prom2click, prometheus-clickhouse-bridge, or a custom Go service) receives Prometheus remote_write payloads encoded as Snappy-compressed Protobuf, deserializes WriteRequest messages, and bulk-inserts sample batches into ClickHouse via the native TCP protocol. This path works with any ClickHouse version and is fully independent of the TimeSeries engine.

Pattern 2 — ClickHouse-native Prometheus protocol support. Available in ClickHouse 24.8 LTS, the ClickHouse server exposes /write and /read endpoints that speak the Prometheus remote_write and remote_read protocols natively. No adapter process is required. Prometheus points remote_write.url directly at the ClickHouse Prometheus port. This is the recommended path for new deployments.

Pattern 3 — Intermediate store with long-tail export. Organizations already running VictoriaMetrics or Mimir can keep the short-term store in place and periodically export data older than 30–90 days into ClickHouse. ClickHouse handles SQL analytics over the long-tail data while the intermediate store continues serving PromQL for recent windows.

ClickHouse-Native TimeSeries Engine Deep Dive

The ClickHouse TimeSeries table engine, enabled with allow_experimental_time_series_table = 1, is a composite engine that auto-provisions three internal sub-tables: a data table (MergeTree, storing id UUID, timestamp DateTime64(3), value Float64), a tags table (AggregatingMergeTree, storing metric_name LowCardinality(String), tags Map, and min/max time as SimpleAggregateFunctions), and a metrics table (ReplacingMergeTree, storing metric family metadata). The id is derived via sipHash128(metric_name, all_tags), giving each unique label set a stable UUID for fast joins.

To configure ClickHouse 24.8 LTS as a drop-in Prometheus remote_write endpoint, add a handler block to the server configuration:

<!-- /etc/clickhouse-server/config.d/prometheus.xml -->
<clickhouse>
  <prometheus>
    <port>9363</port>
    <handlers>
      <write_handler>
        <url>/write</url>
        <handler>
          <type>remote_write</type>
          <database>metrics</database>
          <table>prometheus_ts</table>
        </handler>
      </write_handler>
      <read_handler>
        <url>/read</url>
        <handler>
          <type>remote_read</type>
          <database>metrics</database>
          <table>prometheus_ts</table>
        </handler>
      </read_handler>
    </handlers>
  </prometheus>
</clickhouse>

# prometheus.yml — remote_write to ClickHouse native endpoint
remote_write:
  - url: "http://clickhouse-01.internal:9363/write"
    remote_timeout: 30s
    queue_config:
      capacity: 100000
      max_shards: 10
      max_samples_per_send: 10000
      batch_send_deadline: 5s
    # Optional: HA pair label to enable deduplication
    write_relabel_configs:
      - target_label: __replica__
        replacement: "prometheus-01"

remote_read:
  - url: "http://clickhouse-01.internal:9363/read"
    read_recent: true

Schema Design and Downsampling Pipeline

For teams preferring a custom schema over the composite TimeSeries engine, a hand-crafted table gives full control over codec selection and the downsampling pipeline. The schema below applies DoubleDelta encoding on the timestamp column (the same algorithmic family as Gorilla TSDB) and ZSTD on the value column, reducing typical uncompressed sample sizes of 15–30 bytes to 1–3 bytes on disk — a 10–30x improvement over Prometheus TSDB block files.

-- TimeSeries table using the native engine (ClickHouse 24.8+)
SET allow_experimental_time_series_table = 1;
CREATE DATABASE IF NOT EXISTS metrics;

CREATE TABLE metrics.prometheus_ts
ENGINE = TimeSeries
DATA ENGINE = MergeTree
    ORDER BY (id, timestamp)
    PARTITION BY toYYYYMMDD(timestamp)
    TTL
        timestamp + INTERVAL 7 DAY TO DISK 'hot_ssd',
        timestamp + INTERVAL 90 DAY TO DISK 'warm_ssd',
        timestamp + INTERVAL 730 DAY TO VOLUME 's3_cold'
TAGS ENGINE = AggregatingMergeTree
    PRIMARY KEY metric_name
    ORDER BY (metric_name, id)
METRICS ENGINE = ReplacingMergeTree
    ORDER BY metric_family_name;

-- Alternative: explicit samples table for adapter-based ingestion
CREATE TABLE metrics.samples
(
    metric      LowCardinality(String),
    tags        Map(LowCardinality(String), String),
    value       Float64 CODEC(Gorilla, ZSTD(1)),
    timestamp   DateTime64(3) CODEC(DoubleDelta, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (metric, timestamp)
TTL
    toDate(timestamp) + INTERVAL 7 DAY TO DISK 'hot_ssd',
    toDate(timestamp) + INTERVAL 90 DAY TO DISK 'warm_ssd',
    toDate(timestamp) + INTERVAL 730 DAY TO VOLUME 's3_cold'
SETTINGS index_granularity = 8192;

-- 1-minute rollup target table
CREATE TABLE metrics.samples_1m
(
    metric      LowCardinality(String),
    tags        Map(LowCardinality(String), String),
    minute      DateTime CODEC(DoubleDelta, ZSTD(1)),
    value_sum   AggregateFunction(sum, Float64),
    value_max   AggregateFunction(max, Float64),
    value_min   AggregateFunction(min, Float64),
    value_count AggregateFunction(count, Float64)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (metric, tags, minute)
TTL minute + INTERVAL 365 DAY DELETE;

-- Materialized View: raw samples → 1-minute rollup
CREATE MATERIALIZED VIEW metrics.mv_samples_to_1m
TO metrics.samples_1m
AS
SELECT
    metric,
    tags,
    toStartOfMinute(timestamp)    AS minute,
    sumState(value)               AS value_sum,
    maxState(value)               AS value_max,
    minState(value)               AS value_min,
    countState(value)             AS value_count
FROM metrics.samples
GROUP BY metric, tags, minute;

-- Query the rollup: avg CPU utilization per 5-minute bucket over the last 30 days
SELECT
    toStartOfFiveMinutes(minute)                       AS bucket,
    metric,
    tags['job']                                        AS job,
    sumMerge(value_sum) / countMerge(value_count)     AS avg_value,
    maxMerge(value_max)                                AS peak_value
FROM metrics.samples_1m
WHERE
    metric = 'node_cpu_seconds_total'
    AND minute >= now() - INTERVAL 30 DAY
GROUP BY bucket, metric, job
ORDER BY bucket ASC;

TTL, Tiered Storage, and Querying

ClickHouse TTL expressions move data across storage volumes automatically. The recommended tiered layout: 7 days on NVMe SSD (hot, sub-100ms Grafana reads), 90 days on SATA SSD (warm, query latency under 500ms on billions of rows), and 2 years on S3 (cold, capacity planning queries). Configure volumes in storage_configuration within config.xml, then reference volume names in each TTL clause as shown in the schema above.

Thanos and Cortex solve the same retention problem but require object-store compaction jobs and separate query-frontend and store-gateway processes. Mimir improves on Cortex operationally but still runs as a microservices cluster. VictoriaMetrics handles cardinality well but does not serve SQL, limiting cross-signal correlation. When an organization already stores application logs and distributed traces in ClickHouse, adding Prometheus metrics to the same cluster delivers one storage engine, one query language, one retention policy, and one backup strategy.

Direct SQL queries expose analytical power that PromQL cannot match:

-- P95 request latency per service over the past 7 days, bucketed hourly
SELECT
    toStartOfHour(timestamp)                                AS hour,
    tags['job']                                             AS service,
    quantileExactWeighted(0.95)(value, 1)                   AS p95_latency_ms,
    max(value)                                              AS peak_latency_ms,
    count()                                                 AS sample_count
FROM metrics.samples
WHERE
    metric = 'http_request_duration_seconds'
    AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY hour, service
ORDER BY hour ASC, service ASC;

-- Cross-metric correlation: join CPU and memory metrics
SELECT
    cpu.hour,
    cpu.avg_cpu,
    mem.avg_mem_bytes / 1e9          AS avg_mem_gb
FROM (
    SELECT toStartOfHour(timestamp) AS hour,
           avg(value)               AS avg_cpu
    FROM metrics.samples
    WHERE metric = 'node_cpu_seconds_total'
      AND timestamp >= now() - INTERVAL 30 DAY
    GROUP BY hour
) AS cpu
INNER JOIN (
    SELECT toStartOfHour(timestamp) AS hour,
           avg(value)               AS avg_mem_bytes
    FROM metrics.samples
    WHERE metric = 'node_memory_MemAvailable_bytes'
      AND timestamp >= now() - INTERVAL 30 DAY
    GROUP BY hour
) AS mem ON cpu.hour = mem.hour
ORDER BY cpu.hour ASC;

A chproxy instance in front of the ClickHouse cluster provides connection pooling, query timeout enforcement, and per-user concurrency limits. Route remote_write requests to an INSERT-only ingest user and Grafana SQL queries to a SELECT-only read user, isolating heavy analytical queries from the ingest path. The Prometheus remote_write specification requires receivers to return HTTP 5xx for transient failures and HTTP 4xx for invalid samples; chproxy translates ClickHouse error codes to the correct response status.

Monitor the ingest pipeline via system.part_log for merge pressure and system.query_log for slow inserts. Set async_insert = 1 with wait_for_async_insert = 0 to absorb ingest bursts without blocking Prometheus. If remote_write queues back up — watch prometheus_remote_storage_samples_pending — increase queue_config.capacity or add ClickHouse shards.

Key Takeaways

Prometheus local TSDB is capped at days of retention; ClickHouse extends that to years at 1–3 bytes per sample after DoubleDelta + ZSTD versus 15–30 bytes uncompressed.
ClickHouse 24.8 LTS supports native remote_write and remote_read via the TimeSeries engine — no adapter is required for new deployments.
The TimeSeries engine auto-provisions three sub-tables (MergeTree for data, AggregatingMergeTree for tags, ReplacingMergeTree for metric metadata) with a stable UUID per label set.
AggregatingMergeTree + Materialized Views implement a 4-level rollup (10s → 1m → 5m → 1h), enabling years of retention at manageable storage cost.
Tiered TTL (7d hot SSD → 90d warm SSD → 2y S3) automates the full data lifecycle without manual intervention.
Direct SQL unlocks cross-metric joins, quantile functions, and correlation with logs and traces — capabilities PromQL cannot match.
ClickHouse wins over Thanos, Cortex, and Mimir when metrics, logs, and traces already share one cluster.

How ChistaDATA Can Help

At ChistaDATA, our ClickHouse engineering team has deployed and tuned Prometheus long-term storage architectures for cloud-native SaaS, financial services, and e-commerce organizations. We design the TimeSeries engine configuration, chproxy topology, downsampling pipeline, and tiered S3 storage policies — then hand the runbook to the operations team with full documentation. For teams that prefer a fully managed experience, our ClickHouse managed services include 24×7 monitoring, patch management, and performance tuning SLAs. If the goal is to retire Thanos or Cortex and consolidate observability onto one ClickHouse cluster, we have a proven migration playbook. Schedule a consultation to discuss retention requirements, cardinality projections, and cluster sizing.

Frequently Asked Questions

Can ClickHouse fully replace Prometheus, or does it complement it?

ClickHouse complements Prometheus rather than replacing it. Prometheus scrapes targets, evaluates alerting rules, and maintains the short-term head block for sub-minute alerting. ClickHouse takes over as the durable long-term store via remote_write. Running both is the standard production pattern.

How does ClickHouse Prometheus long-term metrics storage compare to Thanos?

Thanos requires a sidecar, compactor, store gateway, and query frontend — significant operator overhead. ClickHouse requires fewer moving parts when metrics, logs, and traces share one cluster, offers richer SQL analytics, and compresses data 10–30x better than Thanos block format. For organizations already running ClickHouse for other workloads, consolidation on ClickHouse is the simpler path.

What is the TimeSeries table engine and when should it be used?

The TimeSeries engine is a composite ClickHouse engine introduced in 24.x that manages three internal sub-tables (data, tags, metrics) and exposes native remote_write and remote_read endpoints. Use it for new deployments on ClickHouse 24.8 LTS or later. For older versions or when custom schema control is needed, use a hand-crafted MergeTree samples table with a community adapter.

Does ClickHouse support PromQL queries natively?

ClickHouse does not execute PromQL directly. The remote_read endpoint allows Grafana (via the Prometheus datasource) to issue PromQL that is translated to ClickHouse reads internally. For full PromQL coverage, point Grafana at a Prometheus instance using ClickHouse as the remote_read backend. For analytical workloads, use the ClickHouse SQL datasource with direct SQL queries.

How should HA Prometheus instances be handled to avoid duplicate data?

Configure each Prometheus instance with a distinct external_labels.replica value. Both write to ClickHouse independently. At query time, use LIMIT 1 BY (metric, tags, timestamp) to deduplicate exact duplicates, or rely on aggregation functions (avg, max, sum) that absorb near-duplicate values. For gauge metrics, millisecond-offset duplicates have negligible impact on aggregated dashboards.

What storage savings are realistic with ClickHouse versus Prometheus TSDB?

Prometheus TSDB block overhead, index structures, and WAL bring effective on-disk size to 5–15 bytes per sample. ClickHouse with DoubleDelta + ZSTD achieves 1–3 bytes per sample, and AggregatingMergeTree hourly rollups reduce historical data to under 0.5 bytes per logical sample — a 10–30x overall improvement in storage efficiency.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse and Prometheus: Long-Term Metrics Storage with Remote Write at Scale