Connect Prometheus to Your ClickHouse® Cluster: Complete Monitoring and Observability Guide
Introduction
In today’s data-driven landscape, monitoring your database infrastructure is crucial for maintaining optimal performance, ensuring reliability, and preventing costly downtime. ClickHouse®, renowned for its exceptional analytical performance, requires comprehensive monitoring to unlock its full potential. Prometheus, the industry-standard monitoring and alerting toolkit, provides the perfect solution for tracking cluster health and performance metrics.
This comprehensive guide will walk you through the process of connecting Prometheus to your ClickHouse cluster, from initial setup to advanced monitoring strategies. Whether you’re running a single-node deployment or a complex distributed cluster, this integration will provide the observability foundation necessary for production-grade operations.
Understanding the Monitoring Architecture
Why Use Prometheus for Monitoring?
Prometheus offers several compelling advantages for ClickHouse monitoring:
- Time-series data model perfectly suited for database metrics
- Powerful query language (PromQL) for complex metric analysis
- Scalable architecture supporting large-scale deployments
- Rich ecosystem of exporters and integrations
- Built-in alerting capabilities for proactive issue detection
- Grafana integration for comprehensive visualization
ClickHouse Metrics Overview
ClickHouse exposes extensive metrics through multiple interfaces:
System Tables for Metrics
- system.metrics – Current metric values
- system.events – Cumulative event counters
- system.asynchronous_metrics – Background process metrics
- system.processes – Active query information
HTTP Metrics Endpoints
- /metrics – compatible metrics endpoint
- /ping – Health check endpoint
- /replicas_status – Replication status information
Setting Up Monitoring for ClickHouse
Prerequisites
Before beginning the integration:
- ClickHouse cluster running version 20.3 or later
- server installed and configured
- Network connectivity between Prometheus and ClickHouse nodes
- Appropriate permissions for metrics collection
- Basic understanding of configuration
Enabling ClickHouse Metrics Endpoint
Configuration Steps
- Enable the metrics endpoint in ClickHouse configuration:
<!-- /etc/clickhouse-server/config.xml -->
<clickhouse>
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
</prometheus>
</clickhouse>
- Configure HTTP interface for metrics access:
<http_port>8123</http_port>
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
</prometheus>
- Restart ClickHouse to apply configuration changes:
sudo systemctl restart clickhouse-server
Configuration for Monitoring
Basic Scrape Configuration
Add ClickHouse targets to your configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'clickhouse'
static_configs:
- targets:
- 'clickhouse-node1:9363'
- 'clickhouse-node2:9363'
- 'clickhouse-node3:9363'
scrape_interval: 30s
metrics_path: /metrics
scheme: http
Advanced Configuration Techniques
For dynamic environments, use service discovery:
scrape_configs:
- job_name: 'clickhouse-cluster'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: ['clickhouse']
relabel_configs:
- source_labels: [__meta_consul_service_port]
target_label: __address__
replacement: '${1}:9363'
Essential Metrics to Monitor in ClickHouse
Performance Metrics
Query Performance
Monitor query execution characteristics:
# Average query execution time
rate(ClickHouseProfileEvents_Query[5m])
# Queries per second
rate(ClickHouseProfileEvents_SelectQuery[5m])
# Failed queries rate
rate(ClickHouseProfileEvents_FailedQuery[5m])
Resource Utilization
Track system resource consumption:
# CPU usage
ClickHouseAsyncMetrics_jemalloc_resident
# Memory usage
ClickHouseAsyncMetrics_MemoryTracking
# Disk I/O operations
rate(ClickHouseProfileEvents_DiskReadElapsedMicroseconds[5m])
Health Metrics for Cluster
Replication Status
Monitor replication lag and health:
# Replication lag
ClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay
# Number of active replicas
ClickHouseAsyncMetrics_ReplicasMaxQueueSize
# Replication errors
rate(ClickHouseProfileEvents_ReplicatedPartFailedFetches[5m])
Connection Metrics
Track connection pool status:
# Active connections
ClickHouseMetrics_TCPConnection
# HTTP connections
ClickHouseMetrics_HTTPConnection
# Connection errors
rate(ClickHouseProfileEvents_ConnectionFailed[5m])
Storage Metrics
Disk Usage
Monitor storage consumption and performance:
# Disk space usage
ClickHouseAsyncMetrics_DiskSpaceUsed_default
# Disk read/write operations
rate(ClickHouseProfileEvents_DiskReadElapsedMicroseconds[5m])
rate(ClickHouseProfileEvents_DiskWriteElapsedMicroseconds[5m])
# Merge operations
rate(ClickHouseProfileEvents_MergedRows[5m])
Strategies for Advanced Monitoring
Custom Metrics Collection
Application-Specific Metrics
Create custom metrics for your specific use cases:
-- Custom query to track table sizes
SELECT
database,
table,
sum(bytes_on_disk) as size_bytes,
sum(rows) as total_rows
FROM system.parts
WHERE active = 1
GROUP BY database, table
Business Logic Monitoring
Monitor business-critical queries and operations:
-- Track specific query patterns
SELECT
query_kind,
count() as query_count,
avg(query_duration_ms) as avg_duration
FROM system.query_log
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY query_kind
Multi-Cluster Monitoring
Federated Prometheus Setup
For large-scale deployments, implement federation:
# Global Prometheus configuration
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"clickhouse.*"}'
static_configs:
- targets:
- 'prometheus-shard1:9090'
- 'prometheus-shard2:9090'
Cross-Cluster Correlation
Monitor relationships between clusters:
# Compare query rates across clusters
sum by (cluster) (rate(ClickHouseProfileEvents_Query[5m]))
# Cross-cluster replication lag
max by (cluster) (ClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay)
Setting Up Alerting Rules
Critical System Alerts
High Query Failure Rate
Alert on excessive query failures:
groups:
- name: clickhouse.rules
rules:
- alert: ClickHouseHighQueryFailureRate
expr: rate(ClickHouseProfileEvents_FailedQuery[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High query failure rate detected"
description: "ClickHouse query failure rate is {{ $value }} per second"
Memory Usage Alert
Monitor memory consumption:
- alert: ClickHouseHighMemoryUsage
expr: ClickHouseAsyncMetrics_MemoryTracking > 0.8 * ClickHouseAsyncMetrics_MemoryLimit
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on ClickHouse"
description: "Memory usage is {{ $value | humanizePercentage }}"
Replication Lag Alert
Monitor replication health:
- alert: ClickHouseReplicationLag
expr: ClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay > 300
for: 2m
labels:
severity: critical
annotations:
summary: "ClickHouse replication lag detected"
description: "Replication lag is {{ $value }} seconds"
Performance Degradation Alerts
Slow Query Detection
Alert on performance degradation:
- alert: ClickHouseSlowQueries
expr: histogram_quantile(0.95, rate(ClickHouseProfileEvents_QueryTimeMicroseconds_bucket[5m])) > 10000000
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries detected in ClickHouse"
description: "95th percentile query time is {{ $value | humanizeDuration }}"
Grafana Dashboard Integration
Essential Dashboard Panels
Cluster Overview Dashboard
Create comprehensive cluster monitoring:
{
"dashboard": {
"title": "ClickHouse Cluster Overview",
"panels": [
{
"title": "Query Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ClickHouseProfileEvents_Query[5m]))",
"legendFormat": "Queries/sec"
}
]
},
{
"title": "Active Connections",
"type": "singlestat",
"targets": [
{
"expr": "sum(ClickHouseMetrics_TCPConnection)"
}
]
}
]
}
}
Performance Monitoring Dashboard
Track query performance metrics:
{
"panels": [
{
"title": "Query Duration Distribution",
"type": "heatmap",
"targets": [
{
"expr": "rate(ClickHouseProfileEvents_QueryTimeMicroseconds_bucket[5m])",
"format": "heatmap"
}
]
}
]
}
Custom Visualization Strategies
Resource Utilization Heatmaps
Visualize resource usage patterns:
# CPU utilization across nodes
avg by (instance) (rate(ClickHouseAsyncMetrics_OSCPUVirtualTimeMicroseconds[5m]))
# Memory usage distribution
ClickHouseAsyncMetrics_MemoryTracking / ClickHouseAsyncMetrics_MemoryLimit
Troubleshooting Common Monitoring Issues
Connection Problems
Metrics Endpoint Not Accessible
Common solutions:
- Verify configuration in ClickHouse settings
- Check firewall rules and network connectivity
- Validate port configuration and binding
- Review ClickHouse logs for error messages
# Test metrics endpoint
curl http://clickhouse-server:9363/metrics
# Check ClickHouse configuration
clickhouse-client --query "SELECT * FROM system.settings WHERE name LIKE '%prometheus%'"
Authentication Issues
Configure authentication if required:
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<credentials>
<user>monitoring</user>
<password>secure_password</password>
</credentials>
</prometheus>
Performance Considerations
High Cardinality Metrics
Manage metric cardinality to prevent performance issues:
# Limit label cardinality
metric_relabel_configs:
- source_labels: [__name__]
regex: 'ClickHouse.*'
target_label: __tmp_keep
replacement: 'true'
- source_labels: [__tmp_keep]
regex: 'true'
action: keep
Scrape Interval Optimization
Balance monitoring granularity with performance:
scrape_configs:
- job_name: 'clickhouse-detailed'
scrape_interval: 15s # High-frequency for critical metrics
static_configs:
- targets: ['clickhouse-primary:9363']
- job_name: 'clickhouse-secondary'
scrape_interval: 60s # Lower frequency for secondary metrics
static_configs:
- targets: ['clickhouse-replica:9363']
Security Best Practices
Access Control
Network Security
Implement proper network isolation:
# Restrict metrics access to monitoring network
scrape_configs:
- job_name: 'clickhouse'
static_configs:
- targets: ['10.0.1.100:9363'] # Internal network only
Authentication and Authorization
Configure secure access:
<users>
<monitoring>
<password>secure_monitoring_password</password>
<networks>
<ip>10.0.0.0/8</ip> <!-- Restrict to monitoring network -->
</networks>
<profile>readonly</profile>
</monitoring>
</users>
Data Privacy
Sensitive Metric Filtering
Filter out sensitive information:
metric_relabel_configs:
- source_labels: [__name__]
regex: '.*password.*|.*secret.*'
action: drop
Scaling and Optimization
High-Availability Setup
Prometheus HA Configuration
Implement redundant monitoring to ensure availability:
# Primary Prometheus instance
external_labels:
replica: 'prometheus-1'
cluster: 'production'
# Secondary Prometheus instance
external_labels:
replica: 'prometheus-2'
cluster: 'production'
Load Balancing
Distribute monitoring load:
scrape_configs:
- job_name: 'clickhouse-shard1'
static_configs:
- targets: ['ch-shard1-node1:9363', 'ch-shard1-node2:9363']
- job_name: 'clickhouse-shard2'
static_configs:
- targets: ['ch-shard2-node1:9363', 'ch-shard2-node2:9363']
Resource Optimization
Storage Efficiency
Optimize metric retention and storage:
global:
retention: '30d'
retention_size: '100GB'
# Downsampling configuration
rule_files:
- "downsampling.yml"
Query Performance
Optimize PromQL queries for better performance:
Further Reading
ClickHouse Projections: A Complete Guide to Query Optimization
Updating and Deleting ClickHouse Data with Mutations
Master ClickHouse Custom Partitioning Keys
Building a Custom ETL Tool: Technical Implementation for PostgreSQL to ClickHouse Data Movement
Maximizing Real-Time Analytics Performance: How ClickHouse Revolutionizes Data Processing
What is Prometheus?
Learning Prometheus