Alert Name |
Shell or SQL command |
Severity |
ClickHouse status | $ curl ‘http://localhost:8123/’
Ok. |
Critical |
Too many simultaneous queries. Maximum: 100 (by default) | select value from system.metrics where metric=’Query’ |
Critical |
Replication status | $ curl ‘http://localhost:8123/replicas_status’
Ok. |
High |
Read only replicas (reflected by replicas_status as well) | select value from system.metrics where metric=’ReadonlyReplica’ |
High |
Some replication tasks are stuck | select count() from system.replication_queue where num_tries > 100 or num_postponed > 1000 |
High |
ZooKeeper is available | select count() from system.zookeeper where path=’/’ |
Critical for writes |
ZooKeeper exceptions | select value from system.events where event=’ZooKeeperHardwareExceptions’ |
Medium |
Other CH nodes are available | $ for node in `echo “select distinct host_address from system.clusters where host_name !=’localhost'” | curl ‘http://localhost:8123/’ –silent –data-binary @-`; do curl “http://$node:8123/” –silent ; done | sort -u
Ok. |
High |
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) | for cluster in `echo “select distinct cluster from system.clusters where host_name !=’localhost'” | curl ‘http://localhost:8123/’ –silent –data-binary @-` ; do clickhouse-client –query=”select ‘$cluster’, ‘OK’ from cluster(‘$cluster’, system, one)” ; done | Critical |
There are files in ‘detached’ folders | $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \ 19.8+
select count() from system.detached_parts |
Medium |
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected | select value from system.asynchronous_metrics where metric=’MaxPartCountForPartition’;select value from system.events/system.metrics where event/metric=’DelayedInserts’; select value from system.events where event=’RejectedInserts’ |
Critical |
Dictionaries: exception | select concat(name,’: ‘,last_exception) from system.dictionaries where last_exception != ” |
Medium |
ClickHouse has been restarted | select uptime();
select value from system.asynchronous_metrics |
|
DistributedFilesToInsert should not be always increasing | select value from system.metrics where metric=’DistributedFilesToInsert’ |
Medium |
A data part was lost | select value from system.events where event=’ReplicatedDataLoss’ |
High |
Data parts are not the same on different replicas | select value from system.events where event=’DataAfterMergeDiffersFromReplica’; \ select value from system.events where event=’DataAfterMutationDiffersFromReplica’ | Medium |
The following queries are recommended to be included in monitoring:
- SELECT * FROM system.replicas. – For more information, see the ClickHouse guide on System Tables. Visit here.
- SELECT * FROM system.merges – Checks on the speed and progress of currently executed merges.
- SELECT * FROM system.mutations WHERE create_time desc – This is the source of information on the speed and progress of currently executed merges.