Backup & Data Restore (DR) in ClickHouse

Table of Contents

Introduction

We at ChistaDATA never recommend Replication Solution as an alternative for Backup or DR. Replication focuses on business continuity, i.e. delivering uninterrupted data infrastructure availability to mission-critical ClickHouse Operations and Backup involves creating a copy or copies of an existing set of data so that in case of a data loss, the original could be restored and recovered. ClickHouse from an Architecture/Operations perspective is built for reliability. For example, You can’t just drop tables in MergeTree Tables with more than 50 GB of data. Replication solutions provide protection from hardware failures but not a human error (the most common reason for Database Infrastructure outage). A robust ClickHouse Backup is the most scalable solution to intuitively mitigate all the possible human errors

Filesystem Snapshots

Filesystem Snapshots are not an ideal solution on the ClickHouse instance addressing live queries. So what we recommended is to create another replica with the same filesystem (like ZFS) and exclude it from the Distributed tables used for SELECT queries. snapshots on such replicas will be out of reach of any queries that modify data. Because the solution is recommended purely for DR operations more disks will be added to the server, So from an investment perspective these servers are not too expensive

Note: Snapshots cannot be used to recover individual files; So Cost-To-Data-Recovery on ClickHouse will be very high

clickhouse-backup

clickhouse-backup is developed by Alexander Akulov is capable of full, partial and incremental backup scenarios. This cloud-ready backup solution is also compatible with AWS, GCS and Azure.

Installation

Download the latest binary from the releases page and decompress with:

tar -zxvf clickhouse-backup.tar.gz

Use the official tiny Docker image and run it on the host where installed clickhouse-server:

docker run -u $(id -u clickhouse) --rm -it --network host -v "/var/lib/clickhouse:/var/lib/clickhouse" \
   -e CLICKHOUSE_PASSWORD="password" \
   -e S3_BUCKET="clickhouse-backup" \
   -e S3_ACCESS_KEY="access_key" \
   -e S3_SECRET_KEY="secret" \
   alexakulov/clickhouse-backup --help

Build from the sources:

GO111MODULE=on go get github.com/AlexAkulov/clickhouse-backup/cmd/clickhouse-backup

Usage

NAME:
   clickhouse-backup - Tool for easy backup of ClickHouse with cloud support

USAGE:
   clickhouse-backup <command> [-t, --tables=<db>.<table>] <backup_name>

VERSION:
   1.0.0

DESCRIPTION:
   Run as 'root' or 'clickhouse' user

COMMANDS:
   tables          Print list of tables
   create          Create new backup
   create_remote   Create and upload
   upload          Upload backup to remote storage
   list            Print list of backups
   download        Download backup from remote storage
   restore         Create schema and restore data from backup
   restore_remote  Download and restore
   delete          Delete specific backup
   default-config  Print default config
   print-config    Print current config
   clean           Remove data in 'shadow' folder from all `path` folders available from `system.disks`
   server          Run API server
   help, h         Shows a list of commands or help for one command
GLOBAL OPTIONS:
   --config FILE, -c FILE  Config FILE name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --help, -h              show help
   --version, -v           print the version

Default Config

The config file location can be defined by $CLICKHOUSE_BACKUP_CONFIG

All options can be overwritten via environment variables

general:
  remote_storage: none           # REMOTE_STORAGE, if `none` then `upload` and  `download` command will fail
  max_file_size: 1073741824      # MAX_FILE_SIZE, 1G by default, useless when upload_by_part is true, use for split data parts files by archives  
  disable_progress_bar: true     # DISABLE_PROGRESS_BAR, show progress bar during upload and download, have sense only when `upload_concurrency` and `download_concurrency` equal 1
  backups_to_keep_local: 0       # BACKUPS_TO_KEEP_LOCAL, how much newest local backup should keep, 0 mean all created backups will keep on local disk
                                 # you shall to run `clickhouse-backup delete local <backup_name>` command to avoid useless disk space allocations
  backups_to_keep_remote: 0      # BACKUPS_TO_KEEP_REMOTE, how much newest backup should keep on remote storage, 0 mean all uploaded backups will keep on remote storage. 
                                 # if old backup is required for newer incremental backup, then it will don't delete. Be careful with long incremental backup sequences.
  log_level: info                # LOG_LEVEL
  allow_empty_backups: false     # ALLOW_EMPTY_BACKUPS
  download_concurrency: 1        # DOWNLOAD_CONCURRENCY, max 255
  upload_concurrency: 1          # UPLOAD_CONCURRENCY, max 255
  restore_schema_on_cluster: ""  # RESTORE_SCHEMA_ON_CLUSTER, execute all schema related SQL queryes with `ON CLUSTER` clause as Distributed DDL, look to `system.clusters` table for proper cluster name
  upload_by_part: true           # UPLOAD_BY_PART
  download_by_part: true         # DOWNLOAD_BY_PART
clickhouse:
  username: default                # CLICKHOUSE_USERNAME
  password: ""                     # CLICKHOUSE_PASSWORD
  host: localhost                  # CLICKHOUSE_HOST
  port: 9000                       # CLICKHOUSE_PORT, don't use 8123, clickhouse-backup doesn't support HTTP protocol
  disk_mapping: {}                 # CLICKHOUSE_DISK_MAPPING, use it if your system.disks on restored servers not the same with system.disks on server where backup was created
  skip_tables:                     # CLICKHOUSE_SKIP_TABLES
    - system.*
    - INFORMATION_SCHEMA.*
    - information_schema.*
  timeout: 5m                  # CLICKHOUSE_TIMEOUT
  freeze_by_part: false        # CLICKHOUSE_FREEZE_BY_PART
  secure: false                # CLICKHOUSE_SECURE, use SSL encryption for connect
  skip_verify: false           # CLICKHOUSE_SKIP_VERIFY
  sync_replicated_tables: true # CLICKHOUSE_SYNC_REPLICATED_TABLES
  tls_key: ""                  # CLICKHOUSE_TLS_KEY, filename with TLS key file 
  tls_cert: ""                 # CLICKHOUSE_TLS_CERT, filename with TLS certificate file 
  tls_ca: ""                   # CLICKHOUSE_TLS_CA, filename with TLS custom authority file 
  log_sql_queries: true        # CLICKHOUSE_LOG_SQL_QUERIES, enable log clickhouse-backup SQL queries on `system.query_log` table inside clickhouse-server
  debug: false                 # CLICKHOUSE_DEBUG
  config_dir:      "/etc/clickhouse-server"              # CLICKHOUSE_CONFIG_DIR
  restart_command: "systemctl restart clickhouse-server" # CLICKHOUSE_RESTART_COMMAND, this command use when you try to restore with --rbac or --config options
  ignore_not_exists_error_during_freeze: true # CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE, allow avoiding backup failures when you often CREATE / DROP tables and databases during backup creation, clickhouse-backup will ignore `code: 60` and `code: 81` errors during execute `ALTER TABLE ... FREEZE` 
azblob:
  endpoint_suffix: "core.windows.net" # AZBLOB_ENDPOINT_SUFFIX
  account_name: ""             # AZBLOB_ACCOUNT_NAME
  account_key: ""              # AZBLOB_ACCOUNT_KEY
  sas: ""                      # AZBLOB_SAS
  use_managed_identity: false  # AZBLOB_USE_MANAGED_IDENTITY
  container: ""                # AZBLOB_CONTAINER
  path: ""                     # AZBLOB_PATH
  compression_level: 1         # AZBLOB_COMPRESSION_LEVEL
  compression_format: tar      # AZBLOB_COMPRESSION_FORMAT
  sse_key: ""                  # AZBLOB_SSE_KEY
  buffer_size: 0               # AZBLOB_BUFFER_SIZE, if less or eq 0 then calculated as max_file_size / max_parts_count, between 2Mb and 4Mb
  max_parts_count: 10000       # AZBLOB_MAX_PARTS_COUNT, number of parts for AZBLOB uploads, for properly calculate buffer size
  max_buffers: 3               # AZBLOB_MAX_BUFFERS
s3:
  access_key: ""                   # S3_ACCESS_KEY
  secret_key: ""                   # S3_SECRET_KEY
  bucket: ""                       # S3_BUCKET
  endpoint: ""                     # S3_ENDPOINT
  region: us-east-1                # S3_REGION
  acl: private                     # S3_ACL
  assume_role_arn: ""              # S3_ASSUME_ROLE_ARN
  force_path_style: false          # S3_FORCE_PATH_STYLE
  path: ""                         # S3_PATH
  disable_ssl: false               # S3_DISABLE_SSL
  compression_level: 1             # S3_COMPRESSION_LEVEL
  compression_format: tar          # S3_COMPRESSION_FORMAT
  sse: ""                          # S3_SSE, empty (default), AES256, or aws:kms
  disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION
  storage_class: STANDARD          # S3_STORAGE_CLASS
  concurrency: 1                   # S3_CONCURRENCY
  part_size: 0                     # S3_PART_SIZE, if less or eq 0 then calculated as max_file_size / max_parts_count, between 5MB and 5Gb
  max_parts_count: 10000           # S3_MAX_PARTS_COUNT, number of parts for S3 multipart uploads
  debug: false                     # S3_DEBUG
gcs:
  credentials_file: ""         # GCS_CREDENTIALS_FILE
  credentials_json: ""         # GCS_CREDENTIALS_JSON
  bucket: ""                   # GCS_BUCKET
  path: ""                     # GCS_PATH
  compression_level: 1         # GCS_COMPRESSION_LEVEL
  compression_format: tar      # GCS_COMPRESSION_FORMAT
  debug: false                 # GCS_DEBUG
cos:
  url: ""                      # COS_URL
  timeout: 2m                  # COS_TIMEOUT
  secret_id: ""                # COS_SECRET_ID
  secret_key: ""               # COS_SECRET_KEY
  path: ""                     # COS_PATH
  compression_format: tar      # COS_COMPRESSION_FORMAT
  compression_level: 1         # COS_COMPRESSION_LEVEL
ftp:
  address: ""                  # FTP_ADDRESS
  timeout: 2m                  # FTP_TIMEOUT
  username: ""                 # FTP_USERNAME
  password: ""                 # FTP_PASSWORD
  tls: false                   # FTP_TLS
  path: ""                     # FTP_PATH
  compression_format: tar      # FTP_COMPRESSION_FORMAT
  compression_level: 1         # FTP_COMPRESSION_LEVEL
  debug: false                 # FTP_DEBUG
sftp:
  address: ""                  # SFTP_ADDRESS
  username: ""                 # SFTP_USERNAME
  password: ""                 # SFTP_PASSWORD
  key: ""                      # SFTP_KEY
  path: ""                     # SFTP_PATH
  concurrency: 1               # SFTP_CONCURRENCY     
  compression_format: tar      # SFTP_COMPRESSION_FORMAT
  compression_level: 1         # SFTP_COMPRESSION_LEVEL
  debug: false                 # SFTP_DEBUG
api:
  listen: "localhost:7171"     # API_LISTEN
  enable_metrics: true         # API_ENABLE_METRICS
  enable_pprof: false          # API_ENABLE_PPROF
  username: ""                 # API_USERNAME, basic authorization for API endpoint
  password: ""                 # API_PASSWORD
  secure: false                # API_SECURE, use TLS for listen API socket
  certificate_file: ""         # API_CERTIFICATE_FILE
  private_key_file: ""         # API_PRIVATE_KEY_FILE
  create_integration_tables: false # API_CREATE_INTEGRATION_TABLES
  allow_parallel: false        # API_ALLOW_PARALLEL, could allocate much memory and spawn go-routines, don't enable it if you not sure

clickhouse-copier

clickhouse-copier – A scalable and reliable DR tool for ClickHouse which is capable of data movement between ClickHouse Tables and clusters. You can run multiple clickhouse-copier instances on different servers to perform the same job. ZooKeeper is used for syncing the processes.

After starting, clickhouse-copier:

Connects to ZooKeeper and receives:
- Copying jobs.
- The state of the copying jobs.
It performs the jobs.

Each running process chooses the “closest” shard of the source cluster and copies the data into the destination cluster, resharding the data if necessary.

clickhouse-copier tracks the changes in ZooKeeper and applies them on the fly.

To reduce network traffic, we recommend running clickhouse-copier on the same server where the source data is located.

Parameters:

daemon — Starts clickhouse-copier in daemon mode.
config — The path to the zookeeper.xml file with the parameters for the connection to ZooKeeper.
task-path — The path to the ZooKeeper node. This node is used for syncing clickhouse-copier processes and storing tasks. Tasks are stored in $task-path/description.
task-file — Optional path to file with task configuration for initial upload to ZooKeeper.
task-upload-force — Force upload task-file even if node already exists.
base-dir — The path to logs and auxiliary files. When it starts, clickhouse-copier creates clickhouse-copier_YYYYMMHHSS_<PID>subdirectories in $base-dir. If this parameter is omitted, the directories are created in the directory where clickhouse-copier was launched.

Conclusion

To read more about Clickhouse data backup and restore, do consider reading the below articles

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Backup & Data Restore (DR) in ClickHouse: Part 1

Introduction

Filesystem Snapshots

clickhouse-backup

Installation

Usage

Default Config

clickhouse-copier

Conclusion

Introduction

Filesystem Snapshots

clickhouse-backup

Installation

Usage

Default Config

clickhouse-copier

Conclusion

Related Articles

Connecting to ClickHouse with C++ client libraries

Implementing an Oracle RMAN-Like Backup and Recovery Toolkit for ClickHouse

Implementing Data Level Security on ClickHouse: Complete Technical Guide