Data Backup and Restore in ClickHouse

Photo by Lenharth Systems on StockSnap

Introduction

Backup is a way to protect and restore databases. It is a safeguard against unexpected data loss and application errors. Replication protects the databases from hardware or connection failures, but user errors like accidentally deleting a table or deleting a database or software bugs like incorrect data processing and data corruption are still problems. Because these mistakes affect all clusters, even replicas.

It is better to prepare a backup and restore strategy as a company. On the other hand, there is no magic solution to fit every situation. In ClickHouse, we can use a variety of backup and restore methods. Some of them are:

  • Backup / Restore
  • Export / Import ( click here to access related KB article)
  • Duplicating Source Data
  • Filesystem Snapshots
  • Using Tool( clickhouse-copier)
  • Manipulations with Parts

Every method has its own pros and cons. It is important to choose one or more suitable methods for your company.

This article it is aimed to explain ClickHouse Backup/Restore method.

 

Environments

In the scope of the demos, ClickHouse V22.9.3 on Ubuntu Linux 22.04 was used.

 

Backup & Restore

While we are starting to configure the backup solution, the first thing to do is to specify the backup location. As a backup destination, you can use a local directory, an NFS mounted filesystem or even cloud storage such as Amazon S3. For that reason, we need to create an XML file under “/etc/clickhouse-server/config.d“.

The local directory configuration file is shown below. In this example, “/clickHouse/backups/” is the folder where backups are stored, and you can use any other folder as the path. Also, we named “backups” as our backup location, so we are using it on the backup command.

<clickhouse>
    <storage_configuration>
        <disks>
            <backups>
                <type>local</type>
        <path>/clickHouse/backups/</path>
            </backups>
        </disks>
    </storage_configuration>
    <backups>
        <allowed_disk>backups</allowed_disk>
    <allowed_path>/clickHouse/backups/</allowed_path>
    </backups>
</clickhouse>

 

Once this XML file is added, you need to restart the ClickHouse database. After that, login to the database and check if the backup disk is configured. As you can see, “backups” disk with the given path was configured.

clickhouse01 :) 
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total,
    formatReadableSize(keep_free_space) AS reserved
FROM system.disks

Query id: c45d1937-d10a-494f-b532-08d4ed9f06d7

┌─name────┬─path─────────────────┬─free──────┬─total─────┬─reserved─┐
│ backups │ /clickHouse/backups/ │ 17.84 GiB │ 28.89 GiB │ 0.00 B   │
│ default │ /var/lib/clickhouse/ │ 17.84 GiB │ 28.89 GiB │ 0.00 B   │
└─────────┴──────────────────────┴───────────┴───────────┴──────────┘

 

Also, it is possible to configure S3 as a backup disk. Of course, we need an active AWS S3 bucket and keys to access it.

<clickhouse>
  <storage_configuration>
    <disks>
      <s3>
        <type>s3</type>
        <endpoint>https://your_buckets_URL_address</endpoint>
        <access_key_id>A.........5</access_key_id>
        <secret_access_key>N...........D</secret_access_key>
      </s3>
    </disks>
  </storage_configuration>
  <backups>
    <allowed_disk>s3</allowed_disk>
    <allowed_path>/backups/</allowed_path>
  </backups>
</clickhouse>

 

After the backup location is set, it is time to take a backup. The backup and restore commands are basically defined as shown below.

BACKUP TABLE|DATABASE table_name|database_name TO Disk('<backup_location>', 'backupName.zip');

RESTORE TABLE|DATABASE table_name|database_name FROM Disk('<backup_location>', 'backupName.zip')

 

Backup Parameters

There are some useful parameters that you can use. These are:

  • ASYNC: backup or restore asynchronously (default is SYNC)
  • PARTITIONS: a list of partitions to restore. Allow restoring specific partitions
  • SETTINGS:
    • compression_method and compression_level.
    • password for the file on disk.
    • base_backup: It is possible to use an incremental backup to reduce to the usage of storage, time, and network bandwidth.

 

In the scope of this blog, we did some tests on both single instance and sharded cluster databases. Let’s examine the backup/restore scenarios.

Scenario I – Single Instance Backup&Restore

First of all, we need a sample dataset to backup. Therefore, we can use air traffic data from the OpenSky network. You can download and install it as shown here. Actually, this is a huge dataset, so I used a subset with 11 million rows.

clickhouse01 :)
SELECT count()
FROM opensky

┌──count()─┐
│ 11348850 │
└──────────┘

 

Let’s start with the single table backup. While taking backup, I truncated the table from another console.

#Console 1
clickhouse01 :) BACKUP TABLE opensky TO Disk('backups', 'opensky_20221007');

BACKUP TABLE opensky TO Disk('backups', 'opensky_20221007')

Query id: e490789b-59ae-4d08-861e-3514a74988f1

┌─id───────────────────────────────────┬─status─────────┐
│ 1bb1733d-5b70-4c94-a113-a2b96d6e54ab │ BACKUP_CREATED │
└──────────────────────────────────────┴────────────────┘


#Console 2
clickhouse01 :) truncate table opensky;

TRUNCATE TABLE opensky

Now, the table exists but has no record on it.

clickhouse01 :) select count() from opensky;

SELECT count()
FROM opensky

Query id: 2316aa82-2f2a-4174-85b3-c5038c7f6f90

┌─count()─┐
│       0 │
└─────────┘

 

You can access backup and restore information from a system table called backups. But unfortunately, it is a dynamic table and flushed after a restart. So it would be a good idea to install records on any other table that we create. As you can see, just because we did not use “.zip” suffix in backup statement, backup did not compressed. For compression and reduce disk usage, it is important to use “backupName.zip” format.

SELECT *
FROM system.backups
WHERE id = '1bb1733d-5b70-4c94-a113-a2b96d6e54ab'
FORMAT Vertical

Query id: c6997a5e-7d7d-4672-a4f8-c9885a5e1008

Row 1:
──────
id:                1bb1733d-5b70-4c94-a113-a2b96d6e54ab
name:              Disk('backups', 'opensky_20221007')
status:            BACKUP_CREATED
num_files:         117
uncompressed_size: 508789970
compressed_size:   508789970
error:             
start_time:        2022-10-07 09:32:20
end_time:          2022-10-07 09:32:23

 

Let’s have a look at the backup disk and examine the backup file. As you can see, it backed up data, metadata, and other necessary information.

root@clickhouse01:/clickHouse/backups/opensky_20221007# ls -al
total 40
drwxr-x--- 4 clickhouse clickhouse  4096 Oct  7 09:32 .
drwxr-x--- 3 clickhouse clickhouse  4096 Oct  7 09:32 ..
-rw-r----- 1 clickhouse clickhouse 20731 Oct  7 09:32 .backup
drwxr-x--- 3 clickhouse clickhouse  4096 Oct  7 09:32 data
drwxr-x--- 3 clickhouse clickhouse  4096 Oct  7 09:32 metadata

 

Now, it is time to restore the table. As you remember, we truncated the table in the middle of the backup process. ClickHouse did not lock the table and backup operation did not fail. Restore is quite similar to “Backup”. After restore is completed, we can access table and all rows.

clickhouse01 :) RESTORE TABLE opensky FROM Disk('backups', 'opensky_20221007');

┌─id───────────────────────────────────┬─status───┐
│ ea94e314-4d8d-4adf-ab78-07ff498977be │ RESTORED │
└──────────────────────────────────────┴──────────┘

clickhouse01 :) select count() from opensky;

┌──count()─┐
│ 11348850 │
└──────────┘

 

It is possible to backup not only tables but also databases. Let’s backup the database and in the middle of the process, drop the complete database with its tables.

#console 1
clickhouse01 :) show tables;

┌─name─────────────────┐
│ opensky              │
│ openskyDeletedRecord │
│ opensky_1000000      │
│ opensky_temp         │
└──────────────────────┘

clickhouse01 :) BACKUP database sampleDatasets TO Disk('backups', 'sampleDatasets_2022107.zip');

┌─id───────────────────────────────────┬─status─────────┐
│ 4cbd6759-2e98-40fa-bebd-dcc9c16fc4ac │ BACKUP_CREATED │
└──────────────────────────────────────┴────────────────┘

#console2
clickhouse01 :) drop database sampleDatasets;
clickhouse01 :) show databases;

┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ backupInfo         │
│ default            │
│ information_schema │
│ system             │
└────────────────────┘


 

It is time to restore the database, likewise, restoring the table, restoring a database is quite simple and dropping the database in the middle of the backup process does not cause any problems.

clickhouse01 :) RESTORE database sampleDatasets FROM Disk('backups', 'sampleDatasets_2022107.zip');
┌─id───────────────────────────────────┬─status───┐
│ ab332153-6346-441e-a593-a489160099ac │ RESTORED │
└──────────────────────────────────────┴──────────┘

clickhouse01 :) show tables;
┌─name─────────────────┐
│ opensky              │
│ openskyDeletedRecord │
│ opensky_1000000      │
│ opensky_temp         │
└──────────────────────┘

 

You can use AWS S3 storage as a backup location, too. For that reason, you need to create S3 XML file (as shown in this document) under config.d folder and restart the ClickHouse server. After that, the S3 bucket is ready for backup and restore operations.

Scenario II – Sharded Cluster Backup&Restore

For this test, you will need to create a cluster with 2 shards. You can use this document to create it with ClickHouse Keeper. Using a shared disk for backup and restore operations would be a better idea. For that reason, we created the nfs filesystem and mounted it to servers. After that we organize backup_disk.xml file(as described in this document) for all nodes and restart the servers.

Because we are using cluster infrastructure, we need to create a database and local tables on the cluster. In this example, our cluster name is “cls_chista_2S_1R“. In the scope of this blog, I would not explain how we create clusters and shards.

It is enough to run this command on any of the cluster nodes.

CREATE DATABASE flightlist ON CLUSTER cls_chista_2S_1R;

CREATE TABLE flightlist.opensky_local ON CLUSTER cls_chista_2S_1R
(
    `callsign` String,
    `number` String,
    `icao24` String,
    `registration` String,
    `typecode` String,
    `origin` String,
    `destination` String,
    `firstseen` DateTime,
    `lastseen` DateTime,
    `day` DateTime,
    `latitude_1` Float64,
    `longitude_1` Float64,
    `altitude_1` Float64,
    `latitude_2` Float64,
    `longitude_2` Float64,
    `altitude_2` Float64
)
ENGINE = MergeTree
ORDER BY (origin, destination, callsign)

 

It is time to create the main table. From now, we have to use a distributed table for all queries and DML operations.

CREATE TABLE flightlist.opensky ON CLUSTER cls_chista_2S_1R
(
    `callsign` String,
    `number` String,
    `icao24` String,
    `registration` String,
    `typecode` String,
    `origin` String,
    `destination` String,
    `firstseen` DateTime,
    `lastseen` DateTime,
    `day` DateTime,
    `latitude_1` Float64,
    `longitude_1` Float64,
    `altitude_1` Float64,
    `latitude_2` Float64,
    `longitude_2` Float64,
    `altitude_2` Float64
)
ENGINE = Distributed(cls_chista_2S_1R,flightlist,opensky_local,cityHash64(origin, destination, callsign) );

 

After inserting the data into the distributed “opensky” table, ClickHouse distributed the rows over nodes with the given shard key definition.

clickhouse01 :) select count() from opensky_local;
┌─count()─┐
│ 4387412 │
└─────────┘

clickhouse02 :) select count() from opensky_local;
┌─count()─┐
│ 4422271 │
└─────────┘

clickhouse01 :)SELECT count() FROM opensky;
┌─count()─┐
│ 8809683 │
└─────────┘

 

You can backup and restore table every single node with its local data. In the following example, we backed up only the data in the local table.

clickhouse01 :) BACKUP TABLE flightlist.opensky_local TO Disk('backups', 'opensky_20221007_S1.zip');

BACKUP TABLE flightlist.opensky_local TO Disk('backups', 'opensky_20221007_S1.zip')

 

Backup and restore operations are similar to single-node operations, but one important point is to take care of them. Since any shard of the distributed table is (“opensky_local“) table itself and the distributed table(“opensky“) has no data, it is important to backup the database for all nodes concurrently.

 

Conclusion

Backup and restore are crucial for database operations. Because there is no magical method to cover all our problems, it is important to choose the right strategies to protect against data loss. Because of their nature, backup and restore processes use system resources such as network bandwidth, CPU and memory. It is important to schedule backup operations at the least busy time. Also, it is important to test backups taken to see if they are restorable on another server. Another significant point is backup automation and monitoring. It is important to take daily backups automatically via a portal and monitor them in case of failure.

In ClickHouse, you can backup a database, a table or even a materialized view and restore it at a database, table or partition level. Since it is a column-store analytic database, the point-in-time recovery concept is not applicable in restore operations, but it is possible to take an incremental backup instead.

 

About Emrah Idman 9 Articles
Emrah Idman has considerable experience in relational and NoSQL databases. He has worked in a large-scale financial company for over 15 years. He has significant experience in team management, procurement and capacity planning, database administration and product testing for high-volume systems. He is working at ChistaDATA Inc. as senior database administrator.
Contact: Website