Data Replication in ClickHouse (Docker Based Setup)

Image Source – Pexels

Data replication is the process of storing multiple copies of data to ensure system reliability and improve data availability. ClickHouse supports multi-primary replication, and it is asynchronous (eventually consistent). Every MergeTree table engine has a replicated version, and other engines don’t support data replication.

Replicated tables require Apache Zookeeper or ClickHouse keeper. Creation/Deletion of replicated tables should happen in individual nodes of the cluster. The data can be inserted and read from any nodes in the cluster. Let us look at the steps to create a replicated table in the ClickHouse cluster. I am using docker and Apache Zookeeper to demonstrate the steps involved. Ensure the Docker engine and Docker compose are installed and working.

Docker-based ClickHouse cluster.

Let us set up a 3-node ClickHouse cluster and a single-node Zookeeper. I have used the docker-compose file below for this setup.

 

version: '3'

services:
  zookeeper:
    image: 'bitnami/zookeeper:latest'
    networks:
      - ch_replicated
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
      - ZOOKEEPER_CLIENT_PORT=2181
    ports:
      - "2182:2181"
      - "2888:2888"
      - "3888:3888"
    
  clickhouse1:
    image: clickhouse/clickhouse-server
    ports:
      - "8002:9000"
      - "9123:8123"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    networks:
      - ch_replicated
    depends_on:
      - zookeeper
      
  clickhouse2:
    image: clickhouse/clickhouse-server
    ports:
      - "8003:9000"
      - "9124:8123"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    networks:
      - ch_replicated
    depends_on:
      - zookeeper
      
  clickhouse3:
    image: clickhouse/clickhouse-server
    ports:
      - "8004:9000"
      - "9125:8123"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    networks:
      - ch_replicated
    depends_on:
      - zookeeper

networks:
  ch_replicated:
    driver: bridge

We are creating a bridge network (ch_replicated) for seamless networking experience across the containers. Fire up the containers using docker-compose up command. verify if all the containers are up and running.

vijayanand@chistadata:/media/vijayanand/Data/ClickHouse/ClickHouse_Replication$ sudo docker container ls
[sudo] password for vijayanand: 
CONTAINER ID   IMAGE                          COMMAND                  CREATED      STATUS          PORTS                                                                                                                                       NAMES
f60b856f1fb5   clickhouse/clickhouse-server   "/entrypoint.sh"         5 days ago   Up 58 seconds   9009/tcp, 0.0.0.0:9124->8123/tcp, :::9124->8123/tcp, 0.0.0.0:8003->9000/tcp, :::8003->9000/tcp                                              clickhouse_replication_clickhouse2_1
a04d155475e4   clickhouse/clickhouse-server   "/entrypoint.sh"         5 days ago   Up 58 seconds   9009/tcp, 0.0.0.0:9123->8123/tcp, :::9123->8123/tcp, 0.0.0.0:8002->9000/tcp, :::8002->9000/tcp                                              clickhouse_replication_clickhouse1_1
b222f8b44db8   clickhouse/clickhouse-server   "/entrypoint.sh"         5 days ago   Up 58 seconds   9009/tcp, 0.0.0.0:9125->8123/tcp, :::9125->8123/tcp, 0.0.0.0:8004->9000/tcp, :::8004->9000/tcp                                              clickhouse_replication_clickhouse3_1
69527fd60266   bitnami/zookeeper:latest       "/opt/bitnami/script…"   5 days ago   Up 58 seconds   0.0.0.0:2888->2888/tcp, :::2888->2888/tcp, 0.0.0.0:3888->3888/tcp, :::3888->3888/tcp, 8080/tcp, 0.0.0.0:2182->2181/tcp, :::2182->2181/tcp   clickhouse_replication_zookeeper_1

 

Configuring the ClickHouse cluster for replicated tables

Configure the following things in the config.xml

  1. Cluster config
  2. Zookeeper config
  3. Macros (optional)

Cluster Config

Add the following in the config.xml file in every node to configure the multi node ClickHouse cluster.

<remote_servers>
<replicated_cluster>
    <shard>
    <internal_replication>true</internal_replication>
    <replica>
        <host>clickhouse1</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>clickhouse2</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>clickhouse3</host>
        <port>9000</port>
    </replica>
    </shard>
</replicated_cluster>
</remote_servers>

This config tells the ClickHouse server on the existence of other ClickHouse nodes. The host-name and port details are based on the details in docker-compose file.

Zookeeper Config

Add the details of the Zookeeper cluster in every node’s config file.

<zookeeper>
  <node index="1">
    <host>zookeeper</host>
    <port>2181</port>
  </node>  
</zookeeper>

Macros

Macros substitutions are commonly used while configuring the Zookeeper path in replicated tables.

<macros>
<shard>01</shard>
<replica>ch1</replica>
</macros>

The macro values must vary for every node.

Creating and testing a replicated table

Let us create a replicated table with just two columns. Repeat the creation process in every configured node.

CREATE TABLE replicated_example
(
    ID UInt32,
    Name String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/replicated_example_1', '{replica}')
ORDER BY (ID)

The table creation syntax is similar to the non-replicated MergeTree counterpart. Along with the regular parameters, we will have to provide the Zookeeper path (Znode path where the replication info will be stored) and the replica information. The {replica} is substituted from the macro which we configured earlier.

Once the table creation is successful, try inserting some data in the table from one node. the data will be available automatically in the rest of the configured nodes.

Conclusion

  • Replicated versions of all the MergeTree family engines are available in ClickHouse.
  • Replication requires a Zookeeper/ClickHouse keeper.
  • DDL commands needs to be run in every node (CREATE, DROP)
  • DML and DQL can be run in any one of the nodes (ALTER, INSERT, and SELECT)

References

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication/

https://medium.com/@merticariug/distributed-clickhouse-configuration-d412c211687c