How to Setup a 6-Node Cluster in ClickHouse with Replication & Sharding

Introduction

ClickHouse implements replication and sharding through a combination of distributed tables and replication settings.

Replication in ClickHouse is implemented by creating a cluster of servers, each of which contains a copy of the same data. When a change is made to the data on one server, it is automatically propagated to the other servers in the cluster. This ensures that all servers in the cluster have the same data and can handle read requests even if one of the servers goes down.

Sharding in ClickHouse is implemented by distributing data across multiple servers. ClickHouse supports several types of sharding, including:

  • Range-based sharding: Data is divided into ranges based on a specified key and stored on different servers.
  • Hash-based sharding: Data is divided into buckets based on a specified key and stored on different servers.
  • Replicated-based sharding: Data is replicated on multiple servers, and read requests are sent to all servers.
  • Distributed tables: Data is spread across multiple servers and can be accessed through a single table.

When a query is executed, ClickHouse automatically routes the query to the correct shard(s) based on the sharding configuration. This ensures that the data can be retrieved quickly, even for large datasets.

To configure sharding, you need to define the shards in the config file and then creating a table with the ENGINE=Distributed option and specify the sharding key and the shard to use.

It’s worth noting that you can use a combination of sharding and replication to improve performance and high availability, by sharding data across multiple servers and replicating each shard on multiple servers.

How is Horizontal Partitioning/Sharding implemented in Database Systems?

# Step 1: Define the data to be shared

# Step 2: Divide the data into multiple shards

# Step 3: Assign each shard to a server or group of servers

# Step 4: Configure a routing policy to direct requests to the correct shard

# Step 5: Monitor and adjust the sharding strategy as needed to ensure optimal performance

# Step 6: Rebalance the shards if data is added or deleted

Runbook to set up a 6-node ClickHouse Replicated and Sharded cluster

Step 1: Install ClickHouse on each of the six nodes. Ensure all nodes are running the same version of ClickHouse and have the same configuration.

Step 2: Configure replication on each node by adding the following settings to the config.xml file:

<replication>
    <replica>
        <host>hostname1</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>hostname2</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>hostname3</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>hostname4</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>hostname5</host>
        <port>9000</port>
    </replica>
    <replica>
        <host>hostname6</host>
        <port>9000</port>
    </replica>
</replication>

Step 3: Configure sharding on each node by adding the following settings to the config.xml file:

<remote_servers>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname1</host>
            <port>9000</port>
        </replica>
    </shard>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname2</host>
            <port>9000</port>
        </replica>
    </shard>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname3</host>
            <port>9000</port>
        </replica>
    </shard>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname4</host>
            <port>9000</port>
        </replica>
    </shard>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname5</host>
            <port>9000</port>
        </replica>
    </shard>
    <shard>
        <weight>1</weight>
        <internal_replication>true</internal_replication>
        <replica>
            <host>hostname6</host>
            <port>9000</port>
        </replica>
    </shard>
</remote_servers>

Step 4: Start the ClickHouse server on each node

Conclusion

This is a guide to the relatively simple process of setting up a 6-node ClickHouse cluster with replication & sharding. However, the very same principles can be applied to achieve more complex and production-grade horizontally scaled ClickHouse infrastructure.

To learn more about Sharding in ClickHouse, do consider reading the below articles

About ChistaDATA Inc. 11 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc