Implementing Tiered Storage in ClickHouse: Leveraging S3 for Efficient Data Archival and Compliance

Using tiered storage like S3 for archiving data in ClickHouse is a common strategy for handling large volumes of data efficiently, particularly for compliance purposes where data must be retained but is queried infrequently. Here are some detailed suggestions and considerations for implementing this strategy:

Recommendations for Using S3 as Tiered Storage

1.Tiered Storage Strategy:

Move Old Data to S3: Configure ClickHouse to move older, less frequently accessed data to S3. This helps reduce the load on local storage and keeps costs down.

Retention Policy: Set up a retention policy to determine when data should be moved to S3 and how long it should be kept there.

2.Configuring ClickHouse for S3:

S3 Disk Configuration: Define an S3 disk in ClickHouse configuration.

<storage_configuration>
    <disks>
        <default>
            <path>/var/lib/clickhouse/</path>
        </default>
        <s3>
            <type>s3</type>
            <endpoint>https://s3.amazonaws.com</endpoint>
            <access_key_id>YOUR_ACCESS_KEY</access_key_id>
            <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
            <bucket>your-bucket-name</bucket>
            <path>clickhouse/</path>
        </s3>
    </disks>
    <policies>
        <tiered>
            <volumes>
                <hot>
                    <disk>default</disk>
                </hot>
                <cold>
                    <disk>s3</disk>
                </cold>
            </volumes>
        </tiered>
    </policies>
</storage_configuration>

Move Parts to S3: Use ALTER TABLE to move older data to S3.

ALTER TABLE your_table MOVE PARTITION 'partition_id' TO VOLUME 'cold';

3.Querying Archived Data:

Data Availability: Once data is moved to S3, it remains accessible for querying, albeit with potentially higher latency due to network access.

Cross-Cluster Access: If you have multiple ClickHouse instances (e.g., clickhouse-1 and clickhouse-2), they can both access the same S3 bucket for reading archived data, provided they are configured to use the same S3 disk.

4.Data Format:

ClickHouse Native Format: By default, ClickHouse writes data to S3 in its own format, which is optimized for ClickHouse’s performance and storage efficiency.

Parquet Format: While ClickHouse does not natively write data in Parquet format, you can achieve this by exporting data to Parquet using external tools or custom scripts.

Steps for Using S3 as Tiered Storage

1.Set Up S3 Disk:

•Ensure your S3 bucket is configured and accessible.

•Update the ClickHouse configuration to include the S3 disk.

2.Data Movement Policy:

•Define policies for when and how data should be moved to S3. This can be based on time (e.g., data older than 6 months) or size thresholds.

3.Automate Data Archival:

•Use ClickHouse’s built-in functions to automate the movement of data to S3.

•Schedule periodic ALTER TABLE commands using cron jobs or ClickHouse’s internal task scheduler.

4.Querying Archived Data:

•Ensure all ClickHouse instances that need to query archived data are configured to use the same S3 storage policy.

•Test query performance to understand the impact of accessing data stored in S3.

Example: Setting Up Tiered Storage and Querying

1.Configure ClickHouse to Use S3:

Update the config.xml to include S3 disk and policy:

<yandex>
  <storage_configuration>
      <disks>
          <default>
              <path>/var/lib/clickhouse/</path>
          </default>
          <s3>
              <type>s3</type>
              <endpoint>https://s3.amazonaws.com</endpoint>
              <access_key_id>YOUR_ACCESS_KEY</access_key_id>
              <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
              <bucket>your-bucket-name</bucket>
              <path>clickhouse/</path>
          </s3>
      </disks>
      <policies>
          <tiered>
              <volumes>
                  <hot>
                      <disk>default</disk>
                  </hot>
                  <cold>
                      <disk>s3</disk>
                  </cold>
              </volumes>
          </tiered>
      </policies>
  </storage_configuration>
</yandex>

2.Move Data to S3:

ALTER TABLE your_table MOVE PARTITION 'partition_id' TO VOLUME 'cold';

3.Query Archived Data:

SELECT * FROM your_table WHERE partition_column = 'old_partition_value';

Exporting Data to Parquet

If you need to export ClickHouse data to Parquet format for interoperability or compliance, you can use the following approach:

1.Export Using SQL:

INSERT INTO FUNCTION
    url('https://s3.amazonaws.com/your-bucket-name/your_table.parquet',
        'Parquet', 'schema_definition')
SELECT * FROM your_table;

2.External Tools:

Use external tools like Apache Spark or custom scripts to convert ClickHouse data to Parquet.

clickhouse-client --query="SELECT * FROM your_table" | \
parquet-tools convert - your_table.parquet

Conclusion

Using S3 as tiered storage in ClickHouse for archiving data is a viable strategy for managing large datasets, especially for compliance purposes. By configuring ClickHouse to move older data to S3, you can optimize local storage usage and still keep the data accessible for infrequent queries. Ensure your setup allows for cross-cluster access if needed, and consider using external tools to export data to Parquet format when necessary.

Implementing Custom Access Policies in ClickHouse: A Comprehensive Guide

Securing ClickHouse Data at Rest: A Guide to Implementing Filesystem-Level Encryption

Unlocking High-Speed Analytics: Why ClickHouse Is Ideal for High-Velocity, High-Volume Data Ingestion

How do we implement intelligent Caching on ClickHouse with machine learning?

About Shiv Iyer 245 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.