ClickHouse S3 Archival: 7 Powerful Strategies for Tiered Storage Success

ClickHouse S3 Archival is a common strategy for handling large volumes of data efficiently, particularly for compliance purposes where data must be retained but is queried infrequently. Using tiered storage like S3 for archiving data in ClickHouse helps optimize storage costs while keeping historical data accessible when needed.:

Recommendations for Using ClickHouse S3 as Tiered Storage

1. Tiered Storage Strategy:

•Move Old Data to S3: Configure ClickHouse to move older, less frequently accessed data to S3. This helps reduce the load on local storage and keeps costs down.

•Retention Policy: Set up a retention policy to determine when data should be moved to S3 and how long it should be kept there.

2. Configuring ClickHouse for S3:

•S3 Disk Configuration: Define an S3 disk in ClickHouse configuration.

<storage_configuration>
    <disks>
        <default>
            <path>/var/lib/clickhouse/</path>
        </default>
        <s3>
            <type>s3</type>
            <endpoint>https://s3.amazonaws.com</endpoint>
            <access_key_id>YOUR_ACCESS_KEY</access_key_id>
            <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
            <bucket>your-bucket-name</bucket>
            <path>clickhouse/</path>
        </s3>
    </disks>
    <policies>
        <tiered>
            <volumes>
                <hot>
                    <disk>default</disk>
                </hot>
                <cold>
                    <disk>s3</disk>
                </cold>
            </volumes>
        </tiered>
    </policies>
</storage_configuration>

• Move Parts to S3: Use ALTER TABLE to move older data to S3.

ALTER TABLE your_table MOVE PARTITION 'partition_id' TO VOLUME 'cold';

3. Querying Archived Data:

•Data Availability: Once data is moved to S3, it remains accessible for querying, albeit with potentially higher latency due to network access.

•Cross-Cluster Access: If you have multiple ClickHouse instances (e.g., clickhouse-1 and clickhouse-2), they can both access the same S3 bucket for reading archived data, provided they are configured to use the same S3 disk.

4. Data Format:

•ClickHouse Native Format: By default, ClickHouse writes data to S3 in its own format, which is optimized for ClickHouse’s performance and storage efficiency.

•Parquet Format: While ClickHouse does not natively write data in Parquet format, you can achieve this by exporting data to Parquet using external tools or custom scripts.

Steps for Using S3 as Tiered Storage

1. Set Up S3 Disk:

•Ensure your S3 bucket is configured and accessible.

•Update the ClickHouse configuration to include the S3 disk.

2. Data Movement Policy:

•Define policies for when and how data should be moved to S3. This can be based on time (e.g., data older than 6 months) or size thresholds.

3. Automate Data Archival:

•Use ClickHouse’s built-in functions to automate the movement of data to S3.

•Schedule periodic ALTER TABLE commands using cron jobs or ClickHouse’s internal task scheduler.

4. Querying Archived Data:

•Ensure all ClickHouse instances that need to query archived data are configured to use the same S3 storage policy.

•Test query performance to understand the impact of accessing data stored in S3.

Example: Setting Up Tiered Storage and Querying

1. Configure ClickHouse to Use S3:

Update the config.xml to include S3 disk and policy:

<yandex>
  <storage_configuration>
      <disks>
          <default>
              <path>/var/lib/clickhouse/</path>
          </default>
          <s3>
              <type>s3</type>
              <endpoint>https://s3.amazonaws.com</endpoint>
              <access_key_id>YOUR_ACCESS_KEY</access_key_id>
              <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
              <bucket>your-bucket-name</bucket>
              <path>clickhouse/</path>
          </s3>
      </disks>
      <policies>
          <tiered>
              <volumes>
                  <hot>
                      <disk>default</disk>
                  </hot>
                  <cold>
                      <disk>s3</disk>
                  </cold>
              </volumes>
          </tiered>
      </policies>
  </storage_configuration>
</yandex>

2. Move Data to S3:

ALTER TABLE your_table MOVE PARTITION 'partition_id' TO VOLUME 'cold';

3. Query Archived Data:

SELECT * FROM your_table WHERE partition_column = 'old_partition_value';

Exporting Data to Parquet

If you need to export ClickHouse data to Parquet format for interoperability or compliance, you can use the following approach:

1. Export Using SQL:

INSERT INTO FUNCTION
    url('https://s3.amazonaws.com/your-bucket-name/your_table.parquet',
        'Parquet', 'schema_definition')
SELECT * FROM your_table;

2. External Tools:

Use external tools like Apache Spark or custom scripts to convert ClickHouse data to Parquet.

clickhouse-client --query="SELECT * FROM your_table" | \
parquet-tools convert - your_table.parquet

Conclusion

Using S3 as tiered storage in ClickHouse for archiving data is a viable strategy for managing large datasets, especially for compliance purposes. By configuring ClickHouse to move older data to S3, you can optimise local storage usage and still keep the data accessible for infrequent queries. Ensure your setup allows for cross-cluster access if needed, and consider using external tools to export data to Parquet format when necessary.

Further Reading

Partitioning in ClickHouse

Master ClickHouse Custom Partitioning Keys

Open AI: https://openai.com/api/