ClickHouse Server Configuration for High-Volume Data Ingestion

Introduction

Optimizing ClickHouse for high-velocity and high-volume data ingestion involves several server configuration and tuning techniques. The recommended values for these settings can vary based on your specific hardware capacity (CPU, RAM, disk, network). Below are the top 10 techniques with general recommended values, considering a robust hardware setup (e.g., servers with high CPU cores, large RAM, and fast SSD storage).

Top 10 Techniques to Configure ClickHouse for Ingestion Performance

1. max_insert_block_size

  • Purpose: Determines the number of rows in a block for insert operations.
  • Recommended Value: Increase to 1048576 or higher depending on the RAM. Larger blocks improve insert speed but consume more memory.

2. max_bytes_before_external_group_by

  • Purpose: Sets the threshold for spilling GROUP BY operations to disk.
  • Recommended Value: Set to 70-80% of your server’s RAM. For instance, with 128 GB RAM, set around 96G.

3. max_memory_usage

  • Purpose: Limits memory usage per query.
  • Recommended Value: Configure to around 60-70% of available RAM per query, depending on other workloads.

4. max_partitions_per_insert_block

  • Purpose: Controls the number of partitions affected in a single INSERT.
  • Recommended Value: Set based on partition strategy; for high-volume data, consider increasing to 100 or more.

5. Use Bulk Insert Mechanism

  • Recommended Technique: Employ batch processing or ClickHouse’s native interfaces for bulk inserts.

6. Disk I/O Optimization

  • Recommended Setup: Use SSDs (preferably NVMe). RAID 10 can be ideal for a balance of speed and redundancy.

7. insert_quorum

  • Purpose: Ensures data consistency in replicated setups.
  • Recommended Value: For faster inserts where immediate consistency isn’t critical, reduce this value or set to 0(zero quorum).

8. Compression Settings

  • Purpose: Optimizes storage and I/O.
  • Recommended Codec: LZ4 for faster compression/decompression. Use ZSTD for higher compression ratios at the cost of CPU.

9. Distributed Data Loading

  • Recommended Technique: In a cluster, shard data to distribute inserts across multiple nodes.

10. Network Configuration

  • Purpose: Enhances data transfer speeds.
  • Recommended Settings: Increase net.core.wmem_max and net.core.rmem_max to 262144 or more. Adjust TCP settings like net.ipv4.tcp_max_syn_backlog and net.ipv4.tcp_fin_timeout.

Additional Considerations

  • Monitor Performance: Regularly review performance metrics and adjust configurations as needed.
  • Hardware Utilization: Ensure that your configuration aligns with your hardware specs—avoid settings that may overutilize or underutilize your resources.
  • ClickHouse Version: Keep ClickHouse updated for the latest performance improvements.

Conclusion

These recommendations provide a starting point, but optimal settings can vary greatly based on your specific hardware and workload. It’s essential to test and monitor the impact of these settings in your environment and adjust accordingly.

To learn more about data ingestion in ClickHouse, do read the following articles:

About Shiv Iyer 219 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.