Introduction
Optimizing ClickHouse for high-velocity and high-volume data ingestion involves several server configuration and tuning techniques. The recommended values for these settings can vary based on your specific hardware capacity (CPU, RAM, disk, network). Below are the top 10 techniques with general recommended values, considering a robust hardware setup (e.g., servers with high CPU cores, large RAM, and fast SSD storage).
Top 10 Techniques to Configure ClickHouse for Ingestion Performance
1. max_insert_block_size
- Purpose: Determines the number of rows in a block for insert operations.
- Recommended Value: Increase to
1048576
or higher depending on the RAM. Larger blocks improve insert speed but consume more memory.
2. max_bytes_before_external_group_by
- Purpose: Sets the threshold for spilling
GROUP BY
operations to disk. - Recommended Value: Set to 70-80% of your server’s RAM. For instance, with 128 GB RAM, set around
96G
.
3. max_memory_usage
- Purpose: Limits memory usage per query.
- Recommended Value: Configure to around 60-70% of available RAM per query, depending on other workloads.
4. max_partitions_per_insert_block
- Purpose: Controls the number of partitions affected in a single INSERT.
- Recommended Value: Set based on partition strategy; for high-volume data, consider increasing to
100
or more.
5. Use Bulk Insert Mechanism
- Recommended Technique: Employ batch processing or ClickHouse’s native interfaces for bulk inserts.
6. Disk I/O Optimization
- Recommended Setup: Use SSDs (preferably NVMe). RAID 10 can be ideal for a balance of speed and redundancy.
7. insert_quorum
- Purpose: Ensures data consistency in replicated setups.
- Recommended Value: For faster inserts where immediate consistency isn’t critical, reduce this value or set to
0
(zero quorum).
8. Compression Settings
- Purpose: Optimizes storage and I/O.
- Recommended Codec: LZ4 for faster compression/decompression. Use ZSTD for higher compression ratios at the cost of CPU.
9. Distributed Data Loading
- Recommended Technique: In a cluster, shard data to distribute inserts across multiple nodes.
10. Network Configuration
- Purpose: Enhances data transfer speeds.
- Recommended Settings: Increase
net.core.wmem_max
andnet.core.rmem_max
to262144
or more. Adjust TCP settings likenet.ipv4.tcp_max_syn_backlog
andnet.ipv4.tcp_fin_timeout
.
Additional Considerations
- Monitor Performance: Regularly review performance metrics and adjust configurations as needed.
- Hardware Utilization: Ensure that your configuration aligns with your hardware specs—avoid settings that may overutilize or underutilize your resources.
- ClickHouse Version: Keep ClickHouse updated for the latest performance improvements.
Conclusion
These recommendations provide a starting point, but optimal settings can vary greatly based on your specific hardware and workload. It’s essential to test and monitor the impact of these settings in your environment and adjust accordingly.
To learn more about data ingestion in ClickHouse, do read the following articles: