Data Compression in ClickHouse for Performance and Scalability

Implementing Data Compression in ClickHouse: A Complete Guide to Optimal Performance and Scalability


Introduction

Data compression in ClickHouse is a critical optimization technique that can dramatically improve query performance, reduce storage costs, and enhance overall system scalability. With the exponential growth of data volumes in modern analytics workloads, implementing effective compression strategies has become essential for maintaining optimal database performance.

ClickHouse, designed for OLAP workloads, offers sophisticated compression capabilities that can achieve compression ratios of 10:1 or higher while maintaining sub-second query response times. This comprehensive guide explores advanced compression techniques, implementation strategies, and performance optimization methods to maximize your ClickHouse deployment’s efficiency.

Understanding ClickHouse Compression Architecture

Column-Oriented Storage Benefits

ClickHouse’s columnar storage architecture provides inherent advantages for compression:

  • Data Locality: Similar data types are stored together, improving compression ratios
  • Vectorized Processing: Compressed data can be processed in batches
  • Selective Decompression: Only required columns are decompressed during queries

Compression Layers

ClickHouse implements compression at multiple levels:

-- Table-level compression settings
CREATE TABLE analytics_data (
    timestamp DateTime CODEC(DoubleDelta, LZ4),
    user_id UInt64 CODEC(Delta, ZSTD(3)),
    event_type LowCardinality(String),
    value Float64 CODEC(Gorilla, ZSTD(1))
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
SETTINGS index_granularity = 8192;

Compression Algorithms in ClickHouse

1. LZ4 – Speed Optimized

LZ4 provides excellent compression speed with moderate compression ratios:

-- Optimal for frequently accessed data
ALTER TABLE events MODIFY COLUMN message String CODEC(LZ4);

-- Performance characteristics
-- Compression ratio: 2-3x
-- Compression speed: ~400 MB/s
-- Decompression speed: ~1200 MB/s

2. ZSTD – Balanced Performance

ZSTD offers configurable compression levels balancing ratio and speed:

-- Different ZSTD levels for various use cases
ALTER TABLE logs MODIFY COLUMN content String CODEC(ZSTD(1));  -- Fast
ALTER TABLE archive MODIFY COLUMN data String CODEC(ZSTD(9));  -- High ratio
ALTER TABLE realtime MODIFY COLUMN payload String CODEC(ZSTD(3)); -- Balanced

-- Compression level comparison
-- ZSTD(1): 3-4x ratio, ~200 MB/s compression
-- ZSTD(3): 4-5x ratio, ~100 MB/s compression  
-- ZSTD(9): 6-8x ratio, ~20 MB/s compression

3. Specialized Codecs

Delta Encoding

Perfect for monotonic sequences:

-- Timestamp compression
CREATE TABLE time_series (
    ts DateTime CODEC(DoubleDelta, ZSTD(3)),
    sensor_id UInt32 CODEC(Delta, LZ4),
    value Float64 CODEC(Gorilla, ZSTD(1))
) ENGINE = MergeTree()
ORDER BY ts;

Gorilla Compression

Optimized for floating-point time series:

-- Financial data compression
CREATE TABLE stock_prices (
    timestamp DateTime,
    symbol String,
    price Float64 CODEC(Gorilla, ZSTD(1)),
    volume UInt64 CODEC(Delta, LZ4)
) ENGINE = MergeTree()
ORDER BY (symbol, timestamp);

Implementation Strategies

1. Data Type Optimization

Choose appropriate data types before applying compression:

-- Inefficient approach
CREATE TABLE user_events (
    user_id String,  -- Should be UInt64
    timestamp String, -- Should be DateTime
    category String   -- Should be LowCardinality(String)
);

-- Optimized approach
CREATE TABLE user_events_optimized (
    user_id UInt64 CODEC(Delta, LZ4),
    timestamp DateTime CODEC(DoubleDelta, ZSTD(3)),
    category LowCardinality(String)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

2. Compression Strategy Selection

Implement data-driven compression selection:

-- Hot data (frequently accessed)
CREATE TABLE recent_events (
    id UInt64 CODEC(Delta, LZ4),
    data String CODEC(LZ4),
    created_at DateTime CODEC(DoubleDelta, LZ4)
) ENGINE = MergeTree()
ORDER BY created_at
TTL created_at + INTERVAL 7 DAY;

-- Cold data (archival)
CREATE TABLE archived_events (
    id UInt64 CODEC(Delta, ZSTD(9)),
    data String CODEC(ZSTD(9)),
    created_at DateTime CODEC(DoubleDelta, ZSTD(9))
) ENGINE = MergeTree()
ORDER BY created_at;

3. Tiered Compression Architecture

Implement automatic data lifecycle management:

-- Multi-tier table with TTL-based compression
CREATE TABLE metrics_tiered (
    timestamp DateTime,
    metric_name String,
    value Float64,
    tags Map(String, String)
) ENGINE = MergeTree()
ORDER BY (metric_name, timestamp)
TTL 
    timestamp + INTERVAL 1 DAY TO DISK 'hot',
    timestamp + INTERVAL 7 DAY TO DISK 'warm' 
        RECOMPRESS CODEC(ZSTD(3)),
    timestamp + INTERVAL 30 DAY TO DISK 'cold' 
        RECOMPRESS CODEC(ZSTD(9)),
    timestamp + INTERVAL 365 DAY DELETE;

Performance Optimization Techniques

1. Compression Benchmarking

Create systematic benchmarking procedures:

-- Benchmark different compression strategies
CREATE TABLE compression_test AS 
SELECT * FROM source_table;

-- Test LZ4
ALTER TABLE compression_test MODIFY COLUMN data String CODEC(LZ4);

-- Measure compression ratio and query performance
SELECT 
    formatReadableSize(sum(data_compressed_bytes)) as compressed_size,
    formatReadableSize(sum(data_uncompressed_bytes)) as uncompressed_size,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) as ratio
FROM system.parts 
WHERE table = 'compression_test';

2. Query-Aware Compression

Optimize compression based on query patterns:

-- For analytical queries (GROUP BY heavy)
CREATE TABLE analytics_optimized (
    date Date CODEC(Delta, LZ4),
    dimension1 LowCardinality(String),
    dimension2 LowCardinality(String),
    metric1 Float64 CODEC(Gorilla, ZSTD(1)),
    metric2 UInt64 CODEC(Delta, LZ4)
) ENGINE = MergeTree()
ORDER BY (date, dimension1, dimension2)
SETTINGS index_granularity = 8192;

-- For point queries (WHERE heavy)
CREATE TABLE transactional_optimized (
    id UInt64 CODEC(Delta, LZ4),
    user_id UInt64 CODEC(Delta, LZ4),
    details String CODEC(ZSTD(3)),
    timestamp DateTime CODEC(DoubleDelta, LZ4)
) ENGINE = MergeTree()
ORDER BY id
SETTINGS index_granularity = 1024; -- Smaller granularity for point queries

3. Memory-Conscious Compression

Balance memory usage with compression efficiency:

-- Configure compression settings for memory optimization
SET max_compress_block_size = 65536;  -- Smaller blocks for memory efficiency
SET min_compress_block_size = 4096;   -- Minimum compression block size

-- Monitor memory usage during compression
SELECT 
    query,
    memory_usage,
    formatReadableSize(memory_usage) as readable_memory
FROM system.processes 
WHERE query LIKE '%COMPRESS%';

Monitoring and Troubleshooting

1. Compression Metrics Monitoring

Implement comprehensive monitoring:

-- Monitor compression ratios across tables
CREATE VIEW compression_monitoring AS
SELECT 
    database,
    table,
    formatReadableSize(sum(data_compressed_bytes)) as compressed_size,
    formatReadableSize(sum(data_uncompressed_bytes)) as uncompressed_size,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) as compression_ratio,
    round(sum(data_compressed_bytes) * 100.0 / sum(data_uncompressed_bytes), 2) as compression_percentage
FROM system.parts 
WHERE active = 1
GROUP BY database, table
ORDER BY compression_ratio DESC;

2. Performance Impact Analysis

Track compression impact on query performance:

-- Query performance monitoring
SELECT 
    query_kind,
    type,
    avg(query_duration_ms) as avg_duration,
    avg(memory_usage) as avg_memory,
    count() as query_count
FROM system.query_log 
WHERE event_date >= today() - 1
  AND query LIKE '%compressed_table%'
GROUP BY query_kind, type
ORDER BY avg_duration DESC;

3. Troubleshooting Common Issues

Address typical compression problems:

-- Identify poorly compressed columns
SELECT 
    database,
    table,
    column,
    type,
    compression_codec,
    formatReadableSize(data_compressed_bytes) as compressed,
    formatReadableSize(data_uncompressed_bytes) as uncompressed,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) as ratio
FROM system.parts_columns 
WHERE active = 1 
  AND data_uncompressed_bytes > 0
  AND (data_uncompressed_bytes / data_compressed_bytes) < 2
ORDER BY ratio ASC
LIMIT 20;

Best Practices and Real-World Examples

1. E-commerce Analytics Platform

Optimize compression for high-volume transactional data:

-- Order events table
CREATE TABLE order_events (
    order_id UInt64 CODEC(Delta, LZ4),
    user_id UInt64 CODEC(Delta, LZ4),
    event_time DateTime CODEC(DoubleDelta, ZSTD(3)),
    event_type LowCardinality(String),
    product_id UInt64 CODEC(Delta, LZ4),
    quantity UInt32 CODEC(Delta, LZ4),
    price Decimal(10,2) CODEC(Gorilla, ZSTD(1)),
    metadata String CODEC(ZSTD(5))
) ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time)
SETTINGS index_granularity = 8192;

-- Achieved results:
-- - 12:1 compression ratio
-- - 40% query performance improvement
-- - 85% storage cost reduction

2. IoT Time Series Data

Handle high-frequency sensor data efficiently:

-- Sensor readings table
CREATE TABLE sensor_data (
    sensor_id UInt32 CODEC(Delta, LZ4),
    timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD(3)),
    temperature Float32 CODEC(Gorilla, LZ4),
    humidity Float32 CODEC(Gorilla, LZ4),
    pressure Float32 CODEC(Gorilla, LZ4),
    location_data String CODEC(ZSTD(7))
) ENGINE = MergeTree()
ORDER BY (sensor_id, timestamp)
PARTITION BY toYYYYMMDD(timestamp)
TTL timestamp + INTERVAL 90 DAY TO DISK 'cold' 
    RECOMPRESS CODEC(ZSTD(9))
SETTINGS index_granularity = 16384;

3. Log Analytics System

Optimize text-heavy log data compression:

-- Application logs table
CREATE TABLE application_logs (
    timestamp DateTime CODEC(DoubleDelta, LZ4),
    level LowCardinality(String),
    service LowCardinality(String),
    message String CODEC(ZSTD(6)),
    trace_id String CODEC(LZ4),
    span_id String CODEC(LZ4),
    metadata Map(String, String) CODEC(ZSTD(5))
) ENGINE = MergeTree()
ORDER BY (service, timestamp)
PARTITION BY toYYYYMMDD(timestamp)
SETTINGS index_granularity = 4096;

4. Advanced Configuration Tuning

Fine-tune compression parameters:

-- System-wide compression settings
SET max_compress_block_size = 1048576;      -- 1MB blocks for better compression
SET min_compress_block_size = 65536;        -- 64KB minimum
SET compress_block_size = 262144;           -- 256KB default
SET max_block_size = 65536;                 -- Processing block size

-- Table-specific optimizations
ALTER TABLE high_volume_data 
MODIFY SETTING 
    compress_primary_key = 1,
    compress_marks = 1,
    min_merge_bytes_to_use_direct_io = 10737418240; -- 10GB

Conclusion

Implementing effective data compression in ClickHouse requires a strategic approach that considers data characteristics, query patterns, and performance requirements. Key takeaways include:

  1. Algorithm Selection: Choose compression algorithms based on data access patterns and performance requirements
  2. Layered Approach: Implement tiered compression strategies for different data lifecycle stages
  3. Continuous Monitoring: Regularly assess compression ratios and performance impact
  4. Data-Driven Decisions: Use benchmarking and monitoring to optimize compression strategies

By following these guidelines and implementing the techniques outlined in this guide, organizations can achieve significant improvements in storage efficiency, query performance, and overall system scalability. The combination of ClickHouse’s advanced compression capabilities with proper implementation strategies can result in compression ratios exceeding 10:1 while maintaining sub-second query response times.

Remember that compression optimization is an iterative process. Regular monitoring, testing, and adjustment of compression strategies will ensure continued optimal performance as data volumes and access patterns evolve.


For more advanced ClickHouse optimization techniques and enterprise deployment strategies, consider consulting with database performance specialists or exploring ClickHouse’s extensive documentation and community resources.

 

About ChistaDATA Inc. 161 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc

Be the first to comment

Leave a Reply