Efficient Strategies for Purging Data in ClickHouse: Real-Life Use Cases and Detailed Implementation

Efficiently purging data from ClickHouse is crucial for maintaining performance and managing storage costs, especially when dealing with large, real-life datasets. Here are some detailed strategies, complete with real-life data sets and use cases:

1. Using TTL (Time to Live) for Automatic Data Expiration

Use Case: An e-commerce platform logs user activity data, which needs to be retained for 30 days for analytical purposes.

Dataset Example: User activity logs with event_time and user_action fields.

Implementation:

CREATE TABLE user_activity (
    user_id UInt32,
    event_time DateTime,
    user_action String
)
ENGINE = MergeTree()
ORDER BY event_time
TTL event_time + INTERVAL 30 DAY DELETE;

With this setup, any data older than 30 days will be automatically deleted.

2. Partitioning and Dropping Partitions

Use Case: A financial services company stores transaction logs partitioned by month and wants to drop data older than one year.

Dataset Example: Transaction logs with transaction_date and amount fields.

Implementation:

1. Create a Partitioned Table:

CREATE TABLE transactions (
    transaction_id UInt64,
    transaction_date Date,
    amount Float64,
    customer_id UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(transaction_date)
ORDER BY transaction_date;

2. Drop Partitions:

ALTER TABLE transactions DROP PARTITION 202301;  -- Drops January 2023 data

3. Using ALTER TABLE DELETE for Specific Conditions

Use Case: A social media platform needs to purge user posts older than a year to comply with data retention policies.

Dataset Example: User posts with post_date and content fields.

Implementation:

ALTER TABLE user_posts DELETE WHERE post_date < now() - INTERVAL 1 YEAR;

4. Truncating the Table

Use Case: A sensor data collection system that periodically needs to purge all collected data to reset the environment.

Dataset Example: Sensor readings with reading_time and sensor_value fields.

Implementation:

TRUNCATE TABLE sensor_readings;

This command quickly deletes all data from the table.

5. Using External Tables

Use Case: Temporary storage of processed data for an ETL pipeline.

Dataset Example: Processed data with process_time and result fields.

Implementation:

CREATE TABLE temp_processed_data (
    process_time DateTime,
    result String
) ENGINE = Memory;

-- To purge:
DROP TABLE temp_processed_data;

6. Optimize Table After Deletion

Use Case: After deleting old logs from a web server log table, it’s necessary to optimize the table to reclaim space.

Dataset Example: Web server logs with log_time and log_message fields.

Implementation:

  1. Delete Old Logs:

    ALTER TABLE web_logs DELETE WHERE log_time < now() - INTERVAL 6 MONTH;
    
  2. Optimize the Table:

    OPTIMIZE TABLE web_logs FINAL;
    

    This reclaims space and improves performance.

7. Managing Retention with Views

Use Case: A healthcare system needs to retain only recent patient visit records for quick access and analysis.

Dataset Example: Patient visits with visit_date and diagnosis fields.

Implementation:

  1. Create Materialized View:

    CREATE MATERIALIZED VIEW recent_patient_visits
    ENGINE = MergeTree()
    ORDER BY visit_date
    POPULATE AS
    SELECT * FROM patient_visits WHERE visit_date >= now() - INTERVAL 1 YEAR;
    
  2. Purge Old Data, If needed, drop the view and recreate it:

    DROP VIEW recent_patient_visits;
    

8. Automating Data Purge

Use Case: A telecommunications company needs to automate the purge of call detail records (CDRs) older than 18 months.

Dataset Example: CDRs with call_time and duration fields.

Implementation:

  1. Monitor Partitions: Use a script to monitor and automate partition drops based on conditions:
    SELECT partition, name FROM system.parts WHERE table = 'cdrs' AND active = 1;
    
  2. Schedule Purge Tasks: Use cron jobs to schedule and run purge scripts periodically.

    Example Cron Job:

    0 3 * * * clickhouse-client --query="ALTER TABLE cdrs DROP PARTITION toYYYYMM(now() - INTERVAL 18 MONTH)"
    

Summary

Efficiently purging data from ClickHouse involves several strategies:

  • Using TTL: Automatically deletes data older than a specified duration.
  • Partitioning and Dropping Partitions: Efficiently removes large sets of data.
  • Using ALTER TABLE DELETE: Granular deletions based on specific conditions.
  • Truncating the Table: Quickly purges all data in a table.
  • Using External Tables: For temporary data that can be dropped quickly.
  • Optimizing the Table: Reclaims space and enhances performance post-deletion.
  • Managing Retention with Views: Retains only necessary data.
  • Automating Data Purge: Ensures regular purging through scheduled tasks.

By implementing these strategies, organizations can maintain optimal performance and manage storage efficiently in ClickHouse, even with large, real-life datasets.

Effective Strategies for Deleting Old Records in ClickHouse: Methods and Best Practices

 

Enhancing Data Processing Workflows with Chained Materialized Views in ClickHouse

 

Enhancing ClickHouse Performance: Strategic Insights on Partitioning, Indexing, and Monitoring

Unlocking High-Speed Analytics: Why ClickHouse Is Ideal for High-Velocity, High-Volume Data Ingestion

 

 

About Shiv Iyer 234 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.