ClickHouse Storage Engines Explained

Comprehensive Guide to ClickHouse Storage Engines for Optimized Application Performance


ClickHouse Storage Engines Overview

ClickHouse uses storage engines to manage how data is stored, organized, and queried. Each storage engine is optimized for specific use cases, enabling efficient data ingestion, querying, and storage for diverse applications.

1. MergeTree Engine

The MergeTree family of engines is the foundation of ClickHouse. It is designed for high-performance analytical queries on large datasets.

Key Features:

  • Partitioning: Splits data into smaller parts for efficient querying.
  • Primary Key Indexing: Creates sparse indices to accelerate lookups.
  • Merge Process: Periodically merges smaller data parts into larger ones for optimized reads and writes.
  • Columnar Compression: Stores data in a compressed, columnar format.

Variants of MergeTree:

  • MergeTree: Basic storage engine.
  • ReplicatedMergeTree: Adds replication for high availability.
  • AggregatingMergeTree: Pre-aggregates data to optimize analytical queries.
  • ReplacingMergeTree: Stores the latest version of rows with the same primary key.
  • SummingMergeTree: Performs summation during merges.
  • CollapsingMergeTree: Handles rows with paired “sign” columns for soft deletes.
  • VersionedCollapsingMergeTree: Enhances CollapsingMergeTree with version control.
  • GraphiteMergeTree: Optimized for time-series data (e.g., Graphite metrics).

Use Cases:

  • Large-scale analytical queries on partitioned data.
  • High-throughput systems needing efficient reads/writes.
  • Real-time OLAP analytics.

Example Configuration:

CREATE TABLE logs (
    timestamp DateTime,
    user_id UInt32,
    event_type String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);

2. Log Engine

A lightweight storage engine optimized for append-only workloads with no indexes.

Key Features:

  • No indices: Simplifies writes but slows down complex queries.
  • Suitable for fast, small-scale inserts.

Use Cases:

  • Temporary or staging tables.
  • Simple logging systems with low querying demands.

Example Configuration:

CREATE TABLE raw_logs (
    event_time DateTime,
    log_message String
) ENGINE = Log;

3. StripeLog Engine

A columnar storage engine designed for append-only scenarios, offering better compression and performance than Log.

Use Cases:

  • Intermediate storage.
  • Batch processing pipelines.

Example Configuration:

CREATE TABLE stripe_logs (
    event_time DateTime,
    log_message String
) ENGINE = StripeLog;

4. Memory Engine

Stores data in memory for extremely fast reads and writes. Data is volatile and not persisted.

Key Features:

  • In-memory storage.
  • Useful for caching or temporary results.

Use Cases:

  • Real-time dashboards requiring high-speed data access.
  • Intermediate calculations or transient data storage.

Example Configuration:

CREATE TABLE temp_cache (
    session_id String,
    user_id UInt32,
    data String
) ENGINE = Memory;

5. Merge Engine

Provides a virtual table by merging data from multiple underlying tables.

Key Features:

  • Allows querying across several tables as if they were one.
  • Useful for data consolidation.

Use Cases:

  • Querying across sharded tables without replication.

Example Configuration:

CREATE TABLE consolidated_logs ENGINE = Merge(‘logs_database’, ‘^log_table_\d+$’);

6. Distributed Engine

Distributes queries across multiple servers in a ClickHouse cluster.

Key Features:

  • Enables horizontal scalability.
  • Supports sharding and replication for fault tolerance.

Use Cases:

  • Distributed querying across shards in a cluster.
  • Large-scale, parallel data processing.

Example Configuration:

CREATE TABLE distributed_logs AS logs ENGINE = Distributed(‘cluster_name’, ‘logs_database’, ‘logs’, user_id);

7. File Engines

Enable reading and writing data to external files.

Types:

  • File Engine: Reads and writes data from text files (CSV, JSON, etc.).
  • URL Engine: Reads data directly from HTTP(S) or FTP endpoints.

Use Cases:

  • Importing/exporting data for ETL pipelines.
  • Querying data from external file systems.

Example Configuration (CSV):

CREATE TABLE csv_logs (
    event_time DateTime,
    log_message String
) ENGINE = File(CSV);

Example Configuration (HTTP):

CREATE TABLE http_logs (
    event_time DateTime,
    log_message String
) ENGINE = URL('http://example.com/logs.json', JSONEachRow);

8. MaterializedView Engine

Allows the creation of materialized views to pre-aggregate or transform data.

Key Features:

  • Stores precomputed results of queries.
  • Improves query performance by avoiding real-time computations.

Use Cases:

  • Pre-aggregated dashboards.
  • Transforming raw data into analytical formats.

Example Configuration:

CREATE MATERIALIZED VIEW mv_logs
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time)
AS
SELECT
    user_id,
    count(*) AS event_count,
    min(event_time) AS first_event_time,
    max(event_time) AS last_event_time
FROM raw_logs
GROUP BY user_id;

9. View Engine

Creates virtual tables for reusable query logic. Unlike materialized views, views do not store data.

Key Features:

  • Queries are executed in real-time.

Use Cases:

  • Simplifying complex queries with reusable logic.

Example Configuration:

CREATE VIEW recent_logs AS
SELECT * FROM logs WHERE timestamp >= now() - INTERVAL 1 DAY;

How to Use Storage Engines for Application Performance

1. Understand Workload Characteristics:

  • High-Insert, Low-Query: Use Log or StripeLog.
  • Analytical Queries: Use MergeTree variants with proper partitioning and primary key indexing.
  • In-Memory Processing: Use Memory.

2. Optimize Partitioning and Indexing:

  • Use time-based partitioning for time-series data.
  • Index frequently queried fields using ORDER BY clauses.

3. Use Pre-Aggregation:

  • Employ AggregatingMergeTree or Materialized Views for dashboards or repeated analytical queries.

4. Scale Out Horizontally:

  • Use Distributed for querying across a cluster of nodes.

5. Integrate with External Data Sources:

  • Use File or URL engines to integrate external data sources.

6. Minimize Overhead:

  • Use Log for staging data before transformation.
  • Avoid complex engines like MergeTree for transient or small datasets.

By selecting the appropriate storage engine and configuring it to match application requirements, ClickHouse delivers exceptional performance, scalability, and efficiency for diverse workloads.

 

ClickHouse Horizontal Scaling: Optimal Read-Write Split Configuration and Execution

 

ClickHouse MergeTree: Introduction to ReplicatedMergeTree

Efficient Strategies for Purging Data in ClickHouse: Real-Life Use Cases and Detailed Implementation

 

Enhancing ClickHouse Performance: Strategic Insights on Partitioning, Indexing, and Monitoring

 

About Shiv Iyer 245 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.