MergeTree Engine – Storage Infrastructure and Indexing

Understanding Storage Infrastructure and Indexing in MergeTree


The heart of ClickHouse storage infrastructure is the MergeTree storage engine. They are 100% columnar data stores built for performance and resilience supporting customized partitioning, sparse primary key index, secondary data skipping indexes, and optimized for inserting very large volumes of data into a table. We often get this question, How MergeTree storage engine support high performance INSERT operations? MergeTree INSERT data part-by-part to the table and later rules of merging the parts are asynchronously applied in the background.
Note: ClickHouse is not an OLTP store so ACID compliance is implemented aggressively. When you continuously rewrite data during INSERT, latency is expected.
More reasons for using the MergeTree storage engine:
  • Data stored by primary key. This makes SORT/SEARCH queries faster with help of a small sparse index.
  • ClickHouse supports partitioning – ClickHouse can automatically cut off the partition data matching to partition pruning logic implemented
  • You can implement redundant(support replication) and distributed Database Infrastructure Operations using ReplicatedMergeTree tables.
CREATE TABLE [IF NOT EXISTS] [chdb.]tab_nam [ON CLUSTER chcluster]
(
    colnam1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    colnam2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_colnam1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_colnam2 expr2 TYPE type2(...) GRANULARITY value2,
    ...
    PROJECTION projection_name_1 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY]),
    PROJECTION projection_name_2 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY])
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr
    [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx' [, ...] ]
    [WHERE conditions]
    [GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ] ]
[SETTINGS name=value, ...]
How data is stored in ClickHouse MergeTree Storage Engine
When you INSERT data in MergeTree Storage Engine, separate data parts are created with each of them sorted lexicographically by primary key. Let us explain this with an example: When the primary is created on (OrderID, OrderDate), The physical data structure is sorted by “OrderID” and within each “OrderID”, It is sorted by OrderDate.
In the MergeTree storage engine, the data belonging to different partitions are separated into different parts in the background and ClickHouse merges data for storage efficiency. ClickHouse will not merge parts belonging to different partitions.
Note:  The ClickHouse merge mechanism does not guarantee all rows with the same primary key will be on the same data part.
ClickHouse stores data parts in Wide or Compact format. We have explained below both in detail:
    • In “Wide” format, each column is stored in a separate file of a filesystem.
    • In “Compact” format entire columns will be stored in a single file. We recommend a “Compact” format for optimal performance of small and very frequent INSERT operations
The data storage format is controlled by the configuration variables min_bytes_for_wide_part and min_rows_for_wide_part of the table engine. When the number of bytes or rows in a data part is lower than the corresponding configuration parameter’s value, the part is stored in “Compact” format else data is stored in “Wide” format.

Note: when none of these settings is configured then data parts are stored in Wide format

In ClickHouse, each data part is logically divided into granules. The granule is the smallest and indivisible data that ClickHouse reads while querying data. The first row of a granule is marked with the Primary Key of the row. ClickHouse creates an index file for each data part that stores the marks. These marks direct your queries to find data in column files. The size of the granule is restricted by configuration settings index_granularity and index_granularity_byes of the table engine. The size of a granule can exceed index_granularity_bytes if the size of a single row is greater than the value of the setting. In this case, the size of the granule equals the size of the row. We have explained below MergeTree Table engine configuration for performance:
  • index_granularity — Maximum number of data rows between the marks of an index. Default value: 8192.
  • index_granularity_bytes — Maximum size of data granules in bytes. Default value: 10Mb. To restrict the granule size only by a number of rows, set to 0 (not recommended).
  • min_index_granularity_bytes — Min allowed size of data granules in bytes. Default value: 1024b. To provide a safeguard against accidentally creating tables with very low index_granularity_bytes.
  • enable_mixed_granularity_parts — Enables or disables transitioning to control the granule size with the index_granularity_bytes setting. Before version 19.11, there was only the index_granularity setting for restricting granule size. The index_granularity_bytes setting improves ClickHouse performance when selecting data from tables with big rows (tens and hundreds of megabytes). If you have tables with big rows, you can enable this setting for the tables to improve the efficiency of SELECT queries.
  • use_minimalistic_part_header_in_zookeeper — Storage method of the data parts headers in ZooKeeper. If use_minimalistic_part_header_in_zookeeper=1, then ZooKeeper stores less data. For more information, see the setting description in “Server configuration parameters”.
  • min_merge_bytes_to_use_direct_io — The minimum data volume for merge operation that is required for using direct I/O access to the storage disk. When merging data parts, ClickHouse calculates the total storage volume of all the data to be merged. If the volume exceeds min_merge_bytes_to_use_direct_io bytes, ClickHouse reads and writes the data to the storage disk using the direct I/O interface (O_DIRECT option). If min_merge_bytes_to_use_direct_io = 0, then direct I/O is disabled. Default value: 10 * 1024 * 1024 * 1024 bytes.
  • merge_with_ttl_timeout — Minimum delay in seconds before repeating a merge with delete TTL. Default value: 14400 seconds (4 hours).
  • merge_with_recompression_ttl_timeout — Minimum delay in seconds before repeating a merge with recompression TTL. Default value: 14400 seconds (4 hours).
  • try_fetch_recompressed_part_timeout — Timeout (in seconds) before starting merge with recompression. During this time ClickHouse tries to fetch recompressed parts from the replica which assigned this merge with recompression. Default value: 7200seconds (2 hours).
  • write_final_mark — Enables or disables writing the final index mark at the end of the data part (after the last byte). Default value: 1. Don’t turn it off.
  • merge_max_block_size — Maximum number of rows in the block for merge operations. Default value: 8192.
  • storage_policy — Storage policy. See Using Multiple Block Devices for Data Storage.
  • min_bytes_for_wide_part, min_rows_for_wide_part — Minimum number of bytes/rows in a data part that can be stored in Wide format. You can set one, both or none of these settings.
  • max_parts_in_total — Maximum number of parts in all partitions.
  • max_compress_block_size — Maximum size of blocks of uncompressed data before compressing for writing to a table. You can also specify this setting in the global settings (see max_compress_block_size setting). The value specified when table is created overrides the global value for this setting.
  • min_compress_block_size — Minimum size of blocks of uncompressed data required for compression when writing the next mark. You can also specify this setting in the global settings. The value specified when the table is created overrides the global value for this setting.
  • max_partitions_to_read — Limits the maximum number of partitions that can be accessed in one query. You can also specify setting max_partitions_to_read in the global setting.

How ClickHouse use indexes and partitions in queries?

In ClickHouse, Indexes will be used when there is a WHERE/PREWHERE clause has an expression (as one of the conjunction elements, or entirely) which qualifies an equality or inequality comparison operation or selected query has IN or LIKE with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions. ClickHouse goes for the full scan method when the values of the primary key in the query parameter range do not represent a monotonic sequence.
About Shiv Iyer 3 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.