ClickHouse MergeTree: Introduction to ReplacingMergeTree

Introduction

Multiple heavy-weight table engines and functionalities of ClickHouse are built on top of the MergeTree engine. The MergeTree engine supports PRIMARY KEY expression, but it is not the same as the primary keys in relational databases. The data is sorted and stored on the disk based on the primary keys, but the duplicates are not removed and stored as it is, in the case of the MergeTree engine. The Duplicate removal is again a tedious and manual process in this case (by using OPTIMIZE .. FINAL … DEDUPLICATE statement). ReplacingMergeTree table engine comes in handy in such scenarios. ReplacingMergeTree engine removes the duplicates based on the ORDER BY expression used while creating the table. There is an optional setting to manually specify the column with the version information of the rows. If the version column is specified, the row with the highest version is retained.

Table Creation

Syntax

CREATE TABLE [database_name].table_name
(
 column_name1 [data_type1],
...
)
ENGINE = ReplacingMergeTree([optional_version_column])
[ORDER BY expr]
[PRIMARY KEY expr]
[SETTINGS name=value, ]

Create a database and a ReplacingMergeTree table in it.

CREATE DATABASE mergetree_testing;

CREATE TABLE mergetree_testing.replacing_mergetree
(
    `ID` UInt64
)
ENGINE = ReplacingMergeTree
PRIMARY KEY ID;

Insert some data in the newly created table.

INSERT INTO mergetree_testing.replacing_mergetree VALUES (1), (1), (2), (2);

Run the OPTIMIZE FINAL statement to ensure the background merges are completed on this table.

OPTIMIZE TABLE mergetree_testing.replacing_mergetree FINAL;

Run a SELECT statement to verify if the duplicates are removed.

SELECT *
FROM mergetree_testing.replacing_mergetree

Query id: cb574ca3-d342-4ff8-a6c3-0425c996d7d1

┌─ID─┐
│  1 │
│  2 │
└────┘

2 rows in set. Elapsed: 0.003 sec. 

Version Column

Let’s explore the usage of the version column in the VersionedCollapsingMergeTree table engine. We’ll create a new table with a version column.

CREATE TABLE mergetree_testing.replacing_mergetree_versioned
( `ID` UInt64,
 `version` Int64) 
ENGINE = ReplacingMergeTree(version)
PRIMARY KEY ID;

Insert some data into it with version info.

INSERT INTO mergetree_testing.replacing_mergetree_versioned VALUES (1,2), (2,2);

Read the inserted data.

SELECT *
FROM mergetree_testing.replacing_mergetree_versioned

Query id: 73ceea55-f10d-4f5c-afb4-a8be0747c99c

Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 22.7.1 revision 54457.

┌─ID─┬─version─┐
│  1 │       2 │
│  2 │       2 │
└────┴─────────┘

2 rows in set. Elapsed: 0.003 sec. 

Insert some more data with a different version.

INSERT INTO mergetree_testing.replacing_mergetree_versioned VALUES (1,1), (2,1);

Run a SELECT query after inserting the data.

SELECT *
FROM mergetree_testing.replacing_mergetree_versioned

Query id: 9d0e47bc-2ea3-4d9e-b796-94ef81d5b9c3

┌─ID─┬─version─┐
│  1 │       2 │
│  2 │       2 │
└────┴─────────┘
┌─ID─┬─version─┐
│  1 │       1 │
│  2 │       1 │
└────┴─────────┘

4 rows in set. Elapsed: 0.003 sec. 

Read the final result after running an OPTIMIZE FINAL statement on the table.

OPTIMIZE TABLE mergetree_testing.replacing_mergetree_versioned FINAL;

SELECT *
FROM mergetree_testing.replacing_mergetree_versioned

Query id: 25962eca-0bb2-4144-843d-508a012e662a

┌─ID─┬─version─┐
│  1 │       2 │
│  2 │       2 │
└────┴─────────┘

2 rows in set. Elapsed: 0.003 sec. 

Conclusion

In this article, we have seen the applications of ReplacingMergeTree table engine. Specifically, we have covered

  • Table creation
  • Specifying the optional version column based on which the deduplication happens

To read more about MergeTree in ClickHouse, do give the following articles a read

References