Introduction
Multiple heavy-weight table engines and functionalities of ClickHouse are built on top of the MergeTree engine. The MergeTree engine supports PRIMARY KEY expression, but it is not the same as the primary keys in relational databases. The data is sorted and stored on the disk based on the primary keys, but the duplicates are not removed and stored as it is, in the case of the MergeTree engine. The Duplicate removal is again a tedious and manual process in this case (by using OPTIMIZE .. FINAL … DEDUPLICATE statement). ReplacingMergeTree table engine comes in handy in such scenarios. ReplacingMergeTree engine removes the duplicates based on the ORDER BY expression used while creating the table. There is an optional setting to manually specify the column with the version information of the rows. If the version column is specified, the row with the highest version is retained.
Table Creation
Syntax
CREATE TABLE [database_name].table_name ( column_name1 [data_type1], ... ) ENGINE = ReplacingMergeTree([optional_version_column]) [ORDER BY expr] [PRIMARY KEY expr] [SETTINGS name=value, ]
Create a database and a ReplacingMergeTree table in it.
CREATE DATABASE mergetree_testing; CREATE TABLE mergetree_testing.replacing_mergetree ( `ID` UInt64 ) ENGINE = ReplacingMergeTree PRIMARY KEY ID;
Insert some data in the newly created table.
INSERT INTO mergetree_testing.replacing_mergetree VALUES (1), (1), (2), (2);
Run the OPTIMIZE FINAL statement to ensure the background merges are completed on this table.
OPTIMIZE TABLE mergetree_testing.replacing_mergetree FINAL;
Run a SELECT statement to verify if the duplicates are removed.
SELECT * FROM mergetree_testing.replacing_mergetree Query id: cb574ca3-d342-4ff8-a6c3-0425c996d7d1 ┌─ID─┐ │ 1 │ │ 2 │ └────┘ 2 rows in set. Elapsed: 0.003 sec.
Version Column
Let’s explore the usage of the version column in the VersionedCollapsingMergeTree table engine. We’ll create a new table with a version column.
CREATE TABLE mergetree_testing.replacing_mergetree_versioned ( `ID` UInt64, `version` Int64) ENGINE = ReplacingMergeTree(version) PRIMARY KEY ID;
Insert some data into it with version info.
INSERT INTO mergetree_testing.replacing_mergetree_versioned VALUES (1,2), (2,2);
Read the inserted data.
SELECT * FROM mergetree_testing.replacing_mergetree_versioned Query id: 73ceea55-f10d-4f5c-afb4-a8be0747c99c Connecting to localhost:9000 as user default. Connected to ClickHouse server version 22.7.1 revision 54457. ┌─ID─┬─version─┐ │ 1 │ 2 │ │ 2 │ 2 │ └────┴─────────┘ 2 rows in set. Elapsed: 0.003 sec.
Insert some more data with a different version.
INSERT INTO mergetree_testing.replacing_mergetree_versioned VALUES (1,1), (2,1);
Run a SELECT query after inserting the data.
SELECT * FROM mergetree_testing.replacing_mergetree_versioned Query id: 9d0e47bc-2ea3-4d9e-b796-94ef81d5b9c3 ┌─ID─┬─version─┐ │ 1 │ 2 │ │ 2 │ 2 │ └────┴─────────┘ ┌─ID─┬─version─┐ │ 1 │ 1 │ │ 2 │ 1 │ └────┴─────────┘ 4 rows in set. Elapsed: 0.003 sec.
Read the final result after running an OPTIMIZE FINAL statement on the table.
OPTIMIZE TABLE mergetree_testing.replacing_mergetree_versioned FINAL;
SELECT * FROM mergetree_testing.replacing_mergetree_versioned Query id: 25962eca-0bb2-4144-843d-508a012e662a ┌─ID─┬─version─┐ │ 1 │ 2 │ │ 2 │ 2 │ └────┴─────────┘ 2 rows in set. Elapsed: 0.003 sec.
Conclusion
In this article, we have seen the applications of ReplacingMergeTree table engine. Specifically, we have covered
- Table creation
- Specifying the optional version column based on which the deduplication happens
To read more about MergeTree in ClickHouse, do give the following articles a read
- ClickHouse MergeTree: Deletes and Updates with CollapsingMergeTree
- ClickHouse MergeTree: Introduction to VersionedCollapsingMergeTree
- ClickHouse MergeTree: Overview of ClickHouse Storage Engines
- ClickHouse MergeTree – Configuring Storage Infrastructure & Indexes for Performance
References