Image Courtesy – Pexels – Vlad Chetan
The data is stored in multiple parts in MergeTree family of engines and data parts are merged asynchronously in the background. Refer to our article on Mergetree storage and infrastructure for more details. The merging logic for the data parts is governed by the table engine. For example, the multiple rows with the same sorting key are merged into a single row in ReplacingMergeTree.
ClickHouse uses a complex algorithm for merging the data parts and the merges are almost indeterministic. In certain cases, we may have to perform a manual merge. The OPTIMIZE statement in ClickHouse can force a merge to happen immediately.
Syntax
OPTIMIZE TABLE [database.]table_name [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY deduplication_expression]]
Example with ReplacingMergeTree
Let’s create a table based on ReplacingMergeTree engine and manually stop the background merges. Then, we will insert a few million rows sequentially and examine the unmerged parts.
CREATE DATABASE testing; CREATE TABLE testing.replacing_mergetree ( ID UInt16, PART UInt16 ) ENGINE = ReplacingMergeTree PRIMARY KEY ID PARTITION BY PART;
Execute the following SQL statement. We are inserting 100 million rows of data (randomly generated) into the table. We are generating random integers between 0 and 50 for the ID and PART columns.
INSERT INTO testing.replacing_mergetree SELECT ID % 50, PART % 50 FROM generateRandom('ID UInt16, PART UInt16') LIMIT 100000000 Query id: 5a2f92c8-7e29-49ff-ab11-5f84f331c5ba Ok. 0 rows in set. Elapsed: 5.386 sec. Processed 100.66 million rows, 402.64 MB (18.69 million rows/s., 74.76 MB/s.)
We can find the active data parts in the table using the parts table in the system database.
SELECT active, count(active) AS part_count FROM system.parts WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1) GROUP BY active Query id: d68643d8-193a-447d-ac12-5ef6db71d919 ┌─active─┬─part_count─┐ │ 1 │ 200 │ └────────┴────────────┘ 1 row in set. Elapsed: 0.005 sec.
Let’s force the merge operation using the OPTIMIZE statement.
OPTIMIZE TABLE testing.replacing_mergetree Query id: 065ea0ba-5cb0-4732-89e2-5d9b2505dfe5 Ok. 0 rows in set. Elapsed: 0.002 sec.
Query the number of data parts in the table after executing the OPTIMIZE statement.
SELECT active, count(active) AS part_count FROM system.parts WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1) GROUP BY active Query id: 95678a31-9b84-4d7b-82d8-874119ec6046 ┌─active─┬─part_count─┐ │ 1 │ 50 │ └────────┴────────────┘ 1 row in set. Elapsed: 0.006 sec.
We have 50 unique values in the partition (random numbers generated from 0-49) and hence we have 50 parts.
Other options
Optimize based on a specific partition
Delete all the records from the table and re-insert the data.
ALTER TABLE testing.replacing_mergetree DELETE WHERE ID >-1; INSERT INTO testing.replacing_mergetree SELECT ID % 50, PART % 50 FROM generateRandom('ID UInt16, PART UInt16') LIMIT 100000000;
Check the data parts.
SELECT active, count(active) AS part_count FROM system.parts WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1) GROUP BY active Query id: 20c3d226-fba2-4119-bde9-be2e89898c58 ┌─active─┬─part_count─┐ │ 1 │ 200 │ └────────┴────────────┘ 1 row in set. Elapsed: 0.005 sec.
Run the optimize statement based on a partition and verify the data parts again.
OPTIMIZE TABLE testing.replacing_mergetree PARTITION 1 Query id: 02207704-fbcb-4ff0-bfe7-a637c5b59277 Ok. 0 rows in set. Elapsed: 0.004 sec.
SELECT active, count(active) AS part_count FROM system.parts WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1) GROUP BY active Query id: 6c1adcef-1f71-495c-9f8f-5f3799d7ddd6 ┌─active─┬─part_count─┐ │ 1 │ 197 │ └────────┴────────────┘ 1 row in set. Elapsed: 0.005 sec.
This ensures that we force the merge specific to a partition specified in the SQL statement (Partition with value 1 in this case).
OPTIMIZE .. FINAL
This ensures the merges are forced irrespective of the existing data part merges or the number of data parts in the table (or even if the data is already merged).
OPTIMIZE .. FINAL .. DEDUPLICATE
Optimize statement with DEDUPLICATE clause ensures duplicates are removed from the table. If the columns are specified, the duplicates are eliminated based on these columns, or else the completely identical rows in the table are deduplicated.
Let us create a vanilla MergeTree table to illustrate this.
CREATE TABLE testing.vanilla_mergetree ( `ID` UInt16, `PART` UInt16 ) ENGINE = MergeTree PRIMARY KEY ID Query id: da369e84-9740-401d-9e86-73f846891b45 Ok. 0 rows in set. Elapsed: 0.067 sec.
Insert randomly generated data ranging from 0-9 in the columns.
INSERT INTO testing.vanilla_mergetree SELECT ID % 10, PART % 10 FROM generateRandom('ID UInt16, PART UInt16') LIMIT 100000000 Query id: 4403a800-abf1-4b31-8916-ca09be249b8f Ok. 0 rows in set. Elapsed: 1.657 sec. Processed 100.66 million rows, 402.64 MB (60.75 million rows/s., 243.00 MB/s.)
Count the number of rows in the table. Since it is a vanilla MergeTree, the rows will not be deduplicated based on the primary key like in ReplacingMergetree.
SELECT COUNT() FROM testing.vanilla_mergetree Query id: 2b3048b3-1437-4ea4-b3d5-71db55fcacca ┌───count()─┐ │ 100000000 │ └───────────┘ 1 row in set. Elapsed: 0.003 sec.
Run the optimize statement with DEDUPLICATE clause.
OPTIMIZE TABLE testing.vanilla_mergetree DEDUPLICATE BY ID Query id: 099882ad-0731-48e3-91b1-b2a47fa2d211 Ok. 0 rows in set. Elapsed: 0.297 sec.
Verify the number of rows in the table after removing duplicates. Wait for a couple of minutes (or more depending on the hardware).
SELECT COUNT() FROM testing.vanilla_mergetree Query id: 7f3b2093-735e-405a-aa7e-c815cc8de02f ┌─count()─┐ │ 10 │ └─────────┘ 1 row in set. Elapsed: 0.003 sec.
SELECT * FROM testing.vanilla_mergetree Query id: 59e8a783-d1c8-43b9-b1f5-7880f1474185 ┌─ID─┬─PART─┐ │ 0 │ 8 │ │ 1 │ 7 │ │ 2 │ 4 │ │ 3 │ 1 │ │ 4 │ 8 │ │ 5 │ 5 │ │ 6 │ 6 │ │ 7 │ 0 │ │ 8 │ 8 │ │ 9 │ 5 │ └────┴──────┘ 10 rows in set. Elapsed: 0.003 sec.
Summary
We have seen the usage of OPTIMIZE statement in ClickHouse. We have also seen the variants and options in the OPTIMIZE statement.