OPTIMIZE statement

Image Courtesy – Pexels – Vlad Chetan

 The data is stored in multiple parts in MergeTree family of engines and data parts are merged asynchronously in the background. Refer to our article on Mergetree storage and infrastructure for more details. The merging logic for the data parts is governed by the table engine. For example, the multiple rows with the same sorting key are merged into a single row in ReplacingMergeTree.

ClickHouse uses a complex algorithm for merging the data parts and the merges are almost indeterministic. In certain cases, we may have to perform a manual merge. The OPTIMIZE statement in ClickHouse can force a merge to happen immediately.

Syntax

OPTIMIZE TABLE [database.]table_name [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY deduplication_expression]]

 

Example with ReplacingMergeTree

Let’s create a table based on ReplacingMergeTree engine and manually stop the background merges. Then, we will insert a few million rows sequentially and examine the unmerged parts.

CREATE DATABASE testing;

CREATE TABLE testing.replacing_mergetree
(
    ID UInt16,
    PART UInt16
)
ENGINE = ReplacingMergeTree
PRIMARY KEY ID
PARTITION BY PART;

 

Execute the following SQL statement.  We are inserting 100 million rows of data (randomly generated) into the table. We are generating random integers between 0 and 50 for the ID and PART columns.

 

INSERT INTO testing.replacing_mergetree SELECT
    ID % 50,
    PART % 50
FROM generateRandom('ID UInt16, PART UInt16')
LIMIT 100000000

Query id: 5a2f92c8-7e29-49ff-ab11-5f84f331c5ba

Ok.

0 rows in set. Elapsed: 5.386 sec. Processed 100.66 million rows, 402.64 MB (18.69 million rows/s., 74.76 MB/s.)

We can find the active data parts in the table using the parts table in the system database.

SELECT
    active,
    count(active) AS part_count
FROM system.parts
WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1)
GROUP BY active

Query id: d68643d8-193a-447d-ac12-5ef6db71d919

┌─active─┬─part_count─┐
│      1 │        200 │
└────────┴────────────┘

1 row in set. Elapsed: 0.005 sec. 

 

Let’s force the merge operation using the OPTIMIZE statement.

 

OPTIMIZE TABLE testing.replacing_mergetree

Query id: 065ea0ba-5cb0-4732-89e2-5d9b2505dfe5

Ok.

0 rows in set. Elapsed: 0.002 sec. 

Query the number of data parts in the table after executing the  OPTIMIZE statement.

SELECT
    active,
    count(active) AS part_count
FROM system.parts
WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1)
GROUP BY active

Query id: 95678a31-9b84-4d7b-82d8-874119ec6046

┌─active─┬─part_count─┐
│      1 │         50 │
└────────┴────────────┘

1 row in set. Elapsed: 0.006 sec. 

We have 50 unique values in the partition (random numbers generated from 0-49) and hence we have 50 parts.

Other options

Optimize based on a specific partition

Delete all the records from the table and re-insert the data.

ALTER TABLE testing.replacing_mergetree DELETE WHERE ID >-1;

INSERT INTO testing.replacing_mergetree  
SELECT ID % 50, PART % 50 FROM generateRandom('ID UInt16, PART UInt16') LIMIT 100000000;

Check the data parts.

SELECT
    active,
    count(active) AS part_count
FROM system.parts
WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1)
GROUP BY active

Query id: 20c3d226-fba2-4119-bde9-be2e89898c58

┌─active─┬─part_count─┐
│      1 │        200 │
└────────┴────────────┘

1 row in set. Elapsed: 0.005 sec. 

Run the optimize statement based on a partition and verify the data parts again.

OPTIMIZE TABLE testing.replacing_mergetree PARTITION 1

Query id: 02207704-fbcb-4ff0-bfe7-a637c5b59277

Ok.

0 rows in set. Elapsed: 0.004 sec. 

 

SELECT
    active,
    count(active) AS part_count
FROM system.parts
WHERE (table LIKE '%replacing%') AND (database LIKE '%testing%') AND (active = 1)
GROUP BY active

Query id: 6c1adcef-1f71-495c-9f8f-5f3799d7ddd6

┌─active─┬─part_count─┐
│      1 │        197 │
└────────┴────────────┘

1 row in set. Elapsed: 0.005 sec.

This ensures that we force the merge specific to a partition specified in the SQL statement (Partition with value 1 in this case).

 

OPTIMIZE .. FINAL

This ensures the merges are forced irrespective of the existing data part merges or the number of data parts in the table (or even if the data is already merged).

 

OPTIMIZE .. FINAL .. DEDUPLICATE

Optimize statement with DEDUPLICATE clause ensures duplicates are removed from the table. If the columns are specified, the duplicates are eliminated based on these columns, or else the completely identical rows in the table are deduplicated.

Let us create a vanilla MergeTree table to illustrate this.

CREATE TABLE testing.vanilla_mergetree
(
    `ID` UInt16,
    `PART` UInt16
)
ENGINE = MergeTree
PRIMARY KEY ID

Query id: da369e84-9740-401d-9e86-73f846891b45

Ok.

0 rows in set. Elapsed: 0.067 sec. 

Insert randomly generated data ranging from 0-9 in the columns.

INSERT INTO testing.vanilla_mergetree SELECT
    ID % 10,
    PART % 10
FROM generateRandom('ID UInt16, PART UInt16')
LIMIT 100000000

Query id: 4403a800-abf1-4b31-8916-ca09be249b8f

Ok.

0 rows in set. Elapsed: 1.657 sec. Processed 100.66 million rows, 402.64 MB (60.75 million rows/s., 243.00 MB/s.)

Count the number of rows in the table. Since it is a vanilla MergeTree, the rows will not be deduplicated based on the primary key like in ReplacingMergetree.

SELECT COUNT()
FROM testing.vanilla_mergetree

Query id: 2b3048b3-1437-4ea4-b3d5-71db55fcacca

┌───count()─┐
│ 100000000 │
└───────────┘

1 row in set. Elapsed: 0.003 sec. 

Run the optimize statement with DEDUPLICATE clause.

OPTIMIZE TABLE testing.vanilla_mergetree DEDUPLICATE BY ID

Query id: 099882ad-0731-48e3-91b1-b2a47fa2d211

Ok.

0 rows in set. Elapsed: 0.297 sec. 

Verify the number of rows in the table after removing duplicates. Wait for a couple of minutes (or more depending on the hardware).

SELECT COUNT()
FROM testing.vanilla_mergetree

Query id: 7f3b2093-735e-405a-aa7e-c815cc8de02f

┌─count()─┐
│      10 │
└─────────┘

1 row in set. Elapsed: 0.003 sec. 
SELECT *
FROM testing.vanilla_mergetree

Query id: 59e8a783-d1c8-43b9-b1f5-7880f1474185

┌─ID─┬─PART─┐
│  0 │    8 │
│  1 │    7 │
│  2 │    4 │
│  3 │    1 │
│  4 │    8 │
│  5 │    5 │
│  6 │    6 │
│  7 │    0 │
│  8 │    8 │
│  9 │    5 │
└────┴──────┘

10 rows in set. Elapsed: 0.003 sec. 

 

Summary

We have seen the usage of OPTIMIZE statement in ClickHouse. We have also seen the variants and options in the OPTIMIZE statement.

Was this article helpful?

Related Articles

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

In the spirit of freedom, independence and innovation. ChistaDATA Corporation is not affiliated with ClickHouse Corporation 

Need Support?

Can't find the answer you're looking for?
Contact Support

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)

Copyright 2022 ChistaDATA Inc

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.