ClickHouse Deduplication Operation

With the help of ClickHouse functions, you can rapidly determine how many duplicate rows are present in a table and then eliminate them, leaving just the unique rows.

Let’s create a table:

CREATE TABLE default.exp_test
(
    `id` UInt32,
    `address` UInt32,
    `value` UInt32
)
ENGINE = MergeTree
ORDER BY tuple()

Let’s insert 10 million random data:

INSERT INTO exp_test SELECT rand32() / 1000, rand32() / 100000, rand32() / 10000 FROM numbers(10000000)

Let’s quickly review the data:

SELECT * FROM exp_test LIMIT 5

┌───id─┬─address─┬──value─┐
│ 1260 │   12608 │ 126083 │
│ 1438 │   14389 │ 143897 │
│ 3386 │   33866 │ 338666 │
│ 1435 │   14359 │ 143593 │
│ 2327 │   23272 │ 232727 │
└──────┴─────────┴────────┘

Figure out the duplicate data with uniq() function approximately. Duplicate data is easily estimated in this manner.

SELECT uniq(*) FROM exp_test

┌─uniq(id, address, value)─┐
│                  3866551 │
└──────────────────────────┘

However, we can use the DISTINCT function to get a more precise result:

SELECT countDistinct(id, address, value) FROM exp_test

┌─uniqExact(id, address, value)─┐
│                       3875983 │
└───────────────────────────────┘

The total duplicate rate can be compared to all the data:

SELECT countDistinct(id, address, value) / count(*) FROM exp_test

┌─divide(uniq(id, address, value), count())─┐
│                                 0.3875983 │
└───────────────────────────────────────────┘

So %0,39 data is duplicated. How can we deduplicate these duplicates?

OPTIMIZE TABLE exp_test FINAL DEDUPLICATE;
or
OPTIMIZE TABLE exp_test FINAL DEDUPLICATE BY id, address, value;

Only distinct rows will remain in the table as a result:

SELECT countDistinct(id, address, value) / count(*) FROM exp_test

┌─divide(uniqExact(id, address, value), count())─┐
│                                              1 │
└────────────────────────────────────────────────┘

If you choose to utilize the uniq function over DISTINCT, uniq can be any number of parameters passed to the function. Tuple, Array, Date, DateTime, String, and numeric type parameters are all acceptable.

Was this article helpful?

Yes No

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse Deduplication Operation

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

Need Support?

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)

ClickHouse Deduplication Operation

Related Articles

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

Need Support?

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)