1. Home
  2. Knowledge Base
  3. ClickHouse
  4. ClickHouse Deduplication Operation
  1. Home
  2. Knowledge Base
  3. ClickHouse Performance
  4. ClickHouse Deduplication Operation

ClickHouse Deduplication Operation

With the help of ClickHouse functions, you can rapidly determine how many duplicate rows are present in a table and then eliminate them, leaving just the unique rows.

Let’s create a table:

CREATE TABLE default.exp_test
(
    `id` UInt32,
    `address` UInt32,
    `value` UInt32
)
ENGINE = MergeTree
ORDER BY tuple()

Let’s insert 10 million random data:

INSERT INTO exp_test SELECT rand32() / 1000, rand32() / 100000, rand32() / 10000 FROM numbers(10000000)

Let’s quickly review the data:

SELECT * FROM exp_test LIMIT 5

┌───id─┬─address─┬──value─┐
│ 1260 │   12608 │ 126083 │
│ 1438 │   14389 │ 143897 │
│ 3386 │   33866 │ 338666 │
│ 1435 │   14359 │ 143593 │
│ 2327 │   23272 │ 232727 │
└──────┴─────────┴────────┘

Figure out the duplicate data with uniq() function approximately. Duplicate data is easily estimated in this manner.

SELECT uniq(*) FROM exp_test

┌─uniq(id, address, value)─┐
│                  3866551 │
└──────────────────────────┘

However, we  can use the DISTINCT function to get a more precise result:

SELECT countDistinct(id, address, value) FROM exp_test

┌─uniqExact(id, address, value)─┐
│                       3875983 │
└───────────────────────────────┘

The total duplicate rate can be compared to all the data:

SELECT countDistinct(id, address, value) / count(*) FROM exp_test

┌─divide(uniq(id, address, value), count())─┐
│                                 0.3875983 │
└───────────────────────────────────────────┘

So %0,39 data is duplicated. How can we deduplicate these duplicates?

OPTIMIZE TABLE exp_test FINAL DEDUPLICATE;
or
OPTIMIZE TABLE exp_test FINAL DEDUPLICATE BY id, address, value;

Only distinct rows will remain in the table as a result:

SELECT countDistinct(id, address, value) / count(*) FROM exp_test

┌─divide(uniqExact(id, address, value), count())─┐
│                                              1 │
└────────────────────────────────────────────────┘

If you choose to utilize the uniq function over DISTINCT, uniq can be any number of parameters passed to the function. Tuple, Array, Date, DateTime, String, and numeric type parameters are all acceptable.

Was this article helpful?

Related Articles

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

In the spirit of freedom, independence and innovation. ChistaDATA Corporation is not affiliated with ClickHouse Corporation 

Need Support?

Can't find the answer you're looking for?
Contact Support

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)

Copyright 2022 ChistaDATA Inc

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.