With the help of ClickHouse functions, you can rapidly determine how many duplicate rows are present in a table and then eliminate them, leaving just the unique rows.
Let’s create a table:
CREATE TABLE default.exp_test ( `id` UInt32, `address` UInt32, `value` UInt32 ) ENGINE = MergeTree ORDER BY tuple()
Let’s insert 10 million random data:
INSERT INTO exp_test SELECT rand32() / 1000, rand32() / 100000, rand32() / 10000 FROM numbers(10000000)
Let’s quickly review the data:
SELECT * FROM exp_test LIMIT 5 ┌───id─┬─address─┬──value─┐ │ 1260 │ 12608 │ 126083 │ │ 1438 │ 14389 │ 143897 │ │ 3386 │ 33866 │ 338666 │ │ 1435 │ 14359 │ 143593 │ │ 2327 │ 23272 │ 232727 │ └──────┴─────────┴────────┘
Figure out the duplicate data with uniq() function approximately. Duplicate data is easily estimated in this manner.
SELECT uniq(*) FROM exp_test ┌─uniq(id, address, value)─┐ │ 3866551 │ └──────────────────────────┘
However, we can use the DISTINCT function to get a more precise result:
SELECT countDistinct(id, address, value) FROM exp_test ┌─uniqExact(id, address, value)─┐ │ 3875983 │ └───────────────────────────────┘
The total duplicate rate can be compared to all the data:
SELECT countDistinct(id, address, value) / count(*) FROM exp_test ┌─divide(uniq(id, address, value), count())─┐ │ 0.3875983 │ └───────────────────────────────────────────┘
So %0,39 data is duplicated. How can we deduplicate these duplicates?
OPTIMIZE TABLE exp_test FINAL DEDUPLICATE; or OPTIMIZE TABLE exp_test FINAL DEDUPLICATE BY id, address, value;
Only distinct rows will remain in the table as a result:
SELECT countDistinct(id, address, value) / count(*) FROM exp_test ┌─divide(uniqExact(id, address, value), count())─┐ │ 1 │ └────────────────────────────────────────────────┘
If you choose to utilize the uniq function over DISTINCT, uniq can be any number of parameters passed to the function. Tuple, Array, Date, DateTime, String, and numeric type parameters are all acceptable.