Assume we have two tables with identical amounts of data, each with roughly n million rows for each. If we want to compare the data row by row, we can utilize with sipHash64(*) hash function.
What is sipHash64?
This hashing operation uses cryptography. Compared to the MD5 function, it operates at least three times faster. The function converts each input parameter to a string and calculates its hash value. Then employs the following procedure to merge hashes:
The function obtains the array of hashes after hashing each supplied argument.
The function creates a hash for the array of the first and second components.
The function then produces a hash for the array comprising the hash value computed in the preceding step and the third member of the initial hash array.
The previous phase is repeated for each of the initial hash array’s remaining components.
Let’s learn more about the cars table and one of the tables and compare the two tables!
SELECT * FROM cars ┌─year─┬─brand───┬─gearbox─┬─km_info─┐ │ 2008 │ Audi │ Auto │ 130000 │ │ 2016 │ VW │ Manual │ 85000 │ │ 2019 │ BMW │ Auto │ 60000 │ │ 2020 │ Peugeot │ Auto │ 30000 │ └──────┴─────────┴─────────┴─────────┘
Let’s learn more about the vehicle table, which can be compared to the cars table.
SELECT * FROM vehicle ┌─year─┬─type──┬─brand_name─┬─route─┬─duration─┬─fuel_type─┬─gear─┬─breakdown─┐ │ 2008 │ car │ Audi │ West │ 1 Week │ Gas │ Auto │ Yes │ │ 2010 │ car │ Mercedes │ South │ 30 Day │ Gas │ Auto │ No │ │ 2019 │ car │ BMW │ South │ 1 Week │ Diesel │ Auto │ No │ │ 2019 │ plane │ Airbus │ North │ 1 Day │ Avgas │ Auto │ No │ │ 2021 │ plane │ Airbus │ East │ 2 Day │ Avgas │ Auto │ No │ └──────┴───────┴────────────┴───────┴──────────┴───────────┴──────┴───────────┘
Well, two tables have some mutual columns, and we want to compare the data in these mutual columns.
This query gives us the difference between cars and vehicle tables according to the data hashes.
SELECT * FROM cars WHERE sipHash64(year, brand, gearbox) NOT IN ( SELECT sipHash64(year, brand_name, gear) FROM vehicle) ┌─year─┬─brand───┬─gearbox─┬─km_info─┐ │ 2016 │ VW │ Manual │ 85000 │ │ 2020 │ Peugeot │ Auto │ 30000 │ └──────┴─────────┴─────────┴─────────┘
Also, this query gives us the difference between vehicle and cars tables according to the data hashes.
SELECT * FROM vehicle WHERE sipHash64(year, brand_name, gear) NOT IN ( SELECT sipHash64(year, brand, gearbox) FROM cars) ┌─year─┬─type──┬─brand_name─┬─route─┬─duration─┬─fuel_type─┬─gear─┬─breakdown─┐ │ 2010 │ car │ Mercedes │ South │ 30 Day │ Gas │ Auto │ No │ │ 2019 │ plane │ Airbus │ North │ 1 Day │ Avgas │ Auto │ No │ │ 2021 │ plane │ Airbus │ East │ 2 Day │ Avgas │ Auto │ No │ └──────┴───────┴────────────┴───────┴──────────┴───────────┴──────┴───────────┘