From version 23.5 version, ClickHouse introduces another interesting feature called urlCluster
function, which builds upon the existing url
function. It functions similarly s3Cluster
by leveraging the resources from all nodes within the cluster, thereby significantly accelerating data processing. By using the urlCluster
function, you can observe noticeable performance improvements compared to the url
function.
So In ClickHouse, there are two main approaches for restoring data from external sources: s3Cluster and s3, as well as urlCluster and url.
Let’s explore each of these options.
urlCluster :
Allows processing files from URL in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisk in URL file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
For instance, in the query below, the url
function takes approximately 12.404 seconds, while the urlCluster
function completes the same task in just 9.517 seconds. This time difference becomes even more pronounced when dealing with larger datasets.
clickhouse01 :) CREATE TABLE covid19 ( date Date, location_key LowCardinality(String), new_confirmed Int32, new_deceased Int32, new_recovered Int32, new_tested Int32, cumulative_confirmed Int32, cumulative_deceased Int32, cumulative_recovered Int32, cumulative_tested Int32 ) ENGINE = MergeTree ORDER BY (location_key, date); Query id: 5b4be598-737e-4c43-b9d7-82228871c064 clickhouse01 :) INSERT INTO covid19 SELECT * FROM url( 'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv', CSVWithNames, 'date Date, location_key LowCardinality(String), new_confirmed Int32, new_deceased Int32, new_recovered Int32, new_tested Int32, cumulative_confirmed Int32, cumulative_deceased Int32, cumulative_recovered Int32, cumulative_tested Int32' ); Query id: feb967c0-410d-4572-b336-4e741f72b01a Ok. 0 rows in set. Elapsed: 12.404 sec. Processed 12.53 million rows, 450.53 MB (1.01 million rows/s., 36.32 MB/s.) clickhouse01 :) INSERT INTO covid19 SELECT * FROM urlCluster('cluster_demo_ash', 'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv', CSVWithNames, 'date Date, location_key LowCardinality(String), new_confirmed Int32, new_deceased Int32, new_recovered Int32, new_tested Int32, cumulative_confirmed Int32, cumulative_deceased Int32, cumulative_recovered Int32, cumulative_tested Int32' ); Query id: e7d6ab57-0158-47bd-9fd5-83daf64f15ac → Progress: 2.49 million rows, 86.99 MB (1.11 million rows/s., 38.77 MB/s.) █████▎ 16% Ok. 0 rows in set. Elapsed: 9.517 sec. Processed 12.53 million rows, 439.12 MB (1.32 million rows/s., 46.14 MB/s.) clickhouse01 :) select cluster,shard_num,replica_num,host_name,host_address,is_active from system.clusters WHERE cluster = 'cluster_demo_ash'; SELECT cluster, shard_num, replica_num, host_name, host_address, is_active FROM system.clusters WHERE cluster = 'cluster_demo_ash' Query id: 8a324fd1-e3cd-4ed3-9449-269d3372b2ed ┌─cluster──────────┬─shard_num─┬─replica_num─┬─host_name────┬─host_address─┬─is_active─┐ │ cluster_demo_ash │ 1 │ 1 │ clickhouse01 │ 172.30.0.2 │ ᴺᵁᴸᴸ │ │ cluster_demo_ash │ 1 │ 2 │ clickhouse02 │ 172.30.0.3 │ ᴺᵁᴸᴸ │ │ cluster_demo_ash │ 2 │ 1 │ clickhouse03 │ 172.30.0.4 │ ᴺᵁᴸᴸ │ │ cluster_demo_ash │ 2 │ 2 │ clickhouse04 │ 172.30.0.5 │ ᴺᵁᴸᴸ │ └──────────────────┴───────────┴─────────────┴──────────────┴──────────────┴───────────┘ 4 rows in set. Elapsed: 0.006 sec.
s3Cluster :
Allows processing files from Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisks in S3 file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
For instance, in the query below, the s3
function takes approximately 77.369 seconds, while the s3Cluster
function completes the same task in just 49.010 seconds. This time difference becomes even more pronounced when dealing with larger datasets.
clickhouse01 :) CREATE DATABASE trips_db ON CLUSTER `{cluster}`; CREATE DATABASE trips_db ON CLUSTER `{cluster}` Query id: f83abe63-ea38-435e-8062-b15675583f9c ┌─host─────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐ │ clickhouse01 │ 9000 │ 0 │ │ 3 │ 0 │ │ clickhouse02 │ 9000 │ 0 │ │ 2 │ 0 │ │ clickhouse04 │ 9000 │ 0 │ │ 1 │ 0 │ │ clickhouse03 │ 9000 │ 0 │ │ 0 │ 0 │ └──────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘ 4 rows in set. Elapsed: 0.113 sec. clickhouse01 :) SELECT cluster, shard_num, replica_num, host_name, host_address FROM system.clusters WHERE cluster = 'cluster_demo_ash' SELECT cluster, shard_num, replica_num, host_name, host_address FROM system.clusters WHERE cluster = 'cluster_demo_ash' Query id: e8f9ce4c-84a6-4cea-b034-e9f64697dc57 ┌─cluster──────────┬─shard_num─┬─replica_num─┬─host_name────┬─host_address─┐ │ cluster_demo_ash │ 1 │ 1 │ clickhouse01 │ 172.20.0.2 │ │ cluster_demo_ash │ 1 │ 2 │ clickhouse02 │ 172.20.0.3 │ │ cluster_demo_ash │ 2 │ 1 │ clickhouse03 │ 172.20.0.4 │ │ cluster_demo_ash │ 2 │ 2 │ clickhouse04 │ 172.20.0.5 │ └──────────────────┴───────────┴─────────────┴──────────────┴──────────────┘ 4 rows in set. Elapsed: 0.002 sec. CREATE TABLE trips_db.trips_s3 ON CLUSTER `{cluster}` ( trip_id UInt32, pickup_datetime DateTime, dropoff_datetime DateTime, pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count UInt8, trip_distance Float32, fare_amount Float32, extra Float32, tip_amount Float32, tolls_amount Float32, total_amount Float32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), pickup_ntaname LowCardinality(String), dropoff_ntaname LowCardinality(String) ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trips_db/trips_s3', '{replica}') PRIMARY KEY (pickup_datetime, dropoff_datetime); INSERT INTO trips_db.trips_s3 SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3( 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames' ); CREATE TABLE trips_db.trips_s3cluster ON CLUSTER `{cluster}` ( trip_id UInt32, pickup_datetime DateTime, dropoff_datetime DateTime, pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count UInt8, trip_distance Float32, fare_amount Float32, extra Float32, tip_amount Float32, tolls_amount Float32, total_amount Float32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), pickup_ntaname LowCardinality(String), dropoff_ntaname LowCardinality(String) ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trips_db/trips_s3cluster', '{replica}') PRIMARY KEY (pickup_datetime, dropoff_datetime); INSERT INTO trips_db.trips_s3cluster SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3Cluster('cluster_demo_ash', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames' ); clickhouse01 :) INSERT INTO trips_db.trips_s3 SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3( 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames' ); INSERT INTO trips_db.trips_s3 SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames') Query id: a4b15e0c-0ef9-49cd-b0d7-df0310fcbdfe Ok. 0 rows in set. Elapsed: 77.369 sec. Processed 16.00 million rows, 1.22 GB (206.82 thousand rows/s., 15.72 MB/s.) clickhouse01 :) clickhouse01 :) INSERT INTO trips_db.trips_s3cluster SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3Cluster('cluster_demo_ash', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames' ); INSERT INTO trips_db.trips_s3cluster SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3Cluster('cluster_demo_ash', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames') Query id: b5558357-c1d3-4564-94c3-549df3024482 Ok. 0 rows in set. Elapsed: 49.010 sec. Processed 16.00 million rows, 1.22 GB (326.49 thousand rows/s., 24.86 MB/s.)
References :
https://clickhouse.com/docs/en/sql-reference/table-functions/urlCluster
https://clickhouse.com/docs/en/sql-reference/table-functions/s3Cluster