1. Home
  2. Knowledge Base
  3. ChistaDATA
  4. Boosting Data Processing in ClickHouse: Exploring urlCluster and s3Cluster Functions
  1. Home
  2. Knowledge Base
  3. ClickHouse
  4. Boosting Data Processing in ClickHouse: Exploring urlCluster and s3Cluster Functions
  1. Home
  2. Knowledge Base
  3. ClickHouse Performance
  4. Boosting Data Processing in ClickHouse: Exploring urlCluster and s3Cluster Functions

Boosting Data Processing in ClickHouse: Exploring urlCluster and s3Cluster Functions

From version 23.5 version, ClickHouse introduces another interesting feature called  urlCluster function, which builds upon the existing url function. It functions similarly  s3Cluster by leveraging the resources from all nodes within the cluster, thereby significantly accelerating data processing. By using the urlCluster function, you can observe noticeable performance improvements compared to the url function.

So In ClickHouse, there are two main approaches for restoring data from external sources: s3Cluster and s3, as well as urlCluster and url.

Let’s explore each of these options.

 

urlCluster :

Allows processing files from URL in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisk in URL file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.

For instance, in the query below, the url function takes approximately 12.404 seconds, while the urlCluster function completes the same task in just 9.517 seconds. This time difference becomes even more pronounced when dealing with larger datasets.

clickhouse01 :) CREATE TABLE covid19 (
    date Date,
    location_key LowCardinality(String),
    new_confirmed Int32,
    new_deceased Int32,
    new_recovered Int32,
    new_tested Int32,
    cumulative_confirmed Int32,
    cumulative_deceased Int32,
    cumulative_recovered Int32,
    cumulative_tested Int32
)
ENGINE = MergeTree
ORDER BY (location_key, date);

Query id: 5b4be598-737e-4c43-b9d7-82228871c064


clickhouse01 :) INSERT INTO covid19
   SELECT *
   FROM
      url(
        'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
        CSVWithNames,
        'date Date,
        location_key LowCardinality(String),
        new_confirmed Int32,
        new_deceased Int32,
        new_recovered Int32,
        new_tested Int32,
        cumulative_confirmed Int32,
        cumulative_deceased Int32,
        cumulative_recovered Int32,
        cumulative_tested Int32'
    );

Query id: feb967c0-410d-4572-b336-4e741f72b01a

Ok.

0 rows in set. Elapsed: 12.404 sec. Processed 12.53 million rows, 450.53 MB (1.01 million rows/s., 36.32 MB/s.)

clickhouse01 :) INSERT INTO covid19
   SELECT *
   FROM
      urlCluster('cluster_demo_ash',
        'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
        CSVWithNames,
        'date Date,
        location_key LowCardinality(String),
        new_confirmed Int32,
        new_deceased Int32,
        new_recovered Int32,
        new_tested Int32,
        cumulative_confirmed Int32,
        cumulative_deceased Int32,
        cumulative_recovered Int32,
        cumulative_tested Int32'
    );

Query id: e7d6ab57-0158-47bd-9fd5-83daf64f15ac

→ Progress: 2.49 million rows, 86.99 MB (1.11 million rows/s., 38.77 MB/s.) █████▎                           16%
Ok.

0 rows in set. Elapsed: 9.517 sec. Processed 12.53 million rows, 439.12 MB (1.32 million rows/s., 46.14 MB/s.)

clickhouse01 :) select cluster,shard_num,replica_num,host_name,host_address,is_active from system.clusters WHERE cluster = 'cluster_demo_ash';

SELECT
    cluster,
    shard_num,
    replica_num,
    host_name,
    host_address,
    is_active
FROM system.clusters
WHERE cluster = 'cluster_demo_ash'

Query id: 8a324fd1-e3cd-4ed3-9449-269d3372b2ed

┌─cluster──────────┬─shard_num─┬─replica_num─┬─host_name────┬─host_address─┬─is_active─┐
│ cluster_demo_ash │         1 │           1 │ clickhouse01 │ 172.30.0.2   │      ᴺᵁᴸᴸ │
│ cluster_demo_ash │         1 │           2 │ clickhouse02 │ 172.30.0.3   │      ᴺᵁᴸᴸ │
│ cluster_demo_ash │         2 │           1 │ clickhouse03 │ 172.30.0.4   │      ᴺᵁᴸᴸ │
│ cluster_demo_ash │         2 │           2 │ clickhouse04 │ 172.30.0.5   │      ᴺᵁᴸᴸ │
└──────────────────┴───────────┴─────────────┴──────────────┴──────────────┴───────────┘

4 rows in set. Elapsed: 0.006 sec.

 

s3Cluster  : 

Allows processing files from Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisks in S3 file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.

For instance, in the query below, the s3 function takes approximately 77.369 seconds, while the s3Cluster function completes the same task in just 49.010 seconds. This time difference becomes even more pronounced when dealing with larger datasets.

clickhouse01 :) CREATE DATABASE trips_db ON CLUSTER `{cluster}`;

CREATE DATABASE trips_db ON CLUSTER `{cluster}`

Query id: f83abe63-ea38-435e-8062-b15675583f9c

┌─host─────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ clickhouse01 │ 9000 │      0 │       │                   3 │                0 │
│ clickhouse02 │ 9000 │      0 │       │                   2 │                0 │
│ clickhouse04 │ 9000 │      0 │       │                   1 │                0 │
│ clickhouse03 │ 9000 │      0 │       │                   0 │                0 │
└──────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

4 rows in set. Elapsed: 0.113 sec.

clickhouse01 :) SELECT
                    cluster,
                    shard_num,
                    replica_num,
                    host_name,
                    host_address
                FROM system.clusters
                WHERE cluster = 'cluster_demo_ash'

SELECT
    cluster,
    shard_num,
    replica_num,
    host_name,
    host_address
FROM system.clusters
WHERE cluster = 'cluster_demo_ash'

Query id: e8f9ce4c-84a6-4cea-b034-e9f64697dc57

┌─cluster──────────┬─shard_num─┬─replica_num─┬─host_name────┬─host_address─┐
│ cluster_demo_ash │         1 │           1 │ clickhouse01 │ 172.20.0.2   │
│ cluster_demo_ash │         1 │           2 │ clickhouse02 │ 172.20.0.3   │
│ cluster_demo_ash │         2 │           1 │ clickhouse03 │ 172.20.0.4   │
│ cluster_demo_ash │         2 │           2 │ clickhouse04 │ 172.20.0.5   │
└──────────────────┴───────────┴─────────────┴──────────────┴──────────────┘

4 rows in set. Elapsed: 0.002 sec.



CREATE TABLE trips_db.trips_s3 ON CLUSTER `{cluster}` (
    trip_id             UInt32,
    pickup_datetime     DateTime,
    dropoff_datetime    DateTime,
    pickup_longitude    Nullable(Float64),
    pickup_latitude     Nullable(Float64),
    dropoff_longitude   Nullable(Float64),
    dropoff_latitude    Nullable(Float64),
    passenger_count     UInt8,
    trip_distance       Float32,
    fare_amount         Float32,
    extra               Float32,
    tip_amount          Float32,
    tolls_amount        Float32,
    total_amount        Float32,
    payment_type        Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
    pickup_ntaname      LowCardinality(String),
    dropoff_ntaname     LowCardinality(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trips_db/trips_s3', '{replica}')
PRIMARY KEY (pickup_datetime, dropoff_datetime);


INSERT INTO trips_db.trips_s3
SELECT
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount,
    extra,
    tip_amount,
    tolls_amount,
    total_amount,
    payment_type,
    pickup_ntaname,
    dropoff_ntaname
FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz',
    'TabSeparatedWithNames'
);



CREATE TABLE trips_db.trips_s3cluster ON CLUSTER `{cluster}` (
    trip_id             UInt32,
    pickup_datetime     DateTime,
    dropoff_datetime    DateTime,
    pickup_longitude    Nullable(Float64),
    pickup_latitude     Nullable(Float64),
    dropoff_longitude   Nullable(Float64),
    dropoff_latitude    Nullable(Float64),
    passenger_count     UInt8,
    trip_distance       Float32,
    fare_amount         Float32,
    extra               Float32,
    tip_amount          Float32,
    tolls_amount        Float32,
    total_amount        Float32,
    payment_type        Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
    pickup_ntaname      LowCardinality(String),
    dropoff_ntaname     LowCardinality(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trips_db/trips_s3cluster', '{replica}')
PRIMARY KEY (pickup_datetime, dropoff_datetime);


INSERT INTO trips_db.trips_s3cluster
SELECT
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount,
    extra,
    tip_amount,
    tolls_amount,
    total_amount,
    payment_type,
    pickup_ntaname,
    dropoff_ntaname
FROM s3Cluster('cluster_demo_ash',
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz',
    'TabSeparatedWithNames'
);




clickhouse01 :) INSERT INTO trips_db.trips_s3
                SELECT
                    trip_id,
                    pickup_datetime,
                    dropoff_datetime,
                    pickup_longitude,
                    pickup_latitude,
                    dropoff_longitude,
                    dropoff_latitude,
                    passenger_count,
                    trip_distance,
                    fare_amount,
                    extra,
                    tip_amount,
                    tolls_amount,
                    total_amount,
                    payment_type,
                    pickup_ntaname,
                    dropoff_ntaname
                FROM s3(
                    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz',
                    'TabSeparatedWithNames'
                );

INSERT INTO trips_db.trips_s3 SELECT
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount,
    extra,
    tip_amount,
    tolls_amount,
    total_amount,
    payment_type,
    pickup_ntaname,
    dropoff_ntaname
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames')

Query id: a4b15e0c-0ef9-49cd-b0d7-df0310fcbdfe

Ok.

0 rows in set. Elapsed: 77.369 sec. Processed 16.00 million rows, 1.22 GB (206.82 thousand rows/s., 15.72 MB/s.)

clickhouse01 :)




clickhouse01 :) INSERT INTO trips_db.trips_s3cluster
                SELECT
                    trip_id,
                    pickup_datetime,
                    dropoff_datetime,
                    pickup_longitude,
                    pickup_latitude,
                    dropoff_longitude,
                    dropoff_latitude,
                    passenger_count,
                    trip_distance,
                    fare_amount,
                    extra,
                    tip_amount,
                    tolls_amount,
                    total_amount,
                    payment_type,
                    pickup_ntaname,
                    dropoff_ntaname
                FROM s3Cluster('cluster_demo_ash',
                    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz',
                    'TabSeparatedWithNames'
                );


INSERT INTO trips_db.trips_s3cluster SELECT
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount,
    extra,
    tip_amount,
    tolls_amount,
    total_amount,
    payment_type,
    pickup_ntaname,
    dropoff_ntaname
FROM s3Cluster('cluster_demo_ash', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..15}.gz', 'TabSeparatedWithNames')

Query id: b5558357-c1d3-4564-94c3-549df3024482

Ok.

0 rows in set. Elapsed: 49.010 sec. Processed 16.00 million rows, 1.22 GB (326.49 thousand rows/s., 24.86 MB/s.)


 

References : 

https://clickhouse.com/docs/en/sql-reference/table-functions/urlCluster

https://clickhouse.com/docs/en/sql-reference/table-functions/s3Cluster

 

Was this article helpful?

Related Articles

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

In the spirit of freedom, independence and innovation. ChistaDATA Corporation is not affiliated with ClickHouse Corporation 

Need Support?

Can't find the answer you're looking for?
Contact Support

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)

Copyright 2022 ChistaDATA Inc

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.