ChistaDATA Cloud for ClickHouse v/s Google BigQuery: Comparative Benchmarking

Introduction

BigQuery is Google’s fully managed data warehouse which is offered with its own SQL dialect support. BigQuery was launched in the year 2011 and has been a mature cloud offering ever since. BigQuery is offered in two different pricing models. Users can opt for on-demand pricing where the users are billed based on the amount of data processed by the queries. There is another option of choosing capacity pricing where the users are billed based on the compute capacity chosen for running the queries. BigQuery uses a columnar data structure which makes it suitable for OLAP scenarios.

ChistaDATA cloud is a fully managed and autonomous ClickHouse infrastructure meant for real-time data analytics and OLAP scenarios. ClickHouse is an open-source and columnar database meant for analytical workloads. ChistaDATA cloud was released in the year 2023 and is fairly new. ChistaDATA cloud offers a transparent pricing model where you pay for the ClickHouse cluster based on the cluster size. (You know what you are going to pay and no cost explosion is possible)

ClickHouse has been benchmarked against many popular and commonly used databases and the results are available here. ClickHouse is generally seen outshining the other databases for analytical tasks. In this article, I will share our comparative benchmarking study on ClickHouse and BigQuery.

This benchmarking was done as part of our engagement with Difinative Technologies to compare the ChistaDATA cloud and Google BigQuery. The input data infrastructure was set up and provided by Difinative where they were able to set up a working pipeline to ingest data into BigQuery and ChistaDATA cloud.

Before we discuss the results, let us look into the environment and the data used in this benchmark.

Hardware

BigQuery used in this study is based on on-demand pricing, so the hardware resources are not fixed and depend on the query. ClickHouse cluster in the ChistaDATA cloud is a single node server with 16GB RAM and 8 vCPUs.

Data

We used Open Telemetry as the data source and stored the same set of data in BigQuery and ClickHouse with the same table schema (albeit different engines of course).  The total number of rows is ~ 83 million and we used MergeTree table engine in ClickHouse.

Benchmarking Methodology

The performance of BigQuery and ChistaDATA cloud is measured based on the query execution duration for the following scenarios

  • Data Ingestion
  • Updates and Deletes
  • Data Reads
  • Aggregations
  • Data type conversion

While the SQL dialect for BigQuery and ClickHouse is slightly different, the final outcome was designed to be the same. So we came up with query scenarios and then wrote the SQL for BigQuery and ClickHouse. The SQL was executed using an automation script (Python Client) for both BigQuery and ClickHouse. The scripts are run from EC2/compute engine on the same/closest available zone in AWS/GCP to reduce the latency and the query duration mentioned here is measured from the automation script.

ClickHouse vs BigQuery: Comparative Performance

Data Ingestion Performance

Let us begin our comparison considering the data ingestion speeds. In general, the row-by-row insert was consistently slow in ClickHouse whereas the TSV-based insert was observed to be slower in BigQuery for this scenario. ClickHouse was faster for batch/bulk inserts.

ScenarioClickHouse (Seconds)Big Query (Seconds)
Insert data from a csv file – Row by row31.0810911.86
Insert data from a csv file – Batch – 1000 Rows0.1020311.78
Insert data from a TSV file – Row by row26.710541064.53
Insert data from a TSV file – Batch – 10000.0516755.36
Ingestion in an unpartitioned table – 10000 rows0.019915.87

Updates and Deletes

ScenarioClickHouse (Seconds)Big Query (Seconds)
Update the data based on WHERE clause0.0297939.03
Delete the data based on WHERE clause0.0320297.92

Updates and deletes are hardware-intensive operations in ClickHouse and are supported in the MergeTree family of engines. We tried measuring the time taken by the client for point updates and deletes in both databases and found that ClickHouse performs better in both scenarios.

Read Performance

ScenarioClickHouse (Seconds)Big Query (Seconds)
Random read0.039064178.72
Simple SELECT + Limit0.0278021.26
Simple SELECT + Sort + LIMIT11.7848972.02
String Search + LIMIT2.368581.99
Int Search + LIMIT0.0582881.29
Int Search + LIMIT + Sort9.0569731.9
Count0.128465.59
Count + data filter0.1591971.26
   

We tested the scenarios based on data read operations in both ClickHouse and BigQuery. ClickHouse was able to outshine BigQuery in most of the scenarios.

Aggregations

ScenarioClickHouse (Seconds)Big Query (Seconds)
Aggregate function – 1 + Sorting0.0424961.75
Aggregate function – 2 + Sorting0.0292961.67
Aggregate function – 3 + Sorting0.0247021.57
Aggregate function – 4 + Sorting0.4485661.46
Aggregate function – 5 + Sorting0.2338491.46

When it comes to data aggregation, ClickHouse is the undisputed winner over Google BigQuery.

Data Type Conversion

ScenarioClickHouse (Seconds)Big Query (Seconds)
Data type conversion (To float)0.379411.85
Type Conversion to JSON0.0395841.15
Type Conversion to JSON + JSON search0.0264171.26

We tried three different scenarios involving data type conversion and ClickHouse had the upper hand in these scenarios too.

Data Compression

Data Compression – ClickHouse – 5.38 %

  • Compressed Size – 2.3 GB 
  • Uncompressed Size – 42.54 GiB

The compression stats for BigQuery are unknown. ClickHouse was able to compress the data to ~6 % of its original size and store. The main advantage is we can store more data in the given ClickHouse server while the performance doesn’t take a hit for reading the compressed data files.

Conclusion

ChistaDATA Cloud and BigQuery offer strong capabilities as analytical data stores and excel in different aspects of data management and processing. ClickHouse demonstrated superior performance in complex query processing, scalability, and storage efficiency due to its high data compression ratio. ChistaDATA cloud is highly suitable for those who seek more granular control on their database servers and offers a lot of configuration options. Google’s BigQuery, on the other hand, offers great data ingestion performance and multiple integrations. It has user-friendly interface which results in ease of use and is fully managed, and serverless.

Special thanks to the team from ChistaDATA and Difinative Technologies who worked on this endeavor.

Shiv Iyer – ChistaDATA

Emrah Idman – ChistaDATA

Alkin Tezuysal – ChistaDATA

Saji Thoppil – Difinative technologies

Prasanth Sekharan – Difinative technologies

Asnah Farah – Difinative technologies

Vaibhava Lakshmi – Difinative technologies

To learn more about ClickHouse Benchmarking, do read the following articles: