ChistaDATA's ClickHouse DBaaS Benchmarking on OpenTelemetry

Table of Contents

Introduction

ChistaDATA, an up-and-coming player in the world of database-as-a-solution services, is introducing an innovative approach to serverless ClickHouse databases, enriched with a variety of appealing features. In this detailed benchmark analysis, we explore the dynamic range of performance within our cloud offerings, highlighting the diverse set of instance classes integrated into our recently launched DBaaS service.

Before we dive into all the details about benchmarking, let’s talk about three remarkable things we have added to help you to test and check your applications:

Metrics Dashboard on DBaaS Portal

We’ve introduced a metrics dashboard on the ChistaDATA DBaaS Portal, providing an instant overview of database performance. This dashboard displays important metrics such as Queries Per Second (QPS), Storage Utilization, Total Inserts, Total Selects, Updates, Failed Queries, and the Average Query Memory and Query Duration. We’re actively working to incorporate additional metrics in the future, enhancing the insights available to users.

Benchmarking Suite – As benchmarking toolkit

Our benchmarking efforts were executed using an in-house Benchmarking Suite configured with an Apache Superset dashboard for visual representation. We’ve taken the initiative to develop a comprehensive end-to-end benchmarking toolkit, which we plan to open source soon. This toolkit promises to be a valuable resource for the wider community, facilitating robust benchmarking procedures and insights.

ChistaDATA Anansi – Query Profiler for ClickHouse

A log analysis tool called ChistaDATA Anansi has been developed for ClickHouse and PostgreSQL, designed as a query profiler by our ChistaDATA Team, and the good news is its open-source github link . The reports generated by the profiler provide information on various query characteristics, including execution time, memory usage, bytes read, and other elementary but essential details which will help you to find bottlenecks and increase the performance of your application.

Now let’s take deep dive into our benchmarking report

Note : This document is exclusively centered around measuring performance within our ChistaDATA DBaaS ecosystem. If you’re interested in exploring a comparative study between ClickHouse and other databases, I believe you’ll find valuable insights in the following resource: Benchmark ClickHouse vs Other Databases. This link will take you to a document that delves into a thorough evaluation of ClickHouse’s capabilities in comparison to alternative database solutions.

Hardware at a Glance

Take a glance at what Instance class we can use while creating ClickHouse Cluster on our manage service platform as of today.

Instance Class	vCPU Cores	RAM ( GB )
ChistaDATA Cloud – Small	2	4
ChistaDATA Cloud – Medium	2	8
ChistaDATA Cloud – Large	4	16
ChistaDATA Cloud – X Large	8	32

Adding to the above list, we also offer support for the Nano Instance class, which comes with 0.5 vCPU and 1 GB RAM. However, it’s primarily intended for testing purposes. The queries we employ for benchmarking are quite extensive, so we aren’t utilizing them to compare with other higher instance classes.

Furthermore, our current support for the Database as a Service (DBaaS) is exclusive to AWS. Nevertheless, we have exciting plans as we’re actively developing the capability to extend our service to GCP and Azure Cloud. Rest assured, these enhancements are in progress, and we’re dedicated to bringing them to you.

Dataset Specifications:

Below are the details about the data we are using for benchmarking purposes.

Data Source: Standard ClickHouse benchmarks (Anonymized Web Analytics Data)
Data Download: CSV File Link
Uncompressed Data Size: 56.28 GB
Table Size on Disk: 13.55 GB
Compression Ratio: 0.2407
Data Rows: Approximately 100 million

Benchmarking

The benchmarking process employs a meticulously curated set of standardized queries thoughtfully executed sequentially. To ensure a comprehensive evaluation, the process is iterated not once but thrice – comprising a dynamic trio of runs: one “Hot” run and two “Cold” runs. This repetition allows us to gather insights into the time taken for query execution under various conditions, furnishing us with a holistic perspective on performance dynamics.

Data Loading Speed

Look at the performance for data loading activity; we are loading ~ 10 Million records into our cluster, which are different in size.

The time required for loading data is approximately the same, irrespective of instance size. It’s important to note that the loaded dataset is substantial, containing around 100 million records, totaling 56 GB. This considerable dataset contributes to the varying load times experienced across these instance classes.

Storage optimization

We are using below given SQL query to calculate the actual data size vs. compressed data size in the MergeTee table in our DBaaS portal.

SELECT
    database,
    table,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY
    database,
    table
ORDER BY size DESC

Query id: 34531c2e-27b3-4bf1-8853-8bae1d59a67b

┌─database──┬─table───────────────────┬─compressed─┬─uncompressed─┬─compr_rate─┬──────rows─┬─part_count─┐
│ benchmark │ hits                    │ 13.57 GiB  │ 56.35 GiB    │       4.15 │  99997497 │          6 │
│ system    │ trace_log               │ 361.95 MiB │ 7.28 GiB     │      20.58 │  23508487 │         10 │
│ system    │ text_log                │ 305.83 MiB │ 1.56 GiB     │       5.22 │  10290430 │         13 │
│ system    │ asynchronous_metric_log │ 298.88 MiB │ 11.00 GiB    │      37.69 │ 757107710 │         12 │
│ system    │ metric_log              │ 163.30 MiB │ 4.84 GiB     │      30.33 │   1361627 │         12 │
│ system    │ query_log               │ 132.85 MiB │ 1.29 GiB     │       9.91 │   1084442 │         13 │
│ system    │ part_log                │ 11.22 KiB  │ 44.78 KiB    │       3.99 │       224 │          1 │
└───────────┴─────────────────────────┴────────────┴──────────────┴────────────┴───────────┴────────────┘

7 rows in set. Elapsed: 0.004 sec.

As we can see, the compressed rate is around 4.15 for table hits in benchmark db.

Query Performance

Now, shifting our focus to the actual performance of query execution, let’s delve into the details. Within our benchmark suite, we have 39 queries encompassing analytical tasks involving intricate processes such as extensive grouping and sorting. This collection of questions is consistent across all nodes, ensuring a fair evaluation of each instance’s behavior. As we proceed, we’ll gain insights into the distinct behavior of each instance type during the execution of these queries.

Query Duration (Second)	Total Queries	Total Query Processing Duration (Seconds)	Failed Queries
Small	39 Queries	563.4	10
Medium	39 Queries	624	2
Large	39 Queries	373.8	0
Extra Large	39 Queries	204	0

Indeed, it’s evident that the small instance class (vCPU 2 and 4 GB RAM ) experiences more failed queries from the benchmark suite. These queries, characterized by their complexity and analytical nature, are ideally tailored for execution on larger instance classes. Nonetheless, optimization can potentially ensure successful execution even on smaller instance classes.

It’s worth noting that, for the sake of achieving accurate and genuine comparisons, we’ve intentionally employed default configurations across all nodes during benchmarking. While these configurations might not be optimized for the specific characteristics of smaller instance classes, this approach enables us to draw meaningful insights by placing all instances on a level playing field. In the future, we can explore ways to fine-tune these analytical queries for successful execution on smaller instance classes while still preserving the integrity of our benchmarking efforts.

From the above graph, we can notice Large and X Large have finished query execution in ~ 373 and 204 seconds, and for small and medium, it is slightly taking more time than 500 sec.

Median (2nd quartile) Query Execution TIme:

Note: Since the queries are of different types, the presence of slow queries could potentially skew the average duration taken, and hence we use the median to quantify the performance.

The median, also called the 2nd quartile, is a fundamental statistical measure that divides a dataset into two halves and provides a balanced view of the data distribution. It represents the middle value when the data points are arranged in ascending order. In other words, the value separates the lower 50% of the data from the upper 50%.

In this table, the “Instance Class” column lists different instances: Small, Medium, Large, and Extra Large. The “Query Duration (Median)” column displays the median query duration for each instance class in terms of time, and The query durations are presented in milliseconds.

By examining the table, you can discern how the median query durations vary across instance classes. Notably, as the instance class increases in capacity, the median query duration decreases, indicating that instances with more resources tend to execute queries more quickly. The information in this table helps us understand the performance characteristics of different instance classes and make informed decisions regarding resource allocation and query optimization.

Additionally, the median is often used in conjunction with the 1st and 3rd quartiles to provide a comprehensive overview of the data spread in a dataset.

1st Quartile

The 1st quartile, the lower quartile, or the 25th percentile is a statistical measure that divides a dataset into four equal parts. It represents the value below which 25% of the data points in the dataset fall and helps understand data distribution and identify the lower range of values.

In this table, the “Instance Class” column lists different instances: Small, Medium, Large, and Extra Large. The “Query Duration (1st Quartile)” column presents the 1st quartile query duration for each instance class, measured in milliseconds.

By examining the values in this table, you can gain insights into the spread and variability of query durations across different instance classes. You’ll notice that the 1st quartile query duration tends to decrease as the instance class size increases. This indicates that more significant instances generally exhibit faster query execution times, especially in the lower durations. This information can be valuable for understanding each instance class’s performance distribution and characteristics in query execution.

3rd Quartile

The 3rd quartile indicates the point below which 75% of the query durations in each instance class fall. It signifies the upper range of query durations for a particular instance class.

In this table, the “Instance Class” column lists different instances: Small, Medium, Large, and Extra Large. The “Query Duration (3rd Quartile)” column represents the 3rd quartile query duration for each instance class, measured in seconds.

By examining the values in this table, you can gain insights into the spread and variability of query durations across different instance classes, mainly focusing on the longer durations. You’ll observe that the 3rd quartile query duration tends to decrease as the instance class size increases. This implies that more significant instances generally exhibit faster query execution times, especially in the upper duration range. This information helps you understand each instance class’s performance distribution and characteristics in terms of query execution times.

Histogram

Now, let’s look at the actual numbers of the internal metrics that shape the system’s performance. We’ll explore critical indicators like Query throughput, average query duration, and memory usage. This detailed exploration will equip us with a holistic view of how the system is functioning and utilizing resources

Query Duration (Second)	Average Query Duration ( Second )	Median Query Duration ( Second)	Average Query Throughput ( Million Rows per Second)	Average Read Throughput (Mega Bytes per Second)	Median memory usage (Mega Bytes)
Small	6.6	1.7	14	576.601	41.62
Medium	5.4	1.7	16	673.165	177.31
Large	3.2	1.2	26.8	1094.29	242.3
Extra Large	1.8	0.839	46.5	1899.38	272.25

To sum it all up, when you look at how fast our system handles QPS (queries per second) and the different types of questions it can manage, it’s clear that our DBaaS platform is an excellent option, especially when dealing with data analysis tasks.

Superset Monitoring Dashboard

Added here screenshots of the superset dashboard for reference.

Conclusion

ClickHouse undeniably boasts a distinct advantage over other databases, particularly in analytical workloads. If your priority is optimizing time and concentrating on your core business objectives, our managed service provides an excellent solution. We proudly offer ClickHouse as a service, wherein we shoulder the responsibilities of overseeing, deploying, monitoring, and safeguarding your infrastructure. With a seamless single-click interface, you can experience the convenience of streamlined management, leaving you with the freedom to focus on what truly matters – your business.

If you’re curious to learn more about this service, feel free to explore further by visiting: ChistaData DBaaS Portal

To know more about ClickHouse benchmarking, do read the following articles:

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ChistaDATA Cloud for ClickHouse Benchmarking on OpenTelemetry Data

ChistaDATA DBaaS Benchmarking Statistics