Automated ClickHouse Benchmarking w/ Google Cloud & Superset

Table of Contents

Introduction

Benchmarking is a critical aspect of ensuring the performance and scalability of database systems. By analyzing the performance of a database under different scenarios, developers and administrators can make informed decisions about optimization and resource allocation. ClickHouse, being a popular open-source analytical database management system, often requires thorough benchmarking to ensure its optimal performance.

In this blog post, we’ll delve into the advantages of benchmarking, followed by a detailed explanation of a Proof of Concept (POC) automation tool designed specifically for ClickHouse benchmarking. This tool leverages the power of Google Cloud for infrastructure provisioning, GitHub Actions for automation workflows, and Apache Superset for visualization of benchmark results.

Advantages of Benchmarking

Benchmarking offers several advantages:

Performance Optimization: Identifying bottlenecks and performance limitations allows for targeted optimizations, improving overall system performance.
Scalability Testing: Benchmarking helps determine how well a system scales with increasing workload or data volume, enabling efficient resource allocation.
Quality Assurance: By comparing performance metrics against predefined standards, benchmarking ensures the reliability and stability of the database system.
Cost-Efficiency: Optimizing performance and resource allocation through benchmarking can lead to cost savings by utilizing resources more efficiently.

Automating ClickHouse Benchmarking

The ClickHouse benchmarking automation tool simplifies the process of setting up benchmark tests, executing them, and visualizing the results. Here’s how it works:

Customer Request:
- Customers send a request specifying the example dataset they want to benchmark (e.g., analytics, logs) via GitHub.
Infrastructure Provisioning:
- The automation tool creates an instance on Google Cloud Platform (GCP) with the specified dataset.
Benchmark Execution:
- Once the instance is set up, benchmark tests are executed using predefined SQL queries. These tests simulate real-world scenarios to assess the performance of ClickHouse.
Result Reporting:
- Upon completion of benchmark tests, a comprehensive report is generated. This report includes detailed performance metrics, such as query execution time, throughput, and resource utilization.
Email Notification:
- The generated report is automatically sent to the customers via email.
Visualization with Apache Superset:
- Benchmark results are stored in a dedicated reporting database. Apache Superset is used to visualize these results, providing interactive dashboards for easy analysis and interpretation.

Example of Report:

Query 10:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:06:00.388898
query_duration_ms: 5959
read_rows: 325895567
read_bytes: 2280837740
result_rows: 5
result_bytes: 10768
memory_usage: 8678041105
query: SELECT actor_login, COUNT(DISTINCT number) AS total_activities FROM GitHub.github_events WHERE event_type IN ('IssuesEvent', 'PullRequestEvent') GROUP BY actor_login ORDER BY total_activities DESC LIMIT 5
query_id: 829b3918-324a-4d1a-b50c-8ce87381f144
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 11:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:06:16.506420
query_duration_ms: 1136
read_rows: 111271799
read_bytes: 1095578738
result_rows: 1
result_bytes: 1032
memory_usage: 670012228
query: SELECT labels, COUNT(*) AS label_count FROM GitHub.github_events WHERE event_type = 'IssuesEvent' GROUP BY labels ORDER BY label_count DESC LIMIT 1
query_id: b01beb81-d818-4758-b50a-ab8e3599772f
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 12:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:06:22.720544
query_duration_ms: 1154
read_rows: 214623768
read_bytes: 831990033
result_rows: 5
result_bytes: 11024
memory_usage: 2470392055
query: SELECT repo_name, COUNT(*) AS mergeable_count FROM GitHub.github_events WHERE event_type = 'PullRequestEvent' AND mergeable = 1 GROUP BY repo_name ORDER BY mergeable_count DESC LIMIT 5
query_id: 0031fdd4-b7b9-4b15-a4f9-c003da395a27
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 13:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:06:28.912512
query_duration_ms: 60714
read_rows: 3119798001
read_bytes: 14604929356
result_rows: 100
result_bytes: 17472
memory_usage: 72941047629
query: SELECT actor_login, COUNT(DISTINCT CASE WHEN action = 'opened' THEN repo_name END) AS opened_repos, COUNT(DISTINCT CASE WHEN action = 'merged' THEN repo_name END) AS merged_repos FROM GitHub.github_events GROUP BY actor_login HAVING opened_repos > 5 ORDER BY opened_repos DESC LIMIT 100
query_id: 69fccf31-3380-4dbe-9768-f52c3267ce07
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 14:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:07:40.037804
query_duration_ms: 96367
read_rows: 214623768
read_bytes: 1501902701
result_rows: 100
result_bytes: 6672
memory_usage: 80466074159
query: SELECT repo_name, created_at, events_count, AVG(events_count) OVER (PARTITION BY repo_name ORDER BY created_at ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg FROM (SELECT repo_name, created_at, COUNT(*) AS events_count FROM GitHub.github_events WHERE event_type = 'PullRequestEvent' GROUP BY repo_name, created_at) subquery ORDER BY repo_name, created_at LIMIT 100
query_id: e0287f44-a86b-4674-a0b9-02f1ed470ee0
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 15:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:09:21.562277
query_duration_ms: 1586
read_rows: 3119798001
read_bytes: 5508995441
result_rows: 21
result_bytes: 552
memory_usage: 2105418
query: SELECT event_type, AVG(comments) AS avg_comments FROM GitHub.github_events GROUP BY event_type;
query_id: eeba95bd-65b1-4971-b2ff-4951bd222151
initial_user: chistadata2024qxzzpd
type: QueryFinish

Query 16:
hostname: dataset-github-new-20240218161837018422932.us-central1-.........
query_start_time_microseconds: 2024-02-18 17:09:28.187139
query_duration_ms: 2166
read_rows: 3119798001
read_bytes: 9736719569
result_rows: 21
result_bytes: 848
memory_usage: 2762889
query: SELECT event_type, AVG(additions) AS avg_additions, AVG(deletions) AS avg_deletions FROM GitHub.github_events GROUP BY event_type;
query_id: ec48f1b7-fa8b-4a2f-abc2-3a43f9755f64
initial_user: chistadata2024qxzzpd
type: QueryFinish

Benefits of the Automation Tool

Efficiency: Automating the benchmarking process saves time and effort, allowing for faster iterations and more frequent testing.
Consistency: By following predefined workflows, the tool ensures consistency in benchmarking procedures, reducing the risk of human error.
Scalability: The tool can be easily scaled to accommodate multiple benchmarking requests simultaneously, making it suitable for both small-scale tests and large-scale deployments.
Transparency: Customers have full visibility into the benchmarking process, from request submission to result visualization, enhancing trust and collaboration.

Analyzing ClickHouse Performance Metrics in Apache Superset

Apache Superset offers a powerful platform for visualizing and analyzing performance metrics of ClickHouse clusters. By leveraging Superset’s capabilities, customers gain valuable insights into the health and efficiency of their ClickHouse deployment. Here’s a breakdown of the key performance metrics available in Superset dashboards:

Average Query Throughput (Rows per Second): This metric indicates the average rate at which queries are processed by the ClickHouse cluster, measured in rows per second. A higher throughput suggests better query performance and efficient resource utilization.
Total Queries: Total Queries provides a count of the queries executed on the ClickHouse cluster within a specified time period. Monitoring this metric helps track workload trends and identify periods of high activity.
Total Query Processing Duration (Seconds): This metric sums up the total duration taken to process all queries within the defined timeframe. It provides an overview of the overall workload and helps in assessing the efficiency of query processing.
Query Duration Quartiles: Query Duration Quartiles break down the distribution of query execution times into quartiles (25th, 50th, 75th percentiles), allowing for a detailed analysis of query performance variability.
Average Query Duration: The Average Query Duration metric calculates the mean time taken to execute queries on the ClickHouse cluster. It serves as a benchmark for evaluating query performance and identifying outliers.
Query Run Time: Query Run Time represents the duration of individual query executions, providing insights into the time taken for specific queries to complete. Deviations from the norm may indicate performance issues or optimization opportunities.
Query Duration – Distribution: This metric visualizes the distribution of query execution durations across the ClickHouse cluster. Understanding the distribution helps in identifying patterns and outliers that may impact overall performance.
Median Query Duration: The Median Query Duration metric provides the central value of query execution times, offering a robust measure of typical query performance unaffected by extreme values.
Memory Usage (Mega Bytes): Memory Usage reflects the amount of memory consumed by the ClickHouse cluster, measured in megabytes. Monitoring memory usage is crucial for ensuring optimal performance and avoiding resource contention.
Read Throughput (Mega Bytes / Second): Read Throughput measures the rate at which data is read from the ClickHouse cluster, expressed in megabytes per second. It helps assess the efficiency of data retrieval operations and storage subsystem performance.
Average Read Throughput (Mega Bytes per Second): Similar to Average Query Throughput, this metric calculates the mean rate of data read operations performed by the ClickHouse cluster per second.
Median Memory Usage (Mega Bytes): The Median Memory Usage metric provides the central value of memory consumption across the ClickHouse cluster, offering insights into typical memory utilization patterns.

By utilizing these performance metrics in Apache Superset dashboards, customers can effectively monitor, analyze, and optimize the performance of their ClickHouse clusters, ensuring optimal efficiency and reliability for their analytical workloads.

As the ClickHouse benchmarking process evolves, it’s expected that additional benchmark metrics and datasets will be incorporated to further enhance performance evaluation and optimization. This ongoing refinement enables organizations to adapt to changing requirements and leverage the latest advancements in benchmarking methodologies. With Apache Superset’s flexibility and scalability, customers can seamlessly integrate new benchmark metrics and datasets into their analysis workflows, ensuring continuous improvement and optimization of ClickHouse clusters for diverse use cases and workloads.

Conclusion

Benchmarking is essential for assessing the performance and scalability of database systems like ClickHouse. By automating the benchmarking process with tools like Google Cloud, GitHub Actions, and Apache Superset, organizations can streamline testing, optimize performance, and make informed decisions about resource allocation. The ClickHouse benchmarking automation tool presented in this blog post exemplifies how automation can simplify complex tasks and empower teams to achieve better results efficiently.

To learn more about ClickHouse Benchmarking, do read the following articles:

☛ ChistaDATA Inc. Contacts

Business Function	Contact
*☎ CONTACT GLOBAL SALES (247)**	(844)395-5717
🚩 ChistaDATA Inc. FAX	+1 (209) 314-2364
📨 ChistaDATA Inc. Email – General / Sales / Consulting / Support	info@chistadata.com
🏠 CORPORATE ADDRESS: CALIFORNIA	ChistaDATA Inc. 440 N BARRANCA AVE #9718 COVINA, CA 91723
🏠 CORPORATE ADDRESS: DELAWARE	ChistaDATA Inc., PO Box 2093 PHILADELPHIA PIKE #3339 CLAYMONT, DE 19703
🏠 CORPORATE ADDRESS: DELAWARE	ChistaDATA Inc., 256 Chapman Road STE 105-4, Newark, New Castle 19702, Delaware
🏠 CORPORATE ADDRESS: HOUSTON	ChistaDATA Inc., 1321 Upland Dr. PMB 19322, Houston, TX, 77043, US

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Streamlining ClickHouse Performance Evaluation: Automated ClickHouse Benchmarking with Google Cloud and Apache Superset