Introduction
Benchmarking is a critical aspect of ensuring the performance and scalability of database systems. By analyzing the performance of a database under different scenarios, developers and administrators can make informed decisions about optimization and resource allocation. ClickHouse, being a popular open-source analytical database management system, often requires thorough benchmarking to ensure its optimal performance.
In this blog post, we’ll delve into the advantages of benchmarking, followed by a detailed explanation of a Proof of Concept (POC) automation tool designed specifically for ClickHouse benchmarking. This tool leverages the power of Google Cloud for infrastructure provisioning, GitHub Actions for automation workflows, and Apache Superset for visualization of benchmark results.
Advantages of Benchmarking
Benchmarking offers several advantages:
- Performance Optimization: Identifying bottlenecks and performance limitations allows for targeted optimizations, improving overall system performance.
- Scalability Testing: Benchmarking helps determine how well a system scales with increasing workload or data volume, enabling efficient resource allocation.
- Quality Assurance: By comparing performance metrics against predefined standards, benchmarking ensures the reliability and stability of the database system.
- Cost-Efficiency: Optimizing performance and resource allocation through benchmarking can lead to cost savings by utilizing resources more efficiently.
Automating ClickHouse Benchmarking
The ClickHouse benchmarking automation tool simplifies the process of setting up benchmark tests, executing them, and visualizing the results. Here’s how it works:
- Customer Request:
- Customers send a request specifying the example dataset they want to benchmark (e.g., analytics, logs) via GitHub.
- Infrastructure Provisioning:
- The automation tool creates an instance on Google Cloud Platform (GCP) with the specified dataset.
- Benchmark Execution:
- Once the instance is set up, benchmark tests are executed using predefined SQL queries. These tests simulate real-world scenarios to assess the performance of ClickHouse.
- Result Reporting:
- Upon completion of benchmark tests, a comprehensive report is generated. This report includes detailed performance metrics, such as query execution time, throughput, and resource utilization.
- Email Notification:
- The generated report is automatically sent to the customers via email.
- Visualization with Apache Superset:
- Benchmark results are stored in a dedicated reporting database. Apache Superset is used to visualize these results, providing interactive dashboards for easy analysis and interpretation.
Example of Report:
Query 10: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:06:00.388898 query_duration_ms: 5959 read_rows: 325895567 read_bytes: 2280837740 result_rows: 5 result_bytes: 10768 memory_usage: 8678041105 query: SELECT actor_login, COUNT(DISTINCT number) AS total_activities FROM GitHub.github_events WHERE event_type IN ('IssuesEvent', 'PullRequestEvent') GROUP BY actor_login ORDER BY total_activities DESC LIMIT 5 query_id: 829b3918-324a-4d1a-b50c-8ce87381f144 initial_user: chistadata2024qxzzpd type: QueryFinish Query 11: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:06:16.506420 query_duration_ms: 1136 read_rows: 111271799 read_bytes: 1095578738 result_rows: 1 result_bytes: 1032 memory_usage: 670012228 query: SELECT labels, COUNT(*) AS label_count FROM GitHub.github_events WHERE event_type = 'IssuesEvent' GROUP BY labels ORDER BY label_count DESC LIMIT 1 query_id: b01beb81-d818-4758-b50a-ab8e3599772f initial_user: chistadata2024qxzzpd type: QueryFinish Query 12: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:06:22.720544 query_duration_ms: 1154 read_rows: 214623768 read_bytes: 831990033 result_rows: 5 result_bytes: 11024 memory_usage: 2470392055 query: SELECT repo_name, COUNT(*) AS mergeable_count FROM GitHub.github_events WHERE event_type = 'PullRequestEvent' AND mergeable = 1 GROUP BY repo_name ORDER BY mergeable_count DESC LIMIT 5 query_id: 0031fdd4-b7b9-4b15-a4f9-c003da395a27 initial_user: chistadata2024qxzzpd type: QueryFinish Query 13: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:06:28.912512 query_duration_ms: 60714 read_rows: 3119798001 read_bytes: 14604929356 result_rows: 100 result_bytes: 17472 memory_usage: 72941047629 query: SELECT actor_login, COUNT(DISTINCT CASE WHEN action = 'opened' THEN repo_name END) AS opened_repos, COUNT(DISTINCT CASE WHEN action = 'merged' THEN repo_name END) AS merged_repos FROM GitHub.github_events GROUP BY actor_login HAVING opened_repos > 5 ORDER BY opened_repos DESC LIMIT 100 query_id: 69fccf31-3380-4dbe-9768-f52c3267ce07 initial_user: chistadata2024qxzzpd type: QueryFinish Query 14: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:07:40.037804 query_duration_ms: 96367 read_rows: 214623768 read_bytes: 1501902701 result_rows: 100 result_bytes: 6672 memory_usage: 80466074159 query: SELECT repo_name, created_at, events_count, AVG(events_count) OVER (PARTITION BY repo_name ORDER BY created_at ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_avg FROM (SELECT repo_name, created_at, COUNT(*) AS events_count FROM GitHub.github_events WHERE event_type = 'PullRequestEvent' GROUP BY repo_name, created_at) subquery ORDER BY repo_name, created_at LIMIT 100 query_id: e0287f44-a86b-4674-a0b9-02f1ed470ee0 initial_user: chistadata2024qxzzpd type: QueryFinish Query 15: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:09:21.562277 query_duration_ms: 1586 read_rows: 3119798001 read_bytes: 5508995441 result_rows: 21 result_bytes: 552 memory_usage: 2105418 query: SELECT event_type, AVG(comments) AS avg_comments FROM GitHub.github_events GROUP BY event_type; query_id: eeba95bd-65b1-4971-b2ff-4951bd222151 initial_user: chistadata2024qxzzpd type: QueryFinish Query 16: hostname: dataset-github-new-20240218161837018422932.us-central1-......... query_start_time_microseconds: 2024-02-18 17:09:28.187139 query_duration_ms: 2166 read_rows: 3119798001 read_bytes: 9736719569 result_rows: 21 result_bytes: 848 memory_usage: 2762889 query: SELECT event_type, AVG(additions) AS avg_additions, AVG(deletions) AS avg_deletions FROM GitHub.github_events GROUP BY event_type; query_id: ec48f1b7-fa8b-4a2f-abc2-3a43f9755f64 initial_user: chistadata2024qxzzpd type: QueryFinish
Benefits of the Automation Tool
- Efficiency: Automating the benchmarking process saves time and effort, allowing for faster iterations and more frequent testing.
- Consistency: By following predefined workflows, the tool ensures consistency in benchmarking procedures, reducing the risk of human error.
- Scalability: The tool can be easily scaled to accommodate multiple benchmarking requests simultaneously, making it suitable for both small-scale tests and large-scale deployments.
- Transparency: Customers have full visibility into the benchmarking process, from request submission to result visualization, enhancing trust and collaboration.

Analyzing ClickHouse Performance Metrics in Apache Superset
Apache Superset offers a powerful platform for visualizing and analyzing performance metrics of ClickHouse clusters. By leveraging Superset’s capabilities, customers gain valuable insights into the health and efficiency of their ClickHouse deployment. Here’s a breakdown of the key performance metrics available in Superset dashboards:
- Average Query Throughput (Rows per Second): This metric indicates the average rate at which queries are processed by the ClickHouse cluster, measured in rows per second. A higher throughput suggests better query performance and efficient resource utilization.
- Total Queries: Total Queries provides a count of the queries executed on the ClickHouse cluster within a specified time period. Monitoring this metric helps track workload trends and identify periods of high activity.
- Total Query Processing Duration (Seconds): This metric sums up the total duration taken to process all queries within the defined timeframe. It provides an overview of the overall workload and helps in assessing the efficiency of query processing.
- Query Duration Quartiles: Query Duration Quartiles break down the distribution of query execution times into quartiles (25th, 50th, 75th percentiles), allowing for a detailed analysis of query performance variability.
- Average Query Duration: The Average Query Duration metric calculates the mean time taken to execute queries on the ClickHouse cluster. It serves as a benchmark for evaluating query performance and identifying outliers.
- Query Run Time: Query Run Time represents the duration of individual query executions, providing insights into the time taken for specific queries to complete. Deviations from the norm may indicate performance issues or optimization opportunities.
- Query Duration – Distribution: This metric visualizes the distribution of query execution durations across the ClickHouse cluster. Understanding the distribution helps in identifying patterns and outliers that may impact overall performance.
- Median Query Duration: The Median Query Duration metric provides the central value of query execution times, offering a robust measure of typical query performance unaffected by extreme values.
- Memory Usage (Mega Bytes): Memory Usage reflects the amount of memory consumed by the ClickHouse cluster, measured in megabytes. Monitoring memory usage is crucial for ensuring optimal performance and avoiding resource contention.
- Read Throughput (Mega Bytes / Second): Read Throughput measures the rate at which data is read from the ClickHouse cluster, expressed in megabytes per second. It helps assess the efficiency of data retrieval operations and storage subsystem performance.
- Average Read Throughput (Mega Bytes per Second): Similar to Average Query Throughput, this metric calculates the mean rate of data read operations performed by the ClickHouse cluster per second.
- Median Memory Usage (Mega Bytes): The Median Memory Usage metric provides the central value of memory consumption across the ClickHouse cluster, offering insights into typical memory utilization patterns.
By utilizing these performance metrics in Apache Superset dashboards, customers can effectively monitor, analyze, and optimize the performance of their ClickHouse clusters, ensuring optimal efficiency and reliability for their analytical workloads.
As the ClickHouse benchmarking process evolves, it’s expected that additional benchmark metrics and datasets will be incorporated to further enhance performance evaluation and optimization. This ongoing refinement enables organizations to adapt to changing requirements and leverage the latest advancements in benchmarking methodologies. With Apache Superset’s flexibility and scalability, customers can seamlessly integrate new benchmark metrics and datasets into their analysis workflows, ensuring continuous improvement and optimization of ClickHouse clusters for diverse use cases and workloads.
Conclusion
Benchmarking is essential for assessing the performance and scalability of database systems like ClickHouse. By automating the benchmarking process with tools like Google Cloud, GitHub Actions, and Apache Superset, organizations can streamline testing, optimize performance, and make informed decisions about resource allocation. The ClickHouse benchmarking automation tool presented in this blog post exemplifies how automation can simplify complex tasks and empower teams to achieve better results efficiently.
To learn more about ClickHouse Benchmarking, do read the following articles:
- ChistaDATA Cloud for ClickHouse: Benchmarking on OpenTelemetry Dataset
- Benchmarking ClickHouse Performance with Brown University’s Benchmark
- Benchmarking ClickHouse Using the clickhouse-benchmark Tool
- How to Prepare ClickHouse for TPC-DS Benchmarking: Part 1
☛ ChistaDATA Inc. Contacts
Business Function | Contact |
---|---|
☎ CONTACT GLOBAL SALES (24*7) | (844)395-5717 |
🚩 ChistaDATA Inc. FAX | +1 (209) 314-2364 |
📨 ChistaDATA Inc. Email – General / Sales / Consulting / Support | info@chistadata.com |
🏠 CORPORATE ADDRESS: CALIFORNIA | ChistaDATA Inc. 440 N BARRANCA AVE #9718 COVINA, CA 91723 |
🏠 CORPORATE ADDRESS: DELAWARE | ChistaDATA Inc., PO Box 2093 PHILADELPHIA PIKE #3339 CLAYMONT, DE 19703 |
🏠 CORPORATE ADDRESS: DELAWARE | ChistaDATA Inc., 256 Chapman Road STE 105-4, Newark, New Castle 19702, Delaware |
🏠 CORPORATE ADDRESS: HOUSTON | ChistaDATA Inc., 1321 Upland Dr. PMB 19322, Houston, TX, 77043, US |