ClickHouse Performance: How to assess Accuracy of Cardinality Estimates in Execution Plans

Introduction

In ClickHouse, evaluating the accuracy of cardinality estimates in a query plan can be challenging since ClickHouse relies on different heuristics and sampling techniques to estimate cardinalities.

Accuracy of Cardinality Estimates

However, you can take the following steps to gain insights into the accuracy of cardinality estimates:

  1. Analyze the Query Plan: Obtain the query plan using the EXPLAIN or PROFILE command in ClickHouse. Review the plan to understand the operations involved, including joins, filters, and aggregations. Pay attention to the estimated number of rows and cardinalities at each step.
  2. Compare Estimates with Actual Data: Execute the query and observe the actual number of rows returned by the query. Compare these actual values with the estimated cardinalities in the query plan. If the estimated and actual cardinalities align closely, it indicates that the estimates are reasonably accurate. However, significant discrepancies might indicate inaccurate estimates.
  3. Use the SAMPLE Clause: ClickHouse provides the SAMPLE clause to perform sampling during query execution. By specifying a sample size, you can obtain a subset of the data to estimate cardinalities more accurately. Running the query with different sample sizes and comparing the results can help you gauge the accuracy of cardinality estimates.
  4. Analyze Query Performance: Cardinality estimates directly impact query execution plans and performance. If you observe query performance issues or unexpected execution times, it might indicate inaccurate estimates. Long execution times, excessive resource usage, or suboptimal query plans can be signs of inaccurate cardinality estimates.
  5. Adjust ClickHouse Settings: ClickHouse provides configuration parameters that influence cardinality estimation, such as statistics_sampling, optimize_min_equality_disjunction_chain_length, and enable_internal_optimizer. Experimenting with these settings can help improve cardinality estimates based on your specific dataset and query patterns.
  6. Gather Accurate Statistics: To enhance cardinality estimation accuracy, ensure that ClickHouse has up-to-date and representative statistics about your data. ClickHouse supports manual statistics gathering using the OPTIMIZE TABLE statement or automatic gathering via the STATS_AUTO_RECALCULATION setting. Keeping statistics current helps improve cardinality estimates.

Conclusion

It’s important to note that ClickHouse’s cardinality estimation is not always exact due to its distributed and columnar nature. The accuracy depends on various factors such as data distribution, table structure, query complexity, and available statistics. Monitoring query performance, experimenting with settings, and validating estimates against actual data can aid in assessing the accuracy of cardinality estimates in ClickHouse.

To read more about ClickHouse, do give the following a read

About Shiv Iyer 211 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.