Why don’t we recommend INDEX RANGE SCAN on large Data Sets in ClickHouse

INDEX RANGE SCAN is a type of index access method used in ClickHouse to retrieve data from an index based on a range of values. While INDEX RANGE SCAN can be very efficient for small data sets, it can become inefficient on large data sets due to the following reasons:

  1. Index size: As the size of the index grows, the time required to scan the entire index also increases. This can result in longer query response times and slower overall performance.
  2. Disk I/O: As the index size increases, the amount of disk I/O required to read the index also increases. This can result in slower query performance due to disk contention and slower disk access times.
  3. Index fragmentation: As data is inserted, updated, and deleted from the table, the index can become fragmented, which can reduce the efficiency of INDEX RANGE SCAN. Fragmentation can occur when data is inserted in a non-sequential order, which can cause the index pages to become disordered.

To mitigate the inefficiencies of INDEX RANGE SCAN on large data sets, ClickHouse provides several optimization techniques, including:

  1. Using a covering index: A covering index is an index that includes all the columns required for a query. By using a covering index, ClickHouse can retrieve all the required data from the index, avoiding the need to access the underlying table. This can significantly improve query performance, especially on large data sets.
  2. Using a compressed index: ClickHouse supports several compression algorithms that can be used to compress index data, reducing the amount of disk space required to store the index. This can reduce the amount of disk I/O required to access the index, improving query performance on large data sets.
  3. Using a different index access method: ClickHouse supports several different index access methods, including bitmap indexes and hash indexes. By using a different index access method, ClickHouse can improve query performance on large data sets, especially when the data has a low cardinality.

Overall, while INDEX RANGE SCAN can be efficient on small data sets, it can become inefficient on large data sets due to the size of the index, disk I/O, and index fragmentation. By using optimization techniques such as covering indexes, compressed indexes, and different index access methods, ClickHouse can improve query performance on large data sets.

About Shiv Iyer 219 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.