Practical Guide to Using Indexes in ClickHouse

 

Introduction

In ClickHouse, indexes are implemented as a separate data structure that is stored on disk alongside the table data. The index data structure is used to quickly locate the specific data rows that match a given query condition, without having to scan the entire table. ClickHouse supports several types of indexes, including:
 
  • Primary index: This is the default index created for a table when it is created. It is based on the primary key of the table, if one is defined.
  • Secondary index: This index can be created on any column of a table, and is used to speed up queries that filter on that column.
  • Trie index: This is a specialized index that can be used for text search and prefix matching.
  • Bitmap index: This is a specialized index that can be used for filtering on boolean columns.
 
To use indexes in ClickHouse, you need to ensure that the query conditions match the columns that the indexes are built on. Additionally, you can use the system table ‘system.indices’ to check the indexes and their usage statistics.
 
To create an index on a table, you can use the CREATE INDEX statement, specifying the table name, index name, and the column(s) to index.
 

Example:

CREATE INDEX idx_table_name ON table_name (column1, column2)

It is important to note that, like any other database indexes, indexes in ClickHouse require maintenance, such as rebuilding or optimizing them, to ensure that they continue to provide good performance.

How Indexes work internally in ClickHouse?

In ClickHouse, indexes are implemented using data structures such as B+ Trees, Radix Trees, and Hash Tables. These data structures allow for efficient lookups and range scans of data based on the indexed columns. ClickHouse supports several types of indexes, including primary key, secondary, and full-text indexes.

To use indexes for performance, it is important to understand the types of queries that will be executed against the data and to create indexes that are tailored to support these queries. For example, if you have a table with a large number of rows and you frequently need to retrieve a small subset of these rows based on a specific value in a column, you would want to create an index on that column. This would allow ClickHouse to use the index to quickly locate the relevant rows, rather than having to scan the entire table.

It’s also important to note that indexing doesn’t come without a cost, indexes take up disk space and can increase the time it takes to insert or update data. So it is important to monitor the performance and size of indexes and to update or remove them as needed.

When to use Indexes in ClickHouse?

Indexes are used to speed up query performance in ClickHouse by reducing the amount of data that needs to be scanned. They are particularly useful when querying large datasets and when filtering or sorting by specific columns. It’s important to note that indexes come with a cost in terms of disk space and query execution time, so they should only be used when they will provide significant performance benefits. In general, you should use indexes when:

  • Querying large datasets
  • Filtering by specific columns
  • Sorting by specific columns
  • Joining tables on specific columns
  • Grouping by specific columns

It’s also a good idea to test the performance of queries with and without indexes to determine if they are beneficial for your specific use case.

Monitoring Index usage in ClickHouse

In ClickHouse, you can use the system table system.indices to monitor the usage of indexes. This table provides information on the number of rows, bytes, and query count for each index. Additionally, you can use the system.parts table to monitor the number of parts in an index and the system.merges table to monitor merge operations. You can also use the Python script copied below to monitor ClickHouse Index Usage in the real-time

import time 
from clickhouse_driver import Client

# Connect to the ClickHouse server
client = Client('hostname', user='username', password='password')
while True:
# Execute a query to retrieve index usage statistics
result = client.execute('SELECT index, bytes_read, query_time_ms FROM system.indices')

# Print the results
for row in result:
print(f"Index: {row[0]}, Bytes Read: {row[1]}, Query Time (ms): {row[2]}")

# Sleep for 2 seconds
time.sleep(2)

Conclusion

Note: You can download the Python script for monitoring ClickHouse Index Usage in real-time from our GitHub: https://github.com/shiviyer/Troubleshooting-ClickHouse/blob/main/click-index-usg.py

To read more about Clickhouse Indexes, do consider reading the following articles,

About Shiv Iyer 206 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.