How to Use Indexes in ClickHouse

Table of Contents

Introduction

Indexes in ClickHouse are implemented as a separate data structure that is stored on disk alongside the table data. The index data structure is used to quickly locate the specific data rows that match a given query condition, without having to scan the entire table.

Types of Indexes in ClickHouse

ClickHouse supports several types of indexes, including:

Primary index: This is the default index created for a table when it is created. It is based on the primary key of the table, if one is defined.
Secondary index: This index can be created on any column of a table, and is used to speed up queries that filter on that column.
Trie index: This is a specialized index that can be used for text search and prefix matching.
Bitmap index: This is a specialized index that can be used for filtering on boolean columns.

To use indexes in ClickHouse, you need to ensure that the query conditions match the columns that the indexes are built on. Additionally, you can use the system table ‘system.indices’ to check the indexes and their usage statistics.

To create an index on a table, you can use the CREATE INDEX statement, specifying the table name, index name, and the column(s) to index.

Example:

CREATE INDEX idx_table_name ON table_name (column1, column2)

It is important to note that, like any other database indexes, indexes in ClickHouse require maintenance, such as rebuilding or optimizing them, to ensure that they continue to provide good performance.

How Indexes work internally in ClickHouse?

In ClickHouse, indexes are implemented using data structures such as B+ Trees, Radix Trees, and Hash Tables. These data structures allow for efficient lookups and range scans of data based on the indexed columns. ClickHouse supports several types of indexes, including primary key, secondary, and full-text indexes.

To use indexes for performance, it is important to understand the types of queries that will be executed against the data and to create indexes that are tailored to support these queries. For example, if you have a table with a large number of rows and you frequently need to retrieve a small subset of these rows based on a specific value in a column, you would want to create an index on that column. This would allow ClickHouse to use the index to quickly locate the relevant rows, rather than having to scan the entire table.

It’s also important to note that indexing doesn’t come without a cost, indexes take up disk space and can increase the time it takes to insert or update data. So it is important to monitor the performance and size of indexes and to update or remove them as needed.

When to use Indexes in ClickHouse?

Indexes are used to speed up query performance in ClickHouse by reducing the amount of data that needs to be scanned. They are particularly useful when querying large datasets and when filtering or sorting by specific columns. It’s important to note that indexes come with a cost in terms of disk space and query execution time, so they should only be used when they will provide significant performance benefits. In general, you should use indexes when:

Querying large datasets
Filtering by specific columns
Sorting by specific columns
Joining tables on specific columns
Grouping by specific columns

It’s also a good idea to test the performance of queries with and without indexes to determine if they are beneficial for your specific use case.

Monitoring Index usage in ClickHouse

In ClickHouse, you can use the system table system.indices to monitor the usage of indexes. This table provides information on the number of rows, bytes, and query count for each index. Additionally, you can use the system.parts table to monitor the number of parts in an index and the system.merges table to monitor merge operations. You can also use the Python script copied below:

import time 
from clickhouse_driver import Client 

# Connect to the ClickHouse server 
client = Client('hostname', user='username', password='password') 
while True: 
        # Execute a query to retrieve index usage statistics 
        result = client.execute('SELECT index, bytes_read, query_time_ms FROM system.indices') 

        # Print the results 
         for row in result: 
             print(f"Index: {row[0]}, Bytes Read: {row[1]}, Query Time (ms): {row[2]}") 
        
        # Sleep for 2 seconds 
        time.sleep(2)

Conclusion

Note: You can download the Python script for monitoring ClickHouse Index Usage in real-time from our GitHub: https://github.com/shiviyer/Troubleshooting-ClickHouse/blob/main/click-index-usg.py

To read more about Clickhouse Indexes, do consider reading the following articles,

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Practical Guide to Using Indexes in ClickHouse