How Data Skipping Indexes are implemented in ClickHouse?

Data skipping indexes in ClickHouse help improve query performance by allowing the system to skip over irrelevant data parts while reading from disk. They are a type of secondary index that store summary information (such as min/max values) about primary key ranges. When a query is executed, ClickHouse uses this summary information to determine which data parts need to be read, effectively reducing the amount of data that needs to be processed.

Here’s a detailed explanation of data skipping indexes and their benefits:

  1. Implementation: Data skipping indexes are implemented as part of the MergeTree family of table engines in ClickHouse. They can be created using the CREATE INDEX statement while defining the table schema. Data skipping indexes are stored alongside the data, and their summary information is updated as new data is added or existing data is modified.

There are several types of data skipping indexes in ClickHouse, including:

  • MinMax index: Stores the minimum and maximum values for a given column in each primary key range.
  • Set index: Stores a set of distinct values in a given column for each primary key range.
  • Bloom filter index: Stores a probabilistic data structure that allows for testing whether a given value is present in a column for each primary key range.
  • n-gram index: Stores information about n-grams (substrings of length n) for text columns in each primary key range.
  1. Benefits: Data skipping indexes provide several benefits for ClickHouse:
  • Faster query execution: By skipping irrelevant data parts during query execution, ClickHouse can significantly reduce the amount of data that needs to be read and processed. This results in faster query execution times, especially for large datasets and analytical queries.
  • Improved resource utilization: Data skipping indexes help reduce the amount of I/O and CPU resources needed to execute queries. This allows ClickHouse to handle more concurrent queries and improves overall system performance.
  • Flexible indexing options: ClickHouse offers several types of data skipping indexes, each suited to different types of data and query patterns. This flexibility enables users to choose the most appropriate type of index for their specific use case.
  1. Example: Let’s assume you have a table with IoT sensor data, and you want to query the average temperature for a specific time range:

CREATE TABLE sensor_data
(
timestamp DateTime,
device_id UInt32,
temperature Float64,
humidity Float64
) ENGINE = MergeTree()
ORDER BY (timestamp, device_id);
-- Add data to the table

Now, you want to query the average temperature for a specific time range. Without a data skipping index, ClickHouse would have to scan the entire dataset, which can be slow for large datasets.

To speed up this query, you can create a MinMax data skipping index on the timestamp column:

CREATE TABLE sensor_data
(
timestamp DateTime,
device_id UInt32,
temperature Float64,
humidity Float64,
INDEX min_max_timestamp timestamp TYPE minmax GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY (timestamp, device_id);

Now, when you execute a query like this:

SELECT AVG(temperature)
FROM sensor_data
WHERE timestamp BETWEEN '2023-01-01 00:00:00' AND '2023-01-01 23:59:59';

ClickHouse can use the MinMax index to skip data parts that are outside the specified time range, significantly reducing the amount of data that needs to be read and processed, resulting in faster query execution times.

By using data skipping indexes, you can optimize query performance in ClickHouse and reduce the resources required to process analytical queries.

About Shiv Iyer 211 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.