Introduction to Data Skipping Indexes
When working on ClickHouse, many different factors can affect its performance of Clickhouse. One of the most critical elements is that ClickHouse determines whether it will use a primary key when evaluating the sentence condition. In accordance with this, the most effective table design is the selection of a primary key that applies to the most commonly used queries.
Considering this situation, even if the primary key is selected in a way, it will inevitably use the use of unpaid queries. Users usually rely on ClickHouse for time series type data, but they will want to analyze the same data according to other business dimensions such as customer identity, and website URL. In this case, the query performance may be adversely affected, because when you apply the WHERE requirement, the screening will be performed for each column.
While ClickHouse is still relatively fast in those circumstances, evaluating millions or billions of individual values will cause “non-indexed” queries to execute much more slowly than those based on the primary key.
In an RDBMS, one approach to this problem is to attach one or more “secondary” indexes to a table. This is a b-tree structure that permits the database to find all matching rows on disk in O(log(n)) time instead of O(n) time (a table scan), where n is the number of rows. However, this type of secondary index will not work for ClickHouse (or other column-oriented databases) because there are no individual rows on the disk to add to the index.
Instead, ClickHouse provides a different directory option that can greatly increase the query rate in certain cases. This structure is called ‘skip’ indexes. This means skipping reading to read this data when it is guaranteed that there is no matching value in ClickHouse.
Choosing The Data Skipping Index
Clickhouse MergeTree table engine provides a few data skipping indexes which makes queries faster by skipping granules of data (A granule is the smallest indivisible data set that ClickHouse reads when selecting data) and therefore reducing the amount of data to read from disk. ngrambf_v1
and tokenbf_v1
are two interesting indexes using bloom filters for optimizing filtering of Strings. A bloom filter is a space-efficient probabilistic data structure allowing to test whether an element is a member of a set.
ngrambf_v1
A string is split into substrings of n characters. For example, n=3
ngram (trigram) of 'hello world'
is ['hel', 'ell', 'llo', lo ', 'o w' ...]
. The ngrams of each column value will be stored in the bloom filter.
When searching with a filter column LIKE 'hello'
the string in the filter will also be split into ngrams ['hel', 'ell', 'llo']
and a lookup is done for each value in the bloom filter. If all the ngram values are present in the bloom filter we can consider that the searched string is present in the bloom filter.
Functions with a constant argument that is less than ngram size can’t be used by ngrambf_v1
for query optimization. For example, searching for ‘hi’
will not trigger a ngrambf_v1
index with n=3
. Small n allows supporting more searched strings. But small n leads to more ngram values which mean more hashing and eventually more false positives. False positive means reading data which do not contain any rows that match the searched string.
Since false positive matches are possible in bloom filters, the index cannot be used when filtering with negative operators such as column_name != 'value’
or column_name NOT LIKE ‘%hello%’
.
tokenbf_v1
tokenbf_v1
splits the string into tokens separated by non-alphanumeric characters and stores tokens in the bloom filter. ‘Hello world’
is split into 2 tokens [‘hello’, ‘world’]
.
In addition to the limitation of not supporting negative operators, the searched string must contain at least a complete token. In the above example, searching for `hel`
will not trigger the index.
Once we understand how each index behaves, tokenbf_v1
turns out to be a better fit for indexing HTTP URLs, because HTTP URLs are typically path segments separated by /. Each path segment will be stored as a token. Splitting the URls into ngrams would lead to much more sub-strings to store. The index size needs to be larger and lookup will be less efficient.
Configuring the Data Skipping Index
Tokenbf_v1 Index needs to be configured with several parameters. Firstly, it is stated how many granular data will be added to a single block index by using Bloom Filters. The whole block will be sciented or not, depending on whether the sought-after value is visible in the block, or at all. The number of lines in each granule is defined by the index_granularity setting of the table.
In addition, we need to predict the number of coins in each data granule. The number of coins corresponds to the number of different road segments. After fixing the N, the number of the coin values, the P, which is the number of incorrect positive ratio and K hash functions, gives us the Bloom Filters size.
The index can be created on a column or on an expression if we apply some functions to the column in the query. In our case searching for HTTP URLs is not case sensitive so we have created the index on lowerUTF8(http_url)
.
Index create syntax is like;
ADD INDEX tokenbf_http_url_index lowerUTF8(http_url) TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4
Result: Index size
The size of the tokenbf_v1
index before compression can be calculated as following:
Bloom_filter_size x number_of_blocks
Number_of_blocks = number_of_rows / (table_index_granularity * tokenbf_index_granularity)
You can check the size of the index file in the directory of the partition in the file system. The file is named as skp_idx_{index_name}.idx
.
In our case, the size of the index on the HTTP URL column is only 0.1% of the disk size of all data in that partition.
Result: Query speed
The questioning rate depends on two factors; Thanks to Index Lookup and Index, how many blocks can be skipped.
According to our tests, the index-looking time cannot be neglected. For example, if the directory of the directory is set to 1, it can take up to a few seconds in our data set. We decided to set the directory of the directory to 4 to get the directory search time in our data set in a second.
The number of blocks that can be skipped depends on how often the data is required and how it is distributed in the table. Our table is sorted by the time stamp, so if the sought call is regularly in almost every block, we will see a performance improvement because no data is missed. On the contrary, if the call that matches the query appears in only a few blocks, a small amount of data should be read that makes the query much faster.
Example
If some portion of the WHERE clause filtering condition matches the skip index expression when executing a query and reading the relevant column files, ClickHouse will use the index file data to determine whether each relevant block of data must be processed or can be bypassed (assuming that the block has not already been excluded by applying the primary key). To use a very simplified example, consider the following table loaded with predictable data.
CREATE TABLE skip_table ( my_key UInt64, my_value UInt64 ) ENGINE MergeTree primary key my_key SETTINGS index_granularity=8192; INSERT INTO skip_table SELECT number, intDiv(number,4096) FROM numbers(100000000);
When executing a simple query that does not use the primary key, all 100 million entries in the my_value
column are scanned:
SELECT * FROM skip_table WHERE my_value IN (125, 700) ┌─my_key─┬─my_value─┐ │ 512000 │ 125 │ │ 512001 │ 125 │ │ ... | ... | └────────┴──────────┘ 8192 rows in set. Elapsed: 0.079 sec. Processed 100.00 million rows, 800.10 MB (1.26 billion rows/s., 10.10 GB/s.
Now add a very basic skip index:
ALTER TABLE skip_table ADD INDEX vix my_value TYPE set(100) GRANULARITY 2;
Normally skip indexes are only applied on newly inserted data, so just adding the index won’t affect the above query.
To index already existing data, use this statement:
ALTER TABLE skip_table MATERIALIZE INDEX vix;
Rerun the query with the newly created index:
SELECT * FROM skip_table WHERE my_value IN (125, 700) ┌─my_key─┬─my_value─┐ │ 512000 │ 125 │ │ 512001 │ 125 │ │ ... | ... | └────────┴──────────┘ 8192 rows in set. Elapsed: 0.051 sec. Processed 32.77 thousand rows, 360.45 KB (643.75 thousand rows/s., 7.08 MB/s.)
Instead of processing 100 million rows of 800 megabytes, ClickHouse has only read and analyzed 32768 rows of 360 kilobytes — four granules of 8192 rows each.
Conclusion
The bloom_filter index and its 2 variants ngrambf_v1 and tokenbf_v1 all have some limitations. They do not support filtering with all operators. The performance improvement depends on how frequently the searched data occurred and how it is spread across the whole dataset so it’s not guaranteed for all queries.
We recommend you, try the data skipping index to improve the performance of your Clickhouse queries, especially since it’s relatively cheap to put in place. It only takes a bit more disk space depending on the configuration and it could speed up the query by 4-5 times depending on the amount of data that can be skipped.
To read more about Indexes in ClickHouse, do consider reading the following articles: