Introduction
We have seen how to store vectors and perform vector similarity searches in this blog post. In this part 2, we look at various vector search and and storage algorithms available in ClickHouse.
Annoy
ClickHouse’s Annoy (Approximate Nearest Neighbors Oh Yeah) index is implemented based on the ANNOY library which is in C++. ANNOY library is an open-source library from Spotify, which is used in their music recommendations. It was developed by Erik Bernhardsson while working at Spotify in the year 2015. The algorithm is based on random projections and binary trees. More details can be found here.
Creating Annoy Index
Let us create a table with ANNOY index.
CREATE TABLE ann_index_example ( Name String, embedding Array(Float32), INDEX ann_index_1 embedding TYPE annoy('L2Distance', 100) ) ENGINE = MergeTree ORDER BY Name;
The annoy index accepts two parameters.
- Distance – L2Distance or cosineDistance
- NumTrees
The distance metric used to find the similarity between the vectors while building the index or while performing the search can be Euclidean distance or Cosine distance. The next parameter is the number of trees in the index. A smaller number of trees will result in faster searches at the cost of accuracy and a larger number of trees will result in accurate searches that are slower.
USearch
This index type is based on hierarchical navigable small world graphs algorithm (HNSW) and is implemented via the USearch library. This algorithm was developed by Yu. A. Malkov and D. A. Yashunin in the year 2016. In ClickHouse, this index is implemented via the USearch library, an open-source library from Unum cloud. HNSW index has multiple graphs of different hierarchies and the vectors form the vertices of the graph and the edges of the graph are based on the similarity of the vertices.
Creating HNSW index
CREATE TABLE ann_index_example_hnsw ( Name String, embedding Array(Float32), INDEX ann_index_hnsw embedding TYPE usearch('cosineDistance', 'f32') ENGINE = MergeTree ORDER BY Name;
This index type accepts two parameters. The first one is the distance and ClickHouse supports L2Distance or cosineDistance for HNSW index. The next parameter is the scalar format used to store the vectors in the HNSQ graphs with reduced precision. The allowed values are i8, f16, f62 and f64.
Conclusion
ClickHouse’s vector search and storage is still an experimental feature (23.8 LTS) but is being actively developed and maintained. We can probably see more support and index types in the upcoming releases.
To know more about Search in ClickHouse, do consider reading the following articles: