ClickHouse is a well-known, lightning-fast columnar data store. ClickHouse is primarily used for storing analytical data in tabular format. With the recent advancements in AI and Machine learning, there is a rapid increase in demand for storing vector data and performing fast searches to find similar vectors. ClickHouse also has support to store and search the vectors. Before we get into any of this, let us look at some of the essential terms that will be used in this article.
Structured and Unstructured Data
Structured data consists of clearly defined data types and can be stored in database tables consisting of rows and columns. Unstructured data is neither organized in a predefined manner nor with a pre-defined data model. Ex- Audio, Images, text corpus, etc.
In linear algebra, Vectors are finite and ordered lists of one or more scalars with direction. In computer programming, Vectors are represented as a 1-D array of real numbers. Vectors can be multi-dimensional, and a N-dimensional vector is represented as a 1-D array with N number of elements.
In machine learning and AI, embedding vectors are generated by applying a mathematical transformation, a trained machine learning model, or an embedding function on raw data. Embedding vectors can hold the information representing the data. These vectors can be used to find similar images, text, audio or video, etc.
Need to store vectors
Storing, indexing, and searching across large sets of unstructured data is a very big challenge, and vectorizing the data using machine learning/deep learning models is one of the proven solutions to reduce the storage requirement and faster searching.
Storing Vectors in ClickHouse
Vectors can be stored in ClickHouse in columns with either array or tuple data type. While the vector embeddings generated from an embedding model are of fixed dimension (length), we can still store vectors of varying length in ClickHouse array or tuples.
Indexing the Vector fields
While searching for similar vectors, we can either use brute force search using distance functions, which is slow but accurate, or index the vector fields to perform faster but less accurate searches. ClickHouse has experimental support for Approximate Nearest Neighbor Search Indexes in MergeTree table engines. The following indexing algorithms are currently supported (23.8 LTS). The following indexing algorithms are supported.
- HNSW (USEARCH library)
Now that we have covered some of the basics let us look at a practical example of storing and searching vectors in ClickHouse.
Storing and Searching for Similar Vectors in ClickHouse
Since the ANN indexes are experimental in CLickHouse, you will have to enable this feature until it is marked stable. Connect to the ClickHouse server via the clickhouse-client tool and execute the following SQL to enable the experimental feature on ANN indexes
SET allow_experimental_annoy_index = 1
To store and index vectors, let us create a simple table in ClickHouse using the MergeTree table engine.
CREATE TABLE ann_index_example ( Name String, embedding Array(Float32), INDEX ann_index_1 embedding TYPE annoy('L2Distance') ) ENGINE = MergeTree ORDER BY Name;
This table has just two columns called Name and Embedding. The embedding column is of Array data type. We are creating an index on the embedding field called ann_index_1. This index will be based on ANNOY and use L2 (Euclidean) distance to find the similarity between the vectors.
Use the below python script to generate random data and insert it into the table.
from clickhouse_driver import Client import random import string # function to generate data for Name field def gen_rand_string(str_len, n_str): letters = string.ascii_lowercase random_string = [''.join(random.choice(letters) for i in range(str_len)) for x in range(n_str)] return random_string # function to generate data for embedding field def gen_rand_embeddings(emb_len, n_emb): emb = [[random.uniform(0, 1.0) for y in range(emb_len)] for x in range(n_emb)] return emb # Prepare data n_records = 1000 # number of records that will be inserted to the table name_len = 5 vector_len = 5 strs = gen_rand_string(name_len, n_records) embds = gen_rand_embeddings(vector_len, n_records) data = [ (strs[x], embds[x]) for x in range(n_records)] # insert to ClickHouse server client = Client(host='localhost') client.execute( 'INSERT INTO ann_index_example VALUES', data); client.disconnect()
After the data is inserted, let us perform a nearest neighbor search on this data.
SELECT Name, embedding, L2Distance(embedding, [0.1, 0.2, 0.3, 0.4, 0.5]) AS score FROM ann_index_example ORDER BY score ASC LIMIT 5 Query id: 9de95e7f-79e1-4d02-81ff-43f7ace14072 ┌─Name──┬─embedding────────────────────────────────────────────────┬───────────────score─┐ │ anabd │ [0.109578826,0.17093645,0.1863009,0.4694911,0.56862974] │ 0.1529803320145931 │ │ ewzos │ [0.24622843,0.28853804,0.42360082,0.5199647,0.5086616] │ 0.24282803026289437 │ │ avajo │ [0.24545254,0.36559477,0.26052764,0.31315225,0.53613853] │ 0.24286758135453868 │ │ wyayw │ [0.032240078,0.18597832,0.31064722,0.21862713,0.6575932] │ 0.2502660809473027 │ │ adxir │ [0.1420014,0.11847292,0.41707036,0.41038868,0.71311146] │ 0.2600782018429016 │ └───────┴──────────────────────────────────────────────────────────┴─────────────────────┘ 5 rows in set. Elapsed: 0.006 sec. Processed 1.01 thousand rows, 42.42 KB (178.20 thousand rows/s., 7.48 MB/s.) Peak memory usage: 121.19 KiB.
In this example, we are searching for similar vectors in the table to the vector [0.1, 0.2, 0.3, 0.4, 0.5], and we have the results. The results may vary since we are inserting randomly generated data.
So, we have seen about storing and searching similar vectors in ClickHouse. In the next part of this series, we will get into the index types, indexing tuple fields, and various distance metrics.