Full Text Search is the most expected feature at Clickhouse community. Recently, Clickhouse has the support to perform the Full text Search using Quickwit. Quickwit is a distributed search engine. In this article, I am going to explain the steps involved on this configuration and will show how the Full Text Search is working.
Pre-requaistics
- Clickhouse server
- Quickwit
- Sample data ( I used the GitHub archive dataset )
You can refer the link to install the Clickhouse server.
Quickwit Installation:
Quickwit needs the following dependency packages to work properly. So, it is recommended to install them.
apt-get -y install libpq-dev libssl-dev
Once the dependency packages are installed, execute the following commands to install the Quickwit.
curl -L https://install.quickwit.io | sh cd quickwit-v*/ export QW_CONFIG=./config/quickwit.yaml
The next step is we need to create the index in Quickwit. This index will be used to perform the Full Text Search.
Creating Index in Quickwit
Firstly, we need to create the configuration file. It can be downloaded from GitHub using the following command.
curl -o gh-archive-index-config.yaml https://raw.githubusercontent.com/quickwit-oss/quickwit/main/config/tutorials/gh-archive/index-config-for-clickhouse.yaml
The configuration file contains the index name, and description of all the columns we need to index. For example, My index name is “clickhouse-fulltextsearch” and I would like to add index for the columns “id, created_at,event_type,body,title”. So, my configuration file looks like below.
version: 0 index_id: clickhouse-fulltextsearch doc_mapping: store_source: false field_mappings: - name: id type: u64 fast: true - name: created_at type: i64 fast: true - name: event_type type: text tokenizer: raw - name: title type: text tokenizer: default record: position - name: body type: text tokenizer: default record: position indexing_settings: timestamp_field: created_at search_settings: default_search_fields: [title, body]
Now we prepared the config file, the next step is, need to create the index. The following command is used to create the index.
./quickwit index create --index-config gh-archive-index-config.yaml
You will see the following output at the end once the index is created.
Index `clickhouse-fulltextsearch` successfully created.
Use the following command to see the available indexes.
root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index list +---------------------------+-----------------------------------------------------------------------+ | Index ID | Index URI | +---------------------------+-----------------------------------------------------------------------+ | clickhouse-fulltextsearch | file:///root/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch | +---------------------------+-----------------------------------------------------------------------+
So, The next step is we need download the sample data and apply the created index in it.
Data preparation and Indexing
As I told earlier, I have downloaded the GitHub archive data. You can also download that using the following url.
wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12-text-only.json.gz
The data has been downloaded now. The next step is, need to apply the index we created already. The following command can be used to do that.
gunzip -c gh-archive-2021-12-text-only.json.gz | ./quickwit index ingest --index clickhouse-fulltextsearch
The command will take some time to complete. You will see the following output at the end.
Indexed 8,135,624 documents in 13m 19s Now, you can query the index with the following command: quickwit index search --index clickhouse-fulltextsearch --config ./config/quickwit.yaml --query "my query" Clearing local cache directory...
Once the process completed, the indexes will be spliced and stored under the “qwdata” folder. You can refer the following output for the details.
root@ip-172-31-11-13:~/quickwit-v0.3.1# ls -lrth total 2.8G -rw-r--r-- 1 root root 2.8G Dec 15 2021 gh-archive-2021-12-text-only.json.gz drwxr-xr-x 2 1001 121 4.0K Jun 22 18:28 config -rw-r--r-- 1 1001 121 34K Jun 22 18:28 LICENSE_AGPLv3.0.txt -rwxr--r-- 1 1001 121 85M Jun 22 18:28 quickwit drwxr-xr-x 4 1001 121 4.0K Aug 11 09:26 qwdata -rw-r--r-- 1 root root 587 Aug 11 18:00 gh-archive-index-config.yaml root@ip-172-31-11-13:~/quickwit-v0.3.1# root@ip-172-31-11-13:~/quickwit-v0.3.1# cd qwdata/ root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# ls -lrth total 8.0K drwxr-xr-x 3 root root 4.0K Aug 11 18:00 indexes drwxr-xr-x 3 root root 4.0K Aug 11 18:09 indexing root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# du -sh * 10G indexes 20K indexing root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# cd indexes/ root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes# ls -lrth total 8.0K -rw-r--r-- 1 root root 42 Aug 11 18:00 indexes_states.json drwxr-xr-x 2 root root 4.0K Aug 11 18:22 clickhouse-fulltextsearch root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes# cd clickhouse-fulltextsearch/ root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch# ls -lrth total 10G -rw-r--r-- 1 root root 504M Aug 11 18:11 01GA71JJBBRZS9YQS4R1XB5X21.split -rw-r--r-- 1 root root 500M Aug 11 18:12 01GA71MCZ08HK0J92RPM73H438.split -rw-r--r-- 1 root root 502M Aug 11 18:13 01GA71P7JD6A4PFSDHB4ZZ885W.split -rw-r--r-- 1 root root 500M Aug 11 18:14 01GA71R2716PZ0CHGD4NT5WWBR.split -rw-r--r-- 1 root root 501M Aug 11 18:15 01GA71SWVD6HGGB5PVS1V26BED.split -rw-r--r-- 1 root root 501M Aug 11 18:16 01GA71VQEYXRZ6QKE3XRN4CPNA.split -rw-r--r-- 1 root root 490M Aug 11 18:17 01GA71XJ2A9YFHVAXN9Y5150VE.split -rw-r--r-- 1 root root 502M Aug 11 18:18 01GA71ZCPK08GS1VB3BGSSMXRY.split -rw-r--r-- 1 root root 501M Aug 11 18:19 01GA7217AHDJX46EGN0J8VZEN4.split -rw-r--r-- 1 root root 500M Aug 11 18:20 01GA7231YVV9T7HKN16R49H92N.split -rw-r--r-- 1 root root 240M Aug 11 18:20 01GA724WJYAPFNRXHH38FS51T9.split -rw-r--r-- 1 root root 4.9G Aug 11 18:23 01GA725CQADYJN2797FK2H7MX1.split -rw-r--r-- 1 root root 7.8K Aug 11 18:23 metastore.json root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch#
And you can also check the index description using the following command.
root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index describe --index clickhouse-fulltextsearch 1. General information =============================================================================== Index ID: clickhouse-fulltextsearch Index URI: file:///root/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch Number of published splits: 2 Number of published documents: 8135624 Size of published splits: 5442 MB Timestamp field: created_at Timestamp range: Some(1638316800000) -> Some(1639483214000) 2. Split statistics =============================================================================== Document count stats: Mean ± σ in [min … max]: 4067812 ± 3651190 in [416622 … 7719002] Quantiles [1%, 25%, 50%, 75%, 99%]: [489645.8, 4067812, 4067812, 5893407, 5893407] Size in MB stats: Mean ± σ in [min … max]: 2721 ± 2470 in [251 … 5191] Quantiles [1%, 25%, 50%, 75%, 99%]: [300.4, 2721, 2721, 3956, 3956]
Once the index is applied, we need to start the “searcher” using the following command.
./quickwit run --service searcher
Once the searcher is started, Quickwit will listen the port 7280, which is used to UI and REST API. We can query the data using that port.
root@ip-172-31-11-13:~/quickwit-v0.3.1# netstat -tulnp | grep -i 7280 tcp 0 0 127.0.0.1:7280 0.0.0.0:* LISTEN 21348/./quickwit udp 0 0 127.0.0.1:7280 0.0.0.0:* 21348/./quickwit
Search within Quickwit:
Now, we created the index, applied the index and started the “searcher” as well. Lets, search some data within Quickwit and understand the output.
root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index search --index clickhouse-fulltextsearch --query "Updates tantivy (#838)" { "num_hits": 2, "hits": [ { "body": "Updates tantivy.\r\nAdds a read_only_directory macro to keep the code dry\r\n\r\nCloses #838.\r\n", "created_at": 1639466966000, "event_type": "PullRequestEvent", "id": 19322237191, "title": "Updates tantivy (#838)" }, { "body": "Updates tantivy.\r\nAdds a read_only_directory macro to keep the code dry\r\n\r\nCloses #838.\r\n", "created_at": 1639471598000, "event_type": "PullRequestEvent", "id": 19323338657, "title": "Updates tantivy (#838)" } ], "elapsed_time_micros": 85997, "errors": [] }
From the above command,
- –index – Defines the index name
- –query – Defines what we need to search
From the above output,
- num_hits – How many records matched with the search string
- hits – Print the matched data
- elapsed_time_micros – How much Micro seconds taken
- error – print if any error
Hope this helps to understand the output.
Integration with Clickhouse:
Now, we can see how this can be used inside the Cliskhouse. As a first step, login the clickhouse shell and create the database as well was table. Creating the table inside the Clikhouse is not really needed. Because, when I execute the query, the query will search the result from the Quickwit not from Clickhouse table.
I am doing this for compare the results.
click house-client --password create database GitHub; CREATE TABLE github.github_events ( `id` UInt64, `event_type` Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), `actor_login` LowCardinality(String), `repo_name` LowCardinality(String), `created_at` Int64, `action` Enum8('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), `comment_id` UInt64, `body` String, `ref` LowCardinality(String), `number` UInt32, `title` String, `labels` Array(LowCardinality(String)), `additions` UInt32, `deletions` UInt32, `commit_id` String ) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at) SETTINGS index_granularity = 8192
Import the same data we downloaded into ClickHouse using the following command.
gunzip -c gh-archive-2021-12-text-only.json.gz | clickhouse-client -d gh-archive --query="INSERT INTO github_events FORMAT JSONEachRow"
Data is imported.
Now, let see how to query the Quickwit data from Clickhouse.
ip-172-31-11-13.ap-southeast-1.compute.internal :) SELECT id FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64'); SELECT * FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64') Query id: 00a1b3db-4b0d-4894-af2f-aea4b7246cab ┌──────────id─┐ │ 19258102090 │ └─────────────┘ 1 row in set. Elapsed: 0.009 sec.
As you see the above query, we are querying the data directly from the Quickwit URL not from the Clickhouse table. So, it is not necessary to have the table in Clickhouse. 🙂
Lets compare the result:
In this session, I am going to execute the Query from both Clickhouse table and Quickwit and compare the result.
From Clickhouse table:
p-172-31-11-13.ap-southeast-1.compute.internal :) select id from github_events where body like '%YEAAAHH%'; SELECT id FROM github_events WHERE body LIKE '%YEAAAHH%' Query id: 0cd0ba27-ab8c-43ed-9983-d85b9fdf3b98 ┌──────────id─┐ │ 19258102090 │ └─────────────┘ 1 row in set. Elapsed: 4.324 sec. Processed 8.14 million rows, 12.88 GB (1.88 million rows/s., 2.98 GB/s.)
From Quickwit,
ip-172-31-11-13.ap-southeast-1.compute.internal :) SELECT * FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64'); SELECT * FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64') Query id: 647bc9af-e0c3-42b1-befc-b35d1a2ba64e ┌──────────id─┐ │ 19258102090 │ └─────────────┘ 1 row in set. Elapsed: 0.012 sec.
From the above results, Full text Search from Quickwit is performing very well.
As we see, this is one of the great thing, that we are able to do queries on Quickwit using the Clickhouse shell. Looking forward to see the Clickhouse built-in support from Clickhouse.