1. Home
  2. Knowledge Base
  3. ClickHouse
  4. Overview of Clickhouse Full Text Search using the Quickwit
  1. Home
  2. Knowledge Base
  3. ClickHouse DBA
  4. Overview of Clickhouse Full Text Search using the Quickwit

Overview of Clickhouse Full Text Search using the Quickwit

Full Text Search is the most expected feature at Clickhouse community. Recently, Clickhouse has the support to perform the Full text Search using Quickwit. Quickwit is a distributed search engine. In this article, I am going to explain the steps involved on this configuration and will show how the Full Text Search is working.

Pre-requaistics 

You can refer the link to install the Clickhouse server.

Quickwit Installation:

Quickwit needs the following dependency packages to work properly. So, it is recommended to install them.

apt-get -y install libpq-dev libssl-dev

Once the dependency packages are installed, execute the following commands to install the Quickwit.

curl -L https://install.quickwit.io | sh
cd quickwit-v*/
export QW_CONFIG=./config/quickwit.yaml

The next step is we need to create the index in Quickwit. This index will be used to perform the Full Text Search.

Creating Index in Quickwit

Firstly, we need to create the configuration file. It can be downloaded from GitHub using the following command.

curl -o gh-archive-index-config.yaml https://raw.githubusercontent.com/quickwit-oss/quickwit/main/config/tutorials/gh-archive/index-config-for-clickhouse.yaml

The configuration file contains the index name, and description of all the columns we need to index. For example, My index name is “clickhouse-fulltextsearch” and I would like to add index for the columns “id, created_at,event_type,body,title”. So, my configuration file looks like below.

version: 0

index_id: clickhouse-fulltextsearch

doc_mapping:
  store_source: false
  field_mappings:
    - name: id
      type: u64
      fast: true
    - name: created_at
      type: i64
      fast: true
    - name: event_type
      type: text
      tokenizer: raw
    - name: title
      type: text
      tokenizer: default
      record: position
    - name: body
      type: text
      tokenizer: default
      record: position

indexing_settings:
  timestamp_field: created_at

search_settings:
  default_search_fields: [title, body]

Now we prepared the config file, the next step is, need to create the index. The following command is used to create the index.

./quickwit index create --index-config gh-archive-index-config.yaml

You will see the following output at the end once the index is created.

Index `clickhouse-fulltextsearch` successfully created.

Use the following command to see the available indexes.

root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index list
+---------------------------+-----------------------------------------------------------------------+
|         Index ID          |                               Index URI                               |
+---------------------------+-----------------------------------------------------------------------+
| clickhouse-fulltextsearch | file:///root/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch |
+---------------------------+-----------------------------------------------------------------------+

So, The next step is we need download the sample data and apply the created index in it.

Data preparation and Indexing

As I told earlier, I have downloaded the GitHub archive data. You can also download that using the following url.

wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12-text-only.json.gz

The data has been downloaded now. The next step is, need to apply the index we created already. The following command can be used to do that.

gunzip -c gh-archive-2021-12-text-only.json.gz | ./quickwit index ingest --index clickhouse-fulltextsearch

The command will take some time to complete. You will see the following output at the end.

Indexed 8,135,624 documents in 13m 19s
Now, you can query the index with the following command:
quickwit index search --index clickhouse-fulltextsearch --config ./config/quickwit.yaml --query "my query"
Clearing local cache directory...

Once the process completed, the indexes will be spliced and stored under the “qwdata” folder. You can refer the following output for the details.

root@ip-172-31-11-13:~/quickwit-v0.3.1# ls -lrth
total 2.8G
-rw-r--r-- 1 root root 2.8G Dec 15  2021 gh-archive-2021-12-text-only.json.gz
drwxr-xr-x 2 1001  121 4.0K Jun 22 18:28 config
-rw-r--r-- 1 1001  121  34K Jun 22 18:28 LICENSE_AGPLv3.0.txt
-rwxr--r-- 1 1001  121  85M Jun 22 18:28 quickwit
drwxr-xr-x 4 1001  121 4.0K Aug 11 09:26 qwdata
-rw-r--r-- 1 root root  587 Aug 11 18:00 gh-archive-index-config.yaml
root@ip-172-31-11-13:~/quickwit-v0.3.1# 
root@ip-172-31-11-13:~/quickwit-v0.3.1# cd qwdata/
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# ls -lrth
total 8.0K
drwxr-xr-x 3 root root 4.0K Aug 11 18:00 indexes
drwxr-xr-x 3 root root 4.0K Aug 11 18:09 indexing
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# du -sh *
10G	indexes
20K	indexing
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata# cd indexes/
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes# ls -lrth
total 8.0K
-rw-r--r-- 1 root root   42 Aug 11 18:00 indexes_states.json
drwxr-xr-x 2 root root 4.0K Aug 11 18:22 clickhouse-fulltextsearch
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes# cd clickhouse-fulltextsearch/
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch# ls -lrth
total 10G
-rw-r--r-- 1 root root 504M Aug 11 18:11 01GA71JJBBRZS9YQS4R1XB5X21.split
-rw-r--r-- 1 root root 500M Aug 11 18:12 01GA71MCZ08HK0J92RPM73H438.split
-rw-r--r-- 1 root root 502M Aug 11 18:13 01GA71P7JD6A4PFSDHB4ZZ885W.split
-rw-r--r-- 1 root root 500M Aug 11 18:14 01GA71R2716PZ0CHGD4NT5WWBR.split
-rw-r--r-- 1 root root 501M Aug 11 18:15 01GA71SWVD6HGGB5PVS1V26BED.split
-rw-r--r-- 1 root root 501M Aug 11 18:16 01GA71VQEYXRZ6QKE3XRN4CPNA.split
-rw-r--r-- 1 root root 490M Aug 11 18:17 01GA71XJ2A9YFHVAXN9Y5150VE.split
-rw-r--r-- 1 root root 502M Aug 11 18:18 01GA71ZCPK08GS1VB3BGSSMXRY.split
-rw-r--r-- 1 root root 501M Aug 11 18:19 01GA7217AHDJX46EGN0J8VZEN4.split
-rw-r--r-- 1 root root 500M Aug 11 18:20 01GA7231YVV9T7HKN16R49H92N.split
-rw-r--r-- 1 root root 240M Aug 11 18:20 01GA724WJYAPFNRXHH38FS51T9.split
-rw-r--r-- 1 root root 4.9G Aug 11 18:23 01GA725CQADYJN2797FK2H7MX1.split
-rw-r--r-- 1 root root 7.8K Aug 11 18:23 metastore.json
root@ip-172-31-11-13:~/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch#

And you can also check the index description using the following command.

root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index  describe --index clickhouse-fulltextsearch 
1. General information
===============================================================================
Index ID:                           clickhouse-fulltextsearch
Index URI:                          file:///root/quickwit-v0.3.1/qwdata/indexes/clickhouse-fulltextsearch
Number of published splits:         2
Number of published documents:      8135624
Size of published splits:           5442 MB
Timestamp field:                    created_at
Timestamp range:                    Some(1638316800000) -> Some(1639483214000)

2. Split statistics
===============================================================================
Document count stats:
Mean ± σ in [min … max]:            4067812 ± 3651190 in [416622 … 7719002]
Quantiles [1%, 25%, 50%, 75%, 99%]: [489645.8, 4067812, 4067812, 5893407, 5893407]

Size in MB stats:
Mean ± σ in [min … max]:            2721 ± 2470 in [251 … 5191]
Quantiles [1%, 25%, 50%, 75%, 99%]: [300.4, 2721, 2721, 3956, 3956]

Once the index is applied, we need to start the “searcher” using the following command.

./quickwit run --service searcher

Once the searcher is started, Quickwit will listen the port 7280, which is used to UI and REST API. We can query the data using that port.

root@ip-172-31-11-13:~/quickwit-v0.3.1# netstat -tulnp | grep -i 7280
tcp        0      0 127.0.0.1:7280          0.0.0.0:*               LISTEN      21348/./quickwit    
udp        0      0 127.0.0.1:7280          0.0.0.0:*                           21348/./quickwit    

Search within Quickwit:

Now, we created the index, applied the index and started the “searcher” as well. Lets, search some data within Quickwit and understand the output.

root@ip-172-31-11-13:~/quickwit-v0.3.1# ./quickwit index search --index clickhouse-fulltextsearch --query "Updates tantivy (#838)"
{
  "num_hits": 2,
  "hits": [
    {
      "body": "Updates tantivy.\r\nAdds a read_only_directory macro to keep the code dry\r\n\r\nCloses #838.\r\n",
      "created_at": 1639466966000,
      "event_type": "PullRequestEvent",
      "id": 19322237191,
      "title": "Updates tantivy (#838)"
    },
    {
      "body": "Updates tantivy.\r\nAdds a read_only_directory macro to keep the code dry\r\n\r\nCloses #838.\r\n",
      "created_at": 1639471598000,
      "event_type": "PullRequestEvent",
      "id": 19323338657,
      "title": "Updates tantivy (#838)"
    }
  ],
  "elapsed_time_micros": 85997,
  "errors": []
}

From the above command,

  • –index – Defines the index name
  • –query – Defines what we need to search

From the above output,

  • num_hits – How many records matched with the search string
  • hits – Print the matched data
  • elapsed_time_micros – How much Micro seconds taken
  • error – print if any error

Hope this helps to understand the output.

Integration with Clickhouse:

Now, we can see how this can be used inside the Cliskhouse. As a first step, login the clickhouse shell and create the database as well was table. Creating the table inside the Clikhouse is not really needed. Because, when I execute the query, the query will search the result from the Quickwit not from Clickhouse table.

I am doing this for compare the results.

click house-client --password

create database GitHub;

CREATE TABLE github.github_events
(
`id` UInt64,
`event_type` Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
`actor_login` LowCardinality(String),
`repo_name` LowCardinality(String),
`created_at` Int64,
`action` Enum8('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20),
`comment_id` UInt64,
`body` String,
`ref` LowCardinality(String),
`number` UInt32,
`title` String,
`labels` Array(LowCardinality(String)),
`additions` UInt32,
`deletions` UInt32,
`commit_id` String
)
ENGINE = MergeTree
ORDER BY (event_type, repo_name, created_at)
SETTINGS index_granularity = 8192

Import the same data we downloaded into ClickHouse using the following command.

gunzip -c gh-archive-2021-12-text-only.json.gz | clickhouse-client -d gh-archive --query="INSERT INTO github_events FORMAT JSONEachRow"

Data is imported.

Now, let see how to query the Quickwit data from Clickhouse.

ip-172-31-11-13.ap-southeast-1.compute.internal :) SELECT id FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64');

SELECT *
FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64')

Query id: 00a1b3db-4b0d-4894-af2f-aea4b7246cab

┌──────────id─┐
│ 19258102090 │
└─────────────┘

1 row in set. Elapsed: 0.009 sec.

As you see the above query, we are querying the data directly from the Quickwit URL not from the Clickhouse table. So, it is not necessary to have the table in Clickhouse. 🙂

Lets compare the result:

In this session, I am going to execute the Query from both Clickhouse table and Quickwit and compare the result.

From Clickhouse table:

p-172-31-11-13.ap-southeast-1.compute.internal :) select id from github_events where body like '%YEAAAHH%';

SELECT id
FROM github_events
WHERE body LIKE '%YEAAAHH%'

Query id: 0cd0ba27-ab8c-43ed-9983-d85b9fdf3b98

┌──────────id─┐
│ 19258102090 │
└─────────────┘

1 row in set. Elapsed: 4.324 sec. Processed 8.14 million rows, 12.88 GB (1.88 million rows/s., 2.98 GB/s.)

From Quickwit,

ip-172-31-11-13.ap-southeast-1.compute.internal :) SELECT * FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64');

SELECT *
FROM url('http://127.0.0.1:7280/api/v1/clickhouse-fulltextsearch/search/stream?query=YEAAAHH&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64')

Query id: 647bc9af-e0c3-42b1-befc-b35d1a2ba64e

┌──────────id─┐
│ 19258102090 │
└─────────────┘

1 row in set. Elapsed: 0.012 sec.

From the above results, Full text Search from Quickwit is performing very well.

As we see, this is one of the great thing, that we are able to do queries on Quickwit using the Clickhouse shell. Looking forward to see the Clickhouse built-in support from Clickhouse.

Was this article helpful?

Related Articles

CHISTADATA IS COMMITTED TO OPEN SOURCE SOFTWARE AND BUILDING HIGH PERFORMANCE COLUMNSTORES

In the spirit of freedom, independence and innovation. ChistaDATA Corporation is not affiliated with ClickHouse Corporation 

Need Support?

Can't find the answer you're looking for?
Contact Support

ChistaDATA Inc. Knowledge base is licensed under the Apache License, Version 2.0 (the “License”)

Copyright 2022 ChistaDATA Inc

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.