Introduction
Due to the exponential growth of textual data and the need for users to access relevant information quickly and effectively, full-text search remains vital in today’s digital ecosystem, and as ChistaDATA Inc., we continue to provide open-source solutions to meet your needs using the ClickHouse database.
ClickHouse’s strength lies in analytics, as it provides elementary full-text search capabilities for substring matching within columns of text, and as referred by ClickHouse, the full-text search feature is still experimental. However, integrating ClickHouse with specialized full-text search engines such as Manticore Search or Apache Solr is often a superior way to achieve more advanced and comprehensive full-text search capabilities.
For a more detailed explanation of Manticore Search, it is an open source full text search engine designed to provide high performance search capabilities for various applications. It is based on the Sphinx search engine and includes a number of advanced features and enhancements. Manticore Search has two types of indexes. One is a real-time index and the other is a plain index.
- Important note: As ChistaDATA Inc. we always recommend using real time analytics infrastructures with ChistaDATA Cloud for ClickHouse and we already have a blog about it. You can learn about the real time option as we recommended. This blog is about Manticore plain index structure and it is only written to give an idea.
In this article, we’ll continue with plain indexes, which are immutable. Such an index requires an indexing process using the indexer tool and uses a source (which can be a MySQL select query, CSV, TSV or XML files). If you INSERT/UPDATE/DELETE/REPLACE the data on the source(ClickHouse) table, you need to drop and re-create for the latest index on Manticore Search.
Example of Manticore Full Text Search with Plain Index
Suppose that we have a trips
table in ClickHouse with four columns, one of which is of data type String
. Especially in LIKE conditions, a full-text search with Manticore Search integration can save time.
First, we need to clone this github and set up all the containers. On the ClickHouse container, we can create the table mentioned above.
CREATE TABLE trips ( `trip_id` UInt32, `trip_distance` Float32, `explanation` String, `extra` Float32 ) ENGINE = MergeTree ORDER BY trip_id
Then, generate 50 million random data and insert it. As you can see, ClickHouse is always ready for bulk insertion.
INSERT INTO trips SELECT * FROM generateRandom('trip_id UInt32, trip_distance Float32, explanation String, extra Float32') LIMIT 50000000
We quickly completed our work on the ClickHouse side. We now need to send the data from ClickHouse to Manticore as an index. During this operation, each row in Manticore will have its unique ID to identify each row for the explanation
column. It’s nice to know that Manticore document IDs must be unique 64-bit unsigned integers. That’s why we used the one we specified in /etc/manticoresearch/manticore.conf when fetching data from the ClickHouse database.
Let’s continue; we can install the clickhouse-client
on the Manticore container
chmod +x /var/lib/clickhouse-client.sh /var/lib/clickhouse-client.sh
Then, we will index the explanation
column as explanation_idx
on Manticore container. The values mentioned in /etc/manticoresearch/manticore.conf
indexer --all --rotate --print-queries
We should see some outputs like that if all goes well.
using config file '/etc/manticoresearch/manticore.conf'... indexing table 'explanation_idx'... collected 50000000 docs, 255.2 MB creating secondary index creating lookup: 50000.0 Kdocs, 100.0% done sorted 76.0 Mhits, 100.0% done total 50000000 docs, 255276428 bytes total 93.674 sec, 2725142 bytes/sec, 533763.04 docs/sec total 1019 reads, 0.980 sec, 1363.2 kb/call avg, 0.9 msec/call avg total 6422 writes, 5.023 sec, 473.6 kb/call avg, 0.7 msec/call avg rotating tables: successfully sent SIGHUP to searchd (pid=1).
It’s time to take advantage of Manticore Search!
Manticore Search has an SQL interface based on the MySQL protocol. That’s why we used port 9306 by default. You can check /etc/manticoresearch/manticore.conf for other listening protocols. So, we will connect the Manticore using with MySQL protocol and check the index created.
mysql -h 127.0.0.1 -P 9306 mysql> show tables; +-----------------+-------+ | Index | Type | +-----------------+-------+ | explanation_idx | local | +-----------------+-------+
Let’s fetch the same data using with explanation_idx
on Manticore Search.
mysql> select count(*) from explanation_idx where match('mSWyG'); +---------------------+-------------+ | id | explanation | +---------------------+-------------+ | 1153360392352323823 | mSWyG | +---------------------+-------------+ 1 row in set (0.03 sec)
Have you noticed how Manticore indexes the explanation
column? A unique 64-bit document ID has been prepared for each row.
Let’s fetch the same data using the ClickHouse database.
SELECT * FROM trips WHERE explanation LIKE '%mSWyG%' ┌────trip_id─┬─trip_distance─┬─explanation─┬────────extra─┐ │ 2496022668 │ 3.4318767e-19 │ mSWyG │ -0.031302456 │ └────────────┴───────────────┴─────────────┴──────────────┘ 1 row in set. Elapsed: 1.059 sec. Processed 50.00 million rows, 700.00 MB
We have brought this specific data, and it returns almost 1 second. This is why we say that LIKE operations are not for ClickHouse, and the choice of index is essential.
Conclusion
While ClickHouse is successful in various contexts, it is crucial to emphasise that its primary capabilities lie in analytical workloads. If your business explicitly requires full text search operations, we can use some open source tools to retrieve the data more efficiently until ClickHouse officially supports full text search. However, this plain indexing technology is now old fashioned and this blog is only Manticore’s structure related. We don’t recommend using this technology even in a test environment. It would be great if you could use the real time indexing option instead of this structure, which is specified below. It’s also important to note that container resources can change query processing times.