Introduction
Due to the exponential growth of textual data and the need for users to access relevant information quickly and effectively, full text search remains vital in today’s digital ecosystem and as ChistaDATA Inc. we continue to provide real-time analytics with open source solutions to meet your needs using the ClickHouse database.
ClickHouse’s strength lies in analytics, as it provides elementary full-text search capabilities for substring matching within columns of text, and as referred by ClickHouse, the full-text search feature is still experimental. However, integrating ClickHouse with specialized full-text search engines such as Manticore Search or Apache Solr is often a superior way to achieve more advanced and comprehensive full-text search capabilities.
For a more detailed explanation of Manticore Search, it is an open source full text search engine designed to provide high-performance search capabilities for various applications. It is based on the Sphinx search engine and includes a number of advanced features and enhancements. Manticore Search has two types of indexes. One is a real-time index
and the other is a plain index
and real-time indexing is always our preference. That’s why we preferred Manticore Search especially.
In this article, we’ll continue with the demonstration, which contains the most recent data on the Manticore Search side. Real-time indexes are similar to database tables in that they start empty and can be filled with data in the same way as ClickHouse tables by performing INSERT/UPDATE/DELETE/REPLACE operations, and they contain the most recent data, unlike plain indexes on Manticore. A Python script can also help us to collect the latest data from the ChistaDATA Cloud. This allows us to do real-time analysis clearly.
Example of Manticore Full Text Search with Realtime Index
Suppose that we have a trips
table in on our DBaas environment with two columns, one of which is of data type String
. Especially in LIKE conditions, a full-text search with Manticore Search integration will save our time.
We can inspect our ChistaDATA DBaaS environment and the trips
table has around 49 million rows.
Then, clone this github and set up the Manticore Search container. Let’s install the necessary packages on Manticore container.
apt-get update apt-get install python3 -y apt-get install python3-pip -y pip3 install clickhouse-driver apt-get install python3-pymysql -y
Then, we will index the explanation
column as explanation_idx
on Manticore container. Our data will move the /var/lib/data
folder which specified in /var/lib/manticore.conf
. It’s also important to edit the real_time.py
script before running it. Simply replace your ChistaDATA DBaaS information with this file to fetch data to Manticore Search.
python3 /etc/manticoresearch/real_time.py
Manticore Search has an SQL interface based on the MySQL protocol. That’s why we used port 9306 by default. You can check /etc/manticoresearch/manticore.conf for other listening protocols. So, we will connect the Manticore using with MySQL protocol and check the index created. It’s time to take advantage of Manticore Search!
mysql -h 127.0.0.1 -P 9306 mysql> show tables; +-----------------+------+ | Index | Type | +-----------------+------+ | explanation_idx | rt | +-----------------+------+
Let’s fetch the same data using with explanation_idx
on Manticore Search.
mysql> select * from explanation_idx where match('lift'); +----------------------+-------------+ | id | explanation | +----------------------+-------------+ | 7441205159493159023 | lift | +----------------------+-------------+ 1 row in set (0.02 sec)
Have you noticed how Manticore indexes the explanation
column? A unique 64-bit document ID has been prepared for each row.
Let’s fetch the same data using the DBaaS environment.
SELECT * FROM trips WHERE explanation LIKE '%lift%' id explanation 10 lift Took: 503.875345ms
This is the specific data that we have collected, and it returns half-second instead of 20 ms. This is why we say that LIKE operations are not for ClickHouse, and the choice of index is essential.
Conclusion
While ClickHouse is successful in various contexts, it’s important to emphasise that its primary capabilities lie in analytical workloads and we have used the robust ChistaDATA DBaaS environment. However, if your business explicitly requires full-text search operations, we can use some open source tools to retrieve the data more efficiently until ClickHouse officially supports real-time full-text indexing. It’s also important to note that container resources can change query processing times on Manticore.