How to use Asynchronous Inserts in ClickHouse for High Performance Data Loading

Introduction

Asynchronous inserts in ClickHouse can be useful in situations where you need to insert a large amount of data into a table and you don’t want to wait for the data to be written to disk before continuing with other operations. This can help to improve the performance of bulk insert operations by reducing the amount of time spent waiting for data to be written to disk.

Here are some examples of situations where asynchronous inserts can be useful:

  • When you need to insert large amounts of data into a table in real-time, such as data from IoT devices or log data from a server.
  • When you need to insert data into a table as part of a data pipeline, and the pipeline should continue processing other data while the inserts are in progress.
  • When you need to insert data into a table as part of a data warehousing or data lake solution, and you want to improve the performance of bulk data loads.

Configuring ClickHouse for Asynchronous Inserts

ClickHouse supports asynchronous inserts, which allows you to insert data into a table without waiting for the data to be written to disk. This can improve the performance of bulk insert operations by reducing the amount of time spent waiting for data to be written to disk.

To perform an asynchronous insert in ClickHouse, you can use the INSERT INTO statement with the ASYNC keyword. For example:

INSERT INTO mytable (column1, column2, column3) VALUES ('value1', 'value2', 'value3') ASYNC;

This will insert the specified data into the mytable table and return immediately, without waiting for the data to be written to disk. The data will be written to disk in the background asynchronously.

You can also use the ASYNC keyword with other methods of inserting data, such as the COPY command, and external data loading tools such as clickhouse-copier or clickhouse-bulk.

Also, you may want to consider setting the insert_quorum and insert_quorum_timeout settings to control the minimum number of replicas that must acknowledge the insert for it to be considered successful and the maximum time to wait for the acknowledgement.

Implementation of Asynchronous Inserts in ClickHouse

Asynchronous inserts in ClickHouse are implemented by using a separate thread or process to perform the insert operation, without waiting for the data to be written to disk. When an asynchronous insert is performed, the data is added to an in-memory buffer, and a separate thread or process is responsible for writing the data to disk. This allows the insert operation to return immediately, without waiting for the data to be written to disk.

Here’s a high-level overview of the process:

  1. A client sends an INSERT INTO statement to the ClickHouse server with the ASYNC keyword.
  2. The ClickHouse server receives the statement and adds the data to an in-memory buffer.
  3. The ClickHouse server assigns a separate thread or process to write the data from the in-memory buffer to disk.
  4. The INSERT INTO statement returns immediately, without waiting for the data to be written to disk.
  5. The separate thread or process writes the data from the in-memory buffer to disk in the background, asynchronously.

It’s important to note that while asynchronous inserts can improve the performance of bulk insert operations, they also increase the risk of data loss in case of a system failure, since the data is not immediately written to disk. As a result, it’s recommended to use asynchronous inserts with caution, and only when you’re sure that data loss can be tolerated.

Conclusion

It’s important to note that while asynchronous inserts can improve the performance of bulk insert operations, they also increase the risk of data loss in case of a system failure, since the data is not immediately written to disk. As a result, it’s recommended to use asynchronous inserts with caution, and only when you’re sure that data loss can be tolerated.

To learn more High-velocity Data Ingestion in ClickHouse, do consider giving the following articles a read:

About Shiv Iyer 237 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.