ClickHouse developers recommend batch data ingestion for ClickHouse. While transactional databases can handle multiple insert requests efficiently and fast, ClickHouse is little different. For MergeTree family engines, the data inserted is stored first in unmerged data parts and then sorted and merged. Inserting the records row by row is a costly operation because it requires multiple data writes, sorts and updates in the primary key and the mark files. Let us verify this using a Python code.
Pre-Requisites
- Python 3.x
- clickhouse-connect driver
- ClickHouse Server – You can sign up for fully managed ClickHouse server here
Example table
CREATE TABLE insert_benchmark ( ID Int64, Value Float64, Count Int64 ) Engine = MergeTree ORDER BY ID;
Row by Row Insert
The below Python code snippet can generate random data and insert it row by row in to the ClickHouse table.
from clickhouse_driver import Client import random client = Client(host='localhost', user='default', password='123456', port='9000', send_receive_timeout=600, secure=False ) insert_query = "INSERT INTO insert_benchmark VALUES " # Random data for insertion data = [[random.randint(0, 10000000), random.uniform(0.0, 100000.0), random.randint(0, 10000000)] for x in range(100000)] # row insert for row in data: client.execute(insert_query, [row]) client.disconnect()
Batch Insert
The below Python code snippet can generate random data and insert it in a single batch to the ClickHouse table.
from clickhouse_driver import Client import random client = Client(host='localhost', user='default', password='123456', port='9000', send_receive_timeout=600, secure=False ) insert_query = "INSERT INTO insert_benchmark VALUES " data = [[random.randint(0, 10000000), random.uniform(0.0, 100000.0), random.randint(0, 10000000)] for x in range(100000)] client.execute(insert_query, data) client.disconnect()
You can verify the speed difference when you run these on your ClickHouse setup.
References
https://clickhouse-driver.readthedocs.io/en/latest/quickstart.html#inserting-data