Stream PostgreSQL to ClickHouse with Kafka: Part 1

Table of Contents

Introduction

A quick calculation of analytical business data using metrics for modeling, planning, or forecasting is possible with OLAP only. Also a lot of business applications for reporting, simulation models, information-to-knowledge transfers, and trend and performance management are supported by OLAP which is also the cornerstone of analytics. Regarding OLAP requirements, migration to a ClickHouse database should be taken into consideration and this is possible with Change Data Capture (CDC) method continuously. In this article, we will dive into PostgreSQL to ClickHouse migration project with a tutorial in two parts.

How does PostgreSQL > ClickHouse streaming work?

The use of Change Data Capture (CDC) is a great method for migrating from an OLTP database to an OLAP database and keeping track of row-level changes made to database tables as a result of bulk loads, insert, update, and delete operations. Many databases, including PostgreSQL, MySQL, SQL Server, MongoDB, and Oracle via Debezium, use a variety of techniques to reveal these change data events. The logical replication feature of PostgreSQL allows this CDC approach to be used specifically with PostgreSQL. The captured changes are then written into a Kafka topic. Finally, the topic’s content will be sent to ClickHouse via the ChistaDATA Python Connector.

What about PostgreSQL replication methods?

We mentioned the logical replication above but what is this? How does PostgreSQL replication work?

Physical replication and logical replication are the two forms of replication that PostgreSQL offers. Unlike logical replication, which merely copies data changes like inserts, updates, and deletes of rows in tables, physical replication produces an exact copy of the data, including the underlying storage structure. As it simply duplicates changes to the data and not the full dataset, logical replication offers a more adaptable and effective method of data replication. This may aid in lowering replication overhead and the amount of disk space needed to hold duplicated data.

Logical replication in PostgreSQL is accomplished by copying write-ahead log WAL entries, which represent data changes. These records include the type of change (insert, update, or delete), the time the change was performed, and the actual data that was altered. This information is all required to reproduce the changes that were made to the data.

As of PostgreSQL 10+, there is a logical replication stream mode, called pgoutput that is natively supported by PostgreSQL. This means that a Debezium PostgreSQL connector can consume that replication stream without the need for additional plug-ins. This is particularly valuable for environments where installation of plug-ins is not supported or not allowed.

Once we have confirmed that we are running PostgreSQL version 10 or later, we can create the debezium config file to match this configuration. Since we will be using pgoutput method. The other information about the PostgreSQL server and the tables for which the CDC will be used should also be included in Debezium configuration file. Here is our general design:

According to the design, we should see the message for every row under our Kafka topic as JSON structure. Observe how the create operation ("op": "c") is displayed in the message. This operation type indicates that the row was recently inserted; the source indicates the PostgreSQL instance metadata from which this event was retrieved; and after providing values for each of the row’s columns. If ("op": "d")indicates that data is deleted and ("op": "u")means there are some update operations therefore, part of the before won’t be null anymore. The key component of the ChistaDATA Python Connector is ("op": "r"). All the Kafka messages (PostgreSQL data) may migrate in bulk if the existing table was specified in Debezium.

Here is a sample message under the Kafka topic:

{
        "before": null,
        "after": {
            "id": 1,
            "department": "full time",
            "employee_number": "776",
            "date_of_recruitment": "1616071869"
        },
        "source": {
            "version": "2.1.1.Final",
            "connector": "postgresql",
            "name": "topic_under",
            "ts_ms": 1593018069944,
            "snapshot": "false",
            "db": "salary",
            "schema": "public",
            "table": "employees",
            "txId": 602,
            "lsn": 184579736,
            "xmin": null
        },
        "op": "c",
        "ts_ms": 1593018069947,
        "transaction": null
    }

The ChistaDATA Python Connector, which enables the migration from Kafka to ClickHouse, is the most robust tool in the market in terms of restart issues and supported data types. The most current data that was sent to ClickHouse is known to the ChistaDATA Python Connector. As a consequence, you may be certain that migration will continue where it left off in the event that your machine needs to be restarted.

Also Date and Timestamp data types can specify epoch time on the Kafka side and these values will be converted to actual values in ClickHouse. Boolean support is available as well! Here is another supporting data type list:

Conclusion

Debezium is a fantastic tool for capturing row-level changes that occur on a database and streaming those changes to a broker of our choosing. Moreover, the ChistaDATA Python Connector is a powerful tool for applying messages to ClickHouse. We are capable of bulk loading two million data transfers per minute. Moreover, once the ("op": "r") loads, the ("op": "c"), ("op": "u"), and ("op": "d") processes can be migrated at astounding rates. In the second part, we will implement a tutorial on how to migrate data. Hope you found it useful!

To read more about streaming data to ClickHouse, do consider reading the below articles:

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Streaming Data from PostgreSQL to ClickHouse using Kafka and Debezium: Part 1

Introduction

How does PostgreSQL > ClickHouse streaming work?

What about PostgreSQL replication methods?

Conclusion