Archiving Data From PostgreSQL to ClickHouse

Introduction

Every year, databases store more data than they did the year before. Also, regulatory requirements are forcing businesses to monitor all of their data and exercise control over its management and retention. This calls for efficient data management strategies.

Data archiving is the movement of inactive data from production systems to long-term storage or databases. Archival data is kept on databases or storage so it can always be reactivated.

This article provides general information about data archiving and creates an example scenario showing data transfer from PostgreSQL to ClickHouse.

Importance of Data Archiving

A data archiving strategy is a crucial component of the data lifecycle management policy because it gives you a way to preserve data while adhering to a reasonable storage budget. It also helps improve performance in production systems. The main advantages of the data archiving solutions are:

  • Increase capacity
  • Reduce storage load and costs while increasing productivity
  • Meet compliance requirements
  • Centralize discovery
  • Make backup easy
  • Manage and tier data

Data Tiering

Data lifecycle management processes manage the entire lifecycle of data, from when data is created until it is deleted. The term “data tiering” arose from moving data around different tiers or classes of storage within a storage system but has now expanded to mean archiving data from any source to any destination. We can classify the data into 3 main groups within the scope of data tiering. These are:

  • Hot Data: Data is accessed frequently and generally stored on fast storage in transactional databases.
  • Warm Data: Data is accessed less frequently and used for analytics and reports. Stored in capacity-optimized databases or storage.
  • Cold Data: Data is rarely accessed for compliance reasons. Stored in cheapest storage systems and tapes.

Runbook to Archive Data From PostgreSQL to ClickHouse

In light of the information given above, it is thought that transferring the warm or cold data of transactional databases such as PostgreSQL to other sources will be beneficial in terms of database performance and manageability.

For that reason, to create ClickHouse and PostgreSQL containers in Docker, a “docker-compose.yaml” file is created.

version: '3.7'
services:
    clickhouse-server:
        image: clickhouse/clickhouse-server
        container_name: clickhouse_server
        hostname: clickhouse01
        ports:
            - "8000:9000"
            - "9123:8123"
            - "9009:9009"
        restart: always
        tty: true
        ulimits:
            memlock:
                soft: -1
                hard: -1
            nofile:
                soft: 262144
                hard: 262144
        deploy:
            resources:
                limits:
                    memory: 2g
        cap_add:
            - IPC_LOCK

    postgres:
        image: postgres
        container_name: postgres
        hostname: postgres
        ports:
            - "5432:5432"
        restart: always
        environment:
            POSTGRES_PASSWORD: password
        command: [ "postgres", "-c", "wal_level=logical", "-c", "max_replication_slots=4"]

Run the following command where the docker-compose file is created.

docker-compose up -d

PostgreSQL Configuration

First of all, we need to connect “postgres” shell with postgres user.

docker exec -it -u postgres postgres /bin/bash

Then, modify the following parameters in postgresql.conf file as shown. In our case, this file located under “/var/lib/postgresql/data“.

listen_addresses = '*' 
max_replication_slots = 10
wal_level = logical

Create a user for using data replication and archiving after connecting DB with “psql“.

postgres@postgres:/$ psql
psql (14.5 (Debian 14.5-1.pgdg110+1))
Type "help" for help.

postgres=# CREATE ROLE clickhouse_user SUPERUSER LOGIN PASSWORD '<your_password>';

For replication and archiving, create a database and connect it.

postgres=# CREATE DATABASE testdb;
CREATE DATABASE
postgres=# \connect testdb
You are now connected to database "testdb" as user "postgres".

In our scenario, let’s assume that only the current month’s data is hot and updatable. For that reason, create a partitioned test table.

Also, create “ARCHIVE_STATUS” table to log the archive operations.

CREATE TABLE TESTTABLE (
ID INT NOT NULL,
LOG_DATE DATE)
PARTITION BY RANGE (LOG_DATE);


CREATE TABLE TESTTABLE_P202207 
PARTITION OF TESTTABLE
FOR VALUES FROM ('01-JUL-2022') TO ('31-JUL-2022');

CREATE TABLE TESTTABLE_P202208 
PARTITION OF TESTTABLE
FOR VALUES FROM ('01-AUG-2022') TO ('31-AUG-2022');

CREATE TABLE TESTTABLE_P202209 
PARTITION OF TESTTABLE
FOR VALUES FROM ('01-SEP-2022') TO ('30-SEP-2022');

CREATE TABLE ARCHIVE_STATUS (
TABLE_NAME VARCHAR(32),
PARTITION_NAME VARCHAR(32),
IS_ARCHIVED INT DEFAULT 0,
PARTITION BY RANGE (LOG_DATE);

Insert the data into the “TESTTABLE” with 2 values for each partition.

INSERT INTO TESTTABLE
(id, log_date)
VALUES
(1, '02-JUL-2022'),
(2, '03-JUL-2022');

INSERT INTO TESTTABLE
(id, log_date)
VALUES
(3, '20-AUG-2022'),
(4, '24-AUG-2022');

INSERT INTO TESTTABLE
(id, log_date)
VALUES
(5, '5-SEP-2022'),
(6, '6-SEP-2022');

Add the following rows to the “pg_hba.conf” file to allow connections to the new database with the new user for replication.

In our case, this file located under “/var/lib/postgresql/data“.

# TYPE  DATABASE        USER            ADDRESS                 METHOD

.
.
.


host    testdb          clickhouse_user 192.168.1.0/24          password

The initial configuration for PostgreSQL is completed.

ClickHouse Configuration

Connect the ClickHouse server via docker and connect database (You can also connect to ClickHouse with GUI like DBeaver).

ubuntu@dockerSrv:~/clickHouseArchive$ docker exec -it clickhouse_server /bin/bash
root@clickhouse01:/# 
root@clickhouse01:/# clickhouse-client 
ClickHouse client version 22.8.5.29 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 22.8.5 revision 54460.

Warnings:
 * Linux is not using a fast TSC clock source. Performance can be degraded. Check /sys/devices/system/clocksource/clocksource0/current_clocksource

clickhouse01 :) 

Create tables with Postgres Engine so you can read and write remote PostgreSQL tables.

CREATE TABLE archive_status
(
table_name String,
partition_name String,
is_archived int,
arch_date date
)
ENGINE = PostgreSQL('<IP>:<PORT>', 'testdb', 'archive_status', 'clickhouse_user', '<PostgrePassword>');



CREATE TABLE testtable_postgre
(
id int,
log_date date
)
ENGINE = PostgreSQL('<IP>:<PORT>', 'testdb', 'testtable', 'clickhouse_user', '<PostgrePassword>');
clickhouse01 :) select * from testtable_postgre;

SELECT *
FROM testtable_postgre

Query id: 0a819daf-5738-4231-b5d8-752d201448f1

┌─id─┬───log_date─┐
│  1 │ 2022-07-02 │
│  2 │ 2022-07-03 │
│  3 │ 2022-08-20 │
│  4 │ 2022-08-24 │
│  5 │ 2022-09-05 │
│  6 │ 2022-09-06 │
└────┴────────────┘

Create a local partitioned table with the MergeTree engine.

create table testtable_postgre_archive 
ENGINE = MergeTree()
PARTITION BY toYYYYMM(log_date) 
ORDER BY log_date
as select * from testtable_postgre where 1=2;

Let’s assume that the 202207 partition in “TESTTABLE” at PostgreSQL no longer has access, and we would like to archive that partition to ClickHouse.

Because every table partition in PostgreSQL has a table itself, you can directly access related partitions from ClickHouse as tables.

CREATE TABLE testtable_postgre_p202207
(
id int,
log_date date
)
ENGINE = PostgreSQL('<HOST>:<PORT>', 'testdb', 'testtable_p202207', 'clickhouse_user', '<your_password>');

Now you can archive only the related partition to the local ClickHouse table. We are adding records about archived table partitions to the “archive_status” table so admin or automated jobs can delete cold data from the source DB.

INSERT INTO testtable_postgre_archive SELECT *
FROM testtable_postgre_p202207



insert into archive_status values('testtable','testtable_p202207',1,now());

The local archive table only includes data between 07.01.2022 and 07.31.2022, which means cold data.

clickhouse01 :) select * from testtable_postgre_archive;

SELECT *
FROM testtable_postgre_archive

Query id: 7f6a1d0f-eb32-426d-ab57-8eedf8bdb822

┌─id─┬───log_date─┐
│  1 │ 2022-07-02 │
│  2 │ 2022-07-03 │
└────┴────────────┘

2 rows in set. Elapsed: 0.013 sec. 

Conclusion

Data archiving strategies and data tiering are important components of data lifecycle management. Data archiving provides many advantages, such as performance improvement, storage management, and tiered configuration. Because of its native database connection engines, ClickHouse can be a good alternative to archive transactional databases such as MySQL and PostgreSQL.

To learn more about archiving data from PostgreSQL to ClickHouse, read the following articles:

About Emrah Idman 11 Articles
Emrah Idman has considerable experience in relational and NoSQL databases. He has worked in a large-scale financial company for over 15 years. He has significant experience in team management, procurement and capacity planning, database administration and product testing for high-volume systems. He is working at ChistaDATA Inc. as senior database administrator.
Contact: Website