Photo by Lenharth System on StockSnap
Every year, databases store more data than they did the year before. Also, regulatory requirements are forcing businesses to monitor all of their data and exercise control over its management and retention. This calls for efficient data management strategies.
Data archiving is the movement of inactive data from production systems to long-term storage or databases. Archival data is kept on databases or storage so it can always be reactivated.
This article provides general information about data archiving and creates an example scenario showing data transfer from PostgreSQL to ClickHouse.
Importance of Data Archiving
A data archiving strategy is a crucial component of the data lifecycle management policy because it gives you a way to preserve data while adhering to a reasonable storage budget. It also helps improve performance in production systems. The main advantages of the data archiving solutions are:
- Increase capacity
- Reduce storage load and costs while increasing productivity
- Meet compliance requirements
- Centralize discovery
- Make backup easy
- Manage and tier data
Data lifecycle management processes manage the entire lifecycle of data, from when data is created until it is deleted. The term “data tiering” arose from moving data around different tiers or classes of storage within a storage system but has now expanded to mean archiving data from any source to any destination. We can classify the data into 3 main groups within the scope of data tiering. These are:
- Hot Data: Data is accessed frequently and generally stored on fast storage in transactional databases.
- Warm Data: Data is accessed less frequently and used for analytics and reports. Stored in capacity-optimized databases or storage.
- Cold Data: Data is rarely accessed for compliance reasons. Stored in cheapest storage systems and tapes.
Archive Data From PostgreSQL to ClickHouse
In light of the information given above, it is thought that transferring the warm or cold data of transactional databases such as PostgreSQL to other sources will be beneficial in terms of database performance and manageability.
For that reason, to create ClickHouse and PostgreSQL containers in Docker, a “docker-compose.yaml” file is created.
version: '3.7' services: clickhouse-server: image: clickhouse/clickhouse-server container_name: clickhouse_server hostname: clickhouse01 ports: - "8000:9000" - "9123:8123" - "9009:9009" restart: always tty: true ulimits: memlock: soft: -1 hard: -1 nofile: soft: 262144 hard: 262144 deploy: resources: limits: memory: 2g cap_add: - IPC_LOCK postgres: image: postgres container_name: postgres hostname: postgres ports: - "5432:5432" restart: always environment: POSTGRES_PASSWORD: password command: [ "postgres", "-c", "wal_level=logical", "-c", "max_replication_slots=4"]
Run the following command where the docker-compose file is created.
docker-compose up -d
First of all, we need to connect “postgres” shell with postgres user.
docker exec -it -u postgres postgres /bin/bash
Then, modify the following parameters in postgresql.conf file as shown. In our case, this file located under “/var/lib/postgresql/data“.
listen_addresses = '*' max_replication_slots = 10 wal_level = logical
Create a user for using data replication and archiving after connecting DB with “psql“.
postgres@postgres:/$ psql psql (14.5 (Debian 14.5-1.pgdg110+1)) Type "help" for help. postgres=# CREATE ROLE clickhouse_user SUPERUSER LOGIN PASSWORD '<your_password>';
For replication and archiving, create a database and connect it.
postgres=# CREATE DATABASE testdb; CREATE DATABASE postgres=# \connect testdb You are now connected to database "testdb" as user "postgres".
In our scenario, let’s assume that only the current month’s data is hot and updatable. For that reason, create a partitioned test table.
Also, create “ARCHIVE_STATUS” table to log the archive operations.
CREATE TABLE TESTTABLE ( ID INT NOT NULL, LOG_DATE DATE) PARTITION BY RANGE (LOG_DATE); CREATE TABLE TESTTABLE_P202207 PARTITION OF TESTTABLE FOR VALUES FROM ('01-JUL-2022') TO ('31-JUL-2022'); CREATE TABLE TESTTABLE_P202208 PARTITION OF TESTTABLE FOR VALUES FROM ('01-AUG-2022') TO ('31-AUG-2022'); CREATE TABLE TESTTABLE_P202209 PARTITION OF TESTTABLE FOR VALUES FROM ('01-SEP-2022') TO ('30-SEP-2022'); CREATE TABLE ARCHIVE_STATUS ( TABLE_NAME VARCHAR(32), PARTITION_NAME VARCHAR(32), IS_ARCHIVED INT DEFAULT 0, PARTITION BY RANGE (LOG_DATE);
Insert the data into the “TESTTABLE” with 2 values for each partition.
INSERT INTO TESTTABLE (id, log_date) VALUES (1, '02-JUL-2022'), (2, '03-JUL-2022'); INSERT INTO TESTTABLE (id, log_date) VALUES (3, '20-AUG-2022'), (4, '24-AUG-2022'); INSERT INTO TESTTABLE (id, log_date) VALUES (5, '5-SEP-2022'), (6, '6-SEP-2022');
Add the following rows to the “pg_hba.conf” file to allow connections to the new database with the new user for replication.
In our case, this file located under “/var/lib/postgresql/data“.
# TYPE DATABASE USER ADDRESS METHOD . . . host testdb clickhouse_user 192.168.1.0/24 password
The initial configuration for PostgreSQL is completed.
Connect the ClickHouse server via docker and connect database (You can also connect to ClickHouse with GUI like DBeaver.)
ubuntu@dockerSrv:~/clickHouseArchive$ docker exec -it clickhouse_server /bin/bash root@clickhouse01:/# root@clickhouse01:/# clickhouse-client ClickHouse client version 220.127.116.11 (official build). Connecting to localhost:9000 as user default. Connected to ClickHouse server version 22.8.5 revision 54460. Warnings: * Linux is not using a fast TSC clock source. Performance can be degraded. Check /sys/devices/system/clocksource/clocksource0/current_clocksource clickhouse01 :)
Create tables with Postgres Engine so you can read and write remote PostgreSQL tables.
CREATE TABLE archive_status ( table_name String, partition_name String, is_archived int, arch_date date ) ENGINE = PostgreSQL('<IP>:<PORT>', 'testdb', 'archive_status', 'clickhouse_user', '<PostgrePassword>'); CREATE TABLE testtable_postgre ( id int, log_date date ) ENGINE = PostgreSQL('<IP>:<PORT>', 'testdb', 'testtable', 'clickhouse_user', '<PostgrePassword>');
clickhouse01 :) select * from testtable_postgre; SELECT * FROM testtable_postgre Query id: 0a819daf-5738-4231-b5d8-752d201448f1 ┌─id─┬───log_date─┐ │ 1 │ 2022-07-02 │ │ 2 │ 2022-07-03 │ │ 3 │ 2022-08-20 │ │ 4 │ 2022-08-24 │ │ 5 │ 2022-09-05 │ │ 6 │ 2022-09-06 │ └────┴────────────┘
Create a local partitioned table with the MergeTree engine.
create table testtable_postgre_archive ENGINE = MergeTree() PARTITION BY toYYYYMM(log_date) ORDER BY log_date as select * from testtable_postgre where 1=2;
Let’s assume that the 202207 partition in “TESTTABLE” at PostgreSQL no longer has access, and we would like to archive that partition to ClickHouse.
Because every table partition in PostgreSQL has a table itself, you can directly access related partitions from ClickHouse as tables.
CREATE TABLE testtable_postgre_p202207 ( id int, log_date date ) ENGINE = PostgreSQL('<HOST>:<PORT>', 'testdb', 'testtable_p202207', 'clickhouse_user', '<your_password>');
Now you can archive only the related partition to the local ClickHouse table. We are adding records about archived table partitions to the “archive_status” table so admin or automated jobs can delete cold data from the source DB.
INSERT INTO testtable_postgre_archive SELECT * FROM testtable_postgre_p202207 insert into archive_status values('testtable','testtable_p202207',1,now());
The local archive table only includes data between 07.01.2022 and 07.31.2022, which means cold data.
clickhouse01 :) select * from testtable_postgre_archive; SELECT * FROM testtable_postgre_archive Query id: 7f6a1d0f-eb32-426d-ab57-8eedf8bdb822 ┌─id─┬───log_date─┐ │ 1 │ 2022-07-02 │ │ 2 │ 2022-07-03 │ └────┴────────────┘ 2 rows in set. Elapsed: 0.013 sec.
Data archiving strategies and data tiering are important components of data lifecycle management. Data archiving provides many advantages, such as performance improvement, storage management, and tiered configuration. Because of its native database connection engines, ClickHouse can be a good alternative to archive transactional databases such as MySQL and PostgreSQL.