Runbook for Migration from Amazon Redshift to ChistaDATA Cloud for ClickHouse: Part 1

 

Introduction

Amazon Redshift is a proprietary cloud data warehouse offered by AWS. It is based on PostgreSQL but with modifications. Amazon RedShift is an OLAP data store, whereas PostgreSQL is an OLTP database system. Amazon Redshift was first made available to the public in October 2012. 

ClickHouse is an open-source analytical database developed by ClickHouse Inc. and it is a columnar data store and is focused on speed. This database was first made available to the public in 2016 and has grown by leaps and bounds since its arrival.  

Fully managed ClickHouse clusters can be set up and run in a few minutes using the ChistaDATA cloud. Along with the impressive features and benefits provided by ClickHouse, ChistaDATA cloud offers fully autonomous Cloud Database Infrastructure on ClickHouse focussed on security and reliability. Unlike other fully managed ClickHouse providers, ChistaDATA offers on-premise deployment and fully customizable cluster architecture for data replication and Sharding.

In this article, let us look at the steps to migrate the data from Amazon Redshift to the ChistaDATA cloud. The following approach will be followed for the migration.

Amazon Redshift is commonly used as a data warehouse, storing analytical data from multiple data sources. Being an AWS-hosted solution, we can easily dump the data from Amazon Redshift to AWS S3. Since the ChistaDATA cloud is powered by ClickHouse, we can make use of the S3 table engine to read the data dumped in S3. Before we start, we have to ensure the following prerequisites are followed.

  1. Running cluster on ChistaDATA cloud. You can sign up for a free trial here and create one.
  2. AWS IAM user/root with admin privileges
  3. A Redshift cluster
  4. S3 bucket

Step 1 – Create IAM role to unload the data from Redshift

Goto the Namespace configuration for the Redshift instance and Select the Security and Encryption tab. In the permissions section, choose Manage IAM roles. On the new page, select Create IAM role.

 

In the next screen, select the option “Specific S3 buckets” and choose the S3 bucket in which the data will be unloaded.

Once the role is created, note down the ARN of the role.

Step 2 – Copy the data from the Redshift table to S3

Assuming the table which needs to be migrated is available in Redshift, we can unload the data to S3. To do this, select the query editor in the Redshift dashboard and choose the appropriate database to execute the SQL statements.

Then, modify the below SQL to suit your needs and execute it in the query editor.

unload ('<SELECT query>')
to '<S3_URL/filename.csv>' 
iam_role '<ARN string>'
parallel off
CSV;

The data will be dumped to a CSV file based on the SELECT query to the S3 bucket into the file called <filename.csv>. We will have to provide the ARN for the IAM role which we just created to download the data. The parallel setting will enable us to unload the data in parallel to multiple files with mentioned filename prefix. When turned off, the data will be dumped into a single file.

unload ('select * from iris_data')
to 's3://redshift.poc/irisdump1.csv' 
iam_role 'arn:aws:iam::1234123412341234:role/service-role/AmazonRedshift-CommandsAccessRole-20230713T161943'
parallel off
CSV;

The above example will dump all the data from iris_data table in redshift to the mentioned S3 bucket.

 

Step 3 – Create ClickHouse tables to read the data in AWS S3

The next step is to create a table in the ChistaDATA database server using S3 table engine. We are supposed to know the columns and the datatypes beforehand.

CREATE TABLE <databasename>.<tablename>
(column1 <datatype>,
column2 <datatype>,
.
.
.
columnX <datatype>)
ENGINE=S3('<S3 Url for the file>', '<aws_access_key_id>', '<aws_secret_access_key>', 'CSV')
SETTINGS input_format_with_names_use_header = 1;

Let me create a table based on the iris data set.

CREATE TABLE redshift.poc (sepal_length Float32, sepal_width Float32, petal_length Float32, petal_width Float32, variety String)
ENGINE=S3('https://s3.amazonaws.com/redshift.poc/irisdump1.csv000', 'xxxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxxxxxxxxxx')
SETTINGS input_format_with_names_use_header = 1;

Let us test if it works.

We can store the data in the database server by creating a MergeTree based table and doing a SELECT … INSERT from S3 table to the MeregTree table

 

Conclusion

Users can easily migrate the data to ChistaDATA cloud using S3, Streaming solutions such as Kafka, or directly pull the data from the source via connectors.  We have seen one of the many ways to migrate the data from Redshift to ChistaDATA cloud. In the next part of this series, we will see other alternative ways to migrate the data.

 

References

  • https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
  • https://clickhouse.com/docs/en/engines/table-engines/integrations/s3
  • https://docs.cloud.chistadata.io/getting-started/index.html