Runbook for Migration from Snowflake to ChistaDATA Cloud for ClickHouse: Part 1

Introduction

Snowflake is a fully managed SaaS (Software as a service) used for data warehousing, data engineering, data lakes, and data science applications. Snowflake is a proven cloud-based data platform, and it is one of the top-ranked database systems according to db-engines.com

ChistaDATA Cloud is the world’s most advanced, fully autonomous Cloud Database Infrastructure on ClickHouse® for Real-Time Data Analytics and OLAP. You can sign up for a free trial here.

While Snowflake has seen wide adoption, there are few areas of concern around pay-as-you-go billing, which can lead to huge unanticipated bills based on the usage. ChistaDATA offers a fixed pricing model based on the cluster size, and the user can not breach the fixed monthly costs. Along with that, ChistaDATA is powered by ClickHouse, which is super efficient and fast; you can compare the performance of Snowflake with ClickHouse here. It is evident that we can get Snowflake like query speeds from ClickHouse using smaller servers (which translates to much lower costs).

Many enterprises using Snowflake are moving to cheaper and more efficient alternatives. In this article, I will demonstrate how to migrate the data from a Snowflake table to ChistaDATA platform using AWS S3.

Pre-Requisites

  • Snowflake account with admin privileges (AWS)
  • AWS account with necessary privileges to create S3 buckets and IAM roles and user creation
  • ChistaDATA account with a ClickHouse server up and running
  • Python – to load data to a Snowflake server

Steps

  • Create a table in Snowflake and load the data
  • Create S3 bucket in the same region as Snowflake
  • Allow Snowflake’s AWS VPC to access S3
  • Allow Snowflake AWS user ID to access S3
  • Unload the data to S3
  • Create a table in ClickHouse using S3 table engine
  • Load the data to a ClickHouse table

Step 1 – Snowflake Tables

Let us begin by creating the necessary databases and tables in Snowflake. It is necessary to have Snowflake in AWS for this tutorial. Login to Snowflake and execute the below SQL

create or replace database CLICKHOUSE_SNOWFLAKE_DB;

create or replace schema CLICKHOUSE_SNOWFLAKE_DB.TEST_DB;

create or replace TABLE CLICKHOUSE_SNOWFLAKE_DB.TEST_DB.SNOWFLAKE_CH_TABLE (
    COUNT NUMBER(38,0),
    VAL FLOAT,
    NAME VARCHAR(3)
);

This table has just 3 columns named COUNT, VAL and NAME. Using a Python script, let us load the randomly generated data into this table. This script requires the following Python packages that can be installed using the below command.

pip3 install snowflake-connector-python snowflake-sqlalchemy sqlalchemy pandas

After installing the necessary packages, you can modify and run the below Python script. n_rows parameter determines the number of rows to insert.

import pandas as pd
import snowflake.connector
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import random
import string

def random_string(str_len):
   chars = string.ascii_lowercase
   return ''.join(random.choice(chars) for i in range(str_len))

n_rows = 1000000
data = [[random.randint(0, 99999), random.uniform(0, 99999.9), random_string(3)] for x in range(n_rows)]
df = pd.DataFrame(data, columns = ['count', 'val', 'name'])

engine = create_engine(URL(
    account = '<account_name>', # Example - pd33123.us-east-2.aws
    user = '<snowflake login username>',
    password = '<Snowflake login password>',
    database = 'CLICKHOUSE_SNOWFLAKE_DB',
    schema = 'TEST_DB',
    role='<role name>', # Example - ACCOUNTADMIN
))
connection = engine.connect()

df.to_sql(name='SNOWFLAKE_CH_TABLE', con=engine, index=False, if_exists='append')

connection.close()
engine.dispose()

Let us note down the VPC ID from Snowflake. Run the following SQL statement to get the VPC ID.

select system$get_snowflake_platform_info();


Sample Output:

{"snowflake-vpc-id":["vpc-xxxxxxxxxxxxxxxxx"]}

Step 2- AWS S3 Bucket

Next, we should create an AWS S3 bucket in the same region as Snowflake AWS region.

Ensure that public access is disabled for the bucket.

Next, add the bucket policy to this S3 bucket with Snowflake VPC. An example is given below.

{
    "Version": "2012-10-17",
    "Id": "Policy1234567890123",
    "Statement": [
        {
            "Sid": "Stmt1234567890123",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::snowflake.ch.migration/*",
                "arn:aws:s3:::snowflake.ch.migration"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:sourceVpc": "vpc-xxxxxxxxxxxxxxxxx"
                }
            }
        }
    ]
}

Step 3 – Allow Snowflake User ID to access S3

Goto IAM management console and create a new policy called SnowflakeCHPoC.  In the policy document, update the policy JSON similar to the one below and create the policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::snowflake.ch.migration/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::snowflake.ch.migration",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "/*"
                    ]
                }
            }
        }
    ]
}

Go to the Snowflake SQL console and run the following SQL to create storage integration in Snowflake with AWS S3.

CREATE STORAGE INTEGRATION snowflake_ch_poc
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'Snowflake_POC'
  [ STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control' ]
  STORAGE_ALLOWED_LOCATIONS = ('s3://snowflake.ch.migration/SnowflakePOC/')
  [ STORAGE_BLOCKED_LOCATIONS = ('s3://snowflake.ch.migration/SnowflakePOC/') ]

After this is successful, run the SQL below and note the following information from the results.

  • STORAGE_AWS_ROLE_ARN
  • STORAGE_AWS_EXTERNAL_ID
DESC INTEGRATION snowflake_ch_poc;
 
Next, create a new role called Snowflake_POC in AWS.

Add the policy that was created earlier. Unless this is done, we cannot access the S3 from Snowflake.

 

Edit the trust relationship for the role and add a similar policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<12 digit User ID from STORAGE_AWS_ROLE_ARN>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<STORAGE_AWS_EXTERNAL_ID>"
                }
            }
        }
    ]
}

Step 4 – Unload the data from the Snowflake table to S3

Let us directly unload the data in the table to S3 bucket. With all the configurations in place, we can copy the data into AWS S3. Just run this SQL from Snowflake.

COPY INTO s3://snowflake.ch.migration/SnowflakePOC/ from users storage_integration=snowflake_ch_poc;

Just ensure we have selected the right database and Schema in Snowflake.

Once the SQL is executed successfully, wait some time (Depending on the data size) and verify whether the data is available in S3.

Step 5 – Load the data in ClickHouse

As the first step, create a separate AWS user in IAM named ch_snowflake_user and assign SnowflakeCHPoC to the user. This will enable the user to access the S3 bucket.

Create an access key for this IAM user using the ClickHouse to access the S3.

 

Now login to your ChistaDATA, open the query editor on the ClickHouse server where the data has to be migrated and run the following SQL to create a S3 table.

  CREATE TABLE snowflake_ch
(
id Int64, -- auto incrementing IDs
name String,  -- variable string column
created_at DateTime64(6)
)
ENGINE=S3('https://s3.us-east-2.amazonaws.com/snowflake.ch.migration/SnowflakePOC/data_0_0_0.csv.gz', aws_access_key_id='<access key ID>', aws_secret_access_key='<secret access key for the ID>',  format='CSV', compression='gzip')
SETTINGS input_format_with_names_use_header = 1;

Once the execution is successful, read a few rows of data from the table.

We can also create a storage table and store the data in the table for further processing and usage.

CREATE TABLE snowflake_ch_storage
(
id Int64, 
name String, 
created_at DateTime64(6)
)
ENGINE=MergeTree
ORDER BY id;

INSERT INTO snowflake_ch_storage SELECT * FROM snowflake_ch;

Conclusion

So we can easily migrate from Snowflake to ChistaDATA, which offers faster data retrieval at a lower price. Migrating using S3 is just one of the many possible options, and we will cover the other alternatives in the upcoming blog articles.

To know more about migrating to ClickHouse, do visit the following articles:

References

https://community.snowflake.com/s/question/0D50Z00009C6023SAB/how-can-i-insert-data-into-snowflake-table-from-a-panda-data-frame

https://clickhouse.com/docs/en/engines/table-engines/integrations/s3

https://docs.snowflake.com/en/user-guide/data-unload-s3

ChistaDATA: Your Trusted ClickHouse Consultative Support and Managed Services Provider. Unlock the Power of Real-Time Analytics with ChistaDATA Cloud(https://chistadata.io) – the World’s Most Advanced ClickHouse DBaaS Infrastructure. Contact us at info@chistadata.com or (844)395-5717 for tailored solutions and optimal performance.