Migrating from Databricks to ChistaDATA Cloud for ClickHouse

Table of Contents

Introduction

Databricks is an enterprise software company founded by the creators of Apache Spark. The unique feature of Databricks is that it combines data warehouses & data lakes into a lakehouse architecture. Databricks SQL is a data warehouse on the Databricks Lakehouse Platform that allows us to run SQL queries and use the data for analytics and business intelligence.

ChistaDATA Cloud is the world’s most advanced fully autonomous Cloud Database Infrastructure on ClickHouse® for Real-Time Data Analytics and OLAP. ClickHouse being the world’s fastest OLAP database system, there would be instances where we are required to migrate our data from the Databricks SQL warehouse to the ChistaDATA cloud.

Migration Runbook from Databricks to ClickHouse

In this article, we will look at a simple example of migrating our data from Databricks SQL warehouse to ChistaDATA cloud, using a simple Python script. We will use the MetroPT-3 dataset, available on the UCI machine learning repository. This data set contains the sensor measurements from a metro train’s air intake valves of a compressor’s Air Production Unit. Some characteristics of the dataset are

1.5 million rows
15 features + 1 timestamp column
218 MB of uncompressed data

The pre-requisites are

Admin access or user with sufficient privileges to create and read a Databricks SQL warehouse
A ChistaDATA cloud account – Sign up here
Python 3.10+
- Databricks SQL Connector for Python
- clickhouse-driver

Let us look at the steps.

Load data into Databricks

Download the data from the UCI ML repository. Here is the download URL. Once the dataset is downloaded, extract the downloaded archive to a folder on your local computer. Go to Databricks web UI and select New -> File Upload. We can directly upload the data (small datasets) to Databricks, auto infer the table schema, and create the table.

Next, browse for the file on the local PC and choose to upload the file.

There upload progress will be displayed until the process is completed.

After the upload is over, Databricks will auto-infer the columns and suggest a table schema to create. We can choose the default options if there are no changes.

Once the table is created, run the following queries to verify the data. you can try out more if needed.

SELECT COUNT(*) FROM chistadata_poc.default.metro_pt_3_air_compressor;

SELECT * FROM chistadata_poc.default.metro_pt_3_air_compressor LIMIT 5

Loading the data into ChistaDATA cloud

Let us begin by creating a database and a table in ChistaDATA cloud. Go to the query editor of a cluster and run the following SQL statements.

CREATE DATABASE databricks_poc;

CREATE TABLE databricks_poc.metro_pt_3_air_compressor
 ( `timestamp` DateTime,
TP2 Float64,
TP3 Float64, 
H1 Float64, 
DV_pressure Float64, 
Reservoirs Float64, 
Oil_temperature Float64,
Motor_current Float64,
COMP Float64,
DV_eletric Float64, 
Towers Float64,
MPG Float64,
LPS Float64, 
Pressure_switch Float64,
Oil_level Float64,
Caudal_impulses Float64) 
Engine = MergeTree
ORDER BY (`timestamp`);

Once the table is created, we can read the data from Databricks and load it to the ChistaDATA cloud. We can use the below Python script to perform the migration.

from databricks import sql
import os
from clickhouse_driver import Client

# Databricks Connection object
databricks_connection = sql.connect(server_hostname = '<databricks warehouse server hostname>',
                         http_path       = '<http path>',
                         access_token    = '<secret access token>')

# ClickHouse connection object
ch_client = Client('<chistadata cluster hostname>', password='<password for default user>', port='9440', secure=True )

# Configs
db_source_table_name = 'metro_pt_3_air_compressor'
db_source_database = 'chistadata_poc.default'
ch_dest_table_name = 'metro_pt_3_air_compressor'
ch_dest_database = 'databricks_poc'
batch_size = 10000

#Function to insert ClickHouse data
def insert_ch(data):
    ch_client.execute(f'INSERT INTO {ch_dest_database}.{ch_dest_table_name} VALUES', data)


# Read the data from Databricks and insert to ClickHouse in large batches

db_cursor = databricks_connection.cursor()

db_cursor.execute(f"SELECT * FROM {db_source_database}.{db_source_table_name}")
result = db_cursor.fetchall()

tmp = []
for index, res in enumerate(result):
    tmp.append(tuple([x for x in res]))

    if index>0 and index%batch_size==0:
        insert_ch(tuple(tmp))
        tmp = []
        print(f'Inserted batch {int(index/batch_size)} with {batch_size} rows')

if tmp:
    insert_ch(tmp)

db_cursor.close()
databricks_connection.close()
ch_client.disconnect()

The above script reads the data from Databricks SQL warehouse and inserts it to ClickHouse in batches of 10,0000 (batch_size). The database names and table names are configurable. On top of this, we can possibly customize the scripts to accommodate our custom requirements.

Once the script has run successfully, we can verify the same in ChistaDATA cloud.

Conclusion

So we have seen a simple way to migrate our data from Databricks to ChistaDATA cloud. This method works for smaller datasets and may not be the optimal choice for larger datasets. For larger datasets, we can dump the data to S3 and ingest the data to Clickhouse and we will cover them in upcoming parts.

To know more about migration to ClickHouse, do read the following articles:

Reference

https://www.databricks.com/product/databricks-sql

https://clickhouse-driver.readthedocs.io/en/latest/

https://docs.cloud.chistadata.io/getting-started/index.html

https://docs.databricks.com/en/dev-tools/python-sql-connector.html