Migrating From Google BigQuery to ChistaDATA Cloud for ClickHouse: Part 1

Introduction

BigQuery is Google‘s fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. – Wikipedia

Google BigQuery has been around for over a decade and offers serverless / fully managed solutions for data warehouse/ analytical data store space. BigQuery can easily handle petabytes of data and also offers ML/AI and BI for insights at scale.

ChistaDATA Cloud is a relatively new player in fully managed and serverless analytical data store based on ClickHouse. ClickHouse was open-sourced in the year 2017 and it is one of the fastest-growing analytical data stores, and the fastest analytical database system when it comes to query performance. ClickHouse is hardware-efficient and has a lot of design choices in it to make the queries run faster.

Google BigQuery is already in use by many large companies as their analytical data store and data warehouse. Many recent inquiries made for ChistaDATA cloud were on migrating from Google BiqQuery and Snowflake and the recurring theme was their super-high costs that increase every single year. While Snowflake and BigQuery have been around for a long time, ChistaDATA cloud offers several benefits over these proprietary software giants.

  • Powered by open-source ClickHouse – Save costs on license fee
  • Hardware efficient – requires fewer resources than BigQuery and Snowflake to run the same queries
  • Fast Query Execution – Benchmarks available here
  • Transparent Billing – ChistaDATA offers and fixed price billing, so no surprises with the bills

Runbook for migration from Google BigQuery to ChistaDATA Cloud

In this blog article, I will demonstrate a simple example of migrating your data from BigQuery. I will use the COVID-19 open dataset. The details are available here. This dataset is around 500 MB and this should be usable in the free tier of BigQuery. We will follow the following steps to demonstrate the migration. Before you proceed, make sure the following prerequisites are met.

  1. BigQuery access with permissions to create and insert the data in the tables
  2. A Google Cloud Storage bucket with access to create, upload, and delete files
  3. Access Key ID and Secret access key for the GCS bucket
  4. ChistaDATA cloud account – Sign Up for a free trial here

Step 1 – Download the data

You can download the CSV file which is available in this URL. If you are on Linux based operating system, you can run the following command to download the file.

wget https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv

Next, upload the file to the cloud storage bucket. We can directly upload CSV files up to 100 MB while creating a table. Since the file size is ~500 MB, we have to upload it to cloud storage and then load the data in BigQuery table.

Step 2 – Load the data into Google BigQuery

Go to the Google Cloud web console and go to BigQuery.

 

I have a dataset called migration_poc created. Let us create a table under this dataset called covid19.

 

Enter the necessary information as given below.

Remember to choose auto-detect the schema. This will save a lot of time for us instead of choosing to define the schema.

 

In the advanced options, select Jagged rows and proceed to create the table. Once the load job is completed, you can preview the table.

 

 

The data is ready to be queried from the BigQuery table. Let us assume this to be the data to be migrated from BigQuery to ClickHouse.

Step 3 – Dump the data into Cloud Storage

Ensure we have a directory ready in Google Cloud Storage, where the table data will be dumped.  Find the gsutil URI from the configuration section.

Next, go to the query editor of the covid19 table and execute the below query.

EXPORT DATA
  OPTIONS (
    uri = 'gs://bigquery_chistadata_migration_poc/dump/*.gz',
    format = 'CSV',
    overwrite = true,
    header = true,
    field_delimiter = ',',
    compression = 'GZIP')
AS (
SELECT * FROM `dbaas-dev.migration_poc.covid19` 
);

This SQL statement will export the data in the table as multiple CSV files in the mentioned cloud storage bucket path. The files will be compressed using GZIP compression. This can come in handy when we have a huge dataset to export, and save on storage space and network transfer while loading the data in ClickHouse. After the export is successful, we can visually verify the export path in Google Cloud Storage Bucket.

 

Step 4 – Create a table in ChistaDATA Cloud

Now that we have the data in Cloud Storage, we can go ahead and create a table in ChistaDATA cloud where we can store the data. Assuming we have a cluster up and running, go to the Query editor and run the below query. This table will be based on MergeTree table engine.

CREATE TABLE covid19 (
`date` Date,
location_key LowCardinality(String),
new_confirmed Int32,
new_deceased Int32,
new_recovered Int32,
new_tested Int32,
cumulative_confirmed Int32,
cumulative_deceased Int32,
cumulative_recovered Int32,
cumulative_tested Int32
)
ENGINE = MergeTree
ORDER BY (location_key, `date`);

This table is indexed based on the location_key and ‘date’ columns.

Step 5 – Ingest the data from Google Cloud Storage

The next step is to ingest the data that is dumped from the BigQuery table into the Google Cloud Storage. ClickHouse has an inbuilt engine that can read the data from the Google Cloud Storage bucket. We need the following information to read the data.

  • GCS bucket URL – we can match the files in the bucket using wildcards
  • Access Key ID for the GCS bucket
  • Secret Access key for the above Access Key ID
  • File format

Run the below SQL statement to start the ingestion

INSERT INTO covid19
SELECT *
FROM gcs('https://storage.googleapis.com/bigquery_chistadata_migration_poc/dump/*.gz',
'<Access Key ID>', '<Secret Access Key>',
 'CSV');

This may take some time depending on the size of the data. Once the query is executed successfully (ingestion is over), let is run a simple query on the ClickHouse table.

If needed, additional queries to validate the migration can be done to assert the success of the migration.

Conclusion

So in this article, we have successfully migrated the data from Google BigQuery to ChistaDATA cloud which is powered by the Open Source ClickHouse database. We have done the following steps

  1. Create a GCS bucket
  2. Dump the data from the BigQuery table to GCS bucket
  3. Create a storage table in ChistaDATA cloud database
  4. Ingest the data using in built ClickHouse gcs function
  5. Verify and validate the cluster

To learn more about Google BigQuery vs ChistaDATA Cloud for ClickHouse, please read our benchmark comparing their performance on OpenTelemetry data.

References

https://clickhouse.com/docs/en/migrations/bigquery

https://cloud.google.com/bigquery/docs/exporting-data

https://clickhouse.com/docs/en/sql-reference/table-functions/gcs