Benchmarking ClickHouse(TPC-DS 10TB) on EC2

Creating a run-book with scripts and comments for TPC-DS 10TB data benchmarking on EC2 to evaluate ClickHouse performance involves several detailed steps.

Step 1: Setting Up the EC2 Environment

Select and Launch EC2 Instance:
- Choose an EC2 instance (e.g., m5.24xlarge for balance between compute/memory).
- Configure the instance with a Linux distribution like Ubuntu or CentOS.
- Ensure the instance has sufficient attached EBS storage for over 10TB data.
Install Required Tools:
- SSH into your EC2 instance.
- Install necessary tools (git, compilers, etc.).

sudo apt update && sudo apt install -y git build-essential

Step 2: Install and Configure ClickHouse

Install ClickHouse:
- Follow official documentation to install ClickHouse.

sudo apt-get install -y clickhouse-server clickhouse-client

Configure ClickHouse:
- Edit the config file /etc/clickhouse-server/config.xml.
- Set max_threads, max_memory_usage to optimize for XXX CPUs and XXXGB RAM.

Step 3: Data Generation and Preparation

Generate TPC-DS Data:
- Install tpcds-kit for data generation.

git clone <https://github.com/databricks/tpcds-kit.git>
cd tpcds-kit/tools
make OS=LINUX

Generate 10TB Data:
- Run the data generator. This may need to be executed multiple times.

./dsdgen -SCALE 100000 -DIR /path/to/output

Step 4: Data Loading into ClickHouse

Prepare ClickHouse Tables:
- Create tables in ClickHouse matching the TPC-DS schema.
Load Data into ClickHouse:
- Write a script to load CSV files into ClickHouse.

import os
import subprocess

data_dir = '/path/to/output'
for file in os.listdir(data_dir):
    if file.endswith(".csv"):
        cmd = f"clickhouse-client --query='INSERT INTO my_table FORMAT CSV' < {os.path.join(data_dir, file)}"
        subprocess.run(cmd, shell=True)

Step 5: Executing the Benchmark

Translate TPC-DS Queries:
- Convert TPC-DS queries to ClickHouse SQL format.
Run Benchmark Queries:
- Execute queries against ClickHouse and measure performance.

clickhouse-client --query="SELECT ..." # Example query

Step 6: Performance Monitoring and Adjustment

Monitor Performance:
- Use monitoring tools to observe system resources.

htop
iotop

Tweak Configurations as Needed:
- Based on observations, adjust ClickHouse configurations for better performance.

Step 7: Analyzing Results

Analyze Output:
- Examine query execution times, resource usage, etc.
Document Findings:
- Record all observations, configurations, and outcomes.

Step 8: Cleanup

Decommission Resources:
- Terminate EC2 instance and release EBS volumes to avoid extra charges.

Conclusion

This run-book provides a structured approach to setting up, executing, and analyzing a TPC-DS 10TB benchmark on EC2 for ClickHouse. Each step involves careful planning and execution, from environment setup and data generation to query execution and performance analysis. Due to the resource-intensive nature of this task, continuous monitoring and adjustment are key to obtaining accurate and useful performance metrics.

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services