Benchmarking ClickHouse(TPC-DS 10TB) on EC2

Creating a run-book with scripts and comments for TPC-DS 10TB data benchmarking on EC2 to evaluate ClickHouse performance involves several detailed steps.

Step 1: Setting Up the EC2 Environment

  1. Select and Launch EC2 Instance:
    • Choose an EC2 instance (e.g., m5.24xlarge for balance between compute/memory).
    • Configure the instance with a Linux distribution like Ubuntu or CentOS.
    • Ensure the instance has sufficient attached EBS storage for over 10TB data.
  2. Install Required Tools:
    • SSH into your EC2 instance.
    • Install necessary tools (git, compilers, etc.).
sudo apt update && sudo apt install -y git build-essential

Step 2: Install and Configure ClickHouse

  1. Install ClickHouse:
    • Follow official documentation to install ClickHouse.
sudo apt-get install -y clickhouse-server clickhouse-client
  1. Configure ClickHouse:
    • Edit the config file /etc/clickhouse-server/config.xml.
    • Set max_threads, max_memory_usage to optimize for XXX CPUs and XXXGB RAM.

Step 3: Data Generation and Preparation

  1. Generate TPC-DS Data:
    • Install tpcds-kit for data generation.
git clone <https://github.com/databricks/tpcds-kit.git>
cd tpcds-kit/tools
make OS=LINUX
  1. Generate 10TB Data:
    • Run the data generator. This may need to be executed multiple times.
./dsdgen -SCALE 100000 -DIR /path/to/output

Step 4: Data Loading into ClickHouse

  1. Prepare ClickHouse Tables:
    • Create tables in ClickHouse matching the TPC-DS schema.
  2. Load Data into ClickHouse:
    • Write a script to load CSV files into ClickHouse.
import os
import subprocess

data_dir = '/path/to/output'
for file in os.listdir(data_dir):
    if file.endswith(".csv"):
        cmd = f"clickhouse-client --query='INSERT INTO my_table FORMAT CSV' < {os.path.join(data_dir, file)}"
        subprocess.run(cmd, shell=True)

Step 5: Executing the Benchmark

  1. Translate TPC-DS Queries:
    • Convert TPC-DS queries to ClickHouse SQL format.
  2. Run Benchmark Queries:
    • Execute queries against ClickHouse and measure performance.
clickhouse-client --query="SELECT ..." # Example query

Step 6: Performance Monitoring and Adjustment

  1. Monitor Performance:
    • Use monitoring tools to observe system resources.
htop
iotop
  1. Tweak Configurations as Needed:
    • Based on observations, adjust ClickHouse configurations for better performance.

Step 7: Analyzing Results

  1. Analyze Output:
    • Examine query execution times, resource usage, etc.
  2. Document Findings:
    • Record all observations, configurations, and outcomes.

Step 8: Cleanup

  1. Decommission Resources:
    • Terminate EC2 instance and release EBS volumes to avoid extra charges.

Conclusion

This run-book provides a structured approach to setting up, executing, and analyzing a TPC-DS 10TB benchmark on EC2 for ClickHouse. Each step involves careful planning and execution, from environment setup and data generation to query execution and performance analysis. Due to the resource-intensive nature of this task, continuous monitoring and adjustment are key to obtaining accurate and useful performance metrics.