Creating a run-book with scripts and comments for TPC-DS 10TB data benchmarking on EC2 to evaluate ClickHouse performance involves several detailed steps.
Step 1: Setting Up the EC2 Environment
- Select and Launch EC2 Instance:
- Choose an EC2 instance (e.g.,
m5.24xlarge
for balance between compute/memory). - Configure the instance with a Linux distribution like Ubuntu or CentOS.
- Ensure the instance has sufficient attached EBS storage for over 10TB data.
- Choose an EC2 instance (e.g.,
- Install Required Tools:
- SSH into your EC2 instance.
- Install necessary tools (git, compilers, etc.).
sudo apt update && sudo apt install -y git build-essential
Step 2: Install and Configure ClickHouse
- Install ClickHouse:
- Follow official documentation to install ClickHouse.
sudo apt-get install -y clickhouse-server clickhouse-client
- Configure ClickHouse:
- Edit the config file
/etc/clickhouse-server/config.xml
. - Set
max_threads
,max_memory_usage
to optimize for XXX CPUs and XXXGB RAM.
- Edit the config file
Step 3: Data Generation and Preparation
- Generate TPC-DS Data:
- Install
tpcds-kit
for data generation.
- Install
git clone <https://github.com/databricks/tpcds-kit.git> cd tpcds-kit/tools make OS=LINUX
- Generate 10TB Data:
- Run the data generator. This may need to be executed multiple times.
./dsdgen -SCALE 100000 -DIR /path/to/output
Step 4: Data Loading into ClickHouse
- Prepare ClickHouse Tables:
- Create tables in ClickHouse matching the TPC-DS schema.
- Load Data into ClickHouse:
- Write a script to load CSV files into ClickHouse.
import os import subprocess data_dir = '/path/to/output' for file in os.listdir(data_dir): if file.endswith(".csv"): cmd = f"clickhouse-client --query='INSERT INTO my_table FORMAT CSV' < {os.path.join(data_dir, file)}" subprocess.run(cmd, shell=True)
Step 5: Executing the Benchmark
- Translate TPC-DS Queries:
- Convert TPC-DS queries to ClickHouse SQL format.
- Run Benchmark Queries:
- Execute queries against ClickHouse and measure performance.
clickhouse-client --query="SELECT ..." # Example query
Step 6: Performance Monitoring and Adjustment
- Monitor Performance:
- Use monitoring tools to observe system resources.
htop iotop
- Tweak Configurations as Needed:
- Based on observations, adjust ClickHouse configurations for better performance.
Step 7: Analyzing Results
- Analyze Output:
- Examine query execution times, resource usage, etc.
- Document Findings:
- Record all observations, configurations, and outcomes.
Step 8: Cleanup
- Decommission Resources:
- Terminate EC2 instance and release EBS volumes to avoid extra charges.
Conclusion
This run-book provides a structured approach to setting up, executing, and analyzing a TPC-DS 10TB benchmark on EC2 for ClickHouse. Each step involves careful planning and execution, from environment setup and data generation to query execution and performance analysis. Due to the resource-intensive nature of this task, continuous monitoring and adjustment are key to obtaining accurate and useful performance metrics.