Benchmarking ClickHouse: TPC-DS 10TB on Amazon EC2
This comprehensive run-book walks you through every stage of running a TPC-DS 10TB benchmark against ClickHouse on Amazon EC2 — from provisioning your instance and installing ClickHouse, to generating data, loading it, executing queries, and analyzing performance results. Each step includes production-ready scripts and inline commentary.
Step 1: Setting Up the EC2 Environment
Begin by selecting a compute-optimized EC2 instance. For a 10TB TPC-DS workload, an m5.24xlarge (96 vCPUs, 384 GB RAM) or r6i.32xlarge provides the right balance of compute and memory. Attach sufficient EBS (gp3) storage — at least 15TB to accommodate raw data, generated datasets, and ClickHouse table storage.
1.1 — Update System & Install Build Tools
#!/bin/bash
# Update package index and upgrade existing packages
sudo apt update && sudo apt upgrade -y
# Install essential build and development tools
sudo apt install -y \
git \
build-essential \
cmake \
python3 \
python3-pip \
unzip \
curl \
wget \
htop \
sysstat \
nvme-cli
echo "✔ Build tools installed successfully."
1.2 — Mount and Format EBS Volumes
#!/bin/bash # Identify attached NVMe/EBS devices lsblk # Format the data volume (replace /dev/nvme1n1 with your device) sudo mkfs -t xfs /dev/nvme1n1 # Create mount point and mount sudo mkdir -p /data sudo mount /dev/nvme1n1 /data # Make mount persistent across reboots echo '/dev/nvme1n1 /data xfs defaults,nofail 0 2' | sudo tee -a /etc/fstab # Set permissions sudo chown -R ubuntu:ubuntu /data echo "✔ EBS volume mounted at /data"
Step 2: Install and Configure ClickHouse
Install the latest stable ClickHouse server and client from the official Altinity/ClickHouse APT repository. After installation, tune the configuration for your instance’s memory and CPU profile.
2.1 — Install ClickHouse via APT
#!/bin/bash
# Add the official ClickHouse repository
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
GNUPGHOME=$(mktemp -d)
sudo GNUPGHOME="$GNUPGHOME" gpg --no-default-keyring \
--keyring /usr/share/keyrings/clickhouse-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 \
--recv-keys 8919F6BD2B48D754
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" \
| sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
# Install ClickHouse server and client
sudo apt-get install -y clickhouse-server clickhouse-client
# Start and enable the service
sudo systemctl enable clickhouse-server
sudo systemctl start clickhouse-server
sudo systemctl status clickhouse-server
2.2 — Tune ClickHouse Configuration
<?xml version="1.0"?>
<yandex>
<!-- Use all available CPU threads -->
<max_threads>96</max_threads>
<!-- Memory limits: set to ~90% of total RAM (e.g., 345 GB for 384 GB instance) -->
<max_memory_usage>370161893376</max_memory_usage>
<max_memory_usage_for_all_queries>370161893376</max_memory_usage_for_all_queries>
<!-- Data and temp storage paths -->
<path>/data/clickhouse/</path>
<tmp_path>/data/clickhouse/tmp/</tmp_path>
<!-- Enable parallel query execution -->
<max_parallel_replicas>1</max_parallel_replicas>
<distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
<!-- Increase connection and query limits -->
<max_connections>4096</max_connections>
<keep_alive_timeout>3</keep_alive_timeout>
</yandex>
#!/bin/bash # Create data directories sudo mkdir -p /data/clickhouse /data/clickhouse/tmp sudo chown -R clickhouse:clickhouse /data/clickhouse # Restart to apply new config sudo systemctl restart clickhouse-server # Verify ClickHouse is running and accessible clickhouse-client --query "SELECT version(), uptime()"
Step 3: TPC-DS Data Generation & Preparation
Use the official TPC-DS data generator (dsdgen) from the Databricks fork to produce 10TB of synthetic retail analytics data across 24 TPC-DS tables.
3.1 — Clone and Build TPC-DS Kit
#!/bin/bash # Clone the TPC-DS toolkit git clone https://github.com/databricks/tpcds-kit.git cd tpcds-kit/tools # Compile the data generator make OS=LINUX # Verify the binary was built successfully ls -lh dsdgen dsqgen echo "✔ TPC-DS tools compiled"
3.2 — Generate 10TB Dataset (Parallelized)
#!/bin/bash
# Configuration
SCALE=10000 # 10TB scale factor
PARALLEL=32 # Number of parallel chunks
OUTPUT_DIR="/data/tpcds-raw"
TOOLS_DIR="$HOME/tpcds-kit/tools"
mkdir -p "$OUTPUT_DIR"
# Launch dsdgen in parallel chunks
for i in $(seq 1 $PARALLEL); do
echo "Starting chunk $i of $PARALLEL ..."
$TOOLS_DIR/dsdgen \
-SCALE $SCALE \
-DIR "$OUTPUT_DIR" \
-PARALLEL $PARALLEL \
-CHILD $i \
-FORCE Y &
done
# Wait for all background processes
wait
echo "✔ 10TB data generation complete. Output in $OUTPUT_DIR"
du -sh "$OUTPUT_DIR"
Step 4: Schema Creation & Data Loading into ClickHouse
Create TPC-DS-compatible tables in ClickHouse using MergeTree engine families, then load the generated CSV files efficiently using ClickHouse’s native clickhouse-client ingestion.
4.1 — Create TPC-DS Schema
-- Create TPC-DS database
CREATE DATABASE IF NOT EXISTS tpcds;
-- Fact table: store_sales (largest table in TPC-DS)
CREATE TABLE tpcds.store_sales (
ss_sold_date_sk Nullable(Int64),
ss_sold_time_sk Nullable(Int64),
ss_item_sk Int64,
ss_customer_sk Nullable(Int64),
ss_cdemo_sk Nullable(Int64),
ss_hdemo_sk Nullable(Int64),
ss_addr_sk Nullable(Int64),
ss_store_sk Nullable(Int64),
ss_promo_sk Nullable(Int64),
ss_ticket_number Int64,
ss_quantity Nullable(Int32),
ss_wholesale_cost Nullable(Decimal(7,2)),
ss_list_price Nullable(Decimal(7,2)),
ss_sales_price Nullable(Decimal(7,2)),
ss_ext_discount_amt Nullable(Decimal(7,2)),
ss_ext_sales_price Nullable(Decimal(7,2)),
ss_ext_wholesale_cost Nullable(Decimal(7,2)),
ss_ext_list_price Nullable(Decimal(7,2)),
ss_ext_tax Nullable(Decimal(7,2)),
ss_coupon_amt Nullable(Decimal(7,2)),
ss_net_paid Nullable(Decimal(7,2)),
ss_net_paid_inc_tax Nullable(Decimal(7,2)),
ss_net_profit Nullable(Decimal(7,2))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(toDate(assumeNotNull(ss_sold_date_sk)))
ORDER BY (ss_item_sk, ss_ticket_number)
SETTINGS index_granularity = 8192;
-- Dimension table: date_dim
CREATE TABLE tpcds.date_dim (
d_date_sk Int64,
d_date_id String,
d_date Date,
d_month_seq Nullable(Int32),
d_week_seq Nullable(Int32),
d_quarter_seq Nullable(Int32),
d_year Nullable(Int32),
d_dow Nullable(Int32),
d_moy Nullable(Int32),
d_dom Nullable(Int32),
d_qoy Nullable(Int32),
d_fy_year Nullable(Int32),
d_fy_quarter_seq Nullable(Int32),
d_fy_week_seq Nullable(Int32),
d_day_name Nullable(String),
d_quarter_name Nullable(String),
d_holiday Nullable(String),
d_weekend Nullable(String),
d_following_holiday Nullable(String),
d_first_dom Nullable(Int32),
d_last_dom Nullable(Int32),
d_same_day_ly Nullable(Int32),
d_same_day_lq Nullable(Int32),
d_current_day Nullable(String),
d_current_week Nullable(String),
d_current_month Nullable(String),
d_current_quarter Nullable(String),
d_current_year Nullable(String)
)
ENGINE = MergeTree()
ORDER BY d_date_sk;
4.2 — Parallel Data Load Script (Python)
#!/usr/bin/env python3
"""
Parallel TPC-DS CSV loader for ClickHouse.
Loads all .dat files from the TPC-DS output directory into their
corresponding ClickHouse tables using concurrent subprocess workers.
"""
import os
import subprocess
import concurrent.futures
from pathlib import Path
# ── Configuration ────────────────────────────────────────────────────────────
DATA_DIR = "/data/tpcds-raw"
CLICKHOUSE_HOST = "localhost"
CLICKHOUSE_PORT = 9000
DATABASE = "tpcds"
MAX_WORKERS = 8 # Tune based on I/O bandwidth
FIELD_DELIMITER = "|"
# ─────────────────────────────────────────────────────────────────────────────
def load_table(dat_file: Path) -> dict:
"""Load a single TPC-DS .dat file into ClickHouse."""
table_name = dat_file.stem.split("_")[0] # e.g. store_sales_1_32 → store_sales
cmd = [
"clickhouse-client",
f"--host={CLICKHOUSE_HOST}",
f"--port={CLICKHOUSE_PORT}",
f"--database={DATABASE}",
f"--query=INSERT INTO {table_name} FORMAT CSV",
f"--format_csv_delimiter={FIELD_DELIMITER}",
"--input_format_csv_empty_as_default=1",
"--max_insert_block_size=1048576",
]
try:
with open(dat_file, "rb") as f:
result = subprocess.run(
cmd, stdin=f, capture_output=True, timeout=3600
)
if result.returncode == 0:
return {"file": dat_file.name, "status": "OK"}
else:
return {"file": dat_file.name, "status": "FAIL", "error": result.stderr.decode()}
except subprocess.TimeoutExpired:
return {"file": dat_file.name, "status": "TIMEOUT"}
def main():
dat_files = sorted(Path(DATA_DIR).glob("*.dat"))
print(f"Found {len(dat_files)} .dat files to load.")
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {executor.submit(load_table, f): f for f in dat_files}
for future in concurrent.futures.as_completed(futures):
result = future.result()
status_icon = "✔" if result["status"] == "OK" else "✘"
print(f" {status_icon} {result['file']} → {result['status']}")
if "error" in result:
print(f" ERROR: {result['error'][:200]}")
print("\nLoad complete.")
if __name__ == "__main__":
main()
4.3 — Verify Row Counts After Load
-- Verify key table sizes after data load
SELECT
table,
formatReadableQuantity(sum(rows)) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS disk_size,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio
FROM system.parts
WHERE
database = 'tpcds'
AND active = 1
GROUP BY table
ORDER BY sum(bytes_on_disk) DESC;
Step 5: Executing the TPC-DS Benchmark
Translate the 99 TPC-DS queries to ClickHouse-compatible SQL, then execute them via a benchmarking script that measures latency, records output, and logs any failures.
5.1 — Generate TPC-DS Query Files
#!/bin/bash
TOOLS_DIR="$HOME/tpcds-kit/tools"
QUERY_DIR="/data/tpcds-queries"
TEMPLATE_DIR="$TOOLS_DIR/query_templates"
mkdir -p "$QUERY_DIR"
# Generate all 99 queries at the 10TB scale factor
for qnum in $(seq 1 99); do
$TOOLS_DIR/dsqgen \
-DIRECTORY "$TEMPLATE_DIR" \
-TEMPLATE "query${qnum}.tpl" \
-SCALE 10000 \
-DIALECT ansi \
-OUTPUT_DIR "$QUERY_DIR" \
-QUIET Y
done
echo "✔ Generated $(ls $QUERY_DIR/*.sql 2>/dev/null | wc -l) query files in $QUERY_DIR"
5.2 — Benchmark Execution Script (Shell)
#!/bin/bash
# ── TPC-DS Benchmark Runner ──────────────────────────────────────────────────
QUERY_DIR="/data/tpcds-queries"
RESULTS_DIR="/data/tpcds-results/$(date +%Y%m%d_%H%M%S)"
DATABASE="tpcds"
ITERATIONS=3 # Run each query N times and record the best time
# ─────────────────────────────────────────────────────────────────────────────
mkdir -p "$RESULTS_DIR"
SUMMARY_FILE="$RESULTS_DIR/summary.tsv"
echo -e "query\titeration\twall_time_ms\tstatus" > "$SUMMARY_FILE"
total_queries=$(ls "$QUERY_DIR"/*.sql | wc -l)
echo "Running $total_queries queries × $ITERATIONS iterations..."
echo "Results: $RESULTS_DIR"
for sql_file in $(ls "$QUERY_DIR"/*.sql | sort -V); do
qname=$(basename "$sql_file" .sql)
for iter in $(seq 1 $ITERATIONS); do
start_ns=$(date +%s%N)
clickhouse-client \
--database="$DATABASE" \
--time \
--max_execution_time=600 \
--max_memory_usage=350000000000 \
--queries-file="$sql_file" \
> "$RESULTS_DIR/${qname}_iter${iter}.out" \
2> "$RESULTS_DIR/${qname}_iter${iter}.err"
exit_code=$?
end_ns=$(date +%s%N)
wall_ms=$(( (end_ns - start_ns) / 1000000 ))
if [ $exit_code -eq 0 ]; then
status="OK"
else
status="FAIL"
fi
echo -e "${qname}\t${iter}\t${wall_ms}\t${status}" >> "$SUMMARY_FILE"
printf " %-12s iter=%d %7d ms [%s]\n" "$qname" "$iter" "$wall_ms" "$status"
done
done
echo ""
echo "✔ Benchmark complete. Summary: $SUMMARY_FILE"
5.3 — Example: TPC-DS Query 1 (ClickHouse SQL)
-- TPC-DS Query 1: Customer return analysis
-- Find customers whose total returns in a given store exceed 1.2× the average
-- for that store's state, filtered to a specific year.
WITH
customer_total_return AS (
SELECT
sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM tpcds.store_returns
JOIN tpcds.date_dim ON sr_returned_date_sk = d_date_sk
WHERE d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c_customer_id
FROM customer_total_return ctr1
JOIN tpcds.store ON s_store_sk = ctr1.ctr_store_sk
JOIN tpcds.customer ON c_customer_sk = ctr1.ctr_customer_sk
WHERE s_state = 'TN'
AND ctr1.ctr_total_return > (
SELECT avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
)
ORDER BY c_customer_id
LIMIT 100;
Step 6: Performance Monitoring & Real-Time Diagnostics
Monitor ClickHouse internals during benchmark execution using the built-in system tables. These queries should be run in a separate clickhouse-client session concurrently with the benchmark.
6.1 — Monitor Active Queries
-- Live view of all currently executing queries
SELECT
query_id,
user,
elapsed,
formatReadableSize(memory_usage) AS memory,
formatReadableSize(read_bytes) AS bytes_read,
formatReadableQuantity(read_rows) AS rows_read,
round(read_bytes / elapsed / 1e9, 2) AS GB_per_sec,
substring(query, 1, 80) AS query_snippet
FROM system.processes
WHERE query != ''
ORDER BY elapsed DESC;
6.2 — Track Merge Activity
-- Watch MergeTree background merge operations
SELECT
database,
table,
round(progress * 100, 1) AS progress_pct,
formatReadableSize(total_size_bytes_compressed) AS total_size,
formatReadableSize(bytes_read_uncompressed) AS bytes_read,
num_parts,
elapsed
FROM system.merges
ORDER BY elapsed DESC;
6.3 — Adaptive Configuration Tuning
-- Apply performance tuning settings per-session before running heavy queries SET max_threads = 96; SET max_memory_usage = 350000000000; SET use_uncompressed_cache = 1; SET max_bytes_before_external_group_by = 50000000000; SET max_bytes_before_external_sort = 50000000000; SET distributed_aggregation_memory_efficient = 1; SET compile_expressions = 1; SET optimize_move_to_prewhere = 1; SET allow_experimental_parallel_reading_from_replicas = 1;
Step 7: Analyzing Benchmark Results
After the benchmark run, parse the summary TSV, compute per-query statistics, and identify the slowest and fastest queries for further investigation.
7.1 — Parse Results & Compute Statistics (Python)
#!/usr/bin/env python3
"""
Analyze TPC-DS benchmark results from the summary TSV.
Computes best, median, mean, and p95 latency per query.
"""
import csv
import statistics
from pathlib import Path
from collections import defaultdict
SUMMARY_FILE = "/data/tpcds-results/summary.tsv"
# Read results
results = defaultdict(list)
failed = defaultdict(int)
with open(SUMMARY_FILE, newline="") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
q = row["query"]
status = row["status"]
if status == "OK":
results[q].append(int(row["wall_time_ms"]))
else:
failed[q] += 1
# Print analysis table
header = f"{'Query':<14} {'Best(ms)':>10} {'Median(ms)':>12} {'Mean(ms)':>10} {'P95(ms)':>10} {'Fails':>6}"
print(header)
print("-" * len(header))
total_time = 0
for q in sorted(results.keys()):
times = sorted(results[q])
best = times[0]
median = int(statistics.median(times))
mean = int(statistics.mean(times))
p95 = int(statistics.quantiles(times, n=20)[18]) if len(times) >= 5 else times[-1]
fails = failed.get(q, 0)
total_time += best
print(f"{q:<14} {best:>10,} {median:>12,} {mean:>10,} {p95:>10,} {fails:>6}")
print("-" * len(header))
print(f"\nTotal benchmark time (sum of best runs): {total_time / 1000:.1f} seconds ({total_time / 60000:.1f} minutes)")
7.2 — Query the ClickHouse Query Log
-- Summarize the 10 slowest queries from system.query_log
SELECT
query_id,
formatReadableTimeDelta(query_duration_ms / 1000) AS duration,
formatReadableSize(memory_usage) AS peak_memory,
formatReadableSize(read_bytes) AS bytes_read,
formatReadableQuantity(read_rows) AS rows_read,
round(read_bytes / query_duration_ms / 1e6, 2) AS GB_per_sec,
exception_code,
substring(query, 1, 100) AS query_snippet
FROM system.query_log
WHERE
type = 'QueryFinish'
AND event_date = today()
AND database = 'tpcds'
ORDER BY query_duration_ms DESC
LIMIT 10;
Step 8: Cleanup
After completing your benchmarking run, clean up all resources to avoid unnecessary AWS charges.
8.1 — Drop TPC-DS Database
-- Remove all TPC-DS tables and the database DROP DATABASE IF EXISTS tpcds SYNC; -- Verify removal SHOW DATABASES;
8.2 — Decommission EC2 Resources
#!/bin/bash
# Retrieve instance metadata
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
echo "Terminating instance: $INSTANCE_ID in $REGION"
# Stop ClickHouse gracefully before termination
sudo systemctl stop clickhouse-server
# Unmount EBS volume
sudo umount /data
# Terminate the EC2 instance
aws ec2 terminate-instances \
--instance-ids "$INSTANCE_ID" \
--region "$REGION"
echo "✔ Instance terminated. Remember to delete unattached EBS volumes from the AWS Console."
Conclusion
This run-book provides a fully scripted, end-to-end approach for running a TPC-DS 10TB benchmark on ClickHouse hosted on Amazon EC2. The workflow covers every phase — instance provisioning and ClickHouse installation, memory and thread tuning, parallel data generation via dsdgen, schema design with MergeTree, concurrent CSV ingestion, query execution with timing, real-time diagnostics from system.processes and system.merges, and statistical result analysis.
Key takeaways for production-grade benchmarking:
- Parallelism — both data generation (
-PARALLEL) and loading (ThreadPoolExecutor) are critical to keeping total elapsed time manageable at the 10TB scale. - Memory tuning — setting
max_memory_usageto ~90% of available RAM prevents OOM kills while keeping swap-free execution. - MergeTree ordering — choose
ORDER BYkeys aligned with your most common query join and filter predicates to maximize data skipping via sparse indexes. - Query log analysis —
system.query_logis your best post-mortem tool; pair it withEXPLAIN PIPELINEto diagnose bottlenecks in slow-running queries.