Benchmarking ClickHouse: TPC-DS 10TB on Amazon EC2

This comprehensive run-book walks you through every stage of running a TPC-DS 10TB benchmark against ClickHouse on Amazon EC2 — from provisioning your instance and installing ClickHouse, to generating data, loading it, executing queries, and analyzing performance results. Each step includes production-ready scripts and inline commentary.


Step 1: Setting Up the EC2 Environment

Begin by selecting a compute-optimized EC2 instance. For a 10TB TPC-DS workload, an m5.24xlarge (96 vCPUs, 384 GB RAM) or r6i.32xlarge provides the right balance of compute and memory. Attach sufficient EBS (gp3) storage — at least 15TB to accommodate raw data, generated datasets, and ClickHouse table storage.

1.1 — Update System & Install Build Tools

#!/bin/bash
# Update package index and upgrade existing packages
sudo apt update && sudo apt upgrade -y

# Install essential build and development tools
sudo apt install -y \
    git \
    build-essential \
    cmake \
    python3 \
    python3-pip \
    unzip \
    curl \
    wget \
    htop \
    sysstat \
    nvme-cli

echo "✔ Build tools installed successfully."

1.2 — Mount and Format EBS Volumes

#!/bin/bash
# Identify attached NVMe/EBS devices
lsblk

# Format the data volume (replace /dev/nvme1n1 with your device)
sudo mkfs -t xfs /dev/nvme1n1

# Create mount point and mount
sudo mkdir -p /data
sudo mount /dev/nvme1n1 /data

# Make mount persistent across reboots
echo '/dev/nvme1n1 /data xfs defaults,nofail 0 2' | sudo tee -a /etc/fstab

# Set permissions
sudo chown -R ubuntu:ubuntu /data
echo "✔ EBS volume mounted at /data"

Step 2: Install and Configure ClickHouse

Install the latest stable ClickHouse server and client from the official Altinity/ClickHouse APT repository. After installation, tune the configuration for your instance’s memory and CPU profile.

2.1 — Install ClickHouse via APT

#!/bin/bash
# Add the official ClickHouse repository
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
GNUPGHOME=$(mktemp -d)
sudo GNUPGHOME="$GNUPGHOME" gpg --no-default-keyring \
    --keyring /usr/share/keyrings/clickhouse-keyring.gpg \
    --keyserver hkp://keyserver.ubuntu.com:80 \
    --recv-keys 8919F6BD2B48D754

echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" \
    | sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt-get update

# Install ClickHouse server and client
sudo apt-get install -y clickhouse-server clickhouse-client

# Start and enable the service
sudo systemctl enable clickhouse-server
sudo systemctl start clickhouse-server
sudo systemctl status clickhouse-server

2.2 — Tune ClickHouse Configuration

<?xml version="1.0"?>
<yandex>
    <!-- Use all available CPU threads -->
    <max_threads>96</max_threads>

    <!-- Memory limits: set to ~90% of total RAM (e.g., 345 GB for 384 GB instance) -->
    <max_memory_usage>370161893376</max_memory_usage>
    <max_memory_usage_for_all_queries>370161893376</max_memory_usage_for_all_queries>

    <!-- Data and temp storage paths -->
    <path>/data/clickhouse/</path>
    <tmp_path>/data/clickhouse/tmp/</tmp_path>

    <!-- Enable parallel query execution -->
    <max_parallel_replicas>1</max_parallel_replicas>
    <distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>

    <!-- Increase connection and query limits -->
    <max_connections>4096</max_connections>
    <keep_alive_timeout>3</keep_alive_timeout>
</yandex>
#!/bin/bash
# Create data directories
sudo mkdir -p /data/clickhouse /data/clickhouse/tmp
sudo chown -R clickhouse:clickhouse /data/clickhouse

# Restart to apply new config
sudo systemctl restart clickhouse-server

# Verify ClickHouse is running and accessible
clickhouse-client --query "SELECT version(), uptime()"

Step 3: TPC-DS Data Generation & Preparation

Use the official TPC-DS data generator (dsdgen) from the Databricks fork to produce 10TB of synthetic retail analytics data across 24 TPC-DS tables.

3.1 — Clone and Build TPC-DS Kit

#!/bin/bash
# Clone the TPC-DS toolkit
git clone https://github.com/databricks/tpcds-kit.git
cd tpcds-kit/tools

# Compile the data generator
make OS=LINUX

# Verify the binary was built successfully
ls -lh dsdgen dsqgen
echo "✔ TPC-DS tools compiled"

3.2 — Generate 10TB Dataset (Parallelized)

#!/bin/bash
# Configuration
SCALE=10000          # 10TB scale factor
PARALLEL=32          # Number of parallel chunks
OUTPUT_DIR="/data/tpcds-raw"
TOOLS_DIR="$HOME/tpcds-kit/tools"

mkdir -p "$OUTPUT_DIR"

# Launch dsdgen in parallel chunks
for i in $(seq 1 $PARALLEL); do
    echo "Starting chunk $i of $PARALLEL ..."
    $TOOLS_DIR/dsdgen \
        -SCALE $SCALE \
        -DIR "$OUTPUT_DIR" \
        -PARALLEL $PARALLEL \
        -CHILD $i \
        -FORCE Y &
done

# Wait for all background processes
wait
echo "✔ 10TB data generation complete. Output in $OUTPUT_DIR"
du -sh "$OUTPUT_DIR"

Step 4: Schema Creation & Data Loading into ClickHouse

Create TPC-DS-compatible tables in ClickHouse using MergeTree engine families, then load the generated CSV files efficiently using ClickHouse’s native clickhouse-client ingestion.

4.1 — Create TPC-DS Schema

-- Create TPC-DS database
CREATE DATABASE IF NOT EXISTS tpcds;

-- Fact table: store_sales (largest table in TPC-DS)
CREATE TABLE tpcds.store_sales (
    ss_sold_date_sk          Nullable(Int64),
    ss_sold_time_sk          Nullable(Int64),
    ss_item_sk               Int64,
    ss_customer_sk           Nullable(Int64),
    ss_cdemo_sk              Nullable(Int64),
    ss_hdemo_sk              Nullable(Int64),
    ss_addr_sk               Nullable(Int64),
    ss_store_sk              Nullable(Int64),
    ss_promo_sk              Nullable(Int64),
    ss_ticket_number         Int64,
    ss_quantity              Nullable(Int32),
    ss_wholesale_cost        Nullable(Decimal(7,2)),
    ss_list_price            Nullable(Decimal(7,2)),
    ss_sales_price           Nullable(Decimal(7,2)),
    ss_ext_discount_amt      Nullable(Decimal(7,2)),
    ss_ext_sales_price       Nullable(Decimal(7,2)),
    ss_ext_wholesale_cost    Nullable(Decimal(7,2)),
    ss_ext_list_price        Nullable(Decimal(7,2)),
    ss_ext_tax               Nullable(Decimal(7,2)),
    ss_coupon_amt            Nullable(Decimal(7,2)),
    ss_net_paid              Nullable(Decimal(7,2)),
    ss_net_paid_inc_tax      Nullable(Decimal(7,2)),
    ss_net_profit            Nullable(Decimal(7,2))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(toDate(assumeNotNull(ss_sold_date_sk)))
ORDER BY (ss_item_sk, ss_ticket_number)
SETTINGS index_granularity = 8192;

-- Dimension table: date_dim
CREATE TABLE tpcds.date_dim (
    d_date_sk                Int64,
    d_date_id                String,
    d_date                   Date,
    d_month_seq              Nullable(Int32),
    d_week_seq               Nullable(Int32),
    d_quarter_seq            Nullable(Int32),
    d_year                   Nullable(Int32),
    d_dow                    Nullable(Int32),
    d_moy                    Nullable(Int32),
    d_dom                    Nullable(Int32),
    d_qoy                    Nullable(Int32),
    d_fy_year                Nullable(Int32),
    d_fy_quarter_seq         Nullable(Int32),
    d_fy_week_seq            Nullable(Int32),
    d_day_name               Nullable(String),
    d_quarter_name           Nullable(String),
    d_holiday                Nullable(String),
    d_weekend                Nullable(String),
    d_following_holiday      Nullable(String),
    d_first_dom              Nullable(Int32),
    d_last_dom               Nullable(Int32),
    d_same_day_ly            Nullable(Int32),
    d_same_day_lq            Nullable(Int32),
    d_current_day            Nullable(String),
    d_current_week           Nullable(String),
    d_current_month          Nullable(String),
    d_current_quarter        Nullable(String),
    d_current_year           Nullable(String)
)
ENGINE = MergeTree()
ORDER BY d_date_sk;

4.2 — Parallel Data Load Script (Python)

#!/usr/bin/env python3
"""
Parallel TPC-DS CSV loader for ClickHouse.
Loads all .dat files from the TPC-DS output directory into their
corresponding ClickHouse tables using concurrent subprocess workers.
"""

import os
import subprocess
import concurrent.futures
from pathlib import Path

# ── Configuration ────────────────────────────────────────────────────────────
DATA_DIR        = "/data/tpcds-raw"
CLICKHOUSE_HOST = "localhost"
CLICKHOUSE_PORT = 9000
DATABASE        = "tpcds"
MAX_WORKERS     = 8          # Tune based on I/O bandwidth
FIELD_DELIMITER = "|"
# ─────────────────────────────────────────────────────────────────────────────

def load_table(dat_file: Path) -> dict:
    """Load a single TPC-DS .dat file into ClickHouse."""
    table_name = dat_file.stem.split("_")[0]  # e.g. store_sales_1_32 → store_sales
    cmd = [
        "clickhouse-client",
        f"--host={CLICKHOUSE_HOST}",
        f"--port={CLICKHOUSE_PORT}",
        f"--database={DATABASE}",
        f"--query=INSERT INTO {table_name} FORMAT CSV",
        f"--format_csv_delimiter={FIELD_DELIMITER}",
        "--input_format_csv_empty_as_default=1",
        "--max_insert_block_size=1048576",
    ]

    try:
        with open(dat_file, "rb") as f:
            result = subprocess.run(
                cmd, stdin=f, capture_output=True, timeout=3600
            )
        if result.returncode == 0:
            return {"file": dat_file.name, "status": "OK"}
        else:
            return {"file": dat_file.name, "status": "FAIL", "error": result.stderr.decode()}
    except subprocess.TimeoutExpired:
        return {"file": dat_file.name, "status": "TIMEOUT"}


def main():
    dat_files = sorted(Path(DATA_DIR).glob("*.dat"))
    print(f"Found {len(dat_files)} .dat files to load.")

    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = {executor.submit(load_table, f): f for f in dat_files}
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            status_icon = "✔" if result["status"] == "OK" else "✘"
            print(f"  {status_icon}  {result['file']} → {result['status']}")
            if "error" in result:
                print(f"      ERROR: {result['error'][:200]}")

    print("\nLoad complete.")


if __name__ == "__main__":
    main()

4.3 — Verify Row Counts After Load

-- Verify key table sizes after data load
SELECT
    table,
    formatReadableQuantity(sum(rows))          AS total_rows,
    formatReadableSize(sum(bytes_on_disk))     AS disk_size,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio
FROM system.parts
WHERE
    database = 'tpcds'
    AND active = 1
GROUP BY table
ORDER BY sum(bytes_on_disk) DESC;

Step 5: Executing the TPC-DS Benchmark

Translate the 99 TPC-DS queries to ClickHouse-compatible SQL, then execute them via a benchmarking script that measures latency, records output, and logs any failures.

5.1 — Generate TPC-DS Query Files

#!/bin/bash
TOOLS_DIR="$HOME/tpcds-kit/tools"
QUERY_DIR="/data/tpcds-queries"
TEMPLATE_DIR="$TOOLS_DIR/query_templates"

mkdir -p "$QUERY_DIR"

# Generate all 99 queries at the 10TB scale factor
for qnum in $(seq 1 99); do
    $TOOLS_DIR/dsqgen \
        -DIRECTORY "$TEMPLATE_DIR" \
        -TEMPLATE "query${qnum}.tpl" \
        -SCALE 10000 \
        -DIALECT ansi \
        -OUTPUT_DIR "$QUERY_DIR" \
        -QUIET Y
done

echo "✔ Generated $(ls $QUERY_DIR/*.sql 2>/dev/null | wc -l) query files in $QUERY_DIR"

5.2 — Benchmark Execution Script (Shell)

#!/bin/bash
# ── TPC-DS Benchmark Runner ──────────────────────────────────────────────────
QUERY_DIR="/data/tpcds-queries"
RESULTS_DIR="/data/tpcds-results/$(date +%Y%m%d_%H%M%S)"
DATABASE="tpcds"
ITERATIONS=3    # Run each query N times and record the best time
# ─────────────────────────────────────────────────────────────────────────────

mkdir -p "$RESULTS_DIR"
SUMMARY_FILE="$RESULTS_DIR/summary.tsv"
echo -e "query\titeration\twall_time_ms\tstatus" > "$SUMMARY_FILE"

total_queries=$(ls "$QUERY_DIR"/*.sql | wc -l)
echo "Running $total_queries queries × $ITERATIONS iterations..."
echo "Results: $RESULTS_DIR"

for sql_file in $(ls "$QUERY_DIR"/*.sql | sort -V); do
    qname=$(basename "$sql_file" .sql)

    for iter in $(seq 1 $ITERATIONS); do
        start_ns=$(date +%s%N)

        clickhouse-client \
            --database="$DATABASE" \
            --time \
            --max_execution_time=600 \
            --max_memory_usage=350000000000 \
            --queries-file="$sql_file" \
            > "$RESULTS_DIR/${qname}_iter${iter}.out" \
            2> "$RESULTS_DIR/${qname}_iter${iter}.err"

        exit_code=$?
        end_ns=$(date +%s%N)
        wall_ms=$(( (end_ns - start_ns) / 1000000 ))

        if [ $exit_code -eq 0 ]; then
            status="OK"
        else
            status="FAIL"
        fi

        echo -e "${qname}\t${iter}\t${wall_ms}\t${status}" >> "$SUMMARY_FILE"
        printf "  %-12s  iter=%d  %7d ms  [%s]\n" "$qname" "$iter" "$wall_ms" "$status"
    done
done

echo ""
echo "✔ Benchmark complete. Summary: $SUMMARY_FILE"

5.3 — Example: TPC-DS Query 1 (ClickHouse SQL)

-- TPC-DS Query 1: Customer return analysis
-- Find customers whose total returns in a given store exceed 1.2× the average
-- for that store's state, filtered to a specific year.

WITH
customer_total_return AS (
    SELECT
        sr_customer_sk     AS ctr_customer_sk,
        sr_store_sk        AS ctr_store_sk,
        sum(sr_return_amt) AS ctr_total_return
    FROM tpcds.store_returns
    JOIN tpcds.date_dim ON sr_returned_date_sk = d_date_sk
    WHERE d_year = 2000
    GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c_customer_id
FROM customer_total_return ctr1
JOIN tpcds.store         ON s_store_sk = ctr1.ctr_store_sk
JOIN tpcds.customer      ON c_customer_sk = ctr1.ctr_customer_sk
WHERE s_state = 'TN'
  AND ctr1.ctr_total_return > (
      SELECT avg(ctr_total_return) * 1.2
      FROM customer_total_return ctr2
      WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
  )
ORDER BY c_customer_id
LIMIT 100;

Step 6: Performance Monitoring & Real-Time Diagnostics

Monitor ClickHouse internals during benchmark execution using the built-in system tables. These queries should be run in a separate clickhouse-client session concurrently with the benchmark.

6.1 — Monitor Active Queries

-- Live view of all currently executing queries
SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage)           AS memory,
    formatReadableSize(read_bytes)             AS bytes_read,
    formatReadableQuantity(read_rows)          AS rows_read,
    round(read_bytes / elapsed / 1e9, 2)       AS GB_per_sec,
    substring(query, 1, 80)                    AS query_snippet
FROM system.processes
WHERE query != ''
ORDER BY elapsed DESC;

6.2 — Track Merge Activity

-- Watch MergeTree background merge operations
SELECT
    database,
    table,
    round(progress * 100, 1)                          AS progress_pct,
    formatReadableSize(total_size_bytes_compressed)   AS total_size,
    formatReadableSize(bytes_read_uncompressed)       AS bytes_read,
    num_parts,
    elapsed
FROM system.merges
ORDER BY elapsed DESC;

6.3 — Adaptive Configuration Tuning

-- Apply performance tuning settings per-session before running heavy queries
SET max_threads                          = 96;
SET max_memory_usage                     = 350000000000;
SET use_uncompressed_cache               = 1;
SET max_bytes_before_external_group_by  = 50000000000;
SET max_bytes_before_external_sort      = 50000000000;
SET distributed_aggregation_memory_efficient = 1;
SET compile_expressions                  = 1;
SET optimize_move_to_prewhere            = 1;
SET allow_experimental_parallel_reading_from_replicas = 1;

Step 7: Analyzing Benchmark Results

After the benchmark run, parse the summary TSV, compute per-query statistics, and identify the slowest and fastest queries for further investigation.

7.1 — Parse Results & Compute Statistics (Python)

#!/usr/bin/env python3
"""
Analyze TPC-DS benchmark results from the summary TSV.
Computes best, median, mean, and p95 latency per query.
"""

import csv
import statistics
from pathlib import Path
from collections import defaultdict

SUMMARY_FILE = "/data/tpcds-results/summary.tsv"

# Read results
results = defaultdict(list)
failed  = defaultdict(int)

with open(SUMMARY_FILE, newline="") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        q      = row["query"]
        status = row["status"]
        if status == "OK":
            results[q].append(int(row["wall_time_ms"]))
        else:
            failed[q] += 1

# Print analysis table
header = f"{'Query':<14} {'Best(ms)':>10} {'Median(ms)':>12} {'Mean(ms)':>10} {'P95(ms)':>10} {'Fails':>6}"
print(header)
print("-" * len(header))

total_time = 0
for q in sorted(results.keys()):
    times = sorted(results[q])
    best   = times[0]
    median = int(statistics.median(times))
    mean   = int(statistics.mean(times))
    p95    = int(statistics.quantiles(times, n=20)[18]) if len(times) >= 5 else times[-1]
    fails  = failed.get(q, 0)
    total_time += best
    print(f"{q:<14} {best:>10,} {median:>12,} {mean:>10,} {p95:>10,} {fails:>6}")

print("-" * len(header))
print(f"\nTotal benchmark time (sum of best runs): {total_time / 1000:.1f} seconds ({total_time / 60000:.1f} minutes)")

7.2 — Query the ClickHouse Query Log

-- Summarize the 10 slowest queries from system.query_log
SELECT
    query_id,
    formatReadableTimeDelta(query_duration_ms / 1000)  AS duration,
    formatReadableSize(memory_usage)                   AS peak_memory,
    formatReadableSize(read_bytes)                     AS bytes_read,
    formatReadableQuantity(read_rows)                  AS rows_read,
    round(read_bytes / query_duration_ms / 1e6, 2)    AS GB_per_sec,
    exception_code,
    substring(query, 1, 100)                           AS query_snippet
FROM system.query_log
WHERE
    type = 'QueryFinish'
    AND event_date = today()
    AND database = 'tpcds'
ORDER BY query_duration_ms DESC
LIMIT 10;

Step 8: Cleanup

After completing your benchmarking run, clean up all resources to avoid unnecessary AWS charges.

8.1 — Drop TPC-DS Database

-- Remove all TPC-DS tables and the database
DROP DATABASE IF EXISTS tpcds SYNC;

-- Verify removal
SHOW DATABASES;

8.2 — Decommission EC2 Resources

#!/bin/bash
# Retrieve instance metadata
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

echo "Terminating instance: $INSTANCE_ID in $REGION"

# Stop ClickHouse gracefully before termination
sudo systemctl stop clickhouse-server

# Unmount EBS volume
sudo umount /data

# Terminate the EC2 instance
aws ec2 terminate-instances \
    --instance-ids "$INSTANCE_ID" \
    --region "$REGION"

echo "✔ Instance terminated. Remember to delete unattached EBS volumes from the AWS Console."

Conclusion

This run-book provides a fully scripted, end-to-end approach for running a TPC-DS 10TB benchmark on ClickHouse hosted on Amazon EC2. The workflow covers every phase — instance provisioning and ClickHouse installation, memory and thread tuning, parallel data generation via dsdgen, schema design with MergeTree, concurrent CSV ingestion, query execution with timing, real-time diagnostics from system.processes and system.merges, and statistical result analysis.

Key takeaways for production-grade benchmarking:

  • Parallelism — both data generation (-PARALLEL) and loading (ThreadPoolExecutor) are critical to keeping total elapsed time manageable at the 10TB scale.
  • Memory tuning — setting max_memory_usage to ~90% of available RAM prevents OOM kills while keeping swap-free execution.
  • MergeTree ordering — choose ORDER BY keys aligned with your most common query join and filter predicates to maximize data skipping via sparse indexes.
  • Query log analysissystem.query_log is your best post-mortem tool; pair it with EXPLAIN PIPELINE to diagnose bottlenecks in slow-running queries.