Untangling the Spaghetti: Writing Efficient ClickHouse SQL

Table of Contents

Untangling the Spaghetti: Writing Clean and Efficient ClickHouse SQL

In the world of big data analytics, ClickHouse has emerged as a powerful columnar database system capable of processing billions of rows and petabytes of data with blazing speed. However, with great power comes great responsibility – the responsibility to write clean, maintainable, and efficient SQL queries. Let’s explore how to transform complex “spaghetti queries” into well-structured, high-performance SQL code specifically optimized for ClickHouse.

The Problem with Spaghetti Queries

Spaghetti queries are characterized by:

Overly complex SQL that’s difficult to understand at a glance
Hard-to-maintain code where small changes can lead to unexpected results
Inefficient execution that doesn’t leverage ClickHouse’s columnar architecture
Collaboration barriers that make teamwork challenging

These issues become even more critical in ClickHouse environments due to:

The massive data scale ClickHouse typically handles
ClickHouse’s columnar architecture requiring different optimization strategies
The inherently complex nature of analytical queries

Common Signs of Problematic Queries

Excessive Nesting

SELECT user_id, SUM(revenue) AS total_revenue 
FROM (
  SELECT user_id, 
    (SELECT SUM(price) 
     FROM (
       SELECT order_items.price, order_items.order_id 
       FROM order_items 
       WHERE order_items.order_id IN (
         SELECT id FROM orders 
         WHERE user_id = users.id
       )
     ) AS user_items
    ) AS revenue 
  FROM users 
  WHERE (
    SELECT COUNT(*) FROM orders 
    WHERE orders.user_id = users.id
  ) > 0
) AS user_revenue 
GROUP BY user_id 
HAVING total_revenue > 100 
ORDER BY total_revenue DESC LIMIT 100;

This query is difficult to follow with multiple layers of nested subqueries creating a maze-like structure.

JOIN Overload

SELECT u.name, COUNT(DISTINCT o.id) AS order_count, 
  SUM(oi.price * oi.quantity) AS total_spent, 
  MAX(o.created_at) AS last_order_date,
  (SELECT AVG(r.rating) FROM reviews r WHERE r.user_id = u.id) AS avg_rating,
  CASE WHEN 
    (SELECT COUNT(*) FROM cart_items ci 
     JOIN carts c ON ci.cart_id = c.id 
     WHERE c.user_id = u.id AND c.status = 'active') > 0 
  THEN true ELSE false END AS has_active_cart
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
LEFT JOIN order_items oi ON o.id = oi.order_id
LEFT JOIN products p ON oi.product_id = p.id
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN payment_methods pm ON o.payment_method_id = pm.id
WHERE u.created_at BETWEEN '2020-01-01' AND '2023-12-31'
  AND u.status = 'active'
  AND (o.status = 'completed' OR o.status IS NULL)
  AND (c.name = 'Electronics' OR c.name IS NULL)
GROUP BY u.id, u.name
HAVING order_count > 0
ORDER BY total_spent DESC LIMIT 100;

This query demonstrates excessive table joins with unclear relationships, making it memory-intensive and difficult to optimize.

The Path to Cleaner Queries

1. Common Table Expressions (CTEs) to the Rescue

CTEs create named subqueries that can be referenced multiple times, improving readability and maintainability. ClickHouse optimizes CTEs effectively, often reusing results when referenced multiple times.

Here’s how we can refactor our nested subquery example:

-- First get eligible orders
WITH user_orders AS (
    SELECT orders.id, orders.user_id
    FROM orders 
    JOIN users ON users.id = orders.user_id
    WHERE users.status = 'active'
),
-- Then calculate revenue per order
order_revenue AS (
    SELECT user_orders.user_id, 
           SUM(order_items.price) AS revenue
    FROM order_items 
    JOIN user_orders ON order_items.order_id = user_orders.id
    GROUP BY user_orders.user_id
)
-- Finally, get the aggregated results
SELECT user_id, 
       SUM(revenue) AS total_revenue
FROM order_revenue
GROUP BY user_id
HAVING total_revenue > 100
ORDER BY total_revenue DESC LIMIT 100;

This approach breaks down the complex query into logical building blocks, making it more intuitive and easier to understand.

2. ClickHouse-Specific Optimizations

PREWHERE Clause

The PREWHERE clause performs preliminary filtering to reduce data scanned, making it most effective when filtering eliminates most rows and involves a small subset of columns.

-- Standard WHERE
SELECT 
  user_id, 
  event_time, 
  event_type, 
  page_url, 
  device_type
FROM user_events
WHERE toDate(event_time) 
  BETWEEN '2023-01-01' 
  AND '2023-01-31'
  AND event_type = 'click'

-- Optimized PREWHERE
SELECT 
  user_id, 
  event_time, 
  event_type, 
  page_url, 
  device_type
FROM user_events
PREWHERE toDate(event_time) 
  BETWEEN '2023-01-01' 
  AND '2023-01-31'
WHERE event_type = 'click'

PREWHERE applies the filtering condition before reading other columns, significantly reducing I/O when the filter eliminates many rows.

FINAL Modifier

For tables with ReplacingMergeTree or other *MergeTree engines that store data in parts, the FINAL modifier ensures only the final version of each row is retrieved:

-- Without FINAL (may return duplicates)
SELECT 
  user_id, 
  session_id, 
  last_activity
FROM user_sessions
WHERE user_id = 12345

-- With FINAL (returns exactly one row per unique key)
SELECT 
  user_id, 
  session_id, 
  last_activity
FROM user_sessions FINAL
WHERE user_id = 12345

The FINAL modifier adds overhead but ensures data consistency – use it selectively when needed.

Materialized Views

Materialized views pre-aggregate data, dramatically speeding up common query patterns:

-- Create a materialized view for faster aggregations
CREATE MATERIALIZED VIEW daily_user_events
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (day, event_type, user_id)
AS SELECT
    toDate(event_time) AS day,
    event_type,
    user_id,
    count() AS events_count
FROM events
GROUP BY day, event_type, user_id;

-- Query the materialized view (much faster than calculating from raw events)
SELECT
    day,
    sum(events_count) AS total_events
FROM daily_user_events
WHERE day BETWEEN '2023-01-01' AND '2023-01-31'
GROUP BY day
ORDER BY day;

This approach pre-computes and incrementally updates derived results, making queries much faster.

3. Proper Formatting and Commenting

Compare these two versions of the same query:

-- Before: Poorly Formatted Query
SELECT u.user_id,u.username,u.email,count(o.id) as order_count,sum(o.total_amount) as total_spent,max(o.created_at) as last_order,
(select count(*) from cart_items ci join carts c on ci.cart_id=c.id where c.user_id=u.user_id and c.status='active') as cart_items_count,avg(r.rating) as avg_rating,
case when count(o.id)>10 then 'loyal' when count(o.id)>5 then 'regular' else 'new' end as user_segment from users u left join orders o on u.user_id=o.user_id left join reviews r on r.user_id=u.user_id
where u.created_at>'2020-01-01' and (o.status='completed' or o.status is null) and u.status='active'
group by u.user_id,u.username,u.email having total_spent>1000 order by total_spent desc limit 100;

-- After: Well-Formatted Query
-- Find high-value users with their order metrics and segmentation
-- Note: Filters to active users who have spent over $1000
SELECT
    u.user_id,
    u.username,
    u.email,
    COUNT(o.id) AS order_count,
    SUM(o.total_amount) AS total_spent,
    MAX(o.created_at) AS last_order,
    -- Get current items in user's cart
    (SELECT COUNT(*) 
     FROM cart_items ci 
     JOIN carts c ON ci.cart_id = c.id 
     WHERE c.user_id = u.user_id AND c.status = 'active') AS cart_items_count,
    AVG(r.rating) AS avg_rating,
    -- Segment users based on order frequency
    CASE 
        WHEN COUNT(o.id) > 10 THEN 'loyal'
        WHEN COUNT(o.id) > 5 THEN 'regular' 
        ELSE 'new' 
    END AS user_segment
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
LEFT JOIN reviews r ON r.user_id = u.user_id
WHERE 
    u.created_at > '2020-01-01' 
    AND (o.status = 'completed' OR o.status IS NULL)
    AND u.status = 'active'
GROUP BY u.user_id, u.username, u.email
HAVING total_spent > 1000
ORDER BY total_spent DESC
LIMIT 100;

The well-formatted version uses consistent indentation, logical line breaks, and meaningful comments to make the query much more readable.

4. Denormalization for Performance

Unlike traditional OLTP databases, ClickHouse is optimized for analytically processing denormalized data. Consider denormalizing your data model:

-- Normalized Approach (Avoid)
-- Separate dimension tables
CREATE TABLE users(...)
CREATE TABLE products(...)
CREATE TABLE categories(...)
-- Facts table with foreign keys
CREATE TABLE events(
  event_id UInt64,
  event_time DateTime,
  user_id UInt64,
  product_id UInt64
)

-- Denormalized Approach (Preferred)
-- Denormalized events table
CREATE TABLE events(
  event_id UInt64,
  event_time DateTime,

  -- Embedded user dimensions
  user_id UInt64,
  user_name String,
  user_country String,

  -- Embedded product dimensions
  product_id UInt64,
  product_name String,
  product_price Decimal(10,2),
  category_name String
)

Denormalization trades increased storage space for dramatically faster query performance – a worthwhile tradeoff in many analytical scenarios.

5. Dictionary Tables for Dimension Data

For dimension data that changes infrequently, ClickHouse’s dictionary tables provide extremely fast lookups:

-- Query using the dictionary instead of JOIN
SELECT 
    e.event_id,
    e.product_id,
    dictGet('products', 'name', toUInt64(e.product_id)) AS product_name,
    dictGet('products', 'category', toUInt64(e.product_id)) AS category
FROM events e
WHERE e.event_date = today();

Dictionaries are loaded into RAM for extremely fast access, dramatically outperforming JOIN operations. ¹

6. Window Functions for Clean Analytics

Window functions can replace complex self-joins or subqueries for analytical operations:

-- Messy approach with self-join
SELECT 
  current.date,
  current.revenue,
  current.revenue - previous.revenue AS daily_change
FROM daily_sales current
LEFT JOIN daily_sales previous 
  ON current.date = previous.date + 1
ORDER BY current.date;

-- Clean approach with window functions
SELECT
  date,
  revenue,
  revenue - lag(revenue) OVER (ORDER BY date) AS daily_change,
  sum(revenue) OVER (ORDER BY date) AS running_total,
  avg(revenue) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS weekly_moving_avg
FROM daily_sales
ORDER BY date;

Window functions provide a cleaner, more efficient way to perform running aggregates, ranking, and lag/lead analysis.

Query Testing and Validation

Before deploying your refactored queries to production, it’s essential to:

Verify with sample data – Test queries with small, known datasets where you can manually verify results
Compare with previous solutions – Run refactored queries in parallel with original versions to ensure consistent results
Profile query performance – Use EXPLAIN and system.query_log to analyze execution plans and resource usage
Build incrementally – Develop complex queries gradually, validating each component before combining

Conclusion

Writing clean and efficient ClickHouse SQL is both an art and a science. By leveraging CTEs, ClickHouse-specific optimizations, proper formatting, and denormalization strategies, you can transform complex spaghetti queries into maintainable, high-performance code.

Remember that in ClickHouse, query structure is even more important due to the massive data scale, columnar architecture, and complex analytics typically performed. Taking the time to refactor and optimize your queries will pay dividends in improved performance, reduced costs, and enhanced developer productivity.

Note: You can download the presentation of this blog post here

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Untangling the Spaghetti: Writing Efficient ClickHouse SQL

Untangling the Spaghetti: Writing Clean and Efficient ClickHouse SQL

The Problem with Spaghetti Queries

Common Signs of Problematic Queries

Excessive Nesting

JOIN Overload

The Path to Cleaner Queries

1. Common Table Expressions (CTEs) to the Rescue

2. ClickHouse-Specific Optimizations

PREWHERE Clause

FINAL Modifier

Materialized Views

3. Proper Formatting and Commenting

4. Denormalization for Performance

5. Dictionary Tables for Dimension Data

6. Window Functions for Clean Analytics

Query Testing and Validation

Conclusion

Further reading

You might also like:

Untangling the Spaghetti: Writing Clean and Efficient ClickHouse SQL

The Problem with Spaghetti Queries

Common Signs of Problematic Queries

Excessive Nesting

JOIN Overload

The Path to Cleaner Queries

1. Common Table Expressions (CTEs) to the Rescue

2. ClickHouse-Specific Optimizations

PREWHERE Clause

FINAL Modifier

Materialized Views

3. Proper Formatting and Commenting

4. Denormalization for Performance

5. Dictionary Tables for Dimension Data

6. Window Functions for Clean Analytics

Query Testing and Validation

Conclusion

Further reading

You might also like:

Related Articles

Enhancing ClickHouse Performance: Strategic Insights on Partitioning, Indexing, and Monitoring

bare-metal.io – Run analytical workloads for one-fourth of the AWS costs

Optimizing ClickHouse Thread Pool for High-Concurrency Workloads