Untangling the Spaghetti: Writing Efficient ClickHouse SQL

Untangling the Spaghetti: Writing Clean and Efficient ClickHouse SQL



In the world of big data analytics, ClickHouse has emerged as a powerful columnar database system capable of processing billions of rows and petabytes of data with blazing speed. However, with great power comes great responsibility – the responsibility to write clean, maintainable, and efficient SQL queries. Let’s explore how to transform complex “spaghetti queries” into well-structured, high-performance SQL code specifically optimized for ClickHouse.

The Problem with Spaghetti Queries

Spaghetti queries are characterized by:

  • Overly complex SQL that’s difficult to understand at a glance
  • Hard-to-maintain code where small changes can lead to unexpected results
  • Inefficient execution that doesn’t leverage ClickHouse’s columnar architecture
  • Collaboration barriers that make teamwork challenging

These issues become even more critical in ClickHouse environments due to:

  1. The massive data scale ClickHouse typically handles
  2. ClickHouse’s columnar architecture requiring different optimization strategies
  3. The inherently complex nature of analytical queries

Common Signs of Problematic Queries

Excessive Nesting

SELECT user_id, SUM(revenue) AS total_revenue 
FROM (
  SELECT user_id, 
    (SELECT SUM(price) 
     FROM (
       SELECT order_items.price, order_items.order_id 
       FROM order_items 
       WHERE order_items.order_id IN (
         SELECT id FROM orders 
         WHERE user_id = users.id
       )
     ) AS user_items
    ) AS revenue 
  FROM users 
  WHERE (
    SELECT COUNT(*) FROM orders 
    WHERE orders.user_id = users.id
  ) > 0
) AS user_revenue 
GROUP BY user_id 
HAVING total_revenue > 100 
ORDER BY total_revenue DESC LIMIT 100;

This query is difficult to follow with multiple layers of nested subqueries creating a maze-like structure.

JOIN Overload

SELECT u.name, COUNT(DISTINCT o.id) AS order_count, 
  SUM(oi.price * oi.quantity) AS total_spent, 
  MAX(o.created_at) AS last_order_date,
  (SELECT AVG(r.rating) FROM reviews r WHERE r.user_id = u.id) AS avg_rating,
  CASE WHEN 
    (SELECT COUNT(*) FROM cart_items ci 
     JOIN carts c ON ci.cart_id = c.id 
     WHERE c.user_id = u.id AND c.status = 'active') > 0 
  THEN true ELSE false END AS has_active_cart
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
LEFT JOIN order_items oi ON o.id = oi.order_id
LEFT JOIN products p ON oi.product_id = p.id
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN payment_methods pm ON o.payment_method_id = pm.id
WHERE u.created_at BETWEEN '2020-01-01' AND '2023-12-31'
  AND u.status = 'active'
  AND (o.status = 'completed' OR o.status IS NULL)
  AND (c.name = 'Electronics' OR c.name IS NULL)
GROUP BY u.id, u.name
HAVING order_count > 0
ORDER BY total_spent DESC LIMIT 100;

This query demonstrates excessive table joins with unclear relationships, making it memory-intensive and difficult to optimize.

The Path to Cleaner Queries

1. Common Table Expressions (CTEs) to the Rescue

CTEs create named subqueries that can be referenced multiple times, improving readability and maintainability. ClickHouse optimizes CTEs effectively, often reusing results when referenced multiple times.

Here’s how we can refactor our nested subquery example:

-- First get eligible orders
WITH user_orders AS (
    SELECT orders.id, orders.user_id
    FROM orders 
    JOIN users ON users.id = orders.user_id
    WHERE users.status = 'active'
),
-- Then calculate revenue per order
order_revenue AS (
    SELECT user_orders.user_id, 
           SUM(order_items.price) AS revenue
    FROM order_items 
    JOIN user_orders ON order_items.order_id = user_orders.id
    GROUP BY user_orders.user_id
)
-- Finally, get the aggregated results
SELECT user_id, 
       SUM(revenue) AS total_revenue
FROM order_revenue
GROUP BY user_id
HAVING total_revenue > 100
ORDER BY total_revenue DESC LIMIT 100;

This approach breaks down the complex query into logical building blocks, making it more intuitive and easier to understand.

2. ClickHouse-Specific Optimizations

PREWHERE Clause

The PREWHERE clause performs preliminary filtering to reduce data scanned, making it most effective when filtering eliminates most rows and involves a small subset of columns.

-- Standard WHERE
SELECT 
  user_id, 
  event_time, 
  event_type, 
  page_url, 
  device_type
FROM user_events
WHERE toDate(event_time) 
  BETWEEN '2023-01-01' 
  AND '2023-01-31'
  AND event_type = 'click'

-- Optimized PREWHERE
SELECT 
  user_id, 
  event_time, 
  event_type, 
  page_url, 
  device_type
FROM user_events
PREWHERE toDate(event_time) 
  BETWEEN '2023-01-01' 
  AND '2023-01-31'
WHERE event_type = 'click'

PREWHERE applies the filtering condition before reading other columns, significantly reducing I/O when the filter eliminates many rows.

FINAL Modifier

For tables with ReplacingMergeTree or other *MergeTree engines that store data in parts, the FINAL modifier ensures only the final version of each row is retrieved:

-- Without FINAL (may return duplicates)
SELECT 
  user_id, 
  session_id, 
  last_activity
FROM user_sessions
WHERE user_id = 12345

-- With FINAL (returns exactly one row per unique key)
SELECT 
  user_id, 
  session_id, 
  last_activity
FROM user_sessions FINAL
WHERE user_id = 12345

The FINAL modifier adds overhead but ensures data consistency – use it selectively when needed.

Materialized Views

Materialized views pre-aggregate data, dramatically speeding up common query patterns:

-- Create a materialized view for faster aggregations
CREATE MATERIALIZED VIEW daily_user_events
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (day, event_type, user_id)
AS SELECT
    toDate(event_time) AS day,
    event_type,
    user_id,
    count() AS events_count
FROM events
GROUP BY day, event_type, user_id;

-- Query the materialized view (much faster than calculating from raw events)
SELECT
    day,
    sum(events_count) AS total_events
FROM daily_user_events
WHERE day BETWEEN '2023-01-01' AND '2023-01-31'
GROUP BY day
ORDER BY day;

This approach pre-computes and incrementally updates derived results, making queries much faster.

3. Proper Formatting and Commenting

Compare these two versions of the same query:

-- Before: Poorly Formatted Query
SELECT u.user_id,u.username,u.email,count(o.id) as order_count,sum(o.total_amount) as total_spent,max(o.created_at) as last_order,
(select count(*) from cart_items ci join carts c on ci.cart_id=c.id where c.user_id=u.user_id and c.status='active') as cart_items_count,avg(r.rating) as avg_rating,
case when count(o.id)>10 then 'loyal' when count(o.id)>5 then 'regular' else 'new' end as user_segment from users u left join orders o on u.user_id=o.user_id left join reviews r on r.user_id=u.user_id
where u.created_at>'2020-01-01' and (o.status='completed' or o.status is null) and u.status='active'
group by u.user_id,u.username,u.email having total_spent>1000 order by total_spent desc limit 100;

-- After: Well-Formatted Query
-- Find high-value users with their order metrics and segmentation
-- Note: Filters to active users who have spent over $1000
SELECT
    u.user_id,
    u.username,
    u.email,
    COUNT(o.id) AS order_count,
    SUM(o.total_amount) AS total_spent,
    MAX(o.created_at) AS last_order,
    -- Get current items in user's cart
    (SELECT COUNT(*) 
     FROM cart_items ci 
     JOIN carts c ON ci.cart_id = c.id 
     WHERE c.user_id = u.user_id AND c.status = 'active') AS cart_items_count,
    AVG(r.rating) AS avg_rating,
    -- Segment users based on order frequency
    CASE 
        WHEN COUNT(o.id) > 10 THEN 'loyal'
        WHEN COUNT(o.id) > 5 THEN 'regular' 
        ELSE 'new' 
    END AS user_segment
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
LEFT JOIN reviews r ON r.user_id = u.user_id
WHERE 
    u.created_at > '2020-01-01' 
    AND (o.status = 'completed' OR o.status IS NULL)
    AND u.status = 'active'
GROUP BY u.user_id, u.username, u.email
HAVING total_spent > 1000
ORDER BY total_spent DESC
LIMIT 100;

The well-formatted version uses consistent indentation, logical line breaks, and meaningful comments to make the query much more readable.

4. Denormalization for Performance

Unlike traditional OLTP databases, ClickHouse is optimized for analytically processing denormalized data. Consider denormalizing your data model:

-- Normalized Approach (Avoid)
-- Separate dimension tables
CREATE TABLE users(...)
CREATE TABLE products(...)
CREATE TABLE categories(...)
-- Facts table with foreign keys
CREATE TABLE events(
  event_id UInt64,
  event_time DateTime,
  user_id UInt64,
  product_id UInt64
)

-- Denormalized Approach (Preferred)
-- Denormalized events table
CREATE TABLE events(
  event_id UInt64,
  event_time DateTime,

  -- Embedded user dimensions
  user_id UInt64,
  user_name String,
  user_country String,

  -- Embedded product dimensions
  product_id UInt64,
  product_name String,
  product_price Decimal(10,2),
  category_name String
)

Denormalization trades increased storage space for dramatically faster query performance – a worthwhile tradeoff in many analytical scenarios.

5. Dictionary Tables for Dimension Data

For dimension data that changes infrequently, ClickHouse’s dictionary tables provide extremely fast lookups:

-- Query using the dictionary instead of JOIN
SELECT 
    e.event_id,
    e.product_id,
    dictGet('products', 'name', toUInt64(e.product_id)) AS product_name,
    dictGet('products', 'category', toUInt64(e.product_id)) AS category
FROM events e
WHERE e.event_date = today();

Dictionaries are loaded into RAM for extremely fast access, dramatically outperforming JOIN operations. 1

6. Window Functions for Clean Analytics

Window functions can replace complex self-joins or subqueries for analytical operations:

-- Messy approach with self-join
SELECT 
  current.date,
  current.revenue,
  current.revenue - previous.revenue AS daily_change
FROM daily_sales current
LEFT JOIN daily_sales previous 
  ON current.date = previous.date + 1
ORDER BY current.date;

-- Clean approach with window functions
SELECT
  date,
  revenue,
  revenue - lag(revenue) OVER (ORDER BY date) AS daily_change,
  sum(revenue) OVER (ORDER BY date) AS running_total,
  avg(revenue) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS weekly_moving_avg
FROM daily_sales
ORDER BY date;

Window functions provide a cleaner, more efficient way to perform running aggregates, ranking, and lag/lead analysis.

Query Testing and Validation

Before deploying your refactored queries to production, it’s essential to:

  1. Verify with sample data – Test queries with small, known datasets where you can manually verify results
  2. Compare with previous solutions – Run refactored queries in parallel with original versions to ensure consistent results
  3. Profile query performance – Use EXPLAIN and system.query_log to analyze execution plans and resource usage
  4. Build incrementally – Develop complex queries gradually, validating each component before combining

Conclusion

Writing clean and efficient ClickHouse SQL is both an art and a science. By leveraging CTEs, ClickHouse-specific optimizations, proper formatting, and denormalization strategies, you can transform complex spaghetti queries into maintainable, high-performance code.

Remember that in ClickHouse, query structure is even more important due to the massive data scale, columnar architecture, and complex analytics typically performed. Taking the time to refactor and optimize your queries will pay dividends in improved performance, reduced costs, and enhanced developer productivity.

 

Note: You can download the presentation of this blog post here



 

Implementing Self-Joins in ClickHouse: Techniques, Use Cases, and Best Practices

 

Mastering User Management in ClickHouse: A Complete Guide to Authentication, Authorization, and Future Security Enhancements

 

Regular Expressions in ClickHouse: Limitations, Constraints, and Best Practices

 

About Shiv Iyer 253 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.