Untangling the Spaghetti: Writing Clean and Efficient ClickHouse SQL
In the world of big data analytics, ClickHouse has emerged as a powerful columnar database system capable of processing billions of rows and petabytes of data with blazing speed. However, with great power comes great responsibility – the responsibility to write clean, maintainable, and efficient SQL queries. Let’s explore how to transform complex “spaghetti queries” into well-structured, high-performance SQL code specifically optimized for ClickHouse.
The Problem with Spaghetti Queries
Spaghetti queries are characterized by:
- Overly complex SQL that’s difficult to understand at a glance
- Hard-to-maintain code where small changes can lead to unexpected results
- Inefficient execution that doesn’t leverage ClickHouse’s columnar architecture
- Collaboration barriers that make teamwork challenging
These issues become even more critical in ClickHouse environments due to:
- The massive data scale ClickHouse typically handles
- ClickHouse’s columnar architecture requiring different optimization strategies
- The inherently complex nature of analytical queries
Common Signs of Problematic Queries
Excessive Nesting
SELECT user_id, SUM(revenue) AS total_revenue FROM ( SELECT user_id, (SELECT SUM(price) FROM ( SELECT order_items.price, order_items.order_id FROM order_items WHERE order_items.order_id IN ( SELECT id FROM orders WHERE user_id = users.id ) ) AS user_items ) AS revenue FROM users WHERE ( SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id ) > 0 ) AS user_revenue GROUP BY user_id HAVING total_revenue > 100 ORDER BY total_revenue DESC LIMIT 100;
This query is difficult to follow with multiple layers of nested subqueries creating a maze-like structure.
JOIN Overload
SELECT u.name, COUNT(DISTINCT o.id) AS order_count, SUM(oi.price * oi.quantity) AS total_spent, MAX(o.created_at) AS last_order_date, (SELECT AVG(r.rating) FROM reviews r WHERE r.user_id = u.id) AS avg_rating, CASE WHEN (SELECT COUNT(*) FROM cart_items ci JOIN carts c ON ci.cart_id = c.id WHERE c.user_id = u.id AND c.status = 'active') > 0 THEN true ELSE false END AS has_active_cart FROM users u LEFT JOIN orders o ON u.id = o.user_id LEFT JOIN order_items oi ON o.id = oi.order_id LEFT JOIN products p ON oi.product_id = p.id LEFT JOIN categories c ON p.category_id = c.id LEFT JOIN payment_methods pm ON o.payment_method_id = pm.id WHERE u.created_at BETWEEN '2020-01-01' AND '2023-12-31' AND u.status = 'active' AND (o.status = 'completed' OR o.status IS NULL) AND (c.name = 'Electronics' OR c.name IS NULL) GROUP BY u.id, u.name HAVING order_count > 0 ORDER BY total_spent DESC LIMIT 100;
This query demonstrates excessive table joins with unclear relationships, making it memory-intensive and difficult to optimize.
The Path to Cleaner Queries
1. Common Table Expressions (CTEs) to the Rescue
CTEs create named subqueries that can be referenced multiple times, improving readability and maintainability. ClickHouse optimizes CTEs effectively, often reusing results when referenced multiple times.
Here’s how we can refactor our nested subquery example:
-- First get eligible orders WITH user_orders AS ( SELECT orders.id, orders.user_id FROM orders JOIN users ON users.id = orders.user_id WHERE users.status = 'active' ), -- Then calculate revenue per order order_revenue AS ( SELECT user_orders.user_id, SUM(order_items.price) AS revenue FROM order_items JOIN user_orders ON order_items.order_id = user_orders.id GROUP BY user_orders.user_id ) -- Finally, get the aggregated results SELECT user_id, SUM(revenue) AS total_revenue FROM order_revenue GROUP BY user_id HAVING total_revenue > 100 ORDER BY total_revenue DESC LIMIT 100;
This approach breaks down the complex query into logical building blocks, making it more intuitive and easier to understand.
2. ClickHouse-Specific Optimizations
PREWHERE Clause
The PREWHERE clause performs preliminary filtering to reduce data scanned, making it most effective when filtering eliminates most rows and involves a small subset of columns.
-- Standard WHERE SELECT user_id, event_time, event_type, page_url, device_type FROM user_events WHERE toDate(event_time) BETWEEN '2023-01-01' AND '2023-01-31' AND event_type = 'click' -- Optimized PREWHERE SELECT user_id, event_time, event_type, page_url, device_type FROM user_events PREWHERE toDate(event_time) BETWEEN '2023-01-01' AND '2023-01-31' WHERE event_type = 'click'
PREWHERE applies the filtering condition before reading other columns, significantly reducing I/O when the filter eliminates many rows.
FINAL Modifier
For tables with ReplacingMergeTree or other *MergeTree engines that store data in parts, the FINAL modifier ensures only the final version of each row is retrieved:
-- Without FINAL (may return duplicates) SELECT user_id, session_id, last_activity FROM user_sessions WHERE user_id = 12345 -- With FINAL (returns exactly one row per unique key) SELECT user_id, session_id, last_activity FROM user_sessions FINAL WHERE user_id = 12345
The FINAL modifier adds overhead but ensures data consistency – use it selectively when needed.
Materialized Views
Materialized views pre-aggregate data, dramatically speeding up common query patterns:
-- Create a materialized view for faster aggregations CREATE MATERIALIZED VIEW daily_user_events ENGINE = SummingMergeTree() PARTITION BY toYYYYMM(day) ORDER BY (day, event_type, user_id) AS SELECT toDate(event_time) AS day, event_type, user_id, count() AS events_count FROM events GROUP BY day, event_type, user_id; -- Query the materialized view (much faster than calculating from raw events) SELECT day, sum(events_count) AS total_events FROM daily_user_events WHERE day BETWEEN '2023-01-01' AND '2023-01-31' GROUP BY day ORDER BY day;
This approach pre-computes and incrementally updates derived results, making queries much faster.
3. Proper Formatting and Commenting
Compare these two versions of the same query:
-- Before: Poorly Formatted Query SELECT u.user_id,u.username,u.email,count(o.id) as order_count,sum(o.total_amount) as total_spent,max(o.created_at) as last_order, (select count(*) from cart_items ci join carts c on ci.cart_id=c.id where c.user_id=u.user_id and c.status='active') as cart_items_count,avg(r.rating) as avg_rating, case when count(o.id)>10 then 'loyal' when count(o.id)>5 then 'regular' else 'new' end as user_segment from users u left join orders o on u.user_id=o.user_id left join reviews r on r.user_id=u.user_id where u.created_at>'2020-01-01' and (o.status='completed' or o.status is null) and u.status='active' group by u.user_id,u.username,u.email having total_spent>1000 order by total_spent desc limit 100; -- After: Well-Formatted Query -- Find high-value users with their order metrics and segmentation -- Note: Filters to active users who have spent over $1000 SELECT u.user_id, u.username, u.email, COUNT(o.id) AS order_count, SUM(o.total_amount) AS total_spent, MAX(o.created_at) AS last_order, -- Get current items in user's cart (SELECT COUNT(*) FROM cart_items ci JOIN carts c ON ci.cart_id = c.id WHERE c.user_id = u.user_id AND c.status = 'active') AS cart_items_count, AVG(r.rating) AS avg_rating, -- Segment users based on order frequency CASE WHEN COUNT(o.id) > 10 THEN 'loyal' WHEN COUNT(o.id) > 5 THEN 'regular' ELSE 'new' END AS user_segment FROM users u LEFT JOIN orders o ON u.user_id = o.user_id LEFT JOIN reviews r ON r.user_id = u.user_id WHERE u.created_at > '2020-01-01' AND (o.status = 'completed' OR o.status IS NULL) AND u.status = 'active' GROUP BY u.user_id, u.username, u.email HAVING total_spent > 1000 ORDER BY total_spent DESC LIMIT 100;
The well-formatted version uses consistent indentation, logical line breaks, and meaningful comments to make the query much more readable.
4. Denormalization for Performance
Unlike traditional OLTP databases, ClickHouse is optimized for analytically processing denormalized data. Consider denormalizing your data model:
-- Normalized Approach (Avoid) -- Separate dimension tables CREATE TABLE users(...) CREATE TABLE products(...) CREATE TABLE categories(...) -- Facts table with foreign keys CREATE TABLE events( event_id UInt64, event_time DateTime, user_id UInt64, product_id UInt64 ) -- Denormalized Approach (Preferred) -- Denormalized events table CREATE TABLE events( event_id UInt64, event_time DateTime, -- Embedded user dimensions user_id UInt64, user_name String, user_country String, -- Embedded product dimensions product_id UInt64, product_name String, product_price Decimal(10,2), category_name String )
Denormalization trades increased storage space for dramatically faster query performance – a worthwhile tradeoff in many analytical scenarios.
5. Dictionary Tables for Dimension Data
For dimension data that changes infrequently, ClickHouse’s dictionary tables provide extremely fast lookups:
Dictionaries are loaded into RAM for extremely fast access, dramatically outperforming JOIN operations. 1
6. Window Functions for Clean Analytics
Window functions can replace complex self-joins or subqueries for analytical operations:
-- Messy approach with self-join SELECT current.date, current.revenue, current.revenue - previous.revenue AS daily_change FROM daily_sales current LEFT JOIN daily_sales previous ON current.date = previous.date + 1 ORDER BY current.date; -- Clean approach with window functions SELECT date, revenue, revenue - lag(revenue) OVER (ORDER BY date) AS daily_change, sum(revenue) OVER (ORDER BY date) AS running_total, avg(revenue) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS weekly_moving_avg FROM daily_sales ORDER BY date;
Window functions provide a cleaner, more efficient way to perform running aggregates, ranking, and lag/lead analysis.
Query Testing and Validation
Before deploying your refactored queries to production, it’s essential to:
- Verify with sample data – Test queries with small, known datasets where you can manually verify results
- Compare with previous solutions – Run refactored queries in parallel with original versions to ensure consistent results
- Profile query performance – Use EXPLAIN and system.query_log to analyze execution plans and resource usage
- Build incrementally – Develop complex queries gradually, validating each component before combining
Conclusion
Writing clean and efficient ClickHouse SQL is both an art and a science. By leveraging CTEs, ClickHouse-specific optimizations, proper formatting, and denormalization strategies, you can transform complex spaghetti queries into maintainable, high-performance code.
Remember that in ClickHouse, query structure is even more important due to the massive data scale, columnar architecture, and complex analytics typically performed. Taking the time to refactor and optimize your queries will pay dividends in improved performance, reduced costs, and enhanced developer productivity.
Note: You can download the presentation of this blog post here
Implementing Self-Joins in ClickHouse: Techniques, Use Cases, and Best Practices
Regular Expressions in ClickHouse: Limitations, Constraints, and Best Practices