ClickHouse offers powerful SQL capabilities that extend beyond basic queries. This article explores advanced SQL features in ClickHouse, focusing on window functions, array operations, and JSON processing.
Window Functions
ClickHouse introduced window functions in version 20.5, providing powerful analytical capabilities directly in SQL queries.
Basic Window Function Syntax
SELECT event_time, user_id, value, avg(value) OVER (PARTITION BY user_id ORDER BY event_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_avg FROM user_events
Practical Window Function Examples
Calculating Moving Averages
SELECT timestamp, metric_value, avg(metric_value) OVER ( ORDER BY timestamp ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) AS moving_avg_6points FROM sensor_data WHERE sensor_id = 'temp001' ORDER BY timestamp
Ranking Results
SELECT user_id, purchase_amount, rank() OVER (ORDER BY purchase_amount DESC) AS purchase_rank, dense_rank() OVER (ORDER BY purchase_amount DESC) AS dense_purchase_rank FROM purchases WHERE purchase_date >= now() - INTERVAL 30 DAY
Calculating Differences Between Rows
SELECT timestamp, metric_value, metric_value - lag(metric_value) OVER (ORDER BY timestamp) AS value_change FROM sensor_data ORDER BY timestamp
Array Operations
ClickHouse has first-class support for array data types, enabling powerful operations directly in SQL.
Creating and Manipulating Arrays
-- Creating arrays SELECT [1, 2, 3] AS simple_array; -- Array functions SELECT arrayJoin([1, 2, 3]) AS joined_array, arrayMap(x -> x * x, [1, 2, 3]) AS squared_array, arrayFilter(x -> x > 1, [1, 2, 3]) AS filtered_array
Practical Array Examples
Sessionization with Arrays
SELECT user_id, groupArray(event_time) AS event_times, groupArray(event_type) AS event_sequence, length(groupArray(event_type)) AS session_events_count FROM user_events GROUP BY user_id, session_id HAVING session_events_count > 1
Array Aggregations
SELECT date, arraySort(groupUniqArray(user_id)) AS unique_users, length(arraySort(groupUniqArray(user_id))) AS unique_users_count FROM page_views GROUP BY date ORDER BY date
Working with Nested Data
SELECT product_id, arrayJoin(categories) AS category, count() AS category_count FROM products GROUP BY product_id, category ORDER BY category_count DESC
JSON Processing
ClickHouse provides extensive functions for working with JSON data, whether stored as strings or using specialized JSON types.
Basic JSON Extraction
SELECT request_id, JSONExtractString(payload, 'user', 'name') AS user_name, JSONExtractInt(payload, 'metrics', 'duration') AS duration FROM api_logs WHERE JSONHas(payload, 'error') = 1
Advanced JSON Techniques
Working with JSON Arrays
SELECT request_id, JSONLength(payload, 'items') AS items_count, arrayMap(x -> JSONExtractString(x, 'name'), JSONExtractArrayRaw(payload, 'items')) AS item_names FROM orders WHERE items_count > 0
JSON Path Expressions
SELECT event_id, JSONExtractString(data, 'user.profile.preferences.theme') AS theme, JSONExtractRaw(data, 'user.devices[0]') AS primary_device FROM events WHERE JSONType(data, 'user.profile') = 'object'
Converting Between JSON and Arrays
SELECT product_id, JSONExtractArrayRaw(attributes, 'options') AS options_json, arrayMap(x -> JSONExtractString(x, 'name'), options_json) AS option_names FROM products WHERE JSONHas(attributes, 'options')
Combining Techniques
The real power comes from combining these features:
SELECT user_id, -- Window function to get session rank by duration rank() OVER (PARTITION BY user_id ORDER BY session_duration DESC) AS session_rank, -- Array of events in the session groupArray(event_type) AS events, -- JSON extraction from each event arrayMap(x -> JSONExtractString(x, 'details.page'), groupArray(event_payload)) AS pages_visited FROM user_sessions WHERE session_start >= now() - INTERVAL 7 DAY GROUP BY user_id, session_id, session_duration HAVING session_rank <= 3 ORDER BY user_id, session_rank
Performance Considerations
- Window functions can be memory-intensive; use appropriate LIMIT clauses
- Large arrays may impact performance; consider using arrayJoin earlier in the query pipeline
- JSON extraction is slower than native columns; extract frequently used fields to dedicated columns
Conclusion
ClickHouse’s advanced SQL capabilities provide powerful tools for complex analytical workloads. Window functions enable sophisticated time-series analysis, array operations simplify working with collections of data, and JSON processing functions make it easy to work with semi-structured data.
By mastering these techniques, you can solve complex analytical problems directly in SQL without needing to export data to external processing systems, taking full advantage of ClickHouse’s exceptional performance.
Real-Time Bid Tracking and Optimization with ClickHouse in High-Performance Data Pipelines
Leave a Reply
You must be logged in to post a comment.