Advanced ClickHouse SQL: Window Functions, Array, and JSON Processing

ClickHouse offers powerful SQL capabilities that extend beyond basic queries. This article explores advanced SQL features in ClickHouse, focusing on window functions, array operations, and JSON processing.

Window Functions

ClickHouse introduced window functions in version 20.5, providing powerful analytical capabilities directly in SQL queries.

Basic Window Function Syntax

SELECT
    event_time,
    user_id,
    value,
    avg(value) OVER (PARTITION BY user_id ORDER BY event_time
                     ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_avg
FROM user_events

Practical Window Function Examples

Calculating Moving Averages

SELECT
    timestamp,
    metric_value,
    avg(metric_value) OVER (
        ORDER BY timestamp
        ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) AS moving_avg_6points
FROM sensor_data
WHERE sensor_id = 'temp001'
ORDER BY timestamp

Ranking Results

SELECT
    user_id,
    purchase_amount,
    rank() OVER (ORDER BY purchase_amount DESC) AS purchase_rank,
    dense_rank() OVER (ORDER BY purchase_amount DESC) AS dense_purchase_rank
FROM purchases
WHERE purchase_date >= now() - INTERVAL 30 DAY

Calculating Differences Between Rows

SELECT
    timestamp,
    metric_value,
    metric_value - lag(metric_value) OVER (ORDER BY timestamp) AS value_change
FROM sensor_data
ORDER BY timestamp

Array Operations

ClickHouse has first-class support for array data types, enabling powerful operations directly in SQL.

Creating and Manipulating Arrays

-- Creating arrays
SELECT [1, 2, 3] AS simple_array;

-- Array functions
SELECT
    arrayJoin([1, 2, 3]) AS joined_array,
    arrayMap(x -> x * x, [1, 2, 3]) AS squared_array,
    arrayFilter(x -> x > 1, [1, 2, 3]) AS filtered_array

Practical Array Examples

Sessionization with Arrays

SELECT
    user_id,
    groupArray(event_time) AS event_times,
    groupArray(event_type) AS event_sequence,
    length(groupArray(event_type)) AS session_events_count
FROM user_events
GROUP BY user_id, session_id
HAVING session_events_count > 1

Array Aggregations

SELECT
    date,
    arraySort(groupUniqArray(user_id)) AS unique_users,
    length(arraySort(groupUniqArray(user_id))) AS unique_users_count
FROM page_views
GROUP BY date
ORDER BY date

Working with Nested Data

SELECT
    product_id,
    arrayJoin(categories) AS category,
    count() AS category_count
FROM products
GROUP BY product_id, category
ORDER BY category_count DESC

JSON Processing

ClickHouse provides extensive functions for working with JSON data, whether stored as strings or using specialized JSON types.

Basic JSON Extraction

SELECT
    request_id,
    JSONExtractString(payload, 'user', 'name') AS user_name,
    JSONExtractInt(payload, 'metrics', 'duration') AS duration
FROM api_logs
WHERE JSONHas(payload, 'error') = 1

Advanced JSON Techniques

Working with JSON Arrays

SELECT
    request_id,
    JSONLength(payload, 'items') AS items_count,
    arrayMap(x -> JSONExtractString(x, 'name'),
             JSONExtractArrayRaw(payload, 'items')) AS item_names
FROM orders
WHERE items_count > 0

JSON Path Expressions

SELECT
    event_id,
    JSONExtractString(data, 'user.profile.preferences.theme') AS theme,
    JSONExtractRaw(data, 'user.devices[0]') AS primary_device
FROM events
WHERE JSONType(data, 'user.profile') = 'object'

Converting Between JSON and Arrays

SELECT
    product_id,
    JSONExtractArrayRaw(attributes, 'options') AS options_json,
    arrayMap(x -> JSONExtractString(x, 'name'), options_json) AS option_names
FROM products
WHERE JSONHas(attributes, 'options')

Combining Techniques

The real power comes from combining these features:

SELECT
    user_id,
    -- Window function to get session rank by duration
    rank() OVER (PARTITION BY user_id ORDER BY session_duration DESC) AS session_rank,
    -- Array of events in the session
    groupArray(event_type) AS events,
    -- JSON extraction from each event
    arrayMap(x -> JSONExtractString(x, 'details.page'),
             groupArray(event_payload)) AS pages_visited
FROM user_sessions
WHERE session_start >= now() - INTERVAL 7 DAY
GROUP BY user_id, session_id, session_duration
HAVING session_rank <= 3
ORDER BY user_id, session_rank

Performance Considerations

  • Window functions can be memory-intensive; use appropriate LIMIT clauses
  • Large arrays may impact performance; consider using arrayJoin earlier in the query pipeline
  • JSON extraction is slower than native columns; extract frequently used fields to dedicated columns

Conclusion

ClickHouse’s advanced SQL capabilities provide powerful tools for complex analytical workloads. Window functions enable sophisticated time-series analysis, array operations simplify working with collections of data, and JSON processing functions make it easy to work with semi-structured data.

By mastering these techniques, you can solve complex analytical problems directly in SQL without needing to export data to external processing systems, taking full advantage of ClickHouse’s exceptional performance.

 

Untangling the Spaghetti: Writing Efficient ClickHouse SQL

 

Most Common ClickHouse Analytical Models

 

Real-Time Bid Tracking and Optimization with ClickHouse in High-Performance Data Pipelines

About Shiv Iyer 254 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.

Be the first to comment

Leave a Reply