Using GROUPBY for Groupings, Rolllups and Cubes in ClickHouse

Introduction

Grouping, rollup, and cube are SQL query operations that allow for grouping and aggregation of data based on multiple dimensions or attributes. In ClickHouse, these operations are implemented using the GROUP BY clause, which allows you to group data based on one or more columns. Here are some real-life data examples to illustrate how to implement groupings, rollups, and cubes in ClickHouse:

Example 1: Sales Data

Suppose we have a sales table with the following columns: order_id, customer_id, order_date, product_id, and quantity. We want to calculate the total quantity sold for each product and each month. Here’s how we can do this using grouping:

SELECT product_id, toMonth(order_date) AS month, sum(quantity) AS total_quantity
FROM sales
GROUP BY product_id, month
ORDER BY product_id, month

This query will group the sales data by product_id and month, and calculate the total quantity sold for each combination of product and month. The toMonth() function is used to extract the month from the order_date column.

Example 2: Web Traffic Data

Suppose we have a web traffic table with the following columns: timestamp, ip_address, page_url, user_agent. We want to calculate the number of page views by browser type and operating system. Here’s how we can do this using rollup:

SELECT
CASE
WHEN user_agent LIKE ‘%Firefox%’ THEN ‘Firefox’
WHEN user_agent LIKE ‘%Chrome%’ THEN ‘Chrome’
ELSE ‘Other’
END AS browser,
CASE
WHEN user_agent LIKE ‘%Windows%’ THEN ‘Windows’
WHEN user_agent LIKE ‘%Mac OS%’ THEN ‘Mac OS’
ELSE ‘Other’
END AS os,
count(*) AS page_views
FROM web_traffic
GROUP BY ROLLUP(browser, os)
ORDER BY browser, os

This query will group the web traffic data by browser and operating system, and calculate the number of page views for each combination. The ROLLUP() function is used to create a hierarchy of subtotals, so the query will also return subtotals for each browser and for each operating system.

Example 3: Employee Data

Suppose we have an employee table with the following columns: employee_id, department, job_title, salary. We want to calculate the average salary by department and job title, and also calculate subtotals by department and totals for all employees. Here’s how we can do this using cube:

SELECT department, job_title, avg(salary) AS avg_salary
FROM employees
GROUP BY CUBE(department, job_title)
ORDER BY department, job_title

This query will group the employee data by department and job title, and calculate the average salary for each combination. The CUBE() function is used to create a hierarchy of subtotals and totals, so the query will also return subtotals by department and totals for all employees.

Conclusion

In summary, grouping, rollup, and cube are powerful SQL query operations that allow for grouping and aggregation of data based on multiple dimensions or attributes. In ClickHouse, these operations are implemented using the GROUP BY clause, along with functions such as ROLLUP() and CUBE(). By using these operations, you can gain deeper insights into your data and perform complex analysis on large-scale data sets.

To read more about GROUPBY & the EXPLAIN tool in ClickHouse, do consider reading the below articles

About Shiv Iyer 218 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.