What are working data sets in ClickHouse?

ClickHouse Advanced Education Series / Internals – Understanding working data sets in ClickHouse


In ClickHouse, a working dataset refers to a set of data that is stored in memory and used to perform operations such as sorting and aggregation. ClickHouse has several types of working datasets, each with its own characteristics and uses.

  1. MergeSortedBlock: This is a working dataset that is used for sorting and merging data. It stores data in memory in a sorted order and can be used for operations such as sorting, merging, and aggregation.
  2. GroupBySortedBlock: This is a working dataset that is used for group by operations. It stores data in memory in a sorted order, grouped by the specified columns.
  3. Columns: This is a working dataset that stores the data of a single column. It is used to perform operations such as filtering, aggregation, and sorting.
  4. AggregatingBlockInputStream: This is a working dataset that is used for aggregate operations. It stores the data in memory and performs the aggregate operation on it before returning the result.
  5. Join: This is a working dataset that is used to perform join operations. It stores the data in memory, and performs the join operation on it before returning the result.
  6. Distinct: This is a working dataset that is used for distinct operations. It stores the data in memory and performs the distinct operation on it before returning the result.

All these working datasets are used to perform specific operations in ClickHouse, and ClickHouse automatically chooses the appropriate dataset based on the query being executed. These working datasets are also used to perform operations such as filtering, sorting, and grouping, but their main goal is to speed up the processing of data by reducing the number of disk I/O operations.

In summary, working datasets in ClickHouse are the set of data that is stored in memory and used to perform specific operations such as sorting, merging, and aggregation. ClickHouse has several types of working datasets like MergeSortedBlock, GroupBySortedBlock, Columns, AggregatingBlockInputStream, Join and Distinct, each with its own characteristics and uses. ClickHouse automatically chooses the appropriate dataset based on the query being executed to speed up the processing of data by reducing the number of disk I/O operations.

About Shiv Iyer 56 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.