What are Working Datasets in ClickHouse?

Introduction

In ClickHouse, a working dataset refers to a set of data that is stored in memory and used to perform operations such as sorting and aggregation.

Working Dataset Types in ClickHouse

ClickHouse has several types each with its own characteristics and uses.

  1. MergeSortedBlock: This is used for sorting and merging data. It stores data in memory in a sorted order and can be used for operations such as sorting, merging, and aggregation.
  2. GroupBySortedBlock: This is a working dataset that is used for group by operations. It stores data in memory in a sorted order, grouped by the specified columns.
  3. Columns: This stores the data of a single column. It is used to perform operations such as filtering, aggregation, and sorting.
  4. AggregatingBlockInputStream: This is a working dataset that is used for aggregate operations. It stores the data in memory and performs the aggregate operation on it before returning the result.
  5. Join: This is a working dataset that is used to perform join operations. It stores the data in memory, and performs the join operation on it before returning the result.
  6. Distinct: This is a working dataset that is used for distinct operations. It stores the data in memory and performs the distinct operation on it before returning the result.

All these working datasets are used to perform specific operations in ClickHouse, and ClickHouse automatically chooses the appropriate dataset based on the query being executed. These working datasets are also used to perform operations such as filtering, sorting, and grouping, but their main goal is to speed up the processing of data by reducing the number of disk I/O operations.

Conclusion

In summary, working datasets in ClickHouse are the set of data that is stored in memory and used to perform specific operations such as sorting, merging, and aggregation. ClickHouse has several types such as MergeSortedBlock, GroupBySortedBlock, Columns, AggregatingBlockInputStream, Join and Distinct, each with its own characteristics and uses. ClickHouse automatically chooses the appropriate dataset based on the query being executed to speed up the processing of data by reducing the number of disk I/O operations.

To know more about Clickhouse, please do consider reading the below articles: 

About Shiv Iyer 217 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.