Why is ClickHouse so fast?

Unveiling the design choices behind the world's fastest real-time analytics DBMS

Introduction

Online Analytical Processing

Since Michael Stonebraker’s seminal 1986 paper that introduced POSTGRES (and consequently PostgreSQL) to the world, the relational database management system (RDBMS) has been synonymous with the word “database” and indeed the bedrock of all future innovation in this space. RDBMS however represents only the OLTP (or online transactional processing) spectrum of the database world: they are designed for transactional consistency and integrity, i.e., to ensure that data is never lost and is reliably updated in a WRITE/UPDATE transaction. The corollary is that they aren’t optimized for fast reads, which are essential for data analytics and reporting of any kind that involve aggregations, grouping, sorting, rollups, etc. Such online analytical processing (OLAP) workloads require their own category of database custom-built for blazing fast reads of multiple types. This is the category to which ClickHouse corresponds, and the focus of this article.

Understanding Analytical Queries

Here’s the structure of a simple analytical query.

SELECT Column_a, SUM (Column_b) AS Column_b_sum, MEDIAN (Column_3) AS Column_c_median

FROM Table_1

WHERE  Column_x >= XX AND Column_x <= YY

GROUP BY (Column_a)

ORDER BY (Column_b)

LIMIT 1000

If you observe this analytical query carefully, you will notice the following properties:

  • Analytical queries first select column(s) from database table(s)
  • Analytical queries perform aggregations and statistical operations on column(s)
  • Analytical queries use filter predicates like WHERE to select requisite rows in the selected columns
  • Analytical tables groupings, aggregations, rollups, and sorts on the result
  • Additionally (not mentioned above), more complex analytical queries like JOINS may also join multiple tables on a common key

It is now perhaps evident why online analytical processing (OLAP) workloads, on datasets which tend to be in the petabytes, need dramatically different database design than online transactional processing (OLTP) workloads which solve for transactional integrity on terabyte scale datasets.

Segmenting Query Latency

As evident from the aforementioned query, when an analytical query is executing, the duration of the execution can be logically broken down into two discrete steps:

  • Step 1: Load data from disk into RAM
  • Step 2: Process data in CPU(s) as per query execution plan

ClickHouse has been optimized for extreme performance on both these fronts, with an astute attention to low-level detail that makes it the fastest real-time analytics database on the planet. Through the rest of this blog, we will cover the specific features of ClickHouse that allow it to achieve this astounding speed.

Step 1: Load data from disk into RAM

Loading data from disk into CPU can be logically broken down into two discrete sub-steps.

  • Identify the precise data subset to be loaded into RAM
  • Load the identified data subset from disk to RAM

Identifying the requisite data subset and loading said data subset from disk is often the 80/20 of query latency, and where traditional OLAP systems typically spend the majority of their time while executing analytical queries.

Consider a table with 1000 columns and 100 billion rows. Quite common in the real-time analytics world and a good example of a large dataset that is perfect for analytics in ClickHouse. That’s 100 trillion discrete possible data values in a single table, of which perhaps only a small fraction (typically <1%) are relevant to the query in question. Given how time-intensive loading data into RAM can be (not to mention wasteful if the significant chunks of the loaded data aren’t even used by the query), the “sniper gun” ability of a database to identify the precise data subset to be loaded into RAM in the shortest possible time is mission critical to speed. If the DBMS is able to quickly and accurately identify this data subset, query execution will be blazing fast. If it is slow to answer this critical question or loads excessive data, precious milliseconds are wasted in loading data to RAM.

ClickHouse as a sniperThis is where ClickHouse excels, with three features that set it apart from most incumbents.

Columnar Storage

Let’s consider the same table as above with 1000 columns and 100 billion rows, and suppose that only 10 columns are relevant to the query in question. In a row-based database system, each query would load all the 1000 columns from all the requisite rows, despite the fact that the values stored in 990 columns wouldn’t even be used in the query processing. This is incredibly wasteful of precious disk I/O resources and the single biggest reason why transactional DBMS such as PostgreSQL, MySQL, MS SQL Server, and even AWS Redshift (which is at heart a PostgreSQL based solution by its own admission) perform poorly in the context of analytical workloads.

Row-based storesContrast this to ClickHouse. A purely columnar DBMS (in fact, one of the world’s only truly columnar databases with constant-length values in each column with no extra data), it only pick the subset of 10 columns that are openly declared in the SQL query, thus eliminating 99% columns (990 of 1000 columns) from the disk read consideration set. This is also known as “column pruning”. Analytical queries fundamentally access data in columnar patterns and with a purely columnar storage paradigm, ClickHouse closely aligns itself to this query pattern. Read more here.

ClickHouse is a Columnar Store

Multiple Layers of Indexes

After narrowing down on 10 of the 1000 columns via columnar storage, ClickHouse now identifies the rows of data within these columns that are relevant to the query. It does this with the help of multiple layers of indexes. 

Primary Key Index

An index in a database is very simply a pointer to data in a table, designed to enable efficient search. The primary key of a database is a column in a relational database table that is distinctive for each record, i.e., a unique identifier. In ClickHouse, a primary key is converted to a primary index, and data is sorted during table creation by the primary key for easy access.

For instance, if the primary key is “date”, all of the data will be chronologically sorted by date of occurrence upon table creation. ClickHouse generates a sparse primary index (.idx) file in its granulated storage paradigm, and the .bin files that store the actual data in columns are now accessible via the sparse primary index file. The choice of primary key is critical, since when a query invokes the primary key, it will derive the most benefit on a query latency front. It should hence be carefully chosen based on a keen understanding of query patterns.

If the query is filtering by date, say the year 2005 – 2006, in a dataset containing data about years 2001 – 2020, and the primary key is date, ClickHouse can rapidly navigate to the granules in the primary index containing rows where the primary key (date) has the year as 2005 – 2006, and eliminate disk reads for all the other years in the dataset. For this query, ClickHouse has eliminated disk reads for 90% of all rows within the columns.

Primary Index in ClickHouseViewing this example in toto, ClickHouse has expertly navigated to the 0.1% of data values needed for query execution. Columnar storage has enabled it to choose the 1% of columns which contain the requisite data, and the primary index has enabled it to choose the 10% of rows in those 10 columns which are query-relevant. 99.9% of table scan has consequently been saved, with massively positive implications for query latency. More about indexes in ClickHouse here.

Secondary Key Index

In case the query does not correspond to the primary key, ClickHouse also enables the creation of secondary or data skipping indexes for each frequently accessed column in the database. For instance, if query patterns frequently involve the userID column, this can be configured to be a secondary (data skipping) key, and the right set of rows corresponding to the index linked to this key can be rapidly identified.

It’s fascinating that some of the world’s leading OLAP DBMS technologies, from Snowflake to BigQuery, do not natively support indexing. While they use a technology called clustering based on adjacent principles, it tends to be slower, more cumbersome, and notably more expensive than sparse indexing. Such were perhaps the needs of the time: these technologies were built for a cloud-native batch-processed era, while ClickHouse inhabits the world of hybrid-native real-time analytics processing.

Note: Like most OLAP databases, ClickHouse also supports partitioning. Partitions improve performance for queries containing a partitioning key because ClickHouse will filter for that partition before selecting the parts and granules within the partition. It works in conjunction with indexes to find query-relevant data to load into RAM with speed & precision.

Data Compression

A great advantage bestowed by the columnar storage paradigm is that homogeneous (similar) values are stored together in columns, versus the row-based storage paradigm where heterogeneous values are stored together in rows. Think of a table with columns such as “date”, “name” or “userID”: the data in each column will be of the type of date, strong, and int8 respectively, whereas the data in a row will have all 3 data types together.

Homogeneity of data types creates the possibility for the application of various compression algorithms to reduce the size of the dataset in storage. This is precisely what ClickHouse does. To its credit, and as part of what sets it apart:

  • It supports 10s of compression algorithms, from LZ4 to ZTSD to delta encoding, to run-length encoding (RLE) to delta encoding.
  • It enables each single one of its columns to be stored with a different compression algorithm that corresponds most closely to the associated data type.

This creates an order of magnitude better compression than most incumbent OLAP DBMS which, though often columnar with data compression capabilities, haven’t treated it as much as first-class citizens with low-level attention to detail as ClickHouse.

Data Compression in ClickHouse

The impact on storage costs is evident: data compressed on disk will naturally cost less to store. The impact on query performance is less well-understood. When data is being loaded from disk into RAM, speed is inversely proportional to size. Since ClickHouse decompresses data on the fly while reading in RAM, the speed at which the data is read increases since the data size post ClickHouse’s advanced compression techniques is super small, with the downstream effect of minimizing disk read latency.

Summary

To summarize this section, ClickHouse uses columnar storage to identify precisely the query-relevant columns, multiple layers of indexes to identify precisely the query-relevant rows, and advanced compression techniques on columnar storage format to minimize size of the data read from disk. These 3 effects and other such design in a nutshell make it blazing fast in the context of minimizing query latency stemming from disk reads, which is often the biggest thief of time especially for large & I/O intensive queries.

Step 2: Process data in CPU(s) as per query execution plan

When disk read is complete and decompressed data is loaded into RAM ready for processing, there are three key levers that may be leveraged to drive performance:

  • How much data are you processing in one cycle?
  • How much can you parallelize query execution?
  • How much of past query execution can you leverage?

ClickHouse has specific design choices due to which it is highly optimized across all three of these levers. They are discussed in detail below.

Vectorized Query Execution

ClickHouse doesn’t just store data in columns, it also processes data as columns. Instead of processing data value by value, as is the wont of most incumbent OLAP DBMS, ClickHouse processes data in large arrays, which could be the entire length of the column loaded into RAM. Vectorized query execution reduces CPU cache miss rates, utilizes the SIMD (single instruction multiple data) capabilities of modern CPUs, reduces branching overheads and minimizes data movements, which saves precious milliseconds of query execution time, reducing latency.

Vectorized Query Execution in ClickHouse

Massively Parallel & Scalable Processing 

MPP (Massively Parallel Processing) is a common feature of most OLAP systems today, but the implementation in ClickHouse is particularly interesting. Not only can a single large query be split into the various CPU cores of a single node, but a single query can also leverage all the other CPU cores and disks corresponding to all the other shards in the cluster. This means that on a large cluster with tens of cores, even datasets with hundreds of billions of rows can be processed in seconds since the possible levels of parallelism are often an order of magnitude higher than alternatives.

Massively parallel processing in ClickHouse

Materialized Views & Caching

Materialized views are special triggers that store the result of a SELECT query on data, which is inserted as is into a target table. For frequently accessed data subsets, materialized views are incredibly useful since they cut query execution time into a fraction, by storing only the data subsets that are relevant to frequently-run queries. While there is a small storage overhead, disk reads are consequently that much faster for frequent query patterns, which result in blazing fast performance for the most common query sets.

In addition to this, ClickHouse supports multiple types of caches: from data cache types (read cache, write cache, compression cache, dictionary cache) to query cache, to filesystem cache, all of which leverage the power of recent query executions to reduce future query runtimes.

Attention to Low-level Details

While the aforementioned features summarize the key aspects of ClickHouse, what is particularly impressive is the long list of low-level optimizations that deliver speed in distinct and often obscure scenarios. A few examples:

  • ClickHouse has 10+ distinct storage engines, and uses them in distinct use cases. While the vanilla MergeTree engine is great for a vast variety of use cases, there are engines such as ReplacingMergeTree, CollapsingMergeTree, SummingMergeTree, AggregatingMergeTree, VersionedCollapsingMergeTree, etc. all of which have their utility in specific scenarios. At ChistaDATA Inc. for instance, we have pioneered the use of LogStructuredMergeTree (LSM tree) with RocksDB as the persistent key-value store for enhanced WRITE performance and extreme data compaction for ingestion-heavy use cases, for which we see 100+ billion WRITES per day at our largest customers.
  • The GROUP BY function in ClickHouse is ubiquitous and particularly interesting since ClickHouse chooses one of 30+ hash table data structure implementations for every GROUP BY execution, based on the use case.
  • ClickHouse has native support for the Adaptive JOIN Algorithm, which adaptively chooses how to JOIN multiple tables, by preferring hash-join algorithm and falling back to the merge-join algorithm if there’s more than one large table.
  • Support for massive concurrency: each node with the right configurations can individually support 100s of users. For comparison, Snowflake caps concurrency at 8 users per virtual warehouse (the only way out is multi-cluster warehouses, which cost a fair bit).

Conclusion

Viewed in toto, it is the combination of a design for speed at each level of query execution that makes ClickHouse the fastest OLAP DBMS on the planet. We believe it is the current state-of-the-art in OLAP database management system evolution that will be significantly challenging, if not impossible, to better by an order of magnitude. This is the simple reason we at ChistaDATA have devoted the last decade of our lives to ClickHouse, and intend to continue this journey with ClickHouse for many decades to come.

To view ClickHouse performance benchmarks, please refer to the links below: