ClickHouse vs RocksDB for Real-time Analytics

Introduction

In this article, we will provide information about ClickHouse and RocksDB. Afterward, we will talk about the comparison, advantages, and disadvantages of these two databases, which are the RDBMS and the Key-Value database system.

What are ClickHouse and RocksDB?

Clickhouse is a fast open-source column-oriented OLAP database management system developed by Yandex for its Yandex. Metrica web analytics service, similar to Google Analytics. It’s built to handle trillions of rows and petabytes of data in mind and quickly perform analytical queries.

RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database, but our current focus is on embedded workloads.

RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory, and write-once workloads, and to be flexible to allow for innovation.

So how fast are these databases?

ClickHouse

ClickHouse was designed to be fast. Query execution performance has always been a top priority during the development process. Still, other important characteristics like user-friendliness, scalability, and security were also considered so that ClickHouse could become a real production system.

ClickHouse was initially built as a prototype to do just a single task well: to filter and aggregate data as fast as possible. That’s what needs to be done to build a typical analytical report, and that’s what a typical GROUP BY query does. ClickHouse team has made several high-level decisions that combined made achieving this task possible:

Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few. The system must avoid reading unnecessary columns, or the most expensive disk read operations would be wasted.

Indexes: ClickHouse keeps data structures in memory that allow reading used columns and only necessary row ranges of those columns.

Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because real data columns often have the same or not so many different values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.

Vectorized query execution: ClickHouse stores data in columns and processes data in columns. It leads to better CPU cache utilization and allows for SIMD CPU instructions usage.

Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but on all CPU cores and disks of a cluster.

RocksDB

RocksDB benchmarked LevelDB and found that it was unsuitable for our server workloads. The benchmark results look awesome at first sight, but we quickly realized that those results were for a database whose size was smaller than the size of RAM on the test machine – where the entire database could fit in the OS page cache. When RocksDB performed the same benchmarks on a database at least 5 times larger than the main memory, the performance results were dismal.

By contrast, RocksDB published the RocksDB benchmark results for server-side workloads on Flash. RocksDB also measured the performance of LevelDB on these server-workload benchmarks and found that RocksDB solidly outperforms LevelDB for these IO-bound workloads. RocksDB found that LevelDB’s single-threaded compaction process was insufficient to drive server workloads. RocksDB frequent write-stalls with LevelDB that caused 99-percentile latency to be tremendously large. RocksDB found that n-m a file into the OS cache introduced performance bottlenecks for reads. We could not make LevelDB consume all the IOs offered by the underlying Flash storage.

What Is A Column-Oriented Database?

Lets say you have a data like this;

----------------------------------------------
| timestamp        | domain          | visit |
----------------------------------------------
| 2021-09-05 12:00 | chistadata.com  | 2220  |
| 2021-09-05 12:00 | medium.com      | 1300  |
| 2021-09-05 12:01 | cansayin.com    | 47    |
----------------------------------------------

When you’re storing the data in a row-oriented OLTP database like PostgreSQL & MySQL, the data will be logically stored like below:

rowX -> column1, column2, column3;example:

row1 -> 2022-08-04 12:00, chistadata.com, 2220;
row2 -> 2022-08-04 12:00, medium.com, 1300;
row3 -> 2022-08-04 12:01, cansayin.com, 47;

The data for each column in a row is written next to each other. This makes the data lookup for individual rows fast. The data update and deletion operations are also fast as you can update or delete rows quickly by theoretically remove that 1 line. But when you’re summing up a group of rows for example the number of visits for chistadata.com, the database needs to read each row one by one, get the relevant column (and discard the irrelevant columns), then only sum up the total. This is a waste of IO operations, and it is costly, thus reflecting the longer processing time for this query.

However, in column-oriented, the data will be stored like below:

columnX -> row1:id, row2:id, row3:idexample:

timestamp column -> 2022-08-04 12:00:001,2022-08-04 12:00:002,2022-08-04;
domain column -> chistadata.com,medium.com,cansayin.com;
visit column -> 2220:001,1300:002,47:003;

Notice that the data for each row in a column are stored side by side. If you are summing up the number of visits for chistadata.com site, the database first need to look for id for chistadata.com from domain column, fetch the visit data column for relevant id retrieved and finally sum them up. The database doesn’t need to run many expensive I/O operations to retrieve the whole row, as it only gets the relevant columns in the first place. This is the crucial reason why the column-oriented database is so robust for this query.

Advantages of Relational Database

Speed

Even though a relational database is poor in terms of performance, still its speed is considerably higher because of its ease and simplicity. And also various optimizations that is included in a relational database further increase its speed. So all the applications will run with appropriate speed when used in a relational database.

Security

Since there are several tables in a relational database, certain tables can be made to be confidential. These tables are protected with username and password so that only authorized users can access them. The users are only allowed to work on that specific table.

Simplicity

Compared to other types of network models, a relational database model is much simpler. It is free from query processing and complex structuring. As a result, it does not require any complex queries. A simple SQL query is sufficient enough for handling.

Accessibility

Unlike other types of databases, a relational database does not require any specific path for accessing the data. Even modifying data in the relevant column is made easy. So whatever the outcome is shown is appropriate to the user.

Accuracy

As mentioned earlier, a relational database uses primary and foreign keys to make the tables interrelated. Thus, all the data which is stored is non-repetitive. This means that the data does not duplicate. Therefore, the data stored can be guaranteed to be accurate.

Multi-User

Multiple users will be able to access a relational database at the same time. Even if the data is updated, the users can access them conveniently. Hence, the crashes happening from multi-access are possibly prevented.

Disadvantages of Relational Database

Cost

The underlying cost involved in a relational database is quite expensive. For setting up a relational database, there must be separate software that needs to be purchased. And professional technicians should be hired to maintain the system. All these can be costly, especially for businesses with a small budgets.

Performance

Always the performance of the relational database depends on the number of tables. If there are more tables, the response given to the queries will be slower. Additionally, more data presence not only slows down the machine, it eventually makes it complex to find information. Thus, a relational database is known to be a slower database.

Physical Storage

A relational database also requires tremendous amount of physical memory since it is with rows and columns. Each of the operations depend on separate physical storage. Only through proper optimization, the targeted applications can be made to have maximum physical memory.

Complexity

Although a relational database is free from complex structuring, occasionally it may become complex too. When the amount of data in a relational database increases, it eventually makes the system more complicated. Each and every data is been complex since the data is arranged using common characteristics.

Information Loss

Large organizations tends to use more number of number of database systems with more tables. These information can be used to be transferred from one system to another. This could pose a risk of data loss.

Structure Limitations

The fields that is present on a relational database is with limitations. Limitations in essence means that it cannot accommodate more information. Despite if more information are provided, it may lead to data loss. Therefore, it is necessary to describe the exact amount of data volume which the field will be given.

What Is A Key-Value Store Database ?

A key-value database, AKA key-value store, associates a value (which can be anything from a number or simple string to a complex object) with a key, which is used to keep track of the object. In its simplest form, a key-value store is like a dictionary/array/map object as it exists in most programming paradigms, but which is stored in a persistent way and managed by a Database Management System (DBMS).

Key-value databases use compact, efficient index structures to be able to quickly and reliably locate a value by its key, making them ideal for systems that need to be able to find and retrieve data in constant time. Redis, for instance, is a key-value database that is optimized for tracking relatively simple data structures (primitive types, lists, heaps, and maps) in a persistent database. By only supporting a limited number of value types, Redis is able to expose an extremely simple interface to querying and manipulating them, and when configured optimally is capable of high throughput.

Key-Value Store Advantages

It is worth pointing out that different database types exist to serve different purposes. This sometimes makes the choice of the right type of database to use obvious. While key-value databases may be limited in what they can do, they are often the right choice for the following reasons:

Simplicity. As mentioned above, key value databases are quite simple to use. The straightforward commands and the absence of data types make work easier for programmers. With this feature data can assume any type, or even multiple types, when needed.

Speed. This simplicity makes key value databases quick to respond, provided that the rest of the environment around it is well-built and optimized.

Scalability. This is a beloved advantage of NoSQL databases over relational databases in general, and key-value stores in particular. Unlike relational databases, which are only scalable vertically, key-value stores are also infinitely scalable horizontally.

Easy to move. The absence of a query language means that the database can be easily moved between different systems without having to change the architecture.

Reliability. Built-in redundancy comes in handy to cover for a lost storage node where duplicated data comes in place of what’s been lost.

Key-Value Store Disadvantages

Simplicity. The list of advantages and disadvantages demonstrates that everything is relative, and that what generally comes as an advantage can also be a disadvantage. This further proves that you have to consider your needs and options carefully before choosing a database to use. The fact that key-value stores are not complex also means that they are not refined. There is no language nor straightforward means that would allow you to query the database with anything else other than the key.

No query language. Without a unified query language to use, queries from one database may not be transportable into a different key-value database.

Values can’t be filtered. The database sees values as blobs so it cannot make much sense of what they contain. When there is a request placed, whole values are returned — rather than a specific piece of information — and when they get updated, the whole value needs to be updated.

Key-Value Database Use Cases

The advantages listed above naturally lend themselves to several popular use cases for key-value databases.

Web applications may store user session details and preference in a key-value store. All the information is accessible via user key, and key-value stores lend themselves to fast reads and writes.
Real-time recommendations and advertising are often powered by key-value stores because the stores can quickly access and present new recommendations or ads as a web visitor moves throughout a site.
On the technical side, key-value stores are commonly used for in-memory data caching to speed up applications by minimizing reads and writes to slower disk-based systems. Hazelcast is an example of a technology that provides an in-memory key-value store for fast data retrieval.

Name	ClickHouse	RocksDB
Description	Column-oriented Relational DBMS powering Yandex	Embeddable persistent key-value store optimized for fast storage (flash and RAM)
Primary database model	Relational DBMS	Key-value store
Secondary database models	Time Series DBMS
Website	clickhouse.tech	rocksdb.org
Technical documentation	https://clickhouse.com/docs/en/intro/	github.com/facebook/rocksdb/wiki
Developer	Clickhouse Inc.	Facebook, Inc.
Initial release	2016	2013
License	Open Source	Open Source
Implementation language	C++	C++
Server operating systems	C++, FreeBSD, macOS	Linux
Data scheme	yes	schema-free
XML support	no	no
Secondary indexes	no	no
APIs and other access methods	HTTP REST JDBC ODBC Proprietary protocol	C++ API Java API
Supported programming languages	C# C++ ElixirGoJavaJavaScript (Node.js)KotlinNimPerlPHPPythonRRubyScala	C C++ Go Java Perl Python Ruby

Conclusion

RBMS and Key-Value database systems have their own operating principles and architectures. Before working with a database, one of the first questions to ask is which database is the most suitable for your project? You can choose the most suitable database for you by asking different questions such as the budget of your project, the data to be used, how much the data can grow over time, your data processing principles, relationships with different databases or big data architectures, and by analyzing the projects you want to use, where the database has been used before.

To compare ClickHouse v/s other databases, read the following articles:

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

Comparing ClickHouse vs RocksDB for Real-time Analytics