Introduction
In this article, we will provide information about ClickHouse and RocksDB. Afterward, we will talk about the comparison, advantages, and disadvantages of these two databases, which are the RDBMS and the Key-Value database system.
What are ClickHouse and RocksDB?
Clickhouse is a fast open-source column-oriented OLAP database management system developed by Yandex for its Yandex. Metrica web analytics service, similar to Google Analytics. It’s built to handle trillions of rows and petabytes of data in mind and quickly perform analytical queries.
RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database, but our current focus is on embedded workloads.
RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory, and write-once workloads, and to be flexible to allow for innovation.
So how fast are these databases?
ClickHouse
ClickHouse was designed to be fast. Query execution performance has always been a top priority during the development process. Still, other important characteristics like user-friendliness, scalability, and security were also considered so that ClickHouse could become a real production system.
ClickHouse was initially built as a prototype to do just a single task well: to filter and aggregate data as fast as possible. That’s what needs to be done to build a typical analytical report, and that’s what a typical GROUP BY query does. ClickHouse team has made several high-level decisions that combined made achieving this task possible:
Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few. The system must avoid reading unnecessary columns, or the most expensive disk read operations would be wasted.
Indexes: ClickHouse keeps data structures in memory that allow reading used columns and only necessary row ranges of those columns.
Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because real data columns often have the same or not so many different values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.
Vectorized query execution: ClickHouse stores data in columns and processes data in columns. It leads to better CPU cache utilization and allows for SIMD CPU instructions usage.
Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but on all CPU cores and disks of a cluster.
RocksDB
RocksDB benchmarked LevelDB and found that it was unsuitable for our server workloads. The benchmark results look awesome at first sight, but we quickly realized that those results were for a database whose size was smaller than the size of RAM on the test machine – where the entire database could fit in the OS page cache. When RocksDB performed the same benchmarks on a database at least 5 times larger than the main memory, the performance results were dismal.
By contrast, RocksDB published the RocksDB benchmark results for server-side workloads on Flash. RocksDB also measured the performance of LevelDB on these server-workload benchmarks and found that RocksDB solidly outperforms LevelDB for these IO-bound workloads. RocksDB found that LevelDB’s single-threaded compaction process was insufficient to drive server workloads. RocksDB frequent write-stalls with LevelDB that caused 99-percentile latency to be tremendously large. RocksDB found that n-m a file into the OS cache introduced performance bottlenecks for reads. We could not make LevelDB consume all the IOs offered by the underlying Flash storage.
What Is A Column-Oriented Database?
Lets say you have a data like this;
---------------------------------------------- | timestamp | domain | visit | ---------------------------------------------- | 2021-09-05 12:00 | chistadata.com | 2220 | | 2021-09-05 12:00 | medium.com | 1300 | | 2021-09-05 12:01 | cansayin.com | 47 | ----------------------------------------------
When you’re storing the data in a row-oriented OLTP database like PostgreSQL & MySQL, the data will be logically stored like below:
rowX -> column1, column2, column3;example: row1 -> 2022-08-04 12:00, chistadata.com, 2220; row2 -> 2022-08-04 12:00, medium.com, 1300; row3 -> 2022-08-04 12:01, cansayin.com, 47;
The data for each column in a row is written next to each other. This makes the data lookup for individual rows fast. The data update and deletion operations are also fast as you can update or delete rows quickly by theoretically remove that 1 line. But when you’re summing up a group of rows for example the number of visits for chistadata.com
, the database needs to read each row one by one, get the relevant column (and discard the irrelevant columns), then only sum up the total. This is a waste of IO operations, and it is costly, thus reflecting the longer processing time for this query.
However, in column-oriented, the data will be stored like below:
columnX -> row1:id, row2:id, row3:idexample: timestamp column -> 2022-08-04 12:00:001,2022-08-04 12:00:002,2022-08-04; domain column -> chistadata.com,medium.com,cansayin.com; visit column -> 2220:001,1300:002,47:003;
Notice that the data for each row in a column are stored side by side. If you are summing up the number of visits for chistadata.com
site, the database first need to look for id
for chistadata.com
from domain
column, fetch the visit
data column for relevant id retrieved and finally sum them up. The database doesn’t need to run many expensive I/O operations to retrieve the whole row, as it only gets the relevant columns in the first place. This is the crucial reason why the column-oriented database is so robust for this query.
Advantages of Relational Database
Speed
Security
Simplicity
Accessibility
Accuracy
Multi-User
Disadvantages of Relational Database
Cost
Performance
Physical Storage
Complexity
Information Loss
Structure Limitations
What Is A Key-Value Store Database ?
A key-value database, AKA key-value store, associates a value (which can be anything from a number or simple string to a complex object) with a key, which is used to keep track of the object. In its simplest form, a key-value store is like a dictionary/array/map object as it exists in most programming paradigms, but which is stored in a persistent way and managed by a Database Management System (DBMS).
Key-value databases use compact, efficient index structures to be able to quickly and reliably locate a value by its key, making them ideal for systems that need to be able to find and retrieve data in constant time. Redis, for instance, is a key-value database that is optimized for tracking relatively simple data structures (primitive types, lists, heaps, and maps) in a persistent database. By only supporting a limited number of value types, Redis is able to expose an extremely simple interface to querying and manipulating them, and when configured optimally is capable of high throughput.
Key-Value Store Advantages
It is worth pointing out that different database types exist to serve different purposes. This sometimes makes the choice of the right type of database to use obvious. While key-value databases may be limited in what they can do, they are often the right choice for the following reasons:
Simplicity. As mentioned above, key value databases are quite simple to use. The straightforward commands and the absence of data types make work easier for programmers. With this feature data can assume any type, or even multiple types, when needed.
Speed. This simplicity makes key value databases quick to respond, provided that the rest of the environment around it is well-built and optimized.
Scalability. This is a beloved advantage of NoSQL databases over relational databases in general, and key-value stores in particular. Unlike relational databases, which are only scalable vertically, key-value stores are also infinitely scalable horizontally.
Easy to move. The absence of a query language means that the database can be easily moved between different systems without having to change the architecture.
Reliability. Built-in redundancy comes in handy to cover for a lost storage node where duplicated data comes in place of what’s been lost.
Key-Value Store Disadvantages
Simplicity. The list of advantages and disadvantages demonstrates that everything is relative, and that what generally comes as an advantage can also be a disadvantage. This further proves that you have to consider your needs and options carefully before choosing a database to use. The fact that key-value stores are not complex also means that they are not refined. There is no language nor straightforward means that would allow you to query the database with anything else other than the key.
No query language. Without a unified query language to use, queries from one database may not be transportable into a different key-value database.
Values can’t be filtered. The database sees values as blobs so it cannot make much sense of what they contain. When there is a request placed, whole values are returned — rather than a specific piece of information — and when they get updated, the whole value needs to be updated.
Key-Value Database Use Cases
The advantages listed above naturally lend themselves to several popular use cases for key-value databases.
- Web applications may store user session details and preference in a key-value store. All the information is accessible via user key, and key-value stores lend themselves to fast reads and writes.
- Real-time recommendations and advertising are often powered by key-value stores because the stores can quickly access and present new recommendations or ads as a web visitor moves throughout a site.
- On the technical side, key-value stores are commonly used for in-memory data caching to speed up applications by minimizing reads and writes to slower disk-based systems. Hazelcast is an example of a technology that provides an in-memory key-value store for fast data retrieval.
Name | ClickHouse | RocksDB |
Description | Column-oriented Relational DBMS powering Yandex | Embeddable persistent key-value store optimized for fast storage (flash and RAM) |
Primary database model | Relational DBMS | Key-value store |
Secondary database models | Time Series DBMS | |
Website | clickhouse.tech | rocksdb.org |
Technical documentation | https://clickhouse.com/docs/en/intro/ | github.com/facebook/rocksdb/wiki |
Developer | Clickhouse Inc. | Facebook, Inc. |
Initial release | 2016 | 2013 |
License | Open Source | Open Source |
Implementation language | C++ | C++ |
Server operating systems | C++, FreeBSD, macOS | Linux |
Data scheme | yes | schema-free |
XML support | no | no |
Secondary indexes | no | no |
APIs and other access methods | HTTP REST JDBC ODBC Proprietary protocol |
C++ API Java API |
Supported programming languages | C# C++ ElixirGoJavaJavaScript (Node.js)KotlinNimPerlPHPPythonRRubyScala |
C C++ Go Java Perl Python Ruby |
Conclusion
RBMS and Key-Value database systems have their own operating principles and architectures. Before working with a database, one of the first questions to ask is which database is the most suitable for your project? You can choose the most suitable database for you by asking different questions such as the budget of your project, the data to be used, how much the data can grow over time, your data processing principles, relationships with different databases or big data architectures, and by analyzing the projects you want to use, where the database has been used before.
To compare ClickHouse v/s other databases, read the following articles:
- 6 Reasons why ClickHouse is superior to Hadoop for Real-time Analytics
- Hadoop and Teradata vs ClickHouse for Real-time Analytics in Modern Banking
- ClickHouse v/s Hadoop for Real-time Analytics