Why we used ClickHouse for Real-Time Analytics and not something like MapReduce
Systems like MapReduce is distributed computing ecosystem built to reduce Data Infrastructure Operations based on Distributed Sorting. Distributed Sorting is definitely not an ideal solution if the result of operations and other intermediate results are located in the RAM of a single server, which is usually the case for online queries. To address these performance bottleneck associated with distributed computing platform we use hash table. Most MapReduce implementations allow you to execute arbitrary code on a cluster but OLAP systems are optimised to run declarative query language which is the most compelling reason for using ClickHouse in Real-Time Analytics. The following below are strong reasons for using ClickHouse over MapReduce:
- ClickHouse stores and process data in columns (also known as vectored query execution). This helps for cost-efficient CPU cache utilization allows for SIMD CPU instructions usage
- ClickHouse architecture is built for scale: Capable of using all available CPU cores and disks to execute every single query.
- ClickHouse retains Data Structure in memory so this allows reading used columns and all the row ranges of those columns optimally using available system resources optimally