Introducing ChistaDATA Fabric

Introduction

Specialization is a natural outcome, and indeed, a telltale sign of system evolution. It enables large use cases to be deeply solved by standalone pieces of technology, achieving cost-efficiency and performance gains that would otherwise be hard to realize. However, a downside of specialization is an explosion in diversity, creating a highly heterogeneous ecosystem of specialized, “point” solutions that become cumbersome to manage at scale.

Fifty years into its inception, database systems are well into their evolution towards deep specialization. The database technology stack across most enterprises today is highly complex and heterogeneous with tens of unique technologies powering specialized use cases. Below is a bird’s eye view of the complexity and heterogeneity of modern database systems architectures.

Fig 1: The heterogeneous enterprise data analytics stack

The problem with heterogeneous data stacks

Complexity creates problems: most notably inflated total cost of operations (TCO), cumbersome and resource-intensive systems management, and a lack of visibility and observability across this technology stack that verily assumes a life & expansion of its own. More specifically, of Pandora’s box of potential issues, we have observed six emergent problems in heterogeneous database systems in production.

01: Runaway expensive OLAP bills

Few companies today use a single OLAP store: it’s usually a combination of a data lake (Delta Lake), a warehouse (Snowflake, Redshift, BigQuery), query engines (Presto, Trino, Spark), legacy batch systems (Hadoop, Vertica), and all of the associated point tooling (BI, ETL, observability) et al. While there is a push towards market consolidation by Databricks and Snowflake, the situation on the ground is far from this vision.

Add to that the usage-based billing paradigm especially in vogue amongst the closed-source and cloud-managed services of these incumbents, in combination with a growing paradigm of democratized provisioning and access of SQL query engines. The consequence is that a growing number of people in an organization (typically, SQL newbies) can write increasing quantities of “bad” (i.e., resource-inefficient) SQL to data query engines, creating a growing incidence of the “thousand dollar query” (“Select *”, “No LIMIT”, etc.) that keeps chugging away for hours on end producing no meaningful result other than stalling the system and creating large, fast-growing, and hard-to-control analytics bills.

02: Lack of system observability, visibility & control

Heterogeneous database systems exist in effective siloes. They may provide control planes, cost control mechanisms and observability dashboards independently, but there is no single plane of control & visibility for the DB layer as a whole. This is a huge problem, since no one in the company really has an answer to the questions: “What is the performance of my data stack as a whole?” and “What is the TCO of my data stack?”. Individual stakeholders may have piecemeal answers, but rarely does anyone have the whole picture.

Additionally, the cost-efficiency & performance bestowed by specialization begins to diminish due to the management complexity of heterogeneous systems. For instance, real-time analytics use cases that are too slow & expensive in batch-processed systems and are better suited for real-time analytics ecosystems are executed in whichever incumbent system the data resides in. Ease & inertia always triumphs over cost & performance.

03: Vulnerable & fault-prone systems

Tens of components implies hundreds of interconnections and, effectively, hundreds of potential points of failure. Two key consequences:

  • System resilience and fault-tolerance (and consequently high availability) is that much harder to achieve & maintain. How can one maintain the uptime of the system when one can barely observe it holistically?
  • The cyber-attack surface area of the data stack has increased massively, creating many new high-severity and hitherto unsolved attack vectors. How can one truly secure such a complex & dynamic system?

04: Rate-limited ingestion

As data volumes grow exponentially with the rise of machine data (time-series machine/sensor data, logs of various kinds, clickstream & web events), high-velocity data ingestion and real-time data streaming is the need of the hour. As our friends at Last9 write, handling the streaming & ingestion of millions of events per second is now commonplace in the largest enterprises. This is an ascendent trend: it will inevitably percolate down to every data property owner in the next decade with the advent of AI.

Incumbent tools built in an era of terabyte-scale datasets for batch ingestions with unconstrained latency are thoroughly unable to solve the ingestion challenge.

05: Bloated & slow RDBMS

Postgres, MySQL and their brethren, born many decades ago, just weren’t built to manage web scale as we know it today. Horizontally scaling a database system will only ever take you so far: at some point this resource-intensive system will give way. Add to that the notorious use/abuse of traditional RDBMS for analytical reads that continues to this day, which is naturally terribly expensive & slow given that row-based database systems simply aren’t built for analytical reads.

To solve this, you either need expert consultative support from folks who live & breathe RDBMS (for instance, MinervaDB) OR you need to create lean RDBMS infrastructure by archiving old data into a cost-efficient and read-optimized data store. More on this soon.

06: Rate-limited innovation

Perhaps the most concerning consequence. In a system with as many moving parts and interdependencies as the modern data stack, innovation particularly at the highly-sensitive & foundational database layer is throttled given the fear of tampering with a complex system effectively frankly stuck together with glue.

You just never know when a misplaced data movement might bring down something mission-critical. Not to mention how involved and painful full-fledged data migrations are: consuming many quarters of engineering resources and mental bandwidth. It is very much a last-ditch measure when there is no recourse from cost or performance pain. Enterprise teams are rightly averse to “trying too much of the new stuff”. Net net: it is easier to negotiate with your vendor for higher discounts every year rather than risk downtime or the operational nightmare of a data migration.

Fig 2: 6 key problems in the heterogeneous enterprise data stack

The solution: ChistaDATA Fabric

In our minds, a number of these concerning problem statements might be reduced in count & intensity with a greater degree of communication & orchestration between the various presently siloed elements of the data stack. An intelligent orchestration layer and control + visibility plane above the disparate data stack might indeed have a striking impact on the level of performance, efficiency & security of the system as a whole. In other words, it aims to make your existing database investment more successful. This is what we set out to build with ChistaDATA Fabric.

ChistaDATA Fabric is a reverse proxy server that sits on top of heterogeneous database systems acting as an effective “database gateway”. All queries to the database and results into BI applications, ML models, and other destinations pass through the Fabric. It has specific capabilities (discussed in detail below) which enable it to route individual queries to appropriate destinations, intercept, inspect & modify queries, and aggregate query logs to provide end to end visibility and observability of system activity. By creating a plane of control, visibility, and orchestration above the database layer, it weaves the siloed data stack into a contiguous “fabric”, uniting hitherto disparate elements into a single coherent and inter-communicative entity.

Fig 3: ChistaDATA Fabric technical architecture

Here are some of the capabilities of the ChistaDATA Fabric:

  • Read-write splitting: Identifies the nature of incoming queries and routes them to the appropriate read-optimized and write-optimized across database systems.
  • Load balancing1: Balances incoming query traffic across the system using a variety of strategies such as round-robin, random, etc.
  • Secure communications: All communications are encrypted per need with HTTPS/TCP-TLS algorithms with forwarding SSL support.
  • Query logging & aggregation: Asynchronous logging for file system logs, database logs, network logs, etc. as well as remote aggregation of query logs from external systems using an open source technology stack.
  • Multi-database protocol aware: Can connect with any DBMS on the planet using the wire protocol. Currently supports MySQL, PostgreSQL, ClickHouse, Snowflake, and the list is growing.
  • SQL blacklisting2: Configures organization-specific rules to stop the execution of “bad” or malicious SQL at the proxy itself before it is forwarded to the database layer.
  • 100% GPL licensed, with a highly extensible, modular, plugin-based architecture for feature addition in open source communities.
  • Lightweight deployment: Addition of simply 2 servers (one active, one failover) for data stack orchestration, super light on compute & storage.

Here is how the ChistaDATA Fabric solves the six problems we identified with increasingly heterogeneous database systems.

Fig 4: ChistaDATA Fabric enmeshed in the enterprise data stack

01: OLAP TCO optimization

ChistaDATA Fabric uses ClickHouse, the world’s fastest OLAP DBMS, as an veritable “cache” for Snowflake, BigQuery, Redshift, and other CDWs. For analytical reads that take too long to run in Snowflake et al (i.e., poor performance), or defined & frequent query patterns with latency constraints and high concurrency that constitute the majority of the data analytics bill (i.e., high cost per query), ClickHouse is often a performant alternative for such real-time analytics use cases. In this case, a small subset of frequently accessed data is replicated to ClickHouse (storage costs being minimal with ClickHouse’s trademark extreme data compaction). The Fabric splits the aforementioned reads that were hitherto going into Snowflake towards ClickHouse, which is able to execute them in seconds at a fraction of the cost, thus reducing query traffic towards incumbent CDWs and optimizing the spend on read compute nodes. Over time, customers can shift more workloads towards ClickHouse to optimize read compute spends further, wherever performance is provably superior.

Fabric can also blacklist poorly authored SQL queries and prevent their execution. For instance, the execution of syntactically correct but resource-intensive queries such as ‘SELECT *’ and queries without LIMIT can be stopped at the proxy server itself, curtailing the incidence of the “thousand dollar query”. Enterprises can configure many such org-specific rules in the proxy themselves with our assistance. We are also working on SQL optimization of poor SQL to convert it into more resource-efficient SQL on the fly for key database systems, which will further augment OLAP TCO optimization.

Further, the Fabric can be configured to split reads & writes across the entire data stack, utilizing each element for the appropriate purpose. For instance, it can intercept queries and eliminate analytical reads going into the RDBMS, preserving it only for ACID writes and diverting the reads towards an appropriate OLAP DBMS. This enables optimal utilization of each database for its strengths and achieves high system efficiency.

02: Unified control plane for observability & visibility

As a “database gateway”, the proxy server intercepts every single query going into the database layer. Consequently, it is able to remotely aggregate query, network, file system, and database logs from the system itself as well as cloud vendors (AWS Cloudwatch, Azure log analytics, etc.), and create a comprehensive view of holistic system performance in Grafana dashboards, using ClickHouse / Prometheus as the logs & metrics store. This implies that we can have both granular and holistic views of the system in a single dashboard:

Granular views: For each individual analytical or transactional query, we can analyze cost & performance within a specific database system, and determine in which system the query executes most price-performantly. This enables optimization, and rule configuration in the proxy to split analytical reads of this pattern into a particular database system.

Holistic views: Fabric provides a single pane of glass with a bird’s eye view of cost & performance for all constituent database systems, enabling key performance metrics such as query failure rate, query latency, throughput, etc. to be monitored for each DBMS as well as in aggregate in a single dashboard. Spends on the database stack, for each individual DBMS and in aggregate, can also be centrally monitored, which was as yet impossible in a heterogeneous database stack. 

03: Secure & resilient systems

The proxy server sitting atop the database layer constitutes a secure HTTP endpoint server, with various HTTP handlers available as plugins. All communications between proxy server & database layer are encrypted with HTTP(S)/TCP-TLS passthrough, termination and SSL exchange, as per need. Since internal and external applications now invoke the Fabric HTTP endpoint, there is no longer any need to expose database clusters and share their credentials for access, which is a superb way to secure DB systems. A single proxy server gates and secures all communication between the DB layer and any external application/user.

Additionally, the SQL blacklisting capability of the proxy server intercepts, identifies and stops the execution of potentially malicious queries or SQL injection attacks; with rules which can easily be configured in the proxy by enterprise security teams.

In terms of resilience, the load balancing capabilities of the Fabric and its ability to split reads and writes across multiple database systems based on incoming traffic & utilization. Its integration with Zookeeper for service discovery and the inbuilt system observability layer helps it identify overloaded or failing nodes, and divert traffic away from them into performing nodes. Also note that the proxy server always has a failover server, so there is enough redundancy to ensure near-zero downtime for the Fabric itself. This helps avert system downtime, boosting High Availability (HA) and database fault-tolerance.

04: High-velocity data ingestion

Production systems at some of the world’s largest data properties often need data stores with the capability to ingest hundreds of billions of records a day. Using RocksDB, a persistent key-value store open sourced by Meta as the LSM (log-structured merge) store, we have been able to achieve enhanced WRITE amplification in ClickHouse. In fact, ChistaDATA’s largest customer in the security space achieves >150 billion WRITEs every 24 hours to their ClickHouse instance with this innovation. More on this here.

05: Lean & performant RDBMS

A solution to a bloated & under-performing RDBMS is to use a high compaction factor OLAP DBMS as an archival store. ClickHouse performs quite well in this use case. Here’s how the archival process works. 

Older transactional data in an RDBMS is often useful only for analytical reads. Historical data can therefore be migrated out of the RDBMS into an OLAP store like ClickHouse, which compresses it to a fraction of original size by virtue of being truly columnar with a bevy of compression algorithms and codecs at disposal. This creates a much leaner RDBMS without the weight of historical data, with lower TCO and higher WRITE performance.

All analytical reads on historical data can be split by the proxy server into ClickHouse, reserving the RDBMS only for transactional WRITEs and simple reads. As indicated above, the Fabric enables the appropriate optimization of each database system based on its areas of strength.

Here’s our runbook to archive data from PostgreSQL and MySQL to ClickHouse, leveraging the capabilities of Fabric.

06: Zero-disruption innovation & migration

The Fabric eliminates the need for complex database migrations involving long-drawn POCs, data sharing, and decisioning processes that typically consume enterprise bandwidth and mindshare for months, throttling innovation. 

With Fabric, you can connect any new database into a highly secure & resilient production environment with wire protocols, move a small subset of production data into the new DBMS, and run queries to measure live performance.

The Fabric observability plane helps you compare the performance of the same set of queries across multiple database systems side-by-side, enabling an informed choice of which DBMS is better in production for which use cases. With its orchestration capabilities, you can then leverage each system for its unique strength to achieve the most optimal query performance at a system level.

Migrations are now simple and piecemeal: (1) test robustly for query cost & performance, (2) move larger datasets/tables only when performance in any new DBMS is satisfactory in production, (3) configure the Fabric to point specific analytical workloads towards databases with the optimal price-performance per use cases.

Fabric therefore removes the rate-limit from database layer innovation by enabling real-time testing of DB performance in production with minimal effort, while making migrations piecemeal and entirely based on proof of system performance.

Conclusion

The heterogeneous database stack needs an intelligent orchestration & control plane to simplify highly complex and siloed data operations. ChistaDATA Fabric is a brave attempt at this problem, weaving disparate elements of your data stack into a highly contiguous and interconnected system to optimize TCO across OLAP & OLTP DBs, provide unified observability & cost-visibility, achieve higher security & resilience, and most critically improve the rate of innovation in the enterprise database layer. Its fundamental promise is to make your database investment more successful, by seamlessly sliding in with minimal disruption to unlock higher performance and lower TCO.

We look forward to serving you with ChistaDATA Fabric. In upcoming posts, we will cover the key use cases of this solution, and how it can help you improve, observe, and optimize database price-performance in your organization. Stay tuned.

  1. In beta ↩︎
  2. In beta ↩︎