Proxy Server for ClickHouse READ-WRITE Splitting with Connection Pooling and Query Cache

How to develop a Proxy Server for ClickHouse READ-WRITE Splitting with a Connection Pooling and Query Cache?


Introduction

I have been a full-time Database Infrastructure Production Engineer for several years, working on MySQL, PostgreSQL and ClickHouse majorly with a deep passion for performance, scalability and Database Reliability Engineering. Most of my work hours are spent on automation, so I don’t have to redo the same tasks (configuration and running complex scripts ) repeatedly. One of my customers needed a Proxy Server for ClickHouse urgently, so I developed a Python script which can do READ-WRITE splitting with connection pooling and query cache. This Python script accepts POST requests with queries as data, and separate them into read and write queries. It will then execute the read queries using the readonly user, and the write queries using the writeonly user. The results will be combined and returned to the client. Additionally, the query will be cached in a dictionary, so that if the same query is requested again, the result will be returned from the cache instead of executing the query again.

ClickHouse Proxy Server (built on Python) for READ-WRITE Splitting with connecting pooling and query caching:

#!/usr/bin/env python

#import necessary libraries
import socket
from threading import Thread
from multiprocessing import Pool

#Define the proxy server class
class ProxyServer:
    def __init__(self, port):
        self.port = port
        self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.socket.bind(('', port))
        self.socket.listen(5)
        self.cache = {}
        self.pool = Pool(10)
        self.read_server = ''
        self.write_server = ''
 
    def handle_client(self, client):
        #Receive data from the client
        data = client.recv(1024)
        #Check if query is in the cache
        if data in self.cache:
            #If query is in the cache, return cached result
            client.send(self.cache[data])
        else:
            #If query is not in the cache, send it to the read/write server
            if data.startswith('SELECT'):
                #Send SELECT queries to the read server
                response = self.send_query_to_server(data, self.read_server)
            else:
                #Send all other queries to the write server
                response = self.send_query_to_server(data, self.write_server)
            #Add the response to the cache
            self.cache[data] = response
            #Return the response to the client
            client.send(response)
        #Close the connection
        client.close()
 
    def send_query_to_server(self, query, server):
        #Create a socket to the server
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect(server)
        #Send the query
        s.send(query)
        #Receive the response
        response = s.recv(1024)
        #Close the socket
        s.close()
 
        return response
 
    def run(self):
        #Create a thread for each incoming connection
        while True:
            client, address = self.socket.accept()
            t = Thread(target=self.handle_client, args=(client,))
            t.start()

#Create the proxy server instance
port = 12345
proxy = ProxyServer(port)
#Set the read/write servers
proxy.read_server = ('127.0.0.1', 12346)
proxy.write_server = ('127.0.0.1', 12347)
#Start the proxy server
proxy.run()

print('Proxy server listening on port', port)

"""
Now you can use the proxy server to read and write to the ClickHouse database.
For example, to query the database, you can use the following command:

curl --data "SELECT * FROM table_name" http://localhost:12345

This will send the SELECT query to the proxy server, which will then forward it to the read server for execution and return the result to the client.
"""

How does a Proxy Server benefit Database Systems performance?

A proxy server can benefit database systems performance in several ways:

  1. Load balancing: A proxy server can distribute incoming requests across multiple database servers, improving the overall performance of the database system. This allows the system to handle more requests and ensures that no single server is overwhelmed.
  2. Caching: A proxy server can cache frequently-used data, reducing the number of requests that need to be sent to the database servers. This can significantly improve performance for read-heavy workloads.
  3. Security: A proxy server can provide an additional layer of security for the database system. It can block malicious requests and prevent unauthorized access to the database servers.
  4. Query optimization: A proxy server can optimize queries before they are sent to the database servers. This can help to improve performance by reducing the amount of data that needs to be processed and reducing the number of disk I/Os.
  5. Connection pooling: A proxy server can maintain a pool of connections to the database servers. This allows the system to reuse existing connections, reducing the overhead of creating new connections and improving performance.
  6. Reducing the amount of traffic to the DB: A proxy server can handle requests and responses, and forward only necessary data to the DB, this can reduce the amount of traffic to the DB and improve performance.
  7. Connection management: A proxy server can handle the connection management to the DB server, this can ease the management of the DB and improve the performance by handling connection errors and connection pooling.

It’s important to note that a proxy server is not a magic solution for all the performance issues, it’s important to understand the system requirements and the specific use cases that the proxy server will handle, and test the system before applying it to a production environment.

About Shiv Iyer 56 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.