How to Avoid the Linux OOM Killer in ClickHouse

Introduction

In numerous instances, this indicates an out-of-memory scenario. ClickHouse started utilizing excessive memory, prompting Linux to terminate it to avert system instability. The responsible Linux process for this action is the Out-of-Memory Killer, also known as the OOM killer. OOM killer is the one of the most faced issue and ClickHouse mentioned this as ‘Deadly Sins’ at here.

What is the OOM Killer?

The Out Of Memory (OOM) killer is a Linux kernel feature aimed at preventing system crashes caused by memory depletion. If the system exhausts its available memory and cannot allocate more, the kernel activates the OOM killer to choose and terminate one or more processes, freeing up memory and enabling the system to remain operational. Yes, this feature can be disabled but this is not recommended. If ClickHouse consumes a significant amount of system memory, it may be targeted by the OOM killer, resulting in ClickHouse process termination.

Top Reasons for OOM Killer in ClickHouse

From ClickHouse’s point of view, sudden halts can be observed: messages that can be found in the system logs indicate that ClickHouse may need a significant amount of RAM in order to perform regular operations, in particular for queries (sorting/aggregation/joins in RAM) and to store caches, dictionaries, buffers, etc.

The most common reasons of OOM Killer:

  1. Memory Consumption: ClickHouse can be memory intensive at times, especially when managing large datasets or handling complex queries. Insufficient memory resources in the system to handle the workload could result in the OOM killer terminating processes to reclaim memory.

  2. Misconfiguration: Improperly configured memory settings in ClickHouse or at the system level may cause excessive memory usage, leading to intervention by the OOM killer.

  3. Query Optimization: Inefficient queries or suboptimal data access patterns can result in high memory usage, especially when working with large datasets or conducting complex analytics. Improving query performance and optimizing indexing strategies can help address this issue. Investigating explain plans would be helpful in these cases.

  4. Concurrent Queries: Simultaneously running excessive queries or large batch jobs can place a significant strain on the system’s memory resources, potentially leading to the activation of the OOM killer.

  5. Resource Competition: ClickHouse may contend with other processes or services on the same system for memory resources. If ClickHouse consumes an excessive amount of memory, it may prompt the OOM killer to ensure system stability.

How can we avoid the OOM Killer?

To avoid the OOM (Out of Memory) killer on ClickHouse and prevent system instability, you can take the following steps:

  • Adjust the ClickHouse configuration parameters:

Increase the memory limits for ClickHouse processes by configuring the max_server_memory_usage (default: Unlimited) and max_server_memory_usage_to_ram_ratio (default: 0,9) parameters in the ClickHouse server settings. This means that ClickHouse allocates up to 90% of the physical RAM of your sever with these parameters. Set them to lower values based on your system resources and workload requirements.

  • Optimise queries:

Optimise queries to reduce memory usage. Use appropriate data types, limit result sets and avoid unnecessary joins or subqueries that can lead to excessive memory usage. Also we can utilize fewer GROUP BY keys that consume less memory.

  • Monitor system resources:

Regularly monitor system resources such as memory, CPU and disk usage to identify abnormal spikes. Implement monitoring tools and alerts to keep track of memory usage patterns.

  • Implement memory protection mechanisms:

Configure resource management tools such as oom_score_adj, cgroups or ulimit to control memory usage and prioritise critical processes.

  • Use swap space (as a last resort):

If necessary, consider configuring swap space on your system as a temporary solution to deal with memory overrun situations. However, the use of swap space can impact performance, so it should only be considered as a last resort.

By implementing these strategies, you can minimise the risk of the OOM killer killing ClickHouse processes and maintain system stability.

How can we investigate the OOM Killer Error?

You can use ChistaDATA Inception System and tool details are explained in this blog. This tool filters your ClickHouse errors and helps to identify the cause.

Conclusion

By understanding ClickHouse’s architecture, leveraging system resources effectively, and following best practices, you can optimize ClickHouse’s performance and achieve efficient data processing and storage. Deploying ClickHouse on Linux environments provides additional flexibility and integration options, enabling seamless integration with Linux-based analytics platforms. By managing ClickHouse effectively on a Linux server and mitigating potential issues such as the Linux OOM killer, you can ensure optimal performance and reliability for your ClickHouse deployments.

About Ilkay 25 Articles
Ilkay has been administering databases including NoSQL and RDBMS for over 7 years. He is experienced working high-scale banking databases. He is currently working at ChistaDATA Inc. as Database Administrator.
Contact: Website