ClickHouse Release Notes - v23.9 (September 2023)

Introduction

We commend the ClickHouse team for their ongoing dedication to providing monthly releases with fresh features, enhancements, and important bug fixes. At ChistaData, we conscientiously perform Quick Peek analyses on every release and create dedicated blog posts to ensure our readers stay well-informed.

Today, we are presenting our coverage of the September 2023 release of ClickHouse (version 23.9). Below, you will find the necessary links related to this release.

Source Code – GitHub Link
Installation Instructions – Official Documentation
v23.9 Release Webinar – Recording
v23.9 Release Webinar – Slides

Key features & improvements

The release happened on September 28th, 2023. It has the following aspects.

20 new features
19 performance optimization
55 bug fixes

Let’s look at the main features and improvements one by one.

1. Log Backup Status

Track the status of the backup from system.backup_log table

clickhouse1 :) SELECT *
FROM system.backup_log order by 1 desc FORMAT Vertical

SELECT *
FROM system.backup_log
ORDER BY 1 DESC
FORMAT Vertical

Query id: b5a1422d-d83c-4c74-b728-d0982af39ab9

Row 1:
──────
event_date:              2023-10-05
event_time_microseconds: 2023-10-05 05:56:18.055698
id:                      3472cd65-371b-476f-90a5-8e6d870fdb51
name:                    Disk('backups_disk', '1.zip')
status:                  BACKUP_FAILED
error:                   Code: 318. DB::Exception: The 'backups.allowed_disk' configuration parameter is not set, cannot use 'Disk' backup engine. (INVALID_CONFIG_PARAMETER) (version 23.9.1.1854 (official build))
start_time:              2023-10-05 05:56:18
end_time:                2023-10-05 05:56:18
num_files:               0
total_size:              0
num_entries:             0
uncompressed_size:       0
compressed_size:         0
files_read:              0
bytes_read:              0

2. DROP TABLE IF Empty

Worried about mistakenly dropping necessary tables? Check if they are empty before you drop them.

Best Practice says you should Truncate and then drop to avoid accidental falls. This works pretty well with the new feature.

clickhouse1 :) drop table if empty orders;

DROP TABLE IF EMPTY orders

Query id: 21dc5fb7-7a13-45d2-bb18-777da021de0c


0 rows in set. Elapsed: 0.007 sec.

Received exception from server (version 23.9.1):
Code: 705. DB::Exception: Received from localhost:9000. DB::Exception: Table orders is not empty. (TABLE_NOT_EMPTY)

3. SHOW FUNCTIONS

show functions command is available. New command to list all the functions available in system.functions table

clickhouse1 :) show functions Format Vertical;

SHOW FUNCTIONS
FORMAT Vertical

Query id: 3a44ce0f-e3c5-4949-b3f5-3b5f01ac4b44

Row 1:
──────
name:             JSON_VALUE
is_aggregate:     0
is_deterministic: 1
case_insensitive: 0
alias_to:
create_query:
origin:           System
description:
syntax:
arguments:
returned_value:
examples:
categories:

o/p truncated

4. INFILE Improvement

Performance Improvement while reading files from INFILE , All files get processed parallelly.

clickhouse-client --query="INSERT INTO processed_order FROM INFILE 'load_*.csv' FORMAT CSV;"

5. New codec called GCD for compression

Added GCD a.k.a. “greatest common denominator” as a new data compression codec. The codec computes the GCD of all column values and divides each value by the GCD. The GCD codec is a data preparation codec (similar to Delta and DoubleDelta) and cannot be used stand-alone. It works with data integer, decimal, and date/time type. A viable use case for the GCD codec is column values that change (increase/decrease) in multiples of the GCD, e.g., 24 – 28 – 16 – 24 – 8 – 24 (assuming GCD = 4)

Suppose you use Decimal type, but the scale is more excellent than needed… More details here

Start by creating a table called "my_data" with a field for decimal numbers and using the MergeTree storage engine.


CREATE TABLE my_data (x Decimal(20, 10)) ENGINE = MergeTree ORDER BY ();
Insert one million random decimal values into the "my_data" table.


INSERT INTO my_data SELECT rand() FROM numbers(1000000);
Check the total storage size used by the "my_data" table before compression.


SELECT total_bytes FROM system.tables WHERE name = 'my_data';
Compression before: 9,516,649 bytes

Now, create a new table with the same data but apply the LZ4 compression codec to reduce storage size.


CREATE TABLE my_data (x Decimal(20, 10) CODEC(LZ4)) ENGINE = MergeTree ORDER BY ();
Insert the same one million random decimal values into the new "my_data" table.


INSERT INTO my_data SELECT rand() FROM numbers(1000000);
Check the total storage size used by the "my_data" table after applying LZ4 compression.


SELECT total_bytes FROM system.tables WHERE name = 'my_data';
Compression after (LZ4): 6,212,819 bytes

By using LZ4 compression, you’ve reduced the storage size from 9.5 MB to 6.2 MB, making your data storage more efficient. Keep in mind that other compression codecs like ZSTD can also be applied, each with its own impact on storage size (e.g., GCD + ZSTD: 4.1 MB).

6. Number to date and dateTime

Transforming numeric values into date and dateTime formats is a breeze in ClickHouse. Thanks to the existing inverse functions, this task becomes even more seamless.

Here are a few functions at your disposal:

‘YYYYMMDDToDate’
‘YYYYMMDDToDate32’
‘YYYYMMDDhhmmssToDateTime’
‘YYYYMMDDhhmmssToDateTime64’

With these functions, you can effortlessly convert numeric values to their corresponding date and dateTime representations. Here’s a delightful example:

SELECT YYYYMMDDToDate(20230102), YYYYMMDDhhmmssToDateTime(20230102030405)

The result is as follows:

2023-01-02 2023-01-02 03:04:05

It’s worth noting that we also have the inverse functions at our disposal, such as ‘toYYYYMM,’ ‘toYYYYMMDD,’ and ‘toYYYYMMDDhhmmss,’ further enhancing the flexibility of your data manipulation.

7. Workload scheduling

In workload management, ClickHouse is really good at handling many queries simultaneously. These queries often compete for access to shared resources, like storage space on the computer. To make sure everything is fair and organized, ClickHouse lets you set rules and plans for when each query can run. You can create a careful plan for each resource in the system.

Let’s imagine a system with two types of work: “production” and “development.”

Here’s how it works: you can tell ClickHouse how to manage these two types of work, like deciding which one is more important. For example, you can control how the system reads data from an external source like S3, making sure one type of query doesn’t take up all the computer’s resources.

Now, let’s look at some example queries in our system:

When you run a query like SELECT count() FROM data_table WHERE value = 42 SETTINGS workload = ‘production’, ClickHouse knows it’s for the “production” work.
And when you run a query like SELECT count() FROM data_table WHERE value = 13 SETTINGS workload = ‘development’, ClickHouse knows it’s for the “development” work.

By doing this, ClickHouse can make sure both “production” and “development” work are balanced and don’t get in each other’s way, even when they use the same resources. This helps manage and share the computer’s resources effectively.

For more details, take a look here

8. Autodetect JSON/JOSNEachRow

The integration between JSON and JsoneachRow is a seamless and powerful feature. When parsing JSON content, you can use the ‘data.json’ method, which will automatically detect the format.

What’s even more impressive is that *.son files are automatically disambiguated, distinguishing between the full JSON format with both data and metadata fields and the newline-delimited JSON format, also known as JSONLines or JSONEachRow. This intelligent handling makes working with JSON data a breeze.

9. Long column Name

When working with autogenerated columns, a business AI tool, and automation, especially when dealing with lengthy column names, you’ll be delighted to know that support is available. To make the experience even smoother, consider enabling the ‘replace_log_file_name_to_hash’ setting by setting it to 1. Remember that this will become the default setting in future versions, so it’s a setting you only need to enable in the current version.

These autogenerated columns are particularly useful in projections, especially when creating them using complex expressions. If you’re making alterations to a table with a projection, the process becomes even more streamlined and efficient.

10. decodeHTML New feature

Processing HTML with ClickHouse has become even more exciting with the introduction of the new ‘decodeHTMLCOMPONENT()’ feature.

11. Non-Constant Time zones

We can now leverage the ‘time_zone’ feature to convert your data into various time zones available in the ‘system.time_zones’ table, allowing you to handle time zone conversions across different records effortlessly. The ‘system.time_zone’ table offers a comprehensive selection of time zones for your convenience.

Other improvements

1. Security Features

Can create Temporary user Credentials with VALID UNTIL Clause, link
SSH Key for Authentication is possible with a Public Key in in Clickhouse, here
SSO for AWS s3 – Clickhouse-local will read temp SSO credentials as per AWS_PROFILE

2. Performance improvements

Parallel reading FROM INFILE, all files will processed parallelly
Optimization for SELECT FINAL – If the data is merged into a single part, it will avoid reading ORDER BY columns

3. Integrations

Kafka Connect is GA – Connector to Kafka
PowerBi – Proper type mapping for clickhouse and PowerBI
Tableau – Tableau Online Via MySQL Protocol #54115 (Serge Klochkov)
Updates for Python and node.js clients

4. Bug Fixes

Some interesting bug fixes like,

Resolved the issue involving the “REPLACE/MOVE PARTITION” with zero-copy replication. Its experimental feature “zero-copy replication” #54193 (Alexander Tokmakov).
Fix throttling for BACKUP operations to and from S3, especially when the native copy function is not utilized, and applied improvements in various other areas as well #53336 (Azat Khuzhin).
Improved summary reporting when working with parallel replicas while applying a LIMIT #53050 (Raúl Marín).
Resolved issues related to input-output (IO) throttling when copying entire directories #53338 (Azat Khuzhin).
Fixed a problem with nullable primary keys when using the FINAL modifier #54164 (Amos Bird).
Corrected unexpected errors in the “system.errors” after a join operation #54306 (vdimir).
Rectified the filtering of parts with an indexHint for non-analyzers #54449 (Azat Khuzhin).

Conclusion

ClickHouse 23.9 brings exciting new features for enhancing its high-performance data analytics capabilities. Being open-source and constantly innovating, ClickHouse remains at the forefront and preferred choice of solutions for organizations seeking rapid, scalable, and feature-rich data analytics capabilities. Upgrade to unlock its full potential for your workflows. For further details, please see the official ClickHouse change log list here.

Check out previous Clickhouse releases–

References

https://clickhouse.com/docs/en/whats-new/changelog#clickhouse-release-239-2023-09-28
https://presentations.clickhouse.com/release_23.9/

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse September 2023 Release – v23.9

ClickHouse Version 23.9