Introduction
We commend the ClickHouse team for their ongoing dedication to providing monthly releases with fresh features, enhancements, and important bug fixes. At ChistaData, we conscientiously perform Quick Peek analyses on every release and create dedicated blog posts to ensure our readers stay well-informed.
Today, we are presenting our coverage of the September 2023 release of ClickHouse (version 23.9). Below, you will find the necessary links related to this release.
- Source Code – GitHub Link
- Installation Instructions – Official Documentation
- v23.9 Release Webinar – Recording
- v23.9 Release Webinar – Slides
Key features & improvements
The release happened on September 28th, 2023. It has the following aspects.
- 20 new features
- 19 performance optimization
- 55 bug fixes
Let’s look at the main features and improvements one by one.
1. Log Backup Status
Track the status of the backup from system.backup_log table
clickhouse1 :) SELECT * FROM system.backup_log order by 1 desc FORMAT Vertical SELECT * FROM system.backup_log ORDER BY 1 DESC FORMAT Vertical Query id: b5a1422d-d83c-4c74-b728-d0982af39ab9 Row 1: ────── event_date: 2023-10-05 event_time_microseconds: 2023-10-05 05:56:18.055698 id: 3472cd65-371b-476f-90a5-8e6d870fdb51 name: Disk('backups_disk', '1.zip') status: BACKUP_FAILED error: Code: 318. DB::Exception: The 'backups.allowed_disk' configuration parameter is not set, cannot use 'Disk' backup engine. (INVALID_CONFIG_PARAMETER) (version 23.9.1.1854 (official build)) start_time: 2023-10-05 05:56:18 end_time: 2023-10-05 05:56:18 num_files: 0 total_size: 0 num_entries: 0 uncompressed_size: 0 compressed_size: 0 files_read: 0 bytes_read: 0
2. DROP TABLE IF Empty
Worried about mistakenly dropping necessary tables? Check if they are empty before you drop them.
Best Practice says you should Truncate and then drop to avoid accidental falls. This works pretty well with the new feature.
clickhouse1 :) drop table if empty orders; DROP TABLE IF EMPTY orders Query id: 21dc5fb7-7a13-45d2-bb18-777da021de0c 0 rows in set. Elapsed: 0.007 sec. Received exception from server (version 23.9.1): Code: 705. DB::Exception: Received from localhost:9000. DB::Exception: Table orders is not empty. (TABLE_NOT_EMPTY)
3. SHOW FUNCTIONS
show functions command is available. New command to list all the functions available in system.functions table
clickhouse1 :) show functions Format Vertical; SHOW FUNCTIONS FORMAT Vertical Query id: 3a44ce0f-e3c5-4949-b3f5-3b5f01ac4b44 Row 1: ────── name: JSON_VALUE is_aggregate: 0 is_deterministic: 1 case_insensitive: 0 alias_to: create_query: origin: System description: syntax: arguments: returned_value: examples: categories: o/p truncated
4. INFILE Improvement
Performance Improvement while reading files from INFILE , All files get processed parallelly.
clickhouse-client --query="INSERT INTO processed_order FROM INFILE 'load_*.csv' FORMAT CSV;"
5. New codec called GCD for compression
Added GCD
a.k.a. “greatest common denominator” as a new data compression codec. The codec computes the GCD of all column values and divides each value by the GCD. The GCD codec is a data preparation codec (similar to Delta and DoubleDelta) and cannot be used stand-alone. It works with data integer, decimal, and date/time type. A viable use case for the GCD codec is column values that change (increase/decrease) in multiples of the GCD, e.g., 24 – 28 – 16 – 24 – 8 – 24 (assuming GCD = 4)
Suppose you use Decimal type, but the scale is more excellent than needed… More details here
Start by creating a table called "my_data" with a field for decimal numbers and using the MergeTree storage engine. CREATE TABLE my_data (x Decimal(20, 10)) ENGINE = MergeTree ORDER BY (); Insert one million random decimal values into the "my_data" table. INSERT INTO my_data SELECT rand() FROM numbers(1000000); Check the total storage size used by the "my_data" table before compression. SELECT total_bytes FROM system.tables WHERE name = 'my_data'; Compression before: 9,516,649 bytes Now, create a new table with the same data but apply the LZ4 compression codec to reduce storage size. CREATE TABLE my_data (x Decimal(20, 10) CODEC(LZ4)) ENGINE = MergeTree ORDER BY (); Insert the same one million random decimal values into the new "my_data" table. INSERT INTO my_data SELECT rand() FROM numbers(1000000); Check the total storage size used by the "my_data" table after applying LZ4 compression. SELECT total_bytes FROM system.tables WHERE name = 'my_data'; Compression after (LZ4): 6,212,819 bytes
By using LZ4 compression, you’ve reduced the storage size from 9.5 MB to 6.2 MB, making your data storage more efficient. Keep in mind that other compression codecs like ZSTD can also be applied, each with its own impact on storage size (e.g., GCD + ZSTD: 4.1 MB).
6. Number to date and dateTime
Transforming numeric values into date and dateTime formats is a breeze in ClickHouse. Thanks to the existing inverse functions, this task becomes even more seamless.
Here are a few functions at your disposal:
- ‘YYYYMMDDToDate’
- ‘YYYYMMDDToDate32’
- ‘YYYYMMDDhhmmssToDateTime’
- ‘YYYYMMDDhhmmssToDateTime64’
With these functions, you can effortlessly convert numeric values to their corresponding date and dateTime representations. Here’s a delightful example:
SELECT YYYYMMDDToDate(20230102), YYYYMMDDhhmmssToDateTime(20230102030405)
The result is as follows:
2023-01-02 2023-01-02 03:04:05
It’s worth noting that we also have the inverse functions at our disposal, such as ‘toYYYYMM,’ ‘toYYYYMMDD,’ and ‘toYYYYMMDDhhmmss,’ further enhancing the flexibility of your data manipulation.
7. Workload scheduling
In workload management, ClickHouse is really good at handling many queries simultaneously. These queries often compete for access to shared resources, like storage space on the computer. To make sure everything is fair and organized, ClickHouse lets you set rules and plans for when each query can run. You can create a careful plan for each resource in the system.
Let’s imagine a system with two types of work: “production” and “development.”
Here’s how it works: you can tell ClickHouse how to manage these two types of work, like deciding which one is more important. For example, you can control how the system reads data from an external source like S3, making sure one type of query doesn’t take up all the computer’s resources.
Now, let’s look at some example queries in our system:
When you run a query like SELECT count() FROM data_table WHERE value = 42 SETTINGS workload = ‘production’, ClickHouse knows it’s for the “production” work.
And when you run a query like SELECT count() FROM data_table WHERE value = 13 SETTINGS workload = ‘development’, ClickHouse knows it’s for the “development” work.
By doing this, ClickHouse can make sure both “production” and “development” work are balanced and don’t get in each other’s way, even when they use the same resources. This helps manage and share the computer’s resources effectively.
For more details, take a look here
8. Autodetect JSON/JOSNEachRow
The integration between JSON and JsoneachRow is a seamless and powerful feature. When parsing JSON content, you can use the ‘data.json’ method, which will automatically detect the format.
What’s even more impressive is that *.son files are automatically disambiguated, distinguishing between the full JSON format with both data and metadata fields and the newline-delimited JSON format, also known as JSONLines or JSONEachRow. This intelligent handling makes working with JSON data a breeze.
9. Long column Name
When working with autogenerated columns, a business AI tool, and automation, especially when dealing with lengthy column names, you’ll be delighted to know that support is available. To make the experience even smoother, consider enabling the ‘replace_log_file_name_to_hash’ setting by setting it to 1. Remember that this will become the default setting in future versions, so it’s a setting you only need to enable in the current version.
These autogenerated columns are particularly useful in projections, especially when creating them using complex expressions. If you’re making alterations to a table with a projection, the process becomes even more streamlined and efficient.
10. decodeHTML New feature
Processing HTML with ClickHouse has become even more exciting with the introduction of the new ‘decodeHTMLCOMPONENT()’ feature.
11. Non-Constant Time zones
We can now leverage the ‘time_zone’ feature to convert your data into various time zones available in the ‘system.time_zones’ table, allowing you to handle time zone conversions across different records effortlessly. The ‘system.time_zone’ table offers a comprehensive selection of time zones for your convenience.
Other improvements
1. Security Features
- Can create Temporary user Credentials with VALID UNTIL Clause, link
- SSH Key for Authentication is possible with a Public Key in in Clickhouse, here
- SSO for AWS s3 – Clickhouse-local will read temp SSO credentials as per AWS_PROFILE
2. Performance improvements
- Parallel reading FROM INFILE, all files will processed parallelly
- Optimization for SELECT FINAL – If the data is merged into a single part, it will avoid reading ORDER BY columns
3. Integrations
- Kafka Connect is GA – Connector to Kafka
- PowerBi – Proper type mapping for clickhouse and PowerBI
- Tableau – Tableau Online Via MySQL Protocol #54115 (Serge Klochkov)
- Updates for Python and node.js clients
4. Bug Fixes
Some interesting bug fixes like,
- Resolved the issue involving the “REPLACE/MOVE PARTITION” with zero-copy replication. Its experimental feature “zero-copy replication” #54193 (Alexander Tokmakov).
- Fix throttling for BACKUP operations to and from S3, especially when the native copy function is not utilized, and applied improvements in various other areas as well #53336 (Azat Khuzhin).
- Improved summary reporting when working with parallel replicas while applying a LIMIT #53050 (Raúl Marín).
- Resolved issues related to input-output (IO) throttling when copying entire directories #53338 (Azat Khuzhin).
- Fixed a problem with nullable primary keys when using the FINAL modifier #54164 (Amos Bird).
- Corrected unexpected errors in the “system.errors” after a join operation #54306 (vdimir).
- Rectified the filtering of parts with an indexHint for non-analyzers #54449 (Azat Khuzhin).
Conclusion
ClickHouse 23.9 brings exciting new features for enhancing its high-performance data analytics capabilities. Being open-source and constantly innovating, ClickHouse remains at the forefront and preferred choice of solutions for organizations seeking rapid, scalable, and feature-rich data analytics capabilities. Upgrade to unlock its full potential for your workflows. For further details, please see the official ClickHouse change log list here.
Check out previous Clickhouse releases–
References
- https://clickhouse.com/docs/en/whats-new/changelog#clickhouse-release-239-2023-09-28
- https://presentations.clickhouse.com/release_23.9/