ClickHouse Summer 2023 Release – Version 23.7

Introduction

Each new release includes new features, improvements, and numerous bug fixes, and the ChistaDATA team is always on top of the latest releases. On 27th July 2023, ClickHouse version 23.7 (July 2023) was released, and this version contains 31 new features, 16 performance optimizations, and 47 bug fixes.

 

For further details, please see the official ClickHouse change log list here.

Let’s look at some important new features:

initcap – initcapUTF8

The initcap / initcapUTF8 operations have been updated to convert the first letter of each word to uppercase and the rest to lowercase. This feature is not available for the Turkish language only. You can also use the ‘initcapUTF8’ function for Russian and Chinese etc.

:) SELECT initCap('ChistaDATA Cloud Infrastructure for ClickHouse is a Columnar Database Service built for performance, scalability and reliability which is operationally simple and cost-efficient.')

┌─initcap('ChistaDATA Cloud Infrastructure for ClickHouse is a Columnar Database Service built for performance, scalability and reliability which is operationally simple and cost-efficient.')─┐
│ Chistadata Cloud Infrastructure For Clickhouse Is A Columnar Database Service Built For Performance, Scalability And Reliability Which Is Operationally Simple And Cost-Efficient.            │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

hasSubsequence

To match subsequences in strings, new hasSubsequence methods (+CaseInsensitive and UTF8 variants) have been added. As we have searched for the value ‘ClickHouse’ in all records, it returns 1 as shown in below.

:) select hasSubsequence('ChistaDATA Cloud Infrastructure for ClickHouse is a Columnar Database Service built for performance, scalability and reliability which is operationally simple and cost-efficient.', 'ClickHouse');

┌─hasSubsequence('ChistaDATA Cloud Infrastructure for ClickHouse is a Columnar Database Service built for performance, scalability and reliability which is operationally simple and cost-efficient.', 'ClickHouse')─┐
│                                                                                                                                                                                                                  1 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

arrayJaccardIndex

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used to measure the similarity and diversity of sample sets. Add the arrayJaccardIndex function, which computes the Jaccard similarity of two arrays.

SELECT arrayJaccardIndex(tokens('ChistaDATA is a company'), tokens('ClickHouse is a database'))

┌─arrayJaccardIndex(tokens('ChistaDATA is a company'), tokens('ClickHouse is a database'))─┐
│                                                                       0.3333333333333333 │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Listener Feature

We can call this one of the most powerful features of the release. In the same way that the previous SYSTEM STOP MERGES and SYSTEM STOP REPLICATED SENDS queries stop accepting new connections from all ports (except internal), this query should stop existing connections by stopping all queries.

SYSTEM STOP LISTEN QUERIES HTTPS | SYSTEM START LISTEN QUERIES ALL ON CLUSTER default

This function enables or disables the server to stop accepting new connections via the https interface. The ClickHouse database will continue to replicate but won’t accept new connections.

You can also select a specific interface, such as TCP, TCP WITH PROXY, TCP SECURE, HTTP, HTTPS, MYSQL, GRPC, POSTGRESQL, PROMETHEUS, CUSTOM, INTERSERVER HTTP, INTERSERVER HTTPS, QUERIES ALL, QUERIES DEFAULT, QUERIES CUSTOM

Custom Disks in Keeper

Added external disk support in Keeper to store snapshots and logs on custom disks. You can use S3 for your snapshots and logs. Only the most recent log needs to be saved locally.

keeper_server:
    log_storage_disk: log_s3_plain
    latest_log_storage_disk: log_local
    snapshot_storage_disk: snapshot_s3_plain
    latest_snapshot_storage_disk: snapshot_local

Dynamic Reconfiguration of Keeper

Support for the ZooKeeper reconfig command for CH Keeper, with incremental reconfiguration allowed through the keeper_server.enable_reconfiguration parameter. Support for adding, deleting, and modifying server priority by sending the reconfig command

keeper_server:
    enable_reconfiguration: 1

CSV with Variable Number of Columns

The CSV format now has a new setting input_format_csv_use_default_on_bad_values, which allows it to inject a default value when parsing a single field fails. Even this strange CSV with no data in the two columns can now be parsed successfully:

"Name","Columnar","Version"
ClickHouse,Yes,23.7
Oracle,No

When not using this feature, we tried implementing a CSV file to the some table to realise the improvements.

Exception on client:
Code: 27. DB::ParsingException: Cannot parse input: expected '"' before: 'Version"\nClickHouse, Yes, 23.7,\nOracle, No,\n': 
Row 1:
Column 0,   name: path,  type: String, parsed text: "<DOUBLE QUOTE>Name<DOUBLE QUOTE>"
Column 1,   name: month, type: String, parsed text: "<DOUBLE QUOTE>Columnar<DOUBLE QUOTE>"
Column 2,   name: hits,  type: UInt32, parsed text: "<DOUBLE QUOTE>"ERROR
Code: 27. DB::ParsingException: Cannot parse input: expected '"' before: 'Version"\nClickHouse, Yes, 23.7,\nOracle, No,\n'. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 23.7.1.2470 (official build))

: While executing ParallelParsingBlockInputFormat: While executing File: data for INSERT was parsed from file: (in file/uri /root/abc.csv): (at row 1)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)

But when we set the parameter, it works fine.

:) set input_format_csv_use_default_on_bad_values=1

Ok.

:) INSERT INTO sometable FROM INFILE 'abc.csv' FORMAT CSV

Ok.

Parquet Writing

ClickHouse always continues to optimise. Exporting your data as Parquet format from ClickHouse is now 6x faster and it’s now multi-threaded. Even if the table has 100 columns, it is almost as fast as reading because it delivers gigabytes/sec.

But what’s the Parquet?

Apache Parquet is a popular open source cross-platform data format. Of course it is supported in ClickHouse. The columnar layout allows for high compression and allows data warehouses to run SQL queries directly on the Parquet data. Secondly, it is widely supported. You can read and write Parquet data using simple libraries in languages such as Python. Parquet is also used by Google BigQuery, Spark, Amazon Redshift and many other applications. And with this release, speed improvements have been introduced.

SELECT * FROM function INTO OUTFILE 'hits.function'

Filesystem As A Database

ClickHouse automatically detects the headers in the CSV and you can write the CSV name like as an existing table.

:) SELECT * FROM `test.csv`

But how does it work in the background? ClickHouse has a new database engine called Filesystem. It also supports S3 and HDFS.

Overlay: Provides support for the Database interface. Allows for the integration of different databases, such as FileSystem and Memory. It holds a vector of other database references internally and proxies requests to them in turn until it is successfully performed.
Filesystem: Provides read-only access to files on the filesystem. When a user requests the table, it uses TableFunctionFile to implicitly load the file. Internally, the result of the TableFunctionFile call is cached for quick access.
S3: Provides read-only access to S3 storage. It implicitly loads the table from S3 using TableFunctionS3.
HDFS: Provides an interface to HDFS storage. It implicitly loads the table from HDFS using TableFunctionHDFS.

Overlay and FileSystem have been added to clickhouse-local. To simplify the handling of local files, clickhouse-local uses Overlay combined with FileSystem and Memory. Previously it was necessary to use the file table function but it’s not necessary anymore. Here is the example usage:

SELECT * FROM "sample.csv",
SELECT * FROM "https://storage.chistadata.com/my-test-bucket-878/data.csv"
SELECT * FROM "hdfs://sample_hdfs1:9000/test"

PRQL Support (experimental)

PRQL is a modern data transformation language that is simple, powerful and pipelined. It’s readable, explicit and declarative, much like SQL. Unlike SQL, it allows abstractions such as variables and functions, and provides a logical pipeline of changes. Because it compiles to SQL, it can be used with any database that uses SQL. Now you can write your PRQL queries in ClickHouse, but first:

SET dialect = 'prql'

Then

from tracks
filter artist == "Bob Marley"                 # Each line transforms the previous result
aggregate {                                   # `aggregate` reduces each column to a value
  plays    = sum plays,
  longest  = max length,
  shortest = min length,                      # Trailing commas are allowed
}

You can convert your SQL queries to the PRQL via https://prql-lang.org/playground/

Remember to switch back to ClickHouse SQL after you have had fun with PRQL.

SET dialect = 'clickhouse'

Note: If you want to test your scenarios, use the following Docker containers and container packages we have prepared. Here is the link to this installation.

Conclusion

ClickHouse 23.7 brings many exciting new features that enhance its high-performance data analysis database management system capabilities. ClickHouse continues to evolve and provide users with powerful tools for processing, analyzing, and visualizing data. The performance enhancements and expanded integrations further solidify ClickHouse’s position as the first choice for high-speed data processing, real-time analytics, and advanced data analytics use cases. Upgrade to ClickHouse 23.7 with ChistaDATA support to take advantage of these exciting new features and unlock the full potential of your data analysis workflows.

To know more about ClickHouse releases, do visit the following articles:

References:

  • https://en.wikipedia.org/wiki/Jaccard_index
  • https://prql-lang.org/
  • https://parquet.apache.org/
  • https://github.com/ClickHouse/ClickHouse/pull/48821
About Ilkay 25 Articles
Ilkay has been administering databases including NoSQL and RDBMS for over 7 years. He is experienced working high-scale banking databases. He is currently working at ChistaDATA Inc. as Database Administrator.
Contact: Website