Introduction
Every new release includes new features, enhancements, and numerous bug fixes, and the ChistaDATA team always stays on top of the latest releases. On January 30, 2024, ClickHouse version 24.1 was released, and this version contains the following;
- 26 new features,
- 22 performance optimizations,
- 47 bug fixes.
For further details, please see the official ClickHouse docs here.
- v24.1 Source Code : GitHub Link
- v24.1 Release Webinar : Slides
- Installation: ClickHouse Docs
This article will look at the critical features of the ClickHouse 24.1 release.
Key features & improvements
(1) Improvements For Replicated Databases
Introduced two new modes, null_status_on_timeout_only_active and throw_only_active, for the distributed_ddl_output_mode. These modes enable the avoidance of waiting for inactive replicas.
SET distributed_ddl_output_mode = 'throw_only_active'; SET distributed_ddl_output_mode = 'null_status_on_timeout_only_active';
(2) arrayShingles
Introduce the arrayShingles function to generate subarrays. For example, calling arrayShingles([1, 2, 3, 4, 5], 3) will yield [[1,2,3],[2,3,4],[3,4,5]].
SELECT 'ClickHouse is a good database' AS phrase, tokens(phrase) AS tok, arrayShingles(tok, 3) AS shingles Row 1: ────── phrase: ClickHouse is a good database tok: ['ClickHouse','is','a','good','database'] shingles: [['ClickHouse','is','a'],['is','a','good'],['a','good','database']]
(3) quantileDD
quantileDD, quantilesDD, medianDD
Introduce the quantileDD aggregate function along with its counterparts quantilesDD and medianDD, which are derived from the DDSketch algorithm outlined in https://www.vldb.org/pvldb/vol12/p2195-masson.pdf. This includes updating the documentation to reflect these user-facing changes.
SELECT quantileExact(c), quantileDD(0.0001)(c), quantile(c), quantileBFloat16(c), quantileTiming(c), quantileTDigest(c) FROM ( SELECT created_at::Date, count() AS c FROM github_events WHERE repo_name = 'ClickHouse/ClickHouse' AND event_type = 'PullRequestEvent' AND action = 'opened' GROUP BY ALL) ────── quantileExact(c): 19 quantileDD(0.0001)(c): 19.001159522718307 quantile(c): 19 quantileBFloat16(c): 19 quantileTiming(c): 19 quantileTDigest(c): 18.804445
(4) Functions For Punycode
punycodeEncode, punycodeDecode, idnaEncode, idnaDecode
New features have been incorporated, including punycodeEncode, punycodeDecode, idnaEncode, and idnaDecode, facilitating the conversion of international domain names into an ASCII format in line with the IDNA standard.
:) SELECT punycodeEncode('ClickHouse是一个很好的数据库') ClickHouse-zf2pypw92j24o7ldjpvw6hdrd236i :) SELECT idnaEncode('ClickHouse.是一个不错的.数据库') clickhouse.xn--4gq0a0fy48indsd45b.xn--dxty1ibyb :) SELECT idnaDecode('clickhouse.xn--4gq0a0fy48indsd45b.xn--dxty1ibyb') clickhouse.是一个不错的.数据库
(5) New String Similarity Functions
levenshteinDistance, damerauLevenshteinDistance, jaroSimilarity, jaroWinklerSimilarity
Incorporated new string similarity functionalities: dramerauLevenshteinDistance, jaroSimilarity, and jaroWinklerSimilarity.
SELECT word, levenshteinDistance(word, 'clickhouse') AS d1, damerauLevenshteinDistance(word, 'clickhouse') AS d2, jaroSimilarity(word, 'clickhouse') AS d3, jaroWinklerSimilarity(word, 'clickhouse') AS d4 FROM ( SELECT DISTINCT arrayJoin(tokens(lower(title))) AS word FROM hackernews) ORDER BY d1 ASC LIMIT 50
(6) Control For Compression Level
Introduce two settings: output_format_compression_level to adjust the compression level of the output, and output_format_compression_zstd_window_log to specify the compression window size explicitly and activate long-range mode for zstd compression when the output compression method is zstd. These settings are applicable when using INTO OUTFILE and when writing to table functions file, URL, HDFS, S3, and Azure Blob Storage.
:) SELECT text FROM hackernews INTO OUTFILE 'text.tsv.zst' SETTINGS output_format_compression_level = 6; :) SELECT text FROM hackernews INTO OUTFILE 'text.tsv.zst' SETTINGS output_format_compression_level = 6, output_format_compression_zstd_window_log = 26;
(7) Speed Up For Parallel Replicas
The coordination mechanism for parallel replicas has been revamped to enhance parallelism and optimize cache locality. Extensive testing has confirmed its linear scalability across hundreds of replicas. Additionally, it now supports reading in sequential order.
SET allow_experimental_parallel_reading_from_replicas = 1, max_parallel_replicas = 123;
Enhanced cache locality entails reading identical ranges from matching replicas when accessible.
Improved tail latency involves quicker replicas usurping tasks from slower counterparts.
Conclusion
In summary, these updates, implemented by the ClickHouse database, represent a substantial stride forward in optimizing performance, scalability, and resource efficiency. By focusing on improving parallelism, cache locality, and reducing memory usage, ClickHouse has demonstrated a commitment to enhancing the user experience and meeting the evolving demands of modern data management. The introduction of new modes for distributed DDL output handling further underscores ClickHouse’s dedication to providing flexibility and control to its users. These updates collectively reinforce ClickHouse’s position as a leading solution for high-performance analytical workloads.
These are the ClickHouse 24.1 features. To find out more details, please visit the official ClickHouse Docs.
Learn about the last v24.2 release in our release notes.