ClickHouse Data Types

Mastering ClickHouse Data Types: A Complete Guide to Storage Optimization



When working with ClickHouse, choosing the right data types isn’t just about functionality—it’s about unlocking the full potential of this columnar database. Poor type selection can lead to bloated storage, slower queries, and unnecessary costs. In this comprehensive guide, we’ll explore how to leverage ClickHouse’s rich type system for maximum performance and efficiency.

The Foundation: Understanding ClickHouse’s Type Philosophy

Unlike traditional row-based databases, ClickHouse stores data in columns, making type selection critical for both storage efficiency and query performance. Every byte matters when you’re processing billions of rows, and the right data type can mean the difference between a query that runs in seconds versus minutes.

Precision Matters: Integer Types Done Right

ClickHouse offers a complete range of integer types, each optimized for specific use cases. The key principle is simple: use the smallest type that accommodates your data range.

-- Signed integers
Int8    -- -128 to 127 (1 byte)
Int16   -- -32,768 to 32,767 (2 bytes)
Int32   -- -2,147,483,648 to 2,147,483,647 (4 bytes)
Int64   -- -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (8 bytes)

-- Unsigned integers
UInt8   -- 0 to 255 (1 byte)
UInt16  -- 0 to 65,535 (2 bytes)
UInt32  -- 0 to 4,294,967,295 (4 bytes)
UInt64  -- 0 to 18,446,744,073,709,551,615 (8 bytes)

Consider a user ID field: if your application will never exceed 4 billion users, UInt32 uses half the storage of UInt64. Multiply this by billions of rows, and the savings become substantial.

String Handling: When Size and Structure Matter

ClickHouse provides two primary string types, each with distinct advantages:

-- Variable-length strings
String          -- UTF-8 encoded, unlimited length
FixedString(N)  -- Fixed length N bytes, padded with null bytes

-- Practical example
CREATE TABLE users (
    id UInt32,
    username String,
    country_code FixedString(2)  -- Always exactly 2 bytes: 'US', 'UK', etc.
);

FixedString shines for standardized data like country codes, currency symbols, or status flags. It offers better compression and faster processing since ClickHouse knows exactly how much space each value occupies.

Temporal Data: Choosing the Right Time Precision

Time-based data is crucial in analytics, and ClickHouse offers several options:

-- Date types
Date        -- 1900-01-01 to 2299-12-31 (2 bytes)
Date32      -- Extended range with 4 bytes

-- DateTime types
DateTime    -- Unix timestamp with second precision
DateTime64(precision, [timezone])  -- Subsecond precision

-- Real-world example
CREATE TABLE events (
    event_time DateTime('UTC'),
    precise_time DateTime64(3, 'UTC'),  -- millisecond precision
    event_date Date
);

Pro tip: Don’t default to DateTime64 with maximum precision. If you only need second-level accuracy, DateTime uses less storage and processes faster.

The Nullable Dilemma: Performance vs Flexibility

One of ClickHouse’s most important design decisions is making all types non-nullable by default. This isn’t arbitrary—it’s a performance optimization.

Non-Nullable: The Default Choice

-- Standard types reject NULL values
CREATE TABLE products (
    id UInt32,           -- Cannot be NULL
    name String,         -- Cannot be NULL
    price Float64        -- Cannot be NULL
);

-- This will fail
INSERT INTO products VALUES (1, NULL, 99.99);  -- ERROR

Benefits:

  • 8-15% faster queries due to eliminated null checking
  • Smaller storage footprint without null masks
  • Better compression ratios from predictable data patterns

When Nullable Makes Sense

-- Wrap types with Nullable() when needed
CREATE TABLE users (
    id UInt32,
    email String,
    phone Nullable(String),      -- Optional field
    age Nullable(UInt8),         -- May be unknown
    last_login Nullable(DateTime)
);

Use Nullable types only when:

  • Data naturally contains missing values
  • Integrating with external systems that include NULLs
  • Business logic requires distinguishing between “unknown” and “zero/empty”

Complex Data Structures: Arrays and Beyond

ClickHouse excels at handling complex, nested data structures that would be cumbersome in traditional databases.

Arrays: Storing Multiple Values Efficiently

CREATE TABLE user_events (
    user_id UInt32,
    event_types Array(String),
    timestamps Array(DateTime),
    scores Array(Float32)
);

-- Insert multiple events per user
INSERT INTO user_events VALUES 
    (1, ['login', 'purchase', 'logout'], 
     ['2025-06-01 09:00:00', '2025-06-01 09:15:00', '2025-06-01 09:30:00'],
     [1.0, 5.0, 1.0]);

Arrays enable powerful analytics with built-in functions:

-- Analyze user behavior patterns
SELECT 
    user_id,
    length(event_types) as event_count,
    arraySum(scores) as total_score,
    arrayJoin(event_types) as individual_events  -- Flatten for analysis
FROM user_events;

Nested Structures: Related Data Together

For related arrays that should stay synchronized, use Nested columns:

CREATE TABLE orders (
    order_id UInt32,
    items Nested(
        product_id UInt32,
        quantity UInt16,
        price Float32
    )
);

-- Calculate order totals efficiently
SELECT 
    order_id,
    arraySum(arrayMap((q, p) -> q * p, items.quantity, items.price)) as total
FROM orders;

The LowCardinality Game-Changer

Perhaps ClickHouse’s most innovative optimization is LowCardinality, designed for columns with limited unique values (typically under 10,000 distinct values).

How LowCardinality Works

Instead of storing repeated string values, ClickHouse creates a dictionary:

Standard String Storage:
┌─────────────┬─────────────┬─────────────┐
│   Chrome    │   Firefox   │   Chrome    │
└─────────────┴─────────────┴─────────────┘

LowCardinality Storage:
Dictionary: [Chrome=1, Firefox=2, Safari=3]
Values:     [1, 2, 1, 1, 3, 2, 1, ...]

Real-World Performance Impact

CREATE TABLE web_logs (
    timestamp DateTime,
    url String,
    country String,                      -- High cardinality
    browser LowCardinality(String),      -- Low cardinality
    status_code LowCardinality(UInt16)
);

The performance benefits are dramatic:

-- GROUP BY operations are 3-5x faster
SELECT 
    browser,
    count(*) as requests
FROM web_logs 
GROUP BY browser;

-- Filtering uses dictionary lookups
SELECT count(*) 
FROM web_logs 
WHERE browser = 'Chrome';  -- Optimized dictionary lookup

When to Apply LowCardinality

Ideal candidates:

  • Browser names, operating systems
  • Country codes, currency codes
  • User roles, product categories
  • Status codes, error types
  • Any enum-like values

Avoid for:

  • High cardinality data (>10,000 unique values)
  • Frequently changing value sets
  • Very short strings (<4 characters) where overhead exceeds benefits

Measuring Cardinality

Before applying LowCardinality, measure your data:

SELECT 
    uniq(browser) as unique_browsers,
    count() as total_rows,
    uniq(browser) / count() as cardinality_ratio
FROM web_logs;

-- If cardinality_ratio < 0.01 (1%), LowCardinality will help

Storage Optimization in Practice

Let’s see how proper type selection impacts real-world scenarios:

Before: Inefficient Schema

CREATE TABLE inefficient_logs (
    id Int64,                    -- Oversized for typical IDs
    status Nullable(String),     -- Could be LowCardinality
    timestamp DateTime64(9),     -- Excessive nanosecond precision
    user_agent String            -- High cardinality, no optimization
);

After: Optimized Schema

CREATE TABLE optimized_logs (
    id UInt32,                           -- 50% storage reduction
    status LowCardinality(String),       -- 80% storage reduction
    timestamp DateTime,                  -- Sufficient precision
    user_agent String                    -- Keep as-is for high cardinality
);

Result: 60-70% storage reduction with improved query performance.

Best Practices for Type Selection

  1. Start small: Use the smallest integer type that fits your range
  2. Avoid nullable by default: Only use Nullable when business logic requires it
  3. Measure cardinality: Apply LowCardinality to columns with <1% unique values
  4. Match precision to needs: Don’t use DateTime64(9) when DateTime suffices
  5. Consider access patterns: Use FixedString for consistent-length data
  6. Monitor and adjust: Use system tables to track storage efficiency

Monitoring Your Optimizations

ClickHouse provides excellent introspection capabilities:

-- Check compression ratios by column
SELECT 
    column,
    formatReadableSize(data_compressed_bytes) as compressed_size,
    formatReadableSize(data_uncompressed_bytes) as uncompressed_size,
    data_compressed_bytes / data_uncompressed_bytes as compression_ratio
FROM system.parts_columns 
WHERE table = 'your_table_name'
ORDER BY compression_ratio;

Conclusion

Mastering ClickHouse data types is fundamental to building high-performance analytical systems. The choices you make during schema design have cascading effects on storage costs, query performance, and system scalability. By understanding the nuances of each type and applying optimization techniques like LowCardinality, you can unlock ClickHouse’s full potential.

Remember: in the world of big data analytics, every byte counts, and every optimization compounds. Start with these fundamentals, measure your results, and iterate toward the most efficient schema for your specific use case.

The investment in proper type selection pays dividends in faster queries, lower storage costs, and more responsive analytics—exactly what you need when processing data at scale.




 

Advantages of using data skipping indexes in ClickHouse

 

Advanced ClickHouse SQL: Window Functions, Array, and JSON Processing

 

ClickHouse JOIN: Understanding the Internal Mechanics of JOIN operations

 

Implementing Custom Access Policies in ClickHouse: A Comprehensive Guide

About ChistaDATA Inc. 157 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc

Be the first to comment

Leave a Reply