Building a Medallion Architecture with ClickHouse: A Complete Guide
The medallion architecture has emerged as a leading data lakehouse pattern, providing a structured approach to data processing that balances flexibility with governance. At ChistaDATA Inc., we’ve helped numerous organizations implement this architecture using ClickHouse as the analytical engine. This guide explores how to build an effective medallion architecture leveraging ClickHouse’s exceptional performance capabilities.
What is Medallion Architecture?
Medallion architecture organizes data into three distinct layers:
- Bronze Layer: Raw, unprocessed data ingested from various sources
- Silver Layer: Cleaned, validated, and enriched data ready for analysis
- Gold Layer: Business-ready datasets optimized for specific use cases
This layered approach ensures data quality improves progressively while maintaining full data lineage and enabling both batch and real-time processing workflows.
Why ClickHouse for Medallion Architecture?
ClickHouse offers several advantages that make it ideal for implementing medallion architecture:
Performance Excellence
- Sub-second query response times on billions of rows
- Columnar storage optimized for analytical workloads
- Advanced compression reducing storage costs by 10x
Scalability
- Horizontal scaling across commodity hardware
- Automatic data distribution and replication
- Support for petabyte-scale deployments
Real-time Capabilities
- Native streaming ingestion support
- Materialized views for real-time aggregations
- Low-latency data processing pipelines
Implementing Bronze Layer with ClickHouse
The bronze layer serves as your data landing zone, capturing raw data from multiple sources.
-- Create bronze table for raw events
CREATE TABLE bronze.raw_events
(
event_id String,
timestamp DateTime64(3),
user_id String,
event_type String,
properties String,
source_system String,
ingestion_time DateTime DEFAULT now()
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (source_system, timestamp, event_id)
SETTINGS index_granularity = 8192;
-- Create Kafka engine for real-time ingestion
CREATE TABLE bronze.events_queue
(
event_id String,
timestamp DateTime64(3),
user_id String,
event_type String,
properties String,
source_system String
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse_bronze',
kafka_format = 'JSONEachRow';
-- Materialized view for continuous ingestion
CREATE MATERIALIZED VIEW bronze.events_mv TO bronze.raw_events AS
SELECT
event_id,
timestamp,
user_id,
event_type,
properties,
source_system
FROM bronze.events_queue;
Building the Silver Layer
The silver layer transforms bronze data into a cleaner, more structured format with data quality checks and enrichments.
-- Create silver table with improved schema
CREATE TABLE silver.processed_events
(
event_id String,
timestamp DateTime64(3),
user_id UInt64,
event_type LowCardinality(String),
session_id String,
device_type LowCardinality(String),
country_code FixedString(2),
revenue Decimal(10,2),
properties Map(String, String),
processing_time DateTime DEFAULT now(),
data_quality_score UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, timestamp, user_id)
SETTINGS index_granularity = 8192;
-- Transformation logic with data quality checks
CREATE MATERIALIZED VIEW silver.events_transform_mv TO silver.processed_events AS
SELECT
event_id,
timestamp,
toUInt64OrZero(user_id) as user_id,
event_type,
JSONExtractString(properties, 'session_id') as session_id,
JSONExtractString(properties, 'device_type') as device_type,
JSONExtractString(properties, 'country') as country_code,
toDecimal64OrZero(JSONExtractString(properties, 'revenue'), 2) as revenue,
CAST(JSONExtract(properties, 'Map(String, String)'), 'Map(String, String)') as properties,
now() as processing_time,
-- Data quality scoring
CASE
WHEN user_id != '' AND event_type != '' AND timestamp > '2020-01-01' THEN 100
WHEN user_id != '' AND event_type != '' THEN 80
WHEN user_id != '' OR event_type != '' THEN 60
ELSE 20
END as data_quality_score
FROM bronze.raw_events
WHERE data_quality_score >= 60;
Creating the Gold Layer
The gold layer contains business-ready datasets optimized for specific analytical use cases.
-- User behavior aggregation table
CREATE TABLE gold.user_behavior_daily
(
date Date,
user_id UInt64,
total_events UInt32,
unique_sessions UInt16,
total_revenue Decimal(12,2),
avg_session_duration Float32,
device_types Array(String),
top_event_type String
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, user_id)
SETTINGS index_granularity = 8192;
-- Real-time aggregation materialized view
CREATE MATERIALIZED VIEW gold.user_behavior_mv TO gold.user_behavior_daily AS
SELECT
toDate(timestamp) as date,
user_id,
count() as total_events,
uniq(session_id) as unique_sessions,
sum(revenue) as total_revenue,
avg(toFloat32(JSONExtractFloat(properties, 'session_duration'))) as avg_session_duration,
groupUniqArray(device_type) as device_types,
topK(1)(event_type)[1] as top_event_type
FROM silver.processed_events
WHERE data_quality_score >= 80
GROUP BY date, user_id;
-- Business metrics table for dashboards
CREATE TABLE gold.business_metrics_hourly
(
hour DateTime,
total_users UInt32,
total_revenue Decimal(15,2),
conversion_rate Float32,
avg_order_value Decimal(10,2),
top_countries Array(String)
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY hour
SETTINGS index_granularity = 8192;
Data Quality and Monitoring
Implement comprehensive monitoring across all layers:
Further Reading:
Mastering Custom Partitioning Keys in ClickHouse: A Complete Guide
Why is ClickHouse So Fast? The Architecture Behind Lightning-Speed Analytics
An Introduction to Time-Series Databases: Powering Modern Data-Driven Applications
ClickHouse® ReplacingMergeTree Explained: The Good, The Bad, and The Ugly
Pro Tricks to Build Cost-Efficient Analytics: Snowflake vs BigQuery vs ClickHouse® for Any Business
You might also like:
- Advanced ClickHouse SQL: Window Functions, Array, and JSON Processing
- How to Implement Metrohash Function in ClickHouse for High Performance
- ClickHouse Security: How to encrypt Data at Rest in ClickHouse?
- Streaming Data from PostgreSQL to ClickHouse using Kafka and Debezium – Part 2
- Enhancing Data Processing Workflows with Chained Materialized Views in ClickHouse