Apache Pulsar Segmented Storage and BookKeeper: An Operator’s Field Guide

If you’ve spent a weekend rebalancing a Kafka cluster after a hot partition pinned a single broker — moving terabytes around at 3 a.m. while half the engineering team watched the lag graph and the other half argued about num.replica.fetchers — Apache Pulsar’s storage model lands differently the first time you really see it. The headline idea is almost boring on paper: a topic is a sequence of segments, not a single growing file glued to one broker. But the consequences of that one decision ripple through everything: how Pulsar handles failure, how it scales, how it offloads to object storage, how it survives broker restarts, and frankly, how much sleep your on-call rotation gets.

This post is a working engineer’s walkthrough of how Apache Pulsar segmented storage and Apache BookKeeper actually fit together — not a marketing comparison, not a one-line architecture diagram. We’ve operated Pulsar clusters of various sizes and shapes at ChistaDATA, and the patterns that follow are the ones we wish someone had spelled out before we hit them in production.

Understanding Apache Pulsar Segmented Storage

The Two-Layer Idea, and Why It Matters

Pulsar splits the system into two clearly separated tiers. Brokers handle the protocol, dispatch, subscriptions, and routing. They don’t keep long-term data on local disk. Bookies — the storage nodes of Apache BookKeeper — handle durability and replication. The metadata store (ZooKeeper, or increasingly etcd) keeps the catalog: which topic owns which ledgers, where the cursors are, who is the owner broker for each bundle. That’s it. Three things.

This separation reads as a small thing until you live with it. In Kafka, a broker is also the storage; killing a broker means leadership reelection, ISR shrinking, replication catch-up. In Pulsar, killing a broker means a bundle is reassigned to another broker, which immediately starts serving from the same bookies the old broker was using. There’s no data to move. We’ve seen the practical effect during a rolling upgrade on a 12-broker cluster: producers paused for under a second per broker swap, and consumers didn’t see a single missed message. The bookies never noticed.

The flip side, of course, is that BookKeeper is now in the critical path of every write. So getting bookies right matters more than getting brokers right. Most Pulsar performance problems we get called for are bookie problems wearing a Pulsar costume.

Ledgers: BookKeeper’s Unit of Storage

The fundamental thing in BookKeeper is a ledger. A ledger is an append-only log with exactly one writer at a time. It has a 64-bit ID, a defined replication contract, and an unambiguous lifecycle: OPEN, IN_RECOVERY, CLOSED. Once it’s closed, it’s immutable. No appends, no edits, no compaction. The bytes that went in are the bytes that come out, forever, until you explicitly delete the ledger.

A Pulsar topic isn’t backed by one giant file. It’s backed by a managed ledger — an ordered list of BookKeeper ledgers maintained by the broker. As the topic accumulates messages, the current ledger grows. When it hits a roll trigger (configurable: max size, max entries, max time), the broker closes that ledger and opens a fresh one. The metadata store records the new ledger ID in the topic’s ledger list. Consumers don’t know or care; the broker reads from whichever ledger the next message lives in.

The defaults are sensible but worth knowing. managedLedgerMaxEntriesPerLedger defaults to 50,000. managedLedgerMaxLedgerRolloverTimeMinutes defaults to 4 hours. managedLedgerMinLedgerRolloverTimeMinutes defaults to 10. The minimum exists for a reason: rolling too often hammers the metadata store with ledger-creation traffic and tends to surface as elevated ZooKeeper write latency long before it shows up in Pulsar itself.

The Quorum Contract: E, Qw, Qa

When a Pulsar broker opens a new ledger, it picks three numbers that together define the durability promise of every entry written to that ledger:

  • E (ensemble size): how many bookies are in the rotation for this ledger
  • Qw (write quorum): how many bookies each entry is striped across
  • Qa (ack quorum): how many of those bookies must acknowledge the fsync before the producer hears “ok”

The invariant is E ≥ Qw ≥ Qa, and a typical production setting is (3, 3, 2) or (5, 3, 2). Each entry’s write quorum is a deterministic subsequence of the ensemble, starting at entry_id % E. This is what gives Pulsar its even striping: entry 0 writes to bookies [0,1,2], entry 1 writes to [1,2,3], entry 2 to [2,3,4], and so on around the ring. A ledger isn’t bolted to a fixed trio of bookies; it’s smeared across the ensemble.

The failure math is honest. With Qa = 2, you tolerate Qa − 1 = 1 bookie failure for any given entry without data loss. Push Qa to 3 and you tolerate 2 failures, at the cost of higher write latency because the producer now waits on a third fsync. We’ve seen teams set (5, 5, 3) for financial workloads and (3, 3, 2) for telemetry, on the same cluster, on different namespaces, and that’s the right way to use the contract — durability isn’t one number, it’s a per-namespace policy decision.

A subtle thing worth internalizing: the ack quorum is about fsync, not about delivery. A bookie acknowledges only after the entry is durable on disk. That’s why bookie disk choice matters so much — every ack is bottlenecked by your journal write latency, full stop.

The Bookie Write Path, Honestly

Inside a bookie, an incoming entry takes a more interesting trip than people assume:

  1. The entry arrives over the network and is appended to the journal — a sequential, append-only write-ahead log on a dedicated disk.
  2. The journal write is fsynced. This is when the bookie acks the broker.
  3. In parallel (and asynchronously), the entry is added to a memtable.
  4. The memtable is periodically flushed to the entry log, a much larger sequential file on the ledger disk.
  5. An index (a RocksDB-backed mapping from (ledgerId, entryId) to file offset) is updated so reads can find the entry later.

The reason this matters operationally is that the journal is fsync-heavy and latency-sensitive, while the entry log is throughput-heavy and sequential. If you put both on the same disk, you’ve created a fight: every random journal fsync stalls behind the entry-log flush, and your tail latency grows two of its own legs. Every bookie we’ve ever tuned in production has the journal on its own NVMe device — separate physical hardware, not a partition, not a logical volume on the same SSD. That single change has fixed more “Pulsar is slow” incidents than any broker config has.

The relevant settings live in bookkeeper.conf:

# Journal — small, sequential, latency-critical. Put on a dedicated NVMe.
journalDirectories=/data/journal
journalSyncData=true
journalMaxSizeMB=2048
journalPreAllocSizeMB=16
journalWriteBufferSizeKB=64

# Entry log + index — throughput-bound. Different physical disks.
ledgerDirectories=/data/ledgers/d1,/data/ledgers/d2
indexDirectories=/data/index

# Memtable / write cache
dbStorage_writeCacheMaxSizeMb=2048
dbStorage_readAheadCacheMaxSizeMb=1024

# Compaction — keep entry logs from drifting toward fragmentation
minorCompactionThreshold=0.2
minorCompactionInterval=3600
majorCompactionThreshold=0.8
majorCompactionInterval=86400

The compaction settings are the kind of thing nobody looks at until they bite. Entry logs accumulate deleted-but-not-yet-reclaimed entries as ledgers are deleted; compaction reclaims that space. Set the major threshold too low and you’ll burn IO compacting files that don’t need it; set it too high and you’ll watch your bookie disks fill up on a calm Tuesday afternoon. The defaults are reasonable; we tune from there.

Reads: Three Different Paths

Reads in Pulsar take three distinct paths depending on how close the consumer is to the tail:

Tailing reads — consumers caught up to the producer — are served from the broker’s managed-ledger cache. On a healthy system, the vast majority of reads should hit this cache and never touch a bookie. The relevant tunable is managedLedgerCacheSizeMB on the broker, and we size it to comfortably hold the working set of unconsumed entries.

Slightly-behind reads miss the broker cache and go to the bookie. If the entry is still in the bookie’s memtable or read-ahead cache, it’s served from memory. If it’s not, the bookie reads from the entry log on disk. This is where the index disk earns its keep — every miss is a RocksDB lookup followed by a disk seek.

Backlog reads — a consumer rewinding hours or days back — are the ones that surprise people. These walk through closed ledgers, sequentially, often against entry logs that haven’t been touched in a while. They’re throughput-bound and can blow out the bookie’s page cache if you let them. We’ve had production incidents where a poorly-bounded backfill job pinned a bookie’s IO for forty minutes and degraded latency for every other tenant on the cluster.

The mitigation is a combination of dispatchThrottlingRatePerSubscriptionInMsg on suspect subscriptions, separate bookie pools for cold tiers if your scale justifies it, and — most often — tiered storage, which is where this gets actually elegant.

Tiered Storage: The Real Payoff of Sealed Ledgers

This is where segmented storage stops being an architectural curiosity and starts paying its rent. Because closed ledgers are immutable, they’re trivially safe to copy to object storage. There’s no consistency dance, no read-during-write hazard, no version skew to manage. Pulsar’s offloader copies an entire sealed ledger to S3, GCS, Azure Blob, or any S3-compatible store, updates the managed-ledger metadata to point reads for that ledger at the cloud copy, and then deletes the BookKeeper copy when the namespace policy allows.

The economics are hard to argue with. A bookie cluster with NVMe SSDs is fast and expensive. S3 Standard is slow and cheap. S3 Glacier Instant Retrieval is slower and cheaper still. Most data we see in production has a very short hot window — usually 24 to 48 hours where real-time consumers and analytics jobs touch it — and then a very long tail of “we might need to replay this once a quarter for an audit.” Keeping the long tail on NVMe is a waste; pushing it to object storage typically cuts storage cost by 80% to 90% with no consumer-visible change in behavior.

The configuration lives in broker.conf (or per-namespace via pulsar-admin), and the namespace policy controls when offload triggers:

# Configure the AWS S3 offloader on the brokers
managedLedgerOffloadDriver=aws-s3
offloadersDirectory=offloaders
s3ManagedLedgerOffloadBucket=our-pulsar-offload
s3ManagedLedgerOffloadRegion=us-east-1
s3ManagedLedgerOffloadMaxBlockSizeInBytes=67108864      # 64 MB parts
s3ManagedLedgerOffloadReadBufferSizeInBytes=1048576     # 1 MB read blocks

# Per-namespace: offload when topic exceeds 100 GB on bookies, keep 24h hot
pulsar-admin namespaces set-offload-threshold \
  --size 100G public/orders
pulsar-admin namespaces set-offload-deletion-lag \
  --lag 24h public/orders

# Trigger an offload manually on a single topic
pulsar-admin topics offload \
  --size-threshold 10G \
  persistent://public/orders/order-events

Two gotchas that have bitten teams we’ve worked with. First, set a lifecycle rule on the offload bucket to abort incomplete multipart uploads after a day or two. Broker crashes during offload leave orphan multipart uploads, and cloud providers happily bill you for them indefinitely. Second, the offload-deletion-lag is the safety window — the broker only deletes the BookKeeper copy after this lag elapses post-offload. Setting it to zero is brave; setting it to 4 to 24 hours is sensible.

What Happens When a Bookie Dies

Hardware fails. A bookie dies, an EBS volume detaches, an AZ has a bad afternoon. BookKeeper’s response to bookie failure is one of the better-engineered parts of the system, and understanding it is the difference between staying calm and panicking when the alert fires.

When a bookie goes down, the writes it was acknowledging stop. Pulsar brokers detect the failure (via the BookKeeper client’s failure detection, generally within seconds) and trigger an ensemble change on the affected open ledgers: a new fragment is started, a different bookie is added to the ensemble for that ledger, and writes resume. Closed ledgers that the dead bookie was part of don’t need an ensemble change — they’re already replicated to Qw bookies, and as long as Qa − 1 failures haven’t happened in that subset, reads still work.

Background, the BookKeeper autorecovery daemon (sometimes called the “replication worker”) notices that some ledger fragments are now under-replicated and re-replicates them onto healthy bookies. This is automatic. You don’t need to do anything. You should, however, watch the bookkeeper_server_under_replicated_ledgers metric like a hawk — if it stops trending toward zero, something is wrong with autorecovery, and lingering under-replication is the kind of risk that compounds quietly until a second failure makes it loud.

One operational hazard worth flagging: don’t reuse a failed bookie’s data directories blindly. If a bookie died because of a disk corruption, swapping in a new instance with the same disks isn’t recovery, it’s procrastination. Provision fresh storage, let autorecovery do its job, and let the dead bookie’s data go.

The Metadata Store: ZooKeeper, etcd, and the Quiet Bottleneck

Pulsar’s metadata store catalogs every ledger, every cursor position, every namespace policy, every broker assignment. Historically that’s been ZooKeeper, and ZooKeeper is fine until it isn’t. Cluster-wide, the rate of metadata writes scales with how often ledgers roll. A cluster with 10,000 topics, each rolling a ledger every 4 hours, generates roughly 700 metadata writes per hour from rollovers alone, plus cursor updates, plus broker heartbeats. Healthy. Push that to ledgers rolling every 10 minutes, and now you’re talking 60,000 writes per hour from rollovers, and ZooKeeper starts noticing.

This is the practical reason managedLedgerMinLedgerRolloverTimeMinutes = 10 exists. It’s not arbitrary; it’s a guardrail against accidentally turning your metadata store into the bottleneck. We’ve seen well-meaning teams aggressively shrink ledger sizes to try to “improve” something and end up with a ZooKeeper that’s pinned at 80% CPU. The fix was always counterintuitive: roll ledgers less often, not more.

Newer Pulsar releases support etcd and other pluggable metadata stores, and for greenfield deployments we typically recommend etcd — it’s operationally lighter, has cleaner failure semantics, and is what most cloud-native teams already run for other things. For existing ZooKeeper-backed clusters, the migration is doable but isn’t free; we plan it like a major version upgrade.

A Sizing Heuristic That Actually Holds Up

Bookie sizing is a question we get asked constantly, and the honest answer is that it’s workload-dependent. But there’s a heuristic that’s served us well as a starting point:

  • Plan for at least 3 bookies at the minimum, 5 for any real production system, and enough to spread your ensemble comfortably across failure domains (AZs, racks, hosts)
  • Journal disk: small (a few hundred GB is plenty), but fast — NVMe SSD with low write latency, dedicated, no other workload
  • Ledger disks: large, throughput-oriented; multiple disks per bookie distribute IO across the entry-log writers; spinning disks are viable for cold tiers but we default to SSD
  • Index disk: small to medium SSD, separate from ledger disks if possible
  • RAM: enough for the read-ahead and write caches plus the OS page cache for hot entry logs — 64 GB on a busy bookie is unremarkable
  • Network: 10 GbE minimum on modern hardware; bookies talk to each other for re-replication, and the BookKeeper client (running in the broker) also talks to all of them

The thing we don’t do is size purely for steady-state throughput. Bookies need headroom for the surge when autorecovery is re-replicating after a bookie failure — that surge can be substantial, and a cluster running at 80% IO utilization on a calm day will become a cluster running at 100% during recovery, which is precisely when you want it to have headroom.

The Things We Wish We’d Known Sooner

A scattered list, in no particular order, of the lessons that took us longer than they should have:

  • Bookie journal latency is the single most diagnostic metric in the entire stack. If it’s clean, the rest usually is too. If it’s noisy, nothing else matters until you fix it.
  • Don’t share journal disks between bookies. We have seen “virtualized” deployments do this; they all suffer together when one bookie has a busy minute.
  • Pulsar’s per-broker bundle assignment is dynamic. A topic can move between brokers in seconds. This is a feature, not a bug, and it means broker-level CPU graphs are nearly useless for capacity planning — go by cluster aggregates.
  • Use namespace-level offload policies, not topic-level, unless you have a specific reason. Per-topic policies multiply quickly into ops debt.
  • Test bookie restart behavior before you need to. A bookie coming back up has to read its journal, rebuild its index entries, and rejoin the cluster — for a large journal this can take minutes, and you want to know what minutes look like in your environment.
  • Monitor the under-replicated ledger count and the unclosed ledger count separately. They mean different things, and conflating them costs investigation time.
  • If you’re using S3 offload, monitor the offload queue and the multipart upload abort rate, not just S3 PUT errors. Failed offloads pile up silently.

Closing Thought

The segmented storage model isn’t magic. It’s a deliberate decision to push storage one layer down and let a different system (BookKeeper) own it. That decision buys Pulsar three things that are genuinely hard to retrofit into a system that wasn’t designed with the split: stateless brokers, even distribution of any single topic across the entire storage tier, and a clean path to long-term object storage. The price is that you’re now operating two systems instead of one, and BookKeeper has its own learning curve.

For most teams, the price is worth it. We run Pulsar clusters at ChistaDATA that have outlived the Kafka clusters they replaced, with smaller operational footprints and fewer 3 a.m. pages. But that outcome isn’t free — it comes from understanding what the bookies are actually doing, sizing the journal disks correctly, watching the right metrics, and respecting the metadata store. Get those right and the rest of Pulsar tends to fade into the background, which is exactly what good infrastructure should do.

If you’re standing up your first Pulsar cluster, or you’ve inherited one and the dashboards aren’t telling you a coherent story yet, the playbook is the same: start at the bookie journal and work outward. Almost every Pulsar performance question we’ve ever been called in to answer has had its root cause within two layers of that journal disk. Everything else — broker tuning, client backpressure, consumer fan-out — is downstream of whether the storage tier is healthy. Once you internalize that, the rest of the architecture starts to feel less like a black box and more like a system designed by people who’d been burned by single-tier storage one too many times.



Further Reading 

Real-Time Analytics on ClickHouse with ChistaDATA 

Data Migration Services for ClickHouse 

ChistaDATA DBA Services 

ChistaDATA University 

About ChistaDATA Inc. 214 Articles
We are an full-stack ClickHouse infrastructure operations Consulting, Support and Managed Services provider with core expertise in performance, scalability and data SRE. Based out of California, Our consulting and support engineering team operates out of San Francisco, Vancouver, London, Germany, Russia, Ukraine, Australia, Singapore and India to deliver 24*7 enterprise-class consultative support and managed services. We operate very closely with some of the largest and planet-scale internet properties like PayPal, Garmin, Honda cars IoT project, Viacom, National Geographic, Nike, Morgan Stanley, American Express Travel, VISA, Netflix, PRADA, Blue Dart, Carlsberg, Sony, Unilever etc