Skip to content
guides

How to Size Your Kafka Cluster: A Step-by-Step Capacity Planning Guide

The Problem: Guessing Kafka Cluster Size

Most Kafka clusters start with a guess. Someone picks 3 brokers because "that's what the tutorial used," and six months later the team is firefighting disk full alerts at 2 AM.

Proper capacity planning requires calculating across four independent dimensions — throughput, storage, partitions, and fault tolerance — and sizing to the largest. Get any one wrong and you'll hit a bottleneck.

We built the Kafka Capacity Planner to make this math instant.

How the Calculator Works

The planner implements the same 5-step sizing methodology used by Kafka administrators at enterprises running thousands of brokers. Enter your workload parameters, and it calculates the optimal cluster size in real time.

Kafka Capacity Planner — Stock Exchange preset showing 7 brokers needed with storage bottleneck, utilization bars, and per-broker hardware spec

The Four Dimensions

Every Kafka cluster is constrained by four independent dimensions. You must calculate all four and size to the maximum:

DimensionWhat Drives ItBottleneck Sign
ThroughputMessage rate × size × replicationProducer latency > 100ms
StorageThroughput × retention periodBroker disk usage > 70%
PartitionsConsumer parallelism + orderingSlow rebalances, high latency
Fault toleranceReplication factor + failure headroomCannot survive broker loss

The calculator evaluates all four and tells you which one is driving your broker count.

Step 1: Choose a Preset or Enter Custom Values

Start with one of the five built-in presets that match common workloads:

  • Stock Exchange — 800K msgs/sec peak, 400-byte messages, 8 consumer groups
  • IoT / Telemetry — 2M msgs/sec peak, small 128-byte messages, 14-day retention
  • Log Aggregation — 200K msgs/sec peak, larger 2KB messages, 30-day retention
  • E-commerce Events — 500K msgs/sec peak, 512-byte messages, 7-day retention
  • Custom — blank slate for your exact numbers

The key inputs are:

  • Peak message rate — not your average, but the spike your cluster must handle during volatile periods. This is the most important number.
  • Average message size — sample from your actual topics or estimate from your schema.
  • Replication factor — always 3 for production. Never less for critical data.
  • Retention period — driven by business or regulatory requirements (e.g., SEBI requires 7 days for financial data).
  • Consumer groups — count every application that reads from Kafka (surveillance, risk, settlement, analytics, etc.).
  • Max consumer parallelism — your Spark Streaming executor count × cores, or equivalent.

Step 2: Customize Your Hardware (Optional)

Click Broker Hardware Spec to expand the hardware configuration. The defaults represent a typical production setup:

  • 128 GB RAM
  • 24 vCores
  • 12 × 12 TB NVMe SSDs (JBOD)
  • 10 GbE NIC

But if your environment is different — maybe you have 6 × 4 TB HDDs, or 25 GbE networking — change these values and watch the broker count adjust. The calculator derives three critical capacities from your hardware:

  • Usable storage per broker = disk count × disk size × 85% (formatted capacity)
  • NIC throughput = NIC speed × 120 MB/s per GbE
  • Safe throughput per broker = 67% of NIC capacity (headroom for spikes)

Step 3: Read the Results

The results panel shows four metric cards:

  • Brokers — the minimum number needed, with the bottleneck dimension highlighted
  • Total Storage — cluster-wide storage requirement including 30% headroom
  • Network Throughput — total cluster network load (inbound + outbound + replication)
  • Total Partitions — based on the higher of consumer parallelism or throughput per partition

Understanding the Bottleneck

The colored insight box tells you what's driving the broker count:

  • Fault tolerance (blue) — your workload fits comfortably. Brokers are driven by the minimum needed to survive failures, not resource pressure. This is common for lighter workloads.
  • Storage (amber) — retention × replication × throughput exceeds what your disks can hold. Solutions: shorten retention, enable tiered storage, or use larger disks.
  • Throughput (amber) — consumer groups multiply read traffic. With 8 consumer groups, outbound traffic is 8× your inbound. Solutions: faster NICs, compression, or fewer consumer groups.
  • Partitions (amber) — high consumer parallelism requires many partitions. Review whether all consumers truly need that level of parallelism.

Step 4: Tune the Broker Count

Use the +/- buttons to increase the broker count above the calculated minimum. The utilization bars update in real time so you can see the impact:

  • Green (< 60%) — comfortable headroom for growth
  • Orange (60–80%) — running tight, plan for expansion
  • Red (> 80%) — at risk, add brokers or upgrade hardware

This is useful for planning ahead. If you expect 2× traffic growth in the next year, bump the broker count until utilization drops to a comfortable level.

The per-broker stats line at the bottom shows exactly how much storage, network, and partition load each broker carries at your chosen count.

Real Example: Stock Exchange Cluster

Using the Stock Exchange preset (800K msgs/sec peak, 400 bytes, RF=3, 7-day retention, 8 consumer groups):

  1. Inbound throughput: 800K × 400 bytes = 320 MB/sec
  2. Replication traffic: 320 × 2 = 640 MB/sec
  3. Outbound traffic: 8 groups × 320 = 2,560 MB/sec
  4. Total network: 3,520 MB/sec (~3.5 GB/sec)
  5. Storage: ~737 TB with 30% headroom
  6. Result: 8 brokers — storage is the bottleneck at 64% utilization per broker

Pro Tips

  • Always size for peak, not average. Kafka clusters that run fine at normal load can fall over during traffic spikes.
  • The 30% storage headroom is already included in the calculator — don't add it again manually.
  • Keep per-broker NIC utilization under 70%. The calculator enforces this in its throughput calculations.
  • Rack awareness matters. If spreading across 3 availability zones, consider a broker count that's a multiple of 3.
  • Run the calculation quarterly. Traffic patterns change. What was comfortable 6 months ago may be tight today.

Try It Now

Open the Kafka Capacity Planner and size your cluster in under a minute. No signup, no data collection — it runs entirely in your browser.

Open Capacity Planner | Read the Docs