Understanding Kafka with Factorio

Thanks to Tom de Ruijter, Steven Reitsma and Laurens Koppenol for proof reading this post.

While playing Factorio the other day, I was struck by the many similarities with Apache Kafka. If you aren’t familiar with them: Factorio is an open-world RTS where you build and optimize supply chains in order to launch a satellite and restore communications with your home planet, and Kafka is a distributed streaming platform, which handles asynchronous communication in a durable way.

I wonder how far we can take the analogy between Factorio and Kafka before it starts to break down. Let’s start from scratch, explore the core Kafka concepts through Factorio visualizations, and have some fun along the way.

If you don’t have a lot of time to spare, don’t download Factorio.

Why bother with async messaging?

Let’s say we have three microservices. One for mining iron ore, one for smelting iron ore into iron plates, and one for producing iron gear wheels from these plates. We can chain these services with synchronous HTTP calls. Whenever our mining drill has new iron ore, it does a POST call on the smelting furnace, which in turn POSTs to the factory.

From left to right: mining, smelting and producing — tightly coupled via synchronous communication

This setup served us well, until there was a power outage in the factory. The furnace’s HTTP calls failed, causing the mining drill’s calls to fail as well. We can implement circuit breakers and retries to prevent cascading failures and message loss, but at some point we’ll have to stop trying, or we’ll run out of memory.

Power outage at the factory

If only there was a way to decouple these microservices… This is, of course, where Kafka comes in. With Kafka, you can store streams of records in a fault-tolerant and durable way. In Kafka terminology, these streams are called topics.

Microservices decoupled by asynchronous messaging

With asynchronous topics between services, messages, or records, are buffered during peak loads, and when there is an outage. These buffers obviously have limited capacity, so let’s talk about scalability.

We can increase storage capacity and throughput by adding Kafka servers to the cluster. Another way is to increase disk size (for storage), or CPU and network speed (for throughput). Which of these options give you the best value for money is use-case specific, but buying bigger servers — unlike buying more servers — is subject to the law of diminishing returns. Kafka’s capacity scales linearly with each node added, so that’s usually the way to go.

Vertical scaling — a bigger, exponentially more expensive server

Horizontal scaling — distribute the load over more servers

To divide a topic between multiple servers, we need a way to split a topic into smaller substreams. These substreams are called partitions. Whenever a service produces a new record, this service gets to decide which partition the record should land on.

A wagon producing records, a partitioner that puts messages on the right partition, and a topic with four partitions.

The default partitioner hashes the message key and modulos that over the number of partitions:

That way messages with the same key always end up on the same partition.

Note that messages are only guaranteed to be ordered within the context of a producer and partition. Records from multiple producers, or from a single producer on multiple partitions, can interleave.

read original article here