Demo: Surviving the Split

Written by Patrik Nordwall | Dec 12, 2025 6:18:02 PM

How Akka handles disaster scenarios

With Akka, there are multiple ways to build fault tolerant systems that can recover during a failure. You can build Akka clusters that run on any environment which have built-in resilience from disruptions to the cluster. Optionally, you can also deploy Akka services into Akka Automated Operations, a managed platform that runs within your cloud’s VPC. If your services are deployed into multiple regions (or across multiple clouds including GCP, AWS, and Azure), Akka automates the replication of your data across regions with additional controls that enable failover between regions if one region is downed, and recovery when a downed region is brought up again.

In distributed systems, we constantly balance the tension between high availability (deploying across multiple regions) and strong consistency. A common pattern to achieve this is the Single Writer Principle: for any given entity instance (like a specific flight booking or a customer cart), one region holds the "System of Record" while others act as read-only replicas.

This primary writer role is not static. Under normal circumstances, an entity's primary location is automatically moved from one region to the region where a new write request occurs. To ensure safety, Akka uses a consensus protocol to fully flush and replicate all events from the old region before handing over authority to the new one. This is a quick exchange of events between the regions. However, this fully consistent approach has a trade-off: it requires all regions to be available to perform the handshake.

So, what happens during a catastrophic failure?

If a region goes offline or a network partition occurs, the consensus protocol cannot complete. You cannot gracefully switch the writer. To unlock the surviving regions, an operator must make a manual decision to "down" the unavailable region.

Once that decision is made, the entities in the surviving regions can resume writing. But because event replication is asynchronous and not part of a global distributed transaction, events written in the failed region just moments before the crash may not have left the building.

You have now entered the territory of Split Brain. You have two divergent timelines of history for the same entity.

In this post, we’ll explore how Akka’s event sourced entities handle this scenario without data loss, using a "scalpel, not sledgehammer" approach to recovery.

The scenario: a tale of three passengers

Let's visualize the lifecycle of a specific entity instance—Seat 1B on a flight—across two regions: Region A (original primary) and Region B (survivor).

Our story involves three passengers: Alice (who books normally), Bob (who books during the crash), and Charlie (who books after the failover but before the regions reconnect).

Phase 1: Normal operation

Under normal conditions, Akka ensures strong consistency. When Alice books Seat 1A, the event is persisted in Region A’s journal and immediately replicated to Region B. Both regions are in sync.

Phase 2: The split

Disaster strikes. A network partition isolates Region A.

The trapped write: Just before Region A goes dark, Bob books Seat 1B. Because replication is asynchronous, this event is persisted in Region A's local database but fails to reach Region B (marked by the red X below).
The failover: Operators manually down Region A. Region B takes over as the primary writer to keep the business running. It sees Seat 1B as empty—it never received Bob's event—so it allows Charlie to book the same seat.

We now have divergent history: Region A thinks Bob has the seat; Region B thinks Charlie has it.

Phase 3: The healing

When Region A restarts and the network heals, Akka automatically resumes ordinary event replication.

Conflict detection: As part of the standard replication flow, Region A sends its "zombie" event (Bob) to B, and Region B sends its "survivor" event (Charlie) to A.
Resolution: This conflict is detected at the fine-grained entity level. Akka applies a deterministic resolution strategy. Since Region B was the authoritative survivor, its timeline wins. Charlie keeps the seat.

Full timeline of the event lifecycle:

The mechanics of convergence

How does the system know there is a conflict? It relies on Version Vectors.

A version vector is a mechanism to track the causal history of events in a distributed system. It allows the system to determine if one event happened before another, or if they happened concurrently. For a specific entity, seat 1B in this case, the vector has one counter, or sequence number, per region, which is incremented when the region writes a new event.

When Bob booked, Region A advanced to {A:1, B:0}.
When Charlie booked, Region B advanced to {A:0, B:1}.

When replication resumes, the entity instance compares these vectors. Since neither vector is a direct ancestor of the other, Akka flags a concurrent conflict. The resolution function then executes, deciding to discard the event from the downed region to preserve the integrity of the survivor's timeline.

This conflict resolution strategy is often a good fit, but Akka is prepared for plugging in other resolution strategies, such as business domain decisions or CRDT style merge functions.

Strategic benefits: why this beats traditional DR

You might ask: "We just discarded Bob's booking. Is that really a win?"
Compared to traditional database backups or active-passive replication, Akka’s approach offers three critical advantages:

1. The "black box recorder" (no data loss)

In a traditional database failover, "last write wins" often means the loser is overwritten and vanishes. In Akka, discarded does not mean deleted.

Bob’s event (SeatReserved) remains immutable in Region A’s local journal, and even replicated and stored in RegionB’s journal. It is simply ignored when constructing the current state of the entity. This means you have a perfect forensic audit trail. You can run a projection to find all "discarded" events and trigger compensating actions—like automatically emailing Bob a voucher or refund.

2. The scalpel, not the sledgehammer

Traditional disaster recovery often relies on point-in-time recovery. If you restore a database backup from 5 minutes ago, you lose everyone's data from that 5-minute window, even for users who had nothing to do with the problem.

Akka's recovery is fine-grained at the entity instance level.

Alice (Seat 1A) had no conflict, so her state merged seamlessly.
Only Seat 1B required resolution.

The rest of your application continues running without rolling back unrelated data.

3. Forward-only recovery

We never roll back time. We always merge forward. By combining the replicated history from before the crash with the survivor history after the crash, Akka ensures the system state always progresses. We maximize data retention rather than settling for the "lowest common denominator" of an old backup tape.

Conclusion

With Akka's event sourced entities, we move beyond the idea that a "Split Brain" is a catastrophic, unrecoverable event. Instead, it becomes a managed state with a deterministic resolution. It turns disaster recovery from a frantic fire drill into a mathematical merge operation—keeping your business online and your data auditable, no matter what the infrastructure throws at you.

Related items

View full post