With Akka, there are multiple ways to build fault tolerant systems that can recover during a failure. You can build Akka clusters that run on any environment which have built-in resilience from disruptions to the cluster. Optionally, you can also deploy Akka services into Akka Automated Operations, a managed platform that runs within your cloud’s VPC. If your services are deployed into multiple regions (or across multiple clouds including GCP, AWS, and Azure), Akka automates the replication of your data across regions with additional controls that enable failover between regions if one region is downed, and recovery when a downed region is brought up again.
In distributed systems, we constantly balance the tension between high availability (deploying across multiple regions) and strong consistency. A common pattern to achieve this is the Single Writer Principle: for any given entity instance (like a specific flight booking or a customer cart), one region holds the "System of Record" while others act as read-only replicas.
This primary writer role is not static. Under normal circumstances, an entity's primary location is automatically moved from one region to the region where a new write request occurs. To ensure safety, Akka uses a consensus protocol to fully flush and replicate all events from the old region before handing over authority to the new one. This is a quick exchange of events between the regions. However, this fully consistent approach has a trade-off: it requires all regions to be available to perform the handshake.
If a region goes offline or a network partition occurs, the consensus protocol cannot complete. You cannot gracefully switch the writer. To unlock the surviving regions, an operator must make a manual decision to "down" the unavailable region.
Once that decision is made, the entities in the surviving regions can resume writing. But because event replication is asynchronous and not part of a global distributed transaction, events written in the failed region just moments before the crash may not have left the building.
You have now entered the territory of Split Brain. You have two divergent timelines of history for the same entity.
In this post, we’ll explore how Akka’s event sourced entities handle this scenario without data loss, using a "scalpel, not sledgehammer" approach to recovery.
Let's visualize the lifecycle of a specific entity instance—Seat 1B on a flight—across two regions: Region A (original primary) and Region B (survivor).
Our story involves three passengers: Alice (who books normally), Bob (who books during the crash), and Charlie (who books after the failover but before the regions reconnect).
Under normal conditions, Akka ensures strong consistency. When Alice books Seat 1A, the event is persisted in Region A’s journal and immediately replicated to Region B. Both regions are in sync.
Disaster strikes. A network partition isolates Region A.
We now have divergent history: Region A thinks Bob has the seat; Region B thinks Charlie has it.
When Region A restarts and the network heals, Akka automatically resumes ordinary event replication.
How does the system know there is a conflict? It relies on Version Vectors.
A version vector is a mechanism to track the causal history of events in a distributed system. It allows the system to determine if one event happened before another, or if they happened concurrently. For a specific entity, seat 1B in this case, the vector has one counter, or sequence number, per region, which is incremented when the region writes a new event.
When replication resumes, the entity instance compares these vectors. Since neither vector is a direct ancestor of the other, Akka flags a concurrent conflict. The resolution function then executes, deciding to discard the event from the downed region to preserve the integrity of the survivor's timeline.
This conflict resolution strategy is often a good fit, but Akka is prepared for plugging in other resolution strategies, such as business domain decisions or CRDT style merge functions.
You might ask: "We just discarded Bob's booking. Is that really a win?"
Compared to traditional database backups or active-passive replication, Akka’s approach offers three critical advantages:
In a traditional database failover, "last write wins" often means the loser is overwritten and vanishes. In Akka, discarded does not mean deleted.
Bob’s event (SeatReserved) remains immutable in Region A’s local journal, and even replicated and stored in RegionB’s journal. It is simply ignored when constructing the current state of the entity. This means you have a perfect forensic audit trail. You can run a projection to find all "discarded" events and trigger compensating actions—like automatically emailing Bob a voucher or refund.
Traditional disaster recovery often relies on point-in-time recovery. If you restore a database backup from 5 minutes ago, you lose everyone's data from that 5-minute window, even for users who had nothing to do with the problem.
Akka's recovery is fine-grained at the entity instance level.
The rest of your application continues running without rolling back unrelated data.
We never roll back time. We always merge forward. By combining the replicated history from before the crash with the survivor history after the crash, Akka ensures the system state always progresses. We maximize data retention rather than settling for the "lowest common denominator" of an old backup tape.
With Akka's event sourced entities, we move beyond the idea that a "Split Brain" is a catastrophic, unrecoverable event. Instead, it becomes a managed state with a deterministic resolution. It turns disaster recovery from a frantic fire drill into a mathematical merge operation—keeping your business online and your data auditable, no matter what the infrastructure throws at you.