Design techniques for building stateful cloud-native applications: Part 2 – distributed state
An introduction to distributed state
In Part 1 of our series on design techniques for building stateful, cloud-native applications, we looked at the topics of resiliency, recoverability, and scalability for cloud-native applications like Reactive Microservices and real-time streaming systems. In this part, we take a look at application state, and why we should distribute it. Let’s start with defining the word state.
Merriam-Webster defines it as “a condition or stage in the physical being of something”. Using the insect kingdom as an example, there would be an egg, larva, pupa, and adult. Virtually all things have state, and that is also true of computing. State is useful in order to make decisions or take actions. An insect in the egg state would have very different behavior than one in the adult state; in the egg state, the insect can either die or metamorphosize into a larva, while the adult can only die. This is just one example of what I’ll call contextual-state, where application state, as with beauty, is also in the eye of the beholder. What is application state?
Just as in the insect example, objects in your application also have current state and behavior over time. For example, the accumulated movements of an electric car might result in a geographic position of 40.712776 latitude and -74.005974 longitude at 12 PM, on Monday, March 11, 2019. That snapshot of the object’s state at this point is very important when it is desired to move the car in a certain direction; yet once that decision is made, it is then reflected in a new state of the vehicle.
What is subjective state?
State needs to exist within a certain context, for example within a bounded-context if we are to consider domain-driven-design (DDD).
Let’s take for example an airport model and look at two bounded-contexts: ground control and departures. Both of these contexts have very different interests, ground control moves planes and other vehicles around the tarmac, while departures are interested in getting a plane flown safely out of the local airspace.
The SOA architect might see the aircraft as the common item and attempt to model it centrally to be shared among these two contexts; however, as you can imagine, this just leads to a functional bottleneck for no good reason. In fact, with these two example contexts, most likely the only shared attribute would be the aircraft’s identity or callsign. All the other attributes would be of unique interest to the given context. For example, ground control would not care about altitude, destination or passenger manifest, while departures probably would.
We can introduce the term subjective-state for cases when you have a view on something that is specific to your own needs, or even at a certain time granularity or summarization. Any attempt to share state across bounded contexts flies in the face of DDD. State should be kept private and both highly available and highly accurate in order for the domain to make quick and accurate decisions. We’ll explain how to accomplish the lofty goals of availability and accuracy a bit later.
What about events?
Any discussion of application state will likely include the topic of events and event-driven architecture (EDA). Now that we’ve established that–in most cases–the proper use of state makes it meaningless to share with the outside world, what do we do?
Events just happen to already exist in your business, whether you are capitalizing on them or not. The world is made up of events and so are our systems. If you know enough about your business you can lock yourself in a room and capture all of your business events on sticky notes, this is called event storming.
Once you capture all of the events it is a matter of grouping them according to the bounded context in which they belong and directing your development teams to have at it, roughly translating a bound context to an individual set of microservices to be developed.
This is where event-sourcing, an alternative to the CRUD (create/read/update/delete) model of storing current state, comes into play. Event sourcing makes use of immutable facts that are stored, in order, in a durable event log, representing the history of the state of the domain over time. State is then relegated to any observer of “order” in the system, each having its own unique pivot on the data. For example, OrderCreated, OrderShipped or OrderBilled are different event states previously visible with CRUD only as a snapshot; with event sourcing, this is now derived by taking into account all of the events in the order they occurred that lead to that state.
CRUD vs event sourcing
Event sourcing is a mature concept and has been around for decades for the likes of banking and telemetry-based applications, such as within the energy or manufacturing industries. Martin Fowler was writing about it as early as 2005 here. Event sourcing just so happens to map perfectly into our “sharing stuff” challenge. The world is moving to event sourcing more now than ever, due to the growth of streaming analytics and systems of microservices.
The following example shows event sourcing in use in an orders system and what it looks like compared to the CRUD (create/read/update/delete) current state model we are so used to building.
You’ll notice in the example above that the event-sourced domain storage is made up of things that have occurred upon that piece of the domain, the order, rather than just the current state of the order. In one fell swoop, we have gained insight into behavior over time and utilize concrete events, understood by the business to be our system of record.
This is rather an arbitrary use of a current state model. The domain storage is also vastly simplified and uniform across domains since all events look the same, some id, event-type, JSON or other payload and a timestamp. No compound keys, or guesswork, and there is another great benefit too: these arbitrary event IDs can be used to uniformly shard the data across clustered storage, or uniformly shard your domain across your application cluster! Much more on this later.
So behavior determines state, but state also affects behavior. An injured soccer player is not someone who kicks a ball or throws it, but an injured soccer (football for our non-US readers) player is really a patient. When the soccer player is no longer injured he can become available for play and join a game and so it goes. Going back to the aircraft example, behavior determining state would be an airplane with its doors open and still available for booking. When the doors close, no further bookings are available and the airliners take on the states of leaving the gate, entering the air, etc. While in those different states, the airliner has different characteristics, such as in the example of the injured soccer player.
On distributed state, or better yet, distributed domain
The term distributed state really addresses what I’d consider the old way of dealing with scale on an N-tier stateless architecture. Why did I just type that? Didn’t I already mention that pretty much all applications have state?
What happens in a stateless architecture is that the design is bound by the lack of a properly clustered application (or business) tier. In the absence of a clustered application tier, we’re forced to distributed our state, using the only thing available to us and capable of such things–our database. Now we’ve removed the ability of the domain to make real-time decisions. Since the state is not private, nor held by the domain model the application nodes are forced to query the state from the database before making a decision, and finally writing the new state back to the database.
While this is happening you squint your eyes real tight and look away, hoping that another “stateless” node isn’t making the same sort of decision at the same point in time, duplicating the data writes and polluting your data. We see this happen all the time, and even to people who understand the problem well enough to fix it will soon learn that the only answer is to utilize multiple-keyed collections in your database to make it impossible to write the same thing twice.
This however now leaks domain behavior back into your data store, just like the old days of triggers and stored procedures.
So what do we do? Luckily we no longer live in the old days and distributed, clustered, cloud-native application software does exist in the form of Akka. Akka harnesses your entire cloud and presents it as a single, virtual supercomputer–and a very resilient one at that.
For the first time, our entire application can fit in memory as needed. We no longer need to be “request bound”. A single request can trigger any number of asynchronous behaviors across your microservices, which lets developers avoid thinking about doing a chunk of work and then offloading to a message bus so another node can help do the work.
Akka uses the tried and true actor model for addressable application components across your cloud. Akka also provides an illusion of now, due to the actor interface being a mailbox. While the actor is processing a message from its mailbox, it will not be interrupted until that message has been fully processed, and its internal state has been changed as a result.
This illusion of now also exists in domain-driven design in the form of an aggregate root. An aggregate root represents an encapsulation of behavior with strongly consistent data within. This maps neatly into Akka actors, so much so that Akka has the concept of the persistent entity. The persistent entity maintains its state privately and makes decisions upon that state. These decisions take the form of events, which are written to the event store database of your choice. Since the events fully tell the story of that piece of the domain, such as an order, it is a simple matter of instantiating the persistent entity actor on a healthy node of the cluster and replaying the events and derive current state and then being ready to do business.
Akka Actors are resilient in that they may be kept in memory indefinitely or set to time-out after periods of nonuse. They are uniformly sharded across your application cluster and moved to different nodes as needed. These persistent, sharded actors are singletons, meaning there will ever only be one instantiated somewhere. This even goes for a failed network where a network partition (aka split brain scenario) occurs. Akka will use a chosen strategy to decide which side of the partitions cluster is the “good side” and which nodes to shut down, guarding against data contamination that would result in multiple singletons existing at the same time.
This singleton safety also applies to clustering and persistence across multiple data centers as the events happen to be a convenient thing to share for the purposes of failover, redundancy or geo-servicing of your users.
The same guarantee exists to enforce the single writer principle, where, though hot replicas exist across the data centers, only a single persistent actor is ever effectively doing business at any given time, achieved by applying conflicting update resolution. Read more about that here.
Availability and accuracy
Now that we have an idea of how distributed, persistent entities work, it’s not a stretch to figure out how we accomplish availability and accuracy. Entities in a well-tuned cluster will have the utmost availability since they are already in memory (this is tunable), and are self-sufficient in that they do not need to rely on real-time database reads to make decisions.
Typical latency for communications between actors in the cluster is in microseconds (µs). In addition to this awesome availability, we also have unparalleled accuracy, since the entities are the keepers of their own state there is only one single place where decisions are being made and events being issued for any particular piece of the domain, such as an order.
As a byproduct of all this, we end up with a cloud-native architecture with significantly less noise; and by that I really mean data flying around between all tiers, all the time, and usually in real time. The following images show what I mean by noise, the first one is a typical “stateless” architecture, the second is a Reactive architecture, utilizing distributed state.
In a typical 3-tier architecture (above), we are forced to traverse all tiers at least once for every request. In most cases, there are also lateral callouts to other services as well to further add “noise”.
In a Reactive architecture (above), state is the only thing you need to adequately do your job. Here it is properly distributed across your application tier, meaning that the load and behavior are also naturally distributed across your application cluster.
To sum up
Hopefully, we’ve explained the value of having an event-driven, distributed domain with internal state. We saw how state could mean very different things to different observers and how events are the things we should be sharing. We have shown how it is not possible to design your systems using actors to model the world without regard to limited resources, state contention or any sacrifices. We’ve also seen how noisy a traditional 3-tier architecture is compared to a reactive architecture, and that the traditional way is no longer a good fit. With everything described in this article already available to you in Akka, you can leave the heavy lifting to us!