Design techniques for building stateful, cloud-native applications: Part 1 - resiliency, recoverability, and scalability
How to build stateful cloud-native applications
For a long time, stateless services have been the primary choice for developers. One of the reasons for using a stateless protocol is that it provides for resiliency from failures, recovery strategies in the event of failures, and the option to scale processing capacity up and down to handle variances in traffic.
A common pattern is to stand up stateless servers behind a load balancer, as shown above. The load balancer provides a single point of contact for clients. Clients connect to the load balancer, send requests to it, and the load balancer routes these requests to one of the available servers that are connected to the load balancer.
When the client traffic increases more servers can be started and connected to the load balancer. The load balancer then routes traffic to each of the servers using a request distribution strategy, such as round-robin distribution strategy.
When a server goes offline, as shown above, the load balancer detects this and routes traffic to the remaining online servers.
While on the surface building a stateless web app or microservice may be the right way to go, in some cases it is not necessarily the best approach for cloud-native services.
When does it make sense to go stateless?
Wikipedia defines stateless as:
“In computing, a stateless protocol is a communications protocol in which no session information is retained by the receiver, usually a server. Relevant session data is sent to the receiver by the client in such a way that every packet of information transferred can be understood in isolation, without context information from previous packets in the session. This property of stateless protocols makes them ideal in high volume applications, increasing performance by removing server load caused by retention of session information."
There are two key advantages when building stateless systems. First, the programming complexity is reduced. Incoming requests are received, processed, and forgotten. Second, there is no need to maintain state, and often the complexity revolves around maintaining session state, which typically involves replicating the session state across the cluster. This replication approach is used to maintain the session state when one of the servers goes offline.
In the last sentence of the Wikipedia definition of a stateless protocol, it states - “This property of stateless protocols makes them ideal in high volume applications, increasing performance by removing server load caused by retention of session information.” While it is true that the stateless approach does not have the overhead of maintaining session state it does introduce processing patterns that have their own overhead and performance costs.
The term stateless is somewhat misleading. Applications by their very nature deal with the state of things that is what they do, they create, read, update, and delete stateful items. The typical processing flow of a stateless process is to receive a request, retrieve the state from a persistence store, such as a relational database, make the requested state changes, store the changed state back into the persistence stores, and then forget that anything happened.
While there may be reductions in overhead related to not maintaining session state on the servers there may be costs associated with delegating state management outside of the application, such as delegating the sole responsibility for state management to the persistence layer.
This cost often is seen when the persistence layer slows down due to high contention. It may be true that it is possible to scale processing capacity at the application layer with the ability to increase or decrease the number of stateless servers, it is also true that the persistence layer does not have unlimited processing capacity. Once the persistence processing capacity is exceeded the application often cannot go any faster.
It is also important to understand that the decision to use the stateless approach has contributed to the persistence capacity limits by delegating state management from the application layer to the persistence layer.
Why use a stateful approach?
For many developers and architects working with cloud-native applications, our intuition tells us that on the surface a stateful approach has advantages. The most obvious benefit is the potential for a reduction in the overhead associated with retrieving state on every request. However, our intuition also tells us that maintaining state has an associated cost with the potential for increased complexity. Often, however, this perception of increased complexity is because we are looking at the problem from the perspective of our current way of doing things. That is our current approaches for maintaining state across a cluster and our current relational CRUD based ways for handling persistence.
There are stateful alternatives. Here we will look at two of the stateful alternatives:
- Clustered state management
- Event based state persistence
These two stateful alternatives share an events first way of processing and persisting state changes.
First, we will look at the stateful persistence approach. As mentioned this approach is focused on persisting events. Using the classic shopping cart scenario, each change to the state of a shopping cart is persisted as a sequence of events.
Shown in the table example above is a series of shopping cart state change events. This is an example of an event log. The events are persisted to an event log stored in a database as each event occurs.
Events are statements of fact, a log of things that happened at some point in the past, a historical record. In the above event log, the events aggregate to show that your shopping cart contains one item, item 1567 and a shipping and billing address. My shopping cart contains two items along with shipping and billing addresses. Finally, there is the other shopping cart that contains two items.
Also, note that the event log has recorded various shopping cart changes. For example, you removed an item from your cart. The other user changed one of the cart items. This is an example of types of historical data that is typically lost when using the traditional CRUD based persistence approach.
At any point in time, it is possible to determine the state of a shopping cart by replaying the events up until that time. Of course, this makes it possible to recover the current state of any shopping cart at the current time. It also makes it possible to view the state for a cart at a time in the past. For example, your cart contained two items at 08:20.
One of the advantages of persisting data using events is that it is now possible to record all of the interesting events that happened over time that resulted in the current state of each shopping cart. This event data, such as removing or changing items, can be extremely interesting for downstream data analytics.
Another event log advantage is that the persistence data structure is a simple key and value pair. In this example case, the key is user Id, item Id, and time and the value is the event data.
The event log is also idempotent; events are insert only; there are no updates and no deletes. The insert only approach reduces the load and contention of the persistence layer.
Next, we will look at clustered state management for cloud-native applications.
Managing state in clusters
In the case of a clustered environment, the state of each shopping cart is managed by an actor that is based on the actor model. In the shopping cart example, there is a unique actor instance that is responsible for handling cart state changes. Going with the above example there are three actors in use, one to handle your shopping cart, another actor to handle my shopping cart, and a third actor to handle the other shopping cart. The example actor system diagram also shows many other shopping cart actors.
Shown in the above diagram is an actor-based, clustered shopping cart system. This is also an example of a shopping cart microservice. On the left are three clients, you, me, and the other. Each of us is building our shopping carts vis some web based device. Access to the system is handled via a load balancer. The load balancer routes incoming requests to service endpoints.
The “crop circle” on the right is a visual depiction of the more interesting actors in the shopping cart system. The blue leaf circles on the perimeter of the diagram represent stateful shopping cart actors. The green circles that connect to the outer leaf circles represent shards. The shards are used to distribute the shopping cart actors across a cluster. Note that in this example there are 15 shards.
The shards connect up to what is called a shard region actor. There is one shard region actor per server node in the cluster. So in the above diagram, this cluster is currently composed of three serves and each server hosts a shard region actor.
The lines connecting the web clients to the actors provide an example of how incoming client requests are routed to each shopping cart actor instance. As an example, your client request was routed by the load balancer to the HTTP endpoint hosted on the bottom server circle.
From there the HTTP request is used to create a message that contains your request, such as add an item to the shopping cart actor. Initially, the message is sent to the local shard region actor. It is the responsibility of the shard region actor and other sharding related actor (see Akka Cluster Sharding for more details) to determine how the message should be routed. In the above example, your shopping cart actor is located on another server. The shard region actor will forward the message across the network to the shard region actor on the other server. This second shard region actor will then forward the message to the shard actor that maps to your shopping cart Id. The shard actor then forwards the message to your shopping cart actor.
When a shopping cart actor receives a message, it first validates that the message is ok to process. Validation is use case specific, for example, when a request is received to add an item to a shopping cart, the validation might be a verification that the item is a valid catalog item and it is currently available. If the request is valid the next step is an event is created, and it is then persisted to the event log.
Once the event has been inserted into the event log the shopping cart actor then updates the state of the shopping cart. Again, each shopping cart actor’s state is the current state of the specific shopping cart.
As a second example, shown in the above diagram here a fourth server was added to the cluster. Note that there are still 15 green shard actors and that some of the shard actors have moved to the new fourth server. This is an example of how the Akka Cluster sharding is capable of recognizing that the cluster has changed and how it is designed to distribute work across the cluster as it scales up and down.
Following along with the example with your shopping cart. Say you are adding the next item to the cart. In this case, the client HTTP request lands on another server, not the same server that received the first request. Also, your shopping cart actor has moved to the new server that just joined the cluster.
When a shard is moved to another server, the old shard actor and its associated shopping cart actors are stopped on the old server. Then on the new server new instances of the actors are started. Shopping cart actors are started as needed. That is they are only started when a message is sent to the actor in this case of your shopping cart actor. The old instance was stopped.
When you submitted the request to add the second item the message was routed to the shard actor that maps to your shopping cart Id. The shard actor checks and sees that your shopping cart actor is not running. The shard actor starts your actor. When the shopping cart actor is started it first recovers its state by reading and re-playing all of the persisted events. In this scenario, one item was previously added to the cart. Once the shopping cart state is recovered, then the incoming message is forwarded to the actor.
In this scenario, the actor system is doing most of the heavy lifting for us. In contrast, the application code is relatively simple. At the HTTP endpoint, custom code is used to handle incoming requests, transform them to message objects, and then send the messages to the local shard region actor. On the other end of the message flow custom code is created to handle incoming messages, validate, persist, and then update the state of the shopping cart. Custom code is also required to recover existing events when each instance of a shopping cart actor is restarted. In between, the actor system handles all of the routing and messaging both within each JVM and across the network, from private cloud to hybrid to on-premise.
Stateless for “good enough”; stateful for real-time streaming data
As usual, when it comes to cloud-native software systems, determining the best approach depends on the specific circumstances. This certainly applies when considering stateless versus stateful systems. In many cases, the stateless approach is an acceptable solution; however, there is a growing number of scenarios where using a stateful approach will be a better alternative. This is undoubtedly true for the ever-increasing demand for high-performance near real-time and stream based systems.
Related Reading