AWS Architecture

Saga Pattern: Misconceptions and Serverless Solution on AWS

Published in

AWS in Plain English

10 min readSep 27, 2021

In this article, we will look at a few of the common misconceptions about the Saga pattern and see how the Saga pattern can be implemented using AWS serverless services.

PersianDutchNetwork, CC BY-SA 3.0, via Wikimedia Commons

The Saga pattern is a failure management pattern that helps establish consistency in distributed applications, and coordinates transactions between multiple microservices to maintain data consistency – Sourced from AWS, I personally think that it is a succinct definition.

To appreciate the importance of the Saga pattern, we need to understand the problem it tries to resolve and where the problem came from. This problem isn’t exactly new but it is made more prevalent and prominent because of the uptake the microservices architecture.

One way to make sense of this is to use an example that I believe most people would be familiar with: purchasing a physical item online.

When you purchase an item online, it doesn’t just end there. Several things need to happen before you get your order mailed out to you. These are obviously not a complete list but it should give you a rough idea:

Order is created in the online ordering system
Payment is processed through the payments system
The quantity of items you ordered is subtracted from the inventory system
Fulfillment order is sent to logistics
Billing system generates a receipt and this is sent to your email

See how this one event (i.e., your online purchase through a website or app) triggers a chain of steps that goes through multiple systems? If any one of the steps above failed, then your order would be considered as incomplete and each step needs to be rolled back. If all of these systems share one database and all the steps above are part of an atomic transaction then rolling back would be very easy, but I would argue that monolithic times are over. These days, applications are not designed that way anymore. As monolithic applications grow, it becomes difficult to modify or add functionality to them, and to track what parts of the codebase are involved in a specific change. As a result, small changes can require lengthy regression testing, and the development of new features can slow down.

Many modern applications are implemented using microservices architecture, which sees its functions broken up into smaller components with bounded context. Microservices architecture also often motivates another architecture pattern: database per service. The main driver for this is decoupling: this means changes to one service’s database does not impact any other services. Also, this means each microservice can use the type of database that is best suited to its needs. However, the database per service architecture pattern brings another set of challenges: it is now very hard to implement business transactions that span multiple microservices. It is also impossible for distributed databases to meet all the 3 guarantees of the CAP theorem: Consistency, Availability and Partition tolerance. In simple terms, it is fair to say that rolling back a database transaction is not in the cards anymore. Instead, what we need to do here is issue what’s called compensating transactions that undo the effects of previously committed transactions.

Let’s try to understand what this means by contextualizing this to the example above. Let’s say that the item you ordered is very popular and the order is rejected by the inventory system as the item happens to be out of stock. However at this time, the ordering system already captured the order, we also have already submitted a charge to the payment system, hence these transactions have been committed by the order and payments systems. What needs to happen from here is not to roll back but to issue a reversal in the form of another set of transactions, hence the name compensating transactions. This depends on the business logic, but the compensating transactions can be:

Submit reverse charge to the payments system
Mark the order as failed in the order system and cancel the order
Send an apology email to the customer

Illustration of compensating transactions to complete a saga

Now, before we look at the solution, there are a few misconceptions that I would highlight at this point. Although these may seem very obvious, in my personal experience, their existence is actually way more common than I thought:

The Saga oblivion
This is interesting because sometimes, people can be completely unaware that they are dealing with a Saga. Here is my tip: as soon as you are dealing with more than 1 data stores, it is safer to assume you have a Saga rather than not.
Thinking API is the solution
API alone is not the solution for the problem that Saga tries to solve, although there are times and cases where it might seem that API is the solution, especially for small Sagas. API can be built as a coordinator or orchestrator but it alone cannot reliably solve the problem. API falls short when it comes to retries and guaranteed execution of compensating transactions.
Thinking Kafka is the solution
Kafka does have its place in Saga pattern implementation but it alone, much like APIs, does not solve the Saga problem. Even if Kafka is used, in many cases it can be overkill and may not be cost-effective solution.
Thinking event sourcing is the solution
This is another misnomer. Event sourcing is another design pattern that deals with different sets of problems mainly around the limitations of the traditional CRUD operations against a data store. While the event sourcing pattern can be used to an extent to provide some recoverability from microservice failures, it is not its primary purpose.

To further clarify the misconceptions above, let’s use an example to illustrate. Let’s say we implement the online process using an orchestrator API that talks to various microservices, like so:

If there is no error and each microservice is perfect, then all is good and well, it means API does the job, but not designing for failure is a failure in design. This is what makes solution architecture tasks different to those that are primarily in business analysis or software development. Astute developers or engineers tend to pick up where the problem is quite quickly, but to the longtimer yet technically uninitiated business analysts, the problem might not be obvious until it is observed more closely. See below.

It is paramount to realise that all solution components can fail. In this example, the orchestrator API fails after sending the fulfillment request to the fulfilment microservice and there is no retry. The result may not be catastrophic because the purchased item may end up at the customer’s door anyway, but we don’t have an invoice, and the customer never gets an email. It is not so uncommon especially in projects where this these kind of scenarios are considered as edge case such that it is then simply added as a backlog item because people think that the chance of this happening is tiny and that when it happens it’s made an IT department’s problem, hence doing anything now means over-engineering. In this particular example, we see two of the most prominent misconceptions I mentioned above at play:

The Saga oblivion
Thinking API is the solution

If you follow on so far, I would hope that you now have an understanding of the problem that Saga pattern is dealing with and trying to solve. Now, onto the solution. There are two ways to solve the distributed transactions problem.

Method 1: Orchestration based Saga
In orchestration-based saga, there is a central controller that coordinates all saga participants. This orchestrator is responsible for instructing (requesting) saga participants to perform their local transactions, it needs to interpret the result (response) of saga participants, keeps track of where the saga state is at, and in the event of failures, handle them by implementing retry logic before issuing compensating transactions to the appropriate saga participants. Sometimes, the terms that may also be used to refer to this orchestrator is workflow or state machine.

In AWS, we can implement a serverless orchestrator using AWS Step Functions in combination with AWS Lambda and API Gateway. Using the example scenario that we have been talking about in this article, the implementation can look something like this.

Orchestration based saga on AWS using AWS Step function

Here we assume that all the saga participants are microservices with endpoints implemented using AWS Lambda but it doesn’t have to be. Step function can integrate with a lot of other services as well, e.g. ECS, EKS, AWS Batch, API Gateway, to name a few. In this implementation, we also have an API on API gateway which proxies to a Lambda function that triggers this step function. The pink boxes within the step function illustrate compensating transactions, the step function will need to ensure that a retry mechanism is implemented because compensating transaction must complete within a saga. The yellow box there is what’s called a pivot transaction. Essentially this is the point of no return, and it is followed only by retryable transactions indicated in green which we need to ensure we execute.

The Saga pattern is often difficult to test and debug as it is inherently complex. Its complexity only increases with the number of saga participants (i.e. microservices). The primary advantage of this method is that the complexity is somewhat centralised into one component: the orchestrator. With this method, the saga participants don’t need to even be aware of the saga or even other microservices, which gives us clear separation of concerns. Furthermore, in AWS Step function, we can visually see the steps which makes testing and debugging easier. However, with this method, obviously, the orchestrator itself is one additional component in the end-to-end solution that needs to be designed, built, tested, hence budgeted for.

Method 2: Choreography based Saga

With the choreography-based saga, there is no central coordinator or orchestrator. Here, each saga participant does its thing but harmoniously and collectively achieves what the saga intends to achieve and will be able to handle failures as well. The key ingredient to achieve this is a messaging technology, where each participant publishes events to and subscribes event from. In AWS, one way we can implement serverless choreography-based saga is by using a combination of Amazon SNS and Amazon SQS.

Choreography based saga on AWS using Amazon SNS and Amazon SQS

With this setup, the request comes through from API gateway and it uses AWS Integration to get API gateway to publish a message right into SNS without needing to have a Lambda function in between. From there, we use the SNS fan out pattern to pipe the messages to relevant SQS queues. The green arrows in the diagram above illustrates the happy path. When something goes wrong, compensating transactions are choreographed to follow the pink arrows. We put SQS here as a way to achieve a level of fault-tolerance, especially during the events where compensating transactions need to be executed to complete the saga. In reality, you would also need DLQs as well (Dead Letter Queues) and maybe even use specific queues to handle compensating transactions. As you can see, this choreography works but it turns very complex very quickly hence is suited for short and simple sagas with small number of participants. Frankly, I would think that even 4 participants are too much and I would prefer an orchestration based saga.

Also note that with this method, you typically would need to ensure that the saga participants can implement idempotency and they handle commutative requests. Also worth noting that compensating transaction must complete and cannot be aborted, so make sure that there is retry mechanism. There are two types of Amazon SNS topics: Standard and FIFO. Standard topics have what AWS call best-effort deduplication which means that a message is delivered at least once, but occasionally more than one copy of a message is delivered. In addition, messages might be occasionally delivered in an order different from which they were published, which is what is referred to as best-effort ordering. Similar concepts also apply to Amazon SQS.

Luckily with AWS, we have SNS FIFO topics and SQS FIFO queues that help with this aspect to some extent. However, SNS FIFO and SQS FIFO come with limitations as well, most notably throughput and that SNS FIFO strict deduplication only works within a 5-minute time window. FIFO is also more expensive than standard. Therefore I would generally suggest that one should not rely fully on the FIFO functions.

The alternative to this is to use Amazon Kinesis Data Streams which has order guarantee per shard, but scaling is manual and you’d have to scale by adding more shards. The other option is to use Amazon Message Streaming for Kafka (MSK) which is a fully managed Kafka cluster on AWS. However, if you are going down this path, it is imperative to do a cost benefit analysis, make sure you understand the MSK pricing model, and see if it is cost-effective given your scenario.

All in all, the primary advantage of this method is that there is no additional orchestrator component to build as long as each microservice publishes the right events, subscribes to the relevant events, and act on these events appropriately. However, testing and debugging this can be notoriously hard. It is also going to be hard to grasp where the workflow is at, as there will be no visual cue.

In closing: as always, this article provides only a baseline or starting point but I do hope that it helps you to understand the Saga pattern, the problem that it tries to solve, common misconceptions of the solution, and 2 ways of solving for it using AWS technologies. Please read through the official AWS documentation and blogs to assess and validate your use case and approach more closely.

More content at plainenglish.io

AWS Architecture

Saga Pattern: Misconceptions and Serverless Solution on AWS

Written by Adi Simon