Amazon EventBridge: Archive & Replay Events In Tandem With A Circuit-Breaker

Sheen Brisals
8 min readNov 18, 2021

In my previous article, “How To Build Better Orchestrations With AWS Step Functions, Task Tokens, And Amazon EventBridge”, I explained the Loyalty Service platform. In that, the focus was mainly on the Order Processing and Vendor Mediator microservices. These two services, though decoupled, communicate via events and work in harmony!

Event-driven interaction of serverless microservices.
Event-driven service interaction. Source Author

The Vendor Mediator service is responsible for handling updates to a third-party SaaS application. Among the design goals of the Vendor Mediator service were failure isolation, managing platform downtime, and error retries.

In this article, we will see,

  • How do we monitor the health of the SaaS platform
  • How do we handle requests when the platform is down
  • How do we make sure every request gets submitted

Below is an oversimplified view of the solution. When the status is down, it holds onto the requests. When it turns good, it resubmits the requests.

How we achieve this forms part of the discussion in this article.

High-level view of an archive and replay solution.
Abstract view of archive replay. Source Author

SaaS Status Monitoring

The following diagram shows a minimalist approach to monitoring the status of a third-party application and propagating it.

Architecture diagram showing the status checking of a SaaS application.
Application status monitoring. Source Author
  • A lambda function gets invoked at scheduled intervals.
  • The lambda function invokes the status endpoint of the SaaS application.
  • It updates the status in a DynamoDB table.
  • When there is a change of state, it publishes an event on the bus.

Though I showed a single flow, the Vendor Mediator service checks three separate parts of the SaaS platform and maintains three status attributes with a different threshold for each.

Here is a sample SaaS status event. The type attribute tells the name of the part of the SaaS platform.

{
"detail": {
"metadata": {
"domain": "LEGO-LOYALTY",
"service": "service-loyalty-vendor-mediator",
"category": "task-status",
"type": "SAAS_STATUS",
"status": "down"
},
"data": {
"changed_at": "2021-11-10T13:15:30Z",
"prev_status": "up",
"prev_change_at": "2021-11-01T18:35:10Z"
}
}
}

Alternate design approach

Here is one of many different ways of doing it! This one is more event-driven, single-purpose, and also uses SSM to keep the status value.

Elaborate architecture of the status monitoring of SaaS application.
Application status monitoring extended architecture. Source Author

Circuit Breaking

The circuit-breaker pattern is prominent in software engineering. There are several flavors of its implementation in serverless. Below is one from Jeremy Daly’s blog on serverless patterns.

In this one, a function that invokes the external endpoint takes care of the status check, threshold, etc. This one is useful when we deal with client-facing scenarios.

Source: Serverless Patterns blog by Jeremy Daly

The Vendor Mediator is a data submission service to a third party. As it invokes three different parts of the SaaS platform, it maintains the status for each. That’s one reason why the monitoring of status is carried out separately by a dedicated service.

A lambda function that submits a request to the SaaS application checks the respective status in the table before proceeding. More details on it are in the next section.

Data Submission Flow

The Previous article gave an overview of the event flow between the Order Processing and the Vendor Mediator services. The diagram below expands on how Vendor Mediator handles an incoming request.

Architecture showing event data processing flow.
Expanded view of event data processing flow. Source Author

Items marked as 1 and 2 are the same as what we saw in Part 1. Let’s briefly go through the other steps.

  1. Order processing dispatches an event with the data payload.
  2. Event filter rule triggers processing target in Vendor Mediator service.
  3. The event, along with its payload, is stored in a DynamoDB table.
  4. Vendor status check. It checks if the circuit is closed or not by querying the SaaS status table.
  5. If the circuit is open (i.e., status = Down), it sends an event with a status value of retry . It is an important event that indicates the need for archiving, as we will see shortly.
  6. If the circuit is closed (i.e., status = Up), it invokes the SaaS endpoint. If it is a success, then a submitted event is sent. If a client error, then it’s a failure, and an error status event goes out. If it encounters a server error, or a connection timeout, then a retry event is sent.
  7. One of the subscribers of these status events is the Order Processing service.

The logic flow shown above in the Vendor Mediator service would fit perfectly as a state machine. With the Step Functions AWS SDK integration, we can reduce the need of lambda functions!

Vendor Mediator status event

Here is the core of the data submission status event I mentioned above.

{
"detail": {
"metadata": {
"domain": "LEGO-LOYALTY",
"service": "service-loyalty-vendor-mediator",
"category": "task-status",
"type": "voucher",
"status": "retry"
},
"data": {
"loyalty_request_id": "AbLhmB7wnOsiBFAq6Cicj2acx8iQ",
"loyalty_reference": "P6IF7YcwQd",
"merchant_reference": "xz5CzHM1wZOm",
"loyalty_order_reference": "M101-S76-OP10-T65"
}
}
}

The two important elements of this event are-

  1. status — Possible values are submitted ,error , and retry .
  2. data — It contains the keys to identify the original event data from the Vendor Mediator’s cache table. The status event does not contain the original data payload.

loyalty_request_id attribute contains the task token. When the Order Processing service receives an event with the status retry , it will extend the callback task token timeout for that particular execution flow.

Archiving The Retry Events

This part is simple, as shown below. In addition to the general consumers of the status events, there is an event filter rule that identifies the retry events and sends them to LoyaltyVoucherArchive.

EventBridge archive event flow.
Event archive view. Source Author

Event archive creation

Setting up an event archive is easy. Supply a name and the filter pattern, and it’s ready.

Resources:
VoucherArchive:
Type: AWS::Events::Archive
Properties:
ArchiveName: LoyaltyVoucherArchive
Description: Archive for vouchers to be resubmitted
EventPattern:
<YourEventFilter>
RetentionDays: 10
SourceArn: 'arn:aws:...:event-bus/loyalty-bus'

The event filter in this case is,

{
"detail": {
"metadata": {
"domain": [
"LEGO-LOYALTY"
],
"service": [
"service-loyalty-vendor-mediator"
],
"category": [
"task-status"
],
"type": [
"voucher"
],
"status": [
"retry"
]
}
}
}

Here it is from the AWS console.

AWS events archive console.
AWS event archive console. Source Author

AWS managed rule

While creating an archive with a filter pattern, AWS configures an event routing rule behind the scene. It’s a managed rule, and we can’t directly change it.

AWS console showing the managed filter rule.
AWS managed rule. Source Author

When events are replayed from an archive, AWS adds the attribute replay-name to differentiate a replay event from the original. As highlighted in the picture, the managed rule adds an extra condition to prevent an endless event routing cycle.

Though I showed one archive, in reality, the Vendor Mediator service maintains three archives to align with the three parts of the SaaS application that I mentioned under the Status Monitor section above.

One of the strengths of serverless is ‘granularity’. When it comes to archiving events, where possible, think of the subsets of events that need archiving, rather than going for a ‘catch-all’ and ‘archive-everything’ pattern.

Having separate archives allows us to vary the retention days. For example, there could be temporary cache events that are short-lived, and critical business events that stay longer.

While replaying, we can easily vary the replay time frame per archive, and have specific target rules, etc. to make it efficient and easily manageable.

Replaying The Archived Events

The replay of archived events happens when the status of the SaaS application becomes ‘up’ from being ‘down’. In circuit-breaker terminology, the circuit has become ‘closed’.

Here is the event replay path depicted as a stretched view.

Architecture of event replay from archive.
Replaying event from archive. Source Author
  1. SaaS status events published by the SaaS Status Monitoring service arrive at the bus.
  2. When the rule identifies that the status has changed from ‘down’ to ‘up’, it triggers the replay.
  3. The replayed events identified by the replay-name are buffered into an SQS queue. SQS with its different characteristics, allows controlling the flow of these events.
  4. The queue handler lambda function retrieves the initial data event from the table, sends it back to the bus.

Replay logic

  • As shown earlier, the SaaS status event contains the time interval of the service being down. The replay trigger function uses it to set the EventStartTime and the EventEndTime of the StartReplay API.
  • It uses a ReplayName that aligns with the archive name. For example, replay names of LoyaltyVoucherArchive will be of the format LoyaltyVoucherReplay_YYYYMMDDHHMMSS .
  • The above format helps to be specific while setting up the event filter pattern for replay events (item 3 in the diagram). Instead of using the "replay-name": {[ "exists": true ]} pattern, it can be more specific as shown below.
{
"detail": {
"metadata": {
"domain": [
"LEGO-LOYALTY"
],
"service": [
"service-loyalty-vendor-mediator"
],
"type": [
"voucher"
],
"status": [
"retry"
]
}
},
"replay-name": [
{
"prefix": "LoyaltyVoucherReplay"
}
]
}

Trade-offs & Facts

  • Ordering of events is not a requirement in this use case.
  • Idempotent replay event handler (item 4 in the above diagram). It checks the submission status in the DataCache table before invoking the SaaS application.
  • The request submission flow depicted earlier updates the submission status (as submitted ,error , or retry) in the DataCache table.
  • The archive and replay with EventBridge eliminate the usual approach of periodically scanning or querying the DataCache table for status and resubmitting.
  • Original data payload event is not archived. Instead, a separate event with the keys to identifying the data from the table is used.
  • The replay trigger function stores the latest replay interval in a table. This is not shown in the diagram.
  • DLQs (Dead Letter Queues) are omitted from the diagrams for clarity.
  • During the up-time of the SaaS platform, a sweeper lambda function replays any stranded events from the archive at specific intervals.

Conclusion

There are several ways we can make a serverless application resilient and fault-tolerant. Understanding the requirements is key to making the right architectural decision and design choices.

Not every pattern nor solution is going to fit every use case. Having the understanding and knowledge to identify the optimal approach is essential to succeed in serverless.

--

--

Sheen Brisals

Co-author of Serverless Development on AWS (O'Reilly, 2024) | Engineer. Architect. Leader. Writer. Speaker. AWS Serverless Hero.