API Series Part 7 - Inter-Service Communication Overview

See the series introduction for the list of posts in this series.

In this post we'll explore our options regarding how different services could communicate with each other. There is no code in this post, we'll just explore some different architectures and the trade-offs included with each one. In the end you can make up your own mind about your own specific domain. I have my preference for Govrnanza and in the next part we'll implement that strategy and get into some code.

Synchronous vs Asynchronous Communication

First we need to talk about what synchronous and asynchronous communication means and disambiguate it from synchronous/asynchronous functions and IO.

When we talk about synchronous vs asynchronous operations in code or IO we are talking about blocking vs non-blocking operations. It is more nuanced than that but essentially the principal difference between a synchronous operation is that it blocks (the thread waits for a response) and an asynchronous operation is non-blocking (the thread does not wait).

But synchronous/asynchronous communication refers to a higher level than that. For example, an HTTP call to a RESTful service is considered synchronous communication. The application making the call might make use of async/await so that the thread that made the call can be used to serve other requests while the application waits for a response. So it uses asynchrony at a local thread level, but from a birds eye view, the application makes a call and waits for a response. Local state is maintained and the thread of execution contunues once a response is received. This is synchronous communication.

Asynchronous communications usually involve a message queue technology. An application sends a message to a queue and either expects no response, or can easily live with a response coming in later, even through a different channel. With asynchronous communication, we either don't require a direct response, or don't put tight constraints on the nature of that response. For example, one application might send a PlaceOrder command on a queue. It also consumes a queue where OrderPlaced events are sent. So it gets a response of sorts but that response will be consumed in a completely different context (different execution context, different host etc).

Note that you can do RPC style communications over a messaging platform where an application sends a message with a reply-to queue name included in the message, so that the receiver can send the response to that queue. If the application waits for a response as if it were an HTTP call then that is still a synchronous communication.

Types of Communication

We can split communications into the following types:

  • A request for data

  • A request to carry out a task (Command)

  • A notification that something happened (Event)

We have different communication mechanisms for each of those types of communications and they each have trade-offs. 

New Services

In order to demonstrate those options and trade-offs we are going to add new services and functionality to Govrnanza. So far we just have a single service responsible for the management of APIs, domains and sub domains. Now we are going to add three new services:

  • Profile service. Responsible for managing user profiles.

  • Account service. Responsible for management of account and organisations. Subscription plans.

  • Billing service. Responsible for calculating monthly charges, taking credit card information and making charges. Freezing out organisations that haven't paid.

Each service might receive HTTP requests from our SPA app or mobile clients from the internet that pass through our API gateway. But then they need to communicate with each other on their own private network to serve requests.

Fig 1. Services must communicate with each other as well as respond to external calls.

Let's list all of the requirements related to the services working together.

Registry service

  • Needs to know what organisations exist and which user profiles belong to which organisation. All APIs, domains and sub domains belong to specific organisations.

  • Needs to augment API data with profile data. For example, each API has a technical and business owner. The Registry service is not responsible for any profile data. It must store user ids at the very least and must include at a minimum the technical/business owner's name and job title in responses. Names and job titles don't change that often but they do change.

  • Needs to respond with a "payment required" response to any request related to an organisation that has been frozen out of their account due to lack of payment.

  • Needs to know API, domain and sub domain count limits set by the current billing plan so that it can reject create requests that would push the count over the limit.

Profile service

  • Needs to know what organisations exist so that user profiles can be linked to specific organisations.

Account service

  • Needs to know when an organisation has failed to pay.

Billing Service

  • Needs to know what organisations exist and their subscription plans.

  • Needs to know how many APIs, domains and sub domains exist for each organisation that has a pay-as-you-go subscription plan.

  • Needs to know how many users exist for each organisation that has a pay-as-you-go subscription plan.

There are a few communication dependencies there and we can visualise as it as follows:

Fig 2. Communication needs between services

We'll now look some example communications architectures and analyse their pros and cons. First though let's review some good traits for our services to have to frame each architecture against.

Good (micro) service traits

Effective microservices have some common traits:

  • Encapulsulate a single business domain (bounded context). It is essential to get the boundaries right and be well-defined. Each business domain should be owned by a single service. For example if we find logic related to billing in the Registry service then we know that we haven't correctly modelled our services.

  • Independent units. Each microservice should be maintained by a single team (one team could have multiple services). It should have its own source control repo, build and could even have its own technology stack. Teams can choose to consolidate their tech and use the same source control, build/deploy, database and application framework technology but that choice should not be binding. The choice of one team must not impact another.

  • Loosely coupled. Services should not have hard dependecies on each other or have shared dependencies where one service could force a change or deployment in another. Interaction is governed by contracts with zero leakage of technology choices. For example, if I used NodaTime for dates in the Billing service then I should not serialize dates in the NodaTime format as that couples the contract to a specific technology.

  • Decentralised data. Each service should be responsbile for its own data, be able to store data in the format it needs, with the technology it needs without concern for impact on other services. A central data store with a single data structure and format creates tight coupling to all other services which complicates and slows down development cycles as well as a bunch of other downsides.

  • Controlled failures. Services will fail and the architecture should support ephemeral services that come and go. Additionally, we should design with failure in other parts of the system in mind. If the Registry service cannot verify the status of an organisation or its subscription plan then we can choose to fail open and allow the operation, or we can choose to deny the operation. But we need identify the failure modes of the system and how each part reacts to each failure. Failure cascades must be avoided as far as possible.

  • Stateless. Stateless doesn't mean no state. It means that state is stored outside of the application. That means databases, caches and messging systems. When applications are stateless they can fail and respawn easily. We can distribute load between them, scale them up and down elastically.

The common theme seems to be that each service should be independent as far as possible. When a service has loose coupling to all other systems, each team can move fast as long as it respects the contracts by which it interacts with the rest of the system.

Three Example Communication Architectures

We'll look at three options:

  • Shared Database.

  • API Centric using synchronous HTTP calls.

  • Self-Contained Services using asynchronous messaging.

Fig 3. Three different ways of communicating

 

Shared Database Architecture

This is a tempting architecture to use because it is so easy to get started with. In our example it seems that each service needs to know about organisations for example. Wouldn't it be easy to just put it all in one database and access the data directly from each service?

Fig 4. One database to rule them all

After all, doing joins in SQL is much easier and more powerful than calling an API. Mashing up the data between your own database and some API calls is more awkward in comparison. We can even clearly separate responsibility for who performs writes and you can make read-only accesses.

The issue with the shared database architecture is that it violates most of our positive traits that we have identified earlier. 

Shared Database Drawbacks

The shared database pattern can cause a few headaches:

  • Applications can affect each other's performance.

  • Changes to a single table might require coordination between multiple teams and coordinated deployments.

  • Some schema changes that would be beneficial for one service are not made because of a negative impact on other services.

  • The data is not really optimised for any application. It is probably the full data set and fully normalised. Some applications have to write mega queries to get the data they need.

  • Complexity grows and grows as we try to create the one true data model that represents the entire system.

API Centric Architecture

If Govrnanza started out as a nicely modularised monolith and we wanted to break it down into microservices then an obvious choice would be to decompose it into one microservice per module. In-process method calls would be replaced by calls over the network to REST APIs. APIs are everywhere and changing the world. It is clear that APIs built for external consumers are a clear winner, but does it make sense to use APIs as the primary means of inter-service communication?

Fig 5. API Centric Services Architecture

Fig 5 depicts the Govrnanza architecture where services communicate via HTTP calls to RESTful APIs.

Let's say we have a SPA app that calls our Registry API, the sequence of calls might look like the below:

Fig 6. API Centric Sequence

Note that you would probably make use of caching to avoid making the same calls again and again for data that probably hasn't changed.

Modelling Requests To Carry Out Tasks

So far we have only needed to request data and set data. But what about tasks?

The business says that we need to add a notifications feature. Users can watch specific objects such as APIs, domains and sub-domains and get notified when changes occur. Also they can be warned when APIs get added/removed from an existing domain. Additionally there are notifications for changes to account settings, reminders of password expiration, payment required etc. Basically every service needs to be able to send notifications.

Users can also set when they want to receive these notifications. They can choose Instant, Daily (with a time) or Weekly(with a day and time). So when a user chooses Daily, the Notification service needs to roll-up all notifications into one for that day. Except that certain notifications are exempt from that such as Billing related notifications.

We decide that we need a new service, the Notification service that is responsible for sending out email notifications.

How do we model that with an API Centric model?

  • Option 1. Each service is responsible for making API calls to the Notifications service. The Notifications service is simply a wrapper around an emailing system and centralises control of email styling.

  • Option 2. The Notification service contains all the logic of notifications, the other services know nothing of notifications except the Profile service which stores the notification settings of each user. The Notification service polls each service REST API for the latest changes (additions, deletions, changes). It also polls the Profile service for notification settings. Then on a schedule it sends out notifications according to the user's settings. This means Instant now means within the polling interval.

Option 2 seems best to me as we don't mix notification logic into all the services. If we make each service responsible for calling the Notification service then things get complex quickly. How do we manage the Daily and Weekly roll-ups for example? When we want to change notification logic we have to visit every service. We have lost the loose coupling and well-defined service boundaries.

But polling introduces extra latency, though in the case of Govrnanza I think this is an acceptable delay. Other downsides though are that each API must implement logic for each resource for paging and ordering by creation/modified date. Though that would probably be required by the SPA app already.

This is just one example of how we can model requests to do work. We can make it a push model where an explicit request is made by each application or we can make it a pull model where the service that must do work pulls the necessary data from the other services on a schedule. Pull makes for more decoupling and cleaner boundaries but introduces extra latency.

API Centric Benefits

Mature Technologies. API managers, retry/circuit breaker libraries, Swagger, multiple mature versioning patterns. Alternate higher performance alternatives to REST+JSON such as GRPC. The quintessencial microservices architecture is API based and so tooling and techniques are mature and improving over time.

Distribution of computationally expensive tasks.  If the work required to serve a request requires a lot of different computationally expensive tasks then we can gain performance and scalability by distributing that work across multiple services hosted on multiple machines.

But if the work could be done performantly on a single machine then you'll get the opposite effect. If you have an already well performing piece of code and then distribute that across the network then you'll see significantly slower response times. The problem is that a network call is a 1000 times slower than an in-memory call. That needs to be remembered.

API Centric Drawbacks

Resilience. If the Account service goes down then it can cause service disruption across the whole system. To mitigate this we can fail gracefully and make design decisions that favour availability over consistency. For example if the Account service is down then we could assume that the client's organisation is active (not frozen out due to lack of payment). If the profile service was down we could simply omit the technical and business owner details from the response).

Failure cascades can cause outages of an entire system due to a chain reaction of dependencies. One unimportant service could end up bringing down a system if not designed correctly. Patterns such as Circuit Breaker and Bulk Head are critically important to mitigate such failure scenarios. But they are not magic and if a crititcal system is down then your whole system might become unavailable and there's nothing that can be done until it comes back online.

High Availability of services is very important. Our data technologies such as our RDBMS, caches, messaging systems should all have mature HA capabilities. Where traditionally it has been harder to achieve HA is in our own applications because HA is a hard problem to solve. This is where container orchestrators like Docker Swarm and Kubernetes can help. 

Uptime. If each service by itself could achieve 99.5% uptime (43 hours and 48 minutes downtime per year), then the dependencies of Registry -> Profiles -> Accounts reduces the Registry service uptime to 98.5% (88 hours and 53 minutes). If we had 6 dependencies that would reduce to 97% (262 hours). The more synchronous dependencies we have the lower our uptime.

Latency. Let's say that every service on average responds in 100ms. The 95th percentile is 250ms, 99th percentile is 500 ms and the 99.9th percentile is 5 seconds.

If the Registry service made calls to three other services then the likelihood of seeing a 500 ms call to one of those services is now 1/33 (instead of 1/100 that each service has independently). If we make 6 calls that is now down to 1/16.

Likewise the chance of hitting 5 seconds rises as follows:

  • 1 service = 1/1000

  • 3 services = 1/333

  • 10 services = 1/100

What if the typical customer experience on the SPA app involves 50 requests to the Registry service within a few minutes. The chances is are they will see unacceptable response times. So tail latency issues, where the latency rises much higher in the higher percentiles, is greatly exacerbated by a greater number of synchronous dependencies.

In short, latency variability will increase as you add more calls to other services with the increased potential for latency spikes and cascades.

Development. With synchronous call graphs we must start up all these services locally in order to make one service run. This makes local development more painful and more difficult to debug.

If I am in the Registry service team then I must start-up the Profile service and Account service which belong to other teams. We could make things a bit easier by containerising everything, even the databases and caches, so we can bring the whole system up for running our Registry service. But now we also need to make sure that those other services have the right data in their databases. So while not insurmountable, there is an added cost of maintaining container orchestrator and database scripts to make everything work in the development environment.

Debugging/Incident Analysis. When an incident occurs we now need to review the logs of multiple machines. If you don't have a modern log analytics platform then this can be a painfully slow and laborious task. You need a centralised log analytics platform like the ELK stack or Prometheus so that you can view the logs of multiple services and hosts together in a single screen. Correlation Ids are very important so that we can correlate the distributed work of one end user request across the system.

Self-Contained Services

The idea behind self-contained services is that when an HTTP request comes in, the service has all the data it needs locally in order to serve the request. Because the self-contained service has no external dependencies at request serving time, we avoid the issues with the synchronous call graphs (API Centric). There are no failure cascades, latency spikes or hard to calculate uptime guarantees. Also, when running our service locally for debugging, we don't need to stand-up the rest of the services - just our own. The service is a self-reliant unit.

The idea is based on Self Contained Systems (SCS) but without the requirement for the UI included in the self-contained boundary. The SCS website states:

The Self-contained System (SCS) approach is an architecture that focuses on a separation of the functionality into many independent systems, making the complete logical system a collaboration of many smaller software systems. This avoids the problem of large monoliths that grow constantly and eventually become unmaintainable.
— http://scs-architecture.org/

Sounds like microservices and can be considered a subset, or specialisation within the microservices concept. I am not entirely convinced that UIs can always be integrated, though I am far from being any kind of UI expert. But in this post I'll be concentrating on self-containing the backend application and calling it a Self-Contained Service.

Self-containment is achieved by making each service emit events for every change to its domain objects. These events are the current consistent snapshot of the object rather than a delta. Other services that need that data subscribe to those events and can either react immediately to that event or just store the data locally for when it needs it. This is basically both a reactive event-driven architecture and a data replication architecture and therefore has the trade-offs of eventual consistency.

Fig 7. Decoupled communication via Kafka topics

The individual services do not need to use event sourcing for their own data. They can use the technology and data architecture of their choice within the boundaries of the self-contained service. So using SQL Server and Entity Framework just like they always have done is fine. But whenever an object is changed, an event is published to an immutable commit log.

Commit Log vs Queue

The Registry service needs the name and job title of each profile so that it can serve that data in its responses. So it needs to consume every profile change event and store those two fields plus the id in its own database.

Events published to a queue are transient. Once consumed they are gone. Events published to a commit log are persistent and can remain in the log forever or get cleaned according to a data retention policy. With Kafka, each consumer maintains an offset (a pointer to their position in a given log partition) which allows multiple applications to consume the same log concurrently and allows each application to rewind the clock by simply moving its offset back.

Fig 8. A single log (topic) partition with two applications reading from different offsets.

 

Kafka splits each log (topic) into multiple partitions and messages are distributed between these partitions. We can make that round robin or use a hashing function. In our case we really want messages related to a specific profile to always go to the same partition, so we use the primary key of the profile as the message key.

Fig 9. Two consumer groups consuming three partitions of a given topic.

Each partition has only a single consumer per application (consumer group) which means that if we scale out our partitions we can get message ordering guarantees with multiple concurrent scaled out consumers. This is a feat that normal messages queues cannot achieve because they lose message order when consumed concurrently.

Additionally Kafka has the concept of Log Compaction. When a partition is compacted, only the most recent message per message key remains. If we always store the complete state of an object in the message, rather than deltas, we can still get the most recent state and keep the size of the partitions under control.

The problems with a queue as a data replication technology are:

  • Now we want email address. We have to run a custom ETL process to get the data into the Registry service database.

  • Oops, there was a bug and we need to reconsume all the profile events of the last 48 hours. But those events are gone. We'll need to either republish them somehow or run a custom one-off process to fix the data.

  • We have a new service that needs profile data. We have to run a custom ETL process to to prepopulate the data and then hook up the application to a queue. Care needs to be taken to make the transition perfectly so as to have consistent data.

Compare that to Kafka:

  • Now we want email address. We already use log compaction to ensure each profile only has a single message in the log tail which keeps the message volume under control. We simply move our Registry service consumer offset back to the start and reconsume all messages again, but taking the email address also now. This might impact the liveness of the data so we can instead spawn a one-off copy of the existing event ingester that will consume from offset 1.

  • Oops, there was a bug and we need to reconsume all the profile events of the last 48 hours. We just move the offset back 48 hours.

  • We have a new service that needs profile data. The new application will already start at offset 1. It can be dark deployed at first so that it can consume the whole profiles log, then it can be integrated into the system when it is all caught up.

Request Call Sequence

Let's look at the sequence of events that occurs when a request from our SPA app arrives at our Regsitry service:

Fig 10. Sequence of steps to respond to a request in the self-contained Registry service.

We don't rely on another service and have the freedom to use the power of SQL to combine data into the structure that we want. 

Modelling Requests TO Carry Out Tasks

Let's look at how we would model the new Notification service with our self-contained service architecture.

Fig 11. The Notification service consumes events emitted by each service.

I have omitted the consumption of these topics by the other services to keep the diagram simple. What we see is that no services have any idea about the existence of notifications or the Notification service. Except for that the user's notification settings are stored in the Profile service.

The Notification service consumes all the events generated in the system, for example:

  • WatchCreated

  • WatchDeleted

  • ProfileCreated

  • ProfileModified

  • ProfileDeleted

  • ApiCreated

  • ApiModified

  • ApiDeleted

  • ...

  • UpcomingPayment

  • PaymentOverdue

  • OrganisationFrozen

  • SubscriptionPlanCreated

  • SubscriptionPlanModified

  • ...

The Notification service can consume these events and depending on the user settings, what is being watched, and the type of notification it can either send out an immediate notification or store the event for a roll-up notification. We can add new notifications without changing the events emitted.

So we see that we can use events to replicate data but also trigger tasks in the system, in a very decoupled way.

Self-Contained Services Drawbacks

Eventual Consistency. In the best case, the Registry service might lag a few ms behind the profile service regarding profile data. In the worst case it could be minutes or even hours behind if there is a serious outage. But where a synchronous architecture might suffer a full blown outage, the self-contained architecture might just experience a delay in seeing some data.

Inconsistency. An error or bug might happen causing inconsistency between the data stored in the Registry service and the Profile service.

Related data, such as domains and sub-domains are consumed from different topics and so the event ingestors might consume these events in an inconsistent order. For example, we may be further ahead reading sub-domains than domains and insert a new sub-domain before its parent domain. If you have referential integrity between the two this would cause failures. But seeing as we are constructing a read-only model, perhaps 3rd normal form with foreign keys is overkill. We can often relax our table design a bit and go for a denormalised structure without referential integrity.

We have a few options:

  • Avoid inconsistency by pausing consumption of sub-domains while we wait for the domains ingestor to catch up. Somewhat complex if there are multiple dependencies between objects.

  • Be inconsistency tolerant. No referential integrity in the database. Inner joins will filter out objects with missing related objects.

  • Avoid inconsistency by storing related objects (an aggregate in DDD) in the same message in a unified topic. For example, have a domains topic that includes the sub-domains in the message as well.

Kafka has a potential consistency headache related to its partitions. If we start with 10 partitions, then the events of profile 1001 will always go to the same partition. But if we need to scale up later on, and scale to 100 partitions, the existing messages are not moved, only the distribution of new messages is changed. Now events of profile 1001 are routed to a different partition. So we have two partitions with the state of the profile of user 1001. This means that we lose guarantees that the latest state will always be written to the secondary stores of profile data. To mitigate this issue we can either:

  • Start with a partition count that assumes future growth in traffic and data.

  • Identify older versions in other partitions and delete them (or mark them to be ignored).

  • Consumers check the object change timestamp against existing data. This mitigates the risk of overwriting newer versions with older versions.

Conclusions

There is no silver bullet so I don't want to be too opinionated because it usually depends on your own unique domain and personal/team experience. Additionally there are probably more ways of inter-service communication that I have not covered. But personally I prefer the self-contained, event-driven model over a synchronous one. We could even say that the choice is a false dichotomy because we don't have to choose one or the other but mix them up according to our needs.

One blocker with starting with the API centric model is that you need a lot more maturity as an organisation regarding the technologies, patterns and practices. It is a massive jump to suddently be thinking about correlation ids, circuit breakers, rate limiting and understanding the failure modes of a highly distributed system. High availability of services becomes very important and HA doesn't come easily. If you already have engineers with experience with all the technologies and patterns required to make an API Centric architecture work reliably, then it might be a good choice.

With the Self-Contained model, I like that each service is more independent, with fewer request time dependencies. Latency is more predictable with less variance. I don't have to spend so much time mapping out the dependency graph of HTTP calls and implementing circuit breakers everywhere. Debugging is easier as I have to start-up fewer services for my own service to work.

But HTTP calls don't go away completely. There is always authentication/authorisation services and third party systems that only offer a REST endpoint. There are also potentially some services with too much data to replicate.

This model is also more easy to adopt by organisations without the necessary experience and technologies to go for the API Centric model. While still distributed, the self-contained model has islands of computation in a loosely connected mesh. Developers on each island are more isolated and usually only have to look around their own codebase when things go wrong.

The major downsides of the self-contained approach are the eventual consistency of the data and even some rare cases of inconsistent data. But consider that regardless of our choice, we'll probably need to make use of caching then eventual consistency creeps into to almost all architectures these days.

I have experience with both architectures and experienced the downsides/upsides of both but ultimately, I prefer the self-contained approach where possible.

In the next part we'll see the new services in the self-contained model with Kafka in the middle.

See the series introduction for the list of posts in this series.