Scaling a biotech research platform requires combining a collection of different software systems and components into a single coherent whole, aka a Chimera Data Platform.
To keep this system maintainable as it grows, it needs enough consistency to ensure that a small group of developers can quickly understand each part.
So how do you do that?
Wait! Don’t close the tab yet – let me explain.
Ambiguity and design decisions
When you’re designing a system, it’s fairly natural to organize the logic into small packets – functions, classes, modules, etc.
For a single, monolithic system, there aren’t many organizational decisions beyond that.
But once you have two systems, you have to decide what logic goes where.
For example, let’s say you have a process where a user uploads a csv file, some quality checks are run, it gets converted to an excel file, then it gets saved in a Cloud bucket.
For various reasons, lets say we want to have system A do the quality checks and system B save the Excel file.
Should system A convert it to Excel before sending it, or should system B convert it after it gets a CSV file from system A?
Depending on the context, there may be significant technical reasons why one decision is better than the other.
But a lot of the time, it won’t actually matter.
And the more complex the system, the more decisions you’ll find like this that don’t actually matter.
Consistency in Arbitrary Decisions
These arbitrary decisions don’t matter until the new team member shows up and wants to make some changes.
Or, more likely, the original programmer wants to make some changes but has no memory of writing the code in the first place.
So to find the code that converts from CSV to Excel, do they look in system A or system B?
To make this problem sensible, you need a pattern that tells you which arbitrary decision was most likely made, and thus where you should start looking.
But there are so many types of arbitrary decisions like this, you can’t just come up with a few abstract rules to cover them all.
You need metaphors.
Metaphors for distributed systems
A metaphor is a figure of speech in which you talk about one thing as if it were something else.
In our situation, we’re going to talk about the computer programs that make up these different systems as if they were something more than just programs.
And while there are a number of different metaphors you can use, there are three common ones I’ll write about in this post: Services, Events and State.
The Services Metaphor
A service is a component that takes requests from different clients, performs an action for each request, then returns a response.
This could be something simple like writing information into a database, or retrieving something else.
Or it could be more complex like making a prediction/calculation or kicking off a long-running workflow.
While performing the action, the service that the client called might make requests to other services, and so on.
Interactions with a service are stateless: the client and the service don’t establish an ongoing session, so the request has to include the full context for each request/action/response loop.
The service is also expected to respond fairly quickly.
So if the client requests that a long running job should be started, the service will respond as soon as it’s queued, and include an ID for the job.
If the client wants to know the status later, it has to send another request.
And the request better include the ID: While the service will remember the job, it immediately forgets the client.
Microservices and Service-Oriented Architecture
The two most common frameworks based on the services metaphor differ mostly in technical detail and to a lesser extent in how they apply the metaphor.
Microservices is the one you’re most likely to hear about but Service-oriented architecture (SOA) still has plenty of adherents.
On the technical level, microservices tends to use JSON-based REST APIs while SOA tends to use XML-based SOAP APIs.
(But both frameworks explicitly let you use any API you want.)
The microservices approach assumes you have a lot of control over all the services and you’re splitting them up in order to manage scale in terms of data/users and distribute development across a larger team.
So the framework is optimized for having lots of narrowly-scoped services (hence “micro”) that are relatively tightly coupled.
The software that supports it, such as Kubernetes, focuses on coordinating the deployment and communications of lots of services with multiple replicas and different versions.
Service-oriented architecture, on the other hand, assumes your services are relatively large applications that you have limited control over.
So it’s optimized for managing communication between a small number of large, loosely coupled services, some of which may not even be designed as services.
The software associated with SOA includes things like service buses that translate between different API standards, or even software without a formal API.
This type of approach is particularly useful for organizations with large amounts of legacy software systems.
A small biotech organization with more modern software may want to leverage elements of both frameworks.
The Events Metaphor
An event is something that has changed or happened, and that the system needs to respond to.
In the events metaphor, the client doesn’t make a request to a service: It notifies another component of an event.
In the same way that a service can make a request to other services, a component can pass on the notification about an event to other components.
Each action is in response to the event rather than to a request from the client.
That may seem like a nominal distinction because it is – the only difference is the metaphor.
From a technical perspective, there’s no difference between notification/action/acknowledgement vs request/action/response.
In fact, Event-drive Architecture – the event-based equivalent of Microservices or SOA – explicitly builds off of SOA.
But the metaphor may change how you make some of the otherwise arbitrary decisions about which components manage which logic.
And this has proven particularly useful, for example, in developing data streaming frameworks such as Kafka.
The State Metaphor
State is the status at a given moment in time of all the aspects of a system that are not fixed.
The state of a tennis ball includes its position position and velocity.
The state of a database is the data it contains and the status of any processes it’s running at a given moment.
A software system might track the state of all planes currently in flight, the state of all patients in a hospital, the state of all experiments in a lab, …
Such a system has essentially two operations: Update its record of some part of the external system’s state or query its record of some part of the state.
Updating the state is a particular type of event.
Querying the state is a particular type of request.
So again, this isn’t fundamentally different from the services or events metaphor but it helps you make more consistent design decisions.
I don’t know of a state-based framework along the lines of microservices, SOA or EDA, but there is an API.
Unlike REST or SOAP, the GraphQL API is built around requesting and updating state.
In REST and SOAP, the service defines the structure of the requests that clients must send and the response that they’ll get back.
In GraphQL, the client tells the service what information it’s sending for a state update, and exactly which fields it wants to get back for a state query.
The service gets to decide how it should respond to state updates and how it should figure out the state in response to a query, which could involve sending messages to other components.
But the metaphor pushes these and similar, otherwise arbitrary, design decisions in a consistent direction.
Architecture or metaphor?
Not everything in distributed systems is a metaphor.
Or at least, not everything is *just* a metaphor.
There are also system architectures that are technically specific and legitimately distinct, but happen to use a metaphor to help people understand them.
A task queue is a good example: Workers and managers are metaphors, but the model of interaction between them is very specific.
There are very specific technical reasons why this architecture might be better or worse than another approach in a particular context.
But there are also some arbitrary decisions within the bounds of this architecture, such as the boundary between worker and manager, that the metaphor helps to keep consistent.
When building a distributed system, adopting a metaphor for the interactions between components can help create a more consistent design.
Some metaphors may work better than others depending on the type of work the system is trying to accomplish.
And in some cases it may even make sense to use different metaphors for different parts of the system.
But in any case, it’s worth thinking about which metaphors you’re relying on – consciously or subconsciously – when you think and communicate about your designs.