System Architecture

What is a system? A system fulfils a business need. A system is made up of:

Processes a.k.a services
Databases e.g. Apache Cassandra
Queueing systems e.g. Apache Kafka

Users needn’t know about these components, only the system’s ingress. Users interact with your system via its ingress. Typical ingress mechanisms are:

HTTP and HTTP/2
gRPC over HTTP/2
A topic/queue such as Apache Kafka

Users can be humans via a web browser or other systems.

What is system architecture? Simply put, architecture consists of everything that is hard to change after the fact. Your system architecture impacts:

Reliability: does your system give the user the correct answer
Availability: can your system respond to user requests
Performance: how quickly does your system respond
Capacity: how many concurrent requests can your system handle

Our recommended approach is to define Service Level Objectives (SLOs) for these aspects of your system even if you don’t a Service Level Agreement (SLA) with your user.

Some important parts of system architecture that constrain the above are:

System ingress
Interprocess communication
Persistence technologies e.g. databases & queues
Database access patterns

System ingress

A user interacts with a system via its ingress. There are different tradeoffs for ingress vs communication within a system. For ingress typical concerns are:

Forward and backward compatibility: users are out of your control, within a system all the components are in your control
Authentication and authorisation
TLS
Global access: is the system exposed to the web, should their be edge locations around the world or a single region?

For externally accessible systems the most common ingress technology is HTTP. There are many benefits to sticking with HTTP:

Global HTTP ingress support cloud providers with support for TLS termination
Contend Delivery Networks and other caching technology
Well defined authentication mechanisms and single sign on (SSO)

For internal systems other ingress technologies can be used that are typical for inter process communication with in a system:

Queues e.g. Kafka: Asynchronous ingress allows the system to be fully shut down for maintenance without users knowing as long as performance SLOs aren’t breached
HTTP/2 can take advantage of TCP multiplexing to reduce the overhead of connections
gRPC uses, and has the advantages of, HTTP/2 as well as defining interfaces and serializion formats

Inter process communication

The first major decision for interprocess communication is whether it is asynchronous or synchronous. Queues such as Kafka allow asynchronous communication allowing components to be fully shutdown at the expense of two additional complexities:

Communication complexity for request/reply
Infrastructure complexity of running queueing technology

Synchronous communication means that the downstream component needs to be up at the time of request. It allows easier request/reply communication. Synchronous communication is susceptible to cascading failures, where a downstream failure causes failures in all upstream components.

Database access patterns

Database access patterns come in three main categories:

Create, Update, Read, Delete (CRUD) where the database stores the current state of entities such as users, transactions or stock
Arbitrary access. Less principled than CRUD, using any/every feature of a database e.g. arbitrary SQL.
Event sourcing and Command Query Read Segregation (CQRS)

CRUD often degrades into arbitrary database access. Event sourcing opens up possibilities such as rebuilding different views on the data as all historical events are stored but introduces complexity such as:

Large build up of events, slowing the system down
Supporting old schemas for events

Each process in a system can use different database access patterns. Some organisations will give full autonomy to development teams to decide on internal data access patterns whereas others will want to standardise.

Persistence technology

Nothing affects a system more than the selected persistence technology. It puts upper bounds on:

Capacity: A traditional relational database may hit its limit in the 10s of TBs whereas a distributed database such as Apache Cassandra can go up to Petabytes
Availability: Can it be deployed across DCs or is the max availability that of a single cloud region or data center?
Performance

In addition, it will dictate the costs of a system. Using Google Spanner may provide your system with excellent latency and multi region support but it comes at a huge cost.