An Overview of Resiliency
This is the 8th post in a series on System Design.
Any failure that can happen will eventually happen as you scale out your applications. Failures of hardware, software crashes, memory leaks, etc. You will experience more failures if you have more components.
When a system is resilient, it can withstand a major disruption within acceptable degradation parameters and recover quickly.
Common Failure Causes
To protect your systems against failures, you must first understand what can go wrong. Single points of failure, unreliable networks, slow processes, and unexpected loads are the most common causes of failures.
Single point of failure
The most obvious cause of failure in a distributed system is a single point of failure. This one component could bring down the entire system if it failed. When a system is designed, it should be identified if there is a single point of failure.
Detecting them requires examining every component of the system and asking what would happen if it failed.
A client makes a remote network call by sending a request to a server and expecting a response from it later. After sending a request, the client should receive a response shortly. What if the client waits and waits and still doesn’t receive a response? The client does not know whether a response will eventually arrive in that case. If it reaches that point, it only has two options: to continue waiting or to fail the request with an exception.
From an observer’s perspective, a very slow process is not very different from one that doesn’t run at all — both cannot perform useful tasks. Slow processes are often caused by resource leaks. Any time you use resources, especially those leased from a pool, there is a possibility of leaks. The most common source of leaks is memory. As a result of constant paging and the garbage collector eating up CPU cycles, the process is slower.
If you use a thread pool, you can lose a thread if it blocks on a synchronous call that never returns. A thread won’t be returned to the pool if it makes a synchronous blocking HTTP call without setting a timeout. It will eventually run out of threads since the pool has a fixed size and keeps losing threads.
There is a limit to how much load a system can handle without scaling. As the load increases, you’re bound to hit a brick wall sooner or later. An organic increase in load that gives you time to scale out your service accordingly is one thing, but an unexpected spike is another.
The cascading failure occurs when a portion of an overall system fails, increasing the probability of other parts failing as well. Once a cascading failure has started, it is very difficult to stop it. Mitigating one is best achieved by preventing it in the first place.
Patterns that protect service against downstream failures will be explored.
If there is no response within a certain amount of time, you can configure a timeout to fail the request. Failures are isolated and limited by timeouts, preventing cascading failures. Don’t rely on third-party libraries that make network calls without setting timeouts.
In order to make a network request, a client should configure a timeout. What should it do when the request fails or the timeout occurs? There are two options at that point: the client can either fail fast or retry the request later. Retrying after some backoff time may be successful if the failure or timeout was caused by a short-lived connectivity issue.
Assume your service uses timeouts to detect communication failures with downstream dependencies, and retries to mitigate transient failures. What should it do if the failures aren’t transient and the downstream dependency keeps being unresponsive? Clients will experience slower response times if the service keeps retrying failed requests. Cascading failures can result from this slowness propagating to the rest of the system.
For non-transient failures, we need a mechanism that detects long-term degradations of downstream dependencies and prevents new requests from being sent downstream. It is also known as a circuit breaker.
The purpose of a circuit breaker is to prevent a sub-system from bringing down the entire system. In order to protect the system, calls to the failing sub-system are temporarily blocked. After the sub-system has recovered and failures have stopped, the circuit breaker allows calls to go through again. In contrast to retries, circuit breakers prevent network calls entirely, making them particularly useful for long-term degradations. Retries are helpful when the next call is expected to succeed, while circuit breakers are helpful when the next call is expected to fail.
In order to protect ourselves from the upstream load, let’s look at what we can do.
At any given time, a server has very little control over how many requests it receives, which can negatively affect its performance. When a server is at capacity, accepting new requests will only degrade it, so there is no reason to keep accepting new requests. As a result, excess requests should be rejected by the process. This will allow it to focus on the ones it is already processing. If the server detects that it’s overloaded, it can reject incoming requests by failing fast and returning a 503 status code. It is also known as load shedding.
When clients don’t expect a response within a short time frame, load leveling can be used as an alternative to load shedding. Clients and services will communicate through a messaging channel. Channels decouple the load directed to the service from its capacity, allowing it to process requests at its own pace rather than being pushed to it by clients. By smoothing out short-lived spikes, this pattern is referred to as load leveling.
Throttling, or rate-limiting, is a mechanism that rejects requests if a specific quota is exceeded. There are multiple quotas for a service, such as the number of requests seen or the number of bytes received within a given period. Users, API keys, or IP addresses are typically subject to quotas.
If a service rate limits a request, it must return a specific error code to let the sender know the quota was exceeded. When HTTP APIs are used, the most common way to do this is to return a response with status code 429.
It is also used to enforce pricing tiers; if a user wants to use more resources, they will also have to pay more.
Bulkhead patterns are designed to isolate faults in one part of a service from affecting the whole service. It is named after the partitions of a ship’s hull. An isolated leak on one partition does not spread to the rest of the ship if it is damaged and filled with water.
By design, the bulkhead pattern ensures fault isolation. Partitioning a shared resource, such as a pool of service instances behind a load balancer, allows each user to only use resources belonging to the partition to which they are assigned.
In another example, we should use different connection pools for downstream connections, so that if one connection pool is exhausted, the remaining connections will not be affected. Other services will continue to operate normally even if downstream services run slowly in the future or the connection pool is exhausted.
A health check detects a degradation caused by a remote dependency like a database, which must be accessed to handle incoming requests. Response time, timeouts, and errors of remote calls directed to the dependency are measured by the process. To reduce the load on downstream dependencies, the process reports itself as unhealthy if any measure breaks a predefined threshold.
The main reason to build distributed services is to be able to withstand single-process failures. Your service’s health should not be affected by a process crash if:
- Incoming requests can be handled by other processes that are identical to the one that crashed;
- Any process can serve requests since they are stateless;
- All non-volatile states are stored on a separate and dedicated data store so that they are not lost when the process crashes;
- All shared resources are leased so that when the process crashes, the leases expire and other processes can access them;
- To handle the occasional failure of individual processes, the service is always slightly over-scaled.
Whenever any monitored metric breaches a configured threshold, the watchdog considers the process degraded and deliberately restarts it.
Book: Designing Data-Intensive Applications
Book: Understanding Distributed Systems