At some point, you’ll get to work on an online system that needs so much scale that you start coming up against limitation constraints. Adding nodes to your cluster will only do so much if your database has reached maximum capacity. These constraints are an opportunity to employ clever solutions, like load shedding.
The tragedy of the commons is a situation where individual users, acting for their own benefit, behave in a manner that hurts the collective of users. One such case is when the individuals, optimizing personal gain, deplete the available resources. An example of this would be farmers who allow their cattle to overgraze, destroying the carrying capacity of the common grazing land for others.
In software, the grass is precious CPU cycles, and the hungry cows are clients trying to get their requests processed.
For many reasons, there could be a spike of load on the system. Maybe some automated test went awry, or a few nodes died. Unfortunately, the server can only handle a certain amount of requests. As more requests come in and overload the system, the latency begins to suffer. The client requests that are timing out face an uptick. The issue becomes exacerbated, and an inflection point is reached where availability of the service becomes effectively 0. The server is trying to equally service all requests, but none of them are completing in time. And if the clients are configured to retry, the throughput will multiply and exacerbate the issue.
No grass. Hungry clients.
All of the work done for those timed out requests has essentially gone to waste. This points to the difference between goodput & throughput. Goodput describes the requests that are processed and accepted by the clients. Throughput is all the requests that get processed. The difference of throughput & goodput represents wasted work. Maybe this wasted work would be better purposed servicing the requests that are more likely to complete?
If you live in an area that is prone to natural disasters, you’ve probably experienced your electrical grid load shedding. Load shedding, in electrical grids, is an intentional power shut down of certain areas to prioritize others. You can imagine that this prioritization is intelligent - keeping power on for critical infrastructure is more important than keeping it on for residential televisions.
We can apply this same philosophy to distributed computer systems. We can stack rank the importance of each request against each other, and throw away requests that would cost us the least if they were left unserviced. We ration the processing costs in accordance to the cost that we would incur from refusing. Determining this can be done by classifying the requests into buckets. You may prioritize your enterprise customers over your self-service customers. You might deprioritize handling requests for users that have gone past their rate limits.
You may employ techniques more complex than a simple bucketing mechanism, such as acutely observing the degree at which clients are exceeding their baseline. However, these techniques aren’t free. The cost of simply throwing away the request can overwhelm your server - and the more steps you add before the shedding part the lower the maximum throughput you can tolerate before going to 0 availability. It’s important to understand at what point this happens when designing a system that takes advantage of this technique.
One interesting strategy that I’ve found in my research is the idea of giving a request a TTL (time to live). Given the client timeout, there is only so much time a request can jump around while being processed before it becomes “doomed”. At this point, any more work done on the request would be wasted because the client would have ceased waiting for it. If you work on a system with thorough tracing, you can add this TTL and allow the process handling the request to appropriately give up on it.
When deciding what mechanism to employ to load shed, you should keep in mind the layer at which you are load shedding. Modern distributed systems are comprised of many layers. You can do it at the load balancer, at the OS level, or in the application logic. This becomes a trade-off. As you get closer to the core application logic, the more information you will have to make a decision. On the other hand, as you get closer, the more work you have already performed and the more cost there is to throwing away the request.
For example, If you do it at the OS level, it is a lot cheaper than leaving it to the server process. I’d advise to keep automation away from the heavy hitting blunt force techniques, and instead use them as manual interventions for emergencies.
If you choose to do it in your application logic, think carefully about how much work is done for the request before it gets thrown away. For example, if your server validates client access tokens, are you going to validate the token’s signature before making your decision? Verifying a token isn’t free.