Want to reduce downtime? Are you sure?

Guillermo Ojeda's photo
Guillermo Ojeda
·Jul 30, 2022·

9 min read

Software systems can either work, or not work. When they work users are happy, we make money and everything is great. When they're not working, though, we miss a chance to make money and users get angry (rightfully so), which leaves us feeling down. That's why we call downtime the time when the system is not working (it's a joke).

System works = Good. System not works = Bad.

So, we just make the system work all the time and we make more money, right? More good and less bad! Well, yeah, possibly, kind of, but you should know how expensive that is.

We're going to talk about making a system work more often and not work less often, how to achieve that and how much can it cost (spoiler: it's not linear, it becomes A LOT). In more technical terms, we're going to talk about availability, downtime, how to architect a system for higher availability and the associated costs.

What does availability mean?

Let's go with a few technical terms first. We call availability the percentage of time that a system is working. By the way, working means it can serve user requests. We measure availability in nines (9s), where two nines (2 9s) means the system is working 99% of the time. If the system is working 99% of the time, it's not working 1% of the time.

Actually there are more states than working and not working. For example, the system could be presenting old data when it can't fetch the latest, or it could be working for only some users, or with an increased latency. These are not necessarily bad things, they're actually strategies to deal with parts of a system being down. However, to keep this discussion simple, we're just going to consider the two states of up (everything is working) and down (nothing is working).

This feels like a good spot to warn you that this discussion requires a bit of math. For now we'll go with some numbers so you get a feeling of how much actual downtime we're talking about, but we'll be multiplying some probabilities later.

  • 95% availability = 5% downtime = 1 hour 12 minutes of downtime per day.
  • 99% availability = 1% downtime = 14 minutes of downtime per day or 3.5 days per year.
  • 99.5% availability = 0.5% downtime = 3.5 hours of downtime per month or 1.8 days per year.
  • 99.9% availability = 0.1% downtime = 8 hours 45 minutes of downtime per year.
  • 99.99% availability = 0.01% downtime = 52 minutes of downtime per year.

We measure availability in 9s and operate in probabilities, but I wanted you to understand how much downtime our values for availability really mean. Going from 95% to 99% availability doesn't seem too crazy, but can you imagine what you'd need to do to achieve 99.99% availability? Probably rearchitect the whole system and even the whole operations area of the company, at a minimum!

Ok, but how much availability do I have?

One way to know a system's availability is simply to measure it: 1 - % of failed requests = availability. It's very straightforward for an already implemented system. But what if we don't have a lot of historical data points (or any whatsoever)? Then it becomes a game of What Ifs, probabilities and multiplication.

Let me show you what I mean. Let's design a system together and see what its availability would be.

Some design decisions:

  • We'll make it serverless, because we're cool (and it makes the math easier)
  • We'll use AWS, because we're AWSome (just kidding)
  • We'll host static content in S3, but since that has a boring 4 9s of availability, we'll leave it out of this exercise.
  • There are GET (non-static content), POST and PUT requests that will always be handled by API Gateway -> Lambda -> RDS with a single instance.
  • If any part of the system fails, it will return an error response. There are no caches and no automatic retries (these are areas where we can improve!).
  • I'm trying to keep things simple and easy to understand. In the real world there are a lot more moving parts, such as a VPC for RDS and Secrets Manager, and a lot of What Ifs that can happen like a whole AWS region failing. But let's enjoy a moment of blissful ignorance, at least for this post.

If our code works all the time, we could assume our system works all the time, right? Well, we'd be wrong. It turns out infrastructure can fail as well! And AWS even tells us how often it fails, through their Service Level Agreements (SLAs). There we see that both API Gateway and Lambda have 99.95% availability, and RDS single instance has 99.5% availability.

We're not actually guaranteed that services will have the availability stated in the SLA, we're only guaranteed that if they don't AWS will give us some money back (which will probably not cover our own losses if it fails). SLAs are actually a contract that says "If the service works for less % of time than X, we'll pay you Y.", no guarantees that it will work, just a promise to pay you if it doesn't. I'm trying to keep things simple so we're just going to take those numbers at face value. This is actually enough for most systems, but a handful of industries and use cases require a much deeper analysis.

So, math time! Our system is available if everything is working, and the combined probability of everything working is the product of the probability of every single part working (because our failure events are statistically independent). So, the availability of our system is 0.9995 * 0.9995 * 0.995 = 0.994 = 99.4%. Wait a second, that's almost 2 days of downtime a year! Can we live with that?

Can we increase availability?

Honestly, most systems can actually live with that. So at this point I want you to stop panicking. You're fine! Please do keep reading, but out of curiosity, not out of a sense of need or urgency.

Let's assume we're not happy with our system's availability. Can we improve it? Absolutely! Take a couple of seconds to come up with an idea to improve our system's availability, then check out mine.

If we take a look at the numbers we'll notice our availability is dominated by the lowest number: the database. Remember our old friend Theory of Constraints? It shows up here as well! So we'll see how we can improve our constraint: the availability of our database.

If you opened the RDS SLA page earlier, you might have found that Multi-AZ DB Instances have an SLA of 99.95%. If we use that instead of a single RDS instance, then our availability goes up significantly: 0.9995 * 0.9995 * 0.9995 = 0.9985 = 99.85%. That's approximately 10 hours of downtime per year, much better than our previous number. Good job!

Can we go further up? Yes we can! We can add a cache for reads using ElastiCache (99.9% availability with Redis multi-AZ), so if your database fails reads can be served from there. For writes we can push write operations to an SQS queue (99.9% availability) if the database is failing. In that case our availability would be 0.9995 * 0.9995 * (1 - ((1 - 0.9995) * (1 - 0.999))) = 0.998999 = 99.8999%. It went up again!

Can we go even further up? Maybe...

But at what cost?

For the database improvement, the cost is pretty simple: Twice as much per month for RDS. A multi-az cluster with two instances costs twice as much as a single instance (which makes sense, because you get two instances), and the change is pretty simple to make (at least while we're in the design phase).

Adding ElastiCache and SQS is not so simple, and it's definitely not cheap. Both ElastiCache and SQS have their own monthly costs (it's not much), but you'll also need to make a few changes to your app:

  • You'll have to modify it to write to ElastiCache (which increases Lambda execution time and response time) and read from ElastiCache when the DB is not working.
  • You'll need to modify it to write to SQS when the database is not working.
  • You'll need an app to consume from the SQS queue and write to the DB when it's back up.

Sure, it's not something completely infeasible, but notice how it's significantly harder and more expensive than what we had to do to use multi-AZ RDS, and how the increase in availability is smaller? That's because the more available your system is, the harder and more expensive it is to increase the availability.

When is it enough?

As you asymptotically approach 100% availability, you need to know when to stop. More is not always better: at some point your users won't even notice the change, but I assure you they will notice the price increase that you'll need to apply to stay in business because you went overboard and quadrupled your operational costs. Some rules of thumb:

  • Try to avoid a single instance (e.g. one EC2 or one RDS). Going multiple instances is usually cheap enough. And do so in multiple AZs when you can (which is 99% of the times).
  • Don't go multi-cloud unless you have a big reason to do so. It's going to cost you a lot more. Avoiding cloud vendor lock-in is not a big reason if you're already locked with a platform (e.g. Kubernetes) or with a programming language, framework or library.
  • Don't even go multi-region unless you really have to. If you must do it, use different disaster recovery strategies for different parts of your service, ideally going with pilot light for most of them if you can.
  • Don't overengineer for an idealized number. Netflix is down pretty often and we're still using it. If you're not sure whether you're fine or you need more availability, high chances are you're perfectly fine.
  • Reality is not black and white, HTTP 200 = user happy and HTTP 404 = user sad. Implement retries and/or degraded responses, and don't lose focus on other quality aspects like latency.
  • Users only get really mad when they lose money. The rest of the time they rarely even care.

Wrapping up

Availability is the percentage of time your system is working and making money. You can get a number by multiplying the availability of all the parts your system depends on. More availability is better, but also more expensive. Find a number that works for your users (look at the downtime, think of the impact), and chase that. Don't overpay, don't overengineer. Implement mitigation strategies (retries, degraded responses) and don't forget about other quality aspects like latency. It's all about the users.

System works = Good. System not works = Bad.

Paying a bit for system to work much more often = Good. Paying too much for system to work only slightly more often = Bad.

Don't overpay, don't overengineer.

Self promotion

I create scalable, secure and cost-efficient cloud solutions for startups. This process includes finding a balance between availability and cost (of implementation, operations and maintenance). For 99% of startups just making something scalable provides enough availability, and this never becomes an infrastructure problem. I also design much more complex solutions for the other 1%, but we only go there if you really need it.

If you need something scalable, whether you're just starting or you need to fix something, give me a call. If you need something more complex, give me a call as well, I'll help you figure out how far you need to go and build the simplest solution that takes you there.

 
Share this