Reliability Basics
Learn how cloud systems stay reliable through redundancy, failure domains, and regional design.
This lesson is purely conceptual — no AWS usage required.
What reliability means in cloud
Systems fail. Hardware fails. Networks fail. People make mistakes. Updates break things.
Reliability is not “nothing ever fails.” Reliability is: Your system continues working even when something fails.
1) What breaks systems
Here are common causes of downtime:
- A server crashes
- A database becomes unavailable
- A network link fails
- A power issue affects a location
- A bad deploy or misconfiguration breaks the app
- Traffic spikes overload the system
You cannot prevent every failure. But you can design so failures do not take everything down.
2) Redundancy
Redundancy means you have backups of critical components.
Example:
- One server running your app is a risk.
- Two servers is better.
- Two servers in two different “places” is even better.
Note
Redundancy is not just “more.” Redundancy is independence. If both copies can fail from the same issue, the redundancy is weak.
3) Failure domains
A failure domain is a shared boundary where one failure can affect many things.
Examples of failure domains:
- One physical machine
- One rack in a data center
- One data center building
- One power system
- One network provider
- One geographic area
If all your critical components are inside the same failure domain, a single issue can take everything down.
4) Regions vs availability zones
Cloud providers organize infrastructure into geographic groupings to improve reliability.
Region
A region is a geographic area. A major regional issue is rare, but if one occurs, all systems in that region may be affected.
Zone
A zone is a separate location inside a region designed to reduce shared failures. The goal is that a failure in one zone does not automatically take down the others.
Tip
Key takeaway:
- Zones help with data center-level failures.
- Regions help with larger geographic failures.
5) Single point of failure
A single point of failure is any part of your system where: If that one thing fails, the entire system fails.
Common examples:
- One server hosting everything
- One database with no backup/replica plan
- One critical secret stored in one place with no recovery
Your job in reliability design is to identify these and reduce them.
Micro-activity
Activity: Think about "single vs redundant"
Common confusion
Note
“If I use cloud, is my app automatically reliable?” No. Cloud gives you the building blocks (zones/regions, load balancing, managed services), but you still must design for redundancy and recovery.
Note
“Does redundancy mean zero downtime?” Not guaranteed. Redundancy reduces risk, but reliability also depends on good monitoring, safe deployments, and tested recovery plans.
Summary
- Failures will happen. Reliability means you keep running anyway.
- Redundancy is the main tool, but independence matters.
- Failure domains explain why “two copies in the same place” can still fail together.
- Regions and zones are ways providers separate infrastructure to reduce shared failures.
- Identify and remove single points of failure.