Skip to main content
Skip to main content
Still in beta — questions, comments or suggestions? aramb@aramb.dev

Reliability Basics

Learn how cloud systems stay reliable through redundancy, failure domains, and regional design.

10 min
Introductory
No AWS Account NeededFREE

This lesson is purely conceptual — no AWS usage required.

What reliability means in cloud

Systems fail. Hardware fails. Networks fail. People make mistakes. Updates break things.

Reliability is not “nothing ever fails.” Reliability is: Your system continues working even when something fails.


1) What breaks systems

Here are common causes of downtime:

  • A server crashes
  • A database becomes unavailable
  • A network link fails
  • A power issue affects a location
  • A bad deploy or misconfiguration breaks the app
  • Traffic spikes overload the system

You cannot prevent every failure. But you can design so failures do not take everything down.


2) Redundancy

Redundancy means you have backups of critical components.

Example:

  • One server running your app is a risk.
  • Two servers is better.
  • Two servers in two different “places” is even better.

Note

Redundancy is not just “more.” Redundancy is independence. If both copies can fail from the same issue, the redundancy is weak.


3) Failure domains

A failure domain is a shared boundary where one failure can affect many things.

Examples of failure domains:

  • One physical machine
  • One rack in a data center
  • One data center building
  • One power system
  • One network provider
  • One geographic area

If all your critical components are inside the same failure domain, a single issue can take everything down.


4) Regions vs availability zones

Cloud providers organize infrastructure into geographic groupings to improve reliability.

Region

A region is a geographic area. A major regional issue is rare, but if one occurs, all systems in that region may be affected.

Zone

A zone is a separate location inside a region designed to reduce shared failures. The goal is that a failure in one zone does not automatically take down the others.

Tip

Key takeaway:

  • Zones help with data center-level failures.
  • Regions help with larger geographic failures.

5) Single point of failure

A single point of failure is any part of your system where: If that one thing fails, the entire system fails.

Common examples:

  • One server hosting everything
  • One database with no backup/replica plan
  • One critical secret stored in one place with no recovery

Your job in reliability design is to identify these and reduce them.


Micro-activity

Activity: Think about "single vs redundant"

Setup A — Single point of failure: every component is a single link in the chain
Setup B — More reliable: redundant servers across zones, database with standby replica
Practice
1 / 2

In Setup A, which components are a single point of failure?


Common confusion

Note

“If I use cloud, is my app automatically reliable?” No. Cloud gives you the building blocks (zones/regions, load balancing, managed services), but you still must design for redundancy and recovery.

Note

“Does redundancy mean zero downtime?” Not guaranteed. Redundancy reduces risk, but reliability also depends on good monitoring, safe deployments, and tested recovery plans.


Summary

  • Failures will happen. Reliability means you keep running anyway.
  • Redundancy is the main tool, but independence matters.
  • Failure domains explain why “two copies in the same place” can still fail together.
  • Regions and zones are ways providers separate infrastructure to reduce shared failures.
  • Identify and remove single points of failure.

Quiz

Knowledge Check
1 / 6

What is reliability in cloud systems?