Reliability Basics

Learn how cloud systems stay reliable through redundancy, failure domains, and regional design.

10 min

Introductory

No AWS Account NeededFREE

This lesson is purely conceptual — no AWS usage required.

What reliability means in cloud

Systems fail. Hardware fails. Networks fail. People make mistakes. Updates break things.

Reliability is not “nothing ever fails.” Reliability is: Your system continues working even when something fails.

1) What breaks systems

Here are common causes of downtime:

A server crashes
A database becomes unavailable
A network link fails
A power issue affects a location
A bad deploy or misconfiguration breaks the app
Traffic spikes overload the system

You cannot prevent every failure. But you can design so failures do not take everything down.

2) Redundancy

Redundancy means you have backups of critical components.

Example:

One server running your app is a risk.
Two servers is better.
Two servers in two different “places” is even better.

Note

Redundancy is not just “more.” Redundancy is independence. If both copies can fail from the same issue, the redundancy is weak.

3) Failure domains

A failure domain is a shared boundary where one failure can affect many things.

Examples of failure domains:

One physical machine
One rack in a data center
One data center building
One power system
One network provider
One geographic area

If all your critical components are inside the same failure domain, a single issue can take everything down.

4) Regions vs availability zones

Cloud providers organize infrastructure into geographic groupings to improve reliability.

Region

A region is a geographic area. A major regional issue is rare, but if one occurs, all systems in that region may be affected.

Zone

A zone is a separate location inside a region designed to reduce shared failures. The goal is that a failure in one zone does not automatically take down the others.

Tip

Key takeaway:

Zones help with data center-level failures.
Regions help with larger geographic failures.

5) Single point of failure

A single point of failure is any part of your system where: If that one thing fails, the entire system fails.

Common examples:

One server hosting everything
One database with no backup/replica plan
One critical secret stored in one place with no recovery

Your job in reliability design is to identify these and reduce them.

Micro-activity

Activity: Think about "single vs redundant"

Setup A — Single point of failure: every component is a single link in the chain

Setup B — More reliable: redundant servers across zones, database with standby replica

Practice

1 / 2

In Setup A, which components are a single point of failure?

Common confusion

Note

“If I use cloud, is my app automatically reliable?” No. Cloud gives you the building blocks (zones/regions, load balancing, managed services), but you still must design for redundancy and recovery.

Note

“Does redundancy mean zero downtime?” Not guaranteed. Redundancy reduces risk, but reliability also depends on good monitoring, safe deployments, and tested recovery plans.

Summary

Failures will happen. Reliability means you keep running anyway.
Redundancy is the main tool, but independence matters.
Failure domains explain why “two copies in the same place” can still fail together.
Regions and zones are ways providers separate infrastructure to reduce shared failures.
Identify and remove single points of failure.

Quiz

Knowledge Check

1 / 6

What is reliability in cloud systems?