Designing a Distributed Uptime Monitoring Pipeline: Architecture Patterns for Reliability
A deep technical look at the architecture behind a production uptime monitoring system — event-driven pipelines, multi-location consensus, state machines for health transitions, and the outbox pattern for guaranteed alert delivery.
Building an uptime monitoring system sounds simple: send a request, check the response, fire an alert if something is wrong. In practice, the challenge is not making the check — it is making the check reliably, from multiple locations, without sending false alerts, without missing real outages, and without losing notifications along the way.
This post walks through the core architectural patterns behind a production-grade monitoring pipeline. These are the same patterns we use at MonitorHound, presented at the conceptual level so they are useful to anyone building distributed systems that need strong delivery guarantees.
The pipeline at a glance
A monitoring system is fundamentally a data pipeline with four stages:
Scheduler → Checker → Evaluator → Notifier
- Scheduler — Decides which monitors are due for a check and dispatches work orders
- Checker — Executes the actual HTTP, DNS, or SSL check from a specific geographic location
- Evaluator — Aggregates results across locations and decides whether the monitor’s health state has changed
- Notifier — Delivers alerts to the appropriate channels (email, SMS, webhook)
Each stage is a separate service, connected by an asynchronous message bus. This decoupling is the foundation of everything else. If the notifier is slow, it does not delay the checker. If one location’s checker is down, the others continue independently.
Why event-driven, not request-response
The first architectural decision is how the stages communicate. A synchronous request-response design — where the scheduler calls the checker, waits for the result, passes it to the evaluator, and then calls the notifier — is simple to understand but fragile in practice.
Consider what happens when the notifier takes 10 seconds to deliver an SMS. In a synchronous design, the entire pipeline stalls. That single slow notification blocks every other monitor from being checked. Under load, the system grinds to a halt.
An event-driven architecture using a message bus (Pub/Sub, SQS, Kafka, or similar) solves this by decoupling the stages:
- Backpressure is handled naturally. If the notifier falls behind, messages queue up. The checker keeps checking. The evaluator keeps evaluating.
- Failures are isolated. A notifier crash does not affect checking. A checker crash in one location does not affect other locations.
- Scaling is independent. You can run 10 checker instances and 2 notifier instances. Each stage scales to its own workload.
- Retries are built in. Most message buses offer automatic retry with backoff. A transient failure in one stage does not lose the message.
The trade-off is complexity. You now have to reason about message ordering, at-least-once delivery semantics, and eventual consistency. But for a monitoring pipeline, these trade-offs are worth it.
Scheduling: the claim-and-dispatch pattern
The scheduler’s job is to find monitors that are due for a check and dispatch work orders. This sounds straightforward, but it has a subtle concurrency problem: if the scheduler runs on a timer (say, every 60 seconds) and a previous run has not finished, you can dispatch duplicate work orders for the same monitor.
The solution is a claim-based approach. Instead of simply querying for due monitors, the scheduler atomically claims them:
UPDATE monitors
SET next_run_at = next_run_at + interval_seconds
WHERE next_run_at <= NOW()
RETURNING id, url, check_type, interval_seconds;
This single query does two things at once: it selects monitors that are due and immediately advances their next_run_at timestamp. If two scheduler instances run concurrently, they cannot claim the same monitor because the first one to execute the UPDATE moves the timestamp forward, excluding it from the second query.
The claimed monitors are then published as individual work orders to the message bus, one per monitor per check location. A monitor configured to check from three locations produces three messages.
Multi-location checking and the consensus problem
Checking from a single location is unreliable. Network partitions, regional ISP issues, and routing anomalies can make a perfectly healthy service appear down from one vantage point. Checking from multiple locations and requiring agreement before declaring an outage eliminates the vast majority of false positives.
But multi-location checking introduces a coordination problem: how do you aggregate results from independent checkers that complete at different times?
The aggregation window
Each check cycle is identified by a unique run ID. All checkers for the same monitor and run ID report their results independently. The evaluator collects these results and waits until it has heard from enough locations to make a decision.
The key question is: how long do you wait? Wait too long and your alerts are delayed. Do not wait long enough and you make decisions on incomplete data.
In practice, the evaluator does not need to wait at all. Instead, it evaluates on every incoming result using a simple rule:
Re-evaluate the monitor’s health each time a new result arrives for the current run. If enough locations have reported, make a decision. If not, wait for the next result.
This gives you the fastest possible alerting while still requiring multi-location agreement. The “enough locations” threshold is configurable — for most setups, requiring agreement from a majority of locations (for example, 2 out of 3) provides a good balance between speed and accuracy.
Composite keys for idempotency
In a distributed system built on at-least-once delivery, you will receive duplicate messages. A checker might publish the same result twice because the message bus redelivered it after a timeout. The evaluator must handle this gracefully.
The solution is to use composite keys that make each result uniquely identifiable:
(monitor_id, run_id, location)
When a result arrives, the evaluator performs an upsert. If a result with the same composite key already exists, the duplicate is silently ignored. The evaluator acknowledges the message (returns a success response to the message bus) so it is not redelivered, but takes no further action.
This idempotency guarantee is critical. Without it, a duplicated failure result could trigger duplicate alerts, and a duplicated success result could prematurely clear a legitimate alert.
State machines for health transitions
A naive monitoring system has two states: up and down. When a check succeeds, the monitor is up. When it fails, the monitor is down. Alert on every transition.
This works poorly in practice. A single failed check followed by an immediate recovery produces an alert and a recovery notification within seconds of each other — noise that erodes trust in the system. Intermittent failures produce a stream of alerts that train operators to ignore them.
A better approach models monitor health as a state machine with explicit transition rules:
┌──────────────────────────────────────────────────┐
│ │
┌────▼────┐ consecutive ┌───────────┐ consecutive
│ Healthy │───── failures ──────►│ Unhealthy │──── successes ─┐
└────┬────┘ (threshold) └─────┬─────┘ (threshold) │
│ │ │
│ └──────────◄──────────┘
│
│ no data / timeout
└──────────────────────────►┌─────────┐
│ Unknown │
└─────────┘
The state machine enforces several important behaviors:
- Transition thresholds — A monitor does not become unhealthy after a single failure. It requires multiple consecutive failures (configurable, typically 2-3). This filters out transient blips.
- Hysteresis — The threshold for recovering (going from unhealthy back to healthy) can differ from the threshold for failing. This prevents rapid oscillation between states when a service is flaky.
- Unknown state — When no data has been received for a monitor (for example, after creation or after a long outage in the monitoring system itself), the state is explicitly unknown rather than assumed healthy. This prevents missing real outages during data gaps.
- Alerts fire on state transitions, not on individual check results. The evaluator sends a notification only when the state changes (healthy to unhealthy, or unhealthy to healthy). Individual check failures within an already-unhealthy state do not generate additional alerts.
Handling stale and out-of-order results
In an asynchronous pipeline, messages can arrive out of order. A result from run 5 might arrive after the result from run 6, especially if run 5’s checker experienced a timeout and retried. If the evaluator blindly processes results in arrival order, it can make incorrect state transitions based on stale data.
The solution is anti-rewind logic: the evaluator tracks the most recent run ID it has processed for each monitor and rejects results from older runs.
current_run_id = 6
incoming result: run_id = 5 → SKIP (stale)
incoming result: run_id = 6 → PROCESS
incoming result: run_id = 7 → PROCESS, update current_run_id to 7
This ensures the evaluator’s state machine only moves forward in time, never backwards. A delayed result from a previous run cannot corrupt the current state.
Concurrent evaluation and database-level locking
When results from multiple locations arrive simultaneously for the same monitor, you can end up with multiple evaluator instances trying to update the same monitor’s state at the same time. Without coordination, this leads to race conditions: two evaluators read the same state, both decide to transition, and two duplicate alerts are sent.
Database-level advisory locks solve this cleanly. Before evaluating a monitor, the evaluator acquires a lock keyed on the monitor ID:
SELECT pg_try_advisory_xact_lock(monitor_id);
If the lock is held by another transaction, the evaluator skips this result and lets the message bus redeliver it later. The lock is automatically released when the transaction commits, keeping the critical section short.
This approach is preferable to application-level locking (mutexes, Redis locks) because it ties the lock lifetime to the database transaction. If the evaluator crashes mid-evaluation, the lock is released automatically when the connection drops. There is no risk of orphaned locks.
Guaranteed alert delivery: the outbox pattern
The final and arguably most critical piece is alert delivery. When a monitor transitions from healthy to unhealthy, the system must send a notification. But “must” is a strong word in distributed systems. What happens if the notifier crashes after reading the alert but before delivering it? What if the email provider is temporarily down?
The transactional outbox pattern provides at-least-once delivery guarantees without requiring a distributed transaction:
-
When the evaluator detects a state transition, it writes the new state and an outbox record in the same database transaction. If either write fails, both are rolled back. The alert is never lost.
-
The notifier reads unprocessed records from the outbox, delivers the notification, and marks the record as processed. If delivery fails, the record stays in the outbox and is retried on the next cycle.
Evaluator Transaction:
├── UPDATE monitor_state SET status = 'unhealthy'
└── INSERT INTO alert_outbox (monitor_id, alert_type, payload, ...)
Notifier (separate process):
├── SELECT * FROM alert_outbox WHERE processed_at IS NULL
├── Deliver notification (email / SMS / webhook)
└── UPDATE alert_outbox SET processed_at = NOW()
The key insight is that writing to the outbox is a local database operation, not a distributed call. The evaluator does not need to know whether the email provider is up. It writes the intent to send an alert, and the notifier handles the actual delivery asynchronously.
If the notifier processes the same outbox record twice (because it crashed after delivery but before marking it processed), the worst case is a duplicate notification. Users receive one extra alert, which is far preferable to receiving none. For channels that support it, you can add deduplication at the delivery layer using a unique message ID.
Putting it all together
Here is the complete flow for a single monitoring cycle:
-
The scheduler claims monitors where
next_run_at <= NOW(), advances their next run time, and publishes work orders to the message bus — one per monitor per check location. -
Each checker receives a work order, executes the check (HTTP request, DNS lookup, or SSL handshake), and publishes the result with a composite key of
(monitor_id, run_id, location). -
The evaluator receives each result, acquires an advisory lock on the monitor ID, upserts the result (handling duplicates via the composite key), checks for stale results (anti-rewind), aggregates across locations, runs the state machine, and — if a transition occurred — writes to the outbox. All within a single transaction.
-
The notifier polls the outbox, delivers notifications through configured channels, and marks records as processed.
At every stage, the system is designed to handle the realities of distributed computing: duplicate messages, out-of-order delivery, concurrent processing, partial failures, and crash recovery. No single component failure causes data loss or missed alerts.
Designing for failure
The recurring theme across all of these patterns is the same: assume everything will fail, and design so that failure is handled gracefully rather than catastrophically.
- Messages will be duplicated → use idempotent operations
- Messages will arrive out of order → use anti-rewind logic
- Services will crash mid-operation → use transactional writes and automatic lock release
- External delivery will fail → use the outbox pattern with retries
- Multiple instances will race → use database-level locking
None of these patterns are novel. They are well-established solutions to well-understood problems. The value is in combining them correctly into a coherent pipeline where each pattern reinforces the others.
The monitoring system that pages you at 3 AM needs to be more reliable than the service it is monitoring. That reliability does not come from writing bug-free code — it comes from assuming bugs exist and designing a system that behaves correctly despite them.
See these patterns in action — start monitoring with MonitorHound.