Notification Service at Scale (Email/SMS/Push/In-App)

1) Use Case & Problem Context

We need to send notifications to millions of users across multiple channels:

Email, SMS, Push, In-app

The system must support:

Templates + personalization (e.g., {name}, {orderId})
Localization fallback
User preferences (opt-in/out)
Quiet hours + frequency caps
Retries with exponential backoff
Provider failover (if one provider is down)
Delivery status via provider webhooks
Observability (success rate, bounces, latency)

2) High-Level Architecture (Text Diagram)


          +------------------+
          |   Client Apps    |
          +--------+---------+
                   |
                   v
            +------+------+
            | Notify API  |   POST /notify
            +------+------+
                   |
              create job
                   |
                   v
            +------+------+
            | Job Queue   |  (Kafka/SQS/Rabbit)
            +------+------+
                   |
                Fan-out
                   v
   +---------------+------------------+
   | Fanout Workers (per recipient)   |
   +---------------+------------------+
        |           |            |
        v           v            v
   Prefs Store   Template      Send Queue
   (opt-out,     Render        (per channel)
   quiet hours)                  |
                                 v
                      +----------+----------+
                      | Channel Providers    |
                      | Email / SMS / Push   |
                      +----------+----------+
                                 |
                               Webhooks
                                 v
                           Delivery Events
                         (status + metrics)

Key principle:
✅ Redirect hot path? (not relevant here)
✅ For notifications: API should be fast and async; heavy work happens in workers.

3) Core Data Flow

A) Send flow

Client calls POST /notify
Service creates a Send Job
Enqueue tasks per recipient per channel
Fanout worker checks:
- preferences
- quiet hours
- dedupe/idempotency
Render template
Route to provider (failover if needed)
Persist delivery status
Retry with backoff if transient error
Move to DLQ if permanently failing

B) Status flow

Provider calls webhook: delivered/bounced/failed
Store delivery event and update metrics

Notes (What to say in interviews)

Separate hot path vs cold path: API returns quickly; workers do heavy work.
Fan-out: convert one job into many tasks (per user/channel).
Preferences first: opt-out + quiet hours should skip early.
Provider failover: try next provider on transient failures.
Retries + DLQ: exponential backoff + poison message handling.
Idempotency: dedupe key prevents duplicates.

The Backend Engineer’s Journal

Notification Service at Scale (Email/SMS/Push/In-App)

Notification Service at Scale (Email/SMS/Push/In-App)

1) Use Case & Problem Context

2) High-Level Architecture (Text Diagram)

3) Core Data Flow

A) Send flow

B) Status flow

Notes (What to say in interviews)

No comments:

Post a Comment

Confusion Matrix + Precision/Recall (Super Simple, With Examples)

Featured Posts

Report Abuse