Notification Service at Scale (Email/SMS/Push/In-App)

 

Notification Service at Scale (Email/SMS/Push/In-App)

1) Use Case & Problem Context

We need to send notifications to millions of users across multiple channels:

  • Email, SMS, Push, In-app

The system must support:

  • Templates + personalization (e.g., {name}, {orderId})

  • Localization fallback

  • User preferences (opt-in/out)

  • Quiet hours + frequency caps

  • Retries with exponential backoff

  • Provider failover (if one provider is down)

  • Delivery status via provider webhooks

  • Observability (success rate, bounces, latency)


2) High-Level Architecture (Text Diagram)

+------------------+ | Client Apps | +--------+---------+ | v +------+------+ | Notify API | POST /notify +------+------+ | create job | v +------+------+ | Job Queue | (Kafka/SQS/Rabbit) +------+------+ | Fan-out v +---------------+------------------+ | Fanout Workers (per recipient) | +---------------+------------------+ | | | v v v Prefs Store Template Send Queue (opt-out, Render (per channel) quiet hours) | v +----------+----------+ | Channel Providers | | Email / SMS / Push | +----------+----------+ | Webhooks v Delivery Events (status + metrics)

Key principle:
Redirect hot path? (not relevant here)
✅ For notifications: API should be fast and async; heavy work happens in workers.


3) Core Data Flow

A) Send flow

  1. Client calls POST /notify

  2. Service creates a Send Job

  3. Enqueue tasks per recipient per channel

  4. Fanout worker checks:

    • preferences

    • quiet hours

    • dedupe/idempotency

  5. Render template

  6. Route to provider (failover if needed)

  7. Persist delivery status

  8. Retry with backoff if transient error

  9. Move to DLQ if permanently failing

B) Status flow

  1. Provider calls webhook: delivered/bounced/failed

  2. Store delivery event and update metrics



Notes (What to say in interviews)

  • Separate hot path vs cold path: API returns quickly; workers do heavy work.

  • Fan-out: convert one job into many tasks (per user/channel).

  • Preferences first: opt-out + quiet hours should skip early.

  • Provider failover: try next provider on transient failures.

  • Retries + DLQ: exponential backoff + poison message handling.

  • Idempotency: dedupe key prevents duplicates.

No comments:

Post a Comment

Online Food Delivery Platform — System Design

  Online Food Delivery Platform — System Design  1) Use Case & Problem Context Users should be able to: Browse restaurants near them...

Featured Posts