Message Queues and Pub-Sub
Queues and pub-sub systems decouple producers from consumers, absorb traffic spikes, and enable asynchronous workflows. They are foundational in notifications, analytics, billing, and event-driven systems.
Reading time
12 min
Why Messaging Helps
Synchronous chains can be brittle. If one downstream service is slow, every caller waits. Queues let work be processed asynchronously and smooth out bursts. They also decouple producers from consumers so each can scale, deploy, and fail independently.
Queue vs Pub-Sub
- Queue: one worker processes each message, used for task distribution and work offloading
- Pub-sub: multiple subscribers each receive a copy, used for event broadcasting and fan-out
Common Messaging Systems
- Kafka: high-throughput, durable log with replay support, ideal for event streaming at scale
- RabbitMQ: flexible routing with exchanges and bindings, good for task queues and RPC patterns
- AWS SQS: fully managed queue with at-least-once delivery and visibility timeouts
- AWS SNS: managed pub-sub for fan-out to multiple endpoints or queues
- Google Pub/Sub: managed pub-sub with strong ordering and replay support
Key Concerns
- Ordering guarantees: most queues offer best-effort ordering, Kafka guarantees order within a partition
- Retries and dead-letter queues: failed messages are retried up to a limit, then moved to a DLQ for inspection
- Backpressure: consumers signal they are overwhelmed so producers slow down or buffer
- Retention and replay: Kafka retains messages for a configurable window, allowing consumers to reprocess past events
- Idempotent consumers: processing the same message twice must produce the same result as processing it once
Delivery Guarantees
- At-most-once: messages may be lost but are never redelivered, lowest latency
- At-least-once: messages are never lost but may be delivered more than once, most common default
- Exactly-once: guaranteed single delivery with no loss, hardest to implement and highest overhead
Dead-Letter Queues
When a message fails processing repeatedly it is moved to a dead-letter queue instead of blocking the main queue. DLQs allow engineers to inspect, debug, and replay failed messages without losing them. Every production queue should have a DLQ configured.
Idempotency
Because at-least-once delivery is the norm, consumers must handle duplicate messages safely. Common approaches include tracking processed message IDs in a database, using upsert operations instead of inserts, and designing state transitions that are safe to apply multiple times.
Backpressure
When consumers fall behind, unbounded queues accumulate messages and exhaust memory. Backpressure mechanisms signal producers to slow down or pause. In Kafka this is handled by consumer lag monitoring and scaling consumers horizontally. In SQS, queue depth metrics trigger autoscaling policies.
Partitioning and Ordering
Kafka partitions topics across brokers and guarantees order only within a single partition. Choosing a partition key such as user ID or entity ID ensures all events for the same entity are processed in order by the same consumer, avoiding race conditions in downstream state machines.
Consumer Groups
Multiple consumers can form a group to share the work of processing a queue or topic. Each message is delivered to exactly one member of the group, allowing horizontal scaling of consumers. Adding consumers up to the number of partitions increases throughput linearly.
Message Schema and Versioning
Producers and consumers must agree on message format. Schema registries like Confluent Schema Registry enforce compatibility rules so producers cannot publish breaking changes that crash consumers. Versioning strategies include backward compatibility, forward compatibility, and full compatibility depending on deployment requirements.
Interview Tip
Say how you handle duplicates. Retries happen all the time in real systems. Also mention dead-letter queues proactively, it signals that you think about failure modes and operational visibility, not just the happy path.