ARCHITECTURE2023-09-22BY PRIYA PATEL

Q&A: choosing between queues and streams

Answers to common questions about when we use queues, when we use streams, and what we watch out for operationally.

architecturequeuesstreamsevents

Q&A

Why do we use both queues and streams at all?

Because they solve related but different problems.

  • Queues are about work: move tasks from producers to workers, usually once.
  • Streams are about events: record what happened so multiple consumers can react or rebuild state.

Trying to make one system behave like the other usually leads to surprising failure modes.

When do we reach for a queue first?

We default to queues when:

  • there is clear, bounded work to be done (send email, process a payment, resize an image)
  • we care about controlling concurrency and retries
  • only one consumer needs to do the work

Operationally, queues make it easier to:

  • control per-job retries
  • scale workers up and down
  • reason about "in flight" work

When do we reach for a stream first?

We default to streams when:

  • multiple systems need to react to the same events
  • we might need to replay history (for backfills, debugging, or new consumers)
  • ordering within a key (like per user or per entity) matters

Operationally, streams make it easier to:

  • add new consumers later
  • rebuild projections or derived data
  • debug by replaying specific event sequences

What are the common mistakes with queues?

A few patterns show up repeatedly:

  • putting unbounded work in a single message (jobs that do "everything" for a whole account)
  • unbounded retries that hide underlying failures
  • sharing one hot queue for unrelated workloads, making it hard to set SLOs

We guard against these with:

  • per-job-type concurrency and retry limits
  • metrics for queue depth and job age
  • separate queues for latency-sensitive vs bulk work

What are the common mistakes with streams?

For streams, we often see:

  • using them as if they were RPC (expecting immediate responses)
  • not planning for consumer lag or reprocessing
  • assuming consumers are independent when they actually need coordination

We guard against these by:

  • modeling streams as append-only logs, not request channels
  • designing consumers to be idempotent and tolerant of duplicates
  • documenting which ordering guarantees we rely on

How do we decide in ambiguous cases?

We ask a few questions:

  • Is this primarily about work or history? If work, queue. If history, stream.
  • Do multiple consumers need to react independently? If yes, streams are often a better fit.
  • Do we need to replay or audit later? Streams help here; queues usually don’t keep history.

Sometimes the answer is both:

  • a stream records events
  • workers consume from the stream and push specific tasks onto queues

How does this show up in incident response?

The choice affects what we look at during incidents:

  • With queues, we watch depth, job age, and worker health.
  • With streams, we watch consumer lag, partition hot spots, and reprocessing behavior.

We also think about blast radius:

  • a stuck queue can block a specific kind of work
  • a misbehaving stream consumer can produce bad derived state repeatedly until fixed

Takeaways

  • Queues and streams solve related problems but with different operational shapes.
  • Queues are best for bounded, single-consumer work; streams are best for shared, replayable history.
  • Most trouble comes from trying to use one as if it were the other.
  • Asking whether you care more about work, history, or both usually points to the right tool.

Further reading