Async Agent Architectures: Queue-Based vs Streaming, and When Each Wins

The wall every agent system hits

The first version of an agent system is almost always synchronous. The user sends a request, the agent loops through its steps, and when it's done, the response comes back. Simple to build, easy to debug, and it works fine — until traffic grows past a certain point. Then you start hitting the wall.

The wall has several names. Long agent runs hold connections open and exhaust your web server's worker pool. A single slow tool call backs up everything behind it. Retries during a partial outage cascade into worse outages. Users abandon requests that take more than a few seconds to start producing output. Every team building serious agent systems crosses this line eventually, and the answer is the same: go async.

What's less obvious is that "going async" doesn't mean one thing. There are two distinct patterns — queue-based and streaming — with very different tradeoffs. Choosing between them isn't a stylistic preference. It changes what kinds of problems your system handles well and which ones it handles badly.

The two patterns, briefly

Queue-based async

The user submits a request. The system writes a job to a queue and immediately returns an acknowledgment with a job ID. A pool of worker processes picks jobs off the queue, runs them to completion, and writes the results to a store. The user polls (or receives a webhook) when the job is done.

This is the pattern of background processing systems going back decades — Sidekiq, Celery, SQS-based architectures. The "AI agent" part is just what happens inside the worker. From the infrastructure's perspective, an agent run is the same shape as any other long-running job.

Streaming-based async

The user submits a request and immediately gets a streaming connection back — Server-Sent Events, WebSockets, or HTTP streaming. As the agent makes progress, it pushes events down the stream: tool calls, partial outputs, status updates, intermediate reasoning. The user sees the work happening in real time, even though the underlying request is still long-running.

This is the pattern most modern chat-style AI products use, and the one that frameworks like LangGraph and OpenAI's Realtime API are built around.

What each one is actually good at

Queue-based wins for: throughput, reliability, and asymmetric workloads

When the workload is uneven — most jobs short, some jobs very long; most jobs quick, some requiring expensive backtracking — queues are the right answer. The queue smooths out the variance. Workers pick up the next job whenever they finish, and a slow job doesn't block the fast ones behind it.

Queue-based systems are also dramatically easier to make reliable. When a worker crashes mid-job, the message goes back on the queue and another worker picks it up. When you need to deploy a new version of the worker, you drain the old workers gracefully while new ones start. When traffic spikes, you scale workers horizontally without touching anything else. Everything you've already learned about operating background job systems transfers directly.

Streaming wins for: perceived latency and conversational UX

When the user is sitting there waiting, perceived latency matters more than actual latency. A streaming response that starts producing output in 500ms feels dramatically faster than a queue-based response that completes in 3 seconds, even though the queue version is technically faster end-to-end.

Streaming also unlocks UX patterns that aren't possible with queues: showing the agent's intermediate steps as they happen, letting the user interrupt mid-execution to redirect, pausing for human input partway through a workflow. Anything that requires the agent and the user to interact during execution needs streaming.

The decision shortcut If the user is going to stare at the screen until the agent finishes, you want streaming. If the user will do something else and come back later, you want queues. Most teams default to streaming because their first product was a chat interface, and then keep using it for jobs where queues would be a better fit.

The hidden costs of using the wrong one

Streaming for batch-shaped workloads

A common antipattern: building a data-processing or document-generation system on a streaming architecture because that's what the team learned first. The result is an infrastructure that can't survive a worker restart, can't easily retry failed jobs, can't be load-balanced cleanly, and breaks every time a connection times out. Half the engineering time goes into reinventing the reliability primitives that queue systems give you for free.

Queues for interactive workloads

The reverse mistake hurts user experience. A polling-based interface for a feature where users want immediate feedback feels broken — even if it works correctly. We've seen products where the underlying agent was fine, but the polling-based UX caused users to give up before the response arrived. The fix was changing the transport layer, not the agent.

Hybrid done badly

Some teams try to have it both ways: a streaming connection that pushes status updates, but with the actual work happening in a queue worker. This can work, but it's the most complicated of the three options and adds a coordination layer between the stream and the queue that's easy to get wrong. Don't reach for it unless you've tried both pure approaches and have a specific reason neither works.

A workable hybrid pattern

When you do need both — long-running work plus interactive feedback — the pattern that holds up best is queue-backed streaming:

The user opens a streaming connection
The connection handler enqueues the job and subscribes to a channel for that job's events
A worker picks up the job, executes it, and publishes events to the channel as it goes
The connection handler forwards each event down the stream to the user
If the connection drops, the worker keeps going; the user can reconnect to a separate "resume" endpoint and pick up from where they left off

This gives you the reliability of queues with the perceived latency of streaming. The cost is more moving parts and a pub/sub layer in the middle. For systems where reliability and UX both matter — and where you can afford the operational complexity — it's the right answer.

The async architecture you choose isn't a transport detail. It shapes which failure modes are easy to handle and which are catastrophic. Pick based on what your workload actually looks like, not on what your first prototype happened to use.

How to decide for your system

A short rubric that gets the answer right most of the time:

Is the work conversational, with the user actively engaged throughout? → Streaming.
Is the work batch-like, where the user submits and walks away? → Queue.
Is it both, depending on the use case? → Two separate systems, not one hybrid.
Does each job take more than 30 seconds on average? → Lean toward queue regardless.
Do you need the agent to ask the user mid-flow for clarification? → Streaming, with no queue in front of it.

The teams that make this choice deliberately end up with systems that scale gracefully. The teams that pick whichever framework example they copied first end up rewriting their infrastructure six months later, usually under pressure, usually while users are complaining. Make the call early — it's much cheaper than making it late.