Agentic AI Workflows Speed Up 40% with Persistent WebSocket Connection for LLMs
8
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
The technical complexity and infrastructural change warrant a high impact score, as it resolves a fundamental architectural limitation of current multi-turn AI systems, despite moderate current coverage.
Article Summary
This deep-dive technical article details how the Codex team tackled latency bottlenecks in complex, multi-step AI agentic workflows. Previously, every step required a full synchronous API call, forcing the system to re-process the entire conversation history, which significantly slowed down tasks involving dozens of back-and-forth tool calls. The solution was implementing persistent WebSocket connections and leveraging in-memory state caching for the Responses API. By passing cached state (such as previous response objects and rendered tokens) rather than the full conversation context repeatedly, the overhead was drastically reduced. This optimization allowed them to maintain API stability while enabling extremely fast models like GPT-5.3-Codex-Spark to hit a 1,000+ tokens per second throughput, representing a major leap in real-world agent capability.Key Points
- The primary bottleneck for AI agents was not model inference speed, but the cumulative API overhead generated by numerous synchronous calls and re-processing full conversation history.
- The team transitioned the Responses API to support persistent WebSocket connections, allowing them to cache state in memory and only process new or changed information.
- This structural change achieved up to a 40% improvement in agentic workflows, enabling models to hit 1,000+ tokens per second in production environments.

