Slow Agents
We recently upgraded from a RAG workflow to a fully agentic system. We saw significant gains in quality, but unfortunately there is no such thing as a free lunch and our median response time dropped from 3.5s to 14s. This degradation came from tool usage and a model swap from Gemini Flash 2.0 to Claude Sonnet 4 (slower but significantly stronger agentic capabilities).
The response time was deemed unsatisfactory for our use case (interactive chatbot), so we kicked off an investigation into how we could reduce it to 7-9s. If you cannot measure it, you cannot improve it: we set up the highest fidelity environment possible and repeated experiments to control for LLM indeterminism and load variability on the model provider's end. We quickly came to the conclusion that output tokens were the bottleneck. As per OpenAI, a 50% reduction in input tokens results in only a 1-5% reduction in latency, whilst the same reduction in output tokens results in a ~50% reduction in latency. This aligns with what we had previously seen when we enabled prompt caching, with only the Time to First Token (TTFT) improving.
To reduce output tokens, there are a few vectors of attack...
Faster Model
Claude Sonnet 4 provided great answers but has one of the lowest output token throughput. Therefore, an obvious solution is to switch to a model with a higher output token throughput. This turned out not to be so obvious.
Some models are faster but have weaker agentic abilities, resulting in worse quality or more tool calls — neither of which were attractive options. Other models generate more output tokens per second and provide similar quality, but still see worse end-to-end performance. This turned out to be due to reasoning tokens, which are still output tokens (visible or not) and therefore have a negative effect on latency. For some models, like Gemini 2.5 Pro, you cannot even disable this thinking.
We did find a viable option in GPT-4.1, with a p50 of 10s and only a small drop in quality. However, we ultimately didn't take this further because we didn't feel confident in our ability to deliver a full model change and deal with all the nuances in the limited time we had. A model change is something we plan to revisit in the future, particularly with recent releases such as GPT-5 and Claude Haiku 4.5.
Answer Conciseness
Rather than changing the model, we could try to change the model's instructions: by telling it to be more concise and updating our various style guidelines, we could make answers more concise and reduce the number of output tokens. However, this change only affects the final answer generation, not all output generation, so we estimated that it would only shave off 1-2s from p50. This, combined with going through the rigmarole of updating style guidelines with our PMs, meant that we decided not to explore a prompt change in earnest.
Fewer Rounds
Another solution to reduce response time is to do less work. Specifically, making fewer requests to the LLM. However, at the time, our mean number of LLM calls was only 2.16, so optimising this would only improve the worst case requests and not move the p50 needle.
We still believe this is an important avenue, particularly as we scale to more tools. A couple ideas we had were (1) reduce tool call redundancy by improving definitions and/or merging similar tools and (2) encourage tool parallelisation and/or speculation, e.g. via retrieval hedging with different query reformulations.
User Perception
The final lever we looked at was reducing perceived latency. This provided a concrete way to improve the user experience, without affecting quality and taking too much time to implement. As is often the case, the simplest solution is usually the best.
We started sending progress updates, i.e. tools would send messages like "searching for entities...", and streaming the final answer. There was one hiccup with answer streaming where Claude would sometimes generate intermediate thinking sections that we did not want to show to the end user, yet we could not distinguish them from the final answer. To address this, we prompted the model to wrap its final answer: <answer>2 + 2 = 4</answer>
. Then, during output streaming, we only sent the partial results if they began with the <answer>
tag. This was surprisingly robust and we rarely saw violations in our evaluation set.
All in all, we shipped streaming in less than a week and reduced median time to first answer token from 14s to 8.5s, unblocking our release and making everyone happy 🚀