Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Optimizing Latency and Cost in Multi-Agent Systems

Buğra Gündüz

CEO

Example H2

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

Example H3

Minimizing API calls is often treated as a default optimization target in LLM-powered systems. We initially approached the design of our multi-agent pipeline with that assumption: fewer calls, fewer problems.

The data contradicted us.

After operating production multi-agent systems across marketing and sales tasks for over a year, we found a consistent pattern: splitting tasks into smaller, simpler, and more narrowly scoped LLM calls reliably improved latency, cost, and reliability. This post outlines what we observed, why it likely happens, and which design choices made the most impact.

Why generalist agents underperform

We began with what seemed like an efficient design: delegate entire workflows to single agents. A lead-ranking agent, for instance, might take raw CRM input, sales notes, and product context, and generate a priority score and rationale in one call.

The problems compounded quickly:

>30 second latency for a single lead
Costs grew, driven by reliance on higher-tier models. I woke up one morning to half of our monthly LLM budget disappearing overnight. Not fun.
Failures were opaque—debugging which part of the prompt failed was non-trivial.

The core issue: generalized prompts tend to encode logic for many possible inputs, even when that logic isn’t relevant for a given task. That generality creates computational overhead and ambiguity.

What worked better: Narrow and stateless specialist agents

We began replacing these general-purpose agents with a distributed model: each step of the workflow was handled by a separate, task-specific agent.

Performance improved across every metric. Take the lead ranking agent I previously mentioned:

Latency dropped by 72%
Cost per lead decreased by 54%
Hallucination-related errors declined by 19%

Not only that, but also this shift allowed us to clearly observe and predict how the model will behave with any input.

Tactics we found for developing specialist agents

1. Atomic Task Design

Atomic task design involves decomposing a complex task into simpler, non-overlapping units that can be executed independently. The goal is to reduce cognitive and computational load on each agent by focusing it on a single responsibility.

The key to designing this system: You must not have tasks that actually depend on each other in any way. If two tasks depend on each other's outputs, they cannot run in parallel and will add to latency. Therefore, we had to figure out creative ways to decouple some agents from each other.

2. The "Think” and "Remember” tool calls

For tasks for which we used reasoning models, switching over to smaller models proved very difficult. Because even though the task becomes smaller, the need for context and thinking does not go away.

We found out that adding optionality to a small model to Think and Remember specific context using tool calls, not only is very very fast, but improves accuracy by an order of magnitude on complex tasks.

Most complex tasks would have to start with a Think tool call. In some more complex cases we required the LLM to Think in between each step.

The Remember tool call is a basic key:value store, where the LLM would request a key and get the value. Each key comes with when it should be invoked. For example, we have an agent that writes Salesforce SOQL queries, but there is a list of common mistakes that it always does. These common mistakes are included in the system prompt, but as context gets extended, the model starts making the same common mistakes because it doesn't remember the beginning of the conversation. In this case, we instructed the LLM to Remember "Common SOQL Mistakes”, which it consistently does and keeps itself from failing.

3. The "Judge”

Evaluating the outputs of a task—especially in a multi-step pipeline—is non-trivial. We introduced a pattern where a separate "Judge" model would review the outputs of previous steps.

Having an LLM evaluate itself at the end of complex chains creates the highest step change in accuracy. In some long running agents, we use small models for most of the intermediary steps, and a larger model for the “Judge” step, instead of using the larger model for each step.

When introducing Judge as a tool call, and asking the LLM to always use Judge before finalizing answer, we have seen that the LLM goes to Judge, gets rejected, refines output, and then evading Judge so that it doesn't get rejected again. Therefore, it's important to keep Judge as a necessary hardcoded end step rather than a tool call within the agent chain.

4. Context engineering

Too much context slows down inference and increases model confusion. Too little context results in incomplete reasoning. Context engineering aims to strike the right balance: delivering only the necessary information to the agent at each stage.

The tendency to just dump in all context into each LLM call, especially in chained agents, is very real. However, small models are especially bad at dealing with long context and consistently fail or hallucinate in those cases.

A good solution to this is to have a synthesis step before passing data between two agents. The synthesis step can just be a small summarizer that will have some instructions to keep the most important points of the conversation. After each step, the past summary and new context is passed in to generate new synthesis. This allows for both the summarizer to remain lean and each agent to remain lean.

5. Cheap Model Pre-Screening

Many tasks only require high-capacity models in a small fraction of cases. We observed substantial savings by introducing a lightweight screening layer that identifies whether to escalate a task to a more expensive model (or even a more expensive agent).

For example, we have a deep research agent and a shallow research agent. Our general researcher agent first assesses which one of these a question will need, and route accordingly. We are able to deflect 82% of inquiries to the shallow research agent, which is 20x less costly than the deep research agent!

6. MapReduce

Some workloads—especially those involving batch inputs—benefit from a MapReduce-style pattern. This involves splitting the task into parallel subtasks (Map), then aggregating the results into a single output (Reduce).

We use this pattern for tasks like journey summarization across large datasets or summarizing multi-user conversations. Each agent processes a subset of the data independently, and a final aggregator (sometimes another LLM, sometimes just a function) consolidates the results.

Since all of these tasks are parallelized, latency does not go up with the size of the task, which is a big win.

7. Opportunistic Caching

Back to basics — If a database query is slow, we subconsciously reach for caching. So why not cache LLM queries? We realized that if we have specialist agents, we have more opportunities to cache.

For example, for our research agent, we realized that we were making the same types of searches and analyses for different research questions for the same company.

We cached the existing searches and introduced it as accessible memory to our research agent. Roughly 38% of incoming requests had one hit to a previously cached call, with most having more than 1 search cached. 5% of those requests did not have to do any search at all after accessing existing cached searches.

8. Not everything needs to be an agent

We had a persona classification task that used to rely on a user entered prompt that asks for personas they most like to reach out to.

To be able to select this persona, we were using a two step process, first reranking all employees within each company using Cohere with a very high max K, then passing that huge list of employees to a smart and long-context LLM to figure out the top N persona matches.

This usually takes more than a minute to complete PER COMPANY, and is NOT a good idea.

So instead, we added 2 basic UI inputs.

Dropdown for department
Dropdown for seniority level

We first filter employee lists based on these inputs, which leads to a smaller set of employees. We skip the rerank step entirely. And then we can afford to run many cheap LLMs on top of each employee in parallel to say "Yes this matches the persona” and "No this doesn't match the persona”.

Immediately reduced the task to take a maximum of 2-3 seconds with 91% lower cost on average.

Point is, some of the endpoints to the agent, don't really have to be LLM calls. You can get around some of your complexity by building programs and UX.

Ending Note

We're currently applying these patterns successfully to data analysis, complex enterprise sales tasks, deep research, and other adjacent tasks in sales and marketing.

Horizontal agent companies, such as Writer or Glean, aren't able to make use of these efficiency gains, because we found that a lot of our efficiency gains came from us being able to predict the use cases of our agents very well and tune the prompts accordingly. A horizontal agent company would have to deploy engineers that tune prompts for each of their customers, versus for us it becomes easily productized.

This space evolves quickly, models change, new patterns get discovered. We're continuing to refine our benchmarks and share architectural learnings publicly.

If you're working in this domain and have any insights that might help us, or you just want to learn more, please reach out at bugra@hockeystack.com