Please choose an option
Please choose an option
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

What We Got Wrong About Efficiency in LLM Systems

When we set out to build a multi-agent system at HockeyStack to power our marketing and sales workflows, we thought we knew how to make it efficient. Everyone talks about minimizing LLM API calls, so fewer requests should mean faster performance and lower cost. Right?

That assumption held for our first few prototypes. But once we tried to scale into real production use cases, it started to break. The surprises came fast—and the biggest one was this:

To reduce latency and cut costs, we actually had to make more LLM calls, not fewer.

At first glance, that sounds completely backward. But when you build systems that have to operate reliably, quickly, and with clear reasoning, the tradeoffs shift in ways you don’t expect.

The Mega Agent Problem

We began with what I now think of as the “mega agent” model. One large, smart agent with a giant prompt that tries to do everything at once: read a lead’s full profile, interpret CRM notes, extract intent from emails, and make a recommendation, all in a single API call.

It sounded efficient in theory. But in practice, it was none of those things.

These prompts were expensive to run. They were slow, because we needed powerful models and large context windows. They were also fragile. They were hard to debug, and harder to trust. One small change could tank performance, and the whole thing became a black box.

It started to feel like we were building a monolith inside a prompt.

The Shift to Micro-Agents

So we tried something different. Instead of one agent doing everything, we split the work into many smaller, more specialized agents. One agent ranked leads. Another summarized emails. Another scored intent signals. Each had a very narrow job, and each ran on a smaller, cheaper model with a much shorter prompt.

At first, it felt clunky, like we were taking a step backward. But the results told a different story.

Suddenly, our system was faster. Not just a little faster, but an order of magnitude faster. We started finishing inference in milliseconds. Our costs dropped. Debugging became simpler. And because each agent did one thing, we could test, improve, and deploy them independently.

We also gained something we didn’t expect: parallelism. These agents could run at the same time. What used to take half a second now happened in under 100ms, without cutting corners on quality.

Why It Works

Here’s what I’ve come to believe: LLMs are incredible pattern matchers. They excel when the task is simple, the prompt is focused, and the context is tight. When you ask them to do too much, things get fuzzy and expensive.

The more we reduced each agent’s scope, the more reliable and scalable our system became. That’s counterintuitive if you’re coming from a world where “smart” means “do more in one step.” But in AI systems, intelligence comes from composition, not complexity.

What This Means for the Future

At HockeyStack, we’ve reoriented our approach entirely. We no longer ask: Can this be done in one call? Instead, we ask: What’s the smallest useful step here? And then: Can we split thinking from doing?

That mindset has helped us build systems that are fast, modular, and most importantly, under our control. We can fine-tune agents without blowing up everything else. We can run cheap models most of the time, and reserve expensive ones for rare edge cases. And we can keep our latency low without sacrificing reliability.

If you're building with LLMs in production, especially for revenue teams, I’d encourage you to try the same shift. Resist the temptation to make your agents smarter. Instead, make them simpler, faster, and more focused. Give them less to do—and let your system do more.

We’re still learning things every day, but this change was a turning point. If you’re in the trenches, interested in working on similar problems, please reach out!