Efficiency Agent: Case Study

We audited every agent we run. Then we fixed them.

AI agent deployments accumulate cost debt quietly. Not through any single mistake, but through dozens of small mismatches between what a job actually needs and what you gave it. We found all of ours. Here is what we changed and how you can do the same.

21
agents audited
~40%
fewer premium model calls
1
redundant job eliminated
0
output degradation
The Problem

AI agent deployments accumulate cost debt.

Not because the agents are expensive by default. Because teams stop auditing them after launch.

When you deploy an agent, you make a model choice based on what you know at the time. You pick a capable model, a reasonable schedule, and a scope that makes sense. Then you ship it and move on to the next build.

Six weeks later, that agent is still running on the same model, the same schedule, and the same scope. The model choice that felt conservative at launch now looks like premium compute for a job that runs a script and sends a message. The schedule that made sense for fresh data now overlaps with three other jobs that pull from the same source. A standalone job that was scoped narrowly in week one has since been superseded by another job that does the same thing plus more.

None of these are obvious failures. The agents run. Outputs look reasonable. No alerts fire. The cost just compounds quietly in the background while nobody revisits the original decisions.

The fix is not technical. It is an audit. One structured pass over every scheduled job, asking three questions for each one.

The Audit Framework

Three questions for every scheduled agent.

These questions cut through the technical details and get to the actual cost drivers.

Question 1

Does this job require reasoning, or just execution?

An agent that reads a file, runs a script, and reports the result does not need the same model as an agent that writes original prose, interprets ambiguous signals, or makes multi-step decisions. Execution and reasoning are different tasks. They should not cost the same. A capable smaller model handles execution well. Premium models are worth their cost when the output quality genuinely reflects the model choice. When it does not, you are paying for reasoning you are not using.

Question 2

Does this schedule still make sense given what I know now?

Schedules decay faster than code. A job that runs three times a day made sense when you wanted fresh data throughout the day. It may not make sense when you discover the source data updates once every 24 hours, or when another job runs 30 minutes later and would catch the same signal. Redundant runs are invisible unless you look for them. When you find them, the fix is a one-line schedule change.

Question 3

Is this job still doing its own thing, or has it been superseded?

Agent deployments accumulate jobs over time. A narrowly scoped job from month one often gets partially replaced by a broader job in month two without the original being retired. Both run. Both do part of the same thing. Neither is obviously redundant because they were built for different reasons. Asking whether each job still has a unique purpose it owns, or whether that purpose has migrated to something else, surfaces the overlap that code review never will.

What We Found

Three categories of waste. All fixable in an afternoon.

The audit was not a line-by-line code review. It was a functional pass: what does each job actually do, and is that job matched to the right resources?

Most of what we found fell into one of three patterns. They are not unique to our deployment. They appear in every multi-agent system that has been running for more than a few weeks without a structured review.

None of them required rebuilding anything. The fixes were model swaps, schedule changes, and a single job consolidation. Total implementation time: a few hours. The savings compound every week going forward.

Pattern 1
Wrapper agents on premium models

Several jobs had a single responsibility: run a script and report back. No writing, no analysis, no judgment calls. They were running on models built for complex reasoning because that was the default when they were created. Switching to a lighter model for these jobs produced identical outputs. The only thing that changed was the cost per run.

Pattern 2
Overlapping schedules with shared data sources

A monitoring job ran three times daily. The deduplication window it used was seven days, which meant the second and third daily runs were mostly processing results already seen in the morning. Cutting to two runs per day reduced model calls without reducing signal coverage, because the data source was not refreshing fast enough to justify the third run.

Pattern 3
Standalone jobs with superseded scope

A weekly audit job had been running independently for weeks. Over time, the nightly maintenance job had expanded its scope to cover everything the weekly job did, plus more. Both jobs were running. The weekly job was not adding anything the nightly job did not already do on its Friday run. Removing the weekly job simplified the schedule without removing any functionality.

Model Selection

Match the model to the task, not to your comfort level.

The instinct to use a premium model everywhere is understandable. You want reliable outputs. But the model does not determine reliability when the task is not model-dependent. A job that runs a command and formats the result will not improve with a more powerful model. It will just cost more.

Lightweight

Command runners and report formatters

Jobs that execute scripts, call APIs, and return structured output. The work is in the underlying system, not the model. Use the cheapest model available. Output quality is identical because there is no model-dependent output to optimize.

Mid-tier

Health monitoring and anomaly detection

Jobs that read system outputs and decide whether something is actually wrong. This requires genuine judgment: distinguishing a transient glitch from a persistent failure, reading ambiguous signals, and deciding whether to alert. A lightweight model gets this wrong often enough to matter. A premium model is more than needed. Mid-tier is the right match.

Premium

Writing, analysis, and external data interpretation

Jobs that produce original prose, synthesize multiple sources, or reason about content from the open web where injection resistance matters. The quality of the output directly reflects the quality of the model. This is where premium compute earns its cost.

Most multi-agent deployments have more jobs in the first row than the third, but are paying for the third row across the board.

The Result

Same outputs. Lower cost. Cleaner architecture.

The audit took a few hours. The changes took another hour to apply. Everything has been running since with no output quality issues and no new alerts.

The model savings are not one-time. Every week that the deployment runs, the audit pays for itself again. Jobs that were running on premium compute three times a day, seven days a week, now run on the right model at the right frequency. The math is straightforward.

The less visible benefit is architectural. Fewer jobs. Cleaner separation of concerns. A system that is easier to reason about when something eventually does go wrong.

~40%

reduction in premium model calls per week

14

fewer high-cost runs eliminated per week

1

redundant job removed from the schedule

0

outputs changed or degraded

What It Means for You

Every multi-agent deployment benefits from this audit.

The three patterns we found are not specific to our stack, our tools, or our industry. They appear in any deployment that has been running for more than a few weeks without a structured review. The longer it runs without one, the more they compound.

We include an efficiency audit in every build we deliver and recommend running one again after 90 days. Model costs, data freshness characteristics, and job interdependencies all change as a deployment matures. What was right at launch is often not right at month three.

Deployment Audit

A structured review of every scheduled agent: model selection, schedule overlap, scope boundaries, and job dependencies. Delivered as a prioritized list of changes with expected impact for each.

Model Tier Design

Mapping every job in your deployment to the right model tier based on what it actually does. Built into every new system we design so the first audit is a tuneup, not a full review.

Schedule Optimization

Reviewing run frequency against actual data freshness and job dependencies. Most deployments have two or three schedule changes that reduce compute by 20 to 30 percent without touching a single output.