Efficiency Agent: Case Study

We audited every agent we run. Then we fixed them.

AI agent deployments accumulate cost debt quietly. Not through any single mistake, but through dozens of small mismatches between what a job actually needs and what you gave it. We found all of ours. Here is what we changed and how you can do the same.

Key shift: two-stage pipelines now run discovery on lightweight tiers, write to staging JSON, then route synthesis to higher tiers only where judgment quality matters. This removed premium usage from low-value discovery steps.

Current tier map: Haiku for tool orchestration and wrappers, Sonnet for synthesis and operational analysis, Opus for weekly deep audits, and Codex for Instagram generation workflows.

27
cron jobs audited, renamed, and coordinated
43%
fewer premium model calls
3
audit stack and wrappers consolidated
0
output degradation
The Problem

AI agent deployments accumulate cost debt.

Not because the agents are expensive by default. Because teams stop auditing them after launch.

When you deploy an agent, you make a model choice based on what you know at the time. You pick a capable model, a reasonable schedule, and a scope that makes sense. Then you ship it and move on to the next build.

Six weeks later, that agent is still running on the same model, the same schedule, and the same scope. The model choice that felt conservative at launch now looks like premium compute for a job that runs a script and sends a message. The schedule that made sense for fresh data now overlaps with three other jobs that pull from the same source. A standalone job that was scoped narrowly in week one has since been superseded by another job that does the same thing plus more.

None of these are obvious failures. The agents run. Outputs look reasonable. No alerts fire. The cost just compounds quietly in the background while nobody revisits the original decisions.

The fix starts with an audit. We ran one structured pass across all 27 cron jobs, then applied naming, model tier, and schedule corrections in one coordinated change set.

The Audit Framework

Three questions for every scheduled job.

These questions cut through the technical details and get to the actual cost drivers.

Question 1

Does this job require reasoning, or just execution?

An agent that reads a file, runs a script, and reports the result does not need the same model as an agent that writes original prose, interprets ambiguous signals, or makes multi-step decisions. Execution and reasoning are different tasks. They should not cost the same. A capable smaller model handles execution well. Premium models are worth their cost when the output quality genuinely reflects the model choice. When it does not, you are paying for reasoning you are not using.

Question 2

Does this schedule still make sense given what I know now?

Schedules decay faster than code. A job that runs three times a day made sense when you wanted fresh data throughout the day. It may not make sense when you discover the source data updates once every 24 hours, or when another job runs 30 minutes later and would catch the same signal. Redundant runs are invisible unless you look for them. When you find them, the fix is a one-line schedule change.

Question 3

Is this job still doing its own thing, or has it been superseded?

Agent deployments accumulate jobs over time. A narrowly scoped job from month one often gets partially replaced by a broader job in month two without the original being retired. Both run. Both do part of the same thing. Neither is obviously redundant because they were built for different reasons. Asking whether each job still has a unique purpose it owns, or whether that purpose has migrated to something else, surfaces the overlap that code review never will.

What We Found

Five categories of waste. All fixable without rebuilding the stack.

The audit was not a line-by-line code review. It was a functional pass: what does each job actually do, and is that job matched to the right resources?

Most of what we found fell into repeatable patterns: wrapper jobs on expensive tiers, two-stage work forced into one premium run, schedule collisions, stale job names, and standalone jobs with superseded scope.

None of them required rebuilding anything. The fixes were model remaps, cron rename and timing edits, wrapper cleanup, and scope consolidation. Implementation took a few hours and reduced waste on every weekly cycle after deployment.

Pattern 1
Wrapper agents on premium models

Several jobs had a single responsibility: run a script, route status, and report back. No writing, no analysis, no judgment calls. They were running on models built for complex reasoning because that was the default when they were created. Switching to a lighter model for these jobs produced identical outputs. The only thing that changed was the cost per run.

Pattern 2
Overlapping schedules with shared data sources

A monitoring job ran three times daily. The deduplication window it used was seven days, which meant the second and third daily runs were mostly processing results already seen in the morning. Right-sizing cadence by source freshness reduced model calls without reducing signal coverage, because the data source was not refreshing fast enough to justify the third run.

Pattern 3
Standalone jobs with superseded scope

Audit workflows were split into a coordinated Sunday stack: host security at 5:30 AM, architecture review at 8:45 AM, and zero-trust audit at 9:00 AM. This removed overlap and improved sequencing between baseline checks and deep adversarial analysis.

Model Selection

Match the model tier to the task, not to your comfort level.

We documented a strict model tiering policy and applied it to each cron job. Cost-quality tradeoff rule: pay premium rates only when output quality or risk reduction is measurably better. Reliability comes from workflow design and validation checks, not from forcing every stage onto the most expensive model.

Haiku tier

Orchestration, wrappers, and deterministic transforms

Jobs that execute scripts, call APIs, and return structured output. The work is in the underlying system, not the model. Use a lightweight tier such as Haiku for orchestration and deterministic wrappers. Output quality is identical because there is no model-dependent output to optimize.

Sonnet and Opus tier

Heavy analysis, cross-source synthesis, and exception handling

Jobs that read system outputs and decide whether something is actually wrong. This requires genuine judgment: distinguishing a transient glitch from a persistent failure, reading ambiguous signals, and deciding whether to alert. A lightweight model gets this wrong often enough to matter. A premium model is more than needed. Mid-tier Sonnet is the right match for judgment-heavy operational analysis.

Codex generation tier

Instagram concept and asset generation

Mapped to Codex for Instagram concept generation, caption draft variants, and prompt-ready visual direction where generation quality directly affects downstream engagement. This tier is isolated to creative generation so analysis and orchestration tasks do not inherit creative-model costs.

The practical rule: reserve premium tiers for output-sensitive work, and keep orchestration on Haiku by default.

The Result

Measured cost reduction with higher schedule reliability.

The audit and rollout were completed in one cycle. We redesigned the internal dashboard and schedule table, then applied all cron, tiering, and dependency updates with staged validation.

We also moved two premium workflows to a two-stage pipeline: Stage 1 on Haiku for extraction, normalization, and candidate filtering, then Stage 2 on Sonnet or Opus only for shortlisted items requiring deep reasoning. This reduced premium calls while preserving decision quality.

Operationally, wrapper mismatches were removed, Sunday audit tasks were stacked with staggered timing, and job names were normalized so on-call operators can map logs to schedules without lookup friction.

43%

reduction in unnecessary premium calls per week

31%

fewer high-cost overlap runs per week

27

cron jobs moved to coordinated windows

2-stage

pipeline adopted to reduce premium waste

What It Means for You

Every multi-agent deployment benefits from ongoing efficiency audits.

These findings are not specific to one stack. Any deployment with dozens of scheduled jobs will drift unless model tiers, names, and timing are reviewed together as one system.

We include an efficiency audit in every build and rerun it as schedules, models, and dependencies evolve. Model pricing, data freshness, and cron dependencies change as deployments mature. What was right at launch is often not right at month three.

Deployment Audit

A structured review of each scheduled job: model tier, wrapper depth, name-to-scope match, overlap windows, and dependency timing. Delivered with exact cron and routing changes.

Model Tier Design

Concrete policy mapping: Haiku for orchestration, Sonnet and Opus for heavy analysis, Codex for IG generation. Enforced through scheduler metadata so tier drift is visible.

Schedule Optimization

Reviewing run frequency against actual data freshness and job dependencies. Most deployments have schedule and model changes that reduce premium usage significantly without output loss.