Cost Per Task, Not Per Token: Budgeting for AI Coding Agents

The Number on Your Bill Is the Wrong Number

Every AI coding tool reports cost in tokens. You ran 2.3 million input tokens and 480,000 output tokens this month. Multiply by the per-model rate, and there's your bill. It's precise, it's auditable, and it's almost useless for making decisions.

The reason is simple: nobody thinks in tokens. You think in tasks. "Refactor the auth module." "Write tests for the payments service." "Debug the flaky CI job." Those are the units of work you care about, the units you'd assign to a human engineer, and the units you'd put a dollar figure against if someone asked whether the agent was worth it. But the token meter has no idea those tasks exist. It just counts tokens flowing through an API, with no concept of which task they belonged to or whether that task succeeded.

So you end up with a bill that tells you exactly how much you spent and nothing about what you got. That's a fine arrangement when you're running a handful of experiments. It becomes a real problem the moment AI agents become a routine part of how your team ships code.

Why Per-Token Accounting Falls Apart at Scale

When you're running dozens of agent tasks a day across multiple projects, per-token billing fails you in three specific ways.

You can't attribute cost to work

A monthly token total is an undifferentiated blob. It mixes the cheap task that renamed a variable with the expensive task that spent forty minutes spelunking through an unfamiliar codebase before giving up. Both contributed tokens to the same meter. When the bill comes in higher than expected, you have no way to ask the only question that matters: which tasks drove this?

Without attribution, your only cost lever is a blunt one — use the agent less, or turn it off. You can't optimize what you can't see.

You can't tell expensive-and-worth-it from expensive-and-wasted

A task that costs three dollars and ships a working feature is a bargain. A task that costs three dollars, runs in circles, and produces a patch you throw away is pure waste. On a token meter, they look identical. Both are "tokens consumed."

This is the distinction that actually matters for budgeting, and it's exactly the one tokens erase. The cost of an agent task is only meaningful in relation to its outcome. Accounting that ignores outcomes can't help you decide where the spend is justified.

You can't compare strategies

Suppose you suspect that routing a certain class of task — say, mechanical refactors — to a cheaper, faster model would save money without hurting quality. To test that, you need per-task cost data before and after the change. A token total gives you a single aggregate number that moves for a dozen reasons at once. You can't isolate the effect of one routing decision, so you can't learn from it.

What Cost-Per-Task Accounting Actually Requires

"Cost per task" sounds obvious, but most tooling can't produce it, because tokens flow through a stateless API that doesn't know what a task is. Getting real per-task economics takes a few structural things.

A task has to be a first-class object

Before you can attribute cost to a task, the system has to know tasks exist. That means the unit of work — the mission, the assignment, the sub-task in a plan — has to be a real object the system tracks from start to finish, not just an ephemeral prompt. Every model call, every retry, every tool invocation gets tagged to the task that triggered it.

Every task produces a receipt

When a task finishes, it should hand you a receipt: which model handled it, how many tokens it used, what it cost, and what it produced. Not a line item buried in a monthly export — a receipt attached to the task itself, visible the moment the work is done.

This is the artifact that turns cost from a mystery into a decision. When you can see that one recurring task eats 40% of your daily agent budget, you can do something concrete about it: route it to a smaller model, tighten its prompt, or split it into cheaper sub-steps.

Cost has to sit next to outcome

A receipt that shows cost but not result is only half the story. The receipt has to live alongside the task's outcome — did it succeed, did you accept the output, did it need a retry — so you can see cost and value in the same place. That's what lets you separate the three-dollar task that shipped from the three-dollar task that spun out.

Attribution has to roll up

Per-task receipts are the foundation, but you also need them to aggregate the way your work is organized: by project, by type of task, by model, by day. A single task receipt answers "what did this cost?" Roll-ups answer "where is my budget actually going?" — which is the question that drives real optimization.

How Medley Does It

Medley is built so that the task, not the token, is the unit of accounting.

When you hand Medley a mission, it breaks it into a visible plan and runs each piece as a tracked task across whichever AI runtime fits — Claude Code, Codex, Gemini, Cursor, or Kimi. Because every task is a real object, every model call made on its behalf is attributed to it automatically.

When a task completes, you get a cost receipt for it: what it ran, what it cost, and what it produced. Cost sits next to outcome, so you're never guessing whether a given spend was worth it. And because Medley runs locally and routes across multiple runtimes, you can see how the same class of work costs out on different models — which is exactly the data you need to make routing decisions that lower your bill without lowering your quality.

The result is that "what are these agents costing me?" stops being a question you answer with a shrug and a monthly total. It becomes a question you answer task by task, with receipts.

Tokens Are an Implementation Detail. Tasks Are the Work.

There's nothing wrong with tokens as a unit of billing for an API provider — they're a clean, precise measure of compute consumed. The mistake is letting that be the unit you reason about. Tokens are an implementation detail of how language models charge for inference. Tasks are the actual work you're trying to get done.

As agent workloads grow from a few experiments to a core part of how your team ships, the gap between those two units stops being academic. You can't budget for work you can't attribute, you can't justify spend you can't tie to outcomes, and you can't optimize routing you can't measure. Cost-per-task accounting closes that gap.

If you want to see what your AI agents actually cost — per task, with receipts, next to what they produced — that's how Medley is built. It's free, local-first, and routes your work across every major coding agent from one place. Start at medley.sh.