The question that keeps getting harder
Eighteen months ago, most companies were still in the exploration phase with AI. Budgets were experimental. Nobody expected precise ROI numbers on a proof of concept. The conversation was about potential, not returns.
That grace period is over. In the past few months, I've watched the tone in boardrooms shift noticeably. CFOs want to know what the AI spend is actually buying. They're seeing the cloud bills, the model contracts, the headcount, and they want a defensible story about value. And most engineering teams don't have one.
The problem isn't that AI projects don't create value. Many of them create a lot. The problem is that the value is often diffuse, delayed, or tangled up with other changes — and without a deliberate measurement framework, it becomes impossible to point at and defend.
Why AI ROI is hard to measure
Traditional ROI math is simple: cost in, revenue out, compute the ratio. AI projects break this in a few specific ways:
- Costs are spread across many budgets: infrastructure, model APIs, data labeling, evaluation, engineering time, platform licenses. The real cost of an AI feature is rarely captured in a single line item.
- Value often looks like avoided cost: a support ticket that didn't need a human, a mistake that didn't happen. These don't show up as revenue, but they're real.
- Quality improvements compound slowly: a 5% improvement in output quality rarely produces a 5% jump in metrics. The benefit shows up over months as user retention, referrals, and reduced friction — all of which are hard to attribute.
- Counterfactuals are murky: knowing what would have happened without the AI system requires careful experimental design that most teams skip.
A framework that works has to handle all four of these honestly.
The four-box framework
Here's the structure I use with teams that need to justify their AI spend to people with financial responsibility. It has four categories, each measured differently.
1. Direct cost reduction
The cleanest category. A human-performed task now performed by an AI system, at lower unit cost. Examples: tier-1 support automation, document processing, basic code generation.
Measurement is simple in principle: unit volume times the cost differential. The catch is getting the true unit cost of both the old and new process, including all the wrap-around costs — QA, exception handling, retraining, infrastructure. These often dwarf the obvious model API cost.
2. Revenue enablement
Features that drive revenue that wouldn't exist otherwise. Faster onboarding that lifts conversion. Personalization that increases average order value. New capabilities that unlock new customer segments.
Measurement here requires experimental discipline: A/B tests, holdout groups, clear attribution. Without them, you're guessing — and your guesses will be questioned. With them, you can produce numbers a CFO will trust.
3. Quality and risk reduction
AI-driven improvements in quality, compliance, or risk management. Fewer errors in financial reports. Better detection of fraud. More consistent policy enforcement.
These are real and often valuable, but they're the hardest category to monetize. The trick is to translate them into terms finance already understands: expected loss reduction, incident rate, time-to-resolution. Connect the quality improvement to a dollar figure the business already tracks.
4. Team velocity
Faster development, shorter research cycles, more experiments run per quarter. This is where internal AI tools (copilots, code review, knowledge retrieval) create their value.
Velocity gains are easy to claim and hard to prove. The honest version: measure specific cycle times before and after adoption, watch them for several months, and avoid single-point comparisons. Velocity is also the category most subject to confounding — people feel more productive when they have new tools, even when the data doesn't fully agree.
Leading indicators that buy you time
Most AI projects can't produce clean ROI numbers for six to twelve months. Executives don't want to wait that long to know if something is working. Leading indicators bridge the gap:
- Usage and engagement: are users actually using the feature? Repeatedly?
- Task success rate: when they use it, does it work? This is where your eval suite earns its keep.
- Deflection and escalation rates: for assistance features, how often does the AI handle the task without handoff?
- Quality metrics from production logs: not just whether the model ran, but whether users acted on its output
A project that has strong leading indicators at month three has a much better chance of showing real ROI at month nine. A project with weak leading indicators probably won't get there, and the sooner you know that, the better.
The communication layer
Measurement is only half the job. Communicating the results to non-technical stakeholders is the other half, and it's where most teams stumble.
A few principles that work:
- Translate model metrics into business terms. Nobody outside the team cares about pass@1 or BLEU scores. They care about completed tasks, reduced handle time, and customer outcomes.
- Show both cost and value, always. A cost story without value feels like a bill. A value story without cost feels like hand-waving. Both together is a budget conversation.
- Acknowledge uncertainty honestly. Precise-looking numbers with false precision erode credibility faster than ranges that reflect reality.
- Compare to a baseline the audience understands. "Our agent resolved 42% of tickets" means nothing without "the previous approach resolved 12%."
The strongest ROI stories are the ones where the engineering team and the finance team agree on the numbers before either presents them. If your CFO is surprised by your ROI math, you haven't socialized it enough.
What this looks like in practice
Teams that measure ROI well treat it as an ongoing discipline, not a pre-launch slide deck. They instrument projects from day one. They pick metrics carefully. They review the numbers monthly. They kill projects that aren't working, and they scale the ones that are — with data to back both decisions.
That's the habit that makes AI spend defensible. Not better benchmarks, not better demos, not more sophisticated models. Just disciplined measurement, honestly communicated, consistently over time.