Skip to content

Blog

AI Tooling Has a Trust Problem, Not a Capability Problem

· Dan Maby · 6 min read

The interesting question isn't 'does it work' anymore

Claude Code, Codex, Copilot and Gemini have all crossed a threshold most engineering teams haven't fully reckoned with. Docker now reports that over a quarter of all production code is now AI-authored, and developers who use agents are merging roughly 60% more pull requests. The capability argument is largely settled. What hasn't settled is the harder question: how much should we trust the output, in which contexts, and what does the engineering practice around that trust actually look like?

Most of the conversation we hear from prospective clients still treats AI tooling as a binary. Either you've 'adopted AI' or you haven't. Either your developers are using Cursor or they're being left behind. We think that framing is wrong, and it's producing a new category of debt that's harder to see than the kind teams are used to.

Our position is straightforward: trust calibration is an engineering discipline. It needs sandboxing, coding standards written for both humans and agents, testing strategies that account for non-deterministic output, and deliberate decisions about where AI judgment ends and human judgment begins. Skip that infrastructure and you're not adopting AI - you're just accumulating maintenance liability faster.

The trust gap is widening, not closing

The usual technology adoption curve says familiarity breeds confidence. With AI coding tools, the opposite is happening. Stack Overflow's data is striking: in 2023, roughly 70% of developers reported using or planning to use AI tools, with trust around 40%. By 2025, usage rose to 84% even as trust dropped to 29%. The more developers use these tools, the less they trust them.

Sonar's State of Code survey points at why. They found that 96% of developers do not fully trust AI-generated code, but only 48% always verify it before committing, and 61% agree AI often produces code that "looks correct but isn't reliable". That's the dangerous zone: plausible output, hidden defects, and time pressure that erodes the verification step exactly when it matters most.

This isn't a tooling problem you fix by switching to a better model. It's a process problem. The Stack Overflow team puts it well in their piece on the developer AI trust gap (opens in a new tab):

Think of an AI coding tool like a junior developer: promising, pretty fast, but prone to sometimes-basic errors and in need of supervision and redirection.

That's the right mental model. We wouldn't merge a junior's PR into a payments service without a senior reviewing the approach, the tests, and the edge cases. The same standard has to apply to agent-generated code, regardless of how confident the output sounds.

Debt that doesn't look like debt

The debt AI tooling generates doesn't look like the debt teams know how to triage. Traditional tech debt comes from conscious shortcuts under deadline pressure. AI-generated debt is different. GitClear's analysis of 211 million changed lines from 2020 to 2024 found that copy-pasted code rose from 8.3% to 12.3% of all changes, refactored code dropped from 25% to under 10%, and code duplication blocks increased eightfold. The codebase grows. The architecture rots.

The Ox Security analysis of 300 repositories called it cleanly: AI-generated code is "highly functional but systematically lacking in architectural judgment." That's the tell. The code passes tests. It reads fine in review. It just doesn't know about the consolidation work the team did six months ago, or the fact that this same logic already exists in three other places.

For consultancy work, this matters more than for product teams with a single codebase. We move between client systems. On Nectar Group's logistics platform, the patterns aren't the same as PodcasterPlus or All Counseling's directory at scale. An AI agent that helpfully produces a 'standard' authentication flow can introduce a flow that contradicts the system it's been dropped into. That's not a hallucination - it's worse, because it looks right.

What the discipline actually involves

We think trust calibration breaks down into four concrete engineering practices. None of them are exotic. Most teams just haven't sequenced them in the right order.

Sandboxing as a default, not an afterthought. If you let an agent run unattended on your host, you're trusting both the model's judgment and its lack of malice. Docker's own framing is honest about this: an autonomous agent can access files you didn't intend it to touch, read sensitive data, or execute destructive commands while trying to help. Guardrails matter, but only when they're enforced outside the agent, not by it. Agents need a true bounding box: constraints defined before execution and clear limits on what it can access and execute. We treat 'where does this agent run' as a first-class architectural decision, not a developer preference.

Coding standards written for agents and humans. The standards documents most teams have were written for humans who can read between the lines. Agents can't. If your repo doesn't make the architectural rules explicit - which database stores which data, which patterns are deprecated, which directories are off-limits - the model will optimise the local task and ignore the bigger picture. That's not a model failure. That's a documentation failure that AI exposes.

Testing strategies for non-deterministic output. Traditional QA assumes the same input produces the same output. With agent-driven systems, that assumption breaks. Testing has to shift toward property-based checks, verification of invariants, and explicit evaluation harnesses for the categories of task you let AI handle. Verification becomes a first-class part of the workflow, not an afterthought you bolt on when something breaks in production.

Explicit boundaries between cooperative and delegative use. Fly's piece on trust calibration (opens in a new tab) draws a useful distinction:

Cooperative systems generally call for lower levels of trust because users can choose whether to accept or reject AI suggestions.

Delegative systems are different. When you let an agent run end-to-end on a task, you're effectively trusting it to deliver. The discipline is deciding which categories of work get which mode, and being honest that 'the agent did it' is never a defence when something fails in production.

Our take

We're not anti-AI tooling. We use these tools, and they meaningfully change what a small, senior team can deliver. PodcasterPlus, All Counseling and the work we do for Nectar Group all benefit from agents that handle scaffolding, refactors and test generation faster than we could by hand. That's real, and we're not interested in pretending otherwise.

But we're sceptical of the framing that says any team not using AI flat-out is falling behind. The teams falling behind are the ones treating AI as a productivity multiplier without building the engineering practice around it. They're shipping faster in the short term and accumulating a maintenance bill they haven't priced. The ones that will look strong in two years are the ones that invested in sandboxes, evaluation harnesses, explicit standards and human review processes that scale - not the ones with the highest agent-authored line counts.

For consultancies in particular, this is a craft question. Our clients hire us because we own the outcome. 'The model wrote it' is not a position we can take when something breaks in production. That means the standard for AI-assisted work has to be the same standard we'd apply to code we wrote ourselves: does it fit the system, does it have tests we trust, does someone on our team understand it well enough to maintain it. If the answer to any of those is no, the velocity gain is illusory.

Where this leaves you

If you're evaluating how to introduce AI tooling into a serious codebase, the question to start with isn't 'which model'. It's 'what does our verification, sandboxing and standards infrastructure look like, and is it ready for code we didn't write?' If the honest answer is no, that's the work to do first.

We help clients build software that has to keep running and keep evolving long after the initial sprint. If that's the kind of problem you're working on and you'd like another set of opinionated eyes on it, get in touch.