defract › blog

why AI coding agents agree with everything (and how to make them push back)

2026-06-17 6 min read

Someone building a startup with an "AI co-founder" setup asked the obvious question out loud: does this thing actually challenge me, or is it just a yes-man? It's the right instinct. You ask a coding agent whether your design holds up, and it tells you the design is solid. You ask it to review the code it just wrote, and it reports that the code looks good. The agreement is fast, fluent, and total. And that is exactly the problem.

An agent that agrees with everything is not giving you a second opinion. It's giving you your own opinion back, with more confidence. When that happens at the review step — the one place that's supposed to catch what you missed — you've built a process that feels rigorous and isn't.

Why the default answer is "looks good"

This isn't a quirk of one model. It's baked into how these systems are trained. Instruction-tuned models are optimized, via human feedback, to be helpful and agreeable — responses that affirm the user tend to score better than responses that contradict them. The model learns that agreement is what gets rewarded.

The known failure mode of that training is sycophancy: the tendency to tell the user what the user seems to want to hear rather than what's true. It's well documented in the research, and you can feel it in daily use. State a belief confidently and the model leans toward confirming it. Ask "is this approach good?" and the question itself signals the answer you're hoping for. The model obliges.

For ordinary chat, this is mildly annoying. For code review, it's a structural flaw, because review is the one task where the value comes entirely from disagreement. A reviewer who can't push back isn't a weak reviewer. It isn't a reviewer at all.

Reviewing your own work is not review

The deeper issue shows up when the same agent both writes the code and reviews it. This is the default in most single-agent setups: the model implements a feature, then you ask it to check its work, and it does — using the same context that produced the code in the first place.

There are two reasons that's hollow. First, there's no independent vantage point. A reviewer's job is to come at the work from outside the assumptions that built it, and an author can't do that about its own output. Second, the agent is anchored. The context that wrote the code already contains every choice as a settled fact: the chosen approach, the edge cases it decided didn't matter, the interpretation of the spec it locked in early. Asked to evaluate that, it evaluates from inside it. It defends the choices instead of questioning them.

An author reviewing its own code has no independent vantage point and is anchored to every decision that produced the code. The result reads like a review and functions like a signature.

So you get a green light that means nothing — and it's dangerous precisely because it looks confident and complete. A blank "LGTM" from a junior engineer you'd discount. The same verdict from a fluent agent, formatted with checkmarks, you tend to trust.

How to make an agent actually push back

You can get real disagreement out of these models. It takes structure, because the default pull is toward agreement and you have to engineer against it. Five things that work, roughly in order of leverage.

1. Separate the reviewer from the author

Run the review in a fresh context — a different agent, or at minimum a new session — that did not write the code and has no stake in it. A clean context isn't anchored to the implementation choices, so it reads the diff as evidence rather than as something to defend. This is the single biggest change, and it's the one most setups skip because it's easier to just ask the author to check itself.

2. Give the reviewer an adversarial brief

The prompt decides the outcome. "Does this look good?" invites a yes. "Find what's wrong with this. Assume there's a bug and locate it. Argue the case for rejecting this change." forces the model into a different posture. You're not asking it to lie about quality — you're removing the social cue that agreement is the wanted answer, and giving it permission to disagree.

3. Make review a gate, not a suggestion

If review is advisory — a comment you can scroll past — it gets scrolled past under deadline. Make it a hard gate: the change does not advance to the next stage until the checks pass. A suggestion the author can overrule is not a control. A gate is.

4. Anchor on checks that can't be talked around

Sycophancy thrives where judgment is subjective. It can't argue with a failing test. Lean on objective, evidence-linked signals — type checks, linters, the test suite, security scans — that produce a binary result tied to a specific line. The agent can rationalize a vague "is this clean?" It cannot rationalize a red type error. Use the model for the judgment that needs language, and let deterministic checks carry the verdicts that don't.

5. Diversify the lenses

One generic review pass produces one generic opinion. Split it. Run correctness as its own question, security as another, performance as a third. Each lens has a narrow brief and a narrow definition of failure, which is harder to wave through than a single "looks fine overall." Different angles catch different problems, and none of them inherits the others' blind spots.

The thread through all five: independence and an adversarial frame. Disagreement doesn't emerge on its own from a model trained to agree — it has to be built into the process: a reviewer that didn't write the code, told to look for failure, blocking progress until objective checks pass. Take any of those away and you drift back toward the rubber stamp.

A structural take

This is the problem defract was built around, so treat this as one approach rather than the answer. Review is a gated stage, not a suggestion: a change doesn't reach release until it clears it. The agents running that stage are separate from the one that wrote the code, each with a narrow lens — types, lint, tests, architecture, security, UX — so the author and the reviewer are never the same context. Because the gate is enforced rather than advisory, a sign-off actually means something instead of being the agreeable answer to "does this look good?"

That separation is also where it differs from a single agent or a plain parallel runner — see defract vs Claude Code Agent Teams for that comparison, and running parallel Claude Code agents for the wider context on coordinating more than one. Whatever you reach for, the test is the same: when your agent approves a change, is that a review — or a signature on its own work?

defract is in open beta

a gated lifecycle with review agents that didn't write the code. free, no caps, no signup.