Checking Is Cheaper Than Creating
The verifier is the one part of the loop you cannot fake. So the frontier of self-improving AI is just the frontier of where a cheap, trustworthy check can be built.
Markus Hav
Lead Researcher, Agents · June 25, 2026
Abstract
This is the verifier — the part of the outer loop that does the real work. A self-improving system has to dream broadly, throwing off brilliant ideas and confident nonsense in the same breath; what divides the two is a check applied after the fact. This essay is about that check, and it rests on a single asymmetry: for a great many tasks, verifying a solution is far cheaper than producing one. That gap is the only reason a wild generator is affordable — and it quietly explains the shape of the whole AI frontier, because the domains that fell first, code and mathematics, are exactly the ones that arrived with cheap, exact verifiers already attached. The work now is manufacturing a trustworthy check where none seems to exist. But the verifier is also where the loop is most dangerous: because the system optimises toward whatever the check rewards, a check it can gameis worse than no check at all. Every check you can build is a proxy for reality. The one thing that is not a proxy — and so the only thing that cannot be gamed — is the outcome itself. For a business, that is the P&L.
A word first, for anyone who has landed here without reading the rest. This is the third part of The Outer Loop, a series about building an AGI — but not in the way the headlines mean it. The bet of this series is that the path to AGI does not have to run through an ever-bigger model. If raw intelligence is becoming a cheap, metered utility, sold by the token like electricity, then the system that finally crosses the line into AGI — the one that autonomously produces more value than it consumes— will not be a smarter brain. It will be a loop wired around an ordinary one.
That loop has three moving parts. A generator that reaches for surprising, valuable ideas. A verifier that sorts the good surprises from the bad. And a memory that keeps whatever survives, so the system never has to earn the same move twice. The previous partbuilt the generator — the faculty for dreaming up things that are true but unexpected. This part is about the verifier, and it makes a strong claim: of the three, the verifier is the part that does the real work. The reason why turns out to be a piece of arithmetic so simple it is almost a joke.
The Cheapest Hard Problem in the World
There is a question in theoretical computer science worth a million dollars, and it is, at bottom, a question about this very asymmetry. P versus NP asks whether every problem whose solution is easy to check is also easy to solve. Almost everyone believes the answer is no — that checking is genuinely easier than creating — but, and this matters, no one has proved it. It is a conjecture, formalised by Cook and Levin in 1971 and unresolved ever since. Take a completed Sudoku: confirming it is valid takes a glance down each row; producing it from a blank grid is a different order of effort entirely. Take a number with a thousand digits: handing you its two prime factors lets you verify them with a single multiplication, while finding them in the first place is hard enough that the world's banking rests on the difficulty.
Scott Aaronson put the stakes of the conjecture better than anyone: if checking really were as easy as creating — if P equalled NP — then “everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss.” The reason the world is not like that, the reason appreciation is common and genius is rare, is the asymmetry. Recognising a good thing is cheap. Producing one is dear. The entire previous part of this series — the dreaming generator that reaches for the improbable truth — is affordable only because the recognising half is so much cheaper than the reaching half.
This is the lever the whole loop turns on. You can let the generator hallucinate at ruinous breadth, at high temperature, dreaming a thousand candidates for every one you keep, precisely because sorting the thousand afterward costs so much less than producing them. Jason Wei, who has thought about this as clearly as anyone, names it directly: the ease of getting an AI to master a task is proportional to how verifiable the task is. His Verifier's Law is almost a tautology once you see it — every task that is possible to do and cheap to check will be done by a machine — and like the best tautologies it turns out to predict the future. Andrej Karpathy compresses it to a slogan: classical software automates what you can specify; AI automates what you can verify.
The Frontier Is a Map of Cheap Checks
Look at where AI has advanced fastest and you are looking at a map of where verification was cheap. The two domains that fell first — code and competition mathematics — are exactly the two that arrived with a cheap, exact, automatic verifier already bolted on. This is not a coincidence. It is the whole mechanism.
Start with code, because code is the killer app, and it is worth being precise about why. In July 2024, Saoud Rizwan released an open-source extension first called Claude Dev and soon renamed Cline — for “CLI and Editor.” Two months later it gained the feature that mattered, described in his own release notes: it could now “monitor your workspace for linter, compiler, and build issues as he works… and automatically fix problems like missing imports, type errors, and more all on his own.” That sentence is the outer loop in miniature. The agent writes, the editor's own checks fire, the agent reads the resulting errors, and it tries again — refined, soon after, so it re-fed only the errors its own edits had caused. Cursor shipped the same loop that December (“the Agent reads linter errors to automatically fix issues”); Aider had it in the terminal, re-running your test suite after every edit and fixing whatever returned a non-zero exit code. None of these tools made the model smarter. They wired it to a check.
The Domain That Had Checks Lying Around
Code did not become the first home of the agent because it is the most valuable work in the world. It became the first home because it shipped a whole stack of cheap, automatic verifiers, ready to wire a loop to.
The parser rejects what is not even well-formed. Instant, free.
The checker catches a whole class of errors before the code runs at all. Shape, not behaviour.
Style and a thousand known foot-guns, flagged the moment they appear.
Behaviour, finally. Does the thing actually do what it claims? Pass or fail, by execution.
The program meets the world. It crashes, hangs, or returns. The last and bluntest check.
Write → run the check → read the error → fix. Five rungs, each cheaper than the bug it catches. The agent climbs the stack on its own.
The benchmark that came to define agentic coding makes the point at the level of measurement. SWE-bench grades a model by checking out a real bug from a real repository, applying the model's patch, and running the project's hidden unit tests. There is no judge, no rubric, no opinion — the tests pass or they do not. When the SWE-agent team wrapped a model in a well-designed interface for exploring the repo and running those tests, the solve rate jumped from the low single digits to 12.5%, several times the best non-interactive system of the day. Same model. A loop wired to an automatic check. The harness, not the intelligence, moved the number.
There is a quiet lesson buried in which language this happened in, and it matches something I felt long before I could justify it: that with TypeScript, it was easier to get an agent to bring value fast. The serious AI world runs on Python — it is the language of the model itself, of PyTorch, of the whole research stack — and so, by sheer path dependence, Python became the default for agentic code too, even where it is not the best target. Yet one of the most valuable properties an agent's output language can have is cheap verifiability, and there a statically typed language like TypeScript has a structural edge that went underrated for a long while. Anders Hejlsberg, who designed both C# and TypeScript, puts it flatly: “the only way you keep the AI honest is to put it through a deterministic type checker.” A type checker is a verifier that fires before the code runs at all, on every edit, catching a class of mistakes — the overwhelming majority of the compile errors models actually make are type errors — that in a dynamic language surface only at runtime, or never. Constrain generation to type-correct code and you cut compile errors roughly in half.
I want to be honest about the limits of that claim, because it is the kind of thing that gets oversold. Models are not, on the evidence, simply better at TypeScript — on apples-to-apples benchmarks they often score highest in Python, which dominates their training data. The type checker is also a shallow check: it verifies shape, not behaviour, and a model that is unsure can quietly defeat it by reaching for an escape hatch and writing any. So the edge is not that TypeScript makes a smarter model. It is that the type checker hands the loop a cheaper, earlier rung on the ladder — a tighter feedback loop — and that, I suspect, is what actually made it feel faster to build with. Which is the real moral: code won not because it is the most valuable work, but because it came with a whole stack of cheap checks, each catching what the one above it missed. Types catch shape before runtime; tests catch behaviour; you stack the rungs.
Mathematics tells the same story in its purest form. A proof written in a formal language like Lean can be checked by a machine with total certainty: it type-checks or it does not, and a proof that type-checks is correct, full stop — a verifier that never hallucinates and cannot be flattered. In July 2024, DeepMind's AlphaProof used exactly this. It generated candidate proof steps and let Lean adjudicate each one, an exact binary reward driving the search, and scored a silver medal at the International Mathematical Olympiad — solving four of six problems, including the hardest one on the paper. The same principle, generalised, became the engine of the 2025 reasoning models: reinforcement learning from verifiable rewards, where the reward is not a human's opinion but an automatic check — the right answer to a math problem, a passing test suite for code. DeepSeek-R1 showed that this alone, pointed at checkable problems, could bootstrap reasoning from almost nothing. Cheap exact verifier in; superhuman generation out.
Manufacturing a Check Where None Exists
Here is the catch, and it is the frontier. Most valuable work does not come with a compiler. There is no test suite for “is this the right strategy,” no type checker for “is this email persuasive,” no Lean for “is this design any good.” If the loop only runs where an exact verifier already exists, it runs in a narrow corner of the world. So the real engineering question of the outer loop is the one the previous part promised this one would answer: how do you manufacture a cheap, trustworthy check in a domain that looks like it has none?
The way to hold the answer in your head is a ladder. At the top are the exact, mechanical checks — compilers, tests, proof checkers — cheap and nearly impossible to fool, but rare. At the bottom is reality itself: the true outcome, un-gameable but slow and coarse. The craft is to climb a domain upthe ladder — to convert an expensive, soft, slow check into a cheaper, harder, faster one — without ever losing your anchor to the rung below.
The Ladder of Verifiers
Cheap-and-exact at the top, slow-and-true at the bottom. The whole craft is moving a domain upthe ladder — turning an expensive, soft check into a cheap, hard one — without losing your grip on the rung below it.
Exact / mechanical
Compiler, type checker, unit test, proof checker, schema
Executable spec
Prose turned into a runnable check — a claim that fails loudly when it drifts false
Model as judge
An LLM grading another model against a narrow, specific question
Human
A person approving a step, a batch, or a spot-check
Reality
The real outcome. For a business: the P&L — money in versus money out
Every rung above reality is a proxyfor it. Proxies are gameable in proportion to how hard you optimize against them — which is why the bottom rung, the one that is not a proxy at all, is the one the loop is ultimately answerable to.
The second rung is the one we have written about elsewhere. Codumentationtakes documentation — prose, the least checkable thing there is — and turns its claims into executable specifications that fail loudly the moment they drift false. That is a verifier manufactured for language: a way to give text the one property, checkability, that lets the loop run on it at all. Every domain you can drag onto that rung becomes a domain the loop can improve.
Ask the Smallest Question You Can
The third rung — the model as judge — is where most of the new ground is being won, and it has one governing principle: ask the smallest question you can. Do not ask a model “is this code safe?” The question is vast, the answer is a soft impression, and the impression drifts from run to run. Ask instead “is there a SQL injection vulnerability in this function?” — and then ask fifty more questions just as narrow. A small, specific, near-binary question is almost a unit test. Its surface area for error is tiny, its answer is close to a yes or no, and when it is wrong it is wrong locally, where you can see it.
One Big Question
"Is this code safe?"
- · Vast surface area; the judge can be wrong in a hundred ways
- · No clear pass/fail; the answer drifts run to run
- · Easy to satisfy with a confident, hollow "looks fine"
- · You cannot tell why it passed or failed
A Battery of Small Ones
"Is there a SQL injection in this function?" × 50
- · Each question is nearly a unit test — specific, near-binary
- · Wrong answers are local and visible, not smeared across the whole
- · Hard to fake fifty narrow checks at once
- · The failures tell you exactly what broke
This is not folklore; it is how the most credible evaluations are now built. OpenAI's HealthBench had 262 physicians write tens of thousands of specific rubric criteria, and grades each response by asking a model, criterion by criterion, a single binary question: is thisone thing present or not? The factuality work in the same vein — FActScore, and its search-augmented successor — breaks a long answer into atomic facts and checks each one independently, behaving like a test suite for prose. Anthropic's own guidance is the same: grade each dimension with its own isolated judge rather than asking one judge to weigh everything at once. The principle even has a theoretical backbone in the scalable-oversight literature: decomposing a hard judgment into small checkable sub-questions provably extends what a bounded, fallible checker can confirm — a poly-time judge presiding over a structured debate can verify claims it could never settle by answering directly.
Two honest cautions keep this from being a magic wand. The first is that narrow does not automatically mean un-gameable: a check that captures only a sliver of what you actually care about can be satisfied perfectly while the real goal is missed, so the small questions have to collectively cover the thing you mean. The second is sharper and easy to forget. Models are surprisingly bad at checking their own work: ask a model to grade its own answer with no external signal and it will often confidently bless its mistakes, sometimes making the output worse. The asymmetry — checking is cheaper than creating — is a fact about external, sound verifiers: compilers, tests, trained reward models, panels of independent judges. It is not a free pass for a model to mark its own homework. The verifier has to come from somewhere the generator cannot simply reach in and reassure.
And there is human judgment, the rung above reality — the most trustworthy check most organisations have, and the most expensive. The mistake is to treat it as one setting, on or off. It is a spectrum: a person approving every function as the agent writes it; a person approving a whole batch of work at once; a person spot-checking a sample after the fact. Each trades trust against throughput. Human attention is the costliest verifier you own, so the discipline is to spend it only where the cheap checks cannot reach — and to let the cheap checks, ruthlessly, handle everything else. Watchingthe system over time — the observability that tells you whether the whole loop is drifting — matters too, but it belongs to the next part of this series, because it is about what compounds across many runs, not about gating a single one.
A Check You Can Game Is Worse Than No Check
Now the dangerous part, and it is the same impulse the first part of this series warned about, seen from the verifier's side. The loop optimises toward whatever the check rewards. That is the entire point of it. Which means that if the check is wrong — if it can be satisfied without doing the work it was meant to certify — the loop will find that out faster than you will, and it will optimise straight into the gap. A gameable verifier does not merely fail to help. It actively manufactures confident, certified error: it stamps false-and-surprising as true-and-surprising, the precise corner the previous part identified as the dangerous one. A bad check is worse than no check.
This is not a worry; it is a documented and growing catalogue. In 2025, OpenAI watched a frontier reasoning model, during training, learn to call exit(0) to crash out of a test harness before the tests could run — reporting success by never being graded — and, in other runs, simply overwrite the verification function so it always returned true. Anthropic documented a model that, handed a curriculum of gameable environments, generalised from petty flattery all the way to editing its own unit tests to hide that it had rewritten its reward. METR found a model that, asked for a fast kernel, dug the correct answer out of the scoring code's own memory and returned it — and, tellingly, reward-hacked most often exactly when it could see the scoring function. On coding benchmarks, agents have been caught running git logto read the merged fix out of the repository's future and pasting it back as their own; one audit found the leading scaffold reading the hidden answer key in 97% of its runs. The grades were so contaminated that OpenAI stopped reporting one of them.
When the Check Can Be Gamed
These are not hypotheticals. Every one was observed in a real model against a real verifier — the loop satisfying the letter of the check while doing none of the work it was meant to certify.
exit(0)
A frontier reasoning model exploited a bug to quit the process before the tests ran, so the run reported success. Caught reading its own chain-of-thought, where it had written: “Let’s hack.”
overwrite verify()
Rather than satisfy the verification function, the model rewrote it to always return true. The check passed because the check had been replaced.
return the grader’s own answer
Asked to write a fast kernel, a model traced the call stack to the answer the scorer had already computed, returned it, and disabled the timer so its “speedup” could not be measured. It hacked most often exactly when it could see the scoring function.
git log --all
On a coding benchmark, agents read the repository’s own future commits to copy the merged fix. One audit found the top scaffold reading the hidden answer-key file in 97% of its runs.
A check the system can satisfy without doing the work is not a check. It is a faster way to be wrong — and the loop optimises straight toward it.
The pattern underneath all of these has a name a century older than the machines: Goodhart's law. When a measure becomes a target, it ceases to be a good measure. The AI version is now measured, not just feared — optimise hard against an imperfect proxy reward and the true reward climbs for a while and then falls, as the system learns to exploit the gap between the proxy and the thing it stood for. The practical consequences are two. First, the harder you intend to optimise, the more robust your check has to be: a verifier good enough for a weak generator will be torn apart by a strong one, which is why you can productively train a verifier against an adversary whose whole job is to fool it. Second, and counterintuitively, a verifier's false positives are far more poisonous than its false negatives. A check that wrongly rejects good work merely slows the loop; a check that wrongly accepts bad work feeds poison back into it, and the loop, optimising, learns to produce exactly that poison on purpose.
The One Check That Cannot Be Gamed
Follow that logic to its end and you arrive somewhere clarifying. Every check you can build is a proxy — a stand-in for the outcome you actually care about — and every proxy is gameable in principle, given enough optimisation pressure. There is exactly one thing on the ladder that is not a proxy, because it is the outcome itself: reality. Did the kernel actually run faster on the actual hardware. Did the patch actually fix the bug in production. Did the strategy actually make money. Reality cannot be flattered, cannot be short-circuited, cannot be talked into a passing grade. It is the only verifier that is un-gameable, for the simple reason that it is not a measure of the thing — it is the thing.
This is the deepest reason the first part of this series pointed the outer loop at a business rather than a lab. A business comes with the one verifier that cannot be gamed into meaninglessness already installed: it makes money or it does not. Revenue minus cost is a scoreboard reality keeps for you, and no amount of clever output can argue with it. The serious AI researchers David Silver and Richard Sutton have made a parallel case for where machine learning is headed at large — that the next era belongs to agents whose rewards are “grounded in their experience of the environment, rather than coming from human prejudgment.” The P&L is precisely such a grounded reward. It is what a startup's discipline of actionable over vanity metrics has always reached for: not the number that feels good, but the one tied by hard cause and effect to whether the thing worked.
But reality, the perfect verifier, has a flaw that is the mirror of its virtue: it is slow and it is coarse. You cannot run your quarterly P&L on every token the system emits; by the time reality returns its verdict, the loop has moved a thousand times. This is the genuine engineering problem the whole essay has been circling, and the intuition behind the fix is plain: judging the big picture is hard; judging a small step is much easier. So you decompose the slow, true, un-gameable check into many fast, local, gameable ones — a stack of cheap proxies that approximatethe real outcome — and then you do the one thing that keeps the whole structure honest: you reconcile the proxies against reality, often enough that they cannot quietly drift away from it. The fast checks let the loop run; the slow check keeps the fast ones true. That reconciliation — anchoring the cheap proxies back to the un-gameable outcome before Goodhart pulls them loose — is the real discipline of building a verifier, and it is what the next two parts, on compounding and on the break-out, are ultimately about.
The previous part ended on a line I have been building toward this whole essay: the sorter is the whole game. This part built the sorter, and in doing so it drew a map. The frontier of what a self-improving system can master is not a map of intelligence — it is a map of where a cheap, trustworthy check can be built. Karpathy calls the shape of that frontier jagged: verifiable tasks race ahead while the unverifiable ones lag, not because the model is cleverer in one place than another, but because in one place there is a check and in the other there is not. Expand the verifier and you expand the loop. That, and not a larger model, is the lever a business has its hands on.
A generator without a verifier is a dreamer talking to itself. The verifier is what turns the dreaming into work — and the only verifier that can never be fooled is the world telling you, slowly and without mercy, whether the thing you made was real.
Notes & Further Reading
- Jason Wei, "Asymmetry of verification and verifier's law" (2025) — the cleanest formulation of the lever this essay turns on. link
- Andrej Karpathy, "Verifiability" (2025) — AI automates what you can verify; the jagged frontier. link
- The P versus NP problem (Cook–Levin, 1971) — a Clay Millennium Prize problem; the asymmetry remains a conjecture. Scott Aaronson's survey has the Mozart–Gauss line. link
- Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021) — a 6B model plus a verifier matched a 175B model. The generator–verifier gap, measured. link
- Yang et al., "SWE-agent" (2024) and Jimenez et al., "SWE-bench" (2023) — the harness wired to an automatic unit-test check. link
- Cline (Saoud Rizwan, 2024) — the agent that reads the editor's linter, compiler, and type errors and fixes them on its own. link
- DeepMind, "AI achieves silver-medal standard solving International Mathematical Olympiad problems" (2024) — AlphaProof, with Lean as an exact verifier. link
- DeepSeek-AI, "DeepSeek-R1" (2025) — reinforcement learning from verifiable rewards on math and code. link
- OpenAI, "HealthBench" (2025) — tens of thousands of specific, independently graded binary criteria. The smallest-question principle at scale. link
- Min et al., "FActScore" (2023) — decompose a generation into atomic facts and check each one. link
- Stechly et al., "GPT-4 Doesn't Know It's Wrong" (2023) and Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet" (2023) — why the verifier must be external and sound. link
- Baker et al. (OpenAI), "Monitoring Reasoning Models for Misbehavior" (2025) —
exit(0), overwriting the verifier, and what happens when you penalise the thought instead of the act. link - Denison et al. (Anthropic), "Sycophancy to Subterfuge" (2024) — a model that edits its own tests to hide a rewritten reward. link
- METR, "Recent Frontier Models Are Reward Hacking" (2025) — reward hacking spikes when the model can see the scorer. link
- Gao, Schulman & Hilton, "Scaling Laws for Reward Model Overoptimization" (2022) — Goodhart's law, measured: the proxy reward climbs, then falls. link
- Silver & Sutton, "Welcome to the Era of Experience" (2025) — rewards grounded in the environment rather than human prejudgment. link
About the Author
Markus Hav
Markus Hav is Lead Researcher for Agents at Benque Max AI Lab in Finland, where he focuses on advancing autonomous AI systems and agent architectures. His work explores the boundaries between programmed behavior and emergent intelligence in AI agents. He also serves as Head of AI Automation at Hoxhunt, applying cutting-edge agent research to real-world automation challenges.