Discipline you can install.Reasoning you can check.
Hammerstein is a free, open-source framework that wraps around any AI model and forces structural verification. It makes the assistant check its work before shipping, flag contradictions in your plan instead of agreeing with it, and refuse what it cannot honestly answer. It is not a new model. It is the verification layer you put around one.
Clever-lazy over stupid-industrious,
verification over enthusiasm,
legible failure over hidden success.
demo · h audit
~/proj $h audit"ship the new pricing page by friday; QA after launch"
·· loading corpus (Hammerstein-Equord) ················· ok
·· model: claude-opus | judge: off | scope: plan
VERDICTstupid-industrious
why QA-after-launch inverts verification and shipping.
The deadline is the load-bearing claim; pricing is
A set of instructions and real worked examples you load into any AI model before asking it hard questions.
Once loaded, it enforces a few habits: verify before shipping, flag the flaw instead of cheerleading it, refuse what it can't honestly answer. The name comes from Kurt von Hammerstein-Equord, who sorted officers into four boxes — clever or stupid, lazy or industrious. The lazy-and-clever belong at the top. The industrious-and-stupid are dangerous and must be removed. Hammerstein the framework applies that typology to reasoning. It is not a model. It is the discipline you wrap around one.
Clever-lazyFind the cheapest path to a correct, reversible answer. Refuse the heroic version.
VerificationBefore shipping, name the load-bearing claim and the failure mode. Enthusiasm is not evidence.
LegibilityWhen the model can't help, say so on the record. A refusal is a result.
ScopeReason inside the question. Reject scope-creep that smuggles in new claims. Answer one thing, well.
PortabilityNo fine-tune required. Same prompt, same corpus, any model. The discipline travels.
§ 03How you use itcommand-line
Five verbs and a REPL. That's the surface.
The CLI ships as h. Each verb takes a plan, a claim, a draft —
something concrete — and returns a verdict in the framework's voice.
The hsh REPL keeps state across turns when you want to argue back.
h audit <plan>Classify the plan in the Hammerstein quadrants. Names the load-bearing assumption. Tells you which kind of officer you're being.
h next <state>The single next move that maximises learning per unit effort. No roadmap. One step.
h scope <question>Tighten the question until it's answerable. Strips smuggled premises. Returns a shorter question and the things you just gave up.
h worth <task>Is this task worth doing at all? Cost, reversibility, blast radius, alternatives. The lazy-clever filter.
h sharper <draft>Adversarial edit pass. The framework reads your draft like the smartest person who hates it. Returns the cuts and the open questions.
hshInteractive REPL. Carries the campaign across turns. Useful when the work is bigger than a single verb.
preferred frontier-models-running-Hammerstein over their raw selves.
Three frontier models — Claude Opus, GPT-5, Claude Sonnet —
were each asked the same prompts twice: once raw, once with the Hammerstein
system prompt and corpus loaded. Four blind LLM judges, no knowledge of which
was which, picked the better answer. The framework won 53 of 54 head-to-heads.
Hammerstein vs. raw frontier — in-distribution98.1%53 / 54
OOD · corpus offers no direct help100%48 / 48
OOD stress · hard novel domains81.3%32 ratings
0%25%50%75%100%
Out-of-distribution
48 / 48100%
On four questions the corpus can't help with, Hammerstein still won every blind rating. The discipline travels; the corpus is only half the lift.
OOD stress · novel domains
81.3%32 ratings
On hard novel domains, Hammerstein beat raw frontier 81.3%, and refused out-of-scope medical/legal calls where raw models confidently hallucinated.
Models tested
3Opus · GPT-5 · Sonnet
Three frontier models, identical prompts, two conditions each. The lift held across all three; the dominant component differed by model.
Total blind ratings
~246across 4 judges
Four blind LLM judges with no view of condition. Position and label confounds checked separately. Harness in the repo.
Methodology. Blind, multi-judge, ~246 ratings total, ~$10 in API cost,
~90 minutes wall-clock. Reproducible: the full harness, prompts,
judges, and raw rating logs live in /bench in the repo.
Confound checks (position bias, label leakage, judge agreement) are reported
alongside. Read the harness →
§ 04bHammerstein-CODERthe discipline, measured on code
The same discipline, measured on code.
Hammerstein refuses the stupid-industrious move — the elaborate build that feels like work
but delivers nothing. We wrapped a coding model in that discipline to test whether the
refusal survives contact with real coding tasks. It survives. Across every model tested,
the wrap stops a system that defaults to over-engineering while it still completes the
legitimate work.
Claude Opus 4.8 — over-engineering baits refused100%wrap
Claude Sonnet 4.6100%wrap
GPT-5100%wrap
GLM-5.2 / Kimi / Qwen3-Coder90–100%wrap
0%25%50%75%100%
Hammerstein-CODER · Plain baseline
6 / 6 models pass
100%gate
Over-engineering bait-refusal rises from near zero to 90–100% on every model tested. The wrap closes the gap; it does not replace the model.
Correctness — HumanEval pass@1 Δ
±0.05neutral
On three open coders (GLM +0.05, Kimi −0.03, Qwen 0.00), baseline coding accuracy is unchanged. The wrap only changes what the system attempts to build.
The one model that already reasons this way
Opus 4.8smallest lift
Opus 4.8 refused 70% of baits without the wrap — the highest plain baseline. The wrap takes it to 100%. Which tells you the bench grades judgment, not its own prompt.
More than "do less" — vs ponytail
+0.23ambiguous-scope
Against ponytail (off-the-shelf generic minimalism — a strong baseline, HumanEval 0.93–0.97), both refuse over-engineering similarly. The split is on vague requests: ponytail applies the smallest change; the wrap scopes what's actually needed first, then builds. +0.23 mean advantage on ambiguous-scope handling, ≥+0.20 in 4 of 6 models. Full per-model table: the eval doc below.
Methodology. 15-task adversarial bait bank; restraint scored by an independent LLM judge (kimi-k2.7-code); correctness by execution-based HumanEval pass@1. Tested 2026-06-21 on Claude Opus 4.8, Sonnet 4.6, GPT-5, GLM-5.2, Kimi-K2.7-Code, Qwen3-Coder-480B. Full per-arm results: eval/RESULTS-coder-bench.md →
§ 04cFable-5 controlthe proof that the bench grades judgment
On a model that already has the discipline, the wrap changes nothing.
We ran a blind, position-randomized 4-judge panel (Claude Opus 4.7, GPT-5, Claude Sonnet 4.6,
DeepSeek) on Fable 5 — a frontier model trained on check-then-speak reasoning — with and
without the Hammerstein system prompt. Result: 12–11–0 (52.2%, n=23).
A statistical coin-flip. Mean framework Δ −0.22.
Fable-5 with wrap vs. raw
52.2%n = 23
12 Hammerstein wins, 11 raw wins, 0 ties. Indistinguishable from chance. The framework added nothing on a model already reasoned in check-then-speak.
What this tells you
Δ −0.22mean
The benchmark grades judgment, not its own prompt. It lifts models that lack the discipline and disappears on models that already have it. That's the proof, not the caveat.
Why we publish this. The 53/54 result shows what the framework does for most models.
The Fable-5 control shows why it does it: the wrap is a discipline scaffold, not a
formatting trick. A framework that hides its null result is itself stupid-industrious.
Verdicts: eval/results/2026-06-11T195511Z/JUDGE-VERDICTS.md →
· Harness: eval/judge_pass.py →
§ 05Honest limitswhat we won't claim
What Hammerstein does not claim.
The numbers above are real and the harness is open. We list these
caveats here because a framework that hides its edges is itself
stupid-industrious.
The dominant component is model-dependent. For some models the system prompt does the heavy lifting; for others, the corpus. We can't yet say which carries which model in general.
A fine-tuned 7B does not match frontier-with-context on novel domains. Small-model Hammerstein is a useful artifact, not a frontier substitute.
LLM judges are not human judges. The 98.1% is preference among four blind LLM judges. Human evaluation is ongoing and will be reported separately.
Benchmarks are not deployment. The harness covers reasoning quality on bounded prompts. It does not certify safety, accuracy, or fitness for any specific production use.
The corpus is opinionated. Hammerstein-Equord doctrine is one stance on reasoning, not the only one. If you disagree with the priors, the framework will not pretend otherwise.
§ 06Get itMIT · github
Clone it. Run it on your own model and key.
Hammerstein is MIT-licensed. No account, no telemetry, no hosted dependency.
You bring the model and the API key; the framework brings the discipline.