Hammerstein
Strategic-reasoning advisor tuned to the Hammerstein-Equord officer typology.
Pause. Audit. Then ship.
Repo: github.com/lerugray/hammerstein. Hosted: hammerstein.ai. Framework essay: Von Hammerstein's Ghost, In Daily Use. The companion model: Hammerstein-7B.
Latest: 100%, then I tried to break it — the day I ran the benchmark, got 100%, then ran a Diplomacy game to falsify it.
Hammerstein is a command-line advisor I run before firing any plan with a non-trivial blast radius. It takes the plan as input and returns adversarial review: failure modes, verification gates, structural-fix candidates, a recommendation, and a counter-observation. The framework underneath is ninety years old. The execution layer is small.
What I learned on 2026-05-10
A 2,400-token system prompt, applied to any frontier model, makes that model's strategic-reasoning output preferred by blind LLM judges 100% of the time over the same model without it. I expanded the test in three directions to control for length bias, rubric tautology, and out-of-domain generalization. The headline held. I then ran a Diplomacy matched pair to try to falsify it on an adversarial multi-agent task; the wrap shaped reasoning style observably but did not change game outcome. Both Austrias ended at 2 supply centers. The claim is bounded: the wrap is a force multiplier on reasoning output, not on every downstream task involving strategic reasoning. Full writeup at 100%, then I tried to break it.
What it is
A retrieval-augmented prompt around an operator-curated corpus of operating principles. Ask it about a plan; it pressure-tests the shape of the work against the corpus. The framework's central distinction — clever-lazy versus stupid-industrious — is encoded in the templates rather than asked for. The output reads as if the framework were watching the work.
This is not a chatbot, not a coach, not a productivity tool. It is a single structured query: here is a plan, audit it. The query returns one page of text. You read it, decide whether the audit shifts the plan, and either ship or revise.
The templates
Five templates, each shaped against a different decision moment:
-
audit-this-plan— the most-used. Multi-file plan goes in; failure modes, verification gates, structural-fix candidates, and a recommendation come out. This is the one wired to the/auditslash command in my interactive sessions. -
scope-this-idea— pre-plan stage. Idea goes in; the framework returns a bounded scope, the things that must stay out, and the smallest verifiable next step. -
is-this-worth-doing— opportunity cost check. Returns the cheapest way to falsify the premise before committing time. -
what-should-we-do-next— for moments between landing one piece of work and starting the next, when the obvious next item is not actually the best one. -
review-from-different-angle— invokes the framework against a draft or decision through a deliberately rotated lens.
How I use it
The pre-flight audit rule is the load-bearing one. Before I fire any plan
with multi-file scope, external dependencies, or cross-repo blast radius,
I run hammerstein "<plan>" --template audit-this-plan.
Cost is rounding error. The catch-rate is not.
Recent landings the audit changed materially:
- A launch-post sequencing plan flipped from unified-Tuesday fire to phased Sunday-r/LL smoke-test → Tuesday HN, after the audit flagged single-point-failure risk on the unified path.
- A "framework adherence is in the weights" claim got the marketing word moat stripped out across every draft, after the audit pointed out that calling honest disclosure a competitive moat invites adversarial testing rather than respect.
-
A 30-project portfolio scoring schema added a required
next_review_triggerfield after the audit caught thatlast_reviewedalone records history without forcing action.
None of those catches required an LLM smarter than the framework. They required the framework, applied at the moment of plan-firing, against the specific shape of the work.
The corpus
The framework lives in a curated corpus of operating principles, drawn from a few years of working with AI tools and keeping append-only logs of where the framework fired and where it did not. The corpus is what gives the audit its specificity: not generic advice, but the particular structural moves that have repeatedly mattered. The corpus is the IP. The retrieval and templating layer is small.
Backends
Default routing is OpenRouter (Qwen 3.6 Plus class), with DeepSeek as a fallback and Ollama as the local-only path. Per-call latency is on the order of a minute; per-call cost is on the order of a cent. The framework is the constraint, not the model.
For people who want to run the framework offline or self-host, the Hammerstein-7B model is a QLoRA on Qwen2.5-7B-Instruct that bakes the framework's voice into the weights. The CLI is the primary surface; the model is the offline companion.
Hosted: hammerstein.ai
For the wargame-specific application, there's a hosted version at hammerstein.ai. It takes a board photo and a rulebook PDF, reads what's on the board (with a confirm-or-flag step to catch misreads), and returns kriegspiel-style Auftragstaktik orders structured by section: situation, intent, main effort, supporting effort, end-state. Same framework as the CLI, applied to a different surface.
Where to get it
The repository is public at
github.com/lerugray/hammerstein.
Install via pipx install -e . from a clone (it is a small
Python package). For the framework itself, the essay is here:
Von Hammerstein's Ghost, In Daily Use.