Framework-correctness score on the held-out strategic set. The older v3a artifact scored 0.956 on the same rubric.
The framework,distilled into weights.
Hammerstein-7B is a QLoRA fine-tune of Qwen2.5-7B-Instruct that bakes the Hammerstein framework's output behavior into the model's weights. No system prompt at runtime. No API key. No per-call cost. Runs on any 8 GB GPU or 8 GB Mac.
Framework discipline in the weights, not in the scaffolding.
Refuse stupid-industrious plans.
Hold ground on correct assessments.
Return the path, not just the verdict.
A portable artifact of the Hammerstein framework.
The Hammerstein framework is a clever-lazy / stupid-industrious diagnostic for catching misdirected effort. The framework lives in a system prompt and curated corpus; this model is a distilled snapshot of that behavior, baked into 7B weights via QLoRA so the discipline travels without the runtime scaffolding.
The model is behavior cloning, not reasoning training. The strategic reasoning competence still lives in the corpus. This artifact is proof the framework can be distilled into a local model and remain competitive on framework-shaped strategic Q&A against frontier models running without the framework.
One command. Runs on your Mac or any 8 GB GPU.
The model is published as a Q4_K_M GGUF (4.68 GB) with a Modelfile. Ollama pulls and runs it directly from HuggingFace.
VRAM / RAM requirement: 8 GB. Runs on any Apple Silicon Mac (8 GB+) or NVIDIA GPU (8 GB+ VRAM)
› Python / PEFT alternative — for GPU users
For direct Python inference via PEFT and the adapter weights:
Requires an NVIDIA GPU with 6 GB+ VRAM (4-bit) or 16 GB+ (fp16).
What the eval measured.
Evaluated on a 70-prompt held-out set: 40 strategic-reasoning prompts scored against a framework-fidelity rubric, plus 30 out-of-domain prompts to check for catastrophic forgetting.
The model does not inject framework vocabulary into haikus, recipes, or factual questions. Forgetting fully suppressed.
Delta over a base Qwen2.5-7B-Instruct + Hammerstein system prompt ablation (same base model, no adapter). The framework discipline is in the weights, not just recoverable by prompting.
1,708 scrubbed-strategic + 72 unique-behavior pairs (the three target behaviors, explicitly trained) + 214 off-domain forgetting suppressors. All synthetic. No private or scraped data.
Rubric scores presence of 11 framework markers on held-out prompts. OOD leakage measures framework-vocabulary bleed on prompts that should not trigger framework mode (creative, factual, conversational, math). Full eval harness and scoring rubric at hammerstein/eval/RESULTS-v0.4.md.
What this model does not claim.
A framework that hides its edges is itself stupid-industrious.
- Tuned for framework-shaped tasks. This model is trained and evaluated on strategic-reasoning tasks that match the Hammerstein framework's vocabulary and rubric. Performance on general-purpose tasks, math, code, and long-context reasoning is untested and not claimed.
- Not a frontier replacement. The 7B is far smaller than frontier models. Where it does well, it is measuring framework-discipline specifically (the 0.975 / 0.000 scores above), not general capability. On tasks outside that domain, run the wrapper on a frontier model.
- Behavior cloning, not reasoning training. The model learned to mimic the teacher's output structure. The strategic reasoning competence still lives in the Hammerstein corpus. This is a portability artifact.
- Generalization to neutral benchmarks is untested. The eval does not cover MMLU, HumanEval, or comparable neutral benchmarks. Do not infer from the framework-benchmark result to general capability.
- The rubric has built-in framework-fidelity bias. Hammerstein-shaped outputs score higher on a Hammerstein rubric by design. The bias-resistant signals are the usefulness and voice deltas from the blind judge comparisons; those are positive but smaller.
Everything needed to retrain is public.
The 1,994-pair training set, the teacher prompt, the train and GGUF scripts, the eval harness, and the framework corpus (AGPL) are all public. The training arc (why v3a, what the v2 experiments found, how the off-domain mixin suppressed forgetting) is documented in HAMMERSTEIN-7B.md.
Framework: AGPL-3.0 (github.com/lerugray/hammerstein)