Hammerstein-7B framework distilled · 7B local model
§ 01Hero apache-2.0

The framework,distilled into weights.

Hammerstein-7B is a QLoRA fine-tune of Qwen2.5-7B-Instruct that bakes the Hammerstein framework's output behavior into the model's weights. No system prompt at runtime. No API key. No per-call cost. Runs on any 8 GB GPU or 8 GB Mac.

Framework discipline in the weights, not in the scaffolding.

Refuse stupid-industrious plans.

Hold ground on correct assessments.

Return the path, not just the verdict.

run it
$ ollama run hf.co/lerugray/hammerstein-7b-framework:Q4_K_M
>>> Audit this plan: replace managed Postgres with a self-hosted HA cluster to save $200/mo
This plan operates in stupid-industrious by default. It treats a recurring cash expense as the only cost metric while ignoring the operator's binding constraint (time). [...]
§ 02What it is

A portable artifact of the Hammerstein framework.

The Hammerstein framework is a clever-lazy / stupid-industrious diagnostic for catching misdirected effort. The framework lives in a system prompt and curated corpus; this model is a distilled snapshot of that behavior, baked into 7B weights via QLoRA so the discipline travels without the runtime scaffolding.

The model is behavior cloning, not reasoning training. The strategic reasoning competence still lives in the corpus. This artifact is proof the framework can be distilled into a local model and remain competitive on framework-shaped strategic Q&A against frontier models running without the framework.

refusal-with-pathway
When the plan is wrong, the model says so AND surfaces what would unblock a better version. Refuses without abandoning the operator. "Creative positioning belongs to you; I'll run your options through the framework and stress-test them, but I won't own-generate the core message."
hold-your-ground
Pushes back on weak framings and does not capitulate when the operator rephrases the same bad premise. Correct assessment over comfortable agreement. "Rewriting a hot path in hand-assembly without measured bottlenecks is the classic misdirected-effort failure mode: high commitment to low-leverage work that compounds maintenance cost."
refuse stupid-industrious
Flags misdirected effort: working hard in the wrong direction with total commitment. Names the structural fix, not just the symptom. "The effort shifts from paying a vendor to managing infrastructure, which compounds silently until it consumes the strategic bandwidth needed to justify the savings."
§ 03Run it Ollama · 8 GB

One command. Runs on your Mac or any 8 GB GPU.

The model is published as a Q4_K_M GGUF (4.68 GB) with a Modelfile. Ollama pulls and runs it directly from HuggingFace.

install
# pulls + runs the GGUF straight from HuggingFace
$ ollama run hf.co/lerugray/hammerstein-7b-framework:Q4_K_M

VRAM / RAM requirement: 8 GB. Runs on any Apple Silicon Mac (8 GB+) or NVIDIA GPU (8 GB+ VRAM)

Python / PEFT alternative — for GPU users

For direct Python inference via PEFT and the adapter weights:

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
 
model = AutoPeftModelForCausalLM.from_pretrained(
"lerugray/hammerstein-7b-framework",
load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("lerugray/hammerstein-7b-framework")

Requires an NVIDIA GPU with 6 GB+ VRAM (4-bit) or 16 GB+ (fp16).

§ 04Results blind · multi-judge

What the eval measured.

Evaluated on a 70-prompt held-out set: 40 strategic-reasoning prompts scored against a framework-fidelity rubric, plus 30 out-of-domain prompts to check for catastrophic forgetting.

0.975
framework-correctness
Strategic (n=40)

Framework-correctness score on the held-out strategic set. The older v3a artifact scored 0.956 on the same rubric.

0.000
OOD leakage
Off-domain (n=30)

The model does not inject framework vocabulary into haikus, recipes, or factual questions. Forgetting fully suppressed.

+0.30
vs ablation
Adapter signal

Delta over a base Qwen2.5-7B-Instruct + Hammerstein system prompt ablation (same base model, no adapter). The framework discipline is in the weights, not just recoverable by prompting.

1,994
training pairs
Data

1,708 scrubbed-strategic + 72 unique-behavior pairs (the three target behaviors, explicitly trained) + 214 off-domain forgetting suppressors. All synthetic. No private or scraped data.

Rubric scores presence of 11 framework markers on held-out prompts. OOD leakage measures framework-vocabulary bleed on prompts that should not trigger framework mode (creative, factual, conversational, math). Full eval harness and scoring rubric at hammerstein/eval/RESULTS-v0.4.md.

§ 05Honest limits what we won't claim

What this model does not claim.

A framework that hides its edges is itself stupid-industrious.

  • Tuned for framework-shaped tasks. This model is trained and evaluated on strategic-reasoning tasks that match the Hammerstein framework's vocabulary and rubric. Performance on general-purpose tasks, math, code, and long-context reasoning is untested and not claimed.
  • Not a frontier replacement. The 7B is far smaller than frontier models. Where it does well, it is measuring framework-discipline specifically (the 0.975 / 0.000 scores above), not general capability. On tasks outside that domain, run the wrapper on a frontier model.
  • Behavior cloning, not reasoning training. The model learned to mimic the teacher's output structure. The strategic reasoning competence still lives in the Hammerstein corpus. This is a portability artifact.
  • Generalization to neutral benchmarks is untested. The eval does not cover MMLU, HumanEval, or comparable neutral benchmarks. Do not infer from the framework-benchmark result to general capability.
  • The rubric has built-in framework-fidelity bias. Hammerstein-shaped outputs score higher on a Hammerstein rubric by design. The bias-resistant signals are the usefulness and voice deltas from the blind judge comparisons; those are positive but smaller.
§ 06Reproducibility open · data + methodology

Everything needed to retrain is public.

The 1,994-pair training set, the teacher prompt, the train and GGUF scripts, the eval harness, and the framework corpus (AGPL) are all public. The training arc (why v3a, what the v2 experiments found, how the off-domain mixin suppressed forgetting) is documented in HAMMERSTEIN-7B.md.

hammerstein-7b-framework
Adapter + GGUF + Modelfile + model card
hammerstein
Framework repo: system prompt, corpus, AGPL-3.0
hammerstein-model
Training data, teacher prompt, train scripts, model card
eval/RESULTS-v0.4.md
Framework wrapper benchmark (wrapper vs frontier)
§ 07Ecosystem related, not required
hammerstein
The framework this model distills. System prompt + RAG corpus + CLI.
hammerstein.ai
Hosted version with persistent campaigns and shared corpora.
small-model-orchestrator
Puts the 7B in the judgment seat for long multi-step work: own the judgment, rent the capability. One-call tool loop + self-dispatch. MIT.
GeneralStaff
Applies Hammerstein doctrine to autonomous coding agents.

Tweaks

Theme
Accent
Body scale