Crockpot Cook — the full how-to

This is how a dev team crockpot-cooks a diabetic model on OpenDiabetic compute, end to end. It's a slow cooker on purpose: every stage is verified before the next begins. The output is a 5-cap model — flagship-grade or it doesn't get plated. → why 5-cap-only

Worked example throughout: DiabeticAnchor-27B — the inaugural dish on the Diabetic Crockpot Cooks menu (Qwen3.6-27B base · 417,176 deeded clinical rows · 2× RTX PRO 6000 Blackwell).

The pipeline at a glance

clean rig → pull deeded dataset (verify SHA) → recipe → CANARY → flightsheet
   → fire the cook (best-in-class, thermal-managed) → beat-base A/B → ship to the menu

No stage is skipped. A failure at any stage means re-cook (v+1) or kill — never demote.

Step 1 · Clean the rig

We never cook on a dirty rig. Full pre-flight evaluation first — inventory the GPUs, evict any slop, confirm the silicon is healthy and idle. The rigs are senior managing directors; we preserve them.

# inventory — what's loaded, what's running
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
nvidia-smi --query-gpu=index,name,temperature.gpu,ecc.errors.uncorrected.volatile.total,\
  clocks_throttle_reasons.active --format=csv

# evict idle/competing models, confirm VRAM free + 0 ECC errors, THEN cook

Rule: never run before a full rig evaluation. A contended GPU spills to CPU and silently corrupts your pace and your cook.

Step 2 · Pull a deeded dataset (and verify it yourself)

Cooks are only as honest as their corpus. Pull from DiabeticDatasets and verify the SHA-256 against the catalog before you trust a single row. Full walkthrough on the Datasets page. The short version:

# every dataset in catalog.json carries: rows, size, sha256, deed
# download, then verify — crystal-clear means YOU check the file
python3 download_corpus.py          # boto3 → Tigris, SHA-verify each file
# → "sha256 OK  rows=417136 (expected 417136)"  ✓ deeded + verified

Then build your splits — and dedup honestly. On DiabeticAnchor-27B, a content-hash dedup caught that 5 "specialty" sets were a subset of the 417K superset: the claimed 554K was really 417,176 unique. We publish the honest number, never the inflated one. → crystal-clear

Step 3 · The recipe (gold standard)

One canonical recipe, tier-adjusted. LoRA on attention + MLP only (q/k/v/o + gate/up/down) — not the linear-attention/GDN state-mixers (over-engineering the MLP already reaches every layer). Disciplined, proven, not fancy.

Tier	LR	LoRA r / α	Epoch cap	Notes
4B edge	2e-5	32 / 16	0.7	mobile specialist
9B ops	~1.5e-5	32 / 16	0.6	analysis tier
27B anchor	1e-5	32 / 16	0.5	clinician foundation

Bigger models on bigger corpora overfit faster per example — that's why larger tiers run smaller epoch fractions. The cap is a quality choice, never a time-saver. Early-stopping + load-best-checkpoint governs the real endpoint. Framework: Unsloth (not vanilla — vanilla falls back to slow torch on Qwen's GDN layers).

Step 4 · Canary — then cook

Before any full burn, run a smoke test that exercises the entire pipeline on a tiny slice: model load → LoRA → a few steps → eval → save → merge. The canary's whole job is to fail cheap.

CUDA_VISIBLE_DEVICES=0 python3 cook.py --smoke-test   # 500 rows, ~7 steps

On DiabeticAnchor-27B the canary earned its keep three times over — it caught two real Blackwell (sm_120) blockers before we burned a single hour of the full cook:

Caught	Fix
`flash_attn`/`xformers` have no sm_120 kernels → `cudaErrorNoKernelImageForDevice`	Load with `attn_implementation="sdpa"` (torch-native, works on Blackwell); clear `unsloth_compiled_cache`
transformers-v5 meta-tensor lm_head (248K vocab) → `Cannot copy out of meta tensor` in `fix_untrained_tokens`	Skip that check via `UNSLOTH_IGNORED_TOKENIZER_NAMES` (safe — corpus adds no new tokens)

Lesson logged in the docs: a small-dataset canary misses scale-dependent bugs (the meta-tensor crash only fired on the full 417K token scan). For a flagship, also smoke-test trainer-init on the full data.

Step 5 · The flightsheet

Every cook gets a flightsheet — the formal record an auditor, customer, or future engineer can read to know exactly what happened. Five sections: pre-flight (intent, config, corpus SHA, hardware, projections) · canary (smoke results, fixes) · in-flight (loss/eval/thermal receipts) · post-flight (final eval, beat-base A/B) · SR-hack sign-off (3 stages). The flightsheet hash becomes the cook's defendable receipt. No flightsheet, no cook.

Step 6 · Fire the cook — best-in-class, thermal-managed

CUDA_VISIBLE_DEVICES=0 nohup python3 cook.py > cook.log 2>&1 &   # detached; survives the session

We do not trim a cook for time, energy, or cost — we own the rigs and the electrons; quality is the only axis. We do manage thermals to preserve the hardware. On DiabeticAnchor-27B the Blackwell pinned 90 °C at its 600 W ceiling for a ~24 h burn, so we capped power for longevity — not speed:

sudo nvidia-smi -i 0 -pl 500     # 600W→500W : 90°C → 86°C, cook uninterrupted

Watch the loss trajectory in blocks; eval every ~100 steps; let early-stopping take the best checkpoint.

Step 7 · Beat-base or kill

A cook is not "done" because the loss fell. It's done when it beats the base model on a held-out A/B with deterministic gates (structure, concept coverage, format) — think-off, decode deterministically, score by rule, no LLM-judge. If it doesn't beat base, it's killed or re-cooked. The A/B result is a receipt in the flightsheet.

Step 8 · Ship to the menu

Only a beat-base-proven, flightsheet-signed cook earns a plate on the Diabetic Crockpot Cooks menu and a published home on DiabeticModels / the model org — weights, flightsheet, and A/B receipt together. Show + host. No black box. The model then flows down to the LocalDiabetic edge brain on a diabetic's own box — and PHI never flows back. → The Firewall