This is how a dev team crockpot-cooks a diabetic model on OpenDiabetic compute, end to end. It's a slow cooker on purpose: every stage is verified before the next begins. The output is a 5-cap model — flagship-grade or it doesn't get plated. → why 5-cap-only
Worked example throughout: DiabeticAnchor-27B — the inaugural dish on the Diabetic Crockpot Cooks menu (Qwen3.6-27B base · 417,176 deeded clinical rows · 2× RTX PRO 6000 Blackwell).
clean rig → pull deeded dataset (verify SHA) → recipe → CANARY → flightsheet
→ fire the cook (best-in-class, thermal-managed) → beat-base A/B → ship to the menu
No stage is skipped. A failure at any stage means re-cook (v+1) or kill — never demote.
We never cook on a dirty rig. Full pre-flight evaluation first — inventory the GPUs, evict any slop, confirm the silicon is healthy and idle. The rigs are senior managing directors; we preserve them.
# inventory — what's loaded, what's running
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
nvidia-smi --query-gpu=index,name,temperature.gpu,ecc.errors.uncorrected.volatile.total,\
clocks_throttle_reasons.active --format=csv
# evict idle/competing models, confirm VRAM free + 0 ECC errors, THEN cook
Rule: never run before a full rig evaluation. A contended GPU spills to CPU and silently corrupts your pace and your cook.
Cooks are only as honest as their corpus. Pull from DiabeticDatasets and verify the SHA-256 against the catalog before you trust a single row. Full walkthrough on the Datasets page. The short version:
# every dataset in catalog.json carries: rows, size, sha256, deed
# download, then verify — crystal-clear means YOU check the file
python3 download_corpus.py # boto3 → Tigris, SHA-verify each file
# → "sha256 OK rows=417136 (expected 417136)" ✓ deeded + verified
Then build your splits — and dedup honestly. On DiabeticAnchor-27B, a content-hash dedup caught that 5 "specialty" sets were a subset of the 417K superset: the claimed 554K was really 417,176 unique. We publish the honest number, never the inflated one. → crystal-clear
One canonical recipe, tier-adjusted. LoRA on attention + MLP only (q/k/v/o + gate/up/down) — not the linear-attention/GDN state-mixers (over-engineering the MLP already reaches every layer). Disciplined, proven, not fancy.
| Tier | LR | LoRA r / α | Epoch cap | Notes |
|---|---|---|---|---|
| 4B edge | 2e-5 | 32 / 16 | 0.7 | mobile specialist |
| 9B ops | ~1.5e-5 | 32 / 16 | 0.6 | analysis tier |
| 27B anchor | 1e-5 | 32 / 16 | 0.5 | clinician foundation |
Bigger models on bigger corpora overfit faster per example — that's why larger tiers run smaller epoch fractions. The cap is a quality choice, never a time-saver. Early-stopping + load-best-checkpoint governs the real endpoint. Framework: Unsloth (not vanilla — vanilla falls back to slow torch on Qwen's GDN layers).
Before any full burn, run a smoke test that exercises the entire pipeline on a tiny slice: model load → LoRA → a few steps → eval → save → merge. The canary's whole job is to fail cheap.
CUDA_VISIBLE_DEVICES=0 python3 cook.py --smoke-test # 500 rows, ~7 steps
On DiabeticAnchor-27B the canary earned its keep three times over — it caught two real Blackwell (sm_120) blockers before we burned a single hour of the full cook:
| Caught | Fix |
|---|---|
flash_attn/xformers have no sm_120 kernels → cudaErrorNoKernelImageForDevice | Load with attn_implementation="sdpa" (torch-native, works on Blackwell); clear unsloth_compiled_cache |
transformers-v5 meta-tensor lm_head (248K vocab) → Cannot copy out of meta tensor in fix_untrained_tokens | Skip that check via UNSLOTH_IGNORED_TOKENIZER_NAMES (safe — corpus adds no new tokens) |
Lesson logged in the docs: a small-dataset canary misses scale-dependent bugs (the meta-tensor crash only fired on the full 417K token scan). For a flagship, also smoke-test trainer-init on the full data.
Every cook gets a flightsheet — the formal record an auditor, customer, or future engineer can read to know exactly what happened. Five sections: pre-flight (intent, config, corpus SHA, hardware, projections) · canary (smoke results, fixes) · in-flight (loss/eval/thermal receipts) · post-flight (final eval, beat-base A/B) · SR-hack sign-off (3 stages). The flightsheet hash becomes the cook's defendable receipt. No flightsheet, no cook.
CUDA_VISIBLE_DEVICES=0 nohup python3 cook.py > cook.log 2>&1 & # detached; survives the session
We do not trim a cook for time, energy, or cost — we own the rigs and the electrons; quality is the only axis. We do manage thermals to preserve the hardware. On DiabeticAnchor-27B the Blackwell pinned 90 °C at its 600 W ceiling for a ~24 h burn, so we capped power for longevity — not speed:
sudo nvidia-smi -i 0 -pl 500 # 600W→500W : 90°C → 86°C, cook uninterrupted
Watch the loss trajectory in blocks; eval every ~100 steps; let early-stopping take the best checkpoint.
A cook is not "done" because the loss fell. It's done when it beats the base model on a held-out A/B with deterministic gates (structure, concept coverage, format) — think-off, decode deterministically, score by rule, no LLM-judge. If it doesn't beat base, it's killed or re-cooked. The A/B result is a receipt in the flightsheet.
Only a beat-base-proven, flightsheet-signed cook earns a plate on the Diabetic Crockpot Cooks menu and a published home on DiabeticModels / the model org — weights, flightsheet, and A/B receipt together. Show + host. No black box. The model then flows down to the LocalDiabetic edge brain on a diabetic's own box — and PHI never flows back. → The Firewall