Datasets — find, verify, pull

DiabeticDatasets is the cookbook: open diabetic & medical datasets, donated free to builders — no charge, no strings. Every dataset is crystal-clear and deeded. You never have to take our word for what's in a file.

No PHI, ever. These are open education + synthetic instruction sets and cited facts — never patient records. → The Firewall

Crystal-clear: verify before you trust

"Open" doesn't mean "take our word." Every catalog entry shows its schema, real sample rows, exact size, and SHA-256 so you can verify the file yourself before a single row enters a cook. Metadata-only "open" is just trust-me-bro with extra steps. We reject it.

// catalog.json — one entry (shape)
{
  "slug": "medical-internal-medicine",
  "rows": 98986,
  "size": "252.3 MB",
  "sha256": "38fb516b1bf799ff0a22994614e3f69df8c7544e993698e297815e671d5c7cc3",
  "full_download": { "url": "https://data.diabeticdatasets.com/medical/...-full.jsonl", "rows": 98986 },
  "deed": "Swarm & Bee Title Deed v1.0 · two-judge tribunal · Hedera-anchored"
}

The format

Training rows are chat-format JSONL — system / user / assistant — ready for SFT:

{"messages":[
  {"role":"system","content":"You are a board-certified neuroradiologist. ..."},
  {"role":"user","content":"..."},
  {"role":"assistant","content":"..."}
]}

Cited patient-facing sets may ship as {instruction, response} — normalize to messages at build time (see Crockpot Cook → Step 2).

Pull it — and SHA-verify every file

Full datasets live in object storage (Tigris / S3-compatible). Pull with credentials, then verify each file's SHA-256 against the catalog. A mismatch is a hard stop.

import boto3, hashlib, json
# creds from your .env (never commit them)
s3 = boto3.client("s3", endpoint_url=ENDPOINT, aws_access_key_id=KEY, aws_secret_access_key=SECRET)
for m in manifest:                       # {key, sha256, rows} from catalog.json
    s3.download_file(BUCKET, m["key"], dst)
    got = sha256_of(dst)
    assert got == m["sha256"], f"SHA MISMATCH — refuse: {m['key']}"
    print(f"sha256 OK  rows={count(dst)} (expected {m['rows']})")   # ✓ deeded + verified

Tip: the same logical dataset can appear in more than one bucket/branded domain — list the bucket and confirm the object exists before trusting a catalog URL. (On DiabeticAnchor-27B the 417K superset lived in the branded-domain bucket, not the default one — the catalog URL alone would have 404'd.)

Deeded provenance

Every dataset carries a deed — a Swarm & Bee Title Deed: graded by a two-judge tribunal, Merkle-rooted, and anchored on a public ledger. The deed is how an outside auditor verifies what the data is and where it came from, independent of us. Verifiability is the moat.

Dedup honestly

When you combine sets, dedup by content hash and publish the true unique count — never the sum of the parts. Superset/subset overlaps are common (a broad "domain-expert" set often contains the specialty sets). Claiming the inflated total is the opposite of crystal-clear. → Doctrine

Read next