Skip to content

Commit 42c8f4a

Browse files
committed
docs: update design spec — Qwen 3.5-4B vs Gemma 4 E4B evaluation
Replace single-family model comparison (Qwen 4B vs 9B) with cross-family comparison (Qwen 3.5-4B vs Gemma 4 E4B). Gemma 4 E4B has native structured JSON output and 128K context; Qwen 3.5-4B is the proven D4BL baseline. Key changes: - Add model selection section with head-to-head comparison table - Update notebook config from MODEL_SIZES to MODELS list - Switch from QLoRA to bf16 LoRA (Unsloth discourages QLoRA on Qwen 3.5) - Update Ollama naming, env vars, cost estimates, success criteria
1 parent 88f5bba commit 42c8f4a

File tree

1 file changed

+55
-20
lines changed

1 file changed

+55
-20
lines changed

docs/superpowers/specs/2026-04-02-fine-tuning-student-explainability-design.md

Lines changed: 55 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Design Spec: Fine-Tuning for Student Explainability
22

3-
**Date:** 2026-04-02
3+
**Date:** 2026-04-02 (updated 2026-04-03)
44
**Epic label:** `fine-tuning: student-explainability`
55
**Epic branch:** `fine-tuning/student-explainability`
66
**Status:** Draft
@@ -9,7 +9,7 @@
99

1010
## 1. Goal
1111

12-
Fine-tune a small language model (Qwen 3.5) on Bishop State domain data to replace GPT-4o-mini for three inference tasks in the dashboard. The primary value is improved explainability: advisors get SHAP-grounded, institution-aware narratives instead of templated rule-engine output. Secondary benefits include FERPA compliance (all inference on-premises), offline deployment, and institutional scalability.
12+
Fine-tune a small language model on Bishop State domain data to replace GPT-4o-mini for three inference tasks in the dashboard. Two candidate models will be evaluated head-to-head: **Qwen 3.5-4B** (proven by D4BL) and **Gemma 4 E4B** (native structured JSON output). The primary value is improved explainability: advisors get SHAP-grounded, institution-aware narratives instead of templated rule-engine output. Secondary benefits include FERPA compliance (all inference on-premises), offline deployment, and institutional scalability.
1313

1414
### Tasks to Fine-Tune
1515

@@ -24,6 +24,34 @@ Fine-tune a small language model (Qwen 3.5) on Bishop State domain data to repla
2424
- Query Analyzer (NL → SQL) — high risk, deferred to future epic
2525
- Model serving infrastructure (RunPod, dedicated GPU hosting) — use local Ollama for now
2626

27+
### Model Selection: Qwen 3.5-4B vs Gemma 4 E4B
28+
29+
Two candidate models will be trained and evaluated. The winner is selected based on ship criteria metrics.
30+
31+
| | **Qwen 3.5-4B** | **Gemma 4 E4B** |
32+
|---|---|---|
33+
| Effective params | 4B | 4.5B (8B with embeddings) |
34+
| GGUF size (q4_k_m) | ~2.7 GB | ~5 GB |
35+
| Context window | 32K | 128K |
36+
| Native JSON output | No | Yes — built-in structured function calling |
37+
| Ollama support | Text-only (vision mmproj broken) | Full support |
38+
| Unsloth LoRA | Yes, but QLoRA discouraged — use bf16 LoRA | Yes, bf16 LoRA recommended |
39+
| VRAM for training (bf16) | 10 GB | ~10 GB |
40+
| License | Apache 2.0 | Apache 2.0 |
41+
| D4BL proven | Yes (5 experiments, 98.77% schema validity) | No (new model) |
42+
43+
**Why these two:**
44+
- Qwen 3.5-4B is the known quantity — D4BL proved the full pipeline (distill → train → GGUF → Ollama) with 5 experiments and achieved 98.77% schema validity on structured output tasks.
45+
- Gemma 4 E4B has native structured JSON output before fine-tuning, 128K context (headroom for SHAP-heavy narrator prompts), and full Ollama GGUF support without the mmproj workaround.
46+
47+
**Why not others:**
48+
- Qwen 3.5-9B: 22 GB VRAM for training, larger GGUF (~5.5 GB), marginal quality gain over 4B for our task complexity.
49+
- Qwen 3-4B: #1 fine-tuning benchmark, but lacks 3.5's architecture improvements.
50+
- Llama 3.2-3B: Best tunability gain, but Meta's community license (700M MAU limit) is restrictive for educational institutions.
51+
- Phi-4-mini: Strong on math, but less proven for structured output and limited Unsloth support.
52+
53+
**Training note:** Unsloth explicitly discourages QLoRA (4-bit) on Qwen 3.5 due to quantization artifacts. Both models will use bf16 LoRA on A100 (40 GB VRAM is sufficient for either).
54+
2755
## 2. Prerequisites
2856

2957
Before the epic branch is created:
@@ -87,7 +115,7 @@ Before the epic branch is created:
87115
| 3 | Build Colab training notebook (Unsloth + LoRA) | Single "Run All" notebook, parameterized config, 3-phase training, GGUF export. Replace `training/finetune.py` (MLX) with Unsloth wrapper. | #1 | `type:feature`, `area:ai` |
88116
| 4 | Distill training pairs for summarizer and explainer | Run distillation for both existing tasks (~1,500 pairs each via Claude API). Prepare datasets. | #1 | `type:feature`, `area:ai` |
89117
| 5 | Distill training pairs for SHAP narrator | Generate ~1,500 SHAP narrator pairs from student data + SHAP values. Requires SHAP data in DB. | #2 | `type:feature`, `area:ai` |
90-
| 6 | Train and evaluate 4B + 9B models | Run Colab notebook for both model sizes. Evaluate via ship criteria. Compare metrics, pick winner. | #3, #4, #5 | `type:spike`, `area:ai` |
118+
| 6 | Train and evaluate Qwen 3.5-4B + Gemma 4 E4B | Run Colab notebook for both models. Evaluate via ship criteria. Compare metrics, pick winner. | #3, #4, #5 | `type:spike`, `area:ai` |
91119
| 7 | Export models and wire into dashboard | GGUF export, Ollama registration, wire `model-client.ts` into consumer routes, update `enrich_with_llm` model string. | #6 | `type:feature`, `area:ai`, `area:frontend` |
92120
| 8 | Update documentation and feasibility report | Update feasibility report with actual results, update README and CLAUDE.md. | #6 | `type:documentation` |
93121

@@ -110,7 +138,10 @@ Issues #2, #3, and #4 can proceed concurrently after #1. Issue #5 waits only on
110138
Cell 1: Configuration (ONLY cell the user edits)
111139
-------------------------------------------------
112140
SCHOOL = "bishop-state"
113-
MODEL_SIZES = ["4b", "9b"]
141+
MODELS = [
142+
{"name": "qwen3.5-4b", "hf_id": "Qwen/Qwen3.5-4B"},
143+
{"name": "gemma4-e4b", "hf_id": "google/gemma-4-e4b-it"},
144+
]
114145
REPO_URL = "https://github.com/codebenders/datathon.git"
115146
REPO_BRANCH = "fine-tuning/student-explainability"
116147
HF_TOKEN = "" # or userdata.get('HF_TOKEN')
@@ -123,9 +154,9 @@ Cell 2+: Fully autonomous
123154
- GPU detection + validation (assert A100/T4/L4)
124155
- pip install unsloth, trl, peft
125156
- Clone repo, load schools/{SCHOOL}/config.yaml
126-
- For each model size:
157+
- For each model in MODELS:
127158
- Phase 1: Domain adaptation
128-
- Load base Qwen model via Unsloth (4-bit NF4)
159+
- Load base model via Unsloth (bf16 LoRA — no QLoRA for Qwen 3.5)
129160
- Train on training_data/{school}/domain.jsonl
130161
- LoRA rank 16, all modules, 1 epoch, lr 2e-4, effective batch 32
131162
- Save merged checkpoint
@@ -139,13 +170,13 @@ Cell 2+: Fully autonomous
139170
- Phase 3: GGUF export
140171
- Quantize each task adapter to q4_k_m
141172
- Upload to Google Drive (or HF Hub if HF_TOKEN provided)
142-
- Print comparison table: 4B vs 9B metrics across all tasks
173+
- Print comparison table: Qwen 3.5-4B vs Gemma 4 E4B metrics across all tasks
143174
- Recommend winner based on ship criteria
144175
```
145176

146177
### Training Hyperparameters
147178

148-
Based on D4BL's proven configurations:
179+
Based on D4BL's proven configurations. Both models use bf16 LoRA (not QLoRA) — Unsloth discourages 4-bit QLoRA on Qwen 3.5 due to quantization artifacts, and Gemma 4 also recommends bf16 LoRA.
149180

150181
| Parameter | Phase 1 (Domain) | Phase 2 (Tasks) |
151182
|-----------|------------------|-----------------|
@@ -158,6 +189,7 @@ Based on D4BL's proven configurations:
158189
| Max sequence length | 4096 | 4096-8192 |
159190
| Optimizer | AdamW 8-bit | AdamW 8-bit |
160191
| Precision | bf16 (A100) | bf16 (A100) |
192+
| Quantization during training | None (bf16 LoRA) | None (bf16 LoRA) |
161193

162194
### What the Notebook Does NOT Do
163195

@@ -253,12 +285,12 @@ This is the highest-value task — it transforms per-student SHAP attribution da
253285
### Ollama Model Naming
254286

255287
```
256-
bishop-state-narrator:{size} # SHAP narrator
257-
bishop-state-summarizer:{size} # Query summary
258-
bishop-state-explainer:{size} # Course pairing
288+
bishop-state-narrator:{model} # SHAP narrator
289+
bishop-state-summarizer:{model} # Query summary
290+
bishop-state-explainer:{model} # Course pairing
259291
```
260292

261-
Where `{size}` is `4b` or `9b` based on evaluation results.
293+
Where `{model}` is the winning model identifier (e.g., `qwen3.5-4b` or `gemma4-e4b`) based on evaluation results.
262294

263295
### SHAP Narrator Integration Point
264296

@@ -268,16 +300,16 @@ Where `{size}` is `4b` or `9b` based on evaluation results.
268300
# Before (OpenAI)
269301
python ai_model/generate_readiness_scores.py --enrich-with-llm --llm-model gpt-4o-mini
270302

271-
# After (fine-tuned)
272-
python ai_model/generate_readiness_scores.py --enrich-with-llm --llm-model ollama/bishop-state-narrator:4b
303+
# After (fine-tuned, winner TBD after evaluation)
304+
python ai_model/generate_readiness_scores.py --enrich-with-llm --llm-model ollama/bishop-state-narrator
273305
```
274306

275307
### Environment Variables
276308

277309
```env
278310
MODEL_BACKEND=ollama # or "openai" (fallback)
279311
OLLAMA_BASE_URL=http://localhost:11434
280-
MODEL_SIZE=4b # set after evaluation picks winner
312+
MODEL_TAG=qwen3.5-4b # or gemma4-e4b, set after evaluation picks winner
281313
SCHOOL_CODE=bishop-state
282314
```
283315

@@ -290,16 +322,19 @@ The operator sets `MODEL_BACKEND` to either `ollama` or `openai`. There is no au
290322
| Item | Cost |
291323
|------|------|
292324
| Claude API distillation (~4,500 pairs across 3 tasks) | $5-10 |
293-
| Colab A100 compute (~4 hours for 2 model sizes) | $8-16 |
294-
| **Total per training run** | **$13-26** |
295-
| Iteration runs (subsequent) | $8-16 each |
325+
| Colab A100 compute (~4-5 hours for 2 models, bf16 LoRA) | $8-20 |
326+
| **Total per training run** | **$13-30** |
327+
| Iteration runs (subsequent) | $8-20 each |
328+
329+
Note: bf16 LoRA (required for Qwen 3.5, recommended for Gemma 4) uses more VRAM than QLoRA but fits comfortably on A100 40GB. Training time may be slightly longer than D4BL's QLoRA runs.
296330

297331
## 8. Success Criteria
298332

299333
The epic is complete when:
300334

301-
1. All three tasks pass ship criteria on the winning model size
335+
1. All three tasks pass ship criteria on the winning model (Qwen 3.5-4B or Gemma 4 E4B)
302336
2. `MODEL_BACKEND=ollama` serves all three tasks in the dashboard without OpenAI
303337
3. SHAP narrator produces grounded narratives that cite specific feature attributions
304-
4. Feasibility report is updated with actual metrics and model selection rationale
338+
4. Feasibility report is updated with actual metrics, model comparison, and selection rationale
305339
5. Colab notebook is documented and reproducible (clone + Run All)
340+
6. Model selection decision is documented with head-to-head metrics for both candidates

0 commit comments

Comments
 (0)