You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace single-family model comparison (Qwen 4B vs 9B) with
cross-family comparison (Qwen 3.5-4B vs Gemma 4 E4B). Gemma 4 E4B
has native structured JSON output and 128K context; Qwen 3.5-4B is
the proven D4BL baseline.
Key changes:
- Add model selection section with head-to-head comparison table
- Update notebook config from MODEL_SIZES to MODELS list
- Switch from QLoRA to bf16 LoRA (Unsloth discourages QLoRA on Qwen 3.5)
- Update Ollama naming, env vars, cost estimates, success criteria
Fine-tune a small language model (Qwen 3.5) on Bishop State domain data to replace GPT-4o-mini for three inference tasks in the dashboard. The primary value is improved explainability: advisors get SHAP-grounded, institution-aware narratives instead of templated rule-engine output. Secondary benefits include FERPA compliance (all inference on-premises), offline deployment, and institutional scalability.
12
+
Fine-tune a small language model on Bishop State domain data to replace GPT-4o-mini for three inference tasks in the dashboard. Two candidate models will be evaluated head-to-head: **Qwen 3.5-4B** (proven by D4BL) and **Gemma 4 E4B** (native structured JSON output). The primary value is improved explainability: advisors get SHAP-grounded, institution-aware narratives instead of templated rule-engine output. Secondary benefits include FERPA compliance (all inference on-premises), offline deployment, and institutional scalability.
13
13
14
14
### Tasks to Fine-Tune
15
15
@@ -24,6 +24,34 @@ Fine-tune a small language model (Qwen 3.5) on Bishop State domain data to repla
24
24
- Query Analyzer (NL → SQL) — high risk, deferred to future epic
25
25
- Model serving infrastructure (RunPod, dedicated GPU hosting) — use local Ollama for now
26
26
27
+
### Model Selection: Qwen 3.5-4B vs Gemma 4 E4B
28
+
29
+
Two candidate models will be trained and evaluated. The winner is selected based on ship criteria metrics.
| Native JSON output | No | Yes — built-in structured function calling |
37
+
| Ollama support | Text-only (vision mmproj broken) | Full support |
38
+
| Unsloth LoRA | Yes, but QLoRA discouraged — use bf16 LoRA | Yes, bf16 LoRA recommended |
39
+
| VRAM for training (bf16) | 10 GB |~10 GB |
40
+
| License | Apache 2.0 | Apache 2.0 |
41
+
| D4BL proven | Yes (5 experiments, 98.77% schema validity) | No (new model) |
42
+
43
+
**Why these two:**
44
+
- Qwen 3.5-4B is the known quantity — D4BL proved the full pipeline (distill → train → GGUF → Ollama) with 5 experiments and achieved 98.77% schema validity on structured output tasks.
45
+
- Gemma 4 E4B has native structured JSON output before fine-tuning, 128K context (headroom for SHAP-heavy narrator prompts), and full Ollama GGUF support without the mmproj workaround.
46
+
47
+
**Why not others:**
48
+
- Qwen 3.5-9B: 22 GB VRAM for training, larger GGUF (~5.5 GB), marginal quality gain over 4B for our task complexity.
- Llama 3.2-3B: Best tunability gain, but Meta's community license (700M MAU limit) is restrictive for educational institutions.
51
+
- Phi-4-mini: Strong on math, but less proven for structured output and limited Unsloth support.
52
+
53
+
**Training note:** Unsloth explicitly discourages QLoRA (4-bit) on Qwen 3.5 due to quantization artifacts. Both models will use bf16 LoRA on A100 (40 GB VRAM is sufficient for either).
54
+
27
55
## 2. Prerequisites
28
56
29
57
Before the epic branch is created:
@@ -87,7 +115,7 @@ Before the epic branch is created:
87
115
| 3 | Build Colab training notebook (Unsloth + LoRA) | Single "Run All" notebook, parameterized config, 3-phase training, GGUF export. Replace `training/finetune.py` (MLX) with Unsloth wrapper. |#1|`type:feature`, `area:ai`|
88
116
| 4 | Distill training pairs for summarizer and explainer | Run distillation for both existing tasks (~1,500 pairs each via Claude API). Prepare datasets. |#1|`type:feature`, `area:ai`|
89
117
| 5 | Distill training pairs for SHAP narrator | Generate ~1,500 SHAP narrator pairs from student data + SHAP values. Requires SHAP data in DB. |#2|`type:feature`, `area:ai`|
90
-
| 6 | Train and evaluate 4B + 9B models | Run Colab notebook for both model sizes. Evaluate via ship criteria. Compare metrics, pick winner. |#3, #4, #5|`type:spike`, `area:ai`|
118
+
| 6 | Train and evaluate Qwen 3.5-4B + Gemma 4 E4B | Run Colab notebook for both models. Evaluate via ship criteria. Compare metrics, pick winner. |#3, #4, #5|`type:spike`, `area:ai`|
91
119
| 7 | Export models and wire into dashboard | GGUF export, Ollama registration, wire `model-client.ts` into consumer routes, update `enrich_with_llm` model string. |#6|`type:feature`, `area:ai`, `area:frontend`|
92
120
| 8 | Update documentation and feasibility report | Update feasibility report with actual results, update README and CLAUDE.md. |#6|`type:documentation`|
93
121
@@ -110,7 +138,10 @@ Issues #2, #3, and #4 can proceed concurrently after #1. Issue #5 waits only on
- Load base model via Unsloth (bf16 LoRA — no QLoRA for Qwen 3.5)
129
160
- Train on training_data/{school}/domain.jsonl
130
161
- LoRA rank 16, all modules, 1 epoch, lr 2e-4, effective batch 32
131
162
- Save merged checkpoint
@@ -139,13 +170,13 @@ Cell 2+: Fully autonomous
139
170
- Phase 3: GGUF export
140
171
- Quantize each task adapter to q4_k_m
141
172
- Upload to Google Drive (or HF Hub if HF_TOKEN provided)
142
-
- Print comparison table: 4B vs 9B metrics across all tasks
173
+
- Print comparison table: Qwen 3.5-4B vs Gemma 4 E4B metrics across all tasks
143
174
- Recommend winner based on ship criteria
144
175
```
145
176
146
177
### Training Hyperparameters
147
178
148
-
Based on D4BL's proven configurations:
179
+
Based on D4BL's proven configurations. Both models use bf16 LoRA (not QLoRA) — Unsloth discourages 4-bit QLoRA on Qwen 3.5 due to quantization artifacts, and Gemma 4 also recommends bf16 LoRA.
Note: bf16 LoRA (required for Qwen 3.5, recommended for Gemma 4) uses more VRAM than QLoRA but fits comfortably on A100 40GB. Training time may be slightly longer than D4BL's QLoRA runs.
296
330
297
331
## 8. Success Criteria
298
332
299
333
The epic is complete when:
300
334
301
-
1. All three tasks pass ship criteria on the winning model size
335
+
1. All three tasks pass ship criteria on the winning model (Qwen 3.5-4B or Gemma 4 E4B)
302
336
2.`MODEL_BACKEND=ollama` serves all three tasks in the dashboard without OpenAI
303
337
3. SHAP narrator produces grounded narratives that cite specific feature attributions
304
-
4. Feasibility report is updated with actual metrics and model selection rationale
338
+
4. Feasibility report is updated with actual metrics, model comparison, and selection rationale
305
339
5. Colab notebook is documented and reproducible (clone + Run All)
340
+
6. Model selection decision is documented with head-to-head metrics for both candidates
0 commit comments