Skip to content

Commit 075b3f5

Browse files
committed
docs: design doc for self-service data upload (issue #86)
1 parent 6aeb82d commit 075b3f5

1 file changed

Lines changed: 170 additions & 0 deletions

File tree

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Design: Self-Service Data Upload (Issue #86)
2+
3+
**Date:** 2026-02-24
4+
**Author:** Claude Code
5+
6+
---
7+
8+
## Overview
9+
10+
Allow admin and IR users to upload institutional data files directly from the dashboard without
11+
needing direct database or server access. Two upload paths: course enrollment CSVs (end-to-end
12+
to Postgres) and PDP cohort/AR files (to Supabase Storage + GitHub Actions ML pipeline trigger).
13+
14+
---
15+
16+
## Scope
17+
18+
**In scope:**
19+
- Course enrollment CSV → `course_enrollments` Postgres table (upsert)
20+
- PDP Cohort CSV / PDP AR (.xlsx) → Supabase Storage + GitHub Actions `repository_dispatch`
21+
- Preview step (first 10 rows + column validation) before commit
22+
- Role guard: admin and ir only
23+
24+
**Out of scope:**
25+
- Upload history log (future issue)
26+
- Column remapping UI (columns must match known schema)
27+
- ML experiment tracking / MLflow (future issue)
28+
- Auto-triggering ML pipeline without a server (GitHub Actions is the trigger mechanism)
29+
30+
---
31+
32+
## Pages & Routing
33+
34+
**New page:** `codebenders-dashboard/app/admin/upload/page.tsx`
35+
36+
**Role guard:** Add to `lib/roles.ts` `ROUTE_PERMISSIONS`:
37+
```ts
38+
{ prefix: "/admin", roles: ["admin", "ir"] },
39+
{ prefix: "/api/admin", roles: ["admin", "ir"] },
40+
```
41+
Middleware already enforces this pattern via `x-user-role` header — no other auth code needed.
42+
43+
**Nav link:** Add "Upload Data" to `nav-header.tsx`, visible only to admin/ir roles.
44+
45+
**New API routes:**
46+
- `POST /api/admin/upload/preview` — parse first 10 rows, return sample + validation summary
47+
- `POST /api/admin/upload/commit` — full ingest (course → Postgres; PDP/AR → Storage + Actions)
48+
49+
---
50+
51+
## UI Flow (3 States)
52+
53+
### State 1 — Select & Drop
54+
- Dropdown: file type (`Course Enrollment CSV` | `PDP Cohort CSV` | `PDP AR File (.xlsx)`)
55+
- Drag-and-drop zone (click to pick; `.csv` for course/cohort, `.csv`+`.xlsx` for AR)
56+
- "Preview" button → calls `/api/admin/upload/preview`
57+
58+
### State 2 — Preview
59+
- Shows: detected file type, estimated row count, first 10 rows in a table
60+
- Validation banner: lists missing required columns or warnings
61+
- "Confirm & Upload" → calls `/api/admin/upload/commit`
62+
- "Back" link to return to State 1
63+
64+
### State 3 — Result
65+
- Course enrollments: `{ inserted, skipped, errors[] }` summary card
66+
- PDP/AR: "File accepted — ML pipeline queued in GitHub Actions" + link to Actions run
67+
- "Upload another file" resets to State 1
68+
69+
---
70+
71+
## API Routes
72+
73+
### `POST /api/admin/upload/preview`
74+
75+
**Input:** `multipart/form-data` with `file` and `fileType` fields
76+
77+
**Logic:**
78+
1. Parse first 50 rows with `csv-parse` (CSV) or `xlsx` (Excel)
79+
2. Validate required columns exist for the given `fileType`
80+
3. Return `{ columns, sampleRows (first 10), rowCount (estimated), warnings[] }`
81+
82+
### `POST /api/admin/upload/commit`
83+
84+
**Input:** Same multipart form
85+
86+
**Course enrollment path:**
87+
1. Stream-parse full CSV with `csv-parse` async iterator
88+
2. Batch-upsert 500 rows at a time into `course_enrollments` via `pg`
89+
3. Conflict target: `(student_guid, course_prefix, course_number, academic_term)`
90+
4. Return `{ inserted, skipped, errors[] }`
91+
92+
**PDP/AR path:**
93+
1. Upload file to Supabase Storage bucket `pdp-uploads` via `@supabase/supabase-js`
94+
2. Call GitHub API `POST /repos/{owner}/{repo}/dispatches` with:
95+
```json
96+
{ "event_type": "ml-pipeline", "client_payload": { "file_path": "<storage-path>" } }
97+
```
98+
3. Return `{ status: "processing", actionsUrl: "https://github.com/{owner}/{repo}/actions" }`
99+
100+
**Role enforcement:** Read `x-user-role` header (set by middleware); return 403 if not admin/ir.
101+
102+
---
103+
104+
## GitHub Actions Workflow
105+
106+
**File:** `.github/workflows/ml-pipeline.yml`
107+
108+
**Trigger:** `repository_dispatch` with `event_type: ml-pipeline`
109+
110+
**Steps:**
111+
1. Checkout repo
112+
2. Set up Python with `venv`
113+
3. Install dependencies (`pip install -r requirements.txt`)
114+
4. Download uploaded file from Supabase Storage using `SUPABASE_SERVICE_KEY` secret
115+
5. Run `venv/bin/python ai_model/complete_ml_pipeline.py --input <downloaded-file-path>`
116+
6. Upload `ML_PIPELINE_REPORT.txt` as a GitHub Actions artifact (retained 90 days)
117+
118+
**Required secrets:** `SUPABASE_URL`, `SUPABASE_SERVICE_KEY`, `GITHUB_TOKEN` (auto-provided)
119+
120+
---
121+
122+
## Required Column Schemas
123+
124+
### Course Enrollment CSV
125+
Must include: `student_guid`, `course_prefix`, `course_number`, `academic_year`, `academic_term`
126+
Optional (all other `course_enrollments` columns): filled as NULL if absent
127+
128+
### PDP Cohort CSV
129+
Must include: `Institution_ID`, `Cohort`, `Student_GUID`, `Cohort_Term`
130+
131+
### PDP AR File (.xlsx)
132+
Must include: `Institution_ID`, `Cohort`, `Student_GUID` (first sheet parsed)
133+
134+
---
135+
136+
## New Packages
137+
138+
| Package | Purpose |
139+
|---------|---------|
140+
| `csv-parse` | Streaming CSV parsing (async iterator mode) |
141+
| `xlsx` | Excel (.xlsx) parsing |
142+
143+
---
144+
145+
## New Files
146+
147+
| File | Purpose |
148+
|------|---------|
149+
| `codebenders-dashboard/app/admin/upload/page.tsx` | Upload UI page |
150+
| `codebenders-dashboard/app/api/admin/upload/preview/route.ts` | Preview API route |
151+
| `codebenders-dashboard/app/api/admin/upload/commit/route.ts` | Commit API route |
152+
| `.github/workflows/ml-pipeline.yml` | GitHub Actions ML pipeline trigger |
153+
154+
---
155+
156+
## Supabase Changes
157+
158+
**Storage bucket:** Create `pdp-uploads` bucket (private, authenticated access only).
159+
No new database migrations required — `course_enrollments` table already exists.
160+
161+
**Bucket policy:** Only service role key can read/write. Signed URLs used for pipeline download.
162+
163+
---
164+
165+
## Constraints & Known Limitations
166+
167+
- ML pipeline trigger via GitHub Actions means a ~30-60s delay before the pipeline starts
168+
- Vercel free tier has a 4.5 MB request body limit — large files should use Supabase Storage direct upload in a future iteration
169+
- No upload history log in this version (deferred)
170+
- Column remapping is out of scope — files must match the known schema

0 commit comments

Comments
 (0)