From 3690ac143ce40c708f8952e0dc3e7e4d2a1dec96 Mon Sep 17 00:00:00 2001 From: Eryk Kullikowski Date: Tue, 10 Mar 2026 13:52:44 +0100 Subject: [PATCH 01/25] feat(redcap): update REDCap2 plugin design document to reflect current implementation status and export mode details --- redcap.md | 466 +++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 336 insertions(+), 130 deletions(-) diff --git a/redcap.md b/redcap.md index 285dd34..6c9ecf9 100644 --- a/redcap.md +++ b/redcap.md @@ -1,14 +1,16 @@ -# REDCap2 Plugin Design And Implementation Plan +# REDCap2 Plugin Design, Status, And Implementation Plan **Navigation:** [← Back to README](README.md#available-plugins) ## Table of Contents - [Summary](#summary) -- [Why Report-First Is The Right Default](#why-report-first-is-the-right-default) +- [Current Implementation Status (2026-03-10)](#current-implementation-status-2026-03-10) +- [Export Mode Design](#export-mode-design) - [Target User Flow](#target-user-flow) - [Syncable File Model](#syncable-file-model) - [Export Controls](#export-controls) +- [REDCap Built-In De-Identification Parameters](#redcap-built-in-de-identification-parameters) - [De-Identification And Encryption](#de-identification-and-encryption) - [Metadata Outputs](#metadata-outputs) - [Architecture In rdm-integration](#architecture-in-rdm-integration) @@ -17,99 +19,188 @@ - [Open Questions](#open-questions) - [References](#references) -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) --- ## Summary -This document proposes a new plugin, `redcap2`, to coexist with the current `redcap` plugin. +This document describes the `redcap2` plugin, which coexists with the current `redcap` plugin. - Keep current `redcap` unchanged (File Repository mode). - Add `redcap2` for direct API exports (without manual "export then save to File Repository"). -- Start with a **report-first** workflow, then expand to full record export controls. +- Start with a **report-first** workflow, then expand to more advanced export/de-identification/metadata features. -Key point: current behavior requiring manual export/save is expected because the existing plugin only uses REDCap `fileRepository` list/export actions. +Key point: manual export/save was required in the old `redcap` plugin because it uses REDCap `fileRepository` list/export actions (`folder_id` / `doc_id` flow). -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Why Report-First Is The Right Default](#why-report-first-is-the-right-default) +**PoC branch:** The proof-of-concept is being developed on the `redcap_v2` branch (same branch name in both the backend and frontend repositories). + +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Current Implementation Status](#current-implementation-status-2026-03-10) + +--- + +## Current Implementation Status (2026-03-10) + +### Implemented + +1. New backend plugin `redcap2` was added and registered. +2. `redcap2` supports two export modes selectable in the UI: + - **Report mode** (`exportMode: "report"`): exports a saved REDCap report by ID via `content=report`. + - **Records mode** (`exportMode: "records"`): exports all project records via `content=record` with optional filters. +3. `redcap2` supports a variable-list mode for the intermediate settings screen (`pluginOptions.request = "variables"`): + - In report mode: fetches only the CSV header row of the report (header-only request, avoids full download). + - In records mode: fetches the full field list from `content=metadata`. +4. `redcap2` `Query()` and `Streams()` generate syncable virtual files directly from REDCap API exports (no file repository dependency). +5. Frontend intermediate page (`/redcap2-export/:id`) with: + - Report / All records toggle. + - Report ID field (report mode only). + - Common export controls: format, record type, CSV delimiter, raw/label, header labels. + - Record-only filters: fields, forms, events, records, filter logic, date range (records mode only). + - "Include survey fields" and "Include Data Access Groups" toggles (records mode only, default off). + - Variable anonymization table with auto-detection of REDCap identifier-tagged fields. +6. End-to-end `pluginOptions` payload propagated through options, compare, and stream/store requests. +7. Export parameter routing is correct per mode: + - `applySharedExportParams`: `type`, `csvDelimiter`, `rawOrLabel`, `rawOrLabelHeaders` — sent for both modes. + - `applyRecordOnlyFilters`: `fields`, `forms`, `events`, `records`, `filterLogic`, `dateRangeBegin`, `dateRangeEnd`, `exportSurveyFields`, `exportDataAccessGroups` — sent for records mode only (these are not supported by `content=report`). +8. Bundle cache keyed by `exportMode` + all stable options (including `exportSurveyFields`, `exportDataAccessGroups`); `generatedAt` excluded. +9. REDCap built-in de-identification support: + - `exportSurveyFields` and `exportDataAccessGroups` exposed as records-mode toggles (server-side suppression). + - Identifier-tagged fields auto-detected from `content=metadata` (`identifier` column) and pre-selected as `blank` in the variable anonymization table; users can override to `none`. +10. Existing `redcap` plugin remains available and unchanged for fallback. + +### Generated File Layout (Implemented) + +**Report mode** (`exportMode: "report"`): + +1. `redcap/report-/data.csv` or `data.json` +2. `redcap/report-/metadata.csv` (filtered to exported fields) +3. `redcap/report-/project_info.json` +4. `redcap/report-/events.csv` (longitudinal projects) +5. `redcap/report-/form_event_mapping.csv` (longitudinal projects) +6. `redcap/report-/manifest.json` (export config + timestamp + REDCap version + warnings) + +**Records mode** (`exportMode: "records"`): + +1. `redcap/records/data.csv` or `data.json` +2. `redcap/records/metadata.csv` (filtered to exported fields) +3. `redcap/records/project_info.json` +4. `redcap/records/events.csv` (longitudinal projects) +5. `redcap/records/form_event_mapping.csv` (longitudinal projects) +6. `redcap/records/manifest.json` + +### Not Implemented Yet + +1. XML data export. +2. Advanced de-identification modes beyond `blank` (drop/mask/pseudonymize/encrypt). +3. DDI-CDI/Croissant/RO-Crate metadata exporters. +4. Attachment/file-field download modes. + +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Export Mode Design](#export-mode-design) --- -## Why Report-First Is The Right Default +## Export Mode Design + +Both modes are now implemented as peer citizens in the same UI and backend. + +### Report Mode (`exportMode: "report"`) -Report export is the best first target for `redcap2`: +- Exports a saved REDCap report by ID via `content=report`. +- The report definition in the REDCap UI controls which fields, records, and filters are included — no extra filter parameters are sent by the plugin. +- User enters the report ID manually (the standard REDCap API has **no endpoint to list reports**; IDs are visible in "My Reports & Exports" in the REDCap web UI). +- Variable list for anonymization is fetched by a CSV header-only request against the report endpoint (avoids downloading full data just to get field names). Falls back to `content=metadata` if that fails. +- Best choice when: the user has already curated a report in REDCap and wants to export exactly that snapshot. -1. It matches how many REDCap users already work ("My Reports & Exports"). -2. Report definitions provide field selection and filter logic in REDCap UI. -3. API supports report export by `report_id`. -4. It minimizes frontend complexity for MVP. +### Records Mode (`exportMode: "records"`) -Recommendation: +- Exports directly via `content=record` with optional server-side filters. +- No report ID needed — works on any project without prior report setup. +- Supports all REDCap record-export filter parameters: `fields`, `forms`, `events`, `records`, `filterLogic`, `dateRangeBegin`, `dateRangeEnd`. +- Variable list for anonymization is fetched from `content=metadata` (all project fields). +- Best choice when: the user wants an ad-hoc export with dynamic filters, or no report has been configured. -1. `redcap2` MVP should support `report_id` export first. -2. Add advanced record export mode as second phase. +### API Parameter Routing -Note: +The `content=report` endpoint does **not** accept record-filter parameters. +The split into `applySharedExportParams` and `applyRecordOnlyFilters` enforces this: -- We should verify whether the REDCap API on our server exposes a "list reports" endpoint. -- If not available, the first MVP can accept manual `report_id` entry (visible in REDCap report list UI). +| Parameter | Report mode | Records mode | +|---|---|---| +| `type` (flat/eav) | ✓ | ✓ | +| `csvDelimiter` | ✓ | ✓ | +| `rawOrLabel` | ✓ | ✓ | +| `rawOrLabelHeaders` | ✓ | ✓ | +| `fields` | — | ✓ | +| `forms` | — | ✓ | +| `events` | — | ✓ | +| `records` | — | ✓ | +| `filterLogic` | — | ✓ | +| `dateRangeBegin` / `dateRangeEnd` | — | ✓ | +| `exportSurveyFields` | — | ✓ | +| `exportDataAccessGroups` | — | ✓ | +| `report_id` | required | — | -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Target User Flow](#target-user-flow) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Target User Flow](#target-user-flow) --- ## Target User Flow -### MVP Flow (Report-First) - -1. User selects `REDCap2` source plugin. -2. User enters: - - REDCap URL - - REDCap API token - - Report ID (manual input or dropdown if API listing exists) -3. User configures export options in an intermediate "Export Settings" panel: - - format (`csv`/`json`/`xml`) - - delimiter for CSV (`,` or tab) - - raw/label options -4. Compare step shows generated virtual files. +### Report Mode Flow + +1. User selects `REDCap Reports (beta)` source plugin. +2. User enters REDCap URL and API token. +3. On the intermediate export settings page: + - Select **Report** mode (default). + - Enter **Report ID** (find it in REDCap under "My Reports & Exports"). + - Choose format (`csv`/`json`), record type, delimiter, raw/label options. + - Optionally configure per-variable anonymization (`none`/`blank`). +4. Compare step shows generated virtual files under `redcap/report-/`. 5. User selects files and syncs to Dataverse. -### Advanced Flow (Record Mode) +### Records Mode Flow -1. User chooses "Record export mode" instead of report mode. -2. User sets optional filters: - - fields - - forms - - events - - records - - filter logic - - date range - - record type (`flat`/`eav`) -3. Plugin generates data + metadata files according to config. -4. Compare and sync as usual. +1. User selects `REDCap Reports (beta)` source plugin. +2. User enters REDCap URL and API token. +3. On the intermediate export settings page: + - Select **All records** mode. + - Choose format, record type, delimiter, raw/label options. + - Optionally set fields, forms, events, records, filter logic, date range. + - Optionally configure per-variable anonymization. +4. Compare step shows generated virtual files under `redcap/records/`. +5. User selects files and syncs to Dataverse. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Syncable File Model](#syncable-file-model) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Syncable File Model](#syncable-file-model) --- ## Syncable File Model -`redcap2` should expose **generated virtual files** through `Query()` and `Streams()`. +`redcap2` exposes **generated virtual files** through `Query()` and `Streams()`. + +File paths per mode: -Suggested naming (report mode): +**Report mode:** -1. `redcap2/report-/data.csv` -2. `redcap2/report-/schema/redcap_metadata.csv` -3. `redcap2/report-/schema/instruments.csv` -4. `redcap2/report-/schema/events.csv` (longitudinal only) -5. `redcap2/report-/schema/form_event_mapping.csv` (longitudinal only) -6. `redcap2/report-/manifest/export-config.json` -7. `redcap2/report-/manifest/provenance.json` +1. `redcap/report-/data.csv` or `data.json` +2. `redcap/report-/metadata.csv` +3. `redcap/report-/project_info.json` +4. `redcap/report-/events.csv` (longitudinal only) +5. `redcap/report-/form_event_mapping.csv` (longitudinal only) +6. `redcap/report-/manifest.json` -Suggested naming (record mode): +**Records mode:** -1. `redcap2/records//records.flat.csv` or `records.eav.csv` -2. same schema + manifest sidecars as above +1. `redcap/records/data.csv` or `data.json` +2. `redcap/records/metadata.csv` +3. `redcap/records/project_info.json` +4. `redcap/records/events.csv` (longitudinal only) +5. `redcap/records/form_event_mapping.csv` (longitudinal only) +6. `redcap/records/manifest.json` + +Planned naming extensions (later): + +1. Additional metadata sidecars for standards exporters (DDI-CDI, Croissant, RO-Crate). Design requirements: @@ -117,32 +208,41 @@ Design requirements: 2. Stable hashing for change detection. 3. Each generated file can be independently selected in the tree. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Export Controls](#export-controls) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Export Controls](#export-controls) --- ## Export Controls -### Core Controls (MVP) +### Implemented Controls (both modes) + +1. `exportMode`: `report` or `records` +2. `dataFormat`: `csv` or `json` +3. `recordType`: `flat` or `eav` +4. `csvDelimiter`: comma or tab +5. `rawOrLabel`: `raw`, `label`, or `both` +6. `rawOrLabelHeaders`: `raw` or `label` +7. `variables[]` with anonymization mode: `none` or `blank` -1. `mode`: `report` or `records` -2. `report_id` (required for report mode) -3. `format_type`: `csv`/`json`/`xml` -4. `csv_delimiter`: comma or tab -5. `raw_or_label` -6. `raw_or_label_headers` -7. `export_checkbox_labels` +### Report Mode Only -### Advanced Record Controls +8. `reportId` (required — entered manually; REDCap API has no report-listing endpoint) -1. `fields` (variable subset) -2. `forms` -3. `events` -4. `records` (record IDs subset) -5. `filter_logic` -6. `dateRangeBegin` -7. `dateRangeEnd` -8. `record_type`: `flat`/`eav` +### Records Mode Only + +9. `fields` +10. `forms` +11. `events` +12. `records` +13. `filterLogic` +14. `dateRangeBegin` +15. `dateRangeEnd` +16. `exportSurveyFields`: include survey identifier and timestamp fields (default `false`) +17. `exportDataAccessGroups`: include Data Access Group field (default `false`) + +### Planned Controls + +1. XML output support ### Attachment Controls @@ -155,7 +255,87 @@ Rationale: 1. For many projects, upload/file fields should remain references in MVP. 2. Full attachment download can be expensive and should be explicit. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ De-Identification And Encryption](#de-identification-and-encryption) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ REDCap Built-In De-Identification Parameters](#redcap-built-in-de-identification-parameters) + +--- + +## REDCap Built-In De-Identification Parameters + +The REDCap record-export API (`content=record`) natively supports several de-identification parameters that can be applied **server-side** before data leaves REDCap. This section analyzes these parameters and how they relate to the manual per-variable anonymization we currently implement client-side. + +### Available API Parameters + +The `content=record` endpoint accepts these de-identification-related parameters: + +| Parameter | Type | Default | Description | +|---|---|---|---| +| `exportSurveyFields` | boolean | `false` | Include survey-specific fields (`redcap_survey_identifier`, `[instrument]_timestamp`). Set to `false` to strip them. | +| `exportDataAccessGroups` | boolean | `false` | Include the `redcap_data_access_group` field. Set to `false` to strip it. | +| `exportCheckboxLabel` | boolean | `false` | Export checkbox labels instead of raw values (relevant for label-based anonymization). | +| `filterLogic` | string | — | Server-side record filtering. Already implemented. Can exclude records containing sensitive values. | + +Additionally, REDCap Data Dictionaries allow project admins to tag fields with **Identifier** status (`identifier = y`). While this designation is visible in the `content=metadata` export (the `identifier` column), there is **no** API parameter to automatically strip all identifier-tagged fields from a record export. That logic must be implemented client-side by the exporting tool. + +### What The REDCap Report-Export API Does NOT Offer + +The `content=report` endpoint does **not** accept any of the de-identification parameters above. Reports are exported as configured in the REDCap UI. However, when creating or editing a report in the REDCap web interface, the user can choose: + +1. **"Remove all tagged Identifier fields"** — the report definition itself excludes fields marked as identifiers. +2. **"Hash the Record ID"** — the report replaces the record ID with a hashed value. +3. **"Remove all free-text fields"** — strips notes/text fields. +4. **"Remove dates and shift to date"** — date-shifts or removes date fields. + +These options are set **in the REDCap UI when creating the report** and take effect before the API returns data. They are not settable via the API at export time. + +### Comparison: Built-In vs. Our Current Client-Side Approach + +| Capability | REDCap built-in (server-side) | Our current approach (client-side) | +|---|---|---| +| Suppress survey identifier fields | `exportSurveyFields=false` (records mode) | **Implemented** — toggle on settings page (records mode) | +| Suppress Data Access Groups | `exportDataAccessGroups=false` (records mode) | **Implemented** — toggle on settings page (records mode) | +| Strip identifier-tagged fields | Not available as API parameter; only in report definitions | **Implemented** — auto-detected from metadata, pre-selected as `blank` | +| Hash record ID | Report-level setting in REDCap UI only | Not yet implemented | +| Blank/drop arbitrary fields | Not available | `variables[].anonymization = "blank"` per field | +| Remove free-text fields | Report-level setting in REDCap UI only | Not available (would need new mode) | +| Date-shift dates | Report-level setting in REDCap UI only | Not available | +| Exclude specific fields | `fields` param (records mode — positive filter) | `variables[].anonymization = "blank"` per field | +| Server-side record filter | `filterLogic` (records mode) | Already implemented | + +### Recommendations + +1. ~~**Expose `exportSurveyFields` and `exportDataAccessGroups` as toggles in records mode.**~~ **Done.** + Implemented as "Include survey fields" and "Include Data Access Groups" checkboxes on the records-mode settings page. Default is `false` (off). Backend sends the parameters in `applyRecordOnlyFilters` only when the user opts in. + +2. ~~**Auto-detect identifier-tagged fields from metadata.**~~ **Done.** + Backend parses the `identifier` column from `content=metadata` CSV and returns `Selected: true` on those fields in the variable-list response. Frontend pre-selects those variables as `blank` in the anonymization table. Users can override to `none`. + +3. **For report mode, document that de-identification is best done in the REDCap report definition itself.** + Since the report API has no de-identification parameters, users should be advised to enable "Remove all tagged Identifier fields", "Hash the Record ID", etc. when creating the report in REDCap. The manifest should record whether the report was configured for de-identification (this info is not available from the API, so it should be a user attestation or checkbox in the UI). + +4. **Do not try to replicate date-shifting or record ID hashing client-side in the near term.** + REDCap's date-shifting uses project-level offsets that are not exposed via the API. Reimplementing this would be complex and fragile. If date-shifting is needed, users should use a report with date-shifting enabled, or use records mode and apply a post-processing step. + +5. **Keep the manual per-variable `blank` mode as the primary client-side tool for both modes.** + It is more flexible than anything REDCap offers at the API level and complements the built-in parameters well. The planned `drop`/`mask`/`pseudonymize` extensions remain valuable for cases that built-in parameters cannot cover. + +### Implementation Details + +`exportSurveyFields` and `exportDataAccessGroups` backend wiring: + +- Two fields added to `pluginOptions`: `ExportSurveyFields bool` and `ExportDataAccessGroups bool`. +- Sent in `applyRecordOnlyFilters` (records-mode only) when the user opts in. +- Included in `bundleCacheKey` for correct cache separation. +- Two checkboxes on the frontend settings page (records mode only, defaults off). + +Identifier auto-detection wiring: + +- `identifierFieldsFromMetadata()` parses the `identifier` column from the `content=metadata` CSV. +- `listVariablesFromMetadata()` and `listVariablesFromReport()` return `SelectItem` entries with `Selected: true` for identifier-tagged fields. +- Frontend reads the `selected` flag and pre-sets those variables' anonymization to `blank` (user can override to `none`). + +Both changes are backward-compatible with the existing payload structure. + +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ De-Identification And Encryption](#de-identification-and-encryption) --- @@ -163,7 +343,7 @@ Rationale: ### Policy Model -De-identification should be policy-driven, not ad-hoc. +De-identification should be policy-driven, not ad-hoc. The built-in REDCap parameters described [above](#redcap-built-in-de-identification-parameters) should be used as the first layer (server-side stripping), with our policy model applied as a second layer (client-side transforms). Suggested policy file (`redcap2-policy.json`): @@ -175,14 +355,19 @@ Suggested policy file (`redcap2-policy.json`): ### Methods -1. **Drop** - - safest for direct identifiers -2. **Blank** +1. **Server-side suppression (NEW — via built-in REDCap parameters)** + - `exportSurveyFields=false`: suppress survey identifier and timestamp fields + - `exportDataAccessGroups=false`: suppress data access group field + - safest option — data never leaves REDCap +2. **Drop** + - safest client-side option for direct identifiers +3. **Blank** - preserves schema, no values -3. **Deterministic pseudonymization (non-reversible)** + - can be auto-applied to REDCap identifier-tagged fields +4. **Deterministic pseudonymization (non-reversible)** - e.g. HMAC-based token with secret key - consistent per value, not reversible -4. **Reversible encryption** +5. **Reversible encryption** - only if strictly required - requires key management, key rotation, audit policy, and strict access controls @@ -193,11 +378,13 @@ Important: ### Recommended Defaults -1. Default to `blank` or `drop` for known identifiers. -2. Make reversible encryption opt-in and disabled by default. -3. Store no raw keys in job payloads or logs. +1. Use server-side suppression (`exportSurveyFields=false`, `exportDataAccessGroups=false`) as the baseline. +2. Auto-blank REDCap identifier-tagged fields (detected from metadata) by default; allow user override. +3. Default to `blank` or `drop` for any remaining known identifiers. +4. Make reversible encryption opt-in and disabled by default. +5. Store no raw keys in job payloads or logs. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Metadata Outputs](#metadata-outputs) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Metadata Outputs](#metadata-outputs) --- @@ -241,7 +428,7 @@ Option B: MVP recommendation: Option A. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Architecture In rdm-integration](#architecture-in-rdm-integration) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Architecture In rdm-integration](#architecture-in-rdm-integration) --- @@ -249,53 +436,66 @@ MVP recommendation: Option A. ### Backend (this repo) -Add: - 1. `image/app/plugin/impl/redcap2/common.go` 2. `image/app/plugin/impl/redcap2/options.go` 3. `image/app/plugin/impl/redcap2/query.go` 4. `image/app/plugin/impl/redcap2/streams.go` -5. `image/app/plugin/impl/redcap2/metadata.go` (optional initially) -6. `image/app/plugin/impl/redcap2/deidentify.go` -7. `image/app/plugin/impl/redcap2/exporters/` (for DDI-CDI/Croissant/RO-Crate) +5. `image/app/plugin/registry.go` with `redcap2` +6. `image/app/frontend/default_frontend_config.json` add `redcap2` entry +7. `conf/frontend_config.json` add `redcap2` entry +8. plugin request structs now include `pluginOptions`: + - `OptionsRequest` + - `CompareRequest` + - `StreamParams` -Update: +Planned backend extensions (not yet implemented): -1. `image/app/plugin/registry.go` with `redcap2` -2. `image/app/frontend/default_frontend_config.json` add `redcap2` entry -3. request/option handling if extra params are needed beyond existing fields +1. `image/app/plugin/impl/redcap2/metadata.go` +2. `image/app/plugin/impl/redcap2/deidentify.go` +3. `image/app/plugin/impl/redcap2/exporters/` (DDI-CDI/Croissant/RO-Crate) ### Frontend (separate repo) -Add `redcap2` plugin UX: +Implemented frontend `redcap2` UX: + +1. Intermediate export settings page (`/redcap2-export/:id`). +2. Report / All records mode toggle (report mode is default). +3. Report ID field (visible in report mode only). +4. Common export controls: format, record type, delimiter, raw/label, header labels. +5. Record-only filter fields: fields, forms, events, records, filter logic, date range (visible in records mode only). +6. "Include survey fields" and "Include Data Access Groups" toggles (records mode only, default off). +7. Variable anonymization table (`none`/`blank`) with auto-detection of REDCap identifier-tagged fields (pre-selected as `blank`). +8. Generated files preview updates to show mode-appropriate paths. +9. `pluginOptions` payload propagated through options, compare, and stream/store requests. + +Planned frontend extensions: -1. report selection input/dropdown -2. export settings panel -3. de-identification config panel (later phase) -4. metadata format toggles +1. Richer de-identification config panel. +2. Metadata format toggles and generators. Constraint: -Current generic request model is string-heavy (`option`, `repoName`, etc.). For advanced controls, we should add a structured `pluginOptions` payload rather than overloading one string field. +Current generic request model is string-heavy (`option`, `repoName`, etc.). `pluginOptions` is now used for structured `redcap2` settings and should remain the extension mechanism. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Step-By-Step Implementation Plan](#step-by-step-implementation-plan) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Step-By-Step Implementation Plan](#step-by-step-implementation-plan) --- ## Step-By-Step Implementation Plan -### Phase 0: Design Lock +### Phase 0: Design Lock [Completed] -1. Confirm whether report listing endpoint exists on target REDCap instance. +1. ~~Confirm whether report listing endpoint exists on target REDCap instance.~~ + Confirmed: standard REDCap API has **no** report-listing endpoint; report ID is entered manually. 2. Confirm minimum REDCap version and API rights assumptions. 3. Lock MVP scope: - - report export only - - csv/json/xml - - schema sidecars + - report mode + records mode (both implemented) + - csv/json initial scope + - report-sidecar generation - no attachment download - no reversible encryption in MVP -### Phase 1: Backend `redcap2` MVP +### Phase 1: Backend `redcap2` MVP [Completed] 1. Scaffold `redcap2` plugin package. 2. Implement API client helpers for report export + metadata export. @@ -305,25 +505,30 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For 6. Add logging, error handling, and timeout strategy for long exports. 7. Register plugin in `registry.go`. -### Phase 2: Frontend MVP Wiring +### Phase 2: Frontend MVP Wiring [Completed] 1. Add `redcap2` entry to frontend config. -2. Add required fields: +2. Add required fields and intermediate settings page: - URL - token - - report ID - - export format/delimiter + - report ID (text input on export page) + - export controls (including rawOrLabel, rawOrLabelHeaders) + - variable anonymization 3. Pass settings into compare/stream requests. 4. Verify compare tree and sync workflow end-to-end. -### Phase 3: Record Mode Controls +### Phase 3: Record Mode Controls [Completed] -1. Add record-mode API path. -2. Add fields/forms/events/records/filter/date-range options. -3. Add flat/eav export mode. -4. Add unit tests for each parameter combination. +1. ~~Add record-mode API path (`content=record`).~~ +2. ~~Add fields/forms/events/records/filter/date-range options.~~ +3. ~~Add flat/eav export mode.~~ +4. ~~Separate `applySharedExportParams` from `applyRecordOnlyFilters`.~~ +5. ~~Add report/records mode toggle to frontend.~~ +6. ~~Expose `exportSurveyFields` and `exportDataAccessGroups` as records-mode toggles.~~ +7. ~~Auto-detect identifier-tagged fields from metadata and pre-blank them.~~ +8. Add unit tests for each parameter combination. -### Phase 4: De-Identification Engine +### Phase 4: De-Identification Engine [Next] 1. Add policy schema and validation. 2. Implement field-level transforms (drop/blank/mask/pseudonymize). @@ -334,7 +539,7 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For - no raw-value logging - secure defaults -### Phase 5: Metadata Exporters +### Phase 5: Metadata Exporters [Next] 1. Define normalized metadata model. 2. Implement exporter adapters: @@ -344,7 +549,7 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For 3. Expose format toggles in UI. 4. Add schema validation tests for each output type. -### Phase 6: Hardening And Rollout +### Phase 6: Hardening And Rollout [Next] 1. Performance test with large REDCap projects. 2. Security review (keys, logs, PII handling, transport). @@ -352,7 +557,7 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For 4. Run pilot with limited users. 5. Keep `redcap` plugin as stable fallback until `redcap2` is proven. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Testing Plan](#testing-plan) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Testing Plan](#testing-plan) --- @@ -380,22 +585,23 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For 2. Verify reversible encryption requires explicit opt-in. 3. Validate redaction of error messages containing sensitive values. -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ Open Questions](#open-questions) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Open Questions](#open-questions) --- ## Open Questions -1. Can we list reports over API on the target REDCap instance, or must users provide `report_id` manually? -2. Which de-identification policy should be default at KU Leuven: +1. ~~Do all target REDCap instances expose report listing?~~ **Resolved:** The standard REDCap API does not expose a report-listing endpoint. Report IDs are entered manually. +2. ~~Should record mode be a separate flow or a toggle?~~ **Resolved:** Implemented as a toggle on the same settings page. +3. Which de-identification policy should be default at KU Leuven: - drop identifiers - blank identifiers - deterministic pseudonymization -3. Are reversible transformations acceptable under institutional policy? -4. Should metadata outputs be generated during sync, after sync, or both? -5. Should attachments be supported in MVP or deferred? +4. Are reversible transformations acceptable under institutional policy? +5. Should metadata outputs be generated during sync, after sync, or both? +6. Should attachments be supported in MVP or deferred? -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) | [→ References](#references) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ References](#references) --- @@ -416,4 +622,4 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). For 7. REDCap report export workflow reference: - https://docs.datalad.org/projects/redcap/en/latest/generated/man/datalad-export-redcap-report.html -[↑ Back to Top](#redcap2-plugin-design-and-implementation-plan) +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) From 1338cc08edc4453b56a92fb8d2526176faa74340 Mon Sep 17 00:00:00 2001 From: Eryk Kullikowski Date: Tue, 10 Mar 2026 13:54:08 +0100 Subject: [PATCH 02/25] redcap v2 --- Makefile | 2 +- conf/frontend_config.json | 12 +- .../app/frontend/default_frontend_config.json | 12 +- image/app/plugin/impl/redcap2/common.go | 1014 +++++++++++++++++ image/app/plugin/impl/redcap2/options.go | 39 + image/app/plugin/impl/redcap2/query.go | 76 ++ image/app/plugin/impl/redcap2/streams.go | 69 ++ image/app/plugin/registry.go | 7 + image/app/plugin/types/compare_request.go | 23 +- image/app/plugin/types/options_request.go | 17 +- image/app/plugin/types/stream_params.go | 21 +- 11 files changed, 1260 insertions(+), 32 deletions(-) create mode 100644 image/app/plugin/impl/redcap2/common.go create mode 100644 image/app/plugin/impl/redcap2/options.go create mode 100644 image/app/plugin/impl/redcap2/query.go create mode 100644 image/app/plugin/impl/redcap2/streams.go diff --git a/Makefile b/Makefile index ddf565c..95b6921 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ # Author: Eryk Kulikowski @ KU Leuven (2023). Apache 2.0 License -STAGE ?= prod +STAGE ?= dev BUILD_BASE_HREF ?= /integration/ include env.$(STAGE) diff --git a/conf/frontend_config.json b/conf/frontend_config.json index 26622e1..8e646b1 100644 --- a/conf/frontend_config.json +++ b/conf/frontend_config.json @@ -67,6 +67,16 @@ "sourceUrlFieldName": "Source URL", "sourceUrlFieldPlaceholder": "https://your.redcap.server" }, + { + "id": "redcap2", + "name": "Other REDCap (reports beta)", + "plugin": "redcap2", + "pluginName": "REDCap", + "tokenFieldName": "Project token", + "tokenFieldPlaceholder": "project token", + "sourceUrlFieldName": "Source URL", + "sourceUrlFieldPlaceholder": "https://preview.redcap.gbiomed.kuleuven.be" + }, { "id": "osf", "name": "OSF", @@ -106,4 +116,4 @@ "tokenFieldPlaceholder": "password" } ] -} \ No newline at end of file +} diff --git a/image/app/frontend/default_frontend_config.json b/image/app/frontend/default_frontend_config.json index b89d5b0..d732d66 100644 --- a/image/app/frontend/default_frontend_config.json +++ b/image/app/frontend/default_frontend_config.json @@ -68,6 +68,16 @@ "sourceUrlFieldName": "Source URL", "sourceUrlFieldPlaceholder": "https://your.redcap.server" }, + { + "id": "redcap2", + "name": "REDCap Reports (beta)", + "plugin": "redcap2", + "pluginName": "REDCap", + "tokenFieldName": "Project token", + "tokenFieldPlaceholder": "project token", + "sourceUrlFieldName": "Source URL", + "sourceUrlFieldPlaceholder": "https://your.redcap.server" + }, { "id": "osf", "name": "OSF", @@ -102,4 +112,4 @@ "repoNameFieldHasSearch": true } ] -} \ No newline at end of file +} diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go new file mode 100644 index 0000000..02571e8 --- /dev/null +++ b/image/app/plugin/impl/redcap2/common.go @@ -0,0 +1,1014 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "bufio" + "bytes" + "context" + "crypto/md5" + "encoding/csv" + "encoding/json" + "fmt" + "integration/app/logging" + "integration/app/plugin/types" + "io" + "net/http" + "net/url" + "sort" + "strings" + "sync" + "time" +) + +type variableOption struct { + Name string `json:"name"` + Anonymization string `json:"anonymization"` +} + +type pluginOptions struct { + ExportMode string `json:"exportMode"` // "report" or "records" + Request string `json:"request"` + ReportID string `json:"reportId"` + DataFormat string `json:"dataFormat"` + Fields []string `json:"fields"` + Forms []string `json:"forms"` + Events []string `json:"events"` + Records []string `json:"records"` + FilterLogic string `json:"filterLogic"` + DateRangeBegin string `json:"dateRangeBegin"` + DateRangeEnd string `json:"dateRangeEnd"` + RecordType string `json:"recordType"` + CsvDelimiter string `json:"csvDelimiter"` + RawOrLabel string `json:"rawOrLabel"` + RawOrLabelHeaders string `json:"rawOrLabelHeaders"` + ExportSurveyFields bool `json:"exportSurveyFields"` + ExportDataAccessGroups bool `json:"exportDataAccessGroups"` + Variables []variableOption `json:"variables"` + GeneratedAt string `json:"generatedAt"` +} + +type generatedBundle struct { + ReportID string + Files map[string][]byte +} + +var ( + httpClient *http.Client + clientOnce sync.Once +) + +// bundleCacheEntry holds a cached export bundle and its expiry time. +type bundleCacheEntry struct { + bundle generatedBundle + expiresAt time.Time +} + +// bundleStore is a simple TTL cache for generated bundles. +type bundleStore struct { + mu sync.Mutex + entries map[string]bundleCacheEntry +} + +const bundleCacheTTL = 5 * time.Minute + +var globalBundleCache = &bundleStore{entries: make(map[string]bundleCacheEntry)} + +func (s *bundleStore) get(key string) (generatedBundle, bool) { + s.mu.Lock() + defer s.mu.Unlock() + entry, ok := s.entries[key] + if !ok || time.Now().After(entry.expiresAt) { + delete(s.entries, key) + return generatedBundle{}, false + } + return entry.bundle, true +} + +func (s *bundleStore) set(key string, b generatedBundle) { + s.mu.Lock() + defer s.mu.Unlock() + // Lazy eviction: sweep expired entries on every set to prevent unbounded growth. + now := time.Now() + for k, entry := range s.entries { + if now.After(entry.expiresAt) { + delete(s.entries, k) + } + } + s.entries[key] = bundleCacheEntry{bundle: b, expiresAt: now.Add(bundleCacheTTL)} +} + +func getHTTPClient() *http.Client { + clientOnce.Do(func() { + httpClient = &http.Client{ + Timeout: 5 * time.Minute, + Transport: &http.Transport{ + MaxIdleConns: 100, + MaxIdleConnsPerHost: 10, + IdleConnTimeout: 90 * time.Second, + DisableKeepAlives: false, + }, + } + }) + return httpClient +} + +func parsePluginOptions(raw string) (pluginOptions, error) { + opts := pluginOptions{ + ExportMode: "report", + DataFormat: "csv", + RecordType: "flat", + CsvDelimiter: ",", + RawOrLabel: "raw", + RawOrLabelHeaders: "raw", + GeneratedAt: "missing-generated-at", + } + if strings.TrimSpace(raw) == "" { + return opts, nil + } + if err := json.Unmarshal([]byte(raw), &opts); err != nil { + return pluginOptions{}, fmt.Errorf("invalid pluginOptions JSON: %w", err) + } + normalizePluginOptions(&opts) + return opts, nil +} + +func normalizePluginOptions(opts *pluginOptions) { + switch strings.ToLower(strings.TrimSpace(opts.ExportMode)) { + case "records": + opts.ExportMode = "records" + default: + opts.ExportMode = "report" + } + + opts.Request = strings.TrimSpace(opts.Request) + opts.ReportID = strings.TrimSpace(opts.ReportID) + opts.FilterLogic = strings.TrimSpace(opts.FilterLogic) + opts.DateRangeBegin = strings.TrimSpace(opts.DateRangeBegin) + opts.DateRangeEnd = strings.TrimSpace(opts.DateRangeEnd) + if strings.TrimSpace(opts.GeneratedAt) == "" { + opts.GeneratedAt = "missing-generated-at" + } + + switch strings.ToLower(strings.TrimSpace(opts.DataFormat)) { + case "json": + opts.DataFormat = "json" + default: + opts.DataFormat = "csv" + } + + switch strings.ToLower(strings.TrimSpace(opts.RecordType)) { + case "eav": + opts.RecordType = "eav" + default: + opts.RecordType = "flat" + } + + switch strings.ToLower(strings.TrimSpace(opts.CsvDelimiter)) { + case "tab", "\\t", "tsv": + opts.CsvDelimiter = "\t" + default: + opts.CsvDelimiter = "," + } + + switch strings.ToLower(strings.TrimSpace(opts.RawOrLabel)) { + case "label": + opts.RawOrLabel = "label" + case "both": + opts.RawOrLabel = "both" + default: + opts.RawOrLabel = "raw" + } + + switch strings.ToLower(strings.TrimSpace(opts.RawOrLabelHeaders)) { + case "label": + opts.RawOrLabelHeaders = "label" + default: + opts.RawOrLabelHeaders = "raw" + } + + opts.Fields = normalizeStringSlice(opts.Fields) + opts.Forms = normalizeStringSlice(opts.Forms) + opts.Events = normalizeStringSlice(opts.Events) + opts.Records = normalizeStringSlice(opts.Records) + for i := range opts.Variables { + opts.Variables[i].Name = strings.TrimSpace(opts.Variables[i].Name) + switch strings.ToLower(strings.TrimSpace(opts.Variables[i].Anonymization)) { + case "blank": + opts.Variables[i].Anonymization = "blank" + default: + opts.Variables[i].Anonymization = "none" + } + } +} + +func normalizeStringSlice(in []string) []string { + if len(in) == 0 { + return nil + } + out := make([]string, 0, len(in)) + seen := make(map[string]bool, len(in)) + for _, raw := range in { + v := strings.TrimSpace(raw) + if v == "" || seen[v] { + continue + } + seen[v] = true + out = append(out, v) + } + if len(out) == 0 { + return nil + } + return out +} + +func getAPIURL(baseURL string) string { + base := strings.TrimSpace(baseURL) + if strings.HasSuffix(base, "/api") { + return base + "/" + } + if strings.HasSuffix(base, "/api/") { + return base + } + return strings.TrimSuffix(base, "/") + "/api/" +} + +func redcapRequest(ctx context.Context, baseURL string, form url.Values) ([]byte, error) { + apiURL := getAPIURL(baseURL) + req, err := http.NewRequestWithContext( + ctx, + http.MethodPost, + apiURL, + bytes.NewBufferString(form.Encode()), + ) + if err != nil { + return nil, err + } + req.Header.Add("Content-Type", "application/x-www-form-urlencoded") + req.Header.Add("Accept", "*/*") + + resp, err := getHTTPClient().Do(req) + if err != nil { + return nil, fmt.Errorf("redcap request failed: %w", err) + } + defer resp.Body.Close() + body, err := io.ReadAll(resp.Body) + if err != nil { + return nil, fmt.Errorf("failed to read redcap response: %w", err) + } + + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("redcap request failed with status %d: %s", resp.StatusCode, strings.TrimSpace(string(body))) + } + + trimmed := strings.TrimSpace(string(body)) + if strings.HasPrefix(strings.ToUpper(trimmed), "ERROR") { + return nil, fmt.Errorf("redcap error: %s", trimmed) + } + return body, nil +} + +func baseForm(token, content, format string) url.Values { + form := url.Values{} + form.Set("token", token) + form.Set("content", content) + form.Set("format", format) + form.Set("returnFormat", "json") + return form +} + +// applySharedExportParams sets parameters valid for both content=report and content=record. +func applySharedExportParams(form url.Values, opts pluginOptions) { + if opts.RecordType != "" { + form.Set("type", opts.RecordType) + } + if opts.CsvDelimiter == "\t" { + form.Set("csvDelimiter", "tab") + } + if opts.RawOrLabel != "" && opts.RawOrLabel != "raw" { + form.Set("rawOrLabel", opts.RawOrLabel) + } + if opts.RawOrLabelHeaders != "" && opts.RawOrLabelHeaders != "raw" { + form.Set("rawOrLabelHeaders", opts.RawOrLabelHeaders) + } +} + +// applyRecordOnlyFilters sets parameters only valid for content=record exports. +// These parameters are not supported by the content=report endpoint. +func applyRecordOnlyFilters(form url.Values, opts pluginOptions) { + if len(opts.Fields) > 0 { + form.Set("fields", strings.Join(opts.Fields, ",")) + } + if len(opts.Forms) > 0 { + form.Set("forms", strings.Join(opts.Forms, ",")) + } + if len(opts.Events) > 0 { + form.Set("events", strings.Join(opts.Events, ",")) + } + if len(opts.Records) > 0 { + form.Set("records", strings.Join(opts.Records, ",")) + } + if opts.ExportSurveyFields { + form.Set("exportSurveyFields", "true") + } + if opts.ExportDataAccessGroups { + form.Set("exportDataAccessGroups", "true") + } + if opts.FilterLogic != "" { + form.Set("filterLogic", opts.FilterLogic) + } + if opts.DateRangeBegin != "" { + v := opts.DateRangeBegin + if len(v) == 10 { // YYYY-MM-DD without time component + v += " 00:00:00" + } + form.Set("dateRangeBegin", v) + } + if opts.DateRangeEnd != "" { + v := opts.DateRangeEnd + if len(v) == 10 { + v += " 23:59:59" + } + form.Set("dateRangeEnd", v) + } +} + +func reportDelimiter(opts pluginOptions) rune { + if opts.CsvDelimiter == "\t" { + return '\t' + } + return ',' +} + +func blankFields(opts pluginOptions) map[string]bool { + res := map[string]bool{} + for _, v := range opts.Variables { + if v.Name == "" { + continue + } + if v.Anonymization == "blank" { + res[v.Name] = true + } + } + return res +} + +func parseCSV(data []byte, delimiter rune) ([][]string, error) { + reader := csv.NewReader(bytes.NewReader(data)) + reader.Comma = delimiter + reader.FieldsPerRecord = -1 + return reader.ReadAll() +} + +func writeCSV(rows [][]string, delimiter rune) ([]byte, error) { + var b bytes.Buffer + writer := csv.NewWriter(&b) + writer.Comma = delimiter + if err := writer.WriteAll(rows); err != nil { + return nil, err + } + writer.Flush() + if err := writer.Error(); err != nil { + return nil, err + } + return b.Bytes(), nil +} + +func applyBlankCSV(data []byte, delimiter rune, blanks map[string]bool) ([]byte, []string, error) { + rows, err := parseCSV(data, delimiter) + if err != nil { + return nil, nil, err + } + if len(rows) == 0 { + return data, nil, nil + } + header := append([]string(nil), rows[0]...) + if len(blanks) == 0 { + return data, header, nil + } + indices := make([]int, 0, len(header)) + for i, field := range header { + if blanks[field] { + indices = append(indices, i) + } + } + if len(indices) == 0 { + return data, header, nil + } + for rowIdx := 1; rowIdx < len(rows); rowIdx++ { + for _, colIdx := range indices { + if colIdx < len(rows[rowIdx]) { + rows[rowIdx][colIdx] = "" + } + } + } + out, err := writeCSV(rows, delimiter) + if err != nil { + return nil, nil, err + } + return out, header, nil +} + +func applyBlankJSON(data []byte, blanks map[string]bool) ([]byte, []string, error) { + rows := make([]map[string]interface{}, 0) + if err := json.Unmarshal(data, &rows); err != nil { + return nil, nil, err + } + keys := map[string]bool{} + for _, row := range rows { + for k := range row { + keys[k] = true + } + } + fields := make([]string, 0, len(keys)) + for k := range keys { + fields = append(fields, k) + } + sort.Strings(fields) + + if len(blanks) == 0 { + return data, fields, nil + } + + for _, row := range rows { + for field := range blanks { + if _, ok := row[field]; ok { + row[field] = "" + } + } + } + + out, err := json.Marshal(rows) + if err != nil { + return nil, nil, err + } + return out, fields, nil +} + +func exportReportData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, []string, error) { + form := baseForm(token, "report", opts.DataFormat) + form.Set("report_id", opts.ReportID) + applySharedExportParams(form, opts) + + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return nil, nil, err + } + + blanks := blankFields(opts) + if opts.DataFormat == "json" { + return applyBlankJSON(body, blanks) + } + return applyBlankCSV(body, reportDelimiter(opts), blanks) +} + +func exportRecordData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, []string, error) { + form := baseForm(token, "record", opts.DataFormat) + applySharedExportParams(form, opts) + applyRecordOnlyFilters(form, opts) + + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return nil, nil, err + } + + blanks := blankFields(opts) + if opts.DataFormat == "json" { + return applyBlankJSON(body, blanks) + } + return applyBlankCSV(body, reportDelimiter(opts), blanks) +} + +// redcapRequestHeaderOnly fetches only the first CSV line of a REDCap response, +// avoiding the cost of downloading the full dataset when only the column names are needed. +func redcapRequestHeaderOnly(ctx context.Context, baseURL string, form url.Values, delimiter rune) ([]string, error) { + apiURL := getAPIURL(baseURL) + req, err := http.NewRequestWithContext( + ctx, + http.MethodPost, + apiURL, + bytes.NewBufferString(form.Encode()), + ) + if err != nil { + return nil, err + } + req.Header.Add("Content-Type", "application/x-www-form-urlencoded") + req.Header.Add("Accept", "*/*") + + resp, err := getHTTPClient().Do(req) + if err != nil { + return nil, fmt.Errorf("redcap request failed: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + return nil, fmt.Errorf("redcap request failed with status %d: %s", resp.StatusCode, strings.TrimSpace(string(body))) + } + + scanner := bufio.NewScanner(resp.Body) + scanner.Buffer(make([]byte, 64*1024), 1*1024*1024) // header rows can be wide + if !scanner.Scan() { + return nil, fmt.Errorf("empty response from REDCap report header request") + } + line := scanner.Text() + trimmed := strings.TrimSpace(line) + if strings.HasPrefix(strings.ToUpper(trimmed), "ERROR") { + return nil, fmt.Errorf("redcap error: %s", trimmed) + } + reader := csv.NewReader(strings.NewReader(line)) + reader.Comma = delimiter + reader.FieldsPerRecord = -1 + record, err := reader.Read() + if err != nil { + return nil, fmt.Errorf("failed to parse report header: %w", err) + } + return record, nil +} + +func fallbackFieldsFromMetadata(ctx context.Context, baseURL, token string) ([]string, error) { + form := baseForm(token, "metadata", "csv") + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return nil, err + } + rows, err := parseCSV(body, ',') + if err != nil || len(rows) == 0 { + return nil, err + } + fieldIdx := -1 + for i, col := range rows[0] { + if strings.EqualFold(strings.TrimSpace(col), "field_name") { + fieldIdx = i + break + } + } + if fieldIdx < 0 { + return nil, nil + } + res := make([]string, 0, len(rows)-1) + seen := map[string]bool{} + for _, row := range rows[1:] { + if fieldIdx >= len(row) { + continue + } + field := strings.TrimSpace(row[fieldIdx]) + if field == "" || seen[field] { + continue + } + seen[field] = true + res = append(res, field) + } + sort.Strings(res) + return res, nil +} + +func deduplicatedSelectItems(fields []string) []types.SelectItem { + seen := make(map[string]bool, len(fields)) + unique := make([]string, 0, len(fields)) + for _, f := range fields { + f = strings.TrimSpace(f) + if f != "" && !seen[f] { + seen[f] = true + unique = append(unique, f) + } + } + sort.Strings(unique) + out := make([]types.SelectItem, 0, len(unique)) + for _, field := range unique { + out = append(out, types.SelectItem{Label: field, Value: field}) + } + return out +} + +// deduplicatedSelectItemsWithIdentifiers builds a sorted, deduplicated list of SelectItem values. +// Fields present in the identifiers set are returned with Selected=true, signalling the frontend +// to auto-blank them (they are REDCap identifier-tagged fields). +func deduplicatedSelectItemsWithIdentifiers(fields []string, identifiers map[string]bool) []types.SelectItem { + seen := make(map[string]bool, len(fields)) + unique := make([]string, 0, len(fields)) + for _, f := range fields { + f = strings.TrimSpace(f) + if f != "" && !seen[f] { + seen[f] = true + unique = append(unique, f) + } + } + sort.Strings(unique) + out := make([]types.SelectItem, 0, len(unique)) + for _, field := range unique { + out = append(out, types.SelectItem{Label: field, Value: field, Selected: identifiers[field]}) + } + return out +} + +// identifierFieldsFromMetadata fetches the project metadata and returns a set of field names +// that REDCap has tagged as identifiers (identifier column = "y" in the data dictionary). +func identifierFieldsFromMetadata(ctx context.Context, baseURL, token string) (map[string]bool, error) { + form := baseForm(token, "metadata", "csv") + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return nil, err + } + rows, err := parseCSV(body, ',') + if err != nil || len(rows) == 0 { + return nil, err + } + fieldIdx := -1 + identifierIdx := -1 + for i, col := range rows[0] { + switch strings.ToLower(strings.TrimSpace(col)) { + case "field_name": + fieldIdx = i + case "identifier": + identifierIdx = i + } + } + if fieldIdx < 0 || identifierIdx < 0 { + return nil, nil + } + res := make(map[string]bool) + for _, row := range rows[1:] { + if fieldIdx >= len(row) || identifierIdx >= len(row) { + continue + } + field := strings.TrimSpace(row[fieldIdx]) + ident := strings.ToLower(strings.TrimSpace(row[identifierIdx])) + if field != "" && (ident == "y" || ident == "yes" || ident == "1") { + res[field] = true + } + } + return res, nil +} + +// listVariablesFromReport fetches column headers from a report export (CSV header-only request). +// Falls back to the full metadata field list if the report header fetch fails. +// Fields tagged as identifiers in REDCap are returned with Selected=true. +func listVariablesFromReport(ctx context.Context, baseURL, token, reportID string, opts pluginOptions) ([]types.SelectItem, error) { + identifiers, _ := identifierFieldsFromMetadata(ctx, baseURL, token) + + form := baseForm(token, "report", "csv") + form.Set("report_id", reportID) + applySharedExportParams(form, opts) + + fields, err := redcapRequestHeaderOnly(ctx, baseURL, form, ',') + if err != nil { + // Fallback: derive field list from project metadata. + fields, err = fallbackFieldsFromMetadata(ctx, baseURL, token) + if err != nil { + return nil, err + } + } + return deduplicatedSelectItemsWithIdentifiers(fields, identifiers), nil +} + +// listVariablesFromMetadata returns all project fields from the metadata endpoint. +// Used for record export mode where there is no report to derive headers from. +// Fields tagged as identifiers in REDCap (identifier column = "y") are returned +// with Selected=true so the frontend can auto-blank them. +func listVariablesFromMetadata(ctx context.Context, baseURL, token string) ([]types.SelectItem, error) { + identifiers, err := identifierFieldsFromMetadata(ctx, baseURL, token) + if err != nil { + return nil, err + } + fields, err := fallbackFieldsFromMetadata(ctx, baseURL, token) + if err != nil { + return nil, err + } + return deduplicatedSelectItemsWithIdentifiers(fields, identifiers), nil +} + +func exportMetadataCSV(ctx context.Context, baseURL, token string, fields []string) ([]byte, error) { + form := baseForm(token, "metadata", "csv") + if len(fields) > 0 { + form.Set("fields", strings.Join(fields, ",")) + } + return redcapRequest(ctx, baseURL, form) +} + +func filterMetadataCSV(data []byte, fields []string) ([]byte, error) { + if len(fields) == 0 { + return data, nil + } + allowed := map[string]bool{} + for _, field := range fields { + if strings.TrimSpace(field) != "" { + allowed[strings.TrimSpace(field)] = true + } + } + if len(allowed) == 0 { + return data, nil + } + + rows, err := parseCSV(data, ',') + if err != nil || len(rows) == 0 { + return data, err + } + fieldIdx := -1 + for i, col := range rows[0] { + if strings.EqualFold(strings.TrimSpace(col), "field_name") { + fieldIdx = i + break + } + } + if fieldIdx < 0 { + return data, nil + } + filtered := make([][]string, 0, len(rows)) + filtered = append(filtered, rows[0]) + for _, row := range rows[1:] { + if fieldIdx >= len(row) { + continue + } + if allowed[strings.TrimSpace(row[fieldIdx])] { + filtered = append(filtered, row) + } + } + return writeCSV(filtered, ',') +} + +func exportProjectInfo(ctx context.Context, baseURL, token string) ([]byte, bool, error) { + form := baseForm(token, "project", "json") + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return nil, false, err + } + return body, detectLongitudinal(body), nil +} + +func detectLongitudinal(payload []byte) bool { + check := func(v interface{}) bool { + switch s := v.(type) { + case bool: + return s + case string: + switch strings.ToLower(strings.TrimSpace(s)) { + case "1", "true", "yes", "y": + return true + } + case float64: + return s != 0 + } + return false + } + + var obj map[string]interface{} + if err := json.Unmarshal(payload, &obj); err == nil { + for _, key := range []string{"is_longitudinal", "is_longitudinal_project"} { + if v, ok := obj[key]; ok && check(v) { + return true + } + } + } + + var arr []map[string]interface{} + if err := json.Unmarshal(payload, &arr); err == nil { + for _, row := range arr { + for _, key := range []string{"is_longitudinal", "is_longitudinal_project"} { + if v, ok := row[key]; ok && check(v) { + return true + } + } + } + } + return false +} + +func exportVersion(ctx context.Context, baseURL, token string) string { + form := baseForm(token, "version", "json") + body, err := redcapRequest(ctx, baseURL, form) + if err != nil { + return "" + } + trimmed := strings.TrimSpace(string(body)) + if trimmed == "" { + return "" + } + var asString string + if err := json.Unmarshal(body, &asString); err == nil { + return strings.TrimSpace(asString) + } + return strings.Trim(trimmed, "\"") +} + +func exportCSVContent(ctx context.Context, baseURL, token, content string) ([]byte, error) { + form := baseForm(token, content, "csv") + return redcapRequest(ctx, baseURL, form) +} + +func sanitizeReportID(reportID string) string { + if reportID == "" { + return "unknown" + } + var b strings.Builder + for _, r := range reportID { + switch { + case r >= 'a' && r <= 'z': + b.WriteRune(r) + case r >= 'A' && r <= 'Z': + b.WriteRune(r) + case r >= '0' && r <= '9': + b.WriteRune(r) + case r == '_' || r == '-' || r == '.': + b.WriteRune(r) + default: + b.WriteRune('_') + } + } + safe := b.String() + if safe == "" { + return "unknown" + } + return safe +} + +func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectInfoPath, eventsPath, mappingPath, redcapVersion string, warnings []string) ([]byte, error) { + manifest := map[string]interface{}{ + "plugin": "redcap2", + "export_mode": opts.ExportMode, + "generated_at": opts.GeneratedAt, + "redcap_version": redcapVersion, + "export": map[string]interface{}{ + "data_format": opts.DataFormat, + "record_type": opts.RecordType, + "csv_delimiter": opts.CsvDelimiter, + "raw_or_label": opts.RawOrLabel, + "raw_or_label_headers": opts.RawOrLabelHeaders, + "fields": opts.Fields, + "forms": opts.Forms, + "events": opts.Events, + "records": opts.Records, + "filter_logic": opts.FilterLogic, + "date_range_begin": opts.DateRangeBegin, + "date_range_end": opts.DateRangeEnd, + }, + "files": map[string]string{ + "data": dataPath, + "metadata": metadataPath, + "project_info": projectInfoPath, + }, + } + if opts.ExportMode == "report" { + manifest["report_id"] = reportID + } + if eventsPath != "" { + manifest["files"].(map[string]string)["events"] = eventsPath + } + if mappingPath != "" { + manifest["files"].(map[string]string)["form_event_mapping"] = mappingPath + } + + if len(opts.Variables) > 0 { + manifest["variables"] = opts.Variables + } + if len(warnings) > 0 { + manifest["warnings"] = warnings + } + + return json.MarshalIndent(manifest, "", " ") +} + +func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOptions, reportID string) (generatedBundle, error) { + var dataBytes []byte + var dataFields []string + var err error + var basePath string + + if opts.ExportMode == "records" { + dataBytes, dataFields, err = exportRecordData(ctx, baseURL, token, opts) + if err != nil { + return generatedBundle{}, fmt.Errorf("record export failed: %w", err) + } + basePath = "redcap/records" + } else { + dataBytes, dataFields, err = exportReportData(ctx, baseURL, token, opts) + if err != nil { + return generatedBundle{}, fmt.Errorf("report export failed: %w", err) + } + safeID := sanitizeReportID(reportID) + basePath = fmt.Sprintf("redcap/report-%s", safeID) + } + + metadataRaw, err := exportMetadataCSV(ctx, baseURL, token, nil) + if err != nil { + return generatedBundle{}, fmt.Errorf("metadata export failed: %w", err) + } + metadataBytes, err := filterMetadataCSV(metadataRaw, dataFields) + if err != nil { + return generatedBundle{}, fmt.Errorf("metadata filtering failed: %w", err) + } + + projectInfoBytes, isLongitudinal, err := exportProjectInfo(ctx, baseURL, token) + if err != nil { + return generatedBundle{}, fmt.Errorf("project info export failed: %w", err) + } + redcapVersion := exportVersion(ctx, baseURL, token) + dataFileName := "data.csv" + if opts.DataFormat == "json" { + dataFileName = "data.json" + } + dataPath := basePath + "/" + dataFileName + metadataPath := basePath + "/metadata.csv" + projectInfoPath := basePath + "/project_info.json" + eventsPath := "" + mappingPath := "" + warnings := []string{} + + files := map[string][]byte{ + dataPath: dataBytes, + metadataPath: metadataBytes, + projectInfoPath: projectInfoBytes, + } + + if isLongitudinal { + eventsBytes, eventsErr := exportCSVContent(ctx, baseURL, token, "event") + if eventsErr != nil { + warnings = append(warnings, fmt.Sprintf("events export failed: %v", eventsErr)) + } else { + eventsPath = basePath + "/events.csv" + files[eventsPath] = eventsBytes + } + + mappingBytes, mappingErr := exportCSVContent(ctx, baseURL, token, "formEventMapping") + if mappingErr != nil { + warnings = append(warnings, fmt.Sprintf("form-event mapping export failed: %v", mappingErr)) + } else { + mappingPath = basePath + "/form_event_mapping.csv" + files[mappingPath] = mappingBytes + } + } + + manifestBytes, err := makeManifest( + opts, + reportID, + dataPath, + metadataPath, + projectInfoPath, + eventsPath, + mappingPath, + redcapVersion, + warnings, + ) + if err != nil { + return generatedBundle{}, fmt.Errorf("manifest generation failed: %w", err) + } + files[basePath+"/manifest.json"] = manifestBytes + + logging.Logger.Printf("redcap2: generated %d virtual files (mode: %s, report: %s)", len(files), opts.ExportMode, reportID) + return generatedBundle{ + ReportID: reportID, + Files: files, + }, nil +} + +func md5Hex(data []byte) string { + sum := md5.Sum(data) + return fmt.Sprintf("%x", sum) +} + +// bundleCacheKey returns a stable key for the export bundle cache. +// GeneratedAt is intentionally excluded so the same underlying data always +// produces the same key regardless of when the user pressed "Continue to compare". +func bundleCacheKey(baseURL, token string, opts pluginOptions) string { + stable := pluginOptions{ + ExportMode: opts.ExportMode, + ReportID: opts.ReportID, + DataFormat: opts.DataFormat, + Fields: opts.Fields, + Forms: opts.Forms, + Events: opts.Events, + Records: opts.Records, + FilterLogic: opts.FilterLogic, + DateRangeBegin: opts.DateRangeBegin, + DateRangeEnd: opts.DateRangeEnd, + RecordType: opts.RecordType, + CsvDelimiter: opts.CsvDelimiter, + RawOrLabel: opts.RawOrLabel, + RawOrLabelHeaders: opts.RawOrLabelHeaders, + ExportSurveyFields: opts.ExportSurveyFields, + ExportDataAccessGroups: opts.ExportDataAccessGroups, + Variables: opts.Variables, + // GeneratedAt intentionally excluded + } + data, _ := json.Marshal(stable) + h := md5.Sum(append([]byte(baseURL+"\x00"+token+"\x00"), data...)) + return fmt.Sprintf("%x", h) +} + +// cachedBuildExportBundle returns a cached bundle when available, otherwise +// calls buildExportBundle and caches the result for bundleCacheTTL. +// This halves API calls (Query + Streams each previously made ~5 requests) +// and guarantees the hashes from Query match the bytes served by Streams. +func cachedBuildExportBundle(ctx context.Context, baseURL, token string, opts pluginOptions, reportID string) (generatedBundle, error) { + key := bundleCacheKey(baseURL, token, opts) + if bundle, ok := globalBundleCache.get(key); ok { + logging.Logger.Printf("redcap2: bundle cache hit (mode: %s, report: %s)", opts.ExportMode, reportID) + return bundle, nil + } + bundle, err := buildExportBundle(ctx, baseURL, token, opts, reportID) + if err != nil { + return generatedBundle{}, err + } + globalBundleCache.set(key, bundle) + return bundle, nil +} diff --git a/image/app/plugin/impl/redcap2/options.go b/image/app/plugin/impl/redcap2/options.go new file mode 100644 index 0000000..5f21751 --- /dev/null +++ b/image/app/plugin/impl/redcap2/options.go @@ -0,0 +1,39 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "fmt" + "integration/app/plugin/types" + "strings" +) + +func Options(ctx context.Context, params types.OptionsRequest) ([]types.SelectItem, error) { + if params.Url == "" || params.Token == "" { + return nil, fmt.Errorf("options: missing parameters: expected url, token") + } + + opts, err := parsePluginOptions(params.PluginOptions) + if err != nil { + return nil, err + } + + if strings.EqualFold(opts.Request, "variables") { + if opts.ExportMode == "records" { + return listVariablesFromMetadata(ctx, params.Url, params.Token) + } + reportID := opts.ReportID + if reportID == "" { + reportID = strings.TrimSpace(params.Option) + } + if reportID == "" { + return nil, fmt.Errorf("options: missing report id for variable lookup") + } + return listVariablesFromReport(ctx, params.Url, params.Token, reportID, opts) + } + + // The REDCap API does not provide a standard endpoint to list reports. + // Report IDs are entered manually by the user on the export settings page. + return []types.SelectItem{}, nil +} diff --git a/image/app/plugin/impl/redcap2/query.go b/image/app/plugin/impl/redcap2/query.go new file mode 100644 index 0000000..20fd847 --- /dev/null +++ b/image/app/plugin/impl/redcap2/query.go @@ -0,0 +1,76 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "fmt" + "integration/app/plugin/types" + "integration/app/tree" + "sort" + "strings" +) + +func Query(ctx context.Context, req types.CompareRequest, _ map[string]tree.Node) (map[string]tree.Node, error) { + if req.Url == "" || req.Token == "" { + return nil, fmt.Errorf("query: missing parameters: expected url, token") + } + + opts, err := parsePluginOptions(req.PluginOptions) + if err != nil { + return nil, err + } + + reportID := opts.ReportID + if reportID == "" { + reportID = strings.TrimSpace(req.Option) + } + if reportID == "" && opts.ExportMode != "records" { + return nil, fmt.Errorf("query: missing report id") + } + opts.ReportID = reportID + + bundle, err := cachedBuildExportBundle(ctx, req.Url, req.Token, opts, reportID) + if err != nil { + return nil, err + } + + paths := make([]string, 0, len(bundle.Files)) + for path := range bundle.Files { + paths = append(paths, path) + } + sort.Strings(paths) + + nodes := make(map[string]tree.Node, len(paths)) + for _, fullPath := range paths { + data := bundle.Files[fullPath] + parentPath, fileName := splitPath(fullPath) + nodes[fullPath] = tree.Node{ + Id: fullPath, + Name: fileName, + Path: parentPath, + Attributes: tree.Attributes{ + URL: fullPath, + IsFile: true, + RemoteHash: md5Hex(data), + RemoteHashType: types.Md5, + RemoteFileSize: int64(len(data)), + }, + } + } + return nodes, nil +} + +func splitPath(path string) (parent string, name string) { + clean := strings.TrimSpace(path) + if clean == "" { + return "", "" + } + idx := strings.LastIndex(clean, "/") + if idx < 0 { + return "", clean + } + parent = clean[:idx] + name = clean[idx+1:] + return parent, name +} diff --git a/image/app/plugin/impl/redcap2/streams.go b/image/app/plugin/impl/redcap2/streams.go new file mode 100644 index 0000000..5d64c98 --- /dev/null +++ b/image/app/plugin/impl/redcap2/streams.go @@ -0,0 +1,69 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "bytes" + "context" + "fmt" + "integration/app/plugin/types" + "integration/app/tree" + "io" + "strings" +) + +func Streams(ctx context.Context, in map[string]tree.Node, streamParams types.StreamParams) (types.StreamsType, error) { + if streamParams.Url == "" || streamParams.Token == "" { + return types.StreamsType{}, fmt.Errorf("streams: missing parameters: expected url, token") + } + + opts, err := parsePluginOptions(streamParams.PluginOptions) + if err != nil { + return types.StreamsType{}, err + } + + reportID := opts.ReportID + if reportID == "" { + reportID = strings.TrimSpace(streamParams.Option) + } + if reportID == "" && opts.ExportMode != "records" { + return types.StreamsType{}, fmt.Errorf("streams: missing report id") + } + opts.ReportID = reportID + + bundle, err := cachedBuildExportBundle(ctx, streamParams.Url, streamParams.Token, opts, reportID) + if err != nil { + return types.StreamsType{}, err + } + + res := make(map[string]types.Stream, len(in)) + for key, node := range in { + path := strings.TrimSpace(node.Attributes.URL) + if path == "" { + path = strings.TrimSpace(node.Id) + } + payload, ok := bundle.Files[path] + if !ok { + return types.StreamsType{}, fmt.Errorf("streams: generated file not found: %s", path) + } + // Copy payload to ensure stream readers are isolated from map aliasing. + data := append([]byte(nil), payload...) + res[key] = byteStream(data) + } + + return types.StreamsType{ + Streams: res, + Cleanup: nil, + }, nil +} + +func byteStream(data []byte) types.Stream { + return types.Stream{ + Open: func() (io.Reader, error) { + return bytes.NewReader(data), nil + }, + Close: func() error { + return nil + }, + } +} diff --git a/image/app/plugin/registry.go b/image/app/plugin/registry.go index eed98d0..f6951b1 100644 --- a/image/app/plugin/registry.go +++ b/image/app/plugin/registry.go @@ -13,6 +13,7 @@ import ( "integration/app/plugin/impl/onedrive" "integration/app/plugin/impl/osf" "integration/app/plugin/impl/redcap" + "integration/app/plugin/impl/redcap2" "integration/app/plugin/impl/sftp_plugin" "integration/app/plugin/types" "integration/app/tree" @@ -53,6 +54,12 @@ var pluginMap map[string]Plugin = map[string]Plugin{ Search: nil, Streams: redcap.Streams, }, + "redcap2": { + Query: redcap2.Query, + Options: redcap2.Options, + Search: nil, + Streams: redcap2.Streams, + }, "osf": { Query: osf.Query, Options: nil, diff --git a/image/app/plugin/types/compare_request.go b/image/app/plugin/types/compare_request.go index 6c19a29..8fe269d 100644 --- a/image/app/plugin/types/compare_request.go +++ b/image/app/plugin/types/compare_request.go @@ -3,15 +3,16 @@ package types type CompareRequest struct { - PluginId string `json:"pluginId"` - Plugin string `json:"plugin"` - RepoName string `json:"repoName"` - Url string `json:"url"` - Option string `json:"option"` - User string `json:"user"` - Token string `json:"token"` - PersistentId string `json:"persistentId"` - NewlyCreated bool `json:"newlyCreated"` - DataverseKey string `json:"dataverseKey"` - SessionId string `json:"sessionId"` + PluginId string `json:"pluginId"` + Plugin string `json:"plugin"` + RepoName string `json:"repoName"` + Url string `json:"url"` + Option string `json:"option"` + PluginOptions string `json:"pluginOptions,omitempty"` + User string `json:"user"` + Token string `json:"token"` + PersistentId string `json:"persistentId"` + NewlyCreated bool `json:"newlyCreated"` + DataverseKey string `json:"dataverseKey"` + SessionId string `json:"sessionId"` } diff --git a/image/app/plugin/types/options_request.go b/image/app/plugin/types/options_request.go index 0a4b914..0464b95 100644 --- a/image/app/plugin/types/options_request.go +++ b/image/app/plugin/types/options_request.go @@ -3,12 +3,13 @@ package types type OptionsRequest struct { - PluginId string `json:"pluginId"` - Plugin string `json:"plugin"` - RepoName string `json:"repoName"` - Option string `json:"option"` - Url string `json:"url"` - User string `json:"user"` - Token string `json:"token"` - SessionId string `json:"sessionId"` + PluginId string `json:"pluginId"` + Plugin string `json:"plugin"` + RepoName string `json:"repoName"` + Option string `json:"option"` + PluginOptions string `json:"pluginOptions,omitempty"` + Url string `json:"url"` + User string `json:"user"` + Token string `json:"token"` + SessionId string `json:"sessionId"` } diff --git a/image/app/plugin/types/stream_params.go b/image/app/plugin/types/stream_params.go index 870bab1..2a3633d 100644 --- a/image/app/plugin/types/stream_params.go +++ b/image/app/plugin/types/stream_params.go @@ -3,14 +3,15 @@ package types type StreamParams struct { - PluginId string `json:"pluginId"` - RepoName string `json:"repoName"` - Url string `json:"url"` - Option string `json:"option"` - User string `json:"user"` - Token string `json:"token"` - DVToken string `json:"dvToken"` - PersistentId string `json:"persistentId"` - SessionId string `json:"sessionId"` - DownloadId string `json:"downloadId"` + PluginId string `json:"pluginId"` + RepoName string `json:"repoName"` + Url string `json:"url"` + Option string `json:"option"` + PluginOptions string `json:"pluginOptions,omitempty"` + User string `json:"user"` + Token string `json:"token"` + DVToken string `json:"dvToken"` + PersistentId string `json:"persistentId"` + SessionId string `json:"sessionId"` + DownloadId string `json:"downloadId"` } From 91a4c716964056712f3eaaeea981a91ea8ac6eb3 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 10:35:30 +0200 Subject: [PATCH 03/25] test(redcap2): add unit tests for parameter routing, blanking, and bundle generation Covers option parsing/normalization, report-vs-records API parameter routing, blank anonymization for CSV and JSON exports, virtual node generation, hash determinism, bundle caching, and the variables/Options flow against a fake in-memory REDCap server (~91% statement coverage). Marks the last Phase 3 item done in redcap.md. --- image/app/plugin/impl/redcap2/common_test.go | 521 ++++++++++++++++++ image/app/plugin/impl/redcap2/helper_test.go | 119 ++++ image/app/plugin/impl/redcap2/options_test.go | 137 +++++ image/app/plugin/impl/redcap2/query_test.go | 283 ++++++++++ image/app/plugin/impl/redcap2/streams_test.go | 185 +++++++ redcap.md | 3 +- 6 files changed, 1247 insertions(+), 1 deletion(-) create mode 100644 image/app/plugin/impl/redcap2/common_test.go create mode 100644 image/app/plugin/impl/redcap2/helper_test.go create mode 100644 image/app/plugin/impl/redcap2/options_test.go create mode 100644 image/app/plugin/impl/redcap2/query_test.go create mode 100644 image/app/plugin/impl/redcap2/streams_test.go diff --git a/image/app/plugin/impl/redcap2/common_test.go b/image/app/plugin/impl/redcap2/common_test.go new file mode 100644 index 0000000..4d289ca --- /dev/null +++ b/image/app/plugin/impl/redcap2/common_test.go @@ -0,0 +1,521 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "encoding/json" + "integration/app/plugin/types" + "net/http" + "net/http/httptest" + "net/url" + "reflect" + "strings" + "testing" +) + +func TestParsePluginOptionsDefaults(t *testing.T) { + for _, raw := range []string{"", " "} { + opts, err := parsePluginOptions(raw) + if err != nil { + t.Fatalf("parsePluginOptions(%q) returned error: %v", raw, err) + } + want := pluginOptions{ + ExportMode: "report", + DataFormat: "csv", + RecordType: "flat", + CsvDelimiter: ",", + RawOrLabel: "raw", + RawOrLabelHeaders: "raw", + GeneratedAt: "missing-generated-at", + } + if !reflect.DeepEqual(opts, want) { + t.Fatalf("parsePluginOptions(%q) = %+v, want %+v", raw, opts, want) + } + } +} + +func TestParsePluginOptionsInvalidJSON(t *testing.T) { + if _, err := parsePluginOptions("{not json"); err == nil { + t.Fatal("expected error for invalid pluginOptions JSON") + } +} + +func TestParsePluginOptionsNormalization(t *testing.T) { + opts, err := parsePluginOptions(`{ + "exportMode": " Records ", + "dataFormat": "JSON", + "recordType": "EAV", + "csvDelimiter": " TSV ", + "rawOrLabel": "Label", + "rawOrLabelHeaders": "LABEL", + "reportId": " 7 ", + "fields": [" age", "age", "", "name "], + "variables": [ + {"name": " email ", "anonymization": "BLANK"}, + {"name": "age", "anonymization": "whatever"} + ] + }`) + if err != nil { + t.Fatalf("parsePluginOptions returned error: %v", err) + } + if opts.ExportMode != "records" { + t.Errorf("ExportMode = %q, want records", opts.ExportMode) + } + if opts.DataFormat != "json" { + t.Errorf("DataFormat = %q, want json", opts.DataFormat) + } + if opts.RecordType != "eav" { + t.Errorf("RecordType = %q, want eav", opts.RecordType) + } + if opts.CsvDelimiter != "\t" { + t.Errorf("CsvDelimiter = %q, want tab", opts.CsvDelimiter) + } + if opts.RawOrLabel != "label" || opts.RawOrLabelHeaders != "label" { + t.Errorf("RawOrLabel = %q, RawOrLabelHeaders = %q, want label/label", opts.RawOrLabel, opts.RawOrLabelHeaders) + } + if opts.ReportID != "7" { + t.Errorf("ReportID = %q, want 7", opts.ReportID) + } + if !reflect.DeepEqual(opts.Fields, []string{"age", "name"}) { + t.Errorf("Fields = %v, want [age name]", opts.Fields) + } + wantVars := []variableOption{ + {Name: "email", Anonymization: "blank"}, + {Name: "age", Anonymization: "none"}, + } + if !reflect.DeepEqual(opts.Variables, wantVars) { + t.Errorf("Variables = %v, want %v", opts.Variables, wantVars) + } + if opts.GeneratedAt != "missing-generated-at" { + t.Errorf("GeneratedAt = %q, want missing-generated-at", opts.GeneratedAt) + } +} + +func TestParsePluginOptionsUnknownValuesFallBackToDefaults(t *testing.T) { + opts, err := parsePluginOptions(`{ + "exportMode": "weird", + "dataFormat": "xml", + "recordType": "wide", + "csvDelimiter": ";", + "rawOrLabel": "other", + "rawOrLabelHeaders": "other" + }`) + if err != nil { + t.Fatalf("parsePluginOptions returned error: %v", err) + } + if opts.ExportMode != "report" || opts.DataFormat != "csv" || opts.RecordType != "flat" || + opts.CsvDelimiter != "," || opts.RawOrLabel != "raw" || opts.RawOrLabelHeaders != "raw" { + t.Fatalf("unknown values not normalized to defaults: %+v", opts) + } +} + +func TestNormalizeStringSlice(t *testing.T) { + tests := []struct { + name string + in []string + want []string + }{ + {name: "nil", in: nil, want: nil}, + {name: "all_empty", in: []string{"", " "}, want: nil}, + {name: "trim_and_dedup", in: []string{" a", "a", "b ", "", "b"}, want: []string{"a", "b"}}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := normalizeStringSlice(tt.in); !reflect.DeepEqual(got, tt.want) { + t.Fatalf("normalizeStringSlice(%v) = %v, want %v", tt.in, got, tt.want) + } + }) + } +} + +func TestGetAPIURL(t *testing.T) { + tests := []struct { + in string + want string + }{ + {in: "https://redcap.example.org", want: "https://redcap.example.org/api/"}, + {in: "https://redcap.example.org/", want: "https://redcap.example.org/api/"}, + {in: "https://redcap.example.org/api", want: "https://redcap.example.org/api/"}, + {in: "https://redcap.example.org/api/", want: "https://redcap.example.org/api/"}, + {in: " https://redcap.example.org ", want: "https://redcap.example.org/api/"}, + } + for _, tt := range tests { + if got := getAPIURL(tt.in); got != tt.want { + t.Errorf("getAPIURL(%q) = %q, want %q", tt.in, got, tt.want) + } + } +} + +func TestSanitizeReportID(t *testing.T) { + tests := []struct { + in string + want string + }{ + {in: "", want: "unknown"}, + {in: "42", want: "42"}, + {in: "My Report/7", want: "My_Report_7"}, + {in: "a.b-c_D9", want: "a.b-c_D9"}, + } + for _, tt := range tests { + if got := sanitizeReportID(tt.in); got != tt.want { + t.Errorf("sanitizeReportID(%q) = %q, want %q", tt.in, got, tt.want) + } + } +} + +func TestBlankFields(t *testing.T) { + opts := pluginOptions{Variables: []variableOption{ + {Name: "email", Anonymization: "blank"}, + {Name: "age", Anonymization: "none"}, + {Name: "", Anonymization: "blank"}, + }} + got := blankFields(opts) + if !reflect.DeepEqual(got, map[string]bool{"email": true}) { + t.Fatalf("blankFields = %v, want only email", got) + } +} + +func TestApplySharedExportParamsDefaults(t *testing.T) { + opts, _ := parsePluginOptions("") + form := url.Values{} + applySharedExportParams(form, opts) + if got := form.Get("type"); got != "flat" { + t.Errorf("type = %q, want flat", got) + } + for _, key := range []string{"csvDelimiter", "rawOrLabel", "rawOrLabelHeaders"} { + if _, ok := form[key]; ok { + t.Errorf("default options should not send %q", key) + } + } +} + +func TestApplySharedExportParamsNonDefaults(t *testing.T) { + opts, _ := parsePluginOptions(`{"recordType":"eav","csvDelimiter":"tab","rawOrLabel":"label","rawOrLabelHeaders":"label"}`) + form := url.Values{} + applySharedExportParams(form, opts) + want := map[string]string{ + "type": "eav", + "csvDelimiter": "tab", + "rawOrLabel": "label", + "rawOrLabelHeaders": "label", + } + for key, value := range want { + if got := form.Get(key); got != value { + t.Errorf("%s = %q, want %q", key, got, value) + } + } +} + +func TestApplyRecordOnlyFiltersEmpty(t *testing.T) { + opts, _ := parsePluginOptions("") + form := url.Values{} + applyRecordOnlyFilters(form, opts) + if len(form) != 0 { + t.Fatalf("expected no record-only params for default options, got %v", form) + } +} + +func TestApplyRecordOnlyFiltersFull(t *testing.T) { + opts, _ := parsePluginOptions(`{ + "fields": ["age", "name", "age"], + "forms": ["demographics"], + "events": ["baseline_arm_1"], + "records": ["1", "2"], + "filterLogic": "[age] > 30", + "dateRangeBegin": "2026-01-02", + "dateRangeEnd": "2026-01-31", + "exportSurveyFields": true, + "exportDataAccessGroups": true + }`) + form := url.Values{} + applyRecordOnlyFilters(form, opts) + want := map[string]string{ + "fields": "age,name", + "forms": "demographics", + "events": "baseline_arm_1", + "records": "1,2", + "filterLogic": "[age] > 30", + "dateRangeBegin": "2026-01-02 00:00:00", + "dateRangeEnd": "2026-01-31 23:59:59", + "exportSurveyFields": "true", + "exportDataAccessGroups": "true", + } + for key, value := range want { + if got := form.Get(key); got != value { + t.Errorf("%s = %q, want %q", key, got, value) + } + } +} + +func TestApplyRecordOnlyFiltersKeepsExplicitTimes(t *testing.T) { + opts, _ := parsePluginOptions(`{"dateRangeBegin":"2026-01-02 10:30:00","dateRangeEnd":"2026-01-31 12:00:00"}`) + form := url.Values{} + applyRecordOnlyFilters(form, opts) + if got := form.Get("dateRangeBegin"); got != "2026-01-02 10:30:00" { + t.Errorf("dateRangeBegin = %q, want explicit time preserved", got) + } + if got := form.Get("dateRangeEnd"); got != "2026-01-31 12:00:00" { + t.Errorf("dateRangeEnd = %q, want explicit time preserved", got) + } +} + +func TestApplyBlankCSV(t *testing.T) { + out, header, err := applyBlankCSV([]byte(testDataCSV), ',', map[string]bool{"name": true, "email": true}) + if err != nil { + t.Fatalf("applyBlankCSV returned error: %v", err) + } + wantHeader := []string{"record_id", "name", "email", "age"} + if !reflect.DeepEqual(header, wantHeader) { + t.Errorf("header = %v, want %v", header, wantHeader) + } + want := "record_id,name,email,age\n1,,,34\n2,,,29\n" + if string(out) != want { + t.Errorf("blanked CSV = %q, want %q", string(out), want) + } +} + +func TestApplyBlankCSVNoMatchingColumns(t *testing.T) { + out, header, err := applyBlankCSV([]byte(testDataCSV), ',', map[string]bool{"missing": true}) + if err != nil { + t.Fatalf("applyBlankCSV returned error: %v", err) + } + if string(out) != testDataCSV { + t.Errorf("data changed despite no matching blank columns") + } + if len(header) != 4 { + t.Errorf("header = %v, want 4 columns", header) + } +} + +func TestApplyBlankCSVEmptyInput(t *testing.T) { + out, header, err := applyBlankCSV(nil, ',', map[string]bool{"name": true}) + if err != nil { + t.Fatalf("applyBlankCSV returned error: %v", err) + } + if len(out) != 0 || header != nil { + t.Errorf("expected empty passthrough, got out=%q header=%v", out, header) + } +} + +func TestApplyBlankJSON(t *testing.T) { + out, fields, err := applyBlankJSON([]byte(testDataJSON), map[string]bool{"name": true, "email": true}) + if err != nil { + t.Fatalf("applyBlankJSON returned error: %v", err) + } + wantFields := []string{"age", "email", "name", "record_id"} + if !reflect.DeepEqual(fields, wantFields) { + t.Errorf("fields = %v, want %v", fields, wantFields) + } + rows := []map[string]string{} + if err := json.Unmarshal(out, &rows); err != nil { + t.Fatalf("blanked JSON is invalid: %v", err) + } + for i, row := range rows { + if row["name"] != "" || row["email"] != "" { + t.Errorf("row %d not blanked: %v", i, row) + } + if row["record_id"] == "" || row["age"] == "" { + t.Errorf("row %d lost non-blanked values: %v", i, row) + } + } +} + +func TestApplyBlankJSONInvalid(t *testing.T) { + if _, _, err := applyBlankJSON([]byte("not json"), nil); err == nil { + t.Fatal("expected error for invalid JSON input") + } +} + +func TestFilterMetadataCSV(t *testing.T) { + out, err := filterMetadataCSV([]byte(testMetadataCSV), []string{"age", "record_id"}) + if err != nil { + t.Fatalf("filterMetadataCSV returned error: %v", err) + } + want := "field_name,form_name,field_type,identifier\n" + + "record_id,demographics,text,\n" + + "age,demographics,text,\n" + if string(out) != want { + t.Errorf("filtered metadata = %q, want %q", string(out), want) + } +} + +func TestFilterMetadataCSVPassthrough(t *testing.T) { + out, err := filterMetadataCSV([]byte(testMetadataCSV), nil) + if err != nil || string(out) != testMetadataCSV { + t.Errorf("expected passthrough without fields, got %q (err %v)", string(out), err) + } + noFieldName := "a,b\n1,2\n" + out, err = filterMetadataCSV([]byte(noFieldName), []string{"x"}) + if err != nil || string(out) != noFieldName { + t.Errorf("expected passthrough without field_name column, got %q (err %v)", string(out), err) + } +} + +func TestDetectLongitudinal(t *testing.T) { + tests := []struct { + name string + payload string + want bool + }{ + {name: "object_string_one", payload: `{"is_longitudinal":"1"}`, want: true}, + {name: "object_bool", payload: `{"is_longitudinal":true}`, want: true}, + {name: "object_yes", payload: `{"is_longitudinal":"yes"}`, want: true}, + {name: "object_number", payload: `{"is_longitudinal":1}`, want: true}, + {name: "object_zero", payload: `{"is_longitudinal":"0"}`, want: false}, + {name: "object_missing", payload: `{"project_id":1}`, want: false}, + {name: "array_form", payload: `[{"is_longitudinal":"y"}]`, want: true}, + {name: "alternate_key", payload: `{"is_longitudinal_project":"true"}`, want: true}, + {name: "invalid", payload: `not json`, want: false}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := detectLongitudinal([]byte(tt.payload)); got != tt.want { + t.Fatalf("detectLongitudinal(%q) = %v, want %v", tt.payload, got, tt.want) + } + }) + } +} + +func TestDeduplicatedSelectItems(t *testing.T) { + got := deduplicatedSelectItems([]string{" b", "a", "b", ""}) + want := []types.SelectItem{ + {Label: "a", Value: "a"}, + {Label: "b", Value: "b"}, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("deduplicatedSelectItems = %v, want %v", got, want) + } +} + +func TestDeduplicatedSelectItemsWithIdentifiers(t *testing.T) { + got := deduplicatedSelectItemsWithIdentifiers( + []string{"email", "age", "email", " name "}, + map[string]bool{"email": true, "name": true}, + ) + want := []types.SelectItem{ + {Label: "age", Value: "age"}, + {Label: "email", Value: "email", Selected: true}, + {Label: "name", Value: "name", Selected: true}, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("deduplicatedSelectItemsWithIdentifiers = %v, want %v", got, want) + } +} + +func TestMakeManifestReportMode(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","variables":[{"name":"email","anonymization":"blank"}]}`) + data, err := makeManifest(opts, "7", "redcap/report-7/data.csv", "redcap/report-7/metadata.csv", + "redcap/report-7/project_info.json", "redcap/report-7/events.csv", "redcap/report-7/form_event_mapping.csv", + "14.5.5", []string{"something failed"}) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + manifest := map[string]interface{}{} + if err := json.Unmarshal(data, &manifest); err != nil { + t.Fatalf("manifest is invalid JSON: %v", err) + } + if manifest["plugin"] != "redcap2" || manifest["export_mode"] != "report" { + t.Errorf("unexpected plugin/export_mode: %v / %v", manifest["plugin"], manifest["export_mode"]) + } + if manifest["report_id"] != "7" { + t.Errorf("report_id = %v, want 7", manifest["report_id"]) + } + if manifest["redcap_version"] != "14.5.5" { + t.Errorf("redcap_version = %v, want 14.5.5", manifest["redcap_version"]) + } + files := manifest["files"].(map[string]interface{}) + if files["events"] != "redcap/report-7/events.csv" || files["form_event_mapping"] != "redcap/report-7/form_event_mapping.csv" { + t.Errorf("longitudinal files missing from manifest: %v", files) + } + if _, ok := manifest["variables"]; !ok { + t.Error("variables missing from manifest") + } + if _, ok := manifest["warnings"]; !ok { + t.Error("warnings missing from manifest") + } +} + +func TestMakeManifestRecordsMode(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + data, err := makeManifest(opts, "", "redcap/records/data.csv", "redcap/records/metadata.csv", + "redcap/records/project_info.json", "", "", "", nil) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + manifest := map[string]interface{}{} + if err := json.Unmarshal(data, &manifest); err != nil { + t.Fatalf("manifest is invalid JSON: %v", err) + } + if _, ok := manifest["report_id"]; ok { + t.Error("records-mode manifest should not contain report_id") + } + for _, key := range []string{"variables", "warnings"} { + if _, ok := manifest[key]; ok { + t.Errorf("empty %s should be omitted from manifest", key) + } + } + files := manifest["files"].(map[string]interface{}) + for _, key := range []string{"events", "form_event_mapping"} { + if _, ok := files[key]; ok { + t.Errorf("non-longitudinal manifest should not list %s", key) + } + } +} + +func TestBundleCacheKeyStability(t *testing.T) { + base, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","generatedAt":"2026-01-01T00:00:00Z"}`) + sameButLater := base + sameButLater.GeneratedAt = "2026-06-11T00:00:00Z" + if bundleCacheKey("https://r", "tok", base) != bundleCacheKey("https://r", "tok", sameButLater) { + t.Error("generatedAt should not change the cache key") + } + + otherReport := base + otherReport.ReportID = "8" + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "tok", otherReport) { + t.Error("different report id should change the cache key") + } + + otherMode := base + otherMode.ExportMode = "records" + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "tok", otherMode) { + t.Error("different export mode should change the cache key") + } + + surveyFields := base + surveyFields.ExportSurveyFields = true + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "tok", surveyFields) { + t.Error("exportSurveyFields should change the cache key") + } + + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "other", base) { + t.Error("different token should change the cache key") + } +} + +func TestRedcapRequestErrorBody(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + _, _ = w.Write([]byte("ERROR: You do not have permissions to use the API")) + })) + defer server.Close() + + form := baseForm("tok", "record", "csv") + _, err := redcapRequest(context.Background(), server.URL, form) + if err == nil || !strings.Contains(err.Error(), "redcap error") { + t.Fatalf("expected redcap error for ERROR body, got %v", err) + } +} + +func TestRedcapRequestHTTPError(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + http.Error(w, "forbidden", http.StatusForbidden) + })) + defer server.Close() + + form := baseForm("tok", "record", "csv") + _, err := redcapRequest(context.Background(), server.URL, form) + if err == nil || !strings.Contains(err.Error(), "403") { + t.Fatalf("expected status error, got %v", err) + } +} diff --git a/image/app/plugin/impl/redcap2/helper_test.go b/image/app/plugin/impl/redcap2/helper_test.go new file mode 100644 index 0000000..286e7a1 --- /dev/null +++ b/image/app/plugin/impl/redcap2/helper_test.go @@ -0,0 +1,119 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "net/http" + "net/http/httptest" + "net/url" + "strings" + "sync" +) + +const ( + testDataCSV = "record_id,name,email,age\n1,John,john@example.org,34\n2,Jane,jane@example.org,29\n" + + testDataJSON = `[{"record_id":"1","name":"John","email":"john@example.org","age":"34"},{"record_id":"2","name":"Jane","email":"jane@example.org","age":"29"}]` + + testMetadataCSV = "field_name,form_name,field_type,identifier\n" + + "record_id,demographics,text,\n" + + "name,demographics,text,y\n" + + "email,demographics,text,y\n" + + "age,demographics,text,\n" + + testEventsCSV = "event_name,arm_num,unique_event_name\nBaseline,1,baseline_arm_1\n" + testMappingCSV = "arm_num,unique_event_name,form\n1,baseline_arm_1,demographics\n" + testVersion = "14.5.5" +) + +// fakeRedcap is a minimal in-memory REDCap API stub. It records every form +// submitted per content type so tests can assert on the exact parameters sent. +type fakeRedcap struct { + mu sync.Mutex + forms map[string][]url.Values + longitudinal bool + failReport bool + server *httptest.Server +} + +func newFakeRedcap() *fakeRedcap { + f := &fakeRedcap{forms: map[string][]url.Values{}} + f.server = httptest.NewServer(http.HandlerFunc(f.handle)) + return f +} + +func (f *fakeRedcap) close() { f.server.Close() } + +func (f *fakeRedcap) url() string { return f.server.URL } + +func (f *fakeRedcap) calls(content string) int { + f.mu.Lock() + defer f.mu.Unlock() + return len(f.forms[content]) +} + +func (f *fakeRedcap) lastForm(content string) url.Values { + f.mu.Lock() + defer f.mu.Unlock() + forms := f.forms[content] + if len(forms) == 0 { + return nil + } + return forms[len(forms)-1] +} + +func (f *fakeRedcap) handle(w http.ResponseWriter, r *http.Request) { + if err := r.ParseForm(); err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + form := url.Values{} + for k, v := range r.PostForm { + form[k] = append([]string(nil), v...) + } + content := form.Get("content") + f.mu.Lock() + f.forms[content] = append(f.forms[content], form) + longitudinal := f.longitudinal + failReport := f.failReport + f.mu.Unlock() + + switch content { + case "report": + if failReport { + http.Error(w, "report unavailable", http.StatusInternalServerError) + return + } + writeTestData(w, form) + case "record": + writeTestData(w, form) + case "metadata": + _, _ = w.Write([]byte(testMetadataCSV)) + case "project": + longitudinalFlag := "0" + if longitudinal { + longitudinalFlag = "1" + } + _, _ = w.Write([]byte(`{"project_id":1,"project_title":"Demo","is_longitudinal":"` + longitudinalFlag + `"}`)) + case "version": + _, _ = w.Write([]byte(testVersion)) + case "event": + _, _ = w.Write([]byte(testEventsCSV)) + case "formEventMapping": + _, _ = w.Write([]byte(testMappingCSV)) + default: + http.Error(w, "unsupported content: "+content, http.StatusBadRequest) + } +} + +func writeTestData(w http.ResponseWriter, form url.Values) { + if form.Get("format") == "json" { + _, _ = w.Write([]byte(testDataJSON)) + return + } + data := testDataCSV + if form.Get("csvDelimiter") == "tab" { + data = strings.ReplaceAll(data, ",", "\t") + } + _, _ = w.Write([]byte(data)) +} diff --git a/image/app/plugin/impl/redcap2/options_test.go b/image/app/plugin/impl/redcap2/options_test.go new file mode 100644 index 0000000..15cd9bc --- /dev/null +++ b/image/app/plugin/impl/redcap2/options_test.go @@ -0,0 +1,137 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "integration/app/plugin/types" + "strings" + "testing" +) + +func TestOptionsRequiresUrlAndToken(t *testing.T) { + if _, err := Options(context.Background(), types.OptionsRequest{}); err == nil { + t.Fatal("expected error for missing url and token") + } +} + +func TestOptionsNonVariablesRequestReturnsEmpty(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + items, err := Options(context.Background(), types.OptionsRequest{Url: f.url(), Token: "tok"}) + if err != nil { + t.Fatalf("Options returned error: %v", err) + } + if len(items) != 0 { + t.Errorf("expected empty option list, got %v", items) + } + if f.calls("metadata") != 0 || f.calls("report") != 0 { + t.Error("no API calls expected for non-variables request") + } +} + +// checkVariableItems asserts the standard field list from the fake server: +// sorted alphabetically with the identifier-tagged fields pre-selected. +func checkVariableItems(t *testing.T, items []types.SelectItem) { + t.Helper() + wantFields := []string{"age", "email", "name", "record_id"} + if len(items) != len(wantFields) { + t.Fatalf("got %d items (%v), want %d", len(items), items, len(wantFields)) + } + for i, item := range items { + if item.Label != wantFields[i] || item.Value != wantFields[i] { + t.Errorf("item %d = %v, want %s", i, item, wantFields[i]) + } + wantSelected := wantFields[i] == "email" || wantFields[i] == "name" + if item.Selected != wantSelected { + t.Errorf("item %s Selected = %v, want %v (identifier auto-detection)", item.Label, item.Selected, wantSelected) + } + } +} + +func TestOptionsVariablesRecordsMode(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + items, err := Options(context.Background(), types.OptionsRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"exportMode":"records","request":"variables"}`, + }) + if err != nil { + t.Fatalf("Options returned error: %v", err) + } + checkVariableItems(t, items) + if f.calls("report") != 0 { + t.Error("records mode variable lookup must not call the report endpoint") + } +} + +func TestOptionsVariablesReportMode(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + items, err := Options(context.Background(), types.OptionsRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"request":"variables","reportId":"7"}`, + }) + if err != nil { + t.Fatalf("Options returned error: %v", err) + } + checkVariableItems(t, items) + form := f.lastForm("report") + if form.Get("report_id") != "7" { + t.Errorf("report header request report_id = %q, want 7", form.Get("report_id")) + } +} + +func TestOptionsVariablesReportModeFallsBackToMetadata(t *testing.T) { + f := newFakeRedcap() + f.failReport = true + defer f.close() + + items, err := Options(context.Background(), types.OptionsRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"request":"variables","reportId":"7"}`, + }) + if err != nil { + t.Fatalf("Options returned error: %v", err) + } + checkVariableItems(t, items) +} + +func TestOptionsVariablesReportModeMissingReportID(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + _, err := Options(context.Background(), types.OptionsRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"request":"variables"}`, + }) + if err == nil || !strings.Contains(err.Error(), "missing report id") { + t.Fatalf("expected missing report id error, got %v", err) + } +} + +func TestOptionsVariablesReportModeUsesOptionFallback(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + items, err := Options(context.Background(), types.OptionsRequest{ + Url: f.url(), + Token: "tok", + Option: "9", + PluginOptions: `{"request":"variables"}`, + }) + if err != nil { + t.Fatalf("Options returned error: %v", err) + } + checkVariableItems(t, items) + if form := f.lastForm("report"); form.Get("report_id") != "9" { + t.Errorf("report header request report_id = %q, want 9 (from Option)", form.Get("report_id")) + } +} diff --git a/image/app/plugin/impl/redcap2/query_test.go b/image/app/plugin/impl/redcap2/query_test.go new file mode 100644 index 0000000..bf720f1 --- /dev/null +++ b/image/app/plugin/impl/redcap2/query_test.go @@ -0,0 +1,283 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "integration/app/plugin/types" + "sort" + "strings" + "testing" +) + +func TestQueryRequiresUrlAndToken(t *testing.T) { + if _, err := Query(context.Background(), types.CompareRequest{}, nil); err == nil { + t.Fatal("expected error for missing url and token") + } +} + +func TestQueryRequiresReportIDInReportMode(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + _, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok"}, nil) + if err == nil || !strings.Contains(err.Error(), "missing report id") { + t.Fatalf("expected missing report id error, got %v", err) + } + if f.calls("report") != 0 { + t.Error("no API calls expected when report id is missing") + } +} + +func TestQueryReportModeGeneratesBundle(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + nodes, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"exportMode":"report","reportId":"7"}`, + }, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + + wantPaths := []string{ + "redcap/report-7/data.csv", + "redcap/report-7/manifest.json", + "redcap/report-7/metadata.csv", + "redcap/report-7/project_info.json", + } + gotPaths := make([]string, 0, len(nodes)) + for path := range nodes { + gotPaths = append(gotPaths, path) + } + sort.Strings(gotPaths) + if strings.Join(gotPaths, "|") != strings.Join(wantPaths, "|") { + t.Fatalf("paths = %v, want %v", gotPaths, wantPaths) + } + + node := nodes["redcap/report-7/data.csv"] + if node.Name != "data.csv" || node.Path != "redcap/report-7" || !node.Attributes.IsFile { + t.Errorf("unexpected node shape: %+v", node) + } + if node.Attributes.RemoteHashType != types.Md5 { + t.Errorf("RemoteHashType = %q, want %q", node.Attributes.RemoteHashType, types.Md5) + } + if node.Attributes.RemoteHash != md5Hex([]byte(testDataCSV)) { + t.Errorf("RemoteHash = %q, want md5 of report data", node.Attributes.RemoteHash) + } + if node.Attributes.RemoteFileSize != int64(len(testDataCSV)) { + t.Errorf("RemoteFileSize = %d, want %d", node.Attributes.RemoteFileSize, len(testDataCSV)) + } + + form := f.lastForm("report") + if form.Get("report_id") != "7" || form.Get("type") != "flat" { + t.Errorf("unexpected report form: %v", form) + } + if f.calls("record") != 0 { + t.Error("report mode must not call the record endpoint") + } +} + +func TestQueryReportModeOmitsRecordOnlyFilters(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + _, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{ + "exportMode": "report", + "reportId": "7", + "fields": ["name"], + "forms": ["demographics"], + "events": ["baseline_arm_1"], + "records": ["1"], + "filterLogic": "[age] > 30", + "dateRangeBegin": "2026-01-01", + "dateRangeEnd": "2026-01-31", + "exportSurveyFields": true, + "exportDataAccessGroups": true + }`, + }, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + + form := f.lastForm("report") + for _, key := range []string{ + "fields", "forms", "events", "records", "filterLogic", + "dateRangeBegin", "dateRangeEnd", "exportSurveyFields", "exportDataAccessGroups", + } { + if _, ok := form[key]; ok { + t.Errorf("record-only parameter %q must not be sent to content=report", key) + } + } +} + +func TestQueryRecordsModeSendsFilters(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + nodes, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{ + "exportMode": "records", + "recordType": "eav", + "csvDelimiter": "tab", + "rawOrLabel": "label", + "rawOrLabelHeaders": "label", + "fields": ["age", "name"], + "forms": ["demographics"], + "events": ["baseline_arm_1"], + "records": ["1", "2"], + "filterLogic": "[age] > 20", + "dateRangeBegin": "2026-01-02", + "dateRangeEnd": "2026-01-31", + "exportSurveyFields": true, + "exportDataAccessGroups": true + }`, + }, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + + if _, ok := nodes["redcap/records/data.csv"]; !ok { + t.Errorf("records mode should generate redcap/records paths, got %v", nodes) + } + + form := f.lastForm("record") + want := map[string]string{ + "type": "eav", + "csvDelimiter": "tab", + "rawOrLabel": "label", + "rawOrLabelHeaders": "label", + "fields": "age,name", + "forms": "demographics", + "events": "baseline_arm_1", + "records": "1,2", + "filterLogic": "[age] > 20", + "dateRangeBegin": "2026-01-02 00:00:00", + "dateRangeEnd": "2026-01-31 23:59:59", + "exportSurveyFields": "true", + "exportDataAccessGroups": "true", + } + for key, value := range want { + if got := form.Get(key); got != value { + t.Errorf("%s = %q, want %q", key, got, value) + } + } + if f.calls("report") != 0 { + t.Error("records mode must not call the report endpoint") + } +} + +func TestQueryLongitudinalAddsEventFiles(t *testing.T) { + f := newFakeRedcap() + f.longitudinal = true + defer f.close() + + nodes, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"exportMode":"report","reportId":"7"}`, + }, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + if len(nodes) != 6 { + t.Fatalf("expected 6 files for longitudinal project, got %d", len(nodes)) + } + events, ok := nodes["redcap/report-7/events.csv"] + if !ok || events.Attributes.RemoteHash != md5Hex([]byte(testEventsCSV)) { + t.Errorf("events.csv missing or wrong hash: %+v", events) + } + mapping, ok := nodes["redcap/report-7/form_event_mapping.csv"] + if !ok || mapping.Attributes.RemoteHash != md5Hex([]byte(testMappingCSV)) { + t.Errorf("form_event_mapping.csv missing or wrong hash: %+v", mapping) + } +} + +func TestQueryUsesOptionAsReportIDFallback(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + nodes, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + Option: "9", + }, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + if _, ok := nodes["redcap/report-9/data.csv"]; !ok { + t.Errorf("expected report id from Option to drive paths, got %v", nodes) + } +} + +func TestQueryHashesAreDeterministicAcrossServers(t *testing.T) { + f1 := newFakeRedcap() + defer f1.close() + f2 := newFakeRedcap() + defer f2.close() + + pluginOpts := `{"exportMode":"report","reportId":"7","variables":[{"name":"email","anonymization":"blank"}]}` + nodes1, err := Query(context.Background(), types.CompareRequest{Url: f1.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err != nil { + t.Fatalf("first Query returned error: %v", err) + } + nodes2, err := Query(context.Background(), types.CompareRequest{Url: f2.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err != nil { + t.Fatalf("second Query returned error: %v", err) + } + + if len(nodes1) != len(nodes2) { + t.Fatalf("node counts differ: %d vs %d", len(nodes1), len(nodes2)) + } + for path, node1 := range nodes1 { + node2, ok := nodes2[path] + if !ok { + t.Errorf("path %s missing in second result", path) + continue + } + if node1.Attributes.RemoteHash != node2.Attributes.RemoteHash { + t.Errorf("hash for %s differs between identical exports", path) + } + } +} + +func TestQueryReportExportFailure(t *testing.T) { + f := newFakeRedcap() + f.failReport = true + defer f.close() + + _, err := Query(context.Background(), types.CompareRequest{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"exportMode":"report","reportId":"7"}`, + }, nil) + if err == nil || !strings.Contains(err.Error(), "report export failed") { + t.Fatalf("expected report export failure, got %v", err) + } +} + +func TestSplitPath(t *testing.T) { + tests := []struct { + in string + wantParent string + wantName string + }{ + {in: "redcap/report-7/data.csv", wantParent: "redcap/report-7", wantName: "data.csv"}, + {in: "file.csv", wantParent: "", wantName: "file.csv"}, + {in: "", wantParent: "", wantName: ""}, + {in: " redcap/x ", wantParent: "redcap", wantName: "x"}, + } + for _, tt := range tests { + parent, name := splitPath(tt.in) + if parent != tt.wantParent || name != tt.wantName { + t.Errorf("splitPath(%q) = (%q, %q), want (%q, %q)", tt.in, parent, name, tt.wantParent, tt.wantName) + } + } +} diff --git a/image/app/plugin/impl/redcap2/streams_test.go b/image/app/plugin/impl/redcap2/streams_test.go new file mode 100644 index 0000000..92a8a5f --- /dev/null +++ b/image/app/plugin/impl/redcap2/streams_test.go @@ -0,0 +1,185 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "encoding/json" + "integration/app/plugin/types" + "integration/app/tree" + "io" + "strings" + "testing" +) + +func TestStreamsRequiresUrlAndToken(t *testing.T) { + if _, err := Streams(context.Background(), nil, types.StreamParams{}); err == nil { + t.Fatal("expected error for missing url and token") + } +} + +func TestStreamsRequiresReportIDInReportMode(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + _, err := Streams(context.Background(), nil, types.StreamParams{Url: f.url(), Token: "tok"}) + if err == nil || !strings.Contains(err.Error(), "missing report id") { + t.Fatalf("expected missing report id error, got %v", err) + } +} + +func readStream(t *testing.T, stream types.Stream) []byte { + t.Helper() + reader, err := stream.Open() + if err != nil { + t.Fatalf("stream open failed: %v", err) + } + data, err := io.ReadAll(reader) + if err != nil { + t.Fatalf("stream read failed: %v", err) + } + if err := stream.Close(); err != nil { + t.Fatalf("stream close failed: %v", err) + } + return data +} + +func TestStreamsServesQueryBundleFromCache(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "report", + "reportId": "7", + "generatedAt": "2026-06-11T00:00:00Z", + "variables": [ + {"name": "name", "anonymization": "blank"}, + {"name": "email", "anonymization": "blank"} + ] + }` + nodes, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + + streams, err := Streams(context.Background(), nodes, types.StreamParams{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}) + if err != nil { + t.Fatalf("Streams returned error: %v", err) + } + + for path, node := range nodes { + stream, ok := streams.Streams[path] + if !ok { + t.Errorf("no stream for %s", path) + continue + } + data := readStream(t, stream) + if md5Hex(data) != node.Attributes.RemoteHash { + t.Errorf("stream bytes for %s do not match Query hash", path) + } + if int64(len(data)) != node.Attributes.RemoteFileSize { + t.Errorf("stream size for %s = %d, want %d", path, len(data), node.Attributes.RemoteFileSize) + } + } + + wantData := "record_id,name,email,age\n1,,,34\n2,,,29\n" + gotData := readStream(t, streams.Streams["redcap/report-7/data.csv"]) + if string(gotData) != wantData { + t.Errorf("blanked data.csv = %q, want %q", string(gotData), wantData) + } + + manifestBytes := readStream(t, streams.Streams["redcap/report-7/manifest.json"]) + manifest := map[string]interface{}{} + if err := json.Unmarshal(manifestBytes, &manifest); err != nil { + t.Fatalf("manifest is invalid JSON: %v", err) + } + if manifest["plugin"] != "redcap2" || manifest["report_id"] != "7" { + t.Errorf("unexpected manifest identity: %v", manifest) + } + if manifest["redcap_version"] != testVersion { + t.Errorf("redcap_version = %v, want %s", manifest["redcap_version"], testVersion) + } + if manifest["generated_at"] != "2026-06-11T00:00:00Z" { + t.Errorf("generated_at = %v, want propagated value", manifest["generated_at"]) + } + + // The bundle built during Query must be reused by Streams (single build). + for _, content := range []string{"report", "metadata", "project", "version"} { + if got := f.calls(content); got != 1 { + t.Errorf("content=%s called %d times, want 1 (bundle cache miss?)", content, got) + } + } +} + +func TestStreamsRecordsModeJSONBlanking(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "dataFormat": "json", + "variables": [ + {"name": "name", "anonymization": "blank"}, + {"name": "email", "anonymization": "blank"} + ] + }` + nodes, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + if _, ok := nodes["redcap/records/data.json"]; !ok { + t.Fatalf("expected redcap/records/data.json, got %v", nodes) + } + + streams, err := Streams(context.Background(), nodes, types.StreamParams{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}) + if err != nil { + t.Fatalf("Streams returned error: %v", err) + } + + rows := []map[string]string{} + if err := json.Unmarshal(readStream(t, streams.Streams["redcap/records/data.json"]), &rows); err != nil { + t.Fatalf("data.json is invalid JSON: %v", err) + } + if len(rows) != 2 { + t.Fatalf("expected 2 records, got %d", len(rows)) + } + for i, row := range rows { + if row["name"] != "" || row["email"] != "" { + t.Errorf("record %d not blanked: %v", i, row) + } + } + + manifest := map[string]interface{}{} + if err := json.Unmarshal(readStream(t, streams.Streams["redcap/records/manifest.json"]), &manifest); err != nil { + t.Fatalf("manifest is invalid JSON: %v", err) + } + if _, ok := manifest["report_id"]; ok { + t.Error("records-mode manifest should not contain report_id") + } + if form := f.lastForm("record"); form.Get("format") != "json" { + t.Errorf("record export format = %q, want json", form.Get("format")) + } +} + +func TestStreamsUnknownGeneratedFile(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + in := map[string]tree.Node{ + "redcap/report-7/nope.csv": { + Id: "redcap/report-7/nope.csv", + Attributes: tree.Attributes{ + URL: "redcap/report-7/nope.csv", + IsFile: true, + }, + }, + } + _, err := Streams(context.Background(), in, types.StreamParams{ + Url: f.url(), + Token: "tok", + PluginOptions: `{"exportMode":"report","reportId":"7"}`, + }) + if err == nil || !strings.Contains(err.Error(), "generated file not found") { + t.Fatalf("expected generated file not found error, got %v", err) + } +} diff --git a/redcap.md b/redcap.md index 6c9ecf9..78d5741 100644 --- a/redcap.md +++ b/redcap.md @@ -67,6 +67,7 @@ Key point: manual export/save was required in the old `redcap` plugin because it - `exportSurveyFields` and `exportDataAccessGroups` exposed as records-mode toggles (server-side suppression). - Identifier-tagged fields auto-detected from `content=metadata` (`identifier` column) and pre-selected as `blank` in the variable anonymization table; users can override to `none`. 10. Existing `redcap` plugin remains available and unchanged for fallback. +11. Unit test suite (2026-06-11) covering option parsing/normalization, report-vs-records parameter routing, blank anonymization (CSV and JSON), virtual node generation, hash determinism, bundle caching, and the variables/Options flow (~91% statement coverage, `image/app/plugin/impl/redcap2/*_test.go`). ### Generated File Layout (Implemented) @@ -526,7 +527,7 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). `plu 5. ~~Add report/records mode toggle to frontend.~~ 6. ~~Expose `exportSurveyFields` and `exportDataAccessGroups` as records-mode toggles.~~ 7. ~~Auto-detect identifier-tagged fields from metadata and pre-blank them.~~ -8. Add unit tests for each parameter combination. +8. ~~Add unit tests for each parameter combination.~~ ### Phase 4: De-Identification Engine [Next] From 7835b526ad6a399cdab3cae7db1a79a5c20c32e9 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 11:30:15 +0200 Subject: [PATCH 04/25] fix(redcap2): de-id correctness, API fidelity, manifest provenance (Phase 3.9) De-identification correctness (silent no-ops eliminated): - EAV-aware blanking: blank the value cell of rows whose field_name matches (CSV and JSON EAV) instead of matching headers that don't exist. - Checkbox-aware matching: a blank rule for a field also matches its field___code expansion columns (CSV and JSON). - Label-header support: headers are translated back to field names via the data dictionary (incl. 'Label (choice=...)' checkbox headers). - Anonymization audit in the manifest: per-rule match counts, with warnings for rules that matched no exported data. - metadata.csv filtering now derives the exported field set correctly per mode (EAV field_name values, label translation, checkbox bases, record-ID field seeded for EAV). REDCap API fidelity (verified against PHPCap/REDCap.jl/PyCap/REDCapR): - type is no longer sent to content=report (reports are always flat; type moved to record-only parameters). - rawOrLabel 'both' normalized to raw (not a real API value). - csvDelimiter/rawOrLabelHeaders only sent when applicable (CSV; flat). Manifest provenance (decisions 2026-06-11): - project id/title from content=project. - file-upload fields documented as not-exported attachments. - dictionary_fields_not_exported diff for unfiltered flat records exports (reveals token export-rights stripping). Other: - Bundle cache size cap: oversized bundles are rebuilt, not cached. - redcap.md: review findings, research summary (API facts, landscape, metadata standards), resolved open questions, revised Phase 3.9-6 plan. --- image/app/plugin/impl/redcap2/common.go | 546 +++++++++++++++--- image/app/plugin/impl/redcap2/common_test.go | 319 ++++++++-- image/app/plugin/impl/redcap2/deid_test.go | 189 ++++++ image/app/plugin/impl/redcap2/helper_test.go | 51 +- image/app/plugin/impl/redcap2/query_test.go | 9 +- image/app/plugin/impl/redcap2/streams_test.go | 8 + redcap.md | 256 ++++---- 7 files changed, 1120 insertions(+), 258 deletions(-) create mode 100644 image/app/plugin/impl/redcap2/deid_test.go diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go index 02571e8..6dbf6d2 100644 --- a/image/app/plugin/impl/redcap2/common.go +++ b/image/app/plugin/impl/redcap2/common.go @@ -72,8 +72,21 @@ type bundleStore struct { const bundleCacheTTL = 5 * time.Minute +// maxCacheableBundleBytes bounds how much exported (potentially sensitive) +// data a single bundle may pin in process memory; larger bundles are rebuilt +// on demand instead of cached. Variable so tests can lower it. +var maxCacheableBundleBytes = 64 << 20 + var globalBundleCache = &bundleStore{entries: make(map[string]bundleCacheEntry)} +func (b generatedBundle) size() int { + total := 0 + for _, data := range b.Files { + total += len(data) + } + return total +} + func (s *bundleStore) get(key string) (generatedBundle, bool) { s.mu.Lock() defer s.mu.Unlock() @@ -171,11 +184,10 @@ func normalizePluginOptions(opts *pluginOptions) { opts.CsvDelimiter = "," } + // "both" is not a real REDCap API value (PyCap docstring fossil) — normalize to raw. switch strings.ToLower(strings.TrimSpace(opts.RawOrLabel)) { case "label": opts.RawOrLabel = "label" - case "both": - opts.RawOrLabel = "both" default: opts.RawOrLabel = "raw" } @@ -277,25 +289,39 @@ func baseForm(token, content, format string) url.Values { return form } -// applySharedExportParams sets parameters valid for both content=report and content=record. +// isEAV reports whether the export produces EAV-shaped output. +// Only records-mode exports have a type parameter; reports are always flat. +func isEAV(opts pluginOptions) bool { + return opts.ExportMode == "records" && opts.RecordType == "eav" +} + +// headersAreLabels reports whether the export's column headers will be field +// labels instead of field names. rawOrLabelHeaders only applies to flat CSV. +func headersAreLabels(opts pluginOptions) bool { + return opts.DataFormat == "csv" && opts.RawOrLabelHeaders == "label" && !isEAV(opts) +} + +// applySharedExportParams sets parameters valid for both content=report and +// content=record. Note: content=report has no "type" parameter (reports are +// always flat) — type is set in applyRecordOnlyFilters. func applySharedExportParams(form url.Values, opts pluginOptions) { - if opts.RecordType != "" { - form.Set("type", opts.RecordType) - } - if opts.CsvDelimiter == "\t" { + if opts.DataFormat == "csv" && opts.CsvDelimiter == "\t" { form.Set("csvDelimiter", "tab") } - if opts.RawOrLabel != "" && opts.RawOrLabel != "raw" { - form.Set("rawOrLabel", opts.RawOrLabel) + if opts.RawOrLabel == "label" { + form.Set("rawOrLabel", "label") } - if opts.RawOrLabelHeaders != "" && opts.RawOrLabelHeaders != "raw" { - form.Set("rawOrLabelHeaders", opts.RawOrLabelHeaders) + if headersAreLabels(opts) { + form.Set("rawOrLabelHeaders", "label") } } // applyRecordOnlyFilters sets parameters only valid for content=record exports. // These parameters are not supported by the content=report endpoint. func applyRecordOnlyFilters(form url.Values, opts pluginOptions) { + if opts.RecordType != "" { + form.Set("type", opts.RecordType) + } if len(opts.Fields) > 0 { form.Set("fields", strings.Join(opts.Fields, ",")) } @@ -374,29 +400,168 @@ func writeCSV(rows [][]string, delimiter rune) ([]byte, error) { return b.Bytes(), nil } -func applyBlankCSV(data []byte, delimiter rune, blanks map[string]bool) ([]byte, []string, error) { +// dictionary holds the parsed data-dictionary information needed for blanking, +// label-header translation, metadata filtering, and manifest documentation. +type dictionary struct { + fieldOrder []string // field names in dictionary order + fieldType map[string]string // field_name -> field_type + labelFields map[string][]string // field_label -> field names (labels can collide) +} + +func parseDictionary(metadataCSV []byte) dictionary { + dict := dictionary{ + fieldType: map[string]string{}, + labelFields: map[string][]string{}, + } + rows, err := parseCSV(metadataCSV, ',') + if err != nil || len(rows) == 0 { + return dict + } + nameIdx, typeIdx, labelIdx := -1, -1, -1 + for i, col := range rows[0] { + switch strings.ToLower(strings.TrimSpace(col)) { + case "field_name": + nameIdx = i + case "field_type": + typeIdx = i + case "field_label": + labelIdx = i + } + } + if nameIdx < 0 { + return dict + } + for _, row := range rows[1:] { + if nameIdx >= len(row) { + continue + } + name := strings.TrimSpace(row[nameIdx]) + if name == "" { + continue + } + dict.fieldOrder = append(dict.fieldOrder, name) + if typeIdx >= 0 && typeIdx < len(row) { + dict.fieldType[name] = strings.ToLower(strings.TrimSpace(row[typeIdx])) + } + if labelIdx >= 0 && labelIdx < len(row) { + label := strings.TrimSpace(row[labelIdx]) + if label != "" { + dict.labelFields[label] = append(dict.labelFields[label], name) + } + } + } + return dict +} + +// fileUploadFields returns the dictionary fields of type "file" — per-record +// attachments that are documented in the manifest but never downloaded. +func (d dictionary) fileUploadFields() []string { + res := []string{} + for _, name := range d.fieldOrder { + if d.fieldType[name] == "file" { + res = append(res, name) + } + } + return res +} + +// baseFieldName strips a checkbox expansion suffix: "phones___2" -> "phones". +func baseFieldName(col string) string { + if i := strings.Index(col, "___"); i > 0 { + return col[:i] + } + return col +} + +// resolveHeaderFields maps a data column header to candidate dictionary field +// names. Raw headers resolve via the checkbox base name; label headers are +// translated through the dictionary, including "Label (choice=...)" checkbox +// headers. Unknown headers resolve to themselves so that pseudo-columns +// (record, redcap_event_name, redcap_survey_identifier, ...) stay stable. +func resolveHeaderFields(header string, labelHeaders bool, dict dictionary) []string { + header = strings.TrimSpace(header) + if !labelHeaders { + return []string{baseFieldName(header)} + } + label := header + if i := strings.Index(header, " (choice="); i > 0 { + label = strings.TrimSpace(header[:i]) + } + if fields, ok := dict.labelFields[label]; ok && len(fields) > 0 { + return fields + } + return []string{baseFieldName(header)} +} + +// anonymizationAudit records the outcome of one blank rule so that silent +// no-ops are impossible: every requested transform reports how much data it +// actually touched. +type anonymizationAudit struct { + Field string `json:"field"` + Mode string `json:"mode"` + Matched int `json:"matched"` + Note string `json:"note,omitempty"` +} + +func buildAudit(blanks map[string]bool, matched map[string]int, unit string) []anonymizationAudit { + fields := make([]string, 0, len(blanks)) + for field := range blanks { + fields = append(fields, field) + } + sort.Strings(fields) + audit := make([]anonymizationAudit, 0, len(fields)) + for _, field := range fields { + entry := anonymizationAudit{Field: field, Mode: "blank", Matched: matched[field]} + if entry.Matched == 0 { + entry.Note = "field not present in export" + } else { + entry.Note = fmt.Sprintf("blanked %d %s", entry.Matched, unit) + } + audit = append(audit, entry) + } + return audit +} + +// blankFlatCSV blanks matching columns of a flat CSV export. A blank rule for +// field f matches columns named f, checkbox expansions f___code, and — when +// headers are labels — columns whose label translates back to f. +// Returns the (possibly rewritten) data, the exported dictionary field names, +// and the per-rule audit. +func blankFlatCSV(data []byte, delimiter rune, blanks map[string]bool, labelHeaders bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { rows, err := parseCSV(data, delimiter) if err != nil { - return nil, nil, err + return nil, nil, nil, err } if len(rows) == 0 { - return data, nil, nil + return data, nil, buildAudit(blanks, nil, "columns"), nil } - header := append([]string(nil), rows[0]...) - if len(blanks) == 0 { - return data, header, nil - } - indices := make([]int, 0, len(header)) - for i, field := range header { - if blanks[field] { - indices = append(indices, i) + header := rows[0] + exported := make([]string, 0, len(header)) + seen := map[string]bool{} + matched := map[string]int{} + blankCols := []int{} + for i, col := range header { + candidates := resolveHeaderFields(col, labelHeaders, dict) + for _, field := range candidates { + if !seen[field] { + seen[field] = true + exported = append(exported, field) + } + } + for _, field := range candidates { + if blanks[field] { + blankCols = append(blankCols, i) + matched[field]++ + break + } } } - if len(indices) == 0 { - return data, header, nil + audit := buildAudit(blanks, matched, "columns") + if len(blankCols) == 0 { + return data, exported, audit, nil } for rowIdx := 1; rowIdx < len(rows); rowIdx++ { - for _, colIdx := range indices { + for _, colIdx := range blankCols { if colIdx < len(rows[rowIdx]) { rows[rowIdx][colIdx] = "" } @@ -404,79 +569,215 @@ func applyBlankCSV(data []byte, delimiter rune, blanks map[string]bool) ([]byte, } out, err := writeCSV(rows, delimiter) if err != nil { - return nil, nil, err + return nil, nil, nil, err + } + return out, exported, audit, nil +} + +// blankEAVCSV blanks the value cells of EAV-shaped CSV exports +// (record, [redcap_event_name,] field_name, value): rows whose field_name +// matches a blanked field get an empty value. Falls back to flat handling if +// the EAV columns cannot be located. +func blankEAVCSV(data []byte, delimiter rune, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { + rows, err := parseCSV(data, delimiter) + if err != nil { + return nil, nil, nil, err + } + if len(rows) == 0 { + return data, nil, buildAudit(blanks, nil, "rows"), nil + } + fieldIdx, valueIdx := -1, -1 + for i, col := range rows[0] { + switch strings.ToLower(strings.TrimSpace(col)) { + case "field_name": + fieldIdx = i + case "value": + valueIdx = i + } + } + if fieldIdx < 0 || valueIdx < 0 { + return blankFlatCSV(data, delimiter, blanks, false, dict) + } + exported := eavExportedFields(dict) + seen := map[string]bool{} + for _, field := range exported { + seen[field] = true + } + matched := map[string]int{} + changed := false + for rowIdx := 1; rowIdx < len(rows); rowIdx++ { + row := rows[rowIdx] + if fieldIdx >= len(row) { + continue + } + field := baseFieldName(strings.TrimSpace(row[fieldIdx])) + if field == "" { + continue + } + if !seen[field] { + seen[field] = true + exported = append(exported, field) + } + if blanks[field] && valueIdx < len(row) { + if row[valueIdx] != "" { + row[valueIdx] = "" + changed = true + } + matched[field]++ + } + } + audit := buildAudit(blanks, matched, "rows") + if !changed { + return data, exported, audit, nil + } + out, err := writeCSV(rows, delimiter) + if err != nil { + return nil, nil, nil, err + } + return out, exported, audit, nil +} + +// eavExportedFields seeds the exported-field set for EAV outputs with the +// record-ID field (the first dictionary field): EAV rows reference it in the +// "record" column rather than as a field_name row, and it must stay in the +// filtered metadata. +func eavExportedFields(dict dictionary) []string { + if len(dict.fieldOrder) > 0 { + return []string{dict.fieldOrder[0]} } - return out, header, nil + return []string{} } -func applyBlankJSON(data []byte, blanks map[string]bool) ([]byte, []string, error) { +// blankFlatJSON blanks matching keys of flat JSON exports. JSON exports always +// use raw field names as keys, so only checkbox base-name matching applies. +func blankFlatJSON(data []byte, blanks map[string]bool) ([]byte, []string, []anonymizationAudit, error) { rows := make([]map[string]interface{}, 0) if err := json.Unmarshal(data, &rows); err != nil { - return nil, nil, err + return nil, nil, nil, err } keys := map[string]bool{} + matchedKeys := map[string]string{} // key -> blanked field for _, row := range rows { for k := range row { keys[k] = true } } - fields := make([]string, 0, len(keys)) + exportedSet := map[string]bool{} + exported := []string{} for k := range keys { - fields = append(fields, k) + field := baseFieldName(k) + if !exportedSet[field] { + exportedSet[field] = true + exported = append(exported, field) + } + if blanks[field] { + matchedKeys[k] = field + } } - sort.Strings(fields) - - if len(blanks) == 0 { - return data, fields, nil + sort.Strings(exported) + matched := map[string]int{} + for _, field := range matchedKeys { + matched[field]++ + } + audit := buildAudit(blanks, matched, "columns") + if len(matchedKeys) == 0 { + return data, exported, audit, nil } - for _, row := range rows { - for field := range blanks { - if _, ok := row[field]; ok { - row[field] = "" + for k := range matchedKeys { + if _, ok := row[k]; ok { + row[k] = "" } } } - out, err := json.Marshal(rows) if err != nil { - return nil, nil, err + return nil, nil, nil, err } - return out, fields, nil + return out, exported, audit, nil } -func exportReportData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, []string, error) { - form := baseForm(token, "report", opts.DataFormat) - form.Set("report_id", opts.ReportID) - applySharedExportParams(form, opts) - - body, err := redcapRequest(ctx, baseURL, form) +// blankEAVJSON blanks the "value" of EAV JSON rows whose "field_name" matches +// a blanked field. Falls back to flat handling when rows are not EAV-shaped. +func blankEAVJSON(data []byte, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { + rows := make([]map[string]interface{}, 0) + if err := json.Unmarshal(data, &rows); err != nil { + return nil, nil, nil, err + } + isEAVShaped := len(rows) > 0 + for _, row := range rows { + if _, ok := row["field_name"]; !ok { + isEAVShaped = false + break + } + } + if !isEAVShaped { + return blankFlatJSON(data, blanks) + } + exported := eavExportedFields(dict) + seen := map[string]bool{} + for _, field := range exported { + seen[field] = true + } + matched := map[string]int{} + changed := false + for _, row := range rows { + name, _ := row["field_name"].(string) + field := baseFieldName(strings.TrimSpace(name)) + if field == "" { + continue + } + if !seen[field] { + seen[field] = true + exported = append(exported, field) + } + if blanks[field] { + if _, ok := row["value"]; ok { + row["value"] = "" + changed = true + } + matched[field]++ + } + } + audit := buildAudit(blanks, matched, "rows") + if !changed { + return data, exported, audit, nil + } + out, err := json.Marshal(rows) if err != nil { - return nil, nil, err + return nil, nil, nil, err } + return out, exported, audit, nil +} - blanks := blankFields(opts) - if opts.DataFormat == "json" { - return applyBlankJSON(body, blanks) +// processExportData routes the raw API payload through the mode-appropriate +// blanking implementation and reports the exported dictionary fields plus the +// anonymization audit. +func processExportData(data []byte, opts pluginOptions, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { + switch { + case opts.DataFormat == "json" && isEAV(opts): + return blankEAVJSON(data, blanks, dict) + case opts.DataFormat == "json": + return blankFlatJSON(data, blanks) + case isEAV(opts): + return blankEAVCSV(data, reportDelimiter(opts), blanks, dict) + default: + return blankFlatCSV(data, reportDelimiter(opts), blanks, headersAreLabels(opts), dict) } - return applyBlankCSV(body, reportDelimiter(opts), blanks) } -func exportRecordData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, []string, error) { +func fetchReportData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, error) { + form := baseForm(token, "report", opts.DataFormat) + form.Set("report_id", opts.ReportID) + applySharedExportParams(form, opts) + return redcapRequest(ctx, baseURL, form) +} + +func fetchRecordData(ctx context.Context, baseURL, token string, opts pluginOptions) ([]byte, error) { form := baseForm(token, "record", opts.DataFormat) applySharedExportParams(form, opts) applyRecordOnlyFilters(form, opts) - - body, err := redcapRequest(ctx, baseURL, form) - if err != nil { - return nil, nil, err - } - - blanks := blankFields(opts) - if opts.DataFormat == "json" { - return applyBlankJSON(body, blanks) - } - return applyBlankCSV(body, reportDelimiter(opts), blanks) + return redcapRequest(ctx, baseURL, form) } // redcapRequestHeaderOnly fetches only the first CSV line of a REDCap response, @@ -822,7 +1123,22 @@ func sanitizeReportID(reportID string) string { return safe } -func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectInfoPath, eventsPath, mappingPath, redcapVersion string, warnings []string) ([]byte, error) { +// manifestExtras carries the provenance and audit context recorded alongside +// the export parameters (decisions 2026-06-11: attachments documented, never +// downloaded; every anonymization rule reports its outcome). +type manifestExtras struct { + Audit []anonymizationAudit + FileUploadFields []string + ProjectID interface{} + ProjectTitle string + DictionaryFieldsNotExported []string +} + +func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectInfoPath, eventsPath, mappingPath, redcapVersion string, warnings []string, extras manifestExtras) ([]byte, error) { + recordType := opts.RecordType + if opts.ExportMode != "records" { + recordType = "flat" // content=report has no type parameter + } manifest := map[string]interface{}{ "plugin": "redcap2", "export_mode": opts.ExportMode, @@ -830,7 +1146,7 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI "redcap_version": redcapVersion, "export": map[string]interface{}{ "data_format": opts.DataFormat, - "record_type": opts.RecordType, + "record_type": recordType, "csv_delimiter": opts.CsvDelimiter, "raw_or_label": opts.RawOrLabel, "raw_or_label_headers": opts.RawOrLabelHeaders, @@ -858,6 +1174,31 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI manifest["files"].(map[string]string)["form_event_mapping"] = mappingPath } + if extras.ProjectID != nil || extras.ProjectTitle != "" { + manifest["project"] = map[string]interface{}{ + "id": extras.ProjectID, + "title": extras.ProjectTitle, + } + } + if len(extras.FileUploadFields) > 0 { + manifest["attachments"] = map[string]interface{}{ + "file_upload_fields": extras.FileUploadFields, + "exported": false, + "note": "per-record file attachments are not exported by this plugin", + } + } + if len(extras.Audit) > 0 { + manifest["anonymization_audit"] = extras.Audit + for _, entry := range extras.Audit { + if entry.Matched == 0 { + warnings = append(warnings, fmt.Sprintf("anonymization: field %q matched no exported data", entry.Field)) + } + } + } + if len(extras.DictionaryFieldsNotExported) > 0 { + manifest["dictionary_fields_not_exported"] = extras.DictionaryFieldsNotExported + } + if len(opts.Variables) > 0 { manifest["variables"] = opts.Variables } @@ -868,20 +1209,43 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI return json.MarshalIndent(manifest, "", " ") } +// projectIdentity extracts project_id and project_title from a +// content=project JSON payload (object or single-element array form). +func projectIdentity(payload []byte) (interface{}, string) { + read := func(obj map[string]interface{}) (interface{}, string) { + title, _ := obj["project_title"].(string) + return obj["project_id"], title + } + var obj map[string]interface{} + if err := json.Unmarshal(payload, &obj); err == nil { + return read(obj) + } + var arr []map[string]interface{} + if err := json.Unmarshal(payload, &arr); err == nil && len(arr) > 0 { + return read(arr[0]) + } + return nil, "" +} + func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOptions, reportID string) (generatedBundle, error) { - var dataBytes []byte - var dataFields []string - var err error - var basePath string + // The dictionary drives blanking, header translation, and metadata + // filtering, so it is fetched before the data export. + metadataRaw, err := exportMetadataCSV(ctx, baseURL, token, nil) + if err != nil { + return generatedBundle{}, fmt.Errorf("metadata export failed: %w", err) + } + dict := parseDictionary(metadataRaw) + var rawData []byte + var basePath string if opts.ExportMode == "records" { - dataBytes, dataFields, err = exportRecordData(ctx, baseURL, token, opts) + rawData, err = fetchRecordData(ctx, baseURL, token, opts) if err != nil { return generatedBundle{}, fmt.Errorf("record export failed: %w", err) } basePath = "redcap/records" } else { - dataBytes, dataFields, err = exportReportData(ctx, baseURL, token, opts) + rawData, err = fetchReportData(ctx, baseURL, token, opts) if err != nil { return generatedBundle{}, fmt.Errorf("report export failed: %w", err) } @@ -889,10 +1253,12 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp basePath = fmt.Sprintf("redcap/report-%s", safeID) } - metadataRaw, err := exportMetadataCSV(ctx, baseURL, token, nil) + blanks := blankFields(opts) + dataBytes, dataFields, audit, err := processExportData(rawData, opts, blanks, dict) if err != nil { - return generatedBundle{}, fmt.Errorf("metadata export failed: %w", err) + return generatedBundle{}, fmt.Errorf("export processing failed: %w", err) } + metadataBytes, err := filterMetadataCSV(metadataRaw, dataFields) if err != nil { return generatedBundle{}, fmt.Errorf("metadata filtering failed: %w", err) @@ -938,6 +1304,29 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp } } + projectID, projectTitle := projectIdentity(projectInfoBytes) + extras := manifestExtras{ + Audit: audit, + FileUploadFields: dict.fileUploadFields(), + ProjectID: projectID, + ProjectTitle: projectTitle, + } + // In an unfiltered flat records export, dictionary fields missing from the + // output reveal server-side stripping (token export rights). With filters + // or report definitions the diff is expected, so it is not recorded. + if opts.ExportMode == "records" && !isEAV(opts) && + len(opts.Fields) == 0 && len(opts.Forms) == 0 && len(opts.Events) == 0 { + exported := make(map[string]bool, len(dataFields)) + for _, field := range dataFields { + exported[field] = true + } + for _, field := range dict.fieldOrder { + if !exported[field] { + extras.DictionaryFieldsNotExported = append(extras.DictionaryFieldsNotExported, field) + } + } + } + manifestBytes, err := makeManifest( opts, reportID, @@ -948,6 +1337,7 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp mappingPath, redcapVersion, warnings, + extras, ) if err != nil { return generatedBundle{}, fmt.Errorf("manifest generation failed: %w", err) @@ -1009,6 +1399,10 @@ func cachedBuildExportBundle(ctx context.Context, baseURL, token string, opts pl if err != nil { return generatedBundle{}, err } - globalBundleCache.set(key, bundle) + if bundle.size() <= maxCacheableBundleBytes { + globalBundleCache.set(key, bundle) + } else { + logging.Logger.Printf("redcap2: bundle too large to cache (%d bytes), will rebuild on demand", bundle.size()) + } return bundle, nil } diff --git a/image/app/plugin/impl/redcap2/common_test.go b/image/app/plugin/impl/redcap2/common_test.go index 4d289ca..550a767 100644 --- a/image/app/plugin/impl/redcap2/common_test.go +++ b/image/app/plugin/impl/redcap2/common_test.go @@ -98,7 +98,7 @@ func TestParsePluginOptionsUnknownValuesFallBackToDefaults(t *testing.T) { "dataFormat": "xml", "recordType": "wide", "csvDelimiter": ";", - "rawOrLabel": "other", + "rawOrLabel": "both", "rawOrLabelHeaders": "other" }`) if err != nil { @@ -180,22 +180,16 @@ func TestApplySharedExportParamsDefaults(t *testing.T) { opts, _ := parsePluginOptions("") form := url.Values{} applySharedExportParams(form, opts) - if got := form.Get("type"); got != "flat" { - t.Errorf("type = %q, want flat", got) - } - for _, key := range []string{"csvDelimiter", "rawOrLabel", "rawOrLabelHeaders"} { - if _, ok := form[key]; ok { - t.Errorf("default options should not send %q", key) - } + if len(form) != 0 { + t.Fatalf("default options should send no shared params (type is records-only), got %v", form) } } func TestApplySharedExportParamsNonDefaults(t *testing.T) { - opts, _ := parsePluginOptions(`{"recordType":"eav","csvDelimiter":"tab","rawOrLabel":"label","rawOrLabelHeaders":"label"}`) + opts, _ := parsePluginOptions(`{"csvDelimiter":"tab","rawOrLabel":"label","rawOrLabelHeaders":"label"}`) form := url.Values{} applySharedExportParams(form, opts) want := map[string]string{ - "type": "eav", "csvDelimiter": "tab", "rawOrLabel": "label", "rawOrLabelHeaders": "label", @@ -205,19 +199,46 @@ func TestApplySharedExportParamsNonDefaults(t *testing.T) { t.Errorf("%s = %q, want %q", key, got, value) } } + if _, ok := form["type"]; ok { + t.Error("type must not be a shared parameter (content=report has no type)") + } } -func TestApplyRecordOnlyFiltersEmpty(t *testing.T) { +func TestApplySharedExportParamsJSONSuppressesCSVParams(t *testing.T) { + opts, _ := parsePluginOptions(`{"dataFormat":"json","csvDelimiter":"tab","rawOrLabelHeaders":"label"}`) + form := url.Values{} + applySharedExportParams(form, opts) + for _, key := range []string{"csvDelimiter", "rawOrLabelHeaders"} { + if _, ok := form[key]; ok { + t.Errorf("%s must not be sent for JSON exports", key) + } + } +} + +func TestApplySharedExportParamsEAVSuppressesHeaderLabels(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","recordType":"eav","rawOrLabelHeaders":"label"}`) + form := url.Values{} + applySharedExportParams(form, opts) + if _, ok := form["rawOrLabelHeaders"]; ok { + t.Error("rawOrLabelHeaders must not be sent for EAV exports (flat CSV only)") + } +} + +func TestApplyRecordOnlyFiltersDefaults(t *testing.T) { opts, _ := parsePluginOptions("") form := url.Values{} applyRecordOnlyFilters(form, opts) - if len(form) != 0 { - t.Fatalf("expected no record-only params for default options, got %v", form) + if got := form.Get("type"); got != "flat" { + t.Errorf("type = %q, want flat", got) + } + if len(form) != 1 { + t.Fatalf("expected only type for default options, got %v", form) } } func TestApplyRecordOnlyFiltersFull(t *testing.T) { opts, _ := parsePluginOptions(`{ + "recordType": "eav", "fields": ["age", "name", "age"], "forms": ["demographics"], "events": ["baseline_arm_1"], @@ -231,6 +252,7 @@ func TestApplyRecordOnlyFiltersFull(t *testing.T) { form := url.Values{} applyRecordOnlyFilters(form, opts) want := map[string]string{ + "type": "eav", "fields": "age,name", "forms": "demographics", "events": "baseline_arm_1", @@ -260,69 +282,204 @@ func TestApplyRecordOnlyFiltersKeepsExplicitTimes(t *testing.T) { } } -func TestApplyBlankCSV(t *testing.T) { - out, header, err := applyBlankCSV([]byte(testDataCSV), ',', map[string]bool{"name": true, "email": true}) - if err != nil { - t.Fatalf("applyBlankCSV returned error: %v", err) +func TestParseDictionary(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + if !reflect.DeepEqual(dict.fieldOrder, []string{"record_id", "name", "email", "age"}) { + t.Errorf("fieldOrder = %v", dict.fieldOrder) + } + if dict.fieldType["name"] != "text" { + t.Errorf("fieldType[name] = %q, want text", dict.fieldType["name"]) } - wantHeader := []string{"record_id", "name", "email", "age"} - if !reflect.DeepEqual(header, wantHeader) { - t.Errorf("header = %v, want %v", header, wantHeader) + if !reflect.DeepEqual(dict.labelFields["Email Address"], []string{"email"}) { + t.Errorf("labelFields[Email Address] = %v, want [email]", dict.labelFields["Email Address"]) + } +} + +func TestDictionaryFileUploadFields(t *testing.T) { + metadata := "field_name,field_type,field_label\n" + + "record_id,text,Record ID\n" + + "consent_scan,file,Consent Scan\n" + + "mri_image,file,MRI Image\n" + dict := parseDictionary([]byte(metadata)) + if got := dict.fileUploadFields(); !reflect.DeepEqual(got, []string{"consent_scan", "mri_image"}) { + t.Fatalf("fileUploadFields = %v", got) + } +} + +func TestBaseFieldName(t *testing.T) { + tests := map[string]string{ + "phones___1": "phones", + "phones": "phones", + "a___b___c": "a", + "___x": "___x", // no base before the separator + } + for in, want := range tests { + if got := baseFieldName(in); got != want { + t.Errorf("baseFieldName(%q) = %q, want %q", in, got, want) + } + } +} + +func TestResolveHeaderFields(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + if got := resolveHeaderFields("phones___2", false, dict); !reflect.DeepEqual(got, []string{"phones"}) { + t.Errorf("raw checkbox header = %v, want [phones]", got) + } + if got := resolveHeaderFields("Email Address", true, dict); !reflect.DeepEqual(got, []string{"email"}) { + t.Errorf("label header = %v, want [email]", got) + } + if got := resolveHeaderFields("Full Name (choice=Other)", true, dict); !reflect.DeepEqual(got, []string{"name"}) { + t.Errorf("checkbox label header = %v, want [name]", got) + } + if got := resolveHeaderFields("redcap_event_name", true, dict); !reflect.DeepEqual(got, []string{"redcap_event_name"}) { + t.Errorf("unknown header should resolve to itself, got %v", got) + } +} + +func TestBlankFlatCSV(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + out, exported, audit, err := blankFlatCSV([]byte(testDataCSV), ',', map[string]bool{"name": true, "email": true}, false, dict) + if err != nil { + t.Fatalf("blankFlatCSV returned error: %v", err) } want := "record_id,name,email,age\n1,,,34\n2,,,29\n" if string(out) != want { t.Errorf("blanked CSV = %q, want %q", string(out), want) } + if !reflect.DeepEqual(exported, []string{"record_id", "name", "email", "age"}) { + t.Errorf("exported = %v", exported) + } + for _, entry := range audit { + if entry.Matched != 1 { + t.Errorf("audit %s matched = %d, want 1", entry.Field, entry.Matched) + } + } +} + +func TestBlankFlatCSVCheckboxExpansion(t *testing.T) { + dict := parseDictionary([]byte("field_name,field_type,field_label\nrecord_id,text,Record ID\nphones,checkbox,Phone Types\n")) + data := "record_id,phones___1,phones___2\n1,555-1234,555-5678\n" + out, exported, audit, err := blankFlatCSV([]byte(data), ',', map[string]bool{"phones": true}, false, dict) + if err != nil { + t.Fatalf("blankFlatCSV returned error: %v", err) + } + want := "record_id,phones___1,phones___2\n1,,\n" + if string(out) != want { + t.Errorf("blanked CSV = %q, want %q", string(out), want) + } + if !reflect.DeepEqual(exported, []string{"record_id", "phones"}) { + t.Errorf("exported = %v, want base names", exported) + } + if len(audit) != 1 || audit[0].Matched != 2 { + t.Errorf("audit = %+v, want phones matched=2", audit) + } } -func TestApplyBlankCSVNoMatchingColumns(t *testing.T) { - out, header, err := applyBlankCSV([]byte(testDataCSV), ',', map[string]bool{"missing": true}) +func TestBlankFlatCSVLabelHeaders(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := "Record ID,Full Name,Email Address,Age\n1,John,john@example.org,34\n" + out, exported, audit, err := blankFlatCSV([]byte(data), ',', map[string]bool{"email": true}, true, dict) if err != nil { - t.Fatalf("applyBlankCSV returned error: %v", err) + t.Fatalf("blankFlatCSV returned error: %v", err) + } + want := "Record ID,Full Name,Email Address,Age\n1,John,,34\n" + if string(out) != want { + t.Errorf("blanked CSV = %q, want %q", string(out), want) + } + if !reflect.DeepEqual(exported, []string{"record_id", "name", "email", "age"}) { + t.Errorf("exported = %v, want translated field names", exported) + } + if len(audit) != 1 || audit[0].Matched != 1 { + t.Errorf("audit = %+v", audit) + } +} + +func TestBlankFlatCSVZeroMatchAudit(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + out, _, audit, err := blankFlatCSV([]byte(testDataCSV), ',', map[string]bool{"missing": true}, false, dict) + if err != nil { + t.Fatalf("blankFlatCSV returned error: %v", err) } if string(out) != testDataCSV { - t.Errorf("data changed despite no matching blank columns") + t.Error("data changed despite no matching blank columns") } - if len(header) != 4 { - t.Errorf("header = %v, want 4 columns", header) + if len(audit) != 1 || audit[0].Matched != 0 || audit[0].Note == "" { + t.Errorf("zero-match audit missing note: %+v", audit) } } -func TestApplyBlankCSVEmptyInput(t *testing.T) { - out, header, err := applyBlankCSV(nil, ',', map[string]bool{"name": true}) +func TestBlankEAVCSV(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := "record,redcap_event_name,field_name,value\n" + + "1,baseline_arm_1,name,John\n" + + "1,baseline_arm_1,email,john@example.org\n" + + "2,baseline_arm_1,email,jane@example.org\n" + + "2,baseline_arm_1,age,29\n" + out, exported, audit, err := blankEAVCSV([]byte(data), ',', map[string]bool{"email": true}, dict) if err != nil { - t.Fatalf("applyBlankCSV returned error: %v", err) + t.Fatalf("blankEAVCSV returned error: %v", err) + } + want := "record,redcap_event_name,field_name,value\n" + + "1,baseline_arm_1,name,John\n" + + "1,baseline_arm_1,email,\n" + + "2,baseline_arm_1,email,\n" + + "2,baseline_arm_1,age,29\n" + if string(out) != want { + t.Errorf("blanked EAV CSV = %q, want %q", string(out), want) + } + if !reflect.DeepEqual(exported, []string{"record_id", "name", "email", "age"}) { + t.Errorf("exported = %v (record_id must be seeded)", exported) } - if len(out) != 0 || header != nil { - t.Errorf("expected empty passthrough, got out=%q header=%v", out, header) + if len(audit) != 1 || audit[0].Matched != 2 { + t.Errorf("audit = %+v, want email matched=2 rows", audit) } } -func TestApplyBlankJSON(t *testing.T) { - out, fields, err := applyBlankJSON([]byte(testDataJSON), map[string]bool{"name": true, "email": true}) +func TestBlankEAVJSON(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := `[{"record":"1","field_name":"name","value":"John"},{"record":"1","field_name":"email","value":"john@example.org"}]` + out, exported, audit, err := blankEAVJSON([]byte(data), map[string]bool{"email": true}, dict) if err != nil { - t.Fatalf("applyBlankJSON returned error: %v", err) + t.Fatalf("blankEAVJSON returned error: %v", err) + } + rows := []map[string]string{} + if err := json.Unmarshal(out, &rows); err != nil { + t.Fatalf("blanked EAV JSON invalid: %v", err) + } + if rows[1]["value"] != "" || rows[0]["value"] != "John" { + t.Errorf("unexpected EAV JSON blanking: %v", rows) } - wantFields := []string{"age", "email", "name", "record_id"} - if !reflect.DeepEqual(fields, wantFields) { - t.Errorf("fields = %v, want %v", fields, wantFields) + if !reflect.DeepEqual(exported, []string{"record_id", "name", "email"}) { + t.Errorf("exported = %v", exported) + } + if len(audit) != 1 || audit[0].Matched != 1 { + t.Errorf("audit = %+v", audit) + } +} + +func TestBlankFlatJSONCheckboxExpansion(t *testing.T) { + data := `[{"record_id":"1","phones___1":"555-1234","phones___2":"555-5678","age":"34"}]` + out, exported, audit, err := blankFlatJSON([]byte(data), map[string]bool{"phones": true}) + if err != nil { + t.Fatalf("blankFlatJSON returned error: %v", err) } rows := []map[string]string{} if err := json.Unmarshal(out, &rows); err != nil { - t.Fatalf("blanked JSON is invalid: %v", err) + t.Fatalf("blanked JSON invalid: %v", err) } - for i, row := range rows { - if row["name"] != "" || row["email"] != "" { - t.Errorf("row %d not blanked: %v", i, row) - } - if row["record_id"] == "" || row["age"] == "" { - t.Errorf("row %d lost non-blanked values: %v", i, row) - } + if rows[0]["phones___1"] != "" || rows[0]["phones___2"] != "" || rows[0]["age"] != "34" { + t.Errorf("unexpected JSON blanking: %v", rows) + } + if !reflect.DeepEqual(exported, []string{"age", "phones", "record_id"}) { + t.Errorf("exported = %v, want sorted base names", exported) + } + if len(audit) != 1 || audit[0].Matched != 2 { + t.Errorf("audit = %+v, want phones matched=2", audit) } } -func TestApplyBlankJSONInvalid(t *testing.T) { - if _, _, err := applyBlankJSON([]byte("not json"), nil); err == nil { +func TestBlankFlatJSONInvalid(t *testing.T) { + if _, _, _, err := blankFlatJSON([]byte("not json"), nil); err == nil { t.Fatal("expected error for invalid JSON input") } } @@ -332,9 +489,9 @@ func TestFilterMetadataCSV(t *testing.T) { if err != nil { t.Fatalf("filterMetadataCSV returned error: %v", err) } - want := "field_name,form_name,field_type,identifier\n" + - "record_id,demographics,text,\n" + - "age,demographics,text,\n" + want := "field_name,form_name,field_type,field_label,identifier\n" + + "record_id,demographics,text,Record ID,\n" + + "age,demographics,text,Age,\n" if string(out) != want { t.Errorf("filtered metadata = %q, want %q", string(out), want) } @@ -377,6 +534,23 @@ func TestDetectLongitudinal(t *testing.T) { } } +func TestProjectIdentity(t *testing.T) { + id, title := projectIdentity([]byte(`{"project_id":42,"project_title":"Demo Study"}`)) + if title != "Demo Study" { + t.Errorf("title = %q", title) + } + if num, ok := id.(float64); !ok || num != 42 { + t.Errorf("id = %v (%T), want 42", id, id) + } + id, title = projectIdentity([]byte(`[{"project_id":"7","project_title":"Array Form"}]`)) + if id != "7" || title != "Array Form" { + t.Errorf("array form: id=%v title=%q", id, title) + } + if id, title = projectIdentity([]byte(`not json`)); id != nil || title != "" { + t.Errorf("invalid payload should yield empty identity, got %v %q", id, title) + } +} + func TestDeduplicatedSelectItems(t *testing.T) { got := deduplicatedSelectItems([]string{" b", "a", "b", ""}) want := []types.SelectItem{ @@ -404,10 +578,16 @@ func TestDeduplicatedSelectItemsWithIdentifiers(t *testing.T) { } func TestMakeManifestReportMode(t *testing.T) { - opts, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","variables":[{"name":"email","anonymization":"blank"}]}`) + opts, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","recordType":"eav","variables":[{"name":"email","anonymization":"blank"}]}`) + extras := manifestExtras{ + Audit: []anonymizationAudit{{Field: "email", Mode: "blank", Matched: 1}}, + FileUploadFields: []string{"consent_scan"}, + ProjectID: float64(1), + ProjectTitle: "Demo", + } data, err := makeManifest(opts, "7", "redcap/report-7/data.csv", "redcap/report-7/metadata.csv", "redcap/report-7/project_info.json", "redcap/report-7/events.csv", "redcap/report-7/form_event_mapping.csv", - "14.5.5", []string{"something failed"}) + "14.5.5", []string{"something failed"}, extras) if err != nil { t.Fatalf("makeManifest returned error: %v", err) } @@ -421,13 +601,25 @@ func TestMakeManifestReportMode(t *testing.T) { if manifest["report_id"] != "7" { t.Errorf("report_id = %v, want 7", manifest["report_id"]) } - if manifest["redcap_version"] != "14.5.5" { - t.Errorf("redcap_version = %v, want 14.5.5", manifest["redcap_version"]) + export := manifest["export"].(map[string]interface{}) + if export["record_type"] != "flat" { + t.Errorf("report-mode record_type = %v, want forced flat (no type param)", export["record_type"]) } files := manifest["files"].(map[string]interface{}) if files["events"] != "redcap/report-7/events.csv" || files["form_event_mapping"] != "redcap/report-7/form_event_mapping.csv" { t.Errorf("longitudinal files missing from manifest: %v", files) } + project := manifest["project"].(map[string]interface{}) + if project["title"] != "Demo" { + t.Errorf("project = %v", project) + } + attachments := manifest["attachments"].(map[string]interface{}) + if attachments["exported"] != false { + t.Errorf("attachments.exported = %v, want false", attachments["exported"]) + } + if _, ok := manifest["anonymization_audit"]; !ok { + t.Error("anonymization_audit missing from manifest") + } if _, ok := manifest["variables"]; !ok { t.Error("variables missing from manifest") } @@ -439,7 +631,7 @@ func TestMakeManifestReportMode(t *testing.T) { func TestMakeManifestRecordsMode(t *testing.T) { opts, _ := parsePluginOptions(`{"exportMode":"records"}`) data, err := makeManifest(opts, "", "redcap/records/data.csv", "redcap/records/metadata.csv", - "redcap/records/project_info.json", "", "", "", nil) + "redcap/records/project_info.json", "", "", "", nil, manifestExtras{}) if err != nil { t.Fatalf("makeManifest returned error: %v", err) } @@ -450,7 +642,7 @@ func TestMakeManifestRecordsMode(t *testing.T) { if _, ok := manifest["report_id"]; ok { t.Error("records-mode manifest should not contain report_id") } - for _, key := range []string{"variables", "warnings"} { + for _, key := range []string{"variables", "warnings", "attachments", "anonymization_audit", "project", "dictionary_fields_not_exported"} { if _, ok := manifest[key]; ok { t.Errorf("empty %s should be omitted from manifest", key) } @@ -463,6 +655,21 @@ func TestMakeManifestRecordsMode(t *testing.T) { } } +func TestMakeManifestZeroMatchAuditAddsWarning(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + extras := manifestExtras{Audit: []anonymizationAudit{{Field: "ghost", Mode: "blank", Matched: 0, Note: "field not present in export"}}} + data, err := makeManifest(opts, "", "d", "m", "p", "", "", "", nil, extras) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + manifest := map[string]interface{}{} + _ = json.Unmarshal(data, &manifest) + warnings, ok := manifest["warnings"].([]interface{}) + if !ok || len(warnings) == 0 || !strings.Contains(warnings[0].(string), "ghost") { + t.Fatalf("expected zero-match warning, got %v", manifest["warnings"]) + } +} + func TestBundleCacheKeyStability(t *testing.T) { base, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","generatedAt":"2026-01-01T00:00:00Z"}`) sameButLater := base diff --git a/image/app/plugin/impl/redcap2/deid_test.go b/image/app/plugin/impl/redcap2/deid_test.go new file mode 100644 index 0000000..124a539 --- /dev/null +++ b/image/app/plugin/impl/redcap2/deid_test.go @@ -0,0 +1,189 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "encoding/json" + "integration/app/plugin/types" + "io" + "testing" +) + +func queryAndRead(t *testing.T, f *fakeRedcap, pluginOpts, path string) ([]byte, map[string]interface{}) { + t.Helper() + nodes, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err != nil { + t.Fatalf("Query returned error: %v", err) + } + streams, err := Streams(context.Background(), nodes, types.StreamParams{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}) + if err != nil { + t.Fatalf("Streams returned error: %v", err) + } + read := func(p string) []byte { + stream, ok := streams.Streams[p] + if !ok { + t.Fatalf("no stream for %s (have %v)", p, nodes) + } + reader, err := stream.Open() + if err != nil { + t.Fatalf("open %s: %v", p, err) + } + data, err := io.ReadAll(reader) + if err != nil { + t.Fatalf("read %s: %v", p, err) + } + return data + } + var manifest map[string]interface{} + base, _ := splitPath(path) + if err := json.Unmarshal(read(base+"/manifest.json"), &manifest); err != nil { + t.Fatalf("manifest invalid: %v", err) + } + return read(path), manifest +} + +// EAV exports must blank by field_name row, not by header, and metadata.csv +// must keep the fields seen in the EAV rows plus the record-ID field. +func TestEndToEndEAVBlanking(t *testing.T) { + f := newFakeRedcap() + f.eavCSV = "record,redcap_event_name,field_name,value\n" + + "1,baseline_arm_1,name,John\n" + + "1,baseline_arm_1,email,john@example.org\n" + + "2,baseline_arm_1,email,jane@example.org\n" + + "2,baseline_arm_1,age,29\n" + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "recordType": "eav", + "variables": [{"name": "email", "anonymization": "blank"}] + }` + data, manifest := queryAndRead(t, f, pluginOpts, "redcap/records/data.csv") + + want := "record,redcap_event_name,field_name,value\n" + + "1,baseline_arm_1,name,John\n" + + "1,baseline_arm_1,email,\n" + + "2,baseline_arm_1,email,\n" + + "2,baseline_arm_1,age,29\n" + if string(data) != want { + t.Errorf("EAV data.csv = %q, want %q", string(data), want) + } + + audit := manifest["anonymization_audit"].([]interface{}) + entry := audit[0].(map[string]interface{}) + if entry["field"] != "email" || entry["matched"] != float64(2) { + t.Errorf("audit = %v, want email matched=2", audit) + } + if form := f.lastForm("record"); form.Get("type") != "eav" { + t.Errorf("record export type = %q, want eav", form.Get("type")) + } +} + +// Label-header exports must translate headers through the dictionary so that +// blanking by field name still applies, and metadata.csv keeps all fields. +func TestEndToEndLabelHeaderBlanking(t *testing.T) { + f := newFakeRedcap() + f.labelCSV = "Record ID,Full Name,Email Address,Age\n1,John,john@example.org,34\n2,Jane,jane@example.org,29\n" + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "rawOrLabelHeaders": "label", + "variables": [ + {"name": "name", "anonymization": "blank"}, + {"name": "email", "anonymization": "blank"} + ] + }` + data, manifest := queryAndRead(t, f, pluginOpts, "redcap/records/data.csv") + + want := "Record ID,Full Name,Email Address,Age\n1,,,34\n2,,,29\n" + if string(data) != want { + t.Errorf("label-header data.csv = %q, want %q", string(data), want) + } + if _, ok := manifest["warnings"]; ok { + t.Errorf("no warnings expected for fully matched blanking, got %v", manifest["warnings"]) + } + if form := f.lastForm("record"); form.Get("rawOrLabelHeaders") != "label" { + t.Errorf("rawOrLabelHeaders = %q, want label", form.Get("rawOrLabelHeaders")) + } +} + +// Checkbox fields expand to field___code columns; a blank rule for the base +// field must blank every expansion, and the manifest must document attachments +// (file-upload fields) and dictionary fields missing from the export. +func TestEndToEndCheckboxAndAttachmentManifest(t *testing.T) { + f := newFakeRedcap() + f.metadataCSV = "field_name,form_name,field_type,field_label,identifier\n" + + "record_id,demographics,text,Record ID,\n" + + "phones,demographics,checkbox,Phone Types,y\n" + + "consent_scan,demographics,file,Consent Scan,\n" + f.dataCSV = "record_id,phones___1,phones___2\n1,555-1234,555-5678\n" + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "variables": [{"name": "phones", "anonymization": "blank"}] + }` + data, manifest := queryAndRead(t, f, pluginOpts, "redcap/records/data.csv") + + want := "record_id,phones___1,phones___2\n1,,\n" + if string(data) != want { + t.Errorf("checkbox data.csv = %q, want %q", string(data), want) + } + + audit := manifest["anonymization_audit"].([]interface{}) + entry := audit[0].(map[string]interface{}) + if entry["field"] != "phones" || entry["matched"] != float64(2) { + t.Errorf("audit = %v, want phones matched=2", audit) + } + + attachments := manifest["attachments"].(map[string]interface{}) + fields := attachments["file_upload_fields"].([]interface{}) + if len(fields) != 1 || fields[0] != "consent_scan" || attachments["exported"] != false { + t.Errorf("attachments = %v, want consent_scan not exported", attachments) + } + + notExported := manifest["dictionary_fields_not_exported"].([]interface{}) + if len(notExported) != 1 || notExported[0] != "consent_scan" { + t.Errorf("dictionary_fields_not_exported = %v, want [consent_scan]", notExported) + } +} + +// Zero-match blank rules must surface as manifest warnings, never silently. +func TestEndToEndZeroMatchBlankWarning(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "report", + "reportId": "7", + "variables": [{"name": "not_in_report", "anonymization": "blank"}] + }` + _, manifest := queryAndRead(t, f, pluginOpts, "redcap/report-7/data.csv") + + warnings, ok := manifest["warnings"].([]interface{}) + if !ok || len(warnings) == 0 { + t.Fatalf("expected zero-match warning in manifest, got %v", manifest["warnings"]) + } +} + +// Bundles above the cache cap must be rebuilt instead of cached. +func TestOversizedBundleIsNotCached(t *testing.T) { + originalCap := maxCacheableBundleBytes + maxCacheableBundleBytes = 1 // everything is oversized + defer func() { maxCacheableBundleBytes = originalCap }() + + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{"exportMode":"report","reportId":"7"}` + for i := 0; i < 2; i++ { + if _, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}, nil); err != nil { + t.Fatalf("Query %d returned error: %v", i, err) + } + } + if got := f.calls("report"); got != 2 { + t.Fatalf("report called %d times, want 2 (oversized bundle must not be cached)", got) + } +} diff --git a/image/app/plugin/impl/redcap2/helper_test.go b/image/app/plugin/impl/redcap2/helper_test.go index 286e7a1..54462c6 100644 --- a/image/app/plugin/impl/redcap2/helper_test.go +++ b/image/app/plugin/impl/redcap2/helper_test.go @@ -15,11 +15,11 @@ const ( testDataJSON = `[{"record_id":"1","name":"John","email":"john@example.org","age":"34"},{"record_id":"2","name":"Jane","email":"jane@example.org","age":"29"}]` - testMetadataCSV = "field_name,form_name,field_type,identifier\n" + - "record_id,demographics,text,\n" + - "name,demographics,text,y\n" + - "email,demographics,text,y\n" + - "age,demographics,text,\n" + testMetadataCSV = "field_name,form_name,field_type,field_label,identifier\n" + + "record_id,demographics,text,Record ID,\n" + + "name,demographics,text,Full Name,y\n" + + "email,demographics,text,Email Address,y\n" + + "age,demographics,text,Age,\n" testEventsCSV = "event_name,arm_num,unique_event_name\nBaseline,1,baseline_arm_1\n" testMappingCSV = "arm_num,unique_event_name,form\n1,baseline_arm_1,demographics\n" @@ -28,11 +28,19 @@ const ( // fakeRedcap is a minimal in-memory REDCap API stub. It records every form // submitted per content type so tests can assert on the exact parameters sent. +// Fixture overrides allow individual tests to serve EAV, label-header, +// checkbox, or custom-dictionary payloads. type fakeRedcap struct { mu sync.Mutex forms map[string][]url.Values longitudinal bool failReport bool + metadataCSV string // overrides testMetadataCSV when set + dataCSV string // overrides testDataCSV when set + dataJSON string // overrides testDataJSON when set + eavCSV string // served for type=eav csv requests when set + eavJSON string // served for type=eav json requests when set + labelCSV string // served for rawOrLabelHeaders=label csv requests when set server *httptest.Server } @@ -84,11 +92,15 @@ func (f *fakeRedcap) handle(w http.ResponseWriter, r *http.Request) { http.Error(w, "report unavailable", http.StatusInternalServerError) return } - writeTestData(w, form) + f.writeData(w, form) case "record": - writeTestData(w, form) + f.writeData(w, form) case "metadata": - _, _ = w.Write([]byte(testMetadataCSV)) + metadata := testMetadataCSV + if f.metadataCSV != "" { + metadata = f.metadataCSV + } + _, _ = w.Write([]byte(metadata)) case "project": longitudinalFlag := "0" if longitudinal { @@ -106,12 +118,31 @@ func (f *fakeRedcap) handle(w http.ResponseWriter, r *http.Request) { } } -func writeTestData(w http.ResponseWriter, form url.Values) { +func (f *fakeRedcap) writeData(w http.ResponseWriter, form url.Values) { if form.Get("format") == "json" { - _, _ = w.Write([]byte(testDataJSON)) + if form.Get("type") == "eav" && f.eavJSON != "" { + _, _ = w.Write([]byte(f.eavJSON)) + return + } + data := testDataJSON + if f.dataJSON != "" { + data = f.dataJSON + } + _, _ = w.Write([]byte(data)) + return + } + if form.Get("type") == "eav" && f.eavCSV != "" { + _, _ = w.Write([]byte(f.eavCSV)) + return + } + if form.Get("rawOrLabelHeaders") == "label" && f.labelCSV != "" { + _, _ = w.Write([]byte(f.labelCSV)) return } data := testDataCSV + if f.dataCSV != "" { + data = f.dataCSV + } if form.Get("csvDelimiter") == "tab" { data = strings.ReplaceAll(data, ",", "\t") } diff --git a/image/app/plugin/impl/redcap2/query_test.go b/image/app/plugin/impl/redcap2/query_test.go index bf720f1..e77e5b6 100644 --- a/image/app/plugin/impl/redcap2/query_test.go +++ b/image/app/plugin/impl/redcap2/query_test.go @@ -72,9 +72,12 @@ func TestQueryReportModeGeneratesBundle(t *testing.T) { } form := f.lastForm("report") - if form.Get("report_id") != "7" || form.Get("type") != "flat" { + if form.Get("report_id") != "7" { t.Errorf("unexpected report form: %v", form) } + if _, ok := form["type"]; ok { + t.Error("content=report has no type parameter and must not receive one") + } if f.calls("record") != 0 { t.Error("report mode must not call the record endpoint") } @@ -153,7 +156,6 @@ func TestQueryRecordsModeSendsFilters(t *testing.T) { "type": "eav", "csvDelimiter": "tab", "rawOrLabel": "label", - "rawOrLabelHeaders": "label", "fields": "age,name", "forms": "demographics", "events": "baseline_arm_1", @@ -169,6 +171,9 @@ func TestQueryRecordsModeSendsFilters(t *testing.T) { t.Errorf("%s = %q, want %q", key, got, value) } } + if _, ok := form["rawOrLabelHeaders"]; ok { + t.Error("rawOrLabelHeaders must be suppressed for EAV exports (flat CSV only)") + } if f.calls("report") != 0 { t.Error("records mode must not call the report endpoint") } diff --git a/image/app/plugin/impl/redcap2/streams_test.go b/image/app/plugin/impl/redcap2/streams_test.go index 92a8a5f..3ee42da 100644 --- a/image/app/plugin/impl/redcap2/streams_test.go +++ b/image/app/plugin/impl/redcap2/streams_test.go @@ -102,6 +102,14 @@ func TestStreamsServesQueryBundleFromCache(t *testing.T) { if manifest["generated_at"] != "2026-06-11T00:00:00Z" { t.Errorf("generated_at = %v, want propagated value", manifest["generated_at"]) } + project, ok := manifest["project"].(map[string]interface{}) + if !ok || project["title"] != "Demo" { + t.Errorf("manifest project identity = %v, want title Demo", manifest["project"]) + } + audit, ok := manifest["anonymization_audit"].([]interface{}) + if !ok || len(audit) != 2 { + t.Errorf("anonymization_audit = %v, want 2 entries (name, email)", manifest["anonymization_audit"]) + } // The bundle built during Query must be reused by Streams (single build). for _, content := range []string{"report", "metadata", "project", "version"} { diff --git a/redcap.md b/redcap.md index 78d5741..bf3b3a7 100644 --- a/redcap.md +++ b/redcap.md @@ -6,6 +6,7 @@ - [Summary](#summary) - [Current Implementation Status (2026-03-10)](#current-implementation-status-2026-03-10) +- [Review, Research, And Decisions (2026-06-11)](#review-research-and-decisions-2026-06-11) - [Export Mode Design](#export-mode-design) - [Target User Flow](#target-user-flow) - [Syncable File Model](#syncable-file-model) @@ -91,10 +92,62 @@ Key point: manual export/save was required in the old `redcap` plugin because it ### Not Implemented Yet -1. XML data export. -2. Advanced de-identification modes beyond `blank` (drop/mask/pseudonymize/encrypt). -3. DDI-CDI/Croissant/RO-Crate metadata exporters. -4. Attachment/file-field download modes. +1. Advanced de-identification modes beyond `blank` (`drop`, HMAC `pseudonymize`) — Phase 4. Reversible encryption is **out of scope** (decision 2026-06-11). +2. DDI-CDI/Croissant/RO-Crate metadata exporters — Phase 5 (all three in one phase, from one normalized model; decision 2026-06-11). +3. Attachment/file-field download — **deferred**; file-upload fields are documented in the manifest instead (decision 2026-06-11). +4. XML data export (note: `content=report` also accepts `format=odm`; a metadata-only `content=project_xml` sidecar is planned in Phase 5). + +[↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Review, Research, And Decisions](#review-research-and-decisions-2026-06-11) + +--- + +## Review, Research, And Decisions (2026-06-11) + +A full review of the branch against main plus web research on the REDCap API (triangulated from PHPCap, REDCap.jl, PyCap, REDCapR sources and university changelog mirrors), the integration landscape, and metadata standards produced these findings and decisions. + +### Verified REDCap API facts + +1. **De-identification is enforced server-side by the token user's Data Export Rights** (No Access `0` / Full `1` / De-Identified `2` / Remove All Identifier Fields `3`, per instrument). "De-Identified" strips tagged identifiers + unvalidated text + notes fields, hashes the record ID, and **removes** date fields on API exports (date *shifting* is an interactive-export option only, not available via API). Token rights are therefore the institutional de-id baseline; the plugin's per-variable transforms are a second layer. +2. `content=report` accepts `report_id`, `format` (`csv`/`json`/`xml`/`odm`), `rawOrLabel`, `rawOrLabelHeaders`, `exportCheckboxLabel`, `csvDelimiter`, `decimalCharacter` — and **no `type` parameter** (reports are always flat) and no record filters. Our parameter routing was already correct; the UI offered a Record type toggle in report mode that had no effect (fixed in Phase 3.9). +3. **`rawOrLabel=both` is not a real parameter** of REDCap 13.x–15.x (a PyCap docstring fossil; PHPCap/REDCap.jl/REDCapR/.NET all validate `raw|label`). Removed from the UI. +4. `rawOrLabelHeaders` applies to CSV flat exports only. `csvDelimiter` applies to CSV only (also accepts `;`, `|`, `^` besides comma/tab). +5. EAV exports have columns `record, [redcap_event_name,] field_name, value` — field names appear as *row values*, not headers. Checkbox fields in flat exports expand to `field___code` columns. Both break naive header-name blanking (fixed in Phase 3.9). +6. `content=project_xml&returnMetadataOnly=true` returns complete project metadata (CDISC ODM 1.3.1, incl. value labels) in one call — the recognized archival gold standard. Never export it *with* data (would bypass blanking). +7. Attachments: `content=file` exports exactly one file per call (record × field × event × instance); no batch endpoint. Confirms deferral. +8. Useful newer parameters to consider later: `exportBlankForGrayFormStatus` (13.x), `combineCheckboxOptions` (15.6.0+, collapses checkbox expansion), `decimalCharacter`, `exportCheckboxLabel`. Project info exports `project_pi_email` since 15.5.20. + +### Landscape + +1. rdm-integration is the **only REDCap→Dataverse integration referenced in the official Dataverse guides**; no competing maintained tool exists (Fiocruz effort unreleased; datalad-redcap is a stalled prototype that validates our idempotent-hash model). +2. Closest design relative: **Yale YES3 Exporter** (export-specific dictionaries with per-field distributions, de-id conditioning incl. REDCap-compatible date shifting + record-ID hashing, versioned export specs, audit trail) — the model for our manifest/audit evolution. +3. Nobody has published a REDCap→DDI-CDI or REDCap→Croissant mapping — genuine novelty space. +4. Manifest best practice (union of YES3/REDCapExporter/datalad-redcap): REDCap version, project id/title, exporting user + export-rights level, exact API parameters, de-id transforms applied, per-file checksums. + +### Metadata standards + +1. **Croissant**: 1.0 (2024-03), 1.1 (2026-01). **Dataverse 6.10 (2026-03) ships a built-in Croissant exporter**, but it cannot recover variable labels/value lists from CSVs — a deposited `croissant.json` built from the REDCap dictionary complements it exactly. Validate with `mlcroissant`. Consumed today by Google Dataset Search, NeurIPS (required), HF/Kaggle/OpenML. +2. **RO-Crate**: 1.2 (2025-06) formalizes detached crates; **Process Run Crate** profile fits export-run provenance; Dataverse has a beta previewer that renders a deposited `ro-crate-metadata.json`; KU Leuven already maintains gdcc/exporter-ro-crate. +3. **DDI-CDI**: 1.0 final (2025-01), JSON-LD encoding, no production consumers yet — but LIBIS maintains cdi-viewer (SHACL validation) and this repo already has a DDI-CDI pipeline; strategic in-house value remains high. + +### Decisions (2026-06-11) + +1. **Default de-id policy**: blank identifier-tagged fields by default (current auto-detection), layered on token export-rights as baseline. `drop`/HMAC `pseudonymize` become opt-in per field in Phase 4. +2. **Reversible encryption**: out of scope (irreversible transforms only). +3. **Metadata exporters**: all three (Croissant + RO-Crate + DDI-CDI) in a single phase from one normalized metadata model, generated during export as bundle virtual files (selectable in the compare tree). +4. **Attachments**: deferred; the manifest documents the project's file-upload fields as not-exported references. + +### Review findings driving Phase 3.9 + +Both repos build and pass all tests; architecture and `pluginOptions` job-lifecycle wiring (compare → Redis job → worker → Streams) verified sound. The gaps are all in the de-identification path, where silent failure is unacceptable: + +1. **P1 — Blanking silently no-ops in EAV mode** (field names are row values in the `field_name` column, not headers). Applies to CSV and JSON EAV. +2. **P1 — Blanking silently no-ops with label headers** (`rawOrLabelHeaders=label` makes headers labels; rules carry field names). +3. **P1 — Checkbox fields leak**: flat exports expand to `field___code` columns that name-equality matching misses; identifier auto-detection misses them too. +4. **P1 — Frontend: variables never load on the first-time report flow** (no trigger on report-ID entry, no reload button) — identifier auto-blanking silently skipped on exactly the first-use path. +5. **P2 — `metadata.csv` near-empty in EAV/label-header modes** (filtered by data headers, which are not field names in those modes). +6. **P2 — UI offered nonexistent API options** (`rawOrLabel=both`; Record type in report mode). +7. **P3 — Bundle cache** holds full exports in RAM with TTL-only eviction (size cap added); compare→store TTL gap can rebuild against changed data (next compare detects it — documented behavior). +8. **P3 — Manifest** lacked export-rights context, project identity, attachment documentation, and per-rule anonymization audit. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Export Mode Design](#export-mode-design) @@ -219,18 +272,20 @@ Design requirements: 1. `exportMode`: `report` or `records` 2. `dataFormat`: `csv` or `json` -3. `recordType`: `flat` or `eav` -4. `csvDelimiter`: comma or tab -5. `rawOrLabel`: `raw`, `label`, or `both` -6. `rawOrLabelHeaders`: `raw` or `label` -7. `variables[]` with anonymization mode: `none` or `blank` +3. `csvDelimiter`: comma or tab (CSV format only; REDCap also accepts `;`, `|`, `^` — not yet exposed) +4. `rawOrLabel`: `raw` or `label` (`both` removed — not a real REDCap API parameter) +5. `rawOrLabelHeaders`: `raw` or `label` (CSV flat exports only) +6. `variables[]` with anonymization mode: `none` or `blank` ### Report Mode Only -8. `reportId` (required — entered manually; REDCap API has no report-listing endpoint) +7. `reportId` (required — entered manually; REDCap API has no report-listing endpoint) + +Note: report exports are always flat — `content=report` has no `type` parameter. ### Records Mode Only +8. `recordType`: `flat` or `eav` (records mode only; blanking is EAV-aware) 9. `fields` 10. `forms` 11. `events` @@ -239,22 +294,16 @@ Design requirements: 14. `dateRangeBegin` 15. `dateRangeEnd` 16. `exportSurveyFields`: include survey identifier and timestamp fields (default `false`) -17. `exportDataAccessGroups`: include Data Access Group field (default `false`) - -### Planned Controls +17. `exportDataAccessGroups`: include Data Access Group field (default `false`; REDCap only honors it when the project has DAGs and the API user is not in a DAG) -1. XML output support +### Possible Future Controls -### Attachment Controls +1. `exportCheckboxLabel`, `decimalCharacter`, `exportBlankForGrayFormStatus`, `combineCheckboxOptions` (15.6.0+) +2. Additional CSV delimiters (`;`, `|`, `^`) -1. `include_attachments`: default `false` -2. `attachments_mode`: `reference-only` or `download` -3. `attachments_max_size_mb` +### Attachments (Decision 2026-06-11: Deferred) -Rationale: - -1. For many projects, upload/file fields should remain references in MVP. -2. Full attachment download can be expensive and should be explicit. +File-upload fields are **not downloaded**. The manifest lists the project's file-upload fields (detected from `field_type=file` in the dictionary) so deposits document that binaries exist in REDCap but were deliberately not exported. Rationale: the API allows only one file per call (expensive at scale), and attachment content (consent scans, images) is the most identifying material in a project — none of the tabular de-identification machinery can inspect it. Any future download mode must be opt-in, size-capped, and flagged as not de-identified. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ REDCap Built-In De-Identification Parameters](#redcap-built-in-de-identification-parameters) @@ -342,48 +391,32 @@ Both changes are backward-compatible with the existing payload structure. ## De-Identification And Encryption -### Policy Model - -De-identification should be policy-driven, not ad-hoc. The built-in REDCap parameters described [above](#redcap-built-in-de-identification-parameters) should be used as the first layer (server-side stripping), with our policy model applied as a second layer (client-side transforms). +### Layered Model (Decisions 2026-06-11) -Suggested policy file (`redcap2-policy.json`): +De-identification is policy-driven and layered: -1. `drop_fields`: remove columns entirely -2. `blank_fields`: keep column but replace all values with empty values -3. `mask_rules`: regex or function-based transforms -4. `pseudonymize_fields`: deterministic irreversible tokenization -5. `encrypt_fields`: reversible encryption +1. **Layer 0 — Token export rights (server-side, strongest).** REDCap enforces the token user's Data Export Rights on every API export: "De-Identified" strips tagged identifiers + free-text + notes, hashes the record ID, and removes dates; "Remove All Identifier Fields" strips tagged fields. Institutions should issue de-identified-rights tokens where possible. The plugin records the effective stripping in the manifest (dictionary-vs-export column diff). +2. **Layer 1 — Built-in suppression parameters.** `exportSurveyFields=false`, `exportDataAccessGroups=false` (implemented, default off). +3. **Layer 2 — Per-variable client-side transforms.** `blank` (implemented, EAV/checkbox/label-header aware as of Phase 3.9), `drop` and deterministic HMAC `pseudonymize` (Phase 4). ### Methods -1. **Server-side suppression (NEW — via built-in REDCap parameters)** - - `exportSurveyFields=false`: suppress survey identifier and timestamp fields - - `exportDataAccessGroups=false`: suppress data access group field - - safest option — data never leaves REDCap -2. **Drop** - - safest client-side option for direct identifiers -3. **Blank** - - preserves schema, no values - - can be auto-applied to REDCap identifier-tagged fields -4. **Deterministic pseudonymization (non-reversible)** - - e.g. HMAC-based token with secret key - - consistent per value, not reversible -5. **Reversible encryption** - - only if strictly required - - requires key management, key rotation, audit policy, and strict access controls - -Important: - -1. "Anonymized and reversible" is not anonymous in strict privacy sense. -2. If reversibility is needed, call it pseudonymization/encryption and treat it as sensitive. - -### Recommended Defaults - -1. Use server-side suppression (`exportSurveyFields=false`, `exportDataAccessGroups=false`) as the baseline. -2. Auto-blank REDCap identifier-tagged fields (detected from metadata) by default; allow user override. -3. Default to `blank` or `drop` for any remaining known identifiers. -4. Make reversible encryption opt-in and disabled by default. -5. Store no raw keys in job payloads or logs. +1. **Blank** (default for identifier-tagged fields) + - preserves schema, no values; auto-applied to REDCap identifier-tagged fields, user can override +2. **Drop** (Phase 4) + - removes the column entirely; safest for direct identifiers when schema preservation is not needed +3. **Deterministic pseudonymization (Phase 4, non-reversible)** + - HMAC-SHA256 token with a secret key; consistent per value, not reversible +4. **Reversible encryption — OUT OF SCOPE** (decision 2026-06-11) + - "anonymized and reversible" is not anonymous; if linkability is needed, use deterministic pseudonymization and treat the key as sensitive + +### Defaults + +1. Token export-rights as institutional baseline (documented, recorded in manifest). +2. Server-side suppression toggles default off (survey fields, DAGs excluded by default). +3. Auto-blank REDCap identifier-tagged fields (detected from metadata) by default; user can override. +4. Flag unvalidated text/notes fields as PHI-risk in the variables UI (Phase 4) — free text is the recognized weak point. +5. Store no raw keys or values in job payloads or logs; every transform is recorded in the manifest's anonymization audit. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Metadata Outputs](#metadata-outputs) @@ -391,43 +424,26 @@ Important: ## Metadata Outputs -Requested targets: - -1. DDI-CDI -2. Croissant (including CDIF profile compatibility target) -3. RO-Crate - -### Recommended Strategy +Decision (2026-06-11): **all three exporters in a single phase**, generated from one normalized metadata model **during export** as virtual files in the same bundle (deterministic, cacheable, individually selectable in the compare tree). -Use one internal normalized metadata model, then fan out to exporters. +Targets and their research-validated value: -Normalized model should include: +1. **Croissant 1.0** (`croissant.json`) — highest external value, lowest effort. Complements Dataverse 6.10's built-in Croissant exporter, which cannot recover variable labels/value lists from CSVs. RecordSet/Field model maps near-1:1 from the REDCap dictionary (field→Field, label→description, choices→`sc:Enumeration` RecordSets, validation→dataType); FileObjects carry the bundle's MD5 hashes. Validate with `mlcroissant` in CI. Target 1.0 for Dataverse-ecosystem consistency; 1.1/RAI later. +2. **RO-Crate 1.2** (`ro-crate-metadata.json`) — packaging + provenance. Use the **Process Run Crate** profile to describe the export run (REDCap instance/project/version, export parameters, tool version, timestamp). Rendered by the Dataverse beta previewer; aligns with KU Leuven's gdcc/exporter-ro-crate work. Plain schema.org JSON-LD — writable from Go without a library. +3. **DDI-CDI 1.0** (`ddi-cdi.jsonld`) — highest fidelity (variable cascade, substantive/sentinel value domains for missing codes, wide-table structure), strategic for KU Leuven (existing in-repo DDI-CDI pipeline + LIBIS cdi-viewer for SHACL validation), and genuine novelty (no published REDCap→DDI-CDI mapping exists). Reuse existing ddi-cdi helpers where practical, enriched with dictionary labels/value domains the generic CSV profiler cannot infer. -1. project-level metadata -2. table/file-level metadata -3. variable-level metadata -4. code lists/value labels -5. provenance (source report/mode, options, timestamp) +Normalized model (one struct, three emitters): -Then: +1. project-level metadata (`project_info` + dictionary) +2. table/file-level metadata (bundle files, hashes, sizes, delimiters) +3. variable-level metadata (names, labels, types, validation, checkbox expansions) +4. code lists/value labels (`select_choices_or_calculations`) +5. provenance (source mode/report, options, timestamps, anonymization audit, REDCap version) -1. emit `*.jsonld` for DDI-CDI -2. emit `croissant.json` (or JSON-LD form as needed by tooling) -3. emit `ro-crate-metadata.json` +Additional Phase 5 items: -### Integration with Existing DDI-CDI Stack - -Option A: - -1. Generate CSV + metadata sidecars in `redcap2` -2. Use existing DDI-CDI generation pipeline on resulting tabular files - -Option B: - -1. Add a direct REDCap->DDI-CDI generator path -2. Reuse helper code from existing `ddi-cdi` components where practical - -MVP recommendation: Option A. +1. Implement the plugin `Metadata()` hook (registry already supports it; github/gitlab set the precedent) to prefill Dataverse citation metadata from project info (title, PI name, `project_pi_email` on 15.5.20+, notes). +2. Optional metadata-only CDISC ODM sidecar via `content=project_xml&returnMetadataOnly=true` (`project_metadata.xml`) — one API call, archival gold standard. Never with data (would bypass blanking). [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Architecture In rdm-integration](#architecture-in-rdm-integration) @@ -529,34 +545,49 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). `plu 7. ~~Auto-detect identifier-tagged fields from metadata and pre-blank them.~~ 8. ~~Add unit tests for each parameter combination.~~ +### Phase 3.9: De-Id Correctness And API Fidelity [In Progress — 2026-06-11] + +Fixes the review findings before new features (see [Review, Research, And Decisions](#review-research-and-decisions-2026-06-11)): + +1. EAV-aware blanking: blank the `value` cell of rows whose `field_name` matches a blanked field (CSV and JSON EAV). +2. Checkbox-aware matching: a blank rule for `field` also matches expanded `field___code` columns. +3. Label-header support: when `rawOrLabelHeaders=label`, translate headers back to field names via the dictionary (incl. `Label (choice=...)` checkbox headers) before blanking and metadata filtering. +4. Anonymization audit in the manifest: per-rule match counts; warnings for rules that matched nothing. +5. Correct `metadata.csv` filtering per mode (EAV field_name values; label-header translation; checkbox base names). +6. Frontend: load variables on report-ID entry (blur) + explicit Reload button; concurrent-load guard. +7. Remove `rawOrLabel=both` (not a real API parameter); record type control restricted to records mode; stop sending `type` to `content=report`; send `csvDelimiter`/`rawOrLabelHeaders` only when applicable. +8. Manifest enrichment: project id/title, file-upload-field documentation (attachments decision), dictionary-vs-export column diff (reveals token-rights stripping). +9. Bundle cache size cap (bound PII residency in RAM). + ### Phase 4: De-Identification Engine [Next] -1. Add policy schema and validation. -2. Implement field-level transforms (drop/blank/mask/pseudonymize). -3. Add optional reversible encryption with key-provider abstraction. -4. Add audit/provenance output listing transformed fields and method. -5. Add strict safeguards: - - no key logging - - no raw-value logging - - secure defaults +1. Add `drop` and deterministic HMAC-SHA256 `pseudonymize` per-variable modes (policy schema + validation). +2. Flag unvalidated text/notes fields as PHI-risk in the variables table (field types from the dictionary). +3. Surface token export-rights context (dictionary-vs-export diff) in the UI, not just the manifest. +4. Extend the anonymization audit (method, key id — never key material). +5. Strict safeguards: no key logging, no raw-value logging, secure defaults. +6. ~~Reversible encryption~~ — out of scope (decision 2026-06-11). ### Phase 5: Metadata Exporters [Next] -1. Define normalized metadata model. -2. Implement exporter adapters: - - DDI-CDI - - Croissant - - RO-Crate -3. Expose format toggles in UI. -4. Add schema validation tests for each output type. +1. Define normalized metadata model (project, files, variables, code lists, provenance). +2. Implement all three exporter adapters in one phase (decision 2026-06-11): + - `croissant.json` (Croissant 1.0; validate with mlcroissant in CI) + - `ro-crate-metadata.json` (RO-Crate 1.2, Process Run Crate provenance) + - `ddi-cdi.jsonld` (DDI-CDI 1.0; reuse in-repo ddi-cdi helpers; validate with cdi-viewer SHACL shapes) +3. Generate during export as bundle virtual files; expose toggles in UI. +4. Implement plugin `Metadata()` hook for Dataverse citation prefill from project info. +5. Optional `project_metadata.xml` (CDISC ODM, metadata-only). +6. Schema validation tests for each output type. ### Phase 6: Hardening And Rollout [Next] -1. Performance test with large REDCap projects. -2. Security review (keys, logs, PII handling, transport). +1. Performance test with large REDCap projects; configurable HTTP timeout (current client timeout: 5 minutes). +2. Security review (keys, logs, PII handling, transport, cache residency). 3. Add operator documentation and troubleshooting. 4. Run pilot with limited users. 5. Keep `redcap` plugin as stable fallback until `redcap2` is proven. +6. Revisit attachments (opt-in, size-capped, flagged as not de-identified) based on pilot feedback. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Testing Plan](#testing-plan) @@ -594,13 +625,10 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). `plu 1. ~~Do all target REDCap instances expose report listing?~~ **Resolved:** The standard REDCap API does not expose a report-listing endpoint. Report IDs are entered manually. 2. ~~Should record mode be a separate flow or a toggle?~~ **Resolved:** Implemented as a toggle on the same settings page. -3. Which de-identification policy should be default at KU Leuven: - - drop identifiers - - blank identifiers - - deterministic pseudonymization -4. Are reversible transformations acceptable under institutional policy? -5. Should metadata outputs be generated during sync, after sync, or both? -6. Should attachments be supported in MVP or deferred? +3. ~~Which de-identification policy should be default at KU Leuven?~~ **Resolved (2026-06-11):** Blank identifier-tagged fields by default, layered on token export-rights as the institutional baseline; drop/pseudonymize opt-in per field (Phase 4). +4. ~~Are reversible transformations acceptable under institutional policy?~~ **Resolved (2026-06-11):** Out of scope. Irreversible transforms only (blank/drop/HMAC pseudonymize). +5. ~~Should metadata outputs be generated during sync, after sync, or both?~~ **Resolved (2026-06-11):** During export, as virtual files in the bundle (deterministic, cacheable, selectable in the compare tree). All three exporters in one phase. +6. ~~Should attachments be supported in MVP or deferred?~~ **Resolved (2026-06-11):** Deferred; manifest documents file-upload fields as not-exported references. Future download support must be opt-in, size-capped, and flagged as not de-identified. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ References](#references) From a2516314470271d13fa78a56f5882ee95671d8c9 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 11:48:46 +0200 Subject: [PATCH 05/25] fix(make): apply CUSTOMIZATIONS fallback in dev_build like in build dev_build passed $(CUSTOMIZATIONS) to docker build unvalidated; with STAGE=dev the env.dev path (./docker-volumes/integration/conf/customizations) only exists after 'make init', so the frontend-builder COPY failed on a fresh checkout. dev_build now falls back to ./conf/kul_customizations or ./conf/customizations exactly like the build target. --- Makefile | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index e175c70..5d93f4d 100644 --- a/Makefile +++ b/Makefile @@ -292,12 +292,22 @@ dev_build: fmt ## Build Docker image using local frontend (like dev_up but only --prefix=rdm-integration-frontend-$(FRONTEND_VERSION)/ \ $$(if [[ $$(git stash create) ]]; then git stash create; else git rev-parse HEAD; fi) @echo -n "Building Docker image (STAGE=$(STAGE)) using local frontend... " + customizations_path="$(CUSTOMIZATIONS)"; \ + if [ ! -d "$$customizations_path" ]; then \ + if [ -d "./conf/kul_customizations" ]; then \ + echo "CUSTOMIZATIONS path '$$customizations_path' not found; falling back to './conf/kul_customizations'"; \ + customizations_path="./conf/kul_customizations"; \ + else \ + echo "CUSTOMIZATIONS path '$$customizations_path' not found; falling back to './conf/customizations'"; \ + customizations_path="./conf/customizations"; \ + fi; \ + fi; \ docker build \ --build-arg USER_ID=$(USER_ID) --build-arg GROUP_ID=$(GROUP_ID) \ --build-arg OAUTH2_POXY_VERSION=$(OAUTH2_POXY_VERSION) --build-arg NODE_VERSION=$(NODE_VERSION) \ --build-arg FRONTEND_VERSION=$(FRONTEND_VERSION) --build-arg FRONTEND_TAR_GZ=$(FRONTEND_VERSION).tar.gz \ --build-arg NODE_ENV=$(NODE_ENV) \ - --build-arg BASE_HREF=$(BUILD_BASE_HREF) --build-arg CUSTOMIZATIONS=$(CUSTOMIZATIONS) \ + --build-arg BASE_HREF=$(BUILD_BASE_HREF) --build-arg CUSTOMIZATIONS=$$customizations_path \ --tag "$(IMAGE_TAG)" --file image/Dockerfile . @echo -n "Cleaning up local frontend archive... " @rm -f $(FRONTEND_VERSION).tar.gz From 09c71710ace8cf6bd02e06ce38880a6fe0cf4e27 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 23:27:45 +0200 Subject: [PATCH 06/25] redcap2 Phase 4 backend: drop + HMAC pseudonymize transforms, key handling, manifest redaction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Per-variable transforms generalized: blank, drop (columns/rows/keys removed, also from metadata.csv), pseudonymize (hex HMAC-SHA256, researcher-managed base64 key, min 16 bytes, validated with actionable errors; empty cells stay empty). - EAV exports: transforms on the record-ID field now also cover the record linking column (raw IDs no longer survive blanking/pseudonymization); dropping the record-ID field in EAV is rejected with guidance. - Manifest: records filter redacted when the record-ID field is transformed, filterLogic redacted when it references transformed fields (both leaked the values the transforms removed); anonymization section reports hmac-sha256 + key fingerprint (SHA-256 of the key, first 16 hex) — never the key; key never logged; client-side drops excluded from dictionary_fields_not_exported. - Cache key includes the pseudonymization key (hashed): different keys yield different bundles. - Variables list: PHI-risk notes for notes/unvalidated-text fields (SelectItem.Note), identifier preselect resolved through checkbox base names; header fetch for the variables list is now always raw/comma (label-header and tab-delimiter exports previously produced rule names that never matched); transform rules now also match checkbox expansion columns by their own name. --- image/app/plugin/impl/redcap2/common.go | 676 +++++++++++++------ image/app/plugin/impl/redcap2/common_test.go | 459 +++++++++++-- image/app/plugin/impl/redcap2/deid_test.go | 68 ++ image/app/plugin/types/select_item.go | 1 + 4 files changed, 932 insertions(+), 272 deletions(-) diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go index 6dbf6d2..b4c3066 100644 --- a/image/app/plugin/impl/redcap2/common.go +++ b/image/app/plugin/impl/redcap2/common.go @@ -6,8 +6,12 @@ import ( "bufio" "bytes" "context" + "crypto/hmac" "crypto/md5" + "crypto/sha256" + "encoding/base64" "encoding/csv" + "encoding/hex" "encoding/json" "fmt" "integration/app/logging" @@ -15,6 +19,7 @@ import ( "io" "net/http" "net/url" + "regexp" "sort" "strings" "sync" @@ -45,6 +50,7 @@ type pluginOptions struct { ExportSurveyFields bool `json:"exportSurveyFields"` ExportDataAccessGroups bool `json:"exportDataAccessGroups"` Variables []variableOption `json:"variables"` + PseudonymizationKey string `json:"pseudonymizationKey,omitempty"` GeneratedAt string `json:"generatedAt"` } @@ -203,11 +209,16 @@ func normalizePluginOptions(opts *pluginOptions) { opts.Forms = normalizeStringSlice(opts.Forms) opts.Events = normalizeStringSlice(opts.Events) opts.Records = normalizeStringSlice(opts.Records) + opts.PseudonymizationKey = strings.TrimSpace(opts.PseudonymizationKey) for i := range opts.Variables { opts.Variables[i].Name = strings.TrimSpace(opts.Variables[i].Name) switch strings.ToLower(strings.TrimSpace(opts.Variables[i].Anonymization)) { case "blank": opts.Variables[i].Anonymization = "blank" + case "drop": + opts.Variables[i].Anonymization = "drop" + case "pseudonymize": + opts.Variables[i].Anonymization = "pseudonymize" default: opts.Variables[i].Anonymization = "none" } @@ -366,17 +377,74 @@ func reportDelimiter(opts pluginOptions) rune { return ',' } -func blankFields(opts pluginOptions) map[string]bool { - res := map[string]bool{} +// minPseudonymizationKeyBytes is the minimum decoded key length accepted for +// HMAC-SHA256 pseudonymization. 16 bytes (128 bit) is the floor; the UI and +// docs recommend 32 bytes (openssl rand -base64 32). +const minPseudonymizationKeyBytes = 16 + +// transformPlan is the validated per-export anonymization policy: which field +// gets which irreversible transform, plus the decoded HMAC key when any field +// is pseudonymized. The raw key never leaves this struct: only the SHA-256 +// fingerprint is reported (manifest/audit), and nothing key-related is logged. +type transformPlan struct { + modes map[string]string // field -> "blank" | "drop" | "pseudonymize" + key []byte + keyFingerprint string // first 16 hex chars of SHA-256(key) +} + +func (p transformPlan) isEmpty() bool { return len(p.modes) == 0 } + +func (p transformPlan) mode(field string) string { return p.modes[field] } + +// transformValue applies a cell-level transform (blank or pseudonymize). +// Empty values stay empty: hashing the empty string would replace genuine +// missingness with a constant that looks like data. +func (p transformPlan) transformValue(field, value string) string { + switch p.modes[field] { + case "blank": + return "" + case "pseudonymize": + if value == "" { + return "" + } + mac := hmac.New(sha256.New, p.key) + mac.Write([]byte(value)) + return hex.EncodeToString(mac.Sum(nil)) + } + return value +} + +// buildTransformPlan validates the per-variable anonymization choices and the +// researcher-provided base64 HMAC key (required iff any field is pseudonymized). +func buildTransformPlan(opts pluginOptions) (transformPlan, error) { + plan := transformPlan{modes: map[string]string{}} + usesPseudonymization := false for _, v := range opts.Variables { - if v.Name == "" { + if v.Name == "" || v.Anonymization == "none" || v.Anonymization == "" { continue } - if v.Anonymization == "blank" { - res[v.Name] = true + plan.modes[v.Name] = v.Anonymization + if v.Anonymization == "pseudonymize" { + usesPseudonymization = true } } - return res + if !usesPseudonymization { + return plan, nil + } + if opts.PseudonymizationKey == "" { + return transformPlan{}, fmt.Errorf("pseudonymization requires a base64 key (generate one with: openssl rand -base64 32)") + } + key, err := base64.StdEncoding.DecodeString(opts.PseudonymizationKey) + if err != nil { + return transformPlan{}, fmt.Errorf("pseudonymization key is not valid base64") + } + if len(key) < minPseudonymizationKeyBytes { + return transformPlan{}, fmt.Errorf("pseudonymization key too short: %d bytes decoded, need at least %d (use: openssl rand -base64 32)", len(key), minPseudonymizationKeyBytes) + } + plan.key = key + fingerprint := sha256.Sum256(key) + plan.keyFingerprint = hex.EncodeToString(fingerprint[:])[:16] + return plan, nil } func parseCSV(data []byte, delimiter rune) ([][]string, error) { @@ -400,24 +468,30 @@ func writeCSV(rows [][]string, delimiter rune) ([]byte, error) { return b.Bytes(), nil } -// dictionary holds the parsed data-dictionary information needed for blanking, -// label-header translation, metadata filtering, and manifest documentation. +// dictionary holds the parsed data-dictionary information needed for the +// transforms, label-header translation, metadata filtering, identifier +// preselection, PHI-risk flagging, and manifest documentation. type dictionary struct { - fieldOrder []string // field names in dictionary order - fieldType map[string]string // field_name -> field_type - labelFields map[string][]string // field_label -> field names (labels can collide) + fieldOrder []string // field names in dictionary order + fieldType map[string]string // field_name -> field_type + labelFields map[string][]string // field_label -> field names (labels can collide) + identifier map[string]bool // field_name -> tagged as identifier in REDCap + validation map[string]string // field_name -> text validation type ("" = unvalidated) + hasValidation bool // the validation column was present in the dictionary } func parseDictionary(metadataCSV []byte) dictionary { dict := dictionary{ fieldType: map[string]string{}, labelFields: map[string][]string{}, + identifier: map[string]bool{}, + validation: map[string]string{}, } rows, err := parseCSV(metadataCSV, ',') if err != nil || len(rows) == 0 { return dict } - nameIdx, typeIdx, labelIdx := -1, -1, -1 + nameIdx, typeIdx, labelIdx, identifierIdx, validationIdx := -1, -1, -1, -1, -1 for i, col := range rows[0] { switch strings.ToLower(strings.TrimSpace(col)) { case "field_name": @@ -426,11 +500,16 @@ func parseDictionary(metadataCSV []byte) dictionary { typeIdx = i case "field_label": labelIdx = i + case "identifier": + identifierIdx = i + case "text_validation_type_or_show_slider_number": + validationIdx = i } } if nameIdx < 0 { return dict } + dict.hasValidation = validationIdx >= 0 for _, row := range rows[1:] { if nameIdx >= len(row) { continue @@ -449,10 +528,47 @@ func parseDictionary(metadataCSV []byte) dictionary { dict.labelFields[label] = append(dict.labelFields[label], name) } } + if identifierIdx >= 0 && identifierIdx < len(row) { + switch strings.ToLower(strings.TrimSpace(row[identifierIdx])) { + case "y", "yes", "1": + dict.identifier[name] = true + } + } + if validationIdx >= 0 && validationIdx < len(row) { + dict.validation[name] = strings.ToLower(strings.TrimSpace(row[validationIdx])) + } } return dict } +// fetchDictionary downloads and parses the project data dictionary. +func fetchDictionary(ctx context.Context, baseURL, token string) (dictionary, error) { + body, err := redcapRequest(ctx, baseURL, baseForm(token, "metadata", "csv")) + if err != nil { + return dictionary{}, err + } + return parseDictionary(body), nil +} + +// phiRiskNote returns a warning for fields whose values can carry free-text +// identifying information even though the field is not identifier-tagged: +// notes fields and unvalidated text fields. REDCap's own de-identified export +// rights strip these field types for the same reason. +func phiRiskNote(dict dictionary, field string) string { + base := baseFieldName(field) + switch dict.fieldType[base] { + case "notes": + return "free-text notes field: may contain identifying information" + case "text": + // Only flag unvalidated text when the dictionary actually carried the + // validation column; otherwise every text field would be flagged. + if dict.hasValidation && dict.validation[base] == "" { + return "unvalidated text field: may contain identifying information" + } + } + return "" +} + // fileUploadFields returns the dictionary fields of type "file" — per-record // attachments that are documented in the manifest but never downloaded. func (d dictionary) fileUploadFields() []string { @@ -474,14 +590,20 @@ func baseFieldName(col string) string { } // resolveHeaderFields maps a data column header to candidate dictionary field -// names. Raw headers resolve via the checkbox base name; label headers are -// translated through the dictionary, including "Label (choice=...)" checkbox -// headers. Unknown headers resolve to themselves so that pseudo-columns -// (record, redcap_event_name, redcap_survey_identifier, ...) stay stable. +// names. Raw headers resolve to the column name itself plus the checkbox base +// name, so a transform rule keyed on either "phones___2" or "phones" matches +// the expansion column. Label headers are translated through the dictionary, +// including "Label (choice=...)" checkbox headers. Unknown headers resolve to +// themselves so that pseudo-columns (record, redcap_event_name, +// redcap_survey_identifier, ...) stay stable. func resolveHeaderFields(header string, labelHeaders bool, dict dictionary) []string { header = strings.TrimSpace(header) if !labelHeaders { - return []string{baseFieldName(header)} + base := baseFieldName(header) + if base != header { + return []string{header, base} + } + return []string{header} } label := header if i := strings.Index(header, " (choice="); i > 0 { @@ -493,7 +615,7 @@ func resolveHeaderFields(header string, labelHeaders bool, dict dictionary) []st return []string{baseFieldName(header)} } -// anonymizationAudit records the outcome of one blank rule so that silent +// anonymizationAudit records the outcome of one transform rule so that silent // no-ops are impossible: every requested transform reports how much data it // actually touched. type anonymizationAudit struct { @@ -503,100 +625,145 @@ type anonymizationAudit struct { Note string `json:"note,omitempty"` } -func buildAudit(blanks map[string]bool, matched map[string]int, unit string) []anonymizationAudit { - fields := make([]string, 0, len(blanks)) - for field := range blanks { +func transformVerb(mode string) string { + switch mode { + case "drop": + return "dropped" + case "pseudonymize": + return "pseudonymized" + default: + return "blanked" + } +} + +func buildAudit(plan transformPlan, matched map[string]int, unit string, notes map[string]string) []anonymizationAudit { + fields := make([]string, 0, len(plan.modes)) + for field := range plan.modes { fields = append(fields, field) } sort.Strings(fields) audit := make([]anonymizationAudit, 0, len(fields)) for _, field := range fields { - entry := anonymizationAudit{Field: field, Mode: "blank", Matched: matched[field]} + entry := anonymizationAudit{Field: field, Mode: plan.modes[field], Matched: matched[field]} if entry.Matched == 0 { entry.Note = "field not present in export" } else { - entry.Note = fmt.Sprintf("blanked %d %s", entry.Matched, unit) + entry.Note = fmt.Sprintf("%s %d %s", transformVerb(entry.Mode), entry.Matched, unit) + } + if extra := notes[field]; extra != "" { + entry.Note += "; " + extra } audit = append(audit, entry) } return audit } -// blankFlatCSV blanks matching columns of a flat CSV export. A blank rule for -// field f matches columns named f, checkbox expansions f___code, and — when -// headers are labels — columns whose label translates back to f. +// transformFlatCSV applies the anonymization plan to a flat CSV export. A rule +// for field f matches columns named f, checkbox expansions f___code, and — +// when headers are labels — columns whose label translates back to f. +// Dropped columns are removed entirely (and excluded from the exported field +// list so their dictionary rows disappear from metadata.csv); blank and +// pseudonymize rewrite cell values in place. // Returns the (possibly rewritten) data, the exported dictionary field names, // and the per-rule audit. -func blankFlatCSV(data []byte, delimiter rune, blanks map[string]bool, labelHeaders bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { +func transformFlatCSV(data []byte, delimiter rune, plan transformPlan, labelHeaders bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { rows, err := parseCSV(data, delimiter) if err != nil { return nil, nil, nil, err } if len(rows) == 0 { - return data, nil, buildAudit(blanks, nil, "columns"), nil + return data, nil, buildAudit(plan, nil, "columns", nil), nil } header := rows[0] exported := make([]string, 0, len(header)) seen := map[string]bool{} matched := map[string]int{} - blankCols := []int{} + cellField := map[int]string{} // column -> field with a cell-level transform + dropCols := map[int]bool{} for i, col := range header { candidates := resolveHeaderFields(col, labelHeaders, dict) + var ruleField string for _, field := range candidates { - if !seen[field] { - seen[field] = true - exported = append(exported, field) + if plan.modes[field] != "" { + ruleField = field + break + } + } + if ruleField != "" { + matched[ruleField]++ + if plan.modes[ruleField] == "drop" { + dropCols[i] = true + continue // dropped fields are no longer part of the export } + cellField[i] = ruleField } for _, field := range candidates { - if blanks[field] { - blankCols = append(blankCols, i) - matched[field]++ - break + if !seen[field] { + seen[field] = true + exported = append(exported, field) } } } - audit := buildAudit(blanks, matched, "columns") - if len(blankCols) == 0 { + audit := buildAudit(plan, matched, "columns", nil) + if len(cellField) == 0 && len(dropCols) == 0 { return data, exported, audit, nil } - for rowIdx := 1; rowIdx < len(rows); rowIdx++ { - for _, colIdx := range blankCols { - if colIdx < len(rows[rowIdx]) { - rows[rowIdx][colIdx] = "" + out := make([][]string, 0, len(rows)) + for rowIdx, row := range rows { + newRow := make([]string, 0, len(row)) + for colIdx, cell := range row { + if dropCols[colIdx] { + continue + } + if rowIdx > 0 { + if field, ok := cellField[colIdx]; ok { + cell = plan.transformValue(field, cell) + } } + newRow = append(newRow, cell) } + out = append(out, newRow) } - out, err := writeCSV(rows, delimiter) + encoded, err := writeCSV(out, delimiter) if err != nil { return nil, nil, nil, err } - return out, exported, audit, nil + return encoded, exported, audit, nil } -// blankEAVCSV blanks the value cells of EAV-shaped CSV exports +// transformEAVCSV applies the anonymization plan to EAV-shaped CSV exports // (record, [redcap_event_name,] field_name, value): rows whose field_name -// matches a blanked field get an empty value. Falls back to flat handling if -// the EAV columns cannot be located. -func blankEAVCSV(data []byte, delimiter rune, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { +// matches a blank/pseudonymize rule get their value cell rewritten, rows +// matching a drop rule are removed. A transform on the record-ID field (the +// first dictionary field) is additionally applied to the "record" column of +// every row — otherwise raw record identifiers would survive in the linking +// column. Falls back to flat handling if the EAV columns cannot be located. +func transformEAVCSV(data []byte, delimiter rune, plan transformPlan, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { rows, err := parseCSV(data, delimiter) if err != nil { return nil, nil, nil, err } if len(rows) == 0 { - return data, nil, buildAudit(blanks, nil, "rows"), nil + return data, nil, buildAudit(plan, nil, "rows", nil), nil } - fieldIdx, valueIdx := -1, -1 + fieldIdx, valueIdx, recordIdx := -1, -1, -1 for i, col := range rows[0] { switch strings.ToLower(strings.TrimSpace(col)) { case "field_name": fieldIdx = i case "value": valueIdx = i + case "record": + recordIdx = i } } if fieldIdx < 0 || valueIdx < 0 { - return blankFlatCSV(data, delimiter, blanks, false, dict) + return transformFlatCSV(data, delimiter, plan, false, dict) + } + recordField := recordIDField(dict) + recordMode := "" + if recordField != "" && recordIdx >= 0 { + recordMode = plan.modes[recordField] } exported := eavExportedFields(dict) seen := map[string]bool{} @@ -604,37 +771,49 @@ func blankEAVCSV(data []byte, delimiter rune, blanks map[string]bool, dict dicti seen[field] = true } matched := map[string]int{} + notes := map[string]string{} changed := false + out := make([][]string, 0, len(rows)) + out = append(out, rows[0]) for rowIdx := 1; rowIdx < len(rows); rowIdx++ { row := rows[rowIdx] - if fieldIdx >= len(row) { - continue - } - field := baseFieldName(strings.TrimSpace(row[fieldIdx])) - if field == "" { - continue + field := "" + if fieldIdx < len(row) { + field = baseFieldName(strings.TrimSpace(row[fieldIdx])) } - if !seen[field] { + if field != "" && !seen[field] { seen[field] = true exported = append(exported, field) } - if blanks[field] && valueIdx < len(row) { - if row[valueIdx] != "" { - row[valueIdx] = "" - changed = true - } + if field != "" && plan.modes[field] == "drop" { + matched[field]++ + changed = true + continue + } + if field != "" && plan.modes[field] != "" && valueIdx < len(row) { + row[valueIdx] = plan.transformValue(field, row[valueIdx]) matched[field]++ + changed = true + } + if recordMode != "" && recordMode != "drop" && recordIdx < len(row) && row[recordIdx] != "" { + row[recordIdx] = plan.transformValue(recordField, row[recordIdx]) + changed = true + notes[recordField] = "also applied to the EAV record column" } + out = append(out, row) } - audit := buildAudit(blanks, matched, "rows") + // Dropped fields are removed from the export, so their dictionary rows + // must not survive in metadata.csv. + exported = withoutDroppedFields(exported, plan) + audit := buildAudit(plan, matched, "rows", notes) if !changed { return data, exported, audit, nil } - out, err := writeCSV(rows, delimiter) + encoded, err := writeCSV(out, delimiter) if err != nil { return nil, nil, nil, err } - return out, exported, audit, nil + return encoded, exported, audit, nil } // eavExportedFields seeds the exported-field set for EAV outputs with the @@ -648,15 +827,51 @@ func eavExportedFields(dict dictionary) []string { return []string{} } -// blankFlatJSON blanks matching keys of flat JSON exports. JSON exports always -// use raw field names as keys, so only checkbox base-name matching applies. -func blankFlatJSON(data []byte, blanks map[string]bool) ([]byte, []string, []anonymizationAudit, error) { +// recordIDField returns the project's record identifier field: REDCap defines +// it as the first field of the data dictionary. +func recordIDField(dict dictionary) string { + if len(dict.fieldOrder) > 0 { + return dict.fieldOrder[0] + } + return "" +} + +// withoutDroppedFields removes fields with a drop rule from an exported-field +// list, so that dropped fields also disappear from the filtered metadata.csv. +func withoutDroppedFields(fields []string, plan transformPlan) []string { + out := make([]string, 0, len(fields)) + for _, field := range fields { + if plan.modes[field] == "drop" { + continue + } + out = append(out, field) + } + return out +} + +// jsonValueString renders a JSON cell value for transformation. REDCap exports +// values as strings, but numbers are normalized defensively. +func jsonValueString(v interface{}) string { + switch s := v.(type) { + case nil: + return "" + case string: + return s + default: + return fmt.Sprintf("%v", s) + } +} + +// transformFlatJSON applies the anonymization plan to flat JSON exports. JSON +// exports always use raw field names as keys, so only checkbox base-name +// matching applies. Dropped keys are removed from every row. +func transformFlatJSON(data []byte, plan transformPlan) ([]byte, []string, []anonymizationAudit, error) { rows := make([]map[string]interface{}, 0) if err := json.Unmarshal(data, &rows); err != nil { return nil, nil, nil, err } keys := map[string]bool{} - matchedKeys := map[string]string{} // key -> blanked field + matchedKeys := map[string]string{} // key -> field with a transform rule for _, row := range rows { for k := range row { keys[k] = true @@ -666,28 +881,43 @@ func blankFlatJSON(data []byte, blanks map[string]bool) ([]byte, []string, []ano exported := []string{} for k := range keys { field := baseFieldName(k) + // A rule keyed on the expansion column itself wins over the base field. + ruleField := "" + if plan.modes[k] != "" { + ruleField = k + } else if plan.modes[field] != "" { + ruleField = field + } + if ruleField != "" { + matchedKeys[k] = ruleField + } + if plan.modes[ruleField] == "drop" { + continue + } if !exportedSet[field] { exportedSet[field] = true exported = append(exported, field) } - if blanks[field] { - matchedKeys[k] = field - } } sort.Strings(exported) matched := map[string]int{} for _, field := range matchedKeys { matched[field]++ } - audit := buildAudit(blanks, matched, "columns") + audit := buildAudit(plan, matched, "columns", nil) if len(matchedKeys) == 0 { return data, exported, audit, nil } for _, row := range rows { - for k := range matchedKeys { - if _, ok := row[k]; ok { - row[k] = "" + for k, field := range matchedKeys { + if _, ok := row[k]; !ok { + continue + } + if plan.modes[field] == "drop" { + delete(row, k) + continue } + row[k] = plan.transformValue(field, jsonValueString(row[k])) } } out, err := json.Marshal(rows) @@ -697,9 +927,11 @@ func blankFlatJSON(data []byte, blanks map[string]bool) ([]byte, []string, []ano return out, exported, audit, nil } -// blankEAVJSON blanks the "value" of EAV JSON rows whose "field_name" matches -// a blanked field. Falls back to flat handling when rows are not EAV-shaped. -func blankEAVJSON(data []byte, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { +// transformEAVJSON applies the anonymization plan to EAV JSON rows. Like the +// CSV variant, a transform on the record-ID field is also applied to the +// "record" key of every row. Falls back to flat handling when rows are not +// EAV-shaped. +func transformEAVJSON(data []byte, plan transformPlan, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { rows := make([]map[string]interface{}, 0) if err := json.Unmarshal(data, &rows); err != nil { return nil, nil, nil, err @@ -712,7 +944,12 @@ func blankEAVJSON(data []byte, blanks map[string]bool, dict dictionary) ([]byte, } } if !isEAVShaped { - return blankFlatJSON(data, blanks) + return transformFlatJSON(data, plan) + } + recordField := recordIDField(dict) + recordMode := "" + if recordField != "" { + recordMode = plan.modes[recordField] } exported := eavExportedFields(dict) seen := map[string]bool{} @@ -720,49 +957,71 @@ func blankEAVJSON(data []byte, blanks map[string]bool, dict dictionary) ([]byte, seen[field] = true } matched := map[string]int{} + notes := map[string]string{} changed := false + out := make([]map[string]interface{}, 0, len(rows)) for _, row := range rows { name, _ := row["field_name"].(string) field := baseFieldName(strings.TrimSpace(name)) - if field == "" { - continue - } - if !seen[field] { + if field != "" && !seen[field] { seen[field] = true exported = append(exported, field) } - if blanks[field] { + if field != "" && plan.modes[field] == "drop" { + matched[field]++ + changed = true + continue + } + if field != "" && plan.modes[field] != "" { if _, ok := row["value"]; ok { - row["value"] = "" + row["value"] = plan.transformValue(field, jsonValueString(row["value"])) changed = true } matched[field]++ } + if recordMode != "" && recordMode != "drop" { + if rec, ok := row["record"]; ok { + if recStr := jsonValueString(rec); recStr != "" { + row["record"] = plan.transformValue(recordField, recStr) + changed = true + notes[recordField] = "also applied to the EAV record column" + } + } + } + out = append(out, row) } - audit := buildAudit(blanks, matched, "rows") + exported = withoutDroppedFields(exported, plan) + audit := buildAudit(plan, matched, "rows", notes) if !changed { return data, exported, audit, nil } - out, err := json.Marshal(rows) + encoded, err := json.Marshal(out) if err != nil { return nil, nil, nil, err } - return out, exported, audit, nil + return encoded, exported, audit, nil } // processExportData routes the raw API payload through the mode-appropriate -// blanking implementation and reports the exported dictionary fields plus the -// anonymization audit. -func processExportData(data []byte, opts pluginOptions, blanks map[string]bool, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { +// transform implementation and reports the exported dictionary fields plus the +// anonymization audit. Dropping the record-ID field is rejected for EAV +// exports: the record column cannot be removed without destroying the EAV +// structure, and leaving it would silently keep the identifiers. +func processExportData(data []byte, opts pluginOptions, plan transformPlan, dict dictionary) ([]byte, []string, []anonymizationAudit, error) { + if isEAV(opts) { + if field := recordIDField(dict); field != "" && plan.modes[field] == "drop" { + return nil, nil, nil, fmt.Errorf("dropping the record id field (%s) is not supported for EAV exports: use pseudonymize or blank instead", field) + } + } switch { case opts.DataFormat == "json" && isEAV(opts): - return blankEAVJSON(data, blanks, dict) + return transformEAVJSON(data, plan, dict) case opts.DataFormat == "json": - return blankFlatJSON(data, blanks) + return transformFlatJSON(data, plan) case isEAV(opts): - return blankEAVCSV(data, reportDelimiter(opts), blanks, dict) + return transformEAVCSV(data, reportDelimiter(opts), plan, dict) default: - return blankFlatCSV(data, reportDelimiter(opts), blanks, headersAreLabels(opts), dict) + return transformFlatCSV(data, reportDelimiter(opts), plan, headersAreLabels(opts), dict) } } @@ -827,44 +1086,11 @@ func redcapRequestHeaderOnly(ctx context.Context, baseURL string, form url.Value return record, nil } -func fallbackFieldsFromMetadata(ctx context.Context, baseURL, token string) ([]string, error) { - form := baseForm(token, "metadata", "csv") - body, err := redcapRequest(ctx, baseURL, form) - if err != nil { - return nil, err - } - rows, err := parseCSV(body, ',') - if err != nil || len(rows) == 0 { - return nil, err - } - fieldIdx := -1 - for i, col := range rows[0] { - if strings.EqualFold(strings.TrimSpace(col), "field_name") { - fieldIdx = i - break - } - } - if fieldIdx < 0 { - return nil, nil - } - res := make([]string, 0, len(rows)-1) - seen := map[string]bool{} - for _, row := range rows[1:] { - if fieldIdx >= len(row) { - continue - } - field := strings.TrimSpace(row[fieldIdx]) - if field == "" || seen[field] { - continue - } - seen[field] = true - res = append(res, field) - } - sort.Strings(res) - return res, nil -} - -func deduplicatedSelectItems(fields []string) []types.SelectItem { +// variableSelectItems builds a sorted, deduplicated list of SelectItem values +// for the anonymization table. Identifier-tagged fields (resolved through the +// checkbox base name) are returned with Selected=true, signalling the frontend +// to auto-blank them; free-text fields carry a PHI-risk note. +func variableSelectItems(fields []string, dict dictionary) []types.SelectItem { seen := make(map[string]bool, len(fields)) unique := make([]string, 0, len(fields)) for _, f := range fields { @@ -877,106 +1103,48 @@ func deduplicatedSelectItems(fields []string) []types.SelectItem { sort.Strings(unique) out := make([]types.SelectItem, 0, len(unique)) for _, field := range unique { - out = append(out, types.SelectItem{Label: field, Value: field}) + out = append(out, types.SelectItem{ + Label: field, + Value: field, + Selected: dict.identifier[baseFieldName(field)], + Note: phiRiskNote(dict, field), + }) } return out } -// deduplicatedSelectItemsWithIdentifiers builds a sorted, deduplicated list of SelectItem values. -// Fields present in the identifiers set are returned with Selected=true, signalling the frontend -// to auto-blank them (they are REDCap identifier-tagged fields). -func deduplicatedSelectItemsWithIdentifiers(fields []string, identifiers map[string]bool) []types.SelectItem { - seen := make(map[string]bool, len(fields)) - unique := make([]string, 0, len(fields)) - for _, f := range fields { - f = strings.TrimSpace(f) - if f != "" && !seen[f] { - seen[f] = true - unique = append(unique, f) - } - } - sort.Strings(unique) - out := make([]types.SelectItem, 0, len(unique)) - for _, field := range unique { - out = append(out, types.SelectItem{Label: field, Value: field, Selected: identifiers[field]}) - } - return out -} - -// identifierFieldsFromMetadata fetches the project metadata and returns a set of field names -// that REDCap has tagged as identifiers (identifier column = "y" in the data dictionary). -func identifierFieldsFromMetadata(ctx context.Context, baseURL, token string) (map[string]bool, error) { - form := baseForm(token, "metadata", "csv") - body, err := redcapRequest(ctx, baseURL, form) - if err != nil { - return nil, err - } - rows, err := parseCSV(body, ',') - if err != nil || len(rows) == 0 { - return nil, err - } - fieldIdx := -1 - identifierIdx := -1 - for i, col := range rows[0] { - switch strings.ToLower(strings.TrimSpace(col)) { - case "field_name": - fieldIdx = i - case "identifier": - identifierIdx = i - } - } - if fieldIdx < 0 || identifierIdx < 0 { - return nil, nil - } - res := make(map[string]bool) - for _, row := range rows[1:] { - if fieldIdx >= len(row) || identifierIdx >= len(row) { - continue - } - field := strings.TrimSpace(row[fieldIdx]) - ident := strings.ToLower(strings.TrimSpace(row[identifierIdx])) - if field != "" && (ident == "y" || ident == "yes" || ident == "1") { - res[field] = true - } - } - return res, nil -} - -// listVariablesFromReport fetches column headers from a report export (CSV header-only request). -// Falls back to the full metadata field list if the report header fetch fails. -// Fields tagged as identifiers in REDCap are returned with Selected=true. -func listVariablesFromReport(ctx context.Context, baseURL, token, reportID string, opts pluginOptions) ([]types.SelectItem, error) { - identifiers, _ := identifierFieldsFromMetadata(ctx, baseURL, token) +// listVariablesFromReport fetches column headers from a report export (CSV +// header-only request). The header request is always raw, comma-delimited: +// transform rules are keyed by field name, so the variable list must contain +// field names even when the actual export uses label headers or another +// delimiter. Falls back to the full dictionary field list if the report +// header fetch fails. +func listVariablesFromReport(ctx context.Context, baseURL, token, reportID string, _ pluginOptions) ([]types.SelectItem, error) { + dict, dictErr := fetchDictionary(ctx, baseURL, token) form := baseForm(token, "report", "csv") form.Set("report_id", reportID) - applySharedExportParams(form, opts) fields, err := redcapRequestHeaderOnly(ctx, baseURL, form, ',') if err != nil { - // Fallback: derive field list from project metadata. - fields, err = fallbackFieldsFromMetadata(ctx, baseURL, token) - if err != nil { - return nil, err + // Fallback: derive the field list from the data dictionary. + if dictErr != nil { + return nil, dictErr } + fields = dict.fieldOrder } - return deduplicatedSelectItemsWithIdentifiers(fields, identifiers), nil + return variableSelectItems(fields, dict), nil } -// listVariablesFromMetadata returns all project fields from the metadata endpoint. -// Used for record export mode where there is no report to derive headers from. -// Fields tagged as identifiers in REDCap (identifier column = "y") are returned -// with Selected=true so the frontend can auto-blank them. +// listVariablesFromMetadata returns all project fields from the data +// dictionary. Used for record export mode where there is no report to derive +// headers from. func listVariablesFromMetadata(ctx context.Context, baseURL, token string) ([]types.SelectItem, error) { - identifiers, err := identifierFieldsFromMetadata(ctx, baseURL, token) - if err != nil { - return nil, err - } - fields, err := fallbackFieldsFromMetadata(ctx, baseURL, token) + dict, err := fetchDictionary(ctx, baseURL, token) if err != nil { return nil, err } - return deduplicatedSelectItemsWithIdentifiers(fields, identifiers), nil + return variableSelectItems(dict.fieldOrder, dict), nil } func exportMetadataCSV(ctx context.Context, baseURL, token string, fields []string) ([]byte, error) { @@ -1132,6 +1300,51 @@ type manifestExtras struct { ProjectID interface{} ProjectTitle string DictionaryFieldsNotExported []string + TransformModes map[string]string // field -> transform mode, for echo redaction + RecordIDField string + KeyFingerprint string // SHA-256 fingerprint of the HMAC key (never the key itself) +} + +// filterLogicFieldRe extracts the field names referenced by a REDCap filter +// logic expression: [field], [field(code)], [event][field], ... +var filterLogicFieldRe = regexp.MustCompile(`\[([a-zA-Z0-9_]+)`) + +// redactedEcho replaces a manifest parameter echo that would leak values of an +// anonymized field. The manifest documents that redaction happened instead of +// silently omitting the parameter. +func redactedEcho(note string) map[string]interface{} { + return map[string]interface{}{ + "redacted": true, + "note": note, + } +} + +// recordsEcho returns the manifest echo for the records filter. When the +// record-ID field is anonymized, echoing the requested record IDs verbatim +// would leak the very identifiers the transform removed from the data. +func recordsEcho(opts pluginOptions, extras manifestExtras) interface{} { + if len(opts.Records) == 0 { + return opts.Records + } + if extras.RecordIDField != "" && extras.TransformModes[extras.RecordIDField] != "" { + return redactedEcho(fmt.Sprintf("%d record ids hidden: the record id field (%s) is anonymized", len(opts.Records), extras.RecordIDField)) + } + return opts.Records +} + +// filterLogicEcho returns the manifest echo for filterLogic. Filter logic can +// embed literal values ([name] = "John"), so it is redacted whenever it +// references an anonymized field. +func filterLogicEcho(opts pluginOptions, extras manifestExtras) interface{} { + if opts.FilterLogic == "" || len(extras.TransformModes) == 0 { + return opts.FilterLogic + } + for _, m := range filterLogicFieldRe.FindAllStringSubmatch(opts.FilterLogic, -1) { + if extras.TransformModes[baseFieldName(m[1])] != "" { + return redactedEcho("filter logic hidden: it references anonymized fields") + } + } + return opts.FilterLogic } func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectInfoPath, eventsPath, mappingPath, redcapVersion string, warnings []string, extras manifestExtras) ([]byte, error) { @@ -1153,8 +1366,8 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI "fields": opts.Fields, "forms": opts.Forms, "events": opts.Events, - "records": opts.Records, - "filter_logic": opts.FilterLogic, + "records": recordsEcho(opts, extras), + "filter_logic": filterLogicEcho(opts, extras), "date_range_begin": opts.DateRangeBegin, "date_range_end": opts.DateRangeEnd, }, @@ -1195,6 +1408,13 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI } } } + if extras.KeyFingerprint != "" { + manifest["anonymization"] = map[string]interface{}{ + "method": "hmac-sha256", + "key_fingerprint": extras.KeyFingerprint, + "note": "pseudonyms are hex-encoded HMAC-SHA256 values; the same key reproduces the same pseudonyms across exports", + } + } if len(extras.DictionaryFieldsNotExported) > 0 { manifest["dictionary_fields_not_exported"] = extras.DictionaryFieldsNotExported } @@ -1253,8 +1473,11 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp basePath = fmt.Sprintf("redcap/report-%s", safeID) } - blanks := blankFields(opts) - dataBytes, dataFields, audit, err := processExportData(rawData, opts, blanks, dict) + plan, err := buildTransformPlan(opts) + if err != nil { + return generatedBundle{}, err + } + dataBytes, dataFields, audit, err := processExportData(rawData, opts, plan, dict) if err != nil { return generatedBundle{}, fmt.Errorf("export processing failed: %w", err) } @@ -1310,10 +1533,15 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp FileUploadFields: dict.fileUploadFields(), ProjectID: projectID, ProjectTitle: projectTitle, + TransformModes: plan.modes, + RecordIDField: recordIDField(dict), + KeyFingerprint: plan.keyFingerprint, } // In an unfiltered flat records export, dictionary fields missing from the // output reveal server-side stripping (token export rights). With filters // or report definitions the diff is expected, so it is not recorded. + // Client-side dropped fields are excluded: their absence is deliberate and + // already documented by the anonymization audit. if opts.ExportMode == "records" && !isEAV(opts) && len(opts.Fields) == 0 && len(opts.Forms) == 0 && len(opts.Events) == 0 { exported := make(map[string]bool, len(dataFields)) @@ -1321,7 +1549,7 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp exported[field] = true } for _, field := range dict.fieldOrder { - if !exported[field] { + if !exported[field] && plan.modes[field] != "drop" { extras.DictionaryFieldsNotExported = append(extras.DictionaryFieldsNotExported, field) } } @@ -1378,6 +1606,10 @@ func bundleCacheKey(baseURL, token string, opts pluginOptions) string { ExportSurveyFields: opts.ExportSurveyFields, ExportDataAccessGroups: opts.ExportDataAccessGroups, Variables: opts.Variables, + // The key participates in the cache key (different keys produce + // different pseudonyms) but is only ever used as MD5 input here — + // it is never stored or logged in recoverable form. + PseudonymizationKey: opts.PseudonymizationKey, // GeneratedAt intentionally excluded } data, _ := json.Marshal(stable) diff --git a/image/app/plugin/impl/redcap2/common_test.go b/image/app/plugin/impl/redcap2/common_test.go index 550a767..aa7a4a8 100644 --- a/image/app/plugin/impl/redcap2/common_test.go +++ b/image/app/plugin/impl/redcap2/common_test.go @@ -4,6 +4,10 @@ package redcap2 import ( "context" + "crypto/hmac" + "crypto/sha256" + "encoding/base64" + "encoding/hex" "encoding/json" "integration/app/plugin/types" "net/http" @@ -14,6 +18,21 @@ import ( "testing" ) +// testKey is a 32-byte HMAC key used by pseudonymization tests. +var testKey = []byte("0123456789abcdef0123456789abcdef") + +func testKeyBase64() string { return base64.StdEncoding.EncodeToString(testKey) } + +func testPlan(modes map[string]string) transformPlan { + return transformPlan{modes: modes, key: testKey, keyFingerprint: "test-fingerprint"} +} + +func hmacHex(value string) string { + mac := hmac.New(sha256.New, testKey) + mac.Write([]byte(value)) + return hex.EncodeToString(mac.Sum(nil)) +} + func TestParsePluginOptionsDefaults(t *testing.T) { for _, raw := range []string{"", " "} { opts, err := parsePluginOptions(raw) @@ -53,8 +72,11 @@ func TestParsePluginOptionsNormalization(t *testing.T) { "fields": [" age", "age", "", "name "], "variables": [ {"name": " email ", "anonymization": "BLANK"}, - {"name": "age", "anonymization": "whatever"} - ] + {"name": "age", "anonymization": "whatever"}, + {"name": "ssn", "anonymization": "Drop"}, + {"name": "record_id", "anonymization": "PSEUDONYMIZE"} + ], + "pseudonymizationKey": " c2VjcmV0 " }`) if err != nil { t.Fatalf("parsePluginOptions returned error: %v", err) @@ -83,10 +105,15 @@ func TestParsePluginOptionsNormalization(t *testing.T) { wantVars := []variableOption{ {Name: "email", Anonymization: "blank"}, {Name: "age", Anonymization: "none"}, + {Name: "ssn", Anonymization: "drop"}, + {Name: "record_id", Anonymization: "pseudonymize"}, } if !reflect.DeepEqual(opts.Variables, wantVars) { t.Errorf("Variables = %v, want %v", opts.Variables, wantVars) } + if opts.PseudonymizationKey != "c2VjcmV0" { + t.Errorf("PseudonymizationKey = %q, want trimmed c2VjcmV0", opts.PseudonymizationKey) + } if opts.GeneratedAt != "missing-generated-at" { t.Errorf("GeneratedAt = %q, want missing-generated-at", opts.GeneratedAt) } @@ -164,15 +191,86 @@ func TestSanitizeReportID(t *testing.T) { } } -func TestBlankFields(t *testing.T) { +func TestBuildTransformPlanModes(t *testing.T) { opts := pluginOptions{Variables: []variableOption{ {Name: "email", Anonymization: "blank"}, + {Name: "ssn", Anonymization: "drop"}, {Name: "age", Anonymization: "none"}, {Name: "", Anonymization: "blank"}, }} - got := blankFields(opts) - if !reflect.DeepEqual(got, map[string]bool{"email": true}) { - t.Fatalf("blankFields = %v, want only email", got) + plan, err := buildTransformPlan(opts) + if err != nil { + t.Fatalf("buildTransformPlan returned error: %v", err) + } + want := map[string]string{"email": "blank", "ssn": "drop"} + if !reflect.DeepEqual(plan.modes, want) { + t.Fatalf("modes = %v, want %v", plan.modes, want) + } + if plan.keyFingerprint != "" || plan.key != nil { + t.Error("no pseudonymization requested: plan must not carry key material") + } +} + +func TestBuildTransformPlanKeyValidation(t *testing.T) { + base := pluginOptions{Variables: []variableOption{{Name: "record_id", Anonymization: "pseudonymize"}}} + + if _, err := buildTransformPlan(base); err == nil || !strings.Contains(err.Error(), "openssl rand -base64 32") { + t.Fatalf("missing key must error with generation hint, got %v", err) + } + + bad := base + bad.PseudonymizationKey = "!!!not-base64!!!" + if _, err := buildTransformPlan(bad); err == nil || !strings.Contains(err.Error(), "base64") { + t.Fatalf("invalid base64 must error, got %v", err) + } + + short := base + short.PseudonymizationKey = base64.StdEncoding.EncodeToString([]byte("tooshort")) + if _, err := buildTransformPlan(short); err == nil || !strings.Contains(err.Error(), "too short") { + t.Fatalf("short key must error, got %v", err) + } + + good := base + good.PseudonymizationKey = testKeyBase64() + plan, err := buildTransformPlan(good) + if err != nil { + t.Fatalf("valid key rejected: %v", err) + } + if !reflect.DeepEqual(plan.key, testKey) { + t.Error("decoded key mismatch") + } + wantFingerprint := func() string { + sum := sha256.Sum256(testKey) + return hex.EncodeToString(sum[:])[:16] + }() + if plan.keyFingerprint != wantFingerprint { + t.Errorf("keyFingerprint = %q, want %q", plan.keyFingerprint, wantFingerprint) + } + if plan.keyFingerprint == good.PseudonymizationKey { + t.Error("fingerprint must not equal the key") + } +} + +func TestTransformValue(t *testing.T) { + plan := testPlan(map[string]string{"email": "blank", "record_id": "pseudonymize"}) + if got := plan.transformValue("email", "john@example.org"); got != "" { + t.Errorf("blank = %q, want empty", got) + } + got := plan.transformValue("record_id", "1") + if got != hmacHex("1") { + t.Errorf("pseudonymize = %q, want deterministic HMAC", got) + } + if len(got) != 64 { + t.Errorf("pseudonym length = %d, want 64 hex chars", len(got)) + } + if plan.transformValue("record_id", "1") != got { + t.Error("pseudonymization must be deterministic") + } + if plan.transformValue("record_id", "") != "" { + t.Error("empty values must stay empty (missingness is not data)") + } + if plan.transformValue("age", "34") != "34" { + t.Error("fields without a rule must pass through") } } @@ -322,8 +420,8 @@ func TestBaseFieldName(t *testing.T) { func TestResolveHeaderFields(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) - if got := resolveHeaderFields("phones___2", false, dict); !reflect.DeepEqual(got, []string{"phones"}) { - t.Errorf("raw checkbox header = %v, want [phones]", got) + if got := resolveHeaderFields("phones___2", false, dict); !reflect.DeepEqual(got, []string{"phones___2", "phones"}) { + t.Errorf("raw checkbox header = %v, want expansion plus base", got) } if got := resolveHeaderFields("Email Address", true, dict); !reflect.DeepEqual(got, []string{"email"}) { t.Errorf("label header = %v, want [email]", got) @@ -336,11 +434,11 @@ func TestResolveHeaderFields(t *testing.T) { } } -func TestBlankFlatCSV(t *testing.T) { +func TestTransformFlatCSVBlank(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) - out, exported, audit, err := blankFlatCSV([]byte(testDataCSV), ',', map[string]bool{"name": true, "email": true}, false, dict) + out, exported, audit, err := transformFlatCSV([]byte(testDataCSV), ',', testPlan(map[string]string{"name": "blank", "email": "blank"}), false, dict) if err != nil { - t.Fatalf("blankFlatCSV returned error: %v", err) + t.Fatalf("transformFlatCSV returned error: %v", err) } want := "record_id,name,email,age\n1,,,34\n2,,,29\n" if string(out) != want { @@ -356,31 +454,82 @@ func TestBlankFlatCSV(t *testing.T) { } } -func TestBlankFlatCSVCheckboxExpansion(t *testing.T) { +func TestTransformFlatCSVDrop(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + out, exported, audit, err := transformFlatCSV([]byte(testDataCSV), ',', testPlan(map[string]string{"email": "drop"}), false, dict) + if err != nil { + t.Fatalf("transformFlatCSV returned error: %v", err) + } + want := "record_id,name,age\n1,John,34\n2,Jane,29\n" + if string(out) != want { + t.Errorf("dropped CSV = %q, want %q", string(out), want) + } + if !reflect.DeepEqual(exported, []string{"record_id", "name", "age"}) { + t.Errorf("exported = %v, dropped field must be excluded", exported) + } + if len(audit) != 1 || audit[0].Mode != "drop" || audit[0].Matched != 1 { + t.Errorf("audit = %+v, want drop matched=1", audit) + } +} + +func TestTransformFlatCSVPseudonymize(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + out, _, audit, err := transformFlatCSV([]byte(testDataCSV), ',', testPlan(map[string]string{"record_id": "pseudonymize"}), false, dict) + if err != nil { + t.Fatalf("transformFlatCSV returned error: %v", err) + } + want := "record_id,name,email,age\n" + + hmacHex("1") + ",John,john@example.org,34\n" + + hmacHex("2") + ",Jane,jane@example.org,29\n" + if string(out) != want { + t.Errorf("pseudonymized CSV = %q, want %q", string(out), want) + } + if len(audit) != 1 || audit[0].Mode != "pseudonymize" || audit[0].Matched != 1 { + t.Errorf("audit = %+v", audit) + } +} + +func TestTransformFlatCSVCheckboxExpansion(t *testing.T) { dict := parseDictionary([]byte("field_name,field_type,field_label\nrecord_id,text,Record ID\nphones,checkbox,Phone Types\n")) data := "record_id,phones___1,phones___2\n1,555-1234,555-5678\n" - out, exported, audit, err := blankFlatCSV([]byte(data), ',', map[string]bool{"phones": true}, false, dict) + out, exported, audit, err := transformFlatCSV([]byte(data), ',', testPlan(map[string]string{"phones": "blank"}), false, dict) if err != nil { - t.Fatalf("blankFlatCSV returned error: %v", err) + t.Fatalf("transformFlatCSV returned error: %v", err) } want := "record_id,phones___1,phones___2\n1,,\n" if string(out) != want { t.Errorf("blanked CSV = %q, want %q", string(out), want) } - if !reflect.DeepEqual(exported, []string{"record_id", "phones"}) { - t.Errorf("exported = %v, want base names", exported) + if !reflect.DeepEqual(exported, []string{"record_id", "phones___1", "phones", "phones___2"}) { + t.Errorf("exported = %v, want expansions plus base", exported) } if len(audit) != 1 || audit[0].Matched != 2 { t.Errorf("audit = %+v, want phones matched=2", audit) } } -func TestBlankFlatCSVLabelHeaders(t *testing.T) { +func TestTransformFlatCSVExpansionRule(t *testing.T) { + dict := parseDictionary([]byte("field_name,field_type,field_label\nrecord_id,text,Record ID\nphones,checkbox,Phone Types\n")) + data := "record_id,phones___1,phones___2\n1,555-1234,555-5678\n" + out, _, audit, err := transformFlatCSV([]byte(data), ',', testPlan(map[string]string{"phones___2": "blank"}), false, dict) + if err != nil { + t.Fatalf("transformFlatCSV returned error: %v", err) + } + want := "record_id,phones___1,phones___2\n1,555-1234,\n" + if string(out) != want { + t.Errorf("expansion-rule CSV = %q, want only phones___2 blanked", string(out)) + } + if len(audit) != 1 || audit[0].Field != "phones___2" || audit[0].Matched != 1 { + t.Errorf("audit = %+v, want rule on the expansion column to match", audit) + } +} + +func TestTransformFlatCSVLabelHeaders(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) data := "Record ID,Full Name,Email Address,Age\n1,John,john@example.org,34\n" - out, exported, audit, err := blankFlatCSV([]byte(data), ',', map[string]bool{"email": true}, true, dict) + out, exported, audit, err := transformFlatCSV([]byte(data), ',', testPlan(map[string]string{"email": "blank"}), true, dict) if err != nil { - t.Fatalf("blankFlatCSV returned error: %v", err) + t.Fatalf("transformFlatCSV returned error: %v", err) } want := "Record ID,Full Name,Email Address,Age\n1,John,,34\n" if string(out) != want { @@ -394,30 +543,30 @@ func TestBlankFlatCSVLabelHeaders(t *testing.T) { } } -func TestBlankFlatCSVZeroMatchAudit(t *testing.T) { +func TestTransformFlatCSVZeroMatchAudit(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) - out, _, audit, err := blankFlatCSV([]byte(testDataCSV), ',', map[string]bool{"missing": true}, false, dict) + out, _, audit, err := transformFlatCSV([]byte(testDataCSV), ',', testPlan(map[string]string{"missing": "blank"}), false, dict) if err != nil { - t.Fatalf("blankFlatCSV returned error: %v", err) + t.Fatalf("transformFlatCSV returned error: %v", err) } if string(out) != testDataCSV { - t.Error("data changed despite no matching blank columns") + t.Error("data changed despite no matching columns") } if len(audit) != 1 || audit[0].Matched != 0 || audit[0].Note == "" { t.Errorf("zero-match audit missing note: %+v", audit) } } -func TestBlankEAVCSV(t *testing.T) { +func TestTransformEAVCSV(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) data := "record,redcap_event_name,field_name,value\n" + "1,baseline_arm_1,name,John\n" + "1,baseline_arm_1,email,john@example.org\n" + "2,baseline_arm_1,email,jane@example.org\n" + "2,baseline_arm_1,age,29\n" - out, exported, audit, err := blankEAVCSV([]byte(data), ',', map[string]bool{"email": true}, dict) + out, exported, audit, err := transformEAVCSV([]byte(data), ',', testPlan(map[string]string{"email": "blank"}), dict) if err != nil { - t.Fatalf("blankEAVCSV returned error: %v", err) + t.Fatalf("transformEAVCSV returned error: %v", err) } want := "record,redcap_event_name,field_name,value\n" + "1,baseline_arm_1,name,John\n" + @@ -435,12 +584,56 @@ func TestBlankEAVCSV(t *testing.T) { } } -func TestBlankEAVJSON(t *testing.T) { +func TestTransformEAVCSVDropRemovesRows(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := "record,field_name,value\n" + + "1,name,John\n" + + "1,email,john@example.org\n" + + "2,email,jane@example.org\n" + out, exported, audit, err := transformEAVCSV([]byte(data), ',', testPlan(map[string]string{"email": "drop"}), dict) + if err != nil { + t.Fatalf("transformEAVCSV returned error: %v", err) + } + want := "record,field_name,value\n1,name,John\n" + if string(out) != want { + t.Errorf("dropped EAV CSV = %q, want %q", string(out), want) + } + for _, field := range exported { + if field == "email" { + t.Error("dropped field must not be in exported list") + } + } + if len(audit) != 1 || audit[0].Mode != "drop" || audit[0].Matched != 2 { + t.Errorf("audit = %+v, want drop matched=2 rows", audit) + } +} + +func TestTransformEAVCSVRecordColumnPseudonymized(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := "record,field_name,value\n" + + "1,record_id,1\n" + + "1,age,34\n" + out, _, audit, err := transformEAVCSV([]byte(data), ',', testPlan(map[string]string{"record_id": "pseudonymize"}), dict) + if err != nil { + t.Fatalf("transformEAVCSV returned error: %v", err) + } + want := "record,field_name,value\n" + + hmacHex("1") + ",record_id," + hmacHex("1") + "\n" + + hmacHex("1") + ",age,34\n" + if string(out) != want { + t.Errorf("EAV record column = %q, want pseudonymized record column %q", string(out), want) + } + if len(audit) != 1 || !strings.Contains(audit[0].Note, "record column") { + t.Errorf("audit = %+v, want record-column note", audit) + } +} + +func TestTransformEAVJSON(t *testing.T) { dict := parseDictionary([]byte(testMetadataCSV)) data := `[{"record":"1","field_name":"name","value":"John"},{"record":"1","field_name":"email","value":"john@example.org"}]` - out, exported, audit, err := blankEAVJSON([]byte(data), map[string]bool{"email": true}, dict) + out, exported, audit, err := transformEAVJSON([]byte(data), testPlan(map[string]string{"email": "blank"}), dict) if err != nil { - t.Fatalf("blankEAVJSON returned error: %v", err) + t.Fatalf("transformEAVJSON returned error: %v", err) } rows := []map[string]string{} if err := json.Unmarshal(out, &rows); err != nil { @@ -457,11 +650,44 @@ func TestBlankEAVJSON(t *testing.T) { } } -func TestBlankFlatJSONCheckboxExpansion(t *testing.T) { +func TestTransformEAVJSONRecordColumnAndDrop(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + data := `[{"record":"1","field_name":"record_id","value":"1"},{"record":"1","field_name":"email","value":"john@example.org"},{"record":"1","field_name":"age","value":"34"}]` + out, _, audit, err := transformEAVJSON([]byte(data), testPlan(map[string]string{"record_id": "pseudonymize", "email": "drop"}), dict) + if err != nil { + t.Fatalf("transformEAVJSON returned error: %v", err) + } + rows := []map[string]string{} + if err := json.Unmarshal(out, &rows); err != nil { + t.Fatalf("transformed EAV JSON invalid: %v", err) + } + if len(rows) != 2 { + t.Fatalf("rows = %d, want email row dropped", len(rows)) + } + for _, row := range rows { + if row["record"] != hmacHex("1") { + t.Errorf("record = %q, want pseudonymized", row["record"]) + } + } + if rows[0]["value"] != hmacHex("1") { + t.Errorf("record_id value = %q, want pseudonymized", rows[0]["value"]) + } + foundNote := false + for _, entry := range audit { + if entry.Field == "record_id" && strings.Contains(entry.Note, "record column") { + foundNote = true + } + } + if !foundNote { + t.Errorf("audit = %+v, want record-column note", audit) + } +} + +func TestTransformFlatJSONCheckboxExpansion(t *testing.T) { data := `[{"record_id":"1","phones___1":"555-1234","phones___2":"555-5678","age":"34"}]` - out, exported, audit, err := blankFlatJSON([]byte(data), map[string]bool{"phones": true}) + out, exported, audit, err := transformFlatJSON([]byte(data), testPlan(map[string]string{"phones": "blank"})) if err != nil { - t.Fatalf("blankFlatJSON returned error: %v", err) + t.Fatalf("transformFlatJSON returned error: %v", err) } rows := []map[string]string{} if err := json.Unmarshal(out, &rows); err != nil { @@ -478,12 +704,39 @@ func TestBlankFlatJSONCheckboxExpansion(t *testing.T) { } } -func TestBlankFlatJSONInvalid(t *testing.T) { - if _, _, _, err := blankFlatJSON([]byte("not json"), nil); err == nil { +func TestTransformFlatJSONDrop(t *testing.T) { + data := `[{"record_id":"1","email":"john@example.org","age":"34"}]` + out, exported, _, err := transformFlatJSON([]byte(data), testPlan(map[string]string{"email": "drop"})) + if err != nil { + t.Fatalf("transformFlatJSON returned error: %v", err) + } + rows := []map[string]string{} + if err := json.Unmarshal(out, &rows); err != nil { + t.Fatalf("transformed JSON invalid: %v", err) + } + if _, ok := rows[0]["email"]; ok { + t.Error("dropped key must be removed from rows") + } + if !reflect.DeepEqual(exported, []string{"age", "record_id"}) { + t.Errorf("exported = %v, dropped field must be excluded", exported) + } +} + +func TestTransformFlatJSONInvalid(t *testing.T) { + if _, _, _, err := transformFlatJSON([]byte("not json"), transformPlan{}); err == nil { t.Fatal("expected error for invalid JSON input") } } +func TestProcessExportDataRejectsEAVRecordIDDrop(t *testing.T) { + dict := parseDictionary([]byte(testMetadataCSV)) + opts, _ := parsePluginOptions(`{"exportMode":"records","recordType":"eav"}`) + _, _, _, err := processExportData([]byte("record,field_name,value\n"), opts, testPlan(map[string]string{"record_id": "drop"}), dict) + if err == nil || !strings.Contains(err.Error(), "record id") { + t.Fatalf("EAV drop of record id must error, got %v", err) + } +} + func TestFilterMetadataCSV(t *testing.T) { out, err := filterMetadataCSV([]byte(testMetadataCSV), []string{"age", "record_id"}) if err != nil { @@ -551,29 +804,45 @@ func TestProjectIdentity(t *testing.T) { } } -func TestDeduplicatedSelectItems(t *testing.T) { - got := deduplicatedSelectItems([]string{" b", "a", "b", ""}) +func TestVariableSelectItems(t *testing.T) { + metadata := "field_name,field_type,field_label,identifier,text_validation_type_or_show_slider_number\n" + + "record_id,text,Record ID,,integer\n" + + "name,text,Full Name,y,\n" + + "comments,notes,Comments,,\n" + + "phones,checkbox,Phone Types,y,\n" + + "age,text,Age,,integer\n" + dict := parseDictionary([]byte(metadata)) + got := variableSelectItems([]string{"age", " name ", "name", "comments", "phones___2", "record_id"}, dict) want := []types.SelectItem{ - {Label: "a", Value: "a"}, - {Label: "b", Value: "b"}, + {Label: "age", Value: "age"}, + {Label: "comments", Value: "comments", Note: "free-text notes field: may contain identifying information"}, + {Label: "name", Value: "name", Selected: true, Note: "unvalidated text field: may contain identifying information"}, + {Label: "phones___2", Value: "phones___2", Selected: true}, + {Label: "record_id", Value: "record_id"}, } if !reflect.DeepEqual(got, want) { - t.Fatalf("deduplicatedSelectItems = %v, want %v", got, want) + t.Fatalf("variableSelectItems = %+v, want %+v", got, want) } } -func TestDeduplicatedSelectItemsWithIdentifiers(t *testing.T) { - got := deduplicatedSelectItemsWithIdentifiers( - []string{"email", "age", "email", " name "}, - map[string]bool{"email": true, "name": true}, - ) - want := []types.SelectItem{ - {Label: "age", Value: "age"}, - {Label: "email", Value: "email", Selected: true}, - {Label: "name", Value: "name", Selected: true}, +func TestPhiRiskNote(t *testing.T) { + metadata := "field_name,field_type,field_label,text_validation_type_or_show_slider_number\n" + + "comments,notes,Comments,\n" + + "nickname,text,Nickname,\n" + + "age,text,Age,integer\n" + + "consent,yesno,Consent,\n" + dict := parseDictionary([]byte(metadata)) + if note := phiRiskNote(dict, "comments"); !strings.Contains(note, "notes") { + t.Errorf("notes field note = %q", note) } - if !reflect.DeepEqual(got, want) { - t.Fatalf("deduplicatedSelectItemsWithIdentifiers = %v, want %v", got, want) + if note := phiRiskNote(dict, "nickname"); !strings.Contains(note, "unvalidated") { + t.Errorf("unvalidated text note = %q", note) + } + if note := phiRiskNote(dict, "age"); note != "" { + t.Errorf("validated text field should have no note, got %q", note) + } + if note := phiRiskNote(dict, "consent"); note != "" { + t.Errorf("yesno field should have no note, got %q", note) } } @@ -670,6 +939,90 @@ func TestMakeManifestZeroMatchAuditAddsWarning(t *testing.T) { } } +func TestMakeManifestRedactsRecordsFilterWhenRecordIDTransformed(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","records":["101","102"],"variables":[{"name":"record_id","anonymization":"pseudonymize"}],"pseudonymizationKey":"` + testKeyBase64() + `"}`) + extras := manifestExtras{ + TransformModes: map[string]string{"record_id": "pseudonymize"}, + RecordIDField: "record_id", + KeyFingerprint: "abcdef0123456789", + } + data, err := makeManifest(opts, "", "d", "m", "p", "", "", "", nil, extras) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + if strings.Contains(string(data), "101") || strings.Contains(string(data), "102") { + t.Fatal("manifest leaks record ids despite record-id transform") + } + manifest := map[string]interface{}{} + _ = json.Unmarshal(data, &manifest) + export := manifest["export"].(map[string]interface{}) + records, ok := export["records"].(map[string]interface{}) + if !ok || records["redacted"] != true { + t.Fatalf("records echo = %v, want redaction marker", export["records"]) + } + anonymization, ok := manifest["anonymization"].(map[string]interface{}) + if !ok || anonymization["key_fingerprint"] != "abcdef0123456789" || anonymization["method"] != "hmac-sha256" { + t.Fatalf("anonymization section = %v", manifest["anonymization"]) + } + if strings.Contains(string(data), testKeyBase64()) { + t.Fatal("manifest must never contain the pseudonymization key") + } +} + +func TestMakeManifestKeepsRecordsFilterWithoutRecordIDTransform(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","records":["101"],"variables":[{"name":"email","anonymization":"blank"}]}`) + extras := manifestExtras{ + TransformModes: map[string]string{"email": "blank"}, + RecordIDField: "record_id", + } + data, err := makeManifest(opts, "", "d", "m", "p", "", "", "", nil, extras) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + manifest := map[string]interface{}{} + _ = json.Unmarshal(data, &manifest) + export := manifest["export"].(map[string]interface{}) + records, ok := export["records"].([]interface{}) + if !ok || len(records) != 1 || records[0] != "101" { + t.Fatalf("records echo = %v, want verbatim filter", export["records"]) + } +} + +func TestMakeManifestRedactsFilterLogicReferencingTransformedField(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","filterLogic":"[email] = \"john@example.org\"","variables":[{"name":"email","anonymization":"blank"}]}`) + extras := manifestExtras{ + TransformModes: map[string]string{"email": "blank"}, + RecordIDField: "record_id", + } + data, err := makeManifest(opts, "", "d", "m", "p", "", "", "", nil, extras) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + if strings.Contains(string(data), "john@example.org") { + t.Fatal("manifest leaks filter logic referencing an anonymized field") + } + + unrelated, _ := parsePluginOptions(`{"exportMode":"records","filterLogic":"[age] > 30","variables":[{"name":"email","anonymization":"blank"}]}`) + data, err = makeManifest(unrelated, "", "d", "m", "p", "", "", "", nil, extras) + if err != nil { + t.Fatalf("makeManifest returned error: %v", err) + } + manifest := map[string]interface{}{} + _ = json.Unmarshal(data, &manifest) + export := manifest["export"].(map[string]interface{}) + if export["filter_logic"] != "[age] > 30" { + t.Fatalf("filter_logic = %v, want verbatim echo for untransformed fields", export["filter_logic"]) + } +} + +func TestFilterLogicEchoCheckboxReference(t *testing.T) { + opts := pluginOptions{FilterLogic: `[phones(2)] = "1"`} + extras := manifestExtras{TransformModes: map[string]string{"phones": "drop"}} + if _, ok := filterLogicEcho(opts, extras).(map[string]interface{}); !ok { + t.Fatal("checkbox reference to a transformed field must redact filter logic") + } +} + func TestBundleCacheKeyStability(t *testing.T) { base, _ := parsePluginOptions(`{"exportMode":"report","reportId":"7","generatedAt":"2026-01-01T00:00:00Z"}`) sameButLater := base @@ -696,6 +1049,12 @@ func TestBundleCacheKeyStability(t *testing.T) { t.Error("exportSurveyFields should change the cache key") } + otherKey := base + otherKey.PseudonymizationKey = testKeyBase64() + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "tok", otherKey) { + t.Error("pseudonymization key should change the cache key (different keys, different pseudonyms)") + } + if bundleCacheKey("https://r", "tok", base) == bundleCacheKey("https://r", "other", base) { t.Error("different token should change the cache key") } diff --git a/image/app/plugin/impl/redcap2/deid_test.go b/image/app/plugin/impl/redcap2/deid_test.go index 124a539..84b5c14 100644 --- a/image/app/plugin/impl/redcap2/deid_test.go +++ b/image/app/plugin/impl/redcap2/deid_test.go @@ -7,6 +7,7 @@ import ( "encoding/json" "integration/app/plugin/types" "io" + "strings" "testing" ) @@ -168,6 +169,73 @@ func TestEndToEndZeroMatchBlankWarning(t *testing.T) { } } +// Drop must remove the column (data and metadata), pseudonymize must rewrite +// values with deterministic HMACs, and the manifest must redact the records +// filter (it contains the very identifiers being pseudonymized) while +// reporting the key fingerprint — never the key. +func TestEndToEndDropAndPseudonymize(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "records": ["1", "2"], + "variables": [ + {"name": "record_id", "anonymization": "pseudonymize"}, + {"name": "email", "anonymization": "drop"} + ], + "pseudonymizationKey": "` + testKeyBase64() + `" + }` + data, manifest := queryAndRead(t, f, pluginOpts, "redcap/records/data.csv") + + want := "record_id,name,age\n" + + hmacHex("1") + ",John,34\n" + + hmacHex("2") + ",Jane,29\n" + if string(data) != want { + t.Errorf("data.csv = %q, want %q", string(data), want) + } + + export := manifest["export"].(map[string]interface{}) + records, ok := export["records"].(map[string]interface{}) + if !ok || records["redacted"] != true { + t.Errorf("records echo = %v, want redaction (record id field is pseudonymized)", export["records"]) + } + anonymization, ok := manifest["anonymization"].(map[string]interface{}) + if !ok || anonymization["method"] != "hmac-sha256" || anonymization["key_fingerprint"] == "" { + t.Errorf("anonymization = %v, want hmac-sha256 with fingerprint", manifest["anonymization"]) + } + raw, _ := json.Marshal(manifest) + if string(raw) == "" || strings.Contains(string(raw), testKeyBase64()) { + t.Fatal("manifest must never contain the pseudonymization key") + } + if _, ok := manifest["dictionary_fields_not_exported"]; ok { + t.Errorf("client-side dropped fields must not be reported as token-rights stripping: %v", manifest["dictionary_fields_not_exported"]) + } + if form := f.lastForm("metadata"); form != nil { + // metadata.csv must not keep the dropped field's dictionary row + metadataData, _ := queryAndRead(t, f, pluginOpts, "redcap/records/metadata.csv") + if strings.Contains(string(metadataData), "email") { + t.Error("dropped field must be filtered from metadata.csv") + } + } +} + +// Pseudonymization without a key (or with a bad key) must fail the export +// loudly instead of silently exporting identifiable data. +func TestEndToEndPseudonymizeRequiresKey(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "variables": [{"name": "record_id", "anonymization": "pseudonymize"}] + }` + _, err := Query(context.Background(), types.CompareRequest{Url: f.url(), Token: "tok", PluginOptions: pluginOpts}, nil) + if err == nil || !strings.Contains(err.Error(), "key") { + t.Fatalf("expected missing-key error, got %v", err) + } +} + // Bundles above the cache cap must be rebuilt instead of cached. func TestOversizedBundleIsNotCached(t *testing.T) { originalCap := maxCacheableBundleBytes diff --git a/image/app/plugin/types/select_item.go b/image/app/plugin/types/select_item.go index c5579f8..2e4a7d0 100644 --- a/image/app/plugin/types/select_item.go +++ b/image/app/plugin/types/select_item.go @@ -7,5 +7,6 @@ type SelectItem struct { Value interface{} `json:"value"` Selected bool `json:"selected,omitempty"` Expanded bool `json:"expanded,omitempty"` + Note string `json:"note,omitempty"` Children []SelectItem `json:"children,omitempty"` } From c5065c2ca641c84298808b2d3e5e74b629ac6f5f Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 23:45:51 +0200 Subject: [PATCH 07/25] redcap2 Phase 5: metadata sidecars (Croissant, RO-Crate, DDI-CDI) + per-file mime plumbing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - tree.Node gains an optional mimeType attribute; threaded through both upload paths: native multipart add/replace (explicit part Content-Type instead of CreateFormFile's octet-stream) and direct-upload /addFiles jsonData. Files without an explicit mime keep today's destination-side detection. - Every redcap2 export bundle now includes, generated from one normalized model over the final bundle (no toggles; deselectable per file in compare): - croissant.json (Croissant 1.0, canonical context, FileObject distribution with md5, CSV RecordSet fields with schema.org data types; recordSet omitted for JSON exports) as application/ld+json — previewable via the new generic JSON-LD external tool; - ro-crate-metadata.json (RO-Crate 1.2, detached crate, Process Run Crate provenance: CreateAction + plugin/REDCap SoftwareApplication entities) with the profile mime Dataverse 6.3+ detects and the RO-Crate previewer registers for; - ddi-cdi.jsonld (DDI-CDI 1.0 JSON-LD mirroring the in-repo cdi_generator_jsonld.py structure: WideDataSet/WideDataStructure/ LogicalRecord, InstanceVariables with substantive value domains, CodeLists from REDCap choices, PrimaryKey on the record-ID, PhysicalSegmentLayout for CSV) with the profile mime the deployed cdi previewer registers for; - project_metadata.xml (CDISC ODM, returnMetadataOnly) — failure-tolerant. - Variables reflect the post-transform data file: dropped columns are absent, transforms are noted in descriptions, key fingerprint in dataset description. - Manifest lists the sidecars under files; conf gains 12-jsonld-previewer.json (cdi-viewer registered for bare application/ld+json). --- .../external-tools/12-jsonld-previewer.json | 23 + image/app/core/destination_plugin.go | 2 +- image/app/core/io.go | 8 +- image/app/core/io_types.go | 17 +- image/app/core/persisting.go | 2 +- image/app/dataverse/dataverse_write.go | 10 +- image/app/plugin/impl/redcap2/common.go | 62 +- image/app/plugin/impl/redcap2/helper_test.go | 3 + image/app/plugin/impl/redcap2/query.go | 3 + image/app/plugin/impl/redcap2/query_test.go | 23 +- image/app/plugin/impl/redcap2/sidecars.go | 768 ++++++++++++++++++ .../app/plugin/impl/redcap2/sidecars_test.go | 315 +++++++ image/app/tree/node.go | 15 +- 13 files changed, 1230 insertions(+), 21 deletions(-) create mode 100644 conf/dataverse/external-tools/12-jsonld-previewer.json create mode 100644 image/app/plugin/impl/redcap2/sidecars.go create mode 100644 image/app/plugin/impl/redcap2/sidecars_test.go diff --git a/conf/dataverse/external-tools/12-jsonld-previewer.json b/conf/dataverse/external-tools/12-jsonld-previewer.json new file mode 100644 index 0000000..f9f4cdd --- /dev/null +++ b/conf/dataverse/external-tools/12-jsonld-previewer.json @@ -0,0 +1,23 @@ +{ + "displayName": "View JSON-LD", + "description": "View JSON-LD metadata files (e.g. Croissant) with optional SHACL validation.", + "toolName": "jsonldPreviewer", + "scope": "file", + "types": ["preview", "explore"], + "toolUrl": "https://libis.github.io/cdi-viewer/index.html", + "toolParameters": { + "queryParameters": [ + {"fileid": "{fileId}"}, + {"siteUrl": "{siteUrl}"}, + {"datasetid": "{datasetId}"}, + {"datasetversion": "{datasetVersion}"}, + {"locale": "{localeCode}"} + ] + }, + "contentType": "application/ld+json", + "allowedApiCalls": [ + {"name": "retrieveFileContents", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=true", "timeOut": 3600}, + {"name": "downloadFile", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=false", "timeOut": 3600}, + {"name": "getDatasetVersionMetadata", "httpMethod": "GET", "urlTemplate": "/api/v1/datasets/{datasetId}/versions/{datasetVersion}", "timeOut": 3600} + ] +} diff --git a/image/app/core/destination_plugin.go b/image/app/core/destination_plugin.go index 426234b..704a9ee 100644 --- a/image/app/core/destination_plugin.go +++ b/image/app/core/destination_plugin.go @@ -18,7 +18,7 @@ type DestinationPlugin struct { CreateNewRepo func(ctx context.Context, collection, token, userName string, metadata types.Metadata) (string, error) GetDatasetVersion func(ctx context.Context, datasetDbId, token, userName string) (string, error) GetRepoUrl func(pid string, draft bool) string - WriteOverWire func(ctx context.Context, dbId int64, nodeMapId, token, user, persistentId string, wg *sync.WaitGroup, async_err *ErrorHolder) (io.WriteCloser, error) + WriteOverWire func(ctx context.Context, dbId int64, nodeMapId, mimeType, token, user, persistentId string, wg *sync.WaitGroup, async_err *ErrorHolder) (io.WriteCloser, error) SaveAfterDirectUpload func(ctx context.Context, replace bool, token, user, persistentId string, storageIdentifiers []string, nodes []tree.Node) error CleanupLeftOverFiles func(ctx context.Context, persistentId, token, user string) error DeleteFile func(ctx context.Context, token, user string, id int64) error diff --git a/image/app/core/io.go b/image/app/core/io.go index f3966c8..b3a26fd 100644 --- a/image/app/core/io.go +++ b/image/app/core/io.go @@ -129,7 +129,7 @@ func newS3Client(ctx context.Context) (*s3.Client, error) { }), nil } -func write(ctx context.Context, dbId int64, dataverseKey, user string, fileStream types.Stream, storageIdentifier, persistentId, hashType, remoteHashType, id string, fileSize int64) (hash []byte, remoteHash []byte, size int64, retErr error) { +func write(ctx context.Context, dbId int64, dataverseKey, user string, fileStream types.Stream, storageIdentifier, persistentId, hashType, remoteHashType, id, mimeType string, fileSize int64) (hash []byte, remoteHash []byte, size int64, retErr error) { pid, err := trimProtocol(persistentId) if err != nil { return nil, nil, 0, err @@ -156,7 +156,7 @@ func write(ctx context.Context, dbId int64, dataverseKey, user string, fileStrea if s.driver == "file" || !Destination.IsDirectUpload() { wg := &sync.WaitGroup{} async_err := &ErrorHolder{} - f, err := getFile(ctx, dbId, wg, dataverseKey, user, persistentId, pid, s, id, async_err) + f, err := getFile(ctx, dbId, wg, dataverseKey, user, persistentId, pid, s, id, mimeType, async_err) if err != nil { return nil, nil, 0, err } @@ -203,9 +203,9 @@ func write(ctx context.Context, dbId int64, dataverseKey, user string, fileStrea return hasher.Sum(nil), remoteHasher.Sum(nil), sizeHasher.FileSize, nil } -func getFile(ctx context.Context, dbId int64, wg *sync.WaitGroup, dataverseKey, user, persistentId, pid string, s storage, id string, async_err *ErrorHolder) (io.WriteCloser, error) { +func getFile(ctx context.Context, dbId int64, wg *sync.WaitGroup, dataverseKey, user, persistentId, pid string, s storage, id, mimeType string, async_err *ErrorHolder) (io.WriteCloser, error) { if !Destination.IsDirectUpload() { - return Destination.WriteOverWire(ctx, dbId, id, dataverseKey, user, persistentId, wg, async_err) + return Destination.WriteOverWire(ctx, dbId, id, mimeType, dataverseKey, user, persistentId, wg, async_err) } path := config.GetConfig().Options.PathToFilesDir + pid + "/" if _, err := os.Stat(path); errors.Is(err, os.ErrNotExist) { diff --git a/image/app/core/io_types.go b/image/app/core/io_types.go index cea3275..26bccaf 100644 --- a/image/app/core/io_types.go +++ b/image/app/core/io_types.go @@ -3,6 +3,7 @@ package core import ( + "fmt" "hash" "io" "mime/multipart" @@ -48,10 +49,11 @@ type FileWriter struct { part2 io.Writer writer *multipart.Writer filename string + mimeType string } -func NewFileWriter(filename string, part1bytes []byte, writer *multipart.Writer) *FileWriter { - return &FileWriter{false, part1bytes, nil, writer, filename} +func NewFileWriter(filename, mimeType string, part1bytes []byte, writer *multipart.Writer) *FileWriter { + return &FileWriter{false, part1bytes, nil, writer, filename, mimeType} } func (f *FileWriter) Write(p []byte) (int, error) { @@ -59,7 +61,16 @@ func (f *FileWriter) Write(p []byte) (int, error) { part1, _ := f.writer.CreateFormField("jsonData") part1.Write(f.part1bytes) f.part1written = true - f.part2, _ = f.writer.CreateFormFile("file", f.filename) + if f.mimeType != "" { + // An explicit mime type overrides destination-side type detection + // (CreateFormFile would hardcode application/octet-stream). + header := make(map[string][]string) + header["Content-Disposition"] = []string{fmt.Sprintf(`form-data; name="file"; filename="%s"`, f.filename)} + header["Content-Type"] = []string{f.mimeType} + f.part2, _ = f.writer.CreatePart(header) + } else { + f.part2, _ = f.writer.CreateFormFile("file", f.filename) + } } n, err := f.part2.Write(p) return n, err diff --git a/image/app/core/persisting.go b/image/app/core/persisting.go index 4b3da90..6099ade 100644 --- a/image/app/core/persisting.go +++ b/image/app/core/persisting.go @@ -248,7 +248,7 @@ func doPersistNodeMap(ctx context.Context, streams map[string]types.Stream, in J var h []byte var remoteH []byte var size int64 - h, remoteH, size, err = write(ctx, v.Attributes.DestinationFile.Id, dataverseKey, user, fileStream, storageIdentifier, persistentId, hashType, remoteHashType, k, v.Attributes.RemoteFileSize) + h, remoteH, size, err = write(ctx, v.Attributes.DestinationFile.Id, dataverseKey, user, fileStream, storageIdentifier, persistentId, hashType, remoteHashType, k, v.Attributes.MimeType, v.Attributes.RemoteFileSize) if err != nil { return } diff --git a/image/app/dataverse/dataverse_write.go b/image/app/dataverse/dataverse_write.go index 49a0184..856e770 100644 --- a/image/app/dataverse/dataverse_write.go +++ b/image/app/dataverse/dataverse_write.go @@ -180,13 +180,17 @@ func getDefaultLicense(ctx context.Context, user, token string) (map[string]inte func SaveAfterDirectUpload(ctx context.Context, replace bool, token, user, persistentId string, storageIdentifiers []string, nodes []tree.Node) error { jsonData := []api.JsonData{} for i, v := range nodes { + mimeType := v.Attributes.MimeType + if mimeType == "" { + mimeType = "application/octet-stream" // default that will be replaced by Dataverse while adding/replacing the file + } jsonData = append(jsonData, api.JsonData{ FileToReplaceId: v.Attributes.DestinationFile.Id, ForceReplace: v.Attributes.DestinationFile.Id != 0, StorageIdentifier: storageIdentifiers[i], FileName: v.Name, DirectoryLabel: v.Path, - MimeType: "application/octet-stream", // default that will be replaced by Dataverse while adding/replacing the file + MimeType: mimeType, TabIngest: false, Checksum: &api.Checksum{ Type: v.Attributes.DestinationFile.HashType, @@ -228,7 +232,7 @@ func requestBody(data []byte) (io.Reader, string) { return body, writer.FormDataContentType() } -func ApiAddReplaceFile(ctx context.Context, dbId int64, id, token, user, persistentId string, wg *sync.WaitGroup, async_err *core.ErrorHolder) (io.WriteCloser, error) { +func ApiAddReplaceFile(ctx context.Context, dbId int64, id, mimeType, token, user, persistentId string, wg *sync.WaitGroup, async_err *core.ErrorHolder) (io.WriteCloser, error) { if strings.HasSuffix(id, ".zip") { // workaround: upload via SWORD api if dbId != 0 { @@ -253,7 +257,7 @@ func ApiAddReplaceFile(ctx context.Context, dbId int64, id, token, user, persist jsonDataBytes, _ := json.Marshal(jsonData) pr, pw := io.Pipe() writer := multipart.NewWriter(pw) - fw := core.NewFileWriter(filename, jsonDataBytes, writer) + fw := core.NewFileWriter(filename, mimeType, jsonDataBytes, writer) requestHeader := http.Header{} requestHeader.Add("Content-Type", writer.FormDataContentType()) diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go index b4c3066..51ebe56 100644 --- a/image/app/plugin/impl/redcap2/common.go +++ b/image/app/plugin/impl/redcap2/common.go @@ -57,6 +57,9 @@ type pluginOptions struct { type generatedBundle struct { ReportID string Files map[string][]byte + // Mime holds explicit mime types for generated files (keyed by path). + // Files without an entry rely on destination-side type detection. + Mime map[string]string } var ( @@ -474,24 +477,28 @@ func writeCSV(rows [][]string, delimiter rune) ([]byte, error) { type dictionary struct { fieldOrder []string // field names in dictionary order fieldType map[string]string // field_name -> field_type + fieldLabel map[string]string // field_name -> field_label labelFields map[string][]string // field_label -> field names (labels can collide) identifier map[string]bool // field_name -> tagged as identifier in REDCap validation map[string]string // field_name -> text validation type ("" = unvalidated) + choices map[string]string // field_name -> raw select_choices_or_calculations hasValidation bool // the validation column was present in the dictionary } func parseDictionary(metadataCSV []byte) dictionary { dict := dictionary{ fieldType: map[string]string{}, + fieldLabel: map[string]string{}, labelFields: map[string][]string{}, identifier: map[string]bool{}, validation: map[string]string{}, + choices: map[string]string{}, } rows, err := parseCSV(metadataCSV, ',') if err != nil || len(rows) == 0 { return dict } - nameIdx, typeIdx, labelIdx, identifierIdx, validationIdx := -1, -1, -1, -1, -1 + nameIdx, typeIdx, labelIdx, identifierIdx, validationIdx, choicesIdx := -1, -1, -1, -1, -1, -1 for i, col := range rows[0] { switch strings.ToLower(strings.TrimSpace(col)) { case "field_name": @@ -504,6 +511,8 @@ func parseDictionary(metadataCSV []byte) dictionary { identifierIdx = i case "text_validation_type_or_show_slider_number": validationIdx = i + case "select_choices_or_calculations": + choicesIdx = i } } if nameIdx < 0 { @@ -525,9 +534,16 @@ func parseDictionary(metadataCSV []byte) dictionary { if labelIdx >= 0 && labelIdx < len(row) { label := strings.TrimSpace(row[labelIdx]) if label != "" { + dict.fieldLabel[name] = label dict.labelFields[label] = append(dict.labelFields[label], name) } } + if choicesIdx >= 0 && choicesIdx < len(row) { + choices := strings.TrimSpace(row[choicesIdx]) + if choices != "" { + dict.choices[name] = choices + } + } if identifierIdx >= 0 && identifierIdx < len(row) { switch strings.ToLower(strings.TrimSpace(row[identifierIdx])) { case "y", "yes", "1": @@ -1265,6 +1281,17 @@ func exportCSVContent(ctx context.Context, baseURL, token, content string) ([]by return redcapRequest(ctx, baseURL, form) } +// exportProjectXML fetches the CDISC ODM project metadata (metadata only — no +// record data) for the optional project_metadata.xml sidecar. +func exportProjectXML(ctx context.Context, baseURL, token string) ([]byte, error) { + form := url.Values{} + form.Set("token", token) + form.Set("content", "project_xml") + form.Set("returnMetadataOnly", "true") + form.Set("returnFormat", "json") + return redcapRequest(ctx, baseURL, form) +} + func sanitizeReportID(reportID string) string { if reportID == "" { return "unknown" @@ -1302,7 +1329,8 @@ type manifestExtras struct { DictionaryFieldsNotExported []string TransformModes map[string]string // field -> transform mode, for echo redaction RecordIDField string - KeyFingerprint string // SHA-256 fingerprint of the HMAC key (never the key itself) + KeyFingerprint string // SHA-256 fingerprint of the HMAC key (never the key itself) + ExtraFiles map[string]string // additional manifest file entries (sidecars, ODM) } // filterLogicFieldRe extracts the field names referenced by a REDCap filter @@ -1386,6 +1414,11 @@ func makeManifest(opts pluginOptions, reportID, dataPath, metadataPath, projectI if mappingPath != "" { manifest["files"].(map[string]string)["form_event_mapping"] = mappingPath } + for key, path := range extras.ExtraFiles { + if path != "" { + manifest["files"].(map[string]string)[key] = path + } + } if extras.ProjectID != nil || extras.ProjectTitle != "" { manifest["project"] = map[string]interface{}{ @@ -1527,6 +1560,15 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp } } + // Optional CDISC ODM sidecar (metadata only); failure must not block the export. + odmPath := "" + if odmBytes, odmErr := exportProjectXML(ctx, baseURL, token); odmErr != nil { + warnings = append(warnings, fmt.Sprintf("project metadata (ODM) export failed: %v", odmErr)) + } else { + odmPath = basePath + "/project_metadata.xml" + files[odmPath] = odmBytes + } + projectID, projectTitle := projectIdentity(projectInfoBytes) extras := manifestExtras{ Audit: audit, @@ -1536,6 +1578,12 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp TransformModes: plan.modes, RecordIDField: recordIDField(dict), KeyFingerprint: plan.keyFingerprint, + ExtraFiles: map[string]string{ + "project_metadata": odmPath, + "croissant": basePath + "/croissant.json", + "ro_crate": basePath + "/ro-crate-metadata.json", + "ddi_cdi": basePath + "/ddi-cdi.jsonld", + }, } // In an unfiltered flat records export, dictionary fields missing from the // output reveal server-side stripping (token export rights). With filters @@ -1572,10 +1620,20 @@ func buildExportBundle(ctx context.Context, baseURL, token string, opts pluginOp } files[basePath+"/manifest.json"] = manifestBytes + // Metadata sidecars (Croissant, RO-Crate, DDI-CDI) are rendered from one + // normalized model over the final bundle contents (incl. manifest.json). + // They never block the export; failures are logged. + model := buildSidecarModel(opts, plan, dict, basePath, files, dataPath, redcapVersion, projectID, projectTitle) + mime := map[string]string{} + for _, warning := range addSidecars(model, basePath, files, mime) { + logging.Logger.Printf("redcap2: %s", warning) + } + logging.Logger.Printf("redcap2: generated %d virtual files (mode: %s, report: %s)", len(files), opts.ExportMode, reportID) return generatedBundle{ ReportID: reportID, Files: files, + Mime: mime, }, nil } diff --git a/image/app/plugin/impl/redcap2/helper_test.go b/image/app/plugin/impl/redcap2/helper_test.go index 54462c6..24b8459 100644 --- a/image/app/plugin/impl/redcap2/helper_test.go +++ b/image/app/plugin/impl/redcap2/helper_test.go @@ -24,6 +24,7 @@ const ( testEventsCSV = "event_name,arm_num,unique_event_name\nBaseline,1,baseline_arm_1\n" testMappingCSV = "arm_num,unique_event_name,form\n1,baseline_arm_1,demographics\n" testVersion = "14.5.5" + testProjectXML = `` ) // fakeRedcap is a minimal in-memory REDCap API stub. It records every form @@ -109,6 +110,8 @@ func (f *fakeRedcap) handle(w http.ResponseWriter, r *http.Request) { _, _ = w.Write([]byte(`{"project_id":1,"project_title":"Demo","is_longitudinal":"` + longitudinalFlag + `"}`)) case "version": _, _ = w.Write([]byte(testVersion)) + case "project_xml": + _, _ = w.Write([]byte(testProjectXML)) case "event": _, _ = w.Write([]byte(testEventsCSV)) case "formEventMapping": diff --git a/image/app/plugin/impl/redcap2/query.go b/image/app/plugin/impl/redcap2/query.go index 20fd847..6ce9d3b 100644 --- a/image/app/plugin/impl/redcap2/query.go +++ b/image/app/plugin/impl/redcap2/query.go @@ -55,6 +55,9 @@ func Query(ctx context.Context, req types.CompareRequest, _ map[string]tree.Node RemoteHash: md5Hex(data), RemoteHashType: types.Md5, RemoteFileSize: int64(len(data)), + // Explicit mime for generated metadata sidecars so the right + // previewers fire in Dataverse; empty for other files. + MimeType: bundle.Mime[fullPath], }, } } diff --git a/image/app/plugin/impl/redcap2/query_test.go b/image/app/plugin/impl/redcap2/query_test.go index e77e5b6..6b24621 100644 --- a/image/app/plugin/impl/redcap2/query_test.go +++ b/image/app/plugin/impl/redcap2/query_test.go @@ -43,10 +43,14 @@ func TestQueryReportModeGeneratesBundle(t *testing.T) { } wantPaths := []string{ + "redcap/report-7/croissant.json", "redcap/report-7/data.csv", + "redcap/report-7/ddi-cdi.jsonld", "redcap/report-7/manifest.json", "redcap/report-7/metadata.csv", "redcap/report-7/project_info.json", + "redcap/report-7/project_metadata.xml", + "redcap/report-7/ro-crate-metadata.json", } gotPaths := make([]string, 0, len(nodes)) for path := range nodes { @@ -57,6 +61,21 @@ func TestQueryReportModeGeneratesBundle(t *testing.T) { t.Fatalf("paths = %v, want %v", gotPaths, wantPaths) } + // Sidecars carry explicit mime types so the right previewers fire; + // other files rely on destination-side detection. + wantMimes := map[string]string{ + "redcap/report-7/croissant.json": croissantMimeType, + "redcap/report-7/ddi-cdi.jsonld": ddiCdiMimeType, + "redcap/report-7/ro-crate-metadata.json": roCrateMimeType, + "redcap/report-7/data.csv": "", + "redcap/report-7/manifest.json": "", + } + for path, wantMime := range wantMimes { + if got := nodes[path].Attributes.MimeType; got != wantMime { + t.Errorf("MimeType(%s) = %q, want %q", path, got, wantMime) + } + } + node := nodes["redcap/report-7/data.csv"] if node.Name != "data.csv" || node.Path != "redcap/report-7" || !node.Attributes.IsFile { t.Errorf("unexpected node shape: %+v", node) @@ -192,8 +211,8 @@ func TestQueryLongitudinalAddsEventFiles(t *testing.T) { if err != nil { t.Fatalf("Query returned error: %v", err) } - if len(nodes) != 6 { - t.Fatalf("expected 6 files for longitudinal project, got %d", len(nodes)) + if len(nodes) != 10 { + t.Fatalf("expected 10 files for longitudinal project (incl. events, mapping, ODM, sidecars), got %d", len(nodes)) } events, ok := nodes["redcap/report-7/events.csv"] if !ok || events.Attributes.RemoteHash != md5Hex([]byte(testEventsCSV)) { diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go new file mode 100644 index 0000000..cb60812 --- /dev/null +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -0,0 +1,768 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "encoding/json" + "fmt" + "sort" + "strings" +) + +// Mime types of the generated metadata sidecars. +// +// - The RO-Crate mime matches Dataverse's own filename-based detection for +// ro-crate-metadata.json (Dataverse 6.3+) and the contentType the RO-Crate +// previewer registers for. +// - The DDI-CDI mime must stay in sync with common.DdiCdiMimeType and the +// contentType in conf/dataverse/external-tools/04-cdi-previewer.json. +// - Croissant is JSON-LD; the bare application/ld+json type lets a generic +// JSON-LD previewer (conf/dataverse/external-tools/06-jsonld-previewer.json) +// pick it up. There is no Croissant-specific previewer or mime convention +// (the Croissant 1.0 spec defines no media type). +const ( + roCrateMimeType = `application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"` + ddiCdiMimeType = `application/ld+json;profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://ddialliance.org/specification/ddi-cdi/1.0"` + croissantMimeType = "application/ld+json" +) + +// ddiDataTypeCV is the DDI controlled vocabulary used for variable data types, +// matching the in-repo DDI-CDI generator (cdi_generator_jsonld.py). +const ddiDataTypeCV = "http://rdf-vocabulary.ddialliance.org/cv/DataType/1.1.2/#" + +// choiceCode is one parsed entry of a REDCap select_choices definition. +type choiceCode struct { + Code string + Label string +} + +// sidecarVariable describes one physical column of the exported data file, +// enriched with data-dictionary information where the column maps to a field. +type sidecarVariable struct { + Column string // physical column name (or JSON key) + Field string // dictionary field name ("" for pseudo-columns) + Label string + FieldType string // REDCap field type ("" for pseudo-columns) + Validation string + Identifier bool + IsRecordID bool + Transform string // applied anonymization mode ("" if none) + Choices []choiceCode +} + +// sidecarFile describes one generated file of the export bundle. +type sidecarFile struct { + Name string // file name within the bundle folder + Description string + EncodingFormat string + MD5 string + Size int64 +} + +// sidecarModel is the normalized metadata model all three exporters render. +type sidecarModel struct { + ProjectID interface{} + ProjectTitle string + RedcapVersion string + GeneratedAt string + ExportMode string + ReportID string + DataFileName string + DataFormat string // csv | json + Delimiter string // "," or "\t" (csv only) + IsEAV bool + Files []sidecarFile + Variables []sidecarVariable + KeyFingerprint string +} + +// parseChoiceCodes parses a REDCap select_choices definition +// ("1, Male | 2, Female") into code/label pairs. +func parseChoiceCodes(raw string) []choiceCode { + res := []choiceCode{} + for _, part := range strings.Split(raw, "|") { + part = strings.TrimSpace(part) + if part == "" { + continue + } + code := part + label := "" + if i := strings.Index(part, ","); i >= 0 { + code = strings.TrimSpace(part[:i]) + label = strings.TrimSpace(part[i+1:]) + } + if code == "" { + continue + } + res = append(res, choiceCode{Code: code, Label: label}) + } + return res +} + +// variableChoices returns the parsed code list for choice-type fields only +// (the same dictionary column holds calculations for calc fields). +func variableChoices(dict dictionary, field string) []choiceCode { + switch dict.fieldType[field] { + case "radio", "dropdown", "checkbox": + return parseChoiceCodes(dict.choices[field]) + } + return nil +} + +// dataFileColumns extracts the physical column names (or JSON keys) of the +// processed data file. +func dataFileColumns(data []byte, opts pluginOptions) []string { + if opts.DataFormat == "csv" { + rows, err := parseCSV(data, reportDelimiter(opts)) + if err != nil || len(rows) == 0 { + return nil + } + return rows[0] + } + rows := make([]map[string]interface{}, 0) + if err := json.Unmarshal(data, &rows); err != nil { + return nil + } + seen := map[string]bool{} + keys := []string{} + for _, row := range rows { + for k := range row { + if !seen[k] { + seen[k] = true + keys = append(keys, k) + } + } + } + sort.Strings(keys) + return keys +} + +// buildSidecarVariables maps physical columns to dictionary-enriched variable +// descriptions. Columns that do not resolve to a dictionary field (record, +// redcap_event_name, ...) are kept as pseudo-columns. +func buildSidecarVariables(columns []string, opts pluginOptions, plan transformPlan, dict dictionary) []sidecarVariable { + labelHeaders := headersAreLabels(opts) + recordField := recordIDField(dict) + vars := make([]sidecarVariable, 0, len(columns)) + for _, col := range columns { + v := sidecarVariable{Column: col} + for _, candidate := range resolveHeaderFields(col, labelHeaders, dict) { + base := baseFieldName(candidate) + if _, ok := dict.fieldType[base]; ok { + v.Field = base + break + } + } + if v.Field != "" { + v.Label = dict.fieldLabel[v.Field] + v.FieldType = dict.fieldType[v.Field] + v.Validation = dict.validation[v.Field] + v.Identifier = dict.identifier[v.Field] + v.IsRecordID = v.Field == recordField + v.Choices = variableChoices(dict, v.Field) + v.Transform = plan.modes[v.Field] + if t := plan.modes[baseFieldName(col)]; t != "" { + v.Transform = t + } + } else if strings.EqualFold(strings.TrimSpace(col), "record") { + // The EAV linking column carries record-ID values. + v.IsRecordID = true + v.Transform = plan.modes[recordField] + } + vars = append(vars, v) + } + return vars +} + +// buildSidecarModel assembles the normalized model from the generated bundle +// context. files maps bundle paths to contents; only files under basePath are +// described. +func buildSidecarModel(opts pluginOptions, plan transformPlan, dict dictionary, basePath string, files map[string][]byte, dataPath, redcapVersion string, projectID interface{}, projectTitle string) sidecarModel { + model := sidecarModel{ + ProjectID: projectID, + ProjectTitle: projectTitle, + RedcapVersion: redcapVersion, + GeneratedAt: opts.GeneratedAt, + ExportMode: opts.ExportMode, + ReportID: opts.ReportID, + DataFormat: opts.DataFormat, + Delimiter: opts.CsvDelimiter, + IsEAV: isEAV(opts), + KeyFingerprint: plan.keyFingerprint, + } + + paths := make([]string, 0, len(files)) + for path := range files { + paths = append(paths, path) + } + sort.Strings(paths) + for _, path := range paths { + name := strings.TrimPrefix(path, basePath+"/") + model.Files = append(model.Files, sidecarFile{ + Name: name, + Description: bundleFileDescription(name), + EncodingFormat: bundleFileEncodingFormat(name, opts), + MD5: md5Hex(files[path]), + Size: int64(len(files[path])), + }) + } + + model.DataFileName = strings.TrimPrefix(dataPath, basePath+"/") + columns := dataFileColumns(files[dataPath], opts) + model.Variables = buildSidecarVariables(columns, opts, plan, dict) + return model +} + +func bundleFileDescription(name string) string { + switch name { + case "data.csv", "data.json": + return "Exported REDCap records" + case "metadata.csv": + return "REDCap data dictionary (exported fields)" + case "project_info.json": + return "REDCap project information" + case "events.csv": + return "REDCap events (longitudinal projects)" + case "form_event_mapping.csv": + return "REDCap form-event mapping (longitudinal projects)" + case "manifest.json": + return "Export manifest (parameters, anonymization audit, provenance)" + case "project_metadata.xml": + return "CDISC ODM project metadata (metadata only, no data)" + } + return "" +} + +func bundleFileEncodingFormat(name string, opts pluginOptions) string { + switch { + case name == "data.csv" && opts.CsvDelimiter == "\t": + return "text/tab-separated-values" + case strings.HasSuffix(name, ".csv"): + return "text/csv" + case strings.HasSuffix(name, ".json"): + return "application/json" + case strings.HasSuffix(name, ".xml"): + return "text/xml" + } + return "application/octet-stream" +} + +// datasetName returns a human-readable dataset name for the sidecars. +func (m sidecarModel) datasetName() string { + if m.ProjectTitle != "" { + return m.ProjectTitle + } + if m.ExportMode == "report" { + return "REDCap report " + m.ReportID + } + return "REDCap records export" +} + +func (m sidecarModel) datasetDescription() string { + scope := "all records" + if m.ExportMode == "report" { + scope = fmt.Sprintf("report %s", m.ReportID) + } + desc := fmt.Sprintf("Export of %s from REDCap project %q", scope, m.datasetName()) + if m.RedcapVersion != "" { + desc += fmt.Sprintf(" (REDCap %s)", m.RedcapVersion) + } + desc += ", generated by the rdm-integration redcap2 plugin." + if m.KeyFingerprint != "" { + desc += " Some variables are pseudonymized with HMAC-SHA256 (key fingerprint " + m.KeyFingerprint + ")." + } + return desc +} + +// variableDescription combines the dictionary label and transform note. +func variableDescription(v sidecarVariable) string { + parts := []string{} + if v.Label != "" && v.Label != v.Column { + parts = append(parts, v.Label) + } + if v.Transform != "" { + parts = append(parts, fmt.Sprintf("anonymization applied: %s", v.Transform)) + } + return strings.Join(parts, " — ") +} + +// --- Croissant 1.0 --- + +// croissantContext is the canonical Croissant 1.0 @context. +var croissantContext = map[string]interface{}{ + "@language": "en", + "@vocab": "https://schema.org/", + "citeAs": "cr:citeAs", + "column": "cr:column", + "conformsTo": "dct:conformsTo", + "cr": "http://mlcommons.org/croissant/", + "rai": "http://mlcommons.org/croissant/RAI/", + "data": map[string]interface{}{"@id": "cr:data", "@type": "@json"}, + "dataType": map[string]interface{}{"@id": "cr:dataType", "@type": "@vocab"}, + "dct": "http://purl.org/dc/terms/", + "examples": map[string]interface{}{"@id": "cr:examples", "@type": "@json"}, + "extract": "cr:extract", + "field": "cr:field", + "fileProperty": "cr:fileProperty", + "fileObject": "cr:fileObject", + "fileSet": "cr:fileSet", + "format": "cr:format", + "includes": "cr:includes", + "isLiveDataset": "cr:isLiveDataset", + "jsonPath": "cr:jsonPath", + "key": "cr:key", + "md5": "cr:md5", + "parentField": "cr:parentField", + "path": "cr:path", + "recordSet": "cr:recordSet", + "references": "cr:references", + "regex": "cr:regex", + "repeated": "cr:repeated", + "replace": "cr:replace", + "sc": "https://schema.org/", + "separator": "cr:separator", + "source": "cr:source", + "subField": "cr:subField", + "transform": "cr:transform", +} + +func croissantDataType(v sidecarVariable) string { + switch v.FieldType { + case "yesno", "truefalse": + return "sc:Boolean" + } + switch { + case v.Validation == "integer": + return "sc:Integer" + case v.Validation == "number" || strings.HasPrefix(v.Validation, "number_"): + return "sc:Float" + case strings.HasPrefix(v.Validation, "date_"): + return "sc:Date" + case strings.HasPrefix(v.Validation, "datetime_"): + return "sc:Date" + } + return "sc:Text" +} + +// buildCroissant renders the Croissant 1.0 metadata file. The record set is +// only emitted for CSV exports: Croissant column extraction is defined for +// delimited files. +func buildCroissant(m sidecarModel) ([]byte, error) { + distribution := make([]interface{}, 0, len(m.Files)) + for _, f := range m.Files { + distribution = append(distribution, map[string]interface{}{ + "@type": "cr:FileObject", + "@id": f.Name, + "name": f.Name, + "description": f.Description, + "contentUrl": f.Name, + "encodingFormat": f.EncodingFormat, + "md5": f.MD5, + }) + } + + doc := map[string]interface{}{ + "@context": croissantContext, + "@type": "sc:Dataset", + "conformsTo": "http://mlcommons.org/croissant/1.0", + "name": m.datasetName(), + "description": m.datasetDescription(), + "version": "1.0.0", + "datePublished": m.GeneratedAt, + "distribution": distribution, + } + + if m.DataFormat == "csv" && len(m.Variables) > 0 { + fields := make([]interface{}, 0, len(m.Variables)) + for _, v := range m.Variables { + field := map[string]interface{}{ + "@type": "cr:Field", + "@id": "records/" + v.Column, + "name": v.Column, + "dataType": croissantDataType(v), + "source": map[string]interface{}{ + "fileObject": map[string]interface{}{"@id": m.DataFileName}, + "extract": map[string]interface{}{"column": v.Column}, + }, + } + if desc := variableDescription(v); desc != "" { + field["description"] = desc + } + fields = append(fields, field) + } + doc["recordSet"] = []interface{}{ + map[string]interface{}{ + "@type": "cr:RecordSet", + "@id": "records", + "name": "records", + "field": fields, + }, + } + } + + return json.MarshalIndent(doc, "", " ") +} + +// --- RO-Crate 1.2 --- + +// buildROCrate renders a detached RO-Crate 1.2 metadata file describing the +// bundle folder, with Process Run Crate style provenance (a CreateAction with +// the plugin as instrument). +func buildROCrate(m sidecarModel) ([]byte, error) { + hasPart := make([]interface{}, 0, len(m.Files)) + results := make([]interface{}, 0, len(m.Files)) + graph := []interface{}{} + + graph = append(graph, map[string]interface{}{ + "@id": "ro-crate-metadata.json", + "@type": "CreativeWork", + "conformsTo": map[string]interface{}{"@id": "https://w3id.org/ro/crate/1.2"}, + "about": map[string]interface{}{"@id": "./"}, + "description": "RO-Crate metadata for a REDCap export generated by the rdm-integration redcap2 plugin", + }) + + for _, f := range m.Files { + hasPart = append(hasPart, map[string]interface{}{"@id": f.Name}) + results = append(results, map[string]interface{}{"@id": f.Name}) + } + + rootDataset := map[string]interface{}{ + "@id": "./", + "@type": "Dataset", + "name": m.datasetName(), + "description": m.datasetDescription(), + "datePublished": m.GeneratedAt, + "hasPart": hasPart, + "mentions": map[string]interface{}{"@id": "#export-action"}, + } + if m.ProjectID != nil { + rootDataset["identifier"] = fmt.Sprintf("redcap-project-%v", m.ProjectID) + } + graph = append(graph, rootDataset) + + for _, f := range m.Files { + fileNode := map[string]interface{}{ + "@id": f.Name, + "@type": "File", + "name": f.Name, + "encodingFormat": f.EncodingFormat, + "contentSize": fmt.Sprint(f.Size), + "md5": f.MD5, + } + if f.Description != "" { + fileNode["description"] = f.Description + } + graph = append(graph, fileNode) + } + + // Process Run Crate provenance: the export run and its instruments. + action := map[string]interface{}{ + "@id": "#export-action", + "@type": "CreateAction", + "name": "REDCap export", + "instrument": map[string]interface{}{"@id": "#rdm-integration-redcap2"}, + "result": results, + "endTime": m.GeneratedAt, + "description": fmt.Sprintf( + "Files generated from the REDCap API (export mode: %s) with client-side anonymization applied as documented in manifest.json", + m.ExportMode), + } + graph = append(graph, action) + graph = append(graph, map[string]interface{}{ + "@id": "#rdm-integration-redcap2", + "@type": "SoftwareApplication", + "name": "rdm-integration redcap2 plugin", + "url": "https://github.com/libis/rdm-integration", + }) + if m.RedcapVersion != "" { + action["object"] = map[string]interface{}{"@id": "#redcap"} + graph = append(graph, map[string]interface{}{ + "@id": "#redcap", + "@type": "SoftwareApplication", + "name": "REDCap", + "softwareVersion": m.RedcapVersion, + }) + } + + doc := map[string]interface{}{ + "@context": "https://w3id.org/ro/crate/1.2/context", + "@graph": graph, + } + return json.MarshalIndent(doc, "", " ") +} + +// --- DDI-CDI 1.0 --- + +// ddiCdiContext matches the in-repo DDI-CDI generator (cdi_generator_jsonld.py) +// whose output validates against the official DDI-CDI 1.0 SHACL shapes used by +// the cdi-viewer previewer. +const ddiCdiContext = "https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld" + +func ddiCdiDataType(v sidecarVariable) string { + switch v.FieldType { + case "yesno", "truefalse": + return ddiDataTypeCV + "Boolean" + } + switch { + case v.Validation == "integer": + return ddiDataTypeCV + "Integer" + case v.Validation == "number" || strings.HasPrefix(v.Validation, "number_"): + return ddiDataTypeCV + "Double" + case strings.HasPrefix(v.Validation, "datetime_"): + return ddiDataTypeCV + "DateTime" + case strings.HasPrefix(v.Validation, "date_"): + return ddiDataTypeCV + "Date" + } + return ddiDataTypeCV + "String" +} + +func ddiCdiComponentType(v sidecarVariable) string { + switch { + case v.IsRecordID: + return "IdentifierComponent" + case len(v.Choices) > 0 || v.FieldType == "yesno" || v.FieldType == "truefalse": + return "DimensionComponent" + case v.Validation == "integer" || v.Validation == "number" || strings.HasPrefix(v.Validation, "number_"): + return "MeasureComponent" + } + return "AttributeComponent" +} + +// safeFragment converts a column name into a JSON-LD fragment identifier. +func safeFragment(name string) string { + var b strings.Builder + for _, r := range name { + switch { + case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9', r == '_', r == '-': + b.WriteRune(r) + default: + b.WriteRune('_') + } + } + if b.Len() == 0 { + return "_" + } + return b.String() +} + +// buildDDICDI renders a DDI-CDI 1.0 JSON-LD description of the data file, +// mirroring the structure of the in-repo generator: WideDataSet, +// WideDataStructure, LogicalRecord, InstanceVariables with value domains and +// code lists, and (for CSV) a PhysicalSegmentLayout with value mappings. +func buildDDICDI(m sidecarModel) ([]byte, error) { + graph := []interface{}{} + componentIDs := []interface{}{} + variableIDs := []interface{}{} + valueMappings := []interface{}{} + valueMappingPositions := []interface{}{} + primaryKeyComponent := "" + + used := map[string]int{} + position := 1 + for _, v := range m.Variables { + frag := safeFragment(v.Column) + if n, ok := used[frag]; ok { + used[frag] = n + 1 + frag = fmt.Sprintf("%s_%d", frag, n+1) + } else { + used[frag] = 0 + } + varID := "#" + frag + domainID := "#" + frag + "_Substantive_Value_Domain" + componentID := "#" + frag + "_Component" + mappingID := "#valueMapping_" + frag + mappingPosID := "#ValueMappingPosition_" + frag + + variableIDs = append(variableIDs, varID) + componentIDs = append(componentIDs, componentID) + + // Code list from the REDCap choices definition. + codeListID := "" + if len(v.Choices) > 0 { + codeListID = "#" + frag + "_CodeList" + codeIDs := []interface{}{} + for _, c := range v.Choices { + codeID := "#" + frag + "_Code_" + safeFragment(c.Code) + codeIDs = append(codeIDs, codeID) + codeNode := map[string]interface{}{ + "@id": codeID, + "@type": "Code", + "identifier": c.Code, + } + if c.Label != "" { + codeNode["name"] = c.Label + } + graph = append(graph, codeNode) + } + label := v.Label + if label == "" { + label = v.Column + } + graph = append(graph, map[string]interface{}{ + "@id": codeListID, + "@type": "CodeList", + "name": label + " codes", + "has_Code": codeIDs, + }) + } + + dataType := ddiCdiDataType(v) + domainNode := map[string]interface{}{ + "@id": domainID, + "@type": "SubstantiveValueDomain", + "recommendedDataType": map[string]interface{}{ + "@type": "ControlledVocabularyEntry", + "entryValue": strings.TrimPrefix(dataType, ddiDataTypeCV), + "vocabulary": map[string]interface{}{ + "@type": "Reference", + "uri": ddiDataTypeCV, + }, + }, + } + if codeListID != "" { + domainNode["takesValuesFrom"] = codeListID + } + graph = append(graph, domainNode) + + name := v.Label + if name == "" { + name = v.Column + } + varNode := map[string]interface{}{ + "@id": varID, + "@type": "InstanceVariable", + "name": map[string]interface{}{ + "@type": "ObjectName", + "name": name, + }, + "takesSubstantiveValuesFrom_SubstantiveValueDomain": domainID, + } + definition := "Column: " + v.Column + if v.Transform != "" { + definition += " (anonymization applied: " + v.Transform + ")" + } + varNode["definition"] = map[string]interface{}{ + "@type": "InternationalString", + "languageSpecificString": map[string]interface{}{ + "@type": "LanguageString", + "content": definition, + }, + } + if m.DataFormat == "csv" { + varNode["has_ValueMapping"] = mappingID + valueMappings = append(valueMappings, mappingID) + valueMappingPositions = append(valueMappingPositions, mappingPosID) + graph = append(graph, map[string]interface{}{ + "@id": mappingID, + "@type": "ValueMapping", + "defaultValue": "", + }) + graph = append(graph, map[string]interface{}{ + "@id": mappingPosID, + "@type": "ValueMappingPosition", + "indexes": mappingID, + "value": position, + }) + position++ + } + graph = append(graph, varNode) + + componentType := ddiCdiComponentType(v) + if componentType == "IdentifierComponent" && primaryKeyComponent == "" { + primaryKeyComponent = componentID + } + graph = append(graph, map[string]interface{}{ + "@id": componentID, + "@type": componentType, + "isDefinedBy_RepresentedVariable": varID, + }) + } + + datasetID := "#" + safeFragment(m.datasetName()) + graph = append(graph, map[string]interface{}{ + "@id": datasetID, + "@type": "WideDataSet", + "isStructuredBy": "#datastructure", + }) + graph = append(graph, map[string]interface{}{ + "@id": "#datastructure", + "@type": "WideDataStructure", + "has_DataStructureComponent": componentIDs, + }) + graph = append(graph, map[string]interface{}{ + "@id": "#logicalRecord", + "@type": "LogicalRecord", + "organizes": datasetID, + "has_InstanceVariable": variableIDs, + }) + if primaryKeyComponent != "" { + graph = append(graph, map[string]interface{}{ + "@id": "#primaryKey", + "@type": "PrimaryKey", + "isComposedOf": "#primaryKeyComponent", + }) + graph = append(graph, map[string]interface{}{ + "@id": "#primaryKeyComponent", + "@type": "PrimaryKeyComponent", + "correspondsTo": primaryKeyComponent, + }) + } + if m.DataFormat == "csv" { + delimiter := "," + if m.Delimiter == "\t" { + delimiter = "\\t" + } + graph = append(graph, map[string]interface{}{ + "@id": "#physicalSegmentLayout", + "@type": "PhysicalSegmentLayout", + "formats": "#logicalRecord", + "allowsDuplicates": true, + "isDelimited": true, + "isFixedWidth": false, + "hasHeader": true, + "headerRowCount": 1, + "delimiter": delimiter, + "has_ValueMapping": valueMappings, + "has_ValueMappingPosition": valueMappingPositions, + }) + } + + doc := map[string]interface{}{ + "@context": ddiCdiContext, + "@graph": graph, + } + return json.MarshalIndent(doc, "", " ") +} + +// addSidecars generates the three metadata sidecars from the normalized model +// and registers them (with explicit mime types) in the bundle file map. +// Sidecar generation must never fail the export: errors come back as warnings. +func addSidecars(m sidecarModel, basePath string, files map[string][]byte, mime map[string]string) []string { + warnings := []string{} + + if data, err := buildCroissant(m); err != nil { + warnings = append(warnings, fmt.Sprintf("croissant generation failed: %v", err)) + } else { + path := basePath + "/croissant.json" + files[path] = data + mime[path] = croissantMimeType + } + + if data, err := buildROCrate(m); err != nil { + warnings = append(warnings, fmt.Sprintf("ro-crate generation failed: %v", err)) + } else { + path := basePath + "/ro-crate-metadata.json" + files[path] = data + mime[path] = roCrateMimeType + } + + if data, err := buildDDICDI(m); err != nil { + warnings = append(warnings, fmt.Sprintf("ddi-cdi generation failed: %v", err)) + } else { + path := basePath + "/ddi-cdi.jsonld" + files[path] = data + mime[path] = ddiCdiMimeType + } + + return warnings +} diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go new file mode 100644 index 0000000..dafacdb --- /dev/null +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -0,0 +1,315 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "encoding/json" + "reflect" + "strings" + "testing" +) + +const sidecarTestMetadataCSV = "field_name,form_name,field_type,field_label,select_choices_or_calculations,identifier,text_validation_type_or_show_slider_number\n" + + "record_id,demographics,text,Record ID,,,\n" + + "name,demographics,text,Full Name,,y,\n" + + "age,demographics,text,Age,,,integer\n" + + "weight,demographics,text,Weight,,,number\n" + + "sex,demographics,radio,Sex,\"1, Male | 2, Female\",,\n" + + "consent,demographics,yesno,Consent given,,,\n" + + "visit_date,demographics,text,Visit Date,,,date_ymd\n" + +func sidecarTestModel(t *testing.T, opts pluginOptions, plan transformPlan, dataCSV string) sidecarModel { + t.Helper() + dict := parseDictionary([]byte(sidecarTestMetadataCSV)) + files := map[string][]byte{ + "redcap/records/data.csv": []byte(dataCSV), + "redcap/records/metadata.csv": []byte(sidecarTestMetadataCSV), + "redcap/records/project_info.json": []byte(`{"project_id":1,"project_title":"Demo"}`), + "redcap/records/manifest.json": []byte(`{"plugin":"redcap2"}`), + } + return buildSidecarModel(opts, plan, dict, "redcap/records", files, "redcap/records/data.csv", "14.5.5", float64(1), "Demo") +} + +func TestParseChoiceCodes(t *testing.T) { + got := parseChoiceCodes("1, Male | 2, Female | 3, Other, with comma ") + want := []choiceCode{ + {Code: "1", Label: "Male"}, + {Code: "2", Label: "Female"}, + {Code: "3", Label: "Other, with comma"}, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("parseChoiceCodes = %+v, want %+v", got, want) + } + if got := parseChoiceCodes(""); len(got) != 0 { + t.Errorf("empty choices should parse to nothing, got %+v", got) + } +} + +func TestBuildSidecarVariables(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + plan := testPlan(map[string]string{"name": "blank"}) + model := sidecarTestModel(t, opts, plan, + "record_id,name,age,sex,consent,visit_date\n1,John,34,1,1,2026-01-01\n") + + byColumn := map[string]sidecarVariable{} + for _, v := range model.Variables { + byColumn[v.Column] = v + } + if !byColumn["record_id"].IsRecordID { + t.Error("record_id must be flagged as record ID") + } + if byColumn["name"].Transform != "blank" || !byColumn["name"].Identifier { + t.Errorf("name = %+v, want identifier with blank transform", byColumn["name"]) + } + if len(byColumn["sex"].Choices) != 2 || byColumn["sex"].Choices[0].Label != "Male" { + t.Errorf("sex choices = %+v", byColumn["sex"].Choices) + } + if byColumn["age"].Validation != "integer" { + t.Errorf("age validation = %q", byColumn["age"].Validation) + } +} + +func TestBuildCroissant(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + plan := testPlan(map[string]string{"name": "pseudonymize"}) + plan.keyFingerprint = "abcdef0123456789" + model := sidecarTestModel(t, opts, plan, + "record_id,name,age,weight,sex,consent,visit_date\n1,x,34,70.5,1,1,2026-01-01\n") + + data, err := buildCroissant(model) + if err != nil { + t.Fatalf("buildCroissant returned error: %v", err) + } + doc := map[string]interface{}{} + if err := json.Unmarshal(data, &doc); err != nil { + t.Fatalf("croissant.json is invalid JSON: %v", err) + } + if doc["conformsTo"] != "http://mlcommons.org/croissant/1.0" { + t.Errorf("conformsTo = %v", doc["conformsTo"]) + } + if doc["@type"] != "sc:Dataset" || doc["name"] != "Demo" { + t.Errorf("type/name = %v/%v", doc["@type"], doc["name"]) + } + if !strings.Contains(doc["description"].(string), "abcdef0123456789") { + t.Error("description should mention the pseudonymization key fingerprint") + } + + distribution := doc["distribution"].([]interface{}) + if len(distribution) != len(model.Files) { + t.Errorf("distribution has %d entries, want %d", len(distribution), len(model.Files)) + } + first := distribution[0].(map[string]interface{}) + if first["@type"] != "cr:FileObject" || first["md5"] == "" { + t.Errorf("distribution entry = %v", first) + } + + recordSets := doc["recordSet"].([]interface{}) + fields := recordSets[0].(map[string]interface{})["field"].([]interface{}) + dataTypes := map[string]string{} + for _, f := range fields { + field := f.(map[string]interface{}) + dataTypes[field["name"].(string)] = field["dataType"].(string) + } + want := map[string]string{ + "age": "sc:Integer", + "weight": "sc:Float", + "consent": "sc:Boolean", + "visit_date": "sc:Date", + "sex": "sc:Text", + "name": "sc:Text", + } + for name, wantType := range want { + if dataTypes[name] != wantType { + t.Errorf("dataType(%s) = %q, want %q", name, dataTypes[name], wantType) + } + } +} + +func TestBuildCroissantJSONExportHasNoRecordSet(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","dataFormat":"json"}`) + model := sidecarTestModel(t, opts, transformPlan{}, `[{"record_id":"1","age":"34"}]`) + model.DataFormat = "json" + + data, err := buildCroissant(model) + if err != nil { + t.Fatalf("buildCroissant returned error: %v", err) + } + doc := map[string]interface{}{} + _ = json.Unmarshal(data, &doc) + if _, ok := doc["recordSet"]; ok { + t.Error("JSON exports must not declare a CSV-column record set") + } +} + +func TestBuildROCrate(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + model := sidecarTestModel(t, opts, transformPlan{}, + "record_id,age\n1,34\n") + + data, err := buildROCrate(model) + if err != nil { + t.Fatalf("buildROCrate returned error: %v", err) + } + doc := map[string]interface{}{} + if err := json.Unmarshal(data, &doc); err != nil { + t.Fatalf("ro-crate-metadata.json is invalid JSON: %v", err) + } + if doc["@context"] != "https://w3id.org/ro/crate/1.2/context" { + t.Errorf("@context = %v", doc["@context"]) + } + + byID := map[string]map[string]interface{}{} + for _, entry := range doc["@graph"].([]interface{}) { + node := entry.(map[string]interface{}) + byID[node["@id"].(string)] = node + } + + descriptor := byID["ro-crate-metadata.json"] + if descriptor == nil || descriptor["conformsTo"].(map[string]interface{})["@id"] != "https://w3id.org/ro/crate/1.2" { + t.Errorf("metadata descriptor = %v", descriptor) + } + root := byID["./"] + if root == nil || root["name"] != "Demo" || root["datePublished"] == "" { + t.Errorf("root dataset = %v", root) + } + hasPart := root["hasPart"].([]interface{}) + if len(hasPart) != len(model.Files) { + t.Errorf("hasPart has %d entries, want %d", len(hasPart), len(model.Files)) + } + if byID["data.csv"] == nil || byID["data.csv"]["encodingFormat"] != "text/csv" { + t.Errorf("data.csv file entity = %v", byID["data.csv"]) + } + action := byID["#export-action"] + if action == nil || action["@type"] != "CreateAction" || + action["instrument"].(map[string]interface{})["@id"] != "#rdm-integration-redcap2" { + t.Errorf("provenance action = %v", action) + } + if byID["#rdm-integration-redcap2"] == nil || byID["#redcap"] == nil { + t.Error("software application entities missing") + } +} + +func TestBuildDDICDI(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + model := sidecarTestModel(t, opts, transformPlan{}, + "record_id,age,sex\n1,34,1\n") + + data, err := buildDDICDI(model) + if err != nil { + t.Fatalf("buildDDICDI returned error: %v", err) + } + doc := map[string]interface{}{} + if err := json.Unmarshal(data, &doc); err != nil { + t.Fatalf("ddi-cdi.jsonld is invalid JSON: %v", err) + } + if doc["@context"] != ddiCdiContext { + t.Errorf("@context = %v", doc["@context"]) + } + + byType := map[string][]map[string]interface{}{} + byID := map[string]map[string]interface{}{} + for _, entry := range doc["@graph"].([]interface{}) { + node := entry.(map[string]interface{}) + if typeName, ok := node["@type"].(string); ok { + byType[typeName] = append(byType[typeName], node) + } + if id, ok := node["@id"].(string); ok { + byID[id] = node + } + } + + if len(byType["WideDataSet"]) != 1 || len(byType["WideDataStructure"]) != 1 || len(byType["LogicalRecord"]) != 1 { + t.Fatalf("missing structural nodes: %v", byType) + } + if len(byType["InstanceVariable"]) != 3 { + t.Errorf("InstanceVariables = %d, want 3", len(byType["InstanceVariable"])) + } + if len(byType["IdentifierComponent"]) != 1 { + t.Errorf("IdentifierComponents = %d, want 1 (record_id)", len(byType["IdentifierComponent"])) + } + if len(byType["PrimaryKey"]) != 1 || len(byType["PrimaryKeyComponent"]) != 1 { + t.Error("primary key nodes missing") + } + // sex has a code list with two codes + if len(byType["CodeList"]) != 1 || len(byType["Code"]) != 2 { + t.Errorf("CodeList/Code = %d/%d, want 1/2", len(byType["CodeList"]), len(byType["Code"])) + } + // CSV exports describe the physical layout + layout := byID["#physicalSegmentLayout"] + if layout == nil || layout["isDelimited"] != true || layout["delimiter"] != "," { + t.Errorf("physical layout = %v", layout) + } + if len(byType["ValueMapping"]) != 3 || len(byType["ValueMappingPosition"]) != 3 { + t.Errorf("value mappings = %d/%d, want 3/3", len(byType["ValueMapping"]), len(byType["ValueMappingPosition"])) + } + // age is numeric -> Integer datatype in the DDI CV + domain := byID["#age_Substantive_Value_Domain"] + entry := domain["recommendedDataType"].(map[string]interface{}) + if entry["entryValue"] != "Integer" { + t.Errorf("age datatype = %v", entry["entryValue"]) + } +} + +func TestBuildDDICDIJSONExportSkipsPhysicalLayout(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","dataFormat":"json"}`) + model := sidecarTestModel(t, opts, transformPlan{}, `[{"record_id":"1","age":"34"}]`) + model.DataFormat = "json" + + data, err := buildDDICDI(model) + if err != nil { + t.Fatalf("buildDDICDI returned error: %v", err) + } + if strings.Contains(string(data), "PhysicalSegmentLayout") || strings.Contains(string(data), "ValueMapping") { + t.Error("JSON exports must not describe a delimited physical layout") + } +} + +// End to end: the bundle contains the three sidecars and the ODM file, they +// are valid JSON/XML, dropped variables are absent, and generation is +// deterministic (same input, same bytes). +func TestEndToEndSidecars(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + pluginOpts := `{ + "exportMode": "records", + "variables": [{"name": "email", "anonymization": "drop"}] + }` + croissant, manifest := queryAndRead(t, f, pluginOpts, "redcap/records/croissant.json") + + doc := map[string]interface{}{} + if err := json.Unmarshal(croissant, &doc); err != nil { + t.Fatalf("croissant.json invalid: %v", err) + } + if strings.Contains(string(croissant), `"email"`) { + t.Error("dropped variable must not appear in croissant.json") + } + + files := manifest["files"].(map[string]interface{}) + for _, key := range []string{"croissant", "ro_crate", "ddi_cdi", "project_metadata"} { + if files[key] == nil { + t.Errorf("manifest files missing %s: %v", key, files) + } + } + + roCrate, _ := queryAndRead(t, f, pluginOpts, "redcap/records/ro-crate-metadata.json") + if err := json.Unmarshal(roCrate, &map[string]interface{}{}); err != nil { + t.Fatalf("ro-crate-metadata.json invalid: %v", err) + } + ddiCdi, _ := queryAndRead(t, f, pluginOpts, "redcap/records/ddi-cdi.jsonld") + if err := json.Unmarshal(ddiCdi, &map[string]interface{}{}); err != nil { + t.Fatalf("ddi-cdi.jsonld invalid: %v", err) + } + odm, _ := queryAndRead(t, f, pluginOpts, "redcap/records/project_metadata.xml") + if !strings.Contains(string(odm), " Date: Thu, 11 Jun 2026 23:48:07 +0200 Subject: [PATCH 08/25] redcap2 Phase 5: Metadata() hook for Dataverse citation prefill Maps REDCap project info onto the citation block used when creating a new dataset: title <- project_title, description <- project_notes + purpose_other, author <- principal investigator, grant number, IRB number as OtherId, and a urn:redcap project reference. The generic metadata-selector frontend flow picks this up without changes. --- image/app/plugin/impl/redcap2/helper_test.go | 5 + image/app/plugin/impl/redcap2/metadata.go | 91 ++++++++++++++++++ .../app/plugin/impl/redcap2/metadata_test.go | 95 +++++++++++++++++++ image/app/plugin/registry.go | 9 +- 4 files changed, 196 insertions(+), 4 deletions(-) create mode 100644 image/app/plugin/impl/redcap2/metadata.go create mode 100644 image/app/plugin/impl/redcap2/metadata_test.go diff --git a/image/app/plugin/impl/redcap2/helper_test.go b/image/app/plugin/impl/redcap2/helper_test.go index 24b8459..879498a 100644 --- a/image/app/plugin/impl/redcap2/helper_test.go +++ b/image/app/plugin/impl/redcap2/helper_test.go @@ -37,6 +37,7 @@ type fakeRedcap struct { longitudinal bool failReport bool metadataCSV string // overrides testMetadataCSV when set + projectJSON string // overrides the default project info payload when set dataCSV string // overrides testDataCSV when set dataJSON string // overrides testDataJSON when set eavCSV string // served for type=eav csv requests when set @@ -103,6 +104,10 @@ func (f *fakeRedcap) handle(w http.ResponseWriter, r *http.Request) { } _, _ = w.Write([]byte(metadata)) case "project": + if f.projectJSON != "" { + _, _ = w.Write([]byte(f.projectJSON)) + return + } longitudinalFlag := "0" if longitudinal { longitudinalFlag = "1" diff --git a/image/app/plugin/impl/redcap2/metadata.go b/image/app/plugin/impl/redcap2/metadata.go new file mode 100644 index 0000000..a25f5a0 --- /dev/null +++ b/image/app/plugin/impl/redcap2/metadata.go @@ -0,0 +1,91 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "encoding/json" + "fmt" + "integration/app/plugin/types" + "strings" +) + +// Metadata maps REDCap project information onto the Dataverse citation +// metadata used to prefill a new dataset: title, description (project notes), +// principal investigator, grant number, IRB number and the REDCap project id. +func Metadata(ctx context.Context, streamParams types.StreamParams) (types.MetadataStruct, error) { + if streamParams.Url == "" || streamParams.Token == "" { + return types.MetadataStruct{}, fmt.Errorf("metadata: missing parameters: expected url, token") + } + + payload, _, err := exportProjectInfo(ctx, streamParams.Url, streamParams.Token) + if err != nil { + return types.MetadataStruct{}, err + } + info := projectInfoMap(payload) + + res := types.MetadataStruct{ + Title: stringField(info, "project_title"), + } + + if notes := stringField(info, "project_notes"); notes != "" { + res.DsDescription = append(res.DsDescription, notes) + } + if purpose := stringField(info, "purpose_other"); purpose != "" { + res.DsDescription = append(res.DsDescription, "Purpose: "+purpose) + } + + firstName := stringField(info, "project_pi_firstname") + lastName := stringField(info, "project_pi_lastname") + if piName := citationName(firstName, lastName); piName != "" { + res.Author = []types.Author{{AuthorName: piName}} + } + + if grant := stringField(info, "project_grant_number"); grant != "" { + res.GrantNumber = []types.GrantNumber{{GrantNumberValue: grant}} + } + + if irb := stringField(info, "project_irb_number"); irb != "" { + res.OtherId = append(res.OtherId, types.OtherId{OtherIdAgency: "IRB", OtherIdValue: irb}) + } + if id, ok := info["project_id"]; ok && id != nil { + res.OtherId = append(res.OtherId, types.OtherId{ + OtherIdAgency: "REDCap", + OtherIdValue: fmt.Sprintf("urn:redcap:%s:project:%v", streamParams.Url, id), + }) + } + + return res, nil +} + +// projectInfoMap parses a content=project JSON payload (object or +// single-element array form) into a generic map. +func projectInfoMap(payload []byte) map[string]interface{} { + var obj map[string]interface{} + if err := json.Unmarshal(payload, &obj); err == nil { + return obj + } + var arr []map[string]interface{} + if err := json.Unmarshal(payload, &arr); err == nil && len(arr) > 0 { + return arr[0] + } + return map[string]interface{}{} +} + +func stringField(info map[string]interface{}, key string) string { + if s, ok := info[key].(string); ok { + return strings.TrimSpace(s) + } + return "" +} + +// citationName renders "Family, Given" with either part optional. +func citationName(firstName, lastName string) string { + switch { + case firstName == "": + return lastName + case lastName == "": + return firstName + } + return lastName + ", " + firstName +} diff --git a/image/app/plugin/impl/redcap2/metadata_test.go b/image/app/plugin/impl/redcap2/metadata_test.go new file mode 100644 index 0000000..b1fbc63 --- /dev/null +++ b/image/app/plugin/impl/redcap2/metadata_test.go @@ -0,0 +1,95 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "context" + "integration/app/plugin/types" + "reflect" + "strings" + "testing" +) + +func TestMetadataRequiresUrlAndToken(t *testing.T) { + if _, err := Metadata(context.Background(), types.StreamParams{}); err == nil { + t.Fatal("expected error for missing url and token") + } +} + +func TestMetadataMapsProjectInfo(t *testing.T) { + f := newFakeRedcap() + f.projectJSON = `{ + "project_id": 42, + "project_title": "Hypertension Cohort", + "project_notes": "Longitudinal cohort of hypertension patients.", + "purpose": "2", + "purpose_other": "Research on blood pressure", + "project_pi_firstname": "Ada", + "project_pi_lastname": "Lovelace", + "project_irb_number": "IRB-2026-007", + "project_grant_number": "G0A1234N", + "is_longitudinal": "0" + }` + defer f.close() + + meta, err := Metadata(context.Background(), types.StreamParams{Url: f.url(), Token: "tok"}) + if err != nil { + t.Fatalf("Metadata returned error: %v", err) + } + + if meta.Title != "Hypertension Cohort" { + t.Errorf("Title = %q", meta.Title) + } + wantDescription := []string{ + "Longitudinal cohort of hypertension patients.", + "Purpose: Research on blood pressure", + } + if !reflect.DeepEqual(meta.DsDescription, wantDescription) { + t.Errorf("DsDescription = %v, want %v", meta.DsDescription, wantDescription) + } + if len(meta.Author) != 1 || meta.Author[0].AuthorName != "Lovelace, Ada" { + t.Errorf("Author = %v, want PI Lovelace, Ada", meta.Author) + } + if len(meta.GrantNumber) != 1 || meta.GrantNumber[0].GrantNumberValue != "G0A1234N" { + t.Errorf("GrantNumber = %v", meta.GrantNumber) + } + if len(meta.OtherId) != 2 || meta.OtherId[0].OtherIdAgency != "IRB" || meta.OtherId[0].OtherIdValue != "IRB-2026-007" { + t.Errorf("OtherId = %v", meta.OtherId) + } + if !strings.Contains(meta.OtherId[1].OtherIdValue, "project:42") || meta.OtherId[1].OtherIdAgency != "REDCap" { + t.Errorf("REDCap project id reference = %v", meta.OtherId[1]) + } +} + +func TestMetadataMinimalProjectInfo(t *testing.T) { + f := newFakeRedcap() + defer f.close() + + meta, err := Metadata(context.Background(), types.StreamParams{Url: f.url(), Token: "tok"}) + if err != nil { + t.Fatalf("Metadata returned error: %v", err) + } + if meta.Title != "Demo" { + t.Errorf("Title = %q, want Demo", meta.Title) + } + if len(meta.Author) != 0 || len(meta.GrantNumber) != 0 || len(meta.DsDescription) != 0 { + t.Errorf("minimal project should map only title and project id, got %+v", meta) + } + if len(meta.OtherId) != 1 || meta.OtherId[0].OtherIdAgency != "REDCap" { + t.Errorf("OtherId = %v, want only the REDCap project reference", meta.OtherId) + } +} + +func TestMetadataLastNameOnlyPI(t *testing.T) { + f := newFakeRedcap() + f.projectJSON = `{"project_id":1,"project_title":"T","project_pi_lastname":"Curie"}` + defer f.close() + + meta, err := Metadata(context.Background(), types.StreamParams{Url: f.url(), Token: "tok"}) + if err != nil { + t.Fatalf("Metadata returned error: %v", err) + } + if len(meta.Author) != 1 || meta.Author[0].AuthorName != "Curie" { + t.Errorf("Author = %v, want Curie", meta.Author) + } +} diff --git a/image/app/plugin/registry.go b/image/app/plugin/registry.go index f6951b1..54f8271 100644 --- a/image/app/plugin/registry.go +++ b/image/app/plugin/registry.go @@ -55,10 +55,11 @@ var pluginMap map[string]Plugin = map[string]Plugin{ Streams: redcap.Streams, }, "redcap2": { - Query: redcap2.Query, - Options: redcap2.Options, - Search: nil, - Streams: redcap2.Streams, + Query: redcap2.Query, + Options: redcap2.Options, + Search: nil, + Streams: redcap2.Streams, + Metadata: redcap2.Metadata, }, "osf": { Query: osf.Query, From 533fdb60e2b4be33ca1e0cd5e95fa7a763213200 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Thu, 11 Jun 2026 23:52:25 +0200 Subject: [PATCH 09/25] docs: REDCap integration user guide + plan status update - REDCAP_INTEGRATION.md: end-user guide covering export modes, anonymization modes, pseudonymization key generation (openssl rand -base64 32) and management caveats, generated files, sidecar previewers, citation prefill, manifest reference, and limitations (free-text PHI disclaimer). - redcap.md: Phases 3.9/4/5 marked completed with implementation notes; Phase 6 in progress; decision revisions recorded (no sidecar toggles, ODM always generated, researcher-managed keys); file layout updated. - README.md: REDCap feature section + doc table link. --- README.md | 6 ++ REDCAP_INTEGRATION.md | 231 ++++++++++++++++++++++++++++++++++++++++++ redcap.md | 70 ++++++------- 3 files changed, 273 insertions(+), 34 deletions(-) create mode 100644 REDCAP_INTEGRATION.md diff --git a/README.md b/README.md index 25fb038..ea6fbc2 100644 --- a/README.md +++ b/README.md @@ -57,6 +57,11 @@ Move data reliably and at scale using Globus. Built-in Globus plugin supports bo **Learn more:** [GLOBUS_INTEGRATION.md](GLOBUS_INTEGRATION.md) +### 🏥 REDCap Export With De-Identification +Export REDCap reports or records into Dataverse with per-variable anonymization (blank, drop, HMAC pseudonymization with researcher-managed keys), an auditable export manifest, and automatically generated Croissant, RO-Crate, and DDI-CDI metadata sidecars. + +**Learn more:** [REDCAP_INTEGRATION.md](REDCAP_INTEGRATION.md) + [↑ Back to Top](#rdm-integration) | [→ Quick Start](#quick-start) --- @@ -324,6 +329,7 @@ Comprehensive guides are available for specific features: |----------|-------------| | [ddi-cdi.md](ddi-cdi.md) | Complete guide to DDI-CDI metadata generation | | [GLOBUS_INTEGRATION.md](GLOBUS_INTEGRATION.md) | Globus transfer features, configuration, and comparison | +| [REDCAP_INTEGRATION.md](REDCAP_INTEGRATION.md) | REDCap export, de-identification, and metadata sidecars user guide | | [preview_urls.md](preview_urls.md) | Preview URL support for Globus downloads | | [DOWNLOAD_FILTERING.md](DOWNLOAD_FILTERING.md) | How the download UI filters datasets by user permissions | diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md new file mode 100644 index 0000000..c471302 --- /dev/null +++ b/REDCAP_INTEGRATION.md @@ -0,0 +1,231 @@ +# REDCap Integration (redcap2 plugin) — User Guide + +This guide explains how to export data from REDCap into a Dataverse dataset +with the `redcap2` plugin, including de-identification (anonymization), +pseudonymization key management, and the metadata files that are generated +with every export. + +Design and implementation details live in [redcap.md](redcap.md). This +document is for end users. + +## Table of Contents + +1. [Overview](#overview) +2. [Prerequisites](#prerequisites) +3. [Connecting to REDCap](#connecting-to-redcap) +4. [Export settings](#export-settings) +5. [Variable anonymization](#variable-anonymization) +6. [Pseudonymization keys](#pseudonymization-keys) +7. [Generated files](#generated-files) +8. [Metadata sidecars and previewers](#metadata-sidecars-and-previewers) +9. [New dataset metadata prefill](#new-dataset-metadata-prefill) +10. [The export manifest](#the-export-manifest) +11. [Limitations and good practice](#limitations-and-good-practice) + +## Overview + +The plugin exports REDCap data through the REDCap API and uploads the +resulting files to a Dataverse dataset using the regular compare → sync +workflow: + +1. Select **REDCap** as the repository type on the connect page and enter your + REDCap URL and API token. +2. Configure the export on the **REDCap export settings** page (report or + records mode, format, filters, anonymization). +3. Press **Continue to compare**: the export is generated server-side and shown + as a list of files compared against the dataset content. +4. Deselect any files you do not want, then synchronize. + +Nothing is uploaded until you confirm in the compare step. + +## Prerequisites + +- A REDCap API token for the project (REDCap → API page). The token's **Data + Export Rights** are enforced by REDCap itself and are your first line of + de-identification: + - *Full Data Set*: everything your role can see is exported. + - *De-Identified*: REDCap removes identifier-tagged fields, free-text fields + and dates, and hashes the record ID — before the data reaches this tool. + - *Remove All Identifier Fields*: identifier-tagged fields are removed. +- Permission to publish the data in the destination dataset. Anonymization in + this tool is a convenience layer, not a substitute for your institution's + data protection assessment. + +## Connecting to REDCap + +On the connect page choose the REDCap repository type, fill in the REDCap +server URL and your API token, choose the destination dataset (or "New +Dataset"), and continue. You will be redirected to the REDCap export settings +page. + +## Export settings + +Two export modes: + +- **Report**: export a saved report by its ID (find it in REDCap under *My + Reports & Exports*). The report definition controls fields, records, and + filters. Reports are always exported flat. +- **All records**: export project records directly, with optional filters: + fields, forms, events, record IDs, filter logic (e.g. `[age] > 30`), and a + date range. Records mode also offers: + - **Record type**: *Flat* (one row per record) or *EAV* (one row per value: + `record, [event,] field_name, value`). + - **Include survey fields**: adds `redcap_survey_identifier` and timestamp + columns. The survey identifier can directly identify respondents — leave + this off unless you need it. + - **Include Data Access Groups**: adds the DAG column (only honored by + REDCap if the project has DAGs and your API user is not in one). + +Shared options: + +- **Data format**: CSV or JSON. +- **CSV delimiter**: comma or tab. +- **Raw / Label**: export stored values (`raw`) or their human-readable labels + (`label`). +- **Header labels** (flat CSV only): column headers as variable names or + field labels. + +## Variable anonymization + +The **Variable anonymization** panel lists the variables of the selected +report (after *Load variables*) or of the whole project (records mode). +Variables that REDCap tags as identifiers are pre-set to *Blank*. Free-text +fields (notes, unvalidated text) carry a warning icon: they can contain +identifying information even when not tagged. + +Per variable you can choose: + +| Mode | Effect | +|------|--------| +| None | Exported unchanged. | +| Blank | Values are emptied; the column/rows remain. | +| Drop | The variable is removed entirely — from the data and from the exported data dictionary. | +| Pseudonymize | Values are replaced by irreversible HMAC-SHA256 codes (hex). The same value with the same key always yields the same code, so linkage across exports is preserved. | + +Details worth knowing: + +- A rule on a checkbox field (e.g. `phones`) covers all its expansion columns + (`phones___1`, `phones___2`, ...); a rule on a single expansion column + covers only that column. +- In EAV exports, a transform on the record-ID field is also applied to the + `record` linking column. Dropping the record-ID field is not possible in + EAV (it would either break the structure or silently keep the identifiers) — + use pseudonymize or blank. +- Every rule is audited in the manifest with the number of columns/rows it + touched; a rule that matched nothing produces an explicit warning. + +## Pseudonymization keys + +When at least one variable is set to *Pseudonymize*, a key field appears. + +- **You manage the key, not the server.** Generate one with: + + ```bash + openssl rand -base64 32 + ``` + + and paste the base64 string into the key field. (32 random bytes is the + recommended size; the minimum accepted is 16 bytes.) +- **Store the key safely** (e.g. in your institution's password manager). The + same key reproduces the same pseudonyms in future exports — that is what + makes longitudinal updates linkable. Without the key, new exports cannot be + linked to old ones. With the key, anyone holding the original values can + re-compute the mapping, so treat the key as confidential. +- The key itself never appears in the generated files or logs. The manifest + records only a *fingerprint* (a hash of the key), so you can verify later + which key was used. +- Pseudonymization is irreversible: there is no decryption. This is by + design (institutional decision; reversible encryption is out of scope). + +## Generated files + +Every export produces one folder (`redcap/report-/` or `redcap/records/`) +containing: + +| File | Content | +|------|---------| +| `data.csv` / `data.json` | The records, with anonymization applied. | +| `metadata.csv` | The REDCap data dictionary, filtered to exported fields (dropped variables are excluded). | +| `project_info.json` | REDCap project information. | +| `events.csv`, `form_event_mapping.csv` | Longitudinal projects only. | +| `project_metadata.xml` | CDISC ODM project metadata (metadata only, no data). | +| `croissant.json` | Croissant 1.0 dataset description (ML-ready metadata). | +| `ro-crate-metadata.json` | RO-Crate 1.2 crate with provenance of the export. | +| `ddi-cdi.jsonld` | DDI-CDI 1.0 variable-level description. | +| `manifest.json` | Export parameters, anonymization audit, provenance. | + +All files are generated; none are mandatory uploads. Deselect what you do not +need in the compare step. + +Per-record file attachments (REDCap *file upload* fields) are **not** +exported; the manifest documents which fields hold attachments. + +## Metadata sidecars and previewers + +The three metadata sidecars are generated from the same normalized model, so +they always agree with each other and with the anonymized data (e.g. dropped +variables are absent everywhere, pseudonymized variables are marked): + +- `ro-crate-metadata.json` is uploaded with the RO-Crate mime type that + Dataverse (6.3+) also detects by filename; the standard **RO-Crate + previewer** picks it up. +- `ddi-cdi.jsonld` is uploaded with the DDI-CDI profile mime type registered + by the **CDI previewer** (`conf/dataverse/external-tools/04-cdi-previewer.json`), + which validates against the official DDI-CDI 1.0 SHACL shapes. +- `croissant.json` is uploaded as `application/ld+json`; the generic + **JSON-LD previewer** (`conf/dataverse/external-tools/12-jsonld-previewer.json`, + same viewer as the CDI previewer) displays it. There is no + Croissant-specific previewer in the Dataverse ecosystem yet. The Croissant + CDIF profile ("Semantic Croissant") is still draft-stage; the file targets + plain Croissant 1.0 and can be validated with + `pip install mlcroissant && mlcroissant validate --jsonld croissant.json`. + +## New dataset metadata prefill + +When you create a **new** dataset as the destination, the metadata-copy step +offers values mapped from the REDCap project: + +- Title ← project title +- Description ← project notes (+ "purpose, specify" text) +- Author ← principal investigator +- Grant number ← project grant number +- Other ID ← IRB number and a `urn:redcap:...:project:` reference + +You select which of these to copy; nothing is applied automatically. + +## The export manifest + +`manifest.json` is the audit record of the export. It contains: + +- the export mode and all export parameters (with one privacy exception + below), REDCap version, project id/title, and generation timestamp; +- the **anonymization audit**: per rule, the mode and how many columns/rows it + actually touched, with explicit warnings for rules that matched nothing; +- the pseudonymization method and key fingerprint (never the key); +- attachments documentation (file-upload fields that were not exported); +- `dictionary_fields_not_exported`: dictionary fields missing from an + unfiltered records export — this reveals server-side stripping by your + token's export rights. + +Privacy exception: if you filtered by specific record IDs and the record-ID +field is anonymized, the manifest redacts the record-ID filter (and likewise +filter logic that references anonymized fields) — otherwise the manifest would +leak the very values the transforms removed. + +## Limitations and good practice + +- **Free text can contain anything.** Blanking identifier-tagged fields does + not clean names or phone numbers typed into notes fields. The variables + table flags such fields; review them before exporting. +- **Labels can leak too.** When exporting labels (`Raw / Label = Label`), + choice labels are data from the dictionary, not record values — but check + custom labels for embedded identifying text. +- **Token rights are the foundation.** Prefer a token with *De-Identified* + export rights when you do not need identifiers at all; the client-side + transforms then act as a second layer. +- **Same settings, same bytes.** Exports are deterministic: unchanged data + with unchanged settings produces identical files, so re-running a sync only + uploads what actually changed. +- Attachments, reversible encryption, and a Croissant-CDIF profile are + intentionally out of scope for now; see [redcap.md](redcap.md) for the + decision log. diff --git a/redcap.md b/redcap.md index bf3b3a7..8237e26 100644 --- a/redcap.md +++ b/redcap.md @@ -75,27 +75,25 @@ Key point: manual export/save was required in the old `redcap` plugin because it **Report mode** (`exportMode: "report"`): 1. `redcap/report-/data.csv` or `data.json` -2. `redcap/report-/metadata.csv` (filtered to exported fields) +2. `redcap/report-/metadata.csv` (filtered to exported fields; dropped variables excluded) 3. `redcap/report-/project_info.json` 4. `redcap/report-/events.csv` (longitudinal projects) 5. `redcap/report-/form_event_mapping.csv` (longitudinal projects) -6. `redcap/report-/manifest.json` (export config + timestamp + REDCap version + warnings) +6. `redcap/report-/project_metadata.xml` (CDISC ODM, metadata only; failure-tolerant) +7. `redcap/report-/croissant.json` (Croissant 1.0, mime `application/ld+json`) +8. `redcap/report-/ro-crate-metadata.json` (RO-Crate 1.2, profile mime, exact filename for Dataverse detection) +9. `redcap/report-/ddi-cdi.jsonld` (DDI-CDI 1.0, DDI-CDI profile mime) +10. `redcap/report-/manifest.json` (export config + timestamp + REDCap version + audit + warnings) -**Records mode** (`exportMode: "records"`): - -1. `redcap/records/data.csv` or `data.json` -2. `redcap/records/metadata.csv` (filtered to exported fields) -3. `redcap/records/project_info.json` -4. `redcap/records/events.csv` (longitudinal projects) -5. `redcap/records/form_event_mapping.csv` (longitudinal projects) -6. `redcap/records/manifest.json` +**Records mode** (`exportMode: "records"`): same layout under `redcap/records/`. ### Not Implemented Yet -1. Advanced de-identification modes beyond `blank` (`drop`, HMAC `pseudonymize`) — Phase 4. Reversible encryption is **out of scope** (decision 2026-06-11). -2. DDI-CDI/Croissant/RO-Crate metadata exporters — Phase 5 (all three in one phase, from one normalized model; decision 2026-06-11). +1. ~~Advanced de-identification modes beyond `blank`~~ — **done** (Phase 4, 2026-06-11): `drop` + HMAC-SHA256 `pseudonymize` with researcher-managed base64 key. Reversible encryption remains **out of scope** (decision 2026-06-11). +2. ~~DDI-CDI/Croissant/RO-Crate metadata exporters~~ — **done** (Phase 5, 2026-06-11): all three generated on every export from one normalized model (no toggles; deselectable per file in compare — decision revision 2026-06-11), plus `project_metadata.xml` (ODM, metadata-only). 3. Attachment/file-field download — **deferred**; file-upload fields are documented in the manifest instead (decision 2026-06-11). -4. XML data export (note: `content=report` also accepts `format=odm`; a metadata-only `content=project_xml` sidecar is planned in Phase 5). +4. XML **data** export (the metadata-only `content=project_xml` sidecar shipped in Phase 5). +5. Remaining hardening (Phase 6): configurable HTTP timeout, performance test with large projects, security review, pilot re-test. [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Review, Research, And Decisions](#review-research-and-decisions-2026-06-11) @@ -135,6 +133,8 @@ A full review of the branch against main plus web research on the REDCap API (tr 2. **Reversible encryption**: out of scope (irreversible transforms only). 3. **Metadata exporters**: all three (Croissant + RO-Crate + DDI-CDI) in a single phase from one normalized metadata model, generated during export as bundle virtual files (selectable in the compare tree). 4. **Attachments**: deferred; the manifest documents the project's file-upload fields as not-exported references. +5. **No sidecar toggles** (revision, 2026-06-11): the sidecars are always generated; users deselect unwanted files in the compare step instead of pre-toggling generation. The ODM sidecar is likewise always generated (failure-tolerant). +6. **Pseudonymization keys are researcher-managed** (2026-06-11): base64 key pasted in the UI (min 16 bytes decoded, recommended `openssl rand -base64 32`); server stores nothing, manifest records a SHA-256 fingerprint only. ### Review findings driving Phase 3.9 @@ -443,7 +443,7 @@ Normalized model (one struct, three emitters): Additional Phase 5 items: 1. Implement the plugin `Metadata()` hook (registry already supports it; github/gitlab set the precedent) to prefill Dataverse citation metadata from project info (title, PI name, `project_pi_email` on 15.5.20+, notes). -2. Optional metadata-only CDISC ODM sidecar via `content=project_xml&returnMetadataOnly=true` (`project_metadata.xml`) — one API call, archival gold standard. Never with data (would bypass blanking). +2. Metadata-only CDISC ODM sidecar via `content=project_xml&returnMetadataOnly=true` (`project_metadata.xml`) — one API call, archival gold standard, always generated (failure-tolerant). Never with data (would bypass the transforms). [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Architecture In rdm-integration](#architecture-in-rdm-integration) @@ -545,7 +545,7 @@ Current generic request model is string-heavy (`option`, `repoName`, etc.). `plu 7. ~~Auto-detect identifier-tagged fields from metadata and pre-blank them.~~ 8. ~~Add unit tests for each parameter combination.~~ -### Phase 3.9: De-Id Correctness And API Fidelity [In Progress — 2026-06-11] +### Phase 3.9: De-Id Correctness And API Fidelity [Completed — 2026-06-11] Fixes the review findings before new features (see [Review, Research, And Decisions](#review-research-and-decisions-2026-06-11)): @@ -559,33 +559,35 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 8. Manifest enrichment: project id/title, file-upload-field documentation (attachments decision), dictionary-vs-export column diff (reveals token-rights stripping). 9. Bundle cache size cap (bound PII residency in RAM). -### Phase 4: De-Identification Engine [Next] +### Phase 4: De-Identification Engine [Completed — 2026-06-11] -1. Add `drop` and deterministic HMAC-SHA256 `pseudonymize` per-variable modes (policy schema + validation). -2. Flag unvalidated text/notes fields as PHI-risk in the variables table (field types from the dictionary). -3. Surface token export-rights context (dictionary-vs-export diff) in the UI, not just the manifest. -4. Extend the anonymization audit (method, key id — never key material). -5. Strict safeguards: no key logging, no raw-value logging, secure defaults. +1. ~~Add `drop` and deterministic HMAC-SHA256 `pseudonymize` per-variable modes~~ — done. Key: researcher-managed base64 (min 16 bytes, validated client- and server-side with the `openssl rand -base64 32` hint); pseudonyms are full lowercase-hex HMAC-SHA256; empty cells stay empty. EAV: record-ID transforms also rewrite the `record` linking column; dropping the record-ID field in EAV is rejected. Rules match checkbox expansion columns by their own name as well as the base name. +2. ~~Flag unvalidated text/notes fields as PHI-risk~~ — done via `SelectItem.Note` + warning icon in the variables table. +3. Token export-rights context: documented in the manifest (`dictionary_fields_not_exported`, excluding client-side drops) and in the user guide; no extra UI surface (kept light). +4. ~~Extend the anonymization audit~~ — done (mode + matched counts + record-column notes; `anonymization` manifest section with method + key fingerprint). +5. ~~Safeguards~~ — done: key never logged or echoed; manifest redacts the `records` filter when the record-ID field is transformed and `filterLogic` when it references transformed fields; cache key covers the key (hashed). 6. ~~Reversible encryption~~ — out of scope (decision 2026-06-11). +7. Fixed along the way: the variables list is now always fetched with raw, comma-delimited headers (label-header and tab-delimiter exports previously yielded rule names that never matched). -### Phase 5: Metadata Exporters [Next] +### Phase 5: Metadata Exporters [Completed — 2026-06-11] -1. Define normalized metadata model (project, files, variables, code lists, provenance). -2. Implement all three exporter adapters in one phase (decision 2026-06-11): - - `croissant.json` (Croissant 1.0; validate with mlcroissant in CI) - - `ro-crate-metadata.json` (RO-Crate 1.2, Process Run Crate provenance) - - `ddi-cdi.jsonld` (DDI-CDI 1.0; reuse in-repo ddi-cdi helpers; validate with cdi-viewer SHACL shapes) -3. Generate during export as bundle virtual files; expose toggles in UI. -4. Implement plugin `Metadata()` hook for Dataverse citation prefill from project info. -5. Optional `project_metadata.xml` (CDISC ODM, metadata-only). -6. Schema validation tests for each output type. +1. ~~Normalized metadata model~~ — done (`sidecars.go`: files with md5/size/encoding, variables joined from the post-transform data columns and the dictionary incl. code lists, provenance, key fingerprint). +2. ~~All three exporters~~ — done: + - `croissant.json` (Croissant 1.0, canonical context, FileObject distribution, CSV RecordSet with schema.org dataTypes; no RecordSet for JSON exports). Mime `application/ld+json`; validate manually with `mlcroissant validate --jsonld`. + - `ro-crate-metadata.json` (RO-Crate 1.2, detached crate, Process Run Crate provenance: CreateAction + plugin/REDCap SoftwareApplication). Mime = Dataverse 6.3+ filename-detection string = RO-Crate previewer contentType. + - `ddi-cdi.jsonld` (DDI-CDI 1.0 JSON-LD mirroring `cdi_generator_jsonld.py` structure — WideDataSet/WideDataStructure/LogicalRecord/InstanceVariables/CodeLists/PrimaryKey/PhysicalSegmentLayout — so the cdi-viewer SHACL shapes apply). Mime = deployed CDI previewer contentType (`DdiCdiMimeType`). +3. ~~Generate during export~~ — done; **no UI toggles** (decision revision): always generated, deselectable per file in compare. Per-file mime plumbing added through `tree.Node.Attributes.MimeType` → both upload paths (native multipart Content-Type; direct-upload jsonData mime). +4. ~~Plugin `Metadata()` hook~~ — done (title, notes+purpose, PI, grant number, IRB number, urn:redcap project id). +5. ~~`project_metadata.xml`~~ — done (always generated, failure-tolerant warning). +6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. +7. New external tool conf: `conf/dataverse/external-tools/12-jsonld-previewer.json` (cdi-viewer registered for bare `application/ld+json`, fires on croissant.json). -### Phase 6: Hardening And Rollout [Next] +### Phase 6: Hardening And Rollout [In Progress — 2026-06-11] 1. Performance test with large REDCap projects; configurable HTTP timeout (current client timeout: 5 minutes). 2. Security review (keys, logs, PII handling, transport, cache residency). -3. Add operator documentation and troubleshooting. -4. Run pilot with limited users. +3. ~~User documentation~~ — done: [REDCAP_INTEGRATION.md](REDCAP_INTEGRATION.md) (features, key generation/management, PHI disclaimer, sidecars/previewers, manifest reference). +4. Re-test on pilot (first pilot deploy of Phases 0–3.9 done 2026-06-11 via `make dev_build`). 5. Keep `redcap` plugin as stable fallback until `redcap2` is proven. 6. Revisit attachments (opt-in, size-capped, flagged as not de-identified) based on pilot feedback. From bf220803c0b3993f0ec2e6ea6e859727a7efde6e Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 00:13:24 +0200 Subject: [PATCH 10/25] redcap2 Phase 6: configurable timeout, performance pass, security review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - options.redcapHttpTimeout (Go duration string, default 5m) in the backend config bounds REDCap API requests; invalid values fall back with a warning. - Performance: benchmarks for flat/EAV transforms and sidecar generation (flat ~150 MB/s with pseudonymize+blank+drop; EAV ~79 MB/s after memoizing record-column HMACs, was ~34 MB/s; sidecars ~7.5 ms per 500-variable dictionary); removed the per-file payload copy in Streams (bundle contents are immutable), halving peak memory while streaming. - Security review recorded in redcap.md: key path verified end to end (in-memory frontend state, queued-job-only Redis residency, never logged or echoed, fingerprint-only in manifest, MD5-input-only in cache key); redcap2 client verifies TLS (does not inherit the global DefaultTransport skip); key-validation errors tested to never echo key material; accepted risks documented (job payloads in Redis like all plugin tokens, app-wide DefaultTransport skip flagged for separate review). - Docs: operator configuration section in REDCAP_INTEGRATION.md; Phase 6 statuses updated — only the pilot re-test remains open. --- REDCAP_INTEGRATION.md | 14 ++- image/app/config/backend_config.go | 3 +- image/app/plugin/impl/redcap2/bench_test.go | 114 +++++++++++++++++++ image/app/plugin/impl/redcap2/common.go | 45 +++++++- image/app/plugin/impl/redcap2/common_test.go | 25 ++++ image/app/plugin/impl/redcap2/streams.go | 6 +- redcap.md | 35 +++++- 7 files changed, 229 insertions(+), 13 deletions(-) create mode 100644 image/app/plugin/impl/redcap2/bench_test.go diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index c471302..a83eeeb 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -20,7 +20,8 @@ document is for end users. 8. [Metadata sidecars and previewers](#metadata-sidecars-and-previewers) 9. [New dataset metadata prefill](#new-dataset-metadata-prefill) 10. [The export manifest](#the-export-manifest) -11. [Limitations and good practice](#limitations-and-good-practice) +11. [Configuration (operators)](#configuration-operators) +12. [Limitations and good practice](#limitations-and-good-practice) ## Overview @@ -212,6 +213,17 @@ field is anonymized, the manifest redacts the record-ID filter (and likewise filter logic that references anonymized fields) — otherwise the manifest would leak the very values the transforms removed. +## Configuration (operators) + +- `options.redcapHttpTimeout` in the backend config (`BACKEND_CONFIG_FILE`): + a Go duration string (e.g. `"15m"`) bounding each REDCap API request. + Default: `5m`. Raise it if exports of very large projects time out — the + export itself is fast (hundreds of MB/s for processing); the timeout covers + the REDCap server generating and sending the data. +- The JSON-LD previewer for `croissant.json` requires registering + `conf/dataverse/external-tools/12-jsonld-previewer.json` in Dataverse (the + DDI-CDI and RO-Crate previewers use the existing registrations). + ## Limitations and good practice - **Free text can contain anything.** Blanking identifier-tagged fields does diff --git a/image/app/config/backend_config.go b/image/app/config/backend_config.go index 631440f..d3c786b 100644 --- a/image/app/config/backend_config.go +++ b/image/app/config/backend_config.go @@ -52,7 +52,8 @@ type OptionalConfig struct { ComputationAccessConfig []QueueAccess `json:"computationAccessConfig"` DisableDdiCdi bool `json:"disableDdiCdi,omitempty"` // set to true to disable DDI-CDI generation feature GlobusWebAppUrl string `json:"globusWebAppUrl,omitempty"` - WorkspaceRoot string `json:"workspaceRoot,omitempty"` // base directory for job workspaces (default: /dsdata) + WorkspaceRoot string `json:"workspaceRoot,omitempty"` // base directory for job workspaces (default: /dsdata) + RedcapHttpTimeout string `json:"redcapHttpTimeout,omitempty"` // Go duration string (e.g. "10m") for REDCap API requests in the redcap2 plugin; default 5m. Large projects may need more. } type QueueAccess struct { diff --git a/image/app/plugin/impl/redcap2/bench_test.go b/image/app/plugin/impl/redcap2/bench_test.go new file mode 100644 index 0000000..a5915b0 --- /dev/null +++ b/image/app/plugin/impl/redcap2/bench_test.go @@ -0,0 +1,114 @@ +// Author: Eryk Kulikowski @ KU Leuven (2026). Apache 2.0 License + +package redcap2 + +import ( + "bytes" + "fmt" + "testing" +) + +// syntheticFlatCSV builds a wide flat export: rows x cols data cells plus a +// record_id column. +func syntheticFlatCSV(rows, cols int) ([]byte, dictionary) { + var meta bytes.Buffer + meta.WriteString("field_name,form_name,field_type,field_label,identifier,text_validation_type_or_show_slider_number\n") + meta.WriteString("record_id,f,text,Record ID,,\n") + for c := 0; c < cols; c++ { + fmt.Fprintf(&meta, "var_%d,f,text,Variable %d,,integer\n", c, c) + } + + var data bytes.Buffer + data.WriteString("record_id") + for c := 0; c < cols; c++ { + fmt.Fprintf(&data, ",var_%d", c) + } + data.WriteByte('\n') + for r := 0; r < rows; r++ { + fmt.Fprintf(&data, "%d", r) + for c := 0; c < cols; c++ { + fmt.Fprintf(&data, ",%d", r*cols+c) + } + data.WriteByte('\n') + } + return data.Bytes(), parseDictionary(meta.Bytes()) +} + +// syntheticEAVCSV builds a long export with rows*cols value rows. +func syntheticEAVCSV(rows, cols int) ([]byte, dictionary) { + _, dict := syntheticFlatCSV(0, cols) + var data bytes.Buffer + data.WriteString("record,field_name,value\n") + for r := 0; r < rows; r++ { + for c := 0; c < cols; c++ { + fmt.Fprintf(&data, "%d,var_%d,%d\n", r, c, r*cols+c) + } + } + return data.Bytes(), dict +} + +func benchPlan() transformPlan { + return testPlan(map[string]string{ + "record_id": "pseudonymize", + "var_0": "blank", + "var_1": "drop", + }) +} + +// BenchmarkTransformFlatCSV measures the full per-export processing cost of a +// 50k-row, 50-column flat CSV (~25 MB class) with pseudonymize+blank+drop. +func BenchmarkTransformFlatCSV(b *testing.B) { + data, dict := syntheticFlatCSV(50_000, 50) + plan := benchPlan() + b.SetBytes(int64(len(data))) + b.ResetTimer() + for i := 0; i < b.N; i++ { + if _, _, _, err := transformFlatCSV(data, ',', plan, false, dict); err != nil { + b.Fatal(err) + } + } +} + +// BenchmarkTransformFlatCSVNoRules measures the pass-through cost (parse + +// audit only) on the same input. +func BenchmarkTransformFlatCSVNoRules(b *testing.B) { + data, dict := syntheticFlatCSV(50_000, 50) + b.SetBytes(int64(len(data))) + b.ResetTimer() + for i := 0; i < b.N; i++ { + if _, _, _, err := transformFlatCSV(data, ',', transformPlan{}, false, dict); err != nil { + b.Fatal(err) + } + } +} + +// BenchmarkTransformEAVCSV measures EAV processing of 2.5M value rows +// (50k records x 50 fields) with record-column pseudonymization. +func BenchmarkTransformEAVCSV(b *testing.B) { + data, dict := syntheticEAVCSV(50_000, 50) + plan := benchPlan() + b.SetBytes(int64(len(data))) + b.ResetTimer() + for i := 0; i < b.N; i++ { + if _, _, _, err := transformEAVCSV(data, ',', plan, dict); err != nil { + b.Fatal(err) + } + } +} + +// BenchmarkSidecars measures generating all three sidecars for a 500-variable +// dictionary (large-project class). +func BenchmarkSidecars(b *testing.B) { + data, dict := syntheticFlatCSV(100, 500) + opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + files := map[string][]byte{"redcap/records/data.csv": data} + model := buildSidecarModel(opts, transformPlan{}, dict, "redcap/records", files, "redcap/records/data.csv", "14.5.5", float64(1), "Bench") + b.ResetTimer() + for i := 0; i < b.N; i++ { + mime := map[string]string{} + bundle := map[string][]byte{"redcap/records/data.csv": data} + if warnings := addSidecars(model, "redcap/records", bundle, mime); len(warnings) != 0 { + b.Fatalf("sidecar warnings: %v", warnings) + } + } +} diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go index 51ebe56..ef4af65 100644 --- a/image/app/plugin/impl/redcap2/common.go +++ b/image/app/plugin/impl/redcap2/common.go @@ -14,6 +14,7 @@ import ( "encoding/hex" "encoding/json" "fmt" + "integration/app/config" "integration/app/logging" "integration/app/plugin/types" "io" @@ -120,10 +121,30 @@ func (s *bundleStore) set(key string, b generatedBundle) { s.entries[key] = bundleCacheEntry{bundle: b, expiresAt: now.Add(bundleCacheTTL)} } +// defaultHTTPTimeout bounds a single REDCap API request. Large projects can +// need more: configure options.redcapHttpTimeout (a Go duration string, e.g. +// "15m") in the backend config. +const defaultHTTPTimeout = 5 * time.Minute + +// parseHTTPTimeout parses a configured timeout, falling back to the default +// for empty, invalid, or non-positive values. +func parseHTTPTimeout(raw string) time.Duration { + raw = strings.TrimSpace(raw) + if raw == "" { + return defaultHTTPTimeout + } + d, err := time.ParseDuration(raw) + if err != nil || d <= 0 { + logging.Logger.Printf("redcap2: invalid redcapHttpTimeout %q, using default %v", raw, defaultHTTPTimeout) + return defaultHTTPTimeout + } + return d +} + func getHTTPClient() *http.Client { clientOnce.Do(func() { httpClient = &http.Client{ - Timeout: 5 * time.Minute, + Timeout: parseHTTPTimeout(config.GetConfig().Options.RedcapHttpTimeout), Transport: &http.Transport{ MaxIdleConns: 100, MaxIdleConnsPerHost: 10, @@ -417,6 +438,22 @@ func (p transformPlan) transformValue(field, value string) string { return value } +// memoizedTransform returns a transformValue wrapper that caches results per +// input value. Used for the EAV record column, where the same record ID +// recurs once per exported field and recomputing the HMAC dominates the +// processing cost of large EAV exports. +func (p transformPlan) memoizedTransform(field string) func(string) string { + cache := map[string]string{} + return func(value string) string { + if cached, ok := cache[value]; ok { + return cached + } + transformed := p.transformValue(field, value) + cache[value] = transformed + return transformed + } +} + // buildTransformPlan validates the per-variable anonymization choices and the // researcher-provided base64 HMAC key (required iff any field is pseudonymized). func buildTransformPlan(opts pluginOptions) (transformPlan, error) { @@ -781,6 +818,7 @@ func transformEAVCSV(data []byte, delimiter rune, plan transformPlan, dict dicti if recordField != "" && recordIdx >= 0 { recordMode = plan.modes[recordField] } + transformRecord := plan.memoizedTransform(recordField) exported := eavExportedFields(dict) seen := map[string]bool{} for _, field := range exported { @@ -812,7 +850,7 @@ func transformEAVCSV(data []byte, delimiter rune, plan transformPlan, dict dicti changed = true } if recordMode != "" && recordMode != "drop" && recordIdx < len(row) && row[recordIdx] != "" { - row[recordIdx] = plan.transformValue(recordField, row[recordIdx]) + row[recordIdx] = transformRecord(row[recordIdx]) changed = true notes[recordField] = "also applied to the EAV record column" } @@ -967,6 +1005,7 @@ func transformEAVJSON(data []byte, plan transformPlan, dict dictionary) ([]byte, if recordField != "" { recordMode = plan.modes[recordField] } + transformRecord := plan.memoizedTransform(recordField) exported := eavExportedFields(dict) seen := map[string]bool{} for _, field := range exported { @@ -998,7 +1037,7 @@ func transformEAVJSON(data []byte, plan transformPlan, dict dictionary) ([]byte, if recordMode != "" && recordMode != "drop" { if rec, ok := row["record"]; ok { if recStr := jsonValueString(rec); recStr != "" { - row["record"] = plan.transformValue(recordField, recStr) + row["record"] = transformRecord(recStr) changed = true notes[recordField] = "also applied to the EAV record column" } diff --git a/image/app/plugin/impl/redcap2/common_test.go b/image/app/plugin/impl/redcap2/common_test.go index aa7a4a8..ebde5aa 100644 --- a/image/app/plugin/impl/redcap2/common_test.go +++ b/image/app/plugin/impl/redcap2/common_test.go @@ -16,6 +16,7 @@ import ( "reflect" "strings" "testing" + "time" ) // testKey is a 32-byte HMAC key used by pseudonymization tests. @@ -156,6 +157,26 @@ func TestNormalizeStringSlice(t *testing.T) { } } +func TestParseHTTPTimeout(t *testing.T) { + tests := []struct { + in string + want time.Duration + }{ + {in: "", want: defaultHTTPTimeout}, + {in: " ", want: defaultHTTPTimeout}, + {in: "15m", want: 15 * time.Minute}, + {in: "90s", want: 90 * time.Second}, + {in: "not-a-duration", want: defaultHTTPTimeout}, + {in: "-5m", want: defaultHTTPTimeout}, + {in: "0", want: defaultHTTPTimeout}, + } + for _, tt := range tests { + if got := parseHTTPTimeout(tt.in); got != tt.want { + t.Errorf("parseHTTPTimeout(%q) = %v, want %v", tt.in, got, tt.want) + } + } +} + func TestGetAPIURL(t *testing.T) { tests := []struct { in string @@ -222,12 +243,16 @@ func TestBuildTransformPlanKeyValidation(t *testing.T) { bad.PseudonymizationKey = "!!!not-base64!!!" if _, err := buildTransformPlan(bad); err == nil || !strings.Contains(err.Error(), "base64") { t.Fatalf("invalid base64 must error, got %v", err) + } else if strings.Contains(err.Error(), bad.PseudonymizationKey) { + t.Fatal("key validation errors must never echo the key material") } short := base short.PseudonymizationKey = base64.StdEncoding.EncodeToString([]byte("tooshort")) if _, err := buildTransformPlan(short); err == nil || !strings.Contains(err.Error(), "too short") { t.Fatalf("short key must error, got %v", err) + } else if strings.Contains(err.Error(), short.PseudonymizationKey) { + t.Fatal("key validation errors must never echo the key material") } good := base diff --git a/image/app/plugin/impl/redcap2/streams.go b/image/app/plugin/impl/redcap2/streams.go index 5d64c98..ddcc708 100644 --- a/image/app/plugin/impl/redcap2/streams.go +++ b/image/app/plugin/impl/redcap2/streams.go @@ -46,9 +46,9 @@ func Streams(ctx context.Context, in map[string]tree.Node, streamParams types.St if !ok { return types.StreamsType{}, fmt.Errorf("streams: generated file not found: %s", path) } - // Copy payload to ensure stream readers are isolated from map aliasing. - data := append([]byte(nil), payload...) - res[key] = byteStream(data) + // bytes.Reader never mutates the slice and bundle contents are + // immutable once built, so the payload is streamed without copying. + res[key] = byteStream(payload) } return types.StreamsType{ diff --git a/redcap.md b/redcap.md index 8237e26..6a3877f 100644 --- a/redcap.md +++ b/redcap.md @@ -582,15 +582,40 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. 7. New external tool conf: `conf/dataverse/external-tools/12-jsonld-previewer.json` (cdi-viewer registered for bare `application/ld+json`, fires on croissant.json). -### Phase 6: Hardening And Rollout [In Progress — 2026-06-11] - -1. Performance test with large REDCap projects; configurable HTTP timeout (current client timeout: 5 minutes). -2. Security review (keys, logs, PII handling, transport, cache residency). +### Phase 6: Hardening And Rollout [In Progress — 2026-06-12] + +1. ~~Performance + configurable timeout~~ — done (2026-06-12). `options.redcapHttpTimeout` (Go duration string, default `5m`) in the backend config bounds REDCap API requests. Benchmarks added (`bench_test.go`); measured on dev hardware: + - flat CSV, 50k rows × 50 cols (~19 MB), pseudonymize+blank+drop: ~150 MB/s (~0.13 s) + - same input, no rules (parse + audit only): ~460 MB/s + - EAV CSV, 2.5M value rows (~50 MB) with record-column pseudonymization: ~79 MB/s (~0.64 s) after memoizing record-column HMACs (was ~34 MB/s — the same record ID recurs once per field) + - all three sidecars for a 500-variable dictionary: ~7.5 ms + Also removed the per-file payload copy in `Streams` (bundle contents are immutable; halves peak memory while streaming). +2. ~~Security review~~ — done (2026-06-12), see [Security Review](#security-review-2026-06-12). 3. ~~User documentation~~ — done: [REDCAP_INTEGRATION.md](REDCAP_INTEGRATION.md) (features, key generation/management, PHI disclaimer, sidecars/previewers, manifest reference). -4. Re-test on pilot (first pilot deploy of Phases 0–3.9 done 2026-06-11 via `make dev_build`). +4. Re-test on pilot (first pilot deploy of Phases 0–3.9 done 2026-06-11 via `make dev_build`). **Remaining: user re-tests the Phase 4–6 build.** 5. Keep `redcap` plugin as stable fallback until `redcap2` is proven. 6. Revisit attachments (opt-in, size-capped, flagged as not de-identified) based on pilot feedback. +### Security Review (2026-06-12) + +Scope: redcap2 plugin, key handling end to end, logging, PII residency, transport. + +**Verified safe:** + +1. **Pseudonymization key path**: frontend holds the key in in-memory state only (`credentials.service.ts` uses signals, no localStorage/sessionStorage), so a page refresh discards it; it transits to the backend inside `pluginOptions` over HTTPS exactly like repository API tokens; in Redis it exists only inside the queued job payload (`LPush`/`RPop` — removed when the worker pops it, re-added only on retry), the same residency as every plugin's token; it is never logged (audited all `Logger` calls in the plugin and the job pipeline), never echoed in validation errors (tested), never written to any generated file (manifest carries only the SHA-256 fingerprint — tested), and enters the bundle cache key only as MD5 input (one-way). +2. **Logging**: the plugin logs only file counts, export mode, report ID, cache decisions, and sidecar warnings — no record data, no tokens, no key material. +3. **Transport**: the redcap2 HTTP client builds its own `http.Transport`, so it does **not** inherit the `InsecureSkipVerify: true` that `config.init()` sets on `http.DefaultTransport` — REDCap TLS certificates are verified by this plugin. +4. **PII residency in memory**: bundle cache is process-local with a 5-minute TTL, 64 MB per-bundle cap (oversized bundles are rebuilt on demand, never cached), and lazy eviction on every set. +5. **Manifest hygiene**: records filter redacted when the record-ID field is transformed; filter logic redacted when it references transformed fields; client-side drops excluded from the token-rights diff; REDCap API token absent everywhere (POST form body, never URLs or generated files). +6. **Defaults**: `exportSurveyFields` and `exportDataAccessGroups` are off by default (`redcap_survey_identifier` is often directly identifying). + +**Accepted/documented (no change):** + +1. Queued job payloads in Redis contain the repo token and, when used, the pseudonymization key — pre-existing posture shared by all plugins; mitigate by restricting Redis access (password support exists: `pathToRedisPassword`). +2. REDCap error bodies are surfaced to the user verbatim; REDCap error messages do not echo submitted record data. +3. `common/get_metadata.go` logs the citation-metadata response (project title, PI, ...) — pre-existing app-wide behavior, not record data. +4. The global `http.DefaultTransport` certificate-verification skip in `config.init()` is app-wide and predates this work; flagged for a future app-level review, out of redcap2 scope. + [↑ Back to Top](#redcap2-plugin-design-status-and-implementation-plan) | [→ Testing Plan](#testing-plan) --- From 3ba94042479cb058e52353e0a46e1e5225b93630 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 15:07:47 +0200 Subject: [PATCH 11/25] redcap2: variable-level metadata in Croissant/RO-Crate + DDI-CDI SHACL conformance - croissant.json and ro-crate-metadata.json gain schema.org variableMeasured following the CDIF 1.1 Discovery-profile shape: PropertyValue per data column with name, description (label + anonymization note), alternateName, numeric minValue/maxValue (new text_validation_min/max dictionary parsing), and code lists as valueReference DefinedTerms (termCode = the value in the data). Inline entries in Croissant (its @vocab is schema.org); flattened contextual entities in RO-Crate as its spec requires. Verified with mlcroissant 1.1.0: exits clean (the embedded Croissant context also gained the missing official equivalentProperty/samplingRate keys). - ddi-cdi.jsonld code lists restructured per the official DDI-CDI 1.0 SHACL shapes (bundled with the cdi-viewer previewer), fixing the 'Less than 1 values' violations: each Code now uses a Notation whose TypedString content is the value as it appears in the data and denotes a Category holding the label; CodeList gets allowsDuplicates + ObjectName name; the PrimaryKey is reachable from the WideDataStructure (has_PrimaryKey) and its component uses the full correspondsTo_DataStructureComponent term. Verified with pyshacl against libis/cdi-viewer shapes/ddi-cdi-official.ttl: Conforms=True (was 13 violations). - datePublished/endTime omitted when only the missing-timestamp sentinel is available (must be ISO 8601). - New env-gated TestDumpSidecarsForValidation writes sample sidecars for external validation runs (pyshacl, mlcroissant); docs updated. Full CDIF 1.1 Data Description (double-typed cdi:InstanceVariable + skos code lists) deferred until the profile leaves review. --- REDCAP_INTEGRATION.md | 12 +- image/app/plugin/impl/redcap2/common.go | 31 +- image/app/plugin/impl/redcap2/sidecars.go | 289 +++++++++++++----- .../app/plugin/impl/redcap2/sidecars_test.go | 182 ++++++++++- redcap.md | 5 + 5 files changed, 428 insertions(+), 91 deletions(-) diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index a83eeeb..0e0e091 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -165,7 +165,17 @@ exported; the manifest documents which fields hold attachments. The three metadata sidecars are generated from the same normalized model, so they always agree with each other and with the anonymized data (e.g. dropped -variables are absent everywhere, pseudonymized variables are marked): +variables are absent everywhere, pseudonymized variables are marked). + +All three include **variable-level metadata** derived from the data +dictionary: labels, data types, numeric validation ranges, and code lists +("1 = Male | 2 = Female"). In `croissant.json` and `ro-crate-metadata.json` +this appears as schema.org `variableMeasured` entries following the CDIF 1.1 +Discovery-profile shape (code lists as DefinedTerms whose `termCode` is the +value as it appears in the data); in `ddi-cdi.jsonld` as InstanceVariables +with value domains and CodeLists (each Code carries the data value in its +Notation and the label in its Category). The DDI-CDI output validates against +the official DDI-CDI 1.0 SHACL shapes used by the CDI previewer. - `ro-crate-metadata.json` is uploaded with the RO-Crate mime type that Dataverse (6.3+) also detects by filename; the standard **RO-Crate diff --git a/image/app/plugin/impl/redcap2/common.go b/image/app/plugin/impl/redcap2/common.go index ef4af65..0a2c615 100644 --- a/image/app/plugin/impl/redcap2/common.go +++ b/image/app/plugin/impl/redcap2/common.go @@ -518,24 +518,29 @@ type dictionary struct { labelFields map[string][]string // field_label -> field names (labels can collide) identifier map[string]bool // field_name -> tagged as identifier in REDCap validation map[string]string // field_name -> text validation type ("" = unvalidated) + validationMin map[string]string // field_name -> text_validation_min + validationMax map[string]string // field_name -> text_validation_max choices map[string]string // field_name -> raw select_choices_or_calculations hasValidation bool // the validation column was present in the dictionary } func parseDictionary(metadataCSV []byte) dictionary { dict := dictionary{ - fieldType: map[string]string{}, - fieldLabel: map[string]string{}, - labelFields: map[string][]string{}, - identifier: map[string]bool{}, - validation: map[string]string{}, - choices: map[string]string{}, + fieldType: map[string]string{}, + fieldLabel: map[string]string{}, + labelFields: map[string][]string{}, + identifier: map[string]bool{}, + validation: map[string]string{}, + validationMin: map[string]string{}, + validationMax: map[string]string{}, + choices: map[string]string{}, } rows, err := parseCSV(metadataCSV, ',') if err != nil || len(rows) == 0 { return dict } nameIdx, typeIdx, labelIdx, identifierIdx, validationIdx, choicesIdx := -1, -1, -1, -1, -1, -1 + minIdx, maxIdx := -1, -1 for i, col := range rows[0] { switch strings.ToLower(strings.TrimSpace(col)) { case "field_name": @@ -548,6 +553,10 @@ func parseDictionary(metadataCSV []byte) dictionary { identifierIdx = i case "text_validation_type_or_show_slider_number": validationIdx = i + case "text_validation_min": + minIdx = i + case "text_validation_max": + maxIdx = i case "select_choices_or_calculations": choicesIdx = i } @@ -590,6 +599,16 @@ func parseDictionary(metadataCSV []byte) dictionary { if validationIdx >= 0 && validationIdx < len(row) { dict.validation[name] = strings.ToLower(strings.TrimSpace(row[validationIdx])) } + if minIdx >= 0 && minIdx < len(row) { + if v := strings.TrimSpace(row[minIdx]); v != "" { + dict.validationMin[name] = v + } + } + if maxIdx >= 0 && maxIdx < len(row) { + if v := strings.TrimSpace(row[maxIdx]); v != "" { + dict.validationMax[name] = v + } + } } return dict } diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index cb60812..5ac5022 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -6,6 +6,7 @@ import ( "encoding/json" "fmt" "sort" + "strconv" "strings" ) @@ -44,6 +45,8 @@ type sidecarVariable struct { Label string FieldType string // REDCap field type ("" for pseudo-columns) Validation string + MinValue string // text_validation_min from the dictionary + MaxValue string // text_validation_max from the dictionary Identifier bool IsRecordID bool Transform string // applied anonymization mode ("" if none) @@ -157,6 +160,8 @@ func buildSidecarVariables(columns []string, opts pluginOptions, plan transformP v.Label = dict.fieldLabel[v.Field] v.FieldType = dict.fieldType[v.Field] v.Validation = dict.validation[v.Field] + v.MinValue = dict.validationMin[v.Field] + v.MaxValue = dict.validationMax[v.Field] v.Identifier = dict.identifier[v.Field] v.IsRecordID = v.Field == recordField v.Choices = variableChoices(dict, v.Field) @@ -247,6 +252,15 @@ func bundleFileEncodingFormat(name string, opts pluginOptions) string { return "application/octet-stream" } +// publishedDate returns the export timestamp, or false when only the +// missing-timestamp sentinel is available (datePublished must be ISO 8601). +func (m sidecarModel) publishedDate() (string, bool) { + if m.GeneratedAt == "" || m.GeneratedAt == "missing-generated-at" { + return "", false + } + return m.GeneratedAt, true +} + // datasetName returns a human-readable dataset name for the sidecars. func (m sidecarModel) datasetName() string { if m.ProjectTitle != "" { @@ -286,44 +300,92 @@ func variableDescription(v sidecarVariable) string { return strings.Join(parts, " — ") } +// numericBound parses a REDCap validation min/max into a JSON number. +// Non-numeric bounds (e.g. date limits) are skipped: schema.org +// minValue/maxValue expect numbers. +func numericBound(raw string) (float64, bool) { + f, err := strconv.ParseFloat(strings.TrimSpace(raw), 64) + return f, err == nil +} + +// propertyValueBase renders the shared part of a schema.org PropertyValue +// for one variable, following the CDIF 1.1 Discovery profile shape for +// variableMeasured (name + description required, alternateName, min/max). +// Code lists are attached by the caller (inline DefinedTerms for Croissant, +// flattened references for RO-Crate). +func propertyValueBase(v sidecarVariable) map[string]interface{} { + pv := map[string]interface{}{ + "@type": "PropertyValue", + "name": v.Column, + } + if desc := variableDescription(v); desc != "" { + pv["description"] = desc + } + if v.Label != "" && v.Label != v.Column { + pv["alternateName"] = v.Label + } + if f, ok := numericBound(v.MinValue); ok { + pv["minValue"] = f + } + if f, ok := numericBound(v.MaxValue); ok { + pv["maxValue"] = f + } + return pv +} + +// definedTermFor renders one code-list entry as a schema.org DefinedTerm +// (termCode = the value in the data, name = the human label). +func definedTermFor(c choiceCode) map[string]interface{} { + term := map[string]interface{}{ + "@type": "DefinedTerm", + "termCode": c.Code, + } + if c.Label != "" { + term["name"] = c.Label + } + return term +} + // --- Croissant 1.0 --- // croissantContext is the canonical Croissant 1.0 @context. var croissantContext = map[string]interface{}{ - "@language": "en", - "@vocab": "https://schema.org/", - "citeAs": "cr:citeAs", - "column": "cr:column", - "conformsTo": "dct:conformsTo", - "cr": "http://mlcommons.org/croissant/", - "rai": "http://mlcommons.org/croissant/RAI/", - "data": map[string]interface{}{"@id": "cr:data", "@type": "@json"}, - "dataType": map[string]interface{}{"@id": "cr:dataType", "@type": "@vocab"}, - "dct": "http://purl.org/dc/terms/", - "examples": map[string]interface{}{"@id": "cr:examples", "@type": "@json"}, - "extract": "cr:extract", - "field": "cr:field", - "fileProperty": "cr:fileProperty", - "fileObject": "cr:fileObject", - "fileSet": "cr:fileSet", - "format": "cr:format", - "includes": "cr:includes", - "isLiveDataset": "cr:isLiveDataset", - "jsonPath": "cr:jsonPath", - "key": "cr:key", - "md5": "cr:md5", - "parentField": "cr:parentField", - "path": "cr:path", - "recordSet": "cr:recordSet", - "references": "cr:references", - "regex": "cr:regex", - "repeated": "cr:repeated", - "replace": "cr:replace", - "sc": "https://schema.org/", - "separator": "cr:separator", - "source": "cr:source", - "subField": "cr:subField", - "transform": "cr:transform", + "@language": "en", + "@vocab": "https://schema.org/", + "citeAs": "cr:citeAs", + "column": "cr:column", + "conformsTo": "dct:conformsTo", + "cr": "http://mlcommons.org/croissant/", + "rai": "http://mlcommons.org/croissant/RAI/", + "data": map[string]interface{}{"@id": "cr:data", "@type": "@json"}, + "dataType": map[string]interface{}{"@id": "cr:dataType", "@type": "@vocab"}, + "dct": "http://purl.org/dc/terms/", + "equivalentProperty": "cr:equivalentProperty", + "examples": map[string]interface{}{"@id": "cr:examples", "@type": "@json"}, + "extract": "cr:extract", + "field": "cr:field", + "fileProperty": "cr:fileProperty", + "fileObject": "cr:fileObject", + "fileSet": "cr:fileSet", + "format": "cr:format", + "includes": "cr:includes", + "isLiveDataset": "cr:isLiveDataset", + "jsonPath": "cr:jsonPath", + "key": "cr:key", + "md5": "cr:md5", + "parentField": "cr:parentField", + "path": "cr:path", + "recordSet": "cr:recordSet", + "references": "cr:references", + "regex": "cr:regex", + "repeated": "cr:repeated", + "replace": "cr:replace", + "samplingRate": "cr:samplingRate", + "sc": "https://schema.org/", + "separator": "cr:separator", + "source": "cr:source", + "subField": "cr:subField", + "transform": "cr:transform", } func croissantDataType(v sidecarVariable) string { @@ -362,14 +424,37 @@ func buildCroissant(m sidecarModel) ([]byte, error) { } doc := map[string]interface{}{ - "@context": croissantContext, - "@type": "sc:Dataset", - "conformsTo": "http://mlcommons.org/croissant/1.0", - "name": m.datasetName(), - "description": m.datasetDescription(), - "version": "1.0.0", - "datePublished": m.GeneratedAt, - "distribution": distribution, + "@context": croissantContext, + "@type": "sc:Dataset", + "conformsTo": "http://mlcommons.org/croissant/1.0", + "name": m.datasetName(), + "description": m.datasetDescription(), + "version": "1.0.0", + "distribution": distribution, + } + if date, ok := m.publishedDate(); ok { + doc["datePublished"] = date + } + + // Variable-level metadata as schema.org variableMeasured, following the + // CDIF 1.1 Discovery profile shape (Croissant's @vocab is schema.org, so + // the terms expand to the right IRIs; mlcroissant accepts them). Code + // lists are inline DefinedTerms via valueReference. + if len(m.Variables) > 0 { + variableMeasured := make([]interface{}, 0, len(m.Variables)) + for _, v := range m.Variables { + pv := propertyValueBase(v) + pv["@id"] = "variable/" + v.Column + if len(v.Choices) > 0 { + terms := make([]interface{}, 0, len(v.Choices)) + for _, c := range v.Choices { + terms = append(terms, definedTermFor(c)) + } + pv["valueReference"] = terms + } + variableMeasured = append(variableMeasured, pv) + } + doc["variableMeasured"] = variableMeasured } if m.DataFormat == "csv" && len(m.Variables) > 0 { @@ -427,17 +512,45 @@ func buildROCrate(m sidecarModel) ([]byte, error) { } rootDataset := map[string]interface{}{ - "@id": "./", - "@type": "Dataset", - "name": m.datasetName(), - "description": m.datasetDescription(), - "datePublished": m.GeneratedAt, - "hasPart": hasPart, - "mentions": map[string]interface{}{"@id": "#export-action"}, + "@id": "./", + "@type": "Dataset", + "name": m.datasetName(), + "description": m.datasetDescription(), + "hasPart": hasPart, + "mentions": map[string]interface{}{"@id": "#export-action"}, + } + if date, ok := m.publishedDate(); ok { + rootDataset["datePublished"] = date } if m.ProjectID != nil { rootDataset["identifier"] = fmt.Sprintf("redcap-project-%v", m.ProjectID) } + + // Variable-level metadata as schema.org variableMeasured contextual + // entities (CDIF 1.1 Discovery profile shape). RO-Crate JSON-LD is + // flattened: every PropertyValue and DefinedTerm is its own graph entity. + if len(m.Variables) > 0 { + variableRefs := make([]interface{}, 0, len(m.Variables)) + for _, v := range m.Variables { + variableID := "#variable/" + v.Column + variableRefs = append(variableRefs, map[string]interface{}{"@id": variableID}) + pv := propertyValueBase(v) + pv["@id"] = variableID + if len(v.Choices) > 0 { + termRefs := make([]interface{}, 0, len(v.Choices)) + for _, c := range v.Choices { + termID := variableID + "/code/" + safeFragment(c.Code) + termRefs = append(termRefs, map[string]interface{}{"@id": termID}) + term := definedTermFor(c) + term["@id"] = termID + graph = append(graph, term) + } + pv["valueReference"] = termRefs + } + graph = append(graph, pv) + } + rootDataset["variableMeasured"] = variableRefs + } graph = append(graph, rootDataset) for _, f := range m.Files { @@ -462,11 +575,13 @@ func buildROCrate(m sidecarModel) ([]byte, error) { "name": "REDCap export", "instrument": map[string]interface{}{"@id": "#rdm-integration-redcap2"}, "result": results, - "endTime": m.GeneratedAt, "description": fmt.Sprintf( "Files generated from the REDCap API (export mode: %s) with client-side anonymization applied as documented in manifest.json", m.ExportMode), } + if date, ok := m.publishedDate(); ok { + action["endTime"] = date + } graph = append(graph, action) graph = append(graph, map[string]interface{}{ "@id": "#rdm-integration-redcap2", @@ -576,33 +691,63 @@ func buildDDICDI(m sidecarModel) ([]byte, error) { variableIDs = append(variableIDs, varID) componentIDs = append(componentIDs, componentID) - // Code list from the REDCap choices definition. + // Code list from the REDCap choices definition. Per the DDI-CDI 1.0 + // model (and its official SHACL shapes), a Code carries no literal + // value itself: it uses a Notation whose TypedString content is the + // value as it appears in the data, and denotes a Category that holds + // the human-readable label. codeListID := "" if len(v.Choices) > 0 { codeListID = "#" + frag + "_CodeList" codeIDs := []interface{}{} for _, c := range v.Choices { - codeID := "#" + frag + "_Code_" + safeFragment(c.Code) + codeFrag := frag + "_" + safeFragment(c.Code) + codeID := "#" + codeFrag + "_Code" + categoryID := "#" + codeFrag + "_Category" + notationID := "#" + codeFrag + "_Notation" codeIDs = append(codeIDs, codeID) - codeNode := map[string]interface{}{ - "@id": codeID, - "@type": "Code", - "identifier": c.Code, - } - if c.Label != "" { - codeNode["name"] = c.Label + + categoryName := c.Label + if categoryName == "" { + categoryName = c.Code } - graph = append(graph, codeNode) + graph = append(graph, map[string]interface{}{ + "@id": categoryID, + "@type": "Category", + "name": map[string]interface{}{ + "@type": "ObjectName", + "name": categoryName, + }, + }) + graph = append(graph, map[string]interface{}{ + "@id": notationID, + "@type": "Notation", + "content": map[string]interface{}{ + "@type": "TypedString", + "content": c.Code, + }, + "represents": categoryID, + }) + graph = append(graph, map[string]interface{}{ + "@id": codeID, + "@type": "Code", + "denotes": categoryID, + "uses_Notation": notationID, + }) } label := v.Label if label == "" { label = v.Column } graph = append(graph, map[string]interface{}{ - "@id": codeListID, - "@type": "CodeList", - "name": label + " codes", - "has_Code": codeIDs, + "@id": codeListID, + "@type": "CodeList", + "name": map[string]interface{}{ + "@type": "ObjectName", + "name": label + " codes", + }, + "allowsDuplicates": false, + "has_Code": codeIDs, }) } @@ -684,11 +829,17 @@ func buildDDICDI(m sidecarModel) ([]byte, error) { "@type": "WideDataSet", "isStructuredBy": "#datastructure", }) - graph = append(graph, map[string]interface{}{ + structure := map[string]interface{}{ "@id": "#datastructure", "@type": "WideDataStructure", "has_DataStructureComponent": componentIDs, - }) + } + if primaryKeyComponent != "" { + // The SHACL shapes require the primary key to be reachable from the + // data structure (DataStructure_has_PrimaryKey). + structure["has_PrimaryKey"] = "#primaryKey" + } + graph = append(graph, structure) graph = append(graph, map[string]interface{}{ "@id": "#logicalRecord", "@type": "LogicalRecord", @@ -702,9 +853,9 @@ func buildDDICDI(m sidecarModel) ([]byte, error) { "isComposedOf": "#primaryKeyComponent", }) graph = append(graph, map[string]interface{}{ - "@id": "#primaryKeyComponent", - "@type": "PrimaryKeyComponent", - "correspondsTo": primaryKeyComponent, + "@id": "#primaryKeyComponent", + "@type": "PrimaryKeyComponent", + "correspondsTo_DataStructureComponent": primaryKeyComponent, }) } if m.DataFormat == "csv" { diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go index dafacdb..d515357 100644 --- a/image/app/plugin/impl/redcap2/sidecars_test.go +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -4,19 +4,21 @@ package redcap2 import ( "encoding/json" + "os" + "path/filepath" "reflect" "strings" "testing" ) -const sidecarTestMetadataCSV = "field_name,form_name,field_type,field_label,select_choices_or_calculations,identifier,text_validation_type_or_show_slider_number\n" + - "record_id,demographics,text,Record ID,,,\n" + - "name,demographics,text,Full Name,,y,\n" + - "age,demographics,text,Age,,,integer\n" + - "weight,demographics,text,Weight,,,number\n" + - "sex,demographics,radio,Sex,\"1, Male | 2, Female\",,\n" + - "consent,demographics,yesno,Consent given,,,\n" + - "visit_date,demographics,text,Visit Date,,,date_ymd\n" +const sidecarTestMetadataCSV = "field_name,form_name,field_type,field_label,select_choices_or_calculations,identifier,text_validation_type_or_show_slider_number,text_validation_min,text_validation_max\n" + + "record_id,demographics,text,Record ID,,,,,\n" + + "name,demographics,text,Full Name,,y,,,\n" + + "age,demographics,text,Age,,,integer,0,120\n" + + "weight,demographics,text,Weight,,,number,,\n" + + "sex,demographics,radio,Sex,\"1, Male | 2, Female\",,,,\n" + + "consent,demographics,yesno,Consent given,,,,,\n" + + "visit_date,demographics,text,Visit Date,,,date_ymd,2020-01-01,\n" func sidecarTestModel(t *testing.T, opts pluginOptions, plan transformPlan, dataCSV string) sidecarModel { t.Helper() @@ -46,7 +48,7 @@ func TestParseChoiceCodes(t *testing.T) { } func TestBuildSidecarVariables(t *testing.T) { - opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) plan := testPlan(map[string]string{"name": "blank"}) model := sidecarTestModel(t, opts, plan, "record_id,name,age,sex,consent,visit_date\n1,John,34,1,1,2026-01-01\n") @@ -67,10 +69,13 @@ func TestBuildSidecarVariables(t *testing.T) { if byColumn["age"].Validation != "integer" { t.Errorf("age validation = %q", byColumn["age"].Validation) } + if byColumn["age"].MinValue != "0" || byColumn["age"].MaxValue != "120" { + t.Errorf("age min/max = %q/%q, want 0/120", byColumn["age"].MinValue, byColumn["age"].MaxValue) + } } func TestBuildCroissant(t *testing.T) { - opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) plan := testPlan(map[string]string{"name": "pseudonymize"}) plan.keyFingerprint = "abcdef0123456789" model := sidecarTestModel(t, opts, plan, @@ -123,6 +128,40 @@ func TestBuildCroissant(t *testing.T) { t.Errorf("dataType(%s) = %q, want %q", name, dataTypes[name], wantType) } } + + // Variable-level metadata (CDIF-style variableMeasured). + variableMeasured := doc["variableMeasured"].([]interface{}) + byName := map[string]map[string]interface{}{} + for _, entry := range variableMeasured { + pv := entry.(map[string]interface{}) + byName[pv["name"].(string)] = pv + } + if len(byName) != len(model.Variables) { + t.Errorf("variableMeasured has %d entries, want %d", len(byName), len(model.Variables)) + } + age := byName["age"] + if age["@type"] != "PropertyValue" || age["minValue"] != float64(0) || age["maxValue"] != float64(120) { + t.Errorf("age variableMeasured = %v", age) + } + sex := byName["sex"] + if sex["alternateName"] != "Sex" { + t.Errorf("sex alternateName = %v", sex["alternateName"]) + } + terms := sex["valueReference"].([]interface{}) + if len(terms) != 2 { + t.Fatalf("sex valueReference = %v, want 2 DefinedTerms", terms) + } + firstTerm := terms[0].(map[string]interface{}) + if firstTerm["@type"] != "DefinedTerm" || firstTerm["termCode"] != "1" || firstTerm["name"] != "Male" { + t.Errorf("sex code term = %v", firstTerm) + } + if _, ok := byName["visit_date"]["minValue"]; ok { + t.Error("non-numeric validation bounds must not become minValue") + } + pseudonymized := byName["name"] + if !strings.Contains(pseudonymized["description"].(string), "pseudonymize") { + t.Errorf("transform note missing from description: %v", pseudonymized["description"]) + } } func TestBuildCroissantJSONExportHasNoRecordSet(t *testing.T) { @@ -142,7 +181,7 @@ func TestBuildCroissantJSONExportHasNoRecordSet(t *testing.T) { } func TestBuildROCrate(t *testing.T) { - opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) model := sidecarTestModel(t, opts, transformPlan{}, "record_id,age\n1,34\n") @@ -187,10 +226,56 @@ func TestBuildROCrate(t *testing.T) { if byID["#rdm-integration-redcap2"] == nil || byID["#redcap"] == nil { t.Error("software application entities missing") } + + // Variables are flattened PropertyValue entities referenced from the root. + variableRefs := root["variableMeasured"].([]interface{}) + if len(variableRefs) != len(model.Variables) { + t.Errorf("variableMeasured refs = %d, want %d", len(variableRefs), len(model.Variables)) + } + recordID := byID["#variable/record_id"] + if recordID == nil || recordID["@type"] != "PropertyValue" || recordID["name"] != "record_id" { + t.Errorf("record_id variable entity = %v", recordID) + } +} + +func TestBuildROCrateFlattensCodeLists(t *testing.T) { + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) + model := sidecarTestModel(t, opts, transformPlan{}, + "record_id,sex\n1,1\n") + + data, err := buildROCrate(model) + if err != nil { + t.Fatalf("buildROCrate returned error: %v", err) + } + doc := map[string]interface{}{} + _ = json.Unmarshal(data, &doc) + byID := map[string]map[string]interface{}{} + for _, entry := range doc["@graph"].([]interface{}) { + node := entry.(map[string]interface{}) + byID[node["@id"].(string)] = node + } + + sex := byID["#variable/sex"] + if sex == nil { + t.Fatal("missing #variable/sex entity") + } + termRefs := sex["valueReference"].([]interface{}) + if len(termRefs) != 2 { + t.Fatalf("sex valueReference = %v", termRefs) + } + // RO-Crate JSON-LD must stay flattened: code terms are their own entities. + firstRef := termRefs[0].(map[string]interface{}) + if len(firstRef) != 1 || firstRef["@id"] != "#variable/sex/code/1" { + t.Errorf("term reference must be an @id ref, got %v", firstRef) + } + term := byID["#variable/sex/code/1"] + if term == nil || term["@type"] != "DefinedTerm" || term["termCode"] != "1" || term["name"] != "Male" { + t.Errorf("code term entity = %v", term) + } } func TestBuildDDICDI(t *testing.T) { - opts, _ := parsePluginOptions(`{"exportMode":"records"}`) + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) model := sidecarTestModel(t, opts, transformPlan{}, "record_id,age,sex\n1,34,1\n") @@ -230,9 +315,46 @@ func TestBuildDDICDI(t *testing.T) { if len(byType["PrimaryKey"]) != 1 || len(byType["PrimaryKeyComponent"]) != 1 { t.Error("primary key nodes missing") } - // sex has a code list with two codes - if len(byType["CodeList"]) != 1 || len(byType["Code"]) != 2 { - t.Errorf("CodeList/Code = %d/%d, want 1/2", len(byType["CodeList"]), len(byType["Code"])) + // The SHACL shapes require the primary key to be reachable from the + // structure and the component to use the full association term. + if byID["#datastructure"]["has_PrimaryKey"] != "#primaryKey" { + t.Errorf("datastructure has_PrimaryKey = %v", byID["#datastructure"]["has_PrimaryKey"]) + } + if byID["#primaryKeyComponent"]["correspondsTo_DataStructureComponent"] == nil { + t.Errorf("primaryKeyComponent = %v", byID["#primaryKeyComponent"]) + } + // sex has a code list with two codes; per DDI-CDI each Code uses a + // Notation (the value as it appears in the data) and denotes a Category + // (the label). + if len(byType["CodeList"]) != 1 || len(byType["Code"]) != 2 || + len(byType["Notation"]) != 2 || len(byType["Category"]) != 2 { + t.Errorf("CodeList/Code/Notation/Category = %d/%d/%d/%d, want 1/2/2/2", + len(byType["CodeList"]), len(byType["Code"]), len(byType["Notation"]), len(byType["Category"])) + } + code := byID["#sex_1_Code"] + if code == nil || code["denotes"] != "#sex_1_Category" || code["uses_Notation"] != "#sex_1_Notation" { + t.Fatalf("code node = %v", code) + } + if _, ok := code["identifier"]; ok { + t.Error("Code must not carry a literal identifier (the value lives in the Notation)") + } + notation := byID["#sex_1_Notation"] + content := notation["content"].(map[string]interface{}) + if content["@type"] != "TypedString" || content["content"] != "1" { + t.Errorf("notation content = %v, want TypedString with the data value", content) + } + category := byID["#sex_1_Category"] + categoryName := category["name"].(map[string]interface{}) + if categoryName["@type"] != "ObjectName" || categoryName["name"] != "Male" { + t.Errorf("category name = %v", categoryName) + } + codeList := byID["#sex_CodeList"] + if codeList["allowsDuplicates"] != false { + t.Errorf("codeList allowsDuplicates = %v, want false (required by SHACL)", codeList["allowsDuplicates"]) + } + codeListName := codeList["name"].(map[string]interface{}) + if codeListName["@type"] != "ObjectName" { + t.Errorf("codeList name = %v, want ObjectName object", codeList["name"]) } // CSV exports describe the physical layout layout := byID["#physicalSegmentLayout"] @@ -264,6 +386,36 @@ func TestBuildDDICDIJSONExportSkipsPhysicalLayout(t *testing.T) { } } +// TestDumpSidecarsForValidation writes generated sidecars to SIDECAR_DUMP_DIR +// for external validation: pyshacl against the official DDI-CDI 1.0 SHACL +// shapes (the ones bundled with the cdi-viewer previewer) and +// `mlcroissant validate --jsonld`. Skipped unless the env var is set: +// +// SIDECAR_DUMP_DIR=/tmp/dump go test ./app/plugin/impl/redcap2/ -run TestDumpSidecarsForValidation +func TestDumpSidecarsForValidation(t *testing.T) { + dir := os.Getenv("SIDECAR_DUMP_DIR") + if dir == "" { + t.Skip("SIDECAR_DUMP_DIR not set") + } + opts, _ := parsePluginOptions(`{"exportMode":"records","generatedAt":"2026-06-12T00:00:00Z"}`) + plan := testPlan(map[string]string{"name": "pseudonymize", "email": "drop"}) + plan.keyFingerprint = "abcdef0123456789" + model := sidecarTestModel(t, opts, plan, + "record_id,name,age,weight,sex,consent,visit_date\n1,x,34,70.5,1,1,2026-01-01\n") + files := map[string][]byte{} + mime := map[string]string{} + if warnings := addSidecars(model, "redcap/records", files, mime); len(warnings) != 0 { + t.Fatalf("sidecar warnings: %v", warnings) + } + for path, data := range files { + out := filepath.Join(dir, filepath.Base(path)) + if err := os.WriteFile(out, data, 0o644); err != nil { + t.Fatal(err) + } + t.Logf("wrote %s", out) + } +} + // End to end: the bundle contains the three sidecars and the ODM file, they // are valid JSON/XML, dropped variables are absent, and generation is // deterministic (same input, same bytes). diff --git a/redcap.md b/redcap.md index 6a3877f..97e8489 100644 --- a/redcap.md +++ b/redcap.md @@ -581,6 +581,11 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 5. ~~`project_metadata.xml`~~ — done (always generated, failure-tolerant warning). 6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. 7. New external tool conf: `conf/dataverse/external-tools/12-jsonld-previewer.json` (cdi-viewer registered for bare `application/ld+json`, fires on croissant.json). +8. **Variable-level metadata + validation fixes (2026-06-12):** + - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). + - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). + - Validation workflow: `SIDECAR_DUMP_DIR=/tmp/dump go test ./app/plugin/impl/redcap2/ -run TestDumpSidecarsForValidation` writes sample sidecars for pyshacl/mlcroissant runs. Note: the hosted DDI-CDI context (ddi-cdi.github.io) currently contains stray git conflict markers; strip them before local RDF parsing. + - Future work: full CDIF 1.1 Data Description profile (double-typing variableMeasured as `cdi:InstanceVariable` + skos code lists) once the profile leaves review (currently `reviewRevision`, prefixes at 0.1; "Semantic Croissant" has no published pattern yet). ### Phase 6: Hardening And Rollout [In Progress — 2026-06-12] From c9e80eebe0089a9cd4ad035465229af365433a58 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 15:25:25 +0200 Subject: [PATCH 12/25] redcap2: self-contained inline JSON-LD context for ddi-cdi.jsonld The generated file referenced the hosted DDI-CDI context by URL. That copy is currently invalid JSON upstream (stray git conflict markers) and the cdi-viewer's local fallback 404s on the deployed site, so the viewer falls back to an EMPTY context: every compact property key (name, denotes, content, ...) is silently dropped during JSON-LD expansion and SHACL validation reports mass 'less than 1 values' violations on documents whose content is actually correct. ddi-cdi.jsonld now embeds a minimal inline context covering exactly the terms the generator emits (class-scoped IRIs copied from the official context, JSON-LD 1.1 type-scoped). Nothing remote to fetch, nothing to break. Verified: parsing with zero network access yields the full triple set and pyshacl reports Conforms=True against the exact shapes the deployed viewer loads. --- image/app/plugin/impl/redcap2/sidecars.go | 127 +++++++++++++++++- .../app/plugin/impl/redcap2/sidecars_test.go | 28 +++- redcap.md | 3 +- 3 files changed, 148 insertions(+), 10 deletions(-) diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index 5ac5022..266bbf6 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -608,10 +608,127 @@ func buildROCrate(m sidecarModel) ([]byte, error) { // --- DDI-CDI 1.0 --- -// ddiCdiContext matches the in-repo DDI-CDI generator (cdi_generator_jsonld.py) -// whose output validates against the official DDI-CDI 1.0 SHACL shapes used by -// the cdi-viewer previewer. -const ddiCdiContext = "https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld" +// cdiRef defines a JSON-LD term whose values are IRI references. +func cdiRef(iri string) map[string]interface{} { + return map[string]interface{}{"@id": iri, "@type": "@id"} +} + +// cdiLit defines a JSON-LD term whose values are typed literals. +func cdiLit(iri, dataType string) map[string]interface{} { + return map[string]interface{}{"@id": iri, "@type": dataType} +} + +// cdiClass defines a class term with a type-scoped context (JSON-LD 1.1), +// mirroring the structure of the official DDI-CDI context. +func cdiClass(name string, terms map[string]interface{}) map[string]interface{} { + return map[string]interface{}{"@id": "cdi:" + name, "@context": terms} +} + +// ddiCdiInlineContext is a minimal, self-contained JSON-LD context covering +// exactly the terms this generator emits, with the class-scoped term IRIs +// copied from the official DDI-CDI 1.0 context +// (https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld). +// It is embedded inline instead of referencing that URL: when the remote +// context cannot be fetched or parsed (it currently contains stray git +// conflict markers upstream), consumers like the cdi-viewer previewer fall +// back to an empty context — every compact property key is then silently +// dropped during expansion and SHACL validation reports mass "less than 1 +// values" violations. An inline context cannot fail to load. +var ddiCdiInlineContext = buildDdiCdiInlineContext() + +func buildDdiCdiInlineContext() map[string]interface{} { + component := map[string]interface{}{ + "isDefinedBy_RepresentedVariable": cdiRef("cdi:DataStructureComponent_isDefinedBy_RepresentedVariable"), + } + return map[string]interface{}{ + "cdi": "http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/", + "xsd": "http://www.w3.org/2001/XMLSchema#", + "Code": cdiClass("Code", map[string]interface{}{ + "denotes": cdiRef("cdi:Code_denotes_Category"), + "uses_Notation": cdiRef("cdi:Code_uses_Notation"), + }), + "Category": cdiClass("Category", map[string]interface{}{ + "name": cdiRef("cdi:Concept-name"), + }), + "Notation": cdiClass("Notation", map[string]interface{}{ + "content": cdiRef("cdi:Notation-content"), + "represents": cdiRef("cdi:Notation_represents_Category"), + }), + "CodeList": cdiClass("CodeList", map[string]interface{}{ + "name": cdiRef("cdi:EnumerationDomain-name"), + "allowsDuplicates": cdiLit("cdi:CodeList-allowsDuplicates", "xsd:boolean"), + "has_Code": cdiRef("cdi:CodeList_has_Code"), + }), + "TypedString": cdiClass("TypedString", map[string]interface{}{ + "content": cdiLit("cdi:TypedString-content", "xsd:string"), + }), + "ObjectName": cdiClass("ObjectName", map[string]interface{}{ + "name": cdiLit("cdi:ObjectName-name", "xsd:string"), + }), + "SubstantiveValueDomain": cdiClass("SubstantiveValueDomain", map[string]interface{}{ + "recommendedDataType": cdiRef("cdi:ValueDomain-recommendedDataType"), + "takesValuesFrom": cdiRef("cdi:SubstantiveValueDomain_takesValuesFrom_EnumerationDomain"), + }), + "ControlledVocabularyEntry": cdiClass("ControlledVocabularyEntry", map[string]interface{}{ + "entryValue": cdiLit("cdi:ControlledVocabularyEntry-entryValue", "xsd:string"), + "vocabulary": cdiRef("cdi:ControlledVocabularyEntry-vocabulary"), + }), + "Reference": cdiClass("Reference", map[string]interface{}{ + "uri": cdiLit("cdi:Reference-uri", "xsd:anyURI"), + }), + "InstanceVariable": cdiClass("InstanceVariable", map[string]interface{}{ + "name": cdiRef("cdi:Concept-name"), + "definition": cdiRef("cdi:Concept-definition"), + "has_ValueMapping": cdiRef("cdi:InstanceVariable_has_ValueMapping"), + "takesSubstantiveValuesFrom_SubstantiveValueDomain": cdiRef("cdi:RepresentedVariable_takesSubstantiveValuesFrom_SubstantiveValueDomain"), + }), + "InternationalString": cdiClass("InternationalString", map[string]interface{}{ + "languageSpecificString": cdiRef("cdi:InternationalString-languageSpecificString"), + }), + "LanguageString": cdiClass("LanguageString", map[string]interface{}{ + "content": cdiLit("cdi:LanguageString-content", "xsd:string"), + }), + "WideDataSet": cdiClass("WideDataSet", map[string]interface{}{ + "isStructuredBy": cdiRef("cdi:DataSet_isStructuredBy_DataStructure"), + }), + "WideDataStructure": cdiClass("WideDataStructure", map[string]interface{}{ + "has_DataStructureComponent": cdiRef("cdi:DataStructure_has_DataStructureComponent"), + "has_PrimaryKey": cdiRef("cdi:DataStructure_has_PrimaryKey"), + }), + "LogicalRecord": cdiClass("LogicalRecord", map[string]interface{}{ + "organizes": cdiRef("cdi:LogicalRecord_organizes_DataSet"), + "has_InstanceVariable": cdiRef("cdi:LogicalRecord_has_InstanceVariable"), + }), + "PrimaryKey": cdiClass("PrimaryKey", map[string]interface{}{ + "isComposedOf": cdiRef("cdi:PrimaryKey_isComposedOf_PrimaryKeyComponent"), + }), + "PrimaryKeyComponent": cdiClass("PrimaryKeyComponent", map[string]interface{}{ + "correspondsTo_DataStructureComponent": cdiRef("cdi:PrimaryKeyComponent_correspondsTo_DataStructureComponent"), + }), + "IdentifierComponent": cdiClass("IdentifierComponent", component), + "MeasureComponent": cdiClass("MeasureComponent", component), + "DimensionComponent": cdiClass("DimensionComponent", component), + "AttributeComponent": cdiClass("AttributeComponent", component), + "ValueMapping": cdiClass("ValueMapping", map[string]interface{}{ + "defaultValue": cdiLit("cdi:ValueMapping-defaultValue", "xsd:string"), + }), + "ValueMappingPosition": cdiClass("ValueMappingPosition", map[string]interface{}{ + "indexes": cdiRef("cdi:ValueMappingPosition_indexes_ValueMapping"), + "value": cdiLit("cdi:ValueMappingPosition-value", "xsd:integer"), + }), + "PhysicalSegmentLayout": cdiClass("PhysicalSegmentLayout", map[string]interface{}{ + "allowsDuplicates": cdiLit("cdi:PhysicalSegmentLayout-allowsDuplicates", "xsd:boolean"), + "isDelimited": cdiLit("cdi:PhysicalSegmentLayout-isDelimited", "xsd:boolean"), + "isFixedWidth": cdiLit("cdi:PhysicalSegmentLayout-isFixedWidth", "xsd:boolean"), + "hasHeader": cdiLit("cdi:PhysicalSegmentLayout-hasHeader", "xsd:boolean"), + "headerRowCount": cdiLit("cdi:PhysicalSegmentLayout-headerRowCount", "xsd:integer"), + "delimiter": cdiLit("cdi:PhysicalSegmentLayout-delimiter", "xsd:string"), + "formats": cdiRef("cdi:PhysicalSegmentLayout_formats_LogicalRecord"), + "has_ValueMapping": cdiRef("cdi:PhysicalSegmentLayout_has_ValueMapping"), + "has_ValueMappingPosition": cdiRef("cdi:PhysicalSegmentLayout_has_ValueMappingPosition"), + }), + } +} func ddiCdiDataType(v sidecarVariable) string { switch v.FieldType { @@ -879,7 +996,7 @@ func buildDDICDI(m sidecarModel) ([]byte, error) { } doc := map[string]interface{}{ - "@context": ddiCdiContext, + "@context": ddiCdiInlineContext, "@graph": graph, } return json.MarshalIndent(doc, "", " ") diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go index d515357..c66a239 100644 --- a/image/app/plugin/impl/redcap2/sidecars_test.go +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -287,8 +287,20 @@ func TestBuildDDICDI(t *testing.T) { if err := json.Unmarshal(data, &doc); err != nil { t.Fatalf("ddi-cdi.jsonld is invalid JSON: %v", err) } - if doc["@context"] != ddiCdiContext { - t.Errorf("@context = %v", doc["@context"]) + // The context must be self-contained (inline): referencing the remote + // DDI-CDI context URL makes consumers fail silently when it is + // unreachable or unparseable (as the hosted copy currently is). + context, ok := doc["@context"].(map[string]interface{}) + if !ok { + t.Fatalf("@context = %v, want inline context object", doc["@context"]) + } + if context["cdi"] != "http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/" { + t.Errorf("cdi prefix = %v", context["cdi"]) + } + for _, class := range []string{"Code", "Category", "Notation", "CodeList", "InstanceVariable", "PhysicalSegmentLayout"} { + if _, ok := context[class]; !ok { + t.Errorf("inline context missing class term %s", class) + } } byType := map[string][]map[string]interface{}{} @@ -381,8 +393,16 @@ func TestBuildDDICDIJSONExportSkipsPhysicalLayout(t *testing.T) { if err != nil { t.Fatalf("buildDDICDI returned error: %v", err) } - if strings.Contains(string(data), "PhysicalSegmentLayout") || strings.Contains(string(data), "ValueMapping") { - t.Error("JSON exports must not describe a delimited physical layout") + doc := map[string]interface{}{} + if err := json.Unmarshal(data, &doc); err != nil { + t.Fatalf("ddi-cdi.jsonld invalid: %v", err) + } + for _, entry := range doc["@graph"].([]interface{}) { + node := entry.(map[string]interface{}) + switch node["@type"] { + case "PhysicalSegmentLayout", "ValueMapping", "ValueMappingPosition": + t.Errorf("JSON exports must not describe a delimited physical layout, got %v", node["@type"]) + } } } diff --git a/redcap.md b/redcap.md index 97e8489..d383fb1 100644 --- a/redcap.md +++ b/redcap.md @@ -584,7 +584,8 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 8. **Variable-level metadata + validation fixes (2026-06-12):** - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). - - Validation workflow: `SIDECAR_DUMP_DIR=/tmp/dump go test ./app/plugin/impl/redcap2/ -run TestDumpSidecarsForValidation` writes sample sidecars for pyshacl/mlcroissant runs. Note: the hosted DDI-CDI context (ddi-cdi.github.io) currently contains stray git conflict markers; strip them before local RDF parsing. + - Validation workflow: `SIDECAR_DUMP_DIR=/tmp/dump go test ./app/plugin/impl/redcap2/ -run TestDumpSidecarsForValidation` writes sample sidecars for pyshacl/mlcroissant runs. + - **Self-contained context (2026-06-12):** `ddi-cdi.jsonld` embeds a minimal inline `@context` (exactly the ~30 terms emitted, IRIs copied from the official DDI-CDI context) instead of referencing the hosted context URL. Reason: the hosted copy currently contains stray git conflict markers (invalid JSON) and the cdi-viewer's local fallback (`shapes/ddi-cdi.jsonld`) 404s on the deployed site — the viewer then expands documents with an **empty context**, silently dropping every property and reporting mass "less than 1 values" SHACL violations even for correct documents. With the inline context there is nothing remote to fail. Verified: zero-network parse yields all triples; pyshacl Conforms=True against the deployed viewer's shapes. The same remote-context fragility affects the dataset-level `cdi_generator_jsonld.py` pipeline output — fix it the same way when touched next. - Future work: full CDIF 1.1 Data Description profile (double-typing variableMeasured as `cdi:InstanceVariable` + skos code lists) once the profile leaves review (currently `reviewRevision`, prefixes at 0.1; "Semantic Croissant" has no published pattern yet). ### Phase 6: Hardening And Rollout [In Progress — 2026-06-12] From 373f23a241004fc21a9257e36f8c4766c86f5e49 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 15:36:32 +0200 Subject: [PATCH 13/25] Revert "redcap2: self-contained inline JSON-LD context for ddi-cdi.jsonld" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The generated file references the canonical published DDI-CDI context URL again, like the rest of the ecosystem. The validation failures were caused by the viewer's broken context-fallback path (fixed in cdi-viewer 04547d7) combined with the upstream hosted context being temporarily invalid JSON — not by the generated documents. Generators should follow the standard's regular practice rather than work around consumer bugs. --- image/app/plugin/impl/redcap2/sidecars.go | 127 +----------------- .../app/plugin/impl/redcap2/sidecars_test.go | 28 +--- redcap.md | 2 +- 3 files changed, 10 insertions(+), 147 deletions(-) diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index 266bbf6..5ac5022 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -608,127 +608,10 @@ func buildROCrate(m sidecarModel) ([]byte, error) { // --- DDI-CDI 1.0 --- -// cdiRef defines a JSON-LD term whose values are IRI references. -func cdiRef(iri string) map[string]interface{} { - return map[string]interface{}{"@id": iri, "@type": "@id"} -} - -// cdiLit defines a JSON-LD term whose values are typed literals. -func cdiLit(iri, dataType string) map[string]interface{} { - return map[string]interface{}{"@id": iri, "@type": dataType} -} - -// cdiClass defines a class term with a type-scoped context (JSON-LD 1.1), -// mirroring the structure of the official DDI-CDI context. -func cdiClass(name string, terms map[string]interface{}) map[string]interface{} { - return map[string]interface{}{"@id": "cdi:" + name, "@context": terms} -} - -// ddiCdiInlineContext is a minimal, self-contained JSON-LD context covering -// exactly the terms this generator emits, with the class-scoped term IRIs -// copied from the official DDI-CDI 1.0 context -// (https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld). -// It is embedded inline instead of referencing that URL: when the remote -// context cannot be fetched or parsed (it currently contains stray git -// conflict markers upstream), consumers like the cdi-viewer previewer fall -// back to an empty context — every compact property key is then silently -// dropped during expansion and SHACL validation reports mass "less than 1 -// values" violations. An inline context cannot fail to load. -var ddiCdiInlineContext = buildDdiCdiInlineContext() - -func buildDdiCdiInlineContext() map[string]interface{} { - component := map[string]interface{}{ - "isDefinedBy_RepresentedVariable": cdiRef("cdi:DataStructureComponent_isDefinedBy_RepresentedVariable"), - } - return map[string]interface{}{ - "cdi": "http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/", - "xsd": "http://www.w3.org/2001/XMLSchema#", - "Code": cdiClass("Code", map[string]interface{}{ - "denotes": cdiRef("cdi:Code_denotes_Category"), - "uses_Notation": cdiRef("cdi:Code_uses_Notation"), - }), - "Category": cdiClass("Category", map[string]interface{}{ - "name": cdiRef("cdi:Concept-name"), - }), - "Notation": cdiClass("Notation", map[string]interface{}{ - "content": cdiRef("cdi:Notation-content"), - "represents": cdiRef("cdi:Notation_represents_Category"), - }), - "CodeList": cdiClass("CodeList", map[string]interface{}{ - "name": cdiRef("cdi:EnumerationDomain-name"), - "allowsDuplicates": cdiLit("cdi:CodeList-allowsDuplicates", "xsd:boolean"), - "has_Code": cdiRef("cdi:CodeList_has_Code"), - }), - "TypedString": cdiClass("TypedString", map[string]interface{}{ - "content": cdiLit("cdi:TypedString-content", "xsd:string"), - }), - "ObjectName": cdiClass("ObjectName", map[string]interface{}{ - "name": cdiLit("cdi:ObjectName-name", "xsd:string"), - }), - "SubstantiveValueDomain": cdiClass("SubstantiveValueDomain", map[string]interface{}{ - "recommendedDataType": cdiRef("cdi:ValueDomain-recommendedDataType"), - "takesValuesFrom": cdiRef("cdi:SubstantiveValueDomain_takesValuesFrom_EnumerationDomain"), - }), - "ControlledVocabularyEntry": cdiClass("ControlledVocabularyEntry", map[string]interface{}{ - "entryValue": cdiLit("cdi:ControlledVocabularyEntry-entryValue", "xsd:string"), - "vocabulary": cdiRef("cdi:ControlledVocabularyEntry-vocabulary"), - }), - "Reference": cdiClass("Reference", map[string]interface{}{ - "uri": cdiLit("cdi:Reference-uri", "xsd:anyURI"), - }), - "InstanceVariable": cdiClass("InstanceVariable", map[string]interface{}{ - "name": cdiRef("cdi:Concept-name"), - "definition": cdiRef("cdi:Concept-definition"), - "has_ValueMapping": cdiRef("cdi:InstanceVariable_has_ValueMapping"), - "takesSubstantiveValuesFrom_SubstantiveValueDomain": cdiRef("cdi:RepresentedVariable_takesSubstantiveValuesFrom_SubstantiveValueDomain"), - }), - "InternationalString": cdiClass("InternationalString", map[string]interface{}{ - "languageSpecificString": cdiRef("cdi:InternationalString-languageSpecificString"), - }), - "LanguageString": cdiClass("LanguageString", map[string]interface{}{ - "content": cdiLit("cdi:LanguageString-content", "xsd:string"), - }), - "WideDataSet": cdiClass("WideDataSet", map[string]interface{}{ - "isStructuredBy": cdiRef("cdi:DataSet_isStructuredBy_DataStructure"), - }), - "WideDataStructure": cdiClass("WideDataStructure", map[string]interface{}{ - "has_DataStructureComponent": cdiRef("cdi:DataStructure_has_DataStructureComponent"), - "has_PrimaryKey": cdiRef("cdi:DataStructure_has_PrimaryKey"), - }), - "LogicalRecord": cdiClass("LogicalRecord", map[string]interface{}{ - "organizes": cdiRef("cdi:LogicalRecord_organizes_DataSet"), - "has_InstanceVariable": cdiRef("cdi:LogicalRecord_has_InstanceVariable"), - }), - "PrimaryKey": cdiClass("PrimaryKey", map[string]interface{}{ - "isComposedOf": cdiRef("cdi:PrimaryKey_isComposedOf_PrimaryKeyComponent"), - }), - "PrimaryKeyComponent": cdiClass("PrimaryKeyComponent", map[string]interface{}{ - "correspondsTo_DataStructureComponent": cdiRef("cdi:PrimaryKeyComponent_correspondsTo_DataStructureComponent"), - }), - "IdentifierComponent": cdiClass("IdentifierComponent", component), - "MeasureComponent": cdiClass("MeasureComponent", component), - "DimensionComponent": cdiClass("DimensionComponent", component), - "AttributeComponent": cdiClass("AttributeComponent", component), - "ValueMapping": cdiClass("ValueMapping", map[string]interface{}{ - "defaultValue": cdiLit("cdi:ValueMapping-defaultValue", "xsd:string"), - }), - "ValueMappingPosition": cdiClass("ValueMappingPosition", map[string]interface{}{ - "indexes": cdiRef("cdi:ValueMappingPosition_indexes_ValueMapping"), - "value": cdiLit("cdi:ValueMappingPosition-value", "xsd:integer"), - }), - "PhysicalSegmentLayout": cdiClass("PhysicalSegmentLayout", map[string]interface{}{ - "allowsDuplicates": cdiLit("cdi:PhysicalSegmentLayout-allowsDuplicates", "xsd:boolean"), - "isDelimited": cdiLit("cdi:PhysicalSegmentLayout-isDelimited", "xsd:boolean"), - "isFixedWidth": cdiLit("cdi:PhysicalSegmentLayout-isFixedWidth", "xsd:boolean"), - "hasHeader": cdiLit("cdi:PhysicalSegmentLayout-hasHeader", "xsd:boolean"), - "headerRowCount": cdiLit("cdi:PhysicalSegmentLayout-headerRowCount", "xsd:integer"), - "delimiter": cdiLit("cdi:PhysicalSegmentLayout-delimiter", "xsd:string"), - "formats": cdiRef("cdi:PhysicalSegmentLayout_formats_LogicalRecord"), - "has_ValueMapping": cdiRef("cdi:PhysicalSegmentLayout_has_ValueMapping"), - "has_ValueMappingPosition": cdiRef("cdi:PhysicalSegmentLayout_has_ValueMappingPosition"), - }), - } -} +// ddiCdiContext matches the in-repo DDI-CDI generator (cdi_generator_jsonld.py) +// whose output validates against the official DDI-CDI 1.0 SHACL shapes used by +// the cdi-viewer previewer. +const ddiCdiContext = "https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld" func ddiCdiDataType(v sidecarVariable) string { switch v.FieldType { @@ -996,7 +879,7 @@ func buildDDICDI(m sidecarModel) ([]byte, error) { } doc := map[string]interface{}{ - "@context": ddiCdiInlineContext, + "@context": ddiCdiContext, "@graph": graph, } return json.MarshalIndent(doc, "", " ") diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go index c66a239..d515357 100644 --- a/image/app/plugin/impl/redcap2/sidecars_test.go +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -287,20 +287,8 @@ func TestBuildDDICDI(t *testing.T) { if err := json.Unmarshal(data, &doc); err != nil { t.Fatalf("ddi-cdi.jsonld is invalid JSON: %v", err) } - // The context must be self-contained (inline): referencing the remote - // DDI-CDI context URL makes consumers fail silently when it is - // unreachable or unparseable (as the hosted copy currently is). - context, ok := doc["@context"].(map[string]interface{}) - if !ok { - t.Fatalf("@context = %v, want inline context object", doc["@context"]) - } - if context["cdi"] != "http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/" { - t.Errorf("cdi prefix = %v", context["cdi"]) - } - for _, class := range []string{"Code", "Category", "Notation", "CodeList", "InstanceVariable", "PhysicalSegmentLayout"} { - if _, ok := context[class]; !ok { - t.Errorf("inline context missing class term %s", class) - } + if doc["@context"] != ddiCdiContext { + t.Errorf("@context = %v", doc["@context"]) } byType := map[string][]map[string]interface{}{} @@ -393,16 +381,8 @@ func TestBuildDDICDIJSONExportSkipsPhysicalLayout(t *testing.T) { if err != nil { t.Fatalf("buildDDICDI returned error: %v", err) } - doc := map[string]interface{}{} - if err := json.Unmarshal(data, &doc); err != nil { - t.Fatalf("ddi-cdi.jsonld invalid: %v", err) - } - for _, entry := range doc["@graph"].([]interface{}) { - node := entry.(map[string]interface{}) - switch node["@type"] { - case "PhysicalSegmentLayout", "ValueMapping", "ValueMappingPosition": - t.Errorf("JSON exports must not describe a delimited physical layout, got %v", node["@type"]) - } + if strings.Contains(string(data), "PhysicalSegmentLayout") || strings.Contains(string(data), "ValueMapping") { + t.Error("JSON exports must not describe a delimited physical layout") } } diff --git a/redcap.md b/redcap.md index d383fb1..344f5ab 100644 --- a/redcap.md +++ b/redcap.md @@ -585,7 +585,7 @@ Fixes the review findings before new features (see [Review, Research, And Decisi - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). - Validation workflow: `SIDECAR_DUMP_DIR=/tmp/dump go test ./app/plugin/impl/redcap2/ -run TestDumpSidecarsForValidation` writes sample sidecars for pyshacl/mlcroissant runs. - - **Self-contained context (2026-06-12):** `ddi-cdi.jsonld` embeds a minimal inline `@context` (exactly the ~30 terms emitted, IRIs copied from the official DDI-CDI context) instead of referencing the hosted context URL. Reason: the hosted copy currently contains stray git conflict markers (invalid JSON) and the cdi-viewer's local fallback (`shapes/ddi-cdi.jsonld`) 404s on the deployed site — the viewer then expands documents with an **empty context**, silently dropping every property and reporting mass "less than 1 values" SHACL violations even for correct documents. With the inline context there is nothing remote to fail. Verified: zero-network parse yields all triples; pyshacl Conforms=True against the deployed viewer's shapes. The same remote-context fragility affects the dataset-level `cdi_generator_jsonld.py` pipeline output — fix it the same way when touched next. + - **Context note (2026-06-12):** `ddi-cdi.jsonld` references the canonical published DDI-CDI context URL, like the rest of the ecosystem (incl. `cdi_generator_jsonld.py`). The hosted copy (ddi-cdi.github.io/m2t-ng) currently contains stray git conflict markers (invalid JSON — report upstream); strict consumers cannot resolve it until that is fixed. The cdi-viewer previewer is immune: its vendored context fallback was repaired (the fallback path pointed to `shapes/` while the file lives in `public/shapes/`, so it 404'd and the viewer silently expanded documents with an empty context, reporting mass "less than 1 values" violations on correct documents). Viewer fix in the cdi-viewer repo, commit 04547d7. - Future work: full CDIF 1.1 Data Description profile (double-typing variableMeasured as `cdi:InstanceVariable` + skos code lists) once the profile leaves review (currently `reviewRevision`, prefixes at 0.1; "Semantic Croissant" has no published pattern yet). ### Phase 6: Hardening And Rollout [In Progress — 2026-06-12] From 7e1ccbf6cbd68a0b51052c60f56775c3b6b226b2 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 15:45:45 +0200 Subject: [PATCH 14/25] redcap2 + cdi pipeline: reference the released DDI-CDI context (docs.ddialliance.org) The previously referenced ddi-cdi.github.io/m2t-ng URL is a build-tooling Pages artifact, not a release, and currently serves invalid JSON with unresolved merge-conflict markers. The DDI Alliance documentation site hosts the valid released encoding; generated ddi-cdi.jsonld validated end-to-end against the official SHACL shapes with this context fetched live (Conforms=True). --- image/app/plugin/impl/redcap2/sidecars.go | 11 +++++++---- image/cdi_generator_jsonld.py | 5 ++++- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index 5ac5022..b6d6a9f 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -608,10 +608,13 @@ func buildROCrate(m sidecarModel) ([]byte, error) { // --- DDI-CDI 1.0 --- -// ddiCdiContext matches the in-repo DDI-CDI generator (cdi_generator_jsonld.py) -// whose output validates against the official DDI-CDI 1.0 SHACL shapes used by -// the cdi-viewer previewer. -const ddiCdiContext = "https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld" +// ddiCdiContext is the DDI-CDI 1.0 JSON-LD context published on the DDI +// Alliance documentation site — the released encoding. (The previously used +// ddi-cdi.github.io/m2t-ng URL is a build-tooling Pages artifact and currently +// serves invalid JSON with unresolved merge-conflict markers.) The output +// validates against the official DDI-CDI 1.0 SHACL shapes used by the +// cdi-viewer previewer. +const ddiCdiContext = "https://docs.ddialliance.org/DDI-CDI/1.0/model/encoding/json-ld/ddi-cdi.jsonld" func ddiCdiDataType(v sidecarVariable) string { switch v.FieldType { diff --git a/image/cdi_generator_jsonld.py b/image/cdi_generator_jsonld.py index 725d080..1eac6a1 100644 --- a/image/cdi_generator_jsonld.py +++ b/image/cdi_generator_jsonld.py @@ -71,7 +71,10 @@ # ---- Official DDI-CDI 1.0 JSON-LD Context URL ---- -DDI_CDI_CONTEXT = "https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld" +# The released encoding on the DDI Alliance documentation site. (The previously +# used ddi-cdi.github.io/m2t-ng URL is a build-tooling Pages artifact and +# currently serves invalid JSON with unresolved merge-conflict markers.) +DDI_CDI_CONTEXT = "https://docs.ddialliance.org/DDI-CDI/1.0/model/encoding/json-ld/ddi-cdi.jsonld" # ---- Generator Version (increment when making changes to track generated files) ---- GENERATOR_VERSION = "0.8" From ff117cb2ff03ae41714e5f9e4cffb0cb34c70a0d Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 16:29:02 +0200 Subject: [PATCH 15/25] tag and config for the new redcap plugin --- conf/frontend_config.json | 13 ------------- env.prod | 2 +- 2 files changed, 1 insertion(+), 14 deletions(-) diff --git a/conf/frontend_config.json b/conf/frontend_config.json index 8e646b1..15bd3d9 100644 --- a/conf/frontend_config.json +++ b/conf/frontend_config.json @@ -54,19 +54,6 @@ "tokenFieldName": "Token", "tokenFieldPlaceholder": "Repository API token" }, - { - "id": "redcap", - "name": "Other REDCap", - "plugin": "redcap", - "pluginName": "REDCap", - "optionFieldName": "Folder", - "optionFieldPlaceholder": "Select folder", - "optionFieldInteractive": true, - "tokenFieldName": "Project token", - "tokenFieldPlaceholder": "project token", - "sourceUrlFieldName": "Source URL", - "sourceUrlFieldPlaceholder": "https://your.redcap.server" - }, { "id": "redcap2", "name": "Other REDCap (reports beta)", diff --git a/env.prod b/env.prod index e9d1be4..743d6e9 100644 --- a/env.prod +++ b/env.prod @@ -1,4 +1,4 @@ -IMAGE_TAG=registry.docker.libis.be/rdm/integration:2.0 +IMAGE_TAG=registry.docker.libis.be/rdm/integration:2.0-redcap BASE_HREF=/integration CUSTOMIZATIONS=./conf/kul_customizations From 55e10c2f22a8e60f7b749c4f4a650a2227c35eae Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 16:43:48 +0200 Subject: [PATCH 16/25] redcap2: withdraw the generic JSON-LD previewer registration for croissant.json The cdi-viewer renders flattened @context+@graph documents (the DDI-CDI shape); croissant.json is a single nested node with no @graph, so the registered previewer showed an empty view or a missing-@graph error. Remove conf/dataverse/external-tools/12-jsonld-previewer.json and document that croissant.json has no preview until a Croissant-capable previewer exists. The croissant mime stays application/ld+json (accurate, and no tool registration conflicts with it). --- REDCAP_INTEGRATION.md | 18 +++++++-------- .../external-tools/12-jsonld-previewer.json | 23 ------------------- image/app/plugin/impl/redcap2/sidecars.go | 9 ++++---- redcap.md | 2 +- 4 files changed, 15 insertions(+), 37 deletions(-) delete mode 100644 conf/dataverse/external-tools/12-jsonld-previewer.json diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index 0e0e091..f7de76c 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -183,12 +183,12 @@ the official DDI-CDI 1.0 SHACL shapes used by the CDI previewer. - `ddi-cdi.jsonld` is uploaded with the DDI-CDI profile mime type registered by the **CDI previewer** (`conf/dataverse/external-tools/04-cdi-previewer.json`), which validates against the official DDI-CDI 1.0 SHACL shapes. -- `croissant.json` is uploaded as `application/ld+json`; the generic - **JSON-LD previewer** (`conf/dataverse/external-tools/12-jsonld-previewer.json`, - same viewer as the CDI previewer) displays it. There is no - Croissant-specific previewer in the Dataverse ecosystem yet. The Croissant - CDIF profile ("Semantic Croissant") is still draft-stage; the file targets - plain Croissant 1.0 and can be validated with +- `croissant.json` is uploaded as `application/ld+json` (it is JSON-LD, but + framed as a single nested node rather than a flattened `@graph`, so the + CDI viewer cannot display it). There is no Croissant-specific previewer in + the Dataverse ecosystem yet — the file currently has no preview. The + Croissant CDIF profile ("Semantic Croissant") is still draft-stage; the + file targets plain Croissant 1.0 and can be validated with `pip install mlcroissant && mlcroissant validate --jsonld croissant.json`. ## New dataset metadata prefill @@ -230,9 +230,9 @@ leak the very values the transforms removed. Default: `5m`. Raise it if exports of very large projects time out — the export itself is fast (hundreds of MB/s for processing); the timeout covers the REDCap server generating and sending the data. -- The JSON-LD previewer for `croissant.json` requires registering - `conf/dataverse/external-tools/12-jsonld-previewer.json` in Dataverse (the - DDI-CDI and RO-Crate previewers use the existing registrations). +- The DDI-CDI and RO-Crate previewers use the existing external-tool + registrations; `croissant.json` has no previewer (none exists for + Croissant yet). ## Limitations and good practice diff --git a/conf/dataverse/external-tools/12-jsonld-previewer.json b/conf/dataverse/external-tools/12-jsonld-previewer.json deleted file mode 100644 index f9f4cdd..0000000 --- a/conf/dataverse/external-tools/12-jsonld-previewer.json +++ /dev/null @@ -1,23 +0,0 @@ -{ - "displayName": "View JSON-LD", - "description": "View JSON-LD metadata files (e.g. Croissant) with optional SHACL validation.", - "toolName": "jsonldPreviewer", - "scope": "file", - "types": ["preview", "explore"], - "toolUrl": "https://libis.github.io/cdi-viewer/index.html", - "toolParameters": { - "queryParameters": [ - {"fileid": "{fileId}"}, - {"siteUrl": "{siteUrl}"}, - {"datasetid": "{datasetId}"}, - {"datasetversion": "{datasetVersion}"}, - {"locale": "{localeCode}"} - ] - }, - "contentType": "application/ld+json", - "allowedApiCalls": [ - {"name": "retrieveFileContents", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=true", "timeOut": 3600}, - {"name": "downloadFile", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=false", "timeOut": 3600}, - {"name": "getDatasetVersionMetadata", "httpMethod": "GET", "urlTemplate": "/api/v1/datasets/{datasetId}/versions/{datasetVersion}", "timeOut": 3600} - ] -} diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index b6d6a9f..3af6a55 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -17,10 +17,11 @@ import ( // previewer registers for. // - The DDI-CDI mime must stay in sync with common.DdiCdiMimeType and the // contentType in conf/dataverse/external-tools/04-cdi-previewer.json. -// - Croissant is JSON-LD; the bare application/ld+json type lets a generic -// JSON-LD previewer (conf/dataverse/external-tools/06-jsonld-previewer.json) -// pick it up. There is no Croissant-specific previewer or mime convention -// (the Croissant 1.0 spec defines no media type). +// - Croissant is typed as bare application/ld+json (accurate: it is +// JSON-LD; the Croissant 1.0 spec defines no media type). No previewer +// fires on it: none exists for Croissant, and the cdi-viewer cannot +// render it (croissant.json is a single nested node, not a flattened +// @graph document). const ( roCrateMimeType = `application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"` ddiCdiMimeType = `application/ld+json;profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://ddialliance.org/specification/ddi-cdi/1.0"` diff --git a/redcap.md b/redcap.md index 344f5ab..27ff041 100644 --- a/redcap.md +++ b/redcap.md @@ -580,7 +580,7 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 4. ~~Plugin `Metadata()` hook~~ — done (title, notes+purpose, PI, grant number, IRB number, urn:redcap project id). 5. ~~`project_metadata.xml`~~ — done (always generated, failure-tolerant warning). 6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. -7. New external tool conf: `conf/dataverse/external-tools/12-jsonld-previewer.json` (cdi-viewer registered for bare `application/ld+json`, fires on croissant.json). +7. ~~Generic JSON-LD previewer registration for croissant.json~~ — **withdrawn (2026-06-12)**: the cdi-viewer cannot render croissant.json (single nested node, no flattened `@graph`), so the registration showed an empty view/`@graph` error. Croissant has no previewer until a Croissant-specific one (or single-node JSON-LD support in the cdi-viewer) exists. 8. **Variable-level metadata + validation fixes (2026-06-12):** - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). From d66f22a66c21c4906580abe91bb303a51fc6b06b Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 17:06:17 +0200 Subject: [PATCH 17/25] redcap2: Croissant-profiled mime + dedicated previewer + CDIF mandatory fields MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - croissant.json mime: application/ld+json; profile="http://mlcommons.org/croissant/1.0" (RFC 6906 profile, mirroring the RO-Crate/DDI-CDI conventions) so a Croissant-specific previewer registration can match it exactly. - New conf/dataverse/external-tools/12-croissant-previewer.json: opens the cdi-viewer with ?shacl=croissant to preload the Croissant SHACL shapes (bundled with the viewer; the flatten fix there makes single-node documents renderable). - croissant.json and ro-crate-metadata.json gain identifier and dateModified — mandatory in the CDIF 1.1 Discovery profile (gaps found by validating against the CDIF core shapes; mlcroissant remains clean). --- REDCAP_INTEGRATION.md | 20 ++++++++-------- .../12-croissant-previewer.json | 23 +++++++++++++++++++ image/app/plugin/impl/redcap2/sidecars.go | 21 ++++++++++++----- .../app/plugin/impl/redcap2/sidecars_test.go | 11 +++++++++ redcap.md | 2 +- 5 files changed, 61 insertions(+), 16 deletions(-) create mode 100644 conf/dataverse/external-tools/12-croissant-previewer.json diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index f7de76c..2ba1953 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -183,12 +183,14 @@ the official DDI-CDI 1.0 SHACL shapes used by the CDI previewer. - `ddi-cdi.jsonld` is uploaded with the DDI-CDI profile mime type registered by the **CDI previewer** (`conf/dataverse/external-tools/04-cdi-previewer.json`), which validates against the official DDI-CDI 1.0 SHACL shapes. -- `croissant.json` is uploaded as `application/ld+json` (it is JSON-LD, but - framed as a single nested node rather than a flattened `@graph`, so the - CDI viewer cannot display it). There is no Croissant-specific previewer in - the Dataverse ecosystem yet — the file currently has no preview. The - Croissant CDIF profile ("Semantic Croissant") is still draft-stage; the - file targets plain Croissant 1.0 and can be validated with +- `croissant.json` is uploaded with a Croissant-profiled JSON-LD mime type + (`application/ld+json; profile="http://mlcommons.org/croissant/1.0"`), + mirroring the RO-Crate/DDI-CDI conventions. The **Croissant previewer** + (`conf/dataverse/external-tools/12-croissant-previewer.json`, the CDI + viewer opened with `?shacl=croissant`) displays it and validates against + Croissant SHACL shapes. The Croissant CDIF profile ("Semantic Croissant") + is still draft-stage; the file targets plain Croissant 1.0 plus the CDIF + 1.1 Discovery shape for variables, and can also be validated with `pip install mlcroissant && mlcroissant validate --jsonld croissant.json`. ## New dataset metadata prefill @@ -230,9 +232,9 @@ leak the very values the transforms removed. Default: `5m`. Raise it if exports of very large projects time out — the export itself is fast (hundreds of MB/s for processing); the timeout covers the REDCap server generating and sending the data. -- The DDI-CDI and RO-Crate previewers use the existing external-tool - registrations; `croissant.json` has no previewer (none exists for - Croissant yet). +- The Croissant previewer for `croissant.json` requires registering + `conf/dataverse/external-tools/12-croissant-previewer.json` in Dataverse + (the DDI-CDI and RO-Crate previewers use the existing registrations). ## Limitations and good practice diff --git a/conf/dataverse/external-tools/12-croissant-previewer.json b/conf/dataverse/external-tools/12-croissant-previewer.json new file mode 100644 index 0000000..8c3c1a5 --- /dev/null +++ b/conf/dataverse/external-tools/12-croissant-previewer.json @@ -0,0 +1,23 @@ +{ + "displayName": "View Croissant metadata", + "description": "View Croissant (ML-ready dataset metadata) files with SHACL validation.", + "toolName": "croissantPreviewer", + "scope": "file", + "types": ["preview", "explore"], + "toolUrl": "https://libis.github.io/cdi-viewer/index.html?shacl=croissant", + "toolParameters": { + "queryParameters": [ + {"fileid": "{fileId}"}, + {"siteUrl": "{siteUrl}"}, + {"datasetid": "{datasetId}"}, + {"datasetversion": "{datasetVersion}"}, + {"locale": "{localeCode}"} + ] + }, + "contentType": "application/ld+json; profile=\"http://mlcommons.org/croissant/1.0\"", + "allowedApiCalls": [ + {"name": "retrieveFileContents", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=true", "timeOut": 3600}, + {"name": "downloadFile", "httpMethod": "GET", "urlTemplate": "/api/v1/access/datafile/{fileId}?gbrecs=false", "timeOut": 3600}, + {"name": "getDatasetVersionMetadata", "httpMethod": "GET", "urlTemplate": "/api/v1/datasets/{datasetId}/versions/{datasetVersion}", "timeOut": 3600} + ] +} diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index 3af6a55..df311c5 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -17,15 +17,16 @@ import ( // previewer registers for. // - The DDI-CDI mime must stay in sync with common.DdiCdiMimeType and the // contentType in conf/dataverse/external-tools/04-cdi-previewer.json. -// - Croissant is typed as bare application/ld+json (accurate: it is -// JSON-LD; the Croissant 1.0 spec defines no media type). No previewer -// fires on it: none exists for Croissant, and the cdi-viewer cannot -// render it (croissant.json is a single nested node, not a flattened -// @graph document). +// - Croissant gets a profiled JSON-LD mime (the Croissant 1.0 spec defines +// no media type; the RFC 6906 profile parameter carries the conformsTo +// URI, mirroring the RO-Crate and DDI-CDI conventions). It must stay in +// sync with the croissant previewer registrations +// (conf/dataverse/external-tools/12-croissant-previewer.json and the +// deployment copies): Dataverse matches contentType as an exact string. const ( roCrateMimeType = `application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"` ddiCdiMimeType = `application/ld+json;profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://ddialliance.org/specification/ddi-cdi/1.0"` - croissantMimeType = "application/ld+json" + croissantMimeType = `application/ld+json; profile="http://mlcommons.org/croissant/1.0"` ) // ddiDataTypeCV is the DDI controlled vocabulary used for variable data types, @@ -433,8 +434,15 @@ func buildCroissant(m sidecarModel) ([]byte, error) { "version": "1.0.0", "distribution": distribution, } + // identifier and dateModified are mandatory in the CDIF 1.1 Discovery + // profile (dateModified: the export timestamp is when this snapshot of + // the data last changed). + if m.ProjectID != nil { + doc["identifier"] = fmt.Sprintf("redcap-project-%v", m.ProjectID) + } if date, ok := m.publishedDate(); ok { doc["datePublished"] = date + doc["dateModified"] = date } // Variable-level metadata as schema.org variableMeasured, following the @@ -522,6 +530,7 @@ func buildROCrate(m sidecarModel) ([]byte, error) { } if date, ok := m.publishedDate(); ok { rootDataset["datePublished"] = date + rootDataset["dateModified"] = date } if m.ProjectID != nil { rootDataset["identifier"] = fmt.Sprintf("redcap-project-%v", m.ProjectID) diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go index d515357..7a95e83 100644 --- a/image/app/plugin/impl/redcap2/sidecars_test.go +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -139,6 +139,14 @@ func TestBuildCroissant(t *testing.T) { if len(byName) != len(model.Variables) { t.Errorf("variableMeasured has %d entries, want %d", len(byName), len(model.Variables)) } + // CDIF 1.1 Discovery mandatory dataset-level properties. + if doc["identifier"] != "redcap-project-1" { + t.Errorf("identifier = %v, want redcap-project-1", doc["identifier"]) + } + if doc["dateModified"] != "2026-06-12T00:00:00Z" || doc["datePublished"] != "2026-06-12T00:00:00Z" { + t.Errorf("dateModified/datePublished = %v/%v", doc["dateModified"], doc["datePublished"]) + } + age := byName["age"] if age["@type"] != "PropertyValue" || age["minValue"] != float64(0) || age["maxValue"] != float64(120) { t.Errorf("age variableMeasured = %v", age) @@ -211,6 +219,9 @@ func TestBuildROCrate(t *testing.T) { if root == nil || root["name"] != "Demo" || root["datePublished"] == "" { t.Errorf("root dataset = %v", root) } + if root["dateModified"] != "2026-06-12T00:00:00Z" || root["identifier"] != "redcap-project-1" { + t.Errorf("root dateModified/identifier = %v/%v", root["dateModified"], root["identifier"]) + } hasPart := root["hasPart"].([]interface{}) if len(hasPart) != len(model.Files) { t.Errorf("hasPart has %d entries, want %d", len(hasPart), len(model.Files)) diff --git a/redcap.md b/redcap.md index 27ff041..c3deb79 100644 --- a/redcap.md +++ b/redcap.md @@ -580,7 +580,7 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 4. ~~Plugin `Metadata()` hook~~ — done (title, notes+purpose, PI, grant number, IRB number, urn:redcap project id). 5. ~~`project_metadata.xml`~~ — done (always generated, failure-tolerant warning). 6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. -7. ~~Generic JSON-LD previewer registration for croissant.json~~ — **withdrawn (2026-06-12)**: the cdi-viewer cannot render croissant.json (single nested node, no flattened `@graph`), so the registration showed an empty view/`@graph` error. Croissant has no previewer until a Croissant-specific one (or single-node JSON-LD support in the cdi-viewer) exists. +7. Croissant previewer (2026-06-12, second iteration): croissant.json now carries a Croissant-profiled mime (`application/ld+json; profile="http://mlcommons.org/croissant/1.0"`, RFC 6906 profile mirroring the RO-Crate/DDI-CDI conventions) and a dedicated registration (`conf/dataverse/external-tools/12-croissant-previewer.json`) opens the cdi-viewer with `?shacl=croissant`. The earlier generic bare-`application/ld+json` registration was withdrawn: the viewer could not render single-node documents (one-arg `jsonld.flatten` returned the expanded array — fixed in cdi-viewer) and bare-type matching was unspecific. The viewer now bundles `shapes/croissant-core.ttl` (Croissant 1.0 structure + CDIF 1.1 Discovery dataset checks, authored — no official Croissant shapes exist). croissant.json also gained `identifier` and `dateModified` (CDIF Discovery mandatory; found by validating against the CDIF core shapes — note those use http://schema.org/ while Croissant uses https://, so they target nothing without IRI alignment). 8. **Variable-level metadata + validation fixes (2026-06-12):** - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). From b639a8997b10a7c3b24a69662a12a0bbd761a2c1 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 17:35:16 +0200 Subject: [PATCH 18/25] redcap2: croissant previewer registration uses a plain toolUrl The cdi-viewer now auto-selects shapes from document content, so the ?shacl=croissant query string (which Dataverse's naive toolUrl+'?'+params concatenation mangled into a double question mark) is no longer needed. ?shacl= remains available as an explicit override. --- REDCAP_INTEGRATION.md | 4 ++-- conf/dataverse/external-tools/12-croissant-previewer.json | 2 +- redcap.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index 2ba1953..1fc287d 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -187,8 +187,8 @@ the official DDI-CDI 1.0 SHACL shapes used by the CDI previewer. (`application/ld+json; profile="http://mlcommons.org/croissant/1.0"`), mirroring the RO-Crate/DDI-CDI conventions. The **Croissant previewer** (`conf/dataverse/external-tools/12-croissant-previewer.json`, the CDI - viewer opened with `?shacl=croissant`) displays it and validates against - Croissant SHACL shapes. The Croissant CDIF profile ("Semantic Croissant") + viewer — it auto-selects its bundled Croissant SHACL shapes from the + document's conformsTo) displays and validates it. The Croissant CDIF profile ("Semantic Croissant") is still draft-stage; the file targets plain Croissant 1.0 plus the CDIF 1.1 Discovery shape for variables, and can also be validated with `pip install mlcroissant && mlcroissant validate --jsonld croissant.json`. diff --git a/conf/dataverse/external-tools/12-croissant-previewer.json b/conf/dataverse/external-tools/12-croissant-previewer.json index 8c3c1a5..3b3aa2b 100644 --- a/conf/dataverse/external-tools/12-croissant-previewer.json +++ b/conf/dataverse/external-tools/12-croissant-previewer.json @@ -4,7 +4,7 @@ "toolName": "croissantPreviewer", "scope": "file", "types": ["preview", "explore"], - "toolUrl": "https://libis.github.io/cdi-viewer/index.html?shacl=croissant", + "toolUrl": "https://libis.github.io/cdi-viewer/index.html", "toolParameters": { "queryParameters": [ {"fileid": "{fileId}"}, diff --git a/redcap.md b/redcap.md index c3deb79..fc1da5d 100644 --- a/redcap.md +++ b/redcap.md @@ -580,7 +580,7 @@ Fixes the review findings before new features (see [Review, Research, And Decisi 4. ~~Plugin `Metadata()` hook~~ — done (title, notes+purpose, PI, grant number, IRB number, urn:redcap project id). 5. ~~`project_metadata.xml`~~ — done (always generated, failure-tolerant warning). 6. ~~Schema validation tests~~ — structural Go tests for all three outputs + determinism e2e; external validation documented in the user guide. -7. Croissant previewer (2026-06-12, second iteration): croissant.json now carries a Croissant-profiled mime (`application/ld+json; profile="http://mlcommons.org/croissant/1.0"`, RFC 6906 profile mirroring the RO-Crate/DDI-CDI conventions) and a dedicated registration (`conf/dataverse/external-tools/12-croissant-previewer.json`) opens the cdi-viewer with `?shacl=croissant`. The earlier generic bare-`application/ld+json` registration was withdrawn: the viewer could not render single-node documents (one-arg `jsonld.flatten` returned the expanded array — fixed in cdi-viewer) and bare-type matching was unspecific. The viewer now bundles `shapes/croissant-core.ttl` (Croissant 1.0 structure + CDIF 1.1 Discovery dataset checks, authored — no official Croissant shapes exist). croissant.json also gained `identifier` and `dateModified` (CDIF Discovery mandatory; found by validating against the CDIF core shapes — note those use http://schema.org/ while Croissant uses https://, so they target nothing without IRI alignment). +7. Croissant previewer (2026-06-12, second iteration): croissant.json now carries a Croissant-profiled mime (`application/ld+json; profile="http://mlcommons.org/croissant/1.0"`, RFC 6906 profile mirroring the RO-Crate/DDI-CDI conventions) and a dedicated registration (`conf/dataverse/external-tools/12-croissant-previewer.json`) opens the cdi-viewer, which auto-selects its bundled Croissant shapes from the document's `conformsTo` (the `?shacl=` parameter remains available as an explicit override). The earlier generic bare-`application/ld+json` registration was withdrawn: the viewer could not render single-node documents (one-arg `jsonld.flatten` returned the expanded array — fixed in cdi-viewer) and bare-type matching was unspecific. The viewer now bundles `shapes/croissant-core.ttl` (Croissant 1.0 structure + CDIF 1.1 Discovery dataset checks, authored — no official Croissant shapes exist). croissant.json also gained `identifier` and `dateModified` (CDIF Discovery mandatory; found by validating against the CDIF core shapes — note those use http://schema.org/ while Croissant uses https://, so they target nothing without IRI alignment). 8. **Variable-level metadata + validation fixes (2026-06-12):** - `croissant.json` and `ro-crate-metadata.json` now carry `schema:variableMeasured` following the CDIF 1.1 Discovery-profile shape (PropertyValue with name, description, alternateName, numeric minValue/maxValue from `text_validation_min/max`, code lists as `valueReference` DefinedTerms with termCode = the value in the data). Inline in Croissant; flattened contextual entities in RO-Crate (spec requires flattened JSON-LD). Verified: `mlcroissant validate --jsonld` exits clean (only citeAs/license "recommended" warnings). - `ddi-cdi.jsonld` code lists restructured per the official DDI-CDI 1.0 SHACL shapes (the ones bundled with the cdi-viewer): each `Code` now `uses_Notation` (TypedString content = the value as it appears in the data) and `denotes` a `Category` (ObjectName = the label); `CodeList` carries `allowsDuplicates` and an ObjectName name; the `PrimaryKey` is reachable via `DataStructure_has_PrimaryKey`; `PrimaryKeyComponent` uses the full `correspondsTo_DataStructureComponent` term. Verified with pyshacl against `libis/cdi-viewer` `shapes/ddi-cdi-official.ttl`: **Conforms = True** (previously 13 violations, the "Less than 1 values" errors seen in the previewer). From a1290b5ed301c55f083b88b689af6f5851b9eab6 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 17:58:30 +0200 Subject: [PATCH 19/25] redcap2: addressable croissant root (@id #dataset) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The croissant root Dataset had no @id and was therefore a blank node in RDF. Blank-node labels are relabeled on every serialization, so SHACL results on the root (e.g. the license recommendation) could never link back to the rendered node in RDF-based viewers — and CDIF Discovery wants an identifiable metadata subject anyway. mlcroissant and the croissant SHACL shapes remain clean. --- image/app/plugin/impl/redcap2/sidecars.go | 5 ++++- image/app/plugin/impl/redcap2/sidecars_test.go | 3 +++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/image/app/plugin/impl/redcap2/sidecars.go b/image/app/plugin/impl/redcap2/sidecars.go index df311c5..9c78b2f 100644 --- a/image/app/plugin/impl/redcap2/sidecars.go +++ b/image/app/plugin/impl/redcap2/sidecars.go @@ -426,7 +426,10 @@ func buildCroissant(m sidecarModel) ([]byte, error) { } doc := map[string]interface{}{ - "@context": croissantContext, + "@context": croissantContext, + // An addressable root (CDIF Discovery wants an identifiable subject; + // an anonymous root is also unlinkable in RDF-based viewers). + "@id": "#dataset", "@type": "sc:Dataset", "conformsTo": "http://mlcommons.org/croissant/1.0", "name": m.datasetName(), diff --git a/image/app/plugin/impl/redcap2/sidecars_test.go b/image/app/plugin/impl/redcap2/sidecars_test.go index 7a95e83..fe89d22 100644 --- a/image/app/plugin/impl/redcap2/sidecars_test.go +++ b/image/app/plugin/impl/redcap2/sidecars_test.go @@ -95,6 +95,9 @@ func TestBuildCroissant(t *testing.T) { if doc["@type"] != "sc:Dataset" || doc["name"] != "Demo" { t.Errorf("type/name = %v/%v", doc["@type"], doc["name"]) } + if doc["@id"] != "#dataset" { + t.Errorf("@id = %v, want #dataset (anonymous roots are unlinkable)", doc["@id"]) + } if !strings.Contains(doc["description"].(string), "abcdef0123456789") { t.Error("description should mention the pseudonymization key fingerprint") } From b5e310af1154a6972a01f6e4f753394e5ca5408a Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 18:36:56 +0200 Subject: [PATCH 20/25] docs: settings persistence note + Phase 6 pilot re-test completed --- REDCAP_INTEGRATION.md | 3 +++ redcap.md | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/REDCAP_INTEGRATION.md b/REDCAP_INTEGRATION.md index 1fc287d..f26cfcb 100644 --- a/REDCAP_INTEGRATION.md +++ b/REDCAP_INTEGRATION.md @@ -135,6 +135,9 @@ When at least one variable is set to *Pseudonymize*, a key field appears. - The key itself never appears in the generated files or logs. The manifest records only a *fingerprint* (a hash of the key), so you can verify later which key was used. +- After a page reload or reconnect, your export settings (report ID, mode, + filters, anonymization choices) are restored automatically — but the key + is not: it lives only in page memory and must be pasted again. - Pseudonymization is irreversible: there is no decryption. This is by design (institutional decision; reversible encryption is out of scope). diff --git a/redcap.md b/redcap.md index fc1da5d..dadaae5 100644 --- a/redcap.md +++ b/redcap.md @@ -598,7 +598,7 @@ Fixes the review findings before new features (see [Review, Research, And Decisi Also removed the per-file payload copy in `Streams` (bundle contents are immutable; halves peak memory while streaming). 2. ~~Security review~~ — done (2026-06-12), see [Security Review](#security-review-2026-06-12). 3. ~~User documentation~~ — done: [REDCAP_INTEGRATION.md](REDCAP_INTEGRATION.md) (features, key generation/management, PHI disclaimer, sidecars/previewers, manifest reference). -4. Re-test on pilot (first pilot deploy of Phases 0–3.9 done 2026-06-11 via `make dev_build`). **Remaining: user re-tests the Phase 4–6 build.** +4. ~~Re-test on pilot~~ — done (2026-06-12): full Phase 4–6 build deployed and verified end to end (de-id flows, sidecars, all three previewers incl. the new Croissant previewer, clean SHACL validation). Side quests fixed along the way: broken Shibboleth proxy rebuild (rdm-build), cdi-viewer context fallback + rendering fixes, reconnect settings-loss in the frontend. 5. Keep `redcap` plugin as stable fallback until `redcap2` is proven. 6. Revisit attachments (opt-in, size-capped, flagged as not de-identified) based on pilot feedback. From 2ac88440762ea49833529c4fc90f4d5052701e20 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 18:37:39 +0200 Subject: [PATCH 21/25] docs: ddi-cdi.md references the released DDI-CDI context URL cdi_generator_jsonld.py switched from the m2t-ng build artifact (currently invalid JSON upstream) to the released encoding on docs.ddialliance.org; update the three doc references to match. --- ddi-cdi.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ddi-cdi.md b/ddi-cdi.md index 916e23b..ebf21f7 100644 --- a/ddi-cdi.md +++ b/ddi-cdi.md @@ -172,7 +172,7 @@ The Go backend now assembles a manifest (JSON) that captures dataset context alo Within this manifest-driven run, the generator: - Constructs a DDI-CDI 1.0 compliant JSON-LD document -- Uses the official DDI-CDI context: `https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld` +- Uses the official DDI-CDI context: `https://docs.ddialliance.org/DDI-CDI/1.0/model/encoding/json-ld/ddi-cdi.jsonld` (the released encoding on the DDI Alliance documentation site; the former ddi-cdi.github.io/m2t-ng copy is a build artifact and currently serves invalid JSON) - Describes the dataset structure using WideDataSet, WideDataStructure, and related types - Documents each variable with InstanceVariable, RepresentedVariable, and component types - Records provenance information (processing timestamp, tools used) @@ -638,7 +638,7 @@ The core metadata generation is performed by [`cdi_generator_jsonld.py`](image/c - **Clean, documented code** with clear function boundaries - **Standard Python libraries** (rdflib, chardet, datasketch, python-dateutil) - **JSON-LD output** following DDI-CDI 1.0 specification -- **Official DDI-CDI context** from `https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld` +- **Official DDI-CDI context** from `https://docs.ddialliance.org/DDI-CDI/1.0/model/encoding/json-ld/ddi-cdi.jsonld` (the released encoding on the DDI Alliance documentation site; the former ddi-cdi.github.io/m2t-ng copy is a build artifact and currently serves invalid JSON) - **Streaming architecture** for memory efficiency - **Modular design** making it easy to add features or fix issues @@ -782,7 +782,7 @@ Future versions may include: This feature implements the DDI-CDI 1.0 specification: - **Namespace**: `http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/` -- **JSON-LD Context**: `https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/json-ld/ddi-cdi.jsonld` +- **JSON-LD Context**: `https://docs.ddialliance.org/DDI-CDI/1.0/model/encoding/json-ld/ddi-cdi.jsonld` (the released encoding on the DDI Alliance documentation site; the former ddi-cdi.github.io/m2t-ng copy is a build artifact and currently serves invalid JSON) - **SHACL Shapes**: `https://ddi-cdi.github.io/m2t-ng/DDI-CDI_1-0/encoding/shacl/ddi-cdi.shacl.ttl` - **Documentation**: [https://ddialliance.org/Specification/DDI-CDI/](https://ddialliance.org/Specification/DDI-CDI/) From 85d4854b7bc32d2d9b2697e57d7deca3bd71b5db Mon Sep 17 00:00:00 2001 From: ErykKul Date: Fri, 12 Jun 2026 18:51:16 +0200 Subject: [PATCH 22/25] redcap2: register the RO-Crate previewer in the integration conf MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ro-crate-metadata.json files were uploaded with the RO-Crate profile mime but the integration's external-tools conf (used by local/dev setups via dataverse/setup.sh) had no matching previewer registration — only the deployment repo did. Same gdcc v1.5 ROCrate previewer and the exact contentType the redcap2 plugin emits (and Dataverse 6.3+ detects by filename). --- .../external-tools/13-rocrate-previewer.json | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) create mode 100644 conf/dataverse/external-tools/13-rocrate-previewer.json diff --git a/conf/dataverse/external-tools/13-rocrate-previewer.json b/conf/dataverse/external-tools/13-rocrate-previewer.json new file mode 100644 index 0000000..88ee3d1 --- /dev/null +++ b/conf/dataverse/external-tools/13-rocrate-previewer.json @@ -0,0 +1,19 @@ +{ + "displayName": "Show RO-Crate content", + "description": "View the RO-Crate metadata file.", + "toolName": "rocratePreviewer", + "scope": "file", + "types": ["preview", "explore"], + "toolUrl": "https://gdcc.github.io/dataverse-previewers/previewers/v1.5/ROCrate.html", + "toolParameters": { + "queryParameters": [ + {"fileid": "{fileId}"}, + {"siteUrl": "{siteUrl}"}, + {"key": "{apiToken}"}, + {"datasetid": "{datasetId}"}, + {"datasetversion": "{datasetVersion}"}, + {"locale": "{localeCode}"} + ] + }, + "contentType": "application/ld+json; profile=\"http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate\"" +} From 409fe03461d906600726a7f7723d0b49adeb623b Mon Sep 17 00:00:00 2001 From: ErykKul Date: Wed, 17 Jun 2026 15:18:00 +0200 Subject: [PATCH 23/25] ci: remove flaky @ai-generated AI-governance workflows These Copilot-generated workflows (PR governance/code-review/unit-test runners, /gov ChatOps, and the reusable governance workflow) triggered flaky runs on pull requests. Remove the entire suite so PRs run clean. --- .github/workflows/ai-agent.yml | 137 ------ .github/workflows/ai-governance.yml | 542 ------------------------ .github/workflows/code-review-agent.yml | 153 ------- .github/workflows/copilot-pr-review.yml | 124 ------ .github/workflows/gov-review.yml | 206 --------- .github/workflows/governance-smoke.yml | 33 -- .github/workflows/pr-autolinks.yml | 136 ------ .github/workflows/pr-governance.yml | 38 -- .github/workflows/run-unit-tests.yml | 68 --- .github/workflows/workflow-lint.yml | 28 -- 10 files changed, 1465 deletions(-) delete mode 100644 .github/workflows/ai-agent.yml delete mode 100644 .github/workflows/ai-governance.yml delete mode 100644 .github/workflows/code-review-agent.yml delete mode 100644 .github/workflows/copilot-pr-review.yml delete mode 100644 .github/workflows/gov-review.yml delete mode 100644 .github/workflows/governance-smoke.yml delete mode 100644 .github/workflows/pr-autolinks.yml delete mode 100644 .github/workflows/pr-governance.yml delete mode 100644 .github/workflows/run-unit-tests.yml delete mode 100644 .github/workflows/workflow-lint.yml diff --git a/.github/workflows/ai-agent.yml b/.github/workflows/ai-agent.yml deleted file mode 100644 index 55f13e5..0000000 --- a/.github/workflows/ai-agent.yml +++ /dev/null @@ -1,137 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: AI Governance Agent (ChatOps) - -on: - issue_comment: - types: [created] - -permissions: - contents: read - issues: write - pull-requests: write - -jobs: - respond: - name: Respond to /gov commands on PRs - if: ${{ github.event.issue.pull_request && contains(github.event.comment.body, '/gov') }} - runs-on: ubuntu-latest - steps: - - name: Handle /gov command - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const body = context.payload.comment.body.trim(); - const owner = context.repo.owner; - const repo = context.repo.repo; - const issue_number = context.payload.issue.number; - - const helpText = - 'AI Governance Agent commands:\\n\\n\n' + - `- /gov help — show this help\n` + - `- /gov check — scan this PR for missing governance checklist items and summarize changes\n` + - `- /gov copilot — ask GitHub Copilot to review this PR\n` + - `- /gov links — preview suggested links (governance/test runs) for the PR template\n` + - `- /gov autofill apply — auto-fill safe N/A defaults and add run links into the PR body\n` + - `- /gov — run default check, trigger Copilot review, preview links, and auto-apply autofill\n`; - - const isHelp = body.match(/^\/gov\s+help\b/i); - const isCheck = body.match(/^\/gov\s+check\b/i); - const isBare = body.match(/^\/gov\s*$/i); - - if (!isHelp && !isCheck && !isBare) { - return; // Ignore other /gov variants for now - } - - if (isHelp) { - await github.rest.issues.createComment({ owner, repo, issue_number, body: helpText }); - return; - } - - // Fetch PR details and changed files (treat bare /gov as /gov check) - const doCheck = isCheck || isBare; - const { data: pr } = await github.rest.pulls.get({ owner, repo, pull_number: issue_number }); - const prBody = (pr.body || '').toString(); - const files = await github.paginate(github.rest.pulls.listFiles, { owner, repo, pull_number: issue_number, per_page: 100 }); - - // Heuristics for change types - const changedPaths = files.map(f => f.filename); - const rx = { - userUI: /(^|\/)(ui|web|frontend|public|templates)(\/|$)|(^|\/)src\/.*\.(html|tsx?|vue)$/i, - sensitive: /(^|\/)(auth|authn|authz|login|acl|permissions?|access[_-]?control|secrets?|tokens?|jwt|oauth)(\/|$)|\.(policy|rego)$/i, - infra: /(^|\/)(k8s|kubernetes|helm|charts|deploy|ops|infra|infrastructure|manifests|terraform|ansible)(\/|$)|(^|\/)dockerfile$|docker-compose\.ya?ml$|Chart\.ya?ml$/i, - backend: /(^|\/)(src|api|server|backend|app)(\/|\/.*)([^\/]+)\.(js|ts|py|rb|go|java|cs)$/i, - media: /\.(png|jpe?g|gif|webp|svg|mp4|mp3|wav|pdf)$/i, - data: /(^|\/)(data|datasets|training|notebooks|scripts)(\/|$)/i - }; - const has = (re) => changedPaths.some(p => re.test(p)); - const flags = { - userUI: has(rx.userUI), - sensitive: has(rx.sensitive), - infra: has(rx.infra), - backend: has(rx.backend), - media: has(rx.media), - data: has(rx.data) - }; - - // Simple PR body checks mirroring the reusable workflow - const missing = []; - const need = (label, ok) => { if (!ok) missing.push(label); }; - - need('Prompt', /Prompt/i.test(prBody)); - need('Model', /Model/i.test(prBody)); - need('Date', /Date/i.test(prBody)); - need('Author', /Author/i.test(prBody)); - need('[x] No secrets/PII', /\[x\].*no\s+secrets\/?pii|no\s+pii\/?secrets/i.test(prBody)); - need('Risk classification: limited|high', /Risk\s*classification:\s*(limited|high)/i.test(prBody)); - need('Personal data: yes|no', /Personal\s*data:\s*(yes|no)/i.test(prBody)); - need('Automated decision-making: yes|no', /Automated\s*decision-?making:\s*(yes|no)/i.test(prBody)); - need('Agent mode used: yes|no', /Agent\s*mode\s*used:\s*(yes|no)/i.test(prBody)); - need('Role: provider|deployer', /Role:\s*(provider|deployer)/i.test(prBody)); - - if (flags.userUI) { - need('[x] Transparency notice updated', /\[x\].*transparency\s+notice/i.test(prBody)); - need('Accessibility statement: ', /Accessibility\s*statement:\s*(https?:\/\/|N\/?A)/i.test(prBody)); - } - if (flags.media) { - need('[x] AI content labeled', /\[x\].*ai\s*content\s*labeled/i.test(prBody)); - need('C2PA: ', /C2PA:\s*(https?:\/\/|N\/?A)/i.test(prBody)); - } - if (flags.infra) { - need('Privacy notice: ', /Privacy\s*notice:\s*(https?:\/\/)/i.test(prBody)); - need('Lawful basis: ', /Lawful\s*basis:\s*([A-Za-z]+|N\/?A)/i.test(prBody)); - need('Retention schedule: ', /Retention\s*schedule:\s*(https?:\/\/|N\/?A)/i.test(prBody)); - } - if (flags.backend) { - need('[x] OWASP ASVS review or ASVS: ', /\[x\].*owasp\s*asvs|ASVS:\s*(https?:\/\/)/i.test(prBody)); - } - - // Build a concise response - const bullet = (b) => `- ${b}`; - const filesList = changedPaths.slice(0, 50).map(bullet).join('\n'); - const missingList = missing.length ? missing.map(bullet).join('\n') : '- None (looks good)'; - const flagsList = Object.entries(flags).filter(([,v]) => v).map(([k]) => ` - - ${k}`).join('') || '\n - none detected'; - - const reply = - '### Governance Agent Report\\n\\n\n' + - `PR: #${issue_number} by @${pr.user.login}\n\n` + - `Changed files (${changedPaths.length}):\n${filesList}\n\n` + - `Detected change types:${flagsList}\n\n` + - `Missing or incomplete items:\n${missingList}\n\n` + - `Tip: Use the PR template fields to satisfy these checks.\n\n` + - `Run /gov help for commands. Also try: /gov links and /gov autofill apply.`; - - if (doCheck) { - await github.rest.issues.createComment({ owner, repo, issue_number, body: reply }); - } - - // If the command was bare /gov, also trigger Copilot review, links preview, and auto-apply autofill - if (isBare) { - await github.rest.issues.createComment({ owner, repo, issue_number, body: '/gov copilot' }); - // And trigger auto-links preview so contributors can quickly fill PR fields - await github.rest.issues.createComment({ owner, repo, issue_number, body: '/gov links' }); - // Finally, auto-apply link autofill (safe defaults + run links) - await github.rest.issues.createComment({ owner, repo, issue_number, body: '/gov autofill apply' }); - } diff --git a/.github/workflows/ai-governance.yml b/.github/workflows/ai-governance.yml deleted file mode 100644 index cc46689..0000000 --- a/.github/workflows/ai-governance.yml +++ /dev/null @@ -1,542 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: AI Governance Checks - -on: - workflow_call: - inputs: - run_markdownlint: - required: false - type: boolean - default: true - run_gitleaks: - required: false - type: boolean - default: true - run_dependency_review: - required: false - type: boolean - default: true - run_scancode: - required: false - type: boolean - default: true - run_sbom: - required: false - type: boolean - default: true - run_codeql: - required: false - type: boolean - default: false - lint_command: - required: false - type: string - default: '' - test_command: - required: false - type: string - default: '' - require_ui_transparency: - required: false - type: boolean - default: true - require_dpia_for_user_facing: - required: false - type: boolean - default: true - require_eval_for_high_risk: - required: false - type: boolean - default: false - enable_post_merge_reminders: - required: false - type: boolean - default: true - -permissions: - contents: read - pull-requests: write - issues: write - security-events: write - -jobs: - policy_checks: - name: Policy checks (provenance, risk notes) - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Compute changed files - shell: bash - run: | - base=$(jq -r '.pull_request.base.sha' "$GITHUB_EVENT_PATH") - head=$(jq -r '.pull_request.head.sha' "$GITHUB_EVENT_PATH") - git fetch --no-tags --depth=1 origin "$base" || true - git diff --name-only "$base" "$head" > changed_files.txt || true - echo "Changed files:"; cat changed_files.txt || true - # Heuristic: user-facing change if paths include common web/ui dirs or templates - if grep -Eiq '(^|/)(ui|web|frontend|public|templates|src/.+\.(html|tsx?|vue))$' changed_files.txt; then - echo "user_facing_change=true" >> "$GITHUB_ENV" - else - echo "user_facing_change=false" >> "$GITHUB_ENV" - fi - # Sensitive modules (authz/authn/permissions/secrets) - if grep -Eiq '(^|/)(auth|authn|authz|login|acl|permissions?|access[_-]?control|secrets?|tokens?|jwt|oauth)(/|$)|\.(policy|rego)$' changed_files.txt; then - echo "sensitive_modules=true" >> "$GITHUB_ENV" - else - echo "sensitive_modules=false" >> "$GITHUB_ENV" - fi - # Media assets changed (content provenance / labeling) - if grep -Eiq '\.(png|jpe?g|gif|webp|svg|mp4|mp3|wav|pdf)$' changed_files.txt; then - echo "media_change=true" >> "$GITHUB_ENV" - else - echo "media_change=false" >> "$GITHUB_ENV" - fi - # Infrastructure / deploy manifests changed - if grep -Eiq '(^|/)(k8s|kubernetes|helm|charts|deploy|ops|infra|infrastructure|manifests|terraform|ansible)(/|$)|(^|/)dockerfile$|docker-compose\.ya?ml$|Chart\.ya?ml$' changed_files.txt; then - echo "infra_change=true" >> "$GITHUB_ENV" - else - echo "infra_change=false" >> "$GITHUB_ENV" - fi - # Backend/API code changed - if grep -Eiq '(^|/)(src|api|server|backend|app)(/|/.*)([^/]+)\.(js|ts|py|rb|go|java|cs)$' changed_files.txt; then - echo "backend_change=true" >> "$GITHUB_ENV" - else - echo "backend_change=false" >> "$GITHUB_ENV" - fi - # Data/TDM related paths - if grep -Eiq '(^|/)(data|datasets|training|notebooks|scripts)(/|$)' changed_files.txt; then - echo "data_change=true" >> "$GITHUB_ENV" - else - echo "data_change=false" >> "$GITHUB_ENV" - fi - - name: Check PR provenance fields - shell: bash - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - # Prefer live PR body via API to avoid stale event payloads - pr_number=$(jq -r '.pull_request.number // empty' "$GITHUB_EVENT_PATH") - api="${GITHUB_API_URL:-https://api.github.com}" - repo="$GITHUB_REPOSITORY" - body=$(curl -sSf -H "Authorization: Bearer $GITHUB_TOKEN" -H "Accept: application/vnd.github+json" \ - "$api/repos/$repo/pulls/$pr_number" | jq -r '.body // empty' || true) - # Fallback to event payload if API is unavailable - if [ -z "$body" ]; then - body=$(jq -r '.pull_request.body // ""' "$GITHUB_EVENT_PATH") - fi - # Normalize line endings (strip CR) - body=$(printf "%s" "$body" | sed 's/\r$//') - missing=0 - for key in "Prompt" "Model" "Date" "Author"; do - echo "$body" | grep -qi "$key" || { echo "::error::Missing $key in PR body"; missing=1; } - done - # Date must be strict ISO-8601 UTC Z; accept '-' or '*' bullets - echo "$body" | grep -Eiq '^[[:space:]]*[-*]\s*Date:\s*20[0-9]{2}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z(\s*()\s*)?$' || { - echo "::error::Date must be a real UTC ISO-8601 timestamp (e.g., 2025-09-12T10:21:36Z)."; missing=1; } - # Reject placeholders and templating/backticks in key fields - echo "$body" | grep -Eiq '\\$\{|\$\(|`|, \${...}, \$(...), or backticks with concrete values or N/A where allowed."; missing=1; } - if [ $missing -ne 0 ]; then - echo "::error::Provenance fields missing in PR body (Prompt/Model/Date/Author)."; exit 1; - fi - - name: Require explicit No PII/Secrets checkbox - shell: bash - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - pr_number=$(jq -r '.pull_request.number // empty' "$GITHUB_EVENT_PATH") - api="${GITHUB_API_URL:-https://api.github.com}" - repo="$GITHUB_REPOSITORY" - body=$(curl -sSf -H "Authorization: Bearer $GITHUB_TOKEN" -H "Accept: application/vnd.github+json" \ - "$api/repos/$repo/pulls/$pr_number" | jq -r '.body // empty' || true) - [ -n "$body" ] || body=$(jq -r '.pull_request.body // ""' "$GITHUB_EVENT_PATH") - body=$(printf "%s" "$body" | sed 's/\r$//') - # Accept either '- [x] No secrets/PII' or '[x] No secrets/PII' (case-insensitive) - echo "$body" | grep -Eqi "\[x\].*no\s+secrets/?pii|no\s+pii/?secrets" || { - echo "::error::Please confirm '[x] No secrets/PII' in the PR checklist."; exit 1; - } - - name: Additional compliance (transparency, DPIA, logging, kill-switch, risk classification) - shell: bash - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - pr_number=$(jq -r '.pull_request.number // empty' "$GITHUB_EVENT_PATH") - api="${GITHUB_API_URL:-https://api.github.com}" - repo="$GITHUB_REPOSITORY" - body=$(curl -sSf -H "Authorization: Bearer $GITHUB_TOKEN" -H "Accept: application/vnd.github+json" \ - "$api/repos/$repo/pulls/$pr_number" | jq -r '.body // empty' || true) - [ -n "$body" ] || body=$(jq -r '.pull_request.body // ""' "$GITHUB_EVENT_PATH") - body=$(printf "%s" "$body" | sed 's/\r$//') - # If user-facing changes and transparency required, enforce checkbox - if [ "${{ inputs.require_ui_transparency }}" = "true" ] && [ "$user_facing_change" = "true" ]; then - echo "$body" | grep -Eqi "\[x\].*transparency\s+notice" || { - echo "::error::For user-facing changes, check '[x] Transparency notice updated'"; exit 1; } - fi - # DPIA acknowledgement (link or N/A) for user-facing or personal-data - if [ "${{ inputs.require_dpia_for_user_facing }}" = "true" ] && [ "$user_facing_change" = "true" ]; then - echo "$body" | grep -Eqi "DPIA:\s*(https?://|N/?A)" || { - echo "::error::Add 'DPIA: ' line to PR body for user-facing changes"; exit 1; } - fi - # Logging & kill-switch acknowledgements - echo "$body" | grep -Eqi "\[x\].*agent\s+logging" || { echo "::error::Check '[x] Agent logging enabled]'"; exit 1; } - echo "$body" | grep -Eqi "\[x\].*(kill\-switch|feature\s+flag)" || { echo "::error::Check '[x] Kill-switch / feature flag present]'"; exit 1; } - # Risk classification (limited/high) - echo "$body" | grep -Eqi "Risk\s+classification:\s*(limited|high)" || { echo "::error::Add 'Risk classification: limited|high' to PR body"; exit 1; } - risk=$(echo "$body" | sed -n 's/.*Risk[[:space:]]\{1,\}classification:[[:space:]]*\(limited\|high\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - # Personal data and ADM (automated decision-making) - echo "$body" | grep -Eqi "Personal\s*data:\s*(yes|no)" || { echo "::error::Add 'Personal data: yes|no'"; exit 1; } - personal=$(echo "$body" | sed -n 's/.*Personal[[:space:]]*data:[[:space:]]*\(yes\|no\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - echo "$body" | grep -Eqi "Automated\s*decision\-?making:\s*(yes|no)" || { echo "::error::Add 'Automated decision-making: yes|no'"; exit 1; } - adm=$(echo "$body" | sed -n 's/.*Automated[[:space:]]*decision-\{0,1\}making:[[:space:]]*\(yes\|no\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - # Agent mode used - echo "$body" | grep -Eqi "Agent\s*mode\s*used:\s*(yes|no)" || { echo "::error::Add 'Agent mode used: yes|no'"; exit 1; } - agentmode=$(echo "$body" | sed -n 's/.*Agent[[:space:]]*mode[[:space:]]*used:[[:space:]]*\(yes\|no\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - # Provider vs deployer - echo "$body" | grep -Eqi "Role:\s*(provider|deployer)" || { echo "::error::Add 'Role: provider|deployer'"; exit 1; } - role=$(echo "$body" | sed -n 's/.*Role:[[:space:]]*\(provider\|deployer\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - if [ "$role" = "provider" ]; then - echo "$body" | grep -Eqi "GPAI\s*obligations:\s*(https?://|N/?A)" || { echo "::error::Add 'GPAI obligations: '"; exit 1; } - fi - if [ "$role" = "deployer" ]; then - echo "$body" | grep -Eqi "Vendor\s*GPAI\s*compliance\s*reviewed:\s*(https?://|N/?A)" || { echo "::error::Add 'Vendor GPAI compliance reviewed: '"; exit 1; } - fi - # Prohibited practices attestation - echo "$body" | grep -Eqi "\[x\].*no\s+prohibited\s+practices" || { echo "::error::Confirm '[x] No prohibited practices under EU AI Act'"; exit 1; } - # Human oversight if agent mode used or high risk - if [ "$agentmode" = "yes" ] || [ "$risk" = "high" ]; then - echo "$body" | grep -Eqi "\[x\].*human\s+oversight" || { echo "::error::Check '[x] Human oversight retained'"; exit 1; } - fi - # If automated decision-making is yes, require high risk classification and oversight plan - if [ "$adm" = "yes" ]; then - [ "$risk" = "high" ] || { echo "::error::Automated decision-making implies 'Risk classification: high'"; exit 1; } - echo "$body" | grep -Eqi "Oversight\s*plan:\s*(https?://)" || { echo "::error::Add 'Oversight plan: ' for high-risk/ADM"; exit 1; } - fi - # If personal data yes, require DPIA link (even if not user-facing) - if [ "$personal" = "yes" ]; then - echo "$body" | grep -Eqi "DPIA:\s*(https?://)" || { echo "::error::Provide 'DPIA: ' when personal data is processed"; exit 1; } - fi - # High risk: require rollback plan and smoke test - if [ "$risk" = "high" ]; then - echo "$body" | grep -Eqi "Rollback\s*plan:\s*.+" || { echo "::error::Add 'Rollback plan: ' for high-risk changes"; exit 1; } - echo "$body" | grep -Eqi "Smoke\s*test:\s*(https?://)" || { echo "::error::Add 'Smoke test: ' for high-risk changes"; exit 1; } - fi - # Evaluation results: optionally enforce for high-risk - if [ "${{ inputs.require_eval_for_high_risk }}" = "true" ] && [ "$risk" = "high" ]; then - echo "$body" | grep -Eqi "Eval\s*set:\s*(https?://)" || { echo "::error::Add 'Eval set: ' for high-risk"; exit 1; } - # Expect 'Error rate: 1.5%' style; must be <= 2 - er=$(echo "$body" | sed -n 's/.*Error[[:space:]]*rate:[[:space:]]*\([0-9]*\.?[0-9]*\)%.*/\1/p' | head -n1) - if [ -z "$er" ]; then echo "::error::Add 'Error rate: ' for high-risk"; exit 1; fi - awk -v er="$er" 'BEGIN { if (er+0 > 2.0) { exit 1 } }' || { echo "::error::Error rate must be <= 2% for high-risk"; exit 1; } - else - # Non-blocking warning if error rate declared > 2% - er=$(echo "$body" | sed -n 's/.*Error[[:space:]]*rate:[[:space:]]*\([0-9]*\.?[0-9]*\)%.*/\1/p' | head -n1) - if [ -n "$er" ]; then awk -v er="$er" 'BEGIN { if (er+0 > 2.0) { print "::warning::Declared error rate > 2%"; } }'; fi - fi - # Prompt injection mitigation for agented backend/data changes - if [ "$agentmode" = "yes" ] && { [ "$backend_change" = "true" ] || [ "$data_change" = "true" ]; }; then - echo "$body" | grep -Eqi "\[x\].*untrusted\s*input\s*sanitized" || { echo "::error::Confirm '[x] Untrusted input sanitized' for agent mode with backend/data changes"; exit 1; } - fi - # License/IP attestation & attribution - echo "$body" | grep -Eqi "\[x\].*license/?ip\s*attestation" || { echo "::error::Confirm '[x] License/IP attestation'"; exit 1; } - echo "$body" | grep -Eqi "Attribution:\s*(https?://|N/?A)" || { echo "::error::Add 'Attribution: ' if applicable"; exit 1; } - # Sensitive modules require security review - if [ "$sensitive_modules" = "true" ]; then - echo "$body" | grep -Eqi "\[x\].*security\s+review|Security\s*review:\s*(https?://)" || { echo "::error::Sensitive modules changed; add '[x] Security review requested' or 'Security review: '"; exit 1; } - fi - # Media assets: require AI content labeling + C2PA link or N/A - if [ "$media_change" = "true" ]; then - echo "$body" | grep -Eqi "\[x\].*ai\s*content\s*labeled" || { echo "::error::Media changed; confirm '[x] AI content labeled'"; exit 1; } - echo "$body" | grep -Eqi "C2PA:\s*(https?://|N/?A)" || { echo "::error::Add 'C2PA: ' for media provenance"; exit 1; } - fi - # UI/Accessibility: require accessibility review + statement link when UI changed - if [ "$user_facing_change" = "true" ]; then - echo "$body" | grep -Eqi "\[x\].*accessibility\s+(review|check)" || { echo "::error::UI changed; confirm '[x] Accessibility review (EN 301 549/WCAG)'"; exit 1; } - echo "$body" | grep -Eqi "Accessibility\s*statement:\s*(https?://|N/?A)" || { echo "::error::Add 'Accessibility statement: '"; exit 1; } - fi - # Infra/deploy changes: require privacy notice, lawful basis, retention schedule, NIS2 applicability and incident plan if yes - if [ "$infra_change" = "true" ]; then - echo "$body" | grep -Eqi "Privacy\s*notice:\s*(https?://)" || { echo "::error::Add 'Privacy notice: ' for deploying changes"; exit 1; } - echo "$body" | grep -Eqi "Lawful\s*basis:\s*[A-Za-z]+|N/?A" || { echo "::error::Add 'Lawful basis: '"; exit 1; } - echo "$body" | grep -Eqi "Retention\s*schedule:\s*(https?://|N/?A)" || { echo "::error::Add 'Retention schedule: '"; exit 1; } - echo "$body" | grep -Eqi "NIS2\s*applicability:\s*(yes|no|N/?A)" || { echo "::error::Add 'NIS2 applicability: yes|no|N/A'"; exit 1; } - nis=$(echo "$body" | sed -n 's/.*NIS2[[:space:]]*applicability:[[:space:]]*\(yes\|no\|N\/?A\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - if [ "$nis" = "yes" ]; then - echo "$body" | grep -Eqi "Incident\s*response\s*plan:\s*(https?://)" || { echo "::error::Provide 'Incident response plan: ' for NIS2"; exit 1; } - fi - fi - # Backend/API changes: require OWASP ASVS review link or checkbox - if [ "$backend_change" = "true" ]; then - echo "$body" | grep -Eqi "\[x\].*owasp\s*asvs|ASVS:\s*(https?://)" || { echo "::error::Backend/API changed; confirm '[x] OWASP ASVS review' or add 'ASVS: '"; exit 1; } - fi - # Log retention: if personal data yes, high risk, or infra_change, require a log retention policy link or N/A - if [ "$personal" = "yes" ] || [ "$risk" = "high" ] || [ "$infra_change" = "true" ]; then - echo "$body" | grep -Eqi "Log\s*retention\s*policy:\s*(https?://|N/?A)" || { echo "::error::Add 'Log retention policy: '"; exit 1; } - fi - # TDM compliance if data paths changed - if [ "$data_change" = "true" ]; then - echo "$body" | grep -Eqi "TDM:\s*(yes|no|N/?A)" || { echo "::error::Add 'TDM: yes|no|N/A'"; exit 1; } - tdm=$(echo "$body" | sed -n 's/.*TDM:[[:space:]]*\(yes\|no\|N\/?A\).*/\1/ip' | head -n1 | tr '[:upper:]' '[:lower:]') - if [ "$tdm" = "yes" ]; then - echo "$body" | grep -Eqi "TDM\s*compliance:\s*(https?://)" || { echo "::error::Provide 'TDM compliance: ' (dataset/source register)"; exit 1; } - fi - fi - - name: Auto-label PR as ai-assisted - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const pr = context.payload.pull_request; - if (!pr) return; - const labels = (pr.labels || []).map(l => l.name); - if (!labels.includes('ai-assisted')) { - await github.rest.issues.addLabels({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: pr.number, - labels: ['ai-assisted'] - }); - } - const body = pr.body || ''; - const toAdd = []; - if (/Risk\s*classification:\s*high/i.test(body)) toAdd.push('high-risk'); - if (/Personal\s*data:\s*yes/i.test(body)) toAdd.push('personal-data'); - if (/Agent\s*mode\s*used:\s*yes/i.test(body)) toAdd.push('agent-mode'); - const roleMatch = body.match(/Role:\s*(provider|deployer)/i); - if (roleMatch) toAdd.push(roleMatch[1].toLowerCase()); - if (/Security\s*review:/i.test(body) || /\[x\].*security\s+review/i.test(body)) toAdd.push('security-review'); - if (/\[x\].*owasp\s*asvs|ASVS:/i.test(body)) toAdd.push('asvs'); - if (/NIS2\s*applicability:\s*yes/i.test(body)) toAdd.push('nis2'); - // Optionally infer change-type labels here (non-blocking) - if (toAdd.length) { - await github.rest.issues.addLabels({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: pr.number, - labels: toAdd - }); - } - - name: Require two approvals for high-risk changes - if: ${{ github.event_name == 'pull_request' }} - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const pr = context.payload.pull_request; - if (!pr) return; - const highRisk = /Risk\s*classification:\s*high/i.test(pr.body || ''); - if (!highRisk) return; - const { data: reviews } = await github.rest.pulls.listReviews({ - owner: context.repo.owner, - repo: context.repo.repo, - pull_number: pr.number, - per_page: 100 - }); - const approvals = new Set(reviews.filter(r => r.state === 'APPROVED').map(r => r.user.login)); - if (approvals.size < 2) { - core.setFailed(`High-risk changes require >= 2 approvals. Current unique approvals: ${approvals.size}`); - } - - name: Comment with guidance (on failure) - if: failure() - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const pr = context.payload.pull_request; - if (!pr) { core.info('No PR payload; skipping comment'); return; } - const body = - '### AI Governance checks failed\n' + - '\n' + - 'Please fix the following before re-running checks:\n' + - '\n' + - '- Ensure PR body includes provenance fields:\n' + - ' - Prompt\n' + - ' - Model\n' + - ' - Date\n' + - ' - Author\n' + - ' - [x] No secrets/PII (checkbox)\n' + - '- Complete compliance checklist items required for your change type (transparency notice, DPIA, logging, kill-switch, risk classification, human oversight, security review, vendor GPAI review).\n' + - '- Add a rollback note if the change is risky (authz, data export, evaluation logic, etc.).\n' + - '\n' + - 'Helpful links:\n' + - '- PR template (provenance): https://github.com/libis/ai-transition/blob/main/.github/pull_request_template.md\n' + - '- Risk mitigation matrix: https://github.com/libis/ai-transition/blob/main/governance/risk_mitigation_matrix.md\n' + - '- Reusable governance workflow: https://github.com/libis/ai-transition/blob/main/.github/workflows/ai-governance.yml\n' + - '\n' + - 'After edits, push updates or re-run the workflow to validate.\n'; - await github.rest.issues.createComment({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: pr.number, - body - }); - - name: Risk/rollback note present (non-blocking advisory) - shell: bash - continue-on-error: true - run: | - pr_number=$(jq -r '.pull_request.number // empty' "$GITHUB_EVENT_PATH") - api="${GITHUB_API_URL:-https://api.github.com}" - repo="$GITHUB_REPOSITORY" - body=$(curl -sSf -H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" -H "Accept: application/vnd.github+json" \ - "$api/repos/$repo/pulls/$pr_number" | jq -r '.body // empty' || true) - [ -n "$body" ] || body=$(jq -r '.pull_request.body // ""' "$GITHUB_EVENT_PATH") - body=$(printf "%s" "$body" | sed 's/\r$//') - echo "$body" | grep -Eqi "rollback|risk|incident" || echo "::warning::Consider adding a rollback note and risk summary for risky changes." - - post_merge_reminders: - name: Post-merge compliance reminders - if: ${{ github.event_name == 'push' && inputs.enable_post_merge_reminders }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - id: compute - name: Compute changed files (push) - shell: bash - run: | - before=$(jq -r '.before // empty' "$GITHUB_EVENT_PATH") - after=$(jq -r '.after // env.GITHUB_SHA' "$GITHUB_EVENT_PATH") - if [ -n "$before" ]; then - git fetch --no-tags --depth=1 origin "$before" || true - git diff --name-only "$before" "$after" > changed_files.txt || true - else - git diff --name-only HEAD~1 HEAD > changed_files.txt || true - fi - if grep -Eiq '(^|/)(ui|web|frontend|public|templates|src/.+\.(html|tsx?|vue))$' changed_files.txt; then - echo "user_facing_change=true" >> "$GITHUB_OUTPUT" - else - echo "user_facing_change=false" >> "$GITHUB_OUTPUT" - fi - - name: Create follow-up issue for UI transparency/privacy updates - if: ${{ steps.compute.outputs.user_facing_change == 'true' }} - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const title = - 'Post-deploy AI compliance checklist (${process.env.GITHUB_SHA.slice(0,7)})\n'; - const body = - 'This is an automated reminder for recent user-facing changes.\\n\\n\n' + - `Checklist:\n` + - `- [ ] Update transparency notice in UI (AI disclosure)\n` + - `- [ ] Update privacy notice and accessibility statements if applicable\n` + - `- [ ] Verify kill-switch / feature flag works in production\n` + - `- [ ] Monitor error rates and agent logs for 7 days\n` + - `- [ ] Archive SBOM and ScanCode artifacts in release or internal registry\n`; - await github.rest.issues.create({ - owner: context.repo.owner, - repo: context.repo.repo, - title, - body, - labels: ['post-deploy-compliance'] - }); - - - markdownlint: - if: ${{ inputs.run_markdownlint }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-node@v4 - with: - node-version: '18' - - name: Install markdownlint-cli - run: npm install -g markdownlint-cli@0.39.0 - - name: Lint Markdown - run: | - markdownlint "**/*.md" --ignore node_modules || (echo "::error::Markdown lint errors found"; exit 1) - - gitleaks: - if: ${{ inputs.run_gitleaks }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - with: - fetch-depth: 0 - - name: gitleaks scan - uses: gitleaks/gitleaks-action@v2 - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - with: - args: --redact - - dependency_review: - if: ${{ inputs.run_dependency_review && github.event_name == 'pull_request' }} - runs-on: ubuntu-latest - steps: - - name: Review dependencies for vulnerabilities & licenses - uses: actions/dependency-review-action@v4 - with: - allow-licenses: 'MIT, BSD-2-Clause, BSD-3-Clause, Apache-2.0, ISC, MPL-2.0' - fail-on-severity: critical - - scancode: - if: ${{ inputs.run_scancode }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Install ScanCode - shell: bash - run: | - python -m pip install --upgrade pip - pip install scancode-toolkit - - name: Run ScanCode (JSON) - shell: bash - run: | - scancode --json-pp scancode.json --license --copyright --info . || true - test -s scancode.json || { echo '{}' > scancode.json; } - - name: Upload ScanCode report - uses: actions/upload-artifact@v4 - with: - name: scancode-report - path: scancode.json - - sbom: - if: ${{ inputs.run_sbom }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Generate SBOM (SPDX) - uses: anchore/sbom-action@v0 - with: - path: . - format: spdx-json - output-file: sbom.spdx.json - - name: Upload SBOM artifact - uses: actions/upload-artifact@v4 - with: - name: sbom-spdx - path: sbom.spdx.json - - codeql: - if: ${{ inputs.run_codeql }} - permissions: - actions: read - contents: read - security-events: write - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Initialize CodeQL - uses: github/codeql-action/init@v3 - with: - languages: javascript, typescript, python, ruby, go, java, cpp - - name: Autobuild - uses: github/codeql-action/autobuild@v3 - - name: Perform CodeQL Analysis - uses: github/codeql-action/analyze@v3 - - lint: - if: ${{ inputs.lint_command != '' }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Run linter - run: ${{ inputs.lint_command }} - - tests: - if: ${{ inputs.test_command != '' }} - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Run tests - run: ${{ inputs.test_command }} diff --git a/.github/workflows/code-review-agent.yml b/.github/workflows/code-review-agent.yml deleted file mode 100644 index 2ab4b05..0000000 --- a/.github/workflows/code-review-agent.yml +++ /dev/null @@ -1,153 +0,0 @@ - # @ai-generated: true - # @ai-tool: Copilot -name: AI Code Review Agent (Python) - -on: - pull_request: - types: [opened, synchronize, reopened] - -permissions: - contents: read - pull-requests: write - -jobs: - python_review: - if: ${{ github.event_name == 'pull_request' }} - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - with: - fetch-depth: 0 - - - name: Determine changed Python files - id: diff - shell: bash - run: | - if ! command -v jq >/dev/null 2>&1; then sudo apt-get update && sudo apt-get install -y jq; fi - base=$(jq -r '.pull_request.base.sha' "$GITHUB_EVENT_PATH") - head=$(jq -r '.pull_request.head.sha' "$GITHUB_EVENT_PATH") - git fetch --no-tags --depth=1 origin "$base" || true - git diff --name-only "$base" "$head" | grep -E '\.py$' > py_changed.txt || true - count=$(wc -l < py_changed.txt | tr -d ' ') - echo "Changed Python files ($count):" - cat py_changed.txt || true - if [ "$count" -eq 0 ]; then - echo "has_py=false" >> "$GITHUB_OUTPUT" - else - echo "has_py=true" >> "$GITHUB_OUTPUT" - fi - - - name: Set up Python - if: steps.diff.outputs.has_py == 'true' - uses: actions/setup-python@v5 - with: - python-version: '3.11' - - - name: Install linters (ruff, bandit) - if: steps.diff.outputs.has_py == 'true' - run: | - python -m pip install --upgrade pip - pip install ruff==0.5.7 bandit==1.7.9 - - - name: Run Ruff (style/quality) - if: steps.diff.outputs.has_py == 'true' - shell: bash - run: | - mapfile -t files < py_changed.txt || true - if [ ${#files[@]} -gt 0 ]; then - ruff check "${files[@]}" --output-format=json > ruff.json || true - else - echo '[]' > ruff.json - fi - - - name: Run Bandit (security) - if: steps.diff.outputs.has_py == 'true' - shell: bash - run: | - mapfile -t files < py_changed.txt || true - if [ ${#files[@]} -gt 0 ]; then - bandit -q -f json -o bandit.json "${files[@]}" || true - else - echo '{"results":[]}' > bandit.json - fi - - - name: Comment review summary - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const fs = require('fs'); - const owner = context.repo.owner; - const repo = context.repo.repo; - const issue_number = context.payload.pull_request.number; - - function readJsonSafe(path, fallback) { - try { return JSON.parse(fs.readFileSync(path, 'utf8')); } catch { return fallback; } - } - - const hasPy = fs.existsSync('py_changed.txt') && fs.readFileSync('py_changed.txt','utf8').trim().length > 0; - const pyFiles = hasPy ? fs.readFileSync('py_changed.txt','utf8').trim().split('\n') : []; - const ruff = readJsonSafe('ruff.json', []); - const bandit = readJsonSafe('bandit.json', { results: [] }); - - // Normalize Ruff findings - const ruffByFile = new Map(); - for (const f of ruff) { - // Ruff json entries may include filename and diagnostics array or directly as list; handle both - if (f && f.filename && Array.isArray(f.diagnostics)) { - for (const d of f.diagnostics) { - const k = f.filename; - const arr = ruffByFile.get(k) || []; - arr.push({ - line: d.range?.start?.line ?? d.location?.row ?? 0, - col: d.range?.start?.column ?? d.location?.column ?? 0, - code: d.code || d.rule || 'RUFF', - msg: d.message || '' - }); - ruffByFile.set(k, arr); - } - } else if (f && f.filename && f.rule && f.message) { - const arr = ruffByFile.get(f.filename) || []; - arr.push({ line: f.location?.row ?? 0, col: f.location?.column ?? 0, code: f.rule, msg: f.message }); - ruffByFile.set(f.filename, arr); - } - } - - // Normalize Bandit findings - const banditByFile = new Map(); - for (const r of bandit.results || []) { - const fn = r.filename || 'unknown'; - const arr = banditByFile.get(fn) || []; - arr.push({ line: r.line_number || 0, sev: r.issue_severity || 'MEDIUM', conf: r.issue_confidence || 'MEDIUM', msg: r.issue_text || '' }); - banditByFile.set(fn, arr); - } - - const mk = (arr) => arr.map(x => `- L${x.line}${x.col?':C'+x.col:''} ${x.code? '['+x.code+'] ':''}${x.msg}`).join('\n'); - const mkb = (arr) => arr.map(x => `- L${x.line} [${x.sev}/${x.conf}] ${x.msg}`).join('\n'); - - const ruffCount = Array.from(ruffByFile.values()).reduce((a,b)=>a+b.length,0); - const banditCount = Array.from(banditByFile.values()).reduce((a,b)=>a+b.length,0); - - let body = - '### Code Review Agent (Python)\\n\n'; - if (!hasPy) { - body += `No Python files changed. Skipping analysis.`; - } else { - body += `Analyzed ${pyFiles.length} Python file(s).\n\n`; - body += `Ruff findings: ${ruffCount}\n`; - for (const [file, items] of ruffByFile.entries()) { - body += `\n${file}\n${mk(items)}\n`; - } - body += `\nBandit findings: ${banditCount}\n`; - for (const [file, items] of banditByFile.entries()) { - body += `\n${file}\n${mkb(items)}\n`; - } - if (ruffCount === 0 && banditCount === 0) { - body += `\nNo issues found. ✅`; - } else { - body += `\nNote: This is advisory and does not block the PR. Consider addressing issues above.`; - } - } - - await github.rest.issues.createComment({ owner, repo, issue_number, body }); diff --git a/.github/workflows/copilot-pr-review.yml b/.github/workflows/copilot-pr-review.yml deleted file mode 100644 index bb4caad..0000000 --- a/.github/workflows/copilot-pr-review.yml +++ /dev/null @@ -1,124 +0,0 @@ -# @ai-generated: true -# @ai-tool: GitHub Copilot -name: Copilot PR Review (on-demand) - -on: - issue_comment: - types: [created] - -permissions: - contents: read - pull-requests: write - -jobs: - copilot-review: - name: Generate Copilot review - # Run only on PRs and only when explicitly asked - if: >- - ${{ github.event.issue.pull_request && - (startsWith(github.event.comment.body, '/gov copilot') || - startsWith(github.event.comment.body, '/copilot review')) }} - runs-on: ubuntu-latest - steps: - - name: Build PR context for Copilot - id: ctx - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const owner = context.repo.owner; - const repo = context.repo.repo; - const issue_number = context.payload.issue.number; - - const { data: pr } = await github.rest.pulls.get({ owner, repo, pull_number: issue_number }); - const files = await github.paginate(github.rest.pulls.listFiles, { owner, repo, pull_number: issue_number, per_page: 100 }); - - const truncate = (s, n) => (s ? (s.length > n ? s.slice(0, n) + '…' : s) : ''); - const safeLines = (s, maxLines = 200) => (s || '').split('\n').slice(0, maxLines).join('\n'); - - const fileSummaries = files.map(f => `- ${f.filename} (+${f.additions}/-${f.deletions}, ${f.status}${f.changes ? ", ~"+f.changes+" lines" : ''})`).join('\n'); - const diffs = files.slice(0, 10).map(f => { - const patch = safeLines(truncate(f.patch || '', 6000), 250); - return `--- ${f.filename} (${f.status}, +${f.additions}/-${f.deletions})\n${patch}`; - }).join('\n\n'); - - const prBody = truncate(pr.body || '', 3000); - const prompt = - 'You are GitHub Copilot. Review the following Pull Request and provide a concise, helpful code review.\\n\\n\n'+ - `Please include:\n`+ - `- A 2-5 sentence summary of the changes.\n`+ - `- Potential bugs, security issues, or edge cases (with reasons).\n`+ - `- Testing gaps or scenarios to add.\n`+ - `- Style or clarity suggestions.\n`+ - `- Any breaking changes or migration notes.\n\n`+ - `Use short bullet points, reference files and line ranges when relevant, and be factual. If nothing notable, say "No significant issues found."\n\n`+ - `PR Title: ${pr.title}\n`+ - `Author: @${pr.user.login}\n`+ - `Base: ${pr.base.ref} -> Head: ${pr.head.ref}\n`+ - `PR Description (truncated):\n${prBody}\n\n`+ - `Changed files (${files.length}):\n${fileSummaries}\n\n`+ - `Sample diffs (first up to 10 files, truncated for length):\n\n${diffs}`; - - core.setOutput('prompt', prompt); - // Also write prompt to a workspace file to avoid shell quoting issues - const fs = require('fs'); - const path = require('path'); - const promptPath = path.join(process.cwd(), 'copilot_prompt.txt'); - fs.writeFileSync(promptPath, prompt, { encoding: 'utf8' }); - core.setOutput('prompt_path', promptPath); - - - name: Install gh Copilot extension - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - gh --version - gh extension install github/gh-copilot || gh extension upgrade github/gh-copilot || true - - - name: Ask Copilot for review - id: ask - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - shell: bash - run: | - REVIEW_FILE="review.md" - FINAL_FILE="final.md" - - PROMPT_FILE="copilot_prompt.txt" - # Ensure prompt file exists - if [ ! -f "$PROMPT_FILE" ]; then - echo "::error::Prompt file not found at $PROMPT_FILE"; exit 1; - fi - - # Try Copilot; if unavailable, produce a friendly fallback - if gh extension list | grep -q "github/gh-copilot"; then - if gh copilot --help > /dev/null 2>&1; then - if ! gh copilot --plain -p "$(cat "$PROMPT_FILE")" > "$REVIEW_FILE" 2>/tmp/copilot.err; then - printf "%s\n" "GitHub Copilot couldn't generate a review (perhaps not enabled for this org/repo)." > "$REVIEW_FILE" - printf "\n%s\n" "Tip: Ensure GitHub Copilot is enabled for Pull Requests in your organization or repository." >> "$REVIEW_FILE" - fi - else - echo "GitHub Copilot CLI not available on runner." > "$REVIEW_FILE" - fi - else - echo "GitHub Copilot extension not installed or unavailable." > "$REVIEW_FILE" - fi - - { - echo "### GitHub Copilot PR Review" - echo - cat "$REVIEW_FILE" - echo - echo "_Generated by GitHub Copilot. Please verify important suggestions before applying._" - } > "$FINAL_FILE" - - echo "final_path=$FINAL_FILE" >> "$GITHUB_OUTPUT" - - - name: Post review as PR comment - if: always() - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - PR_NUMBER: ${{ github.event.issue.number }} - run: | - FILE="${{ steps.ask.outputs.final_path }}" - if [ -z "$FILE" ] || [ ! -f "$FILE" ]; then FILE="final.md"; fi - gh pr comment "$PR_NUMBER" -F "$FILE" diff --git a/.github/workflows/gov-review.yml b/.github/workflows/gov-review.yml deleted file mode 100644 index 8852374..0000000 --- a/.github/workflows/gov-review.yml +++ /dev/null @@ -1,206 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: Governance Reports Review (/gov) - -on: - issue_comment: - types: [created] - -permissions: - contents: read - actions: read - issues: write - pull-requests: write - -jobs: - review: - if: ${{ github.event.issue.pull_request && startsWith(github.event.comment.body || '', '/gov') }} - runs-on: ubuntu-latest - steps: - - name: Parse command - id: parse - uses: actions/github-script@v7 - with: - script: | - const body = (context.payload.comment?.body || '').trim(); - const parts = body.split(/\s+/); - let cmd = (parts[1] || 'check').toLowerCase(); - const allowed = new Set(['help','check','licenses','sbom']); - if (!allowed.has(cmd)) cmd = 'help'; - core.setOutput('cmd', cmd); - core.setOutput('do_licenses', String(cmd === 'check' || cmd === 'licenses')); - core.setOutput('do_sbom', String(cmd === 'check' || cmd === 'sbom')); - - - name: Show help - if: ${{ steps.parse.outputs.cmd == 'help' }} - uses: actions/github-script@v7 - with: - script: | - const {owner, repo} = context.repo; - const help = [ - '### /gov help', - '', - 'Usage:', - '- `/gov` or `/gov check` — summarize both ScanCode and SBOM artifacts', - '- `/gov licenses` — summarize ScanCode (licenses) only', - '- `/gov sbom` — summarize SBOM (packages) only', - ].join('\n'); - await github.rest.issues.createComment({owner, repo, issue_number: context.issue.number, body: help}); - - - name: Extract PR info - if: ${{ steps.parse.outputs.cmd != 'help' }} - id: pr - uses: actions/github-script@v7 - with: - script: | - const {owner, repo} = context.repo; - const prNumber = context.issue.number; - const {data: pr} = await github.rest.pulls.get({owner, repo, pull_number: prNumber}); - core.setOutput('head_ref', pr.head.ref); - core.setOutput('head_sha', pr.head.sha); - core.setOutput('base_ref', pr.base.ref); - - - name: Find latest PR Governance run for this PR - if: ${{ steps.parse.outputs.cmd != 'help' }} - id: findrun - uses: actions/github-script@v7 - env: - HEAD_REF: ${{ steps.pr.outputs.head_ref }} - with: - script: | - const {owner, repo} = context.repo; - const headRef = process.env.HEAD_REF; - const {data} = await github.rest.actions.listWorkflowRunsForRepo({owner, repo, per_page: 50}); - const target = data.workflow_runs - .filter(r => r.head_branch === headRef && r.name === 'PR Governance (licenses & secrets)') - .sort((a,b) => new Date(b.created_at) - new Date(a.created_at))[0]; - if (!target) { - core.setFailed('No matching PR Governance run found for this branch.'); - return; - } - core.info(`Using run id: ${target.id}`); - core.setOutput('run_id', String(target.id)); - - - name: Download ScanCode artifact - if: ${{ steps.parse.outputs.cmd != 'help' }} - uses: dawidd6/action-download-artifact@v11 - with: - github_token: ${{ secrets.GITHUB_TOKEN }} - run_id: ${{ steps.findrun.outputs.run_id }} - name: scancode-report - path: gov-artifacts - if_no_artifact_found: warn - - - name: Download SBOM artifact - if: ${{ steps.parse.outputs.cmd != 'help' }} - uses: dawidd6/action-download-artifact@v11 - with: - github_token: ${{ secrets.GITHUB_TOKEN }} - run_id: ${{ steps.findrun.outputs.run_id }} - name: sbom-spdx - path: gov-artifacts - if_no_artifact_found: warn - - - name: List downloaded files (debug) - if: ${{ steps.parse.outputs.cmd != 'help' }} - run: | - echo "Downloaded artifacts:" - find gov-artifacts -maxdepth 3 -type f -print || true - - - name: Summarize reports - if: ${{ steps.parse.outputs.cmd != 'help' }} - id: summarize - shell: bash - run: | - set -euo pipefail - if ! command -v jq >/dev/null 2>&1; then - sudo apt-get update && sudo apt-get install -y jq - fi - mkdir -p gov-artifacts - summary=gov-artifacts/GOV_SUMMARY.md - BUF=$'' - append_line() { BUF+="$1"$'\n'; } - append_blank() { BUF+=$'\n'; } - - append_line "## Governance reports summary" - append_line "Run ID: ${{ steps.findrun.outputs.run_id }}" - append_blank - - COP=0 - UNK=0 - # Resolve file paths when artifact names or folders differ - SCAN_FILE="gov-artifacts/scancode.json" - if [ ! -f "$SCAN_FILE" ]; then - alt=$(find gov-artifacts -type f -name 'scancode*.json' | head -n1 || true) - if [ -n "${alt:-}" ]; then SCAN_FILE="$alt"; fi - fi - SBOM_FILE="gov-artifacts/sbom.spdx.json" - if [ ! -f "$SBOM_FILE" ]; then - alt=$(find gov-artifacts -type f \( -name '*.spdx.json' -o -name 'sbom*.json' \) | head -n1 || true) - if [ -n "${alt:-}" ]; then SBOM_FILE="$alt"; fi - fi - - if [ "${{ steps.parse.outputs.do_licenses }}" = "true" ]; then - if [ -f "$SCAN_FILE" ]; then - COP=$(jq -r '[.files[]? | .licenses[]? | (.spdx_license_key // "") | ascii_downcase | select(test("(^|[^a-z])(agpl|gpl|lgpl)([^a-z]|$)"))] | length' "$SCAN_FILE") - UNK=$(jq -r '[.files[]? | .licenses[]? | (.spdx_license_key // "") | ascii_downcase | select(. == "unknown" or . == "noassertion" or . == "")] | length' "$SCAN_FILE") - append_line "### ScanCode (licenses)" - append_line "- Copyleft findings (AGPL/GPL/LGPL): ${COP}" - append_line "- Unknown/NoAssertion licenses: ${UNK}" - append_line "- Top files with copyleft/unknown:" - TOP=$(jq -r '( - [.files[]? | select(.licenses) | {path: .path, keys: ([.licenses[]? | (.spdx_license_key // "")|ascii_downcase])}] - | map(select((.keys|join(" ")) | test("agpl|gpl|lgpl|unknown|noassertion"))) - | .[:5] - | map(" - " + .path) - | .[]) - ' "$SCAN_FILE" || true) - if [ -n "${TOP:-}" ]; then BUF+="$TOP"$'\n'; fi - append_blank - else - append_line "### ScanCode (licenses)" - append_line "- Artifact not found." - append_blank - fi - fi - - if [ "${{ steps.parse.outputs.do_sbom }}" = "true" ]; then - if [ -f "$SBOM_FILE" ]; then - TOTAL=$(jq -r '[.packages[]?] | length' "$SBOM_FILE") - GPL=$(jq -r '[.packages[]? | (.licenseConcluded // .licenseDeclared // "") | ascii_downcase | select(test("(^|[^a-z])(agpl|gpl|lgpl)([^a-z]|$)"))] | length' "$SBOM_FILE") - NOA=$(jq -r '[.packages[]? | (.licenseConcluded // .licenseDeclared // "") | ascii_downcase | select(. == "noassertion" or . == "unknown" or . == "")] | length' "$SBOM_FILE") - append_line "### SBOM (SPDX)" - append_line "- Packages: ${TOTAL}" - append_line "- Copyleft package licenses (AGPL/GPL/LGPL): ${GPL}" - append_line "- Unknown/NoAssertion package licenses: ${NOA}" - append_blank - else - append_line "### SBOM (SPDX)" - append_line "- Artifact not found." - append_blank - fi - fi - - printf "%s" "$BUF" > "$summary" - printf "%s\n" "copyleft_files=${COP}" >> "$GITHUB_OUTPUT" - printf "%s\n" "unknown_files=${UNK}" >> "$GITHUB_OUTPUT" - printf "%s" "$BUF" >> "$GITHUB_STEP_SUMMARY" - - - name: Comment summary on PR - if: ${{ steps.parse.outputs.cmd != 'help' }} - uses: actions/github-script@v7 - with: - script: | - const fs = require('fs'); - const {owner, repo} = context.repo; - const body = fs.readFileSync('gov-artifacts/GOV_SUMMARY.md','utf8'); - await github.rest.issues.createComment({owner, repo, issue_number: context.issue.number, body}); - - - name: Add label if review needed - if: ${{ steps.parse.outputs.cmd != 'help' && (steps.summarize.outputs.copyleft_files != '0' || steps.summarize.outputs.unknown_files != '0') }} - uses: actions/github-script@v7 - with: - script: | - const {owner, repo} = context.repo; - const labels = ['license-review-needed']; - await github.rest.issues.addLabels({owner, repo, issue_number: context.issue.number, labels}); diff --git a/.github/workflows/governance-smoke.yml b/.github/workflows/governance-smoke.yml deleted file mode 100644 index f149a76..0000000 --- a/.github/workflows/governance-smoke.yml +++ /dev/null @@ -1,33 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: Governance Smoke Tests - -on: - pull_request: - paths: - - '.github/workflows/**' - - 'tests/governance/**' - push: - paths: - - '.github/workflows/**' - - 'tests/governance/**' - -permissions: - contents: read - -jobs: - node-smoke: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-node@v4 - with: - node-version: '20' - - name: Run governance smoke tests (if present) - shell: bash - run: | - if [ -f tests/governance/smoke.test.js ]; then - node tests/governance/smoke.test.js - else - echo "No governance smoke test found; skipping" - fi diff --git a/.github/workflows/pr-autolinks.yml b/.github/workflows/pr-autolinks.yml deleted file mode 100644 index 86680a4..0000000 --- a/.github/workflows/pr-autolinks.yml +++ /dev/null @@ -1,136 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: PR Auto Links (/gov autofill) - -on: - issue_comment: - types: [created] - -permissions: - contents: read - pull-requests: write - actions: read - -jobs: - autolinks: - if: >- - ${{ github.event.issue.pull_request && - (startsWith(github.event.comment.body || '', '/gov autofill') || - startsWith(github.event.comment.body || '', '/gov links')) }} - runs-on: ubuntu-latest - steps: - - name: Preview or apply autofill - uses: actions/github-script@v7 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const cmd = (context.payload.comment?.body || '').trim(); - const apply = /^\/gov\s+autofill\s+apply!?/i.test(cmd); - const previewOnly = /^\/gov\s+links/i.test(cmd) || /^\/gov\s+autofill\s*$/i.test(cmd); - - const {owner, repo} = context.repo; - const prNumber = context.issue.number; - const {data: pr} = await github.rest.pulls.get({owner, repo, pull_number: prNumber}); - - // Compute change flags and latest run links - const files = await github.paginate(github.rest.pulls.listFiles, { owner, repo, pull_number: prNumber, per_page: 100 }); - const changed = files.map(f => f.filename); - const rx = { - userUI: /(^|\/)(ui|web|frontend|public|templates)(\/|$)|(^|\/)src\/.*\.(html|tsx?|vue)$/i, - sensitive: /(^|\/)(auth|authn|authz|login|acl|permissions?|access[_-]?control|secrets?|tokens?|jwt|oauth)(\/|$)|\.(policy|rego)$/i, - infra: /(^|\/)(k8s|kubernetes|helm|charts|deploy|ops|infra|infrastructure|manifests|terraform|ansible)(\/|$)|(^|\/)dockerfile$|docker-compose\.ya?ml$|Chart\.ya?ml$/i, - backend: /(^|\/)(src|api|server|backend|app)(\/|\/.*)([^\/]+)\.(js|ts|py|rb|go|java|cs)$/i, - media: /\.(png|jpe?g|gif|webp|svg|mp4|mp3|wav|pdf)$/i, - data: /(^|\/)(data|datasets|training|notebooks|scripts)(\/|$)/i - }; - const has = (re) => changed.some(p => re.test(p)); - const flags = { - userUI: has(rx.userUI), - sensitive: has(rx.sensitive), - infra: has(rx.infra), - backend: has(rx.backend), - media: has(rx.media), - data: has(rx.data) - }; - - const prBody = (pr.body || '').toString(); - const {data} = await github.rest.actions.listWorkflowRunsForRepo({owner, repo, per_page: 100}); - const pick = (name) => data.workflow_runs - .filter(r => r.head_branch === pr.head.ref && r.name === name) - .sort((a,b) => new Date(b.created_at) - new Date(a.created_at))[0]; - const gov = pick('PR Governance (licenses & secrets)'); - const tests = pick('Run Unit Tests'); - const mkRunUrl = (r) => r ? `https://github.com/${owner}/${repo}/actions/runs/${r.id}` : ''; - const links = { - governance_run: mkRunUrl(gov), - scancode_report: mkRunUrl(gov), - sbom_report: mkRunUrl(gov), - unit_tests: mkRunUrl(tests) - }; - - if (previewOnly && !apply) { - const rows = []; - if (links.governance_run) { - rows.push(`- ScanCode report: ${links.governance_run} (artifact: scancode-report)`); - rows.push(`- SBOM (SPDX): ${links.sbom_report} (artifact: sbom-spdx)`); - } else { - rows.push('- ScanCode/SBOM: not found yet (run PR Governance workflow)'); - } - if (links.unit_tests) rows.push(`- Unit tests: ${links.unit_tests}`); - rows.push('- C2PA: N/A'); - rows.push('- Accessibility statement: N/A'); - rows.push('- Retention schedule: N/A'); - rows.push('- Log retention policy: N/A'); - rows.push('- Smoke test: use latest Unit tests run link above'); - - const body = [ - '### Auto-links suggestions', - '', - ...rows, - '', - 'Tip: Use \'/gov autofill apply\' to apply safe defaults (N/A where allowed) and add run links into the PR body.' - ].join('\n'); - - await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body }); - return; - } - - if (apply) { - let body = prBody; - const updateField = (text, label, value, opts={}) => { - const re = new RegExp(`(^|\n)([\t ]*)${label}[\t ]*:[\t ]*(.*)$`, 'i'); - const m = text.match(re); - if (!m) return text; // line not present - const current = m[3].trim(); - const isPlaceholder = current === '' || /^<.*>$/.test(current) || /^N\/?A$/i.test(current); - if (!isPlaceholder && !opts.force) return text; - const prefix = m[1] + (m[2] || ''); - return text.replace(re, `${prefix}${label}: ${value}`); - }; - - body = updateField(body, 'C2PA', 'N/A'); - body = updateField(body, 'Accessibility statement', 'N/A'); - body = updateField(body, 'Retention schedule', 'N/A'); - body = updateField(body, 'Log retention policy', 'N/A'); - if (links.unit_tests) body = updateField(body, 'Smoke test', links.unit_tests); - - if (!/##\s*Governance artifacts/i.test(body)) { - const extras = []; - extras.push('', '## Governance artifacts', ''); - if (links.governance_run) { - extras.push(`- ScanCode: ${links.governance_run} (artifact: scancode-report)`); - extras.push(`- SBOM: ${links.sbom_report} (artifact: sbom-spdx)`); - } - if (links.unit_tests) extras.push(`- Unit tests: ${links.unit_tests}`); - body += '\n' + extras.join('\n') + '\n'; - } - - await github.rest.pulls.update({ owner, repo, pull_number: prNumber, body }); - const msg = [ - 'Applied auto-fill updates to the PR body:', - '- Set N/A for C2PA, Accessibility statement, Retention schedule, Log retention policy (placeholders only).', - '- Filled Smoke test with latest Unit Tests run link (if available).', - '- Appended Governance artifacts section with run links.' - ].join('\n'); - await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body: msg }); - } diff --git a/.github/workflows/pr-governance.yml b/.github/workflows/pr-governance.yml deleted file mode 100644 index 36ea2f5..0000000 --- a/.github/workflows/pr-governance.yml +++ /dev/null @@ -1,38 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: PR Governance (licenses & secrets) - -on: - pull_request: - types: [opened, synchronize, reopened] - -permissions: - actions: read - contents: read - pull-requests: write - issues: write - security-events: write - -jobs: - governance: - name: Reusable AI governance checks - # NOTE for downstream projects: - # This reference works only when calling the reusable workflow from THIS repository. - # After you copy a local version of `.github/workflows/ai-governance.yml` into your project, - # update the 'uses:' line to your repo path, e.g.: - # uses: //.github/workflows/ai-governance.yml@main - # Otherwise the workflow_call will fail in the consumer repository. - uses: ./.github/workflows/ai-governance.yml - with: - run_markdownlint: true - run_scancode: true - run_sbom: true - run_gitleaks: false - run_dependency_review: false - run_codeql: false - lint_command: 'make fmt' - test_command: 'make test' - require_ui_transparency: true - require_dpia_for_user_facing: true - require_eval_for_high_risk: false - enable_post_merge_reminders: true diff --git a/.github/workflows/run-unit-tests.yml b/.github/workflows/run-unit-tests.yml deleted file mode 100644 index b017323..0000000 --- a/.github/workflows/run-unit-tests.yml +++ /dev/null @@ -1,68 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: Run Unit Tests - -on: - pull_request: - types: [opened, synchronize, reopened] - -permissions: - contents: read - -jobs: - node_tests: - name: Node.js tests - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-node@v4 - with: - node-version: '20' - - name: Install dependencies (if package.json exists) - shell: bash - run: | - if [ -f package.json ]; then - if [ -f package-lock.json ]; then npm ci; else npm install; fi - else - echo "No package.json; skipping Node setup"; exit 0 - fi - - name: Run npm test (if defined) - shell: bash - run: | - if [ -f package.json ]; then - if node -e "const p=require('./package.json');process.exit(p.scripts&&p.scripts.test?0:1)"; then - npm test --silent || (echo "::error::npm test failed"; exit 1) - else - echo "No test script defined; skipping Node tests"; - fi - else - echo "No package.json; skipping"; - fi - - python_tests: - name: Python tests - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-python@v5 - with: - python-version: '3.11' - - name: Install dependencies - shell: bash - run: | - python -m pip install --upgrade pip - if [ -f requirements.txt ]; then - pip install -r requirements.txt || true - fi - # Ensure pytest available - pip install pytest || true - - name: Run pytest (if tests exist) - shell: bash - env: - PYTHONPATH: ${{ github.workspace }} - run: | - if find tests -type f -name "*.py" 2>/dev/null | grep -q .; then - pytest -q || (echo "::error::pytest failed"; exit 1) - else - echo "No Python tests found; skipping" - fi diff --git a/.github/workflows/workflow-lint.yml b/.github/workflows/workflow-lint.yml deleted file mode 100644 index 8c7155e..0000000 --- a/.github/workflows/workflow-lint.yml +++ /dev/null @@ -1,28 +0,0 @@ -# @ai-generated: true -# @ai-tool: Copilot -name: Workflow Lint (actionlint) - -on: - pull_request: - paths: - - '.github/workflows/**' - push: - paths: - - '.github/workflows/**' - -permissions: - contents: read -jobs: - actionlint: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Install shellcheck - run: sudo apt-get update && sudo apt-get install -y shellcheck - - name: Install actionlint (pinned) - shell: bash - run: | - curl -sSfL https://raw.githubusercontent.com/rhysd/actionlint/main/scripts/download-actionlint.bash | bash -s 1.7.1 - - name: Run actionlint - shell: bash - run: ./actionlint -shellcheck=shellcheck From 23ee915a3b513b046029570ebbe275508407c140 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Wed, 17 Jun 2026 15:28:19 +0200 Subject: [PATCH 24/25] ci: drop dangling /.github/workflows/* CODEOWNERS entry The governance workflows it referenced were removed in the prior commit. --- .github/CODEOWNERS | 1 - 1 file changed, 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 090690d..c4026e9 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -4,4 +4,3 @@ /governance/* @ErykKul /governance/** @ErykKul /policies/* @ErykKul -/.github/workflows/* @ErykKul From 1dce5e7efe11f40dabad7c5da220840910d97f06 Mon Sep 17 00:00:00 2001 From: ErykKul Date: Wed, 17 Jun 2026 15:29:26 +0200 Subject: [PATCH 25/25] ci: remove CODEOWNERS (all entries pointed at non-existent paths) governance/, policies/ and .github/workflows/ no longer exist on this branch. --- .github/CODEOWNERS | 6 ------ 1 file changed, 6 deletions(-) delete mode 100644 .github/CODEOWNERS diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS deleted file mode 100644 index c4026e9..0000000 --- a/.github/CODEOWNERS +++ /dev/null @@ -1,6 +0,0 @@ -# CODEOWNERS — interim owners for governance and workflows -# Replace with your org/team handles when available (e.g., @libis/security @libis/privacy). - -/governance/* @ErykKul -/governance/** @ErykKul -/policies/* @ErykKul