Skip to content

REDCap v2 integration plugin (de-identification + metadata sidecars)#8

Open
ErykKul wants to merge 26 commits into
mainfrom
redcap_v2
Open

REDCap v2 integration plugin (de-identification + metadata sidecars)#8
ErykKul wants to merge 26 commits into
mainfrom
redcap_v2

Conversation

@ErykKul

@ErykKul ErykKul commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@AI-Tool: Claude (Anthropic)

Summary

  • Add the REDCap v2 integration plugin: connect to a REDCap project and export records + metadata into Dataverse.
  • De-identification engine: per-field drop and HMAC-pseudonymize transforms, researcher-managed key handling, and manifest redaction with provenance (Phases 3.9 & 4).
  • Metadata sidecars: generate Croissant, RO-Crate and DDI-CDI with variable-level metadata and DDI-CDI SHACL conformance; per-file MIME plumbing plus dedicated previewer registration (RO-Crate + Croissant), referencing the released DDI-CDI context URL.
  • Metadata() hook to prefill Dataverse citation metadata (Phase 5); configurable export timeout, performance pass and security review (Phase 6).
  • Unit tests (parameter routing, blanking, bundle generation) and docs (REDCap integration user guide, ddi-cdi.md).
  • CI: removes the flaky @ai-generated AI-governance GitHub Actions workflows (PR governance / code-review / unit-test runners, /gov ChatOps, and the reusable governance workflow) so PRs run clean — see the ci: commit.

AI Provenance (required for AI-assisted changes)

  • Prompt: iterative development (no single prompt); see redcap.md design doc
  • Model: Claude (Anthropic), via Claude Code
  • Date: 2026-06-17
  • Author: @ErykKul
  • Role: deployer

Compliance checklist

  • No secrets/PII
  • Transparency notice updated (if user-facing)
  • Agent logging enabled (actions/decisions logged)
  • Kill-switch / feature flag present for AI features
  • No prohibited practices under EU AI Act
  • Human oversight retained (required if high-risk or agent mode)
  • Risk classification: limited
  • Personal data: yes (REDCap research data; de-identified on export by this feature)
  • DPIA: N/A
  • Automated decision-making: no
  • Agent mode used: no
  • GPAI obligations: N/A
  • Vendor GPAI compliance reviewed: N/A
  • License/IP attestation
  • Attribution: N/A
  • Oversight plan: N/A

Change-type specifics

  • Security review: completed as part of Phase 6 (see the Phase 6 commit)
  • Backend/API changed:
    • ASVS: N/A
  • Data paths changed:
    • TDM: N/A

Tests & Risk

  • Unit/integration tests added/updated
  • Rollback plan: feature is additive (new plugin + export modes); revert the PR to remove.
  • Docs updated (if needed)

ErykKul added 26 commits March 10, 2026 13:52
…t implementation status and export mode details
…ndle generation

Covers option parsing/normalization, report-vs-records API parameter
routing, blank anonymization for CSV and JSON exports, virtual node
generation, hash determinism, bundle caching, and the variables/Options
flow against a fake in-memory REDCap server (~91% statement coverage).
Marks the last Phase 3 item done in redcap.md.
…hase 3.9)

De-identification correctness (silent no-ops eliminated):
- EAV-aware blanking: blank the value cell of rows whose field_name
  matches (CSV and JSON EAV) instead of matching headers that don't exist.
- Checkbox-aware matching: a blank rule for a field also matches its
  field___code expansion columns (CSV and JSON).
- Label-header support: headers are translated back to field names via
  the data dictionary (incl. 'Label (choice=...)' checkbox headers).
- Anonymization audit in the manifest: per-rule match counts, with
  warnings for rules that matched no exported data.
- metadata.csv filtering now derives the exported field set correctly
  per mode (EAV field_name values, label translation, checkbox bases,
  record-ID field seeded for EAV).

REDCap API fidelity (verified against PHPCap/REDCap.jl/PyCap/REDCapR):
- type is no longer sent to content=report (reports are always flat;
  type moved to record-only parameters).
- rawOrLabel 'both' normalized to raw (not a real API value).
- csvDelimiter/rawOrLabelHeaders only sent when applicable (CSV; flat).

Manifest provenance (decisions 2026-06-11):
- project id/title from content=project.
- file-upload fields documented as not-exported attachments.
- dictionary_fields_not_exported diff for unfiltered flat records
  exports (reveals token export-rights stripping).

Other:
- Bundle cache size cap: oversized bundles are rebuilt, not cached.
- redcap.md: review findings, research summary (API facts, landscape,
  metadata standards), resolved open questions, revised Phase 3.9-6 plan.
dev_build passed $(CUSTOMIZATIONS) to docker build unvalidated; with
STAGE=dev the env.dev path (./docker-volumes/integration/conf/customizations)
only exists after 'make init', so the frontend-builder COPY failed on a
fresh checkout. dev_build now falls back to ./conf/kul_customizations or
./conf/customizations exactly like the build target.
…dling, manifest redaction

- Per-variable transforms generalized: blank, drop (columns/rows/keys removed,
  also from metadata.csv), pseudonymize (hex HMAC-SHA256, researcher-managed
  base64 key, min 16 bytes, validated with actionable errors; empty cells stay
  empty).
- EAV exports: transforms on the record-ID field now also cover the record
  linking column (raw IDs no longer survive blanking/pseudonymization);
  dropping the record-ID field in EAV is rejected with guidance.
- Manifest: records filter redacted when the record-ID field is transformed,
  filterLogic redacted when it references transformed fields (both leaked the
  values the transforms removed); anonymization section reports hmac-sha256 +
  key fingerprint (SHA-256 of the key, first 16 hex) — never the key; key never
  logged; client-side drops excluded from dictionary_fields_not_exported.
- Cache key includes the pseudonymization key (hashed): different keys yield
  different bundles.
- Variables list: PHI-risk notes for notes/unvalidated-text fields
  (SelectItem.Note), identifier preselect resolved through checkbox base names;
  header fetch for the variables list is now always raw/comma (label-header and
  tab-delimiter exports previously produced rule names that never matched);
  transform rules now also match checkbox expansion columns by their own name.
…er-file mime plumbing

- tree.Node gains an optional mimeType attribute; threaded through both upload
  paths: native multipart add/replace (explicit part Content-Type instead of
  CreateFormFile's octet-stream) and direct-upload /addFiles jsonData. Files
  without an explicit mime keep today's destination-side detection.
- Every redcap2 export bundle now includes, generated from one normalized
  model over the final bundle (no toggles; deselectable per file in compare):
  - croissant.json (Croissant 1.0, canonical context, FileObject distribution
    with md5, CSV RecordSet fields with schema.org data types; recordSet
    omitted for JSON exports) as application/ld+json — previewable via the new
    generic JSON-LD external tool;
  - ro-crate-metadata.json (RO-Crate 1.2, detached crate, Process Run Crate
    provenance: CreateAction + plugin/REDCap SoftwareApplication entities) with
    the profile mime Dataverse 6.3+ detects and the RO-Crate previewer
    registers for;
  - ddi-cdi.jsonld (DDI-CDI 1.0 JSON-LD mirroring the in-repo
    cdi_generator_jsonld.py structure: WideDataSet/WideDataStructure/
    LogicalRecord, InstanceVariables with substantive value domains, CodeLists
    from REDCap choices, PrimaryKey on the record-ID, PhysicalSegmentLayout
    for CSV) with the profile mime the deployed cdi previewer registers for;
  - project_metadata.xml (CDISC ODM, returnMetadataOnly) — failure-tolerant.
- Variables reflect the post-transform data file: dropped columns are absent,
  transforms are noted in descriptions, key fingerprint in dataset description.
- Manifest lists the sidecars under files; conf gains
  12-jsonld-previewer.json (cdi-viewer registered for bare application/ld+json).
Maps REDCap project info onto the citation block used when creating a new
dataset: title <- project_title, description <- project_notes + purpose_other,
author <- principal investigator, grant number, IRB number as OtherId, and a
urn:redcap project reference. The generic metadata-selector frontend flow
picks this up without changes.
- REDCAP_INTEGRATION.md: end-user guide covering export modes, anonymization
  modes, pseudonymization key generation (openssl rand -base64 32) and
  management caveats, generated files, sidecar previewers, citation prefill,
  manifest reference, and limitations (free-text PHI disclaimer).
- redcap.md: Phases 3.9/4/5 marked completed with implementation notes;
  Phase 6 in progress; decision revisions recorded (no sidecar toggles, ODM
  always generated, researcher-managed keys); file layout updated.
- README.md: REDCap feature section + doc table link.
- options.redcapHttpTimeout (Go duration string, default 5m) in the backend
  config bounds REDCap API requests; invalid values fall back with a warning.
- Performance: benchmarks for flat/EAV transforms and sidecar generation
  (flat ~150 MB/s with pseudonymize+blank+drop; EAV ~79 MB/s after memoizing
  record-column HMACs, was ~34 MB/s; sidecars ~7.5 ms per 500-variable
  dictionary); removed the per-file payload copy in Streams (bundle contents
  are immutable), halving peak memory while streaming.
- Security review recorded in redcap.md: key path verified end to end
  (in-memory frontend state, queued-job-only Redis residency, never logged or
  echoed, fingerprint-only in manifest, MD5-input-only in cache key); redcap2
  client verifies TLS (does not inherit the global DefaultTransport skip);
  key-validation errors tested to never echo key material; accepted risks
  documented (job payloads in Redis like all plugin tokens, app-wide
  DefaultTransport skip flagged for separate review).
- Docs: operator configuration section in REDCAP_INTEGRATION.md; Phase 6
  statuses updated — only the pilot re-test remains open.
…L conformance

- croissant.json and ro-crate-metadata.json gain schema.org variableMeasured
  following the CDIF 1.1 Discovery-profile shape: PropertyValue per data
  column with name, description (label + anonymization note), alternateName,
  numeric minValue/maxValue (new text_validation_min/max dictionary parsing),
  and code lists as valueReference DefinedTerms (termCode = the value in the
  data). Inline entries in Croissant (its @vocab is schema.org); flattened
  contextual entities in RO-Crate as its spec requires. Verified with
  mlcroissant 1.1.0: exits clean (the embedded Croissant context also gained
  the missing official equivalentProperty/samplingRate keys).
- ddi-cdi.jsonld code lists restructured per the official DDI-CDI 1.0 SHACL
  shapes (bundled with the cdi-viewer previewer), fixing the 'Less than 1
  values' violations: each Code now uses a Notation whose TypedString content
  is the value as it appears in the data and denotes a Category holding the
  label; CodeList gets allowsDuplicates + ObjectName name; the PrimaryKey is
  reachable from the WideDataStructure (has_PrimaryKey) and its component
  uses the full correspondsTo_DataStructureComponent term. Verified with
  pyshacl against libis/cdi-viewer shapes/ddi-cdi-official.ttl: Conforms=True
  (was 13 violations).
- datePublished/endTime omitted when only the missing-timestamp sentinel is
  available (must be ISO 8601).
- New env-gated TestDumpSidecarsForValidation writes sample sidecars for
  external validation runs (pyshacl, mlcroissant); docs updated.

Full CDIF 1.1 Data Description (double-typed cdi:InstanceVariable + skos
code lists) deferred until the profile leaves review.
The generated file referenced the hosted DDI-CDI context by URL. That copy
is currently invalid JSON upstream (stray git conflict markers) and the
cdi-viewer's local fallback 404s on the deployed site, so the viewer falls
back to an EMPTY context: every compact property key (name, denotes,
content, ...) is silently dropped during JSON-LD expansion and SHACL
validation reports mass 'less than 1 values' violations on documents whose
content is actually correct.

ddi-cdi.jsonld now embeds a minimal inline context covering exactly the
terms the generator emits (class-scoped IRIs copied from the official
context, JSON-LD 1.1 type-scoped). Nothing remote to fetch, nothing to
break. Verified: parsing with zero network access yields the full triple
set and pyshacl reports Conforms=True against the exact shapes the deployed
viewer loads.
…onld"

The generated file references the canonical published DDI-CDI context URL
again, like the rest of the ecosystem. The validation failures were caused
by the viewer's broken context-fallback path (fixed in cdi-viewer 04547d7)
combined with the upstream hosted context being temporarily invalid JSON —
not by the generated documents. Generators should follow the standard's
regular practice rather than work around consumer bugs.
…ddialliance.org)

The previously referenced ddi-cdi.github.io/m2t-ng URL is a build-tooling
Pages artifact, not a release, and currently serves invalid JSON with
unresolved merge-conflict markers. The DDI Alliance documentation site
hosts the valid released encoding; generated ddi-cdi.jsonld validated
end-to-end against the official SHACL shapes with this context fetched
live (Conforms=True).
…ssant.json

The cdi-viewer renders flattened @context+@graph documents (the DDI-CDI
shape); croissant.json is a single nested node with no @graph, so the
registered previewer showed an empty view or a missing-@graph error.
Remove conf/dataverse/external-tools/12-jsonld-previewer.json and document
that croissant.json has no preview until a Croissant-capable previewer
exists. The croissant mime stays application/ld+json (accurate, and no
tool registration conflicts with it).
…ry fields

- croissant.json mime: application/ld+json; profile="http://mlcommons.org/croissant/1.0"
  (RFC 6906 profile, mirroring the RO-Crate/DDI-CDI conventions) so a
  Croissant-specific previewer registration can match it exactly.
- New conf/dataverse/external-tools/12-croissant-previewer.json: opens the
  cdi-viewer with ?shacl=croissant to preload the Croissant SHACL shapes
  (bundled with the viewer; the flatten fix there makes single-node
  documents renderable).
- croissant.json and ro-crate-metadata.json gain identifier and
  dateModified — mandatory in the CDIF 1.1 Discovery profile (gaps found by
  validating against the CDIF core shapes; mlcroissant remains clean).
The cdi-viewer now auto-selects shapes from document content, so the
?shacl=croissant query string (which Dataverse's naive toolUrl+'?'+params
concatenation mangled into a double question mark) is no longer needed.
?shacl= remains available as an explicit override.
The croissant root Dataset had no @id and was therefore a blank node in
RDF. Blank-node labels are relabeled on every serialization, so SHACL
results on the root (e.g. the license recommendation) could never link
back to the rendered node in RDF-based viewers — and CDIF Discovery wants
an identifiable metadata subject anyway. mlcroissant and the croissant
SHACL shapes remain clean.
cdi_generator_jsonld.py switched from the m2t-ng build artifact (currently
invalid JSON upstream) to the released encoding on docs.ddialliance.org;
update the three doc references to match.
ro-crate-metadata.json files were uploaded with the RO-Crate profile mime
but the integration's external-tools conf (used by local/dev setups via
dataverse/setup.sh) had no matching previewer registration — only the
deployment repo did. Same gdcc v1.5 ROCrate previewer and the exact
contentType the redcap2 plugin emits (and Dataverse 6.3+ detects by
filename).
These Copilot-generated workflows (PR governance/code-review/unit-test
runners, /gov ChatOps, and the reusable governance workflow) triggered
flaky runs on pull requests. Remove the entire suite so PRs run clean.
The governance workflows it referenced were removed in the prior commit.
governance/, policies/ and .github/workflows/ no longer exist on this branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant