Conversation
…t implementation status and export mode details
…ndle generation Covers option parsing/normalization, report-vs-records API parameter routing, blank anonymization for CSV and JSON exports, virtual node generation, hash determinism, bundle caching, and the variables/Options flow against a fake in-memory REDCap server (~91% statement coverage). Marks the last Phase 3 item done in redcap.md.
…hase 3.9) De-identification correctness (silent no-ops eliminated): - EAV-aware blanking: blank the value cell of rows whose field_name matches (CSV and JSON EAV) instead of matching headers that don't exist. - Checkbox-aware matching: a blank rule for a field also matches its field___code expansion columns (CSV and JSON). - Label-header support: headers are translated back to field names via the data dictionary (incl. 'Label (choice=...)' checkbox headers). - Anonymization audit in the manifest: per-rule match counts, with warnings for rules that matched no exported data. - metadata.csv filtering now derives the exported field set correctly per mode (EAV field_name values, label translation, checkbox bases, record-ID field seeded for EAV). REDCap API fidelity (verified against PHPCap/REDCap.jl/PyCap/REDCapR): - type is no longer sent to content=report (reports are always flat; type moved to record-only parameters). - rawOrLabel 'both' normalized to raw (not a real API value). - csvDelimiter/rawOrLabelHeaders only sent when applicable (CSV; flat). Manifest provenance (decisions 2026-06-11): - project id/title from content=project. - file-upload fields documented as not-exported attachments. - dictionary_fields_not_exported diff for unfiltered flat records exports (reveals token export-rights stripping). Other: - Bundle cache size cap: oversized bundles are rebuilt, not cached. - redcap.md: review findings, research summary (API facts, landscape, metadata standards), resolved open questions, revised Phase 3.9-6 plan.
dev_build passed $(CUSTOMIZATIONS) to docker build unvalidated; with STAGE=dev the env.dev path (./docker-volumes/integration/conf/customizations) only exists after 'make init', so the frontend-builder COPY failed on a fresh checkout. dev_build now falls back to ./conf/kul_customizations or ./conf/customizations exactly like the build target.
…dling, manifest redaction - Per-variable transforms generalized: blank, drop (columns/rows/keys removed, also from metadata.csv), pseudonymize (hex HMAC-SHA256, researcher-managed base64 key, min 16 bytes, validated with actionable errors; empty cells stay empty). - EAV exports: transforms on the record-ID field now also cover the record linking column (raw IDs no longer survive blanking/pseudonymization); dropping the record-ID field in EAV is rejected with guidance. - Manifest: records filter redacted when the record-ID field is transformed, filterLogic redacted when it references transformed fields (both leaked the values the transforms removed); anonymization section reports hmac-sha256 + key fingerprint (SHA-256 of the key, first 16 hex) — never the key; key never logged; client-side drops excluded from dictionary_fields_not_exported. - Cache key includes the pseudonymization key (hashed): different keys yield different bundles. - Variables list: PHI-risk notes for notes/unvalidated-text fields (SelectItem.Note), identifier preselect resolved through checkbox base names; header fetch for the variables list is now always raw/comma (label-header and tab-delimiter exports previously produced rule names that never matched); transform rules now also match checkbox expansion columns by their own name.
…er-file mime plumbing
- tree.Node gains an optional mimeType attribute; threaded through both upload
paths: native multipart add/replace (explicit part Content-Type instead of
CreateFormFile's octet-stream) and direct-upload /addFiles jsonData. Files
without an explicit mime keep today's destination-side detection.
- Every redcap2 export bundle now includes, generated from one normalized
model over the final bundle (no toggles; deselectable per file in compare):
- croissant.json (Croissant 1.0, canonical context, FileObject distribution
with md5, CSV RecordSet fields with schema.org data types; recordSet
omitted for JSON exports) as application/ld+json — previewable via the new
generic JSON-LD external tool;
- ro-crate-metadata.json (RO-Crate 1.2, detached crate, Process Run Crate
provenance: CreateAction + plugin/REDCap SoftwareApplication entities) with
the profile mime Dataverse 6.3+ detects and the RO-Crate previewer
registers for;
- ddi-cdi.jsonld (DDI-CDI 1.0 JSON-LD mirroring the in-repo
cdi_generator_jsonld.py structure: WideDataSet/WideDataStructure/
LogicalRecord, InstanceVariables with substantive value domains, CodeLists
from REDCap choices, PrimaryKey on the record-ID, PhysicalSegmentLayout
for CSV) with the profile mime the deployed cdi previewer registers for;
- project_metadata.xml (CDISC ODM, returnMetadataOnly) — failure-tolerant.
- Variables reflect the post-transform data file: dropped columns are absent,
transforms are noted in descriptions, key fingerprint in dataset description.
- Manifest lists the sidecars under files; conf gains
12-jsonld-previewer.json (cdi-viewer registered for bare application/ld+json).
Maps REDCap project info onto the citation block used when creating a new dataset: title <- project_title, description <- project_notes + purpose_other, author <- principal investigator, grant number, IRB number as OtherId, and a urn:redcap project reference. The generic metadata-selector frontend flow picks this up without changes.
- REDCAP_INTEGRATION.md: end-user guide covering export modes, anonymization modes, pseudonymization key generation (openssl rand -base64 32) and management caveats, generated files, sidecar previewers, citation prefill, manifest reference, and limitations (free-text PHI disclaimer). - redcap.md: Phases 3.9/4/5 marked completed with implementation notes; Phase 6 in progress; decision revisions recorded (no sidecar toggles, ODM always generated, researcher-managed keys); file layout updated. - README.md: REDCap feature section + doc table link.
- options.redcapHttpTimeout (Go duration string, default 5m) in the backend config bounds REDCap API requests; invalid values fall back with a warning. - Performance: benchmarks for flat/EAV transforms and sidecar generation (flat ~150 MB/s with pseudonymize+blank+drop; EAV ~79 MB/s after memoizing record-column HMACs, was ~34 MB/s; sidecars ~7.5 ms per 500-variable dictionary); removed the per-file payload copy in Streams (bundle contents are immutable), halving peak memory while streaming. - Security review recorded in redcap.md: key path verified end to end (in-memory frontend state, queued-job-only Redis residency, never logged or echoed, fingerprint-only in manifest, MD5-input-only in cache key); redcap2 client verifies TLS (does not inherit the global DefaultTransport skip); key-validation errors tested to never echo key material; accepted risks documented (job payloads in Redis like all plugin tokens, app-wide DefaultTransport skip flagged for separate review). - Docs: operator configuration section in REDCAP_INTEGRATION.md; Phase 6 statuses updated — only the pilot re-test remains open.
…L conformance - croissant.json and ro-crate-metadata.json gain schema.org variableMeasured following the CDIF 1.1 Discovery-profile shape: PropertyValue per data column with name, description (label + anonymization note), alternateName, numeric minValue/maxValue (new text_validation_min/max dictionary parsing), and code lists as valueReference DefinedTerms (termCode = the value in the data). Inline entries in Croissant (its @vocab is schema.org); flattened contextual entities in RO-Crate as its spec requires. Verified with mlcroissant 1.1.0: exits clean (the embedded Croissant context also gained the missing official equivalentProperty/samplingRate keys). - ddi-cdi.jsonld code lists restructured per the official DDI-CDI 1.0 SHACL shapes (bundled with the cdi-viewer previewer), fixing the 'Less than 1 values' violations: each Code now uses a Notation whose TypedString content is the value as it appears in the data and denotes a Category holding the label; CodeList gets allowsDuplicates + ObjectName name; the PrimaryKey is reachable from the WideDataStructure (has_PrimaryKey) and its component uses the full correspondsTo_DataStructureComponent term. Verified with pyshacl against libis/cdi-viewer shapes/ddi-cdi-official.ttl: Conforms=True (was 13 violations). - datePublished/endTime omitted when only the missing-timestamp sentinel is available (must be ISO 8601). - New env-gated TestDumpSidecarsForValidation writes sample sidecars for external validation runs (pyshacl, mlcroissant); docs updated. Full CDIF 1.1 Data Description (double-typed cdi:InstanceVariable + skos code lists) deferred until the profile leaves review.
The generated file referenced the hosted DDI-CDI context by URL. That copy is currently invalid JSON upstream (stray git conflict markers) and the cdi-viewer's local fallback 404s on the deployed site, so the viewer falls back to an EMPTY context: every compact property key (name, denotes, content, ...) is silently dropped during JSON-LD expansion and SHACL validation reports mass 'less than 1 values' violations on documents whose content is actually correct. ddi-cdi.jsonld now embeds a minimal inline context covering exactly the terms the generator emits (class-scoped IRIs copied from the official context, JSON-LD 1.1 type-scoped). Nothing remote to fetch, nothing to break. Verified: parsing with zero network access yields the full triple set and pyshacl reports Conforms=True against the exact shapes the deployed viewer loads.
…onld" The generated file references the canonical published DDI-CDI context URL again, like the rest of the ecosystem. The validation failures were caused by the viewer's broken context-fallback path (fixed in cdi-viewer 04547d7) combined with the upstream hosted context being temporarily invalid JSON — not by the generated documents. Generators should follow the standard's regular practice rather than work around consumer bugs.
…ddialliance.org) The previously referenced ddi-cdi.github.io/m2t-ng URL is a build-tooling Pages artifact, not a release, and currently serves invalid JSON with unresolved merge-conflict markers. The DDI Alliance documentation site hosts the valid released encoding; generated ddi-cdi.jsonld validated end-to-end against the official SHACL shapes with this context fetched live (Conforms=True).
…ssant.json The cdi-viewer renders flattened @context+@graph documents (the DDI-CDI shape); croissant.json is a single nested node with no @graph, so the registered previewer showed an empty view or a missing-@graph error. Remove conf/dataverse/external-tools/12-jsonld-previewer.json and document that croissant.json has no preview until a Croissant-capable previewer exists. The croissant mime stays application/ld+json (accurate, and no tool registration conflicts with it).
…ry fields - croissant.json mime: application/ld+json; profile="http://mlcommons.org/croissant/1.0" (RFC 6906 profile, mirroring the RO-Crate/DDI-CDI conventions) so a Croissant-specific previewer registration can match it exactly. - New conf/dataverse/external-tools/12-croissant-previewer.json: opens the cdi-viewer with ?shacl=croissant to preload the Croissant SHACL shapes (bundled with the viewer; the flatten fix there makes single-node documents renderable). - croissant.json and ro-crate-metadata.json gain identifier and dateModified — mandatory in the CDIF 1.1 Discovery profile (gaps found by validating against the CDIF core shapes; mlcroissant remains clean).
The cdi-viewer now auto-selects shapes from document content, so the ?shacl=croissant query string (which Dataverse's naive toolUrl+'?'+params concatenation mangled into a double question mark) is no longer needed. ?shacl= remains available as an explicit override.
The croissant root Dataset had no @id and was therefore a blank node in RDF. Blank-node labels are relabeled on every serialization, so SHACL results on the root (e.g. the license recommendation) could never link back to the rendered node in RDF-based viewers — and CDIF Discovery wants an identifiable metadata subject anyway. mlcroissant and the croissant SHACL shapes remain clean.
cdi_generator_jsonld.py switched from the m2t-ng build artifact (currently invalid JSON upstream) to the released encoding on docs.ddialliance.org; update the three doc references to match.
ro-crate-metadata.json files were uploaded with the RO-Crate profile mime but the integration's external-tools conf (used by local/dev setups via dataverse/setup.sh) had no matching previewer registration — only the deployment repo did. Same gdcc v1.5 ROCrate previewer and the exact contentType the redcap2 plugin emits (and Dataverse 6.3+ detects by filename).
These Copilot-generated workflows (PR governance/code-review/unit-test runners, /gov ChatOps, and the reusable governance workflow) triggered flaky runs on pull requests. Remove the entire suite so PRs run clean.
The governance workflows it referenced were removed in the prior commit.
governance/, policies/ and .github/workflows/ no longer exist on this branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@AI-Tool: Claude (Anthropic)
Summary
Metadata()hook to prefill Dataverse citation metadata (Phase 5); configurable export timeout, performance pass and security review (Phase 6).ddi-cdi.md).@ai-generatedAI-governance GitHub Actions workflows (PR governance / code-review / unit-test runners,/govChatOps, and the reusable governance workflow) so PRs run clean — see theci:commit.AI Provenance (required for AI-assisted changes)
redcap.mddesign docCompliance checklist
Change-type specifics
Tests & Risk