Shared helpers, templates, and operating-model documentation for the UNICEF Chief Statistician Office (CSO), within the Office of Strategy and Evidence (OSE). One IO + API contract, one mode contract, three implementations (R · Python · Stata), one vendoring model.
- One way to read, write, compare, and merge data — auto-dispatched
by file extension; writes emit a
.provenance.jsonsidecar by default (sha256, schema, user, timestamp, metadata; opt out withprovenance = FALSE;.RData/.rdawrites skip the sidecar). - One way to hit external APIs — UIS, SDMX, World Bank, ILO, UNSD-SDG, GitHub-raw — with a deposit cache and a reviewer-mode lockout that physically prevents network calls.
- Producer / reviewer is a session-level mode, not a per-call argument; the contract is enforced at every wrapped call site, not by convention.
- Same behaviour across R, Python, and Stata; vendored into consumer repos via a 1-line manifest pin, not installed over the network.
- Motivation
- Architecture at a glance
- Three roles, one contract
- What's inside — by capability
- Quick start
- Install / vendor
- Versioning and roadmap
- Testing and CI
- License and citation
cso-toolkit exists to facilitate the reproducibility and scalability
of analytics developed by the UNICEF Chief Statistician Office (CSO),
within the Office of Strategy and Evidence (OSE). Concretely, it does
three things:
-
Encodes a single IO + API contract. One way to read, write, compare, and merge data; one way to hit external APIs. Every call routes through wrappers that enforce provenance sidecars, uniqueness checks, and the producer / reviewer mode contract — so any analytics product can be rerun by someone other than its original author and yield the same numbers.
-
Separates producer and reviewer mode at the session level. The Database Manager (producer) pulls live APIs and deposits canonical artefacts; the reviewer reruns from those frozen artefacts and is physically prevented from touching the network. The contract is enforced by the toolkit at every wrapped call site, not by convention.
-
Scales across sectors and projects. The same helpers, the same templates, and the same audit functions are vendored into every sector codebase under the CSO, which means new sectors and new projects inherit the reproducibility floor for free instead of re-inventing it.
Why this matters in practice: a 2026 publication that cites
dw_<sector>.csv must, two years later, produce identical numbers
when re-run by a reviewer who has only the canonical deposit and the
vendored helpers — no network access, no upstream package drift, no
ambient dplyr version skew.
flowchart LR
subgraph EXT["External data sources"]
UIS[UIS]
SDMX[SDMX]
WB[World Bank]
ILO[ILO]
SDG[UNSD-SDG]
GHR[GitHub-raw]
end
subgraph TK["cso-toolkit (this repo)"]
direction TB
IO["IO contract<br/>dw_save · dw_use<br/>dw_compare · dw_merge<br/>dw_isid · dw_verify_z"]
API["API contract<br/>dw_api_fetch<br/>dw_api_cached<br/>dw_api_inventory"]
SYNC["Sync contract<br/>cso_toolkit_check<br/>cso_toolkit_pull"]
AGG["Aggregation<br/>aggregate_data_v2<br/>apply_time_window<br/>dw_nestweight"]
SCAF["Scaffolding<br/>create_profile<br/>create_sector_script<br/>review_profile · test_scripts"]
end
subgraph LANG["Language siblings (same contract)"]
direction LR
R["R<br/>r/R/"]
PY["Python<br/>python/src/"]
ST["Stata<br/>stata/src/"]
end
subgraph CONS["Consumer repos (vendored)"]
direction TB
DWP["DW-Production<br/>00_functions/"]
SEC["Sector pipelines<br/>(ed, ws, nt, hva, pv, im, ...)"]
CCRI["CCRI / geospatial<br/>(Python-side)"]
end
subgraph CANON["Canonical deposit"]
TEAMS["Teams folder<br/>060.DW-MASTER/"]
ZDRIVE["Z: drive<br/>(carbon-copy mirror)"]
HELIX["data.unicef.org<br/>+ SDMX downstream"]
end
EXT -->|"producer mode<br/>(cached)"| API
API --> TK
IO --> TK
SYNC --> TK
TK -.->|"language-parallel<br/>implementations"| LANG
LANG -->|"vendored<br/>.toolkit_manifest.yml"| CONS
CONS -->|"producer<br/>deposits"| CANON
CANON -.->|"reviewer<br/>reads (frozen)"| CONS
Three layered ideas:
- The toolkit (this repo) encodes the contract once. It groups its helpers into five capability families: IO, external API, vintage sync, aggregation, and scaffolding / audit.
- Three language siblings (R, Python, Stata) implement the same
contract in idiomatic form for each language. Same function names,
same
.provenance.jsonschema, same mode-aware path routing, same error-envelope shape ([cso_toolkit.<func>] WHAT / Why / Fix). - Consumer repos (DW-Production, sector pipelines, CCRI / geospatial)
vendor the helpers into their own
00_functions/. The producer deposits artefacts into the canonical Teams / Z: store; reviewers re-run against the frozen deposit without ever calling out.
The toolkit's mode contract distinguishes three roles that touch the data warehouse, each with a strict capability boundary the toolkit enforces at every wrapped call site:
flowchart LR
A["UNICEF data sources<br/>(UIS · SDMX · WB · ILO · ...)"]
B[("Teams folder<br/>+ Z: drive mirror<br/>(canonical deposit)")]
C[("data.unicef.org<br/>+ SDMX feed<br/>+ Helix")]
P["PRODUCER<br/>(Database Manager)<br/><br/>· runs sector pipeline<br/>· pulls live APIs (cached)<br/>· deposits dw_<sector>.<ext><br/>· writes submission template"]
R["REVIEWER<br/>(audit / repro)<br/><br/>· re-runs from canonical<br/>· compares to canonical<br/>· files issues<br/>· NEVER calls APIs"]
I["PUBLISHER (DBP)<br/>(downstream products)<br/><br/>· pulls signed-off deposits<br/>· publishes to data.unicef.org<br/>· feeds SDMX downstream"]
A -->|fetch + cache| P
P -->|deposit| B
B -->|read-only| R
R -.->|files issue| P
B -->|publish| I
I --> C
style P fill:#1cabe2,color:#fff
style R fill:#374ea2,color:#fff
style I fill:#8a8d8f,color:#fff
| Role | What they do | Where they write | Network access |
|---|---|---|---|
| PRODUCER (DBM) | Runs the sector pipeline, pulls upstream APIs, deposits final dw_<sector>.<ext> into the warehouse, writes a submission template. |
The canonical deposit (060.DW-MASTER). |
Yes — dw_apis_allowed = TRUE. |
| REVIEWER (DBR) | Re-runs the sector pipeline from pre-deposited inputs, compares against the canonical deposit, files issues. | A sandbox (sandboxRoot). Never touches canonical. |
No — dw_apis_allowed = FALSE. Every API call site raises with a mode-lock message. |
| PUBLISHER (DBP) | Pulls signed-off deposits into data.unicef.org, SDMX, and downstream products. |
Internal infrastructure outside this repo. | N/A. |
The mode is a session property read once at profile load time
(dw_mode in ~/.config/user_config.yml). It is not a per-call
argument. See docs/mode_contract_integration.md
for the wiring.
Each capability family is implemented in all three language siblings unless flagged. Per-function reference docs are linked in the right column.
Uniform file IO with auto-dispatch by extension
(CSV / TSV / XLSX / RDS-or-PKL / DTA / Parquet / JSON / YAML), isid
uniqueness check, automatic .provenance.json sidecar, Z: drive mirror
on canonical writes, integrity check on canonical reads.
| Function | R | Python | Stata | Reference |
|---|---|---|---|---|
dw_save |
✅ | ✅ | ✅ | R · Python |
dw_use |
✅ | ✅ | ⏳ (#5) | same as above |
dw_compare |
✅ | ✅ | ✅ | same |
dw_merge |
✅ | ✅ | — | R + Python only |
dw_isid |
✅ | ✅ | (embedded in dw_save) |
same |
dw_verify_z |
✅ | ✅ | — | R + Python only |
dw_use accepts HTTPS URLs since v0.4.0 — producer downloads
once and freezes the response under _frozen/<host>/<path> with a
.provenance.json sidecar; reviewer reads only from the frozen
copy. Allowlist is configured per-consumer (empty by default).
Mode-aware wrapper around 10 external data sources. Producer hits the
live API and caches under _apis/<api>/<cache_key>.<ext>; reviewer
reads only from the cache. Secrets in caller kwargs are redacted
before they reach the provenance sidecar.
api = value |
Source | R | Python |
|---|---|---|---|
"uis" |
UNESCO UIS REST | ✅ | ✅ |
"sdmx" |
Generic SDMX | ✅ | ✅ |
"sdmx_codelist" |
UNICEF SDMX codelist | ✅ | ✅ |
"wb", "wb_indicators" |
World Bank WDI | ✅ | ✅ |
"ilo" |
ILO SDMX | ✅ | ✅ |
"unsd_sdg" |
UNSD SDG API | ✅ | ✅ |
"github_raw" |
Pinned-commit raw.githubusercontent.com | ✅ | ✅ |
"http", "json_get" |
Generic HTTP / JSON | ✅ | ✅ |
Aggregation and survey weights — aggregate_data_v2, apply_time_window, generate_agg_footnote, dw_nestweight
aggregate_data_v2() covers weighted_mean / mean / sum /
proportion with population + country coverage and a coverage
threshold. apply_time_window() filters to the latest observation
per country within an inclusive year window. dw_nestweight() is an
R port of World Bank EduAnalyticsToolkit's
edukit_nestweight
(Diana Goldemberg) — redistributes survey weights from missing nested
observations so per-stratum totals are preserved (R + Python).
create_profile() scaffolds a profile_<repo>.R (or .py) with the
standard CSO building blocks: cross-platform user identification, YAML
config load, producer / reviewer dw_mode resolution, optional Z:
drive advisory, packages block, sentinel object. review_profile()
audits an existing profile for the same building blocks and reports
pass / warn / fail per check. create_sector_script() (and the
DW-Production convenience wrapper create_dw_sector_script())
scaffolds a sector run-script template with profile verification,
logging, runtime tracking, and try/catch.
Recursively scans a directory of .R (or .py) scripts and flags any
direct call to a raw file-IO or external-API command that the toolkit
wraps (e.g. read_csv, pd.read_csv, httr::GET, requests.get,
rsdmx::readSDMX, sdmx.Client). Per-line escape hatch via
# cso-allow: <rule-id>; CI-mode via error_on_violation = TRUE
fails the build on any violation.
Consumer-side helpers that read the local .toolkit_manifest.yml,
query upstream for the latest tag, and warn / refresh when the
consumer is behind. Reviewer mode forbids the network call.
Every stop() / raise follows a three-part shape so library errors
are grep-friendly and actionable:
[cso_toolkit.dw_save] Reviewer mode forbids writes under canonical: /path
Reviewer sessions must keep canonical deposits read-only to preserve
vintage permanence; writes go to the sandbox.
Fix:
1. Resolve a sandbox path instead, OR
2. If this is a deliberate DBM bootstrap, pass `allow_canonical_write = TRUE`.
The leading [cso_toolkit.<func>] prefix lets you grep a consumer
project for sites that hit a given error class.
Same code shape across the three siblings — pick yours.
# 1. Source the vendored helpers (normally done by profile_<repo>.R)
source("00_functions/dw_io.R")
source("00_functions/dw_api.R")
# 2. Set session-level mode + paths
dw_mode <- "producer"
dw_apis_allowed <- TRUE
teamsWrkData <- "/path/to/wrk"
# 3. Use the contract
library(dplyr)
df <- tibble::tibble(REF_AREA = c("AGO", "BFA"), OBS_VALUE = c(0.5, 0.7))
dw_save(df, name = "dw_ed_edu.csv", sector = "ed", kind = "wrk",
isid = "REF_AREA",
metadata = list(title = "Education indicators", vintage = "2026-05"))from cso_toolkit import _state, dw_save, dw_use
import pandas as pd
_state.configure(
teamsWrkData="/path/to/wrk",
dw_mode="producer",
dw_apis_allowed=True,
)
df = pd.DataFrame({"REF_AREA": ["AGO", "BFA"], "OBS_VALUE": [0.5, 0.7]})
dw_save(df, name="dw_ed_edu.csv", sector="ed", kind="wrk",
isid=["REF_AREA"],
metadata={"title": "Education indicators", "vintage": "2026-05"})* Wire the mode contract in your profile
global dw_mode "producer"
global teamsWrkDataCanonical "C:/.../013_wrkdata"
* Use the contract
use "input.dta", clear
dw_save using "dw_ed_edu.dta", ///
idvars(REF_AREA INDICATOR) ///
title("Education indicators") ///
vintage("2026-05")Per-language full-flavour quick starts live at
r/README.md,
python/README.md,
stata/README.md.
Production model is vendoring — consumers (DW-Production sector
codebases) copy the helpers into their own 00_functions/ and pin a
version in .toolkit_manifest.yml. Why not source() /
pip install / net install over the network:
- Vintage permanence. A 2026-05 release re-run must use the helper code as it stood in 2026-05. Network sourcing breaks that.
- AppLocker reality. UNICEF laptops block many script-installable
paths; copy-into-
00_functions/always works. - Offline reproducibility. Reviewers on planes / customs / corporate networks need the helpers locally.
For local development, each language also supports a native install
path (devtools::install_local("r/"), pip install -e python/,
adopath ++ "stata/src"). Vendoring stays the production model.
See docs/toolkit_strategy.md for the full
rationale + upgrade flow, and
templates/.toolkit_manifest.yml
for the manifest schema.
Semantic versioning (MAJOR.MINOR.PATCH).
| Tag | Released | Highlights |
|---|---|---|
v0.1.0-rc1 |
2026-05-24 | R helpers feature-complete; Stata / Python scaffolded. |
v0.2.0 |
2026-05-24 | Stata helpers shipped (dw_save, dw_compare, dw_mkdir); dw_nestweight ported from EduAnalyticsToolkit; workflow diagrams. |
v0.3.0 |
2026-05-25 | Full Python port (10 modules, 26 public entries); Roxygen-complete R reference (26 Rd files + pkgdown); graceful three-part error envelopes across R + Python; secrets-redaction in .provenance.json. |
v0.4.0 |
2026-05-26 | DW-Production backports (B1 remote-URL freeze, B2 gzip auto-detect, B3 sidecar tryCatch, B4 URLencode + cache-ext fix); testthat regression suite (237 PASS); GitHub Actions CI; tightened producer / reviewer mode contract for dw_save + dw_use — redundant Teams + Z: producer writes, network-first reviewer reads, overwrite default flipped TRUE → FALSE (issue #14, BREAKING); remaining Stata helpers — dw_use, dw_require_no_api, dw_load_config (issue #5); R demographics family — dw_pop() + dw_regions() (issues #17 + #18); dw_publish() STUB (issue #15; dry-run only, live submission deferred to v0.5.0). |
v0.4.1 |
2026-05-27 | Patch. Restores dialect = "base" byte-parity dispatch on dw_save() (regression from v0.4.0); fixes Copilot-flagged dw_use() regression. Validated against DW-Production NT branch (tests/test_v041_nt.R, tests 1–4 PASS). |
v0.4.2 |
2026-05-27 | Patch. Fixes a silent-.tsv bug in v0.4.1's dialect = "base" dispatch — utils::write.csv() hardcoded the comma separator, so a .tsv write produced CSV content with a .tsv extension. dialect = "base" now respects the file extension. |
v0.4.3 |
2026-05-28 | Integrity release. Two dw_use() fixes (issues #30 + #31) ported from the 2026-05-27 NT reviewer-mode reproducibility audit (DW-Production PR #133). col_select = NULL conditional dispatch for parquet / dta; new cols_lenient flag for any_of()-style schema intersect. |
v0.4.4 |
2026-05-29 | Quality release. Three milestone issues land in one cycle (PRs #39, #41, #43). All three surfaced during the v0.4.3.1 fanout audit (IM / WS / HVA install + reviewer-mode runs on 2026-05-28). No public API breaks. |
v0.4.5 |
2026-05-29 | Closes v0.4.5 milestone with standalone-source %>% binding (issue #46 / PR #47) and 8 new dw_-prefixed aliases (issue #42 / PR #48). magrittr declared as a first-class Import with importFrom in NAMESPACE. No public API breaks; both old and new names remain exported throughout v0.4.x. |
v0.4.6 |
2026-05-30 | Quality release. Four issues land in one cycle — HIGH-severity dw_is_canonical recognises OneDrive-mounted Teams Documents path (issue #54; pre-fix, reviewer-mode dw_save() could silently overwrite canonical Teams artefacts on UNICEF laptops where Documents is OneDrive-mounted); re-exported dw_root() public wrapper around .dw_root_for() (issue #53); .cso_require("magrittr") envelope on standalone-source %>% gate (issue #51); r/.gitattributes pin so the R subtree checks out with LF endings on Windows (issue #52). Also lands the cso-toolkit-hosted DW Operations Hub dashboard infrastructure (single-page SPA at docs/dashboard/, nightly cron via .github/workflows/dashboard.yml) and renames the third role INGESTOR → PUBLISHER (DBM/DBR/DBP). No public API breaks. |
v0.5.0 |
planned | Live dw_publish() submission branch (Helix endpoint + auth + idempotency); Python + Stata siblings of dw_pop() and dw_regions(). |
v1.0.0 |
committed API | After the ed sector pilot lands and a second sector vendors the helpers without modification. |
Changelog. Per-release notes — including breaking-change migration
notes and per-PR provenance — live in NEWS.md (the
toolkit follows the R-ecosystem NEWS.md convention; there is no
separate CHANGELOG.md).
The toolkit ships a layered test surface. Every layer runs both
locally and on CI (.github/workflows/r-check.yml).
| Layer | What it catches |
|---|---|
R testthat suite (r/tests/testthat/, 125+ assertions) |
Unit + regression coverage for every helper. expect_envelope() asserts the [cso_toolkit.<func>] WHAT / Why / Fix shape on every raise. |
R R CMD check |
Rd syntax, NAMESPACE drift, undefined globals, \examples{} blocks. Target: 0 errors / 0 warnings / 0 notes. |
Python smoke test (python/tests/manual/smoke_test.py, 20 invariants) |
Round-trip behaviour, provenance sidecar, mode contract, B1–B4 regressions. Manual-only — run via python python/tests/manual/smoke_test.py. |
Python error-envelope test (python/tests/manual/error_envelope_test.py, 30 paths) |
Every public raise carries the standard envelope. Manual-only. |
R manual smoke (r/tests/manual/check_consumer_side.R) |
End-to-end vendoring scenario against api.github.com. Manual-only. |
GitHub Actions (.github/workflows/r-check.yml) |
Runs R CMD check (which invokes the R testthat suite) across ubuntu-latest (R release + devel), macos-latest (release), and windows-latest (release) on every push / PR. The Python and R-manual layers above are not yet wired into CI — tracked as a follow-up. |
Code under MIT; documentation under CC BY 4.0.
UNICEF Chief Statistician Office, cso-toolkit: Shared helpers and operating model for child-indicator data warehousing, v0.3.0 (2026), https://github.com/unicef-drp/cso-toolkit
r/README.md— R package overview (install, layout, quick start, error envelope, testing).python/README.md— Python package overview.stata/README.md— Stata package overview.docs/roles_and_workflow.md— full PRODUCER / REVIEWER / PUBLISHER role definitions + per-role workflow.docs/toolkit_strategy.md— why the vendoring model exists.docs/mode_contract_integration.md— how to wiredw_modeinto a sector profile.docs/dw_io_reference.md+docs/dw_io_python_reference.md— per-function IO references.docs/dw_api_reference.md+docs/dw_api_python_reference.md— per-function API references.templates/dbm_submission_template.md— eight-section pre-deposit checklist.