Skip to content

unicef-drp/cso-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

236 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cso-toolkit — UNICEF Chief Statistician Office toolkit

Shared helpers, templates, and operating-model documentation for the UNICEF Chief Statistician Office (CSO), within the Office of Strategy and Evidence (OSE). One IO + API contract, one mode contract, three implementations (R · Python · Stata), one vendoring model.

R CMD check License: MIT (code) + CC BY 4.0 (docs) Tag


TL;DR

  • One way to read, write, compare, and merge data — auto-dispatched by file extension; writes emit a .provenance.json sidecar by default (sha256, schema, user, timestamp, metadata; opt out with provenance = FALSE; .RData / .rda writes skip the sidecar).
  • One way to hit external APIs — UIS, SDMX, World Bank, ILO, UNSD-SDG, GitHub-raw — with a deposit cache and a reviewer-mode lockout that physically prevents network calls.
  • Producer / reviewer is a session-level mode, not a per-call argument; the contract is enforced at every wrapped call site, not by convention.
  • Same behaviour across R, Python, and Stata; vendored into consumer repos via a 1-line manifest pin, not installed over the network.

Table of contents


Motivation

cso-toolkit exists to facilitate the reproducibility and scalability of analytics developed by the UNICEF Chief Statistician Office (CSO), within the Office of Strategy and Evidence (OSE). Concretely, it does three things:

  1. Encodes a single IO + API contract. One way to read, write, compare, and merge data; one way to hit external APIs. Every call routes through wrappers that enforce provenance sidecars, uniqueness checks, and the producer / reviewer mode contract — so any analytics product can be rerun by someone other than its original author and yield the same numbers.

  2. Separates producer and reviewer mode at the session level. The Database Manager (producer) pulls live APIs and deposits canonical artefacts; the reviewer reruns from those frozen artefacts and is physically prevented from touching the network. The contract is enforced by the toolkit at every wrapped call site, not by convention.

  3. Scales across sectors and projects. The same helpers, the same templates, and the same audit functions are vendored into every sector codebase under the CSO, which means new sectors and new projects inherit the reproducibility floor for free instead of re-inventing it.

Why this matters in practice: a 2026 publication that cites dw_<sector>.csv must, two years later, produce identical numbers when re-run by a reviewer who has only the canonical deposit and the vendored helpers — no network access, no upstream package drift, no ambient dplyr version skew.


Architecture at a glance

flowchart LR
    subgraph EXT["External data sources"]
        UIS[UIS]
        SDMX[SDMX]
        WB[World Bank]
        ILO[ILO]
        SDG[UNSD-SDG]
        GHR[GitHub-raw]
    end

    subgraph TK["cso-toolkit (this repo)"]
        direction TB
        IO["IO contract<br/>dw_save · dw_use<br/>dw_compare · dw_merge<br/>dw_isid · dw_verify_z"]
        API["API contract<br/>dw_api_fetch<br/>dw_api_cached<br/>dw_api_inventory"]
        SYNC["Sync contract<br/>cso_toolkit_check<br/>cso_toolkit_pull"]
        AGG["Aggregation<br/>aggregate_data_v2<br/>apply_time_window<br/>dw_nestweight"]
        SCAF["Scaffolding<br/>create_profile<br/>create_sector_script<br/>review_profile · test_scripts"]
    end

    subgraph LANG["Language siblings (same contract)"]
        direction LR
        R["R<br/>r/R/"]
        PY["Python<br/>python/src/"]
        ST["Stata<br/>stata/src/"]
    end

    subgraph CONS["Consumer repos (vendored)"]
        direction TB
        DWP["DW-Production<br/>00_functions/"]
        SEC["Sector pipelines<br/>(ed, ws, nt, hva, pv, im, ...)"]
        CCRI["CCRI / geospatial<br/>(Python-side)"]
    end

    subgraph CANON["Canonical deposit"]
        TEAMS["Teams folder<br/>060.DW-MASTER/"]
        ZDRIVE["Z: drive<br/>(carbon-copy mirror)"]
        HELIX["data.unicef.org<br/>+ SDMX downstream"]
    end

    EXT -->|"producer mode<br/>(cached)"| API
    API --> TK
    IO --> TK
    SYNC --> TK
    TK -.->|"language-parallel<br/>implementations"| LANG
    LANG -->|"vendored<br/>.toolkit_manifest.yml"| CONS
    CONS -->|"producer<br/>deposits"| CANON
    CANON -.->|"reviewer<br/>reads (frozen)"| CONS
Loading

Three layered ideas:

  • The toolkit (this repo) encodes the contract once. It groups its helpers into five capability families: IO, external API, vintage sync, aggregation, and scaffolding / audit.
  • Three language siblings (R, Python, Stata) implement the same contract in idiomatic form for each language. Same function names, same .provenance.json schema, same mode-aware path routing, same error-envelope shape ([cso_toolkit.<func>] WHAT / Why / Fix).
  • Consumer repos (DW-Production, sector pipelines, CCRI / geospatial) vendor the helpers into their own 00_functions/. The producer deposits artefacts into the canonical Teams / Z: store; reviewers re-run against the frozen deposit without ever calling out.

Three roles, one contract

The toolkit's mode contract distinguishes three roles that touch the data warehouse, each with a strict capability boundary the toolkit enforces at every wrapped call site:

flowchart LR
    A["UNICEF data sources<br/>(UIS · SDMX · WB · ILO · ...)"]
    B[("Teams folder<br/>+ Z: drive mirror<br/>(canonical deposit)")]
    C[("data.unicef.org<br/>+ SDMX feed<br/>+ Helix")]

    P["PRODUCER<br/>(Database Manager)<br/><br/>· runs sector pipeline<br/>· pulls live APIs (cached)<br/>· deposits dw_&lt;sector&gt;.&lt;ext&gt;<br/>· writes submission template"]
    R["REVIEWER<br/>(audit / repro)<br/><br/>· re-runs from canonical<br/>· compares to canonical<br/>· files issues<br/>· NEVER calls APIs"]
    I["PUBLISHER (DBP)<br/>(downstream products)<br/><br/>· pulls signed-off deposits<br/>· publishes to data.unicef.org<br/>· feeds SDMX downstream"]

    A -->|fetch + cache| P
    P -->|deposit| B
    B -->|read-only| R
    R -.->|files issue| P
    B -->|publish| I
    I --> C

    style P fill:#1cabe2,color:#fff
    style R fill:#374ea2,color:#fff
    style I fill:#8a8d8f,color:#fff
Loading
Role What they do Where they write Network access
PRODUCER (DBM) Runs the sector pipeline, pulls upstream APIs, deposits final dw_<sector>.<ext> into the warehouse, writes a submission template. The canonical deposit (060.DW-MASTER). Yesdw_apis_allowed = TRUE.
REVIEWER (DBR) Re-runs the sector pipeline from pre-deposited inputs, compares against the canonical deposit, files issues. A sandbox (sandboxRoot). Never touches canonical. Nodw_apis_allowed = FALSE. Every API call site raises with a mode-lock message.
PUBLISHER (DBP) Pulls signed-off deposits into data.unicef.org, SDMX, and downstream products. Internal infrastructure outside this repo. N/A.

The mode is a session property read once at profile load time (dw_mode in ~/.config/user_config.yml). It is not a per-call argument. See docs/mode_contract_integration.md for the wiring.


What's inside — by capability

Each capability family is implemented in all three language siblings unless flagged. Per-function reference docs are linked in the right column.

IO contract — dw_save, dw_use, dw_compare, dw_merge, dw_isid, dw_verify_z

Uniform file IO with auto-dispatch by extension (CSV / TSV / XLSX / RDS-or-PKL / DTA / Parquet / JSON / YAML), isid uniqueness check, automatic .provenance.json sidecar, Z: drive mirror on canonical writes, integrity check on canonical reads.

Function R Python Stata Reference
dw_save R · Python
dw_use ⏳ (#5) same as above
dw_compare same
dw_merge R + Python only
dw_isid (embedded in dw_save) same
dw_verify_z R + Python only

dw_use accepts HTTPS URLs since v0.4.0 — producer downloads once and freezes the response under _frozen/<host>/<path> with a .provenance.json sidecar; reviewer reads only from the frozen copy. Allowlist is configured per-consumer (empty by default).

Cached external APIs — dw_api_fetch, dw_api_cached, dw_api_inventory

Mode-aware wrapper around 10 external data sources. Producer hits the live API and caches under _apis/<api>/<cache_key>.<ext>; reviewer reads only from the cache. Secrets in caller kwargs are redacted before they reach the provenance sidecar.

api = value Source R Python
"uis" UNESCO UIS REST
"sdmx" Generic SDMX
"sdmx_codelist" UNICEF SDMX codelist
"wb", "wb_indicators" World Bank WDI
"ilo" ILO SDMX
"unsd_sdg" UNSD SDG API
"github_raw" Pinned-commit raw.githubusercontent.com
"http", "json_get" Generic HTTP / JSON

References: R · Python.

Aggregation and survey weights — aggregate_data_v2, apply_time_window, generate_agg_footnote, dw_nestweight

aggregate_data_v2() covers weighted_mean / mean / sum / proportion with population + country coverage and a coverage threshold. apply_time_window() filters to the latest observation per country within an inclusive year window. dw_nestweight() is an R port of World Bank EduAnalyticsToolkit's edukit_nestweight (Diana Goldemberg) — redistributes survey weights from missing nested observations so per-stratum totals are preserved (R + Python).

Project scaffolding — create_profile, create_sector_script, review_profile

create_profile() scaffolds a profile_<repo>.R (or .py) with the standard CSO building blocks: cross-platform user identification, YAML config load, producer / reviewer dw_mode resolution, optional Z: drive advisory, packages block, sentinel object. review_profile() audits an existing profile for the same building blocks and reports pass / warn / fail per check. create_sector_script() (and the DW-Production convenience wrapper create_dw_sector_script()) scaffolds a sector run-script template with profile verification, logging, runtime tracking, and try/catch.

Contract auditing — test_scripts

Recursively scans a directory of .R (or .py) scripts and flags any direct call to a raw file-IO or external-API command that the toolkit wraps (e.g. read_csv, pd.read_csv, httr::GET, requests.get, rsdmx::readSDMX, sdmx.Client). Per-line escape hatch via # cso-allow: <rule-id>; CI-mode via error_on_violation = TRUE fails the build on any violation.

Vintage sync — cso_toolkit_check, cso_toolkit_diff, cso_toolkit_pull

Consumer-side helpers that read the local .toolkit_manifest.yml, query upstream for the latest tag, and warn / refresh when the consumer is behind. Reviewer mode forbids the network call.

Graceful error envelopes — across every helper

Every stop() / raise follows a three-part shape so library errors are grep-friendly and actionable:

[cso_toolkit.dw_save] Reviewer mode forbids writes under canonical: /path
  Reviewer sessions must keep canonical deposits read-only to preserve
  vintage permanence; writes go to the sandbox.
  Fix:
    1. Resolve a sandbox path instead, OR
    2. If this is a deliberate DBM bootstrap, pass `allow_canonical_write = TRUE`.

The leading [cso_toolkit.<func>] prefix lets you grep a consumer project for sites that hit a given error class.


Quick start

Same code shape across the three siblings — pick yours.

R

# 1. Source the vendored helpers (normally done by profile_<repo>.R)
source("00_functions/dw_io.R")
source("00_functions/dw_api.R")

# 2. Set session-level mode + paths
dw_mode <- "producer"
dw_apis_allowed <- TRUE
teamsWrkData <- "/path/to/wrk"

# 3. Use the contract
library(dplyr)
df <- tibble::tibble(REF_AREA = c("AGO", "BFA"), OBS_VALUE = c(0.5, 0.7))
dw_save(df, name = "dw_ed_edu.csv", sector = "ed", kind = "wrk",
        isid = "REF_AREA",
        metadata = list(title = "Education indicators", vintage = "2026-05"))

Python

from cso_toolkit import _state, dw_save, dw_use
import pandas as pd

_state.configure(
    teamsWrkData="/path/to/wrk",
    dw_mode="producer",
    dw_apis_allowed=True,
)

df = pd.DataFrame({"REF_AREA": ["AGO", "BFA"], "OBS_VALUE": [0.5, 0.7]})
dw_save(df, name="dw_ed_edu.csv", sector="ed", kind="wrk",
        isid=["REF_AREA"],
        metadata={"title": "Education indicators", "vintage": "2026-05"})

Stata

* Wire the mode contract in your profile
global dw_mode "producer"
global teamsWrkDataCanonical "C:/.../013_wrkdata"

* Use the contract
use "input.dta", clear
dw_save using "dw_ed_edu.dta",      ///
    idvars(REF_AREA INDICATOR)      ///
    title("Education indicators")   ///
    vintage("2026-05")

Per-language full-flavour quick starts live at r/README.md, python/README.md, stata/README.md.


Install / vendor

Production model is vendoring — consumers (DW-Production sector codebases) copy the helpers into their own 00_functions/ and pin a version in .toolkit_manifest.yml. Why not source() / pip install / net install over the network:

  • Vintage permanence. A 2026-05 release re-run must use the helper code as it stood in 2026-05. Network sourcing breaks that.
  • AppLocker reality. UNICEF laptops block many script-installable paths; copy-into-00_functions/ always works.
  • Offline reproducibility. Reviewers on planes / customs / corporate networks need the helpers locally.

For local development, each language also supports a native install path (devtools::install_local("r/"), pip install -e python/, adopath ++ "stata/src"). Vendoring stays the production model.

See docs/toolkit_strategy.md for the full rationale + upgrade flow, and templates/.toolkit_manifest.yml for the manifest schema.


Versioning and roadmap

Semantic versioning (MAJOR.MINOR.PATCH).

Tag Released Highlights
v0.1.0-rc1 2026-05-24 R helpers feature-complete; Stata / Python scaffolded.
v0.2.0 2026-05-24 Stata helpers shipped (dw_save, dw_compare, dw_mkdir); dw_nestweight ported from EduAnalyticsToolkit; workflow diagrams.
v0.3.0 2026-05-25 Full Python port (10 modules, 26 public entries); Roxygen-complete R reference (26 Rd files + pkgdown); graceful three-part error envelopes across R + Python; secrets-redaction in .provenance.json.
v0.4.0 2026-05-26 DW-Production backports (B1 remote-URL freeze, B2 gzip auto-detect, B3 sidecar tryCatch, B4 URLencode + cache-ext fix); testthat regression suite (237 PASS); GitHub Actions CI; tightened producer / reviewer mode contract for dw_save + dw_use — redundant Teams + Z: producer writes, network-first reviewer reads, overwrite default flipped TRUE → FALSE (issue #14, BREAKING); remaining Stata helpers — dw_use, dw_require_no_api, dw_load_config (issue #5); R demographics family — dw_pop() + dw_regions() (issues #17 + #18); dw_publish() STUB (issue #15; dry-run only, live submission deferred to v0.5.0).
v0.4.1 2026-05-27 Patch. Restores dialect = "base" byte-parity dispatch on dw_save() (regression from v0.4.0); fixes Copilot-flagged dw_use() regression. Validated against DW-Production NT branch (tests/test_v041_nt.R, tests 1–4 PASS).
v0.4.2 2026-05-27 Patch. Fixes a silent-.tsv bug in v0.4.1's dialect = "base" dispatch — utils::write.csv() hardcoded the comma separator, so a .tsv write produced CSV content with a .tsv extension. dialect = "base" now respects the file extension.
v0.4.3 2026-05-28 Integrity release. Two dw_use() fixes (issues #30 + #31) ported from the 2026-05-27 NT reviewer-mode reproducibility audit (DW-Production PR #133). col_select = NULL conditional dispatch for parquet / dta; new cols_lenient flag for any_of()-style schema intersect.
v0.4.4 2026-05-29 Quality release. Three milestone issues land in one cycle (PRs #39, #41, #43). All three surfaced during the v0.4.3.1 fanout audit (IM / WS / HVA install + reviewer-mode runs on 2026-05-28). No public API breaks.
v0.4.5 2026-05-29 Closes v0.4.5 milestone with standalone-source %>% binding (issue #46 / PR #47) and 8 new dw_-prefixed aliases (issue #42 / PR #48). magrittr declared as a first-class Import with importFrom in NAMESPACE. No public API breaks; both old and new names remain exported throughout v0.4.x.
v0.4.6 2026-05-30 Quality release. Four issues land in one cycle — HIGH-severity dw_is_canonical recognises OneDrive-mounted Teams Documents path (issue #54; pre-fix, reviewer-mode dw_save() could silently overwrite canonical Teams artefacts on UNICEF laptops where Documents is OneDrive-mounted); re-exported dw_root() public wrapper around .dw_root_for() (issue #53); .cso_require("magrittr") envelope on standalone-source %>% gate (issue #51); r/.gitattributes pin so the R subtree checks out with LF endings on Windows (issue #52). Also lands the cso-toolkit-hosted DW Operations Hub dashboard infrastructure (single-page SPA at docs/dashboard/, nightly cron via .github/workflows/dashboard.yml) and renames the third role INGESTOR → PUBLISHER (DBM/DBR/DBP). No public API breaks.
v0.5.0 planned Live dw_publish() submission branch (Helix endpoint + auth + idempotency); Python + Stata siblings of dw_pop() and dw_regions().
v1.0.0 committed API After the ed sector pilot lands and a second sector vendors the helpers without modification.

Changelog. Per-release notes — including breaking-change migration notes and per-PR provenance — live in NEWS.md (the toolkit follows the R-ecosystem NEWS.md convention; there is no separate CHANGELOG.md).


Testing and CI

The toolkit ships a layered test surface. Every layer runs both locally and on CI (.github/workflows/r-check.yml).

Layer What it catches
R testthat suite (r/tests/testthat/, 125+ assertions) Unit + regression coverage for every helper. expect_envelope() asserts the [cso_toolkit.<func>] WHAT / Why / Fix shape on every raise.
R R CMD check Rd syntax, NAMESPACE drift, undefined globals, \examples{} blocks. Target: 0 errors / 0 warnings / 0 notes.
Python smoke test (python/tests/manual/smoke_test.py, 20 invariants) Round-trip behaviour, provenance sidecar, mode contract, B1–B4 regressions. Manual-only — run via python python/tests/manual/smoke_test.py.
Python error-envelope test (python/tests/manual/error_envelope_test.py, 30 paths) Every public raise carries the standard envelope. Manual-only.
R manual smoke (r/tests/manual/check_consumer_side.R) End-to-end vendoring scenario against api.github.com. Manual-only.
GitHub Actions (.github/workflows/r-check.yml) Runs R CMD check (which invokes the R testthat suite) across ubuntu-latest (R release + devel), macos-latest (release), and windows-latest (release) on every push / PR. The Python and R-manual layers above are not yet wired into CI — tracked as a follow-up.

License and citation

Code under MIT; documentation under CC BY 4.0.

UNICEF Chief Statistician Office, cso-toolkit: Shared helpers and operating model for child-indicator data warehousing, v0.3.0 (2026), https://github.com/unicef-drp/cso-toolkit


See also

About

UNICEF Chief Statistician Office toolkit - shared R helpers, contracts (dw_io / dw_api), profile + sector scaffolders, and a contract auditor for reproducible, scalable analytics from the D&A Section (OSE).

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors