Skip to content

docs(devnotes): add Nemotron-Personas dev note#611

Merged
3mei merged 14 commits into
mainfrom
yev/nemotron_personas_dev_note
Jun 1, 2026
Merged

docs(devnotes): add Nemotron-Personas dev note#611
3mei merged 14 commits into
mainfrom
yev/nemotron_personas_dev_note

Conversation

@3mei
Copy link
Copy Markdown
Contributor

@3mei 3mei commented May 7, 2026

📋 Summary

Adds the Designing Nemotron-Personas dev note covering how the multi-locale Nemotron-Personas HF collection is built (4-stage compound-AI pipeline) and how it's used as a seeding primitive across Nemotron training (long-context, tool-use, formal logic, safety refusals, instruction-following). Ships alongside a runnable Tutorial 7 demonstrating reproduction + customization, plus a Colab variant

🔗 Related Issue

N/A

🔄 Changes

✨ Added

  • docs/devnotes/posts/nemotron-personas.md — new dev note
  • docs/devnotes/posts/assets/nemotron-personas/ — four images: three pipeline-stage diagrams from the partner repo plus a black-background Nemotron-Personas world-map hero
  • docs/notebook_source/7-nemotron-personas.py — jupytext source for the Reproducing & Customizing Nemotron-Personas tutorial;
  • docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant; i

🔧 Changed

  • docs/scripts/generate_colab_notebooks.py — adds an ADDITIONAL_SETUP_CELLS map paralleling ADDITIONAL_DEPENDENCIES; injects NGC CLI install + NGC_API_KEY cells. Future devnote-paired tutorials needing extra Colab bootstrap can register one-line entries in the same map.
  • mkdocs.yml — adds Reproducing & Customizing Nemotron-Personas under the Tutorials nav

🧪 Testing

  • make test passes
  • Notebook runs end-to-end via jupytext --to ipynb --execute
  • make generate-colab-notebooks regenerates the Colab .ipynb cleanly with the NGC setup cells in the expected position
  • Unit tests added/updated (N/A — this PR is docs + tutorial assets; no engine code changed)
  • E2E tests added/updated (N/A — Tutorial 7 is opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A — no architectural changes)

3mei added 2 commits May 7, 2026 02:04
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

MkDocs preview: https://219428f5.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-611.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@3mei 3mei changed the title Nemotron-Personas Dev Note docs(devnotes): add Nemotron-Personas dev note May 7, 2026
Copy link
Copy Markdown
Contributor

@danecor danecor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some possible issues / suggestions attached.

Comment thread docs/scripts/generate_colab_notebooks.py
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@3mei 3mei requested review from danecor and johnnygreco May 27, 2026 23:02
…_dev_note

# Conflicts:
#	docs/scripts/generate_colab_notebooks.py
@3mei 3mei marked this pull request as ready for review May 28, 2026 01:29
@3mei 3mei requested a review from a team as a code owner May 28, 2026 01:29
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #611docs(devnotes): add Nemotron-Personas dev note

Summary

A docs-and-tutorial PR that ships:

  • docs/devnotes/posts/nemotron-personas.md — long-form dev note covering how the multi-locale Nemotron-Personas HF collection is built (a 4-stage compound-AI pipeline) and how those personas seed Nemotron training (long-context, tool-use, formal-logic, safety refusals, instruction-following).
  • docs/notebook_source/7-nemotron-personas.py (732 lines) — jupytext source for Reproducing & Customizing Nemotron-Personas. Reproduces the released schema from the NGC-hosted Nemotron-Personas-USA artifact via PersonSampler + two LLMStructuredColumnConfig stages, then layers a small tech_persona customization example.
  • docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant.
  • docs/scripts/generate_colab_notebooks.py — extends the Colab-cell generator with ADDITIONAL_API_KEY_BLOCKS (and a parallel-but-currently-unused ADDITIONAL_SETUP_CELLS) so this notebook can request an NGC_API_KEY in addition to the standard NVIDIA_API_KEY.
  • mkdocs.yml — adds the Tutorial 7 nav entry.
  • Four pipeline-diagram / hero PNGs under docs/devnotes/posts/assets/nemotron-personas/.
  • Tiny doc fix: quick-start/latest/quick-start/ in docs/notebook_source/README.md.

The diff is +2228/−4. No engine code is touched; structural invariants (import direction, lazy heavy imports, etc.) are not at risk.

Findings

Accuracy & content (dev note + tutorial)

  • Frontmatter matches existing dev-note convention (design-principles.md, text-to-sql.md): date + authors only. ✅
  • Quotes from the Nemotron 3 Super Technical Report are attributed and linked. Each block-quote includes a citation and the same URL is repeated rather than relying on an unstable shorthand. ✅
  • Pipeline narrative is consistent with the tutorial code. Stage 1 (OCEAN), Stage 2 (PGM/PersonSampler), Stage 3 (PersonaAttributes), Stage 4 (Personas) line up across both files. The dev note correctly says "nine cohesive persona descriptions" and the Personas Pydantic model in the tutorial defines exactly nine fields (professional, finance, healthcare, sports, arts, travel, culinary, concise, detailed). ✅
  • Locale list is consistent across the dev note, tutorial markdown, and the Try it yourself / Next Steps sections (en_US, en_IN, en_SG, fr_FR, hi_Deva_IN, hi_Latn_IN, ja_JP, ko_KR, pt_BR). ✅
  • Cross-link to other dev notes (design-principles.md, text-to-sql.md, push-datasets-to-hugging-face-hub.md) uses relative paths, which is correct for mkdocs. ✅
  • Image references inside the dev note use relative paths (assets/nemotron-personas/...) — correct. The two embedded diagrams in the tutorial source use raw GitHub URLs pointing at main, which means they won't render in a freshly-rendered notebook until this PR is merged. That's the same approach Tutorials 5/6 take, so consistent — but worth being aware of when reviewing the rendered output before merge.

Minor accuracy issue

  • Typo "experince" appears twice in 7-nemotron-personas.py (lines 515 and 604) inside PERSONA_SYSTEM_PROMPT and the inline prompt string ("A neonatal nurse with decades of experince…"). The same typo appears verbatim in the dev note's prompt copy. It's user-facing prompt text fed to the LLM, so the impact is small, but worth fixing — easy win and the typo presumably came from an upstream copy.

Code quality — 7-nemotron-personas.py

  • from __future__ import annotations ✅, modern type syntax (dict[str, dict[str, str]], int | None) ✅, absolute imports ✅, type-annotated helpers ✅. Consistent with project style.
  • The "verify dataset is on disk" cell raises SystemExit with a clear pointer back to the setup cell — good UX for a notebook that depends on an out-of-band download.
  • The SAMPLE_FROM_SDG_PGM = True branch is gated behind raise NotImplementedError with an informative message. The dead code above the raise is a sketch of the eventual integration. This is a reasonable pattern for a tutorial that documents a "future path", but consider one of:
    • Move the sketch into a markdown cell (it's documentation, not code that runs), or
    • At minimum, add # pragma: no cover / a comment that it's intentionally unreachable.
    • Today, lints / static analysis on docs/notebook_source/ may flag the unreachable lines as dead. (Not a blocker; the notebook isn't part of the import path.)
  • Validator workaround: several ExpressionColumnConfigs use {{ field if field else ' ' }} (single space) and the comment notes "DD's validator rejects expression columns that render to ''". Reasonable workaround for a tutorial; if this is a frequent pattern, a follow-up engine change to allow nullable expression columns would be cleaner — flag for the engine team but not in scope here.
  • temperature=UniformDistribution(low=0.9, high=1.1) is unusual (>1.0 can be aggressive on some endpoints). The markdown above the cell explicitly tells the user to consult the model card — that's the right escape hatch for a tutorial.
  • NUM_RECORDS = 50 for the scale-up cell is appropriately small for a runnable tutorial; the surrounding text correctly notes the released artifact scales to millions.

generate_colab_notebooks.py extension

  • The change is backward-compatible: new parameters on create_colab_setup_cells have defaults, and existing callers (notebooks 1–6) pick up empty maps. ✅
  • Joining the NGC API-key block into the existing os.environ/getpass cell (rather than emitting a duplicate cell with its own imports) is the right call. The comment explains the rationale clearly.
  • Minor concern: ADDITIONAL_SETUP_CELLS is added but currently empty. The comment ("Currently unused; left in place so future tutorials can register…") explicitly flags it as speculative. AGENTS.md style guidance is "Don't design for hypothetical future requirements". This is a docs script, not engine code, and the cost is small (one empty dict + a .get call), so I'd flag it as a nit rather than a blocker — but if a future tutorial needs setup cells, the dict could be added at that time with no extra ceremony. Consider removing the unused map and the corresponding parameter, then re-adding when the first real consumer lands.
  • One small naming nit: the public hash key is "7-nemotron-personas.py" (with .py), matching the existing ADDITIONAL_DEPENDENCIES convention. Consistent. ✅

Notebook execution & gating

  • The PR description states the notebook is "opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY". I did not verify the gating mechanism in this review, but the verify-on-disk cell (raise SystemExit if the parquet isn't found) provides a clean fail-fast path locally, and the tutorial author confirms make test passes. Worth a sanity check on CI that this notebook is excluded from auto-execution unless the NGC asset is present.

Performance & security

  • No new dependencies. ✅
  • The Colab cell stores the NGC API key via os.environ from userdata.get(...) with a getpass fallback — same pattern as the existing NVIDIA-API-key cell. No secrets logged in the notebook source. ✅
  • Output is written via data_designer.create(...) to a named dataset; nothing exotic. ✅

mkdocs.yml

  • The added entry "Reproducing & Customizing Nemotron-Personas": notebooks/7-nemotron-personas.ipynb slots in after Tutorials 1–6 in the Tutorials section. Naming style (long descriptive title in quotes) is consistent with the rest of the nav. ✅

Suggestions

  1. Fix the experince typo in PERSONA_SYSTEM_PROMPT (line 515) and in the inline prompt at line 604 of docs/notebook_source/7-nemotron-personas.py, plus the matching block in docs/devnotes/posts/nemotron-personas.md. Re-run make generate-colab-notebooks after.
  2. Consider removing ADDITIONAL_SETUP_CELLS from generate_colab_notebooks.py until a real consumer needs it. The repo style is to avoid speculative abstractions. If kept, the comment is honest about its status, so this is a soft nit.
  3. Consider moving the if SAMPLE_FROM_SDG_PGM: integration sketch into a markdown cell. The cell currently contains dead-but-illustrative code followed by a raise NotImplementedError. As a markdown-only "future-shape" snippet it would be more obviously instructional and avoid confusing readers (or linters) into treating it as live code.
  4. (Optional) Cross-link Tutorial 7 from the dev note's "Try it yourself" section using the notebooks/7-... mkdocs route, in addition to the Colab link. Mirrors how some other dev notes link to both surfaces.

Test coverage

Per the PR checklist, no unit-test changes are expected (docs + tutorial PR). The author confirms make test passes and that the notebook executes end-to-end via jupytext --to ipynb --execute, which is the right validation surface for this kind of change. The make generate-colab-notebooks regeneration claim is consistent with the diff (the committed .ipynb includes the NGC API-key block and badge cell).

Verdict

Looks good — approve with minor follow-ups. This is a well-structured, well-cited dev note plus a runnable, self-contained tutorial that exercises a real ingestion path (PersonSampler against the NGC artifact). The Colab generator change is small, backward-compatible, and well-commented. The only items worth fixing before merge are the experince typo (cheap) and a judgment call on whether to keep the unused ADDITIONAL_SETUP_CELLS slot. Neither is a blocker; both are easy follow-ups. No risk to engine packages or structural invariants.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds the Designing Nemotron-Personas dev note and an accompanying Tutorial 7 notebook, documenting and reproducing the four-stage compound-AI pipeline behind the Nemotron-Personas HF collection (OCEAN sampling → PGM demographics → persona attributes → persona descriptions).

  • New content: devnote in both MkDocs (nemotron-personas.md) and Fern (nemotron-personas.mdx) formats, new Figure.tsx React component for centered images with captions, and the 7-nemotron-personas.py jupytext source plus its committed Colab variant.
  • Infrastructure changes: generate_colab_notebooks.py gains ADDITIONAL_API_KEY_BLOCKS so the NGC API key is appended to the existing NVIDIA API key Colab cell (safe — os, getpass, and userdata are already imported in COLAB_API_KEY_CELL); fern/docs.yml adds two explicit redirects for legacy mkdocs URL shapes that differ from the auto-generated Fern slugs.
  • Side housekeeping: retriever-sdg-toolkit.mdx and have-it-your-way.mdx have their H1 headings moved from the markdown body into the frontmatter title: field for consistency with the new page pattern.

Confidence Score: 5/5

Documentation and tutorial additions with no engine code changes; safe to merge.

All changes are docs, tutorial content, and minor supporting infrastructure. The NGC key block correctly reuses already-imported names from the existing Colab cell. Two pre-existing observations in the tutorial notebook were noted in prior review rounds and are not defects introduced by this PR.

No files require special attention.

Important Files Changed

Filename Overview
docs/scripts/generate_colab_notebooks.py Adds ADDITIONAL_SETUP_CELLS and ADDITIONAL_API_KEY_BLOCKS maps; NGC_API_KEY block appended to the existing COLAB_API_KEY_CELL which already imports os, getpass, and userdata — no import gap.
docs/notebook_source/7-nemotron-personas.py New tutorial notebook reproducing the Nemotron-Personas pipeline; two pre-existing flagged issues: SAMPLE_FROM_SDG_PGM=True raises NotImplementedError while Next Steps prose suggests flipping it, and age conditionals (>= 6, >= 16) are always true given age_range=[18, 114].
fern/components/Figure.tsx New React component using dangerouslySetInnerHTML for a fully static CSS string literal — no user input involved, so safe.
fern/versions/latest/pages/devnotes/posts/nemotron-personas.mdx New Fern devnote for Nemotron-Personas; uses the new Figure component, Authors component, and matches the redirect in docs.yml pointing to /dev-notes/designing-nemotron-personas.
fern/docs.yml Adds two redirects: nemotron-personas → designing-nemotron-personas and retrieval-sdg-toolkit → retriever-sdg-toolkit, correctly handling legacy mkdocs URL shapes that differ from the Fern nav-entry slugs.
fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx Removes explicit slug frontmatter and moves title from body H1 to frontmatter for consistency with other devnotes; slug still auto-generates to retriever-sdg-toolkit from the nav entry, matching the new redirect.

Reviews (9): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
danecor
danecor previously approved these changes May 28, 2026
Copy link
Copy Markdown
Contributor

@danecor danecor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

…navs

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@johnnygreco
Copy link
Copy Markdown
Contributor

Hey Yev, leaving a few flags from Codex review here so they are visible before the human review comes through. A human review is still coming.

  • The new-locale / SDG-PGMs path looks overstated in the Dev Note. The post says users can declare a PGMGenerator / PGMGeneratorPluginConfig path, but the tutorial currently marks SAMPLE_FROM_SDG_PGM=True as TODO and raises NotImplementedError. Either implementing that path or framing it as future / advanced work would avoid sending readers toward a non-working branch.

  • The PersonSampler field access example appears inaccurate. The post says {{ person.county }} is available directly, but the notebook maps person.district into a county expression column. A reader copying the post as-is may get a broken Jinja reference.

  • A few “inside Nemotron training” claims probably need tighter sourcing or narrower wording. The Super report supports personas in long-context samples, general tool use, and formal logic, but I could not verify the SSCR / general-chat / instruction-following claims from that report as written. The Japanese model card supports Japanese tool-calling data seeded by Nemotron-Personas-Japan, but not broad instruction-following + general-chat data in the current wording.

  • The PR adds duplicate Dev Note prose under legacy docs/ and wires mkdocs.yml. Current docs guidance says Dev Notes prose should live under fern/, so keeping both copies may create drift unless there is still an intentional legacy publish path here.

Narratively, the post reads well: the flow from why personas matter, to how they are used, to how Data Designer builds and customizes them is strong. These are mostly accuracy / maintenance flags rather than a request for a structural rewrite.

Comment thread mkdocs.yml
@3mei
Copy link
Copy Markdown
Contributor Author

3mei commented May 29, 2026

@johnnygreco

Re review from Codez:

"The new-locale / SDG-PGMs path looks overstated in the Dev Note."
Codex was confused, as this landed in SDG-PGMs: https://github.com/NVIDIA-NeMo/SDG-PGMs/tree/main/examples/us_person , https://github.com/NVIDIA-NeMo/SDG-PGMs/tree/main/src/data_designer_plugins

I updated the language in the note to make this a bit more clear.

Rebased to bring the prompt sensitivity and updated mkdocs/fern. Should be good to go.

@3mei 3mei requested a review from johnnygreco May 29, 2026 22:20
johnnygreco
johnnygreco previously approved these changes May 29, 2026
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an awesome post @3mei!!! thanks!

Note that I think the blog card is missing. Up to you if you want to add now or in a follow up

…and Fern routing

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
johnnygreco
johnnygreco previously approved these changes May 31, 2026
@3mei 3mei merged commit 8bd2313 into main Jun 1, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants