Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/adr/0005-okf-rdf-mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# ADR 0005 - OKF → RDF mapping and extractor reuse-vs-fork

- **Status**: Accepted
- **Date**: 2026-06-25
- **Version**: v10
- **Package**: `packages/okf` (`@origintrail-official/dkg-okf`)
- **Related**: ADR 0002 (importer chunking contract), ADR 0003 (code-graph
ontology convergence), `packages/cli/src/extraction/markdown-extractor.ts`,
OKF SPEC v0.1 (`GoogleCloudPlatform/knowledge-catalog`, commit
`d44368c15e38e7c92481c5992e4f9b5b421a801d`)

## Context

Google's Open Knowledge Format (OKF) standardises *how* knowledge is written and
exchanged — portable UTF-8 Markdown with YAML frontmatter and cross-links that
form a graph — but deliberately ships **no** verification, provenance or
ownership layer (OKF SPEC §1, §10). The DKG supplies exactly that. We want to
ingest an OKF bundle as deterministic, owned, verifiable Knowledge Assets,
reconstructing the bundle's cross-concept link graph, and surface it as a
first-class integration.

Two questions had to be settled before writing code:

1. **Reuse or fork the existing Markdown extractor?** The node already ingests
Markdown deterministically via `markdown-extractor.ts`. Convergence — an OKF
import and a natively-ingested Markdown corpus yielding joinable graphs — is a
goal.
2. **What is the exact OKF → RDF predicate mapping?** It must be deterministic
(no LLM), reconcile with the extractor's predicates, and honour OKF's
permissive consumer rules (§9).

## Decision

### Reuse the vocabulary, not the code; add a real Markdown AST

The existing extractor is **regex-based** and resolves only `[[wikilinks]]` →
`schema:mentions`. OKF cross-links are standard Markdown links `[text](path.md)`,
which that extractor does not handle. Importing `extractFromMarkdown` from
`packages/cli` into `packages/okf` would also create a `cli → okf → cli`
dependency cycle (the CLI `okf` command depends on this package).

Therefore:

- **Converge on the extractor's predicate vocabulary** (`rdf:type`,
`schema:name`, `schema:description`, `schema:keywords`, `schema:mentions`,
`http://dkg.io/ontology/hasSection`), re-declaring the same literal IRIs in
`constants.ts` and pinning the convergence with a test. The object encoding
(raw IRIs without angle brackets; literals as `"…"` / `"…"^^<dt>`; blanks as
`_:…`) matches the node's `Quad` term form so output drops straight into
`/api/knowledge-assets/.../wm/write`.
- **Use a real Markdown AST** (`mdast-util-from-markdown`) for link, section and
citation extraction, as OKF §2 requires — this is what lets the importer honour
CommonMark semantics for links inside inline code spans.
- Keep `packages/okf` **runtime-standalone** (local `isSafeIri`, local `Quad`
type, no `dkg-core` runtime import) so the pure mapper is unit-testable in
isolation.

### Locked OKF → RDF mapping

| OKF frontmatter / construct | RDF predicate | Object | Notes |
|---|---|---|---|
| `type` (required) | `rdf:type` | full IRI as-is, else `http://schema.org/<PascalCase>` | `BigQuery Dataset` → `schema:BigQueryDataset` |
| `title` | `schema:name` | literal | |
| `description` | `schema:description` | literal | |
| `tags` (list) | `schema:keywords` | one literal per tag | converges with the extractor's hashtag handling |
| `timestamp` (ISO 8601) | `schema:dateModified` | literal `^^xsd:dateTime` | OKF defines it as last-modified |
| `resource` (canonical URI) | `schema:url` | IRI | chosen over `dcterms:source`; absent for abstract concepts |
| producer-defined keys | `http://schema.org/<camelCase>` | typed literal / IRI | **preserved, never dropped** (§4.1, §9) |
| concept link (§5) | `schema:mentions` | target concept IRI | **one untyped directed edge** — no FK/join inference (§5.3) |
| `# Citations` URL (§8) | `schema:citation` | URL IRI | numbered + bare-bullet styles; distinct from edges |
| body headings (`# …`) | `dkg:hasSection` | section blank node + `schema:name` | OKF titles live in frontmatter, so body H1s are genuine sections |
| folder hierarchy | `schema:isPartOf` | parent IRI | **off by default** (directories are not concepts) |

### IRI derivation

Concept subject IRI = `urn:okf:<conceptId>` (configurable base), a pure function
of the concept ID. Same bundle ⇒ same IRIs. This is the RDF subject; the on-chain
UAL is assigned by the node at VM publish (RFC-43 pre-knowable UALs are draft) and
is a distinct identifier.

### Edge cases (the judgement calls)

- **Links inside inline code spans** are **not** edges by default (CommonMark:
code-span content is literal text). Recorded as `codeSpanLinks` + warned.
`--include-code-span-links` opts in. This is the only place the prompt's
illustrative acceptance list (`outputs → transactions, inputs`) and the
implementation diverge — by design, and documented here and in `CONTEXT.md`.
- **Broken links** (resolved target not in the bundle) → warning, never fatal
(§5.3, §9); the target may be not-yet-written knowledge.
- **Reserved files** (`index.md`, `log.md`) are never minted as KAs; `okf_version`
is read only from a root `index.md` (§11).
- **Conformance** is permissive: only unparseable frontmatter or a missing
non-empty `type` make a bundle non-conformant (§9); all other irregularities
are tolerated.

### Determinism

The mapper is pure and LLM-free. Quads are serialised to canonical N-Quads
(deduped + lexically sorted), so the same bundle yields byte-identical output and
independent importers converge on the same graph (consistent with ADR 0002).

## Consequences

- An OKF import and a native Markdown import share predicates and therefore join
naturally in SPARQL.
- The mapper is a clean, isolated, unit-tested unit; the CLI command and the node
are thin wrappers.
- `export.ts` is the clean inverse (graph-faithful, not byte-faithful): bodies are
regenerated structurally, so round-trip equivalence is asserted over the
semantic (non-`hasSection`) quad set.
- We do **not** infer typed FK/join semantics from untyped OKF links; surfacing
that would require a separate, clearly-labelled Layer-2 enrichment pass.
227 changes: 227 additions & 0 deletions docs/integrations/okf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
---
status: current
version: v10
title: "OKF on the DKG: a trust-and-permanence backend for Google's Open Knowledge Format"
---

# OKF on the DKG: a trust-and-permanence backend for Google's Open Knowledge Format

*The same portable OKF Markdown — now owned, verifiable, and shareable across agents.*

## 1. The thesis: complementary, not competing

Google's **Open Knowledge Format (OKF)** is a beautifully simple way to write and
exchange knowledge: a bundle is just a directory tree of UTF-8 Markdown files,
each with a little YAML frontmatter and ordinary Markdown links between them. The
links form a graph; the format is human- and agent-readable; it travels as a git
repo or a tarball. Crucially, OKF **deliberately ships no verification, provenance
or ownership layer** (OKF SPEC §1, §10). That is a feature — it keeps the format
portable — but it leaves a gap: nothing about an OKF bundle tells you *who* wrote
it, whether it has been tampered with, or who owns it.

The **OriginTrail Decentralized Knowledge Graph (DKG)** supplies exactly that
missing half: cryptographically provenanced, owned, shareable Knowledge Assets,
with a three-layer memory model (private → shared → on-chain). OKF and the DKG
solve adjacent halves of one problem.

This integration makes the DKG the **trust-and-permanence backend for OKF**. The
one-line claim, made literally true in the code: *the same portable OKF Markdown,
now owned, verifiable, and shareable across agents.*

## 2. What the integration does, mechanically

`dkg okf import <bundleDir>` ingests an OKF bundle into a DKG Context Graph as
Knowledge Assets, reconstructing the bundle's cross-concept link graph. It is
**pure and deterministic — no LLM** — so the same bundle always yields identical
triples and IRIs, and independent importers converge on the same graph.

It works in **two passes** (`packages/okf`):

- **Pass 1 — index the bundle.** Walk the tree; separate concept files from
reserved `index.md` / `log.md` (which are *never* minted as Knowledge Assets,
OKF §3.1/§6/§7); parse each concept's frontmatter; derive a stable subject IRI
per concept **from its concept ID** (`tables/blocks` → `urn:okf:tables/blocks`);
build a `conceptId → IRI` map. Read `okf_version` from the root `index.md` if
present (§11).
- **Pass 2 — extract + link.** For each concept, map frontmatter + body to RDF,
then resolve its Markdown links against the Pass-1 map into untyped directed
edges.

A key design choice: rather than write a second Markdown parser, we **converge on
the predicate vocabulary** the node's existing deterministic Markdown extractor
already uses, so an OKF import and a natively-ingested Markdown corpus produce
joinable graphs. But OKF cross-links are standard Markdown links `[text](path.md)`
(the existing extractor only handles `[[wikilinks]]`), so link resolution is new
work — and it uses a **real Markdown AST**, not a regex, which is what lets it
honour CommonMark semantics (see the `outputs.md` edge case below).

### The locked OKF → RDF mapping

| OKF frontmatter / construct | RDF predicate | Notes |
|---|---|---|
| `type` (required) | `rdf:type` | full IRI as-is, else `http://schema.org/<PascalCase>` (`BigQuery Dataset` → `schema:BigQueryDataset`) |
| `title` | `schema:name` | |
| `description` | `schema:description` | |
| `tags` (list) | `schema:keywords` | one triple per tag |
| `timestamp` (ISO 8601) | `schema:dateModified` | typed `xsd:dateTime`; OKF defines it as last-modified |
| `resource` (canonical URI) | `schema:url` | the underlying asset's URI; absent for abstract concepts |
| producer-defined keys | `http://schema.org/<camelCase>` | **preserved, never dropped** (§4.1, §9) |
| concept link (§5) | `schema:mentions` | **one untyped directed edge** |
| `# Citations` URL (§8) | `schema:citation` | distinct from concept edges |
| body headings | `dkg:hasSection` | |

**Concept IDs become Knowledge Asset IRIs** deterministically: `urn:okf:<conceptId>`.
This is the RDF subject. The on-chain UAL (`did:dkg:<chain>/<ka>/<n>`) is a
*different* identifier, assigned by the node only when an asset is published to
Verifiable Memory.

**OKF's untyped cross-links become the relationship graph.** A link from concept A
to concept B asserts *a* relationship; OKF §5.3 says the *kind* (foreign key,
joins-with, depends-on) lives in the surrounding prose, not the link, and that
graph consumers "treat all links as directed edges of an untyped relationship." So
every resolved concept link maps to one `schema:mentions` edge. **We do not invent
typed FK/join semantics** — that would be fabrication. What you get is exactly what
OKF asserts: a directed, untyped relationship graph.

Be concrete and honest about what is *not* inferred: the integration reconstructs
the link graph, not its semantics. If the prose says "join `inputs` to
`transactions` on `transaction_hash`", the importer records that `inputs` mentions
`transactions` — it does not synthesise a typed `joinsOn` predicate. Lifting
untyped edges into typed relationships is a separate, clearly-labelled Layer-2
concern, kept out of the deterministic importer.

## 3. The lifecycle, end to end, with the Bitcoin bundle

The proof artifact is Google's `crypto_bitcoin` OKF bundle — the public
`bigquery-public-data.crypto_bitcoin` dataset (blocks, transactions, inputs,
outputs) produced by the open-source `bitcoin-etl` pipeline. Its value is precisely
the **cross-table foreign-key relationships expressed in prose** — the inter-concept
link graph the importer must reconstruct.

The three memory layers are the backbone of the story:

- **Working Memory (WM)** — private, free, reversible.
- **Shared Working Memory (SWM)** — shared, free, gossip-replicated, TTL-bounded.
- **Verifiable Memory (VM)** — on-chain, permanent, costs real TRAC.

### Launch a mainnet node and attach a Hermes agent

A node operator runs `dkg init` (targeting a mainnet chain), `dkg start`,
`dkg status` / `dkg doctor`, then stands up a **Hermes agent** bound to the node and
records its `agentAddress` (`GET /api/agent/identity`). This is the agent that will
later reason over the shared knowledge.

### Import into a Context Graph (Working Memory)

```bash
dkg okf import packages/okf/test/fixtures/crypto_bitcoin \
--context-graph-id okf-crypto-bitcoin --create-context-graph
```

The reconstructed graph, all in **WM (private, free)**:

- **5 Knowledge Assets** — the dataset (`type: BigQuery Dataset`) and four tables
(`type: BigQuery Table`). Zero assets for the three reserved `index.md` files.
- **The dataset → four tables**, and the **cross-table references that were only
prose** now first-class edges:
- `transactions` → `crypto_bitcoin`, `blocks`, `inputs`, `outputs`
- `inputs` → `crypto_bitcoin`, `transactions`, `outputs`
- `blocks` → *(no concept edges; external citations only)*
- **Citations** captured as `schema:citation` (the `bitcoin-etl` repo, the GCP blog
post, the BigQuery API URIs), parsed from **both** the numbered `[1] [text](url)`
style and the bare-bullet `- https://…` style present in the bundle.

A nice, honest detail: `outputs.md`'s only two concept links are written *inside
backticks* — `` `[transactions](transactions.md)` ``. CommonMark treats code-span
content as literal text, so **by default these are not edges** (the importer warns
and records them as `codeSpanLinks`; `--include-code-span-links` opts in). This is
the mechanism-first reading, and it is the one place where the implementation
deliberately diverges from a naive link count.

Confirm it via SPARQL over `view: working-memory`:

```sparql
SELECT ?s ?o WHERE { ?s <http://schema.org/mentions> ?o }
```

### Seal and share to Shared Working Memory

```bash
dkg okf import … --share # wm/finalize, then swm/share entities:"all"
```

The bundle becomes a **shared Context Graph** other agents can reach — still
**free**, still carrying **no on-chain verification**, but now sealed and
*publish-ready*.

### Issue a join invitation; a second peer verifies

The curator issues a join invitation (`dkg context-graph invite`), allows the
joining agent (`dkg context-graph add-agent`), and a **second node** subscribes
(`dkg subscribe`) and runs the same SPARQL over `shared-working-memory` —
independently reading the Bitcoin knowledge and traversing the cross-table
relationships. This is the "shared memory for multi-agent AI" claim made concrete,
entirely in free SWM.

### The deferred Verifiable Memory promotion

Because the assets are sealed and publish-ready, promotion to VM is one step away —
but it **waits for explicit operator go-ahead**. It costs real TRAC + native gas
and is irreversible; on mainnet there is no faucet. The operator confirms wallet
balances first (`dkg wallet` / `/api/wallets/balances`), publishes a **single**
asset first to observe real cost and validate the on-chain path, records the
returned UAL (`did:dkg:<chainId>/<kasAddress>/<number>`) and `txHash`, then
publishes the rest. UALs and txHashes are recorded **only if this step is actually
run.** Until then, the deliverable is the shared, peer-verified, agent-queried
Context Graph in SWM.

### The Hermes agent reasons over the verifiable knowledge

The Hermes agent answers a natural-language question over the shared graph — *"what
does the `transactions` table reference?"* — and returns the four expected targets,
consuming OKF-derived, provenance-bearing knowledge. And `dkg okf export
okf-crypto-bitcoin ./out` regenerates an OKF bundle from the shared Context Graph
whose `schema:mentions` structure matches Google's own `viz.html` — the literal
"recreated with the integration" artifact, both ways.

## 4. Why this matters

Verifiable, owned, **shared** memory for multi-agent AI:

- Any agent on the network can subscribe to the shared Context Graph and reason
over provenance-bearing knowledge — not a private copy, but a shared one.
- Published assets are permanent and ownable; provenance and authorship are
cryptographic, not conventional.
- An OKF bundle authored *anywhere* — by a human, by Google's reference agent, by
another LLM — gains a trust-and-permanence backend for free, without changing the
format. The same portable Markdown is now also a verifiable, owned graph.

## 5. Honest limitations

- **OKF v0.1 is structural-only.** Links are untyped; the integration reconstructs
the relationship graph but **does not invent semantic FK types**. Typed
relationship lifting would be a separate Layer-2 enrichment pass.
- **VM publishing costs real TRAC** and is irreversible; it is deferred and gated,
never automatic.
- **SWM is TTL-bounded** — shared, free, but not permanent; peers that join late may
not see old SWM content. Only VM is network-replicated and permanent.
- **Export is graph-faithful, not byte-faithful.** Free-form prose is not
recoverable from triples, so `export` regenerates bodies structurally; round-trip
equivalence is asserted over the semantic graph.
- **Deeper schema-level extraction is out of scope.** The importer captures
`# Schema` sections as sections, not as typed column models.

## 6. Reproduce it

The full offline gate and the live runbook:

```bash
pnpm --filter @origintrail-official/dkg-okf test # 60+ deterministic golden/edge/round-trip tests
dkg okf import <bundleDir> --dry-run --print-nquads # the mapping, offline, byte-stable
```

The live lifecycle (mainnet node → Hermes agent → WM → SWM → join invitation →
second-peer verification → rendered graph, with VM promotion held as the deferred
capstone) is in **`packages/okf/DEMO.md`**. The mapping and the reuse-vs-fork
decision are in **`docs/adr/0005-okf-rdf-mapping.md`**; the package vocabulary and
every judgement call are in **`packages/okf/CONTEXT.md`**.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"scripts": {
"build": "node scripts/build.mjs",
"build:packages": "turbo build",
"build:runtime:packages": "pnpm -r --filter @origintrail-official/dkg-core --filter @origintrail-official/dkg-storage --filter @origintrail-official/dkg-query --filter @origintrail-official/dkg-publisher --filter @origintrail-official/dkg-chain --filter @origintrail-official/dkg-epcis --filter @origintrail-official/dkg-random-sampling --filter @origintrail-official/dkg-agent --filter @origintrail-official/dkg-graph-viz --filter @origintrail-official/dkg-node-ui --filter @origintrail-official/dkg-adapter-openclaw --filter @origintrail-official/dkg-adapter-hermes --filter @origintrail-official/kafka-plugin --filter @origintrail-official/dkg run build",
"build:runtime:packages": "pnpm -r --filter @origintrail-official/dkg-core --filter @origintrail-official/dkg-storage --filter @origintrail-official/dkg-query --filter @origintrail-official/dkg-publisher --filter @origintrail-official/dkg-chain --filter @origintrail-official/dkg-epcis --filter @origintrail-official/dkg-okf --filter @origintrail-official/dkg-ip-oracle --filter @origintrail-official/dkg-random-sampling --filter @origintrail-official/dkg-agent --filter @origintrail-official/dkg-graph-viz --filter @origintrail-official/dkg-node-ui --filter @origintrail-official/dkg-adapter-openclaw --filter @origintrail-official/dkg-adapter-hermes --filter @origintrail-official/kafka-plugin --filter @origintrail-official/dkg run build",
"build:runtime": "pnpm run build:runtime:packages && pnpm --filter @origintrail-official/dkg-node-ui run build:ui",
"test": "turbo test",
"test:watch": "vitest --config vitest.config.ts",
Expand Down
Loading
Loading