-
Notifications
You must be signed in to change notification settings - Fork 8
feat(okf): Google OKF → DKG integration — import OKF bundles as deterministic, provenance-bearing Knowledge Assets #1331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Zigoljube
wants to merge
2
commits into
OriginTrail:main
Choose a base branch
from
Zigoljube:feat/okf-integration
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| import { Command } from 'commander'; | ||
| import { toErrorMessage } from '@origintrail-official/dkg-core'; | ||
| import { | ||
| writePatentBundle, | ||
| ingestPatentExport, | ||
| type PatentGenOptions, | ||
| } from '@origintrail-official/dkg-ip-oracle'; | ||
|
|
||
| /** | ||
| * `dkg ip-oracle` — engineering harness for the IP / Patent Context Oracle. | ||
| * | ||
| * `generate` emits a deterministic, **synthetic** Google-Patents-shaped OKF | ||
| * bundle to disk (no BigQuery / GCP dependency), which is then ingested into a | ||
| * PRIVATE Context Graph via `dkg okf import --private`. The data is SIMULATED — | ||
| * every concept stamps `source: … [SIMULATED]` and a CC BY 4.0 licence so the | ||
| * downstream redaction guard and the public article stay honest about what is | ||
| * real vs. generated. | ||
| * | ||
| * This command writes files only; it never touches the node and spends nothing. | ||
| */ | ||
| export function registerIpOracleCommand(program: Command): void { | ||
| const cmd = program | ||
| .command('ip-oracle') | ||
| .description('IP / Patent Context Oracle tooling (synthetic OKF patent corpora)'); | ||
|
|
||
| cmd | ||
| .command('generate <outDir>') | ||
| .description('Generate a deterministic synthetic patent OKF bundle (no BigQuery needed)') | ||
| .requiredOption('--count <n>', 'Number of patent concepts to emit', (v: string) => parseInt(v, 10)) | ||
| .option('--cpc-class <class>', 'CPC subclass tag, e.g. H04L', 'H04L') | ||
| .option('--seed <n>', 'PRNG seed (same seed ⇒ identical corpus)', (v: string) => parseInt(v, 10), 42) | ||
| .option('--citations-per-patent <n>', 'Max backward citations per patent', (v: string) => parseInt(v, 10)) | ||
| .option('--family-size <n>', 'Patents per simulated family', (v: string) => parseInt(v, 10)) | ||
| .option('--retrieval-date <iso>', 'Stamped retrieval / modified date (YYYY-MM-DD)') | ||
| .action((outDir: string, opts: Record<string, unknown>) => { | ||
| try { | ||
| const count = Number(opts.count); | ||
| if (!Number.isInteger(count) || count <= 0) { | ||
| console.error('--count must be a positive integer.'); | ||
| process.exit(2); | ||
| } | ||
| const genOpts: PatentGenOptions = { | ||
| cpcClass: String(opts.cpcClass ?? 'H04L'), | ||
| count, | ||
| seed: Number(opts.seed ?? 42), | ||
| ...(opts.citationsPerPatent != null | ||
| ? { citationsPerPatent: Number(opts.citationsPerPatent) } | ||
| : {}), | ||
| ...(opts.familySize != null ? { familySize: Number(opts.familySize) } : {}), | ||
| ...(opts.retrievalDate ? { retrievalDate: String(opts.retrievalDate) } : {}), | ||
| }; | ||
| const summary = writePatentBundle(genOpts, outDir); | ||
| console.log( | ||
| JSON.stringify( | ||
| { | ||
| mode: 'generate', | ||
| ...summary, | ||
| files: summary.conceptCount + 3, // patents + patents/index + index + log | ||
| synthetic: true, | ||
| note: | ||
| 'Synthetic SIMULATED corpus. Next: dkg okf import <outDir> ' + | ||
| '--context-graph-id <cg> --private --create-context-graph', | ||
| }, | ||
| null, | ||
| 2, | ||
| ), | ||
| ); | ||
| } catch (err) { | ||
| console.error(toErrorMessage(err)); | ||
| process.exit(1); | ||
| } | ||
| }); | ||
|
|
||
| cmd | ||
| .command('ingest <exportFile> <outDir>') | ||
| .description( | ||
| 'Map a Google Patents Public Data NDJSON export (real data, run the BigQuery ' + | ||
| 'query yourself) into OKF bundle(s). Offline, deterministic, no GCP SDK.', | ||
| ) | ||
| .option('--shard-by-cpc', 'Write one self-contained OKF bundle per CPC subclass (recommended at scale)') | ||
| .option('--retrieval-date <iso>', 'Stamped retrieval / modified date (YYYY-MM-DD)') | ||
| .action(async (exportFile: string, outDir: string, opts: Record<string, unknown>) => { | ||
| try { | ||
| const summary = await ingestPatentExport(exportFile, outDir, { | ||
| shardByCpc: Boolean(opts.shardByCpc), | ||
| ...(opts.retrievalDate ? { retrievalDate: String(opts.retrievalDate) } : {}), | ||
| }); | ||
| console.log( | ||
| JSON.stringify( | ||
| { | ||
| mode: 'ingest', | ||
| ...summary, | ||
| synthetic: false, | ||
| note: | ||
| 'Real Google Patents Public Data (CC BY 4.0). Next, per shard: dkg okf ' + | ||
| 'import <shardDir> --context-graph-id <cg> --private --create-context-graph', | ||
| }, | ||
| null, | ||
| 2, | ||
| ), | ||
| ); | ||
| } catch (err) { | ||
| console.error(toErrorMessage(err)); | ||
| process.exit(1); | ||
| } | ||
| }); | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Issue:
ip-oracle generatehas only library tests, not command testsWhat's wrong
The generator library is tested, but the new user-facing command that parses options, validates
--count, calls the generator, and emits the JSON summary is not. A broken CLI registration, bad commander parser, or wrong exit code could ship while the package unit tests still pass.Example
dkg ip-oracle generate /tmp/out --count 2 --seed 7should exit 0, print JSON withmode: "generate", and create the OKF files.dkg ip-oracle generate /tmp/out --count 0should exit 2. Those user-facing behaviors are not currently asserted.Suggested direction
Cover the registered CLI path, not just the underlying generator library, so option parsing and exit behavior are locked down.
For Agents
Add a CLI-level test in
packages/cli/testthat runs the compiled CLI with a temp output directory. Assert success output and generated files for a small count, and assert the invalid-count exit path. This can run without a daemon because the command only writes files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Issue: The new
ip-oracleCLI surface has no end-to-end CLI coverageWhat's wrong
The package-level tests validate the generator and ingest mapper, but they do not verify the behavior users will run from the
dkgbinary. That leaves the new command wrapper and option plumbing unverified.Example
A CLI regression such as a broken command registration, wrong option name, bad
--counthandling, or malformed JSON summary would not be caught bypatent-generator.test.tsorpatent-ingest.test.tsbecause those bypass Commander and call the library directly.Suggested direction
Mirror the OKF CLI tests with small temp-dir fixtures so the parser, registration, option plumbing, output summary, and disk writes are covered together.
For Agents
Add
packages/clisubcommand tests forip-oracle generateandip-oracle ingest: run the compiled CLI, assert exit codes and JSON summaries, verify expected files/shards are written, and confirm these commands do not contact the node.