Skip to content

Chunk oversized schema text literals for Blazegraph compatibility#1334

Open
Jurij89 wants to merge 12 commits into
mainfrom
codex/large-rdf-literal-chunking
Open

Chunk oversized schema text literals for Blazegraph compatibility#1334
Jurij89 wants to merge 12 commits into
mainfrom
codex/large-rdf-literal-chunking

Conversation

@Jurij89

@Jurij89 Jurij89 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Enhances the quick reject-only fix from PR #1323 by allowing supported large public schema.org/text literals to publish safely into Blazegraph-backed nodes.

  • Adds core RDF literal normalization that rewrites oversized public http://schema.org/text and https://schema.org/text literals into deterministic DKG text-body/chunk resources.
  • Keeps private oversized literals and unsupported public literals on the existing OVERSIZED_RDF_LITERAL fail-fast path.
  • Wires normalization through publisher, agent, CLI publish/write/import/enrichment routes, while keeping storage adapters as hard defense-in-depth guards.
  • Exports reconstruction helpers that fail closed on missing count/hash metadata and avoid using schema:text for chunk internals.
  • Adds focused coverage for core chunking/reconstruction, publisher entry points, agent publish staging, CLI parsing, and Blazegraph/SPARQL storage guards.

Data Model Notes

Oversized root triples are not stored as <subject> schema:text huge.... Instead the source subject links to dkg:hasTextBody, the body records source predicate, hashes, byte counts, and ordered dkg:chunkValue chunks. Body identity includes source predicate plus canonical literal term so http://schema.org/text and https://schema.org/text remain distinct.

Test Plan

  • pnpm --filter @origintrail-official/dkg-core exec vitest run test/rdf-literal-size.test.ts
  • pnpm --filter @origintrail-official/dkg-core run build
  • pnpm --dir packages/publisher exec vitest run --config vitest.unit.config.ts test/dkg-publisher-compat.test.ts
  • pnpm --dir packages/agent exec vitest run --config vitest.unit.config.ts test/publish-literal-size.test.ts
  • pnpm --dir packages/cli exec vitest run --config vitest.unit.config.ts test/http-literal-size-validation.test.ts
  • pnpm --dir packages/storage exec vitest run test/blazegraph.unit.test.ts test/sparql-http.test.ts
  • pnpm --filter @origintrail-official/dkg-publisher run build
  • pnpm --filter @origintrail-official/dkg-agent run build
  • pnpm --filter @origintrail-official/dkg run build
  • git diff --check

Local Caveat

packages/cli/test/issue-306-787-write-quad-validation.test.ts was updated to expect chunking on live SWM/WM route writes, but I could not prove that file locally: it is outside the CLI unit config, and the default Hardhat daemon harness failed during local startup before executing assertions. CI should prove this live route path or we should follow up by moving this coverage into a stable harness.

Comment thread packages/core/src/rdf-literal-size.ts Outdated
Comment thread packages/core/src/rdf-literal-size.ts Outdated
Comment thread packages/cli/src/daemon/http-utils.ts Outdated
Comment thread packages/cli/src/daemon/routes/knowledge-assets-import.ts
Comment thread packages/publisher/src/dkg-publisher.ts
Comment thread packages/agent/src/dkg-agent-publish.ts
Comment thread packages/core/src/rdf-literal-size.ts Outdated
Comment thread packages/agent/src/dkg-agent-publish.ts
Comment thread packages/core/src/rdf-literal-size.ts Outdated
Comment thread packages/core/src/rdf-literal-size.ts Outdated
Comment thread packages/publisher/src/dkg-publisher.ts
Comment thread packages/cli/src/daemon/http-utils.ts Outdated
Comment thread packages/cli/test/import-file-orchestration.shared.ts
Comment thread packages/publisher/src/public-write-normalization.ts Outdated
Comment thread packages/cli/src/daemon/routes/knowledge-assets.ts Outdated
Comment thread packages/cli/src/daemon/http-utils.ts Outdated
Comment thread packages/cli/src/daemon/http-utils.ts
Comment thread packages/cli/src/daemon/routes/knowledge-assets.ts
@Jurij89 Jurij89 force-pushed the codex/large-rdf-literal-chunking branch from 0ca73b2 to 5e740d2 Compare June 25, 2026 21:43
Comment thread packages/publisher/src/public-write-normalization.ts Outdated
Comment thread packages/core/src/rdf-text-literal-normalization.ts Outdated
Comment thread packages/publisher/src/public-write-normalization.ts Outdated
Comment thread packages/core/src/rdf-text-literal-normalization.ts Outdated
Comment thread packages/cli/src/daemon/routes/memory.ts Outdated
Comment thread packages/cli/src/daemon/routes/memory.ts Outdated
Comment thread packages/cli/src/daemon/http-utils.ts Outdated
Comment thread packages/publisher/src/auto-partition.ts Outdated
Comment thread packages/cli/test/import-artifact-routes.test.ts Outdated
readonly datatype?: string;
}

export function parseRdfLiteralTerm(term: string): ParsedRdfLiteralTerm | null {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Avoid growing a second RDF literal grammar

What's wrong
The PR adds a custom RDF literal parser/serializer for chunking without retiring the existing regex-based RDF term validation. That creates parallel grammars for the same boundary, which is brittle and makes future literal-format fixes harder to reason about.

Example
parseRdfLiteralTerm now hand-parses escapes/language/datatype, while assertSafeRdfTerm still has its own SAFE_RDF_LITERAL grammar and storage adapters have their own literal parsing. A future change to accepted literal syntax has to be made in multiple places.

Suggested direction
Make this codec the single source of truth by routing existing RDF-term validation through it, or use an existing RDF parser/serializer at the boundary. At minimum, collapse the duplicated regex/parser rules so chunking, HTTP validation, and SPARQL safety cannot drift independently.

For Agents
Focus on packages/core/src/rdf-literal-codec.ts, packages/core/src/sparql-safe.ts, and callers in http-utils.ts / rdf-text-literal-normalization.ts. Preserve the currently accepted short-literal grammar and chunk reconstruction behavior, but make validation and parsing share one canonical codec or parser-backed utility. Tests should prove validator and chunker agree on the same valid/invalid literal terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants