Chunk oversized schema text literals for Blazegraph compatibility#1334
Chunk oversized schema text literals for Blazegraph compatibility#1334Jurij89 wants to merge 12 commits into
Conversation
0ca73b2 to
5e740d2
Compare
| readonly datatype?: string; | ||
| } | ||
|
|
||
| export function parseRdfLiteralTerm(term: string): ParsedRdfLiteralTerm | null { |
There was a problem hiding this comment.
🟡 Issue: Avoid growing a second RDF literal grammar
What's wrong
The PR adds a custom RDF literal parser/serializer for chunking without retiring the existing regex-based RDF term validation. That creates parallel grammars for the same boundary, which is brittle and makes future literal-format fixes harder to reason about.
Example
parseRdfLiteralTerm now hand-parses escapes/language/datatype, while assertSafeRdfTerm still has its own SAFE_RDF_LITERAL grammar and storage adapters have their own literal parsing. A future change to accepted literal syntax has to be made in multiple places.
Suggested direction
Make this codec the single source of truth by routing existing RDF-term validation through it, or use an existing RDF parser/serializer at the boundary. At minimum, collapse the duplicated regex/parser rules so chunking, HTTP validation, and SPARQL safety cannot drift independently.
For Agents
Focus on packages/core/src/rdf-literal-codec.ts, packages/core/src/sparql-safe.ts, and callers in http-utils.ts / rdf-text-literal-normalization.ts. Preserve the currently accepted short-literal grammar and chunk reconstruction behavior, but make validation and parsing share one canonical codec or parser-backed utility. Tests should prove validator and chunker agree on the same valid/invalid literal terms.
Summary
Enhances the quick reject-only fix from PR #1323 by allowing supported large public
schema.org/textliterals to publish safely into Blazegraph-backed nodes.http://schema.org/textandhttps://schema.org/textliterals into deterministic DKG text-body/chunk resources.OVERSIZED_RDF_LITERALfail-fast path.schema:textfor chunk internals.Data Model Notes
Oversized root triples are not stored as
<subject> schema:text huge.... Instead the source subject links todkg:hasTextBody, the body records source predicate, hashes, byte counts, and ordereddkg:chunkValuechunks. Body identity includes source predicate plus canonical literal term sohttp://schema.org/textandhttps://schema.org/textremain distinct.Test Plan
pnpm --filter @origintrail-official/dkg-core exec vitest run test/rdf-literal-size.test.tspnpm --filter @origintrail-official/dkg-core run buildpnpm --dir packages/publisher exec vitest run --config vitest.unit.config.ts test/dkg-publisher-compat.test.tspnpm --dir packages/agent exec vitest run --config vitest.unit.config.ts test/publish-literal-size.test.tspnpm --dir packages/cli exec vitest run --config vitest.unit.config.ts test/http-literal-size-validation.test.tspnpm --dir packages/storage exec vitest run test/blazegraph.unit.test.ts test/sparql-http.test.tspnpm --filter @origintrail-official/dkg-publisher run buildpnpm --filter @origintrail-official/dkg-agent run buildpnpm --filter @origintrail-official/dkg run buildgit diff --checkLocal Caveat
packages/cli/test/issue-306-787-write-quad-validation.test.tswas updated to expect chunking on live SWM/WM route writes, but I could not prove that file locally: it is outside the CLI unit config, and the default Hardhat daemon harness failed during local startup before executing assertions. CI should prove this live route path or we should follow up by moving this coverage into a stable harness.