Skip to content

Codex/blazegraph safe rdf literals#1322

Open
lupuszr wants to merge 2 commits into
mainfrom
codex/blazegraph-safe-rdf-literals
Open

Codex/blazegraph safe rdf literals#1322
lupuszr wants to merge 2 commits into
mainfrom
codex/blazegraph-safe-rdf-literals

Conversation

@lupuszr

@lupuszr lupuszr commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes Blazegraph ingestion/sync failures caused by oversized RDF literals, specifically large schema.org/text values that exceed Java/Blazegraph modified UTF-8 limits.

Changes:

  • Adds shared RDF literal size validation using Java modified UTF-8 byte counting.
  • Rewrites oversized schema.org/text literals into ordered chunk resources below a conservative 60,000-byte ceiling.
  • Adds hash, byte-size, chunk-count, and source-predicate metadata for rewritten text bodies.
  • Rejects oversized non-text RDF literals before publishing with RDF_LITERAL_TOO_LARGE.
  • Wires validation/chunking into producer paths: direct publish, shared memory writes, conditional writes, assertion writes, and agent publish attestation generation.
  • Keeps Blazegraph adapter-side oversized-literal rejection as defense in depth.
  • Maps producer literal-size failures to HTTP 400 in daemon routes.

Testing

  • pnpm --filter @origintrail-official/dkg-core exec vitest run test/rdf-literal-limits.test.ts
  • pnpm --filter @origintrail-official/dkg-storage exec vitest run test/blazegraph.unit.test.ts
  • pnpm --filter @origintrail-official/dkg-publisher exec vitest run --config vitest.unit.config.ts test/dkg-publisher-compat.test.ts
  • git diff --check

Live Blazegraph integration was attempted, but this local environment has no BLAZEGRAPH_TEST_URL and Docker daemon access was unavailable.

@otReviewAgent otReviewAgent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational Notice: Review Agent could not complete this review.

Business logic reviewer failed: retry_exhausted

}

const sha256 = createHash('sha256').update(parsed.lexical, 'utf8').digest('hex');
const bodySubject = `${quad.subject}/.well-known/genid/text-${sha256.slice(0, 16)}`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This derives the replacement resource from quad.subject even when the source subject is a blank node. For an oversized schema:text on _:b0, bodySubject becomes _:b0/.well-known/...; callers such as assertionWrite persist the normalized quads without skolemizing, and the storage serializers treat that string as a blank node label, producing invalid N-Quads and rejecting a write that previously worked on non-Blazegraph backends. Skolemize before rewriting, or generate a valid blank node/IRI for blank-node subjects before emitting the chunk quads.

const shaped = err as { name?: unknown; code?: unknown };
return shaped.name === 'RdfLiteralSizeError' || shaped.code === 'RDF_LITERAL_TOO_LARGE';
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: String(err ?? ...) turns any non-Error object that matches isRdfLiteralSizeError into an unhelpful "[object Object]" response. Since the predicate explicitly accepts shaped objects by name or code, this helper should also read a string message field before falling back, so daemon responses stay actionable for cross-package/plain-object errors.

// or subject-level validation error downstream.
rejectUserAuthoredProtocolMetadata(quads);
const subjects = [...new Set(quads.map(q => q.subject))];
// Normalize oversized text literals and reject reserved/protocol

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The added test only exercises DKGPublisher.publish(), but this new normalization also changes the SWM share() write path and recurs in conditional/append paths. Add a regression test that calls share() or conditionalShare() with an oversized schema:text literal and asserts the stored quads are chunked; otherwise this path could regress to sending the original >65KB literal while tests stay green.

privateQuads?: Quad[];
},
): Promise<PublishOptions['precomputedAttestation']> {
quads = normalizeLargeRdfLiteralsForBlazegraph(quads).quads as Quad[];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This normalization feeds _buildPrecomputedAttestationForSelection, but the added coverage does not exercise an agent/on-chain publish where the precomputed attestation is built from oversized public or private quads. Add a regression test through the agent selection/precomputed-attestation path so a raw-vs-normalized seal mismatch is caught.

if (isPayloadTooLargeError(e)) {
return jsonResponse(res, 413, payloadTooLargeResponseBody(e));
}
if (isRdfLiteralSizeError(e)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The daemon now maps RdfLiteralSizeError to a public 400 response, but there is no route-level test that makes the underlying publish/share call throw this error and asserts the HTTP status/body. Please add coverage for this route and the shared-memory handlers that recur below, so a regression to the generic 500 path or missing RDF_LITERAL_TOO_LARGE code is caught.

@@ -98,6 +98,7 @@ import {
sharedMemoryReadBothFilter,
partitionCatalogQuads,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nit: normalizeLargeRdfLiteralsForBlazegraph is imported from @origintrail-official/dkg-core in its own statement immediately after an existing import from the same package. Consolidating it into the existing import matches the local pattern and avoids needless import drift as more core symbols are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants