Skip to content

fix: submodule support — reinstate clone, fix pull, add hasSubmodules flag#1550

Open
rpapani wants to merge 9 commits intomainfrom
cm-client-bugs
Open

fix: submodule support — reinstate clone, fix pull, add hasSubmodules flag#1550
rpapani wants to merge 9 commits intomainfrom
cm-client-bugs

Conversation

@rpapani
Copy link
Copy Markdown
Contributor

@rpapani rpapani commented Apr 21, 2026

Summary

Full git-submodule support for both standard and BYOG Cloud Manager repos, plus the data-access schema fields downstream consumers need to drive the runtime.

The branch sits on top of main and contains five logical commits, each independently reviewable:

  1. fix: clone functionality to include submodules (#1413) — cherry-pick of the original submodule-clone fix that was reverted in Revert "fix: clone functionality to include submodules (#1413)" #1549 because the Lambda git-layer was missing the submodule helper binaries. With spacecat-git-layer:4 now carrying git-submodule and git-submodule--helper (see adobe/spacecat-infrastructure#470), this can safely go back in. The commit also carries the per-operation git timeout bump that landed alongside the original fix: GIT_OPERATION_TIMEOUT_MS defaults to 600_000 ms (10 min, up from 120s) and is now overridable via env var — recursive clones traverse multiple submodule repos and the old 2-minute budget was too tight.

  2. fix: propagate --recurse-submodules to incremental pull — the original PR patched clone but not pull. The incremental-update path uses pull and was still skipping submodules.

  3. feat: add hasSubmodules field to site.code — initial schema work. Superseded in shape by commits 4 and 5 below; included for review continuity.

  4. refactor: replace code.hasSubmodules with code.metadata.submodules — moves submodule metadata into a nested metadata.submodules: { external, urls } object so additional fields can be added without further schema churn.

  5. refactor: precomputed submoduleMap for BYOG submodule clones — the substantive change. Replaces the previous runtime name-lookup logic with a flat, onboarding-populated submoduleMap array of { path, url } entries. The cm-client iterates it mechanically and writes each entry into .git/config before running submodule update. All name disambiguation, URL classification, and collision handling is done at onboarding; the runtime is purely mechanical.

What's new in commit 5 (the meat of the change)

Submodule pipeline split by parent repo type

  • Standard parent: git clone --recurse-submodules and pull --recurse-submodules work natively. .gitmodules URLs resolve to the same host as the parent (git.cloudmanager.adobe.com/{orgName}/...); the existing org-scoped Basic-auth extraheader covers every submodule transparently. No additional configuration needed.
  • BYOG parent: --no-recurse-submodules at clone time, followed by a second pass driven by submoduleMap. The CM repo service proxy can't serve relative or SSH .gitmodules URLs (they resolve to name-based paths the proxy rejects), so we rewrite each submodule's URL in .git/config to a CM-reachable form before running submodule update. .gitmodules itself is never modified — the working tree stays clean.

submoduleMap data contract

"site.code.metadata.submodules.submoduleMap": [
  { "path": "sub-byog",     "url": "https://cm-repo.adobe.io/api/program/{pid}/repository/{numericId}.git" },
  { "path": "sub-standard", "url": "https://git.cloudmanager.adobe.com/{orgName}/{repoName}/" }
]

Each entry says: "for the submodule at <path> in .gitmodules, write <url> into .git/config submodule.<path>.url." The runtime decides which auth scope to apply by parsing each url's host — proxy URLs get the BYOG triple (Bearer + x-api-key + x-gw-ims-org-id), git.cloudmanager.adobe.com URLs get a Basic header from CM_STANDARD_REPO_CREDENTIALS[programId] scoped to {origin}/{orgName}/. Both scopes coexist on a single git submodule update invocation.

Why this is one map driven by URLs (not branched by repo type at runtime)

By baking the URL form into the map at onboarding, the runtime has no need to know what type the underlying repo is — the URL is the source of truth for both routing AND auth-scope selection. This handles every BYOG configuration we've encountered:

  • Pure-BYOG (parent and all submodules tunnel through the proxy): single Bearer scope.
  • Mixed BYOG + standard (BYOG parent with standard-typed submodules): both scopes attached to the same submodule update invocation; git applies the right headers per-URL by prefix matching.

Other additions

  • New exported constant GIT_CLOUD_MANAGER_HOST (= git.cloudmanager.adobe.com) so the standard-host literal is centralised.
  • New helper #buildSubmoduleAuthArgs that walks the map once, collects unique URL hosts, and emits one -c http.<scope>.extraheader=... per scope (deduplicated).
  • git submodule update --force --recursive is used unconditionally on the BYOG path. --force resets the submodule's working tree to whatever HEAD points at after fetch — required when the parent's pinned commit SHAs aren't reachable in the submodule's current commit graph (a real configuration we've seen, where submodule update would otherwise leave the working tree empty).
  • cmProgramRepos (the previous, runtime-fuzzy-matched name→id map field) is removed throughout. It was unreleased; no production data exists in that shape.

Files changed

File Change
packages/spacecat-shared-cloud-manager-client/src/index.js New GIT_CLOUD_MANAGER_HOST constant; clone() / pull() accept submoduleMap; new #buildSubmoduleAuthArgs; #resolveByogSubmodules rewritten to iterate submoduleMap and run submodule update --force --recursive
packages/spacecat-shared-cloud-manager-client/test/cloud-manager-client.test.js Test rewrite for submoduleMap; new dual-scope auth test; defensive tests for invalid entries and unparseable URLs
packages/spacecat-shared-cloud-manager-client/README.md New "Submodule handling" and "Known limitations" sections
packages/spacecat-shared-data-access/src/models/site/index.d.ts SubmodulesMetadata.submoduleMap?: Array<{path, url}> with full JSDoc on URL forms and runtime auth-scope derivation
packages/spacecat-shared-data-access/src/models/site/site.schema.js JSDoc on the code attribute documents the new field
packages/spacecat-shared-data-access/test/unit/models/site/site.model.test.js Tests for metadata.submodules.submoduleMap storage + cross-import preservation

Known limitations (tracked in cm-client README.md)

The submodule pipeline doesn't cover every BYOG configuration. The cases below are documented as out-of-scope for this PR and would each land as a separate follow-up:

  • submoduleMap not yet populated → parent clone preserved, warning logged, submodule fetches fail (expected until onboarding populates the map).
  • Out-of-program submodules (.gitmodules references repos not onboarded to CM at all).
  • Broken BYOG connections (repo status: ready in API but proxy 404s).
  • Stale parent gitlinks (handled today via --force; documented trade-off is dirty-pointer state in git status).
  • Recursive submodule depth > 1 (each level needs its own map).
  • Cross-org standard submodules (would need a per-{programId, orgName} cred shape).
  • Submodules on third-party hosts not behind the CM proxy.
  • submoduleMap drift between imports.

Test plan

  • cm-client unit tests pass — 80 tests, 100% coverage
  • data-access unit tests pass — 1920 tests
  • Manual end-to-end clone with --recurse-submodules for a standard parent — verified
  • Manual end-to-end clone with --no-recurse-submodules + submoduleMap rewrite for a BYOG parent (single-scope and dual-scope cases) using the updated git Lambda layer :4 — verified end-to-end
  • Downstream consumer verification via adobe/spacecat-import-worker#724

Related

  • Supersedes revert Revert "fix: clone functionality to include submodules (#1413)" #1549
  • Unblocks SITES-43089 autofix work
  • Downstream consumer PR: adobe/spacecat-import-worker#724 — picks up this branch via gist tarball for pre-release testing, then consumes metadata.submodules.submoduleMap
  • Lambda layer PR: adobe/spacecat-infrastructure#470 — bundles git-submodule + git-submodule--helper into spacecat-git-layer:4

🤖 Generated with Claude Code

dmaurya929 and others added 3 commits April 20, 2026 15:55
Please ensure your pull request adheres to the following guidelines:
- [ ] make sure to link the related issues in this description
- [ ] when merging / squashing, make sure the fixed issue references are
visible in the commits, for easy compilation of release notes

## Related Issues


Thanks for contributing!
PR #1413 added --recurse-submodules to clone but missed pull, so
incremental updates of repos with submodules (e.g. Okta's
okta-digital-aem-sites) did not fetch submodule changes. Now pull
uses the same org-scoped extraheader auth as clone — submodules on
the same customer org inherit credentials and are updated to the
SHA pinned in the parent's tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional boolean `hasSubmodules` on `site.code` so the
import worker can record whether the cloned repo contains a
.gitmodules file. Downstream consumers (autofix, suggestion
generation) use this flag to decide whether to parse .gitmodules
and route patches/PRs across parent + submodule repos.

Additive, unset by default — no behaviour change for existing sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rpapani rpapani requested a review from dmaurya929 April 21, 2026 06:33
Replaces the flat boolean with a nested metadata object that carries
the full submodule URL list and an external-vs-internal classification.
Consumers that only need presence check `metadata?.submodules?.urls?.length`;
those that need the URL list or external flag read the same object.

Shape:
  code.metadata = {
    submodules: {
      external: boolean,     // true if any URL points off the parent's host
      urls: string[]         // verbatim from .gitmodules, creds stripped
    }
  }
  // `submodules` is omitted when the repo has no .gitmodules

The importer always overwrites `metadata` on each import so a re-import
that finds no submodules clears stale entries from an earlier import.

Adds CodeMetadata + SubmodulesMetadata TypeScript interfaces and
matching unit tests. Adds a JSDoc doc-comment on the `code` attribute
summarising the contract.

No migration needed — hasSubmodules was unreleased (shipped on this
branch but not yet merged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

This PR will trigger a minor release when merged.

rpapani and others added 5 commits April 22, 2026 20:33
Background: CM BYOG repos are proxied by the CM repo service at
`{cmRepoUrl}/api/program/{pid}/repository/{numericId}.git`. When a
customer's .gitmodules uses relative URLs (`../foo`) — which is the
common case — git resolves them against the parent's clone URL and
produces `.../repository/foo`. The proxy rejects those because it only
serves numeric repository IDs, breaking submodule clone end-to-end.

Fix: a new `cmProgramRepos` map on `site.code.metadata.submodules`
that maps each CM repo's real name (last path segment of its
`url`/`proxyUrl`, NOT the `repo` display field) to its numeric id.
The cm-client reads this at clone/pull time to rewrite name-based
submodule URLs in .git/config to the numeric-id form before running
`git submodule update`. Populated at onboarding via a single call to
`GET /api/program/{pid}/repositories`; the importer doesn't have CM
Management API access so it can't derive the map itself.

cm-client changes:
  - Export `isBYOG(repoType)` helper (anything !== STANDARD).
  - Split clone() by type:
    * STANDARD: `clone --recurse-submodules` + (on ref switch) sync +
      update --init --recursive. Unchanged behaviour.
    * BYOG: `clone --no-recurse-submodules` + checkout ref + new
      `#resolveByogSubmodules` helper that runs submodule init,
      rewrites URLs via cmProgramRepos, then submodule update
      --recursive with the CM repo service extraheaders re-applied.
  - Same split for pull(): STANDARD keeps `pull --recurse-submodules`,
    BYOG pulls the parent only and then runs the same rewrite pass to
    pick up new submodules introduced by the pulled commits.
  - `#resolveByogSubmodules` is idempotent and explicitly never calls
    `git submodule sync` — sync would copy .gitmodules URLs back into
    .git/config and undo the rewrite. Heavily commented.
  - Graceful fallbacks: no .gitmodules (skip), no map (warn, let the
    subsequent update fail naturally — parent clone is still usable),
    submodule name not in map (leave URL alone).

Schema / types:
  - `SubmodulesMetadata.cmProgramRepos?: Record<string, string>` with
    a thorough TS doc comment explaining the lifecycle, who writes it,
    and why standard repos don't need it.
  - site.schema.js JSDoc distinguishes importer-written fields from
    onboarding-written fields.

Testing: validated end-to-end against eni-plenitude (program 123722,
parent repo 233725, 12 relative submodules on Azure DevOps BYOG) —
all submodules cloned at their pinned SHAs via the CM proxy. Unit
tests: 79 passing, 100% coverage (up from 70).

TODO follow-ups tracked elsewhere:
  - Ask CM to support name-based routes on the repo proxy (would
    remove the need for this map entirely).
  - Consider exposing customer-native URLs + stored PAT so we could
    bypass the proxy for BYOG altogether.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the runtime name-lookup logic with a flat, onboarding-populated
`submoduleMap` array of `{path, url}` entries. The cm-client iterates
the list mechanically and writes each entry into `.git/config` before
running `submodule update`. All name disambiguation, URL classification,
and collision handling is done at onboarding; the runtime is purely
mechanical and free of fuzzy matching.

This unblocks two previously unsupported cases:

  - Mixed BYOG + standard submodules under a single BYOG parent. The
    new `#buildSubmoduleAuthArgs` helper attaches multiple
    `http.<scope>.extraheader` args on a single `git submodule update`
    invocation: a Bearer-based scope on the CM proxy host for BYOG
    submodules, and a Basic-auth scope on `git.cloudmanager.adobe.com/
    {orgName}/` for standard submodules. Git applies each scope's
    headers only when the outgoing URL prefix matches.

  - Stale parent gitlinks (where the parent's pinned commit SHAs do
    not exist in the submodule's current commit graph, e.g. after a
    submodule is migrated between BYOG and standard or its history
    is rewritten). `git submodule update --force --recursive` resets
    the working tree to whatever HEAD points at after fetch instead
    of failing silently with an empty working tree.

Other notable changes:
  - New exported constant `GIT_CLOUD_MANAGER_HOST` (= `git.cloudmanager.
    adobe.com`) so the standard-host literal is centralised.
  - `clone()` and `pull()` accept `submoduleMap` instead of the previous
    `cmProgramRepos` map. The legacy field is unused and removed.
  - `data-access` schema: `SubmodulesMetadata.submoduleMap` documented
    in `index.d.ts` with the runtime contract, scope-derivation rules,
    and the URL-form decision (proxy vs standard).
  - cm-client `README.md` gains a "Submodule handling" section and a
    "Known limitations" section calling out cases that aren't supported
    by this PR (out-of-program submodules, broken BYOG connections,
    cross-org standard submodules, third-party hosts, recursive
    depth > 1, map drift). Each is documented as out-of-scope and a
    candidate for a follow-up PR.

Tests: 80 passing, 100% coverage on the cm-client. 1920 passing on
data-access.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workspace's eslint config enforces the `curly` rule
unconditionally. The local checkout's eslint had been more permissive
on a single-statement `if` body, so the issue slipped through. Fix
brings the inner `if (orgName) orgs.add(orgName)` into block form so
CI lint passes.

No behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants