POST /scrape { url } β scrapes the chart with headless Chrome β copies a Google Docs template β
inserts the chart with bold section-titles/chords and unbold lyrics β returns { docUrl, title, artist }.
Designed to run unattended on Cloud Run: headless, stateless, scales to zero, and authenticates to Google with a stored OAuth refresh token (zero human interaction per run). Triggerable from a phone.
This is the v2 service rewrite of the original local, GUI-only CLI tool (Replit, 2024). The document-formatting logic is preserved exactly β see the regression guard below.
A single POST /scrape drives the whole pipeline: a real browser clears Cloudflare, the chart is
located heuristically, the Crown-Jewels formatter builds the Docs payload, and a Google Doc link comes
back. The two deployment shapes differ only in where the real browser runs β a self-hosted Pi on a
residential IP (free), or a managed remote browser from Cloud Run (paid).
flowchart LR
phone["π± iPhone<br/>Shortcut / Share Sheet"]
subgraph pi["π₯§ Raspberry Pi Β· home Β· residential IP"]
direction TB
ex["server.js<br/>x-api-key guard<br/>+ UG-URL validation"]
fe["fetcher.js<br/>headed Chrome under Xvfb"]
det["detect.js<br/>heuristic chord-block scoring"]
fmt["formatter.js<br/>Crown Jewels batchUpdate"]
ex --> fe --> det --> fmt
end
ug["πΈ Ultimate Guitar<br/>behind Cloudflare"]
g["π Google Docs + Drive API"]
phone -->|"POST /scrape { url }<br/>over Tailscale (WireGuard)"| ex
fe <-->|"real browser +<br/>residential IP clears the wall"| ug
fmt -->|"copy template β replace placeholders β<br/>re-read β bold-by-kind style pass"| g
g -.->|"docUrl"| ex
ex -.->|"{ docUrl, title, artist }"| phone
On Cloud Run the
fetcher.jsbox insteadpuppeteer.connects to a managed remote browser (FETCH_STRATEGY=remote) β datacenter IPs are blocked, so the headed browser + residential egress are rented. Everything downstream (detect β format β Docs) is identical.
src/
server.js Express app + routes + API-key guard + URL validation
scraper.js scrapeSong(url) -> { title, artist, rawText } + extractChordText (strategy)
detect.js heuristic chord-block detection (content-fingerprint scoring) β primary strategy
formatter.js Crown Jewels: builds replaceAllText requests + bold-by-kind style pass
layout.js pure section-aware layout: parse, compact, dedupe, wrap-aware 2-column pack
google/
auth.js OAuth2 client, /auth + /oauth2callback, refresh-token load
docs.js copy template, replace placeholders, re-read + style pass (all awaited, timed out)
config.js env-driven: templateId, folderId, scopes, selectors, strategy, port
constants.js sectionTitles[], titles regex, chords regex
test/
detect.test.js heuristic scoring unit tests
scraper-strategy.test.js extractChordText strategy wiring (fake page, no browser)
formatter.test.js regression: refactored payload === legacy payload
formatter.fixture.json captured legacy batchUpdate payload (golden)
e2e.test.js real UG fixtures through the full pipeline (headless setContent)
fixtures/e2e/ trimmed snapshots of 10 live UG pages + manifest.json
The E2E suite loads each saved page into headless Chrome via setContent (no network β immune to
Cloudflare, fully deterministic) and runs the exact extraction pipeline, asserting the right chart,
title, and artist for 10 real songs. It self-skips if no Chromium binary is present.
The chord chart is located heuristically by default β src/detect.js scores candidate text
blocks by their content fingerprint (chord density, section headers, chord-alignment whitespace) and
picks the best one. This makes the scrape resilient to Ultimate Guitar's recurring DOM/class-name
changes, the failure mode that breaks selector-based scraping. The exact CSS selector
(selectors.chordBlock in src/config.js) is kept as a fallback (still fast and unambiguous when
it works, and worth re-pinning when convenient).
The strategy is env-switchable via SCRAPE_STRATEGY β the escape hatch if the heuristic ever
misbehaves in production (flip the env var, no code change):
SCRAPE_STRATEGY |
Behavior |
|---|---|
heuristic (default) |
heuristic first; fall back to the selector if nothing clears DETECT_MIN_SCORE |
auto |
selector first; fall back to the heuristic if the selector is empty |
selector |
selector only (exact legacy behavior) |
DETECT_MIN_SCORE (default 5) is the minimum chord-chart score the heuristic must clear to be
trusted β a weak best-candidate is rejected rather than scraping the wrong element. Title/artist also
fall back to parsing document.title when their selectors fail.
SCRAPE_STRATEGY decides how the chart text is located in a page; FETCH_STRATEGY decides how
the page is fetched. They are orthogonal. Ultimate Guitar sits behind Cloudflare bot protection
that serves a "Just a momentβ¦" challenge to headless Chrome from any IP β datacenter and
residential (confirmed against a clean residential connection). A real (headed) browser on a clean
residential IP passes, but Cloud Run runs headless on datacenter IPs, so the deployed service needs a
real-user egress. Note both signals matter: a headed browser fixes the fingerprint, but you still
need a residential IP β the remote providers below bundle both.
FETCH_STRATEGY |
Behavior | Use |
|---|---|---|
direct (default) |
Puppeteer navigates UG directly, then waits out any interstitial | local dev; also the free self-hosted path β works on a residential host with a real (headed) browser (PUPPETEER_HEADLESS=false); blocked on Cloud Run's datacenter IP |
proxy |
Puppeteer navigates through a residential/mobile proxy, then waits out the interstitial | cheaper, more tuning; set PROXY_SERVER/PROXY_USERNAME/PROXY_PASSWORD |
unlocker |
a web-unlocker API returns rendered HTML (solves Cloudflare + TLS fingerprint + proxies) loaded via setContent |
set UNLOCKER_API_URL/UNLOCKER_API_KEY |
remote |
puppeteer.connects to a real browser on a managed provider (Browserless / Browserbase) with stealth + residential IPs built in |
recommended for Cloud Run β closest to "a real browser window", highest Cloudflare pass rate |
The fetch layer lives in src/fetcher.js and funnels every strategy down to a single Puppeteer
page, so extraction/formatting is reused unchanged. Browser acquisition (launch locally vs.
connect to a remote browser) is createBrowserSession; the unlocker request shape varies by
provider (Bright Data, Scrapfly, Zyte, ScrapingBee, β¦) β fetchViaUnlocker is the one function to
adapt, as is createBrowserbaseSession for a session-based remote provider. On the
direct/proxy/remote paths a clear, actionable error is thrown if a challenge page is detected.
Two ways to point at a managed browser (resolved by resolveRemoteEndpoint):
- Browserless (or any provider exposing a static WebSocket endpoint): set
REMOTE_BROWSER_WS_ENDPOINTto the fullwss://β¦?token=β¦&proxy=residentialURL. We connect to it directly. Treat the whole URL as a secret (it carries the token). - Browserbase (a fresh session is minted per scrape): set
BROWSERBASE_API_KEY+BROWSERBASE_PROJECT_ID(optionalBROWSERBASE_REGION,BROWSERBASE_PROXIES). We POST to the Browserbase API, then connect to the returnedconnectUrl. The session is closed when the scrape finishes (browser.close()), which also stops proxy billing.
REMOTE_BROWSER_WS_ENDPOINT wins if both are configured.
The zero-cost option. A Pi on your home network already has the two things Cloudflare wants β a
residential IP and somewhere to run a real browser β so FETCH_STRATEGY=direct works with no
paid provider. Set PUPPETEER_HEADLESS=false and run the headed Chrome under a virtual display
(Xvfb), since the Pi is monitor-less. The default stays headless, so Cloud Run is untouched. Full
walkthrough (install, systemd, phone trigger): docs/RASPBERRY_PI.md.
| Method | Route | Auth | Purpose |
|---|---|---|---|
POST |
/scrape |
x-api-key header |
scrape a UG URL β create a formatted Google Doc |
GET |
/healthz |
none | liveness probe |
GET |
/auth |
none | one-time OAuth consent (bootstrap) |
GET |
/oauth2callback |
none | capture the refresh token (bootstrap) |
/scrape accepts only well-formed ultimate-guitar.com URLs and requires the x-api-key shared
secret. It is never an open scraping endpoint.
cp .env.example .env # fill in OAuth client + API_KEY (see DEPLOY.md for the bootstrap)
npm install
npm test # formatter regression guard
npm run lint
node src/server.js # GET http://localhost:8080/healthz -> {"status":"ok"}The formatting logic in src/formatter.js + src/layout.js is the fragile heart of the tool. It
builds the Google Docs batchUpdate payload in two passes:
- Pass 1 β
buildReplaceRequests: threereplaceAllTextrequests that fill the template's title placeholder and the two table-cell placeholders ("Song Title - Artist Name","col1","col2") with the rendered column text. Pure string assembly β no index math. - Pass 2 β
buildStyleRequests: after the doc is re-read, oneupdateTextStyleper paragraph, bold by rendered kind (chord/section bold, lyric not), aligned by content position with indices taken straight from the document. No "guess then unbold", and no global-regex state between lines.
Layout (src/layout.js) is a pure pipeline: parse into sections (dropping preamble/footer chrome) β
collapse repeated chord lines β detect repeats β render full or as a one-liner β pack whole sections
into two columns (never split, never duplicate) with a wrap-aware line budget so column 1 doesn't
spill to a second page. Line classification reuses src/detect.js (classifyLine).
npm test includes a golden regression (test/formatter.test.js vs test/formatter.fixture.json β
the pass-1 replaceAllText payload for the sample chart) plus behavior tests in test/layout.test.js
and 10 real-chart fixtures in test/charts.test.js. Regenerate the fixture only with npm run fixture
and explicit sign-off.
See DEPLOY.md for the full gcloud run deploy flow (image build, Secret Manager
wiring, and the one-time /auth bootstrap).
Two human action items that cannot be automated:
- Set the OAuth consent screen to Production (in Testing status, Google expires refresh tokens after 7 days).
- Add the deployed
/oauth2callbackURL to the OAuth client's Authorized redirect URIs. - (Maintenance) Re-pin the UG selectors in
src/config.jsif a scrape returns empty fields.
CI/CD: PRs are gated by .github/workflows/ci.yml and merges to main auto-deploy to Cloud Run via
.github/workflows/deploy.yml (keyless Workload Identity Federation) β see DEPLOY.md Β§7.
POST /scrape is all a phone needs. See docs/MOBILE.md for an iOS Shortcut
that takes a shared Ultimate Guitar link from the Share Sheet, calls the service with the x-api-key,
and opens the finished Google Doc.
How the phone reaches the service depends on the deployment:
| Deployment | Endpoint the Shortcut calls | Reachable from |
|---|---|---|
| Pi + Tailscale (free, recommended) | http://<pi-name>.<tailnet>.ts.net:8080/scrape |
anywhere (encrypted, nothing public) |
| Pi on home Wi-Fi | http://<pi-lan-ip>:8080/scrape |
home network only |
| Cloud Run | https://<cloud-run-url>/scrape |
anywhere (public HTTPS) |
For the self-hosted path, Tailscale is the sweet spot: a private WireGuard link to the Pi from any network, no port-forwarding, nothing exposed to the public internet. Full walkthrough in docs/MOBILE.md and docs/RASPBERRY_PI.md.
PDF auto-export, repeating-chord-pattern dedup, and other mobile surfaces (web PWA / Telegram bot). Clean extension points are left in place.
This repo follows the Konjo Quality Framework (CLAUDE.md, KONJO_QUALITY_FRAMEWORK.md,
.claude/, .konjo/). Run /konjo to boot a session.
MIT Β© Konjo Tech - Wesley Scholl