Skip to content

feat: WebAssembly binding (full stack)#531

Open
gergelyvagujhelyi wants to merge 196 commits into
mainfrom
wasm-pr4-release
Open

feat: WebAssembly binding (full stack)#531
gergelyvagujhelyi wants to merge 196 commits into
mainfrom
wasm-pr4-release

Conversation

@gergelyvagujhelyi
Copy link
Copy Markdown
Contributor

@gergelyvagujhelyi gergelyvagujhelyi commented May 13, 2026

Description

JavaScript binding for NobodyWho — runs local LLMs in a browser tab (or any wasm host) via llama.cpp compiled to wasm32 with Emscripten. The same core engine that powers the Python, Godot, Flutter, and Uniffi bindings, exposed to JS through wasm-bindgen.

This PR consolidates the entire stack that was previously staged across #526, #529, #530 — those have been closed in favour of this single PR per maintainer request. All commits preserved.

What's delivered

Core surface

import createNobodyWhoModule from '@nobodywho/js';
const m = await createNobodyWhoModule();

// Chat — inference runs on a background pthread (Emscripten pthreads),
// same std::thread::spawn code path as native. Pass modelUrl (browser)
// or modelPath (Node).
const chat = await m.Chat.create({
  modelUrl: 'https://huggingface.co/.../model.gguf',
  systemPrompt: 'You are a helpful assistant.',
  templateVariables: { enable_thinking: false },
  sampler: new m.SamplerBuilder().temperature(0.7).topK(40).topP(0.95, 1).dist(),
  tools: [getWeather, lookupTime],
});

// Streaming: per-token async iteration
for await (const tok of chat.ask('What is the capital of Denmark?')) {
  process.stdout.write(tok);  // each token arrives as the model samples it
}

// Streaming: collect the full response
const reply = await chat.ask('Hi').completed();

// Both return a TokenStream with:
//   .next()       → Promise<{value: string, done: boolean}>  (async iterator)
//   .completed()  → Promise<string>  (buffered full response)
//   [Symbol.asyncIterator] support for `for await`

// Multimodal: pass an array of mixed string | Image | Audio parts
const img = m.Image.fromBytes(jpegBytes);
const desc = await chat.ask(['Describe this:', img]).completed();

// Tools — sync or Promise-returning callbacks both work
const tool = m.Tool.fromFn(
  'get_weather',
  'Get current weather for a city',
  { type: 'object', properties: { city: { type: 'string' } }, required: ['city'] },
  async (args) => fetch(`https://api.example/weather?city=${args.city}`).then(r => r.text()),
);

// Structured output — constrain generation to a JSON schema
const jsonChat = await m.Chat.create({
  modelUrl: 'https://...',
  sampler: m.SamplerPresets.constrainWithJsonSchema({ type: 'object', properties: { city: { type: 'string' } } }),
});

// Embeddings + reranking — share the same Model instance
const model = await m.Model.load({ modelUrl: 'https://...' });
const encoder = new m.Encoder(model, 2048);
const vec = await encoder.encode('the quick brown fox');
const xenc = new m.CrossEncoder(model, 2048);
const ranked = await xenc.rankAndSort('query', ['doc1', 'doc2']);

API status

Surface Status
Model.load({modelUrl | modelPath, mmprojUrl | mmprojPath}) ✅ verified
Encoder.encode(text)Float32Array ✅ verified
CrossEncoder.rank(query, docs) / rankAndSort(...) ✅ verified
Chat.create({modelUrl | modelPath, ...})Chat ✅ verified
Chat.ask(prompt)TokenStream ✅ verified
for await (const tok of stream) (async iteration) ✅ verified
TokenStream.next() / .completed() ✅ verified
Chat.terminate()Promise<void> ✅ verified (no-op, kept for API compat)
SamplerConfig / SamplerBuilder / SamplerPresets ✅ verified
Structured output via constrainWithJsonSchema() / constrainWithRegex() / constrainWithGrammar() ✅ verified
Tool calling (Tool.fromFn(...), sync + async callbacks) ✅ verified
Multimodal vision/audio (Image.fromBytes / Audio.fromBytes) ✅ verified
Chat.create({modelUrl}) — streaming download + Cache API ✅ verified
Chat.create({modelPath}) — NODEFS disk access (Node-only) ✅ verified
mmap-backed tensor loading (CPU_Mapped) ✅ verified

Threading model: Emscripten pthreads

The wasm build uses Emscripten pthreads (-pthread, +atomics,+bulk-memory,+mutable-globals Rust target features). This enables std::thread::spawn on wasm — the same code path as native. ChatHandleAsync::new() spawns a real pthread for inference, and llama.cpp's ggml threadpool uses available_parallelism() to pick n_threads (maps to navigator.hardwareConcurrency in browser, os.cpus().length in Node).

Browser requirement: serving origin must set Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers for SharedArrayBuffer.

Build requirement: Rust nightly with -Zbuild-std=std,panic_abort (pre-compiled std for wasm32-unknown-emscripten lacks atomics).

The Chat class directly wraps the core ChatHandleAsync — no Web Worker wrapper, no message protocol, no RPC bridge. Token streaming works through tokio channels across pthreads, same as native.

Build target: wasm32-unknown-emscripten

Build pipeline:

   nobodywho/core (Rust)  +  llama-cpp-2 fork (wasm-emscripten branch)
        |
        | cargo +nightly -Zbuild-std=std,panic_abort
        | emcc for C/C++ side, rustc for Rust side
        v
   wasm32-unknown-emscripten .wasm  (with -pthread, SharedArrayBuffer)
        |
        | patched wasm-bindgen-cli (pthreads-compatible)
        | + post-link emcc with -pthread + --js-library
        v
   pkg-bundler/
     ├── nobodywho_js.js          (Emscripten loader with pthread runtime)
     ├── nobodywho_js_bg.wasm     (linked wasm, shared memory, ~10 MB)
     ├── library_bindgen.js       (kept for debugging)
     └── pre.js                   (HEAP_DATA_VIEW shim, inlined by emcc)

Three temporary forks (all will go away when upstream merges):

  • nobodywho-ooo/wasm-bindgen — descriptor-interpreter fixes + Emscripten output mode + pthreads compatibility (skip thread/multivalue transforms, synthesize stack pointer shim from emscripten exports)
  • nobodywho-ooo/llama-cpp-rsCMAKE_SYSTEM_PROCESSOR=wasm32, -matomics -mbulk-memory for shared-memory link compat, MA_NO_* defines for miniaudio, -fexceptions for mtmd
  • walkingeyerobot/emscripten — the -sWASM_BINDGEN flag (PR Add wasm-bindgen support emscripten-core/emscripten#23493)

Multimodal (Path A — MEMFS-virtualized)

Vision and audio input work end-to-end through bytes. Image.fromBytes(uint8) / Audio.fromBytes(uint8) write to content-hashed MEMFS paths; llama.cpp reads them via strong syscall overrides in js/src/syscall_imports.rs.

Verified: Qwen3.5-0.8B vision, Qwen2-VL, Gemma 3 vision, Qwen3-ASR audio (WAV/MP3/FLAC).

Tool calling

Tool.fromFn(name, description, jsonSchema, callback) — the callback runs directly in the inference context (no RPC bridge needed since pthreads share the same wasm instance). Both sync and async (Promise-returning) callbacks work.

Structured output

Constraints via SamplerPresets.constrainWithJsonSchema() / constrainWithRegex() / constrainWithGrammar(). llguidance works on Emscripten (its clock_gettime requirement is satisfied by Emscripten's libc).

Core changes that affect other bindings

  • tokio features split (rt-multi-thread native vs rt wasm); native-only deps gated behind cfg(not(target_family = "wasm")).
  • Worker channel: std::sync::mpsctokio::sync::mpsc::unbounded_channel. Public API unchanged.
  • std::thread::spawn used on all targets (including wasm, via Emscripten pthreads). No more spawn_local wasm path.
  • WorkerGuard unified — single struct with JoinHandle on all targets.
  • n_threads uses available_parallelism() on wasm (was hardcoded to 1).
  • New mtmd cargo feature (default-on), get_model_from_path, get_model_from_bytes, Tool::new_async, mtmd_marker_string().

Native consumers (Python, Godot, Flutter, uniffi) see no API changes. cargo check --workspace passes cleanly.

v1 limitations

  • 4 GiB ceiling — wasm32 hard limit. Models + KV cache + compute must fit.
  • Browser COOP/COEP — pthreads require cross-origin isolation headers.
  • Audio: WAV/MP3/FLAC only — Ogg not supported under Emscripten.
  • OpenAI-typed-content chat templates (SmolVLM, some Phi-3-Vision) — not yet supported.

Tested

  • cargo check --workspace: clean on native.
  • cargo test -p nobodywho-js: lint tests pass.
  • bash js/scripts/build-pkg-emscripten.sh: ~60s on M-series macOS.
  • All 17 smoke tests pass under Node (forawait, sampler, tool, stop, history, setters, terminate, context-shift, constraint, audio, vision, modelpath, parity-extras, etc.)
  • Browser demos verified with COOP/COEP headers.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change
  • Documentation update

Adds a new workspace member nobodywho-wasm that mirrors the Python binding's
async API surface (Model, Chat, TokenStream, Encoder) via wasm-bindgen.

Status: scaffold only. Compiles cleanly on native (rlib) so cargo check at
the workspace root keeps working, but wasm-pack build --target web needs
two prerequisites that aren't done yet:

1. The marek-hradil/llama-cpp-rs fork needs a wasm32 build path (Emscripten
   cmake). We'll carry our own fork as a patch carrier.
2. nobodywho/core needs cfg(target_arch = "wasm32") gating for std::thread,
   ureq downloads, and tokio rt-multi-thread, plus a get_model_from_bytes
   API for browser use.

Both blockers are tracked in nobodywho/wasm/README.md and as TODO comments
in the binding source. The binding's shape is independent of both.
Step 2a of the WASM binding rollout. Splits a few core dependencies so that
`cargo check --target wasm32-unknown-unknown -p nobodywho` no longer fails
immediately on dep resolution. Native builds are unchanged.

- tokio: `rt-multi-thread` only on native; wasm gets plain `rt` (no OS
  threads on wasm32-unknown-unknown).
- ureq: native only (raw TCP/TLS not available in a browser sandbox).
- indicatif: native only (no terminal in a browser).
- dirs: native only AND non-android (Android already reads /proc/self/cmdline
  directly; wasm has no filesystem).
- `default_progress_callback` in llm.rs and the `indicatif::*` import: gated
  to non-wasm. Native callers unchanged.

Still blocking wasm-pack build (tracked in wasm/README.md):
- Step 1: llama-cpp-2 fork needs wasm32 build path
- Step 2b: Worker refactor (std::thread::spawn -> spawn_local)
- Step 2c: Model::load_bytes constructor

cargo check --workspace passes; only pre-existing deprecation warnings in
uniffi/godot/flutter remain.
The wasm branch (https://github.com/nobodywho-ooo/llama-cpp-rs/tree/wasm)
is forked from marek-hradil/main at 8550f04e, so it inherits the llguidance
/ EOS-fix / lark / with_logits_ith_mut patches that core depends on today.
It adds one commit (a35e66a) that scaffolds wasm32 detection in the build
script — a no-op on native targets.

Why this swap matters:
- Native builds are functionally unchanged (the new branch's only added
  commit is wasm32-gated, native targets never see it).
- The wasm32 build path now lives in a repo nobodywho controls. Step 1 of
  the WASM rollout (Emscripten cmake support, feature gating, load_from_buffer
  wrapper) lands on this branch.

cargo check --workspace passes; only pre-existing deprecation warnings in
uniffi/godot/flutter remain.

When the marek-hradil/main moves forward with more patches, we'll rebase
the wasm branch on top.
cargo check --target wasm32-unknown-unknown -p nobodywho-wasm now succeeds
(with dead-code warnings only). Linking will still fail because the
llama-cpp-rs wasm branch doesn't produce a wasm artifact yet, but every
Rust crate in the workspace now type-checks under the wasm32 target.

Changes:

* grammar/gbnf/Cargo.toml — disable jsonschema default features. The only
  use is jsonschema::meta::is_valid (a pure in-memory metaschema check);
  defaults pull in resolve-http/resolve-file -> reqwest -> hyper -> tokio
  with the 'net' feature -> mio, which doesn't build on wasm32. Drops
  ~12 transitive crates from the dependency graph.

* core/Cargo.toml — move monty (Python interpreter) and bashkit (virtual
  bash) into the not(target_arch = wasm32) deps block. Both have native-
  only requirements: monty has no wasm32 target; bashkit forces
  tokio/full which pulls mio + raw sockets.

* core/src/tool_calling/mod.rs — cfg-gate Tool::python and Tool::bash
  (along with their monty/bashkit imports) to non-wasm. Wasm consumers
  can still build custom tools via Tool::new.

* core/src/llm.rs — cfg-gate the model-loader infrastructure (get_model,
  get_model_async, parse_model_path, resolve_fancy_path_to_fs, the cache
  dir helpers, download_file, download_model_from_hf,
  download_model_from_url, ParsedModelPath enum) to non-wasm32. A
  future Model::load_from_bytes API will handle in-memory loading for
  wasm (see nobodywho/wasm/README.md Step 2c).

* wasm/src/lib.rs — fix init() to use tracing_wasm::try_set_as_global_default
  instead of constructing a Layer and passing it to set_global_default
  (which needs a Subscriber).
Update Cargo.lock to point at the rebased nobodywho-ooo/llama-cpp-rs wasm
branch (6c35d9a). That branch is now marek-hradil/main + Asbjorn's three
cherry-picked Emscripten commits + a fix for two non-exhaustive matches he
missed + WASM.md doc.

Update wasm/README.md to reflect the new world:
- Target is now wasm32-unknown-emscripten (not -unknown-unknown).
- Document the emsdk install path so contributors can actually link.
- Note that cargo check now panics with a clear sysroot message when emcc
  isn't installed, which is the correct behavior.
- Drop the obsolete Step 1 block ("llama-cpp-2 needs a wasm32 build path");
  that's now done.
- Expand Step 2a recap with the additional cfg-gates that landed in
  0e2f913 (gbnf, monty/bashkit, model-loader infra).
- core/src/llm.rs: add get_model_from_bytes(bytes, gpu_layers) which calls
  the new LlamaModel::load_from_buffer in the llama-cpp-2 fork (commit
  606c4759 on the wasm branch). Bypasses every filesystem and HTTP code
  path — pure 'bytes in, Model out'. Gated to non-windows-MSVC to match
  the upstream wrapper.

- wasm/src/lib.rs: Model.loadBytes(uint8Array) now delegates to
  get_model_from_bytes with gpu_layers=0 (wasm32 has no GPU). The
  placeholder JsError is gone — this is a real path now.

- Cargo.lock: pull llama-cpp-2 fork at 606c4759 (load_from_buffer).

cargo check --workspace passes on native. cargo check --target
wasm32-unknown-emscripten -p nobodywho-wasm goes through the full
toolchain and halts at the expected 'Could not detect Emscripten
sysroot' message when emcc isn't installed.
…+ cfg

Step 2b: makes the background worker pattern (ChatHandleAsync, EncoderAsync,
CrossEncoderAsync, plus the sync ChatHandle) compile and behave correctly on
wasm32-unknown-emscripten while preserving native behavior bit-for-bit.

Changes:

* WorkerGuard: channel field changed from std::sync::mpsc::Sender to
  tokio::sync::mpsc::UnboundedSender. join_handle field cfg-gated to
  non-wasm. Two constructors: 3-arg on native, 2-arg on wasm.

* Worker loops: `std::sync::mpsc::channel()` -> `tokio::sync::mpsc::unbounded_channel()`.
  Native loops use `blocking_recv()` inside std::thread::spawn — semantically
  identical to the previous blocking recv. Wasm uses
  `wasm_bindgen_futures::spawn_local` + `recv().await` — a single-threaded
  cooperative pump on the JS event loop. Decode work blocks the event loop
  during each message; Web-Worker parallelism is a follow-up.

* core/Cargo.toml: add wasm-bindgen-futures to the wasm32 deps block.

* WorkerGuard Drop: native joins the thread (unchanged ordering wrt
  closing the sender so LLAMA_BACKEND stays alive during teardown).
  Wasm: dropping the sender causes recv().await to return None and the
  spawn_local future completes on its next poll — no explicit join.

Files touched:
- core/Cargo.toml
- core/src/llm.rs (WorkerGuard struct, constructors, Drop)
- core/src/chat.rs (ChatHandle::new, ChatHandleAsync::new)
- core/src/encoder.rs (EncoderAsync::new)
- core/src/crossencoder.rs (CrossEncoderAsync::new)

cargo check --workspace (native) passes. cargo check --target
wasm32-unknown-emscripten -p nobodywho-wasm passes through everything
Rust and halts at the expected 'install emcc' Emscripten sysroot check.
cargo build --target wasm32-unknown-emscripten -p nobodywho-wasm produces
a 113MB debug nobodywho_wasm.wasm artifact (~10-20MB after release strip).
End-to-end: bindgen + cc + cmake + emscripten + wasm-bindgen + wasm-ld all
in one pipeline.

Two changes here:

* wasm/src/lib.rs — rewrite the wasm-bindgen surface to return js_sys::Promise
  manually rather than `pub async fn`. The macro-generated futures captured
  non-UnwindSafe types (tokio::sync::Mutex, tokio mpsc receivers,
  TokenStreamAsync's interior) which made future_to_promise reject them at
  compile time. The new `promisify` helper wraps each future body in
  AssertUnwindSafe + catch_unwind, asserting the unwind-safety we get for
  free in a single-threaded JS environment and turning any panic into a
  rejected promise rather than tearing down the wasm instance.

* core/src/chat.rs — gate `ChatBuilder::build()` and the sync `ChatHandle`
  type/impl to non-wasm32. The sync variant uses `.completed()` which
  blocks; there's nothing to block on in a browser tab. Use `build_async`
  and await on the returned future instead.

* Cargo.lock — pull the llama-cpp-rs fork at 85987657, which adds the
  -fPIC fixes (configure_emscripten_cc for the wrapper shims, and
  CMAKE_C_FLAGS=-fPIC + CMAKE_POSITION_INDEPENDENT_CODE=ON for llama.cpp's
  own cmake build). Without those, wasm-ld errors with 'relocation
  R_WASM_MEMORY_ADDR_LEB cannot be used against symbol …; recompile with
  -fPIC' for every static symbol in the C/C++ code.

Native cargo check --workspace still passes.
…main

Three things, one commit:

- cargo fmt --all on the two sites the linting CI flagged
  (core/src/llm.rs:185 get_model_from_bytes signature joined onto one
  line; wasm/src/lib.rs:57 promisify .map() closure collapsed).

- cfg-gate the items that pr1's wasm-target build was warning about,
  dropping warning count from 9 to 0:
  * core/src/llm.rs — Read/Write, Duration, info_span imports.
  * core/src/chat.rs — sync TokenStream + impl (only consumers are the
    sync ChatHandle, already gated, and the Python binding which is
    native-only). Doc-comment notes blocking_recv would deadlock the
    JS event loop, so this is also a footgun guard.
  * core/src/memory.rs — GgufModelInfo, read_gguf_model_info,
    estimate_per_layer_bytes, plan_model_loading, plus the now-unused
    std::path::Path import. plan_context / ContextPlan stay public:
    they are called from chat-context setup which runs on wasm too.

- regenerate Cargo.nix and crate-hashes.json. The pre-existing files
  pointed at marek-hradil/llama-cpp-rs and didn't know about the
  nobodywho-wasm workspace member, so nix flake check failed.

Plus a small native error-path test for get_model_from_bytes in
core/src/llm.rs (gated off Windows MSVC, mirroring the function's own
gate). Tests the InvalidModel error path; the success path on Linux
hits an upstream llama.cpp issue with load_from_buffer returning NULL,
tracked separately.

Includes the rebase onto main (PR #527, watchOS Metal-skip): adopts
main's visionos/watchos exclusions on the Vulkan target cfg + the
updated comment, while keeping pr1's switch to the
nobodywho-ooo/llama-cpp-rs#wasm fork. The iOS Metal block stays
explicit (with wasm-fork URL) until the wasm fork rebases on top of
marek-hradil's main to pick up the upstream watchOS Metal-skip patch.
Cargo.lock incorporates the windows-sys 0.52 -> 0.61 transitive bump
from main.

Two nix bugs surfaced and worked around (worth knowing for downstream
PRs):
  1. crate2nix's '-h crate-hashes.json' caches by name@version, not by
     full URL+commit — URL changes silently reuse the old tree's hash.
  2. pkgs.fetchgit defaults fetchSubmodules = true. nix-prefetch-git
     defaults to --no-fetch-submodules. The llama-cpp-rs repo vendors
     llama.cpp as a submodule, so the with-submodules tree hashes
     differently. Manually putting the correct (with-submodules) hash
     into crate-hashes.json before running crate2nix is the workaround.
Combined documentation pass spanning the Emscripten →
wasm32-unknown-unknown + wasi-sdk exploration:
  - first README rewrite once the Emscripten build started working
  - documented the wasm-bindgen + Emscripten incompatibility that
    eventually drove the switch to wasm32-unknown-unknown
  - rewrote the outstanding-work section after the Path B trial gave
    a clearer picture of remaining tasks
  - reflected Path B's C++ side completion with Rust mtmd gating in
    core as the next step
Path B prep: introduce a `mtmd` feature on `nobodywho` (default on) that
gates multimodal support. The wasm binding opts out via
`default-features = false` so the underlying llama-cpp-2 doesn't enable
its own `mtmd` feature for wasm builds.

Native unchanged (feature is on by default, so every existing code path
stays compiled). Wasm build no longer pulls in mtmd's miniaudio
dependency (which uses pthread sched APIs wasi-libc lacks).

This commit gates the easy parts:
- core/Cargo.toml: mtmd feature definition, propagation to llama-cpp-2
- core/src/errors.rs: FailedReadingMediaEmbeddings variant
- core/src/llm.rs: MtmdInputChunks import + read_media_embeddings method
- core/src/template.rs: MtmdBitmap import + RenderedChat.bitmaps field
- wasm/Cargo.toml: default-features = false on nobodywho dep

Still pending for the full wasm build:
- core/src/chat.rs: ChatContext::bitmaps field + ~5 methods that use it
- core/src/tokenizer.rs: TokenizerChunk::Image/Audio variants (~30 sites)

The ChatContext refactor in particular is non-mechanical — bitmaps are
the central state for multimodal chat, and gating them out cleanly means
deciding whether to factor out a non-multimodal ChatContext or to live
with cfg-attributes on every method. That's a design call I want to
leave for review.
Reverts the premature `default-features = false` on the nobodywho dep in
the wasm crate. The mtmd cargo feature is added on core, but until
core/src/chat.rs and core/src/tokenizer.rs are fully gated for mtmd-less
builds (the ChatContext bitmaps refactor + TokenizerChunk variant gating),
opting out here breaks the Emscripten path that previously worked.

Once those two files are gated, switch this back to
default-features = false to enable wasm32-unknown-unknown builds.

Also: pulled the fork's wasm branch to 2b592da, which tightens the mtmd
gate in llama-cpp-2/src/lib.rs to allow Emscripten (was: any wasm32).
cargo build --target wasm32-unknown-unknown -p nobodywho-wasm +
wasm-bindgen --target web now produces a complete `pkg/` directory:

  pkg/
  ├── nobodywho_wasm.d.ts        9.1K  — TS typings for the public API
  ├── nobodywho_wasm.js          37K   — JS loader / wasm-bindgen glue
  ├── nobodywho_wasm_bg.wasm     21M   — debug; ~5-7M release-stripped
  └── nobodywho_wasm_bg.wasm.d.ts

JS consumers can:

  import init, { Model, Chat } from './pkg/nobodywho_wasm.js';
  await init();
  const model = await Model.loadBytes(uint8Array);
  const chat = new Chat(model, { contextSize: 2048 });
  const stream = await chat.ask('Hello');
  // ... await stream.nextToken() in a loop

The whole pipeline that now runs:

  bindgen + cc::Build (wasi-sdk clang)
    +
  cmake (llama.cpp -> wasm32-wasip1 + wasi-libc, with the fork's
         source-level patches applied at build time)
            v
    wasm-bindgen post-processor
            v
    pkg/ ready for npm publish

What made it click on this iteration:

* core/src/chat.rs: gate the bitmap-construction block in ChatWorker::ask()
  behind the mtmd feature, and the bitmap-extraction in
  sync_context_with_render. ChatContext's bitmaps field + the four
  methods that touch it (add_bitmaps, garbage_collect_bitmaps,
  create_bitmap_id, remove_bitmaps) are also feature-gated.

* Cargo.lock: pulls fork branch at 360e169 — the design pivot that
  unblocked everything. Earlier the fork gated mtmd off entirely on
  wasm-unknown, which forced ~30 cfg sites in nobodywho/core/tokenizer.rs.
  Cleaner approach: keep bindgen running for mtmd headers (FFI types
  exist in bindings.rs) but skip compiling the C++. The Rust wrapper
  module compiles normally; mtmd_* symbols become undefined imports in
  the .wasm — the JS host can polyfill them, but in practice the wasm
  binding doesn't expose multimodal so they're never called.

* wasm/README.md: rewritten to reflect the working pipeline. Emscripten
  path documented as alternate/fallback (wasm-bindgen-cli doesn't accept
  Emscripten output, so it's not useful for npm distribution).
Adds three smoke tests under wasm/examples/:
- smoke.html: in-browser test page (open via local HTTP server, requires
  a bundler or manual env-stub for the unresolved imports).
- smoke.mjs: ESM test trying wasm-bindgen's web-target glue from Node.
  Fails because wasm-bindgen-cli emits `import * as ... from 'env'`
  for non-__wbindgen import groups, and 'env' isn't a real module.
  Documented as a known wasm-bindgen-cli limitation; npm distribution
  needs --target bundler + a bundler that aliases 'env'.
- smoke-manual.mjs: bypasses wasm-bindgen's auto-init and instantiates
  the wasm directly with Proxy-based stub imports. THIS WORKS.

Output of smoke-manual.mjs after the -fexceptions fix:

  Wasm size: 20.6 MB
  Compiling…
  Imports: 89
    ./nobodywho_wasm_bg.js: 53 entries
    env: 22 entries
    wasi_snapshot_preview1: 14 entries
  Instantiating…
  Exports: 74
    Class-like: chat_ask, chat_new, chat_reset, chat_resetHistory,
                encoder_encode, encoder_new, init, model_loadBytes,
                tokenstream_completed, tokenstream_nextToken

Proves the wasm is well-formed: compiles under V8 (so passes the
wasm-validation that previously failed on exception-model mismatch),
instantiates with stubs, and exposes all 10 expected class methods.
Going further (actually calling chat_ask) requires real imports for
the wasi_snapshot_preview1 syscalls (fd_write etc.) and for the
__wbg_* glue from nobodywho_wasm_bg.js — which is what an npm package
would ship via --target bundler + a WASI polyfill like
@bjorn3/browser_wasi_shim.

Also: pkg-node/ added to .gitignore (build output).
   ✓ wasi.initialize ran _initialize
   ✓ wasm wired up
   ✓ init() ok
   Loading model from /tmp/bge-small.gguf… (35 MB GGUF)
   ✓ model loaded
   ✓ encoder created
   ✓ embedding generated: 384 dimensions
   first 8: [-0.6244, -0.5940, 0.5545, -0.6085, -0.1348, 0.1800, 0.6621, 0.3490]

This is actual llama.cpp embedding inference running inside V8 from a
real GGUF model loaded as Uint8Array, with the full pipeline end-to-end:

  Uint8Array → Model.loadBytes → fmemopen → llama_model_load_from_file_ptr
    → wasi-libc syscalls (via node:wasi) → llama.cpp eval → Float32Array

Three fixes to get here:

1. wasm/src/lib.rs — override __cxa_atexit to a no-op.

   rust-lld 22.1's wasm driver doesn't understand --mexec-model=reactor
   (the flag is in upstream lld but not in rust-lld's option table). So
   the wasm stays in 'command' exec model, where every export is wrapped
   with __wasm_call_ctors + __wasm_call_dtors. The dtor walk iterates
   atexit-registered handlers and trips on a signature mismatch:

     RuntimeError: function signature mismatch
       at __funcs_on_exit
       at __wasm_call_dtors
       at <any export>.command_export

   Suppressing the atexit registrations entirely makes the dtor walk a
   no-op. Global destructors don't run at module shutdown, which is
   harmless for a wasm instance that lives for the lifetime of the
   process.

2. core/src/llm.rs — hardcode n_threads=1 on wasm32.

   std::thread::available_parallelism() returns Err on
   wasm32-unknown-unknown ('the number of hardware threads is not known
   for the target platform'). The Worker init unwrapped the error,
   failing immediately after model load. cfg the line so wasm32 uses 1.

3. core/src/tokenizer.rs — inline the mtmd_default_marker literal on wasm.

   tokenize_text calls llama_cpp_2::mtmd::mtmd_default_marker() to learn
   the marker string to split text on. On wasm we don't compile mtmd's
   C++, so that resolves to an unresolved env import. Inline the same
   '<__media__>' literal llama.cpp returns — wasm consumers never have
   media in the text anyway, so the split produces one chunk covering
   the whole input.

Also adds wasm/examples/run.mjs — Node runner that:
- Reads the bundler-target .wasm bytes and bg.js glue
- Wires up node:wasi for wasi_snapshot_preview1 imports
- Provides Proxy-based stubs for the remaining env imports (mtmd_*, etc.)
- Calls wasi.initialize, __wbg_set_wasm, __wbindgen_start
- Loads a GGUF and runs Encoder.encode (--encode flag) or Chat.ask

Usage:
  node wasm/examples/run.mjs                              # smoke
  node wasm/examples/run.mjs --encode ./model.gguf 'text' # embedding
  node wasm/examples/run.mjs ./model.gguf 'prompt'        # chat

Chat path with a chat-style GGUF should also work once we have such a
model handy (the bge-small only supports the Encoder API).
- wasm/README.md: rewritten to reflect the working state. Documents the
  verified embedding-inference result (with first 8 dims of an actual
  BGE-small embedding), the full build pipeline, the runtime workarounds
  (and why each), and outstanding follow-ups.

- wasm/examples/browser.html: single-file browser demo. Loads
  @bjorn3/browser_wasi_shim from esm.sh, manually instantiates the
  wasm with WASI + bg.js glue + env stubs, lets the user upload a GGUF
  via <input type=file>, runs Encoder.encode, and prints the embedding.
  Mirrors the working run.mjs but in a browser context.

  Usage: cd nobodywho/wasm && python3 -m http.server 8000, then open
  http://localhost:8000/examples/browser.html.

- wasm/package.json.tpl: template for the npm package.json that would
  ship pkg-bundler/. Includes the right `files`, `main`, `types`,
  and `@bjorn3/browser_wasi_shim` as a peer dep. Becomes
  pkg-bundler/package.json in the publish step (separate commit will
  add the script).
Bumps llama-cpp-2 to 10d12300 (wasm fork merged with marek-hradil/main —
brings in PRs #7/8/9 watchOS Metal-skip). Drops the explicit iOS Metal
block now that the wasm fork has marek's auto-Metal-detect logic.
Hash: 18pwz1r43dj6918dajlg61ak9zlhwazsblqj6hv9aj0qaks7rz4n
(nix-prefetch-git --fetch-submodules).
@gergelyvagujhelyi gergelyvagujhelyi force-pushed the wasm-pr3-features branch 3 times, most recently from 8d59f88 to 699d318 Compare May 13, 2026 14:07
@gergelyvagujhelyi gergelyvagujhelyi force-pushed the wasm-pr4-release branch 2 times, most recently from 125ee55 to 38df4d2 Compare May 13, 2026 14:17
@gergelyvagujhelyi gergelyvagujhelyi force-pushed the wasm-pr4-release branch 2 times, most recently from 28e76e6 to 86327c1 Compare May 15, 2026 09:16
Chat.ask path on wasm was failing at `<Prompt as Display>::fmt` because
that function called `llama_cpp_2::mtmd::mtmd_default_marker()` to get
the media-marker string for joining text + media parts — the same
unresolved env import that broke tokenize_text earlier. Two other call
sites (in ProjectionModel::from_path and ProjectionModel::tokenize) had
the same problem but are in unreachable mtmd-only code paths.

Replace all four call sites with a new `mtmd_marker_string()` helper
that's cfg-gated on `target_arch = "wasm32"`:
- native: calls `llama_cpp_2::mtmd::mtmd_default_marker` (unchanged).
- any wasm32 target (unknown-unknown, emscripten, wasip1, ...): returns
  the literal "<__media__>". Necessary for wasm32-unknown-unknown which
  can't resolve the mtmd C++ symbol; harmless for the other wasm32 OSes
  because the literal is the same string llama.cpp's mtmd_default_marker
  returns.

Real chat output now works end-to-end:

  $ node wasm/examples/run.mjs /tmp/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf 'Hello'
    ✓ wasi.initialize ran _initialize
    ✓ wasm wired up
    ✓ model loaded
    ✓ chat created
  Asking: "Hello"
  Response: Hello! How can I assist you today? If you have any questions
            or need help with something, feel free to ask.
    ✓ produced 25 tokens
  Done.

That's a real Qwen 2.5 0.5B Instruct response from a 379 MB GGUF loaded
as Uint8Array, running entirely inside V8.
…ction

The existing `//` block above `__cxa_atexit` explained the rust-lld
22.1 / command-exec-model / dtor-walk causal chain well, but it lived
as a code comment rather than rustdoc. Two consequences:

1. `cargo clippy --no-deps --target wasm32-unknown-unknown -p
   nobodywho-wasm -- -D warnings` failed on the function because
   `clippy::missing_safety_doc` requires a `# Safety` section in
   rustdoc on `pub unsafe fn`.
2. The rationale didn't surface in generated docs or IDE hover.

Convert the block to `///` rustdoc verbatim (no rewording of the
existing prose) and add a `# Safety` section explaining why this
implementation is trivially safe to call: it ignores all three
arguments and returns success, so there's no UB path regardless of
what handlers libc++ tries to register. The cost (silently dropping
every atexit registration) is acceptable for the wasm-instance-
lifetime reason the body already documents.

Native target unaffected — function is cfg-gated to wasm32. The PR-
stated check (`cargo clippy --no-deps -- -D warnings` without
`--target`) still passes either way; this fixes the wasm-target
clippy gate so a future CI addition won't fail.
…delUrl

Replace modelBytes with modelPath in all Node examples and smoke tests.
Replace Model.loadBytes with Model.load({modelPath}) for Encoder and
CrossEncoder demos. Update browser HTML examples to use
Model.load({modelUrl}) instead of fetchModelBytes + Model.loadBytes.

Fix compilation error: map String errors to JsError in Model.load,
remove stale model_bytes/mmproj_bytes from ChatCreateParsed constructor.

All 14 smoke tests pass.
Update all code examples, status table, smoke test table, and
explanatory sections to reflect:
- Model.load({modelUrl | modelPath}) replacing Model.loadBytes
- Chat.create({modelUrl | modelPath}) replacing modelBytes
- NODEFS for Node disk access
- mmap syscall overrides (CPU_Mapped)
- Cache API tee'd streaming for modelUrl
- Removed Model caching helpers section (now internal)
- Removed >2 GiB readFileSync limitation (no longer applies)
Take main's Docusaurus button layout, add JS (wasm) link.
Replace plain-object sampler specs with a typed SamplerConfig
wasm-bindgen class. Sampler must now come from SamplerBuilder or
SamplerPresets — no more raw { temperature: 0.7 } objects.

- SamplerConfig wraps core's SamplerConfig, has toJSON/fromJSON
- SamplerBuilder terminal methods return SamplerConfig (not Object)
- SamplerPresets methods return SamplerConfig (not Object)
- Delete SamplerSpec, ConstraintSpec, build_sampler
- ChatOptions.sampler deserializes core SamplerConfig directly
- Main thread serializes SamplerConfig before postMessage to worker
- getSamplerConfig/setSamplerConfig use SamplerConfig class

Smoke tests need updating to use the typed API.
- constraint-smoke: use SamplerPresets.constrainWithRegex/JsonSchema
- sampler-smoke: use SamplerPresets.greedy(), SamplerBuilder chain
- sampler-ergo-smoke: assert.ok on opaque SamplerConfig instances
- sampler-extra-smoke: assert.ok on opaque instances, replace
  deprecated SamplerPresets.json() with constrainWithJsonSchema({})
- setters-smoke: setSamplerConfig takes SamplerConfig instance
- build-pkg-emscripten.sh: add SamplerConfig.__wrap sed patch
- lib.rs: fix JsCast issue — use toJSON() for sampler serialization
  across postMessage boundary

13/14 smoke tests pass. sampler-extra [6] needs a wasm rebuild to
verify the constrainWithJsonSchema({}) replacement.
Split into two jobs:
- `lint`: always runs — cargo test -p nobodywho-js (fast, no wasm)
- `build-and-test`: only on full_ci — wasm build + model smoke tests

Pass full_ci from main.yml (same pattern as swift_ci, python_ci, etc).
Update to actions/checkout@v5 and actions/cache@v5.
- get_model_from_bytes + its test in core/llm.rs (nothing calls it
  after removing Model.loadBytes)
- ENOENT constant in syscall_imports.rs (was used by deleted newfstatat)
- lstat parameter on stat_into_buf (always false after removing lstat64)
- Cargo.toml: remove unused web-sys features (Request, RequestInit,
  Headers), update stale comment
- build.rs: remove redundant -lnodefs.js (already in post-link script)
- lib.rs: remove dead on_progress field from ChatCreateParsed and
  onDownloadProgress key filter (no caller reads the field)
- README: remove get_model_from_bytes reference
- core/llm.rs: update doc comment (modelBytes → modelUrl)
- tests/lint.rs: update comments to remove ConstraintSpec/modelBytes
  references, simplify test descriptions
Replace the single-threaded wasm worker wrapper (separate wasm instance
per Web Worker) with Emscripten pthreads (SharedArrayBuffer, real
threads via pthread_create). Inference now runs on a background pthread
using the same std::thread::spawn code path as native.

Build changes:
- .cargo/config.toml: +atomics,+bulk-memory,+mutable-globals target features
- build.rs: -pthread and -sDEFAULT_PTHREAD_STACK_SIZE=2MB linker flags
- build-pkg-emscripten.sh: cargo +nightly -Zbuild-std=std,panic_abort,
  -pthread in post-link, sed patches for deferred _initialize and
  wasm-bindgen binding in pthread workers

Core unification (-190 lines):
- Remove spawn_local wasm path from chat/encoder/crossencoder
- Remove wasm-only WorkerGuard variant (unified with JoinHandle)
- Use available_parallelism() instead of hardcoded n_threads=1

JS binding rewrite (-1770 lines):
- Delete worker dispatcher (runInWorker, message protocol, ChatState,
  WorkerStreamState, worker-backed Chat/TokenStream)
- New Chat wraps ChatHandleAsync directly
- New TokenStream wraps TokenStreamAsync
- Delete __nbw_spawn_worker, __nbw_wrap_node_worker from pre.js

All 8 smoke tests pass. No API changes — Chat.create, chat.ask,
for-await streaming, tool calling, getters/setters all unchanged.
Create a persistent ggml threadpool during Worker init and attach it
to the llama context. This avoids ggml's disposable threadpool pattern
(pthread_create mid-compute) which deadlocks on Emscripten because
pthread_create is async.

With the persistent pool, ggml worker threads are pre-created during
init when the event loop is free, and reused across graph computes.

Benchmark (Qwen3-0.6B-Q4_K_M, Node.js, Apple Silicon 10 cores):
- Single-threaded (before): 18.5 tok/s, 1 active core
- Multi-threaded (now):     57.0 tok/s, 22 active cores (3.1x)
Match the pre-created pthread pool to the actual CPU core count
instead of a hardcoded 16. Avoids wasting memory on machines with
fewer cores and under-provisioning on machines with more.
Remove js_to_serializable_parts, strip_keys, ChatOptions struct —
all were only used by the old Web Worker message protocol. Fix stale
comments referencing postMessage, worker dispatcher, RefCell<ChatState>.
ChatOptions struct was removed — Chat.create now parses options via
js_sys::Reflect instead of serde deserialization. The deny_unknown_fields
lint test is no longer applicable.
The pthreads build requires `cargo +nightly -Zbuild-std=std,panic_abort`
to recompile std with atomics. Add nightly toolchain and rust-src
component to the js_ci build-and-test job.
The action applies `components` to all toolchains in the list, so
`rust-src --toolchain nightly` was mis-parsed as three components.
Use plain `components: rust-src` — added to both stable and nightly.
CI's setup-rust-toolchain sets RUSTFLAGS=-D warnings, which overrides
the [target.wasm32-unknown-emscripten] rustflags in .cargo/config.toml
entirely — silently dropping the +atomics,+bulk-memory,+mutable-globals
target features and breaking the shared-memory link (wasm-ld: error:
--shared-memory is disallowed ... not compiled with 'atomics').

Set RUSTFLAGS directly in the build script so the build is self-contained
and doesn't depend on config.toml being the active rustflags source.
Verified by reproducing locally with RUSTFLAGS=-D warnings set.
@gergelyvagujhelyi
Copy link
Copy Markdown
Contributor Author

/full-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants