Local LLM server for the full Intel stack. NPU, ARC iGPU, ARC discrete, CPU. OpenAI + Ollama APIs. One server, every Intel device.
No NVIDIA required. No Ollama install. No llama.cpp. No problem.
Runs on Intel Core Ultra laptops (NPU + ARC iGPU), desktops with ARC discrete GPUs (A770, B580), or any Intel CPU. Automatically detects your hardware, picks the best device, and exposes both OpenAI and Ollama compatible APIs — so any client that speaks to either just works.
.\install.ps1
.\start.ps1That's it. install.ps1 detects your hardware, lets you pick a model,
downloads it, and generates start.ps1. The launcher waits for the
model to load (with a progress indicator), then opens the built-in
chat UI in your browser at http://localhost:8000.
- OpenAI API (
/v1/chat/completions) — works with any OpenAI client, OpenWebUI, etc. - Ollama API (
/api/chat,/api/generate) — works with Ollama clients, OpenWebUI Ollama mode, etc. - Auto-detects NPU, ARC iGPU, ARC discrete, CPU — picks the best available
- VLM support — send images via base64 or
file://URIs for vision models - Streaming — token-by-token for text chat, with collapsible thinking blocks
- Dual device — NPU for chat + GPU for vision, simultaneously
- Built-in web UI — chat, image drop zone, model selector, dark theme
- Model menu — curated list of verified models, no conversion nightmares
The server includes a built-in chat interface at http://localhost:8000. No separate install, no Docker, no Node.js.
A native Windows GUI is planned to replace the browser-based UI.
Features:
- Streaming chat with tokens appearing in real-time
- Collapsible "Thinking..." blocks (Qwen3 reasoning models)
- Drag-and-drop / paste images for VLM queries
- Model selector showing loaded models and their devices
- Device badge on each response (
[NPU 1.2s],[GPU 2.8s]) - Dark theme
- Keyboard shortcuts: Enter to send, Shift+Enter for newline, Ctrl+V to paste images, Ctrl+N for new chat, Escape to cancel
| Device | Examples | What it does | Streaming? |
|---|---|---|---|
| NPU (Intel AI Boost) | Core Ultra 7 258V | Text chat via LLMPipeline | Yes |
| ARC iGPU | ARC 140V (Core Ultra) | Vision + text, or bigger LLM | VLM: no, LLM: yes |
| ARC discrete | A770, B580 | Same as iGPU, more VRAM for larger models | VLM: no, LLM: yes |
| CPU | Any Intel CPU | Fallback for everything | Yes (slowly) |
Tested with benchmark.py — 1 warmup + 5 runs, outliers discarded.
# Text-only (no images required)
python benchmark.py --llm-only
# With VLM tests — provide 4 images: two "same vehicle" + two "different"
python benchmark.py --images-dir C:\path\to\images
python benchmark.py --same-1 a.jpg --same-2 b.jpg --diff-1 c.jpg --diff-2 d.jpgLLM text (Qwen3 8B INT4-CW, same model on NPU and CPU):
| Test | NPU | CPU |
|---|---|---|
| "Say hello" (thinking) | 11.7s, 5.2 tok/s | 8.1s, 7.4 tok/s |
| "Say hello" (no-think) | 10.6s, 4.6 tok/s | 8.6s, 7.3 tok/s |
| "What is 2+2?" (thinking) | 11.7s, 5.3 tok/s | 9.0s, 7.0 tok/s |
| "What is 2+2?" (no-think) | 5.5s, 0.7 tok/s | 2.7s, 1.5 tok/s |
GPU (Qwen2.5-VL 3B on ARC 140V, non-streaming):
| Test | Time |
|---|---|
| "Say hello" (thinking) | 2.6s |
| "Say hello" (no-think) | 2.6s |
| "What is 2+2?" (thinking) | 2.6s |
| "What is 2+2?" (no-think) | 2.4s |
| Same vehicle? (2 images) | 3.8s |
| Different vehicles? (2 images) | 3.8s |
VLMPipeline doesn't stream, so tok/s can't be measured directly. Subtracting prompt overhead, ARC iGPU generation is roughly 3x faster than NPU for this hardware.
CPU beats NPU on throughput (~7.4 vs ~5.2 tok/s) for this model. GPU text is fast but runs a smaller 3B model (not directly comparable). VLM image responses take ~3-4s regardless of answer length.
Same Qwen3 8B INT4-CW model on every Intel device, plus the same model
served via Ollama (GGUF Q4_K_M) on the RTX 5090 for context. 1 warmup +
3 runs. The "count 1-100" test (max_tokens=4096, no-think) is the
cleanest cross-stack number — long output, steady-state, no thinking confound.
# Each NoLlama device — restart the server with --device <name> first
python benchmark.py --label npu --runs 3 --llm-only
python benchmark.py --label igpu --runs 3 --llm-only
python benchmark.py --label cpu --runs 3 --llm-only
# Ollama (any backend it's running on — CUDA, ROCm, CPU)
python benchmark.py --backend ollama --model qwen3:8b --label rtx5090 --runs 3 --llm-onlyDecode throughput, count-1-100 test:
| Backend | Device | TTFT | Decode tok/s | Speed vs CPU |
|---|---|---|---|---|
| Ollama (GGUF/CUDA) | RTX 5090 | 0.19s | 197 | 11.1× |
| NoLlama (OpenVINO) | CPU (8P + 16E @ DDR5) | 3.84s | 17.8 | 1.0× |
| NoLlama (OpenVINO) | iGPU (Xe-LPG, 4 cores) | 4.01s | 15.4 | 0.87× |
| NoLlama (OpenVINO) | NPU 3 (Intel AI Boost) | 10.6s | 10.0 | 0.56× |
Surprises on this hardware:
- CPU beats iGPU. Arrow Lake's 285K (8P + 16E at high clocks) plus OpenVINO's tuned INT4 CPU kernels add up to more decode throughput than the small Xe-LPG iGPU (only 4 Xe cores on the desktop part — the laptop's ARC 140V has 8). Both share the same DDR5 pool, so the iGPU has no bandwidth advantage, only a compute disadvantage.
- NPU is the slowest Intel device on desktop, opposite of the laptop story. NPU's value is power efficiency (laptop on battery), not throughput on mains.
- Prefill scales differently than decode. RTX 5090's TTFT advantage over NPU is ~55× (0.19s vs 10.6s); its decode advantage is ~20×. Long prompts amplify the gap.
- The dGPU dominates — if you have one, use it. NoLlama's CPU fallback is good for "Intel-only laptop on battery", not for competing with a discrete card.
Why the desktop iGPU/NPU are slower than the laptop's: LPDDR5X-8533 (laptop, ~136 GB/s) vs DDR5-6400 dual-channel (desktop, ~100 GB/s). Decode throughput on INT4 LLMs is memory-bandwidth-bound, so the laptop's faster system memory closes some of the gap that silicon size alone would suggest. (The Core Ultra 7 258V Lunar Lake NPU also has more compute units than the 285K Arrow Lake NPU.)
Practical guidance:
| Hardware | Best NoLlama device |
|---|---|
| Intel Core Ultra laptop (Lunar Lake) | NPU (efficiency) or ARC 140V iGPU |
| Intel Arrow Lake desktop, no dGPU | CPU — surprisingly best |
| Intel + ARC discrete (A770, B580) | ARC discrete |
| Intel + NVIDIA discrete | Use Ollama for the dGPU; NoLlama on CPU/NPU/iGPU as fallback |
When you have both, text requests go to the NPU (streaming) and image requests go to the GPU (VLM). Or put a bigger LLM on the GPU for smarter chat. The routing is automatic — send a request and the right device handles it.
POST /v1/chat/completions
"What is the capital of Norway?" --> NPU (streaming)
[image + "What vehicle is this?"] --> GPU (VLM)
Intel already ships OVMS — a production-grade OpenVINO inference server. If you're deploying LLMs in a datacenter or on Kubernetes, use OVMS. NoLlama is a different target: your laptop.
| OVMS | NoLlama | |
|---|---|---|
| Target | Production, datacenter, K8s | Laptop, desktop, local |
| Runtime | C++ | Python (Flask) |
| OpenAI API | Yes (recent versions) | Yes |
| Ollama API | No | Yes |
| Built-in web UI | No (add OpenWebUI) | Yes |
| Auto device detection | No | Yes |
| Dual-device routing | One model per instance | NPU chat + GPU vision, simultaneously |
| Config | JSON, manual | Zero — install.ps1 and go |
OVMS is a proper inference server. NoLlama is the thing that makes your Core Ultra feel like Ollama already ran on it.
# Auto-detect (picks best device)
python nollama.py
# Force a specific device
python nollama.py --device NPU
python nollama.py --device GPU
python nollama.py --device CPU
# Dual mode: NPU chat + GPU vision
python nollama.py --model-dir model --gpu-model-dir gpu-model
# Different port
python nollama.py --port 9000
# Change the default idle-unload timeout (default is 1800 = 30 min)
python nollama.py --idle-timeout 600 # unload after 10 min idle
python nollama.py --idle-timeout 0 # never unload — keep models loaded foreverNoLlama frees model memory after 30 minutes of inactivity by default (an 8B INT4 model holds ~5 GB of RAM; a VLM another ~3 GB). The next request automatically reloads the model — the client just sees a slow first response (~30-60s for an 8B model on NPU). The web UI shows "Reloading model..." while it waits.
Change with --idle-timeout <seconds>. Use 0 to keep models loaded
forever (the old behavior).
/health reports idle_unloaded slots; the overall status stays
ready because requests can still be served (with a reload).
Standard OpenAI /v1/chat/completions. Works with any OpenAI client.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}]}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages":[{"role":"user","content":[
{"type":"text","text":"What is in this image?"},
{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
]}]
}'When client and server are on the same machine, skip base64:
{"type": "image_url", "image_url": {"url": "file:///C:/path/to/image.jpg"}}Note: file:// URIs only work locally. Remote clients must use base64.
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Tell me a story"}],"stream":true}'GET /health— device status, model names, readinessGET /v1/models— list loaded models (OpenAI format)
Every response includes X-Device and X-Model headers so you can
see which device handled it:
X-Device: NPU
X-Model: qwen3-8b
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
resp = client.chat.completions.create(
model="qwen3-8b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="")NoLlama also serves a full Ollama-compatible API on port 11434 (the Ollama default). Any tool or client that talks to Ollama works without modification — it thinks it's talking to a real Ollama instance.
Supported endpoints:
POST /api/chat— chat with streaming (newline-delimited JSON)POST /api/generate— single-turn completionGET /api/tags— list modelsPOST /api/show— model info
curl http://localhost:11434/api/chat \
-d '{"model":"qwen3-8b-int4-cw","messages":[{"role":"user","content":"Hello!"}]}'Disable with --ollama-port 0 if you don't need it or port 11434 is taken.
OpenWebUI can connect via either API:
OpenAI mode (recommended):
| Field | Value |
|---|---|
| Base URL | http://host.docker.internal:8000/v1 |
| API Key | not-needed |
Ollama mode (no config needed if NoLlama runs on default port):
| Field | Value |
|---|---|
| Ollama Base URL | http://host.docker.internal:11434 |
install.ps1 shows a curated menu of models known to work on Intel
hardware. All pre-exported models are download-only (no conversion).
The menu is defined in models.json — add entries when new models
are verified.
Use download-model.ps1 to grab any HuggingFace model:
# Pre-exported OpenVINO model (just download)
.\download-model.ps1 OpenVINO/Qwen3-8B-int4-cw-ov
# Convert a HuggingFace model to OpenVINO
.\download-model.ps1 Qwen/Qwen2.5-VL-3B-Instruct --convert --weight int8
# With trust-remote-code (some models require this)
.\download-model.ps1 Qwen/Qwen2.5-VL-3B-Instruct --convert --weight int4 --trustModels download to ~/models/<name>/. Point NoLlama at them:
python nollama.py --model-dir ~/models/my-model --device GPU
python nollama.py --gpu-model-dir ~/models/my-vlm| Model | Size | Notes |
|---|---|---|
| Qwen3 8B (INT4-CW) | ~5 GB | Recommended. Best quality. |
| Phi 3.5 Mini (INT4-CW) | ~2 GB | Smaller, faster. |
| DeepSeek R1 Distill 7B (INT4-CW) | ~4 GB | Reasoning. |
| DeepSeek R1 Distill 1.5B (INT4-CW) | ~1 GB | Testing only. |
| Mistral 7B v0.3 (INT4-CW) | ~4 GB | General purpose. |
| Model | Size | Notes |
|---|---|---|
| Gemma 3 4B Vision (INT4) | ~3 GB | Fast, good quality. |
| Gemma 3 12B Vision (INT4) | ~7 GB | Excellent quality. |
| Qwen2.5-VL 7B (INT4) | ~5 GB | Proven architecture. |
| InternVL2 4B (INT4) | ~3 GB | Good small VLM. |
| Model | Size | Notes |
|---|---|---|
| Qwen3 14B (INT4) | ~8 GB | Great reasoning. |
| Qwen3 30B-A3B MoE (INT4) | ~17 GB | 30B brain, 3B speed. |
| Phi 4 (INT4) | ~8 GB | Strong reasoning. |
| Phi 4 Reasoning (INT4) | ~8 GB | Chain-of-thought. |
The server auto-detects your model type (VLM or LLM) from
config.json and loads the right OpenVINO GenAI pipeline:
- VLMPipeline for vision models — handles images + text
- LLMPipeline for text models — handles chat with streaming
In dual mode, both pipelines run on separate devices with separate locks. They don't interfere with each other.
Future simplification: OpenVINO GenAI may unify VLMPipeline and LLMPipeline into a single pipeline that handles both text and images. When that lands, the dual-pipeline detection and routing logic in NoLlama can be collapsed into one code path.
nollama.py The server
install.ps1 Setup wizard
download-model.ps1 Download/convert any HuggingFace model
benchmark.py Device performance benchmark
start.ps1 Auto-generated launcher (after install)
models.json Curated model registry
model/ Primary model (NPU or GPU)
gpu-model/ Secondary GPU model (dual mode)
venv/ Python virtual environment
model/, gpu-model/, venv/, and start.ps1 are gitignored.
The repo is pure code.
- Python 3.10+
- OpenVINO 2026.1+ with openvino-genai
- At least one of:
- Intel Core Ultra (NPU + ARC iGPU)
- Intel ARC discrete GPU (A770, B580, etc.)
- Any Intel CPU (slower, but works)
- ~1-17 GB disk per model
install.ps1 handles the venv, dependencies, and model download.
These are known and intentionally not fixed — either because the cause is upstream, the fix would hurt simplicity, or it doesn't matter for a local single-user tool.
- Cancel may not interrupt mid-generation. The cancel endpoint signals OpenVINO's streamer callback to stop. If OpenVINO is blocked inside a native call and not invoking the callback, there's no way to interrupt it from Python. Generation completes; lock releases when it does.
- NPU prompt limit is 4096 tokens. Long chat histories will eventually exceed this. The UI doesn't trim history — use Ctrl+N to start fresh if you hit the limit.
- VLM doesn't stream. OpenVINO's VLMPipeline has no streaming API, so vision responses arrive all at once. Waiting 3-5s for a VLM answer is normal.
- Ollama management endpoints are stubs.
/api/pull,/api/delete,/api/copyreturn success but don't do anything. Model management is viainstall.ps1ordownload-model.ps1, not the API. - No graceful shutdown. Ctrl+C is abrupt. If you hit it mid-load, NPU/GPU resources may not free cleanly — usually resolves on next launch, occasionally needs a reboot.
- Flask dev server, not production. Single-user local tool. Don't put it on the internet without a reverse proxy.
During initial NPU testing with DeepSeek R1 1.5B, we asked: "What is the capital of Norway?"
The model's response:
"I need to figure out the capital of Norway. I know it's a country in Norway. I remember that Norway is a small island..."
Norway is, in fact, not a small island.
Or is it? To paraphrase the greatest detective of all time, Ford Fairlane: "...an island in an ocean of diarrhea."
The point: 1.5B parameter models are for testing the plumbing, not for geography. Use Qwen3-8B or larger for actual chat. The small models will catch up — they're getting smarter every month.
MIT
Tommy Leonhardsen

