NoLlama

Local LLM server for the full Intel stack. NPU, ARC iGPU, ARC discrete, CPU. OpenAI + Ollama APIs. One server, every Intel device.

No NVIDIA required. No Ollama install. No llama.cpp. No problem.

Runs on Intel Core Ultra laptops (NPU + ARC iGPU), desktops with ARC discrete GPUs (A770, B580), or any Intel CPU. Automatically detects your hardware, picks the best device, and exposes both OpenAI and Ollama compatible APIs — so any client that speaks to either just works.

Quick start

.\install.ps1
.\start.ps1

That's it. install.ps1 detects your hardware, lets you pick a model, downloads it, and generates start.ps1. The launcher waits for the model to load (with a progress indicator), then opens the built-in chat UI in your browser at http://localhost:8000.

What it does

OpenAI API (/v1/chat/completions) — works with any OpenAI client, OpenWebUI, etc.
Ollama API (/api/chat, /api/generate) — works with Ollama clients, OpenWebUI Ollama mode, etc.
Auto-detects NPU, ARC iGPU, ARC discrete, CPU — picks the best available
VLM support — send images via base64 or file:// URIs for vision models
Streaming — token-by-token for text chat, with collapsible thinking blocks
Dual device — NPU for chat + GPU for vision, simultaneously
Built-in web UI — chat, image drop zone, model selector, dark theme
Model menu — curated list of verified models, no conversion nightmares

Web UI

The server includes a built-in chat interface at http://localhost:8000. No separate install, no Docker, no Node.js.

A native Windows GUI is planned to replace the browser-based UI.

Features:

Streaming chat with tokens appearing in real-time
Collapsible "Thinking..." blocks (Qwen3 reasoning models)
Drag-and-drop / paste images for VLM queries
Model selector showing loaded models and their devices
Device badge on each response ([NPU 1.2s], [GPU 2.8s])
Dark theme
Keyboard shortcuts: Enter to send, Shift+Enter for newline, Ctrl+V to paste images, Ctrl+N for new chat, Escape to cancel

Device support

Device	Examples	What it does	Streaming?
NPU (Intel AI Boost)	Core Ultra 7 258V	Text chat via LLMPipeline	Yes
ARC iGPU	ARC 140V (Core Ultra)	Vision + text, or bigger LLM	VLM: no, LLM: yes
ARC discrete	A770, B580	Same as iGPU, more VRAM for larger models	VLM: no, LLM: yes
CPU	Any Intel CPU	Fallback for everything	Yes (slowly)

Benchmark (Core Ultra 7 258V, ARC 140V 16 GB) — laptop, LPDDR5X

Tested with benchmark.py — 1 warmup + 5 runs, outliers discarded.

# Text-only (no images required)
python benchmark.py --llm-only

# With VLM tests — provide 4 images: two "same vehicle" + two "different"
python benchmark.py --images-dir C:\path\to\images
python benchmark.py --same-1 a.jpg --same-2 b.jpg --diff-1 c.jpg --diff-2 d.jpg

LLM text (Qwen3 8B INT4-CW, same model on NPU and CPU):

Test	NPU	CPU
"Say hello" (thinking)	11.7s, 5.2 tok/s	8.1s, 7.4 tok/s
"Say hello" (no-think)	10.6s, 4.6 tok/s	8.6s, 7.3 tok/s
"What is 2+2?" (thinking)	11.7s, 5.3 tok/s	9.0s, 7.0 tok/s
"What is 2+2?" (no-think)	5.5s, 0.7 tok/s	2.7s, 1.5 tok/s

GPU (Qwen2.5-VL 3B on ARC 140V, non-streaming):

Test	Time
"Say hello" (thinking)	2.6s
"Say hello" (no-think)	2.6s
"What is 2+2?" (thinking)	2.6s
"What is 2+2?" (no-think)	2.4s
Same vehicle? (2 images)	3.8s
Different vehicles? (2 images)	3.8s

VLMPipeline doesn't stream, so tok/s can't be measured directly. Subtracting prompt overhead, ARC iGPU generation is roughly 3x faster than NPU for this hardware.

CPU beats NPU on throughput (~7.4 vs ~5.2 tok/s) for this model. GPU text is fast but runs a smaller 3B model (not directly comparable). VLM image responses take ~3-4s regardless of answer length.

Benchmark (Core Ultra 9 285K, RTX 5090) — desktop, DDR5

Same Qwen3 8B INT4-CW model on every Intel device, plus the same model served via Ollama (GGUF Q4_K_M) on the RTX 5090 for context. 1 warmup + 3 runs. The "count 1-100" test (max_tokens=4096, no-think) is the cleanest cross-stack number — long output, steady-state, no thinking confound.

# Each NoLlama device — restart the server with --device <name> first
python benchmark.py --label npu --runs 3 --llm-only
python benchmark.py --label igpu --runs 3 --llm-only
python benchmark.py --label cpu --runs 3 --llm-only

# Ollama (any backend it's running on — CUDA, ROCm, CPU)
python benchmark.py --backend ollama --model qwen3:8b --label rtx5090 --runs 3 --llm-only

Decode throughput, count-1-100 test:

Backend	Device	TTFT	Decode tok/s	Speed vs CPU
Ollama (GGUF/CUDA)	RTX 5090	0.19s	197	11.1×
NoLlama (OpenVINO)	CPU (8P + 16E @ DDR5)	3.84s	17.8	1.0×
NoLlama (OpenVINO)	iGPU (Xe-LPG, 4 cores)	4.01s	15.4	0.87×
NoLlama (OpenVINO)	NPU 3 (Intel AI Boost)	10.6s	10.0	0.56×

Surprises on this hardware:

CPU beats iGPU. Arrow Lake's 285K (8P + 16E at high clocks) plus OpenVINO's tuned INT4 CPU kernels add up to more decode throughput than the small Xe-LPG iGPU (only 4 Xe cores on the desktop part — the laptop's ARC 140V has 8). Both share the same DDR5 pool, so the iGPU has no bandwidth advantage, only a compute disadvantage.
NPU is the slowest Intel device on desktop, opposite of the laptop story. NPU's value is power efficiency (laptop on battery), not throughput on mains.
Prefill scales differently than decode. RTX 5090's TTFT advantage over NPU is ~55× (0.19s vs 10.6s); its decode advantage is ~20×. Long prompts amplify the gap.
The dGPU dominates — if you have one, use it. NoLlama's CPU fallback is good for "Intel-only laptop on battery", not for competing with a discrete card.

Why the desktop iGPU/NPU are slower than the laptop's: LPDDR5X-8533 (laptop, ~136 GB/s) vs DDR5-6400 dual-channel (desktop, ~100 GB/s). Decode throughput on INT4 LLMs is memory-bandwidth-bound, so the laptop's faster system memory closes some of the gap that silicon size alone would suggest. (The Core Ultra 7 258V Lunar Lake NPU also has more compute units than the 285K Arrow Lake NPU.)

Practical guidance:

Hardware	Best NoLlama device
Intel Core Ultra laptop (Lunar Lake)	NPU (efficiency) or ARC 140V iGPU
Intel Arrow Lake desktop, no dGPU	CPU — surprisingly best
Intel + ARC discrete (A770, B580)	ARC discrete
Intel + NVIDIA discrete	Use Ollama for the dGPU; NoLlama on CPU/NPU/iGPU as fallback

Dual mode (NPU + GPU)

When you have both, text requests go to the NPU (streaming) and image requests go to the GPU (VLM). Or put a bigger LLM on the GPU for smarter chat. The routing is automatic — send a request and the right device handles it.

POST /v1/chat/completions
  "What is the capital of Norway?"  --> NPU (streaming)
  [image + "What vehicle is this?"] --> GPU (VLM)

Why not OpenVINO Model Server (OVMS)?

Intel already ships OVMS — a production-grade OpenVINO inference server. If you're deploying LLMs in a datacenter or on Kubernetes, use OVMS. NoLlama is a different target: your laptop.

	OVMS	NoLlama
Target	Production, datacenter, K8s	Laptop, desktop, local
Runtime	C++	Python (Flask)
OpenAI API	Yes (recent versions)	Yes
Ollama API	No	Yes
Built-in web UI	No (add OpenWebUI)	Yes
Auto device detection	No	Yes
Dual-device routing	One model per instance	NPU chat + GPU vision, simultaneously
Config	JSON, manual	Zero — `install.ps1` and go

OVMS is a proper inference server. NoLlama is the thing that makes your Core Ultra feel like Ollama already ran on it.

Usage

# Auto-detect (picks best device)
python nollama.py

# Force a specific device
python nollama.py --device NPU
python nollama.py --device GPU
python nollama.py --device CPU

# Dual mode: NPU chat + GPU vision
python nollama.py --model-dir model --gpu-model-dir gpu-model

# Different port
python nollama.py --port 9000

# Change the default idle-unload timeout (default is 1800 = 30 min)
python nollama.py --idle-timeout 600     # unload after 10 min idle
python nollama.py --idle-timeout 0       # never unload — keep models loaded forever

Idle unload

NoLlama frees model memory after 30 minutes of inactivity by default (an 8B INT4 model holds ~5 GB of RAM; a VLM another ~3 GB). The next request automatically reloads the model — the client just sees a slow first response (~30-60s for an 8B model on NPU). The web UI shows "Reloading model..." while it waits.

Change with --idle-timeout <seconds>. Use 0 to keep models loaded forever (the old behavior).

/health reports idle_unloaded slots; the overall status stays ready because requests can still be served (with a reload).

API

Standard OpenAI /v1/chat/completions. Works with any OpenAI client.

Text chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

Image (VLM, requires GPU with vision model)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages":[{"role":"user","content":[
      {"type":"text","text":"What is in this image?"},
      {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
    ]}]
  }'

Local file shortcut

When client and server are on the same machine, skip base64:

{"type": "image_url", "image_url": {"url": "file:///C:/path/to/image.jpg"}}

Note: file:// URIs only work locally. Remote clients must use base64.

Streaming

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Tell me a story"}],"stream":true}'

Other endpoints

GET /health — device status, model names, readiness
GET /v1/models — list loaded models (OpenAI format)

Response headers

Every response includes X-Device and X-Model headers so you can see which device handled it:

X-Device: NPU
X-Model: qwen3-8b

Using with the openai Python package

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
resp = client.chat.completions.create(
    model="qwen3-8b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Ollama API

NoLlama also serves a full Ollama-compatible API on port 11434 (the Ollama default). Any tool or client that talks to Ollama works without modification — it thinks it's talking to a real Ollama instance.

Supported endpoints:

POST /api/chat — chat with streaming (newline-delimited JSON)
POST /api/generate — single-turn completion
GET /api/tags — list models
POST /api/show — model info

curl http://localhost:11434/api/chat \
  -d '{"model":"qwen3-8b-int4-cw","messages":[{"role":"user","content":"Hello!"}]}'

Disable with --ollama-port 0 if you don't need it or port 11434 is taken.

Using with OpenWebUI

OpenWebUI can connect via either API:

OpenAI mode (recommended):

Field	Value
Base URL	`http://host.docker.internal:8000/v1`
API Key	`not-needed`

Ollama mode (no config needed if NoLlama runs on default port):

Field	Value
Ollama Base URL	`http://host.docker.internal:11434`

Models

install.ps1 shows a curated menu of models known to work on Intel hardware. All pre-exported models are download-only (no conversion). The menu is defined in models.json — add entries when new models are verified.

Adding models outside the menu

Use download-model.ps1 to grab any HuggingFace model:

# Pre-exported OpenVINO model (just download)
.\download-model.ps1 OpenVINO/Qwen3-8B-int4-cw-ov

# Convert a HuggingFace model to OpenVINO
.\download-model.ps1 Qwen/Qwen2.5-VL-3B-Instruct --convert --weight int8

# With trust-remote-code (some models require this)
.\download-model.ps1 Qwen/Qwen2.5-VL-3B-Instruct --convert --weight int4 --trust

Models download to ~/models/<name>/. Point NoLlama at them:

python nollama.py --model-dir ~/models/my-model --device GPU
python nollama.py --gpu-model-dir ~/models/my-vlm

NPU models (chat)

Model	Size	Notes
Qwen3 8B (INT4-CW)	~5 GB	Recommended. Best quality.
Phi 3.5 Mini (INT4-CW)	~2 GB	Smaller, faster.
DeepSeek R1 Distill 7B (INT4-CW)	~4 GB	Reasoning.
DeepSeek R1 Distill 1.5B (INT4-CW)	~1 GB	Testing only.
Mistral 7B v0.3 (INT4-CW)	~4 GB	General purpose.

GPU vision models

Model	Size	Notes
Gemma 3 4B Vision (INT4)	~3 GB	Fast, good quality.
Gemma 3 12B Vision (INT4)	~7 GB	Excellent quality.
Qwen2.5-VL 7B (INT4)	~5 GB	Proven architecture.
InternVL2 4B (INT4)	~3 GB	Good small VLM.

GPU large LLMs (smarter than NPU)

Model	Size	Notes
Qwen3 14B (INT4)	~8 GB	Great reasoning.
Qwen3 30B-A3B MoE (INT4)	~17 GB	30B brain, 3B speed.
Phi 4 (INT4)	~8 GB	Strong reasoning.
Phi 4 Reasoning (INT4)	~8 GB	Chain-of-thought.

How it works

The server auto-detects your model type (VLM or LLM) from config.json and loads the right OpenVINO GenAI pipeline:

VLMPipeline for vision models — handles images + text
LLMPipeline for text models — handles chat with streaming

In dual mode, both pipelines run on separate devices with separate locks. They don't interfere with each other.

Future simplification: OpenVINO GenAI may unify VLMPipeline and LLMPipeline into a single pipeline that handles both text and images. When that lands, the dual-pipeline detection and routing logic in NoLlama can be collapsed into one code path.

Files

nollama.py              The server
install.ps1             Setup wizard
download-model.ps1      Download/convert any HuggingFace model
benchmark.py            Device performance benchmark
start.ps1               Auto-generated launcher (after install)
models.json             Curated model registry
model/                  Primary model (NPU or GPU)
gpu-model/              Secondary GPU model (dual mode)
venv/                   Python virtual environment

model/, gpu-model/, venv/, and start.ps1 are gitignored. The repo is pure code.

Requirements

Python 3.10+
OpenVINO 2026.1+ with openvino-genai
At least one of:
- Intel Core Ultra (NPU + ARC iGPU)
- Intel ARC discrete GPU (A770, B580, etc.)
- Any Intel CPU (slower, but works)
~1-17 GB disk per model

install.ps1 handles the venv, dependencies, and model download.

Known limitations

These are known and intentionally not fixed — either because the cause is upstream, the fix would hurt simplicity, or it doesn't matter for a local single-user tool.

Cancel may not interrupt mid-generation. The cancel endpoint signals OpenVINO's streamer callback to stop. If OpenVINO is blocked inside a native call and not invoking the callback, there's no way to interrupt it from Python. Generation completes; lock releases when it does.
NPU prompt limit is 4096 tokens. Long chat histories will eventually exceed this. The UI doesn't trim history — use Ctrl+N to start fresh if you hit the limit.
VLM doesn't stream. OpenVINO's VLMPipeline has no streaming API, so vision responses arrive all at once. Waiting 3-5s for a VLM answer is normal.
Ollama management endpoints are stubs. /api/pull, /api/delete, /api/copy return success but don't do anything. Model management is via install.ps1 or download-model.ps1, not the API.
No graceful shutdown. Ctrl+C is abrupt. If you hit it mid-load, NPU/GPU resources may not free cleanly — usually resolves on next launch, occasionally needs a reboot.
Flask dev server, not production. Single-user local tool. Don't put it on the internet without a reverse proxy.

A note about small models

During initial NPU testing with DeepSeek R1 1.5B, we asked: "What is the capital of Norway?"

The model's response:

"I need to figure out the capital of Norway. I know it's a country in Norway. I remember that Norway is a small island..."

Norway is, in fact, not a small island.

Or is it? To paraphrase the greatest detective of all time, Ford Fairlane: "...an island in an ocean of diarrhea."

The point: 1.5B parameter models are for testing the plumbing, not for geography. Use Qwen3-8B or larger for actual chat. The small models will catch up — they're getting smarter every month.

License

MIT

Author

Tommy Leonhardsen

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
static		static
templates		templates
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
TODO.md		TODO.md
benchmark.py		benchmark.py
download-model.ps1		download-model.ps1
install.ps1		install.ps1
models.json		models.json
nollama.py		nollama.py
requirements.txt		requirements.txt
start-template.ps1		start-template.ps1

Folders and files

Latest commit

History

Repository files navigation

NoLlama

Quick start

What it does

Web UI

Device support

Benchmark (Core Ultra 7 258V, ARC 140V 16 GB) — laptop, LPDDR5X

Benchmark (Core Ultra 9 285K, RTX 5090) — desktop, DDR5

Dual mode (NPU + GPU)

Why not OpenVINO Model Server (OVMS)?

Usage

Idle unload

API

Text chat

Image (VLM, requires GPU with vision model)

Local file shortcut

Streaming

Other endpoints

Response headers

Using with the openai Python package

Ollama API

Using with OpenWebUI

Models

Adding models outside the menu

NPU models (chat)

GPU vision models

GPU large LLMs (smarter than NPU)

How it works

Files

Requirements

Known limitations

A note about small models

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages