Add local Gemma-4-E4B (oMLX) evaluation harness for memory systems by johnz1019 · Pull Request #1 · lay2dev/HaluMem

johnz1019 · 2026-06-16T07:58:06Z

Summary

Adds an evaluation harness to run the HaluMem memory-system benchmark fully locally against a Gemma-4-E4B model served via oMLX, with no external memory-system API keys required. The answer model, judge model, and embeddings are decoupled so each can be pointed at a local or self-hosted endpoint.

What changed

New local infrastructure

local_mem0_components.py / local_embedding_server.py — deterministic blake2b hash embeddings + an OpenAI-compatible local embedding server, replacing OpenAI text-embedding-3-small.
llms.py — separate answer vs judge model config; judge can read GPT-5.5 from Codex config (JUDGE_USE_CODEX_CONFIG) or JUDGE_OPENAI_* overrides; supports both Responses (streaming) and Chat wire APIs.

Local adapters for the memory systems

eval_memzero_graph.py — run Mem0-Graph locally (local Neo4j + Qdrant + hash embeddings + local LLM).
New: eval_gemma4_e4b_local.py (no-memory / full-session baseline), eval_memzero_local.py, eval_supermemory_local.py.
Adapted: eval_memos.py, eval_memobase.py, eval_memzero.py, eval_supermemory.py, eval_zep.py for local endpoints.

Scoring & orchestration

evaluation.py — route scoring through the new judge; add memory-extraction F1.
run_gemma4_e4b_omlx_eval.py — single-system pipeline: start oMLX → run adapter → stop oMLX → judge scoring → write report.
run_gemma4_e4b_memory_matrix.py — matrix runner across frames.
write_gemma4_e4b_comparison_report.py — local vs README-baseline comparison report.

Notes

Run outputs / data (eval/runs/, eval/reports/, .supermemory/) are intentionally not included.
No secrets are committed (.env is gitignored; verified the diff contains no keys).
Implementation authored by codex on the lay2-studio host; packaged and submitted here.

Run the HaluMem memory-system benchmark fully locally against a Gemma-4-E4B model served via oMLX, decoupling the answer model, the judge model, and the embeddings so that no external memory-system API keys are required. Highlights: - llms.py: separate "answer" vs "judge" model config; judge can read GPT-5.5 from Codex config (JUDGE_USE_CODEX_CONFIG) or JUDGE_OPENAI_* overrides; support both Responses (streaming) and Chat wire APIs. - local_mem0_components.py / local_embedding_server.py: deterministic blake2b hash embeddings + an OpenAI-compatible local embedding server, replacing OpenAI text-embedding-3-small. - eval_memzero_graph.py: run Mem0-Graph locally (local Neo4j + Qdrant + hash embeddings + local LLM). - New local adapters: eval_gemma4_e4b_local (no-memory baseline), eval_memzero_local, eval_supermemory_local. - Adapt eval_memos / eval_memobase / eval_memzero / eval_supermemory / eval_zep for local endpoints. - evaluation.py: route scoring through the new judge; add memory-extraction F1. - Orchestration: run_gemma4_e4b_omlx_eval (start oMLX -> adapter -> stop -> judge scoring -> report), run_gemma4_e4b_memory_matrix (matrix runner), write_gemma4_e4b_comparison_report (local vs README-baseline report). Implementation authored by codex on the lay2-studio host; packaged and submitted here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add local Gemma-4-E4B (oMLX) evaluation harness for memory systems#1

Add local Gemma-4-E4B (oMLX) evaluation harness for memory systems#1
johnz1019 wants to merge 1 commit into
mainfrom
codex-local-gemma-e4b-eval

johnz1019 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

johnz1019 commented Jun 16, 2026

Summary

What changed

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant