Project QA benchmarks onto LLM pre-training corpora.
NanoKnow identifies which benchmark questions have answers in a model's training data, enabling controlled studies of parametric knowledge vs. retrieval-augmented generation (RAG).
π NanoKnow was accepted to SIGIR '26!
Arxiv: https://arxiv.org/abs/2602.20122
Given a QA benchmark and a pre-training corpus, NanoKnow produces relevance judgments (qrels) that partition questions into:
- Supported: The answer exists in the training data (the model could have memorized it).
- Unsupported: The answer does not appear in the training data.
The pipeline has three stages:
- BM25 Retrieval β Search the corpus for candidate documents using the question as a query.
- Answer String Matching β Filter to documents that contain the gold answer as a substring.
- LLM Verification β Use an LLM judge to filter out coincidental matches (e.g., "Paris" in a passage about Paris, Texas).
We provide pre-built qrels for nanochat models trained on karpathy/fineweb-edu-100b-shuffle:
| Dataset | Questions | Supported | Unsupported |
|---|---|---|---|
| SQuAD | 10,570 | 7,490 (71%) | 3,080 (29%) |
| NQ-Open | 3,610 | 2,389 (66%) | 1,221 (34%) |
The pre-built files are organized by dataset under questions-and-qrels/:
questions-and-qrels/
βββ nq/
β βββ answers.nanoknow-nq.jsonl
β βββ qrels.nanoknow-nq.supported.txt
β βββ topics.nanoknow-nq.supported.tsv
β βββ topics.nanoknow-nq.unsupported.tsv
βββ squad/
βββ answers.nanoknow-squad.jsonl
βββ qrels.nanoknow-squad.supported.txt
βββ topics.nanoknow-squad.supported.tsv
βββ topics.nanoknow-squad.unsupported.tsv
Each dataset directory contains:
topics.nanoknow-<dataset>.supported.tsv: supported questions asqid<TAB>question.topics.nanoknow-<dataset>.unsupported.tsv: unsupported questions asqid<TAB>question.answers.nanoknow-<dataset>.jsonl: gold answers as one JSON object per line, e.g.{"qid": "0", "answer": ["14 December 1972 UTC", "December 1972"]}.qrels.nanoknow-<dataset>.supported.txt: TREC-format qrels for supported questions, e.g.0 Q0 shard_01177_50695 1.
pip install -r requirements.txtFor BM25 retrieval, you also need Java 11+:
# Ubuntu/Debian
sudo apt install openjdk-11-jdkWe release a pre-built Lucene index over karpathy/fineweb-edu-100b-shuffle (326 GB):
Download: LingweiGu/NanoKnow-Fineweb-Edu-Index
huggingface-cli download LingweiGu/NanoKnow-Fineweb-Edu-Index --repo-type dataset --local-dir ./fineweb-edu-indexTo build the index yourself using Anserini:
bin/run.sh io.anserini.index.IndexCollection \
-collection FinewebCollection \
-input /path/to/corpus \
-index /output/directory \
-generator DefaultLuceneDocumentGenerator \
-threads 16# Stage 1: BM25 retrieval + answer matching (CPU only)
python scripts/project.py \
--dataset squad \
--stage 1 \
--index_path /path/to/lucene-index \
--output output/squad_stage1.pkl
# Stage 2: LLM verification (requires GPU)
python scripts/project.py \
--stage 2 \
--input output/squad_stage1.pkl \
--output output/squad_stage2.pkl
# Or run both stages together
python scripts/project.py \
--dataset squad \
--stage both \
--index_path /path/to/lucene-index \
--output output/squad_projected.pklNANOCHAT_DIR="${NANOCHAT_DIR:-/path/to/nanochat}"
CHECKPOINT_DIR="${CHECKPOINT_DIR:-/path/to/nanochat-checkpoint}"
STEP="${STEP:?Set STEP to the checkpoint step to evaluate}"
DATASET="${DATASET:-nq}"
QRELS_DIR="${QRELS_DIR:-questions-and-qrels/${DATASET}}"
FINEWEB_INDEX_PATH="${FINEWEB_INDEX_PATH:-/path/to/fineweb-index}"
OUTPUT_DIR="${OUTPUT_DIR:-output}"
DEVICE="${DEVICE:-cuda}"
python scripts/nanochat_inference.py \
--nanochat-dir "${NANOCHAT_DIR}" \
--checkpoint_dir "${CHECKPOINT_DIR}" \
--step "${STEP}" \
--dataset "${DATASET}" \
--qrels_dir "${QRELS_DIR}" \
--fineweb_index_path "${FINEWEB_INDEX_PATH}" \
--output_dir "${OUTPUT_DIR}" \
--device "${DEVICE}"If nanochat is already importable in your Python environment, --nanochat-dir can be omitted. For repeated runs, you can set NANOCHAT_DIR=/path/to/nanochat instead of passing the argument each time.
The output file is named from the checkpoint directory basename and dataset, for example output/karpathy_nanochat_d32_nq.
python scripts/evaluate_model_predictions.py \
--input_file output/karpathy_nanochat_d32_nq \
--output_file nanochat_evaluations/karpathy_nanochat_d32_nq_scored.pklpython scripts/get_eval_scores.py nanochat_evaluations/karpathy_nanochat_d32_nq_scored.pklExample output:
{
"supported_closed_book": {
"count": 2389,
"exact_match_accuracy": 0.19589786521557137,
"llm_judge_accuracy": 0.2293846797823357
},
"supported_w_fineweb_context": {
"count": 2389,
"exact_match_accuracy": 0.46923398911678527,
"llm_judge_accuracy": 0.46672247802427796
},
"unsupported_closed_book": {
"count": 1221,
"exact_match_accuracy": 0.00819000819000819,
"llm_judge_accuracy": 0.04914004914004914
}
}Replication check: on June 1, 2026, we re-ran the NQ evaluation for
karpathy_nanochat_d32 at step 650. Inference completed in
36m56s, scoring completed in 25m40s, and the reproduced scores were:
{
"supported_closed_book": {
"count": 2389,
"exact_match_accuracy": 0.19715362076182502,
"llm_judge_accuracy": 0.22980326496442027
},
"supported_w_fineweb_context": {
"count": 2389,
"exact_match_accuracy": 0.4679782335705316,
"llm_judge_accuracy": 0.47216408539137716
},
"unsupported_closed_book": {
"count": 1221,
"exact_match_accuracy": 0.00819000819000819,
"llm_judge_accuracy": 0.04504504504504504
}
}NanoKnow/
βββ nanoknow/ # Core library
β βββ retriever.py # Stage 1: BM25 retrieval + answer matching
β βββ verifier.py # Stage 2: LLM-based verification
β βββ evaluator.py # Evaluation utilities
βββ scripts/ # Runnable scripts
β βββ project.py # Run the projection pipeline
β βββ nanochat_inference.py # Run nanochat checkpoint inference
β βββ evaluate_model_predictions.py # Score predictions
β βββ get_eval_scores.py # Summarize scored evaluation results
βββ questions-and-qrels/ # Pre-built benchmark questions, answers, and qrels
β βββ nq/
β β βββ answers.nanoknow-nq.jsonl
β β βββ qrels.nanoknow-nq.supported.txt
β β βββ topics.nanoknow-nq.supported.tsv
β β βββ topics.nanoknow-nq.unsupported.tsv
β βββ squad/
β βββ answers.nanoknow-squad.jsonl
β βββ qrels.nanoknow-squad.supported.txt
β βββ topics.nanoknow-squad.supported.tsv
β βββ topics.nanoknow-squad.unsupported.tsv
βββ pyproject.toml
βββ requirements.txt
βββ LICENSE
βββ README.md
We evaluated eight checkpoints across three model scales:
| Scale | Checkpoints |
|---|---|
| d20 (~561M params) | sampathchanda/nanochat-d20, shu127/nanochat-d20, pankajmathur/nanochat-d20 |
| d32 (~1.9B params) | karpathy/nanochat-d32, Antigma/nanochat-d32 |
| d34 (~2.2B params) | renatocastro33/nanochat-d34-sft, victoremnm/nanochat-d34-sft, pankajmathur/nanochat-d34-sft-hf |
@article{gu2026nanoknow,
title={NanoKnow: How to Know What Your Language Model Knows},
author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy},
journal={arXiv preprint arXiv:2602.20122},
year={2026}
}Apache 2.0