Skip to content

castorini/NanoKnow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NanoKnow

Project QA benchmarks onto LLM pre-training corpora.

NanoKnow identifies which benchmark questions have answers in a model's training data, enabling controlled studies of parametric knowledge vs. retrieval-augmented generation (RAG).

πŸŽ‰ NanoKnow was accepted to SIGIR '26!

Arxiv: https://arxiv.org/abs/2602.20122

Overview

Given a QA benchmark and a pre-training corpus, NanoKnow produces relevance judgments (qrels) that partition questions into:

  • Supported: The answer exists in the training data (the model could have memorized it).
  • Unsupported: The answer does not appear in the training data.

The pipeline has three stages:

  1. BM25 Retrieval β€” Search the corpus for candidate documents using the question as a query.
  2. Answer String Matching β€” Filter to documents that contain the gold answer as a substring.
  3. LLM Verification β€” Use an LLM judge to filter out coincidental matches (e.g., "Paris" in a passage about Paris, Texas).

Pre-built Qrels

We provide pre-built qrels for nanochat models trained on karpathy/fineweb-edu-100b-shuffle:

Dataset Questions Supported Unsupported
SQuAD 10,570 7,490 (71%) 3,080 (29%)
NQ-Open 3,610 2,389 (66%) 1,221 (34%)

The pre-built files are organized by dataset under questions-and-qrels/:

questions-and-qrels/
β”œβ”€β”€ nq/
β”‚   β”œβ”€β”€ answers.nanoknow-nq.jsonl
β”‚   β”œβ”€β”€ qrels.nanoknow-nq.supported.txt
β”‚   β”œβ”€β”€ topics.nanoknow-nq.supported.tsv
β”‚   └── topics.nanoknow-nq.unsupported.tsv
└── squad/
    β”œβ”€β”€ answers.nanoknow-squad.jsonl
    β”œβ”€β”€ qrels.nanoknow-squad.supported.txt
    β”œβ”€β”€ topics.nanoknow-squad.supported.tsv
    └── topics.nanoknow-squad.unsupported.tsv

Each dataset directory contains:

  • topics.nanoknow-<dataset>.supported.tsv: supported questions as qid<TAB>question.
  • topics.nanoknow-<dataset>.unsupported.tsv: unsupported questions as qid<TAB>question.
  • answers.nanoknow-<dataset>.jsonl: gold answers as one JSON object per line, e.g. {"qid": "0", "answer": ["14 December 1972 UTC", "December 1972"]}.
  • qrels.nanoknow-<dataset>.supported.txt: TREC-format qrels for supported questions, e.g. 0 Q0 shard_01177_50695 1.

Installation

pip install -r requirements.txt

For BM25 retrieval, you also need Java 11+:

# Ubuntu/Debian
sudo apt install openjdk-11-jdk

FineWeb-Edu Lucene Index

We release a pre-built Lucene index over karpathy/fineweb-edu-100b-shuffle (326 GB):

Download: LingweiGu/NanoKnow-Fineweb-Edu-Index

huggingface-cli download LingweiGu/NanoKnow-Fineweb-Edu-Index --repo-type dataset --local-dir ./fineweb-edu-index

To build the index yourself using Anserini:

bin/run.sh io.anserini.index.IndexCollection \
  -collection FinewebCollection \
  -input /path/to/corpus \
  -index /output/directory \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16

Usage

Project a new benchmark

# Stage 1: BM25 retrieval + answer matching (CPU only)
python scripts/project.py \
    --dataset squad \
    --stage 1 \
    --index_path /path/to/lucene-index \
    --output output/squad_stage1.pkl

# Stage 2: LLM verification (requires GPU)
python scripts/project.py \
    --stage 2 \
    --input output/squad_stage1.pkl \
    --output output/squad_stage2.pkl

# Or run both stages together
python scripts/project.py \
    --dataset squad \
    --stage both \
    --index_path /path/to/lucene-index \
    --output output/squad_projected.pkl

Evaluate a nanochat checkpoint

Step 1: Run inference with a checkpoint

NANOCHAT_DIR="${NANOCHAT_DIR:-/path/to/nanochat}"
CHECKPOINT_DIR="${CHECKPOINT_DIR:-/path/to/nanochat-checkpoint}"
STEP="${STEP:?Set STEP to the checkpoint step to evaluate}"
DATASET="${DATASET:-nq}"
QRELS_DIR="${QRELS_DIR:-questions-and-qrels/${DATASET}}"
FINEWEB_INDEX_PATH="${FINEWEB_INDEX_PATH:-/path/to/fineweb-index}"
OUTPUT_DIR="${OUTPUT_DIR:-output}"
DEVICE="${DEVICE:-cuda}"

python scripts/nanochat_inference.py \
    --nanochat-dir "${NANOCHAT_DIR}" \
    --checkpoint_dir "${CHECKPOINT_DIR}" \
    --step "${STEP}" \
    --dataset "${DATASET}" \
    --qrels_dir "${QRELS_DIR}" \
    --fineweb_index_path "${FINEWEB_INDEX_PATH}" \
    --output_dir "${OUTPUT_DIR}" \
    --device "${DEVICE}"

If nanochat is already importable in your Python environment, --nanochat-dir can be omitted. For repeated runs, you can set NANOCHAT_DIR=/path/to/nanochat instead of passing the argument each time.

The output file is named from the checkpoint directory basename and dataset, for example output/karpathy_nanochat_d32_nq.

Step 2: Score predictions with the evaluator

python scripts/evaluate_model_predictions.py \
    --input_file output/karpathy_nanochat_d32_nq \
    --output_file nanochat_evaluations/karpathy_nanochat_d32_nq_scored.pkl

Step 3: Summarize evaluation scores

python scripts/get_eval_scores.py nanochat_evaluations/karpathy_nanochat_d32_nq_scored.pkl

Example output:

{
  "supported_closed_book": {
    "count": 2389,
    "exact_match_accuracy": 0.19589786521557137,
    "llm_judge_accuracy": 0.2293846797823357
  },
  "supported_w_fineweb_context": {
    "count": 2389,
    "exact_match_accuracy": 0.46923398911678527,
    "llm_judge_accuracy": 0.46672247802427796
  },
  "unsupported_closed_book": {
    "count": 1221,
    "exact_match_accuracy": 0.00819000819000819,
    "llm_judge_accuracy": 0.04914004914004914
  }
}

Replication check: on June 1, 2026, we re-ran the NQ evaluation for karpathy_nanochat_d32 at step 650. Inference completed in 36m56s, scoring completed in 25m40s, and the reproduced scores were:

{
  "supported_closed_book": {
    "count": 2389,
    "exact_match_accuracy": 0.19715362076182502,
    "llm_judge_accuracy": 0.22980326496442027
  },
  "supported_w_fineweb_context": {
    "count": 2389,
    "exact_match_accuracy": 0.4679782335705316,
    "llm_judge_accuracy": 0.47216408539137716
  },
  "unsupported_closed_book": {
    "count": 1221,
    "exact_match_accuracy": 0.00819000819000819,
    "llm_judge_accuracy": 0.04504504504504504
  }
}

Repository Structure

NanoKnow/
β”œβ”€β”€ nanoknow/                  # Core library
β”‚   β”œβ”€β”€ retriever.py           # Stage 1: BM25 retrieval + answer matching
β”‚   β”œβ”€β”€ verifier.py            # Stage 2: LLM-based verification
β”‚   └── evaluator.py           # Evaluation utilities
β”œβ”€β”€ scripts/                   # Runnable scripts
β”‚   β”œβ”€β”€ project.py             # Run the projection pipeline
β”‚   β”œβ”€β”€ nanochat_inference.py  # Run nanochat checkpoint inference
β”‚   β”œβ”€β”€ evaluate_model_predictions.py  # Score predictions
β”‚   └── get_eval_scores.py     # Summarize scored evaluation results
β”œβ”€β”€ questions-and-qrels/       # Pre-built benchmark questions, answers, and qrels
β”‚   β”œβ”€β”€ nq/
β”‚   β”‚   β”œβ”€β”€ answers.nanoknow-nq.jsonl
β”‚   β”‚   β”œβ”€β”€ qrels.nanoknow-nq.supported.txt
β”‚   β”‚   β”œβ”€β”€ topics.nanoknow-nq.supported.tsv
β”‚   β”‚   └── topics.nanoknow-nq.unsupported.tsv
β”‚   └── squad/
β”‚       β”œβ”€β”€ answers.nanoknow-squad.jsonl
β”‚       β”œβ”€β”€ qrels.nanoknow-squad.supported.txt
β”‚       β”œβ”€β”€ topics.nanoknow-squad.supported.tsv
β”‚       └── topics.nanoknow-squad.unsupported.tsv
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ LICENSE
└── README.md

nanochat Checkpoints

We evaluated eight checkpoints across three model scales:

Scale Checkpoints
d20 (~561M params) sampathchanda/nanochat-d20, shu127/nanochat-d20, pankajmathur/nanochat-d20
d32 (~1.9B params) karpathy/nanochat-d32, Antigma/nanochat-d32
d34 (~2.2B params) renatocastro33/nanochat-d34-sft, victoremnm/nanochat-d34-sft, pankajmathur/nanochat-d34-sft-hf

Citation

@article{gu2026nanoknow,
  title={NanoKnow: How to Know What Your Language Model Knows},
  author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy},
  journal={arXiv preprint arXiv:2602.20122},
  year={2026}
}

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages