Skip to content

Branch 3-0-2/bug fixes perf enhancements#29

Merged
russfellows merged 15 commits into
mainfrom
branch-3-0-2/bug-fixes-perf-enhancements
May 13, 2026
Merged

Branch 3-0-2/bug fixes perf enhancements#29
russfellows merged 15 commits into
mainfrom
branch-3-0-2/bug-fixes-perf-enhancements

Conversation

@russfellows
Copy link
Copy Markdown
Owner

PR Summary: branch-3-0-2/bug-fixes-perf-enhancements

Branch: branch-3-0-2/bug-fixes-perf-enhancements
Base: main (mlcommons/storage)
Date: May 13, 2026
Tests: 127 passed, 0 failed (was 112 passed, 13 failed on clean main)


Issues Addressed

Of the 7 most recent open issues on mlcommons/storage, 6 are fixed by this branch.
Issue mlcommons#369 was determined to be an environment/OpenMPI configuration problem with
no code fix applicable.

Issue Title Status Fix location
mlcommons#362 Training stuck at epoch 1, no NVMe reads ✅ Fixed dlio_benchmarkreader_factory.py
mlcommons#363 collect_cluster_info() missing required results_dir ✅ Fixed benchmarks/base.py
mlcommons#364 Flux AU limited by Parquet deserialization throughput ✅ Fixed dlio_benchmarkreader_factory.py + s3dlio
mlcommons#365 Checkpointing split-phase reports wrong operation counts ✅ Fixed benchmarks/base.py
mlcommons#367 reportgen crashes with AttributeError on Namespace.file ✅ Fixed cli_parser.py
mlcommons#369 orte_init failed — No permission (-17) ⚪ Not a code bug OpenMPI environment/permissions issue
mlcommons#371 --params storage.storage_type=direct_fs silently uses pagecache ✅ Fixed dlio_benchmarkpytorch_checkpointing.py
mlcommons#372 32 GB hard cap blocks large-memory runs ✅ Fixed (pending commit) dlio_benchmarkutils/config.py

Commit History (above main)

Commit 1 — 022820b

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: cli_parser: guard --file/--object consolidation for non-benchmark subcommands
Fixes: mlcommons#367
Cherry-picked from: PR mlcommons#368

Problem: The reportgen, history, and lockfile subcommands do not call
add_storage_type_arguments(), so their Namespace objects have no .file or
.object attribute. The unconditional read and del in parse_arguments()
crashed with AttributeError.

Changesmlpstorage_py/cli_parser.py:

  • Guard the --file/--object consolidation block with
    if hasattr(parsed_args, "file") or hasattr(parsed_args, "object"):
  • Use getattr(parsed_args, "file", False) instead of direct attribute access
  • Replace bare del parsed_args.file / del parsed_args.object with a
    for _attr in ("file", "object"): if hasattr(...): delattr(...) loop
    so neither attribute is required to be present

Also includes new unit tests in tests/unit/test_cli.py covering the
parser behaviour for all subcommand types.


Commit 2 — 03765a2

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Remove unwanted file
Cherry-picked from: PR mlcommons#368

Removes a requirements.txt that was accidentally included in the
previous commit.


Commit 3 — 7e4245b

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #363: pass results_dir to collect_cluster_info
Fixes: mlcommons#363
Cherry-picked from: PR mlcommons#366

Problem: Benchmark._collect_cluster_information() called
collect_cluster_info() without the required positional argument
results_dir. This caused a TypeError at runtime:

WARNING: MPI cluster info collection failed: collect_cluster_info()
missing 1 required positional argument: 'results_dir'

The missing cluster info then propagated as None into reportgen,
causing a downstream crash:

[INVALID] None: Check check_num_files_train failed with error:
'NoneType' object has no attribute 'total_memory_bytes'

Changesmlpstorage_py/benchmarks/base.py:

  • Extract ssh_username and shared_staging_dir from self.args via
    getattr(..., None) before the call
  • Pass results_dir=self.run_result_output (the benchmark's computed
    output directory) to collect_cluster_info()
  • Pass shared_staging_dir=shared_staging_dir and
    ssh_username=ssh_username so SSH-based collection uses the correct
    credentials and staging path

Changesmlpstorage_py/tests/test_benchmarks.py:

  • Set benchmark.run_result_output = '/tmp/results/run-001' in the
    test fixture (previously missing; the call site needs this attribute)
  • Update assert_called_once_with to expect results_dir,
    shared_staging_dir, and ssh_username
  • Add TestCollectClusterInfoSignatureBinding regression test class (2
    new tests) that binds the actual kwargs against inspect.signature()
    of the real collect_cluster_info function, so future signature drift
    is caught at unit-test time rather than at runtime

Commit 4 — 2431011

Author: Russell Fellows
Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #365, #372: metadata override propagation, test suite fixes, env lock
Fixes: mlcommons#365

Fix mlcommons#365 — CLI override_parameters not reflected in metadata.json

Problem: The submission checker reads num_checkpoints_write /
num_checkpoints_read from metadata['parameters'] (the YAML
defaults). For split-phase submissions (write-only or read-only runs),
the correct counts are passed as CLI overrides such as:

override_parameters.num_checkpoints_write=10

These overrides landed in metadata['override_parameters'] only, which
the checker ignores. As a result, a 10-write + 10-read split-phase run
would aggregate to 20 writes + 20 reads and be marked INVALID.

Changesmlpstorage_py/benchmarks/base.py:

  • Add _apply_dotted_overrides(params, overrides) static method that
    deep-copies params and merges dotted-key overrides into the nested
    dict structure
  • In the metadata property, call _apply_dotted_overrides() so
    metadata['parameters'] reflects the effective runtime configuration
  • metadata['override_parameters'] is still emitted unchanged for a
    full audit trail

Note: PR mlcommons#370 (crossmeta/zettalane) addresses the same root cause.
That PR is blocked pending CLA signature from @zettalane. This
implementation is carried independently; the two fixes are
functionally equivalent.

Fix — DLIOResultParser system info fallback

Problem: When a DLIO summary.json does not contain a system_info
block (e.g. runs from older DLIO versions), DLIOResultParser.parse()
returned None for ClusterInformation, breaking BenchmarkRun
validation.

Changesmlpstorage_py/rules/models.py:

  • DLIOResultParser.parse() now accepts an optional metadata kwarg
  • When ClusterInformation.from_dlio_summary_json() returns None,
    fall back to metadata['cluster_information'] if present and
    reconstruct via ClusterInformation.from_dict()
  • BenchmarkRun.__init__ passes the run's metadata object to
    parser.parse() to enable the fallback

Fix — 13 pre-existing test failures

mlpstorage_py/tests/test_cluster_collector.py (10 tests):

  • All MPIClusterCollector(...) constructor calls and
    collect_cluster_info(...) call sites in failing tests were missing
    the now-required results_dir argument — added results_dir='/tmp'
    to all 10 affected call sites
  • test_collector_returns_valid_data_without_error_marker: rewrote to
    use the current shared_staging_dir=tmpdir pattern instead of the
    obsolete UUID-based staging directory approach

mlpstorage_py/tests/test_rules.py (3 tests):

  • TestBenchmarkRunSystemInfoFallback tests were failing with
    ValueError: No summary.json found in /tmp/test_run because they
    attempted real filesystem I/O
  • Patched DLIOResultParser._load_summary and
    DLIOResultParser._load_hydra_configs to return in-memory mock data,
    removing the filesystem dependency

pyproject.toml / uv.lock

  • Add [tool.uv] environments = ["sys_platform == 'linux'"] to
    pyproject.toml so uv lock does not attempt to resolve non-Linux
    platform markers (s3dlio only publishes Linux wheels)
  • Regenerate uv.lock accordingly

dlio_benchmark Fixes (russfellows/dlio_benchmark — feat/parquet-dgen-streaming)

The following fixes are in the dlio_benchmark fork that is pinned by this
branch's pyproject.toml. They are already committed in the fork; issue mlcommons#372
has an additional local change that is pending commit/push.


Fix mlcommons#362 / mlcommons#364 — Training stuck at epoch 1; Flux AU limited by CPU Parquet deserialization

Files: dlio_benchmark/reader/reader_factory.py,
dlio_benchmark/reader/parquet_reader_file_iterable.py (new),
dlio_benchmark/reader/parquet_reader_s3dlio.py
Commit: 1635b79 (feat: s3dlio-gen streaming, iterable dataloader, file iterable reader)

Issue mlcommons#362 — Stuck at epoch 1, no NVMe reads:
reader_factory.py routed LOCAL_FS + Parquet to the legacy ParquetReader,
which calls pf.read_row_group() — full PyArrow deserialization on every read.
This is entirely CPU-bound and saturates the Python GIL, starving DLIO's
DataLoader workers of CPU time. Observed symptom: benchmark reaches
"Starting epoch 1" and then makes no measurable NVMe I/O while CPU pegs at
88-95%.

Issue mlcommons#364 — Flux AU limited by per-process Parquet deserialization:
Same root cause. Even on a 192-vCPU Zen 4 machine, PyArrow's
read_row_group(use_threads=True) spawns additional decode threads per call.
Under DLIO's model (e.g. 4 MPI × 8 read_threads = 32 workers), hundreds of
threads contend on the GIL. AU on Skylake with data in tmpfs (zero I/O latency):
21% — storage is provably not the bottleneck; CPU decode is.

Fix: reader_factory.py now routes LOCAL_FS + Parquet to
ParquetReaderFileIterable — a new reader that performs raw byte-range reads
via a 64-thread ThreadPoolExecutor without any PyArrow decode. Data is
returned as raw bytes to the training loop. For S3/object storage, the s3dlio
Rust-based reader (ParquetReaderS3dlio) is used, which similarly bypasses
Python-side decode.

# Before (reader_factory.py):
# LOCAL_FS + Parquet → ParquetReader → pf.read_row_group() — full PyArrow decode

# After:
elif _args.storage_type in (StorageType.LOCAL_FS,):
    from dlio_benchmark.reader.parquet_reader_file_iterable import ParquetReaderFileIterable
    return ParquetReaderFileIterable(dataset_type, thread_index, epoch_number)

Result (from issue mlcommons#364 testing, c6in.16xlarge, data on tmpfs):

Accelerators use_threads AU Throughput Result
4 True (before) 54.38% ~77 MB/s ❌ FAIL
4 False (workaround) 99.79% 141.80 MB/s ✅ PASS
8 False (workaround) 99.68% 283.07 MB/s ✅ PASS

The ParquetReaderFileIterable path goes further — no decode at all — giving
even better scaling on older CPU generations (Skylake, Cascade Lake) that lack
AVX-512 Parquet acceleration.


Fix mlcommons#371--params storage.storage_type=direct_fs silently uses page cache

File: dlio_benchmark/checkpointing/pytorch_checkpointing.py
Commit: present in fork on branch feat/parquet-dgen-streaming

Problem: After PR mlcommons#359 renamed the Python package from mlpstorage
mlpstorage_py, one import path in dlio_benchmark was missed:

# Before (bug — old package name):
try:
    from mlpstorage.checkpointing import StreamingCheckpointing as _SC  # always fails
except ImportError:
    from dlio_benchmark.checkpointing.simple_streaming_checkpointing import (
        SimpleStreamingCheckpointing as _SC,   # silently falls back here
    )

SimpleStreamingCheckpointing ignores the backend='direct_fs' argument
entirely and uses plain open(path, "wb"). The result: when a user passes
--params storage.storage_type=direct_fs, page cache is never bypassed.
This was confirmed with free -h showing page cache growing during the write
phase and Lustre client cache filling up on a Lustre-backed mount.

Fix (one line):

# After:
from mlpstorage_py.checkpointing import StreamingCheckpointing as _SC

This ensures direct_fs checkpointing correctly uses O_DIRECT via s3dlio's
direct:// URI scheme, bypassing the page cache as intended.


Fix mlcommons#372 — 32 GB hard cap blocks large-memory runs

File: dlio_benchmark/utils/config.py
Status: Modified locally in russfellows/dlio_benchmarkpending commit/push

Problem: BUDGET_MB was hard-coded to 32 * 1024 (32 GB). On hosts with
more than 32 GB of RAM this cap artificially constrains the number of DataLoader
workers. The error manifests as:

Exception: Memory budget exceeded: reader.read_threads=2 x comm_size=64 = 128
worker processes, estimated ~64 GB (hard cap: 32 GB). Reduce reader.read_threads
to at most 1 for this run.

On a 377 GB host trying to run 64 accelerators × 2 read_threads, the cap
prevents any run above 32 B200 ranks × 2 threads = 32 GB, limiting throughput
to ~2.3 GB/s regardless of storage capability (well below a Gen5 NVMe's 14 GB/s).

Fix:

# Before:
BUDGET_MB = 32 * 1024  # 32 GB hard cap

# After:
BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024)  # actual host RAM

The budget now scales with actual installed RAM, which is the correct
upper bound for in-memory dataset caching.


Issue mlcommons#369orte_init failed: No permission (-17) (No code fix)

Problem: OpenMPI orte_init fails with getting local rank failed → Returned value No permission (-17). This occurs when MPI processes are
launched as root without passing --allow-run-as-root to mpirun, or
when running inside a container with restricted Linux namespaces that
prevent OpenMPI's process management layer from initializing.

Assessment: This is an environment and OpenMPI configuration issue,
not a bug in mlpstorage or dlio_benchmark. The fix is to add
--allow-run-as-root to the mpirun invocation, or to configure the
container/namespace permissions to allow OpenMPI's process manager. No
code change is warranted.


Test Results

Before (clean main):  112 passed, 13 failed
After  (this branch): 127 passed,  0 failed

The net gain of 15 passing tests breaks down as:

russfellows and others added 15 commits May 12, 2026 08:45
- flux_datagen.yaml: add use_s3dlio_gen: true, row_group_size: 48
- dlrm_b200.yaml: tune prefetch_size/read_threads for benchmark accuracy
- pyproject.toml: s3dlio>=0.9.100; dlio-benchmark from russfellows fork
  (feat/parquet-dgen-streaming); local s3dlio wheel NOTE comment
- tests/DLRM_test_results.md: direct DLIO benchmark reader comparison results
- docs/Flux_NP_ReadThreads_Scaling_Results.md: new -- NP in {1,2,4,8} x
  RT in {1,2,4,8} scaling sweep results, CPU threshold analysis,
  computation_time impact at 0.5s and 1.35s, samp/s/GPU column
- tests/object-store/: add bench/gen/run scripts for Flux and DLRM workloads
- .gitignore: ignore sweep_logs/, sweep_*.sh, sim_*.tsv*, results/
…ecture docs

── 1. DLRM workload config fixes (configs/dlio/workload/) ───────────────

dlrm_b200.yaml, dlrm_datagen.yaml:
  Reduce num_samples_per_file from 4,718,592 to 1,536,000.
  1,536,000 = 250 row groups x 6,144 rows/RG. This keeps the Parquet
  footer under the s3-ultra 4 MiB single-object GET limit. The previous
  value produced a footer exceeding 4 MiB, causing s3-ultra to reject
  the GET and fall back to a multi-part read, distorting latency.
  Also enables use_s3dlio_gen: true and aligns row_group_size to
  batch_size (6,144) for optimal row-group cache hit rate.

── 2. UNet3D B200 workload config (configs/dlio/workload/unet3d_b200.yaml) ─

New config for UNet3D benchmarking on B200-class hardware.
  - computation_time: 0.162 s (H100 baseline / 2 for B200 throughput target)
  - 7,200 NPZ files, ~140 MiB each, s3dlio storage library
  - batch_size: 4, read_threads: 4

── 3. UNet3D NP sweep scripts (tests/object-store/) ─────────────────────

sweep_unet3d_np.sh:
  Automated NP=1/2/4 scaling sweep for the UNet3D B200 workload.
  Each run writes results to results/unet3d_np_sweep/<timestamp>/.
  Appends a TSV summary row and auto-generates docs/UNet3D_NP_Scaling_Results.md
  at sweep completion. NP=8 excluded -- s3-ultra saturates at NP>=4.

gen_unet3d_npz.sh:
  Generates the 984 GiB UNet3D NPZ dataset on s3-ultra (mlp-unet3d bucket)
  using dlio_benchmark's NPZGenerator fast path (s3dlio generate_npz_bytes(),
  zero Python-side copies, hardware CRC32, Rayon parallel fill).

test_unet3d.sh:
  Single-run smoke test for the UNet3D B200 config (NP=1, 1 epoch).

── 4. DLRM sweep scripts (tests/object-store/) ──────────────────────────

sweep_dlrm_np.sh:      NP=1/2/4 scaling sweep for DLRM Parquet workload.
sweep_dlrm_compute.sh: Compute-time sensitivity sweep for DLRM.

── 5. DataLoader architecture documentation (docs/) ─────────────────────

docs/DATALOADER_ARCHITECTURE.md (new):
  Comprehensive reference covering two major topics:

  Part 1 -- Map-style vs. iterable DataLoaders on S3:
    Why "iterable is better for large datasets" originates from HDD seek
    patterns and does not apply to object storage. The real argument for
    iterable is pipeline depth: TorchIterableDatasetSimple achieves
    64 x num_workers in-flight GETs (vs 1 x num_workers with map-style).
    Covers TorchIterableDatasetSimple implementation mechanics, known
    limitations (per-epoch shuffle propagation, prefetch memory bounds,
    drop-last), and a summary comparison table.

  Part 2 -- O_DIRECT on local NVMe (two independent paths):
    Why O_DIRECT is required for accurate NVMe benchmarking (page cache
    problem). Detailed description and comparison of both available paths:
      - odirect: true  -- Python os.open+os.readv, map-style, 1 read/worker
      - storage_library: direct -- Rust/Tokio O_DIRECT, iterable, 64/worker
    12-property comparison table. Guidance on using both paths together
    to isolate I/O concurrency depth and GIL contention as independent
    variables. Includes TOC with anchor links to all sections.

docs/UNet3D_NP_Scaling_Results.md (new):
  NP=1/2/4 benchmark results for UNet3D B200 on s3-ultra.
  Generated by sweep_unet3d_np.sh.

docs/DLRM_NP_Scaling_Results.md (new):
  NP=1/2/4 benchmark results for DLRM Parquet on s3-ultra.

docs/Flux_NP_ReadThreads_Scaling_Results.md (updated):
  Additional read_threads sweep results appended.

docs/README.md (updated):
  - New "Where to Start" row: Benchmark NVMe with O_DIRECT pointing to
    DATALOADER_ARCHITECTURE.md#o_direct-local-storage-two-independent-paths
  - DATALOADER_ARCHITECTURE.md entry expanded to summarise both parts
    (S3 iterable DataLoader and O_DIRECT NVMe paths) with anchor link.

── 6. pyproject.toml / uv.lock ──────────────────────────────────────────

Switch dlio-benchmark dependency from git branch reference to local
editable path (../dlio_benchmark). Allows iterating on dlio_benchmark
and mlp-storage together without tagging intermediate git commits.
uv.lock updated accordingly.

── 7. .gitignore additions ──────────────────────────────────────────────

Add patterns for runtime artifacts that should never be committed:
  hydra_log/          -- Hydra config output written to cwd during runs
  sweep_unet3d_*.log  -- Timestamped sweep run logs written to repo root
  sweep_dlrm_*.log    -- Timestamped sweep run logs written to repo root
  sweep_flux_*.log    -- Timestamped sweep run logs written to repo root
uv.lock: bump s3dlio wheel to 0.9.100 (skip_head HEAD optimisation,
  PyDataset.from_uris(), items(), collect_batch())

tests/object-store/test_retinanet.sh: end-to-end retinanet 3-epoch benchmark
tests/object-store/gen_retinanet_jpeg.sh: generate retinanet JPEG dataset
tests/object-store/sweep_retinanet_np.sh: sweep concurrency parameters for NP workload
…3dlio 0.9.100)

Benchmark results from 2026-05-12 sweep on co-located 24 vCPU / 48 GB host.
50,000 JPEG files × ~315 KiB/file, 8 epochs, batch=24, read_threads=8.
DataLoader: TorchIterableDatasetSimple + _s3_stream_next() pipelined chunking.
dlio_benchmark commit: fc92d7f (feat/parquet-dgen-streaming).
pyproject.toml:
- dlio-benchmark: local editable -> GitHub rev 3667a0e (v3.0.2)
- s3dlio: local wheel source removed (now resolves from PyPI via >=0.9.100 pin)
- [tool.uv] environments = ['sys_platform == linux'] added (s3dlio Linux-only)

uv.lock:
- dlio-benchmark 3.0.1 -> 3.0.2 from russfellows/dlio_benchmark@3667a0e
- s3dlio 0.9.100 from local wheel -> pypi.org/simple
- mlpstorage 2.0.0b1 -> 3.0.2
- Removed colorama + tzdata (Windows-only, no longer resolved)
…ve historical analysis

Deleted from old-archive/ (31 files):
- All per-library dlio_minio_*.sh, dlio_s3dlio_*.sh, dlio_s3torch_*.sh
  (superseded by unified run_datagen/training/checkpointing/cleanup.sh)
- demo_streaming_checkpoint.sh, test_minio_checkpoint.py,
  test_s3dlio_checkpoint.py, test_s3torch_checkpoint.py
  (superseded by run_checkpointing.sh)
- test_dlio_direct_s3dlio.sh, test_dlio_multilib_demo.py,
  test_mlp_minio/s3dlio/s3torch.sh, test_s3dlio_multilib.sh,
  test_training_mpi_sweep.py (superseded by sweep_*.sh)
- llama3_8b_checkpoint_*.yaml (configs now in configs/dlio/)
- dlio_mpi_object_results.md, Object_Perf_Results.md,
  s3dlio_performance_analysis.md (stale; issues since resolved)

Moved from top-level to old-archive/ (historical reference):
- bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py
- bench-results-retinanet-20260425.md

Remaining old-archive/ contains 10 reference files:
- test_direct_write_comparison.py, test_s3dlio_direct.py,
  test_s3dlio_formats.py/.sh, test_s3lib_get_bench.py,
  S3library_review_21-Mar.md (library API/concurrency reference)
- bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py
  (historical optimization analysis)
- bench-results-retinanet-20260425.md (historical benchmark results)
…ts, add sweeps/

Deleted:
- test_dlrm.sh, test_flux.sh — redundant one-liners; run_dlrm_bench.sh and
  run_flux_bench.sh are the proper scripts (full result parsing, env handling)
- gen_flux_parquet.py — non-standard one-off that bypassed mlpstorage datagen;
  confusing next to the .sh generators; can be replaced with gen_flux_parquet.sh

Moved to old-archive/ (Apr-27, ~16 days old, superseded):
- run_datagen.sh, run_training.sh — generic multi-model wrappers replaced by
  model-specific run_*_bench.sh scripts
- test_multi_endpoint_s3dlio.py — demo script, not a test

New sweeps/ subdirectory:
- sweep_dlrm_compute.sh, sweep_dlrm_np.sh, sweep_flux.sh,
  sweep_retinanet_np.sh, sweep_unet3d_np.sh

Also removed sweep_flux.sh from .gitignore (it was excluded as a scratch
script; now tracked properly under sweeps/)
Replace old run_datagen/run_training-centric docs with:
- Structure diagram showing 4 model types × 1 generator + 1 benchmark each
- Quick Start showing the 3-command flow per model
- Table mapping model → format → generator → benchmark script
- Updated Archived Tests section listing what's in old-archive/

Removed: detailed parameter tables for run_datagen.sh and run_training.sh
(both scripts moved to old-archive in previous commit)
Deleted (superseded by May 12 sweep results in docs/):
- tests/object-store/NPZ-OPTIMIZATION-ANALYSIS.md  (bug now fixed, stale)
- tests/object-store/scaling-analysis-2026-04-25.md (s3dlio v0.9.86 era)
- tests/object-store/s3ultra-test-results-20260425.md (s3dlio v0.9.86 era)

README.md: added Performance Results section linking to current docs/:
- docs/DLRM_NP_Scaling_Results.md
- docs/Flux_NP_ReadThreads_Scaling_Results.md
- docs/RetinaNet_NP_Scaling_Results.md
- docs/UNet3D_NP_Scaling_Results.md
…commands

   reports/history/lockfile subparsers do not call add_storage_type_arguments(),
   so their Namespace has no .file or .object attribute. The unconditional
   read and delete in parse_arguments() crashed with AttributeError. Gate the
   consolidation on attribute presence; downstream code already uses
   getattr(args, 'data_access_protocol', None).

   Fixes mlcommons#367

Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
… suite fixes, env lock

Fix mlcommons#365: apply CLI override_parameters into metadata.json parameters
  Add _apply_dotted_overrides() static method to Benchmark base class.
  At metadata serialization time, dotted-key CLI overrides are merged into
  the nested parameters dict so the submission checker sees the effective
  config (e.g. split-phase num_checkpoints_write/read). override_parameters
  is still emitted unchanged for full audit trail.
  This addresses the same root cause as PR mlcommons#370 (crossmeta/zettalane);
  that PR is pending CLA so this implementation is carried here independently.

Fix rules/models.py: system info fallback in DLIOResultParser
  When a DLIO summary.json lacks system_info, fall back to
  cluster_information from the run metadata dict. Fixes the
  TestBenchmarkRunSystemInfoFallback test class (3 tests).

Fix test suite: resolve 13 pre-existing test failures
  test_cluster_collector.py: add missing results_dir argument to all
    MPIClusterCollector constructor and collect_cluster_info() call sites
    (10 tests). Update test_collector_returns_valid_data_without_error_marker
    to use current shared_staging_dir=tmpdir pattern.
  test_rules.py: patch DLIOResultParser._load_summary and
    _load_hydra_configs in TestBenchmarkRunSystemInfoFallback tests so
    they use in-memory mock data instead of hitting /tmp/test_run (3 tests).
  All 127 tests now pass (125 pre-existing + 2 added by PR mlcommons#366).

pyproject.toml/uv.lock: pin uv environments to Linux
  s3dlio only publishes Linux wheels; lock the uv environment selector to
  sys_platform == 'linux' so cross-platform lock generation does not fail.

Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
@russfellows russfellows merged commit 4534ae4 into main May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment