Skip to content

Draft: Region per file storage strategy#287

Open
TheAssembler1 wants to merge 8 commits into
hpc-io:developfrom
TheAssembler1:region_per_file
Open

Draft: Region per file storage strategy#287
TheAssembler1 wants to merge 8 commits into
hpc-io:developfrom
TheAssembler1:region_per_file

Conversation

@TheAssembler1

@TheAssembler1 TheAssembler1 commented Aug 6, 2025

Copy link
Copy Markdown
Collaborator

Adds the following enum so the storage strategy can be selected:

typedef enum pdc_region_writeout_strategy {
    /**
     * Store data as multiple regions inside a single file.
     * Overlapping writes that are not fully contained append new regions
     * to the end of the file, with metadata tracking region locations.
     * Supports incremental updates without rewriting large parts of the file.
     */
    STORE_REGION_BY_REGION_SINGLE_FILE = 0,

    /**
     * Store the entire object as a single flat file.
     * Reads and writes operate by seeking directly within the file.
     * No region metadata bookkeeping; simpler but less flexible for partial updates.
     */
    STORE_FLATTENED_SINGLE_FILE,

    /**
     * Store each flattened region in its own separate file.
     * Enables independent file management per region.
     */
    STORE_FLATTENED_REGION_PER_FILE
} pdc_region_writeout_strategy;

The STORE_REGION_BY_REGION_SINGLE_FILE is the default strategy. The STORE_FLATTENED_REGION_PER_FILE is the new strategy which stores each region of an object in a separate file. The region size the object is sliced into is decided in:

/**
 * Used decide how to split object into chunks each of which will be a file on disk
 */
static perr_t
PDC_shrink_file_dims(uint64_t *temp_file_dims, const uint64_t *obj_dims, uint8_t obj_ndim, size_t unit)

By default it will try to slice the object into regions that are 4 MB in size by halving the largest dimension of the object iteratively until within the <= 4 MB.

This is set here uint64_t max_bytes_per_file = 4ULL * 1024 * 1024; within the PDC_shrink_file_dims function.

@TheAssembler1

Copy link
Copy Markdown
Collaborator Author

We might want to compare the performance between the storage strategies before merging.

@TheAssembler1 TheAssembler1 marked this pull request as ready for review September 18, 2025 19:29
@TheAssembler1 TheAssembler1 requested a review from a team as a code owner September 18, 2025 19:29
@TheAssembler1 TheAssembler1 changed the title Draft: Region Per File Region Per File Sep 18, 2025
@TheAssembler1 TheAssembler1 changed the title Region Per File Region per file storage strategy Sep 18, 2025
@jeanbez jeanbez changed the title Region per file storage strategy Draft: Region per file storage strategy Oct 21, 2025
@jeanbez jeanbez assigned jeanbez and unassigned jeanbez Oct 21, 2025
@jeanbez jeanbez added the type: enhancement New feature or request label Oct 21, 2025
@jeanbez

jeanbez commented Jan 27, 2026

Copy link
Copy Markdown
Member

@TheAssembler1 do you have any updates on this PR?

@jeanbez jeanbez requested review from houjun and jeanbez May 19, 2026 18:19
@TheAssembler1

Copy link
Copy Markdown
Collaborator Author

Object level property for setting the storage strategy. Add documentation for setting the storage strategy. Add performance numbers.

@TheAssembler1 TheAssembler1 self-assigned this May 21, 2026
@TheAssembler1

TheAssembler1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

PDC Writeout Strategy Benchmark (cache=ON, Perlmutter)

Note: The job exceeded the wall-clock time limit and was cancelled by Slurm (STEP CANCELLED DUE TO TIME LIMIT). The benchmark completed 17 of 54 planned configurations (all cache=ON, 16-client server-scaling tests and the cache=ON, 1-server client-scaling tests up through 4 clients). The cache=OFF block and client counts >= 8 were not reached. Results below reflect completed tests only.

Table 1: Client Operation Times (avg across 5 steps)

All times in seconds.

Servers Clients Strategy Obj Create Xfer Create Xfer Start Xfer Wait Xfer Close Obj Close Server Close
1 16 0 RGN/SINGLE 7.50e-04 3.97e-05 0.3303 0.2390 0.0970 6.74e-05 5.678
1 16 1 FLAT/SINGLE 7.22e-04 3.72e-05 0.2960 0.2653 0.0956 5.67e-05 4.699
1 16 2 FLAT/PER_FILE 7.06e-04 3.18e-05 0.7388 0.2115 0.0975 6.72e-05 102.376
2 16 0 RGN/SINGLE 7.26e-04 3.81e-05 0.5107 0.1005 0.0970 6.68e-05 3.326
2 16 1 FLAT/SINGLE 6.71e-04 3.82e-05 0.4808 0.1065 0.0966 5.76e-05 3.220
2 16 2 FLAT/PER_FILE 7.70e-04 4.54e-05 0.3535 0.4466 0.0966 6.12e-05 34.205
4 16 0 RGN/SINGLE 6.79e-03 4.04e-05 0.3905 0.0625 0.0983 6.61e-05 2.081
4 16 1 FLAT/SINGLE 5.88e-03 3.52e-05 0.6741 0.0281 0.0962 5.95e-05 1.858
4 16 2 FLAT/PER_FILE 8.33e-03 3.54e-05 0.4432 0.2086 0.0967 6.68e-05 9.973
1 1 0 RGN/SINGLE 5.09e-04 1.69e-05 0.1950 0.0096 0.0982 3.87e-05 1.149
1 1 1 FLAT/SINGLE 5.22e-04 1.49e-05 0.2113 0.0097 0.0790 3.66e-05 0.949
1 1 2 FLAT/PER_FILE 5.57e-04 1.72e-05 0.1951 0.0097 0.0977 4.36e-05 2.161
1 2 0 RGN/SINGLE 4.75e-04 1.51e-05 0.2039 0.0181 0.0975 3.34e-05 1.077
1 2 1 FLAT/SINGLE 5.22e-04 1.55e-05 0.1726 0.0177 0.0983 3.61e-05 0.888
1 2 2 FLAT/PER_FILE 4.67e-04 1.55e-05 0.1689 0.0227 0.0960 3.20e-05 4.445
1 4 0 RGN/SINGLE 6.59e-04 2.46e-05 0.2009 0.0612 0.0943 3.74e-05 1.883
1 4 1 FLAT/SINGLE 5.54e-04 2.72e-05 0.1785 0.0623 0.0973 5.07e-05 1.325
1 4 2 FLAT/PER_FILE 5.60e-04 3.72e-05 0.1597 0.0034 0.0924 4.57e-05 N/A

Config: 8,388,608 particles/rank, 5 steps, 20s sleep between steps, cache=ON, Lustre 256 OSTs. Incomplete configs: cache=OFF (all), cache=ON clients in {8, 16, 32} not reached before timeout.

@TheAssembler1

Copy link
Copy Markdown
Collaborator Author

Writeout optimization: increased region slice size from 4 MB to 128 MB for STORE_FLATTENED_REGION_PER_FILE.

Tested on Perlmutter with 1 server, 16 clients, 8388608 particles/rank, 5 steps, 20s sleep, cache=ON.

The larger slice size reduces the number of individual file flush operations the server has to perform per object, which significantly cuts per-region flush time and total server drain time at shutdown.

Metric Before (4 MB) After (128 MB) Improvement
Avg flush time per region 4.02s 2.01s 2x faster
Total close time 102.4s 35.3s 2.9x faster
Xfer wait (steps 0-3) ~4ms ~4ms no regression
Xfer wait (step 4) 1.04s 1.23s no regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants