Skip to content

Improve performance of h5trav interfaces for links to objects#6400

Open
jhendersonHDF wants to merge 4 commits into
HDFGroup:developfrom
jhendersonHDF:h5trav_perf_improvement
Open

Improve performance of h5trav interfaces for links to objects#6400
jhendersonHDF wants to merge 4 commits into
HDFGroup:developfrom
jhendersonHDF:h5trav_perf_improvement

Conversation

@jhendersonHDF

@jhendersonHDF jhendersonHDF commented May 6, 2026

Copy link
Copy Markdown
Collaborator

For objects with multiple hard links, use hash table to map between object tokens and pathnames during traversal to avoid linear scan over all previous objects for each hard link seen

Use separate hash table for h5trav "table" interface to map between object tokens and an index into the table of visited objects. This facilitates quick lookups of objects when adding hard link name aliases for h5repack processing.

See the linked issue for context

@jhendersonHDF jhendersonHDF added the Component - Tools Command-line tools like h5dump, includes high-level tools label May 6, 2026
@github-project-automation github-project-automation Bot moved this to To be triaged in HDF5 - TRIAGE & TRACK May 6, 2026
@jhendersonHDF jhendersonHDF linked an issue May 6, 2026 that may be closed by this pull request
Comment thread tools/lib/h5trav.h
#include "hdf5.h"

/* Typedefs for visiting objects */
typedef herr_t (*h5trav_obj_func_t)(const char *path_name, const H5O_info2_t *oinfo, const char *first_seen,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These typedefs had to be moved down for the new trav_seen_t parameter to be available

Comment thread tools/lib/h5trav.h
trav_obj_t *objs;

/* Private data for this trav_table_t */
void *priv_data;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a pointer to a structure with a UT_hash_handle, but exposing the uthash API at the h5trav.h level is problematic due to its header-only nature and already being used in H5private.h. This is just a quick hack to hide the implementation details.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to just include H5private.h here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the header-only nature of uthash.h, it being included in H5private.h meant that having different settings for it in different parts of the library became somewhat difficult. In h5trav.c only, I needed to change change HASH_KEYCMP to use H5Otoken_cmp() to keep compatibility, but it became a first-to-include-uthash.h-wins scenario if I included H5private.h in h5trav.h since H5private.h is also including uthash.h with its own settings. The scope of including the uthash.h header should probably be reduced from H5private.h to where it's actually needed.

Comment thread tools/lib/h5trav.c
* where a visited object was placed to facilitate quicker
* lookups when adding path aliases
*/
typedef struct trav_table_hash_t {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure is mostly for h5repack, which now has its own hash table separate from the main one used by other h5trav interfaces. I considered modifying the h5trav interfaces to allow h5repack to share a single hash table, but it would have been fairly awkward to do so and would have imposed unnecessary memory overhead for other tools that wouldn't need the extra information that would have been stored, so I instead just used a separate hash table specifically for the h5trav "table" interface.

Comment thread tools/lib/h5trav.c
*-------------------------------------------------------------------------
*/
static int
trav_token_visited_cmp(hid_t loc_id, const H5O_token_t *token1, const H5O_token_t *token2)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called by HASH_FIND and is just used to wrap around H5Otoken_cmp() instead of a plain memcmp of object token bytes

Comment thread tools/lib/h5trav.c Outdated
udata.fields = fields;

/* Check for multiple links to top group */
if (oinfo.rc > 1)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this down below the udata initialization in case the hash table gets initialized by the call to trav_token_add().

Comment thread tools/lib/h5trav.c
Comment thread tools/lib/h5trav.c Outdated
mattjala
mattjala previously approved these changes May 8, 2026

@mattjala mattjala left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple minor issues

Comment thread tools/lib/h5trav.c Outdated
/* HASH_ADD modifies what's pointed to by objects_seen_ptr when it
* initializes the hash table after being called for the first time
*/
HASH_ADD(hh, (*objects_seen_ptr), obj.token, sizeof(H5O_token_t), entry);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we certain that tokens never contain garbage bytes? I suspect so just want to make sure, since they are stored as "MAX_TOKEN_SIZE", implying that some connectors don't use all of the bytes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that H5VL_native_addr_to_token zeroes the unused bytes, just curious how strong our documentation is that third party VOLs should do the same.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I did make an assumption here and I haven't found any specific documentation mentioning that the bytes should be zeroed out or set to a specific value yet. We do provide H5O_TOKEN_UNDEF for undefined token values which in theory would be an initializer, but is not really documented as such. Without that assumption and without knowing how many valid bytes there are, I suppose you can't do much with an object token other than pass it to the token APIs. To be 100% sure here, we would need to either document that, or may need to have a callback for returning a hashed version of the token.

Comment thread tools/lib/h5trav.c
udata.fields = fields;

/* Check for multiple links to top group */
if (oinfo.rc > 1) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixing a bug?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic as above, just had to be moved down a bit.

Comment thread tools/lib/h5trav.c
return FAIL;
}
else {
for (i = 0; i < table->nobjs; i++) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a hash table for the path names to accelerate this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes. I mostly tried to limit the scope of this to the performance problem I directly observed, but there's also likely a similar issue with soft links, as well as possibly with the path names here.

fortnern
fortnern previously approved these changes Jun 4, 2026
For objects with multiple hard links, use hash table to map between
object tokens and pathnames during traversal to avoid linear scan over
all previous objects for each hard link seen

Use separate hash table for h5trav "table" interface to map between
object tokens and an index into the table of visited objects. This
facilitates quick lookups of objects when adding hard link name aliases
for h5repack processing
@jhendersonHDF jhendersonHDF force-pushed the h5trav_perf_improvement branch from 534d36e to ba72bd7 Compare June 18, 2026 15:45
Comment thread tools/lib/h5trav.c Fixed
Comment thread tools/lib/h5trav.c Fixed
@jhendersonHDF jhendersonHDF force-pushed the h5trav_perf_improvement branch from ba72bd7 to 965e961 Compare June 18, 2026 16:23
@HDFGroup HDFGroup deleted a comment from github-actions Bot Jun 18, 2026
Comment thread tools/lib/h5trav.c
*-------------------------------------------------------------------------
*/
static herr_t
trav_token_get_size(hid_t obj_id, size_t *token_size)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of an ugly hack, but about the only way that the tools code can ask a VOL connector for the size of object tokens, similar to what the reference API does

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

This PR touches the following areas. Each needs a sign-off
from its listed owners before merging.

Comment thread tools/lib/h5trav.c
/* Determine how large an object token is so we can properly hash them */
H5E_BEGIN_TRY
{
status = trav_token_get_size(fid, &table->obj_token_size);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does result in a second call to trav_token_get_size() where traverse() would already have called it, but it keeps the usage of the token size to just the trav_table_ interfaces where it's used, as opposed to having to pass it through the h5trav_obj_func_t and h5trav_lnk_func_t callbacks where the other trav_ interfaces wouldn't be using the parameter.

@jhendersonHDF jhendersonHDF marked this pull request as ready for review June 18, 2026 18:32
@github-actions github-actions Bot requested review from lrknox and mattjala June 18, 2026 18:33
@jhendersonHDF jhendersonHDF requested a review from fortnern June 18, 2026 18:34
brtnfld added a commit to brtnfld/hdf5 that referenced this pull request Jun 18, 2026
… surface extra reviewers

Three related fixes uncovered by reviewing PR HDFGroup#6400's actual reviewer-request
timeline:

1. isOpeningRace gated only on "no checklist exists yet" misdiagnosed an old
   PR's first encounter with the bot (or one whose checklist comment was
   manually deleted) as a brand-new PR, forcibly reselecting a human-curated
   reviewer list. Now also requires the PR to have been created within the
   last 10 minutes, matching the race's actual real-world window.

2. removeUnselected/removeUnselectedAfterDelay pruned against the repo-wide
   allCodeOwners set, so a reviewer who owns unrelated areas (e.g. a project
   lead manually added for their judgment, not their path ownership) could be
   silently swept just for being a CODEOWNER somewhere else in the repo.
   GitHub's own CODEOWNERS engine never auto-assigns someone who doesn't own
   a touched path, so pruning now scopes to touchedAreaOwners instead.

3. buildBody had no way to show a reviewer who isn't an owner of any touched
   area and isn't a no-CODEOWNER fallback either — they were on the PR doing
   real review work but invisible in the checklist, with their approval
   silently ignored. Add a catch-all "Additional reviewers" line listing them,
   with a checkmark if they've approved (informational only — it doesn't gate
   any area's sign-off).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - Tools Command-line tools like h5dump, includes high-level tools

Projects

Status: To be triaged

Development

Successfully merging this pull request may close these issues.

HDF5 tools performance issue for multiply-linked objects

5 participants