Skip to content

LST: add LSTGeometry package and associated ESProducer#50679

Open
ariostas wants to merge 4 commits into
cms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry
Open

LST: add LSTGeometry package and associated ESProducer#50679
ariostas wants to merge 4 commits into
cms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry

Conversation

@ariostas

@ariostas ariostas commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

This PR adds a new RecoTracker/LSTGeometry package containing the module map computation used by the LST algorithm. Currently, the maps are pre-computed by the code in https://github.com/SegmentLinking/LSTGeometry and they are stored in https://github.com/cms-data/RecoTracker-LSTCore. This PR allows for the on-the-fly computation of these maps via an ESProducer, ensuring that they stay consistent with the tracker geometry being used.

This is the last major task in #46746.

c.c. @slava77

@cmsbuild

cmsbuild commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

cms-bot internal usage

@cmsbuild

cmsbuild commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/48907

@cmsbuild

cmsbuild commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • HLTrigger/Configuration (hlt)
  • RecoTracker/IterativeTracking (reconstruction)
  • RecoTracker/LST (reconstruction)
  • RecoTracker/LSTCore (reconstruction)
  • RecoTracker/LSTGeometry (****)

The following packages do not have a category, yet:

RecoTracker/LSTGeometry
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich

mmusich commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@mmusich

mmusich commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

@cmsbuild, please test

@cmsbuild

cmsbuild commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

-1

Failed Tests: UnitTests HLTP2Timing
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52513/summary.html
COMMIT: e612f24
CMSSW: CMSSW_17_0_X_2026-04-07-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52513/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 17 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 191.9 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.752_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5 step2 max memory diff 189.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 166.0 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 191.8 exceeds +/- 90.0 MiB

@makortel

makortel commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Is ~190 MB increase in memory usage expected?

Comment thread RecoTracker/LSTGeometry/test/dumpLSTGeometry.py Outdated
@ariostas

ariostas commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

@makortel

makortel commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

According to the monitoring the peak memory usage would increase by ~190 MB, and thus freeing it afterwards doesn't help much if the job was killed because of going over the limit.

@makortel

makortel commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

test parameters:

  • workflows_profiling = 34434.0
  • enable_tests = profiling

@makortel

makortel commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

@cmsbuild, please test

Maybe one round of profiling tests would be worth it.

@ariostas

Copy link
Copy Markdown
Contributor Author

In addition to the problems already discussed, now this branch has conflicts that must be resolved.

Rebased to fix conflicts. I'll get back to looking into this.

Since profiling is still ongoing, this problem was surfaced in #50870 and appears to be a weird clash between CMSSW's jemalloc and nsys. You should be able to run any profile by changing the launch command form cmsRun to cmsRunGlibC

Thank you! I'll see what I can learn from the nsys profile

@cmsbuild

Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/49455

@cmsbuild

Copy link
Copy Markdown
Contributor

Pull request #50679 was updated. @Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please check and sign again.

@slava77

slava77 commented May 29, 2026

Copy link
Copy Markdown
Contributor

@cmsbuild please test

IIUC, the HLT timing tests moved to L4 GPUs with more memory. Let's see if the tests complete this time.

@cmsbuild

Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/53582/summary.html
COMMIT: 46de12c
CMSSW: CMSSW_17_0_X_2026-05-29-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50679/53582/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 18 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7503_TTbar_14TeV+Run4D121_HLTHeterogeneousValid step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7591_TTbar_14TeV+Run4D121_HLTPhase2WithNanoValid step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 48.9 exceeds +/- 30.0 MiB
  • Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 48.9 exceeds +/- 30.0 MiB

@cmsbuild cmsbuild mentioned this pull request May 29, 2026
@mmusich

mmusich commented May 30, 2026

Copy link
Copy Markdown
Contributor

IIUC, the HLT timing tests moved to L4 GPUs with more memory. Let's see if the tests complete this time.

the test did pass.
A full set of comparisons for different menus and settings is available at https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/53582/hlt-p2-timing/ (albeit as the test were launched prematurely the baseline still is using a host equipped with T4 GPUs, which has less memory and less compute than the host used to measure the curves labeled "this PR").
The increase of GPU memory usage is large (in the order of +33%)

image

interestingly also the CPU memory usage profile is altered (for the worse):

image

@ariostas

Copy link
Copy Markdown
Contributor Author

The increase of GPU memory usage is large (in the order of +33%)

Yeah, so quick update on this. I did some more experiments with the setup where the producer runs, but the product is not used. If I trivialize the producer by immediately returning an empty product, then the issue goes away. So then I tried adding a sleep timer to simulate the producer doing some work. When the producer takes about a second you start seeing some vram increase (although not as consistent). With about 5 second delay you see exactly the plots above. So the issue is indeed due to kernel scheduling. The producer takes around 6-9 seconds (depending on the machine), so I seems like that's too long and all streams end up waiting around for it to finish. So then all streams sync and launch kernels at the same time causing a spike in vram usage.

interestingly also the CPU memory usage profile is altered (for the worse):

I think it's the same cause, since it looks similar to vram usage with the caching allocator disabled. #50679 (comment)

I'm currently working on speeding up the producer to get the time under a second, which should solve this issue.

@fwyzard

fwyzard commented May 30, 2026

Copy link
Copy Markdown
Contributor

Hi @ariostas I haven't followed the whole thread, but I found your latest comments interesting, and I may need some clarifications to understand them better.

The producer takes around 6-9 seconds (depending on the machine), so I seems like that's too long and all streams end up waiting around for it to finish.

Is that an EDProducer or an ESProducer ?

For an EDProducer I don't understand why "all streams" would end up waiting for it.

For an ESProducer I see how that could happen - but it should only happen once per IOV, a likely once per job. After the first time the products should be already available, and it shouldn't cause anything to wait ?

@Dr15Jones

Copy link
Copy Markdown
Contributor

@fwyzard said

For an EDProducer I don't understand why "all streams" would end up waiting for it.

it could happen if there is no other work that can be done by all the Events (i.e. insufficient concurrency within an Event at that point in the schedule)

@ariostas

Copy link
Copy Markdown
Contributor Author

@fwyzard yeah, it's an ESProducer. And yeah, it only happens once per job, but it causes a big vram spike at the beginning of the job. In #50679 (comment) you can see the spikes at the beginning of the jobs. That is with the caching allocator disabled. Keeping the caching allocator enabled causes the excessive allocation to persist (as expected)

@fwyzard

fwyzard commented May 30, 2026

Copy link
Copy Markdown
Contributor

@ariostas ah, thanks for the clarification.

Is the spike is caused by the ESProducer itself, or by all EDProducers running at the same time as soon as the payload is available ?

@ariostas

Copy link
Copy Markdown
Contributor Author

@fwyzard the spike is caused by the EDProducers running at the same time. I tested it by changing the ESProducer to a dummy one that just waits for 5 seconds.

@fwyzard

fwyzard commented May 31, 2026

Copy link
Copy Markdown
Contributor

OK, thanks for confirming it.
So I guess we need a better way to handle this high memory usage due to concurrency.

@cmsbuild

cmsbuild commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Milestone for this pull request has been moved to CMSSW_20_0_X. Please open a backport if it should also go in to CMSSW_17_0_X.

@makortel

makortel commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

(I assume this PR does not need a backport to 17_0_X (Run 3 legacy))

@cmsbuild

cmsbuild commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Milestone for this pull request has been moved to CMSSW_20_1_X. Please open a backport if it should also go in to CMSSW_20_0_X.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants