Track Purity DNN for Phase-2 HLT#51084
Conversation
Co-authored-by: Jade Chismar <jchismar@ucsd.edu>
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51084/49552
|
|
A new Pull Request was created by @jchismar for master. It involves the following packages:
@Martin-Grunewald, @Moanwar, @cmsbuild, @davidlange6, @fabiocos, @ftenchini, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
Hi @jchismar , thanks, which workflows needed to test this PR ? |
|
for the record, the needed model is at cms-data/RecoTracker-FinalTrackSelectors#15 (it wold be nice to link the two) |
| import FWCore.ParameterSet.Config as cms | ||
|
|
||
| # This modifier sets the use of a deep neural network for high purity track selection | ||
| trackTorchClassifier = cms.Modifier() |
There was a problem hiding this comment.
for my understanding why is this proposed via a modifier and not directly in the "production" workflow?
| HLTInitialStepHPSelectionSequence = cms.Sequence( | ||
| hltInitialStepTrackCutClassifier | ||
| +hltInitialStepTrackSelectionHighPurity | ||
| ) |
| +hltInitialStepTrackTorchClassifierOutput | ||
| +hltInitialStepTrackCutClassifier | ||
| +hltInitialStepTrackSelectionHighPurity | ||
| ) |
|
@jchismar what is the cost in terms of timing and GPU memory consumption of these developments? |
|
the modifier solution was mainly motivated by much earlier interpretation that PyTorchAlpaka carries a significant memory cost (a significant fraction of 1 GB). It sounds from the pixel track DNN that the cost is much smaller. The timing costs were rather small ( @mmusich cmssw/Configuration/Eras/python/Era_Phase2_cff.py Lines 24 to 25 in 2cc100f Once the tests run we can decide if it's OK to move on for production or keep as a modifier |
Thanks @slava77
FWIW, that is fine with me. |
|
Pull request #51084 was updated. @Martin-Grunewald, @Moanwar, @cmsbuild, @davidlange6, @fabiocos, @ftenchini, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please check and sign again. |
|
|
||
| namespace ALPAKA_ACCELERATOR_NAMESPACE { | ||
|
|
||
| class TrackFeatureExtractor : public stream::FixedQueueEDProducer<> { |
| // Create device collection and copy from host | ||
| TrackFeaturesDeviceCollection features_device(iEvent.queue(), nTracks); | ||
| alpaka::memcpy(iEvent.queue(), features_device.buffer(), features_host.const_buffer()); |
There was a problem hiding this comment.
Rather than making the device copy explicitly, it's usually better to let the framework take care of that.
Simply producing the host collection (and making sure the definition for the device collection is available) should be enough.
This avoids making an extra copy when running on the CPU.
There was a problem hiding this comment.
Rather than making the device copy explicitly, it's usually better to let the framework take care of that.
doesn't this proposal increase the CPU memory use when running on a GPU backend?
Here the CPU/host side disappears after the module is done with ::produce; in the other case it stays until the event is reset.
Is it practical to ifdef here and bypass the copy on the CPU backend?
| @@ -170,6 +171,7 @@ | |||
| fragment.load("HLTrigger/Configuration/HLT_75e33/psets/seedFromProtoTracks_cfi") | |||
| fragment.load("HLTrigger/Configuration/HLT_75e33/psets/SiStripClusterChargeCutLoose_cfi") | |||
| fragment.load("HLTrigger/Configuration/HLT_75e33/psets/SiStripClusterChargeCutNone_cfi") | |||
| fragment.load("HLTrigger/Configuration/HLT_75e33/services/PyTorchService_cfi") | |||
There was a problem hiding this comment.
Wouldn't it be better to move this together with the other services ?
What does "recall" mean in this context? |
|
Are there plans to make the input features available directly on GPU? Otherwise, if it is expected that the input features will always be available only on CPU, would it make more sense to merge the three modules into one? |
recall (TP/(TP+FN)) is what we/HEP call "efficiency" |
unlikely in the context of the full/final tracks ( or not for a long while); this implies having full track fit to run on GPU.
"modularity" seems to be an answer to motivate staying with 3 modules here. The |
|
@cmsbuild please test |
|
Tests have not fully completed yet, but I see:
The memory profile is peculiar, both on CPU and GPU summary
I wonder if:
? |
|
+1 Size: This PR adds an extra 16KB to repository HLT P2 Timing: chart Comparison SummarySummary:
Max Memory Comparisons exceeding threshold@cms-sw/core-l2 , I found 20 workflow step(s) with memory usage exceeding the error threshold: Expand to see workflows ...
|
|
Milestone for this pull request has been moved to CMSSW_20_0_X. Please open a backport if it should also go in to CMSSW_17_0_X. |
not expected. This same setup running in a single job up to 32 threads/streams on L4 GPU runs much faster, 10-20 ms (@jchismar has numbers) |
do we have a timing server based measurement that can be verified by experts? |
does the timing server accept cms-sw and cms-data modifications for a test job submission? |
yes, feel free to follow-up in the timing @ HLT mattermost channel for details. |
|
I assume this PR does not need a backport to 17_0_X (Run 3 legacy). Although if you want to continue running the benchmarks as part of the PR tests, one option (until 20_0_0_pre1 RelVal samples arrive) would be to continue testing in 17_0_X. |
for the record, I manually repeated the benchmark on one node in the NGT farm equipped with 4 L40s cards, using 16 jobs, 16 streams and 16 threads, that pretty much confirms the findings from the bot:
I wonder how the benchmark mentioned above was carried out. |




Implementation of a track purity DNN used for high purity selection for HLT tracks. Initial results were presented at the tracking POG meeting on 15 Dec 2025. Since then, the model has been retrained with the latest version of LST, and a separate threshold has been implemented for displaced tracks (|dxy| > 0.5) to improve displaced track efficiency. This threshold is set at a target recall of 99.5% calculated on tracks with |dxy| > 0.5. For tracks with |dxy|$\le$ 0.5, the threshold is set at a target recall of 99.5% calculated on all tracks. Additionally, the number of input features has been reduced from 29 to 15 with no loss of performance. The DNN is configured to run in the HLTInitialStepSequence after the hltInitialStepTracks step when the trackTorchClassifier procModifier is used.
MTV performance on TT+PU=200 is shown below.

