Skip to content

Track Purity DNN for Phase-2 HLT#51084

Open
jchismar wants to merge 7 commits into
cms-sw:masterfrom
jchismar:track-purity-dnn
Open

Track Purity DNN for Phase-2 HLT#51084
jchismar wants to merge 7 commits into
cms-sw:masterfrom
jchismar:track-purity-dnn

Conversation

@jchismar
Copy link
Copy Markdown
Contributor

Implementation of a track purity DNN used for high purity selection for HLT tracks. Initial results were presented at the tracking POG meeting on 15 Dec 2025. Since then, the model has been retrained with the latest version of LST, and a separate threshold has been implemented for displaced tracks (|dxy| > 0.5) to improve displaced track efficiency. This threshold is set at a target recall of 99.5% calculated on tracks with |dxy| > 0.5. For tracks with |dxy| $\le$ 0.5, the threshold is set at a target recall of 99.5% calculated on all tracks. Additionally, the number of input features has been reduced from 29 to 15 with no loss of performance. The DNN is configured to run in the HLTInitialStepSequence after the hltInitialStepTracks step when the trackTorchClassifier procModifier is used.

MTV performance on TT+PU=200 is shown below.
Screenshot 2026-03-31 at 10 48 39 AM
Screenshot 2026-03-31 at 10 50 17 AM

Co-authored-by: Jade Chismar <jchismar@ucsd.edu>
@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented May 28, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51084/49552

@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @jchismar for master.

It involves the following packages:

  • Configuration/ProcessModifiers (operations)
  • HLTrigger/Configuration (hlt)
  • RecoTracker/FinalTrackSelectors (reconstruction)

@Martin-Grunewald, @Moanwar, @cmsbuild, @davidlange6, @fabiocos, @ftenchini, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @gpetruc, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@Moanwar
Copy link
Copy Markdown
Contributor

Moanwar commented May 28, 2026

Hi @jchismar , thanks, which workflows needed to test this PR ?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 28, 2026

for the record, the needed model is at cms-data/RecoTracker-FinalTrackSelectors#15 (it wold be nice to link the two)

import FWCore.ParameterSet.Config as cms

# This modifier sets the use of a deep neural network for high purity track selection
trackTorchClassifier = cms.Modifier()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding why is this proposed via a modifier and not directly in the "production" workflow?

HLTInitialStepHPSelectionSequence = cms.Sequence(
hltInitialStepTrackCutClassifier
+hltInitialStepTrackSelectionHighPurity
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline.

+hltInitialStepTrackTorchClassifierOutput
+hltInitialStepTrackCutClassifier
+hltInitialStepTrackSelectionHighPurity
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 28, 2026

@jchismar what is the cost in terms of timing and GPU memory consumption of these developments?
Given they are not run in the "production" workflow we cannot test it via the bot.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented May 28, 2026

the modifier solution was mainly motivated by much earlier interpretation that PyTorchAlpaka carries a significant memory cost (a significant fraction of 1 GB). It sounds from the pixel track DNN that the cost is much smaller.

The timing costs were rather small (○ Adds ~7ms (on GPU, ~20ms on CPU) to the HLT timing. from https://indico.cern.ch/event/1688301/#5-round-robin-talk-on-dpgpog-s)

@mmusich
to minimize the edits I think it would be practical to add a (temporary) commit in

Phase2 = cms.ModifierChain(Run3_noMkFit.copyAndExclude([phase1Pixel,trackingPhase1,seedingDeepCore,displacedRegionalTracking,ctpps_2022,dd4hep]),
phase2_common, phase2_tracker, trackingPhase2PU140, phase2_ecal, phase2_hcal, phase2_hgcal, phase2_muon, phase2_GEM, hcalHardcodeConditions, phase2_timing, phase2_timing_layer, phase2_trigger, trackingMkFitProdPhase2)
and add the modifier here.
Once the tests run we can decide if it's OK to move on for production or keep as a modifier

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 28, 2026

The timing costs were rather small (○ Adds ~7ms (on GPU, ~20ms on CPU) to the HLT timing. from https://indico.cern.ch/event/1688301/#5-round-robin-talk-on-dpgpog-s)

Thanks @slava77

to minimize the edits I think it would be practical to add a (temporary) commit in ...
and add the modifier here.
Once the tests run we can decide if it's OK to move on for production or keep as a modifier

FWIW, that is fine with me.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Jun 4, 2026

Pull request #51084 was updated. @Martin-Grunewald, @Moanwar, @cmsbuild, @davidlange6, @fabiocos, @ftenchini, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please check and sign again.


namespace ALPAKA_ACCELERATOR_NAMESPACE {

class TrackFeatureExtractor : public stream::FixedQueueEDProducer<> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why FixedQueueEDProducer ?

Comment on lines +73 to +75
// Create device collection and copy from host
TrackFeaturesDeviceCollection features_device(iEvent.queue(), nTracks);
alpaka::memcpy(iEvent.queue(), features_device.buffer(), features_host.const_buffer());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than making the device copy explicitly, it's usually better to let the framework take care of that.

Simply producing the host collection (and making sure the definition for the device collection is available) should be enough.

This avoids making an extra copy when running on the CPU.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than making the device copy explicitly, it's usually better to let the framework take care of that.

doesn't this proposal increase the CPU memory use when running on a GPU backend?
Here the CPU/host side disappears after the module is done with ::produce; in the other case it stays until the event is reset.
Is it practical to ifdef here and bypass the copy on the CPU backend?

@@ -170,6 +171,7 @@
fragment.load("HLTrigger/Configuration/HLT_75e33/psets/seedFromProtoTracks_cfi")
fragment.load("HLTrigger/Configuration/HLT_75e33/psets/SiStripClusterChargeCutLoose_cfi")
fragment.load("HLTrigger/Configuration/HLT_75e33/psets/SiStripClusterChargeCutNone_cfi")
fragment.load("HLTrigger/Configuration/HLT_75e33/services/PyTorchService_cfi")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to move this together with the other services ?

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Jun 5, 2026

This threshold is set at a target recall of 99.5% calculated on tracks with |dxy| > 0.5.

What does "recall" mean in this context?

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Jun 5, 2026

Are there plans to make the input features available directly on GPU?

Otherwise, if it is expected that the input features will always be available only on CPU, would it make more sense to merge the three modules into one?

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 5, 2026

This threshold is set at a target recall of 99.5% calculated on tracks with |dxy| > 0.5.

What does "recall" mean in this context?

recall (TP/(TP+FN)) is what we/HEP call "efficiency"

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 5, 2026

Are there plans to make the input features available directly on GPU?

unlikely in the context of the full/final tracks ( or not for a long while); this implies having full track fit to run on GPU.
Also, the goal is to try this in the offline with minimal changes.
There may still be a benefit to try the same scoring for output tracks already on GPU.

Otherwise, if it is expected that the input features will always be available only on CPU, would it make more sense to merge the three modules into one?

"modularity" seems to be an answer to motivate staying with 3 modules here.

The TrackTorchClassifierFromSoA implemented here is now more suitable for HLT (only HP tracks selected/passed). For the offline use multiple score flags and no full track copy is more appropriate (although that's probably resolvable by adding produceFilteredTracks and dealing with score->purity flag conversion later; but that seems like a premature optimization).

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 5, 2026

@cmsbuild please test

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jun 5, 2026

Tests have not fully completed yet, but I see:

  • the new module hltInitialStepTrackTorchClassifier takes 217ms on GPU, and 13ms on CPU
This PR jchismar@93e8c78 (on GPU backend) This PR jchismar@93e8c78 (on CPU backend)
Screenshot from 2026-06-05 17-12-47 image

The memory profile is peculiar, both on CPU and GPU summary

This PR jchismar@93e8c78 (GPU memory) This PR jchismar@93e8c78 (CPU memory)
image image

I wonder if:

  • the timing of the module on GPU is expected and if not, would it make sense to enforce the CPU backend using the alpaka_serial_sync:: version of the module (until this is resolved)
  • the "spiky" behaviour at the beginning of the job could be mitigated in the same way that @EmanueleCoradin did at 7893376.

?

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Jun 5, 2026

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-983f0c/53692/summary.html
COMMIT: 93e8c78
CMSSW: CMSSW_17_0_X_2026-06-05-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/51084/53692/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 19 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 69
  • DQMHistoTests: Total histograms compared: 4949314
  • DQMHistoTests: Total failures: 14916
  • DQMHistoTests: Total nulls: 5
  • DQMHistoTests: Total successes: 4934373
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 68 files compared)
  • Checked 291 log files, 251 edm output root files, 69 DQM output files
  • TriggerResults: found differences in 16 / 67 workflows

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 20 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 61.6 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7503_TTbar_14TeV+Run4D121_HLTHeterogeneousValid step2 max memory diff 61.6 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 61.8 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.753_TTbar_14TeV+Run4D121_HLT75e33TimingLegacyTracking step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.754_TTbar_14TeV+Run4D121_HLT75e33TimingLegacyTrackingPatatrackQuads step2 max memory diff 61.6 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 63.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.7591_TTbar_14TeV+Run4D121_HLTPhase2WithNanoValid step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 51.5 exceeds +/- 30.0 MiB
  • Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 61.7 exceeds +/- 30.0 MiB
  • Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 61.4 exceeds +/- 30.0 MiB
  • Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 61.4 exceeds +/- 30.0 MiB
  • Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 57.9 exceeds +/- 30.0 MiB

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Jun 5, 2026

Milestone for this pull request has been moved to CMSSW_20_0_X. Please open a backport if it should also go in to CMSSW_17_0_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_17_0_X, CMSSW_20_0_X Jun 5, 2026
@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 5, 2026

I wonder if:

* the timing of the module on GPU is expected and if not, would it make sense to enforce the CPU backend using the `alpaka_serial_sync::` version of the module (until this is resolved)

not expected. This same setup running in a single job up to 32 threads/streams on L4 GPU runs much faster, 10-20 ms (@jchismar has numbers)

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jun 5, 2026

not expected. This same setup running in a single job up to 32 threads/streams on L4 GPU runs much faster, 10-20 ms

do we have a timing server based measurement that can be verified by experts?

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 5, 2026

do we have a timing server based measurement that can be verified by experts?

does the timing server accept cms-sw and cms-data modifications for a test job submission?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jun 5, 2026

does the timing server accept cms-sw and cms-data modifications for a test job submission?

yes, feel free to follow-up in the timing @ HLT mattermost channel for details.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Jun 5, 2026

I assume this PR does not need a backport to 17_0_X (Run 3 legacy). Although if you want to continue running the benchmarks as part of the PR tests, one option (until 20_0_0_pre1 RelVal samples arrive) would be to continue testing in 17_0_X.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jun 6, 2026

not expected. This same setup running in a single job up to 32 threads/streams on L4 GPU runs much faster, 10-20 ms

for the record, I manually repeated the benchmark on one node in the NGT farm equipped with 4 L40s cards, using 16 jobs, 16 streams and 16 threads, that pretty much confirms the findings from the bot:

I wonder how the benchmark mentioned above was carried out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants