Skip to content

Add concurrent normalization#3196

Open
martijnbastiaan wants to merge 5 commits into
masterfrom
martijn/concurrent-normalization
Open

Add concurrent normalization#3196
martijnbastiaan wants to merge 5 commits into
masterfrom
martijn/concurrent-normalization

Conversation

@martijnbastiaan
Copy link
Copy Markdown
Member

@martijnbastiaan martijnbastiaan commented Apr 14, 2026

This adds concurrent normalization by wrapping all state (fields) in MVars. Earlier patches used STM, but this didn't yield a performance benefit due to excessive contention/rollbacks. Explicit locks have all the usual drawbacks of potential deadlocks, which is why TracedMVar has been added. This can be enabled by CLASH_DEBUG_MVAR. Earlier tests on the bittide project yielded a 300% performance increase.

The feature is disabled by default, but can be enabled using -fclash-concurrent-normalization. It is enabled in the clash-testsuite.

Still TODO:

  • Write a changelog entry (see changelog/README.md)
  • Check copyright notices are up to date in edited files
  • Rebase on master
  • Actually listen to -fclash-concurrent-normalization
  • Check on bittide-hardware
  • Investigate Hit specialization limit 20 on function `Clash.Explicit.Testbench.outputVerifierWith[454934]'.. This can be triggered semi-reliably by executing cabal run clash-testsuite -- -j8 -p clash --hide-successes -p XilinxDDR.

@martijnbastiaan martijnbastiaan force-pushed the martijn/concurrent-normalization branch from 197045b to ccf882c Compare April 14, 2026 11:37
@martijnbastiaan
Copy link
Copy Markdown
Member Author

@rowanG077 I let Claude have a look at the issue. The reasoning it came up with sounds reasonable to me, but I haven't checked it in detail so please be careful :). The test suite passes locally. I'll test it on bittide-hardware too


Problem

The commit "Add concurrent normalization" introduced a race condition in the specialization cache. The specialisationCache and specialisationHistory were converted from pure state (sequential access via lens (%=)) to MVar-based shared state, but the cache-check → history-increment → cache-insert sequence was not atomic.

With concurrent normalization, multiple threads normalizing different entities (e.g., testBenchGS, testBenchGA, testBenchUS, testBenchUA) would all try to specialize the same function (ignoreFor) simultaneously. Each thread would:

  1. Check the cache → empty
  2. Increment the specialization history counter
  3. Do the specialization work
  4. Insert into the cache

Since steps 1-4 weren't atomic, N threads would each increment the counter for the same logical specialization, causing the counter to reach N× the expected value and exceeding the specialization limit of 20.

Fix

Two changes in Specialize.hs

  1. Changed the cache type from Map key Id to Map key (MVar Id) in Types.hs, using a "future" pattern (same as normalizeTopLvlBndr already uses for the normalized cache).

  2. Atomic cache reservation: When checking the cache, we atomically either find an existing entry (and wait on it if it's still being computed) or insert an empty MVar as a placeholder. This ensures only one thread does each specialization, and the history counter is only incremented once per unique specialization.

@rowanG077
Copy link
Copy Markdown
Member

rowanG077 commented Apr 14, 2026

Ah that's close to what my hunch was. Interesting that it can do that. I will check in the code whether it's actually true. If true I'm surprised how good it is.

@martijnbastiaan
Copy link
Copy Markdown
Member Author

Modern Claude with VSCode / terminal integration is very close to magic.

@rowanG077 rowanG077 force-pushed the martijn/concurrent-normalization branch 2 times, most recently from 99c384a to 80adea0 Compare April 14, 2026 17:23
@rowanG077 rowanG077 self-requested a review April 14, 2026 17:32
@rowanG077 rowanG077 marked this pull request as ready for review April 14, 2026 18:06
@martijnbastiaan
Copy link
Copy Markdown
Member Author

On bittide this does speed up Clash, but it trades it for high memory usage (5.5 GB -> 21 GB). Reducing the thread count from -N8 to -N2 reduces memory usage to 17 GB.

@rowanG077
Copy link
Copy Markdown
Member

And this was not that much of an improvement right and requiring manual OPAQUE on large components?

@martijnbastiaan
Copy link
Copy Markdown
Member Author

With OPAQUEs it is a 33% improvement. Nothing to sneeze at for 18m runs 🫠..

@martijnbastiaan
Copy link
Copy Markdown
Member Author

It seems to be fundamental to how Clash normalization works: due to its lifting of binders and specialization caches Terms just stay around for a looong time. I don't think there is much incremental we can do, though I'm open to ideas.

@rowanG077
Copy link
Copy Markdown
Member

Oh that is MUCH more significant than I thought

@rowanG077 rowanG077 force-pushed the martijn/concurrent-normalization branch from 6ce3ecd to 4ffba9b Compare May 5, 2026 16:19
@rowanG077
Copy link
Copy Markdown
Member

rowanG077 commented May 5, 2026

I fixed a deadlock that occurs when a thread is waiting on a specialization result of another thread that never completed.

I also added listening to -fclash-concurrent-normalization

The actual implementation looks fine to me now.

@martijnbastiaan
Copy link
Copy Markdown
Member Author

Cool, I'll have a look soon.

In retrospect I should have made TracedMVar for qualified use, instead of trying to be consistent with the existing MVar module. What do you think?

@rowanG077
Copy link
Copy Markdown
Member

In retrospect I should have made TracedMVar for qualified use, instead of trying to be consistent with the existing MVar module. What do you think?

I don't have really strong feelings either way. I do like that the API matches the standard MVar which signals to developers that this is just an MVar under the hood. So if you ask me I would keep it like it is currently.

martijnbastiaan and others added 5 commits May 6, 2026 09:44
Co-authored-by: Vanessa McHale <vamchale@gmail.com>

Co-authored-by: Alexander McKenna <alex@qbaylogic.com>

Co-authored-by: Rowan Goemans <rowan@qbaylogic.com>
When the thread responsible for creating a specialization failed after
installing the cache placeholder, waiting threads could block forever on
  the empty result MVar.
@rowanG077 rowanG077 force-pushed the martijn/concurrent-normalization branch from b93c821 to 90321b2 Compare May 6, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants