Use TwoQubitPeepholeOptimization in preset pass managers#16136
Use TwoQubitPeepholeOptimization in preset pass managers#16136mtreinish wants to merge 89 commits intoQiskit:mainfrom
Conversation
This commit adds a new transpiler pass for physical optimization, TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks, ConsolidateBlocks, and UnitarySynthesis in the optimization stage for a default pass manager setup. The pass logically works the same way where it analyzes the dag to get a list of 2q runs, calculates the matrix of each run, and then synthesizes the matrix and substitutes it inplace. The distinction this pass makes though is it does this all in a single pass and also parallelizes the matrix calculation and synthesis steps because there is no data dependency there. This new pass is not meant to fully replace the Collect2qBlocks, ConsolidateBlocks, or UnitarySynthesis passes as those also run in contexts where we don't have a physical circuit. This is meant instead to replace their usage in the optimization stage only. Accordingly this new pass also changes the logic on how we select the synthesis to use and when to make a substituion. Previously this logic was primarily done via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if the number of basis gates needed based on the weyl chamber coordinates was less than the number of 2q gates in the block (see Qiskit#11659 for discussion on this). Since this new pass skips the explicit consolidation stage we go ahead and try all the available synthesizers Right now this commit has a number of limitations, the largest are: - Only supports the target - It doesn't support any synthesizers besides the TwoQubitBasisDecomposer, because it's the only one in rust currently. For plugin handling I left the logic as running the three pass series, but I'm not sure this is the behavior we want. We could say keep the synthesis plugins for `UnitarySynthesis` only and then rely on our built-in methods for physical optimiztion only. But this also seems less than ideal because the plugin mechanism is how we support synthesizing to custom basis gates, and also more advanced approximate synthesis methods. Both of those are things we need to do as part of the synthesis here. Additionally, this is currently missing tests and documentation and while running it manually "works" as in it returns a circuit that looks valid, I've not done any validation yet. This also likely will need several rounds of performance optimization and tuning. t this point this is just a rough proof of concept and will need a lof refinement along with larger changes to Qiskit's rust code before this is ready to merge. Fixes Qiskit#12007 Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to run in rust, the TwoQubitControlledUDecomposer. This commit updates the new TwoQubitPeepholeOptimization to call this decomposer if the target supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two decomposer types in the TwoQubitDecomposer enumese two types is large. TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer is only 24 bytes. This means each element of ControlledU is wasting > 1600 bytes. However, in this case that is acceptable in order to avoid a layer of pointer indirection as these are stored temporarily in a vec inside a thread to decompose a unitary. A trait would be more natural for this to define a common interface between all the two qubit decomposers but since we keep them instantiated for each edge in a Vec they need to be sized and doing something like `Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer` instead of a enum) to get around this would have additional runtime overhead. This is also considering that TwoQubitControlledUDecomposer has far less likelihood in practice as it only works with some targets that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
The priority for the two qubit peephole pass should be decreasing the 2q gate count. The error rate heuristic should only matter if the 2q counts are the same. This commit flips the heuristic to first check the 2q gate count so the first priority is reducing the 2q gate count.
This commit removes the unitary synthesis plugin mechanism from the pass. This was a layer violation to support this when the pass logic doesn't actually support using the plugin interface. It is easier and more clear that if the plugin interface usage is desired to handle that in the pass manager construction rather than have this pass internally build a pass manager and execute other passes to emulate behavior it doesn't have.
There were two issues identified by the testing which required fixing and adjusting the tests based on limitations in the pass. The first issue was the parameters for the target gate was not handled correctly. In the case of using the Controlled U decomposer we were not passing the computed parameter value correctly to the output circuit and instead the ParameterExpression from the target was being used. Then in the case of controlled gates (not supercontrolled) that had a fixed angle that are normally intended for the xx decomposer were incorrectly being passed to the TwoQubitBasisDecomposer which can't work with them. This was resulting in invalid circuit outputs. The use of the TwoQubitBasisDecomposer is now correctly filtering to only be run with supercontrolled gates. The tests were adjusted for this limitation because they were mostly copied from the UnitarySynthesis tests which supports xx decomposer.
…re locking is needed
We don't want to spend time reconstructing an exact copy of the dag if there are no substituions needed. Prior to using a vec for tracking the run indices that nodes are part of we would check if that map was empty. The vec is always populated and to determine if there are no entries we'd have to do a worse case O(n) lookup to determine if any entries are set. To avoid that this overhead but keeping the check this adds an atomic bool that is used to track whether we've substituted any blocks. If this is not set to true we can just exit early since there are no substitutions to make.
Previously there was a mismatch between the scoring of synthesis results and the peephole pass's comparison with the original block. The pass is documented as using the tuple (num_2q_gates, error, num_gates) and picking the min of all the choices. But, when we called the unitary synthesis function that selects the best synthesis outcome it was maximizing the estimated fidelity but not considering the gate counts like the pass is documented as doing. This corrects this mismatch by updating the function doing the synthesis to be generic on score type and taking a scorer callback. This lets the peephole pass control the heuristic used for selecting the best score.
In testing the pass in the full pass manager there is an underlying issue with Python defined gates in the circuit and the GIL handling. In the presence of those gates the thread doing the synthesis and analysis will need the GIL to get the matrix of the Python gate. But in the previous version of this pass the parent thread retained the GIL while the parallel workers ran. This would cause a deadlock because the worker threads would never be able to acquire the GIL when they tried to do so during the execution of the synthesis. This commit attempts to fix this by splitting out the parallel portion from the serial portion of the function. The serial portion rebuilds the dag from the analysis results and needs the GIL to copy any python operations in the circuit as it's rebuilding. So we re-attach the GIL prior to running the serial portion. The unitary synthesis decomposer handling needed to be updated as well because there was implicit usage of the GIL from the py-clone feature around handling custom rxx equivalent gates for the controlled u decomposer. This was not correct in a threaded context where the GIL might be released and would cause a panic. This is updated to explicitly attach to the python interpreter and handle the python copy with the py token explicitly. There were places already doing this in the decomposer handling code but because of the py-clone feature the clone() calls were missed.
Following on from Qiskit#13419 which added a new optimization pass TwoQubitPeepholeOptimization which was designed to replace the pair of ConsolidateBlocks and UnitarySynthesis for the optimization stage after we have a physical circuit. That PR however did not update the preset pass managers to concentrate the review on just adding the new pass. This continues off from there by updating the preset pass managers to use the new pass in optimization levels 2 and 3 replacing those levels' optimization stage's previous usage of ConsolidateBlocks and UnitarySynthesis to achieve the same goal. This should result in both a runtime performance and transpilation quality improvement as the new pass is both faster and should produce better fidelity circuits than the previous peephole optimization. The tests updates that are made in this PR are because the peephole optimization is changing the transpilation output of various test circuits. These were all verified to be valid outputs and in all cases a "better" output than before. Specifically, for the tests updated these were the changes in output and why they occurred: * The two tests in test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were the single CX gate in the output circuit was flipped from (0, 1) to (1, 0) because in the target the error rate for the (0, 1) direction was higher than the extra error cost of 3 sx gates (the rz gates have 0 error). * In test_unroll_only_if_not_gates_in_basis from test.python.transpiler.test_preset_passmanagers.TestPresetPassManager we no longer run ConsolidateBlocks in the optimization loop so we no longer need to add the 2 executions from the init and translation stages. The test is updated to count the new peephole pass which is the intent of the count check, to check the pass in the optimization loop. * In test_2q_circuit_5q_backend_v2 from test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected had the same cx gate flipping because the error rate in the original layout for the reverse direction was 0.000779905 vs 0.00163587 in the original direction. So the new pass was correctly flipping the cx gate resulting in a different circuit that vf2 couldn't place anywhere better. To fix this the test sets a fixed layout on worse qubits so that vf2 will have to place it somewhere better. * For test_layout_tokyo_fully_connected_cx_4_3 from test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the output circuit has a better estimated fidelity (although more gates in general). The transpiler output goes from an estimated fidelity of 0.9526614226294913 before the new pass was used to an estimated fidelity of 0.961996188569715 after the new pass is used. This new circuit with a better fidelity has a different initial layout set now, so the test is updated to use the new layout.
|
One or more of the following people are relevant to this code:
|
Coverage Report for CI Build 25346270286Coverage increased (+0.01%) to 87.572%Details
Uncovered Changes
Coverage Regressions8 previously-covered lines in 5 files lost coverage.
Coverage Stats
💛 - Coveralls |
|
Hmm, I ran asv transpile runtime benchmarks on this PR and it is showing a regression on the hwb benchmarks which is unexpected. I'll have to do some profiling of why the pass is slower, it does potentially do more work to get the better quality results (it always synthesizes the 2q block's unitary), but in my earlier testing this was always offset by the use of parallelism which resulted in a ~2x speedup for transpilation. It's also showing a speedup in places I wouldn't have expected because they're smaller circuits and I'd expect the extra overhead of the parallelism to limit the speedup. |
|
Interesting looking at that example in particular (with a different seed and backends, I modified the pgo script instead of using the asv benchmark directly) the pass is significantly faster. It takes about 600ms to run while the pair of |
Following on from #13419 which added a new optimization pass
TwoQubitPeepholeOptimization which was designed to replace the
pair of ConsolidateBlocks and UnitarySynthesis for the optimization
stage after we have a physical circuit. That PR however did not update
the preset pass managers to concentrate the review on just adding the
new pass. This continues off from there by updating the preset pass
managers to use the new pass in optimization levels 2 and 3 replacing
those levels' optimization stage's previous usage of ConsolidateBlocks
and UnitarySynthesis to achieve the same goal. This should result in
both a runtime performance and transpilation quality improvement as the
new pass is both faster and should produce better fidelity circuits
than the previous peephole optimization.
The tests updates that are made in this PR are because the peephole
optimization is changing the transpilation output of various test
circuits. These were all verified to be valid outputs and in all cases
a "better" output than before. Specifically, for the tests updated
these were the changes in output and why they occurred:
test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were
the single CX gate in the output circuit was flipped from (0, 1) to
(1, 0) because in the target the error rate for the (0, 1) direction
was higher than the extra error cost of 3 sx gates (the rz gates have
0 error).
test.python.transpiler.test_preset_passmanagers.TestPresetPassManager
we no longer run ConsolidateBlocks in the optimization loop so we no
longer need to add the 2 executions from the init and translation
stages. The test is updated to count the new peephole pass which is
the intent of the count check, to check the pass in the optimization
loop.
test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected
had the same cx gate flipping because the error rate in the original
layout for the reverse direction was 0.000779905 vs 0.00163587 in the
original direction. So the new pass was correctly flipping the cx gate
resulting in a different circuit that vf2 couldn't place anywhere
better. To fix this the test sets a fixed layout on worse qubits so
that vf2 will have to place it somewhere better.
test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the
output circuit has a better estimated fidelity (although more gates in
general). The transpiler output goes from an estimated fidelity of
0.9526614226294913 before the new pass was used to an estimated fidelity
of 0.961996188569715 after the new pass is used. This new circuit with a
better fidelity has a different initial layout set now, so the test
is updated to use the new layout.
This PR is based on top of #13419 and will need to be rebased after that merges. In the meantime you can view the contents of just this PR by looking at the HEAD commit:
471d199
AI/LLM disclosure