GH-126910: Add gdb support for unwinding JIT frames#146071
GH-126910: Add gdb support for unwinding JIT frames#146071diegorusso wants to merge 36 commits intopython:mainfrom
Conversation
|
🤖 New build scheduled with the buildbot fleet by @diegorusso for commit ac018d6 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F146071%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
|
I have some questions about the EH frame generation and how it applies to the different code regions. Looking at
Both end up calling The EH frame in I understand how this is correct for _Py_CODEUNIT *
_JIT_ENTRY(...) {
jit_func_preserve_none jitted = (jit_func_preserve_none)exec->jit_code;
return jitted(exec, frame, stack_pointer, tstate, ...);
}The compiler will emit exactly the prologue/epilogue the EH frame describes. But I don't understand how the same EH frame is correct for The FDE covers the full The test ( Am I missing something about how the stencils interact with the stack, or is the EH frame intentionally approximate for the executor region? |
| struct jit_code_entry *first_entry; | ||
| }; | ||
|
|
||
| static volatile struct jit_descriptor __jit_debug_descriptor = { |
There was a problem hiding this comment.
Should these be non-static? The GDB JIT interface spec says GDB locates __jit_debug_descriptor and __jit_debug_register_code by name in the symbol table. With static linkage they would be invisible in .dynsym on stripped builds and when CPython is loaded as a shared library via dlopen. Am I missing something, or would this silently break in release/packaged builds where .symtab is stripped?
Maybe also worth adding __attribute__((used)) to prevent the linker from eliding them?
There was a problem hiding this comment.
Yes, you are right. Instead of removing the static I've exported with the macro Py_EXPORTED_SYMBOL
| id(42) | ||
| return | ||
|
|
||
| warming_up = True |
There was a problem hiding this comment.
Could this loop hang? When warming_up=True, the call passes warming_up_caller=True which returns immediately at line 8, so the recursive body never actually executes. If the JIT does not activate via some other path, would this not spin forever until the timeout kills it? Should there be a max iteration count as a safety net?
Also, line 16 uses bitwise & instead of and. Was that intentional? It means is_active() is always evaluated even when is_enabled() is False.
There was a problem hiding this comment.
I've simplified the test, the loop is not more controlled and deterministic.
| return; | ||
| } | ||
| _PyJitUnwind_GdbRegisterCode( | ||
| code_addr, (unsigned int)code_size, entry, filename); |
There was a problem hiding this comment.
code_size comes in as size_t but gets cast to unsigned int here. I know JIT regions will not be 4GB, but should the API just take size_t throughout for consistency?
There was a problem hiding this comment.
This is now done.
What this change synthesises for jit_executor is one unwind description for the executor as a whole, not compiler-emitted per-stencil CFI. Because the stencils are musttail-chained, the jumps between stencils do not add extra native call frames. The unwind job here is just to recover the caller of the executor frame. We don't want to describe each stencil as its own frame. When GDB stops at a PC inside
On AArch64, for most of the covered executor range, the synthetic CFI says:
Good catch for the testing gap. I’ve now added a new test that breaks inside the jit executor. It sill breaks at the |
|
@diegorusso @pablogsal I think I may have come up with a solution that works. EDIT: I think I gdb doesn't only use Background info (skip if not interested):
The current issue:
The solution:
This should work with TLDR: frame pointers = eh_frame is simple. |
@diegorusso I have to say that I am tremendously confused here. If GDB or So the real question is not “does the FDE cover the address range?” and it is not “do the stencils form one logical frame?”. The real question is: does the CFI row that applies at that PC actually describe the machine state there? That is the part I do not think has been explained. I agree with the narrow And that is exactly where I think the argument goes off the rails.
A concrete x86_64 example of why this seems wrong to me: with the same sort of flags used for executor stencils, a or, if it needs temporary stack space / spills, something more like In the first case there is no For Also, I rebuilt the branch locally and tried the exact “finish to So this is not just a theoretical concern for me. I still do not understand why the model being described here is supposed to work.I am of course not objecting to the goal. I am saying I still do not see the correctness argument. If the claim is that this is actually a correct unwind description for |
Thanks, thanks for the comment. I regenerated the x86_64 and AArch64 stencils after the recent frame-pointer changes. What we have today is that shim gets a real frame-pointer prologue, but the executor stencils still are not uniformly rbp/x29-framed, so I don’t think the current generated code is enough to justify a single executor-wide CFA = rbp + 16 / x29 + const rule for arbitrary PCs in the blob. The current implementation is still one synthetic executor-wide FDE. The unwinder uses the current PC to select that FDE and apply its CFI to recover the caller frame. That works where the actual machine state matches the synthetic rule at the stop PC, but it is still approximate executor-wide unwind metadata, not exact per-stencil CFI. Separately, once this PR lands, wiring up libgcc-backed backtrace should be fairly easy. We already synthesise .eh_frame; the remaining work is to call the appropriate |
Ok, I think now I understand. After re-checking the generated stencils I agree the current explanation was too bold. musttail only establishes the narrow point that the stencil-to-stencil transitions do not accumulate one native call frame per stencil and it does not by itself establish the stronger property needed for unwinding: that for an arbitrary PC inside jit_executor, the CFA and saved return state always have a shape described by one executor-wide FDE. That stronger property is the missing invariant here. After looking again at the regenerated x86_64 and AArch64 stencils, I don't think we have that invariant today:
I cannot justify the current synthetic executor-wide FDE as being correct for arbitrary PCs in the executor blob. The new test I added is still useful, but it proves something narrower: that the synthetic FDE works for the exercised in-executor stop. It does not prove that the same CFI is exact for every interior PC in the region (like you did in your example) I think the real options are:
The current implementation does not yet have the invariant needed to justify one executor-wide FDE for jit_executor but at the same time I don't really like the suggestions above. Let me think about it |
The current generation reserve the rbp. So all current stencils assume an rbp. Do you think it would fix it if we emitted our own prologue for the very first JIT executor uop ie (
Unfortunately, it seems you're right here. I dug around libgcc a little more and that's the only interface I see that intercepts |
Not all of them. See I'm not entirely sure your statement is true. |
Huh that's surprising! On x86_64, the current |
Oh sorry I'm wrong, not to it to get things like that. |
|
Hey, I did another pass and found something important in the DWARF/GDB unwind info. As I did not want to keep you pushing again and again, I pushed some commits, please check them out. The issue is that the old GDB CFI was describing the JIT executor like a normal function with a real prologue. On x86_64 it was effectively telling GDB to unwind as if the code began like this: push %rbp
mov %rsp, %rbpand the equivalent DWARF rule was basically: That is a valid description for a normal C-style entry sequence, but it is not what the executor stencils actually do. The executor code keeps the frame pointer pinned across the whole region, so there is no real prologue for GDB to “walk through” instruction by instruction. If GDB stops near the start of the executor and the DWARF says “pretend the prologue already happened”, it computes the CFA from the wrong place and can read the saved frame pointer / return address from the wrong stack slots. The new GDB-only path fixes that by describing the frame layout that is actually true while the executor is running. On x86_64 the equivalent rule is now just: with no per-PC prologue simulation in the FDE. In other words, instead of telling GDB “watch a fake prologue happen”, we now tell it “this is the frame layout for this JIT region, unwind from that”. The same idea is used on AArch64 with x29 / x30. I also validated this by hand in GDB instead of trusting only the Python test harness. I built a clean JIT-enabled tree, broke in builtin_id, finished until the selected frame was I also did a negative check by forcing the wrong unwind mode on purpose, and the backtrace immediately turned into garbage which is a strong sign that this change is fixing a real mismatch between the unwind metadata and the code we actually emit. |
Add a shared helper that asserts exactly one py::jit_entry frame above at least one eval frame, so regressions producing duplicate JIT frames or JIT-below-eval can't pass the old tolerant regex.
77777a6 to
a9c6315
Compare
Remove absolute_addr from elf_init_ehframe_perf code path Implement a hack on AArch64 to tell where the shim prologue is positioned. to be properly fixed
The GDB CFI hand-rolled in jit_unwind.c couldn't correctly describe the compiled shim's prologue on AArch64 and relied on hardcoded offsets that would silently invalidate under any compiler/flag change. Tools/jit now compiles shim.c with -fasynchronous-unwind-tables, extracts its .eh_frame at build time, and ships the CIE/FDE CFI bytes as a blob in jit_stencils.h; jit_unwind.c splices those bytes into the synthetic EH frame at runtime, so whatever prologue clang emits is described accurately. Executor regions keep the hand-written steady-state rule, which is our pinned-frame-pointer invariant (enforced by Tools/jit/_optimizers.py _validate()), not a guess at compiler output. Also: "Backtrace stopped: frame did not save the PC" is now a hard AssertionError instead of a silent skip (it always indicates a real unwind bug), and the three JIT tests share a get_stack_trace override that opts out of the generic "?? ()" skip so unrelated libc-without- debug-info frames don't mask a passing test.
Link the JIT shim into the binary and remove the old runtime shim-specific unwind machinery. Rename _PyJIT to _PyJIT_Entry and the synthetic executor frame to py::jit:executor. Make executor unwinding materialize _PyJIT_Entry beneath py::jit:executor. Update the GDB tests to require both the executor frame and _PyJIT_Entry.
Accept either JIT backtrace shape in test_jit.py: - py::jit:executor -> _PyEval_* - py::jit:executor -> _PyJIT_Entry -> _PyEval_* This avoids baking in architecture-specific unwind details while still checking that GDB gets out of the JIT region and back into the eval loop.
Documentation build overview
93 files changed ·
|
markshannon
left a comment
There was a problem hiding this comment.
I've just tried this out and it works nicely (on linux AArch64). Here's an example:
Hitting ctrl-C while running this under gdb:
def count(seq):
t = 0
for i in seq:
t += math.sqrt(1.0)Gives me this backtrace:
#0 PyFloat_AsDouble (op=0xfffff7840770) at Objects/floatobject.c:253
#1 0x0000fffff7779074 in math_1 (err_msg=0xfffff777bf38 "expected a nonnegative input, got %s", can_overflow=0, func=<optimized out>, arg=<optimized out>) at ./Modules/mathmodule.c:792
#2 math_sqrt (self=<optimized out>, args=<optimized out>) at ./Modules/mathmodule.c:1161
#3 0x0000fffff7fe72f0 in py::jit:executor ()
#4 0x0000aaaaaab57960 in _PyEval_EvalFrameDefault (tstate=0xaaaaab31b3c0, frame=0xfffffffffffffffd, throwflag=-1424409936, throwflag@entry=0) at Python/generated_cases.c.h:5941
#5 0x0000aaaaaad3a96c in _PyEval_EvalFrame (throwflag=0, frame=0xfffff7fea020, tstate=0xaaaaab1c7de8 <_PyRuntime+344816>) at ./Include/internal/pycore_ceval.h:122
...
and I can use the up, down, step and continue commands as I would expect.
Stepping into the jitted code even works (sort of):
Single stepping until exit from function py::jit:executor,
which has no line number information.
| code_addr, code_size, entry, filename); | ||
| return NULL; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Unless i am missing something I think that_PyPerfJit_WriteNamedCode() is now invoked from JIT compilation, but perf_map_jit_write_entry_with_name() still updates the global perf_jit_map_state and emits a multi-record PerfUnwindingInfo/PerfLoad sequence without holding map_lock. If this is true then two threads can interleave records and race code_id, producing a corrupted jitdump or duplicate IDs. Or is guuaranteed that we will never have concurrent jit compilation?
There was a problem hiding this comment.
ok, good catch. I will add a lock in perf_map_jit_write_entry_with_name when updating the perf_jit_map_state
| /* Executor steady-state rule (our invariant, not the compiler's). */ | ||
| #ifdef __x86_64__ | ||
| DWRF_U8(DWRF_CFA_def_cfa); // CFA = %rbp + 16 | ||
| DWRF_UV(DWRF_REG_BP); | ||
| DWRF_UV(16); | ||
| DWRF_U8(DWRF_CFA_offset | DWRF_REG_RA); | ||
| DWRF_UV(9); // return-to-_PyJIT_Entry at cfa-72 | ||
| #elif defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__) | ||
| DWRF_U8(DWRF_CFA_def_cfa); // CFA = x29 + 96 | ||
| DWRF_UV(DWRF_REG_FP); | ||
| DWRF_UV(96); | ||
| DWRF_U8(DWRF_CFA_offset | DWRF_REG_FP); | ||
| DWRF_UV(12); // caller x29 at cfa-96 | ||
| DWRF_U8(DWRF_CFA_offset | DWRF_REG_RA); | ||
| DWRF_UV(11); // caller x30 at cfa-88 | ||
| #else |
There was a problem hiding this comment.
I still feel quite worried about this: this hard-codes constants (9, 12, 11, 16, 96, 72) that describe _PyJIT_Entry's frame layout...specifically the slot where the call to jitted(...) spills its return address. I don't think nothing guarantees that layout is won't change when we change the clang version. Indeed i can imagine a bunch of reasons that can change that like changing the PyStackRef_ZERO_BITS width or the number of preserve_none args in Tools/jit/shim.c as none of these are pinned. If any of those change, -72 silently becomes wrong and bt produces a readable-looking but bogus frame; the shape assertions in test_jit.py won't catch a misaligned-by-one-slot RA.
Is there anything that guarantees this won't ever change?
There was a problem hiding this comment.
as discussed on Discord, for now it will be enough to have a mechanism that crashes at build time if the shim of the shape is different from what we expect.
The correct solution would be to read at build time the eh_frame of the shim and parametrise the above rules. We agreed to leave this for 3.16.
There was a problem hiding this comment.
I tried to do a check but I didn't like it and it was introducing some complexity. Actually from there to rely on the eh_frame of the shim is not much work and I did it here: 13db9a9
Now the eh_frame is the source of truth for generating the CIE of the dwarf that we emit to unwind the executor.
| offset += shstrtab_size; | ||
| const size_t str_off = offset; | ||
| offset += strtab_size; | ||
| offset = _Py_SIZE_ROUND_UP(offset, sizeof(Elf64_Sym)); |
There was a problem hiding this comment.
This is not a power of two. _Py_SIZE_ROUND_UP(n, a) is (((n) + ((a)-1)) & ~((a)-1)), which only works for power-of-2 alignments. I checked and indeed: round_up(1, 24) == 8, round_up(17, 24) == 40, round_up(25, 24) == 32.
Today the code happens to work because Elf64_Sym only requires 8-byte alignment for its uint64_t fields and the macro coincidentally produces 8-aligned offsets. But anyone reading this assumes sym_off is a multiple of 24, and any future code that relies on that invariant, We can easily fix by
/* Elf64_Sym requires 8-byte alignment for st_value/st_size. */
offset = _Py_SIZE_ROUND_UP(offset, 8);matching sh_addralign = 8 for SH_SYMTAB.
| res->trace = (_PyUOpInstruction *)(res->exits + exit_count); | ||
| res->code_size = length; | ||
| res->exit_count = exit_count; | ||
| res->jit_gdb_handle = NULL; |
There was a problem hiding this comment.
Nit: executor->jit_gdb_handle = NULL is initialized at three sites: allocate_executor(), make_executor_from_uops() and make_cold_executor() . The latter two call allocate_executor() first, which already zeros the field. The two extra assignments are redundant
| PyErr_Format(PyExc_RuntimeWarning, "JIT %s (%d)", message, hint); | ||
| } | ||
|
|
||
| static void * |
There was a problem hiding this comment.
jit_record_code returns void *, but in the perf-trampoline branch returns NULL on success. The caller assigns into executor->jit_gdb_handle; _PyJIT_Free later only unregisters when non-NULL. The "NULL = no registration handle to release" semantics make perf path and the failure path of GDB path indistinguishable to the free side. This correct, but I had to trace through two backends to figure out NULL is not always a failure. I think either we add a comment that failures are intentionally swallowed or we need to propagate errors
There was a problem hiding this comment.
ok, fair point. Comment added here: 74e65ca
| struct jit_code_entry *first_entry; | ||
| }; | ||
|
|
||
| static PyMutex jit_debug_mutex = {0}; |
There was a problem hiding this comment.
jit_debug_mutex is a process-global PyMutex. After fork(), the child inherits the mutex state; if the parent was inside gdb_jit_register_code() at fork time (mutex held by the dying parent thread), the child deadlocks the next time it tries to register a JIT entry. CPython's _PyOS_AfterFork_Child reset path doesn't know about jit_debug_mutex because it's static.
There was a problem hiding this comment.
I'm a bit lost here. In a previous comment you told me to create it a separate static PyMutex #146071 (comment)
What do you suggest to do?
There was a problem hiding this comment.
Since the child process won't be in the middle of gdb_jit_register_code, you can simply set jit_debug_mutex = {0}; in _PyOS_AfterFork_Child.
You'll need to make jit_debug_mutex non static, of course, and it should be renamed to _Py_jit_debug_mutex
Generate per-target JIT unwind constants from the compiled shim EH frame, emit dispatcher headers for the unwind info, and update the GDB JIT unwind tests.
The PR adds the support to GDB for unwinding JIT frames by emitting eh frames.
It reuses part of the existent infrastructure for the perf_jit from @pablogsal.
This is part of the overall plan laid out here: #126910 (comment)
The output in GDB looks like: