I printf-debugged the mutex locks that looked suspicious. After several rounds of
printf
s, it turned out the problem wasn't in my code at all - the program was hanging in Vec::push
. What.Then I started downgrading my compiler. I wasn't sure what version I had before I updated, so this was a a bit of a search using my "rustup" commits as a guide. Eventually I found that the change in behavior happened between rust nightly-07-05 and nightly-07-06.
Okay, so this was clearly a rust bug. Being pessimistic about the odds of a bugfix if I just dumped my current findings into a Github issue, I started pulling out code to narrow down a repro (I also posted a vague complaint in IRC and on the rust user forum, in the hopes that someone would take enough of an interest to help out). I happened to notice a bug in my initial sizing of a short-lived vector, and bizarrely, fixing the initial size made the bug appear in a totally different patch of code. So it wasn't about the amount of memory I was allocating (I wasn't, for instance, triggering some new heap size limit), it was about the allocation operation.
Given that, it was hard to narrow down a good repro. Naively pulling out just about any code meant removing a bunch of allocations (there's no reason other than laziness for Playform to be allocating that much, but that's a separate issue). I figured maybe I could do something clever by keeping similar allocation patterns and amounts, but slowly replace all the types with integers and the "real" input data with a deterministic set of inputs.
On another train of thought, I also put some time into narrowing down the exact commit that broke Playform. After rediscovering the pain of building the rust compiler, I eventually made it to this commit, which just looks like it might have done something subtle and Playform-breaking.
But then a new wrinkle. I noticed if I ran the Playform server and client as separate binaries (which I wanted to do to make the repro easier), then the bug manifested even before nightly-07-05. What. Ouch. Okay, maybe the compiler change just revealed a pre-existing bug.
At this point, I decided to give up trying to be a super sleuth and just ask what was actually happening. After an exciting time learning about code signing and the exact right way to do this in recent MacOS releases in order to make gdb work (muttering terminal incantations, rebooting, dragging-and-dropping, and typing in an admin password a lot), I finally got gdb to run the Playform client. Better yet, it was still showing the bug!
The backtraces were not good. They were all question marks except for one thing:
jemalloc_mutex_lock
. Well, at least I felt pretty confident that this wasn't my fault. I spent some time googling around, trying to figure out if I should blame MacOS, jemalloc, rust, or Rememberus, the god of memory allocations.One easy thing to try was to stop Playform from using jemalloc and seeing if the problem persisted. So I followed some instructions to change the allocator for the client binary and.. nope, still the same bug. Ugh.
But, oddly enough, the backtrace was still showing jemalloc. I was using the right rust incantation. Maybe I didn't understand some important subtlety about libraries vs binaries. After more googling, I found a Github issue that revealed that the rust compiler commit I'd tracked down had introduced a silent change in syntax for switching allocators.
Using the right syntax, suddenly my bug was gone! I went to patch it into the standalone server+client binary and found out that it already contained patch of code using the old syntax to opt out of jemalloc. I wonder why that was there.
So apparently what happened was that the syntax change introduced in nightly-07-06 had silently undone a careful allocator-switch I'd added and forgotten about, uncovering a bug I'd also entirely forgotten about.
Someday I'll try to find out why jemalloc is deadlocking, but for now I'm just happy that Playform runs again without using a compiler that's about 200 versions behind.
No comments:
Post a Comment