Date: Mon, 21 Apr 2003 16:12:34 -0400 From: Mike Shaver To: Chris Cooper Subject: debugging crashes Some semi-random thoughts; feel free to ask questions. "Segfault with no mm"[*] is the UML version of an OOPS message, basically. Indicates the same sort of thing as a SEGV in a user-space program, which is to say an illegal memory reference. Almost always means one of three things: - dereferencing a NULL pointer - accessing freed memory - dereferencing an uninitialized pointer (rare, because we pre-zero just about everything) There are a few places to look to get more information about the likely cause of the problem. 1) The address that was deemed bogus. The frame above panic will usually be something like: #1 0xa0103813 in segv (address=3735928623, ip=2834630060, is_write=0, is_user=0, sc=0xa652f098) at trap_kern.c:43 where "address" is the address that was being accessed, "ip" is the instruction pointer (address of the instruction that executed the load/store), "is_write" indicates whether we're writing to or reading from that address, and "is_user" shows whether we're in kernel or user context when the fault happens. The first two are the most interesting, since they'll tell is what was wrong, and where we hit it. Usually, the rest of the stack trace will give us good information about where we were (see below), but the address is good for letting us pick which of the three cases above we tripped. Decimal is for pussies (and it's harder to see patterns in), so we first print it as hex: (gdb) p/x 3735928623 $1 = 0xdeadbf2f There's a good hint: it looks like 0xdeadbeef, which is what we set a pointer too after we OBD_FREE or OBD_VFREE it. If it were very small (0x00000040, say), it'd be likely that we were dereferencing a NULL pointer. If we're dereferencing a pointer in a freed structure (say we'd freed req, and then poked at req->rq_repmsg->foo) you'd see something like 0x5a5a5a9a -- we "poison" freed memory with a repeat of 0x5a bytes. To recap: - 0xdeadbeef: this pointer's target memry was freed, and now we're poking at it - 0x00000000: this pointer is NULL, so probably never set, but possibly freed and zeroed - 0x5a5a5a5a: the structure this pointer is in was freed. (You'll often see values that are "not quite" the magic values, because we often dereference the pointers by accessing structure members at some non-zero offset into the structure; in this case, rq_repmsg->flags, which is at offset 0x40 + 0xdeadbeef == 0xdeadbf2f.) OK, so now we have a bit of information about what our bogus-pointer likely resembles. We should try and figure out where that pointer came from. - The frame before is the one that stepped on the land mine. In the 1144 case, that was: #5 0xa8f505ac in target_send_reply (req=0xa572a400, rc=0, fail_id=288) at target.c:596 Typing (gdb) frame 5 will print out the source line in question: DEBUG_REQ(D_HA, req, "not waiting for ack"); The only pointer in play there is "req", so we'll start by printing that out: (gdb) p req $9 = (struct ptlrpc_request *) 0xa572a400 That looks like a reasonable pointer, so we'll see what's in the structure, with the (gdb) print *req you did before. In there, we see rq_repmsg = 0xdeadbeef and we have a good handle on what's happening. A quick check of DEBUG_REQ shows that it dereferences rq_repmsg if it's non-NULL: req->rq_repmsg ? req->rq_repmsg->flags : 0, So now we have a good handle on a start at fixing this: it looks like reqmsg was freed, and then DEBUG_REQ puked on it. A quick look at target_send_reply shows that we did another DEBUG_REQ on the request just a few lines earlier, so it's pretty easy to see -- once I turn my brain on -- where the problem lies. Et voila, solved. Mike [*] Bit of background, some of which you may already know: a "segfault" occurs when we try to load or store from an illegal memory location. Usually that means a virtual address that's not mapped to any physical address, but sometimes it can be from trying to write to a page that's mapped read-only. This happens all the time, during the course of normal operation, since it's at the core of the kernel's virtual memory implementation. For example, when you start to run a program, the kernel only maps in a small portion of the binary and libraries. Once it starts to run, the program will "touch" new areas of memory, and the kernel will be notified of the segmentation fault. In response, it will use the current process' "memory mapping" to see what should be underneath that virtual address, map memory into place (possibly including reading appropriate data from disk), and then restart the offending instruction. Inside kernel code, for an address near 0 or 0xdeadbeef, etc., there is likely no valid mapping. So you get the panic that trips the UML breakpoint, film at 11.