We’ve previously posted a few blogs covering memory management, memory bugs, and post-mortem memory analysis. Now, we are excited to add another memory debug tool to Backtrace’s offering: stale pointer analysis. In this post, I will explain how this analysis works, its strengths, and its limitations.
Among many memory errors, stale pointer is one of the subtlest and most difficult to debug. It happens when a memory region is freed or reallocated but the old references, aka aliases, to the memory are not updated properly. Usage thereafter of these stale references causes unpredictable consequences. It may be benign; it may trigger intermittent failure, or outright crash. Even if there isn’t any observable error, it is a ticking time bomb. With slight context change or a small new code added, it may become catastrophic. With multi-threaded programs, race conditions usually result in stale pointers which add another dimension of complexity to debug.
As an example of stale pointer, let’s take a look at the following code snippet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
The memory bug is obvious here. Function
register_student uses memory pointed
name_buffer and frees it after it is done. Function
to do similar thing but runs into trouble when using the already freed
name_buffer. This code will crash on RedHat Linux because the memory pointed
name_buffer is returned to kernel when it is freed. The system memory
allocator ptmalloc allocates memory directly from kernel and returns to it
after the user frees the memory if the memory request is larger than the tuning
mmap_threshold (128 KiB by default.) Other OSs and their allocators may
have different policies and implementations. Therefore, user could see totally
different symptoms. The above example is overly simplified for the purpose of
this illustration, the real stale pointer bugs in the wild are much more complex
and difficult to debug. For example, function
join_team may be invoked in different threads with improper synchronization.
The execution of the code is transient and difficult for the user to understand
the sequence of events that leads to the final crash.
A Practical Approach to Detect Stale Pointer
It is difficult to debug issues like stale pointer in practice. To prevent it from happening, a user may avoid aliasing, ensure no more reference to freed memory, for instance, by using reference counting, or manage objects with garbage collection, etc. But these are not a solution to all applications.
Backtrace added a new feature to detect and analyze stale pointer in its recent release. As a post-mortem tool, it adopts a different approach than the aforementioned methods. The basic idea is actually simple. For crashes due to invalid memory access, the address of the memory must come from the value of a variable or the value of a variable’s direct or nested data member. The invalid memory address is set incorrectly through the reference of the variable. Finding the variable with the reference could lead to the root cause of the crash. If it points to a freed memory region, it is a stale pointer.
The concept is not new. As a matter of fact, it is commonly used in interactive debugging. A developer would normally review a faulting thread’s backtrace and inspects local variables to figure out their relationship to the crash. This is sufficient in some cases but often unsuccessful for tough issues such as stale pointer. The culprit that we look for may be present in another thread like in a race condition, it may be a global object or it may belong to a heap object that indirectly references the fault memory address. For any nontrivial program, it is impossible to exhaust all possibilities manually, i.e. by inspecting variables and heap memory regions in an interactive fashion, not to mention that it is boring and prone to human error. Backtrace’s stale pointer module scans the whole address space of the target process to extract all back references to the fault memory address. The scanning may be repeated a couple of times for indirect back references. Then it analyzes and symbolizes them in order to put them in the context of the program source. For example, global and local variables are annotated with the compilation unit or thread and stack frame where they belong to, as well as their names and values including elements and data members for aggregated types. However, it is more likely that a back reference belongs to a heap object since most variables and objects are allocated dynamically from heap. It is crucial to know what these objects are and how they are used in the context. Heap objects present a challenge for symbolification. First, a heap object has no debug symbol emitted by the compiler. Secondly, a debugger can’t help unless it is referenced by a local or global variable, which isn’t always true considering there are thousands or even millions of heap objects.
Backtrace overcomes this difficulty by leveraging the heap and object type information which are retrieved by other components of Backtrace. The Backtrace heap module analyzes the metadata of memory allocator and tells us if a heap region is free or in use given any heap address. For a heap region that has a back reference to the fault memory address, it may reveal a stale pointer bug if the region is free; if it is in use, a bug may still set the invalid address through the reference. To identify such a heap object, Backtrace first determines the size of the heap region and reads its data content. The data pattern is analyzed and pointer values, i.e. valid memory addresses that fall in the target’s address space, are picked out as the signature of the object. Assumption is that these values are pointer-typed data members of the object. This information is then combined with the knowledge of data types declared by the compiler, for example, through DWARF debug symbols. The result is a reasonably short list of object types that the heap object might belong to. The user may use the clue to review source code for potential bug related to the object.
Let’s see an example with the previous code snippet. The following figure shows
the result of stale pointer detection. The program crashed while accessing
0x7faf40953000. After scanning the process’s address space, we
found five back references. One of the back reference occurs at
Heap analysis reveals that this is an active region with size of 16 bytes and
starting at address
0x102c260. There are many data types that could result in
the allocation of 16 bytes of heap memory (This also depends on the
implementation of particular memory allocator besides the type itself.) But
only a fraction of them have a pointer value as its data member at the offset
of 8 byte. Backtrace’s type identification lists seven possible object types.
If we ignore the types of system libraries, there are only two types left,
struct student and
struct team_member. The user is prompted to review the
source code related to the variables of these types, especially the data member
which references the fault memory address.
There are some limitations to this approach. The memory scan has a significant impact on performance due to the full scan of process’s address space. Our test shows a linear relationship of response time to the total amount of mapped memory of the target process. The effectiveness of this approach relies on the fault memory address itself. If the program faults on an address that is a common value, there will be too many false positive back references to be useful. As an extreme example, a crash due to dereference of NULL would render the feature useless because NULL pointer, i.e. value 0, would generate numerous back references where most of them are unrelated.
Debugging stale pointer bugs is challenging; there is no silver bullet. However, Backtrace provides a practical feature to find all back references to an invalid memory address. Combined with the heap analysis module embedded in Backtrace, it is hopeful to find the heap object that is freed but still referenced by active variables, and how they are related to the faulted memory address. This may be just the tool you need when others fall short.