const struct object_vtable { const type_info *typeinfo; void (*method1)(object *self); void (*method2)(object *self, int argument); /* etc. */ } object_vtable = { .typeinfo = &object_typeinfo, .method1 = &object_method1, .method2 = &object_method2, /* etc. */ };And it's easy to believe this is the only way, since C is the "bare metal" language, and these are the tools C gives you to represent these kinds of data structures. The object_vtable structure here is a global constant, full of pointers to other global constants—it should be "free", right? In reality, operating systems provide a fairly elaborate runtime environment in order to make C programs work. Data structures that contain global pointers will in fact allocate memory at program launch before even entering main(). To understand why, we need to peek behind the scenes and look at how the dynamic linker works.
When a process is formed, the contents of the executable file and all of the dynamic libraries it uses get memory-mapped into the new process's address space by the dynamic linker. The dynamic linker uses the kernel's memory mapping feature to do this, associating regions of memory with the contents of the binary files on disk. As long as this memory isn't changed by the running program, the kernel can consider it to be "clean", so that if the system needs to free up memory for other purposes, it can discard these clean pages and reload them from the original binary later, since they haven't been changed. Furthermore, if multiple processes launch using the same executable or dynamic libraries, the exact same clean memory pages can be shared across the address spaces of all of those different processes, significantly reducing the amount of memory needed by the entire system.
However, for a number of reasons, the code and data in a binary on disk can't know for certain what memory address it's going to end up getting mapped to in a running process. Every executable can link against any set of dynamic libraries, so a dynamic library may have to rebase to make room for other libraries in the process, sliding to a new base address. Additionally, as a layer of security, operating systems use address space layout randomization, or ASLR, so that if a program gets exploited, attacker code can't make static assumptions about the location of other exploitable resources in the program's memory, making exploits harder to write. Because the binary on disk doesn't know what address it's going to be mapped to, when pointers appear in its global data, the dynamic linker has to slide all of those pointers to their correct values for the process. This sliding has to happen when the program is loaded, before entering main, delaying the launch of the program. It also causes the pages those pointers are on to become "dirty" from the kernel's perspective, causing the system to use more memory: dirty pages can't be shared among different processes, since the pointer values may be different in each process, and since they no longer match the contents of the binary on disk, they can no longer be merely discarded if the system needs to free up memory for other uses; dirty pages instead have to be written and reloaded from swap.
I wrote a small C program for macOS to observe the impact of global pointers on memory usage and program launch time. This program measures the current time and then spawns another copy of itself, which measures the time immediately after entering main and reports the time difference in nanoseconds. The program also accepts a -stop argument, which will cause the process to suspend itself immediately after reporting its launch time, allowing us to inspect its memory usage after entering main. We can build it in three variants:
$ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=0 -o no_class_records $ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=1 -o non_pointer_class_records $ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=2 -o pointer_class_recordsI ran each variant a thousand times on my computer, a 3.1GHz 2017 15" MacBook Pro, and I got these average timings:
$ average() { awk '{ sum += $0 } END { printf "%u\n", sum / NR }' } $ (for i in $(seq 1 1000); do ./no_class_records; done) | average 1513058 $ (for i in $(seq 1 1000); do ./non_pointer_class_records; done) | average 1517911 $ (for i in $(seq 1 1000); do ./pointer_class_records; done) | average 1712659We can see that the global pointers in the pointer_class_records variant have a small but measurable impact on its launch time—about 0.2 milliseconds, or a 13% increase over the baseline launch time of 1.5 milliseconds. We can also ask dyld, the dynamic linker on macOS, to break down all the work it does at process launch by setting the environment variable DYLD_PRINT_STATISTICS_DETAILS. On my machine, running the no_class_records and pointer_class_records variants with DYLD_PRINT_STATISTICS_DETAILS enabled produces the following output:
$ ./no_class_records ... total rebase fixups: 14 total rebase fixups time: 0.48 milliseconds (23.4%) ... $ ./pointer_class_records ... total rebase fixups: 32,783 total rebase fixups time: 1.62 milliseconds (50.8%) ...Among all the other measurements dyld reports, the "total rebase fixups" measurement makes the difference between no_class_records and pointer_class_record plain. The latter has exactly 32K more pointers to rebase than the baseline, accounting for most of the difference in startup time between the two variants.
Now let's look at the memory impact of these rebases. If I run each variant with the -stop flag, and inspect the stopped processes' memory usage in top, I see something like this:
$ ./no_class_records -stop 2034244 nanoseconds from spawn to main() entry pid 73439 zsh: suspended (signal) ./no_class_records -stop $ ./non_pointer_class_records -stop 2147981 nanoseconds from spawn to main() entry pid 73519 zsh: suspended (signal) ./non_pointer_class_records -stop $ ./pointer_class_records -stop 2221860 nanoseconds from spawn to main() entry pid 73528 zsh: suspended (signal) ./pointer_class_records -stop $ top -pid 73439 -pid 73519 -pid 73528 PID COMMAND ... MEM ... 73528 pointer_clas ... 572K ... 73519 non_pointer_ ... 316K ... 73439 no_class_rec ... 316K ...We can see that the no_class_records and non_pointer_class_records variants have no difference in memory usage, whereas the pointer_class_records variant uses 256KB more memory, due to its global data getting dirtied by the dynamic linker's sliding.
These costs may seem small, but in an operating system with frequently-used dynamic libraries that get loaded into almost every process, and executables that have many instances of the same binary running at the same time, the launch time and dirty page costs add up. On mobile platforms, where memory is constrained and swap is usually unavailable, dirty pages directly affect how many processes and how much user data can be kept in memory, and small-seeming changes in load times can have a big impact on the perceived responsiveness and performance of the platform.
We can make data structures position-independent the same way code does, by having structures reference other structures by their distance from each other within the binary instead of by their absolute addresses. For example, the object_vtable from the beginning of the article could look something like this instead:
#define OFFSET(target, source) \ ((intptr_t)&target - (intptr_t)&source) const struct object_vtable { int typeinfo_offset; int method1_offset; int method2_offset; /* etc. */ } object_vtable = { .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset), .method1_offset = OFFSET(object_method1, object_vtable.method1_offset), .method2_offset = OFFSET(object_method2, object_vtable.method2_offset), /* etc. */ };And since these relative distances between objects won't change regardless of where the binary gets mapped into memory, the pages they get loaded into will remain clean. Also, since executables are still normally limited to 2GB even on 64-bit platforms (unless they opt into special "large code models" that increase code size in order to allow the linker to resolve 64-bit addresses), there's an opportunity to reduce size too, since a relative reference to another global in the same binary can be four bytes instead of eight, halving the size of many data structures.
Unfortunately, C and C++ don't consider expressions involving subtracting global addresses to be constant expressions. If we try to compile something like the offset-based object_vtable implementation above, we'll get an error:
$ xcrun clang foo.c foo.c:16:22: error: initializer element is not a compile-time constant .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~So to take advantage of this technique, we have to go deeper. Regardless of what C says, the underlying assembler and linker support this on most contemporary platforms, as does LLVM. Using the assembler on macOS, a data structure with relative references can be generated like this:
.section __TEXT, __const ; Put the following into the constants section of the binary .global _object_vtable ; Export the C symbol 'object_vtable' ; (C symbols begin with underscores in macOS assembly) _object_vtable: ; Define object_vtable .long _object_typeinfo - . ; Emit the distance between object_typeinfo and here ; (A '.' represents the current address in assembly) .long _object_method1 - . .long _object_method2 - .The same thing can be expressed in LLVM with a somewhat verbose constant expression:
%object_vtable = type { i32, i32, i32 } %object_typeinfo = type opaque %object_method = type opaque @object_typeinfo = external constant %object_typeinfo @object_method1 = external constant %object_method @object_method2 = external constant %object_method @object_vtable = constant %object_vtable { i32 trunc (i64 sub (i64 ptrtoint (%object_typeinfo* @object_typeinfo to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 0) to i64)) to i32), i32 trunc (i64 sub (i64 ptrtoint (%object_method* @object_method1 to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 1) to i64)) to i32), i32 trunc (i64 sub (i64 ptrtoint (%object_method* @object_method2 to i64), i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 2) to i64)) to i32) }
One tradeoff to using relative references is that they do require slightly more generated code on average to dereference than absolute pointers, leading to small performance and code size costs. To explore these costs, here's another small macOS C program that microbenchmarks absolute and relative references by creating a large number of vtables using either relative or absolute method pointers and measuring the time taken to call through them all. We can build both variations like this:
$ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=0 -o relative $ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=1 -o absolute
And then run them:
$ ./absolute 850393727 nanoseconds to invoke methods $ ./relative 867976645 nanoseconds to invoke methodsThis shows about a 2% cost for using relative references (if we're doing absolutely nothing but indirectly calling methods, which is an unlikely real-world workload). We can also compare other interesting things about the binaries, like total size:
$ ls -l absolute relative -rwxr-xr-x 1 joe staff 270872 Feb 14 20:23 absolute* -rwxr-xr-x 1 joe staff 139792 Feb 14 20:45 relative*The version with absolute pointers is almost twice the size, due to the vtables containing eight-byte pointers instead of four-byte relative offsets. We can also compare code size only, minus the data:
$ objdump -section-headers absolute | grep __text 0 __text 000000e7 0000000100000e50 TEXT $ objdump -section-headers relative | grep __text 0 __text 0000010d 0000000100000e20 TEXTThe code size for following relative references is higher by 38 bytes, 16% more than the absolute variant. All of these costs and benefits are exaggerated by this being an artificial benchmark that does nothing but follow references, but they give a sense of the tradeoffs involved. A real program will generally be doing a lot more between method calls or pointer dereferences, giving contemporary CPUs the opportunity to mask this cost by doing other operations simultaneously with the extra reference math. Nonetheless, for data structures that require maximum performance, the launch-time, dirty memory, and size costs of absolute pointers may still be worth paying to avoid the runtime cost. Otherwise, smaller is generally better, and code and static data is generally cheaper than dirty data at the system level, so optimizing to reduce the size of data, and particularly the amount of dirty data, is a good default stance.
extern int some_variable; int get_value_of_some_variable(void) { return some_variable; }it'll produce something like this assembly language:
_get_value_of_some_variable: mov rax, some_variable@GOTPCREL[rip] ; load the address of some_variable from the GOT mov eax, [rax] ; load the value of some_variable ret ; return itThe @GOTPCREL notation instructs the assembler to calculate the relative offset from the current instruction to a variable's entry in the GOT. We can also build data structures that refer to external symbols using relative references to their global offset table entries. This gives most of the same benefits as relative referencing symbols within the same binary. A global pointer will end up dirtying the memory page it resides in, whether it points at a local or external symbol. Although the GOT itself will be dirtied when the dynamic linker fills it in with resolved addresses, the GOT entries appear together in one contiguous area of memory. When the pointers are scattered through our constant data structures, more memory pages will get dirtied overall than if the pointers are concentrated together in the GOT. Since a GOT entry gets formed for an external symbol anyway when it gets referenced from code, there's no additional cost if we reuse it for our data structures.
The same @GOTPCREL syntax works for data as well as code in assembly language:
.global _external_relative_reference _external_relative_reference: ; Note that GOTPCREL measures the distance from the address at the end of the ; four-byte value rather than the beginning, since x86-64 does ; PC-relative addressing relative to the address of the following instruction. ; Adding 4 compensates for this offset. .long _external_global@GOTPCREL+4LLVM doesn't directly let you manipulate the GOT, but if you define a private global constant containing the address of another variable, and give that constant the unnamed_addr attribute (which tells LLVM that the address of the constant is not meaningful to the program, only its contents, allowing it to coalesce it with other constants containing the same value), then LLVM will cleverly let the linker-generated GOT entry stand in for that global variable. If you initialize another LLVM global variable with a relative reference to one of these "GOT equivalent" constants, LLVM will produce a @GOTPCREL expression in assembly language. For example, try compiling this LLVM code:
@external_global = external constant i32 ; LLVM treats this global variable as a "GOT equivalent", and will replace ; references to @got.external_global with GOT relocations for external_global ; when possible. @got.external_global = private unnamed_addr constant i32* @external_global @external_relative_reference = constant i32 trunc (i64 sub (i64 ptrtoint (i32** @got.external_global to i64), i64 ptrtoint (i32* @external_relative_reference to i64)) to i32)We have two different strategies for referencing symbols now, one for referencing local objects in the same binary by relative reference to their direct address, and one for referencing external objects, by relative reference to their GOT entry. In some situations, maybe you only need to support referencing local objects, such as for references between data structures that always get generated together as part of the same binary, but what if you need to handle both internal and external references? How do we know which mechanism was used at runtime? There are a few choices here, with different tradeoffs:
First of all, we could uniformly reference all symbols by GOT entry, local and external. The GOT doesn't have to exclusively be for external references; if the linker sees a @GOTPCREL reference to a local variable, it will obligingly create a GOT entry for it. Uniformly referencing all symbols by GOT allows code that follows the reference to remain branchless, since it can uniformly load the offset, then load the GOT entry to get the desired address. It also preserves the benefits that a data structure's inline storage can remain in clean memory and use four-byte offsets instead of potentially eight-byte pointers. On the other hand, it forces GOT entries to be created for local symbols that wouldn't otherwise need them, and makes following every reference require a two-load chain, whether local or external. The additional launch time and dirty cost of an additional GOT entry is negligible; however, the added GOT entries incur an indirect size cost for the data structures that demand them—we save four bytes of inline storage using a relative reference over a pointer, but that'd be offset by adding eight bytes for the otherwise unnecessary GOT entry. Furthermore, long load chains increase the latency of an operation, and going through the GOT adds a load to the chain necessary to follow a reference, since the CPU has to first load the offset, wait for the value to come from memory, then load the final address from the GOT. Since the GOT load is dependent on the offset load, the CPU can't do anything to parallelize this operation.
Contemporary CPUs have good branch predictors and depend on instruction-level parallelism for maximum performance, so if we can branch to avoid extending a load chain, it's often worth it. That suggests an alternative approach, where we distinguish local from external references somehow so we can avoid going through the GOT for local references. If our data structures are aligned to 2, 4, 8 byte boundaries, we could encode whether a reference is external by setting one of the low bits of the offset, which would normally always be zero. We can then check this bit and branch to do the extra dereference necessary to load the GOT entry only for an external symbol, something like this:
const void *resolve_indirectable_relative_offset(const int *offset_ptr) { int offset = *offset_ptr; intptr_t resolved_addr = (intptr_t)offset_ptr + (offset & ~1); // If the low bit is set, then the offset refers to a GOT entry, and we need to // load from there to get the resolved address. if (offset_value & 1) { return *(const void *const *)resolved_addr; } else { return (const void *)resolved_addr; } }This increases the code size to follow a reference, but it's likely to be faster overall for local references, since they only need to wait on a single load to resolve, and the branch is likely to get predicted by the CPU. If traversal of the data structure is infrequent, or isolated to a handful of runtime functions, the code size cost is likely to be worth the performance benefits for local references. On the other hand, if you have to follow the reference frequently in code generated everywhere, like as part of method lookup from dispatch tables, the code size hit may be unacceptable. Also, when referencing things like strings or function pointers that don't normally have alignment guarantees, there are no unused bits in the offset to use for discriminating local from nonlocal references.
It's also possible to avoid the need for external references in some situations by generating an equivalent local definition and referencing that instead. The aforementioned two kinds of object, strings and functions, are prime candidates for this. It's uncommon to link strings from other binaries, since C compilers typically give every binary its own string table. For functions, instead of referencing an external function directly, it's possible to instead reference a local thunk that jumps to the external implementation. This duplication simplifies the reference mechanism, which now only has to worry about local relative references, but eats into the size savings we get from relative referencing.
For some examples of data structures using relative references being used in practice, look at the Swift programming language runtime, which uses this technique extensively with its reflection data structures, enabling a Swift program to include a lot of runtime metadata without that metadata impacting the launch time or memory usage of the program. This technique has also been implemented in the Clang C++ compiler, allowing it to emit C++ vtables using relative references, an option Chromium uses to reduce the size, launch time, and memory cost of all of the vtables from their gigantic C++ codebase. If you're working on a language runtime in a similar vein, it's a good technique to consider adopting for your compiler-generated data structures as well.