Optimizing global constant data structures using relative references

February 21, 2018

Building a native compiler for a programming language with rich reflection? Runtime type information, method dispatch tables, and other metadata require complex data structures full of cross references between related language entities. To reduce the size, memory usage, and launch time cost of these pointer-heavy constant data structures, you can try building them out of relative references instead of pointers. Pointers are one of C's defining features, and seemingly the simplest mechanism for building data structures, but they carry hidden costs when used in global constants, which we'll explore in this post and look at how we can avoid them. Even if you're not writing a compiler, understanding this optimization is a fun chance to peel back some of the mystique of C, explore a bit of the runtime machinery that makes C programs work in contemporary operating systems, and see how a compiler can make different tradeoffs when not constrained by the abstractions C provides. Most of what I'll describe here applies specifically to macOS, iOS, and Apple's other platforms, although ELF-based operating systems like Linux and FreeBSD on x86 work largely the same way. (Windows dynamic linking is very different in many ways, though this optimization still has benefits there.)

The hidden costs of global pointers

The most common way to represent a C++-style vtable is something like this C structure:

const struct object_vtable {
  const type_info *typeinfo;
  void (*method1)(object *self);
  void (*method2)(object *self, int argument);
  /* etc. */
} object_vtable = {
  .typeinfo = &object_typeinfo,
  .method1 = &object_method1,
  .method2 = &object_method2,
  /* etc. */
};

And it's easy to believe this is the only way, since C is the "bare metal" language, and these are the tools C gives you to represent these kinds of data structures. The object_vtable structure here is a global constant, full of pointers to other global constants—it should be "free", right? In reality, operating systems provide a fairly elaborate runtime environment in order to make C programs work. Data structures that contain global pointers will in fact allocate memory at program launch before even entering main(). To understand why, we need to peek behind the scenes and look at how the dynamic linker works.

When a process is formed, the contents of the executable file and all of the dynamic libraries it uses get memory-mapped into the new process's address space by the dynamic linker. The dynamic linker uses the kernel's memory mapping feature to do this, associating regions of memory with the contents of the binary files on disk. As long as this memory isn't changed by the running program, the kernel can consider it to be "clean", so that if the system needs to free up memory for other purposes, it can discard these clean pages and reload them from the original binary later, since they haven't been changed. Furthermore, if multiple processes launch using the same executable or dynamic libraries, the exact same clean memory pages can be shared across the address spaces of all of those different processes, significantly reducing the amount of memory needed by the entire system.

However, for a number of reasons, the code and data in a binary on disk can't know for certain what memory address it's going to end up getting mapped to in a running process. Every executable can link against any set of dynamic libraries, so a dynamic library may have to rebase to make room for other libraries in the process, sliding to a new base address. Additionally, as a layer of security, operating systems use address space layout randomization, or ASLR, so that if a program gets exploited, attacker code can't make static assumptions about the location of other exploitable resources in the program's memory, making exploits harder to write. Because the binary on disk doesn't know what address it's going to be mapped to, when pointers appear in its global data, the dynamic linker has to slide all of those pointers to their correct values for the process. This sliding has to happen when the program is loaded, before entering main, delaying the launch of the program. It also causes the pages those pointers are on to become "dirty" from the kernel's perspective, causing the system to use more memory: dirty pages can't be shared among different processes, since the pointer values may be different in each process, and since they no longer match the contents of the binary on disk, they can no longer be merely discarded if the system needs to free up memory for other uses; dirty pages instead have to be written and reloaded from swap.

I wrote a small C program for macOS to observe the impact of global pointers on memory usage and program launch time. This program measures the current time and then spawns another copy of itself, which measures the time immediately after entering main and reports the time difference in nanoseconds. The program also accepts a -stop argument, which will cause the process to suspend itself immediately after reporting its launch time, allowing us to inspect its memory usage after entering main. We can build it in three variants:

One without any global data structures,
One with 256 kilobytes of non-pointer data structures,
One with 256 kilobytes of pointers, resembling a C++-like set of class vtables.

We can build these three variants like this:

$ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=0 -o no_class_records
$ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=1 -o non_pointer_class_records
$ xcrun clang -O3 -fpie lots-of-global-pointers.c -DVARIATION=2 -o pointer_class_records

I ran each variant a thousand times on my computer, a 3.1GHz 2017 15" MacBook Pro, and I got these average timings:

$ average() { awk '{ sum += $0 } END { printf "%u\n", sum / NR }' }
$ (for i in $(seq 1 1000); do ./no_class_records; done) | average
1513058
$ (for i in $(seq 1 1000); do ./non_pointer_class_records; done) | average
1517911
$ (for i in $(seq 1 1000); do ./pointer_class_records; done) | average
1712659

We can see that the global pointers in the pointer_class_records variant have a small but measurable impact on its launch time—about 0.2 milliseconds, or a 13% increase over the baseline launch time of 1.5 milliseconds. We can also ask dyld, the dynamic linker on macOS, to break down all the work it does at process launch by setting the environment variable DYLD_PRINT_STATISTICS_DETAILS. On my machine, running the no_class_records and pointer_class_records variants with DYLD_PRINT_STATISTICS_DETAILS enabled produces the following output:

$ ./no_class_records
  ...
  total rebase fixups:  14
  total rebase fixups time:   0.48 milliseconds (23.4%)
  ...
$ ./pointer_class_records
  ...
  total rebase fixups:  32,783
  total rebase fixups time:   1.62 milliseconds (50.8%)
  ...

Among all the other measurements dyld reports, the "total rebase fixups" measurement makes the difference between no_class_records and pointer_class_record plain. The latter has exactly 32K more pointers to rebase than the baseline, accounting for most of the difference in startup time between the two variants.

Now let's look at the memory impact of these rebases. If I run each variant with the -stop flag, and inspect the stopped processes' memory usage in top, I see something like this:

$ ./no_class_records -stop
2034244 nanoseconds from spawn to main() entry
pid 73439
zsh: suspended (signal)  ./no_class_records -stop

$ ./non_pointer_class_records -stop
2147981 nanoseconds from spawn to main() entry
pid 73519
zsh: suspended (signal)  ./non_pointer_class_records -stop

$ ./pointer_class_records -stop
2221860 nanoseconds from spawn to main() entry
pid 73528
zsh: suspended (signal)  ./pointer_class_records -stop

$ top -pid 73439 -pid 73519 -pid 73528
PID    COMMAND       ...  MEM   ...
73528  pointer_clas  ...  572K  ...
73519  non_pointer_  ...  316K  ...
73439  no_class_rec  ...  316K  ...

We can see that the no_class_records and non_pointer_class_records variants have no difference in memory usage, whereas the pointer_class_records variant uses 256KB more memory, due to its global data getting dirtied by the dynamic linker's sliding.

These costs may seem small, but in an operating system with frequently-used dynamic libraries that get loaded into almost every process, and executables that have many instances of the same binary running at the same time, the launch time and dirty page costs add up. On mobile platforms, where memory is constrained and swap is usually unavailable, dirty pages directly affect how many processes and how much user data can be kept in memory, and small-seeming changes in load times can have a big impact on the perceived responsiveness and performance of the platform.

Using relative references to build position-independent data structures

How can we represent these reference-heavy data structures in a way that doesn't pay the load time and dirty page costs of global pointers? For inspiration, we can look at how compilers emit position-independent code, or PIC, which is machine code that behaves the same way regardless of where it's loaded in memory. Machine code is full of references to represent branches, function calls, and reads or writes of global variables. Most contemporary CPU instruction sets let these operations be represented with PC-relative addressing ("PC" being short for "program counter", the address of the currently-executing instruction), meaning the thing being referenced is named by its relative offset from the current instruction rather than by a pointer to its absolute address. Whereas absolute pointers change values when the binary gets mapped into memory at a different base address, the relative distances between data structures within the binary do not.

We can make data structures position-independent the same way code does, by having structures reference other structures by their distance from each other within the binary instead of by their absolute addresses. For example, the object_vtable from the beginning of the article could look something like this instead:

#define OFFSET(target, source) \
  ((intptr_t)&target - (intptr_t)&source)

const struct object_vtable {
  int typeinfo_offset;
  int method1_offset;
  int method2_offset;
  /* etc. */
} object_vtable = {
  .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset),
  .method1_offset = OFFSET(object_method1, object_vtable.method1_offset),
  .method2_offset = OFFSET(object_method2, object_vtable.method2_offset),
  /* etc. */
};

And since these relative distances between objects won't change regardless of where the binary gets mapped into memory, the pages they get loaded into will remain clean. Also, since executables are still normally limited to 2GB even on 64-bit platforms (unless they opt into special "large code models" that increase code size in order to allow the linker to resolve 64-bit addresses), there's an opportunity to reduce size too, since a relative reference to another global in the same binary can be four bytes instead of eight, halving the size of many data structures.

Unfortunately, C and C++ don't consider expressions involving subtracting global addresses to be constant expressions. If we try to compile something like the offset-based object_vtable implementation above, we'll get an error:

$ xcrun clang foo.c
foo.c:16:22: error: initializer element is not a compile-time constant
  .typeinfo_offset = OFFSET(object_typeinfo, object_vtable.typeinfo_offset),
                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So to take advantage of this technique, we have to go deeper. Regardless of what C says, the underlying assembler and linker support this on most contemporary platforms, as does LLVM. Using the assembler on macOS, a data structure with relative references can be generated like this:

.section __TEXT, __const    ; Put the following into the constants section of the binary
.global _object_vtable      ; Export the C symbol 'object_vtable'
                            ; (C symbols begin with underscores in macOS assembly)
_object_vtable:             ; Define object_vtable
  .long _object_typeinfo - .    ; Emit the distance between object_typeinfo and here
                                ; (A '.' represents the current address in assembly)
  .long _object_method1 - .
  .long _object_method2 - .

The same thing can be expressed in LLVM with a somewhat verbose constant expression:

%object_vtable = type { i32, i32, i32 }
%object_typeinfo = type opaque
%object_method = type opaque

@object_typeinfo = external constant %object_typeinfo
@object_method1 = external constant %object_method
@object_method2 = external constant %object_method
@object_vtable = constant %object_vtable {
  i32 trunc (i64 sub
    (i64 ptrtoint (%object_typeinfo* @object_typeinfo to i64),
     i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 0) to i64)) to i32),
  i32 trunc (i64 sub
    (i64 ptrtoint (%object_method* @object_method1 to i64),
     i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 1) to i64)) to i32),
  i32 trunc (i64 sub
    (i64 ptrtoint (%object_method* @object_method2 to i64),
     i64 ptrtoint (i32* getelementptr (%object_vtable, %object_vtable* @object_vtable, i32 0, i32 2) to i64)) to i32)
}

One tradeoff to using relative references is that they do require slightly more generated code on average to dereference than absolute pointers, leading to small performance and code size costs. To explore these costs, here's another small macOS C program that microbenchmarks absolute and relative references by creating a large number of vtables using either relative or absolute method pointers and measuring the time taken to call through them all. We can build both variations like this:

$ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=0 -o relative
$ xcrun clang -O3 -fpie invoking-relative-references.c -DVARIATION=1 -o absolute

And then run them:

$ ./absolute
850393727 nanoseconds to invoke methods
$ ./relative
867976645 nanoseconds to invoke methods

This shows about a 2% cost for using relative references (if we're doing absolutely nothing but indirectly calling methods, which is an unlikely real-world workload). We can also compare other interesting things about the binaries, like total size:

$ ls -l absolute relative
-rwxr-xr-x  1 joe  staff  270872 Feb 14 20:23 absolute*
-rwxr-xr-x  1 joe  staff  139792 Feb 14 20:45 relative*

The version with absolute pointers is almost twice the size, due to the vtables containing eight-byte pointers instead of four-byte relative offsets. We can also compare code size only, minus the data:

$ objdump -section-headers absolute | grep __text                    
  0 __text        000000e7 0000000100000e50 TEXT 
$ objdump -section-headers relative | grep __text
  0 __text        0000010d 0000000100000e20 TEXT

The code size for following relative references is higher by 38 bytes, 16% more than the absolute variant. All of these costs and benefits are exaggerated by this being an artificial benchmark that does nothing but follow references, but they give a sense of the tradeoffs involved. A real program will generally be doing a lot more between method calls or pointer dereferences, giving contemporary CPUs the opportunity to mask this cost by doing other operations simultaneously with the extra reference math. Nonetheless, for data structures that require maximum performance, the launch-time, dirty memory, and size costs of absolute pointers may still be worth paying to avoid the runtime cost. Otherwise, smaller is generally better, and code and static data is generally cheaper than dirty data at the system level, so optimizing to reduce the size of data, and particularly the amount of dirty data, is a good default stance.

Leveraging the dynamic linker's data structures for references across libraries

Relative references can eliminate the launch-time and dirty memory overhead of global constant references when the source and target of the reference are both in the same dynamic library or executable. On the other hand, when a data structure needs to reference a symbol that comes from another dynamic library, some work at launch time is inevitable; after all, the whole point of the dynamic linker is to resolve references between independent binary files. Nonetheless, we can minimize the launch time and dirty memory cost of these references by reusing data structures that already get built to make dynamic linking work. When the compile-time linker builds an executable or dynamic library from .o files, and the code in those files references global variables from other dynamic libraries, the linker builds a global offset table, or GOT, with an entry for each external variable the binary references. You can think of each GOT entry as being an extra global variable that the dynamic linker fills in with the resolved address of its associated original variable. (This is in fact how you can work with the GOT from LLVM, which we'll get back to in a minute.) When code tries to read or write the variable, it first has to load the value of the GOT entry variable to get the variable's address. For instance, when the compiler generates assembly for the following C code:

extern int some_variable;

int get_value_of_some_variable(void) {
  return some_variable;
}

it'll produce something like this assembly language:

_get_value_of_some_variable:
  mov rax, some_variable@GOTPCREL[rip] ; load the address of some_variable from the GOT
  mov eax, [rax]                       ; load the value of some_variable
  ret                                  ; return it

The @GOTPCREL notation instructs the assembler to calculate the relative offset from the current instruction to a variable's entry in the GOT. We can also build data structures that refer to external symbols using relative references to their global offset table entries. This gives most of the same benefits as relative referencing symbols within the same binary. A global pointer will end up dirtying the memory page it resides in, whether it points at a local or external symbol. Although the GOT itself will be dirtied when the dynamic linker fills it in with resolved addresses, the GOT entries appear together in one contiguous area of memory. When the pointers are scattered through our constant data structures, more memory pages will get dirtied overall than if the pointers are concentrated together in the GOT. Since a GOT entry gets formed for an external symbol anyway when it gets referenced from code, there's no additional cost if we reuse it for our data structures.

The same @GOTPCREL syntax works for data as well as code in assembly language:

.global _external_relative_reference
_external_relative_reference:
  ; Note that GOTPCREL measures the distance from the address at the end of the
  ; four-byte value rather than the beginning, since x86-64 does
  ; PC-relative addressing relative to the address of the following instruction.
  ; Adding 4 compensates for this offset.
  .long _external_global@GOTPCREL+4

LLVM doesn't directly let you manipulate the GOT, but if you define a private global constant containing the address of another variable, and give that constant the unnamed_addr attribute (which tells LLVM that the address of the constant is not meaningful to the program, only its contents, allowing it to coalesce it with other constants containing the same value), then LLVM will cleverly let the linker-generated GOT entry stand in for that global variable. If you initialize another LLVM global variable with a relative reference to one of these "GOT equivalent" constants, LLVM will produce a @GOTPCREL expression in assembly language. For example, try compiling this LLVM code:

@external_global = external constant i32

; LLVM treats this global variable as a "GOT equivalent", and will replace
; references to @got.external_global with GOT relocations for external_global
; when possible.
@got.external_global = private unnamed_addr constant i32* @external_global

@external_relative_reference = constant i32 trunc (i64 sub
  (i64 ptrtoint (i32** @got.external_global to i64),
   i64 ptrtoint (i32* @external_relative_reference to i64)) to i32)

We have two different strategies for referencing symbols now, one for referencing local objects in the same binary by relative reference to their direct address, and one for referencing external objects, by relative reference to their GOT entry. In some situations, maybe you only need to support referencing local objects, such as for references between data structures that always get generated together as part of the same binary, but what if you need to handle both internal and external references? How do we know which mechanism was used at runtime? There are a few choices here, with different tradeoffs:

First of all, we could uniformly reference all symbols by GOT entry, local and external. The GOT doesn't have to exclusively be for external references; if the linker sees a @GOTPCREL reference to a local variable, it will obligingly create a GOT entry for it. Uniformly referencing all symbols by GOT allows code that follows the reference to remain branchless, since it can uniformly load the offset, then load the GOT entry to get the desired address. It also preserves the benefits that a data structure's inline storage can remain in clean memory and use four-byte offsets instead of potentially eight-byte pointers. On the other hand, it forces GOT entries to be created for local symbols that wouldn't otherwise need them, and makes following every reference require a two-load chain, whether local or external. The additional launch time and dirty cost of an additional GOT entry is negligible; however, the added GOT entries incur an indirect size cost for the data structures that demand them—we save four bytes of inline storage using a relative reference over a pointer, but that'd be offset by adding eight bytes for the otherwise unnecessary GOT entry. Furthermore, long load chains increase the latency of an operation, and going through the GOT adds a load to the chain necessary to follow a reference, since the CPU has to first load the offset, wait for the value to come from memory, then load the final address from the GOT. Since the GOT load is dependent on the offset load, the CPU can't do anything to parallelize this operation.

Contemporary CPUs have good branch predictors and depend on instruction-level parallelism for maximum performance, so if we can branch to avoid extending a load chain, it's often worth it. That suggests an alternative approach, where we distinguish local from external references somehow so we can avoid going through the GOT for local references. If our data structures are aligned to 2, 4, 8 byte boundaries, we could encode whether a reference is external by setting one of the low bits of the offset, which would normally always be zero. We can then check this bit and branch to do the extra dereference necessary to load the GOT entry only for an external symbol, something like this:

const void *resolve_indirectable_relative_offset(const int *offset_ptr) {
  int offset = *offset_ptr;
  intptr_t resolved_addr = (intptr_t)offset_ptr + (offset & ~1);
  // If the low bit is set, then the offset refers to a GOT entry, and we need to
  // load from there to get the resolved address.
  if (offset_value & 1) {
    return *(const void *const *)resolved_addr;
  } else {
    return (const void *)resolved_addr;
  }
}

This increases the code size to follow a reference, but it's likely to be faster overall for local references, since they only need to wait on a single load to resolve, and the branch is likely to get predicted by the CPU. If traversal of the data structure is infrequent, or isolated to a handful of runtime functions, the code size cost is likely to be worth the performance benefits for local references. On the other hand, if you have to follow the reference frequently in code generated everywhere, like as part of method lookup from dispatch tables, the code size hit may be unacceptable. Also, when referencing things like strings or function pointers that don't normally have alignment guarantees, there are no unused bits in the offset to use for discriminating local from nonlocal references.

It's also possible to avoid the need for external references in some situations by generating an equivalent local definition and referencing that instead. The aforementioned two kinds of object, strings and functions, are prime candidates for this. It's uncommon to link strings from other binaries, since C compilers typically give every binary its own string table. For functions, instead of referencing an external function directly, it's possible to instead reference a local thunk that jumps to the external implementation. This duplication simplifies the reference mechanism, which now only has to worry about local relative references, but eats into the size savings we get from relative referencing.

When does using this technique make sense?

Finally, we have the choice of not using relative references at all, and just building data structures with pointers. Now that we've established the costs and potential savings from using relative references instead of absolute pointers, let's think about when it makes sense to use them. Although global pointers dirty memory and require launch-time work to initialize, once they're set up, they're still the absolute cheapest kind of reference to resolve, for either local or external references. For data structures that are frequently accessed on hot paths, paying those one-time costs may be worth it. Relative references in global data are also not expressible in standard C or C++, making them difficult to adopt and maintain in a C or C++ program. Furthermore, the two major benefits of relative references, no launch time or dirty memory cost, are really only a factor for data that can be memory-mapped from disk. This is great for data structures that get compiled into the binary, or that can be memory-mapped from data files, but data structures that are built dynamically by the running program are going to have to allocate the memory and spend the time to initialize that memory no matter what. Relative referencing may still be useful as a size optimization for dynamically-generated data, but it doesn't offer the same major system performance advantages for dynamic data that it does for static global data. On the other hand, if that dynamic data structure is going to be memory-mapped or copied into other processes, serialized to disk and reloaded later, or sent to GPUs or other compute devices, relative referencing can still be a benefit, since the position-independent nature of relative references means they remain valid even if a memory region is transported to a different process, device, or file.

For some examples of data structures using relative references being used in practice, look at the Swift programming language runtime, which uses this technique extensively with its reflection data structures, enabling a Swift program to include a lot of runtime metadata without that metadata impacting the launch time or memory usage of the program. This technique has also been implemented in the Clang C++ compiler, allowing it to emit C++ vtables using relative references, an option Chromium uses to reduce the size, launch time, and memory cost of all of the vtables from their gigantic C++ codebase. If you're working on a language runtime in a similar vein, it's a good technique to consider adopting for your compiler-generated data structures as well.