Data-oriented design for 8-bit CPUs: SoA lookup tables

Performance-oriented developers usually throw around "data-oriented design" when talking about optimizing for cutting-edge CPUs and GPUs with advanced SIMD units, but some of their tricks are surprisingly effective even for programming lowly retro architectures like the 6502. In particular, the "struct of arrays", or SoA, data layout technique can be effective at optimizing code that works with larger integer types or data structures on these limited architectures.

An 8-bit processor usually only operates on 8 bits at a time, but even back in the day, we wanted programs that work with larger numbers and keep tables of those larger numbers in memory. We can do this the obvious way you would on a contemporary computer, storing each 16-bit value in two contiguous bytes of memory, one after the other, like you'd get if you wrote a global array literal in C:

const uint16_t funny_numbers[] = {
  69,
  219,
  420,
  6969,
  20721,
  31337,
};

Now let's consider an operation that does a lookup into the table:

uint8_t encode_funny_number(uint8_t index) {
  uint16_t value = funny_numbers[index];
  return (value & 0xFF) ^ (value >> 8);
}

To compile this down for a 6502, we have to compute the address of the index in the table, then work on one byte at a time, advancing the index as we go:

encode_funny_number:
  asl a ; multiply by two to get the byte offset into the table
  tax ; move the offset from accumulator register A to index register X
  lda funny_numbers,X ; fetch the first byte into register A
  inx ; advance the index to the next byte
  eor funny_numbers,X ; xor the next byte into register A
  rts ; return from the function

You could think of the 16-bit number as being a "struct" of two eight-bit bytes, and the lookup table as being an "array of structs", or AoS—each entry is stored in a contiguous block in memory immediately after the entry before it. But we can also store the table as a "struct of arrays", where each part of the struct is in a separate array. It might be weird to think of an integer this way, but we could have one array for the low bytes of each entry, followed by a separate array for the high bytes of each entry:

const uint8_t funny_numbers_lo[] = {
  69 & 0xFF,
  219 & 0xFF,
  420 & 0xFF,
  6969 & 0xFF,
  20721 & 0xFF,
  31337 & 0xFF,
};

const uint8_t funny_numbers_hi[] = {
  69 >> 8,
  219 >> 8,
  420 >> 8,
  6969 >> 8,
  20721 >> 8,
  31337 >> 8,
};

Even though the values are split up and spread into two tables, the index into each table is now a constant byte offset, so we can implement the lookup with much less index manipulation, saving two whole instructions, and four cycles, over the AoS version:

encode_funny_number:
  tax ; move the offset from accumulator register A to index register X
  ; we don't need to multiply A by 2 anymore!
  lda funny_numbers_lo,X ; fetch the first byte into register A
  ; we don't need to increment X anymore!
  eor funny_numbers_hi,X ; xor the next byte into register A
  rts ; return from the function

View and comment on this post in the fediverse, or email me about it.