August 31, 2018

Integer division and remainder are among the slowest fundamental arithmetic operations CPUs
support, an order of magnitude or more slower than multiplication on many architectures, so
compilers have a lot of incentive to turn them into cheaper operations.
It's common for the divisor to be a constant, in which case there's a well-known
transformation your C compiler probably does to turn a division into a multiplication by an
binary approximation of the reciprocal of the divisor.
For example, unsigned divide(unsigned x) { return x / 679; }into the following assembly:

divide: mov eax, edi imul rax, rax, 1619310203 # multiply by ceil(1/679 * 2**40) == 1619310203 shr rax, 40 # divide by 2**40 via right shift retRidiculous Fish explores this optimization in much further depth. A related common operation, the one I want to focus on here, is to test whether a number is an exact multiple of another, using the remainder operator:

bool is_multiple(unsigned x) { return x % 679 == 0; }

is_multiple: mov eax, edi imul rax, rax, 1619310203 # multiply by ceil(1/679 * 2**40) == 1619310203 shr rax, 40 # divide by 2**40 via right shift imul ecx, eax, 679 # multiply the result by 679 again cmp edi, ecx # is it equal to the original value? sete al retThis is an improvement over using a division instruction to perform the remainder, but the above code sequence still involves two multiplications, one of which is a

double-widthmultiplication (getting a 64 bit result from 32 bit operands), along with a double-width right shift. Can we do better? It turns out we can, using some tricks with modular arithmetic. Integer arithmetic in a 32- or 64-bit CPU register wraps if it overflows, effectively implementing the integers modulo 2³² or 2⁶⁴. An interesting property of integers modulo 2ⁿ is that every odd number has a modular inverse, which is another number it can be multiplied with modulo 2ⁿ to produce 1. Some examples modulo 2³²:

3 * 2_863_311_531 == 0x2_0000_0001 == 1 (mod 2**32) 679 * 2_068_415_767 == 0x147_0000_0001 == 1 (mod 2**32) 65_535 * 4_294_901_759 == 0xfffe_0000_0001 == 1 (mod 2**32)For any number that's a multiple of the original number, multiplying by the modular inverse will divide it:

3 * 2_863_311_531 == 1 (mod 2**32) 6 * 2_863_311_531 == 2 (mod 2**32) 9 * 2_863_311_531 == 3 (mod 2**32) 36_912 * 2_863_311_531 == 12304 (mod 2**32) 679 * 2_068_415_767 == 1 (mod 2**32) 1358 * 2_068_415_767 == 2 (mod 2**32) 2037 * 2_068_415_767 == 3 (mod 2**32) 1_180_102 * 2_068_415_767 == 1738 (mod 2**32)Another interesting thing about multiplication by an odd number modulo 2ⁿ is that it

bool is_multiple(unsigned x) { // return x % 679 == 0; return x * 2068415767u < 6325431u; }which compiles down to:

is_multiple: imul ecx, edi, 2068415767 cmp ecx, 6325431 setb al retusing only one multiplication to do what LLVM and GCC do in two! (Before you ask, yes, I filed a bug to add this optimization to LLVM.)

Even numbers do not have a multiplicative inverse modulo 2ⁿ, but we can still use this trick, since they can be factored into a power of two and an odd number. For instance, 1738 = 869 × 2, and a number is a multiple of 1738 if it is a multiple of both 869 and 2. Furthermore, after dividing by the odd factor, the remaining quotient will be a multiple of the power of two if the original number was. Testing whether something is a multiple of 2 is done easily by bitwise-and of the low bits, and we can use the above trick to test whether something is a multiple of the odd factor. So:

bool is_multiple(unsigned x) { return x % 1738 == 0; }can be optimized into:

bool is_multiple(unsigned x) { unsigned inv = x * 148272749u; // inverse of 869 mod 2**32 return inv < 4942425u // ceil(2**32/869) && (inv & 1) == 0; }

bool is_multiple(unsigned x) { unsigned inv = x * 148272749u; // inverse of 869 mod 2**32 inv = (inv >> 1u) | (inv << 31u); // rotate by one bit return inv <= 2471212u; // floor(2**32/869) >> 1 }A reasonable follow-up question: could using the modular inverse be a better optimization for general division or remainder computations, not only the is-multiple test? Unfortunately, the answer is

probably not.For exact multiples of a divisor, we've seen that multiplying by the modular inverse gives the result of the division, so if division by exact multiples is the only case we care about, then this technique could be an improvement over the reciprocal. However, for an arbitrary dividend, the modular inverse doesn't really simplify the problem. As a simple example, let's consider division by seven for four-bit integers (in other words, modulo 16). The inverse of seven modulo 16 happens to be seven. If we multiply each number from zero through fifteen by seven, and arrange the results by remainder modulo seven, we get:

Remainder: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|

0 | 7 | 14 | 5 | 12 | 3 | 10 | |

1 | 8 | 15 | 6 | 13 | 4 | 11 | |

2 | 9 |