Optimizing is-multiple checks with modular arithmetic

Integer division and remainder are among the slowest fundamental arithmetic operations CPUs support, an order of magnitude or more slower than multiplication on many architectures, so compilers have a lot of incentive to turn them into cheaper operations. It's common for the divisor to be a constant, in which case there's a well-known transformation your C compiler probably does to turn a division into a multiplication by an binary approximation of the reciprocal of the divisor. For example, clang and gcc compile the following C code:
unsigned divide(unsigned x) {
  return x / 679;
}
into the following assembly:
divide:
  mov eax, edi
  imul rax, rax, 1619310203  # multiply by ceil(1/679 * 2**40) == 1619310203
  shr rax, 40                # divide by 2**40 via right shift
  ret
Ridiculous Fish explores this optimization in much further depth. A related common operation, the one I want to focus on here, is to test whether a number is an exact multiple of another, using the remainder operator:
bool is_multiple(unsigned x) {
  return x % 679 == 0;
}
clang and gcc leverage the same division-into-multiplication trick to optimize this operation, first performing the division via multiplication, then multiplying by the divisor and comparing the result to see if the result matches:
is_multiple:
  mov eax, edi
  imul rax, rax, 1619310203  # multiply by ceil(1/679 * 2**40) == 1619310203
  shr rax, 40                # divide by 2**40 via right shift
  imul ecx, eax, 679         # multiply the result by 679 again
  cmp edi, ecx               # is it equal to the original value?
  sete al
  ret
This is an improvement over using a division instruction to perform the remainder, but the above code sequence still involves two multiplications, one of which is a double-width multiplication (getting a 64 bit result from 32 bit operands), along with a double-width right shift. Can we do better? It turns out we can, using some tricks with modular arithmetic. Integer arithmetic in a 32- or 64-bit CPU register wraps if it overflows, effectively implementing the integers modulo 2³² or 2⁶⁴. An interesting property of integers modulo 2ⁿ is that every odd number has a modular inverse, which is another number it can be multiplied with modulo 2ⁿ to produce 1. Some examples modulo 2³²:
     3 * 2_863_311_531 ==    0x2_0000_0001 == 1 (mod 2**32)
   679 * 2_068_415_767 ==  0x147_0000_0001 == 1 (mod 2**32)
65_535 * 4_294_901_759 == 0xfffe_0000_0001 == 1 (mod 2**32)
For any number that's a multiple of the original number, multiplying by the modular inverse will divide it:
        3 * 2_863_311_531 ==     1 (mod 2**32)
        6 * 2_863_311_531 ==     2 (mod 2**32)
        9 * 2_863_311_531 ==     3 (mod 2**32)
   36_912 * 2_863_311_531 == 12304 (mod 2**32)

      679 * 2_068_415_767 ==    1 (mod 2**32)
     1358 * 2_068_415_767 ==    2 (mod 2**32)
     2037 * 2_068_415_767 ==    3 (mod 2**32)
1_180_102 * 2_068_415_767 == 1738 (mod 2**32)
Another interesting thing about multiplication by an odd number modulo 2ⁿ is that it permutes the group of numbers. If we take every number from 0 to 2ⁿ – 1 and multiply it by the same odd number modulo 2ⁿ, we will see all of those numbers in the results once, just in a different order. Since the multiples of 679 between 0 and 2³² all map to their quotients when multiplied by the inverse 2,068,415,767, that uses up the values 0 through ⌊2³²/679⌋ = 6,325,430 in the permuted set modulo 2³². This means that every non-multiple must give a result larger than any multiple produces. So we can reimplement the is_multiple function as:
bool is_multiple(unsigned x) {
  // return x % 679 == 0;
  return x * 2068415767u < 6325431u;
}
which compiles down to:
is_multiple:
  imul ecx, edi, 2068415767
  cmp ecx, 6325431
  setb al
  ret
using only one multiplication to do what LLVM and GCC do in two! (Before you ask, yes, I filed a bug to add this optimization to LLVM.)

Even numbers do not have a multiplicative inverse modulo 2ⁿ, but we can still use this trick, since they can be factored into a power of two and an odd number. For instance, 1738 = 869 × 2, and a number is a multiple of 1738 if it is a multiple of both 869 and 2. Furthermore, after dividing by the odd factor, the remaining quotient will be a multiple of the power of two if the original number was. Testing whether something is a multiple of 2 is done easily by bitwise-and of the low bits, and we can use the above trick to test whether something is a multiple of the odd factor. So:

bool is_multiple(unsigned x) {
  return x % 1738 == 0;
}
can be optimized into:
bool is_multiple(unsigned x) {
  unsigned inv = x * 148272749u;  // inverse of 869 mod 2**32
  return inv < 4942425u           // ceil(2**32/869)
      && (inv & 1) == 0;
}
Update: Greg Parker and Harold Aptroot point out that a further refinement is possible: we can rotate the low bit(s) into the high bits and do one unsigned comparison instead:
bool is_multiple(unsigned x) {
  unsigned inv = x * 148272749u;     // inverse of 869 mod 2**32
  inv = (inv >> 1u) | (inv << 31u);  // rotate by one bit
  return inv <= 2471212u;            // floor(2**32/869) >> 1
}
A reasonable follow-up question: could using the modular inverse be a better optimization for general division or remainder computations, not only the is-multiple test? Unfortunately, the answer is probably not. For exact multiples of a divisor, we've seen that multiplying by the modular inverse gives the result of the division, so if division by exact multiples is the only case we care about, then this technique could be an improvement over the reciprocal. However, for an arbitrary dividend, the modular inverse doesn't really simplify the problem. As a simple example, let's consider division by seven for four-bit integers (in other words, modulo 16). The inverse of seven modulo 16 happens to be seven. If we multiply each number from zero through fifteen by seven, and arrange the results by remainder modulo seven, we get:
Remainder: 0 1 2 3 4 5 6
0 7 14 5 12 3 10
1 8 15 6 13 4 11
2 9
The multiples of seven map neatly to their quotients, but the non-multiples don't so much. Every non-multiple with the same remainder maps to a contiguous range of values, so maybe by finding which of these ranges the result falls into, we could subtract the base value and get the quotient that way. For a small divisor like seven, we could perhaps precompute that mapping from remainder to range of values to emit a decision tree, but this wouldn't scale well to larger divisors, and is not as straightforward as multiplying by reciprocal. It might be an interesting alternative to explore for extremely limited hardware that doesn't natively support integer multiplication or performs multiplication very slowly, on which the double-width multiplication necessary for the reciprocal technique might be worth avoiding, but on desktop-class CPUs this is probably not an improvement.

View and comment on this post in the fediverse, or email me about it.