Integer division and remainder are among the slowest fundamental arithmetic operations CPUs
support, an order of magnitude or more slower than multiplication on many architectures, so
compilers have a lot of incentive to turn them into cheaper operations.
It's common for the divisor to be a constant, in which case there's a well-known
transformation your C compiler probably does to turn a division into a multiplication by an
binary approximation of the reciprocal of the divisor.
For example,
clang and
gcc compile the following C code:
unsigned divide(unsigned x) {
return x / 679;
}
into the following assembly:
divide:
mov eax, edi
imul rax, rax, 1619310203 # multiply by ceil(1/679 * 2**40) == 1619310203
shr rax, 40 # divide by 2**40 via right shift
ret
Ridiculous Fish
explores this optimization in much further depth. A related common operation, the one I
want to focus on here, is to test whether a number is an exact multiple of
another, using the remainder operator:
bool is_multiple(unsigned x) {
return x % 679 == 0;
}
clang and
gcc leverage the same division-into-multiplication trick to optimize
this operation, first performing the division via multiplication, then
multiplying by the divisor and comparing the result to see if the result
matches:
is_multiple:
mov eax, edi
imul rax, rax, 1619310203 # multiply by ceil(1/679 * 2**40) == 1619310203
shr rax, 40 # divide by 2**40 via right shift
imul ecx, eax, 679 # multiply the result by 679 again
cmp edi, ecx # is it equal to the original value?
sete al
ret
This is an improvement over using a division instruction to perform the
remainder, but the above code sequence still involves two multiplications, one
of which is a
double-width
multiplication (getting a 64 bit result from 32
bit operands), along with a double-width right shift. Can we do better? It
turns out we can, using some tricks with modular arithmetic. Integer arithmetic
in a 32- or 64-bit CPU register wraps if it overflows, effectively implementing
the integers modulo 2³² or 2⁶⁴. An interesting property of integers modulo 2ⁿ is
that every odd number has a
modular inverse,
which is another number it can be multiplied with modulo 2ⁿ to produce 1. Some examples modulo 2³²:
3 * 2_863_311_531 == 0x2_0000_0001 == 1 (mod 2**32)
679 * 2_068_415_767 == 0x147_0000_0001 == 1 (mod 2**32)
65_535 * 4_294_901_759 == 0xfffe_0000_0001 == 1 (mod 2**32)
For any number that's a multiple of the original number, multiplying by the modular inverse will
divide it:
3 * 2_863_311_531 == 1 (mod 2**32)
6 * 2_863_311_531 == 2 (mod 2**32)
9 * 2_863_311_531 == 3 (mod 2**32)
36_912 * 2_863_311_531 == 12304 (mod 2**32)
679 * 2_068_415_767 == 1 (mod 2**32)
1358 * 2_068_415_767 == 2 (mod 2**32)
2037 * 2_068_415_767 == 3 (mod 2**32)
1_180_102 * 2_068_415_767 == 1738 (mod 2**32)
Another interesting thing about multiplication by an odd number modulo 2ⁿ is that it
permutes the group of numbers. If we take every number from 0 to 2ⁿ – 1 and multiply it by
the same odd number modulo 2ⁿ, we will see all of those numbers in the results
once, just in a different order. Since the multiples of 679 between 0 and 2³²
all map to their quotients when multiplied by the inverse 2,068,415,767, that
uses up the values 0 through ⌊2³²/679⌋ = 6,325,430 in the permuted set modulo 2³². This
means that every non-multiple must give a result larger than any multiple
produces. So we can reimplement the
is_multiple function as:
bool is_multiple(unsigned x) {
// return x % 679 == 0;
return x * 2068415767u < 6325431u;
}
which compiles down to:
is_multiple:
imul ecx, edi, 2068415767
cmp ecx, 6325431
setb al
ret
using only one multiplication to do what LLVM and GCC do in two! (Before you ask, yes, I filed
a bug to add this optimization to LLVM.)
Even numbers do not have a multiplicative inverse modulo 2ⁿ, but we can still use this trick,
since they can be factored into a power of two and an odd number. For instance,
1738 = 869 × 2, and a number is a multiple of 1738 if it is a multiple of both
869 and 2. Furthermore, after dividing by the odd factor, the remaining quotient will be
a multiple of the power of two if the original number was. Testing whether
something is a multiple of 2 is done easily by bitwise-and of the low bits, and
we can use the above trick to test whether something is a multiple of the odd
factor. So:
bool is_multiple(unsigned x) {
return x % 1738 == 0;
}
can be optimized into:
bool is_multiple(unsigned x) {
unsigned inv = x * 148272749u; // inverse of 869 mod 2**32
return inv < 4942425u // ceil(2**32/869)
&& (inv & 1) == 0;
}
Update: Greg Parker
and
Harold Aptroot
point out that a further refinement is possible: we can rotate the low bit(s) into the high bits
and do one unsigned comparison instead:
bool is_multiple(unsigned x) {
unsigned inv = x * 148272749u; // inverse of 869 mod 2**32
inv = (inv >> 1u) | (inv << 31u); // rotate by one bit
return inv <= 2471212u; // floor(2**32/869) >> 1
}
A reasonable follow-up question: could using the modular inverse be a better optimization for
general division or remainder computations, not only the is-multiple test? Unfortunately, the
answer is
probably not.
For exact multiples of a divisor, we've seen that multiplying by the
modular inverse gives the result of the division, so if division by exact multiples is the only
case we care about, then this technique could be an improvement over the
reciprocal. However, for an arbitrary dividend, the modular inverse doesn't
really simplify the problem. As a simple example, let's consider division by
seven for four-bit integers (in other words, modulo 16). The inverse of seven
modulo 16 happens to be seven. If we multiply each number from zero through
fifteen by seven, and arrange the results by remainder modulo seven, we
get:
Remainder: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| 0 | 7 | 14 | 5 | 12 | 3 | 10 |
| 1 | 8 | 15 | 6 | 13 | 4 | 11 |
| 2 | 9 |
The multiples of seven map neatly to their quotients, but the non-multiples
don't so much. Every non-multiple with the same remainder maps to a contiguous
range of values, so maybe by finding which of these ranges the result falls into,
we could subtract the base value and get the quotient that way. For a small
divisor like seven, we could perhaps precompute that mapping from remainder to
range of values to emit a decision tree, but this wouldn't scale well to larger divisors, and
is not as straightforward as multiplying by reciprocal. It might be an interesting alternative
to explore for extremely limited hardware that doesn't natively support integer multiplication
or performs multiplication very slowly, on which the double-width multiplication necessary for
the reciprocal technique might be worth avoiding, but on desktop-class CPUs this is probably
not an improvement.