# Fun with integer division optimizations.

I recently stumbled across a post about some crazy optimization that clang does to divisions by a constant. If you aren’t interested in reading it yourself, the summary is as follows:

• Arbitrary integer division is slow.
• Division by powers of 2 is fast.
• Given a divisor n, the compiler finds some a, b such that a/2b approximates 1/n.
• This approximation gives exact results for any 32-bit integer.

I was interested in seeing just how much faster this seemingly-magic optimization was than the regular `div` instruction, so I set up a simple test framework:

``````#include <stdio.h>
#include <sys/time.h>

extern unsigned udiv19(unsigned arg);

int main(int argc, char **argv) {

struct timeval start, finish;

unsigned long long accum = 0;
int j;

for(j = 0; j < 1000; j++){

gettimeofday(&start, NULL);

int i;
for(i = 0; i < 1000000; i++){
udiv19(1000);
}

gettimeofday(&finish, NULL);

accum += ((1000000l * finish.tv_sec + finish.tv_usec) - (1000000l * start.tv_sec + start.tv_usec));

}

printf("%f microseconds per 1M operations\n", (float) accum / 1000);

return 0;

}``````

If you don’t feel like reading C, the algorithm is basically as follows:

• Repeat 1,000 times:
• Start a stopwatch .
• Repeat 1,000,000 times:
• Call the assembly function.
• Stop the stopwatch and add the time elapsed to the total.
• Spit out the total time elapsed, divided by 1,000. This gives the average time for 1,000,000 calls to the assembly function.

I took the assembly from the other blog post and marked it up to be able to compile it into this scaffolding:

``````.intel_syntax noprefix
.text
.global udiv19
udiv19:
mov     ecx, edi
mov     eax, 2938661835
imul    rax, rcx
shr     rax, 32
sub     edi, eax
shr     edi
shr     eax, 4
ret``````

I also wrote a naive version using the `div` instruction (this took a bit of googling to refresh my knowledge of the x86 ABI and instruction set…):

``````.intel_syntax noprefix
.text
.global udiv19
udiv19:
mov     edx, 0
mov     ecx, 19
mov     eax, edi
div     ecx
ret
``````

To compile:

``````\$ gcc -o optimized main.c optimized.s
\$ gcc -o simple main.c simple.s``````

The results:

``````\$ ./optimized
1861.276001 microseconds per 1M operations
\$ ./simple
2783.295898 microseconds per 1M operations``````

We can clearly see a ~1.5x speedup (at least on my i5-6400). However, there’s a caveat: we’re only really testing the latency of the implementation, not its throughput. Modern processors have sophisticated ways to increase performance by pipelining instructions and executing them out of order. In a real application, this would drastically reduce the impact of the `div` instruction’s latency. However, it’s still interesting to see this optimization technique and prove that it theoretically could be beneficial, maybe, in some particular instances.