I recently stumbled across a post about some crazy optimization that clang does to divisions by a constant. If you aren’t interested in reading it yourself, the summary is as follows:

- Arbitrary integer division is slow.
- Division by powers of 2 is fast.
- Given a divisor \(n\), the compiler finds some \(a, b\) such that \(a/2^b\) approximates \(1/n\).
- This approximation gives exact results for any 32-bit integer.

I was interested in seeing just how much faster this seemingly-magic optimization was than the regular `div`

instruction, so I set up a simple test framework:

```
#include <stdio.h>
#include <sys/time.h>
extern unsigned udiv19(unsigned arg);
int main(int argc, char **argv) {
struct timeval start, finish;
unsigned long long accum = 0;
int j;
for(j = 0; j < 1000; j++){
gettimeofday(&start, NULL);
int i;
for(i = 0; i < 1000000; i++){
udiv19(1000);
}
gettimeofday(&finish, NULL);
accum += ((1000000l * finish.tv_sec + finish.tv_usec) - (1000000l * start.tv_sec + start.tv_usec));
}
printf("%f microseconds per 1M operations\n", (float) accum / 1000);
return 0;
}
```

If you don’t feel like reading C, the algorithm is basically as follows:

- Repeat 1,000 times:
- Start a stopwatch.
- Repeat 1,000,000 times:
- Call the assembly function.

- Stop the stopwatch and add the time elapsed to the total.

- Spit out the total time elapsed, divided by 1,000. This gives the average time for 1,000,000 calls to the assembly function.

I took the assembly from the other blog post and marked it up to be able to compile it into this scaffolding:

```
.intel_syntax noprefix
.text
.global udiv19
udiv19:
mov ecx, edi
mov eax, 2938661835
imul rax, rcx
shr rax, 32
sub edi, eax
shr edi
add eax, edi
shr eax, 4
ret
```

I also wrote a naive version using the `div`

instruction (this took a bit of googling to refresh my knowledge of the x86 ABI and instruction set…):

```
.intel_syntax noprefix
.text
.global udiv19
udiv19:
mov edx, 0
mov ecx, 19
mov eax, edi
div ecx
ret
```

To compile:

```
$ gcc -o optimized main.c optimized.s
$ gcc -o simple main.c simple.s
```

The results:

```
$ ./optimized
1861.276001 microseconds per 1M operations
$ ./simple
2783.295898 microseconds per 1M operations
```

We can clearly see a ~1.5x speedup (at least on my i5-6400). However, there’s a caveat: we’re only really testing the *latency* of the implementation, not its *throughput*. Modern processors have sophisticated ways to increase performance by pipelining instructions and executing them out of order. In a real application, this would drastically reduce the impact of the *div* instruction’s latency. However, it’s still interesting to see this optimization technique and prove that it theoretically *could* be beneficial, maybe, in some particular instances.