Fun with integer division optimizations.

I recently stumbled across a post about some crazy optimization that clang does to divisions by a constant. If you aren’t interested in reading it yourself, the summary is as follows:

I was interested in seeing just how much faster this seemingly-magic optimization was than the regular div instruction, so I set up a simple test framework:

#include <stdio.h>
#include <sys/time.h>

extern unsigned udiv19(unsigned arg);

int main(int argc, char **argv) {

    struct timeval start, finish;
    
    unsigned long long accum = 0;
    int j;

    for(j = 0; j < 1000; j++){
        gettimeofday(&amp;start, NULL);

        int i;
        for(i = 0; i < 1000000; i++){
            udiv19(1000);
        }
        
        gettimeofday(&amp;finish, NULL);

        accum += ((1000000l * finish.tv_sec + finish.tv_usec) - (1000000l * start.tv_sec + start.tv_usec));
    }
    
    printf("%f microseconds per 1M operations\n", (float) accum / 1000);

    return 0;

}

If you don’t feel like reading C, the algorithm is basically as follows:

I took the assembly from the other blog post and marked it up to be able to compile it into this scaffolding:

.intel_syntax noprefix
.text
.global udiv19
udiv19:
    mov     ecx, edi
    mov     eax, 2938661835
    imul    rax, rcx
    shr     rax, 32
    sub     edi, eax
    shr     edi
    add     eax, edi
    shr     eax, 4
    ret

I also wrote a naive version using the div instruction (this took a bit of googling to refresh my knowledge of the x86 ABI and instruction set…):

.intel_syntax noprefix
.text
.global udiv19
udiv19:
    mov     edx, 0
    mov     ecx, 19
    mov     eax, edi
    div     ecx
    ret

To compile:

$ gcc -o optimized main.c optimized.s 
$ gcc -o simple main.c simple.s

The results:

$ ./optimized 
1861.276001 microseconds per 1M operations
$ ./simple 
2783.295898 microseconds per 1M operations

We can clearly see a ~1.5x speedup (at least on my i5-6400). However, there’s a caveat: we’re only really testing the latency of the implementation, not its throughput. Modern processors have sophisticated ways to increase performance by pipelining instructions and executing them out of order. In a real application, this would drastically reduce the impact of the div instruction’s latency. However, it’s still interesting to see this optimization technique and prove that it theoretically could be beneficial, maybe, in some particular instances.