I previously looked at x86 code generation quality and found that GCC does some interesting high-level optimisations but, in that case, any performance benefit was lost in poor quality code generation. In this article I’m going to look at ARM assembly instead.

Compiling the Fibonacci function in C from last time we obtain times ranging from 2.4s to 6.6s for fib(40) using –O2 to –O0. Interestingly, using –O3 actually worsens performance over –O2. As before, inspection of the generated assembly language shows that GCC employs nifty high-level optimisations when given the –O2 option.

The simplest implementation of our Fibonacci function in ARM assembly is perhaps:

fib:

cmp r0, #2

movlt pc, lr

stmfd sp!, {r1, r2, lr}

sub r1, r0, #2

sub r0, r0, #1

bl fib

mov r2, r0

mov r0, r1

bl fib

add r0, r0, r2

ldmfd sp!, {r1, r2, pc}

These 11 instructions are almost identical to the output generated by –O1 except we have been more frugal in order to avoid having to save and restore R3. This takes 3.9s to run.

Perhaps the most obvious optimisation is to inline the initial test (if n<2 then n else …) and then skip it when recursing:

fib2:

cmp r0, #2

bxlt lr

fib2i:

stmfd sp!, {r1, r2, lr}

sub r1, r0, #2

sub r0, r0, #1

cmp r0, #2

blge fib2i

mov r2, r0

mov r0, r1

cmp r0, #2

blge fib2i

add r0, r0, r2

ldmfd sp!, {r1, r2, pc}

This immediately reduces the time taken to 1.975, over 20% faster than any of the C solutions. So with very little effort we have written assembly by hand that is both shorter and faster than the assembly generated by GCC.

Let’s take a look at that high-level optimisation that GCC did. With –O2, GCC generates 17 instructions:

fib:

cmp r0, #1

stmfd sp!, {r4, r5, r6, lr}

mov r6, r0

ble .L4

mov r4, r0

mov r5, #0

.L3:

sub r0, r4, #1

bl fib

sub r4, r4, #2

cmp r4, #1

add r5, r5, r0

bgt .L3

and r6, r6, #1

.L2:

add r0, r5, r6

ldmfd sp!, {r4, r5, r6, pc}

.L4:

mov r5, #0

b .L2

This is equivalent to the following:

let rec loop(r4, r5, r6) =

r5 += fib(r4-1)

if r4>3 then loop(r4-2, r5, r6) else r5+(r6 & 1)

let fib(n) =

if n <= 1 then n else

loop(n, 0, n)

We can rewrite this high-level code in assembly by hand as 15 instructions:

fib3:

cmp r0, #1

bxle lr

stmfd sp!, {r1, r2, r3, lr}

mov r1, r0

mov r2, #0

mov r3, r0

loop:

sub r0, r1, #1

bl fib3

add r2, r2, r0

cmp r1, #3

subgt r1, r1, #2

bgt loop

and r3, r3, #1

add r0, r2, r3

ldmfd sp!, {r1, r2, r3, pc}

Furthermore, whereas the C code took 2.4s this hand-written assembly takes just 1.9s. This is probably because the assembly generated by GCC takes 8 instructions to implement the identity function when n<=1 whereas our solution takes just 2 instructions.

GCC’s choice of high-level optimisation is interesting. Looking at the problem, the most obvious high-level optimisation to me is inlining the recursive calls. This is particularly beneficial because it results to two identical calls to fib(n-3) in the general case and that common subexpression can be factored out. The following assembly does this and runs in just 38ms:

fib5:

cmp r0, #4

bge fib5mid

cmp r0, #2

moveq r0, #1

movgt r0, #2

bx lr

fib5mid:

stmfd sp!, {r1, r2, lr}

mov r1, r0

sub r0, r1, #4

bl fib5

mov r2, r0

sub r0, r1, #3

bl fib5

add r2, r2, r0

cmp r1, #3

sublt r0, r1, #1

addlt r0, r0, r2

blt fib5end

add r2, r2, r0

sub r0, r1, #2

bl fib5

add r0, r0, r2

fib5end:

ldmfd sp!, {r1, r2, pc}

So it seems the folklore wisdom that it is impossible to beat the assembly generated by a modern C compiler simply isn’t true, at least in this case.