You have to use hand-written assembly to optimize critical routines

(1)

Optimizing ARM Assembly

Computer Organization and Assembly Languages p g z y g g Yung-Yu Chuang

with slides by Peng-Sheng Chen

(2)

Optimization

• Compilers do perform optimization, but they have blind sites There are some optimization tools that blind sites. There are some optimization tools that you can’t explicitly use by writing C, for example.

– Instruction scheduling – Instruction scheduling – Register allocation

Conditional execution – Conditional execution

You have to use hand-written assembly to optimize critical routines

critical routines.

• Use ARM9TDMI as the example, but the rules apply to all ARM cores

to all ARM cores.

• Note that the codes are sometimes in armasm f t t g

format, not gas.

(3)

ARM optimization

• Utilize ARM ISA’s features

C diti l ti

– Conditional execution

– Multiple register load/store – Scaled register operand

– Addressing modes

(4)

Instruction scheduling

• ARM9 pipeline

load/store load/store 8/16-bit data

H d/I t l k If th i d d t i th

• Hazard/Interlock: If the required data is the unavailable result from the previous

i t ti th th t ll

instruction, then the process stalls.

(5)

Instruction scheduling

• No hazard, 2 cycles

• One-cycle interlock

stall

b bbl bubble

(6)

Instruction scheduling

• One-cycle interlock, 4 cycles

; no effect on performance

(7)

Instruction scheduling

• Brach takes 3 cycles due to stalls

(8)

Scheduling of load instructions

• Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions taking approximately 1/3 of all instructions.

Careful scheduling of loads can avoid stalls.

(9)

Scheduling of load instructions

2 l ll T l 11 l f h

2-cycle stall. Total 11 cycles for a character.

It can be avoided by preloading and unrolling.

Th k i d k h i i d

The key is to do some work when awaiting data.

(10)

Load scheduling by preloading

• Preloading: loads the data required for the loop at the end of the previous loop rather than at at the end of the previous loop, rather than at the beginning of the current loop.

Si l i i l di d t f l i 1 th i

• Since loop i is loading data for loop i+1, there is always a problem with the first and last loops.

F th fi t l i t t l d t id For the first loop, insert an extra load outside the loop. For the last loop, be careful not to

d d t Thi b ff ti l d b read any data. This can be effectively done by

conditional execution.

(11)

Load scheduling by preloading

9 cycles.

11/9 1 22

11/9~1.22

(12)

Load scheduling by unrolling

• Unroll and interleave the body of the loop. For example we can perform three loops together example, we can perform three loops together.

When the result of an operation from loop i is not ready we can perform an operation from not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i

result

result.

(13)

Load scheduling by unrolling

(14)

Load scheduling by unrolling

(15)

Load scheduling by unrolling

21 cycles. 7 cycle/character 11/7~1.57

More than doubling the code size

Only efficient for a large data size. y g

(16)

Register allocation

• APCS requires callee to save R4~R11 and to keep the stack 8 byte aligned

keep the stack 8-byte aligned.

Do not use sp(R13) and pc(R15)

Total 14 general-purpose registers.

• We stack R12 only for making the stack 8-byte aligned.

g p p g

(17)

Register allocation

Assume that K<=32 and N is l d l i l f 256 large and a multiple of 256

k 32 k k 32-k

(18)

Register allocation

Unroll the loop to handle 8 words at a time and to use multiple load/store

(19)

Register allocation

(20)

Register allocation

• What variables do we have?

argumentsg read-in overlapp

• We still need to assign carry and kr, but we have used 13 registers and only one remains.

used 13 registers and only one remains.

– Work on 4 words instead

– Use stack to save least-used variable, here NUse stack to save least used variable, here N – Alter the code

(21)

Register allocation

• We notice that

_carry

does not need to stay in the same register Thus we can use yi for it

the same register. Thus, we can use yi for it.

(22)

Register allocation

This is often an iterative process until all variables are assigned to registers.

(23)

More than 14 local variables

• If you need more than 14 local variables, then you store some on the stack

you store some on the stack.

• Work outwards from the inner loops since they

h f i t

have more performance impact.

(24)

More than 14 local variables

(25)

More than 14 local variables

(26)

Packing

• Pack multiple (sub-32bit) variables into a single register

register.

(27)

Packing

• When shifting by a register amount, ARM uses bits 0 7 and ignores others

bits 0~7 and ignores others.

• Shift an array of 40 entries by shift bits.

(28)

Packing

(29)

Packing

• Simulate SIMD (single instruction multiple data)

data).

• Assume that we want to merge two images X d Y t d Z b

and Y to produce Z by

(30)

Example

X Y

Xα+Y(1-α)

30

(31)

α=0.75

31

(32)

α=0.5

32

(33)

α=0.25

33

(34)

Packing

• Load 4 bytes at a time

• Unpack it and promote to 16-bit data

• Unpack it and promote to 16 bit data

• Work on 176x144 images

(35)

Packing

(36)

Packing

(37)

Packing

(38)

Conditional execution

• By combining conditional execution and conditional setting of the flags you can conditional setting of the flags, you can

implement simple if statements without any need of branches

need of branches.

• This improves efficiency since branches can

t k l d l d d i

take many cycles and also reduces code size.

(39)

Conditional execution

(40)

Conditional execution

(41)

Conditional execution

(42)

Block copy example

void bcopy(char *to, char *from, int n) {

{

while (n--)

*to++ = *from++;

}

(43)

Block copy example

@ arguments: R0: to, R1: from, R2: n bcopy: TEQ R2 #0

bcopy: TEQ R2, #0 BEQ end

loop: SUB R2 R2 #1 loop: SUB R2, R2, #1

LDRB R3, [R1], #1 STRB R3 [R0] #1 STRB R3, [R0], #1 B bcopy

d MOV PC LR

end: MOV PC, LR

(44)

Block copy example

@ arguments: R0: to, R1: from, R2: n

@ rewrite “n–-” as “-–n>=0”

@ rewrite n–- as -–n>=0 bcopy: SUBS R2, R2, #1

LDRPLB R3 [R1] #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 BPL bcopy

BPL bcopy MOV PC, LR

(45)

Block copy example

@ assume n is a multiple of 4; loop unrolling

@ assume n is a multiple of 4; loop unrolling bcopy: SUBS R2, R2, #4

LDRPLB R3, [R1], #1, [ ], # STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 BPL bcopy

MOV PC LR MOV PC, LR

(46)

Block copy example

@ n is a multiple of 16;

bcopy: SUBS R2, R2, #16 LDRPL R3, [R1], #4, [ ], # STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 BPL bcopy

MOV PC LR MOV PC, LR

(47)

Block copy example

@ n is a multiple of 16;

bcopy: SUBS R2, R2, #16 LDMPL R1! {R3 R6}

LDMPL R1!, {R3-R6}

STMPL R0!, {R3-R6}

BPL bcopy BPL bcopy MOV PC, LR

@ could be extend to copy 40 byte at a time

@ if t lti l f 40 dd t l

@ if not multiple of 40, add a copy_rest loop

(48)

Search example

int main(void) {

{

int a[10]={7,6,4,5,5,1,3,2,9,8};

int i;

int s=4;

for (i=0; i<10; i++) if ( [i]) b k if (s==a[i]) break;

if (i>=10) return -1;

l t i

else return i;

}

(49)

Search

.section .rodata LC0:

.LC0:

.word 7 word 6 .word 6 .word 4 word 5 .word 5 .word 5

d 1 .word 1 .word 3 d 2 .word 2 .word 9 8 .word 8

(50)

Search

.text low

.global main

.type main, %function s main: sub sp, sp, #48 i

adr r4, L9 @ =.LC0

i a[0]

add r5, sp, #8

ldmia r4!, {r0, r1, r2, r3} : stmia r5!, {r0, r1, r2, r3}

ldmia r4!, {r0, r1, r2, r3}

a[9]

stmia r5!, {r0, r1, r2, r3}

ldmia r4!, {r0, r1}

stmia r5!, {r0, r1} highhigh

stack

(51)

Search

mov r3, #4

str r3, [sp, #0] @ s=4

low str r3, [sp, #0] @ s 4

mov r3, #0

str r3, [sp, #4] @ i=0 s , [ p, # ] @ i

loop: ldr r0, [sp, #4] @ r0=i

i a[0]

cmp r0, #10 @ i<10?

bge end :

ldr r1, [sp, #0] @ r1=s mov r2, #4

a[9]

mul r3, r0, r2 add r3, r3, #8

ld 4 [ 3] @ 4 [i] high ldr r4, [sp, r3] @ r4=a[i] high

stack

(52)

Search

teq r1, r4 @ test if s==a[i]

beq end

low beq end

add r0 r0 #1 @ i++ s

add r0, r0, #1 @ i++ i

str r0, [sp, #4] @ update i b loop

i a[0]

b loop

d t 0 [ #4]

: end: str r0, [sp, #4]

cmp r0, #10 0 # 1

a[9]

movge r0, #-1

add sp, sp, #48

mov pc, lr highhigh

stack

(53)

Optimization

• Remove unnecessary load/store

R l i i

• Remove loop invariant

• Use addressing mode

• Use conditional execution

(54)

Search (remove load/store)

mov r3, #4

str r3, [sp, #0] @ s=4 r1, low

str r3, [sp, #0] @ s 4 mov r3, #0

str r3, [sp, #4] @ i=0 s i r0,

, [ p, # ] @

loop: ldr r0, [sp, #4] @ r0=i

i a[0]

cmp r0, #10 @ i<10?

bge end :

ldr r1, [sp, #0] @ r1=s mov r2, #4

a[9]

mul r3, r0, r2 add r3, r3, #8

stack

(55)

Search (remove load/store)

beq end

low beq end

add r0 r0 #1 @ i++ s

add r0, r0, #1 @ i++ i

i a[0]

b loop

d t 0 [ #4]

: end: str r0, [sp, #4]

cmp r0, #10 0 # 1

a[9]

movge r0, #-1

add sp, sp, #48

stack

(56)

Search (loop invariant/addressing mode)

mov r3, #4

str r3, [sp, #0] @ s=4 r1, low

str r3, [sp, #0] @ s 4 mov r3, #0

str r3, [sp, #4] @ i=0 s i r0,

, [ p, # ] @

loop: ldr r0, [sp, #4] @ r0=i

i add r2, sp, #8 a[0]

cmp r0, #10 @ i<10?

bge end :

ldr r1, [sp, #0] @ r1=s mov r2, #4

a[9]

mul r3, r0, r2 add r3, r3, #8

stack ldr r4, [r2, r0, LSL #2]

(57)

Search (conditional execution)

beq end

low beq end

add r0 r0 #1 @ i++ s

addeqadd r0, r0, #1 @ i++ i

i a[0]

addeq

beqb loop

d t 0 [ #4]

beq :

end: str r0, [sp, #4]

cmp r0, #10 0 # 1

a[9]

movge r0, #-1

add sp, sp, #48

stack

(58)