Compiler Branches

This excerpt from the IBM PowerPC™ Compiler Writer's Guide gives coding examples of how a compiler can avoid branches in the application code that it generates. The document is written for experienced C-language compiler developers. The complete document, written by Warthman Associates, is on IBM's Web site.

The following PowerPC instructions are used in this excerpt:

add Add
add. Add with update (i.e., store the condition codes for the result in the Condition Register)
addi Add immediate
and AND
andc AND with compliment
bge Branch if greater than or equal to
bf Branch if false
ble Branch if greater than or equal to
blt Branch if greater than
bne Branch if not equal
cmpw Compare word
cmpwi Compare word immediate
cror Condition register OR
eqv Equivalent
li Load immediate
lwz Load word and zero
mfcr Move to condition register field
or OR
rlwinm Rotate left word immediate then AND with mask
stw Store word
subf Subtract from
xori XOR immediate

 


3.1.5 Avoiding Branches

Branching, both conditional and unconditional, slows most implementations. Even an unconditional branch or a correctly predicted taken branch may cause a delay if the target instruction is not in the fetch buffer or the cache. It is therefore best to use branch instructions carefully and to devise algorithms that reduce branching. Many operations that normally use branches may be performed either with fewer or no branches.

Reducing the number of branches:

  • Increases the size of the basic blocks, and therefore increases opportunities for scheduling within a basic block.
  • Decreases the number of basic blocks.

 

3.1.5.1 Computing Predicates

Predicates utilize a boolean expression as an integer value. For example, in C:

ne = ((x != y) ? 1 : 0);

Figure 3-15 shows how to calculate this expression using an ordinary control flow translation.

Figure 3-15. Predicate Calculation: Branching Code Sequence

cmpw cr0,Rx,Ry # place compare result in cr0
li R3,1 # R3 = 1
bne cr0,lab # x != y
li R3,0 # R3 = 0
lab:

You can avoid the branch by using Condition Register logical instructions, as shown in Figure 3-16. In this case, the result of the comparison is directly transferred from the Condition Register to a General-Purpose Register, from which the bit is extracted and then flipped.

Figure 3-16. Predicate Calculation: Condition Register Logical Sequence

cmpw cr0,Rx,Ry # place compare result in cr0
mfcr R4 # R4 = condition register
rlwinm R5,R4,3,31,31 # extract the cr0[eq] bit
xori R3,R5,1 # flip the bit to obtain 0/1

Some implementations have delays associated with accessing the Condition Register using the mfcr instruction. An alternative that uses only fixed-point operations is shown in Figure 3-17.

Figure 3-17. Predicate Calculation: Fixed-Point Operation Code Sequence

subf R0,Rx,Ry # R0 = y - x
subf R3,Ry,Rx # R3 = x - y
or R3,R3,R0 # R3 = R3 | R0
# sign bit holds desired result
rlwinm R3,R3,1,31,31 # extract the sign bit

You can generate all boolean predicates with straight-line code that does not use the Condition Register. Figure 3-18 shows arithmetic expressions that yield a sign-bit reflecting the appropriate result.

Figure 3-18. Arithmetic Expressions for Boolean Predicates

Boolean Predicate Arithmetic Expression
x y (x - y) | (y - x)
x = y ((x - y) | (y - x))
x < y (x & y) | ((x y) & (x - y))
x <= y (x | y) & ((x xor.gif (856 bytes) y) | (y - x))
x < y, unsigned (x & y) | ((x y) & (x - y)), or (x & y) | ((x | y) & (x - y))
x <= y, unsigned (x | y) & ((x xor.gif (856 bytes) y) | (y - x))

Shorter sequences exist for many of these operations. The GNU superoptimizer is a program that exhaustively searches a subset of machine instructions to find the shortest branch-free combinations that perform a specified operation. Appendix D lists the PowerPC GNU superoptimizer results for a number of functions.

 

3.1.5.2 Conditionally Incrementing a Value by 1

Figure 3-19 shows the C code fragment for conditionally incrementing a value by 1 and equivalent branching and non-branching assembly code sequences. For simple conditions, branch-free equivalents can be formed using computed predicates.

Figure 3-19. Conditionally Incrementing a Value by 1 Code Example

C Source Code
if (a < b) ++b;
Branching Code
# R3 contains a
# R4 contains b
cmpw cr0,R3,R4 # compare a and b
bge cr0,lab # branch if a >= b
addi R4,R4,1 # b = b + 1
lab:
# R4 contains the desired result
Branch-Free Code
# R3 contains a
# R4 contains b
subf R0,R4,R3 # a - b
eqv R2,R3,R4 # a equival.gif (829 bytes) b
and R2,R0,R2 # (a equival.gif (829 bytes) b) & (a - b)
andc R0,R3,R4 # a & ~b
or R0,R0,R2 # (a & ~b) | ((a b) & (a - b))
rlwinm R0,R0,1,31,31 # extract predicate
add R4,R4,R0 # if (a < b) then b++

 

3.1.5.3 Condition Register Logical

The Condition Register logical instructions can be used to combine several branch conditions, thus reducing the number of branches. For example, Figure 3-20 shows the C code fragment for a complex conditional expression and two equivalent assembly code sequences: one that has a comparison and branch for each side of the OR, and another that uses a Condition Register logical OR to combine the results of the compare and recording operations for a single test by a conditional branch. This form may present itself in situations where a common sub-expression exists between this and other code, thus offering opportunities for multiple compare and recording instructions within a single basic block.

Figure 3-20. Complex Condition Register Logical Code Example

C Source Code
if ((a + b) < 0) || ((x + y) > 0)
...do_something...
Separate Branches
add. R3,Ra,Rb # perform add (a + b) with record
blt cr0,lab1 # if (a + b) < 0, goto lab1
add. R4,Rx,Ry # perform add (x + y) with record
ble cr0,else # if (x + y) <= 0, goto else
lab1:
...statement...
else:
Combined Branch
add R3,Ra,Rb # perform add (a + b)
cmpwi cr3,R3,0 # compare (a + b) to 0
add. R4,Rx,Ry # perform add (x + y) with record
cror 27,1,12 # cr6[so] = cr0[gt] | cr3[lt]
bf cr6[so],else # branch to else if condition bit is false
...statement...
else:

Figure 3-21 shows code sequences for a C switch for which the optimal implementation of the multi-way branch is simply a sequence of compare-branch tests. Because the tests all have a common target address, they can be combined using Condition Register logical instructions, reducing the total number of branches from four to one. For a specific implementation, compare the timing of sequences using Condition Register logical instructions to the equivalent multiple-branch sequences because the relative performance may vary.

Figure 3-21. C Switch: Condition Register Logical Code Example

C Source Code
switch(i){
case 0: case 20: case 30: case 40:
i+=10; break;
}
Assembly Code
lwz R3,i # load i into R3
cmpwi cr0,R3,0 # compare R3 to 0 -> cr0
cmpwi cr1,R3,20 # compare R3 to 20 -> cr1
cmpwi cr6,R3,30 # compare R3 to 30 -> cr6
cmpwi cr7,R3,40 # compare R3 to 40 -> cr7
cror cr5[eq],cr0[eq],
cr1[eq]
# cr5[eq] = cr0[eq] | cr1[eq]
cror cr0[eq],cr7[eq],
cr6[eq]
# cr0[eq] = cr7[eq] | cr6[eq]
cror cr1[eq],cr5[eq],
cr0[eq]
# cr1[eq] = cr5[eq] | cr0[eq]
bne cr1,out # i != 0, 20, 30, 40, goto out
addi R3,R3,10 # i += 10
stw R3,i # store new i
out:

 

Copyright and Trademarks