A neat hack for speeding up arbitrary flow control in dynamically-recompiled code

Julian Brown, 2 March 2001, modified 7 March 2001.

Background

In the model the original ARMphetamine used to run foreign code on a particular platform, a two-tier system was used for emulation. At the core of the system was an interpretive core which took ARM instructions one at a time, decoded them and executed them in software (a relatively slow process). As this traditional emulation was done, the code being executed was profiled and commonly-executed chunks of code were translated into native code to be run in place of the equivalent ARM code block in the future, leading to greatly enhanced performance.

Due to the way the profiling algorithm worked, only a small portion of the whole emulated program was available for analysis during any given invocation of the recompilation algorithm. Often on exit from a particular chunk of code, no further recompiled code is readily available for execution, so the system must fall back to the interpretive emulator and continue emulation in that way.

In more detail, in the middle of a recompiled code block, four different classes of flow control were identified:

Fixed (constant destination) jump within the current recompiled chunk
Fixed jump outside the current recompiled chunk, to the beginning of a previously-recompiled chunk. (Execution may not commence from the middle of a previously-recompiled chunk due to register/flag allocation issues.)
Fixed jump outside the current recompiled chunk, to a currently-unknown location
Calculated jump (ie, switch statement)

The first case is no problem, and can be dealt with easily and efficiently in the translated code.

The second case is also no problem, as control can be passed directly, in the translated machine code, to the previous chunk of translated machine code.

The fourth case is the target of the technique I propose here. Although the third case could perhaps trigger a further (recursive?) invocation of the recompiler from the target address, I'll treat it as a special case of the fourth case for now.

The Hack

Consider the situations in which a 'calculated' jump can occur:

Probably the most common is returning from a subroutine, such as:
```
add:
        add r0,r0,r1
      * movs pc,r14
```

Switch statements:

      * add pc,pc,r0,lsl #2
        mov r0,r0
        b zero_case
        b one_case
        b two_case
        b three_case
        ...

Method call, eg (something like):
```
      * ldr pc,[r5,lsl #2]
```
(any others?)

The easy (but inefficient) way of translating each of these cases to native code is simple - don't. This is the current solution used in ARMphetamine: when the program counter is set to an arbitrary address, just drop back to the interpretive emulator. This isn't quite so bad as it might be - due to the way the profiling system works, it's possible that the emulator will do a hash table lookup on the new program counter address before any code has been (slowly) emulated, and discover that a block of translated code does indeed exist for that address. Unfortunately this involves dropping back into C code and doing a slow (at the level of instruction granularity) hash-table lookup before execution of emulated code can continue.

If we propose the hypothesis that each calculated jump is only likely to transfer control to one of a handful of addresses, we can avoid doing a global hash-table lookup by dynamically compiling the hash table as inline code, in such a way that:

The fall-through case can be used to implement a closed hash table (ie, if a table-lookup collision occurs, check the following entry, etc.)
The default case can be used to call a function which will modify (or rewrite) the hash table to include the new, previously unknown address.

Both of these conditions can be implemented extremely efficiently in the target machine code, and should serve to speed up emulation considerably.

        movl %eax,%ecx           # eax contains emulated system's program counter
        shrl $7,%ecx
        xorl %eax,%ecx           # some fast, arbitrary hash function

tablestart:
        andl $3,%ecx             # size of hash table
        jmp *hashtab(,%ecx,4)

hashtab:
        .long hashcase0
        .long hashcase1
        .long default
        .long hashcase3

hashcase0:
        cmpl $0x12345678,%eax
        je code_at_12345678

hashcase1:
        cmpl $0x23456781,%eax
        je code_at_23456781

hashcase2:
        jmp default              # for fall-through case

hashcase3:
        cmpl $0x23000000,%eax
        je code_at_23000000

        jmp hashcase0            # continue search from start of table

default:
        pushl $tablestart        # rewrite the hash table to contain a new entry
        pushl %eax
        pushl $<global state>
        lcall patchup_hash_table

There must be at least one jump to the default case in the table, otherwise we might go into an infinite loop if we encounter an unknown address. The patchup code should therefore have the ability to extend the hash table as necessary (probably by doubling its size). To do this it's probably necessary (or at least desirable) to maintain some metadata along with each block, containing the (emulated address, native address) tuples which the hash table encodes, to avoid having to parse code we've previously created to recover this information.

The question arises of when to add new entries to each local active hash table. One solution might be to do translation of code a little more 'eagerly', so each time the patchup code is called, follow through to that address and translate it as well. This would mean doing static analysis of program code, and detecting anything longer than a single basic block is likely to get slightly overcomplicated. The alternative would be to do some sort of global backpatching, so an ARM address is tagged with a list of local hash tables to modify when that address eventually gets passed to the recompiler. The latter approach would be more in keeping with the lazy translation approach currently employed by the program. The patchup code then wouldn't actually do any patching up, but instead it would just put in a request to be patched up at some later time. Initially we have to drop back to the emulator as before after making this request, but the point is that eventually the modified table will be created and we'll have fast emulation.

As an aside, I can envisage this technique stressing a memory management system quite badly, due to the need for numerous extensible memory blocks. It might be worth investigating full-scale garbage collection techniques.

Implementation

Later...