Julian Brown, 2 March 2001, modified 7 March 2001.
In the model the original ARMphetamine used to run foreign code on a particular platform, a two-tier system was used for emulation. At the core of the system was an interpretive core which took ARM instructions one at a time, decoded them and executed them in software (a relatively slow process). As this traditional emulation was done, the code being executed was profiled and commonly-executed chunks of code were translated into native code to be run in place of the equivalent ARM code block in the future, leading to greatly enhanced performance.
Due to the way the profiling algorithm worked, only a small portion of the whole emulated program was available for analysis during any given invocation of the recompilation algorithm. Often on exit from a particular chunk of code, no further recompiled code is readily available for execution, so the system must fall back to the interpretive emulator and continue emulation in that way.
In more detail, in the middle of a recompiled code block, four different classes of flow control were identified:
The first case is no problem, and can be dealt with easily and efficiently in the translated code.
The second case is also no problem, as control can be passed directly, in the translated machine code, to the previous chunk of translated machine code.
The fourth case is the target of the technique I propose here. Although the third case could perhaps trigger a further (recursive?) invocation of the recompiler from the target address, I'll treat it as a special case of the fourth case for now.
Consider the situations in which a 'calculated' jump can occur:
add: add r0,r0,r1 * movs pc,r14
* add pc,pc,r0,lsl #2 mov r0,r0 b zero_case b one_case b two_case b three_case ...
* ldr pc,[r5,lsl #2]
The easy (but inefficient) way of translating each of these cases to native code is simple - don't. This is the current solution used in ARMphetamine: when the program counter is set to an arbitrary address, just drop back to the interpretive emulator. This isn't quite so bad as it might be - due to the way the profiling system works, it's possible that the emulator will do a hash table lookup on the new program counter address before any code has been (slowly) emulated, and discover that a block of translated code does indeed exist for that address. Unfortunately this involves dropping back into C code and doing a slow (at the level of instruction granularity) hash-table lookup before execution of emulated code can continue.
If we propose the hypothesis that each calculated jump is only likely to transfer control to one of a handful of addresses, we can avoid doing a global hash-table lookup by dynamically compiling the hash table as inline code, in such a way that:
Both of these conditions can be implemented extremely efficiently in the target machine code, and should serve to speed up emulation considerably.
movl %eax,%ecx # eax contains emulated system's program counter shrl $7,%ecx xorl %eax,%ecx # some fast, arbitrary hash function tablestart: andl $3,%ecx # size of hash table jmp *hashtab(,%ecx,4) hashtab: .long hashcase0 .long hashcase1 .long default .long hashcase3 hashcase0: cmpl $0x12345678,%eax je code_at_12345678 hashcase1: cmpl $0x23456781,%eax je code_at_23456781 hashcase2: jmp default # for fall-through case hashcase3: cmpl $0x23000000,%eax je code_at_23000000 jmp hashcase0 # continue search from start of table default: pushl $tablestart # rewrite the hash table to contain a new entry pushl %eax pushl $<global state> lcall patchup_hash_table
There must be at least one jump to the default case in the table, otherwise we might go into an infinite loop if we encounter an unknown address. The patchup code should therefore have the ability to extend the hash table as necessary (probably by doubling its size). To do this it's probably necessary (or at least desirable) to maintain some metadata along with each block, containing the (emulated address, native address) tuples which the hash table encodes, to avoid having to parse code we've previously created to recover this information.
The question arises of when to add new entries to each local active hash table. One solution might be to do translation of code a little more 'eagerly', so each time the patchup code is called, follow through to that address and translate it as well. This would mean doing static analysis of program code, and detecting anything longer than a single basic block is likely to get slightly overcomplicated. The alternative would be to do some sort of global backpatching, so an ARM address is tagged with a list of local hash tables to modify when that address eventually gets passed to the recompiler. The latter approach would be more in keeping with the lazy translation approach currently employed by the program. The patchup code then wouldn't actually do any patching up, but instead it would just put in a request to be patched up at some later time. Initially we have to drop back to the emulator as before after making this request, but the point is that eventually the modified table will be created and we'll have fast emulation.
As an aside, I can envisage this technique stressing a memory management system quite badly, due to the need for numerous extensible memory blocks. It might be worth investigating full-scale garbage collection techniques.
Later...