The Design of ARMphetamine 2

Julian Brown, 27 March 2001. Modified 25 April 2001.

Background: Accelerated Virtual Machines

Virtual machines have become an integral part of computing today: Java, Transmeta's Crusoe processor, Intel's IA-64 with its built-in IA-32 hardware emulation layer, VMware for running two operating systems simulateously on one PC, Connectix Virtual PC/Gamestation and Microsoft's .NET all utilise virtual machine technology. Virtual machines are useful for security: building safe sandboxes in which to run untrusted software, for the development of new hardware devices, as well as their traditional purpose of preserving software which would otherwise becomes unusable as the legacy hardware it runs on becomes obsolete.

I am currently working on a retargetable, and possibly resourcable, machine emulation system. The original source is an ARM-processor based machine, and the host architecture is Intel IA-32. The aim is to achieve the highest speed possible for running a complete operating system on the guest virtual machine, at great levels of accuracy. This will be achieved by dynamically translating the binary code of the guest system on-the-fly as as it is emulated into the binary code format of the host system.

The first prototype is complete, and is described in this postscript file. This document describes the changes and improvements already implemented or planned for version two of the system.

Context

This work falls into several categories of wider research. Compilation techniques: partial evaluation, dynamic code generation, dynamic optimisation. Architecture: instruction-set simulation, fast approximations to hardware models, etc.

Overall Structure

At the lowest level of the system lies an interpretive emulation core. This supports:

Currently this emulation core is wired into a built-in debugger, which forms a useful test environment. Later, the system will work like this:

Instructions are fetched and executed sequentially by the interpretive core, as would be done by a real processor. These instructions are decoded and executed in software, and their effects are mirrored in the state of the virtual processor and memory system (this is a very processor-intensive task).

As this is done, the flow of control is examined by a profiler, which identifies chunks of commonly-executed code. If a particular chunk of code is executed often, we assume the overall speed of execution will benefit from this piece of code being run faster in the future. So, we take that piece of binary code and translate it into a fragment of binary code for the host computer. When the chunk is encountered on subsequent occasions, we can transfer control directly to the newly-compiled piece of code intead of doing the time-consuming interpretation of the code in software. This technique is known as dynamic recompilation, or perhaps more accurately, dynamic binary code translation, and is capable of speeding up the emulation of a virtual machine by several orders of magnitude.

In Java terminology, this technique would be called Just-In-Time (JIT) compilation. The difference in this case is that binary code for the typical processor is not designed with the intention of being used as the source code for any sort of compiler, so various new methods need to be developed to deal with the special problems involved (discussed further here or here).

Run-Time Environment

The run-time environment consists of virtual implementations of the various support chips required by a processor to make it useful. In this case, the support chips I am targeting are: (subject to change)

Of particular interest for the time being are IOMD (or its equivalent in other system configurations), which contains timers and is able to generate interrupts (which cause problems later), and the MMU which handles the mapping from virtual to physical addresses, and provides access control to memory and memory-mapped devices.

Due to the way the MMU works, it is extremely difficult to get memory accesses to and from the processor going at a good speed in the general case (ie, random access all over the virtual memory map), even if dynamic recompilation is being employed. Vendors of existing solutions have employed tricks such as programming the host memory management unit to behave like that of the guest (in the case of Virtual PC on the Macintosh), or making unsafe assumptions like that instructions which access memory will always access memory and not memory-mapped I/O space (in the case of an experimental dynarec core for UAE, a free Amiga emulator). One safe software-only solution is described below, which should reduce the minimum latency for memory operations to a fairly low level (perhaps 10-30 cycles), without using any hardware tricks or making any unsafe assumptions about memory access patterns.

The basic idea is to use a one-entry TLB, which contains a (virtual, physical) address tuple, a bitmask for the page size, and pointers to byte- and word-sized load and store functions. When a new virtual address is encountered, it is masked, compared to the virtual part of the (virtual, physical) tuple, combined with the physical part if the match is successful then the appropriate access function is transferred to directly. If a miss occurs, code is called which walks over the page tables and generates a new TLB entry. The function pointers may point to functions which handle emulated physical RAM, ROM, or memory-mapped I/O space, providing a uniform interface for each.

The function which performs page walking also encodes access permission into TLB entry by pointing the read/write function pointers to dummy functions which cause a data abort to be raised. The entry is also tagged with the current processor mode when it is set up, to avoid potential unsafe access from user-mode code when the entry was set up in a priveleged mode.

It might be worth investigating whether more than one TLB entry provides better results. Intuition suggests otherwise, since emulating CAM in software is not a particularly fast operation. However, one interesting scheme to try might be to keep a buffer containing the last few translated entries - still always only checking the first on most memory accesses, but keeping the previous few entries in a FIFO or move-to-front buffer to search before doing a full page-table walk.

A Malleable Intermediate Representation: ph2

The original intermediate representation used by ARMphetamine ("Phetacode"), while just about adequate, suffered from a couple of fairly serious flaws which limited its usefulness. These included:

Ideally, we want a language which is amenable to 'real' register allocation and flow control. We want to keep everything well-defined and mind-numbingly simple, so it's easy to see when things are going wrong. We want to be able to do condition-code abstraction (see later), without hurting our heads.

Some things remain the same: the "chunk-based" profiling algorithm, the rationale behind not jumping into the middle of blocks, dynamic register allocation, the existence of an intermediate representation at all, etc. ph2 will primarily support the 32-bit addressing mode of the ARM, but the 26-bit addressing mode will be kept in mind for later addition.

In contrast to the original phetacode, we're going to use a variable-length bytecode, mainly to keep things neat but extendable, but also because we're only building a decoder in software, and it won't hurt much. Note though that it may be necessary to scan this code backwards, so each instruction shall be postfixed with its length in bytes.

Basic operations shall include:

None of these instructions deals with 'real' ARM registers, nor does it reference condition-code flags directly. Instead, a special set of (currently 256) registers is used to hold values used in calculations. A pair of instructions:

Are used to move registers from the ARM to the ph2 registers when they're needed, and move them back when they're done with. This means a register-allocation function need concern itself only with the fetch and commit instructions.

Each time a ph2 instruction generates a result, a new register is allocated. This is important for register-colour based allocation (I believe?). With 256 registers available in one byte of storage, this places a worst-case theoretical limit of 1kb on each chunk of code, but I don't see this being a serious restriction since most chunks will probably be much shorter than this.

If a ph2 instruction must modify ARM flags, it must be wrapped in a pair of instructions:

Writing these explicitly around instructions which modify flags has the advantage that a function to remove redundant flag calculations may modify (only) fexpect and fcommit instructions in-place. There is also an instruction,

Which is used when the following instruction requires a particular set of flags to be valid (eg, ADC, RRX). These instructions should greatly facilitate condition code abstraction.

The instruction,

allows a flag's value to be written directly.

When "meta" control flow is required, such as in the fiddly translation of shift-by-register instructions, the variations of the above instructions:

can be used instead.

ph2 code offers NO internal branch instruction. Instead, the natural grouping of instructions is the basic block, and a graph of basic blocks is generated on-the-fly as ARM code is translated to intermediate code. Each ph2 basic block has an associated predicate (a byte in an array of such predicates), a TRUE destination block and a FALSE destination block. Although this makes translation from ARM code to ph2 code more difficult, the gain in ease of manipulation should make it worthwhile.

A pair of instructions:

are used directly after an fcommit or nfcommit instruction to signal that a particular predicate number should be set, and the condition which it should be set by. Code can be pruned at the IR level to remove unnecessary condition-code storing/retrieval.

Condition-Code Abstraction

One of the driving forces behind the design of ph2 is its amenability to abstracting condition flags and flow control out of the original binary code. On the ARM (as in many) architecture, flow control is achieved using condition codes, which are combinations of particular flags. For example, a piece of code might want to jump somewhere if one register is greater than or equal to another register:

        cmp r4,r5
        bge somewhere

The first instruction sets four flags, named Z (zero), N (negative), V (overflow) and C (carry). The second tests a combination of those flags for the "greater-than or equal" condition, which is defined by "N set and V set, or N clear and V clear", or more concisely as "N==V".

The two architectures in question (ARM and IA-32) both employ flags for controlling flow like this, but have quite different semantics for setting and testing those flags. On the ARM for example, setting flags is optional for ALU operations, and any instruction may be conditionally executed. In IA-32, virtually all instructions unconditionally set (or corrupt) the processor flags. Nevertheless, we can still utilise the native processor's flag-calculation logic in recompiled code, although great care must be taken.

One approach (that used in ARMphetamine v1), is to save and restore flags individually, on-demand, in recompiled code. This was proven to work adequately, but produced bulky and slow code. The reason is mainly a shortcoming of IA-32, which makes it particularly difficult to alter individual bits corresponding to the S, Z and O flags in the flags register. The code sequence used - not necessarily the best - took between 9 and 18 instructions to restore between one and four bits. When this was multiplied by the number of times such an operation was necessary in a chunk of code, this meant a fairly large proportion of code was dedicated to essentially twiddling just a couple of bits.

A better approach is to group flags forming a condition code together, and treat them as an atomic unit (IA-32 provides convenient setxx instruction which allow you to do just that). This means that the state of the flags is not mirrored exactly in recompiled code, and the on-demand scheme can no longer be utilised. Trivially, when we encounter a conditional branch (say), we scan the ph2 representation backwards (over block boundaries if necessary) until we find the instruction which last altered the flags correponding to the relevent condition code. Then we insert an instruction which saves that condition code into a predicate buffer, and use that value rather than the condition flags when we decide whether to take the branch or not. The assembly-language programmer or wizard compiler-writer is capable of writing perfectly legal but nasty code like this:

        movs r0,r1,ror #3   # sets C,N,Z flags
        add r0,r0,r3        # sets nothing, IA-32 add destroys C,V,N,Z equivalents
        tst r0,r2           # sets N,Z flags
        bhi somewhere       # depends on C,Z (unsigned higher)

Inserting an instruction in the ph2 representation of this to save the "hi" condition after the tst instruction will lead to different results from the original code, since the C flag from the add rather than the rotation in the movs instruction will erroneously be used. This isn't insurmountable obviously, but hopefully illustrates that the successful implementation of this algorithm would be a bit tricky. This is mainly why ph2 has such a verbose syntax for "expecting" and "committing" flags.

Summary of ph2

ph2 is essentially a form of register-transfer language. The translation stage from ARM code to ph2 performs register renaming in much the same way as many modern superscalar processors, which aids register allocation later by splitting the liveness of registers in many fragments of ARM code into smaller pieces, which overlap less and thus ease the allocation of registers on the register-starved IA-32 architecture.

Translation of ARM code chunks to ph2 results in a directed graph which represents the flow of control between basic blocks, where one block of ph2 code corresponds exactly to one block of ARM code. These blocks are position-independent and non-internally referencing, which makes them easier to manipulate.

Efficiently Compiling ph2 to IA-32

The algorithm to perform translation of ph2 to IA-32 is a 'proper', albeit lightweight, optimising compiler. Currently the main optimisation it performs is register allocation. (According to Poletto & Sarkar in their Linear Scan paper, this alone can improve performance of generated code by an order of magnitude if done well, over register allocation done badly or not at all). The current code-generator is flexibly-structured enough to make adding peephole optimisation and instruction scheduling fairly easy.

Register Allocation

Register allocation uses the fast Linear Scan algorithm, which the Poletto & Sarkar claim produces code within 10% of the speed of that generated with a register allocator based on Chaitin-style graph colouring, but with O(n) time complexity rather than O(n2).

An additional phase is employed, which I call "source-destination aliasing", which opportunistically aliases the destination register for ph2 instructions to the first source register, which can aid in the generation IA-32 code: since ph2 code uses three addresses for each operation,

        add %3,%6,%5

Whereas IA-32 instructions use a two-address format:

        mov %ebx,%ecx
        add %eax,%ecx

The 'first' source operand, ecx in this case, is overwritten with the result of the operation. We can rewrite the original ph2 code as the following, so long as (the original) register %6 is not used afterwards:

        add %6,%6,%5

Now we've eliminated a mov instruction, so we can just do this:

        add %eax,%ecx
Instruction Selection

We generate code simply by choosing IA-32 instructions which correspond to ph2 instructions one at a time (ARMphetamine v1 used a complicated tree-based matching method). Code is generated indirectly via an abstraction layer (ab86), which performs a similar task to a text-source assembler in choosing instruction variants for different operand types, but also knows about which instructions set up and destroy condition-code flags. A few simple substitutions are performed at this level for optimisation reasons.

(It goes without saying that binary code must be generated directly by the recompiler rather than by, say, forking an external assembler, for performance reasons).

Flow Control

Due to the way the profiling algorithm works, only a small portion of the whole emulated program is available for analysis during any given invocation of the recompilation algorithm. Often on exit from a particular chunk of code, no further recompiled code is readily available for execution, so the system must fall back to the interpretive emulator and continue emulation in that way.

In more detail, in the middle of a recompiled code block, four different classes of flow control are identifiable:

  1. Fixed (constant destination) jump within the current recompiled chunk
  2. Fixed jump outside the current recompiled chunk, to the beginning of a previously-recompiled chunk. (Execution may not commence from the middle of a previously-recompiled chunk due to register/flag allocation issues.)
  3. Fixed jump outside the current recompiled chunk, to a currently-unknown location
  4. Calculated jump (ie, switch statement)

The first case is no problem, and can be dealt with easily and efficiently in the translated code.

The second case is also no problem, as control can be passed directly, in the translated machine code, to the previous chunk of translated machine code.

The fourth case is the target of the technique I propose here. Although the third case could perhaps trigger a further (recursive?) invocation of the recompiler from the target address, I'll treat it as a special case of the fourth case for now.

Active Hashing

Consider the situations in which a 'calculated' jump can occur:

  1. Probably the most common is returning from a subroutine, such as:
    add:
            add r0,r0,r1
          * movs pc,r14
    
  2. Switch statements:
          * add pc,pc,r0,lsl #2
            mov r0,r0
            b zero_case
            b one_case
            b two_case
            b three_case
            ...
    
  3. Method call, eg (something like):
          * ldr pc,[r5,lsl #2]
    
  4. (any others?)

The easy (but inefficient) way of translating each of these cases to native code is simple - don't. This is the solution used in ARMphetamine v1: when the program counter is set to an arbitrary address, just drop back to the interpretive emulator. This isn't quite so bad as it might be - due to the way the profiling system works, it's possible that the emulator will do a hash table lookup on the new program counter address before any code has been (slowly) emulated, and discover that a block of translated code does indeed exist for that address. Unfortunately this involves dropping back into C code and doing a slow (at the level of instruction granularity) hash-table lookup before execution of emulated code can continue.

If we propose the hypothesis that each calculated jump is only likely to transfer control to one of a handful of addresses, we can avoid doing a global hash-table lookup by dynamically compiling the hash table as inline code, in such a way that:

Both of these conditions can be implemented extremely efficiently in the target machine code, and should serve to speed up emulation considerably.

        mov %eax,%ecx           # eax contains emulated system's program counter
        shr $7,%ecx
        xor %eax,%ecx           # some fast, arbitrary hash function

tablestart:
        and $3,%ecx             # size of hash table
        jmp *hashtab(,%ecx,4)

hashtab:
        .long hashcase0
        .long hashcase1
        .long default
        .long hashcase3

hashcase0:
        cmp $0x12345678,%eax
        je code_at_12345678

hashcase1:
        cmp $0x23456781,%eax
        je code_at_23456781

hashcase2:
        jmp default              # for fall-through case

hashcase3:
        cmp $0x23000000,%eax
        je code_at_23000000

        jmp hashcase0            # continue search from start of table

default:
        push $tablestart         # rewrite the hash table to contain a new entry
        push %eax
        push $<global state>
        call patchup_hash_table

There must be at least one jump to the default case in the table, otherwise we might go into an infinite loop if we encounter an unknown address. The patchup code should therefore have the ability to extend the hash table as necessary (probably by doubling its size). To do this it's probably necessary (or at least desirable) to maintain some metadata along with each block, containing the (emulated address, native address) tuples which the hash table encodes, to avoid having to parse code we've previously created to recover this information.

The question arises of when to add new entries to each local hash table. One solution might be to do translation of code a little more 'eagerly', so each time the patchup code is called, follow through to that address and translate it as well. This would mean doing static analysis of program code, and detecting anything longer than a single basic block is likely to get slightly overcomplicated. The alternative would be to do some sort of global backpatching, so an ARM address is tagged with a list of local hash tables to modify when that address eventually gets passed to the recompiler. The latter approach would be more in keeping with the lazy translation approach currently employed by the program. The patchup code then wouldn't actually do any patching up, but instead it would just put in a request to be patched up at some later time. Initially we have to drop back to the emulator as before after making this request, but the point is that eventually the modified table will be created and emulation can proceed at maximum speed.

Summary of "Active Hashing"

At this time, the method described has not been tested. I currently envisage using it in conjunction with a global hash-table based on a balanced binary tree, which if designed properly should allow power-of-two sized sections of code to be invalidated by doing a simple tree walk, and could probably be augmented to allow for LRU code-block replacement.

Return Target Buffer/Shadow Stack

As an extension of the above mechanism, a heuristic could be used to trap the ARM bl and mov[s] pc,lr/ldmfd r13!,{...pc} instructions used in function calls and push and pop appropriate x86 addresses onto a shadow stack. This is not robust on its own of course, but pushing the ARM program counter too and using a simple comparison to ensure the destination pc so obtained is the correct one would probably suffice to make things work.

If this was not done, the tables mentioned above could get prohibitively large for the return addresses of functions which are used a lot.

This requires an extra ph2 instruction, say RTS, to generate code in the appropriate place.

Interrupts & Precise Exceptions in Recompiled Code

Timer interrupts can be handled fairly easily, assuming the timer runs at some (high) fixed division of the processor clock: keep a rough (statically analysed) count of the number of cycles each instruction in a basic block takes to execute, then decrement a counter by that amount at the end of the block. If it crosses zero, call a function to cause a timer interrupt. This obviously won't give the best latency possible and won't be totally accurate, but if the aim is to simulate a machine at a behavioural rather than a timing-sensitive level it should do nicely.

Precise exceptions are more tricky, and might severely limit the extent to which we can perform optimisation on translated code. The basic problem is that every load or store instruction can potentially cause a data abort condition, at which point it must be possible to extract the state of the original machine at the exact location that the exception occurred. We can't return to recompiled code after the exception, because the state of the machine might have caused the code at the return address to become invalid, eg. if the operating system performs a task switch. We can't easily recover the state of the ARM registers, since we've done register allocation (and since thrown away our temporary data) and don't know where the correct values are (in registers, on stack, in the memory-based register file, etc.).

There are two easy solutions: the first is so synchronise the state before every memory access, ie store all registers back to memory. This would cause sub-optimal-for-speed code. The second (due to David Sharp) is to return a flag from each memory access and test it on return, generating specialised escape-condition code for every single memory access. This would cause sub-optimal-for-space code, but is probably the solution I'll go for when the time comes unless I have any better ideas.

Current Status

Just as a quick example, this fragment shows the state of ARM->IA-32 translation at the current time. This is the original binary code:

10000040 : e0030090 : mul r3,r0,r0
10000044 : e0040191 : mul r4,r1,r1
10000048 : e0222493 : mla r2,r3,r4,r2
1000004c : e2511001 : subs r1,r1,#1

And this is the translated output:

   0:   8b 4d 00                mov    0x0(%ebp),%ecx
   3:   0f af c9                imul   %ecx,%ecx
   6:   89 4d 0c                mov    %ecx,0xc(%ebp)
   9:   8b 45 04                mov    0x4(%ebp),%eax
   c:   8b d0                   mov    %eax,%edx
   e:   0f af d0                imul   %eax,%edx
  11:   89 55 10                mov    %edx,0x10(%ebp)
  14:   0f af ca                imul   %edx,%ecx
  17:   8b 5d 08                mov    0x8(%ebp),%ebx
  1a:   8d 0c 19                lea    (%ecx,%ebx,1),%ecx
  1d:   89 4d 08                mov    %ecx,0x8(%ebp)
  20:   83 e8 01                sub    $0x1,%eax
  23:   89 45 04                mov    %eax,0x4(%ebp)

The ARM registers are held in an array whose base is at %ebp. The ratio of source:destination intructions (for this example at least) is better than that reported by some other systems, which are typically of the order 1:10. Addition of load/store instructions will blow the code up immensely though.

The interpretive emulator can run the first (couple of million?) instructions of a real operating system (Acorn's RISC OS 3.70), but is unfortunately let down by poor/non-existent hardware emulation after that.

References

Embra, Shade, IBM's Daisy, Transmeta's Code Morpher, ARDI's Executor, Cifuentes' work on UBQT, Sun's Hotspot, Apple's 68k emulation layer for Mac OS 7-9, UltraHLE N64 emulator, Mike Koenig's DRFAQ, Neil Bradley's DRMZ80, ARMphetamine v1, Dave Sharp's tARMac, ARM's ARMulator, UAE's dynarec, Squish's Generator, Connectix' Virtual PC & Virtual Gamestation, Intel's IA-32 compatibility layer for Itanic, Poletto et al's tcc, probably some I've forgotten (that one about IA-32 compilation in particular).


© Julian Brown, 1999-2001