Notes on revised Intermediate Representation (ph2)

Julian Brown, 21 January 2001. Updated 4 February 2001.

The original intermediate representation used by ARMphetamine ("Phetacode"), while just about adequate, suffered from a couple of fairly serious flaws which limited its usefulness. These included:

Ideally, we want a language which is amenable to 'real' register allocation and flow control. We want to keep everything well-defined and mind-numbingly simple, so it's easy to see when things are going wrong. We want to be able to do condition-code abstraction (see later), without hurting our heads.

Some things remain the same: the "chunk-based" profiling algorithm, the rationale behind not jumping into the middle of blocks, dynamic register allocation, the existence of an intermediate representation at all, etc. ph2 will primarily support the 32-bit addressing mode of the ARM, but the 26-bit addressing mode will be kept in mind for later addition.

In contrast to the original phetacode, we're going to use a variable-length bytecode, mainly to keep things neat but extendable, but also because we're only building a decoder in software, and it won't hurt much. Note though that it may be necessary to scan this code backwards, so each instruction shall be postfixed with its length in bytes.

Basic operations shall include:

None of these instructions deals with 'real' ARM registers, nor does it reference condition-code flags directly. Instead, a special set of (currently 256) registers is used to hold values used in calculations. A pair of instructions:

Are used to move registers from the ARM to the ph2 registers when they're needed, and move them back when they're done with. This means a register-allocation function need concern itself only with the fetch and commit instructions.

Each time a ph2 instruction generates a result, a new register is allocated. This is important for register-colour based allocation (I believe?). With 256 registers available in one byte of storage, this places a worst-case theoretical limit of 1kb on each chunk of code, but I don't see this being a serious restriction since most chunks will probably be much shorter than this.

If a ph2 instruction must modify ARM flags, it must be wrapped in a pair of instructions:

Writing these explicitly around instructions which modify flags has the advantage that a function to remove redundant flag calculations may modify (only) fexpect and fcommit instructions in-place. There is also an instruction,

Which is used when the following instruction requires a particular set of flags to be valid (eg, ADC, RRX). These instructions should greatly facilitate condition code abstraction.

The instruction,

allows a flag's value to be written directly.

When "meta" control flow is required, such as in the fiddly translation of shift-by-register instructions, the variations of the above instructions:

can be used instead.

ph2 code offers NO internal branch instruction. Instead, the natural grouping of instructions is the basic block, and a graph of basic blocks is generated on-the-fly as ARM code is translated to intermediate code. Each ph2 basic block has an associated condition, a TRUE destination block and a FALSE destination block. Although this makes translation from ARM code to ph2 code more difficult, the gain in ease of manipulation should make it worthwhile.

The exact algorithm for translating ph2 code to native x86 code has not yet been fully designed. It will hopefully include:

Opcode listing and format

In full, the current ph2 instruction set looks like this:

CONST Load constant word Immediate word (4-byte, little-endian) value
CONSTB Load constant byte Immediate byte value
FETCH Fetch ARM register ph2 r# (one byte), ARM r# (one byte)
COMMIT Commit ARM register ARM r#, ph2 r#
FEXPECT Prime following instruction for setting flags condition-code set (one byte)
FCOMMIT Indicate that flags should have been set, write into ARM flags condition-code set (one byte)
FENSURE Ensure that flags are resident in native flags register condition-code set (one byte)
NFEXPECT Same as fexpect, but don't involve ARM flags condition-code set (one byte)
NFCOMMIT Indicate that (native) flags should have been set condition-code set (one byte)
NFENSURE Roughly the same as fensure condition-code set (one byte)
FWRITE Write bit zero of register into a flag flag number, ph2 r#
XJMP Transfer control to start of another chunk Address of destination chunk (4 bytes)
LSL Logical shift-left dest r#, src1 r#, src2 r#
LSR Logical shift right dest r#, src1 r#, src2 r#
ASR Arithmetic shift right dest r#, src1 r#, src2 r#
ROR Rotate right dest r#, src1 r#, src2 r#
ROL Rotate left dest r#, src1 r#, src2 r#
RRX Rotate with extend through carry flag dest r#, src r#
RLX Rotate left with extend through carry flag dest r#, src r#
MOV Copy register dest r#, src r#
NOT Copy negated register dest r#, src r#
AND Bitwise AND dest r#, src1 r#, src2 r#
OR Bitwise OR dest r#, src1 r#, src2 r#
EOR Bitwise Exclusive OR dest r#, src1 r#, src2 r#
TEQ Bitwise Exclusive OR, discarding result src1 r#, src2 r#
TST Bitwise AND, discarding result src1 r#, src2 r#
ADD Addition dest r#, src1 r#, src2 r#
ADC Add with carry dest r#, src1 r#, src2 r#
SUB Subtract dest r#, src1 r#, src2 r#
SBC Subtract with carry (in the ARM sense, ie rd = rm + rn + NOT(c)) dest r#, src1 r#, src2 r#
CMP Subtract discarding result src1 r#, src2 r#
CMN Add discarding result src1 r#, src2 r#
MUL Multiply dest r#, src1 r#, src2 r#
LDW Load word dest r#, addr r#
LDB Load byte dest r#, addr r#
STW Store word addr r#, src r#
STB Store byte addr r#, src r#
SWI Generate software interrupt (none?)
UNDEF Trigger undefined instruction exception (none?)
SYNC Synchronise register allocation (dump all registers to memory) (none)
END Finish a basic block (none)

Miscellaneous extras

Interrupts

It is the intention that cycles will be (approximately) counted (decremented), and checked for zero-crossings at the end of each basic block, for the purposes of interrupt handling, which should hopefully provide "good enough" latency for most applications. In the unlikely event that enough instructions accumulate inbetween checks that two or more interrupts should have been fired, both can (probably?) be done simultaneously until the 'debt' is repaid. This will undoubtedly cause havoc with timing-sensitive code.

Memory access

Memory access should use the two-entry TLB (not) documented elsewhere. There is a big problem though, and that is restoring registers after an exception. In general, we won't be returning to the original recompiled chunk after an exception has occured (well, we never will be), but before it happened our register values could have been anywhere.

We're going to take the stance that page faults, access permission faults, etc. hardly ever happen, so try to streamline the most frequent case (that is, hopefully, we have the right TLB entry and the access has the correct permissions). In that case, we run inline code. The next-worst scenario, TLB miss, will be dealt with by a function call, in the same way that memory access works in the current system. This can harmlessly continue running the same recompiled chunk.

In the worst scenario, page fault/access denied, one way of coping with the situation would be to have a look-up table associated with each load/store instruction in each block, which stores the mapping between ARM and x86 registers at the corresponding point in the code, so the complete transfer to the memory-based register file need be done only when an exception occurs. There would probably be a good deal of evil which needed to be done to get that working properly though - whether it'd be worth the effort over just saving all registers before memory accesses and hence avoiding the problem at the expense of potentially less efficient code, I don't know.