Notes on revised Intermediate Representation (ph2)

Julian Brown, 21 January 2001. Updated 4 February 2001.

The original intermediate representation used by ARMphetamine ("Phetacode"), while just about adequate, suffered from a couple of fairly serious flaws which limited its usefulness. These included:

Grungy flag handling, no support for "meta" control-flow
Poorly-defined flow control between basic blocks
Dubious register handling, especially for temporaries
Pessimal translation to x86 code
Bulky, ugly, etc. etc. etc.

Ideally, we want a language which is amenable to 'real' register allocation and flow control. We want to keep everything well-defined and mind-numbingly simple, so it's easy to see when things are going wrong. We want to be able to do condition-code abstraction (see later), without hurting our heads.

Some things remain the same: the "chunk-based" profiling algorithm, the rationale behind not jumping into the middle of blocks, dynamic register allocation, the existence of an intermediate representation at all, etc. ph2 will primarily support the 32-bit addressing mode of the ARM, but the 26-bit addressing mode will be kept in mind for later addition.

In contrast to the original phetacode, we're going to use a variable-length bytecode, mainly to keep things neat but extendable, but also because we're only building a decoder in software, and it won't hurt much. Note though that it may be necessary to scan this code backwards, so each instruction shall be postfixed with its length in bytes.

Basic operations shall include:

Arithmetic (addition, subtraction, multiplication, etc)
Logic (and, or, shifts, etc)
Load/store
The 'xjmp' from the original phetacode, to transfer control to the start of another *chunk*
Software interrupt
Undefined instruction
Load byte/word constant

None of these instructions deals with 'real' ARM registers, nor does it reference condition-code flags directly. Instead, a special set of (currently 256) registers is used to hold values used in calculations. A pair of instructions:

fetch
commit

Are used to move registers from the ARM to the ph2 registers when they're needed, and move them back when they're done with. This means a register-allocation function need concern itself only with the fetch and commit instructions.

Each time a ph2 instruction generates a result, a new register is allocated. This is important for register-colour based allocation (I believe?). With 256 registers available in one byte of storage, this places a worst-case theoretical limit of 1kb on each chunk of code, but I don't see this being a serious restriction since most chunks will probably be much shorter than this.

If a ph2 instruction must modify ARM flags, it must be wrapped in a pair of instructions:

fexpect
fcommit

Writing these explicitly around instructions which modify flags has the advantage that a function to remove redundant flag calculations may modify (only) fexpect and fcommit instructions in-place. There is also an instruction,

fensure

Which is used when the following instruction requires a particular set of flags to be valid (eg, ADC, RRX). These instructions should greatly facilitate condition code abstraction.

The instruction,

fwrite

allows a flag's value to be written directly.

When "meta" control flow is required, such as in the fiddly translation of shift-by-register instructions, the variations of the above instructions:

nfexpect ('n' for native)
nfcommit
nfensure

can be used instead.

ph2 code offers NO internal branch instruction. Instead, the natural grouping of instructions is the basic block, and a graph of basic blocks is generated on-the-fly as ARM code is translated to intermediate code. Each ph2 basic block has an associated condition, a TRUE destination block and a FALSE destination block. Although this makes translation from ARM code to ph2 code more difficult, the gain in ease of manipulation should make it worthwhile.

The exact algorithm for translating ph2 code to native x86 code has not yet been fully designed. It will hopefully include:

Register allocation, done a bit better this time
Less brain-damage in instruction selection
The new runtime assembler
Maybe a bit of peephole optimisation

Opcode listing and format

In full, the current ph2 instruction set looks like this:

CONST Load constant word Immediate word (4-byte, little-endian) value

CONSTB Load constant byte Immediate byte value

FETCH Fetch ARM register ph2 r# (one byte), ARM r# (one byte)

COMMIT Commit ARM register ARM r#, ph2 r#

FEXPECT Prime following instruction for setting flags condition-code set (one byte)

FCOMMIT Indicate that flags should have been set, write into ARM flags condition-code set (one byte)

FENSURE Ensure that flags are resident in native flags register condition-code set (one byte)

NFEXPECT Same as fexpect, but don't involve ARM flags condition-code set (one byte)

NFCOMMIT Indicate that (native) flags should have been set condition-code set (one byte)

NFENSURE Roughly the same as fensure condition-code set (one byte)

FWRITE Write bit zero of register into a flag flag number, ph2 r#

XJMP Transfer control to start of another chunk Address of destination chunk (4 bytes)

LSL Logical shift-left dest r#, src1 r#, src2 r#

LSR Logical shift right dest r#, src1 r#, src2 r#

ASR Arithmetic shift right dest r#, src1 r#, src2 r#

ROR Rotate right dest r#, src1 r#, src2 r#

ROL Rotate left dest r#, src1 r#, src2 r#

RRX Rotate with extend through carry flag dest r#, src r#

RLX Rotate left with extend through carry flag dest r#, src r#

MOV Copy register dest r#, src r#

NOT Copy negated register dest r#, src r#

AND Bitwise AND dest r#, src1 r#, src2 r#

OR Bitwise OR dest r#, src1 r#, src2 r#

EOR Bitwise Exclusive OR dest r#, src1 r#, src2 r#

TEQ Bitwise Exclusive OR, discarding result src1 r#, src2 r#

TST Bitwise AND, discarding result src1 r#, src2 r#

ADD Addition dest r#, src1 r#, src2 r#

ADC Add with carry dest r#, src1 r#, src2 r#

SUB Subtract dest r#, src1 r#, src2 r#

SBC Subtract with carry (in the ARM sense, ie rd = rm + rn + NOT(c)) dest r#, src1 r#, src2 r#

CMP Subtract discarding result src1 r#, src2 r#

CMN Add discarding result src1 r#, src2 r#

MUL Multiply dest r#, src1 r#, src2 r#

LDW Load word dest r#, addr r#

LDB Load byte dest r#, addr r#

STW Store word addr r#, src r#

STB Store byte addr r#, src r#

SWI Generate software interrupt (none?)

UNDEF Trigger undefined instruction exception (none?)

SYNC Synchronise register allocation (dump all registers to memory) (none)

END Finish a basic block (none)

Miscellaneous extras

Interrupts

It is the intention that cycles will be (approximately) counted (decremented), and checked for zero-crossings at the end of each basic block, for the purposes of interrupt handling, which should hopefully provide "good enough" latency for most applications. In the unlikely event that enough instructions accumulate inbetween checks that two or more interrupts should have been fired, both can (probably?) be done simultaneously until the 'debt' is repaid. This will undoubtedly cause havoc with timing-sensitive code.

Memory access

Memory access should use the two-entry TLB (not) documented elsewhere. There is a big problem though, and that is restoring registers after an exception. In general, we won't be returning to the original recompiled chunk after an exception has occured (well, we never will be), but before it happened our register values could have been anywhere.

We're going to take the stance that page faults, access permission faults, etc. hardly ever happen, so try to streamline the most frequent case (that is, hopefully, we have the right TLB entry and the access has the correct permissions). In that case, we run inline code. The next-worst scenario, TLB miss, will be dealt with by a function call, in the same way that memory access works in the current system. This can harmlessly continue running the same recompiled chunk.

In the worst scenario, page fault/access denied, one way of coping with the situation would be to have a look-up table associated with each load/store instruction in each block, which stores the mapping between ARM and x86 registers at the corresponding point in the code, so the complete transfer to the memory-based register file need be done only when an exception occurs. There would probably be a good deal of evil which needed to be done to get that working properly though - whether it'd be worth the effort over just saving all registers before memory accesses and hence avoiding the problem at the expense of potentially less efficient code, I don't know.

`CONST`	Load constant word	Immediate word (4-byte, little-endian) value
`CONSTB`	Load constant byte	Immediate byte value
`FETCH`	Fetch ARM register	ph2 r# (one byte), ARM r# (one byte)
`COMMIT`	Commit ARM register	ARM r#, ph2 r#
`FEXPECT`	Prime following instruction for setting flags	condition-code set (one byte)
`FCOMMIT`	Indicate that flags should have been set, write into ARM flags	condition-code set (one byte)
`FENSURE`	Ensure that flags are resident in native flags register	condition-code set (one byte)
`NFEXPECT`	Same as fexpect, but don't involve ARM flags	condition-code set (one byte)
`NFCOMMIT`	Indicate that (native) flags should have been set	condition-code set (one byte)
`NFENSURE`	Roughly the same as fensure	condition-code set (one byte)
`FWRITE`	Write bit zero of register into a flag	flag number, ph2 r#
`XJMP`	Transfer control to start of another chunk	Address of destination chunk (4 bytes)
`LSL`	Logical shift-left	dest r#, src1 r#, src2 r#
`LSR`	Logical shift right	dest r#, src1 r#, src2 r#
`ASR`	Arithmetic shift right	dest r#, src1 r#, src2 r#
`ROR`	Rotate right	dest r#, src1 r#, src2 r#
`ROL`	Rotate left	dest r#, src1 r#, src2 r#
`RRX`	Rotate with extend through carry flag	dest r#, src r#
`RLX`	Rotate left with extend through carry flag	dest r#, src r#
`MOV`	Copy register	dest r#, src r#
`NOT`	Copy negated register	dest r#, src r#
`AND`	Bitwise AND	dest r#, src1 r#, src2 r#
`OR`	Bitwise OR	dest r#, src1 r#, src2 r#
`EOR`	Bitwise Exclusive OR	dest r#, src1 r#, src2 r#
`TEQ`	Bitwise Exclusive OR, discarding result	src1 r#, src2 r#
`TST`	Bitwise AND, discarding result	src1 r#, src2 r#
`ADD`	Addition	dest r#, src1 r#, src2 r#
`ADC`	Add with carry	dest r#, src1 r#, src2 r#
`SUB`	Subtract	dest r#, src1 r#, src2 r#
`SBC`	Subtract with carry (in the ARM sense, ie rd = rm + rn + NOT(c))	dest r#, src1 r#, src2 r#
`CMP`	Subtract discarding result	src1 r#, src2 r#
`CMN`	Add discarding result	src1 r#, src2 r#
`MUL`	Multiply	dest r#, src1 r#, src2 r#
`LDW`	Load word	dest r#, addr r#
`LDB`	Load byte	dest r#, addr r#
`STW`	Store word	addr r#, src r#
`STB`	Store byte	addr r#, src r#
`SWI`	Generate software interrupt	(none?)
`UNDEF`	Trigger undefined instruction exception	(none?)
`SYNC`	Synchronise register allocation (dump all registers to memory)	(none)
`END`	Finish a basic block	(none)