Currently, all ARM flags are mirrored in blocks of recompiled code precisely. There are two ways in which this strategy can be improved: the first, implemented by most (?) dynarecs, is redundant flag calculation removal - this isn't so important for an ARM source processor since flags are only calculated when the 'S' bit of an instruction is set anyway.
A more complex (and probably more beneficial) optimisation would be to store flags as compounded condition codes, depending on which condition codes the following instructions utilise. The rationale behind this is that although the IA-32 architecture makes it easy to store the values of flags (in byte-wide memory locations), it makes it difficult to do the reverse operation of getting a value in a byte-wide memory location back into a flag. However, IA-32 makes it easy to do a simple comparison with a byte-wide memory location.
So, my proposed scheme is to scan code backwards until the instructions which generate a particular condition code are found, then store that condition code directly (using the IA set instruction):
cmp r0,r1 . (recompiled code which will corrupt flags) . bgt wherever
Now, say r0 is allocated to the eax register and r1 to ebx, we could generate the following code:
cmpl %ebx,%eax setgt $gtoffset(%ebp) . (corrupt flags) . testb $gtoffset(%ebp) jne wherever
Compare this with the current approach, which needs 5+2*(number_of_flags) IA instructions each time flags must be restored.
Naturally, it's important that flags hold the right values between recompiled code chunks - ie, they must be held in a fixed location (as currently) at the entry point and each possible exit point of each chunk.
It's a bit ugly at the moment ;-)
It may be beneficial to abstract the assembler a bit, so that it can include features like instruction reordering or cache alignment, or substitute more efficient encodings of some instructions (like those involving the ax register).
Write-protecting pages of the ARM address space, then trapping writes and discarding invalidated blocks... how do you go about doing something like this in Linux?
Code expiry could be linked to the virtual memory system, so that each page has a hash table of recompiled blocks, and all of those can be discarded when a page becomes unmapped. Sort of coarse-grained approach, not sure how well it'd work.
Exceptions currently aren't dealt with at all, and implementing them could get a bit tricky. On the ARM, they come in the following forms:
I didn't envisage most of these being a problem: IRQs and FIQs can probably be dealt with, to a first approximation, outside of recompiled code chunks. SWIs can pre-set the program counter before they are called, and return to the interpreter when they finish.
The problem comes with undefined instructions (including FP operations) and prefetch/data aborts (actually, undefined instructions aren't a problem - they can be done much the same as SWIs). In chunks of recompiled code, it is far preferable to update the program counter 'lazily' than to ensure it's updated after every recompiled instruction. Consider though that any memory access can cause an exception, so the program counter must have the correct value before each one - this would lead to a fair amount of overhead, which is particularly undesirable since exceptions won't happen very often.
My proposed solution to this involves generating a look-up table containing mappings of memory-accessing instructions from IA to ARM addresses. When an exception occurs (which can be detected in whatever subroutine deals with memory accesses), the stack can be examined to find the calling location, and this address can be mapped back to the equivalent ARM instruction, so the exception can be returned from properly. This should give little overhead most of the time, which is probably a good thing.
Update: I realised that idea wouldn't work, because exceptions within a recompiled chunk won't be able to preserve the ARM<>IA register mapping (and also the base-register writeback semantics over exceptions are horrible). That means that the mapping will need to be synchronised before every memory read/write. That isn't very nice.
OK, I need this before the emulator will be any use to anyone, but it's not ready for it yet...