ARMphetamine v2 executes first test programs correctly. It appears to be fast. Very fast. This is just the recompiler by itself (statically) though, no profile gathering or dynamic translation is done.
Well, I think ARMphetamine is just a spare-time project again, after a brief flirtation with being my academic research. Depends really on whether there's anything academically valid to research on dynarec - I'm having a hard time convincing myself and the people around me that there is, so the research is going off at a tangent on the dynamic optimisation side. Essentially it's the same project, but fiddling with the IR a bit, designing an architecture around it, throwing away the ARM frontend then creating loads of run-time optimisations for it. Ssh, don't tell the multinational corporations.
On the original ARMphetamine side, I've spent a couple of days hacking mercilessly at it, so that now it knows how to flatten basic blocks into ordinary x86 code. I don't know if the way I've done this is overkill - since neither depth-first nor breadth-first traversal do anything to keep, ie, loops contiguous as far as I can tell, I've tried to design it so that strongly-connected basic blocks are kept together if possible. I admit to not thinking this through as carefully as I should have so I've probably got the wrong end of the stick entirely. There's a priority queue in there, so that block reordering based on branch taken/not taken frequency can be used to influence the layout of the final program at some future stage.
Read (well, skimmed through carefully really) Andrew Appel's "Modern Compiler Implementation in ML", and was slightly disappointed that there wasn't more in it with respect to optimisation.
OK, I think I might have this code generation thingy sorted. The problem is that the compilation mechanism (register allocation, code generation, exception handling) seems to need to know everything at once. By producing a list of pseudo target instructions with mutable slots where the operands should be, I think I can get closer to that. Each virtual register in the intermediate code corresponds to one of these slots, so each use of that register can be reallocated somewhere else by just changing the value in one slot. This should enable code generation at the same time as register allocation, which makes the exception stuff easier and also since variable lifetimes are known at that point it's easier to allocate temporaries, etc.
That probably doesn't make a lot of sense, sorry. I guess I'll have to explain it properly at some point (if it works, anyway).
Some messy internals are getting far too messy. Time for yet another rethink of the whole code generation thing I think. Disconnected some obsolete code from the Makefile though, and that's about as productive as I get right now.
Ack, nobody told me about SSA & phi nodes. Hmm.
As you can tell, I'm slacking. This is especially poor since ARMphetamine is now my day job. Well, sorta unofficially anyway, so far.
For a sneak-peek at where the project is going at the moment, take a look at this lovely graph of ph2 control flow (produced with the aid of a neat little program called aiSee). If you look very carefully you might be able to see some branch optimisation, highish-level control flow extraction, etc. The ARM source this was produced from (a division routine) is here.
As a bit of a diversion from register allocation and code generation seeing as I've hit a little snag with it (trying to solve an NP complete problem in linear time may be pushing my luck slightly), I've made a start on a Thumb decoder (ARM's 16-bit instruction set extension). Also tweaks to the memory code to add a framework for adding sign-extending and halfword memory transfer operations which are needed by the '7TDMI, plus a new optimisation phase in the recompiler which I haven't mentioned before ("source-destination aliasing", which seems to be useful for IA-32 code generation).
There's a new design document, meant mainly for my own nefarious 'academic' purposes but you might find it interesting anyway (if you've got this far). Some of it is cut-n-pasted from stuff I've previously put on here, but one little idea to do with MMUs and move-to-front buffers is new, I think.
I may make noises about Gameboy Advance emulation at some point, but don't hold your breath since, err, it'd be lots of work... and there's lots of dynarec stuff which needs sorting first. The idea of nice, simple code as single ROM images in flat address space is quite appealing, and might simplify some stuff though. Maybe it can provide some useful test material. Just why am I writing this emulator anyway? Someone remind me?
Going to buy Black & White tomorrow, if I can find it.
Linear scan now works reasonably convincingly, and a new x86 code generator is well on the way to being half-written. Despite there being no attempt at optimisation and only a handful of opcodes being compiled (fetch, commit, lsl, asl, lsr, ror, rol, and, adc, sub, sbc, mul), the code looks pretty good to me (though there are far too many moves - but we can deal with those in due course ;o)).
The new backend knows about lots more x86 instructions and instruction variants than the old one - it was so cool to see an add instruction being converted one-to-one into an lea instruction. Still missing are flow control, flag handling, lots of opcodes...
I've implemented linear scan register allocation. This isn't necessarily a good thing, since I stayed awake all night to do it. Now the sun is up though, time for bed I think. It doesn't work even slightly, of course, and the code is utterly revolting ;o). Also various changes to ph2 translator, etc.
The run-time assembler has undergone some serious fixing, and now appears to work (at least for the handful of instructions I've tested). Yippee.
Also done much fixing of the MMU & the processor-mode switching code. RISC OS 3.70 boots as far as poking an I/O control byte (at 0x3200000 in the physical address map) in IOMD with some sequence of numbers (0xc3, 0xc2, 0xc0, 0xc2, 0xc3, 0xc3, 0xc2...) in a seemingly infinite loop. (Timers 0/1 don't cause interrupts yet though, maybe that'll help). The only information I have to hand says something about open drains for that address, which despite being something like the smell in the lab of my 'day job', goes right over my head. I'll have to start looking for a RISC PC tech manual, which could be quite hard to find now I think. And I have no money... :-/
I've been working on the interpretive core and hardware support, since finding that some lovely person has uploaded a RISC OS 3.70 ROM image 'somewhere on the internet' since last time I looked. Turns out the copy I ripped out of my own Risc PC back home was corrupted somehow (probably my broken level 2 cache at work again), and it wasn't just my paranoia (I knew that loop couldn't have terminated as it was in my copy!).
Both 26-bit and 32-bit cores can live simulateously now, with different copies of the emulator code invoked through function tables and preprocessor abuse. The VIDC20 code knows a bit about video, and IOMD knows a bit about timers and interrupt requests/masks etc, though currently nothing gets drawn on the 'screen' and no actual interrupts get fired. Even so, the ROM gets as far as enabling virtual memory. Which doesn't actually seem to work ;o). All the same, it's much better than it was before.
To try the new stuff: symlink "Rom" to a copy of RISC OS 3.70, run the emulator, and type script ro.txt.
CVS repository is now on Sourceforge.
Oh yeah, some thoughts about precise exceptions and register rollback, but I'm not sure if it'd work or be efficient enough. The basic idea is this: allocate two copies of the register bank, with the lower aligned at twice the minimum power of two which will wholly contain one copy of the register bank, the second aligned above it in memory in such a way that the two copies can be flipped between by simply changing one bit of the address.
Define a 'rollback unit' as being (something like) the minimum of a basic block and the gap between memory accesses. Start a rollback unit reading from the lower bank of registers, then with each write do a bitwise OR on the address used to access that register, so that write and subsequent reads will access the upper copy not the lower. Multiple ORs are no-ops. Upper registers are committed to lower registers after the end of the rollback unit if no exception occurs. If an exception does occur, we can easily recover the state at the beginning of the rollback unit just by using the lower register bank and falling back to the emulator.
Problems are: another level of indirection on each register access? No better than just saving all registers before each memory access? I see why Transmeta did it in hardware...
Slightly troubled by ph2. I'd quite like to have a go at implementing the "linear scan" register allocation technique (as documented in a paper discovered by David Sharp), but it doesn't seem to entirely mesh with the fetch/commit mechanism which I envisaged using. I've tweaked it so temporaries aren't refetched from registers every time they're used. Now the behaviour is to fetch the minimum set of registers necessary in a basic block as they're needed and commit all the (latest) results generated immediately before exiting the block. This just isn't quite right. Specifically, it creates a sort-of triangular register-need shape through each block, where each block starts out needing no registers and then accumulate liveness for registers as it proceeds, then chucking them all away right at the end. Better if registers are committed after the last time they're used, but that means more passes over the code and/or the ability to insert instructions into blocks. Or perhaps I should treat blocks as immutable, and generate new copies with commits in the right place.
Still doesn't quite mesh with linear scan though. I have to think about that some more, maybe I'll take a couple of weeks off work to do it ;o).
Also, I've possibly bugfixed the runtime assembler re: some bit-manipulation instructions. Still untested. I might get around to testing it one day, but hey, I'm supposed to be working on image analysis all the time. Ho hum.
A new solution to the global transmap problem: Active hashing.
Improvements to ALU shift-generating code. For an example session with the debugger demonstrating ARM->ph2 code translation, see here.
Now the debugger knows how to translate ARM code into ph2 code, using the command phetatrans. Unfinished enhancements to the crappy "parser" in the debugger mean that this one command temporarily uses a different number syntax to all the rest, ie you have to prepend "0x" to your hex values for the start and end address as passed to this command, but not any others.
I have a couple of new reservations about ph2 in general. I'm mainly worried by how opaque the generated code seems to be, despite my best efforts to keep things looking simple. Even single instructions with register-specified shifts are virtually impossible to understand. That might just be because they're broken though.
Doing register allocation 'properly' is going to be really hard, I think. I'm thinking things like preliminary 'priming' passes which ensure that register-shifts get allocated to the x86 cl register and results from function calls get magically put in eax whenever possible, to cut down on unnecessary swapping. Also perhaps it will be necessary or desirable to nominate one particular register for memory-only access in data-processing type operations. Perhaps not always one of the 'source' registers either, like in the case where register result isn't used again except to dump it to memory later, perhaps we might as well just dump it to memory immediately...
I'm considering adding a predicate buffer to each chunk, so each basic block within that chunk just checks whether a particular entry in that buffer (ie, byte-wide memory location) is one or zero, and takes one of two choices depending on the result. How the setting of these predicates can be neatly extracted from low-level flag operations, I don't know exactly. This would replace the scheme where each basic block has a condition associated with it, as currently it's not clear in what sense that condition actually holds - the assumption that the condition is implicit (ie, in the EFLAGS register or whatever) is precisely the nasty sort of thing ph2 was trying to avoid.
<Sigh>, so much talk, so little action.
Debugger enhancements: now offers quite a bit more functionality, including being able to actually emulate programs. Also contains a scripting mechanism: try typing "script startup.txt" when the program starts. (Hint: after succesful execution, r0 contains the result of 0x100 / 0x9, and r1 contains 0x100 % 0x9).
I really will put this thing on Sourceforge soon. Right after I get some unmetered internet access, anyway. (Mumbles incoherently about Telewest, Equifax and the data protection act. Don't even ask.)
Started a new in-built mini debugger, which will become the default 'user interface' to ARMphetamine for the time being, I think. I should have written this a long time ago, but it really didn't occur to me how useful it might be. Bugfixes in multi-word load/store emulation, which was broken when the new memory system was written. More functionality added to ARM->pheta2 translator.
Slightly tidier source available from download section. There's a new, short and unfinished document about the new intermediate representation, which it'd be nice to have some feedback on, if you're knowledgable in these things and have the time...
Moving development onto Sourceforge. See here.
I've been hacking the source of ARMphetamine slowly and irregularly over the last few months. Recent (and not-so-recent) developments include:
msr
,
mrs
, teqp
) in the interpreter.
There's a new run-time assembler, but I don't know if it works and it's not used yet. But it's nicer than the old one.
I've rewritten part of the ALU emulation in x86 assembler, just for fun (speed increase is negligible, though - 5750 vs 5500ish Dhystones/sec). Also, subtle bugfix in dynarec code - BASIC now runs for a little bit longer ;-). Available as a snapshot of my CVS directory, see instructions in Download section for extracting it.