CPU emulation is subtle

I am currently toying with the idea of writing a Gameboy emulator (in Haskell), to try out some programming techniques which seem to be less common in that area, at least as far as I have looked at other emulators. In this post, however, I will focus less on those techniques and more on the subtleties of emulating a CPU in general. After all, since this is my first foray into essentially implementing a whole CPU (albeit in software), it turns out that there are a lot more peculiarities to find out and take care of than one might think - at least from the perspective of a programmer who usually develops for the hardware.

The Gameboy is in fact a pretty popular system to emulate, owing mostly to the Gameboy's immense popularity at its time and the fact that it has a comparably simple system design. There's mostly just a core CPU, RAM, the game cartridge's ROM and some memory mapped hardware with only a hand full of registers, which makes writing an emulator a pretty painless experience. This in general, and my close attachment sprouting from countless hours spent playing on a Gameboy while growing up in particular, makes it a perfect choice for my experimentation with full hardware emulation.

Taking a first look at the Gameboy's core CPU, emulating it looks pretty straightforward at first. You have some basic processor state (mostly registers), a reasonable amount of instructions to modify said state and some outside interaction, like memory access and interrupts. However, if you want to do it exactly right, that is, if your emulated CPU should ideally be able to fully mirror a Gameboy's behavior for all pieces of existing code, you have to take a second, deeper look at how these things are actually implemented.

As developers have been pushing existing hardware beyond their usually conceived limit, they started to exploit a CPU's behavior on a much lower level than initially visible and documented. CPUs have implementation details and bugs that cause side effects which seem completely invisible to normal code, written according to specification. In a very excellent talk about the C64, Michael Steil presents some of these details of the C64's 6502-based CPU, that are often particular to the specific revision of the chip used in that machine.

Here are some tricky areas I came across while looking at the Gameboy CPU, as well as some concrete examples. (Note that I'm not covering the obvious issues like undocumented opcodes, as it is clear at first glance that a CPU which does not know them won't run code using them.)


Memory Access. At first, this seems straightforward. Instructions may read memory at specific locations, and they may modify it. The outcomes of these operations are usually very well documented. Their implementation details, however, may vary. One such detail is the order in which memory accesses occur. For example, if a single instruction needs to read from two different memory locations to compute its result, which is read first? Exactly when, that is in which cycle during the instructions execution, do these accesses occur? Worst of all, some CPUs perform some spurious memory accesses that have no effect on its outcome at all, usually as the result of a bug in the chip's design. They may read a single memory location twice, or they may even read memory locations that aren't involved in the operation in any way at all.

Why is it important to care about such behavior in an emulator, though? Who really cares if a value is read when it isn't used anyway, and how is access order relevant if the outcome remains the same? The answer is that why it usually doesn't matter for ordinary memory, not everything within the address space necessarily is ordinary memory. Instead, a memory mapped hardware register may exist at that location, in which case the hardware's state may not only change by writing to that register, even a simple read access may trigger this. For example, on the C64's IO chips, called CIA, reading the Interrupt Control and Status Register acknowledges interrupts and clears the flags.

This becomes a problem when existing code tries to work around peculiar behavior in a manner that breaks if the bug itself doesn't happen, and becomes even worse when clever programmers explicitly use it for performance reasons. In these cases, an emulated CPU that doesn't signal those memory accesses simply won't run that code correctly.

Interrupts. Interrupts aren't so simple to begin with, if not just because they disrupt normal control flow. There are already the somewhat obvious issues, like what happens if they occur more or less simultaneous. The Gameboy CPU has a (memory mapped) flag register that indicates which interrupts occurred. Let's assume that it is possible that the other hardware would generate two, by ordinal number, different interrupts during the CPU's execution of one maybe particularly long lasting instruction. Which interrupt handler is executed first, the one which is associated to the interrupt which was signaled slightly earlier, or does the interrupt's ordinal number play a role in deciding this? A more pressing question is which bits are set in the flag register upon entrance of the first chosen interrupt handler, only the one for the interrupt corresponding to the called handler?

Things like this should be documented, but it can be hard to come across a complete description sometimes, especially if bugs or unusual corner cases are involved. To give a more contrived example, I'm not even really sure yet that the CPU does not fetch the would-be-next instruction's opcode (resulting in a memory access) before checking for interrupts and jumping into the handler, throwing away the fetched opcode. Some documentation says "The interrupt will be acknowledged during opcode fetch period of each instruction", which is somewhat ambiguous. While especially this issue seems irrelevant for real-life code, there exists a whole scene (the demo scene) of people who have shown that every single side effect can and possibly will be used to achieve some impressive feats that were otherwise inconceivable. Think about it: cleverly taking side effects within instructions into account, effectively abandoning their atomicity, basically allows you to time actions on a sub-instruction level. Which brings us to...


Timing. If you want to be able to run even the most contrived game code, cycle-accurate timing is very important. Especially between CPU and graphics hardware, where the number of cycles during various phases of display rendering is of particular interest. Since these timings, while often undocumented, are pretty constant in hardware, programmers tend to design code which relies on them very early in the platform's life cycle. As hinted at, a lot of pretty effects are only really possible with cycle-accurate programming, and with a tendency to push game graphics way beyond their initial specification, a game's probability to use more and more egregious techniques which rely on increasingly accurate timing behavior gets higher with its release date.

Superficially, core CPU timing itself seems not so interesting. There are very handy tables out there that specify exactly how many cycles an instruction takes, but this is just the tip of the iceberg. Instructions with a lot of cycles take so long because they usually do so much during their execution. Among internal operations are often also once again some memory accesses with their possible side effects. With instructions like CALL NC, a16 that execute for up to 24 cycles, which is about one tenth of the time it takes to draw one screen line, a hypothetical perfect emulator would have to take even that into account (though, granted, the addresses normally used with this particular instruction are rather unlikely to suffer from memory access side effects).

Timing is hard to get completely right, partly because documentation is often sparse. For example, one thing that I haven't been able to dig up some some useful documentation for, is how many cycles pass between the completed execution of an instruction during which the interrupt occurred and fetching the first opcode of the interrupt handler.

Opting to measure these things, on the other hand, is often complicated and sometimes inaccurate. It also requires you to have means to inject your own code into the real system in the first place, a task which is not too hard on home computers, but for game systems usually requires access to additional hardware which is hard to come by.

Yet it's important to get it exactly right, because the effects of small inaccuracies may result in strange effects (or rather, their failure) whose reason is hard to pinpoint. If you don't exactly know how the hardware is implemented (which is often the case) and can't conclusively find out, you may end up in situations where fixing a timing problem causes various other ones.


A look at existing emulators, by the way, only helps to get an overall idea. All emulators with publicly available source code that I have looked at so far had small inconsistencies, mostly in exactly the places where I wasn't able to find conclusive information online. One emulator, for example, executes interrupt handlers immediately after the instruction during which it happened, without any cycles in between. This seems highly unlikely, however, because the CPU at least needs to push the return address on the stack, requiring one memory access minimum. Another emulator just decides to waste a few cycles before jumping into the handler instead. Putting the desire for a cycle-accurate memory write to the stack aside, this may very well be the right amount, but it's hard to just trust those arbitrary numbers with no information on what's really happening on a real machine to use up those cycles.

So without proper documentation about the implementation rather than just the specifications of the visible interface, you often have to resort to a lot of guesswork. Unfortunately, such documentation is hard to find, if it even exists in any public manner. For the Z80 CPU, on which the Gameboy's core CPU is heavily based on, I felt lucky at first to come by a cycle-exact breakdown of every instruction's operation. But it quickly turns out that the Gameboy's core CPU is a completely different beast that merely bases its instruction set on the Z80 with possibly completely different inner workings. This renders the Z80 tables almost useless, if not for the fact that it may enable more educated guesses. It also further illustrates the difference between specification and actual implementation.

In the end, the most viable solution is probably testing, testing, testing, and if a game fails to run in some aspect, think about which likely implementation would fit with the game's expectations. Over time and over playing countless different ROMs on your emulator, those guesses should converge to something close to the actual implementation of the original device--at least until some obscure game shatters your assumptions with another contrived piece of code.

Unless, of course, you have access to some very deep implementation details, like people interested in emulating a 6502 CPU now have: The people at Visual 6502 have exposed, photographed and meticulously digitalized the CPU, so that its complete internal workings are now publicly available for all, and they even provide a visual transistor simulation of the chip in JavaScript. That, at least, should answer every conceivable question about the CPU's implementation and render virtually all detective work about it useless. But where's the fun in that?

1 comment:

Toby said...
This comment has been removed by a blog administrator.