Emulating video game consoles

Transcription

Emulating video game consoles
Emulating video game consoles
Luke Zapart
May 4, 2013
1
Contents
1 Introduction
3
2 Emulation
2.1 Accuracy . . . . . . . . . . . .
2.2 System emulation . . . . . . .
2.3 CPU emulation . . . . . . . .
2.3.1 Interpretation . . . . .
2.3.2 Recompilation . . . . .
2.3.3 Static recompilation .
2.3.4 Dynamic recompilation
2.3.5 Hotspot method . . . .
2.4 Graphics emulation . . . . . .
2.4.1 Framebu↵ers . . . . .
2.4.2 Tile-based graphics . .
2.4.3 Sprites . . . . . . . . .
2.4.4 Vector displays . . . .
2.4.5 Image scaling . . . . .
2.5 Sound emulation . . . . . . .
2.5.1 Sampling . . . . . . .
2.5.2 Frequency modulation
2.5.3 Aliasing . . . . . . . .
2.6 Documentation . . . . . . . .
3 The Kiwi emulator
3.1 Architecture . . . . . . .
3.2 Challenges . . . . . . . .
3.2.1 Line rendering . .
3.2.2 Cycle accuracy .
3.2.3 Video game bugs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
8
10
10
11
12
15
15
16
17
18
19
21
22
24
24
26
28
28
.
.
.
.
.
29
29
31
31
32
33
4 References
35
4.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2
4.2
1
Emulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Introduction
A video game console is a computer designed primarily for playing video
games. As technology progresses, video game consoles are made progressively
better and capable of running ever more realistic games. As this happens,
old consoles are made obsolete and become unsupported, thus making it
increasingly difficult to experience old games.
One solution to this inconvenience is the practice of emulating old consoles
using emulators. A video game emulator is a piece of software designed to
emulate the workings of a video game console well enough to make it possible
to run games designed for that console on a di↵erent system – usually a
personal computer or a modern console.
Fans also use emulators to run video game modifications unapproved by
the manufacturers of the console, such as fan translations and hacks.
Another use of emulators is to aid in the development of games through
enhanced debugging and the convenience of reducing friction in the builddeploy-test cycle by testing on the same machine where development happens.
This thesis is an introduction to emulating video games and an attempt
to systematize knowledge about emulation. The information provided herein
is based on literature about specific aspects of emulation as well as my own
experience writing emulators.
I used what I learned during writing the thesis to write my own emulator,
Kiwi, which is capable of running a number of games for the Sega Genesis,
3
a console popular in the early 1990s.
Figure 1: The game Aladdin running in the Kiwi emulator
2
Emulation
Most computers produced after the 1940s implement the Von Neumann architecture (Fig. 2), which consists of a processing unit (CPU), memory and
input and output mechanisms. These subdivisions share a common bus, a
subsystem that transfers data between components.
An emulator has to emulate all of the components of a computer architecture as well as their interoperation. An example of such an architecture
is the one implemented in Sega Genesis. The architecture, as shown in Fig.
3, consists of two general-purpose processors, three Random Access Memory
(RAM) devices, a Video Display Processor (VDP), a Sound processing unit,
a Programmable Sound Generator (PSG), input and output (I/O), periph4
Memory
CPU
Bus
I/O
Figure 2: Von Neumann architecture
erals and a replaceable cartridge that holds the program data. As some of
the components are 8-bit and some are 16-bit, there are 8 and 16-bit buses
and a bus arbiter. Each bus consists of a control, address and data buses.
Z80 CPU
8K RAM
Bus (8-bit)
Sound
Bus Arbiter
M68000 CPU
I/O
64K RAM
Cartridge
VDP
Bus (16-bit)
Peripherals
64K Video RAM
PSG
Figure 3: Simplified Sega Genesis architecture
Emulation is primarily the domain of hobbyists. Because of this, there is
no unifying theory on how to emulate computer systems. There is however,
one thing that all emulators share, and that is the main loop. Typically,
to emulate the top-level logic of the desired computer architecture, a loop
is used, as shown in Fig. 4. Each component is ran for a number of steps,
usually equivalent to a fraction of a second, e.g. a frame (there are 50/60
frames in a second on a 50/60Hz display). Then, synchronization is performed between the various components.
5
while not finished:
for component in components:
component.run_for_some_time()
synchronize()
Figure 4: Main emulation loop pseudo-code
2.1
Accuracy
Although at first glance it may seem like the best way to go about emulating
each component is to emulate it as accurately as possible, accuracy comes at
a significant cost in performance. Whether performance or fidelity is more
important depends on the requirements of a given emulation project.
The possible accuracy levels have been classified in [15] into classes of
decreasing accuracy:
Data-path Accuracy is the highest level of accuracy. Emulators in this
class simulate the physical characteristics of a given component, down
to the data-path level. This is mostly used in hardware development for
prototyping integrated chips. Another use is historical preservation, as
seen in the interactive Visual 6502 simulator [26] which simulates a 6502
CPU down to the transistor level by working with digitized microscopic
photographs of the CPU (Fig. 5). This is also the slowest method of
emulating a chip; the Visual 6502 simulator runs at 27Hz on a modern
computer, which at 2GHz is a slowdown by a factor of approximately
75 million.
Cycle Accuracy forgoes emulating the precise workings of a chip, instead
only requiring faithful emulation of the underlying Instruction Set Architecture (ISA) and the timings of instructions relative to each other.
Typically, precise timings are only required on old 8 and 16-bit architectures, where programmers could depend on the timings for certain
6
Figure 5: Vectorized layers of the MOS 6502 CPU, courtesy of Visual6502.org [27]
e↵ects. An example is VICE [37] which emulates Commodore 64 cycleaccurately.
Instruction-level Accuracy is concerned with accurately emulating instructions and their side e↵ects while ignoring their precise timing.
This is a fairly common level of accuracy as it is a reasonable accuracy
vs speed trade-o↵.
Basic-block Accuracy allows replacing larger blocks of code with native
code equivalents. This method is faster than instruction-level accuracy
but it does not allow single-stepping through code and does not work
with code that relies on its original form, such as self-modifying code.
7
The Amiga emulator UAE [31] is an example.
High-Level (HLE) Accuracy identifies structures in code that can be replaced by native high-level functionality. Recent gaming consoles are
often emulated by detecting audio and video library code calls and replacing them by native library implementations without ever executing
the original library code. The UltraHLE [29] Nintendo 64 emulator is
famous for being the first emulator of this kind. At the time of its
release, 1999, a mere 3 years after the release of the Nintendo 64, it ran
commercial titles at a playable frame rate on the hardware of the time.
This approach had its drawbacks; at the time of release, UltraHLE was
only able to emulate approximately 20 games to a playable standard.
Overall, faster emulation means less accuracy. Fig. 6 compares the speed
of Visual 6502, VICE and UltraHLE. Some consideration should be given to
undocumented CPU instructions, as games might depend on undocumented
behavior [21].
Emulator
Visual 6502
VICE
UltraHLE
Accuracy class
Data-path
Cycle
HLE
Speed
27Hz @ 2GHz
2x1Mhz @ 500Mhz
93.75Mhz @ 350Mhz
Slowdown factor
75 million
250
3.7
Figure 6: Speed comparison of three emulators from distinct classes
2.2
System emulation
Typically a computer system has one or more address buses, in which addresses and address ranges are assigned to components. For example, the
Sega Genesis has two address buses, as shown in Fig. 7.
Memory mappings fall under two categories:
8
Device
Address (hex)
16-bit bus
Cartridge
000000-3FFFFF
8K RAM
A00000-A01FFF
Sound processor A04000-A04003
I/O
A10000-A1001F
VDP Data
C00000
VDP Control
C00004
PSG
C00011
64K RAM
FF0000-FFFFFF
8-bit bus
8K RAM
0000-1FFF
Sound processor 4000-4003
Bank register
6000
PSG
7F11
Banked memory 8000-FFFF
Figure 7: Simplified Sega Genesis address map
Linear mapping – a mapping where an address range is translated to a
corresponding physical address range and address lines are used directly
without any decoding logic, for example in accessing physical memory.
Direct mapping – a 1:1 mapping of a unique address to one hardware
register or a physical memory location.
As seen in the Sega Genesis example (Fig. 7), memory maps often contain
holes. For simplicity of circuit design, certain mappings are mirrored, for
example on the Sega Genesis, the RAM address 1234 can be accessed by
accessing both FF1234 and EF1234.
An emulator must decode all memory accesses according to the appropriate memory map.
Components send signals called interrupts, of which there are two types.
A hardware interrupt is a signal from a component indicating it needs atten9
tion. A software interrupt is a CPU instruction. When a CPU receives an
interrupt, it suspends its current state of execution and begins execution of
an interrupt handler. This usually invokes an exception appropriate to the
kind of interrupt.
Interrupts are typically used for signaling from components, timers and
traps (user interrupts). A simplified exception table for the main CPU in
the Sega Genesis system is shown in Fig. 8. On the Genesis, the VDP fires
a Vertical Blank interrupt every vertical blank, which occurs at the start of
every displayed frame (50 times a second on a PAL TV and 60 on an NTSC
TV), which is useful for timing-dependent functions. The VDP also signals
a Horizontal Blank interrupt every horizontal blank, which occurs after a
line has been displayed (about every 10 µs). Finally, there is an external
interrupt, used for example to tell the game a light gun has been fired [11].
2.3
CPU emulation
[18] describes three approaches to emulating CPUs: interpretation, recompilation and a hybrid method.
2.3.1
Interpretation
Interpretation is the simplest way of emulating a CPU. As shown in Fig.
10, an interpreter directly simulates the fetch-decode-execute cycle (Fig. 9).
That is, until the emulator is shut down, it fetches the next instruction,
decodes and executes it.
Interpretation is slow, as for every original instruction a couple dozen native instructions are executed. Additionally, modern CPUs take advantage
of a technique called pipelining, which allows them to execute many instruc10
Type
Reset
Errors
Traps
Interrupts
Traps
Number
0
1
2
3
4
5
6
7
8
9
10
11
24
25
26
27
28
29
30
31
32-47
Exception
Reset Initial Stack Pointer
Reset Initial Program Counter
Access Fault
Address Error
Illegal Instruction
Integer Divide by Zero
CHK Instruction
TRAPV Instruction
Privilege Violation
Trace
Line 1010 Emulator
Line 1111 Emulator
Spurious Interrupt
Level 1 Interrupt (Unused)
Level 2 Interrupt (External)
Level 3 Interrupt (Unused)
Level 4 Interrupt (Horizontal Blank)
Level 5 Interrupt (Unused)
Level 6 Interrupt (Vertical Blank)
Level 7 Interrupt (Unused)
Trap #0-15 Instructions
Figure 8: Sega Genesis exception vector table for the 68000 CPU[9]
tions at once, as long as it is possible to predict what instructions will be
called in advance. A simulated fetch-decode-loop makes predicting future instructions impossible, thus modern CPUs are often unable to use pipelining
in interpreter code.
2.3.2
Recompilation
Recompilation translates emulated code to native code. It takes advantage
of the fact that it is unnecessary to decode instructions every time they need
to be executed, and moves the decoding part to the pre-processing phase of
11
Fetch
Decode
Execute
Figure 9: Fetch-decode-execute cycle
the emulator.
Threaded recompilation is the simplest form of recompilation, as it simply replaces each original instruction with a library call to a native code
implementation of the original instruction. An example translation can be
seen in Fig. 11.
Real recompilation, on the other hand, translates original code into native
code directly, without the need for instruction function calls.
Recompilation can be done ahead-of-time (AOT), i.e. before execution
or just-in-time (JIT), i.e. during execution.
2.3.3
Static recompilation
Static recompilation is ahead-of-time recompilation. A statically recompiled
program is indistinguishable from a native program and does not contain any
translation logic.
Like normal ahead-of-time compilers, static recompilers can perform global
optimizations on programs, resulting in fast execution time. This comes at a
cost, however, as static recompilation is a hard problem. Programs on Von
Neumann architectures frequently intertwine code and data and recognizing
what is code and what is data might be hard or impossible, depending on
the architecture.
12
while not finished:
# fetch
instruction = memory[IP]
IP += 2 # increment the instruction pointer register
# decode
if instruction == 0x4e71:
# execute NOP instruction
pass
else if (instruction >> 8) == 0:
# decode and execute an ORI instruction
# decode the appropriate effective address
size = (instruction >> 6) & 3
ea_mode = (instruction >> 3) & 7
ea_register = instruction & 7
ea = get_ea(ea_mode, ea_register, size)
# OR the effective address by an immediate value
*ea |= get_immediate(size)
set_flags()
else if (opcode >> 8) == 1:
...
Figure 10: 68000 CPU interpreter pseudo-code
To illustrate this, a jump table, as shown in Fig. 12, is a piece of code that
loads an address based on an index and then jumps to it. In the example,
if index holds the number 0, the code will jump to Level and if index
holds 1, the jump will be to Title.
The problem is, a recompiler does not know how many entries there are
in the jump table, so it can’t know that 02427fff (the machine code representation of the andi.w instruction) is an instruction and not an address
to jump to.
Video games up until the 1990s were often written in assembly, and used
techniques harmful to static analysis, such as self-modifying code. Some
13
move.w #7, d0
addi.w d1, d0
ori.w #1, d0
move(1, 7, 0)
addi(1, 1, 0)
ori(1, 1, 0)
ori(size, source, dest):
d[dest] |= d[source]
set_flags()
Figure 11: Sample 68000 code and resulting threaded code (pseudo-code)
move.w (index).w,d0
lsl.w #2, d0 ; d0 = d0*4 (addresses are 4 bytes long)
move.w JumpTable(pc,d0.w),d0
jmp JumpTable(pc,d0.w)
JumpTable:
dc.l Level
dc.l Title
dc.l Ending
dc.l Credits
andi.w
...
$7FFF,d2
; machine code = 02427fff
Figure 12: Example 68000 assembly jump table
seemingly harmless practices, such as decompressing, decrypting or loading
code into memory are basically equivalent to self-modifying code for recompilation purposes.
These issues make it difficult to write a completely accurate static recompiler that is oblivious to the input code, but often such characteristics are
unnecessary. If only a single game is needed to be recompiled, it suffices to
play through the game completely, logging what is code and what is data
with an interpreting emulator and then to perform static recompilation [1].
14
2.3.4
Dynamic recompilation
A dynamic, or just-in-time (JIT) recompiler translates original code to native
code on the fly and then caches the resulting native code for later execution.
Although translation of a single piece of code is more computationally expensive than interpretation, most of the execution time in a program is usually
spent in relatively small loops. A JIT recompiler only has to translate a loop
once, but it can run the loop thousands of times, so for an average program
this is an overall speed improvement.
JIT recompilers do not have to possess advance knowledge of what is code
and what is data, since only valid code paths will be translated. JIT also
solves the problem of self-modifying code. If a code modification is detected,
the code can be marked as dirty for later recompilation. For this and for the
speed improvements, JIT recompilers are popular in modern emulators.
2.3.5
Hotspot method
The hotspot method, first used in the eponymous HotSpot Java virtual
machine[3], aims to combine the best of both worlds, the speed of interpreters in rarely executed code and the speed of JITs in often executed code.
It does so by interpreting by default and identifying hot spots, that is, oftenrun code and marking it for recompilation. This is best seen by examining
the cost formulas in Fig. 13 [16]. The average cost per execution is visualized
in Fig. 14, in which it can be seen that for typical values of ce , ci and cr (2,
10 and 50 respectively) the hotspot method approaches dynamic recompiling
in terms of cost for often executed code, while maintaining the fast speed of
interpreting code that is only run once.
15
Interpreter
Dynamic recompiler
Hotspot method
Cost
nci
cr + nce
if n > t then tci + cr + (n
else nci
t)ce
where
ci
cr
ce
n
t
cost of a single interpretation of the code
cost of translating the code
cost of running the translated code
number of times the code is to be emulated
number of times the code is interpreted before it is translated
Figure 13: Cost formulas for an interpreter, a dynamic recompiler and the
hotspot method
2.4
Graphics emulation
Graphics and music hardware is varied and it is impractical to enumerate
all of the di↵erent variations – it suffices to introduce the more common
and general concepts used. Once the functionality of the given hardware is
known, it is a fairly straight-forward process to emulate it. Optimization of
emulation depends largely on specific components used.
A video display (such as a TV) displays graphics composed of pixels and
refreshes at a specific rate (typically 50 or 60 times per second). A pixel
is the smallest controllable element of a picture. A game console processes
graphics and then sends them to the video display at a specific resolution. For
example, the Sega Genesis is capable of outputting 240 lines each consisting
of 320 pixels (this is typically referred to as a resolution of 320x240). Video
game consoles usually have a distinct video output device which processes
graphics and outputs video signal.
16
!("
=83,-4-,3,>?8.:70!-,01:47@,A132413!:,3B1C
*+,-./,!0123!4,-!,5,063718
!'"
!&"
!%"
!$"
!#"
!"
!$
!&
!(
!)
!#"
!#$
96:;,-!1<!,5,0637182
!#&
!#(
!#)
!$"
Figure 14: Average cost graph for di↵erent emulation methods
2.4.1
Framebu↵ers
The simplest video output device is the framebu↵er, which drives a video
display from a memory bu↵er containing a complete frame of data.
The memory bu↵er typically consists of color values for each pixel on the
screen. The number of values each color value can take is the color depth,
and this is the number of colors the device can display simultaneously. The
bigger the color depth, the more bits it takes to store a single color value, and
in turn, the more memory is needed to store the frame bu↵er. For example,
a 24-bit 800x600 frame bu↵er needs 800 ⇤ 600 ⇤ 24/8 = 1440000 bytes, or 1.4
megabytes of memory, while a 4-bit one with the same resolution only needs
800 ⇤ 600 ⇤ 4/8 = 240000 or 234 kilobytes of memory.
In order to save memory and storage, a technique called indexed color is
used where instead of a full color bu↵er, an indexed bu↵er is used. With this
technique, color information is stored in a bu↵er called a palette in which
every color is indexed by its position in the palette. Instead of full color
values, an indexed bu↵er stores palette indexes for each pixel. An example
17
(a) 1-bit
(b) 24-bit (truecolor)
(c) 2-bit and palette
(d) 8-bit and palette
Figure 15: 1, 2, 8-bit indexed color pictures and original 24-bit color picture
indexed image in 1, 2 and 8-bit color depths is shown in Fig. 15. As shown
in Fig. 16, this method was used in numerous systems. For example, the
Sega Genesis has a palette that can hold up to 64 di↵erent colors selected
from 512 di↵erent color values.
2.4.2
Tile-based graphics
To save memory and storage even more, instead of directly controllable framebu↵ers some systems implemented a tile-based method of storing graphics
18
Color depth
1-bit
3-bit
4-bit
6-bit
8-bit
11-bit
12-bit
Number of colors
2
8
16
64
256
2048
4096
Used in
Atari ST
ZX Spectrum
Amstrad CPC
Sega Genesis
Atari TT, AGA, Falcon030
Sega Saturn
Neo Geo
Figure 16: Indexed color depth used in a number of systems
data in memory. Using this method, screens are composed of tiles instead of
individual pixels. In turn, the tiles are composed of pixels or even smaller
tiles. This is illustrated by Fig. 17. The video hardware composes the tiles
on the fly in each frame or line instead of keeping it in memory. This approach made it possible to reuse frequently occurring tiles, such as the sky,
empty space and other repeating patterns.
As an example, in the Sega Genesis, the screen was composed of two main
tile planes and one auxiliary plane. The main planes could hold 4096 tiles
each. Each tile was 8x8 pixels and video memory could hold approximately
1300-1800 distinct tiles at a time. Each tile could be displayed in one of four
palette lines, further promoting reuse of tiles. Fig. 18 shows the planes that
the screenshot in Fig. 17a is composed of.
2.4.3
Sprites
Apart from backgrounds and foregrounds, objects (such as the player character, enemies and items) were a common occurrence in video games. Unlike
backgrounds and foregrounds which typically fit well into tile-based graphics,
objects could move in arbitrary ways, making tiling difficult. Because of this,
consoles used to incorporate sprites into their video graphics hardware.
19
(a) Frame
(b) Tiles in video memory
Figure 17: Screenshot of the game Ristar and tiles in video memory at the time
Generally, sprites are bitmaps which can be moved on the screen without
altering the data defining the rest of the screen. This introduced considerable
space and bandwidth savings, as instead of modifying the entire a↵ected area
of the screen every time an object was to be drawn, the game only had to send
a small sprite frame to the graphics hardware. This was useful for animated
objects as only the sprite frame had to be changed in an animation. Fig. 19
shows distinct frames of a sprite that appears in the frame in Fig. 17a.
Implementing sprites in the hardware was an opportunity to relieve software from doing some work that is common to working with sprites. One
such example is scaling and rotation, which was implemented on the Game
Boy Advance system. A common feature was to perform collision detection,
as the spriting hardware already had all the sprite position information. An
overview of sprite hardware capabilities for several systems is shown in Fig.
20
(a) Foreground plane
(b) Background plane
(c) Sprites plane
Figure 18: Planes of a single frame on the Sega Genesis
Figure 19: Sprite frames for an enemy in the game Ristar
20.
Sprites on screen
Sprites on line
Max sprite size
Colors
Scaling
Rotation
Collision detection
Game Boy Advance
128
128
64x64
15, 255
Yes
Yes
-
Neo Geo
384
96
16x512
15
Shrinking
-
Sega Genesis
80
20
32x32
15
Yes
SNES
128
34
64x64
15
Limited
Limited
Yes
Figure 20: Sprite hardware capabilities for several systems
2.4.4
Vector displays
A vector display was a display device in which the image was composed of
drawn lines instead of a grid of pixels. The electron beam would follow an
21
arbitrary path tracing the connected lines, instead of following the standard
horizontal raster path. To utilize this kind of display, video hardware would
send commands directly to the electron beam, instead of image frames. The
video game console Vectrex was entirely based upon this concept. The popular game Asteroid used vector graphics, as shown in Fig. 21.
Figure 21: Screenshot of the game Asteroids
2.4.5
Image scaling
Old video game consoles ran at much lower resolutions than screens today are
capable of displaying. It is therefore necessary to upscale the image, which
is a non-trivial task. The naı̈ve way of upscaling an image is to perform
nearest-neighbor interpolation on the pixels. However this method is not
very-well suited to the task, as its result looks jagged. There are a number
of pixel art scaling algorithms designed specifically to address this problem.
Fig. 22 compares several of them.
The first known algorithm is EPX, developed to port LucasArts games
to early Macintosh computers, which had about double the resolution of the
22
(a) Nearest-neighbor
(b) EPX
(d) hq4x
(c) Super 2xSaI
(e) Kopf-Lischinski
Figure 22: Comparison of image scaling algorithms performed on a dolphin sprite
from the game Super Mario World (4x magnification) [10]
original system. The algorithm replaces every pixel by a 2x2 block of the
same color. Then, set the top left pixel in the 2x2 to the color of the left and
upper neighbors of the pixels if their colors match and do the same for every
pixel in the 2x2 block. Further magnification can be obtained by running
the algorithm several times.
A refinement on EPX is the Eagle algorithm, which works similarly but
also considers pixels adjacent diagonally. A number of similar algorithms
exist which di↵er in the way they calculate the resulting pixels but are similar
in principle, such as 2xSaI [5], Super 2xSaI and Super Eagle. All of these
algorithms su↵er from locality issues, as they only consider adjacent pixels.
A more sophisticated approach is used in the hqnx family of algorithms
[19]. First, the color di↵erence is calculated between the magnified pixel and
its 8 nearest neighbors. Then, that di↵erence is compared to a predetermined
threshold to determine whether the two pixels are close or distant. Hence,
there are 256 possible combinations of the closeness of the neighboring pixels.
Second, a lookup table with 256 entries is used, one entry for each combination. Each entry describes how to mix the colors in the resulting image
to interpolate pixels of the filtered image. As the most computationally intensive part of the algorithm is generating the lookup table, the algorithm
23
performs very well in real time. Because of that, and because it delivers the
best results when compared to other fast algorithms, it is used in numerous
emulators, such as bsnes and FCE Ultra. Algorithms in the hqnx family
still, however, su↵er from strict locality issues and cannot resolve certain
ambiguities.
All of the algorithms presented so far su↵ered from locality issues. They
also produce bitmaps and only work for predetermined degrees of magnification. To address these problems, the Kopf-Lischinski algorithm was
developed[10]. It uses novel way to extract resolution-independent vector
images from pixel images. The vector image can then be used to magnify the
image by an arbitrary amount without feature degradation. The algorithm
resolves features in the input image and converts them into regions with
smoothly varying shading that are separated by piecewise-smooth contour
curves. Although it produces some of the best-looking results, the algorithm
is quite complex and it is uncertain how it would perform in a real-time
setting, as it has never been used in an emulator to date.
2.5
2.5.1
Sound emulation
Sampling
Sound hardware typically processes digital audio and then performs digitalto-analog conversion and outputs the analog signal. Sound waves are continuous signals and to process them digitally it is necessary to convert sound
waves to a discrete signal. A process called sampling takes measurements of
a continuous signal at regular intervals and produces a set of digital samples,
as shown in Fig. 23. An example sample is shown in Fig. 24.
The simplest sound processing was to store samples and use them directly
to generate sound. The Nyquist sampling theorem states that it suffices that
24
Figure 23: Sampling of a simple sine wave. Samples are the black dots.
%"#$
%"
!"#$
"
$"
&"
'"
("
)""
)$"
)&"
*+,-%./0
&"#"%
&"
!"#"%
!"#"$
"
"#"$
"#"'
"#"(
"#")
"#%
*+,-&./0
Figure 24: Sample of rain, thunder and chirping birds and a 0.1 second excerpt
of the sample
the sampling frequency is at least twice the wave’s frequency for the analog
signal to be completely determined by the sampling. Because the higher
end of the range of human hearing is approximately 20kHz and for technical
reasons, sound is typically sampled at 44.1 kHz, a bit above double 20kHz.
This means that to store a single second of 8-bit mono sound, 44100 bytes
are needed, which was quite prohibitive when every single kilobyte mattered.
As with graphics, space and bandwidth concerns meant chips for generating sound on-the-fly had to be developed. One type was the programmable
sound generator, which generated sound waves by synthesizing several basic
waveforms and combining them. Often PSGs would have a noise generator,
to generate explosion and percussion sounds. An example is the SN76489
25
chip used in the Sega Genesis. As PSGs needed very few bits of input data
to operate, they were used in the years before memory became large and
a↵ordable enough to store sound samples. The downside of PSGs was that
the sound generated was rather bland.
2.5.2
Frequency modulation
A very popular technique called frequency modulation [2] was used to generate better sounding audio. The equation for a frequency-modulated wave
of peak amplitude A is
F = A sin(!c t + If (t))
An arbitrary modulating function f (t) is multiplied by the modulation
(change) index I and added to the angle (!c ) of the carrier sine wave. In the
case that the modulating wave is also a sine wave, the equation becomes
F = A sin(!c t + I sin !m t)
To create realistic sounding instruments, operators are used, each possessing a frequency, an envelope and the capability to modulate its input using
the frequency and envelope. Operators are arranged in circuits. Several such
circuits are shown in Fig. 25. Fig. 26 shows which circuits are well-suited
for which use. To simulate chords, multiple FM channels are used.
This method was so pervasive in the 80s and early 90s that the patent
Stanford licensed to Yamaha for the method generated over $20 million for
the university in licensing fees alone before it expired in 1994, making it the
26
second most lucrative licensing agreement in Stanford’s history to that date
[22].
An example of a frequency modulation synthesis chip was the Yamaha
YM2612 which was used in the Sega Genesis and a number of arcade systems. It was capable of six concurrent FM channels, using four operators
per channel (as shown in Fig. 25). One of the channels could be set to play
digital samples instead [14].
1
2
1
2
2
1
3
3
4
4
4
4
(a)
(b)
(c)
(d)
1
2
3
(f )
3
1
4
1
3
2
3
1
3
2
4
(e)
4
2
1
(g)
2
3
(h)
Figure 25: FM operator circuits in the YM2612 chip [14]
27
4
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
distortion guitar, bass
harp, PSG-like sound
bass, electric guitar, brass, piano, woods
strings, folk guitar, chimes
flute, bells, chorus, bass drum, snare drum
brass, organ
xylophone, vibraphone, snare drum, tom-tom
pipe organ
Figure 26: FM operator circuits in Fig. 25 and what instruments they can be
used to mimic
2.5.3
Aliasing
Typically, in sound hardware, the generated sound was fed to the speakers.
An emulator, however, simulates the generation of sound and then stores the
result in a sample. This results in aliasing, as harmonics above the Nyquist
frequency of the sample (half the sample rate) will fold into the audible
range, resulting in rough-sounding aliases. The solution to this problem is
to perform band-limited synthesis [7].
2.6
Documentation
As discussed before, in order to emulate a specific system, the emulator has
to emulate all of the components and also the system itself, therefore information is needed on how they operate. For publicly available components
such as most CPUs and popular video and sound chips, there is usually ample technical documentation available from the manufacturer. However for
proprietary components (such as the lockout chip on the Nintendo SNES),
peripherals and the inner workings of the system, the only primary documentation is usually confidential and not always possible to find. Therefore
reverse engineering is sometimes necessary, that is, discovering how the system works by taking it apart and analyzing its workings.
28
Another source of information is the source code of many open source
emulators, particularly the MAME project [30], whose stated goal is to be
a reference to the inner workings of various arcade systems, and its sister
project, MESS [36] which does the same for console systems. Although the
licenses for the projects are incompatible with commonly used free software
and open source licenses and therefore their source code cannot be mixed
with commonly-licensed code, they serve as secondary documentation for
systems they emulate.
3
The Kiwi emulator
Using the information I gathered about emulation, I set about to code my
own emulator. The result is the Kiwi emulator, which emulates most of the
features of the Sega Genesis console and runs most of the games I tested to
a high degree of accuracy. It has been tested on several Linux and OS X
machines.
3.1
Architecture
Figure 3 in Chapter 2 showed a diagram of the Sega Genesis architecture.
Kiwi emulates everything in the diagram except for sound and the Z80 CPU
which is mainly used for sound.
The architecture of the emulator, as seen in Fig. 27, consists of several
parts. The main kiwi.py module interprets keyboard input and passes it
to the input module, which emulates the Sega Genesis joypad controllers.
The vdp module handles the emulation of the Sega Genesis Video Display
Processor (VDP), as well as rendering the raw screen output. The CPU core
is emulated by Musashi [34], an open-source library used in several popular
29
Sega Genesis emulators, including DGen [24] and Genesis Plus [28]. Musashi
is an interpreting emulator of the Motorola 68000 CPU with accurate cycle
timing.
vdp.c
VDP State
VDP Registers
VRAM
CRAM
VSRAM
hqx library
vdp_read/write
vdp_render_sprites
m68k_execute
scale_epx
input.c
ROM
read/write_memory
frame
scale_nearest
scale_filter
megadrive.c
RAM
CPU State
m68k_set_irq
vdp_render_bg
scale.c
scale_hqx
vdp_render_line
DMA
Musashi 68K library
hqx_32
kiwi.py
Pad State
pad_press_button
io_read/write_memory
pad_release_button
Key Events
frame
blit_screen
Keyboard
Screen
Figure 27: Kiwi emulator architecture
The megadrive module is responsible for the high-level emulation of the
Sega Genesis architecture. Every 1/60th of a frame, kiwi.py calls on the
megadrive module to emulate a single frame. Emulating a frame consists
of rendering the screen line by line and executing the CPU core for a short
time in between each line.
Having rendered the frame line by line, the megadrive module returns
the single frame image to kiwi.py. In case the user has chosen a resolution
higher than the native 320x240, the scale module performs the necessary
upscaling of the image. The user can choose from three of the image scaling
algorithms seen in Fig. 22: nearest-neighbor scaling, the EPX algorithm and
the hqx algorithm. Hqx is provided by the open-source library hqx [35].
I decided to write the emulator in C with a bit of Python. C was used for
30
GUI
the emulation part of the program because of C’s lack of CPU and memory
overhead. Experiments with writing some parts of the emulation in Python
showed that Python is unnecessarily slow for the task. In the end Python
was used for the graphical user interface of the program.
3.2
Challenges
While writing Kiwi, I encountered some interesting cases of trade-o↵s between speed and accuracy as well as other curiosities.
3.2.1
Line rendering
The Sega Genesis outputs graphics based on a color palette consisting of 64
9-bit colors. So under normal circumstances only 64 colors could be drawn on
the screen at the same time, however in certain games a clever trick was used
that changed the palette mid-frame. For example, one-third of the screen
would be drawn using one set of 64 colors and then the palette would be
changed and a di↵erent set of 64 colors was put in its place, thereby allowing
the remaining two-thirds of the screen to be drawn using the di↵erent set of
colors.
This means an emulator rendering the screen frame by frame would be
inadequate, because it would miss the mid-frame palette changing e↵ect. Fig.
28 shows a water level in the game Sonic the Hedgehog drawn correctly and
incorrectly side by side. The game uses the trick by drawing a part of the
screen that is above water with one palette and drawing the underwater part
with another.
The usual, incorrect way to emulate a Sega Genesis frame would be to
run the CPUs for 1/60th of a second and then render the frame. To emulate
31
Figure 28: A screenshot of a water level in the game Sonic the Hedgehog. The
screen on the left uses correct line rendering. The screen on the right uses incorrect
frame rendering.
the trick properly, my emulator renders each line of the screen separately and
after each line runs the CPUs for 1/13340th of a second (13440 is 60 times
the number of lines in the screen, which in NTSC mode is 224).
3.2.2
Cycle accuracy
The main CPU used in the Sega Genesis, the Motorola 68000, does not execute all instructions equally fast. For example, a multiplication instruction
might take 64 clock cycles, while a subtraction instruction might only take 8
clock cycles to execute. Because some games depend on these di↵erences, it
is necessary to emulate correct cycle timing of CPU instructions, instead of
just emulating their functionality. An example of this is a stage in the game
Sonic the Hedgehog 2, as seen in Fig. 29, which plays too fast when precise
cycle timing is disabled.
32
Figure 29: A screenshot of a level in the game Sonic the Hedgehog 2 which
requires precise instruction timing.
3.2.3
Video game bugs
Sometimes video game programmers make code mistakes which go unnoticed
for one reason or another. One such bug persisted in a game called Bass Masters Classic Pro Edition. The Sega Genesis video processor (VDP) normally
operates in what is called Mode 5. For backwards compatibility, the video
chip included functionality for a mode used in earlier hardware, called Mode
4. The game Bass Masters enters the wrong Mode 4 at one point by mistake, sets the screen resolution, and enters Mode 5 back again. The Mode
4 resolution setting is ignored in real hardware and if this is not emulated
(and instead, the resolution setting sticks), the logo screen is garbled, as
illustrated in Fig. 30.
A similar thing happened when Microsoft was developing Windows 95.
The game SimCity would not work in beta versions of Windows 95. Microsoft
tracked down the bug and it turned out SimCity was reading memory after
33
(a) Incorrect emulation
(b) Correct emulation
Figure 30: A screenshot from the game Bass Masters Classic Pro Edition
freeing its allocation. On modern operating systems, this results in an error, but on Windows 3.x, the operating system preceding Windows 95, this
worked fine as the memory was never used again. So the Windows 95 programmers put in code that looks for SimCity and puts the memory allocator
in a special mode that doesn’t free the memory right away [17].
34
4
4.1
References
Information
[1] Neil Bradley. Ms Pacman lives! (Yahoo Groups). URL http:
//tech.groups.yahoo.com/group/staticrecompilers/
message/287.
[2] John Chowning. The Synthesis of Complex Audio Spectra by Means
of Frequency Modulation. Journal of the Audio Engineering Society, 21
(7):526–534, 1973.
[3] Oracle Corporation. Java SE HotSpot at a Glance. URL http://
java.sun.com/products/hotspot/.
[4] Victor Moya del Barrio. Study of the techniques for emulation programming. Master’s thesis, FIB UPC, June 2001.
[5] Derek Liauw Kie Fa. 2xSaI: The advanced 2x Scale and Interpolation
Engine. URL http://vdnoort.home.xs4all.nl/emulation/
2xsai/.
[6] Marat Fayzullin. How to write a computer emulator. URL http:
//fms.komkon.org/EMUL8/HOWTO.html.
[7] Shay Green. Band-Limited Sound Synthesis.
slack.net/˜ant/bl-synth/.
URL http://www.
[8] LLVM Developer Group. The LLVM Compiler Infrastracture. URL
http://llvm.org/.
[9] Motorola Inc. M68000 Family Programmer’s Reference Manual. Prentice Hall, 1992.
35
[10] Johannes Kopf and Dani Lischinski. Depixelizing Pixel Art. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2011), 30(4):99:1 –
99:8, 2011.
[11] Charles MacDonald. Sega Genesis VDP documentation, 2000. URL
http://cgfm2.emuviews.com/txt/genvdp.txt.
[12] Anders Montonen. antime’s feeble Sega Saturn page. URL http:
//koti.kapsi.fi/˜antime/sega/index.html.
[13] Mark Probst. Fast Machine-Adaptable Dynamic Binary Translation,
2002.
[14] Roger Sanders. New Documentation: An authorative reference on
the YM2612. URL http://gendev.spritesmind.net/forum/
viewtopic.php?t=386.
[15] Joshua H. Sha↵er. A Performance Evaluation of Operating System Emulators. Master’s thesis, Buchnell University, May 2004.
[16] David Sharp. Tarmac, Dynamically Recompiling ARM Emulator, 2001.
[17] Joel Spolsky.
Strategy Letter II: Chicken and Egg Problems,
May 2000. URL http://www.joelonsoftware.com/articles/
fog0000000054.html.
[18] Michael Steil. Dynamic Re-compilation of Binary RISC Code for CISC
Architectures. PhD thesis, Technische Universitat Munchen Institut fur
Informatik, September 2004.
[19] Maxim Stepin. hq2x, hq3x, hq4x filters.
hiend3d.com/demos.html.
[20] Bart Trzynadlowski.
February 2002.
URL http://www.
Sega Genesis Emulator Save State Reference,
[21] Bart Trzynadlowski. 68000 Undocumented Behavior Notes, May 2003.
36
[22] Stanford University. Music synthesis approaches sound quality of real
instruments, June 1994. URL http://news.stanford.edu/pr/
94/940607Arc4222.html.
4.2
Emulators
[23] Stéphane Dallongeville. Gens emulator. URL http://www.gens.
me/.
[24] Dave. The DGen emulator. URL http://dgen.sourceforge.net.
[25] Aaron Giles.
Aaron’s MAME Memories.
aarongiles.com/mamemem/index.html.
URL http://www.
[26] Greg James, Barry Silverman, and Brian SIlverman. Visual 6502 in
JavaScript. URL http://www.visual6502.org/JSSim/index.
html.
[27] Greg James, Barry Silverman, and Brian SIlverman. Visual Transistorlevel Simulation of the 6502 CPU and other chips. URL http://www.
visual6502.org/.
[28] Charles MacDonald and Eke-Eke. The Genesis Plus GX emulator. URL
http://code.google.com/p/genplus-gx/.
[29] RealityMan and Epsilon. UltraHLE. URL http://www.emuunlim.
com/UltraHLE/.
[30] Nicola Salmoria and the MAME Team. MAME, Multiple Arcade Machine Emulator. URL http://mamedev.org/.
[31] Bernd Schmidt.
UAE Amiga Emulator.
amigaemulator.org/.
URL http://www.
[32] Steve Snake. The Kega Fusion emulator. URL http://www.
eidolons-inn.net/tiki-index.php?page=Kega.
37
[33] Michael Steil, Orlando Bassotto, and Gianluca Guida et al. Libcpu.
URL http://www.libcpu.org/wiki/Main_Page.
[34] Karl Stenerud. The Musashi 68000 emulator. URL http://www.
zophar.net/linux/68000/musashi.html.
[35] Maxim Stepin and Cameron Zemek. The hqx scaling library. URL
http://code.google.com/p/hqx/.
[36] MESS team. MESS, Multi Emulator Super System. URL http://
www.mess.org/.
[37] The VICE team.
viceteam.org/.
The VICE Emulator.
URL http://www.
[38] VBA Team. VisualBoyAdvance emulator. URL http://vba.ngemu.
com/.
38