Making of "nano" by KeyJ / Kakiarts
One of the greatest releases of the Evoke 2006 demoparty was "nano", the first demo for the (you guessed it) Apple iPod nano MP3 player. Hoping to attract more coders to this nice platform, I'm going to reveal much of the inner workings of the demo in this article.
Strictly speaking, "nano" isn't the first iPod demo. It's only the first demo on the iPod nano, or color iPods in general. There has already been a demo on the older monochrome hard disk-based models (1G-4G): Podfather by Hooy-Program. While this demo was an outstanding achievement by itself, the content was rather "old-schoolish" and seemed to me more like a proof of concept than a modern demo.
To make a long story short, I wanted to do better :) I wanted to do a demo that is not only good because it runs on the second smallest device that has ever been "demoed" (the smallest one being the Nintendo Pokémon Mini). It also should be a demo I'd personally like, it should be new-school. It should have a consistent theme and mood, pleasing graphics, a good soundtrack and all that. I actually strove for TBL-like excellence, although I knew from the start that I'm going nowhere near that. But hey, it's always good to have dreams :)
The problem with all these wishes and goals was that I didn't have a clear idea of the general topic of the demo. Finally, I settled for the obvious one: It's a demo for a iPod nano, so I made "nano" the title and topic of the project. At first, I only wanted to show some unsorted microscopic, nano-technologic, biologic or chemical stuff. Later on, I refined that idea into its final shape: A journey from very large-scale stuff like galaxies and planets down to real nanoscopic things like sub-atomar particles.
The greatest issue during development was that everything was finished a little too late. In particular, our musician, dq, had no real idea of the mood the demo was going to have, because a first beta version wasn't available until one week before the party. Anyway, dq came up with a really nice tune about a month before Evoke. The music was almost perfect, except that the vocals didn't fit the demo's theme at all. So after some complaining, I got a version without vocals and kept it. The music fit the demo really well, even the timing of the individual parts matched the optimal scene lengths so closely that synchronization was really easy.
The graphics and general design work was much less problematic: Our graphician, Gabi, and I did some brainstorming about what scenes belong into the demo, complete with some early sketches of how the scenes shall look like. The final graphics were done half by her and half by me. We passed our suggestions back and forth: I said what type of image I needed, she sent it, I integrated it, sent her back a screenshot or video of the resulting scene, she added corrections, and so on.
The development platform
As you might expect, Apple's original iPod firmware doesn't offer a way to execute custom-made code, so another operating system has to be installed first. Fortunately, unlike some other closed systems like game consoles, the iPod isn't protected like a fortress. Even better, it doesn't even require any hardware intervention like flashing the boot PROM: Everything is done by modifying the contents of the USB-accessible storage area.
There are two alternative operating systems available for the iPod nano: RockBox is a fully-customized firmware for various music players of various different brands and iPodLinux is a full uClinux system, that includes a media player application (among others). I chose the latter one, because I'm already familiar with Linux and its APIs. In fact, getting started with iPod development was quite easy and isn't too different from writing a normal desktop Linux application.
Having a development platform that is very similar to the platform that is being developed for has other benefits, too. The most important one is that, if done properly, only a minimum amount of target-specific code is required. By implementing only these tiny parts for both the development system (Linux on x86 here) and the target (iPodLinux), it's easily possible to switch between the devices just by recompiling the code. This is very useful, because the largest part of the development process can thus be done on the fast desktop computer with its comfortable debugging facilities and short edit-compile-run cycles. In fact, if this would not have been possible, I wouldn't have done the demo in the first place. Just imagine having to do each and every tiny test on the nano! Edit, compile, plug in iPod, mount iPod, copy binary, unmount iPod, detach iPod, boot iPod, run, curse the damn thing because you can't do any real debugging, curse it again because it has no MMU and thus hangs instead of just segfaulting, and so on. It'd have been pure horror. But with the ability to write the code on the x86 machine, it was just edit, compile, run, debug. Tests on real hardware were only necessary for performance evaluation and (of course) writing the hardware-specific code parts.
Coding considerations
The first design decision made during development was: Write everything in C, unless it is absolutely necessary to use Assembler. The main reason for this is of course laziness (even though ARM assembly is very easy to write, C is even easier), but also served the purpose to minimize target-specific code. Every part written in ARM assembly would have required testing directly on the iPod, and hand-written assembler tends to have a much higher bug density and thus requires higher debugging effort. Fortunately, the iPod is so fast that I didn't require any assembly optimizations altogether. The whole demo is written in pure C and only optimized by GCC with the -O2 switch.
But this doesn't mean that no care has been taken of platform specifics. After all, C code is quite close to the hardware, and careful choice of the C constructs used has a major influence on the quality of the machine code generated by the compiler. So, knowing that the nano is based on a ARM7TDMI core, I followed some basic rules like
- only using fixed-point arithmetics
- minimizing the amount of divisions, because there's no hardware support for that
- using shifts, where appropriate (ARM has a barrelshifter in front of one ALU input, so shifting is almost free)
- using tables where appropriate (the iPod's memory interface is crappy, so tables aren't useful if they cover less than ~3 multiplications)
- using pointer incrementation instead of indexed addressing (ARM doesn't have base+index*scale addressing modes like x86 has, but it has lots of registers)
The engine
The iPod is a device that clearly wasn't built for fast real-time graphics. Everything the original Apple firmware ever does in the graphics department is blitting some pre-rendered font bitmaps to the screen, loading raw album artwork images into the framebuffer or scrolling some parts of the screen. So it's no wonder that the hardware isn't graphics-friendly either. The pixel format is quite weird (RGB565, but with a nifty endianness swap that makes is something like GBRG3553) and the framebuffer interface is crappy as hell: no memory-mapped framebuffer, no DMA, only slow 32-bit MMIO with lots of wait states.
To get started with graphics programing, I experimented with some iPodLinux example framebuffer code first. The results were disappointing, to say the least: 32 fps only, just for copying a raw image from system memory to the frame buffer. This was mainly because the example code is highly device-independent and hence full of 'if's. Also, it's very well structured, so that a single register write access span over up to three function calls. I removed all the unneeded stuff and made a stripped-down nano-only optimized version that finally run "oneframe", that is, 60 fps.
This solved the problem with the slow framebuffer interface, but it didn't solve the problem with the impractical color format: Rendering graphics directly in GRBG3553 is possible, but a little bit hard to code. So I took the easy way again and made up a memory-based framebuffer with much better properties. This buffer is 256 pixels wide so that (x,y) pixel addressing can be done with shifts and ORs (a single instruction in ARM!) and each pixel used a 32-bit RGBA format with 8 bits per component. Conversion from this simple, abstract frame buffer format to the one required by the LCD controller is done directly in the display routine, while waiting for the controller to acknowledge the last pixel transfer. The alpha channel of the framebuffer image is discarded in this process, so it can be used by the rendering routines for any purpose.
Computations are done using a single 21.11 fixed-point integer format throughout the whole demo. I chose the 11 bits because I required the ability to multiply pixel coordinates (~9 bits in the integer part, plus sign bit) with fractional values (0..1), but at the same time, I wanted to retain as much precision as possible. The format worked out well for all uses I had in mind so there was no reason to introduce another one.
Music
Unsurprisingly, playing music on the iPod is a relatively easy task. iPodLinux implements a standard OSS/Linux sound API for that purpose. Anyway, it may be surprising that there's no hardware or hardware-assisted sound decoder of sorts in the iPod: Decoding has to take place in software only! This is why I chose not to use MP2, MP3, Vorbis or something like that. I needed every CPU cycle I could get for the graphics, and these modern lossy audio codecs eat up a massive amount of processing power. The solution was to compress the audio slightly with a fast algorithm: ADPCM. I implemented the DVI/IMA ADPCM spec that compresses 16-bit audio data down to 4 bits per sample using only a handful of instructions at the decoder side. The resulting quantization noise is undiscernable for most of the time, so all in all, I got a good balance between file size, processing power and audio quality.
Basic effects
The first graphics routine implemented was a simple sprite engine that loads sprites from raw 32-bit files (matching the framebuffer format) and displays them on-screen using a small range of blending modes. Among these, there is a special "Z sprite" mode that assumes that the alpha values in both the framebuffer and the sprite are in fact depth values. It it used to draw the 3D, ball-like sprites with real depth testing.
The basis for the main effect in the demo ought to be a 3D engine. I say "ought to", because most of the 3D scenes were scrapped due to the very bad quality of the engine. I never wrote such a thing before: I entered the active demoscene when OpenGL and Direct3D already were the normal way of doing things. So I wrote an OpenGL-like graphics pipeline from scratch, based on 4x4 matrix transforms, a simplified perspective mapper, back face culling, the ability to draw points (using the Z sprites explained above), lines, triangles and triangle strips - but it lacks proper clipping. This was the engine's achilles heel, as out-of screen vertices become distorted so badly that its coordinates may overflow and appear at the opposite side of the screen.
The triangle renderer at the end of the graphics pipeline interpolates 4 values (z, a, b, c) linearly along the edges and scanlines. The Z value is used for depth testing against the aforementioned alpha/depth frame buffer component. The other values can be interpreted either as RGB color values for drawing untextured, colored polygons or they can be interpreted as u/v texture coordinates, allowing for affine texture mapping.
Scene 1: Galaxy
The background image is a sprite that only contains grayscale values. The galaxy is made of 3D transformed point sprites that are simply added to the framebuffer without overflow checking. A final post-processing steps clips the values into the usual (0..255) range.
Scene 2: Planet
This scene uses a simple raycasting engine: for each screen pixel, the point of collision with the sphere is computed and then transformed from (x, y, z) into polar (latitude, longitude) form. These values are then used to pick the proper texture pixel. The texture is an old, but nice NASA image of the whole earth surface, mapped suitably for polar coordinates. I added some clouds to it to generate a more realistic look.
The main problem with this scene was the transformation into polar coordinates: Among others, it requires an atan2() operation, which is really, really slow. When doing the timing, I was glad that the planet scene has such a small timeframe that I could cut it just before it really starts to become jerky.
Scene 3: Mountains
The only scene in the whole demo that is made entirely with polygon 3D. It is a simplified remake of a scene from my then-favorite TBL demo, "Silkcut". The landscape was made with Terragen. The height map is hand-drawn using Terragen's built-in editor. Afterwards, I exported it as a small 11x11 bitmap. In the demo, the landscape is made up of 10x10 quads whose vertex Z coordinates are taken from the bitmap. The texture is again taken from Terragen.
I had a really hard time with this scene. It took hours to choose a camera path and fine-tune some additional hacks so that no larger artifacts (from the bad, clipping-less 3D engine) are visible. Since the polygon filler is a little bit "lazy" in that it doesn't draw all of the edge pixels, the contours of the polygons are not drawn completely and thus become clearly visible. I hid this effect by simply not clearing the framebuffer between frames. This way, the missing pixels at least contained colors that somewhat resembled the correct ones. In addition, a large sprite was put in the upper-right corner of the screen, because the scene wasn't large enough, and tearing-like artifacts would have been visible in that corner. To sum this up, the whole scene is nothing but a dirty hack.
Scene 4: Wobble
There's not much to say about this scene: It's just a dual-sine wobble effect with an unanimated image. Due to optimization, the wobbler only draws 128 of the 132 screen lines, so a sprite is put at the bottom of the screen to fill it up.
Scene 5: Blood
This scene is a composite of two completely different, but synchronized effects: A tunnel effect that draws the walls of the blood vessel, and polygon 3D for the blood cells (slow as hell!). The cells are simply low-polygon torii with a simple texture. The tunnel is again a raycasting effect: This time, the coordinates (x, y, z) of the intersection between the eye ray and the walls (i.e. the cylinder) are transformed into a (phi, z) pair that act as texture coordinates. From the raycasting point of view, the scene isn't animated: The rays will always hit the same spots of the surrounding cylinder. For this reason, all the computations (including the atan2() to get the phi value) are done only once during preparation and stored in a 2D table. To simulate animation, the coordinates taken from this table are translated prior to the texture lookup.
To give the whole scene more depth, a nifty post-processing effect is applied after both the walls and the cells have been drawn: Each pixel's color values are multiplied by its Z value. This makes remote pixels appear darker than close ones, which makes up for a sufficiently convincing 3D effect.
Scene 6: Cells
This was the first effect that has been finished, and it's also one of the simplest. Like the galaxy scene, everything is rendered in a pure 32-bit format without separate RGBA bitfields. The cells are actually circular sprites that are added to the static background. A final color mapping process then converts the simple intensity values into the greenish RGBA ones that can be seen on the screen. Overbright values aren't simply clipped, but inverted. This yields to funny, somehow organic-looking effects when two or more cells overlap.
Scene 7: Molecules / Greetings
The basic idea of having greetings in the form of (fictional) chemical formulae was one of the first during the design process. I actually wrote a Python script to generate as-real-as-possible formulae from arbitrary words (i.e. those that don't use non-existing element names unless absolutely necessary).
The molecules are rendered as 3D point sprites, with geometry data loaded from disk. The simpler molecules are hand-written (coordinates typed in manually), others are generated. The CNT molecule (carbon nano tube) is a good example for this. The large fictional molecule at the right end of the scene is generated by starting with one atom and appending new atoms in any of the 6 spatial directions at random. The buckyball (C60 molecule) was tricky, though: I couldn't come up with an easy algorithm to compute the atom coordinates. So I googled a bit and found some Java applets that show these molecules in 3D. After some reverse-engineering, I got hold of the input files for one of these applets and extracted the required coordinates from it.
Scene 8: Electron Microscope
This scene was one of Gabi's first ideas. It is rendered using a simplified voxel engine without perspective. I guess there's no need to explain that further, as everyone who's still reading the article at this point should have written a voxel engine himself ;)
The Benzene formula that is shown at the top is only made up of half-transparent white lines. The vertex coordinates are computed using a simple 3x2 matrix transform that allows for translation, rotation and scaling in 2D space. In other words, it's simple vector graphics.
Scene 9: Wilson cloud chamber / Credits
Another effect that uses a pure non-RGBA grayscale representation of the framebuffer. However, there's a special twist to it this time: The framebuffer values are re-used after every frame to implement the slowly fading clouds. So the first thing the renderer does is chopping off the green and blue components from the framebuffer values and attenuating the red ones a bit, resulting in nice plain 32-bit intensity values. Furthermore, some slight noise is added to every pixel during this process. Since the amount of fade is dependent on the frame rate, the scene must run with 20-30 fps for optimal visual quality. This is the case on the iPod, but in the PC version, the engine needs to be artificially slowed down for this purpose.
After this pre-conditioning step, the elements of the scene are added one after another (literally added: the pixel values in the frame buffer are incremented). First, 1000 random pixels will get a random (1/4 max.) intensity gain. Second, one "gamma ray" with the same maximum intensity will be added per frame. A gamma ray, in the scene's context, is simply a line somewhere on the screen, in any orientation, that crosses the complete screen. Third, the heavier nucleon traces (20 of them) are drawn. These are slower and react to the chamber's magnetic field, hence they travel along circular paths. Finally, the credits are added using sprites with a quickly jittering position. After all these drawing steps, the screen is again converted from the flat intensity format to the common RGBA one.
Scene 10: Atoms
This is a very straightforward scene: A background, some point-sprite nucleons and electrons, and some orbital lines. The nucleons' geometry data is loaded from files that have been generated by a (painfully slow) Python script, based on the number of protons and neutrons in each of the six elements (hydrogen, helium, carbon, silicon, gold and uranium). The electron orbitals are drawn using partially-transparent white lines. They aren't exactly circular, but regular 32-sided polygons, except for the gold and uranium ones: To save processing power, the polygons are only 16-sided for these atoms. Nevertheless, this is the scene that drives the iPod (or rather, my bad programming :) closest to its limits, with down to ~3 fps while the uranium atom is fully shown.
Scene 11: Particle Accelerator / Collision
These are in fact two scenes: The first one is a Ken Burns type zoom into the accelerator and the second one is the collision itself, none of which is anything special, technology-wise.
For the introductory scene, I chose an image of the DELPHI detector of the (now dismantled) LEP accelerator at CERN. The effect itself isn't actually worth speaking about: It's basically a rotozoomer, just without the rotation component :)
The collision scene is again made with 3D-transformed point sprites, so there's not much to say about it, too. The background image is a funny thing, though. It's not even remotely related to particle accelerators: It's a photo from inside a mere plastic pipe ...
Scene 12: Kakiarts Logo
Right from the start, I wanted to have the Kakiarts logo in form of a hyper-complex molecule. To accomplish this, I used a Python script (again) that takes a black/white image and places atoms only in places that map to black pixels. I fed a simplified image of our mascot into the script and voilà, I had a nice 256-atom logo molecule. Unfortunately, the nano screen proved a little bit to small for it: The monkey mascot was hardly recognizable :(
Just before I decided to ditch the molecule monkey altogether, Gabi came up with a brilliant idea: Blend the molecule logo slowly into the normal, bitmapped one. For some extra effect, I let the atoms from the molecule disappear one after another while the bitmap slowly faded from black. I was delighted to see that this just looked gorgeous :) Seemingly, other sceners share this opinion, because that scene is the most frequently praised one ....
I conclude with a little bit of trivia: The last three numbers in the Kakiarts formula at the end (12-8-2006) together form the release date of the demo: August 12, 2006.
-- KeyJ / Kakiarts 
            