Basic cache optimisations
Since Adok requested more coding articles on the Hugi website, for example about speed optimisations, I decided to write a little text about caches and memory, and it's aimed at the newbies. Experienced coders should know this stuff already.
"Pretty soon, computers will be fast"
I don't remember his name, but some computer-expert said the above quote in the seventies. :) Well, I still think my PC is too slow, but we've got to admit that computers are getting faster. Unfortunately, not all parts of the hardware increase their speed equally fast. Every 18 month, the processing power of the CPU's doubles. But memory chips do not double their bandwidth every 18 months, it takes more time to do that. And harddisks increase their bandwidth even slower. (Remember we are talking about bandwidth here, not capacity.) The result is that the CPU has to wait relatively longer and longer to get the data it needs to process. And of course it's no use to make a CPU that's twice as fast, when it will waste 50% of its cycles waiting for its data to arrive from the memory chips.
Time to clarify this difficult theory with some ASCII :
   Hey...                                   zZzZz
   Here...                                    Zz
   is...                                     zZ
   your...                                  zZzZz
   data...                                   ...
 +--------------+                       +-------------+
 |              | 0 1 1 0 1             |             |
 | Slow Memory  >=>=>=>=>=>============== Waiting CPU |
 |              |                       |             |
 +--------------+                       +-------------+
Didn't that help a lot?
The problem is not that we don't know how to make memory chips that are equally fast as a CPU, we can do that perfectly. The problem is that such memory is damn expensive, so you can't make a gigabyte of that stuff and still hope to sell it to millions of customers. Luckily you can use a small amount of super-fast memory, and still give the CPU the impression that all you memory is very fast.
Caches to the rescue!
If you examine in which way normal programs use their data, you'll notice that the different memory addresses they need are not completely random. They are related in two ways:
- Related in time: If the program recently needed a variable Counter, there is a good chance it will soon need the same Counter again.
- Related in space: If the program recently needed the first element in an array, it will probably soon ask for the second element and the rest too. If it needed one member in a structure, it might ask for the other members too.
The idea behind a cache is to put some superfast memory with some control logic (together called the cache) between the CPU and the slow memory chips. This cache will store the most recently used data. When the CPU ask for the contents of some variable, the cache will quickly look if it already contains that data. If it does, it can send the data to the CPU immediately. If not, it will send the request to the slow memory. When the data arrives, the cache will send it to the CPU, throw some old info inside itself away, and store the new data in the empty space (the new data is now "cached").
More ASCII diagrams: first the bad situation.
The request:
  Address 100 ?...        Oops, I don't        I want address 100!
  I'm searching...        have that one :(           Now!
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         | Waiting CPU |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ====<=<=<===== 090: 78 ===<=<=<===             |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+
And after some time:
  Here...you...         I'll keep this instead    Aha, finally!
  have...it...              of address 090              
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         |     CPU     |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ====>=>=>===== 100:245 ===>=>=>===  245        |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+
And now the good situation:
                           Searching.. Found!     I need address
                           Here you have it         100 again!
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         | Waiting CPU |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ============== 100:245 ===>=>=>===             |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+
(Yeah my doctor recommended me to draw ASCII boxes as a stress-therapy, so no criticism on them, OK?)
To get a significant speed increase, the good case should happen much more often than the bad case, and in normal programs, this is almost always the case.
Caches in the real world
Now the real situation is a tad more complex than I can (or want to) draw in a simple ASCII diagram. You should keep the following additional things in mind, especially the first:
- The memory inside the cache is not organised byte-by-byte, but in chunks of succeeding bytes. On the Intel platform, such a chunk is typically 32 bytes. Rather than asking the RAM a single byte at a time, the cache will ask 32 bytes at once. This way, if the CPU asks the bytes following the first byte, the cache already has them.
- There isn't just one single cache in a PC. There is a small cache, typically between 4 KB and 64 KB, directly on the CPU itself. This is the Level 1 (L1) cache, and it's as fast as the CPU. Then there is a larger Level 2 (L2) cache, 128 KB to 512 KB big, that is located on the motherboard or in the same package as the CPU. The L2 cache is slower than the L1 cache, but still faster than the main memory. The L1 cache is split in a part for the code and a part for the data. In modern computers, the CPU will ask both L1 and L2 caches to look for a certain address in parallel, so if it's not in the L1 cache, the L2 cache is already searching.
- If you know assembler, you can use the CPUID instruction to get more information on the size and organisation of the caches. However, there are so many different combinations of L1 and L2 caches that you shouldn't try to optimise your brand new demo effect for a specific combination. Over-specialisation will make it run slower on CPUs with different cache- characteristics. Just stick to some basic rules, and your code will run close to optimal on most CPUs.
Caches versus Demos
  
You might have seen that in the paragraphs above, I talked about "normal"
programs. I do not consider demos to be "normal" programs  
So, why should democoders bother about caches, if they do not boost the
speed of your code by huge amounts? Well, for two reasons:
 
- If you're working with large amounts of bulk data, like textures, height
  maps, and any other large 2-dimensional array, the cache may make your
  code two or three times SLOWER if you're not careful.
 
- If you're working with large collections of small structures, the cache
  CAN improve your performance, if you think a while about organising
  your data correctly.
 
Big disclaimer about the language used...
 
First I apology for using assembler, I know that a lot of newbies do not
read assembler very fluently, but I want to read some data exactly once
from the RAM to the CPU, and a high-level language like C/C++ or Pascal
doesn't allow you to do that. Don't panic, I'll try to explain everything
in detail in the following:
 
>> Emergency micro-course assembler just for this article <<
 
eax, ebx, ecx and edx: these are 32-bit registers. You can think of them
   as variables that are defined inside the CPU. There are more registers,
   but these are the only one I'll use.
 
[eax], [ebx+4], [my_data]: everything inside square brackets is treated
   as a pointer. So if ebx contains the value 1000, then [ebx+4] means:
   the value at address 1004.
 
mov eax, [my_data]: the MOV instruction moves data TO the first argument
  FROM the second argument. At least one of the arguments has to be a
  register. So here, we read a 32-bit word FROM the variable my_data (in
  memory) TO register eax (on the CPU). 
 
add eax, 4: Can you guess what this does? Yeah, it adds 4 (four) to eax!
  (If  you guessed it correctly, you win a free subscription to Hugi for
  the rest of your life!) If the result is zero, or negative, the CPU will
  set the zero- or the signed-flag. Flags are little 1-bit variables on
  the CPU, and other instructions can read these flags and do something
  depending on their value.
 
and eax, 16777215: In Pascal, this is "eax := eax AND 16777215;" and in C
  it's "eax &= 16777215;". Just like ADD, AND will set some flags.
 
label:   a name followed by a colon is a label. It specifies a position in
  you code. Jump instructions can jump to labels, to form loops. There are
  no for/while/... loops in assembler, you must make them yourself.
 
jnz label: This abbreviation means: Jump if Not Zero. It will look at the
  zero flag, and if this isn't set, it will jump to label: and continue
  with the code that follows label.
 
dec ecx: this subtracts 1 from register ecx, and sets the flags.
 
OK, that's all we'll need.
 
Example 1: reading a big 2D array
 
First I have allocated 16 MB of memory. I will start reading the memory from
the start to the end, one Dword (4 bytes) at a time:
 
 
So I fill ebx with the address of my 2D array. Eax is the index in my array,
and I add 4 to it in each iteration to address the next 32-bit dword. By
ANDing eax, I clear the highest 8 bits, effectively limiting the index to
16 MB. After 4194304 iterations, eax becomes zero after ANDing and the loop
ends.
 
Executing the above loop 100 times takes 4.35 seconds.
 
Next I changed the index updating to: add eax, 4*53. (I choose 53 because it
is a prime number, and hence it will still take 4194304 iterations before
eax AND 16777215 becomes zero. This way I don't need to add extra instructions
in the loop. The only thing that is important is that the whole 16 MB will
still be read exactly once per loop. Trust me. :))
 
Now, executing the loop 100 times again takes 17.42 seconds! That's three
times as much!
 
How comes? In the first loop, when the CPU ask for the first 4 bytes, the
cache will ask 32 bytes from the slow memory chips and keep those cached.
The next 7 times the CPU asks for the following 4 bytes, these are already
in the cache, and can be send to the CPU immediately.
 
In the second loop, each new address is 212 bytes further than the previous.
That means that the cache has to read 32 new bytes from the slow memory each
time. 28 of these will be wasted, because they will have been be overwritten
by other data before our index has wrapped around and is in the same
neighbourhood again. This means that in the second loop, 16*8 = 128 MB
memory is transferred from the RAM to the CPU. Needless to say, this is a
big waste.
 
So what is the moral of the story? Simple: never read your big 2D arrays
like textures/heightmaps/etc column by column (vertically), read them row
by row (horizontally).
 
Now what should you do if you cannot work row by row? For example, if you
want to make a copy of a picture rotated by 90 degrees? Either you'll read
in rows and write columns, or read in columns and write rows, right?
 
Nope. The solution is to use little blocks of pixels, for example 8 by 8
large. If you use 32-bit colors, reading the leftmost column of an 8 by 8
block will result in the whole block being read into the cache. In the next
iteration we read the following column from the same block. Because it's
from the same block, all data will still be in the cache. Nice, isn't it?
After reading 8 columns, we are ready with this block and we can go on
with the one below or to the right.
 
Example 2: Avoid big powers of 2
 
In a perfect world, caches would always store the most recent data. When
an address is requested that is not in the cache, the oldest 32 bytes
would be removed and be replaced by the new data. Unfortunately, the
world is not perfect.
 
The cache needs to check if it contains a given address extremely quickly.
To allow this, each address has a fixed position in the cache. But because
the cache is way smaller than the memory, several addresses are mapped to
the same cache position: they collide. Using addresses that collide will
degrade performance, as you will see now.
 
Take a look at this loop:
 
 
The inner part of the loop will read the same address five times in
register eax. Ecx is decremented at the end of every loop, until it is
zero. So we read 256 million times the same address five times in eax.
This pretty useless code takes 1.44 seconds to complete. Of course, all
the data is in the cache after the first iteration.
 
If I change one instruction in the inner loop to "mov eax, [ebx + 65536]",
the loop will still take 1.44 seconds to complete.
 
Changing another one to "mov eax, [ebx + 131072]": exactly the same.
 
Changing yet another one to "mov eax, [ebx + 196608]": still 1.44 seconds.
 
Changing yet another to "mov eax, [ebx + 262144]" results in this inner
loop:
 
 
And suddenly it takes 2.43 seconds. Huh?
 
The reason is that there is 64KB between each address. Addresses that differ
by large powers of two will collide in the cache. Fortunately, the cache-designers
have taken precautions so that a small amount of colliding addresses
can co-exist without pushing each other out of the cache. This amount is
called the "associativity" of the cache. Typical values are 2, 4 or 8.
 
The code shows that my L1 cache is 4-way associative. As long as there were
4 or less conflicting addresses, they could all be kept in the cache. But
when a fifth address was added, it pushed the first out of the cache. When
the first was read in again, it pushed the second out of the cache, and on
and on. Luckily my L2 cache is 8-way associative, so all data could be kept
there. Otherwise the difference would have been much higher.
 
Now imagine that you have a weird new effect which works like this: you have
9 monochrome 8-bit textures. To get a final psychedelic texture, you do this:
 
(Pseudocode instead of assembler, I'm lazy.)
 
 
If your textures have a size that's a power of two, and they are located
one after another in memory, all addresses will collide and you'll need
to read 9 * 32 bytes for each iteration.
 
You can solve this by making the textures a bit larger than necessary. If
you add 32 bytes to each texture, no addresses will collide, and you'll
need to read 9*32 bytes new data only once every 32 iterations. If you
want to play safe, you should allocate one huge memory block, and place
your textures inside that block yourself. This way, you don't have to
worry about weird things the memory allocation functions might do.
However make sure your data is aligned on 4-byte boundaries! (see further)
 
Example 3: optimising data structures
 
For one effect, I once needed the following structure:
 
 
I had an array with 1000 points, whose position in the array did not change
during runtime. To update the screen, I had to use the X and Y parameter
of every point in the array once for every pixel on the screen. The other
members (speed_x, speed_y and color) were used far less, once every frame.
 
On a 32-bit machine, the point-structure takes 20 bytes, so the entire array
is almost 20KB. That's too much to fit in my L1 cache, also because I needed
some other data too. However, I could increase the speed of my effect by 10%
by splitting point in two parts:
 
 
I had two arrays, arrayA with 1000 pointA's, and arrayB with 1000 pointB's.
The inner loop only required arrayA, which was now only 8 KB, and whenever I
needed the speed or color of the point in arrayA[i], I looked in arrayB[i].
Well, the effect was still to slow to be of any use, but I still hope to
improve it. :)
 
So to optimise your data structures, you should try to keep the most-often
used parts close together, so they are read into the cache together. Put
less-often used parts in a separate structure.
 
Last example: aligning
 
This isn't really about caches, but it's still something you should know.
32-bit CPU's will always read data in 32-bit chunks, which should be aligned
on 32-bit boundaries. This means that if you need byte 0, the CPU will ask
for bytes 0, 1 ,2 and 3. If you ask byte 5, the CPU will ask for bytes
4, 5, 6, and 7. But if you need the word that occupies bytes 7 and 8, the
CPU will first read bytes 4 to 7, and then bytes 8 to 11. This is because
the data you want spans two 4-bit areas, and the CPU will need them both.
 
It can get even worse with caches, because the cache always reads 32 bytes
at the same time, and these chunks are aligned on 32-byte boundaries. This
means that if you read the dword at address 30, the cache will read both
chunk 0 to 31 and chunk 32 to 63.
 
I repeated the tests of example 1 (reading 16 Mb of memory, first
sequentially, then non-sequentially). The only difference was that I added
30 to ebx, which is the start of the memory. As a result, the time to read
the entire block 100 times sequentially increased from 4.35 to 4.83 seconds.
That isn't a mayor slowdown, because the extra data in the cache is always
used in the next iteration. But the time to do the non-sequential test 
increased from 17.42 to 21.22 seconds, because now you need to read 64
bytes per iteration, of which only 4 are actually used.
 
Conclusion:
 
There are still some things I haven't mentioned yet, for example caches
differ significantly on how they write data back to the slow memory, but
I'm quite tired and I want to send this to Adok tomorrow. I hope that
reading this article inspired you to optimise your effects to the max,
and if you want to give feedback, mail me at seven7@writeme.com. And elite
coders for who this article was not intended but who have read it anyway
and want to point out any errors or tell about other tricks are welcome
too.
  mov   ebx, [start_of_16MB_data]
  mov	eax, 0
read_next_dword:
  mov	ecx, [ebx+eax]
  add	eax, 4
  and 	eax, 16777215
  jnz	read_next_dword
  mov   ecx, 256000000
  mov	ebx, [start_of_1MB_data]
label:
  mov	eax, [ebx]
  mov	eax, [ebx]
  mov   eax, [ebx]
  mov	eax, [ebx]
  mov	eax, [ebx]
  dec	ecx
  jnz	label
  mov   eax, [ebx]
  mov	eax, [ebx + 65536]
  mov   eax, [ebx + 131072]
  mov	eax, [ebx + 196608]
  mov	eax, [ebx + 262144]
  i = 0;
  while i < size texture
  {
    final[i].blue  = texture1[i] + texture2[i] + texture3[i];
    final[i].green = texture4[i] + texture5[i] + texture6[i];
    final[i].red   = texture7[i] + texture8[i] + texture9[i];
    i = i + 1;
  }
struct {
  int X, Y;
  int speed_x, speed_y;
  unsigned long color;
} point;
struct {
  int X, Y;
} pointA;
struct {
  int speed_x, speed_y;
  unsigned long color;  
} pointB;