Basic cache optimisations

Seven

Since Adok requested more coding articles on the Hugi website, for example about speed optimisations, I decided to write a little text about caches and memory, and it's aimed at the newbies. Experienced coders should know this stuff already.

"Pretty soon, computers will be fast"

I don't remember his name, but some computer-expert said the above quote in the seventies. :) Well, I still think my PC is too slow, but we've got to admit that computers are getting faster. Unfortunately, not all parts of the hardware increase their speed equally fast. Every 18 month, the processing power of the CPU's doubles. But memory chips do not double their bandwidth every 18 months, it takes more time to do that. And harddisks increase their bandwidth even slower. (Remember we are talking about bandwidth here, not capacity.) The result is that the CPU has to wait relatively longer and longer to get the data it needs to process. And of course it's no use to make a CPU that's twice as fast, when it will waste 50% of its cycles waiting for its data to arrive from the memory chips.

Time to clarify this difficult theory with some ASCII :

   Hey...                                   zZzZz
   Here...                                    Zz
   is...                                     zZ
   your...                                  zZzZz
   data...                                   ...
 +--------------+                       +-------------+
 |              | 0 1 1 0 1             |             |
 | Slow Memory  >=>=>=>=>=>============== Waiting CPU |
 |              |                       |             |
 +--------------+                       +-------------+

Didn't that help a lot?

The problem is not that we don't know how to make memory chips that are equally fast as a CPU, we can do that perfectly. The problem is that such memory is damn expensive, so you can't make a gigabyte of that stuff and still hope to sell it to millions of customers. Luckily you can use a small amount of super-fast memory, and still give the CPU the impression that all you memory is very fast.

Caches to the rescue!

If you examine in which way normal programs use their data, you'll notice that the different memory addresses they need are not completely random. They are related in two ways:

- Related in time: If the program recently needed a variable Counter, there is a good chance it will soon need the same Counter again.

- Related in space: If the program recently needed the first element in an array, it will probably soon ask for the second element and the rest too. If it needed one member in a structure, it might ask for the other members too.

The idea behind a cache is to put some superfast memory with some control logic (together called the cache) between the CPU and the slow memory chips. This cache will store the most recently used data. When the CPU ask for the contents of some variable, the cache will quickly look if it already contains that data. If it does, it can send the data to the CPU immediately. If not, it will send the request to the slow memory. When the data arrives, the cache will send it to the CPU, throw some old info inside itself away, and store the new data in the empty space (the new data is now "cached").

More ASCII diagrams: first the bad situation.

The request:

  Address 100 ?...        Oops, I don't        I want address 100!
  I'm searching...        have that one :(           Now!
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         | Waiting CPU |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ====<=<=<===== 090: 78 ===<=<=<===             |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+

And after some time:

  Here...you...         I'll keep this instead    Aha, finally!
  have...it...              of address 090              
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         |     CPU     |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ====>=>=>===== 100:245 ===>=>=>===  245        |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+

And now the good situation:

                           Searching.. Found!     I need address
                           Here you have it         100 again!
 +--------------+            +---------+         +-------------+
 | Slow Memory  |            |  Cache  |         | Waiting CPU |
 |  .  ...   .  |            |         |         |             |
 | Adr 100: 245 ============== 100:245 ===>=>=>===             |
 | Adr 101: 070 |            | 104: 16 |         |             |
 | Adr 102: 054 |            | 123: 00 |         |             |
 +--------------+            +---------+         +-------------+

(Yeah my doctor recommended me to draw ASCII boxes as a stress-therapy, so no criticism on them, OK?)

To get a significant speed increase, the good case should happen much more often than the bad case, and in normal programs, this is almost always the case.

Caches in the real world

Now the real situation is a tad more complex than I can (or want to) draw in a simple ASCII diagram. You should keep the following additional things in mind, especially the first:

- The memory inside the cache is not organised byte-by-byte, but in chunks of succeeding bytes. On the Intel platform, such a chunk is typically 32 bytes. Rather than asking the RAM a single byte at a time, the cache will ask 32 bytes at once. This way, if the CPU asks the bytes following the first byte, the cache already has them.

- There isn't just one single cache in a PC. There is a small cache, typically between 4 KB and 64 KB, directly on the CPU itself. This is the Level 1 (L1) cache, and it's as fast as the CPU. Then there is a larger Level 2 (L2) cache, 128 KB to 512 KB big, that is located on the motherboard or in the same package as the CPU. The L2 cache is slower than the L1 cache, but still faster than the main memory. The L1 cache is split in a part for the code and a part for the data. In modern computers, the CPU will ask both L1 and L2 caches to look for a certain address in parallel, so if it's not in the L1 cache, the L2 cache is already searching.

- If you know assembler, you can use the CPUID instruction to get more information on the size and organisation of the caches. However, there are so many different combinations of L1 and L2 caches that you shouldn't try to optimise your brand new demo effect for a specific combination. Over-specialisation will make it run slower on CPUs with different cache- characteristics. Just stick to some basic rules, and your code will run close to optimal on most CPUs.

Caches versus Demos

You might have seen that in the paragraphs above, I talked about "normal" programs. I do not consider demos to be "normal" programs . One of the reasons is that demos use memory in very large blocks: a 512 by 512 32-bit texture for example takes 1 MB of memory. If you read that texture from the first byte to the last, and then you repeated the process, not a single byte would be found in the cache the second time you read it. The texture is just too large, and when the cache is full, the newer parts you read from the texture will overwrite the older parts that were in the cache. (Of course I know that everyone uses 3D cards nowadays and that they put the textures in the 3D cards memory, but it's for the sake of example.)

So, why should democoders bother about caches, if they do not boost the speed of your code by huge amounts? Well, for two reasons:

- If you're working with large amounts of bulk data, like textures, height maps, and any other large 2-dimensional array, the cache may make your code two or three times SLOWER if you're not careful.

- If you're working with large collections of small structures, the cache CAN improve your performance, if you think a while about organising your data correctly.

Big disclaimer about the language used...

First I apology for using assembler, I know that a lot of newbies do not read assembler very fluently, but I want to read some data exactly once from the RAM to the CPU, and a high-level language like C/C++ or Pascal doesn't allow you to do that. Don't panic, I'll try to explain everything in detail in the following:

>> Emergency micro-course assembler just for this article <<

eax, ebx, ecx and edx: these are 32-bit registers. You can think of them as variables that are defined inside the CPU. There are more registers, but these are the only one I'll use.

[eax], [ebx+4], [my_data]: everything inside square brackets is treated as a pointer. So if ebx contains the value 1000, then [ebx+4] means: the value at address 1004.

mov eax, [my_data]: the MOV instruction moves data TO the first argument FROM the second argument. At least one of the arguments has to be a register. So here, we read a 32-bit word FROM the variable my_data (in memory) TO register eax (on the CPU).

add eax, 4: Can you guess what this does? Yeah, it adds 4 (four) to eax! (If you guessed it correctly, you win a free subscription to Hugi for the rest of your life!) If the result is zero, or negative, the CPU will set the zero- or the signed-flag. Flags are little 1-bit variables on the CPU, and other instructions can read these flags and do something depending on their value.

and eax, 16777215: In Pascal, this is "eax := eax AND 16777215;" and in C it's "eax &= 16777215;". Just like ADD, AND will set some flags.

label: a name followed by a colon is a label. It specifies a position in you code. Jump instructions can jump to labels, to form loops. There are no for/while/... loops in assembler, you must make them yourself.

jnz label: This abbreviation means: Jump if Not Zero. It will look at the zero flag, and if this isn't set, it will jump to label: and continue with the code that follows label.

dec ecx: this subtracts 1 from register ecx, and sets the flags.

OK, that's all we'll need.

Example 1: reading a big 2D array

First I have allocated 16 MB of memory. I will start reading the memory from the start to the end, one Dword (4 bytes) at a time:

  mov   ebx, [start_of_16MB_data]
  mov	eax, 0
read_next_dword:
  mov	ecx, [ebx+eax]
  add	eax, 4
  and 	eax, 16777215
  jnz	read_next_dword

So I fill ebx with the address of my 2D array. Eax is the index in my array, and I add 4 to it in each iteration to address the next 32-bit dword. By ANDing eax, I clear the highest 8 bits, effectively limiting the index to 16 MB. After 4194304 iterations, eax becomes zero after ANDing and the loop ends.

Executing the above loop 100 times takes 4.35 seconds.

Next I changed the index updating to: add eax, 4*53. (I choose 53 because it is a prime number, and hence it will still take 4194304 iterations before eax AND 16777215 becomes zero. This way I don't need to add extra instructions in the loop. The only thing that is important is that the whole 16 MB will still be read exactly once per loop. Trust me. :))

Now, executing the loop 100 times again takes 17.42 seconds! That's three times as much!

How comes? In the first loop, when the CPU ask for the first 4 bytes, the cache will ask 32 bytes from the slow memory chips and keep those cached. The next 7 times the CPU asks for the following 4 bytes, these are already in the cache, and can be send to the CPU immediately.

In the second loop, each new address is 212 bytes further than the previous. That means that the cache has to read 32 new bytes from the slow memory each time. 28 of these will be wasted, because they will have been be overwritten by other data before our index has wrapped around and is in the same neighbourhood again. This means that in the second loop, 16*8 = 128 MB memory is transferred from the RAM to the CPU. Needless to say, this is a big waste.

So what is the moral of the story? Simple: never read your big 2D arrays like textures/heightmaps/etc column by column (vertically), read them row by row (horizontally).

Now what should you do if you cannot work row by row? For example, if you want to make a copy of a picture rotated by 90 degrees? Either you'll read in rows and write columns, or read in columns and write rows, right?

Nope. The solution is to use little blocks of pixels, for example 8 by 8 large. If you use 32-bit colors, reading the leftmost column of an 8 by 8 block will result in the whole block being read into the cache. In the next iteration we read the following column from the same block. Because it's from the same block, all data will still be in the cache. Nice, isn't it? After reading 8 columns, we are ready with this block and we can go on with the one below or to the right.

Example 2: Avoid big powers of 2

In a perfect world, caches would always store the most recent data. When an address is requested that is not in the cache, the oldest 32 bytes would be removed and be replaced by the new data. Unfortunately, the world is not perfect.

The cache needs to check if it contains a given address extremely quickly. To allow this, each address has a fixed position in the cache. But because the cache is way smaller than the memory, several addresses are mapped to the same cache position: they collide. Using addresses that collide will degrade performance, as you will see now.

Take a look at this loop:

  mov   ecx, 256000000
  mov	ebx, [start_of_1MB_data]
label:
  mov	eax, [ebx]
  mov	eax, [ebx]
  mov   eax, [ebx]
  mov	eax, [ebx]
  mov	eax, [ebx]
  dec	ecx
  jnz	label

The inner part of the loop will read the same address five times in register eax. Ecx is decremented at the end of every loop, until it is zero. So we read 256 million times the same address five times in eax. This pretty useless code takes 1.44 seconds to complete. Of course, all the data is in the cache after the first iteration.

If I change one instruction in the inner loop to "mov eax, [ebx + 65536]", the loop will still take 1.44 seconds to complete.

Changing another one to "mov eax, [ebx + 131072]": exactly the same.

Changing yet another one to "mov eax, [ebx + 196608]": still 1.44 seconds.

Changing yet another to "mov eax, [ebx + 262144]" results in this inner loop:

  mov   eax, [ebx]
  mov	eax, [ebx + 65536]
  mov   eax, [ebx + 131072]
  mov	eax, [ebx + 196608]
  mov	eax, [ebx + 262144]

And suddenly it takes 2.43 seconds. Huh?

The reason is that there is 64KB between each address. Addresses that differ by large powers of two will collide in the cache. Fortunately, the cache-designers have taken precautions so that a small amount of colliding addresses can co-exist without pushing each other out of the cache. This amount is called the "associativity" of the cache. Typical values are 2, 4 or 8.

The code shows that my L1 cache is 4-way associative. As long as there were 4 or less conflicting addresses, they could all be kept in the cache. But when a fifth address was added, it pushed the first out of the cache. When the first was read in again, it pushed the second out of the cache, and on and on. Luckily my L2 cache is 8-way associative, so all data could be kept there. Otherwise the difference would have been much higher.

Now imagine that you have a weird new effect which works like this: you have 9 monochrome 8-bit textures. To get a final psychedelic texture, you do this:

(Pseudocode instead of assembler, I'm lazy.)

  i = 0;
  while i < size texture
  {
    final[i].blue  = texture1[i] + texture2[i] + texture3[i];
    final[i].green = texture4[i] + texture5[i] + texture6[i];
    final[i].red   = texture7[i] + texture8[i] + texture9[i];
    i = i + 1;
  }

If your textures have a size that's a power of two, and they are located one after another in memory, all addresses will collide and you'll need to read 9 * 32 bytes for each iteration.

You can solve this by making the textures a bit larger than necessary. If you add 32 bytes to each texture, no addresses will collide, and you'll need to read 9*32 bytes new data only once every 32 iterations. If you want to play safe, you should allocate one huge memory block, and place your textures inside that block yourself. This way, you don't have to worry about weird things the memory allocation functions might do. However make sure your data is aligned on 4-byte boundaries! (see further)

Example 3: optimising data structures

For one effect, I once needed the following structure:

struct {
  int X, Y;
  int speed_x, speed_y;
  unsigned long color;
} point;

I had an array with 1000 points, whose position in the array did not change during runtime. To update the screen, I had to use the X and Y parameter of every point in the array once for every pixel on the screen. The other members (speed_x, speed_y and color) were used far less, once every frame.

On a 32-bit machine, the point-structure takes 20 bytes, so the entire array is almost 20KB. That's too much to fit in my L1 cache, also because I needed some other data too. However, I could increase the speed of my effect by 10% by splitting point in two parts:

struct {
  int X, Y;
} pointA;

struct {
  int speed_x, speed_y;
  unsigned long color;  
} pointB;

I had two arrays, arrayA with 1000 pointA's, and arrayB with 1000 pointB's. The inner loop only required arrayA, which was now only 8 KB, and whenever I needed the speed or color of the point in arrayA[i], I looked in arrayB[i]. Well, the effect was still to slow to be of any use, but I still hope to improve it. :)

So to optimise your data structures, you should try to keep the most-often used parts close together, so they are read into the cache together. Put less-often used parts in a separate structure.

Last example: aligning

This isn't really about caches, but it's still something you should know. 32-bit CPU's will always read data in 32-bit chunks, which should be aligned on 32-bit boundaries. This means that if you need byte 0, the CPU will ask for bytes 0, 1 ,2 and 3. If you ask byte 5, the CPU will ask for bytes 4, 5, 6, and 7. But if you need the word that occupies bytes 7 and 8, the CPU will first read bytes 4 to 7, and then bytes 8 to 11. This is because the data you want spans two 4-bit areas, and the CPU will need them both.

It can get even worse with caches, because the cache always reads 32 bytes at the same time, and these chunks are aligned on 32-byte boundaries. This means that if you read the dword at address 30, the cache will read both chunk 0 to 31 and chunk 32 to 63.

I repeated the tests of example 1 (reading 16 Mb of memory, first sequentially, then non-sequentially). The only difference was that I added 30 to ebx, which is the start of the memory. As a result, the time to read the entire block 100 times sequentially increased from 4.35 to 4.83 seconds. That isn't a mayor slowdown, because the extra data in the cache is always used in the next iteration. But the time to do the non-sequential test increased from 17.42 to 21.22 seconds, because now you need to read 64 bytes per iteration, of which only 4 are actually used.

Conclusion:

There are still some things I haven't mentioned yet, for example caches differ significantly on how they write data back to the slow memory, but I'm quite tired and I want to send this to Adok tomorrow. I hope that reading this article inspired you to optimise your effects to the max, and if you want to give feedback, mail me at seven7@writeme.com. And elite coders for who this article was not intended but who have read it anyway and want to point out any errors or tell about other tricks are welcome too.

Seven