I have done some tests on K6-2 (0.25), Athlon (0.18) and Pentium III (0.25). The tests do geometry transformation. The results are below. The 3DNow! code was optimised for K6-2, and it is equivalent to SSE code. 1k means that 1000 vectors were processed, 100k - 100,000 vectors. - means that prefetching wasn't used, + means that the optimal prefetching was used (obtained experimentally). The code was run on Win9x. On WinNT the results are much better, but I was unable to test all variants there. The results are in cycles per vector, and are approximate.
1k - | 1k + | 100k - | 100k + | |
K6-2 | 35 | 35 | 72 | 74 |
Athlon | 29 | 28 | 70 | 52 |
Pentium III | 55 | 46 | 112 | 83 |
| ||||
Athlon (WinNT) | 19 | 20 | 55 | 33 |
One thing we notice is that performance increases significantly on WinNT (much better task management mechanism).
Another is that prefetching on K6-2 didn't help at all (probably a poor mobo). I have tried many different prefetch distances, and nothing has changed.
Interesting that Pentium III performs worse than Athlon and even than K6-2. What's more, I have tried multiple prefetch distances on it (what isn't covered in the results), and the larger the distance was, the slower it executed - this didn't happened on AMD processors.
The influence of prefetching in the 1000-vertice test on Pentium III could result from its tiny 16KB L1 data cache - 1000 vectors occupy 16000 bytes. K6-2 has 32KB of L1 data cache and Athlon has 64KB, thus on these processors there is no impact with such a small portion of data.