Appendix C - Comparison of SSE on Pentium III and 3DNow! on Athlon

num is the number of data units processed by a single instruction, lat is the instruction latency (the number of cycles after which the result is ready) and thru is the instruction throughput (the number of cycles after which another instruction can be issued).

SSE num lat thru 3DNow! num lat thru
addps
subps
maxps
minps
cmpps
4 4 2 pfadd
pfsub*
pfmin
pfmax
pfcmp*
2 4 1
addss
subss
maxss
minss
cmpss
1 3 1 -
mulps 4 5 2 pfmul 2 4 1
mulss 1 4 1 -
rcpps
rcqrtps
4 2 2 pfrcp
pfsqrt
2 4 1

MMX-mult 3 1 MMX-mult 4 1
other MMX 1 1 other MMX 2 1

Note that practically the result of a 3DNow! instruction can be fetched by another instruction after 3 cycles, even though it is ready after 4 cycles.