Appendix B - SSE instructions summary

SIMD-FP solution
addps/addss parallel addition / scalar addition
subps/subss subtraction
maxps/maxss max
minps/minss min
andps logical and
orps logical or
xorps logical xor
cmpps/cmpss comparison (type specified by immediate)
mulps/mulss multiplication
divps/divss division
sqrtps/sqrtss square root
rcpps/rcpss reciprocal
rcqrtps/rcqrtss square root reciprocal
comiss compare and store result in EFLAGS
ucomiss as comiss but unordered (detects NaNs etc.)
shufps shuffle contents of XMMn register (for example can reverse order of floats)
unpckhps similar to punpckhwd
unpcklps similar to punpcklwd
cvtpi2ps packed signed dwords from MMn to floats in low XMMn
cvtps2pi opposite to cvtpi2ps
cvttps2pi like cvtps2pi, but with truncation
cvtsi2ss dword integer register to float in low XMMn
cvtss2si opposite to cvtss2si
cvttss2si like cvtss2si, but with truncation
movaps move aligned data
movups move unaligned data
movhps move to/from two 64-bit operand in memory
movlps the same, but concerns low part of XMMn register
movhlps move high part to low part
movlhps move low part to high part
movntps move non-temporal data from XMMn to aligned memory area
movmskps move sign bits from XMMn to an integer register
movss move only the lowest FP number
ldmxcsr set SSE control register
stmxcsr get SSE control register
MMX extension introduced with SSE, also available on Athlon
pshufw shuffle words
pinsrw insert a word from integer register
pextrw extract a word to integer register
psadbw sum of absolute differences of unsigned bytes
pminub/pminsw min
pmaxub/pmaxsw max
pavgb/pavgw average of bytes and words
pmulhuw like pmulhw, but the result is unsigned (at last!!!)
sfence store fence - ensure that store queue is empty
prefetcht0 prefetches temporary data to L0 (the closest) cache
prefetcht1 the need for this instruction is questionable
prefetcht2 ditto
prefetchnta prefetches non-temporary data
movntq move non-temporal data from MMn to aligned memory area
maskmovq move to [edi] using byte mask in source MMn register
pmovmskb move most significant bits of bytes in MMn to integer register

As a side note, some of the new instructions, particularly sfence, movntps, movntq and maskmovq, were introduced to bypass cache and directly write data to memory. Intel introduced this technique thinking of Rambus memory, which later appeared to have too low bandwidth for systems with 64MB+ of RAM.