Appendix B - SSE instructions summary

SIMD-FP solution

addps/addss parallel addition / scalar addition
subps/subss subtraction
maxps/maxss max
minps/minss min
andps logical and
orps logical or
xorps logical xor
cmpps/cmpss comparison (type specified by immediate)
mulps/mulss multiplication
divps/divss division
sqrtps/sqrtss square root
rcpps/rcpss reciprocal
rcqrtps/rcqrtss square root reciprocal
comiss compare and store result in EFLAGS
ucomiss as comiss but unordered (detects NaNs etc.)
shufps shuffle contents of XMMn register (for example can reverse order of floats)
unpckhps similar to punpckhwd
unpcklps similar to punpcklwd
cvtpi2ps packed signed dwords from MMn to floats in low XMMn
cvtps2pi opposite to cvtpi2ps
cvttps2pi like cvtps2pi, but with truncation
cvtsi2ss dword integer register to float in low XMMn
cvtss2si opposite to cvtss2si
cvttss2si like cvtss2si, but with truncation
movaps move aligned data
movups move unaligned data
movhps move to/from two 64-bit operand in memory
movlps the same, but concerns low part of XMMn register
movhlps move high part to low part
movlhps move low part to high part
movntps move non-temporal data from XMMn to aligned memory area
movmskps move sign bits from XMMn to an integer register
movss move only the lowest FP number
ldmxcsr set SSE control register
stmxcsr get SSE control register
MMX extension introduced with SSE, also available on Athlon

pshufw shuffle words
pinsrw insert a word from integer register
pextrw extract a word to integer register
psadbw sum of absolute differences of unsigned bytes
pminub/pminsw min
pmaxub/pmaxsw max
pavgb/pavgw average of bytes and words
pmulhuw like pmulhw, but the result is unsigned (at last!!!)
sfence store fence - ensure that store queue is empty
prefetcht0 prefetches temporary data to L0 (the closest) cache
prefetcht1 the need for this instruction is questionable
prefetcht2 ditto
prefetchnta prefetches non-temporary data
movntq move non-temporal data from MMn to aligned memory area
maskmovq move to [edi] using byte mask in source MMn register
pmovmskb move most significant bits of bytes in MMn to integer register

As a side note, some of the new instructions, particularly sfence, movntps, movntq and maskmovq, were introduced to bypass cache and directly write data to memory. Intel introduced this technique thinking of Rambus memory, which later appeared to have too low bandwidth for systems with 64MB+ of RAM.

SIMD-FP solution

addps/addss	parallel addition / scalar addition
subps/subss	subtraction
maxps/maxss	max
minps/minss	min
andps	logical and
orps	logical or
xorps	logical xor
cmpps/cmpss	comparison (type specified by immediate)
mulps/mulss	multiplication
divps/divss	division
sqrtps/sqrtss	square root
rcpps/rcpss	reciprocal
rcqrtps/rcqrtss	square root reciprocal
comiss	compare and store result in EFLAGS
ucomiss	as comiss but unordered (detects NaNs etc.)
shufps	shuffle contents of XMMn register (for example can reverse order of floats)
unpckhps	similar to punpckhwd
unpcklps	similar to punpcklwd
cvtpi2ps	packed signed dwords from MMn to floats in low XMMn
cvtps2pi	opposite to cvtpi2ps
cvttps2pi	like cvtps2pi, but with truncation
cvtsi2ss	dword integer register to float in low XMMn
cvtss2si	opposite to cvtss2si
cvttss2si	like cvtss2si, but with truncation
movaps	move aligned data
movups	move unaligned data
movhps	move to/from two 64-bit operand in memory
movlps	the same, but concerns low part of XMMn register
movhlps	move high part to low part
movlhps	move low part to high part
movntps	move non-temporal data from XMMn to aligned memory area
movmskps	move sign bits from XMMn to an integer register
movss	move only the lowest FP number
ldmxcsr	set SSE control register
stmxcsr	get SSE control register
MMX extension introduced with SSE, also available on Athlon

pshufw	shuffle words
pinsrw	insert a word from integer register
pextrw	extract a word to integer register
psadbw	sum of absolute differences of unsigned bytes
pminub/pminsw	min
pmaxub/pmaxsw	max
pavgb/pavgw	average of bytes and words
pmulhuw	like pmulhw, but the result is unsigned (at last!!!)
sfence	store fence - ensure that store queue is empty
prefetcht0	prefetches temporary data to L0 (the closest) cache
prefetcht1	the need for this instruction is questionable
prefetcht2	ditto
prefetchnta	prefetches non-temporary data
movntq	move non-temporal data from MMn to aligned memory area
maskmovq	move to [edi] using byte mask in source MMn register
pmovmskb	move most significant bits of bytes in MMn to integer register