MMX and Mixing Samples

Chris Dragan

In this article I will try to show how we can do lossless (32-bit) sample mixing with optional interpolation using MMX.

First of all, 32-bit mixing can be done it two stages. In the first stage all the samples are added into a mixing buffer (summation buffer), and then the buffer can be downgraded either to 8-bit or 16-bit sample, which will be played by the soundcard.

If you have ever used a tracker, you know that samples can have different frequencies depending on the current note, and they can also have different volumes. If the sound buffer we are going to fill has frequency F1, and a sample has frequency F2, we can simply copy the sample into the buffer, moving in the sample with step F2/F1, and scaling sample data by volume. Example:

	     for (i=0; i<SoundBufferSize; i++) {
	   SoundBuffer[i] = Sample[SamplePos] * SampleVolume / 0x40;
	   SamplePos += F2/F1;
	}

In the above example, I assume that sample volume varies between 0 (0%) and 0x40 (100%), like in typical modules.

The only problem in the above example is that SamplePos is moved by floating-point step, as F2/F1 is a floating-point value. To solve this, we can do simple linear interpolation, that is, separately add integral step and fractional step, like this:

	     IntStep = long(F2/F1);
	FracStep = long((F2<<32)/F1); // 64-bit / 32-bit division
	for (i=0; i<SoundBufferSize; i++) {
	   SoundBuffer[i] = Sample[IntSamplePos] * SampleVolume >> 6;
	   FracSamplePos += FracStep;
	   IntSamplePos += IntStep + CY;
	}

In the above example, integral step is added to integral sample position, along with carry from fractional step addition.

As we have probably multiple samples to mix, we have to add them all to the summation buffer. Of course we must first clear the buffer.

	     for (i=0; i<SoundBufferSize; i++)
	   SoundBuffer[i] = 0;
	for (k=0; k<NumSamplesToMix; k++) {
	   IntStep = long(Sample[k].F/MixingF);
	   FracStep = long((Sample[k].F<<32)/MixingF);
	   for (i=0; i<SoundBufferSize; i++) {
	      SoundBuffer[i] += Sample[k].Buffer[Sample[k].IntPos]
				* Sample[k].Volume >> 6;
	      Sample[k].FracPos += FracStep;
	      Sample[k].IntPos += IntStep + CY;
	   }
	}

In the above example I replaced F1 with MixingF and F2 with Sample[k].F. MixingF is the frequency of the sound buffer, and Sample[k].F is the current sample frequency.

In a real situation, the samples are added into the mixing buffer, and then we take the contents of the buffer, saturate them and put in the sound buffer.

	     for (i=0; i<MixingBufferSize; i++)
	   MixingBuffer[i] = 0;
	for (k=0; k<NumSamplesToMix; k++) {
	   IntStep = long(Sample[k].F/MixingF);
	   FracStep = long((Sample[k].F<<32)/MixingF);
	   for (i=0; i<MixingBufferSize; i++) {
	      MixingBuffer[i] += Sample[k].Buffer[Sample[k].IntPos]
				 * Sample[k].Volume;
	      Sample[k].FracPos += FracStep;
	      Sample[k].IntPos += IntStep + CY;
	   }
	}
	for (i=0; i<MixingBufferSize; i++) {
	   temp = MixingBuffer[i] >> 6;
	   if (temp < -0x8000) temp = 0x8000;
	   else if (temp > 0x7FFF) temp = 0x7FFF;
	   SoundBuffer[i] = short(temp);
	}

In the above example, the mixing buffer is 32-bit (long MixingBuffer[]) and the sound buffer is 16-bit (short SoundBuffer[]). Therefore after shifting the value taken from the mixing buffer by volume scale, we have to saturate it if we do not want to have ugly clicks. MMX provides proper instructions for automatical saturation, so the overhead is minimal.

The following code does the task of 32-bit => 16-bit conversion and saturation.

	     ; esi = pointer to the beginnig of the mixing buffer
	; edi = pointer to some place in the sound buffer
	; ecx = number of values/4 to convert
_next:	movq	mm0, [esi]	; mm0 =	   s1	    |	   s0
	psrad	mm0, 6		; mm0 =	  s1>>6     |	  s0>>6
	movq	mm1, [esi+8]	; mm1 =	   s3	    |	   s2
	psrad	mm1, 6		; mm1 =	  s3>>6     |	  s2>>6
	packssdw mm0, mm1       ; mm0 = s3>>6 | s2>>6 | s1>>6 | s0>>6
	movq	[edi], mm0	; store values
	add	esi, 16 	; move source (mixing buffer) pointer
	add	edi, 8		; move dest. (sound buffer) pointer
	dec	ecx		; loop
	jnz	_next

The above routine executes in 9 cycles per loop (2.25 cycles per value), though we could optimize it to execute in less than 6 cycles per loop, and even issue prefetching (this is a topic for another article).

Now, how the actual mixing can be done ? Let's assume that our samples are mono, 16-bit, and the sound buffer is stereo, 16-bit. The actual routine for adding one sample could look like this:

	     ; esi = integral sample position/2
	; ebx = fractional sample position
	; edi = pointer to the beginning of the mixing buffer
	; ecx = number of values/2 to put in the mixing buffer
	; mm7 = rightvol | leftvol | rightvol | leftvol
_next:	movq	mm0, [esi+esi]	; mm0 = ? | ? | ? | s0
	add	ebx, [FracStep] ; add fractional position
	adc	esi, [IntStep]	; add integral position and carry
	cmp	esi, [OffLoopP2]; check agains sample loop
	jb	_l1
	; loop fixing code here
_l1:	punpcklwd mm0, [esi+esi]; mm0 = ? | ? | s1 | s0
	add	ebx, [FracStep]
	adc	esi, [IntStep]
	cmp	esi, [OffLoopP2]
	jb	_l2
	; loop fixing code here
_l2:	punpcklwd mm0, mm0	; mm0 = s1 | s1 | s0 | s0
	movq	mm2, mm0
	pmullw  mm0, mm7	    ; mm0 = lo(s1*rightvol) |
					lo(s1*leftvol) |
					lo(s0*rightvol) |
					lo(s0*leftvol)
	pmulhw  mm2, mm7	    ; mm2 = hi(s1*rightvol) |
					hi(s1*leftvol) |
					hi(s0*rightvol) |
					hi(s0*leftvol)
	movq	mm1, mm0
	punpcklwd mm0, mm2	; mm0 = s0*rightvol | s0*leftvol
	punpckhwd mm1, mm2	; mm1 = s1*rightvol | s1*leftvol
	paddd	mm0, [edi]	; add the current values from
	paddd	mm1, [edi+8]	;   the mixing buffer
	movq	[edi], mm0	; store the sums
	movq	[edi+8], mm1
	add	edi, 16 	; move mixing buffer pointer
	dec	ecx		; loop
	jnz	_next

The above routine executes in 27 cycles per loop (13.5 cycles per sample), and of course it can be optimized. There is a small average penalty of 5 cycles per loop from unaligned memory accesses. Of course still a real bottleneck is cache access, which can only be reduced with prefetching (K6-2 and P3).

Note that stereo samples have right and left channel values interleaved.

The samples sound poorly if their frequency is less than the mixing frequency. This can be avoided using sample interpolation.

The above image shows how we perform when the sample pointer is between two values. This is called linear interpolation. Since we use fractional sample position, which varies between 0/0x100000000 and 0x7FFFFFFF/0x100000000, we can use it to calculate the actual value we will use.

	     value = s1 + (((s2-s1)*FracPos) >> 32) =
	      = ((s1*(0x100000000-FracPos)) >> 32)
	       +((s2*FracPos) >> 32)

Since MMX can only multiply signed words, we can write the above as:

	     value = ((s1*(0x8000-FracPos)) >> 15)
	       +((s2*FracPos) >> 15)

where in this case FracPos is a 15-bit value (actual FracPos shifted right by 17).

The following routine uses the above formula to add a sample to the mixing buffer using linear interpolation. Fractional sample position is kept both in ebx (32-bit) and in mm4 (15-bit). In the other case when the fractional step is added (that resides in mm5), the value is then anded with a mask (in mm6) clearing the most significant bits.

	     ; esi = integral sample position/2
	; ebx = fractional sample position
	; edi = pointer to the beginning of the mixing buffer
	; ecx = number of values/2 to put in the mixing buffer
	; mm4 = (FracPos+FracStep) >> 17 |
		0x8000 - ((FracPos+FracStep) >> 17) |
		FracPos>>17 |
		0x8000 - (FracPos>>17)
	; mm5 = (FracStep*2)>>17 |
		0x8000 - ((FracStep*2) >> 17) |
		(FracStep*2)>>17 |
		0x8000 - ((FracStep*2) >> 17)
	; mm6 = 0x7FFF | 0x7FFF | 0x7FFF | 0x7FFF
	; mm7 = rightvol | leftvol | rightvol | leftvol
_next:	movq	mm0, [esi+esi]	; mm0 = ? | ? | s0+1 | s0
	; here esi and ebx should be updated
	punpckldq mm0, [esi+esi]; mm0 = s1+1 | s1 | s0+1 | s0
	; here esi and ebx should be updated
	pmaddwd mm0, mm4	; do the interpolation
	psrad	mm0, 15 	; issue shift by 15
	paddw	mm4, mm5	; add steps to positions
	pand	mm4, mm6	; mask off MSBs
	; now we have mm0 = v1 | v0 (after interpolation)
	packssdw mm0, mm0	; mm0 = v1 | v0 | v1 | v0
	punpcklwd mm0, mm0	; mm0 = v1 | v1 | v0 | v0
	; and the rest is as in the normal routine
	movq	mm2, mm0
	pmullw	mm0, mm7	; multiply with volume (lo)
	pmulhw	mm2, mm7	; multiply with volume (hi)
	movq	mm1, mm0
	punpcklwd mm0, mm2	; mm0 = v0*rightvol | v0*leftvol
	punpckhwd mm1, mm2	; mm1 = v1*rightvol | v1*leftvol
	paddd	mm0, [edi]
	paddd	mm1, [edi+8]
	movq	[edi], mm0
	movq	[edi+8], mm1
	add	edi, 16
	dec	ecx
	jnz	_next

The above routine executes in 32 cycles per loop (I did not place the esi and ebx updating code which takes 2*3=6 cycles). This is 16 cycles per sample, not counting misalignment penalties. The notes are like for the previous routine.

Usually, the fractional step residing in an integer register is 32-bit. However the step used here for interpolation, kept in mm5 register, is 15-bit. To retain compatibility of the two instances of the value, one should clear 17 less significant bits of the step in an integer register, thus avoiding carries.

There is also a problem of some clicking. When the fractional position becomes 0, it's negative image is also 0 - the resulting sample value will be 0. We can avoid that doing a simple compare to 0 and setting all bits of the negative image of position to ones.

	     pmaddwd mm0, mm3	     ; do the interpolation
	psrad	mm0, 15 	; issue shift by 15
	paddw	mm4, mm5	; add steps to positions
	pand	mm4, mm6	; mask off MSBs
	pxor	mm2, mm2	; mm2 = 0
	movq	mm3, mm4
	pcmpeqd mm2, mm4	; place ones when parts of mm4 are 0
	pand	mm2, [MASK]	; and 0000 | 7FFFh | 0000 | 7FFFh
	por	mm3, mm2	; set negative position if 0
	; now we have mm0 = v1 | v0 (after interpolation)

The presented way of mixing samples is one of the best. Of course using 16-bit mixing buffer instead of 32-bit (hence 16-bit mixing) would be faster, as less memory would be accessed. Nevertheless, this method is extremely fast. Obviously optimization of the above code is recommended, as well as using prefetching.

If you are not familiar with MMX, you should study some manuals from Intel or from AMD. Especially pack, unpack and multiply instructions can be difficult to understand; other work in a straighforward fashion.

And here is some bonus for you.

Chris Dragan