Daisy Seed DMA tx/rx timing and issues with small block sizes

jeffffff · November 20, 2024, 5:43pm

I have been working on a guitar overdrive pedal using Daisy Seed and the Cleveland Music Hothouse. I have run into a few issues along the way and wanted to share my solutions for them back with the community.

The first issue is that there is significant audible noise at the callback frequency (i.e. sample_rate/block_size) coming from the ADC when the input signal is boosted significantly, as is necessary in an overdrive pedal. This seems to be a known issue (see https://forum.electro-smith.com/t/questions-about-digital-noise-and-grounding/432/10), and I am working around it by decreasing block size to 4 with a sample rate of 96kHz to get it out of the audible range.

This seems to have caused another problem though, which is that when looking at the output on an oscilloscope there appeared to be a strange ripple in the signal, which goes away when bypassing the overdrive (in bypass I just memcpy the input to the output).

At first I thought that this may be due to instability in the potentiometer readings, but on further investigation it turned out to be related to the cpu utilization, which was surprising because my cpu usage was below 80%. My effect is oversampling the input in order to reduce aliasing due to the non-linear transformation performed by the overdrive, and halving the oversampling factor got the cpu usage below 50% and fixed the issue. This is not a satisfying solution though, because I would like to keep the higher oversampling rate and be able to use the whole CPU.

My next theory as to the cause of this was that it could be due to the timing of the transfers from the ADC and to the DAC, and this theory turned out to be correct. The Daisy Seed uses circular DMA for the transfers, with a buffer of 2x the block size. After receiving half of the buffer from ADC, the HAL_SAI_RxHalfCpltCallback function is called in an interrupt service routine, which calls some Daisy Seed internal callbacks and eventually calls the user’s audio callback. The Daisy Seed internal callback which calls the user’s audio callback converts the data in the ADC’s rx buffer from ints to floats and puts them in another buffer on the stack, and the user’s audio callback writes its output into another float buffer on the stack. After the user’s audio callback returns, the Daisy Seed internal callback converts the floats back to ints and stores them in the DAC’s tx buffer. This then happens again for the second half of the buffer with the HAL_SAI_RxCpltCallback callback, and then it loops back to the beginning of the buffer, alternately calling these two callbacks whenever one half the buffer has been completely filled.

There are four important events we need to understand the timing of in order to understand what is happening here and how to fix it. They are:

Start time of ADC rx transfer
End time of ADC rx transfer
Start time of DAC tx transfer
End time of DAC tx transfer

The audio callback needs to start after the end of the ADC rx transfer and must read all of its input before the start of the next ADC rx transfer for the same half of the buffer. It must write its result to the tx buffer before the start of the next DAC tx transfer for its half of the buffer, but it must not write any output before the end of the previous DAC tx transfer for its half of the buffer.

I measured the timing of the end time events by instrumenting HAL_SAI_RxHalfCpltCallback, HAL_SAI_RxCpltCallback, HAL_SAI_TxHalfCpltCallback, and HAL_SAI_TxCpltCallback (the latter two are not currently used by Daisy Seed but I added them for this measurement) with some code to track their relative timing. I ran this with varying block sizes, and discovered that the order and timing is:

Tx half complete at time 0 us
Rx half complete at time 10 us
Tx complete at time block_size/sample_rate seconds
Rx complete at time block_size/sample_rate seconds + 10 us

This pattern repeats indefinitely with the next iteration starting at time 2*block_size/sample_rate seconds.

Measuring the start times is trickier because there are no callbacks or diagnostics for when they occur. I was able to measure the rx start times by spin looping on the first element of the rx buffer and tracking when it changed, and comparing this time to the callback timings, and found that the ADC rx DMA transfer seems to start approximately 3-4 us before the callback is called (this may be somewhat longer with higher block sizes). The tx start times are the hardest to measure, but if my theory is correct then adding a delay to the audio callback and increasing it just until the threshold where the ripple starts to occur would allow me to measure it. I did this, and found that the ripple consistently starts to occur when the delay is around 10-15 us shorter than block_size/sample_rate seconds for various block sizes.

What is happening that is causing the ripple is the DMA transfer from the DAC’s tx buffer actually starts ~10-15 microseconds before the next ADC rx callback happens. If the previous rx callback isn’t complete by this time, you will end up transferring some of the old samples that were already in the buffer instead of the newly computed samples. This isn’t a huge deal when using a large block size because if the callbacks are only happening once a millisecond then losing 10-15 microseconds of processing time is only 1-2% of CPU, but if the callbacks are happening 24,000 times per second then you only have 41.6 microseconds between callbacks, and 10-15 microseconds translates into losing ~25-35% of your CPU!

This issue can be solved by delaying the output by block_size samples (i.e. one half of the DMA buffer). There are a few approaches I have found that appear to work:

Store the output of the user’s audio callback into an intermediate buffer instead of the DAC tx buffer, and then copy it into the DAC tx buffer in the next call to HAL_SAI_TxHalfCpltCallback or HAL_SAI_TxCpltCallback.
Move the call to the user’s audio callback from the HAL_SAI_Rx callbacks to the HAL_SAI_Tx callbacks. This also requires flipping which half of the buffer you read from in the audio callback, i.e. in the TxHalfCpltCallback you need to read from the back half of the rx buffer and write to the front half of the tx buffer, and vice versa for the TxCpltCallback.

With either of these approaches, I can push the audio callback up to ~37 microseconds of computation before the ripple appears, whereas without them it starts to appear around ~26 microseconds of computation.

Due to the additional delay of 1 block, this is probably not desirable across the board, but it would be nice to have an option to enable one of these approaches in libDaisy for users who are using small block sizes (where the extra delay of 1 block probably doesn’t matter anyway). I can submit a pull request for this if there is interest.

tele_player · November 20, 2024, 6:34pm

Interesting analysis.

Did you try blocksize=8 at 96k? 12kHz is pretty high up.

jeffffff · November 20, 2024, 7:03pm

I did try block size = 8 at 96kHz and the 12 kHz noise is pretty annoying when the gain is up and I’m not playing anything. On the spectrum analyzer it measures about 10-15 db over the noise floor. I’d still be losing ~10-20% of the CPU with block size = 8 at 96kHz, and I’d prefer not to and I don’t have to with either of my solutions.

TallMike · November 20, 2024, 9:07pm

I wonder if running a bunch of that code, especially the callbacks, from the ITCM would help?

jeffffff · November 22, 2024, 2:08am

I don’t think moving the code to ITCM would help that much because the instruction cache is pretty large and the callbacks should stay in it. Even if moving code to ITCM did help, the timing issue between the DMA rx and tx would still prevent you from using more than ~60% of CPU in your audio callback when using a block size of 4 at 96 kHz (or 2 at 48 kHz), and it would still be beneficial to use one of my proposed solutions in addition to ITCM so that you can safely use 90% of CPU in your audio callback.

jeffffff · November 22, 2024, 2:09am

I’ve opened a PR for libDaisy with my recommended fix: added optional intermediate buffer to SaiHandle to allow full use of cpu in audio callback at low block sizes by jeffplaisance · Pull Request #656 · electro-smith/libDaisy · GitHub

passtheducky · November 22, 2024, 4:07pm

Noise always wins, but fight the fight.

I have recently been up against this and related issues, and I came up with a few solutions. For me the block noise was visible on an RTA as an unwanted tone at the audio callback frequency and at some harmonics above that as well. It was about 15 db above the rest of the noise floor, so I was motivated to get rid of it.

There are a few issues here. The overhead of small blocks is well described above. The tone has also been well described elsewhere, but in short: most people’s code is written in a way that the processor’s power draw fluctuates perfectly periodically with the audio callback and this modulates ground potentials and becomes audible, especially at the ADC inputs.

This solution is not terrible and was effective for me. Essentially you need to insert another buffer into the process and you need to wiggle it.

Reduce the audio callback block size to 1. Yep, 1.
Strip down the audio callback to two tiny bits of functionality: push the incoming samples into an input ring buffer, and read the outgoing samples out of an output ring buffer.
Now do all your audio processing in the main loop. Instead of waiting for the cursed audio callback, just read samples out of the input ring buffer and write them out to the output buffer whenever you want as long as you keep up.
Now, add random delays between processing each sample frame. For me, 0 to 2 us was about all the delay I could spare, but it was enough.

There are some nuances to work through, but it’s pretty straightforward. You can still process in blocks by reading several samples at a time, but the jitter might be less effective. The effect for me was that callback frequency is completely gone, down into the noise floor. The overall effect was only 3db less noise A weighted, but it should be a much less audible form of noise.

jeffffff · November 23, 2024, 2:46am

It was about 15 db above the rest of the noise floor, so I was motivated to get rid of it.

This is around the same level that I’ve measured as well (10-15 db above the noise floor), and it is very annoying when the level is boosted.

Reduce the audio callback block size to 1. Yep, 1.

Are you really seeing a benefit from block size 1 vs block size 2? At 48kHz sample frequency block size 2 should put the callback frequency at 24kHz which should be a) completely inaudible and b) filtered out by the DAC’s low pass filter anyway. I am using a sampling frequency of 96kHz because it allows me to use a much less steep FIR filter for my upsampling (to 768kHz) and downsampling (back to 96kHz), which is significantly cheaper to compute. I am currently using a block size of 4, which puts the callback frequency at 24kHz, and using a block size of 1 at 96kHz seems like it would dramatically increase the callback/irq overhead. I haven’t measured it precisely, but it seems like an irq has an overhead somewhere around 1-5 microseconds, which would be substantial at a callback frequency of 96kHz, with a callback happening every 10.4 microseconds.

The effect for me was that callback frequency is completely gone, down into the noise floor. The overall effect was only 3db less noise A weighted, but it should be a much less audible form of noise.

My solution using a block size of 4 at 96kHz (equivalent to block size 2 at 48kHz) also completely eliminates the spike at the callback frequency, at least below 20kHz in the audible spectrum. With my patch to libDaisy linked above, I am able to use 90% of the CPU’s cycles in my audio callback with no other issues, which is about as good as I’d ever expect to be possible. I’d expect your solution to also allow higher CPU utilization than the standard libDaisy approach since you are also introducing an additional buffer. I haven’t done any precise noise measurements beyond eyeballing it on a spectrum analyzer and verifying that the offending frequencies are gone. There is still white noise, but like you said, it is much less annoying than the spike at 1kHz.

jeffffff · November 23, 2024, 3:08am

Just noticed something else that is interesting - there are measurable spikes at 8kHz and 16kHz with the ST-LINK attached which aren’t there when it isn’t attached.

ST-LINK Attached:

ST-LINK Not Attached:

passtheducky · November 23, 2024, 3:09pm

Yes. I also forgot that I had the ST Link attached once and was going crazy trying to figure out where that noise was coming from.

I suggested a block size of 1 just to make the logic a lot simpler. But theoretically, I believe that my scheme allows me to raise the block size considerably, since feeding the ring buffers is completely decoupled from feeding the audio callback. It just makes managing latency and catching the overflow conditions a little more challenging. I’ll probably try this as I optimize further.