Feedback Effect (Freqout, AcouFiend, etc)

antisvin · May 22, 2021, 9:19pm

I won’t disagree with this, lol. Except that I have some doubts about log2(N) complexity being usable for crossover FB. Apparently you can compensate phase shifts when you sum bands which replaces quadratic growth of number of APF passes with linear (because you can compensate multiple bands in a single pass this way). I’m not sure if this can be done if filters are arranged in a tree, if not then it’s not worth the hassle. But this is getting a bit off topic.

So returning to non-crossover banks, MI Warps has one and it has some things relevant to this discussion:

SVF that uses LP/chain of BP/HP outputs
template SVF function that won’t be computing unused outputs
lower bands are processed at lower SR
script to plot filter response

Firesledge · May 23, 2021, 11:40am

Note that the SVF code quoted a few messages above (Chamberlin filter) is not that inefficient. The unused outputs are actually needed, because they are either temporaries or states. lpass and bpass are filter states, so they must be persistent. notch and hpass are temporary variables, they should be computed but can be thrown away at the end of the iteration. This filter must be oversampled twice to be stable and accurate up to Nyquist.

We can compare the performance of the different SVF implementations. I know the number of operations is not everything, but this gives a rough idea:

Chamberlin, 2x-oversampled   : 14 op, incl. 6 mul
Andrew Simper, increment form: 13 op, incl. 4 or 6 mul
Andrew Simper, standard form : 12 op, incl. 4 or 6 mul
Émilie Gillet                : 12 op, incl. 5 mul

So this is not a big difference for the raw processing, and the output selection should not add any overhead if the code is properly inlined. However correct input oversampling and output downsampling will likely add a more significant overhead.

About the crossover filterbanks, they can be arranged in a tree and the phase compensation complexity is in O(N*log(N)). I think the figure could be even reduced by assuming that distant bands are attenuated enough to skip the compensation, but I never tried to explore this way.

donstavely · May 23, 2021, 3:40pm

Quite the deep dive into SV filters, guys. I hope it doesn’t make the OP’s head explode!

I coded that filter bank in 15 minutes to use as a test bed. I wanted all the outputs to play with, to compare BP vs HP for fundamental identification and harmonic isolation, and perhaps Notch output for fundamental suppression.

I will review your advice to optimize them once I have got the rest of the effect working. Thanks!

cirrus · May 23, 2021, 5:19pm

Haha no sweat, I said I wanted a deep dive, I got dunked into the Mariana. It’s great!

antisvin · May 23, 2021, 6:21pm

Sure, except that it’s not going to happen here - the outputs are store in object, compiler won’t discard them. But I was curious to check how it behaves and built SVF example that only uses 2 outputs, then had a look at disassembled function produced in elf file.

SVF::Process has 22 multiplications, 9 additions and 8 subtractions. Disassembled code has 9 multiplications, 4 subtractions and 13 FMA/FMS operations. Some operations got fused into a single instruction, but total number of operations is left exactly the same in source and compiled code. Which means that all unused outputs are computed (and stored in object, naturally).

I understand that your numbers are related to algorithm efficiency in general, but this doesn’t change the fact that current DaisySP code for SVF is not great for large filterbanks. At least @shensley mentioned that he considers replacing it with a faster version from MI sources.

Firesledge · May 23, 2021, 7:22pm

Sure, I was writing about the basic algorithm implementation, because donstavely was rewriting his own, not the DaisySP one, which is not inlined and has some distortion code added.

cirrus · May 24, 2021, 3:54am

So I finally messed around with @donstavely’s suggestions. I sort of have something. I’m using the autocorrelating pitch detector @recursinging turned me on to. I need to clean up the repo as I had to make some changes to the Q lib to get it working, but I’ll get something up tomorrow.

Here’s a sound clip with the effect (sorry terrible volume on the recording):
https://drive.google.com/file/d/1fZMV363wyZ0qcYoNd1cS5CAiub1CfwV_

And one of the dry signal:
https://drive.google.com/file/d/1wLHb5dzvzlRXqoPL-gnRjC5U9uwfTNhd/view?usp=sharing

It’s got a bunch of issues but it kind of works. The input signal is super noisy (I think my DaisyPod must not be properly grounded) and the feedback also generates a ton of noise on top of it…and I’m also a complete newbie to this so I’m probably doing a ton of no-nos.

RMS volume to detect when it’s in an interesting (feedbacky) range.
Pitch detect only when it’s in that range.
Use an interval (I’m just doubling the detected frequency to get an octave up)
Use a biquad peaking filter (another gift from that Q library) at the frequency. I’m kind of hacking it to work with a changing frequency though, I’ll share that code tomorrow as this is where some no-no’s will be probably be.
Read from delay line at feedback frequency offset.
Ramp a mix value up/down when RMS is in certain range (ramp time is approx one second but my interpolator is continuous so it’s exponential meaning you hear it much faster).
Feed the input sample + a generated sin tone at the feedback frequency (this helps with the infinite sustain, could do without it but it gets more noisy) into the filter.
Feed the filtered value from 7. into the delay line summing the feedback value * the ramped mix value from 6. (so filtered+delaySample*ramp back into delay line).
Write to output samples the incoming value + delaySample*ramp.

I’ll get the source up tomorrow.

cirrus · May 24, 2021, 7:11pm

An interesting other technique that’s sort of in the same realm which may provide inspiration:

Code here:

donstavely · May 24, 2021, 9:50pm

Very nice work, @cirrus! You are way ahead of me. I have been trying to get my filter bank pitch isolator really solid. It works well on most sources and frequencies, but I also get some glitching when it hops from one filter band to another on certain notes. It works on your dry sample, again with some ticks on the decay.

BTW, please keep posting some audio clips, including the dry samples. I would like to keep using them to test my effects as well. I don’t have a guitar and couldn’t play one if I did!

donstavely · May 25, 2021, 7:59pm

I finally have my pitch detection working the way I want it to. My test bed runs the 32-band 1/4-octave filter and envelope detector bank, chooses the banks with the highest two amplitudes, and interpolates the frequency between them. (I interpolate only if they are adjacent bands - don’t want to interpolate between harmonics or notes of a chord!) I see a max frequency estimation error of 5%.

The pics below show it acquiring the fundamental on a sawtooth burst, then using it to set a variable frequency peaking highpass filter to enhance the second harmonic of the original signal. The left is an 80Hz burst and the right is at 2000Hz.

I am only at 33% processor utilization. Next I will add the resonator delay line as @cirrus did.

cirrus · May 26, 2021, 3:12am

That sounds great, smart to only interpolate when adjacent. Cheap way to get better precision with less work. Are you using any dynamic ratio of amplitude for the interpolation?

Is the red waveform the response/output? How can you tell it’s the second harmonic? The period (again, naively) looks the same. I’m guessing the dynamic bits at the start/end are it reacting/responding? So awesome to get such a visual representation of the impact…

I posted my slightly cleaned up and very poorly documented code here: daisy-constructive-feedback/main.cpp at master · luigi-rosso/daisy-constructive-feedback · GitHub

If you’re trying to clone that repo make sure you recurse submodules and have premake5 (there’s a build.sh under dev, you may need to tweak some args in my scripts). I’ll add a README tomorrow, sorry work’s been all encompassing…

donstavely · May 26, 2021, 1:26pm

@cirrus, I use the filter and envelope detector bank to estimate the approximate fundamental frequency (within 5%) and then for this test, I use that frequency, doubled, to set the variable filter. It suppresses but doesn’t fully eliminate the fundamental. So you are seeing some of it in the output, along with the boosted second harmonic.

Yes, it take a few cycles at the beginning of the note for the detection algorithm to get the right answer. The tail is just the resonant filters natural response. I am not using any note trigger or gating yet, as I would in the final effect.

Using the resonant delay instead of the variable filter should be more effective at suppressing the fundamental, if the length is set to the second harmonic frequency. This is analogous to touching a guitar string in the middle, thus only allowing even harmonics and killing the fundamental.

cirrus · May 26, 2021, 3:07pm

Makes sense and now I see the harmonic in there too. Super cool!

When you tap the delay line, how many samples back do you offset? I had trouble getting this right as when I went what I thought was one exact period it sounded wrong (I really need to set up some analysis software like you have). I was using sample rate/frequency. I have some arbitrary value in the code right now that ended up sounding right (I think just frequency number itself which i think makes no sense). Do we care about shifting for phase alignment or should it inherently just be phase aligned? Imagining the delay buffer filling up with a constant frequency, it seems like you’d just need to offset back by one period to get full phase alignment…

donstavely · May 30, 2021, 8:58pm

I’m working on a port of the bitstream autocorrelation algorithm, from the Q library for time domain pitch detection. It’s monophonic, (which is fine for my use case as a vocalist), and seems promisingly efficient at the lowest possible latency. Unfortunately the modern C++ is making it difficult for me to engineer back down to something more rudimentary. There is a simplified version here , which I’m working off of at the moment. I’d be grateful for any help with getting it into DaisySP.

I finally read the whole article. It is quite impressive, especially listening to how well it tracks some very fast playing. @recursinging, how is your port coming? I too would love to see this find its way to the DaisySP library!

recursinging · May 31, 2021, 4:11am

It’s an interesting challenge, and I’ve been discussing it all week with Joel de Guzman, the author of the blog post and the Q library over on his discord channel. I don’t want to get too off topic here, so I won’t go into great detail, but there is one main issue I’ve run into.

Autocorrelation of a bitstream requires retrieving the hamming weight of the window of interest for every sample. x86 has a POPCOUNT instruction for this purpose whereas ARMv7 does not. GCC provides a best-effort builtin alternative, but it is still some 13 instructions long including memory access to a LUT.

The result is, on ARM, the naive implementation I linked above is not usable, it needs too many cycles. The implementation in the Q library takes a different approach which reduces the amount of autocorrelation necessary, and thus makes it workable on the Daisy for frequencies up to about 1200Hz (and gives fantastic results).

Although workable, the algo still spends more than 50% of its time just getting the population count of the bitstream window. I’m working with Joel to find an optimization.

antisvin · May 31, 2021, 8:53am

Are you doing it for 32 bit ints? With a 64k bytes LUT on DTCMRAM you should be able to have 0 latency access and just do something like LUT[bits & 0xFFFF] + LUT[bits >> 16] . You can populate it when program starts.

Obviously GCC can’t use LUTs of this size, so they must be using 256 byte LUT with 4 lookups giving the instruction count that you’ve mentioned.

recursinging · May 31, 2021, 11:14am

I take it back. I misread the source code. GCC doesn’t seem to be using a LUT for __builtin_popcount in this case. This is what objdump gives me:

0800074c <__popcountsi2>:
 800074c:	0843      	lsrs	r3, r0, #1
 800074e:	f003 3355 	and.w	r3, r3, #1431655765	; 0x55555555
 8000752:	1ac0      	subs	r0, r0, r3
 8000754:	0883      	lsrs	r3, r0, #2
 8000756:	f003 3333 	and.w	r3, r3, #858993459	; 0x33333333
 800075a:	f000 3033 	and.w	r0, r0, #858993459	; 0x33333333
 800075e:	4418      	add	r0, r3
 8000760:	eb00 1010 	add.w	r0, r0, r0, lsr #4
 8000764:	f000 300f 	and.w	r0, r0, #252645135	; 0xf0f0f0f
 8000768:	eb00 2000 	add.w	r0, r0, r0, lsl #8
 800076c:	eb00 4000 	add.w	r0, r0, r0, lsl #16
 8000770:	0e00      	lsrs	r0, r0, #24
 8000772:	4770      	bx	lr

Yes.

That’s an interesting idea, that’s an awful lot of memory though! There is also an ARM VCNT instruction, but I’m not sure it makes sense here. Honestly I’m out of my league with this low level stuff. I’ve been looking for ways to optimize the algo by avoiding autocorrelation altogether, albeit not with great success (yet).

antisvin · May 31, 2021, 11:36am

I assume that you’re not using DTCM for anything in this patch, so it’s an awful lot of memory that would be unused otherwise! Which means that it’s sort of free for you.

So what you should do is have something like uint8_t DTCM_MEM_SECTION popcount_lut[65536];, this would allocate to DTCM RAM an empty buffer using macro defined in daisy_core.h. Then you can fill it on startup using that GCC function or just write a naive function to count bits. It’s important to use TCM memory, because it would prevent MCU cache from being trashed with its contents.

VCNT is a SIMD instruction, H7 doesn’t have NEON. So you can’t use it here.

Another interesting idea is that the actual data coming from the codec is only 24 bit ints, it’s possible to process raw data and use just 4k (12 + 12 bits) LUT. But this would require replacing audio IRQ handler to process data before it’s converted to float. You can’t have a bigger LUT to reduce number of operations, because 24 bit LUT needs 16 Mbytes And simply having a smaller LUT won’t make any difference in this case. It may lead to more precise results as you won’t be converting int->float->int before counting bits, but sounds like it could be too much of an effort.

recursinging · May 31, 2021, 11:45am

The bits being counted here are not samples, but zero crossings stored as bits in a vector of (in this case) uint32_t as wide as the 2x samples of the lowest frequency of interest. That is the novelty of this approach, the autocorrelation function is reduced to XOR. Unfortunately you have to do a popcount to get the result, which is the hot spot here.

donstavely · June 16, 2021, 4:10pm

Reviving the topic, I have been thinking about this and playing with code off and on since the OP. Unfortunately, I lead myself down the rabbit hole of pitch detection - positive and negative peak detection, the peaking filter bank, even a sparse autocorrelation idea I wanted to try. None of them were really robust on real-world guitar clips I used as test cases. The bit-wise AC is intriguing, but given it isn’t that efficient on ARM, and because I am tired of this rabbit hole, I am rethinking the whole problem.

A guitarist standing in front of his amp is, at its simplest, a delay of around a millisecond per foot of separation. Beyond that, there are the characteristics of the amp and speaker. So we could add an amp and cabinet sim in series with a delay? Beyond that, how does the sound from the speaker interact with the strings? It is probably a complex process, but maybe trying to model it is overkill.

Going back to the other extreme, Boss had a pedal well before the age of digital effects that made some pretty cool feedbacky sounds:

How the heck did they do it? It uses OC-2 style peak-peak pitch detection to drive a PLL configured as a frequency multiplier to get the octave up effect. Indeed it was way ahead of its time, IMO. I know this is not what @cirrus had in mind. I bring it up in all humility, realizing that Boss did in 1985 with op amps and CMOS logic what I can’t do as well today with a Cortex M7!