GPU based synthesizer

category: code [glöplog]
Your soundcard is now free to render the video ?
raer: both
added on the 2013-07-10 18:53:45 by TLM TLM
w00t! sounds interesting! :D

i think i will give it a try!
added on the 2013-07-10 19:15:55 by rez rez
rez: thanks man! I will gladly help, contact me at tlm.thelostman at gmail
added on the 2013-07-10 19:24:41 by TLM TLM
I wrote GLSL synthesizer in my last 4k intro.

I thought GPU synthesizer can be smaller than CPU one, because they are stored as source code in 4k intro and my 4k intro already have GLSL and OpenGL code.
So GPU synthesizer code can be compressed effectively.

I don't think GPU synthesizer have advantage for speed unless your synthesizer require heavy calculation.
You can use 1 minute for pre-generating music.

Other advantage is that sound synth code and graphics shader code can share GLSL code.
By sharing code, you can save code size and easily synchronize visual and sound.
I used Boost::wave to preprocess #include in GLSL code.
That preprocessor can be used as library.

And you can also share header file between cpp file and GLSL sound synth code.
In my intro, they shares sampling rate and sound length(sec).

My intro execute vertex shader and generate samples.
They are writen to buffer object using transform feedback.
Then they are downloaded to main memory and call win32api to play sound.

Samples are generated at parallel, so there is less flexibility than single thread CPU synthesizer.
But I think that by using compute shader, you can generate whole samples in signle thread, because compute shader can do random write to buffer.
But that would be much slower than multiple vertex shader execution and require newer GPU(>=OpenGL4.3)
added on the 2013-07-10 20:07:55 by tomohiro tomohiro
I looked into it and Punqtured missed a few details.

BB Image

This is accurate.
added on the 2013-07-10 21:45:30 by superplek superplek
Psycho and I have discussed a few times what it would take to make an (almost) pure GPU implementation of the player for my current synth. The motivation is primarily size, as the shader code would compress well together with all the other shader code in the intro, especially for pure DirectCompute intros.

One advantage of this particular synth in this regard is that each trigger of a note for a particular instrument is independent of all the others. I.e. there are no cross-channel effects or infinite response effects (except for a simple delay effect, but that could be handled as a special case). Thus, there are plenty of opportunities for parallelizing the computation.

One way to distribute it would be to run one thread per note. This would mean each thread only needs to read one note and then write a bunch of samples. It would only need a simple preprocessing on the CPU to calculate the position of each note from the songdata. This approach has a number of drawbacks though: it would need to perform atomic updates on the output buffer (since notes can overlap) which means the buffer cannot use floats. And the number of threads would probably be too small to really utilize the GPU, since there would typically be only a few hundred notes in each track at maximum.

A different approach would be to run one thread per output sample. Each thread would iterate (or do a binary search) through the notes to find the ones that overlap the sample. This would be a lot of iteration, but it would be shared nicely between threads in a workgroup, and there would be plenty of threads and no atomics needed.

The real showstopper though, and the reason we have not proceeded further, is precision. There are some computations in the synth (in particular related to pitch slide) which absolutely need double precision, which is still not widely available in consumer GPUs. For an example of how bad it can be when the precision is not sufficient, listen to http://pouet.net/prod.php?which=54110 and compare to the soundtrack link (which sounds as it was intended).
added on the 2013-07-10 23:37:00 by Blueberry Blueberry
I had a crazy idea... would it be possible to create a synth that uses the GPU to implement multi-oscillators? Like a saw/triangle/sin oscillator that's just the sum of some 10.000 oscillators with the same parameters, except for a slight detune and phase variance? That could sound FAT, if the memory bandwidth plays along.
added on the 2013-07-10 23:52:48 by xTr1m xTr1m
xTr1m nice idea! Do you have any links to how that sounds?
like a chorus on meth, obviously :)
added on the 2013-07-11 02:01:13 by superplek superplek
xTrim - I tried it with 137 osc's in the past - sounds fatish :P

added on the 2013-07-11 05:36:57 by Shabby Shabby
Blueberry: Very nice (and true) analysis.

I have basically chosen to go with the 2nd approach (thread per output-sample). One of the best by-product of this method is tiny code size in intro side.

The entire process can be seen as:
1. Setup OpenGL, FBO, Compile shader
2. Render a full screen quad on huge buffer (RGBA float pixel format, 4096x2048, each pixel is 2 samples x 2 channel)
3. Get that buffer back to PC memory (glReadPixels)
4. Play buffer

The cool thing is that most 4k intros are doing steps 1,2 and 4 anyway, all you need is to implement step 3 - a single function call and you have a synth.

regarding the notes search, one of the things I wanted for the pixel shader is to be super quick. The idea behind it, is that a quick pixel shader would allow for complicated effects (like what xTr1m mention).

for a examples, an echo would be (or a delay line):
Code:output_sample = ComputeSample(t) + ComputeSample(t-10)*0.4;

but a better echo would be:
Code:for (int = 0; i < 10; i++) output_sample += ComputeSample(t - 10 * i) * (11 - i);

So shader needs to be quick, this means that the fewer texture look-ups the better, I actually found a way to have no texture looks and no iterative searches. I basically created a function (equation) that gets the time and a channel, and return the note and the time relative to the note start time. The cool thing is that this equation turned to be super small in compare of having entire song data stored in a texture. Feel free to look into the shader code for more details.

One last thing, the accuracy thingy, you are right - there is a problem, I actually had to find this problem the hard way. I simple solved this, by using integer for sample index.
I actually decided to use x100 resolution, this allows a nice and cheap sub-sampling.
added on the 2013-07-11 06:56:08 by TLM TLM
xTrim: very impressive, can you please share the shader code?
added on the 2013-07-11 07:23:10 by TLM TLM
ah, correction:
Shabby: very impressive, can you please share the shader code?
added on the 2013-07-11 08:28:39 by TLM TLM
I'm afraid the src code went to the shrine of unfinished projects :P since it was all right brain doodling - but I found all the groundwork I did for it on the GLSL sandbox for you. The phase distortion oscillator is key part to the synth I did in that vid. For the JP8k it's simply a loop with a pitch multiplier for each voice e.t.c.






Have fun!
added on the 2013-07-11 13:38:22 by Shabby Shabby
very nice! Thanks
added on the 2013-07-11 13:50:35 by TLM TLM
Cool. (but i hear nothing). I am supposed to used a very old CRC, so change in electron beam state will make noise ?
added on the 2013-07-11 14:28:26 by Tigrou Tigrou
@Blueberry: Instead of doing atomic updates on the output buffer, you can draw a line/quad per note with a fragment shader and use additive blend instead. The blending unit supports full 32bit float and is much faster than atomic access anyway. The vertex shader would then only have to figure out the length and location of the note in samples and position the vertices appropriately.
added on the 2013-07-11 15:01:05 by Noobody Noobody
xTrim: on the topic of a bunch of oscillators - how about attempting to recreate the THX sound ("Deep Note") on the GPU?
added on the 2013-07-12 11:30:17 by bloodnok bloodnok
Noobody: True, but that requires the use of the graphics pipeline. Our usecase was mainly pure DirectCompute intros, where the setup overhead of the graphics pipeline (which is considerable in D3D10/11) can be avoided. For OpenGL intros, this approach is definitely viable.

In terms of parallelization strategy, this approach corresponds to parallelizing on both notes and output samples.
added on the 2013-07-12 11:55:34 by Blueberry Blueberry
Guys, did anyone actually tried it? (I'm talking about the first thread post)
added on the 2013-07-12 12:45:08 by TLM TLM
blueberry: what about one threadgroup per channel(/voice) running for the whole song, each has its own RW buffer to write to and read back in for delay lines + reverbs.
for the points in the operation where its possible, e.g. combining a bunch of oscillators or summing multiple delay lines, you go wide across the threads in the group.

the simplicity and reduced pain of that approach, as its a lot closer to how a cpu synth looks, could surely weigh off against that approach being less perfectly parallel.
added on the 2013-07-12 13:20:49 by smash smash