pouët.net

waveOutGetPosition and syncing: do you use an adjusting offset

category: code [glöplog]
 
I'm using waveOutGetPosition() with time format TIME_SAMPLES, in order to sync my visuals to music in 4k intros. It works nicely, but I feel that using the returned position value directly as the "current time" I pass to the shader, then the visuals are slightly behind the beat, and I must add a small offset to get it look just right. I wonder if this is because it takes some time to render the buffer, and meanwhile the audio playback has already progressed, or if it's some kind of hidden buffering caused by the audio drivers or something. However, so far the offset I've added has looked reasonably good on at least four different Windows systems I've tested on.

What do you others use, and have you experienced a similar phenomenon?

I just tested it with exactly 44100/60=735 samples, but it feels too little. 2000 feels better. I don't know, I'm just trying it by ear/eye.
added on the 2014-05-28 20:42:37 by yzi yzi
Just asking, but is there a way to adjust the frequency the audio is fed to the device?

We have solved this issue by manually setting the SDL audio buffer to be only 256 samples long. This would result in 44100 / 256 =~ 172Hz that should be fast enough for no delay to be detectable by the used. This audio buffer size has worked as a sufficient timer resolution for several intros now.

Actually, on *nix side, there is no need for a call to query the position per se. You can access the pointer last updated by the audio callback and substract the start of audio buffer from that.

Even if 256 is not properly divisible by the framerate, it does not seem to matter. It has always at least advanced by the time of next frame anyway. I've personally not been able to see errors with human eye.
added on the 2014-05-29 01:41:34 by Trilkk Trilkk
General: barring adaptive refresh, transmitting a frame's worth of data to the display takes pretty much a frame.

So if you're frame-locked, instantly render a frame just before VBlank is done and immediately flip, the top scanline of the display will be "on the nose", but the bottom scanline will be updated approximately 15.84ms later at 60Hz display (VBlank is ~5% of the frame time at normal display timings). So the "average scanline" will have about 8ms latency even if there's nothing else going on (and this is as true on LCDs as it was on CRTs). There's also LCD switching time if you wanna get really precise, but let's just assume that's zero.

Of course, frames don't render instantly. You normally get your timing information at the start of a frame, render your stuff, then flip. That means that whatever you render is gonna be late by at least the amount of time it takes to render. This is your CPU time from start of the frame to finish, plus the time the GPU spends on that frame rendering it once it gets the command buffer, plus however much delay there is to wait for the next vsync to show the results (assuming you're VSync'ing).

In practice, with no extra buffering and fast frames, you'll average about 16.67ms latency on the rendering side: you query the sample counter just after the flip, which will in the steady state be limited by your VSync rate, and then the earliest point that frame can be displayed is on the next VSync.

Except if you rendered like that, the GPU would be going idle all the time - it would finish rendering before VSync, and then your code is waiting for VSync too, and until you've submitted some significant work somewhere in the middle of the next frame the GPU won't even start.

So that's not what actually happens. What actually happens is that drivers are allowed to buffer data for a few frames. So the CPU side is ahead by a bit, and the GPU has enough work queued up that it won't go idle every time your rendering thread gets context-switched out (or even just when some part of your frame takes longer to prepare on the CPU side than it takes the GPU to render). On Windows, the default max latency is 3 frames (you can set this via DXGI, SetMaximumFrameLatency), which includes the frame you're currently preparing. So there may be up to 2 more frames worth of GPU work queued.

In practice, total time from "CPU starts submitting batches" to "GPU finishes rendering", on a test that's properly running at full frame rate without dropping frames, is usually those 2-3 queued frames' worth, plus whatever scan-out delay you have on top. So aim for 2.5 to 3.5 frames worth of delay, which in your case works out to a delay between 1837 and 2572 samples.

You can also futz around with SetMaximumFrameLatency or enforce no driver queuing by other means (i.e. by reading a GPU resources in a synchronous manner from the CPU side, which forces a flush), but that's kinda costly in terms of GPU time and should be avoided unless you really *really* need low latency.

And of course, on other OSes there's similar things, except it's usually not as explicit (nor as documented). Good luck.
added on the 2014-05-29 06:27:08 by ryg ryg
Thanks for the answers. Actually what I wanted to know was, if the correct offset value is dependent on the Windows system and is different based on what drivers and settings you have. But from Ryg's answer I'd say it's more dependent on the display drivers and settings than audio.

What comes to SDL and tracker module players, I think I've got it covered since "Once upon a Time in the East". http://www.pouet.net/prod.php?which=59910 (source available) http://ftp.kameli.net/pub/fit/once_upon/

Before that prod, I think we used to use just the same thing Trilkk suggested, asking for a small buffer and hoping for the best. The results varied across systems, but on some systems it wasn't good enough. The problem was, we were calling the mod player in the audio callback and using the playroutine's MOD playback position (and note) variables for syncing, and the variables, as perceived by the main program, were jumping ahead with jittery jumps, because the effective position values were whatever they were left at by the playroutine after rendering the amount of audio asked for by the SDL callback. The amount of latency and jitter (i.e. how much the latency varies from frame to frame) depended not only on the length of the audio buffer and the tempo of the particular song, but there also seemed to be some extra hidden audio buffering that was invisible to SDL. It was like, if you asked for just two audio buffers, there seemed to be three, on some platforms. So I made a system that time-stamps the playback positions inside the audio buffers, and compensates for the system/driver-specific hidden buffering. And then I empirically hand-tuned the compensation values separately for Mac and Windows, which seemed to be different from Linux.
added on the 2014-05-29 07:58:29 by yzi yzi
The latency for winmm.dll is quite hard to determine since winmm runs on top of other APIs (and drivers, and hardware) which might do processing, however, from experience 30-50ms should be about right.

I don't think (I might be wrong though) there's a direct way in winmm to determine the exact latency, but it *should not* differ in much from system to system, so doing it by trial and error should be just as good. If to you it looks good on 2000 samples, it'll probably look good on 2000 samples on every system.
added on the 2014-05-29 07:59:04 by Jcl Jcl
Forgot to say that after making that change to "cool_mzx" SDL MOD player, I was able to increase the audio buffer length as much as I wanted, without affecting the sync accyracy at all.

My original question was about Windows "multimedia" audio I use in 4k intros. I already submitted my remote entry to Outline, so let's see how it goes. I used a time offset of 2000 samples, which felt good enough on my system.
added on the 2014-05-29 08:03:16 by yzi yzi
Jcl: thanks. I guess that's what everyone must be doing then. For some reason I couldn't find anything on this subject, even though surely everyone who's used winmm must have faced the same question at some point.
added on the 2014-05-29 08:10:46 by yzi yzi
There's always this trick as well, but well... it requires a head-mounted display :)

http://www.extremetech.com/gaming/181093-oculus-rifts-time-warping-feature-will-make-vr-easier-on-your-stomach

Thanks to good old John Carmack.

Basically that'll decouple the pixel scanout to the display and the frame producing into separate modules, where the scanout is syncronized to the screen halves for both eye images.
added on the 2014-05-29 08:11:51 by visy visy
yzi
Quote:
The amount of latency and jitter (i.e. how much the latency varies from frame to frame) depended not only on the length of the audio buffer and the tempo of the particular song, but there also seemed to be some extra hidden audio buffering that was invisible to SDL.

I've noticed the same thing. On Linux, going below 512 samples was trouble, on Windows, the same value seemed to be 1k or 2k, which is nowhere near small enough to be usable as a time resolution.

I was not able to find out whether which factors affected this. For this reason, if it's not about size, we just sync into the tick count since program start.
added on the 2014-05-29 13:57:58 by Trilkk Trilkk
Winmm latency doesn't get into it!

This only matters if you're trying to respond to user events, not if you're syncing a known video with a known audio stream.

The latency is between "samples getting submitted" and "samples getting played". It does not matter at all for timing when you use waveOutGetPosition (or equivalents in other APIs) for sync. It only does when you use some other timer (e.g. QueryPerformanceCounter) relative to the time you started playing sound - but don't do that anyway!
added on the 2014-05-29 14:21:01 by ryg ryg
I have also found that a value around 2000 works well when running at full 60 fps. I usually use 2048, since it compresses better. ;)

The Clinkster player API has this delay compensation built in, before conversion to music-tick-based time. The default is 2048 samples, but it can be adjusted if desired.
added on the 2014-05-29 16:15:20 by Blueberry Blueberry
Iit's wierd when the sound comes first and visuals after.
Light travels faster than sound and we're so use to it that making the order reverse just feels wierd.
If the sound comes after the visuals you still associate the visual effect with the sound effect.
And then there is the issue that who knows how many buffers and delay you have in a computer.

Because of these reasons I always use some offset.
added on the 2014-05-29 18:09:40 by musk musk

login