Steinar H. Gunderson

Fri, 08 Oct 2010 - VLC latency part 4: Timing

Previously in my post on our attempt at a low-latency VLC setup for The Gathering 2011, I've written about motivation, the overall plan and yesterday on codec latency. Today we'll look at the topic of timestamps and timing, a part most people probably think relatively little about, but which is still central to any sort of audio/video work. (It is also probably the last general exposition in the series; we're running out of relevant wide-ranging topics to talk about, at least of the things I pretend to know anything about.)

Timing is surprisingly difficult to get right; in fact, I'd be willing to bet that more hair has been ripped out over the supposedly mundane issues of demuxing and timestamping than most other issues in creating a working media player. (Of course, never having made one, that's just a guess.) The main source of complexity can be expressed through this quote (usually attributed to a “Lee Segall” who I have no idea who is):

“A man with a watch knows what time it is. A man with two watches is never sure.”

In a multimedia pipeline, we have not only two but several different clocks to deal with: The audio and video streams both have clocks, the kernel has its own clock (usually again based on several different clocks, but you don't need to care much about that), the client in the other end has a kernel clock, and the video and audio cards for playback both have clocks. Unless they are somehow derived from exactly the same clock, all of these can have different bases, move at different rates, and drift out of sync from each other. That's of course assuming they are all stable and don't do weird things like suddenly jump backwards five hours and then back again (or not).

VLC, as a generalist application, generally uses the only reasonable general approach, which is to try to convert all of them into a single master timer, which comes from the kernel. (There are better specialist approaches in some cases; for instance, if you're transcoding from one file to another, you don't care about the kernel's idea of time, and perhaps you should choose one of the input streams' timestamps as the master instead.)

This happens separately for audio and video, in our case right before it's sent to the transcoder — VLC takes a system timestamp, compares it to the stream timer, and then tries to figure out how the stream timer and the system relates, so it can convert from one to the other. (The observant reader, which unfortunately never has existed, will notice that it should have taken this timestamp when it actually received the frame, not the point where it's about to encode it. There's a TODO about this in the source code.) As the relations might change over time, it tries to slowly adjust the bases and rates to match reality. (Of course, this is rapidly getting into control theory, but I don't think you need to go there to get something that works reasonably well.) Similarly, if audio and video go too much out of sync, the VLC client will actually start to stretch audio one way or the other to get the two clocks back in sync without having to drop frames. (Or so I think. I don't know the details very well.)

But wait, there's more. All data blocks have two timestamps, the presentation timestamp (PTS) and the decode timestamp (DTS). (This is not a VLC invention by any means, of course.) You can interpret both as deadlines; the PTS is a deadline for when you are to display the block, and the DTS is a deadline for when the block is to be decoded. (For streaming over the network, you can interpret “display” and “decode” figuratively; the UDP output, for instance, tries to send out the block before the DTS has arrived.) For a stream, generally PTS=DTS except when you need to decode frames out-of-order (think B-frames). Inside VLC after the time has been converted to the global base, there's a concept of “PTS delay”, which despite the name is added to both PTS and DTS. Without a PTS delay, the deadline would be equivalent to the stream acquisition time, so all the packets would be sent out too late, and if you had the --sout-transcode-hurry-up (which is default) the frames would simply get dropped. Again confusingly, the PTS delay is set by the various --*-caching options, so basically you want to set --decklink-caching as low as you can without warnings about “packet sent out too late” showing up en masse.

Finally, all blocks in VLC have a concept of length, in milliseconds. This sounds like an obvious choice until you realize all lengths might not be a whole multiple of milliseconds; for instance, the LPCM blocks are 80 samples long, which is 5/3 ms (about 1.667 ms). Thus, you need to set rounding right if you want all the lengths to add up — there are functions to help with this if your timestamps can be expressed as rationals numbers. And of course, since consecutive blocks might get converted to system time using different parameters, pts2 - pts1 might very well be different from length. (Otherwise, you could never ever adjust a stream base.) And to make things even more confusing, the length parameter is described as optional for some types of blocks, but only in some parts of the VLC code. You can imagine latency problems being pretty difficult to debug in an environment like this, with several different time bases in use from different threads at the same time.

But again, our problem at hand is simpler than this, and with some luck, we can short-circuit away much of the complexity we don't need. To begin with, SDI has locked audio and video; at 50 fps, you always get exactly 20 ms of audio (960 samples at 48000 Hz) with every video frame. So we don't have to worry about audio and video going out of sync, as long as VLC doesn't do too different things to the two.

This is not necessarily a correct assumption — for instance, remember that VLC can sample the system timer for the the audio and video at different times and in different threads, so even though they originally have the same timestamp from the card, VLC can think they have different time bases, and adjust the time for the audio and video blocks differently. They start as locked, but VLC does not process them as such, and once they drift even a little out of sync, things get a lot harder.

Thus, I eventually found out that the easiest thing for me was to take the kernel timer out of the loop. The Blackmagic cards give you access to the SDI system timer, which is locked to the audio and video timers. You only get the audio and video timestamps when you receive a new frame, but you can query the SDI system timer at any time, just like the kernel timer. You can also ask it how far you are into the current frame, so if you just subtract that, you will get a reliable timestamp for the acquisition of the previous frame, assuming you haven't been so busy you skipped an entire frame, in which you case you lose anyway.

The SDI timer's resolution is only 4 ms, it seems, but that's enough for us — also, even though its rate is 1:1 to the other timers, its base is not the same, so there's a fixed offset. However, VLC can already deal with situations like this, as we've seen earlier; as long as the base never changes, it will be right from first block, and there will never be any drift. I wrote a patch to a) propagate the actual frame acquisition time to the clock correction code (so it's timestamped only once, and the audio and video streams will get the same system timestamps), and b) make VLC's timer functions fetch the SDI system timer from the card instead of from the kernel. I don't know if the last part was actually necessary, but it certainly made debugging/logging a lot easier for me. One True Time Base, hooray. (The patch is not sent upstream yet; I don't know if it would be realistically accepted or not, and it requires more cleanup anyhow.)

So, now our time bases are in sync. Wonder how much delay we have? The easiest thing is of course to run a practical test, timestamping things on input and looking what happens at the output. By “timestamping”, I mean in the easiest possible way; just let the video stream capture a clock, and compare the output with another (in sync) clock. Of course, this means your two clocks need to be either the same or in sync — and for the first test, my laptop suddenly plain refused to sync properly to NTP with more than 50 ms or so. Still, we did the test, with about 50 ms network latency, with a PC running a simple clock program to generate the video stream:

Clock delay photo

(Note, for extra bonus, that I didn't think of taking a screenshot instead of a photo. :-) )

You'll see that despite tuning codec delay, despite turning down the PTS delay, and despite having SDI all the way in the input stream, we have a delay of a whopping 1.2 seconds. Disheartening, no? Granted, it's at 25 fps (50i) and not 50 fps, so all frames take twice as long, but even after a theoretical halving (which is unrealistic), we're a far cry from the 80 ms we wanted.

However, with that little cliffhanger, our initial discussion of timing is done. (Insert lame joke about “blog time base” or similar here.) We'll look at tracing the source(s) of this unexpected amount of latency tomorrow, when we look at what happens when the audio and video streams are to go back into one, and what that means for the pipeline's latency.

[18:50] | | VLC latency part 4: Timing

Steinar H. Gunderson <sgunderson@bigfoot.com>