Steinar H. Gunderson

Thu, 07 Oct 2010 - VLC latency, part 3: Codec latency

In previous parts, I wrote a bit about motivation and overall plan for our attempt at a low-latency VLC setup. Today we've come to our first specific source of latency, namely codec latency.

We're going to discuss VLC streaming architecture in more detail later on, but for now we can live with the (over)simplified idea that the data comes in from some demuxer which separates audio and video, then audio and video are decoded and encoded to their new formats in separate threads, and finally a mux combines the newly encoded audio and video into a single bit stream again, which is sent out to the client.

In our case, there are a few givens: The Blackmagic SDI driver actually takes on the role as a demuxer (even though a demuxer normally works on some bitstream on disk or from network), and we have to use the TS muxer (MPEG Transport Stream, a very common choice) because that's the only thing that works with UDP output, which we need because we are to use multicast. Also, in our case, the “decoders” are pretty simple, given that the driver outputs raw (PCM) audio and video.

So, there are really only two choices to be made, namely the audio and video codec. These were also the first places where I started to attack latency, given that they were the most visible pieces of the puzzle (although not necessarily the ones with the most latency).

For video, x264 is a pretty obvious choice these days, at least in the free software world, and in fact, what originally inspired the project was this blog post on x264's newfound support for various low-latency features. (You should probably go read it if you're interested; I'm not going to repeat what's said there, given that the x264 people can explain their own encoder a lot better than I can.)

Now, in hindsight I realized that most of these are not really all that important to us, given that we can live with somewhat unstable bandwidth use. Still, I wanted to try out at least Periodic Intra Refresh in practice, and some of the other ones looked quite interesting as well.

VLC gives you quite a lot of control over the flags sent to x264; it used to be really cumbersome to control given that VLC had its own set of defaults that was wildly different from x264's own defaults, but these days it's pretty simple: VLC simply leaves x264's defaults alone in almost all cases unless you explicitly override them yourself, and apart from that lets you specify one of x264's speed/quality presets (from “ultrafast” down to “placebo”) plus tunings (we use the “zerolatency” and “film” tunings together, as they don't conflict and both are relevant to us).

At this point we've already killed a few frames of latency — in particular, we no longer use B-frames, which by definition requires us to buffer at least one frame, and the “zerolatency” preset enables slice-based threading, which uses all eight CPUs to encode the same frame instead of encoding eight frames at a time (one on each CPU, with some fancy system for sending the required data back and forth between the processes as it's needed for inter-frame compression). Reading about the latter suddenly made me understand why we always got more problems with “video buffer late for mux” (aka: the video encoder isn't delivering frames fast enough to the mux) when we enabled more CPUs in the past :-)

However, we still had unexpectedly much latency, and some debug printfs (never underestimate debug printfs!) indicated that VLC was sending five full frames to x264 before anything came out in the other end. I digged through VLC's x264 encoder module with some help from the people at #x264dev, and lo and behold, there was a single parameter VLC didn't keep at default, namely the “lookahead” parameter, which was set to... five. (Lookahead is useful to know whether you should spend many or fewer bits on the current frame, but in our case we cannot afford that luxury. In any case, the x264 people pointed out that five is a completely useless number to use; either you have lookahead of several seconds or you just drop the concept entirely.) --sout-x264-lookahead 0 and voila, that problem disappeared.

Periodic Intra Refresh (PIR), however, was another story. It's easily enabled with --sout-x264-intra-refresh (which also forces a few other options currently, such as --sout-x264-ref 1, ie. use reference pictures at most one frame back; most of these are not conceptual limitations, though, just an effect of the current x264 implementation), but it causes problems for the client. Normally, when the VLC client “tunes in” to a running stream, it waits until the first key frame before it starts showing anything. With PIR, you can run for ages with no key frames at all (if there's no clear scene cut); that's sort of the point of it all. Thus, unless the client happened to actually see the start of the stream, it could be stuck in a state where it would be unable to show anything at all. (It should be said that there was also a server-side shortcoming in VLC here at a time, where it didn't always mark the right frames as keyframes, but that's also fixed in the 1.1 series.)

So, we have to patch the client. It turns out that the Right Thing(TM) to do is to parse something called SEI recovery points, which is a small piece of metadata the encoder inserts whenever it's beginning a new round of its intra refresh. Essentially this says something like “if you start decoding here now, in NN frames you will have a correct [or almost correct, if a given bit it set] picture no matter what you have in your buffer at this point”. I made a patch which was reviewed and is now in VLC upstream; there have been some concerns about correctness, though (although none that cover our specific use-case), so it might unfortunately be reverted at some point. We'll see how it goes.

Anyhow, now we're down to theoretical sub-frame (<20ms) latency in the video encoder, so let's talk about audio. It might not be obvious to most people, but the typical audio codecs we use today (MP3, Vorbis, AAC, etc.) have quite a bit of latency inherent to the design. For instance, MP3 works in 576-sample blocks at some point; that's 12ms at 48 kHz, and the real situation is much worse, since that's within a subband, which has already been filtered and downsampled. You'll probably find that MP3 latency in practice is about 150–200 ms or so (IIRC), and AAC is something similar; in any case, at this point audio and video were noticeably out of sync.

The x264 post mentions CELT as a possible high-quality, low-latency audio codec. I looked a bit at it, but

  1. VLC doesn't currently support it,
  2. It's not bitstream stable (which means that people will be very reluctant to distribute anything linked against it, as you can break client/server compatibility at any time), and
  3. It does not currently have a TS mapping (a specification for how to embed it into a TS mux; every codec needs such a mapping), and I didn't really feel like going through the procedure of defining one, getting it standardized and then implement it in VLC.

I looked through the list of what usable codecs were supported by the TS demuxer in the client, though, and one caught my eye: LPCM. (The “L” stands for simply “linear” — it just means regular old PCM for all practical purposes.) It turns out both DVDs and Blu-rays have support for PCM, including surround and all, and they have their own ways of chopping the PCM audio into small blocks that fit neatly into a TS mux. It eats bandwidth, of course (48 kHz 16-bit stereo is about 1.5 Mbit/sec), but we don't really need to care too much; one of the privileges of controlling all parts of the chain is that you know where you can cut the corners and where you cannot.

The decoder was already in place, so all I had to do was to write an encoder. The DVD LPCM format is dead simple; the decoder was a bit underdocumented, but it was easy to find more complete specs online and update VLC's comments. The resulting patch was again sent in to VLC upstream, and is currently pending review. (Actually I think it's just forgotten, so I should nag someone into taking it in. It seems to be well received so far.)

With LPCM in use, the audio and video dropped neatly back into sync, and at this point, we should have effectively zero codec latency except the time spent on the encoding itself (which should surely be below one frame, given that the system works in realtime). That means we can start hacking at the rest of the system; essentially here the hard, tedious part starts, given that we're venturing into the unknowns of VLC internals.

This also means we're done with part three; tomorrow we'll be talking about timing and timestamps. It's perhaps a surprising topic, but very important both in understanding VLC's architecture (or any video player in general), the difficulties of finding and debugging latency issues, and where we can find hidden sources of latency.

[23:14] | | VLC latency, part 3: Codec latency

Steinar H. Gunderson <>