In previous parts, I've been talking about the motivations for our low-latency VLC setup, the overall plan, codec latency and finally timing. At this point we're going down into more specific parts of VLC, in an effort to chop away the latency.
So, when we left the story the last time, we had a measured 1.2 seconds of latency. Obviously there's a huge source of latency we've missed somewhere, but where?
At this point I did what I guess most others would have done; invoked VLC with
--help
and looked for interesting flags. There's a first obvious
candidate, namely --sout-udp-caching
, which is yet another
buffering flag, but this time on the output part of the side. (It seems to
delay the DTS, ie. the sending time delay in this case, by that many
milliseconds, so it's a sender-side equivalent of the PTS delay.) Its default,
just like the other “-caching” options, is 300 ms. Set it down to 5 ms (later
1 ms), and whoosh, there goes some delay. (There's also a flag called
“DTS delay” which seems to adjust the PCR relative to the DTS, to give the
client some more chance at buffering. I have no idea why the client would
need the encoder to specify this.)
But there's still lots of delay left, and with some help from the people on
#x264dev (it seems like many of the VLC developers hang there, and it's a lot
less noisy than #videolan :-) ) I found the elusively-named flag
--sout-ts-shaping
, which belongs to the TS muxer module.
To understand what this parameter (the “length of the shaping interval, in milliseconds”) is about, we'll need to take a short look at what a muxer does. Obviously it does the opposite of a demuxer — take in audio and video, and combine them into a finished bitstream. It's imperative at this point that they are in the same time base, of course; if you include video from 00:00:00 and audio from 00:00:15 next to each other, you can be pretty sure there's a player out there that will have problems playing your audio and video in sync. (Not to mention you get fifteen seconds delay, of course.)
VLC's TS muxer (and muxing system in general) does this by letting the audio and video threads post to separate FIFOs, which the muxer can read from. (There's a locking issue in here in that the audio and video encoding seem to take the same lock before posting to these FIFOs, so they cannot go in parallel, but in our case the audio decoding is nearly free anyway, so it doesn't matter. You can add separate transcoder threads if you want to, but in that case, the video goes via a ring buffer that is only polled when the next frame comes, so you add about half a frame of extra delay.) The muxer then is alerted whenever there's new stuff added to any of the FIFOs, and sees if it can output a packet.
Now, I've been told that VLC's TS muxer is a bit suboptimal in many aspect, and that there's a new one that has been living out-of-tree for a while, but this is roughly how the current one works:
- Pick one stream as the PCR stream (PCR is MPEG-speak for “Program Clock
Reference”, the global system clock for that stream), and read blocks
of total length equivalent to the shaping period. VLC's TS muxer tries to
pick the video stream as the PCR stream if one exists. (Actually, VLC waits
a bit at the beginning to give all streams a chance to start sending data,
which identifies the stream. That's what the
--sout-mux-caching
flag is for. For us, it doesn't matter too much, though, since what we care about is the steady state.) - Read data from the other streams until they all have at least caught up with the PCR stream.
From #1 it's pretty obvious that the default shaping interval of 200 ms is going to delay our stream by several frames. Setting it down to 1, the lowest possible value, again chops off some delay.
At this point, I started adding printfs to see how the muxer worked, and it seemed to be behaving relatively well; it picked video blocks (one frame, 40 ms), and then a bunch of audio blocks (the audio is chopped into 80-sample blocks by the LPCM encoder, in addition to a 1024-sample chop at some earlier point I don't know the rationale for). However, sometimes it would be an audio block short, and refuse to mux until it got more data (read: the next frame). More debugging ensued.
At this point, I take some narrative pain for having presented the story a bit out-of-order; it was actually first at this point I added the SDI timer as the master clock. The audio and video having different time bases would cause problems where the audio would be moved, say, 2ms more ahead than the video. Sorry, you don't have enough video to mux, wait for more. Do not collect $500. (Obviously, locking the audio and video timestamps fixed this specific issue.) Similarly, I found and fixed a few rounding issues in the length calculations in the LPCM encoder that I've already talked briefly about.
But there's more subtility, and we've touched on it before. How do you find the length of a block? The TS muxer doesn't trust the length parameter from previous rounds, and perhaps with good reason; the time base correction could have moved the DTS and PTS around, which certainly should also skew the length of the previous block. Think about it; if you have a video frame at PTS=0.000 and then one at PTS=0.020, and the second frame gets moved to to PTS=0.015, the first one should obviously have length 0.015, not 0.020. However, it may already have been sent out, so you have a problem. (You could of course argue that you don't have a problem, since your muxing algorithm shouldn't necessarily care about the lengths at all, and the new TS muxer reportedly does not. However, this is the status quo, so we have to care about the length for now. :-) )
The TS muxer solves this in a way that works fine if you don't care about latency; it never processes a block until it also has the next block. By doing this, it can simply say that this_block.length = next_block.dts - this_block.dts, and simply ignore the incoming length parameter of the block.
This makes for chaos for our purposes, of course -- it means we will always have at least one video frame of delay in the mux, and if for some reason the video should be ahead of the audio (we'll see quite soon that it usually was!), the muxer will refuse to mux the packet on time because it doesn't trust the length of the last audio block.
I don't have a good upstream fix for this, except that again, this is supposedly fixed in the new muxer. In my particular case, I did a local hack and simply made the muxer trust the incoming length -- I know it's good anyway. (Also, of course, I could then remove the demand that there be at least two blocks left in the FIFO to fetch out the first one.)
But even after this hack, there was a problem that I'd been seeing throughout the entire testing, but never really understood; the audio was consistently coming much later than the video. This doesn't make sense, of course, given that the video goes through x264, which takes a lot of CPU time, and the audio is just chopped into blocks and given DVD LPCM headers. My guess was at some serialization throughout the pipeline (and I did indeed find one, the serialized access to the FIFOs mentioned earlier, but it was not the right source), and I started searching. Again lots of debug printfs, but this time, at least I had pretty consistent time bases throughout the entire pipeline, and only two to care about. (That, and I drew a diagram of all the different parts of the code; again, it turns out there's a lot of complexity that's basically short-circuited since we work with raw audio/video in the input.)
The source of this phenomenon was finally found in a place I didn't even know existed: the “copy packetizer”. It's not actually in my drawing, but it appears it sits between the raw audio decoder (indeed…) and the LPCM encoder. It seems mostly to be a dummy module because VLC needs there to actually be a packetizer, but it does do some sanity checking and adjusts the PTS, DTS and... length. Guess how it finds the new length. :-)
For those not following along: It waits for the next packet to arrive, and sets the new length equivalent to next_dts - this_dts, just like ths TS muxer. Obviously this means one block latency, which means that the TS muxer will always be one audio block short when trying to mux the video frame that just came in. (The video goes through no similar treatment along its path to the mux.) This, in turn, translates to a minimum latency of one full video frame in the mux.
So, again a local hack: I have no idea what downstream modules may rely on the length being correct, but in my case, I know it's correct, so I can just remove this logic and send the packet directly on.
And now, for perhaps some disappointing news: This is the point where the posting catches up with what I've actually done. How much latency is there? I don't know, but if I set x264 to “ultrafast” and add some printfs here and there, it seems like I can start sending out UDP packets about 4 ms after receiving the frame from the driver. What does this translate to in end-to-end latency? I don't know, but the best tests we had before the fixes mentioned (the SDI master clock, the TS mux one-block delay, and the copy packetizer one-block delay) was about 250 ms:
That's a machine that plays its own stream from the local network, so intrinsically about 50 ms better than our previous test, but the difference between 1.2 seconds and 250 ms is obviously quite a lot nevertheless.
My guess is that with these fixes, we'll touch about 200 ms, and then when we go to true 50p (so we actually get 50 frames per second, as opposed to 25 frames which each represent two fields), we'll about halve that. Encoding 720p50 in some reasonable quality is going to take some serious oomph, though, but that's really out of our hands — I trust the x264 guys to keep doing their magic much better than I can.
So, I guess that rounds off the series; all that's left for me to write at the current stage is a list of the corrections I've received, which I'll do tomorrow. (I'm sure there will be new ones to this part :-) )
What's left for the future? Well, obviously I want to do a new end-to-end test, which I'll do as soon as I have the opportunity. Then I'm quite sure we'll want to run an actual test at 720p50, and for that I think I'll need to actually get hold of one of these cards myself (including a desktop machine fast enough to drive it). Hello Blackmagic, if you by any chance are throwing free cards at people, I'd love one that does HDMI in to test with :-P Of course, I'm sure this will uncover new issues; in particular, we haven't looked much at the client yet, and there might be lurking unexpected delays there as well.
And then, of course, we'll see how it works in practice at TG. My guess is that we'll hit lots of weird issues with clients doing stupid things — with 5000 people in the hall, you're bound to have someone with buggy audio or video drivers (supposedly audio is usually the worst sinner here), machines with timers that jump around like pinballs, old versions of VLC despite big warnings that you need at least X.Y.Z, etc… Only time will tell, and I'm pretty glad we'll have a less fancy fallback stream. :-)