Steinar H. Gunderson

Sun, 10 Oct 2010 - VLC latency, part 6: Notes and corrections

My five-part series on our low-latency VLC setup ended yesterday, with a discussion about muxing and various one-frame-latency-inducing effects in different parts of VLC. Before that, I'd been talking about motivations, the overall plan, codec latency and timing.

This time there is no new material, only a few comments from readers of the series of the last few days. Thanks; you know who you are.

The UDP output is not the only VLC output that can multicast; in particular, the RTP mux/output can do it, too, in which case the audio and video streams are just sent out more or less as separate streams. This would avoid any problems the TS mux might have, and probably replace them with a whole new and much more interesting set. (I've never ever had any significant success with RTP as a user, so I don't think I'll dare to unleash it from the server side. In any case we need TS for the HTTP streams, and sticking with fewer sets of options is less confusing if we can get away with it.)
VLC 1.1.x can reportedly use H.264 decoding hardware if your machine, OS and drivers all support that. I have no idea how it impacts latency (it could go both ways, I reckon), but presumably it should at least make decoding on the client side cheaper in terms of CPU.
If you're running slice-based threads, you might not want to use too many threads, or bitrate allocation might be very inefficient.
The DVD LPCM encoder is now in VLC's git repository. Yay :-)
The collective bag of hacks I use that have not been sent upstream can be found here. Caveat emptor.

Finally, I realized I hadn't actually posted our current command-line anywhere, so here goes:

vlc \
  --decklink-caching 20 --sout-mux-caching 1 --sout-ts-dts-delay 60 --sout-ts-shaping 1 --sout-udp-caching 1 --no-sout-transcode-hurry-up \
  --decklink-aspect-ratio 16:9 --decklink-mode pal \
  --sout-x264-preset medium --sout-x264-tune zerolatency,film \
  --sout-x264-intra-refresh --no-sout-x264-interlaced --sout-x264-weightp 0 --sout-x264-bpyramid strict --sout-x264-ref 1 \
  --sout-x264-keyint 25 --sout-x264-lookahead 0 \
  --sout-x264-vbv-bufsize 60 --sout-x264-vbv-maxrate 1500 \
  -I dummy -v decklink:// vlc://quit \
  --sout '#transcode{vcodec=h264,vb=1500,acodec=lpcm}:std{access=udp,mux=ts,dst=10.0.0.1:9094}'

You'll probably want to up the bitrate for a real stream. A lot. And of course, coding PAL as non-interlaced doesn't make much sense either, unless it happens to be a progressive signal sent as interlaced.

I'll keep you posted if there's any new movement (new latency tests, more patches going into VLC, or similar), but for now, that's it. Enjoy :-)

[22:21] | | VLC latency, part 6: Notes and corrections

Sat, 09 Oct 2010 - VLC latency, part 5: Muxing

In previous parts, I've been talking about the motivations for our low-latency VLC setup, the overall plan, codec latency and finally timing. At this point we're going down into more specific parts of VLC, in an effort to chop away the latency.

So, when we left the story the last time, we had a measured 1.2 seconds of latency. Obviously there's a huge source of latency we've missed somewhere, but where?

At this point I did what I guess most others would have done; invoked VLC with --help and looked for interesting flags. There's a first obvious candidate, namely --sout-udp-caching, which is yet another buffering flag, but this time on the output part of the side. (It seems to delay the DTS, ie. the sending time delay in this case, by that many milliseconds, so it's a sender-side equivalent of the PTS delay.) Its default, just like the other “-caching” options, is 300 ms. Set it down to 5 ms (later 1 ms), and whoosh, there goes some delay. (There's also a flag called “DTS delay” which seems to adjust the PCR relative to the DTS, to give the client some more chance at buffering. I have no idea why the client would need the encoder to specify this.)

But there's still lots of delay left, and with some help from the people on #x264dev (it seems like many of the VLC developers hang there, and it's a lot less noisy than #videolan :-) ) I found the elusively-named flag --sout-ts-shaping, which belongs to the TS muxer module.

To understand what this parameter (the “length of the shaping interval, in milliseconds”) is about, we'll need to take a short look at what a muxer does. Obviously it does the opposite of a demuxer — take in audio and video, and combine them into a finished bitstream. It's imperative at this point that they are in the same time base, of course; if you include video from 00:00:00 and audio from 00:00:15 next to each other, you can be pretty sure there's a player out there that will have problems playing your audio and video in sync. (Not to mention you get fifteen seconds delay, of course.)

VLC's TS muxer (and muxing system in general) does this by letting the audio and video threads post to separate FIFOs, which the muxer can read from. (There's a locking issue in here in that the audio and video encoding seem to take the same lock before posting to these FIFOs, so they cannot go in parallel, but in our case the audio decoding is nearly free anyway, so it doesn't matter. You can add separate transcoder threads if you want to, but in that case, the video goes via a ring buffer that is only polled when the next frame comes, so you add about half a frame of extra delay.) The muxer then is alerted whenever there's new stuff added to any of the FIFOs, and sees if it can output a packet.

Now, I've been told that VLC's TS muxer is a bit suboptimal in many aspect, and that there's a new one that has been living out-of-tree for a while, but this is roughly how the current one works:

Pick one stream as the PCR stream (PCR is MPEG-speak for “Program Clock Reference”, the global system clock for that stream), and read blocks of total length equivalent to the shaping period. VLC's TS muxer tries to pick the video stream as the PCR stream if one exists. (Actually, VLC waits a bit at the beginning to give all streams a chance to start sending data, which identifies the stream. That's what the --sout-mux-caching flag is for. For us, it doesn't matter too much, though, since what we care about is the steady state.)
Read data from the other streams until they all have at least caught up with the PCR stream.

From #1 it's pretty obvious that the default shaping interval of 200 ms is going to delay our stream by several frames. Setting it down to 1, the lowest possible value, again chops off some delay.

At this point, I started adding printfs to see how the muxer worked, and it seemed to be behaving relatively well; it picked video blocks (one frame, 40 ms), and then a bunch of audio blocks (the audio is chopped into 80-sample blocks by the LPCM encoder, in addition to a 1024-sample chop at some earlier point I don't know the rationale for). However, sometimes it would be an audio block short, and refuse to mux until it got more data (read: the next frame). More debugging ensued.

At this point, I take some narrative pain for having presented the story a bit out-of-order; it was actually first at this point I added the SDI timer as the master clock. The audio and video having different time bases would cause problems where the audio would be moved, say, 2ms more ahead than the video. Sorry, you don't have enough video to mux, wait for more. Do not collect $500. (Obviously, locking the audio and video timestamps fixed this specific issue.) Similarly, I found and fixed a few rounding issues in the length calculations in the LPCM encoder that I've already talked briefly about.

But there's more subtility, and we've touched on it before. How do you find the length of a block? The TS muxer doesn't trust the length parameter from previous rounds, and perhaps with good reason; the time base correction could have moved the DTS and PTS around, which certainly should also skew the length of the previous block. Think about it; if you have a video frame at PTS=0.000 and then one at PTS=0.020, and the second frame gets moved to to PTS=0.015, the first one should obviously have length 0.015, not 0.020. However, it may already have been sent out, so you have a problem. (You could of course argue that you don't have a problem, since your muxing algorithm shouldn't necessarily care about the lengths at all, and the new TS muxer reportedly does not. However, this is the status quo, so we have to care about the length for now. :-) )

The TS muxer solves this in a way that works fine if you don't care about latency; it never processes a block until it also has the next block. By doing this, it can simply say that this_block.length = next_block.dts - this_block.dts, and simply ignore the incoming length parameter of the block.

This makes for chaos for our purposes, of course -- it means we will always have at least one video frame of delay in the mux, and if for some reason the video should be ahead of the audio (we'll see quite soon that it usually was!), the muxer will refuse to mux the packet on time because it doesn't trust the length of the last audio block.

I don't have a good upstream fix for this, except that again, this is supposedly fixed in the new muxer. In my particular case, I did a local hack and simply made the muxer trust the incoming length -- I know it's good anyway. (Also, of course, I could then remove the demand that there be at least two blocks left in the FIFO to fetch out the first one.)

But even after this hack, there was a problem that I'd been seeing throughout the entire testing, but never really understood; the audio was consistently coming much later than the video. This doesn't make sense, of course, given that the video goes through x264, which takes a lot of CPU time, and the audio is just chopped into blocks and given DVD LPCM headers. My guess was at some serialization throughout the pipeline (and I did indeed find one, the serialized access to the FIFOs mentioned earlier, but it was not the right source), and I started searching. Again lots of debug printfs, but this time, at least I had pretty consistent time bases throughout the entire pipeline, and only two to care about. (That, and I drew a diagram of all the different parts of the code; again, it turns out there's a lot of complexity that's basically short-circuited since we work with raw audio/video in the input.)

The source of this phenomenon was finally found in a place I didn't even know existed: the “copy packetizer”. It's not actually in my drawing, but it appears it sits between the raw audio decoder (indeed…) and the LPCM encoder. It seems mostly to be a dummy module because VLC needs there to actually be a packetizer, but it does do some sanity checking and adjusts the PTS, DTS and... length. Guess how it finds the new length. :-)

For those not following along: It waits for the next packet to arrive, and sets the new length equivalent to next_dts - this_dts, just like ths TS muxer. Obviously this means one block latency, which means that the TS muxer will always be one audio block short when trying to mux the video frame that just came in. (The video goes through no similar treatment along its path to the mux.) This, in turn, translates to a minimum latency of one full video frame in the mux.

So, again a local hack: I have no idea what downstream modules may rely on the length being correct, but in my case, I know it's correct, so I can just remove this logic and send the packet directly on.

And now, for perhaps some disappointing news: This is the point where the posting catches up with what I've actually done. How much latency is there? I don't know, but if I set x264 to “ultrafast” and add some printfs here and there, it seems like I can start sending out UDP packets about 4 ms after receiving the frame from the driver. What does this translate to in end-to-end latency? I don't know, but the best tests we had before the fixes mentioned (the SDI master clock, the TS mux one-block delay, and the copy packetizer one-block delay) was about 250 ms:

VLC local latency

That's a machine that plays its own stream from the local network, so intrinsically about 50 ms better than our previous test, but the difference between 1.2 seconds and 250 ms is obviously quite a lot nevertheless.

My guess is that with these fixes, we'll touch about 200 ms, and then when we go to true 50p (so we actually get 50 frames per second, as opposed to 25 frames which each represent two fields), we'll about halve that. Encoding 720p50 in some reasonable quality is going to take some serious oomph, though, but that's really out of our hands — I trust the x264 guys to keep doing their magic much better than I can.

So, I guess that rounds off the series; all that's left for me to write at the current stage is a list of the corrections I've received, which I'll do tomorrow. (I'm sure there will be new ones to this part :-) )

What's left for the future? Well, obviously I want to do a new end-to-end test, which I'll do as soon as I have the opportunity. Then I'm quite sure we'll want to run an actual test at 720p50, and for that I think I'll need to actually get hold of one of these cards myself (including a desktop machine fast enough to drive it). Hello Blackmagic, if you by any chance are throwing free cards at people, I'd love one that does HDMI in to test with :-P Of course, I'm sure this will uncover new issues; in particular, we haven't looked much at the client yet, and there might be lurking unexpected delays there as well.

And then, of course, we'll see how it works in practice at TG. My guess is that we'll hit lots of weird issues with clients doing stupid things — with 5000 people in the hall, you're bound to have someone with buggy audio or video drivers (supposedly audio is usually the worst sinner here), machines with timers that jump around like pinballs, old versions of VLC despite big warnings that you need at least X.Y.Z, etc… Only time will tell, and I'm pretty glad we'll have a less fancy fallback stream. :-)

[22:36] | | VLC latency, part 5: Muxing

Fri, 08 Oct 2010 - VLC latency part 4: Timing

Previously in my post on our attempt at a low-latency VLC setup for The Gathering 2011, I've written about motivation, the overall plan and yesterday on codec latency. Today we'll look at the topic of timestamps and timing, a part most people probably think relatively little about, but which is still central to any sort of audio/video work. (It is also probably the last general exposition in the series; we're running out of relevant wide-ranging topics to talk about, at least of the things I pretend to know anything about.)

Timing is surprisingly difficult to get right; in fact, I'd be willing to bet that more hair has been ripped out over the supposedly mundane issues of demuxing and timestamping than most other issues in creating a working media player. (Of course, never having made one, that's just a guess.) The main source of complexity can be expressed through this quote (usually attributed to a “Lee Segall” who I have no idea who is):

“A man with a watch knows what time it is. A man with two watches is never sure.”

In a multimedia pipeline, we have not only two but several different clocks to deal with: The audio and video streams both have clocks, the kernel has its own clock (usually again based on several different clocks, but you don't need to care much about that), the client in the other end has a kernel clock, and the video and audio cards for playback both have clocks. Unless they are somehow derived from exactly the same clock, all of these can have different bases, move at different rates, and drift out of sync from each other. That's of course assuming they are all stable and don't do weird things like suddenly jump backwards five hours and then back again (or not).

VLC, as a generalist application, generally uses the only reasonable general approach, which is to try to convert all of them into a single master timer, which comes from the kernel. (There are better specialist approaches in some cases; for instance, if you're transcoding from one file to another, you don't care about the kernel's idea of time, and perhaps you should choose one of the input streams' timestamps as the master instead.)

This happens separately for audio and video, in our case right before it's sent to the transcoder — VLC takes a system timestamp, compares it to the stream timer, and then tries to figure out how the stream timer and the system relates, so it can convert from one to the other. (The observant reader, which unfortunately never has existed, will notice that it should have taken this timestamp when it actually received the frame, not the point where it's about to encode it. There's a TODO about this in the source code.) As the relations might change over time, it tries to slowly adjust the bases and rates to match reality. (Of course, this is rapidly getting into control theory, but I don't think you need to go there to get something that works reasonably well.) Similarly, if audio and video go too much out of sync, the VLC client will actually start to stretch audio one way or the other to get the two clocks back in sync without having to drop frames. (Or so I think. I don't know the details very well.)

But wait, there's more. All data blocks have two timestamps, the presentation timestamp (PTS) and the decode timestamp (DTS). (This is not a VLC invention by any means, of course.) You can interpret both as deadlines; the PTS is a deadline for when you are to display the block, and the DTS is a deadline for when the block is to be decoded. (For streaming over the network, you can interpret “display” and “decode” figuratively; the UDP output, for instance, tries to send out the block before the DTS has arrived.) For a stream, generally PTS=DTS except when you need to decode frames out-of-order (think B-frames). Inside VLC after the time has been converted to the global base, there's a concept of “PTS delay”, which despite the name is added to both PTS and DTS. Without a PTS delay, the deadline would be equivalent to the stream acquisition time, so all the packets would be sent out too late, and if you had the --sout-transcode-hurry-up (which is default) the frames would simply get dropped. Again confusingly, the PTS delay is set by the various --*-caching options, so basically you want to set --decklink-caching as low as you can without warnings about “packet sent out too late” showing up en masse.

Finally, all blocks in VLC have a concept of length, in milliseconds. This sounds like an obvious choice until you realize all lengths might not be a whole multiple of milliseconds; for instance, the LPCM blocks are 80 samples long, which is 5/3 ms (about 1.667 ms). Thus, you need to set rounding right if you want all the lengths to add up — there are functions to help with this if your timestamps can be expressed as rationals numbers. And of course, since consecutive blocks might get converted to system time using different parameters, pts2 - pts1 might very well be different from length. (Otherwise, you could never ever adjust a stream base.) And to make things even more confusing, the length parameter is described as optional for some types of blocks, but only in some parts of the VLC code. You can imagine latency problems being pretty difficult to debug in an environment like this, with several different time bases in use from different threads at the same time.

But again, our problem at hand is simpler than this, and with some luck, we can short-circuit away much of the complexity we don't need. To begin with, SDI has locked audio and video; at 50 fps, you always get exactly 20 ms of audio (960 samples at 48000 Hz) with every video frame. So we don't have to worry about audio and video going out of sync, as long as VLC doesn't do too different things to the two.

This is not necessarily a correct assumption — for instance, remember that VLC can sample the system timer for the the audio and video at different times and in different threads, so even though they originally have the same timestamp from the card, VLC can think they have different time bases, and adjust the time for the audio and video blocks differently. They start as locked, but VLC does not process them as such, and once they drift even a little out of sync, things get a lot harder.

Thus, I eventually found out that the easiest thing for me was to take the kernel timer out of the loop. The Blackmagic cards give you access to the SDI system timer, which is locked to the audio and video timers. You only get the audio and video timestamps when you receive a new frame, but you can query the SDI system timer at any time, just like the kernel timer. You can also ask it how far you are into the current frame, so if you just subtract that, you will get a reliable timestamp for the acquisition of the previous frame, assuming you haven't been so busy you skipped an entire frame, in which you case you lose anyway.

The SDI timer's resolution is only 4 ms, it seems, but that's enough for us — also, even though its rate is 1:1 to the other timers, its base is not the same, so there's a fixed offset. However, VLC can already deal with situations like this, as we've seen earlier; as long as the base never changes, it will be right from first block, and there will never be any drift. I wrote a patch to a) propagate the actual frame acquisition time to the clock correction code (so it's timestamped only once, and the audio and video streams will get the same system timestamps), and b) make VLC's timer functions fetch the SDI system timer from the card instead of from the kernel. I don't know if the last part was actually necessary, but it certainly made debugging/logging a lot easier for me. One True Time Base, hooray. (The patch is not sent upstream yet; I don't know if it would be realistically accepted or not, and it requires more cleanup anyhow.)

So, now our time bases are in sync. Wonder how much delay we have? The easiest thing is of course to run a practical test, timestamping things on input and looking what happens at the output. By “timestamping”, I mean in the easiest possible way; just let the video stream capture a clock, and compare the output with another (in sync) clock. Of course, this means your two clocks need to be either the same or in sync — and for the first test, my laptop suddenly plain refused to sync properly to NTP with more than 50 ms or so. Still, we did the test, with about 50 ms network latency, with a PC running a simple clock program to generate the video stream:

Clock delay photo

(Note, for extra bonus, that I didn't think of taking a screenshot instead of a photo. :-) )

You'll see that despite tuning codec delay, despite turning down the PTS delay, and despite having SDI all the way in the input stream, we have a delay of a whopping 1.2 seconds. Disheartening, no? Granted, it's at 25 fps (50i) and not 50 fps, so all frames take twice as long, but even after a theoretical halving (which is unrealistic), we're a far cry from the 80 ms we wanted.

However, with that little cliffhanger, our initial discussion of timing is done. (Insert lame joke about “blog time base” or similar here.) We'll look at tracing the source(s) of this unexpected amount of latency tomorrow, when we look at what happens when the audio and video streams are to go back into one, and what that means for the pipeline's latency.

[18:50] | | VLC latency part 4: Timing

Thu, 07 Oct 2010 - VLC latency, part 3: Codec latency

In previous parts, I wrote a bit about motivation and overall plan for our attempt at a low-latency VLC setup. Today we've come to our first specific source of latency, namely codec latency.

We're going to discuss VLC streaming architecture in more detail later on, but for now we can live with the (over)simplified idea that the data comes in from some demuxer which separates audio and video, then audio and video are decoded and encoded to their new formats in separate threads, and finally a mux combines the newly encoded audio and video into a single bit stream again, which is sent out to the client.

In our case, there are a few givens: The Blackmagic SDI driver actually takes on the role as a demuxer (even though a demuxer normally works on some bitstream on disk or from network), and we have to use the TS muxer (MPEG Transport Stream, a very common choice) because that's the only thing that works with UDP output, which we need because we are to use multicast. Also, in our case, the “decoders” are pretty simple, given that the driver outputs raw (PCM) audio and video.

So, there are really only two choices to be made, namely the audio and video codec. These were also the first places where I started to attack latency, given that they were the most visible pieces of the puzzle (although not necessarily the ones with the most latency).

For video, x264 is a pretty obvious choice these days, at least in the free software world, and in fact, what originally inspired the project was this blog post on x264's newfound support for various low-latency features. (You should probably go read it if you're interested; I'm not going to repeat what's said there, given that the x264 people can explain their own encoder a lot better than I can.)

Now, in hindsight I realized that most of these are not really all that important to us, given that we can live with somewhat unstable bandwidth use. Still, I wanted to try out at least Periodic Intra Refresh in practice, and some of the other ones looked quite interesting as well.

VLC gives you quite a lot of control over the flags sent to x264; it used to be really cumbersome to control given that VLC had its own set of defaults that was wildly different from x264's own defaults, but these days it's pretty simple: VLC simply leaves x264's defaults alone in almost all cases unless you explicitly override them yourself, and apart from that lets you specify one of x264's speed/quality presets (from “ultrafast” down to “placebo”) plus tunings (we use the “zerolatency” and “film” tunings together, as they don't conflict and both are relevant to us).

At this point we've already killed a few frames of latency — in particular, we no longer use B-frames, which by definition requires us to buffer at least one frame, and the “zerolatency” preset enables slice-based threading, which uses all eight CPUs to encode the same frame instead of encoding eight frames at a time (one on each CPU, with some fancy system for sending the required data back and forth between the processes as it's needed for inter-frame compression). Reading about the latter suddenly made me understand why we always got more problems with “video buffer late for mux” (aka: the video encoder isn't delivering frames fast enough to the mux) when we enabled more CPUs in the past :-)

However, we still had unexpectedly much latency, and some debug printfs (never underestimate debug printfs!) indicated that VLC was sending five full frames to x264 before anything came out in the other end. I digged through VLC's x264 encoder module with some help from the people at #x264dev, and lo and behold, there was a single parameter VLC didn't keep at default, namely the “lookahead” parameter, which was set to... five. (Lookahead is useful to know whether you should spend many or fewer bits on the current frame, but in our case we cannot afford that luxury. In any case, the x264 people pointed out that five is a completely useless number to use; either you have lookahead of several seconds or you just drop the concept entirely.) --sout-x264-lookahead 0 and voila, that problem disappeared.

Periodic Intra Refresh (PIR), however, was another story. It's easily enabled with --sout-x264-intra-refresh (which also forces a few other options currently, such as --sout-x264-ref 1, ie. use reference pictures at most one frame back; most of these are not conceptual limitations, though, just an effect of the current x264 implementation), but it causes problems for the client. Normally, when the VLC client “tunes in” to a running stream, it waits until the first key frame before it starts showing anything. With PIR, you can run for ages with no key frames at all (if there's no clear scene cut); that's sort of the point of it all. Thus, unless the client happened to actually see the start of the stream, it could be stuck in a state where it would be unable to show anything at all. (It should be said that there was also a server-side shortcoming in VLC here at a time, where it didn't always mark the right frames as keyframes, but that's also fixed in the 1.1 series.)

So, we have to patch the client. It turns out that the Right Thing(TM) to do is to parse something called SEI recovery points, which is a small piece of metadata the encoder inserts whenever it's beginning a new round of its intra refresh. Essentially this says something like “if you start decoding here now, in NN frames you will have a correct [or almost correct, if a given bit it set] picture no matter what you have in your buffer at this point”. I made a patch which was reviewed and is now in VLC upstream; there have been some concerns about correctness, though (although none that cover our specific use-case), so it might unfortunately be reverted at some point. We'll see how it goes.

Anyhow, now we're down to theoretical sub-frame (<20ms) latency in the video encoder, so let's talk about audio. It might not be obvious to most people, but the typical audio codecs we use today (MP3, Vorbis, AAC, etc.) have quite a bit of latency inherent to the design. For instance, MP3 works in 576-sample blocks at some point; that's 12ms at 48 kHz, and the real situation is much worse, since that's within a subband, which has already been filtered and downsampled. You'll probably find that MP3 latency in practice is about 150–200 ms or so (IIRC), and AAC is something similar; in any case, at this point audio and video were noticeably out of sync.

The x264 post mentions CELT as a possible high-quality, low-latency audio codec. I looked a bit at it, but

VLC doesn't currently support it,
It's not bitstream stable (which means that people will be very reluctant to distribute anything linked against it, as you can break client/server compatibility at any time), and
It does not currently have a TS mapping (a specification for how to embed it into a TS mux; every codec needs such a mapping), and I didn't really feel like going through the procedure of defining one, getting it standardized and then implement it in VLC.

I looked through the list of what usable codecs were supported by the TS demuxer in the client, though, and one caught my eye: LPCM. (The “L” stands for simply “linear” — it just means regular old PCM for all practical purposes.) It turns out both DVDs and Blu-rays have support for PCM, including surround and all, and they have their own ways of chopping the PCM audio into small blocks that fit neatly into a TS mux. It eats bandwidth, of course (48 kHz 16-bit stereo is about 1.5 Mbit/sec), but we don't really need to care too much; one of the privileges of controlling all parts of the chain is that you know where you can cut the corners and where you cannot.

The decoder was already in place, so all I had to do was to write an encoder. The DVD LPCM format is dead simple; the decoder was a bit underdocumented, but it was easy to find more complete specs online and update VLC's comments. The resulting patch was again sent in to VLC upstream, and is currently pending review. (Actually I think it's just forgotten, so I should nag someone into taking it in. It seems to be well received so far.)

With LPCM in use, the audio and video dropped neatly back into sync, and at this point, we should have effectively zero codec latency except the time spent on the encoding itself (which should surely be below one frame, given that the system works in realtime). That means we can start hacking at the rest of the system; essentially here the hard, tedious part starts, given that we're venturing into the unknowns of VLC internals.

This also means we're done with part three; tomorrow we'll be talking about timing and timestamps. It's perhaps a surprising topic, but very important both in understanding VLC's architecture (or any video player in general), the difficulties of finding and debugging latency issues, and where we can find hidden sources of latency.

[23:14] | | VLC latency, part 3: Codec latency

Wed, 06 Oct 2010 - VLC latency, part 2

Yesterday, I introduced my (ongoing) project to set up a low-latency video stream with VLC. Today it's time to talk a bit about the overall architecture, and the signal acquisition.

First of all, some notes on why we're using VLC. As far as I can remember, we've been using it at TG, either alone or together with other software, since 2003. It's been serving us well; generally better than the other software we've used (Windows Media Encoder, Wowza), although not without problems of its own. (I remember we discovered pretty late one year that while VLC could encode, play and serve H.264 video correctly at the time, it couldn't actually reflect it reliably. That caused interesting headaches when we needed a separate distribution point outside the hall.)

One could argue that writing something ourselves would give more control (including over latency), but there are a few reasons why I'm reluctant:

Anything related to audio or video tends to have lots of subtle little issues that are easy to get wrong, even with a DSP education. Not to mention that even though the overarching architecture looks simple, there's heck of a lot of small details (when did you last write a TS muxer?).
VLC is excellent as generalist software; I'm wary of removing all the tiny little options, even though we use few of them. If I suddenly need to scale the picture down by 80%, or use a different input, I'm hosed if I've written my own, since that code just isn't there when I need it. We're going to need VLC for the non-low-latency streams anyhow, so if we can do with one tool, it's a plus.
Contributing upstream is, generally, better than making yet another project.

So, we're going to try to run it with VLC first, and then have “write our own” as not plan B or C, but probably plan F somewhere. With that out of the way, let's talk about the general latency budget (remember that our overall goal is 80ms, or four frames at 50fps):

Inputting the frame: 20ms. While SDI is a cut-through system, there's no way we can get VLC to operate on less than a whole frame at a time, and the SDI card doesn't support it anyway as far as I know. (SDI sends the frame at the signal rate, so it really takes one frame to input one frame.)
H.264 encoding: 10ms. We assume we can encode the frame at twice the realtime speed in acceptable quality. This probably requires a relatively fast quad- or octocore.
Network latency: 5ms. Since we're only distributing within the local network, bandwidth is essentially free, but there's always going to be some delay in various networking stacks, routers, etc. anyway, and of course, outputting a megabit of data at gigabit speeds will take you a millisecond. (A megabit per frame would of course mean 50Mbit/sec at 50 fps, and we're not going that high, but still.)
H.264 decoding: 10ms. We assume decoding can happen in twice the realtime speed; of course, decoding is usually a lot faster than encoding (even though VLC cannot, as far as I know, use H.264 hardware acceleration at this point); this should give us some leeway for slower computers.
Client buffering: 20ms. We don't really have all that much control over the client, and this is sort of a grey area, so it's better to be prepared for this.
Screen display: 20ms. Again, there might various forms of buffering in place here, not to mention that you'll need to wait for vertical sync.

As you can see, a lot of these numbers are based on guesswork. Also, they sum up to 85ms — slightly more than our 80ms goal, but okay, at least it's in the right ballpark. We'll see how things pan out later.

This also explains why we need to start so early — not only are there lots of unknowns, but some of the latency is pretty much bound to be lurking in the client somewhere. We have full control over the server, but we're not going to be distributing our own VLC version to the 5000+ potential viewers (maintaining custom VLC builds for Windows, OS X and various Linux distributions is not very high on my list), so it means we need to get things to upstream, and then wait for the code to find its way to a release and then down onto people's computers in various forms. (To a degree, we can ask people to upgrade, but it's better if the right version is just alreayd on their system.)

So, time to get the signal into the system. Earlier we've been converting the SDI signal to DV, plugged it into a laptop over Firewire, and then sent it over the network to VLC using parts of the DVSwitch suite. (The video mixing rig and the server park are physically quite far apart.) This has worked fine for our purposes, and we'll continue to use DV for the other streams (and also as a backup for this one), but it's an unneccessary part that probably adds a frame or three of latency on its own, so it has to go.

Instead, we're going to send SDI directly into the encoder machine, and multicast straight out from there. As an extra bonus, we'll have access to the SDI clock that way; more on that in a future post.

Anyhow, you probably guessed it: VLC has no support for the Blackmagic SDI cards we plan to use, or really any SDI cards. Thus, the first thing to do was to write an SDI driver for VLC. Blackmagic has recently released a Linux SDK, and it's pretty easy to use (kudos for thorough documentation, BM), so it's really only a matter of writing the right glue; no reverse engineering or kernel hacking needed.

I don't have any of these cards myself (or really, a desktop machine that could hold them; I seem to be mostly laptop-only at home these days), but I've been given access to a beefy machine with an SDI card and some test inputs by Frikanalen, a non-commercial Norwegian TV station. There's a lot of weird things being sent, but as long as it has audio and is moving, it works fine for me.

So, I wrote a simple (input-only) driver for these cards, tested it a bit, and sent it upstream. It actually generated quite a bit of an email thread, but it's been mostly very constructive, so I'm confident it will go in. Support for these cards seems to be a pretty much sought-after feature (only VLC support for them earlier was on Windows, with DirectShow), so a lot of people want their own “pony feature” in, but OK. In any case, should there be some fatal issue and it would not go in, it's actually not so bad; we can maintain a delta at the encoder side if we really need to, and it's a pretty much separate module.

Anyhow, this post became too long too, so it's time to wrap it up. Tomorrow, I'm going to talk about perhaps the most obvious sort of latency (and the one that inspired the project to begin with — you'll understand what I mean tomorrow), namely codec latency.

[23:02] | | VLC latency, part 2

Tue, 05 Oct 2010 - VLC latency, part 1: Introduction/motivation

The Gathering 2011 is as of this time still announced (there's not even an organizer group yet), but it would be an earthquake if it would not be held in Hamar, Easter 2011. Thus, some of us have already started planning our stuff. (Exactly why we need to so start early with this specific will be clear pretty soon.)

TG has, since the early 2000s, had a video stream of stuff happening on the stage, usually also including the demo competitions. There are obviously two main targets for this:

People in the hall who are just too lazy to get to the stage (or have problems seeing in some other way, perhaps because it's very crowded). This includes those who can see the stage, but perhaps not hear it too well.
People external to the party, who want to watch the compos.
Post-party use, from normal video files.

My discussion here will primarily be centered around #1, and for that, there's one obvious metric you want to minimize: Latency. Ideally you want your stream to appear in sync with the audio and video from the main stage; while video is of course impossible, you can actually beat the audio if you really want, and a video delay of 100ms is pretty much imperceptible anyway. (I don't have numbers ready for how much is “too much”, though. Somebody can probably fill in numbers from research if they exist.)

So, we've set a pretty ambitious goal: Run an internal 720p50 stream (TG is in Europe, where we use PAL, y'know) with less than four frames (80ms) latency from end to end (ie., from the source to what's shown a local end user's display). Actually, that's measured compared to the big screens, which will run the same stream, so compared to the stage there will be 2–3 extra frames; TG has run their A/V production on SDI the last few years, which is essentially based on cut-through switching, but there are still some places where you need to do store-and-forward to process an entire frame at a time, so the real latency will be something like 150ms. If the 80ms goal holds, of course...

I've decided to split this post into several parts, once per day, simply because there will be so much text otherwise. (This one is already more than long enough, but the next ones will hopefully be shorter.) I have to warn that a lot of the work is pretty much in-progress, though — I pondered waiting until it was all done and perfect, which usually makes the overall narrative a lot clearer, but then again, it's perhaps more in the blogging spirit to provide as-you-go technical reports. So, it might be that we won't reach our goal, it might be that the TG stream will be a total disaster, but hopefully the journey, even the small part we've completed so far, will be interesting.

Tomorrow, in part 2, we're going to look at the intended base architecture, the latency budget, and signal acquisition (the first point in the chain). Stay tuned :-)

[23:19] | | VLC latency, part 1: Introduction/motivation

Steinar H. Gunderson <sgunderson@bigfoot.com>

<	October 2010					>
Su	Mo	Tu	We	Th	Fr	Sa
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31