Sun, 10 Oct 2010 - VLC latency, part 6: Notes and corrections
My five-part series on our low-latency VLC setup ended yesterday, with
a discussion about muxing and various one-frame-latency-inducing effects in
different parts of VLC. Before that, I'd been talking about
motivations, the overall plan, codec latency and
timing.
This time there is no new material, only a few comments from readers of
the series of the last few days. Thanks; you know who you are.
- The UDP output is not the only VLC output that can multicast; in particular,
the RTP mux/output can do it, too, in which case the audio and video streams
are just sent out more or less as separate streams. This would avoid any
problems the TS mux might have, and probably replace them with a whole new
and much more interesting set. (I've never ever had any significant success
with RTP as a user, so I don't think I'll dare to unleash it from the server
side. In any case we need TS for the HTTP streams, and sticking with fewer
sets of options is less confusing if we can get away with it.)
- VLC 1.1.x can reportedly use H.264 decoding hardware if your
machine, OS and drivers all support that. I have no idea how it impacts latency
(it could go both ways, I reckon), but presumably it should at least make
decoding on the client side cheaper in terms of CPU.
- If you're running slice-based threads, you might not want to use too many
threads, or bitrate allocation might be very inefficient.
- The DVD LPCM encoder is now in VLC's git repository. Yay :-)
- The collective bag of hacks I use that have not been sent upstream can
be found here.
Caveat emptor.
Finally, I realized I hadn't actually posted our current command-line anywhere,
so here goes:
vlc \
--decklink-caching 20 --sout-mux-caching 1 --sout-ts-dts-delay 60 --sout-ts-shaping 1 --sout-udp-caching 1 --no-sout-transcode-hurry-up \
--decklink-aspect-ratio 16:9 --decklink-mode pal \
--sout-x264-preset medium --sout-x264-tune zerolatency,film \
--sout-x264-intra-refresh --no-sout-x264-interlaced --sout-x264-weightp 0 --sout-x264-bpyramid strict --sout-x264-ref 1 \
--sout-x264-keyint 25 --sout-x264-lookahead 0 \
--sout-x264-vbv-bufsize 60 --sout-x264-vbv-maxrate 1500 \
-I dummy -v decklink:// vlc://quit \
--sout '#transcode{vcodec=h264,vb=1500,acodec=lpcm}:std{access=udp,mux=ts,dst=10.0.0.1:9094}'
You'll probably want to up the bitrate for a real stream. A lot. And of course, coding PAL
as non-interlaced doesn't make much sense either, unless it happens to be a progressive
signal sent as interlaced.
I'll keep you posted if there's any new movement (new latency tests, more patches
going into VLC, or similar), but for now, that's it. Enjoy :-)
Sat, 09 Oct 2010 - VLC latency, part 5: Muxing
In previous parts, I've been talking about the motivations for our
low-latency VLC setup, the overall plan, codec latency
and finally timing. At this point we're going down into more
specific parts of VLC, in an effort to chop away the latency.
So, when we left the story the last time, we had a measured 1.2 seconds of
latency. Obviously there's a huge source of latency we've missed somewhere,
but where?
At this point I did what I guess most others would have done; invoked VLC with
--help
and looked for interesting flags. There's a first obvious
candidate, namely --sout-udp-caching
, which is yet another
buffering flag, but this time on the output part of the side. (It seems to
delay the DTS, ie. the sending time delay in this case, by that many
milliseconds, so it's a sender-side equivalent of the PTS delay.) Its default,
just like the other “-caching” options, is 300 ms. Set it down to 5 ms (later
1 ms), and whoosh, there goes some delay. (There's also a flag called
“DTS delay” which seems to adjust the PCR relative to the DTS, to give the
client some more chance at buffering. I have no idea why the client would
need the encoder to specify this.)
But there's still lots of delay left, and with some help from the people on
#x264dev (it seems like many of the VLC developers hang there, and it's a lot
less noisy than #videolan :-) ) I found the elusively-named flag
--sout-ts-shaping
, which belongs to the TS muxer module.
To understand what this parameter (the “length of the shaping interval, in
milliseconds”) is about, we'll need to take a short look at what a muxer does.
Obviously it does the opposite of a demuxer — take in audio and video,
and combine them into a finished bitstream. It's imperative at this point that
they are in the same time base, of course; if you include video from
00:00:00 and audio from 00:00:15 next to each other, you can be pretty
sure there's a player out there that will have problems playing your audio
and video in sync. (Not to mention you get fifteen seconds delay, of course.)
VLC's TS muxer (and muxing system in general) does this by letting the audio
and video threads post to separate FIFOs, which the muxer can read from.
(There's a locking issue in here in that the audio and video encoding seem
to take the same lock before posting to these FIFOs, so they cannot go in
parallel, but in our case the audio decoding is nearly free anyway, so it
doesn't matter. You can add separate transcoder threads if you want
to, but in that case, the video goes via a ring buffer that is only polled
when the next frame comes, so you add about half a frame of extra delay.)
The muxer then is alerted whenever there's new stuff added to any of the FIFOs,
and sees if it can output a packet.
Now, I've been told that VLC's TS muxer is a bit suboptimal in many aspect,
and that there's a new one that has been living out-of-tree for
a while, but this is roughly how the current one works:
- Pick one stream as the PCR stream (PCR is MPEG-speak for “Program Clock
Reference”, the global system clock for that stream), and read blocks
of total length equivalent to the shaping period. VLC's TS muxer tries to
pick the video stream as the PCR stream if one exists. (Actually, VLC waits
a bit at the beginning to give all streams a chance to start sending data,
which identifies the stream. That's what the
--sout-mux-caching
flag is for. For us, it doesn't matter too much, though, since what we
care about is the steady state.)
- Read data from the other streams until they all have at least caught up with
the PCR stream.
From #1 it's pretty obvious that the default shaping interval of 200 ms is
going to delay our stream by several frames. Setting it down to 1, the lowest
possible value, again chops off some delay.
At this point, I started adding printfs to see how the muxer worked, and it
seemed to be behaving relatively well; it picked video blocks (one frame,
40 ms), and then a bunch of audio blocks (the audio is chopped into 80-sample
blocks by the LPCM encoder, in addition to a 1024-sample chop at some
earlier point I don't know the rationale for). However, sometimes it would
be an audio block short, and refuse to mux until it got more data (read:
the next frame). More debugging ensued.
At this point, I take some narrative pain for having presented the story
a bit out-of-order; it was actually first at this point I added the SDI timer
as the master clock. The audio and video having different time bases would
cause problems where the audio would be moved, say, 2ms more ahead than the
video. Sorry, you don't have enough video to mux, wait for more. Do not collect
$500. (Obviously, locking the audio and video timestamps fixed this specific
issue.) Similarly, I found and fixed a few rounding issues in the length
calculations in the LPCM encoder that I've already talked briefly about.
But there's more subtility, and we've touched on it before. How do you find
the length of a block? The TS muxer doesn't trust the length parameter from
previous rounds, and perhaps with good reason; the time base correction could
have moved the DTS and PTS around, which certainly should also skew the
length of the previous block. Think about it; if you have a video frame at
PTS=0.000 and then one at PTS=0.020, and the second frame gets moved to
to PTS=0.015, the first one should obviously have length 0.015, not 0.020.
However, it may already have been sent out, so you have a problem. (You could
of course argue that you don't have a problem, since your muxing algorithm
shouldn't necessarily care about the lengths at all, and the new TS muxer
reportedly does not. However, this is the status quo, so we have to care about
the length for now. :-) )
The TS muxer solves this in a way that works fine if you don't care about
latency; it never processes a block until it also has the next block.
By doing this, it can simply say that this_block.length = next_block.dts -
this_block.dts, and simply ignore the incoming length parameter of the block.
This makes for chaos for our purposes, of course -- it means we will always
have at least one video frame of delay in the mux, and if for some reason
the video should be ahead of the audio (we'll see quite soon that it usually
was!), the muxer will refuse to mux the packet on time because it doesn't
trust the length of the last audio block.
I don't have a good upstream fix for this, except that again, this is supposedly
fixed in the new muxer. In my particular case, I did a local hack and simply
made the muxer trust the incoming length -- I know it's good anyway. (Also,
of course, I could then remove the demand that there be at least two blocks
left in the FIFO to fetch out the first one.)
But even after this hack, there was a problem that I'd been seeing throughout
the entire testing, but never really understood; the audio was consistently
coming much later than the video. This doesn't make sense, of course, given
that the video goes through x264, which takes a lot of CPU time, and the audio
is just chopped into blocks and given DVD LPCM headers. My guess was at some
serialization throughout the pipeline (and I did indeed find one, the serialized
access to the FIFOs mentioned earlier, but it was not the right source),
and I started searching. Again lots of debug printfs, but this time, at least
I had pretty consistent time bases throughout the entire pipeline, and only
two to care about. (That, and I drew a diagram of all the different parts of
the code; again, it turns out there's a lot of complexity that's basically
short-circuited since we work with raw audio/video in the input.)
The source of this phenomenon was finally found in a place I didn't even
know existed: the “copy packetizer”. It's not actually in my drawing, but
it appears it sits between the raw audio decoder (indeed…) and the LPCM
encoder. It seems mostly to be a dummy module because VLC needs there to
actually be a packetizer, but it does do some sanity checking and adjusts
the PTS, DTS and... length. Guess how it finds the new length. :-)
For those not following along: It waits for the next packet to arrive, and
sets the new length equivalent to next_dts - this_dts, just like ths TS
muxer. Obviously this means one block latency, which means that the TS
muxer will always be one audio block short when trying to mux the video
frame that just came in. (The video goes through no similar treatment along
its path to the mux.) This, in turn, translates to a minimum latency of
one full video frame in the mux.
So, again a local hack: I have no idea what downstream modules may rely on
the length being correct, but in my case, I know it's correct, so I can
just remove this logic and send the packet directly on.
And now, for perhaps some disappointing news: This is the point where the
posting catches up with what I've actually done. How much latency is there?
I don't know, but if I set x264 to “ultrafast” and add some printfs here
and there, it seems like I can start sending out UDP packets about 4 ms
after receiving the frame from the driver. What does this translate to in
end-to-end latency? I don't know, but the best tests we had before the
fixes mentioned (the SDI master clock, the TS mux one-block delay, and the copy
packetizer one-block delay) was about 250 ms:
That's a machine that plays its own stream from the local network, so
intrinsically about 50 ms better than our previous test, but the difference
between 1.2 seconds and 250 ms is obviously quite a lot nevertheless.
My guess is that with these fixes, we'll touch about 200 ms, and then when
we go to true 50p (so we actually get 50 frames per second, as opposed to 25
frames which each represent two fields), we'll about halve that. Encoding
720p50 in some reasonable quality is going to take some serious oomph, though,
but that's really out of our hands — I trust the x264 guys to keep doing
their magic much better than I can.
So, I guess that rounds off the series; all that's left for me to write at
the current stage is a list of the corrections I've received, which I'll do
tomorrow. (I'm sure there will be new ones to this part :-) )
What's left for the future? Well, obviously I want to do a new end-to-end
test, which I'll do as soon as I have the opportunity. Then I'm quite sure
we'll want to run an actual test at 720p50, and for that I think I'll need
to actually get hold of one of these cards myself (including a desktop machine
fast enough to drive it). Hello Blackmagic, if you by any chance are throwing
free cards at people, I'd love one that does HDMI in to test with :-P
Of course, I'm sure this will uncover new issues; in particular, we haven't
looked much at the client yet, and there might be lurking unexpected delays
there as well.
And then, of course, we'll see how it works in practice at TG. My guess is
that we'll hit lots of weird issues with clients doing stupid things — with
5000 people in the hall, you're bound to have someone with buggy audio or video
drivers (supposedly audio is usually the worst sinner here), machines with
timers that jump around like pinballs, old versions of VLC despite big warnings
that you need at least X.Y.Z, etc… Only time will tell, and I'm pretty glad
we'll have a less fancy fallback stream. :-)
Fri, 08 Oct 2010 - VLC latency part 4: Timing
Previously in my post on our attempt at a low-latency VLC setup for The
Gathering 2011, I've written about motivation, the
overall plan and yesterday on codec latency. Today we'll
look at the topic of timestamps and timing, a part most people probably think
relatively little about, but which is still central to any sort of audio/video
work. (It is also probably the last general exposition in the series; we're
running out of relevant wide-ranging topics to talk about, at least of the
things I pretend to know anything about.)
Timing is surprisingly difficult to get right; in fact, I'd be willing to bet
that more hair has been ripped out over the supposedly mundane issues of
demuxing and timestamping than most other issues in creating a working
media player. (Of course, never having made one, that's just a guess.)
The main source of complexity can be expressed through this quote
(usually attributed to a “Lee Segall” who I have no idea who is):
“A man with a watch knows what time it is. A man with two watches is never sure.”
In a multimedia pipeline, we have not only two but several different clocks
to deal with: The audio and video streams both have clocks, the kernel has its
own clock (usually again based on several different clocks, but you don't
need to care much about that), the client in the other end has a kernel clock,
and the video and audio cards for playback both have clocks. Unless they are
somehow derived from exactly the same clock, all of these can have different
bases, move at different rates, and drift out of sync from each other.
That's of course assuming they are all stable and don't do weird things like
suddenly jump backwards five hours and then back again (or not).
VLC, as a generalist application, generally uses the only reasonable general
approach, which is to try to convert all of them into a single master timer,
which comes from the kernel. (There are better specialist approaches in some
cases; for instance, if you're transcoding from one file to another, you
don't care about the kernel's idea of time, and perhaps you should choose one
of the input streams' timestamps as the master instead.)
This happens separately for audio and video, in our case right before it's sent
to the transcoder — VLC takes a system timestamp, compares it to the stream
timer, and then tries to figure out how the stream timer and the system
relates, so it can convert from one to the other. (The observant reader, which
unfortunately never has existed, will notice that it should have taken this
timestamp when it actually received the frame, not the point where it's about
to encode it. There's a TODO about this in the source code.) As the relations
might change over time, it tries to slowly adjust the bases and rates to
match reality. (Of course, this is rapidly getting into control theory,
but I don't think you need to go there to get something that works reasonably
well.) Similarly, if audio and video go too much out of sync, the VLC client
will actually start to stretch audio one way or the other to get the two
clocks back in sync without having to drop frames. (Or so I think. I don't
know the details very well.)
But wait, there's more. All data blocks have two timestamps, the presentation
timestamp (PTS) and the decode timestamp (DTS). (This is not a VLC invention
by any means, of course.) You can interpret both as deadlines; the PTS is
a deadline for when you are to display the block, and the DTS is a deadline
for when the block is to be decoded. (For streaming over the network, you
can interpret “display” and “decode” figuratively; the UDP output, for instance,
tries to send out the block before the DTS has arrived.) For a stream, generally
PTS=DTS except when you need to decode frames out-of-order (think B-frames).
Inside VLC after the time has been converted to the global base, there's a
concept of “PTS delay”, which despite the name is added to both PTS and DTS.
Without a PTS delay, the deadline would be equivalent to the stream acquisition
time, so all the packets would be sent out too late, and if you had the
--sout-transcode-hurry-up
(which is default) the frames would
simply get dropped. Again confusingly, the PTS delay is set by the various
--*-caching
options, so basically you want to set
--decklink-caching
as low as you can without warnings about
“packet sent out too late” showing up en masse.
Finally, all blocks in VLC have a concept of length, in milliseconds. This
sounds like an obvious choice until you realize all lengths might not be
a whole multiple of milliseconds; for instance, the LPCM blocks are 80 samples
long, which is 5/3 ms (about 1.667 ms). Thus, you need to set rounding right
if you want all the lengths to add up — there are functions to help with this
if your timestamps can be expressed as rationals numbers. And of course, since consecutive
blocks might get converted to system time using different parameters,
pts2 - pts1 might very well be different from length. (Otherwise, you
could never ever adjust a stream base.) And to make things
even more confusing, the length parameter is described as optional for some
types of blocks, but only in some parts of the VLC code. You can imagine
latency problems being pretty difficult to debug in an environment like this,
with several different time bases in use from different threads at the same
time.
But again, our problem at hand is simpler than this, and with some luck,
we can short-circuit away much of the complexity we don't need. To begin
with, SDI has locked audio and video; at 50 fps, you always get exactly
20 ms of audio (960 samples at 48000 Hz) with every video frame. So we
don't have to worry about audio and video going out of sync, as long as VLC
doesn't do too different things to the two.
This is not necessarily a correct assumption — for instance, remember that VLC
can sample the system timer for the the audio and video at different times and
in different threads, so even though they originally have the same timestamp
from the card, VLC can think they have different time bases, and adjust the
time for the audio and video blocks differently. They start as locked, but VLC
does not process them as such, and once they drift even a little out of sync,
things get a lot harder.
Thus, I eventually found out that the easiest thing for me was to take the
kernel timer out of the loop. The Blackmagic cards give you access to the
SDI system timer, which is locked to the audio and video timers. You only
get the audio and video timestamps when you receive a new frame, but you can
query the SDI system timer at any time, just like the kernel timer. You
can also ask it how far you are into the current frame, so if you just
subtract that, you will get a reliable timestamp for the acquisition of
the previous frame, assuming you haven't been so busy you skipped an entire
frame, in which you case you lose anyway.
The SDI timer's resolution is only 4 ms, it seems, but that's enough for us — also, even
though its rate is 1:1 to the other timers, its base is not the same, so
there's a fixed offset. However, VLC can already deal with situations like
this, as we've seen earlier; as long as the base never changes, it will be
right from first block, and there will never be any drift. I wrote a patch
to a) propagate the actual frame acquisition time to the clock correction
code (so it's timestamped only once, and the audio and video streams will
get the same system timestamps), and b) make VLC's timer functions fetch the
SDI system timer from the card instead of from the kernel. I don't know if
the last part was actually necessary, but it certainly made debugging/logging
a lot easier for me. One True Time Base, hooray. (The patch is not sent
upstream yet; I don't know if it would be realistically accepted or not,
and it requires more cleanup anyhow.)
So, now our time bases are in sync. Wonder how much delay we have? The
easiest thing is of course to run a practical test, timestamping things
on input and looking what happens at the output. By “timestamping”, I
mean in the easiest possible way; just let the video stream capture a clock,
and compare the output with another (in sync) clock. Of course, this means
your two clocks need to be either the same or in sync — and for the first
test, my laptop suddenly plain refused to sync properly to NTP with more
than 50 ms or so. Still, we did the test, with about 50 ms network latency,
with a PC running a simple clock program to generate the video stream:
(Note, for extra bonus, that I didn't think of taking a screenshot instead
of a photo. :-) )
You'll see that despite tuning codec delay, despite turning down the PTS
delay, and despite having SDI all the way in the input stream, we have a delay
of a whopping 1.2 seconds. Disheartening, no? Granted, it's at 25 fps (50i)
and not 50 fps, so all frames take twice as long, but even after a theoretical
halving (which is unrealistic), we're a far cry from the 80 ms we wanted.
However, with that little cliffhanger, our initial discussion of timing is
done. (Insert lame joke about “blog time base” or similar here.)
We'll look at tracing the source(s) of this unexpected amount of latency
tomorrow, when we look at what happens when the audio and video streams are
to go back into one, and what that means for the pipeline's latency.
Thu, 07 Oct 2010 - VLC latency, part 3: Codec latency
In previous parts, I wrote a bit about motivation and
overall plan for our attempt at a low-latency VLC setup.
Today we've come to our first specific source of latency, namely
codec latency.
We're going to discuss VLC streaming architecture in more detail later on, but
for now we can live with the (over)simplified idea that the data comes in from
some demuxer which separates audio and video, then audio and video are
decoded and encoded to their new formats in separate threads, and finally
a mux combines the newly encoded audio and video into a single bit stream
again, which is sent out to the client.
In our case, there are a few givens: The Blackmagic SDI driver actually takes
on the role as a demuxer (even though a demuxer normally works on some
bitstream on disk or from network), and we have to use the TS muxer (MPEG
Transport Stream, a very common choice) because that's the only thing that
works with UDP output, which we need because we are to use multicast.
Also, in our case, the “decoders” are pretty simple, given that the driver
outputs raw (PCM) audio and video.
So, there are really only two choices to be made, namely the audio and video
codec. These were also the first places where I started to attack latency,
given that they were the most visible pieces of the puzzle (although not
necessarily the ones with the most latency).
For video, x264 is a pretty obvious choice these days, at least
in the free software world, and in fact, what originally inspired the project
was this blog post on x264's newfound support for various low-latency
features. (You should probably go read it if you're interested; I'm not going
to repeat what's said there, given that the x264 people can explain their
own encoder a lot better than I can.)
Now, in hindsight I realized that most of these are not really all that
important to us, given that we can live with somewhat unstable bandwidth
use. Still, I wanted to try out at least Periodic Intra Refresh in practice,
and some of the other ones looked quite interesting as well.
VLC gives you quite a lot of control over the flags sent to x264; it used
to be really cumbersome to control given that VLC had its own set of defaults
that was wildly different from x264's own defaults, but these days it's
pretty simple: VLC simply leaves x264's defaults alone in almost all cases
unless you explicitly override them yourself, and apart from that lets you
specify one of x264's speed/quality presets (from “ultrafast” down to
“placebo”) plus tunings (we use the “zerolatency” and “film” tunings together,
as they don't conflict and both are relevant to us).
At this point we've already killed a few frames of latency — in particular,
we no longer use B-frames, which by definition requires us to buffer at least
one frame, and the “zerolatency” preset enables slice-based threading,
which uses all eight CPUs to encode the same frame instead of encoding eight
frames at a time (one on each CPU, with some fancy system for sending the
required data back and forth between the processes as it's needed for
inter-frame compression). Reading about the latter suddenly made me understand
why we always got more problems with “video buffer late for mux” (aka:
the video encoder isn't delivering frames fast enough to the mux) when we
enabled more CPUs in the past :-)
However, we still had unexpectedly much latency, and some debug printfs
(never underestimate debug printfs!) indicated that VLC was sending five
full frames to x264 before anything came out in the other end. I digged through
VLC's x264 encoder module with some help from the people at #x264dev, and
lo and behold, there was a single parameter VLC didn't keep at default,
namely the “lookahead” parameter, which was set to... five. (Lookahead is
useful to know whether you should spend many or fewer bits on the current
frame, but in our case we cannot afford that luxury. In any case, the x264
people pointed out that five is a completely useless number to use; either you
have lookahead of several seconds or you just drop the concept entirely.)
--sout-x264-lookahead 0
and voila, that problem disappeared.
Periodic Intra Refresh (PIR), however, was another story. It's easily enabled
with --sout-x264-intra-refresh
(which also forces a few other
options currently, such as --sout-x264-ref 1
, ie. use reference
pictures at most one frame back; most of these are not conceptual limitations,
though, just an effect of the current x264 implementation), but it causes
problems for the client. Normally, when the VLC client “tunes in” to a running
stream, it waits until the first key frame before it starts showing anything.
With PIR, you can run for ages with no key frames at all (if there's no clear
scene cut); that's sort of the point of it all. Thus, unless the client
happened to actually see the start of the stream, it could be stuck in a state
where it would be unable to show anything at all. (It should be said that
there was also a server-side shortcoming in VLC here at a time, where it didn't
always mark the right frames as keyframes, but that's also fixed in the 1.1
series.)
So, we have to patch the client. It turns out that the Right Thing(TM) to do
is to parse something called SEI recovery points, which is a small piece of
metadata the encoder inserts whenever it's beginning a new round of its
intra refresh. Essentially this says something like “if you start decoding
here now, in NN frames you will have a correct [or almost correct, if a
given bit it set] picture no matter what you have in your buffer at this
point”. I made a patch which was reviewed and is now in VLC upstream;
there have been some concerns about correctness, though (although none that
cover our specific use-case), so it might unfortunately be reverted at some
point. We'll see how it goes.
Anyhow, now we're down to theoretical sub-frame (<20ms) latency in the
video encoder, so let's talk about audio. It might not be obvious to most
people, but the typical audio codecs we use today (MP3, Vorbis, AAC, etc.)
have quite a bit of latency inherent to the design. For instance, MP3 works
in 576-sample blocks at some point; that's 12ms at 48 kHz, and the real
situation is much worse, since that's within a subband, which has already
been filtered and downsampled. You'll probably find that MP3 latency in
practice is about 150–200 ms or so (IIRC), and AAC is something similar;
in any case, at this point audio and video were noticeably out of sync.
The x264 post mentions CELT as a possible high-quality, low-latency
audio codec. I looked a bit at it, but
- VLC doesn't currently support it,
- It's not bitstream stable (which means that people will be very reluctant
to distribute anything linked against it, as you can break client/server
compatibility at any time), and
- It does not currently have a TS mapping (a specification for how to embed
it into a TS mux; every codec needs such a mapping), and I didn't really
feel like going through the procedure of defining one, getting it
standardized and then implement it in VLC.
I looked through the list of what usable codecs were supported by the
TS demuxer in the client, though, and one caught my eye: LPCM. (The “L”
stands for simply “linear” — it just means regular old PCM for all practical
purposes.) It turns out both DVDs and Blu-rays have support for PCM,
including surround and all, and they have their own ways of chopping the PCM
audio into small blocks that fit neatly into a TS mux. It eats bandwidth,
of course (48 kHz 16-bit stereo is about 1.5 Mbit/sec), but we don't really
need to care too much; one of the privileges of controlling all parts of
the chain is that you know where you can cut the corners and where you cannot.
The decoder was already in place, so all I had to do was to write an encoder.
The DVD LPCM format is dead simple; the decoder was a bit underdocumented,
but it was easy to find more complete specs online and update VLC's comments.
The resulting patch was again sent in to VLC upstream, and is
currently pending review. (Actually I think it's just forgotten, so I should
nag someone into taking it in. It seems to be well received so far.)
With LPCM in use, the audio and video dropped neatly back into sync, and at
this point, we should have effectively zero codec latency except the time spent
on the encoding itself (which should surely be below one frame, given that the
system works in realtime). That means we can start hacking at the rest of
the system; essentially here the hard, tedious part starts, given that we're
venturing into the unknowns of VLC internals.
This also means we're done with part three; tomorrow we'll be talking about
timing and timestamps. It's perhaps a surprising topic, but very important both
in understanding VLC's architecture (or any video player in general), the
difficulties of finding and debugging latency issues, and where we can find
hidden sources of latency.
Wed, 06 Oct 2010 - VLC latency, part 2
Yesterday,
I introduced my (ongoing) project to set up a low-latency video stream with
VLC. Today it's time to talk a bit about the overall architecture, and the
signal acquisition.
First of all, some notes on why we're using VLC. As far as I can remember,
we've been using it at TG, either alone or together with other software,
since 2003. It's been serving us well; generally better than the other
software we've used (Windows Media Encoder, Wowza), although not without
problems of its own. (I remember we discovered pretty late one year that
while VLC could encode, play and serve H.264 video correctly at the
time, it couldn't actually reflect it reliably. That caused interesting
headaches when we needed a separate distribution point outside the hall.)
One could argue that writing something ourselves would give more control
(including over latency), but there are a few reasons why I'm reluctant:
- Anything related to audio or video tends to have lots of subtle little
issues that are easy to get wrong, even with a DSP education. Not to mention
that even though the overarching architecture looks simple, there's heck
of a lot of small details (when did you last write a TS muxer?).
- VLC is excellent as generalist software; I'm wary of removing all the
tiny little options, even though we use few of them. If I suddenly need
to scale the picture down by 80%, or use a different input, I'm hosed
if I've written my own, since that code just isn't there when I need it.
We're going to need VLC for the non-low-latency streams anyhow, so
if we can do with one tool, it's a plus.
- Contributing upstream is, generally, better than making yet another
project.
So, we're going to try to run it with VLC first, and then have “write our own”
as not plan B or C, but probably plan F somewhere. With that out of the way,
let's talk about the general latency budget (remember that our overall goal
is 80ms, or four frames at 50fps):
- Inputting the frame: 20ms. While SDI is a cut-through
system, there's no way we can get VLC to operate on less than a whole
frame at a time, and the SDI card doesn't support it anyway as far as I
know. (SDI sends the frame at the signal rate, so it really takes one frame
to input one frame.)
- H.264 encoding: 10ms. We assume we can encode the frame
at twice the realtime speed in acceptable quality. This probably requires
a relatively fast quad- or octocore.
- Network latency: 5ms. Since we're only distributing within
the local network, bandwidth is essentially free, but there's always
going to be some delay in various networking stacks, routers, etc.
anyway, and of course, outputting a megabit of data at gigabit speeds will
take you a millisecond. (A megabit per frame would of course mean 50Mbit/sec
at 50 fps, and we're not going that high, but still.)
- H.264 decoding: 10ms. We assume decoding can happen in
twice the realtime speed; of course, decoding is usually a lot faster
than encoding (even though VLC cannot, as far as I know, use H.264 hardware
acceleration at this point); this should give us some leeway for slower
computers.
- Client buffering: 20ms. We don't really have all that much
control over the client, and this is sort of a grey area, so it's better
to be prepared for this.
- Screen display: 20ms. Again, there might various forms
of buffering in place here, not to mention that you'll need to wait for
vertical sync.
As you can see, a lot of these numbers are based on guesswork. Also, they
sum up to 85ms — slightly more than our 80ms goal, but okay, at least it's
in the right ballpark. We'll see how things pan out later.
This also explains why we need to start so early — not only are there lots
of unknowns, but some of the latency is pretty much bound to be lurking in
the client somewhere. We have full control over the server, but we're not
going to be distributing our own VLC version to the 5000+ potential viewers
(maintaining custom VLC builds for
Windows, OS X and various Linux distributions is not very high on my list),
so it means we need to get things to upstream, and then wait for the code
to find its way to a release and then down onto people's computers in various
forms. (To a degree, we can ask people to upgrade, but it's better if the
right version is just alreayd on their system.)
So, time to get the signal into the system. Earlier we've been converting
the SDI signal to DV, plugged it into a laptop over Firewire, and then sent
it over the network to VLC using parts of the DVSwitch suite.
(The video mixing rig and the server park are physically quite far apart.)
This has worked fine for our purposes, and we'll continue to use DV for
the other streams (and also as a backup for this one), but it's an
unneccessary part that probably adds a frame or three of latency on its
own, so it has to go.
Instead, we're going to send SDI directly into the encoder machine, and
multicast straight out from there. As an extra bonus, we'll have access to
the SDI clock that way; more on that in a future post.
Anyhow, you probably guessed it: VLC has no support for the Blackmagic
SDI cards we plan to use, or really any SDI cards. Thus, the first thing to
do was to write an SDI driver for VLC. Blackmagic has recently released a Linux
SDK, and it's pretty easy to use (kudos for thorough documentation, BM),
so it's really only a matter of writing the right glue; no reverse engineering
or kernel hacking needed.
I don't have any of these cards myself (or really, a desktop machine that
could hold them; I seem to be mostly laptop-only at home these days), but I've
been given access to a beefy machine with an SDI card and some test inputs
by Frikanalen, a non-commercial Norwegian TV station. There's
a lot of weird things being sent, but as long as it has audio and is moving,
it works fine for me.
So, I wrote a simple (input-only) driver for these cards, tested it a bit,
and sent it upstream. It actually generated quite a bit of an
email thread, but it's been mostly very constructive, so I'm
confident it will go in. Support for these cards seems to be a pretty much
sought-after feature (only VLC support for them earlier was on Windows, with
DirectShow), so a lot of people want their own “pony feature” in, but OK.
In any case, should there be some fatal issue and it would not go in, it's
actually not so bad; we can maintain a delta at the encoder side if we really
need to, and it's a pretty much separate module.
Anyhow, this post became too long too, so it's time to wrap it up. Tomorrow,
I'm going to talk about perhaps the most obvious sort of latency (and the
one that inspired the project to begin with — you'll understand what I mean
tomorrow), namely codec latency.
Tue, 05 Oct 2010 - VLC latency, part 1: Introduction/motivation
The Gathering 2011 is as of this time still announced (there's not even
an organizer group yet), but it would be an earthquake if it would not be held
in Hamar, Easter 2011. Thus, some of us have already started planning our
stuff. (Exactly why we need to so start early with this specific will be clear
pretty soon.)
TG has, since the early 2000s, had a video stream of stuff happening on the
stage, usually also including the demo competitions. There are obviously two
main targets for this:
- People in the hall who are just too lazy to get to the stage (or have
problems seeing in some other way, perhaps because it's very crowded).
This includes those who can see the stage, but perhaps not hear it
too well.
- People external to the party, who want to watch the compos.
- Post-party use, from normal video files.
My discussion here will primarily be centered around #1, and for that, there's
one obvious metric you want to minimize: Latency. Ideally you want your stream
to appear in sync with the audio and video from the main stage; while video
is of course impossible, you can actually beat the audio if you really want,
and a video delay of 100ms is pretty much imperceptible anyway. (I don't
have numbers ready for how much is “too much”, though. Somebody can probably
fill in numbers from research if they exist.)
So, we've set a pretty ambitious goal: Run an internal 720p50 stream (TG
is in Europe, where we use PAL, y'know) with less than four frames (80ms)
latency from end to end (ie., from the source to what's shown a local end user's
display). Actually, that's measured compared to the big screens, which will
run the same stream, so compared to the stage there will be 2–3 extra frames;
TG has run their A/V production on SDI
the last few years, which is essentially based on cut-through switching,
but there are still some places where you need to do store-and-forward
to process an entire frame at a time, so the real latency will be something
like 150ms. If the 80ms goal holds, of course...
I've decided to split this post into several parts, once per day, simply
because there will be so much text otherwise. (This one is already more than
long enough, but the next ones will hopefully be shorter.) I have to warn
that a lot of the work is pretty much in-progress, though — I pondered waiting
until it was all done and perfect, which usually makes the overall narrative
a lot clearer, but then again, it's perhaps more in the blogging spirit to
provide as-you-go technical reports. So, it might be that we won't reach our
goal, it might be that the TG stream will be a total disaster, but hopefully
the journey, even the small part we've completed so far, will be interesting.
Tomorrow, in part 2, we're going to look at the intended base architecture,
the latency budget, and signal acquisition (the first point in the chain).
Stay tuned :-)