Steinar H. Gunderson

Sat, 12 May 2012 - TCP optimization for video streaming

At this year's The Gathering, I was once again head of Tech:Server, and one of our tasks is to get the video stream (showing events, talks, and not the least demo competitions) to the inside and outside.

As I've mentioned earlier, we've been using VLC as our platform, streaming to both an embedded Flash player and people using the standalone VLC client. (We have viewers both internally and externally to the hall; multicast didn't really work properly from day one this year, so we did everything on unicast, from a machine with a 10 Gbit/sec Intel NIC. We had more machines/NICs in case we needed more, but we peaked at “only” about 6 Gbit/sec, so it was fine.)

But once we started streaming demo compos, we started getting reports from external users that the stream would sometimes skip and be broken up. With users far away, like in the US, we could handwave it away; TCP works relatively poorly over long-distance links, mainly since you have no control over the congestion along the path, and the high round-trip time (RTT) causes information about packet loss etc. to come back very slowly. (Also, if you have an ancient TCP stack on either side, you're limited to 64 kB windows, but that wasn't the problem in this case.) We tried alleviating that with an external server hosted in France (for lower RTTs, plus having an alternative packet path), but it could not really explain how a 30/30 user only 30 ms away (even with the same ISP as us!) couldn't watch our 2 Mbit/sec stream.

(At this point, about everybody I've talked to go on some variant of “but you should have used UDP!”. While UDP undoubtedly has no similar problem of stream breakdown on congestion, it's also completely out of the question as the only solution for us, for the simple reason that it's impossible to get it to most of our end users. The best you can do with Flash or VLC as the client is RTSP with RTP over UDP, and only a small amount of NATs will let that pass. It's simply not usable as a general solution.)

To understand what was going on, it's useful to take a slightly deeper dive and look at what the packet stream really looks like. When presented with the concept of “video streaming”, the most natural reaction would be to imagine a pretty smooth, constant packet flow. (Well, that or a YouTube “buffering” spinner.) However, that's really about as far from the truth as you could come. I took the time to visualize a real VLC stream from a gigabit line in Norway to my 20/1 cable line in Switzerland; slowing it down a lot (40x) so you can see what's going on:

(The visualization is inspired by Carlos Bueno's Packet Flight videos, but I used none of his code.)

So, what you can see here is TCP being even burstier than its usual self: The encoder card outputs a frame for encoding every 1/25th second (or 1/50th for the highest-quality streams), and after x264 has chewed on the data, TCP immediately sends out all of it as fast as it possibly can. Getting the packets down to my line's speed of 20 Mbit/sec is regarded as someone else's problem (you can see it really does happen, though, as the packets arrive more spaced out at the other end); and the device doing it has to pretty much buffer up the entire burst. At TG, this was even worse, of course, since we were sending at 10 Gbit/sec speeds, with TSO so that you could get lots of packets out back-to-back at line rates. To top it off, encoded video is inherently highly bursty on the micro scale; a keyframe is easily twenty times the size of a B frame, if not more. (B frames also present the complication that they can't be encoded until the next one has been encoded, but I'll ignore that here.)

Why are high-speed bursts bad? Well, the answer really has to do with router buffering along the way. When you're sending such a huge burst and the router can't send it on right away (ie., it's downconverting to a lower speed, either because the output interface is only e.g. 1 Gbit/sec, or because it's enforcing the customer's maximum speed), you stand a risk of the router running out of buffer space and dropping the packet. If so, you need to wait at least one RTT for the retransmit; let's just hope you have selective ACK in your TCP stack, so the rest of the traffic can flow smoothly in the meantime.

Even worse, maybe your router is not dropping packets when it's overloaded, but instead keeps buffering them up. This is in many ways even worse, because now your RTT increases, and as we already discussed, high RTT is bad for TCP. Packet loss happens whether you want to or not (not just due to congestion—for instance, my iwl3945 card goes on a scan through the available 802.11 channels every 120 seconds to see if there are any better APs on other channels), and when they inevitably happen, you're pretty much hosed and eventually your stream will go south. This is known as bufferbloat, and I was really surprised to see it in play here—I had connected it only to uploading before (in particular, BitTorrent), but modern TCP supports continuous RTT measurement through timestamps, and some of the TCP traces (we took tcpdumps for a few hours during the most intensive period) unmistakably show the RTT increasing by several hundred milliseconds at times.

So, now that we've established that big bursts are at least part of the problem, there are two obvious ways to mitigate the problem: Reduce the size of the bursts, or make them smoother (less bursty). I guess you can look at the two as the macroscopic and microscopic solution, respectively.

As for the first part, we noticed after a while that what really seemed to give people problems, was when we'd shown a static slide for a while and then faded to live action; a lot of people would invariably report problems when that happened. This was a clear sign that we could do something on the macrocopic level; most likely, the encoder had saved up a lot of bits while encoding the simple, static image, and now was ready to blow away its savings all at once in that fade.

And sure enough, tuning the VBV settings so that the bitrate budget was calculated over one second instead of whatever was the default (I still don't know what the default behavior of x264 under VLC is) made an immediate difference. Things were not really good, but it pretty much fixed the issue with fades, and in general people seemed happier.

As for the micro-behavior, this seems to be pretty hard to actually fix; there is something called “paced TCP” with several papers, but nothing in the mainline kernel. (TCP Hybla is supposed to do this, but the mainline kernel doesn't have the pacing part. I haven't tried the external patch yet.) I tried implementing pacing directly within VLC by just sending slowly, and this made the traffic a lot smoother... until we needed to retransmit, in which case the TCP stack doesn't care how smoothly data came in in the first place, it justs bursts like crazy again. So, lose. We even tried replacing one 1x10gigE link with 8x1gigE links, using a Cisco 4948E to at least smooth things down to gigabit rates, but it didn't really seem to help much.

During all of this, I had going a thread on the bufferbloat mailing list (with several helpful people—thanks!), and it was from there the second breakthrough came, more or less in the last hour: Dave Täht suggested that we could reduce the amount of memory given to TCP for write buffering, instead of increasing it like one would normally do for higher throughput. (We did this by changing the global flag in /proc/sys; one could also use the SO_SNDBUF socket option.) Amazingly enough, this helped a lot! We only dared to do it on one of the alternative streaming servers (in hindsight this was the wrong decision, but we were streaming to hundreds of people at the time and we didn't really dare messing it up too much for those it did work for), and it really only caps the maximum burst size, but that seemed to push us just over the edge to working well for most people. It's a suboptimal solution in many ways, though; for instance, if you send a full buffer (say, 150 kB or whatever) and the first packet gets lost, your connection is essentially frozen until the retransmit comes and the ack comes back. Furthermore, it doesn't really solve the problem of the burstiness itself—it solves it more on a macro level again (or maybe mid-level if you really want to).

In any case, it was good enough for us to let it stay as it was, and the rest of the party went pretty smoothly, save for some odd VLC bugs here and there. The story doesn't really end there, though—in fact, it's still being written, and there will be a follow-up piece in not too long about post-TG developments and improvements. For now, though, you can take a look at the following teaser, which is what the packet flow from my VLC looks like today:

Stay tuned. And don't send the packets too fast.

[22:03] | | TCP optimization for video streaming

Steinar H. Gunderson <>