Steinar H. Gunderson

Sun, 30 Aug 2009 - Trying to understand TCP performance (long)

I guess most people who have ever been using the Internet have experienced slow downloads (where “slow” means “anything that doesn't fill up my line”). Usually it's pretty easy to shrug it off -- it's probably the other server being slow. Or, perhaps some cable in the middle is clogged. Who knows. (Way way too many people “solve” this problem by using what's known as a “download accelerator”, which opens tons of connections to get the same file. And guess what, it usually works, and you don't really care about the server in the other end being bogged down or everyone else getting less bandwidth. “I pay my ISP for 10Mbit/sec! I will have 10Mbit/sec!” -- but this is a digression.)

However, once you start getting into bigger and bigger networks and see the whole picture, it's not always so easy anymore. Why do I only get 25 Mbit/sec on my download, when both client and server are on gigabit (but in different cities), and the network in-between is only lightly loaded? The answer is pretty long, although I'll try to skip as much detail as possible. (I might also get things seriously wrong, so I'd appreciate any feedback :-) )

At this point, to try to understanding anything at all, you have to delve into TCP, this protocol everybody uses but which I'm sure like only a hundred people in the world truly understand. (I'm certainly not one of them.) On the surface of it, it's a pretty easy thing -- the sender splits the data stream into packets, and the receiver acknowledges them as they come in. If data gets lost, the sender retransmits until the data eventually comes through or you have to give up.

Now, it should be mentioned that TCP today may look similar to the TCP that was standardized in 1974, but there's lots of extensions both to the protocol itself, and how hosts actually handle it. We'll get back to that in a second, but it means that there's actually a ton of different implementations out there that have to coexist in various ways.

Anyways, since the sender may have to retransmit data, it needs to remember all data it has sent out until the other host has acknowledged it. It keeps this data in a sender buffer, which I'll blissfully ignore from now on. Similarly, the receiver has a receive buffer or flow-control window of a given (advertised) size -- basically, it tells the sender “I can handle xxx kB of data before you have to pause and wait for an acknowledgement”. This is a hard limitation; the sender is not allowed to break it under any circumstance. It exists because the TCP stack in the other end does not have space for storing up unlimited amounts of data for the receiving process, so at some point, the sender just has to wait for the data to be processed. (No use in sending at 100Mbit/sec if the data is being saved to a floppy anyway, right?)

There's also a congestion window which the server calculates for itself, as it tries to send just enough data to maximize the speed, but not enough data to overflow the network. The algorithms for this are very complex beasts, and it quickly gets into the realm of control theory, of which I know way too little. Suffice to say that the congestion window is by definition less than or equal to the flow-control window, and that there are tons of different algorithms in use for this. They all behave subtly differently in response to things like handling packet drops (classical TCP assumed dropped packets was always due to congestion, but now that we have wireless and mobile links that assumption is probably not really true anymore), queueing delays (TCP tends to send packets in big bursts and then rely on buffers in network cards and routers to smooth our the pace), feedback from the network, fairness between flows and different types of TCP, etc.. It's a big mess, it's important if you want to really understand TCP performance, and I'm not going to talk much about it here.

Anyways, the flow-control window. Classical TCP has a limit of 64 kB for the window (the protocol field is a 16-bit integer). The advertised window size can be changed as time goes by (for instance because it may be, well, filled up with data), but let's for the moment assume it's constant. Now, imagine the sender and receiver being on different continents, 200ms apart, and with the flow-control window at the max of 64 kB. The sender sends these 64 kB, and then cannot send any more until an ack comes back. That takes 200ms, at which point it can send 64 kB more, etc.. So you get to send 64 kB five times a second, for a whopping total of... 320 kB/sec, or 2.5 Mbit/sec. I'm sure this was unimaginable speeds in 1974, but the world has certainly progressed from there. Even worse, Windows pre-Vista (yes, Microsoft certainly has its part in this mess) has a default receive window size of 17 kB if you're connected over 100Mbit/sec or less! No wonder you can't get your file fast enough.

At this point we need to start considering TCP options. Thankfully, TCP has some space where hosts can advertise their support for various options, and the first one we're going to consider is TCP window scaling (or wscale). Basically TCP window scaling says “my number for the receive window is not in units of bytes, it's in units of 2^N bytes”. So, whoop, set wscale=8 and you can advertise a buffer size of 16 MB, allowing you to potentially reach over half a gigabit per second over your 200ms link (assuming of course you have that much transatlantic bandwidth, and that the congestion algorithms work perfectly). And of course, it doesn't stop at 8.

But there's more: An important other option is selective acks (SACK). This allows your TCP stack to say things like “I've got all your data up to byte 5000, and then I've got from 8000-10000”. Classical TCP only allows the “up to byte 5000” part, and then until you've got an ack for the 5000-8000 part you're totally in the dark as of whether 8000-10000 came through or if you need to retransmit it. This allows the congestion algorithm to make much better decisions in the face of packet loss.

There's also TCP timestamps. This is a sort of ping-on-top-of-TCP, which allows the TCP stack to actually properly estimate the round-trip-time (again helping some congestion algorithms), protect against wraparound of some counters, and other neat things.

Finally, we have something called ECN, or Explicit Congestion Notification (which is more a feature of IP than of TCP, but it's surely intended for things like TCP to take advantage of). The idea is that a router on the way can signal that the network is getting congested before it gets to the point where the packets have to get dropped.

So, receive window size, window scaling, SACK, TCP timestamps and ECN. Which of these are actually used in practice? This is where you have to go out with tcpdump and look at what's actually being sent. I'll save you the hassle:

So, to sum it up: UNIXes have sane defaults, old Windows has pretty slow defaults, and newer Windows has relatively good but not perfect defaults (sometimes due to the fact that Windows XP is getting really old, and sometimes just due to Microsoft's extreme conservatism in breaking anything, for better or for worse). ECN is barely used at all, but I don't know the impact of this and I don't know how many routers are set up to signal these flags anyway.

However, even after this, there are differences. I've seen cases where a Linux and Windows host (even on modern Windows) advertise the same options and similar receive buffer sizes, yet the Linux host downloads a given file from a given server five times as fast as the Windows host (or even more -- I've seen 20Mbit/sec vs. 950Mbit/sec). And that's just the receiver side, which is supposed to be pretty simple! The sender side is, of course, much more complex, since the sender is the one adjusting the congestion window. I don't have any Windows machines around, though, so I haven't been able to look at it in depth.

I put together a small web page that checks the TCP options your browser connects with; it's fun seeing how different appliances react. You can find it at http://tcpmeasure.sesse.net/ -- it's a bit hacky, and it's not really user-friendly, but it's at least easier then trying to tcpdump your TV. (And I can't wait until the different crawlers find it =) ) Happy tuning! :-)

Update: I had misunderstood the meaning of the window size on the first SYN packet (it's not affected by window scaling), which invalidated some of my OS findings. Corrected, and tcpmeasure has been changed accordingly.

[11:33] | | Trying to understand TCP performance (long)

Steinar H. Gunderson <sgunderson@bigfoot.com>