Steinar H. Gunderson

Sun, 28 Apr 2013 - Precise cache miss monitoring with perf

This should have been obvious, but seemingly it's not (perf is amazingly undocumented, and has this huge lex/yacc grammar for its command-line parsing), so here goes:

If you want precise cache miss data from perf (where “precise” means using PEBS, so that it gets attributed to the actual load and not some random instruction a few cycles later), you cannot use “cache-misses:pp” since “cache-misses” on Intel maps to some event that's not PEBS-capable. Instead, you'll have to use “perf record -e r10cb:pp”. The trick is, apparently, that “perf list” very much suggests that what you want is rcb10 and not r10cb, but that's not the way it's really encoded.

FWIW, this is LLC misses, so it's really things that go to either another socket (less likely), or to DRAM (more likely). You can change the 10 to something else (see “perf list”) if you want e.g. L2 hits.

[22:53] | | Precise cache miss monitoring with perf

Mon, 15 Apr 2013 - TG and VLC scalability

With The Gathering 2013 well behind us, I wanted to write a followup to the posts I had on video streaming earlier.

Some of you might recall that we identified an issue at TG12, where the video streaming (to external users) suffered from us simply having too fast network; bursting frames to users at 10 Gbit/sec overloads buffers in the down-conversion to lower speeds, causing packet loss, which triggers new bursts, sending the TCP connection into a spiral of death.

Lacking proper TCP pacing in the Linux kernel, the workaround was simple but rather ugly: Set up a bunch of HTB buckets (literally thousands), put each client in a different bucket, and shape each bucket to approximately the stream bitrate (plus some wiggle room for retransmits and bitrate peaks, although the latter are kept under control by the encoder settings). This requires a fair amount of cooperation from VLC, which we use as both encoder and reflector; it needs to assign a unique mark (fwmark) to each connection, which then tc can use to put the client into the right HTB bucket.

Although we didn't collect systematic user experience data (apart from my own tests done earlier, streaming from Norway to Switzerland), it's pretty clear that the effect was as hoped for: Users who had reported quality for a given stream as “totally unusable” now reported it as “perfect”. (Well, at first it didn't seem to have much effect, but that was due to packet loss caused by a faulty switch supervisor module. Only shows that real-world testing can be very tricky. :-) )

However, suddenly this happened on the stage:

Cosplayer on stage

which led to this happening to the stream load:

Graph going up and to the right

and users, especially ones external to the hall, reported things breaking up again. It was obvious that the load (1300 clients, or about 2.1 Gbit/sec) had something to do with it, but the server wasn't out of CPU—in fact, we killed a few other streams and hung processes, freeing up three or so cores, without any effect. So what was going on?

At the time, we really didn't get to go deep enough into it before the load had lessened; perf didn't really give an obvious answer (even though HTB is known to be a CPU hog, it didn't really figure high up in the list), and the little tuning we tried (including removing HTB) didn't really help.

It wasn't before this weekend, when I finally got access to a lab with 10gig equipment (thanks, Michael!), that I could verify my suspicions: VLC's HTTP server is single-threaded, and not particularly efficient at that. In fact, on the lab server, which is a bit slower than what we had at TG (4x2.0GHz Nehalem versus 6x3.2GHz Sandy Bridge), the most I could get from VLC was 900 Mbit/sec, not 2.1 Gbit/sec! Clearly we were both a bit lucky with our hardware, and that we had more than one stream (VLC vs. Flash) to distribute our load on. HTB was not the culprit, since this was run entirely without HTB, and the server wasn't doing anything else at all.

(It should be said that this test is nowhere near 100% exact, since the server was only talking to one other machine, connected directly to the same switch, but it would seem a very likely bottleneck, so in lieu of $100k worth of testing equipment and/or a very complex netem setup, I'll accept it as the explanation until proven otherwise. :-) )

So, how far can you go, without switching streaming platforms entirely? The answer comes in form of Cubemap, a replacement reflector I've been writing over the last week or so. It's multi-threaded, much more efficient (using epoll and sendfile—yes, sendfile), and also is more robust due to being less intelligent (VLC needs to demux and remux the entire signal to reflect it, which doesn't always go well for more esoteric signals; in particular, we've seen issues with the Flash video mux).

Running Cubemap on the same server, with the same test client (which is somewhat more powerful), gives a result of 12 Gbit/sec—clearly better than 900 Mbit/sec! (Each machine has two Intel 10Gbit/sec NICs connected with LACP to the switch, and load-balance on TCP port number.) Granted, if you did this kind of test using real users, I doubt they'd get a very good experience; it was dropping bytes like crazy since it couldn't get the bytes quickly enough to the client (and I don't think it was the client that was the problem, although that machine was also clearly very very heavily loaded). At this point, the problem is almost entirely about kernel scalability; less than 1% is spent in userspace, and you need a fair amount of mucking around with multiple NIC queues to get the right packets to the right processor without them stepping too much on each others' toes. (Check out /usr/src/linux/Documentation/network/scaling.txt for some essential tips here.)

And now, finally, what happens if you enable our HTB setup? Unfortunately, it doesn't really go well; the nice 12 Gbit/sec drops to 3.5–4 Gbit/sec! Some of this is just increased amounts of packet processing (for instance, the two iptables rules we need to mark non-video traffic alone take the speed down from 12 to 8), but it also pretty much shows that HTB doesn't scale: A lot of time is spent in locking routines, probably the different CPUs fighting over locks on the HTB buckets. In a sense, it's maybe not so surprising when you look at what HTB really does; you can't process each packet as independently, the entire point is to delay packets based on other packets. A more welcome result is that setting up a single fq_codel qdisc on the interface hardly mattered at all; it went down from 12 to 11.7 or something, but inter-run variation was so high, this is basically only noise. I have no idea if it actually had any effect at all, but it's at least good to know that it doesn't do any harm.

So, the conclusion is: Using HTB to shape works well, but it doesn't scale. (Nevertheless, I'll eventually post our scripts and the VLC patch here. Have some patience, though; there's a lot of cleanup to do after TG, and only so much time/energy.) Also, VLC only scales up to a thousand clients or so; after that, you want Cubemap. Or Wowza. Or Adobe Media Server. Or nginx-rtmp, if you want RTMP. Or… or… or… My head spins.

[00:14] | | TG and VLC scalability

Steinar H. Gunderson <sgunderson@bigfoot.com>