This should have been obvious, but seemingly it's not (perf is amazingly
undocumented, and has this huge lex/yacc grammar for its command-line
parsing), so here goes:
If you want precise cache miss data from perf (where “precise” means using
PEBS, so that it gets attributed to the actual load and not some random
instruction a few cycles later), you cannot use “cache-misses:pp” since
“cache-misses” on Intel maps to some event that's not PEBS-capable.
Instead, you'll have to use “perf record -e r10cb:pp”. The trick is,
apparently, that “perf list” very much suggests that what you want is
rcb10 and not r10cb, but that's not the way it's really encoded.
FWIW, this is LLC misses, so it's really things that go to either another
socket (less likely), or to DRAM (more likely). You can change the 10
to something else (see “perf list”) if you want e.g. L2 hits.
With The Gathering 2013 well behind us, I
wanted to write a followup to the posts I had on video streaming earlier.
Some of you might recall that we identified an issue at TG12, where the video
streaming (to external users) suffered from us simply having too fast
network; bursting frames to users at 10 Gbit/sec overloads buffers in the
down-conversion to lower speeds, causing packet loss, which triggers new
bursts, sending the TCP connection into a spiral of death.
Lacking proper TCP pacing in the Linux kernel, the workaround was simple
but rather ugly: Set up a bunch of HTB buckets (literally thousands),
put each client in a different bucket, and shape each bucket to approximately
the stream bitrate (plus some wiggle room for retransmits and bitrate peaks,
although the latter are kept under control by the encoder settings).
This requires a fair amount of cooperation from VLC, which we use as both
encoder and reflector; it needs to assign a unique mark (fwmark) to each
connection, which then tc can use to put the client into the right HTB
bucket.
Although we didn't collect systematic user experience data (apart from my own
tests done earlier, streaming from Norway to Switzerland), it's pretty clear
that the effect was as hoped for: Users who had reported quality for a given
stream as “totally unusable” now reported it as “perfect”. (Well, at first
it didn't seem to have much effect, but that was due to packet loss caused by
a faulty switch supervisor module. Only shows that real-world testing can be
very tricky. :-) )
However, suddenly this happened on the stage:
which led to this happening to the stream load:
and users, especially ones external to the hall, reported things breaking up
again. It was obvious that the load (1300 clients, or about 2.1 Gbit/sec) had something to do
with it, but the server wasn't out of CPU—in fact, we killed a few other
streams and hung processes, freeing up three or so cores, without any effect.
So what was going on?
At the time, we really didn't get to go deep enough into it before the load
had lessened; perf didn't really give an obvious answer (even though HTB is
known to be a CPU hog, it didn't really figure high up in the list), and the
little tuning we tried (including removing HTB) didn't really help.
It wasn't before this weekend, when I finally got access to a lab with 10gig
equipment (thanks, Michael!), that I could verify my suspicions: VLC's HTTP server is
single-threaded, and not particularly efficient at that. In fact, on the lab
server, which is a bit slower than what we had at TG (4x2.0GHz Nehalem versus
6x3.2GHz Sandy Bridge), the most I could get from VLC was 900 Mbit/sec, not
2.1 Gbit/sec! Clearly we were both a bit lucky with our hardware, and that we
had more than one stream (VLC vs. Flash) to distribute our load on. HTB was
not the culprit, since this was run entirely without HTB, and the server
wasn't doing anything else at all.
(It should be said that this test is nowhere near 100% exact, since the
server was
only talking to one other machine, connected directly to the same switch,
but it would seem a very likely bottleneck, so in lieu of $100k worth of
testing equipment and/or a very complex netem setup, I'll accept it as
the explanation until proven otherwise. :-) )
So, how far can you go, without switching streaming platforms entirely?
The answer comes in form of
Cubemap, a replacement
reflector I've been writing over the last week or so. It's multi-threaded,
much more efficient (using epoll and sendfile—yes, sendfile), and also
is more robust due to being less intelligent (VLC needs to demux and remux
the entire signal to reflect it, which doesn't always go well for more
esoteric signals; in particular, we've seen issues with the Flash video mux).
Running Cubemap on the same server, with the same test client (which is
somewhat more powerful), gives a result of 12 Gbit/sec—clearly better than
900 Mbit/sec! (Each machine has two Intel 10Gbit/sec NICs connected with LACP
to the switch, and load-balance on TCP port number.) Granted, if you did this kind of test using real users, I doubt
they'd get a very good experience; it was dropping bytes like crazy since it
couldn't get the bytes quickly enough to the client (and I don't think it was
the client that was the problem, although that machine was also clearly very
very heavily loaded). At this point, the problem is
almost entirely about kernel scalability; less than 1% is spent in userspace,
and you need a fair amount of mucking around with multiple NIC queues to get
the right packets to the right processor without them stepping too much on
each others' toes. (Check out
/usr/src/linux/Documentation/network/scaling.txt
for some essential tips here.)
And now, finally, what happens if you enable our HTB setup? Unfortunately,
it doesn't really go well; the nice 12 Gbit/sec drops to 3.5–4 Gbit/sec!
Some of this is just increased amounts of packet processing (for instance,
the two iptables rules we need to mark non-video traffic alone take the
speed down from 12 to 8), but it also pretty much shows that HTB doesn't
scale: A lot of time is spent in locking routines, probably the different
CPUs fighting over locks on the HTB buckets. In a sense, it's maybe not
so surprising when you look at what HTB really does; you can't process
each packet as independently, the entire point is to delay packets based
on other packets. A more welcome result is that setting up a single fq_codel
qdisc on the interface hardly mattered at all; it went down from 12 to
11.7 or something, but inter-run variation was so high, this is basically
only noise. I have no idea if it actually had any effect at all, but it's
at least good to know that it doesn't do any harm.
So, the conclusion is: Using HTB to shape works well, but it doesn't
scale. (Nevertheless, I'll eventually post our scripts and the VLC patch
here. Have some patience, though; there's a lot of cleanup to do after
TG, and only so much time/energy.) Also, VLC only scales up to a thousand
clients or so; after that, you want Cubemap. Or Wowza. Or Adobe Media Server.
Or nginx-rtmp, if you want RTMP. Or… or… or… My head spins.