Steinar H. Gunderson

Tue, 28 Feb 2012 - Code dive: Mantis DVB-S2 driver

I recently had a run-in with the driver for a DVB-S2 card; you know, one of the things that you can connect to a satellite dish and use to watch TV with. (I wanted to use it to demodulate and decrypt an entire transponder and blast each channel out over multicast.) The problem in question was that I couldn't get it to decrypt channels in a stable fashion, especially with SMP, and some searches found out that this was a common problem. So, I thought I'd get my hands dirty and dig into the kernel source, and I thought it would be interesting to write down some of my findings, primarily because it gives some insight in how Linux deals with typical hardware these days. (As usual, I might be completely wrong; if I were an expert in these matters, I probably wouldn't find this interesting enough to blog about!)

The card's product name is Terratec Cinergy DVB-S2, but internally it's called Mantis, as are many other cards that are essentially clones with different PCI IDs. There is already a Mantis driver in Linux mainline (since 2.6.33), but there are also as far as I know two different out-of-tree drivers (one of them is the same as the one that is merged into mainline, but I don't know to what degree they have diverged).

When you get down to the driver level, it's actually less useful to think of the device as “a card”; it's more of a collection of devices bound together by some glue:

Now, if you were writing a Windows driver, you'd need to write code for all of these. However, in Linux, typically component drivers are shared; the STB0899 frontend already has a stable driver in Linux, there's an entire I²C subsystem that the STB0899 driver talks to, there's a CAM subsystem that knows how to talk to and poll CAMs (if you can give it some hooks to talk to), there's an IR subsystem, and so on. And of course, there's a standardized DVB subsystem so that userspace largely doesn't need to care what kind of card it's talking to. (If you're a card manufacturer, perhaps this is not really a positive, though, as you can no longer distinguish your product through included software in the same way. I'm sure Windows has some sort of standardized DVB subsystem too, but bundled software is probably a much more visible part of the product for most people.)

So that leaves largely the code for the glue, actually, which means that there's a whole lot less code to worry about. The Mantis driver is about 4000 lines of code, which covers at least seven different designs (of which many are for cable and not satellite); in comparison, the driver for the STB0899 frontend alone is about 3200. But let's take a look at how the PC communicates with the Mantis card.

Like much other modern hardware, the Mantis card is generally operated by memory-mapped I/O (MMIO). Basically, you map a portion of the address space to the card, and when you read from or write to that a command goes over the PCI bus, which essentially is treated as a command. (In other words, the given memory doesn't operate as regular RAM in any reasonable way.) Short of the DVB stream itself, which is DMAed, there's really nothing that needs a lot of bandwidth in here, so one can deal with a pretty simplistic scheme.

The way you communicate with anything, say, the CAM, is thus to do it one byte at a time. There's no such thing as synchronous I/O when you're talking about the kernel (that's really just an illusion anyway), so you'll need some communication back and forth. So, you poke an MMIO address with the first byte, and then after a while (say, a few microseconds) the hardware will have processed that byte and an “operation done” bit will show up in a status register (MANTIS_GPIF_STATUS).

However, reading that status register in a busy loop until the operation is done is wasteful; for one, the host CPU could perhaps be doing something more useful, and besides, every poll creates traffic across the PCI bus which can take up space you'd want for the DMA operations (or for other cards). So instead, you can ask (by poking a different MMIO port, namely MANTIS_GPIF_IRQCFG) that the card raise an interrupt line whenever there's a change in the status, and then go to sleep (ie., let some userspace process or other kernel thread use the CPU). When the card signals the IRQ line, the CPU jumps directly into the interrupt handler, where you can poll away to your heart's intent. It's a bit roundabout, though; first, you poll MANTIS_INT_STAT which contains a lot of other status bits (including “new data ready for DMA”, which is a bit set by the RISC mini-programs described above), and if MANTIS_INT_IRQ0 is set, you should go poll MANTIS_GPIF_STATUS, since it's probably changed. (I guess this is related to how the devices are wired up internally on the card, in that IRQ0 really is the DVB-S2 frontend asserting an IRQ on the Mantis bridge. IRQ1 is for the serial port. Note that 0 and 1 refers to the IRQ lines on the Mantis bridge, not on your PC. For instance, the Mantis card has only one IRQ on my PC, namely number 20.)

Now, you don't really want to do a lot of work in your interrupt handler, mainly due to latency issues (interrupt handlers are never preemptible), so after polling and determining that the IRQ0 bit is set, the interrupt defers to a work queue, a kind of worker thread, which can run at leisure, on another CPU if needed. (There are also tasklets, which are similar, but have different guarantees. The Mantis driver uses tasklets to process data coming in from DMA.) The worker thread processing the work queue, when it eventually is scheduled, will read the GPIF status, figure out that the operation is done, and wake up the thread that wrote the byte in the first place. And then it can poke the next byte, and so on, until it's done. Fortunately the PCI bus is pretty fast, and we only need to poke something like four or five bytes every now and then (plus some length bytes), so it doesn't take long. Reading data works in a similar fashion; you poke an address into an MMIO register, and then after a while the result is placed in another MMIO register, the IRQ0 status bit is set, and the Mantis IRQ is raised.

This is the theory. Unfortunately, in my case, when trying to poll for new data, often the IRQ0 bit would not be asserted for a long time (several hundred milliseconds), if ever. Seemingly this would happen for some reason when the Mantis bridge was busy spewing interrupts at the CPU for other reasons, like ongoing DMA transfers; it might also be a bug somewhere in the driver causing it to clear out the IRQ0 bit before it's ever read. Anyway, no matter the cause, this would cause components that tried to read from the CAM to think the CAM was dead, and either give up or try to reset it, neither of which is very good. I still don't understand why it seems to happen a lot more often on SMP than on non-SMP; maybe there's a timing issue of some sort. (It would seem there were also some SMP-related bugs in the driver itself, but I think I've fixed most of those.)

So, what's the brilliant fix? Well, it's really a hack taken from the mailing list; reduce the timeout from 500 ms to 2 ms, and if it times out, just ignore the error, go on and hope the data has been put in the right place. It's an unfortunate hack, but in lieu of a proper fix, at least it gives me rock-steady CAM operation. (Due to a bug, timeouts were always handled as a success anyway.)

I've submitted the fixes (or rather, collections of hacks) in a message to the linux-media list; we'll see if anybody finds them interesting.

[18:57] | | Code dive: Mantis DVB-S2 driver

Fri, 10 Feb 2012 - Rant: Your custom rounding code is wrong

Maxims (or “TL;DR” if you want):

Why?

So, let's start off by assuming you have a float and you want to convert that to an int. (Or maybe you have five million of them.) I'll assume a more or less x86/Linux environment, although of course, the real world is a bit more complex.

OK, so you've decided to write your own rounding function.

Anyway, first of all, what are you going to do about the floating-point numbers that don't represent real numbers? Positive and negative infinity? NaN? OK, maybe only floating-point buffs ever care about those. (I'm not one of them.) Maybe it would be nice to have some sort of consistent behavior for, say, the value 2^48, but maybe you also don't care, since you have a certain amount of control as of where these numbers come from anyway. Sure.

So, you write your function. Most likely, someone taught you somewhere that you can do proper rounding in Excel by adding 0.5, so you write:

return (int)(x + 0.5f);

Now, of course, the (int) cast is a rounding in itself, in truncation mode (this is, by the way, my single least favorite decision by K&R). If you're unlucky enough to be on 32-bit, this makes the compiler spew out tons of stuff for changing the rounding mode, so you pretty much lost. Perhaps you write something similar using inline assembler and the x87 FISTP instruction; now you just made sure that on 64-bit, all your floats have to round-trip from SSE to x87 just to get to ints. (Or maybe you use some dodgy trick you learned from a web forum at some point and thought were exceedingly cool and elegant, which destroys your range. Who knows.)

OK, so maybe you can get the int-truncate step fast and correct. After all, eventually CPUs started to get their own instructions for convert-to-int-with-truncate, since C has made that so common. So your function converts 0.4f to 0, and 0.5f to 1. Fine. Now, what do you do with negative values? -0.5f becomes 0, which means you just violated that round(-x) = -round(x). (Even worse, -0.6f becomes 0.) OK. Maybe you don't care about that (in fact, maybe you think, correctly or not, that this is exactly what you want). Maybe you change your inline assembler to test if the number is negative or not, and add -0.5f instead, in which case you just incurred a (costly) branch. But OK, let's say you're happy about the behavior with negative numbers.

Maybe you also don't worry that this introduces a bias towards rounding up, both for positive and negative numbers; slightly more floats will be rounded up than down. After all, you're most likely not in the hair-splitting business. It should be noted, though, that IEEE has a solution for this, called “round-to-nearest-even”, where numbers ending in .5 are rounded up and down every other time to avoid just this issue (-1.5f → -2, -0.5f → 0, 0.5f → 0, 1.5f → 2, and so on).

OK, but even if you're happy with your behavior with numbers ending in about 0.5, at least it should round the right way for things between 0.0 and less and 0.5, right? Here comes another maxim: Rounding twice is doomed to fail. This might seem intuitive, but not everybody understands this; when I was ten, I had a maths teacher that insisted that 1.46 could be rounded to 2 (“I wouldn't grade that as an error”, in her own words), because you could round 1.46 → 1.5 → 2.0. But whenever you do an add, you also do a round, so you do get rounding twice. The case in point is ~0.4999999701976776123046875, the largest floating-point number below 0.5. If you had 0.5 to this, you'll end up with 0.999999975, except that you don't have enough mantissa bits to represent this exactly, so it gets rounded off to 1.0, so now your function yields round(0.49999997) = 1! Oops.

OK. So maybe you don't care that your function rounds the wrong way every now and then. (And if you think this won't ever happen; I've seen it happen in real code. For doubles.) Sometimes speed is more important than correctness. But now you've basically said that you do not care, so allow me to suggest just calling lrintf(), the standard library's method for converting floats to ints. The implementation looks like this:

cvtss2si %xmm0, %eax
ret

If you run with -fno-math-errno (implied by -ffast-math), it will even get inlined to this single instruction! ¹ Do you really think you can beat that? It will even follow whatever rounding mode you've set, which is most likely the default, sane, bias-free round-to-even, with no issues with 0.499999975 or similar madness. The inlined version will take you all of three cycles (at least two of which can be overlapped with other stuff), and there's no way on Earth you can beat that for speed. (I mean, your CPU has special circuitry for doing this. If adding 0.5 were the fastest way, maybe it would just do that internally.) You'd think people actually noticed this and didn't write their own functions that are slower than what the standard library gives, but a) maybe the code was written ten years ago on an entirely different platform where this was not the case, and b) people generally don't measure their code; a profiler seems to be a completely unheard of tool for most developers, even if they write performance-critical code, so they miss the fact that 70% of their time is spent in their brilliant inlined rounding function.

Granted, this is the most dodgy part of my assertions; for instance, if you're on ARM, lrintf() is vastly more complex, and the compiler matters here. (I did take a reservation. :-) ) But even if so, the other maxims really come into play; if you really care about speed, the right thing to do here on ARM is to do a single round-to-int instruction (VCVTR), not to add 0.5 as a fudge factor.

¹ Unfortunately GCC won't output this sequence without -fno-math-errno, due to language-lawyering reasons — basically it seemingly is not allowed to make a conversion that doesn't set errno on overflow, even though glibc is, since it has communicated it to the program through other means.

Update: Sam pointed out that the constant in question is not 0.499999975, but slightly larger.

[19:20] | | Rant: Your custom rounding code is wrong

Steinar H. Gunderson <sgunderson@bigfoot.com>