Steinar H. Gunderson

Wed, 26 Oct 2016 - Why does software development take so long?

Nageru 1.4.0 is out (and on its way through the Debian upload process right now), so now you can do live video mixing with multichannel audio to your heart's content. I've already blogged about most of the interesting new features, so instead, I'm trying to answer a question: What took so long?

To be clear, I'm not saying 1.4.0 took more time than I really anticipated (on the contrary, I pretty much understood the scope from the beginning, and there was a reason why I didn't go for building this stuff into 1.0.0); but if you just look at the changelog from the outside, it's not immediately obvious why “multichannel audio support” should take the better part of three months of develoment. What I'm going to say is of course going to be obvious to most software developers, but not everyone is one, and perhaps my experiences will be illuminating.

Let's first look at some obvious things that isn't the case: First of all, development is not primarily limited by typing speed. There are about 9,000 lines of new code in 1.4.0 (depending a bit on how you count), and if it was just about typing them in, I would be done in a day or two. On a good keyboard, I can type plain text at more than 800 characters per minute—but you hardly ever write code for even a single minute at that speed. Just as when writing a novel, most time is spent thinking, not typing.

I also didn't spend a lot of time backtracking; most code I wrote actually ended up in the finished product as opposed to being thrown away. (I'm not as lucky in all of my projects.) It's pretty common to do so if you're in an exploratory phase, but in this case, I had a pretty good idea of what I wanted to do right from the start, and that plan seemed to work. This wasn't a difficult project per se; it just needed to be done (which, in a sense, just increases the mystery).

However, even if this isn't at the forefront of science in any way (most code in the world is pretty pedestrian, after all), there's still a lot of decisions to make, on several levels of abstraction. And a lot of those decisions depend on information gathering beforehand. Let's take a look at an example from late in the development cycle, namely support for using MIDI controllers instead of the mouse to control the various widgets.

I've kept a pretty meticulous TODO list; it's just a text file on my laptop, but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83 work items (a single-digit number is not ticked off, mostly because I decided not to do those things), which corresponds roughly 1:2 to the number of commits. So let's have a look at what the ~20 MIDI controller items went into.

First of all, to allow MIDI controllers to influence the UI, we need a way of getting to it. Since Nageru is single-platform on Linux, ALSA is the obvious choice (if not, I'd probably have to look for a library to put in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer). Which one do you want? It sounds like raw MIDI is what we want, but actually, it's the sequencer interface (it does more of the MIDI parsing for you, and generally is friendlier).

The first question is where to start picking events from. I went the simplest path and just said I wanted all events—anything else would necessitate a UI, a command-line flag, figuring out if we wanted to distinguish between different devices with the same name (and not all devices potentially even have names), and so on. But how do you enumerate devices? (Relatively simple, thankfully.) What do you do if the user inserts a new one while Nageru is running? (Turns out there's a special device you can subscribe to that will tell you about new devices.) What if you get an error on subscription? (Just print a warning and ignore it; it's legitimate not to have access to all devices on the system. By the way, for PCM devices, all of these answers are different.)

So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out it probably doesn't integrate too well with Qt, but it's easy enough to put it in a thread. The class dealing with the MIDI handling now needs locking; what mutex granularity do we want? (Experience will tell you that you nearly always just want one mutex. Two mutexes give you all sorts of headaches with ordering them, and nearly never gives any gain.) ALSA expects us to poll() a given set of descriptors for data, but on shutdown, how do you break out of that poll to tell the thread to go away? (The simplest way on Linux is using an eventfd.)

There's a quirk where if you get two or more MIDI messages right after each other and only read one, poll() won't trigger to alert you there are more left. Did you know that? (I didn't. I also can't find it documented. Perhaps it's a bug?) It took me some looking into sample code to find it. Oh, and ALSA uses POSIX error codes to signal errors (like “nothing more is available”), but it doesn't use errno.

OK, so you have events (like “controller 3 was set to value 47”); what do you do about them? The meaning of the controller numbers is different from device to device, and there's no open format for describing them. So I had to make a format describing the mapping; I used protobuf (I have lots of experience with it) to make a simple text-based format, but it's obviously a nightmare to set up 50+ controllers by hand in a text file, so I had to make an UI for this. My initial thought was making a grid of spinners (similar to how the input mapping dialog already worked), but then I realized that there isn't an easy way to make headlines in Qt's grid. (You can substitute a label widget for a single cell, but not for an entire row. Who knew?) So after some searching, I found out that it would be better to have a tree view (Qt Creator does this), and then you can treat that more-or-less as a table for the rows that should be editable.

Of course, guessing controller numbers is impossible even in an editor, so I wanted it to respond to MIDI events. This means the editor needs to take over the role as MIDI receiver from the main UI. How you do that in a thread-safe way? (Reuse the existing mutex; you don't generally want to use atomics for complicated things.) Thinking about it, shouldn't the MIDI mapper just support multiple receivers at a time? (Doubtful; you don't want your random controller fiddling during setup to actually influence the audio on a running stream. And would you use the old or the new mapping?)

And do you really need to set up every single controller for each bus, given that the mapping is pretty much guaranteed to be similar for them? Making a “guess bus” button doesn't seem too difficult, where if you have one correctly set up controller on the bus, it can guess from a neighboring bus (assuming a static offset). But what if there's conflicting information? OK; then you should disable the button. So now the enable/disable status of that button depends on which cell in your grid has the focus; how do you get at those events? (Install an event filter, or subclass the spinner.) And so on, and so on, and so on.

You could argue that most of these questions go away with experience; if you're an expert in a given API, you can answer most of these questions in a minute or two even if you haven't heard the exact question before. But you can't expect even experienced developers to be an expert in all possible libraries; if you know everything there is to know about Qt, ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua (in addition to C++11, of course), I'm sure you'd be a great fit for Nageru, but I'd wager that pretty few developers fit that bill. I've written C++ for almost 20 years now (almost ten of them professionally), and that experience certainly helps boosting productivity, but I can't say I expect a 10x reduction in my own development time at any point.

You could also argue, of course, that spending so much time on the editor is wasted, since most users will only ever see it once. But here's the point; it's not actually a lot of time. The only reason why it seems like so much is that I bothered to write two paragraphs about it; it's not a particular pain point, it just adds to the total. Also, the first impression matters a lot—if the user can't get the editor to work, they also can't get the MIDI controller to work, and is likely to just go do something else.

A common misconception is that just switching languages or using libraries will help you a lot. (Witness the never-ending stream of software that advertises “written in Foo” or “uses Bar” as if it were a feature.) For the former, note that nothing I've said so far is specific to my choice of language (C++), and I've certainly avoided a bunch of battles by making that specific choice over, say, Python. For the latter, note that most of these problems are actually related to library use—libraries are great, and they solve a bunch of problems I'm really glad I didn't have to worry about (how should each button look?), but they still give their own interaction problems. And even when you're a master of your chosen programming environment, things still take time, because you have all those decisions to make on top of your libraries.

Of course, there are cases where libraries really solve your entire problem and your code gets reduced to 100 trivial lines, but that's really only when you're solving a problem that's been solved a million times before. Congrats on making that blog in Rails; I'm sure you're advancing the world. (To make things worse, usually this breaks down when you want to stray ever so slightly from what was intended by the library or framework author. What seems like a perfect match can suddenly become a development trap where you spend more of your time trying to become an expert in working around the given library than actually doing any development.)

The entire thing reminds me of the famous essay No Silver Bullet by Fred Brooks, but perhaps even more so, this quote from John Carmack's .plan has struck with me (incidentally about mobile game development in 2006, but the basic story still rings true):

To some degree this is already the case on high end BREW phones today. I have a pretty clear idea what a maxed out software renderer would look like for that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that seems to be the standard direction. When I was doing the graphics engine upgrades for BREW, I started along those lines, but after putting in a couple days at it I realized that I just couldn't afford to spend the time to finish the work. "A clear vision" doesn't mean I can necessarily implement it in a very small integral number of days.

In a sense, programming is all about what your program should do in the first place. The “how” question is just the “what”, moved down the chain of abstractions until it ends up where a computer can understand it, and at that point, the three words “multichannel audio support” have become those 9,000 lines that describe in perfect detail what's going on.

[09:29] | | Why does software development take so long?

Tue, 04 Feb 2014 - FOSDEM video stream goodiebag

Borrowing a tradition from TG, we have released a video streaming goodiebag from FOSDEM 2014. In short, it contains all the scripts we used for the streaming part (nothing from the video team itself, although I believe of most of what they do is developed out in the open).

If you've read my earlier posts on the subject, you'll know that it's all incredibly rough, and we haven't cleaned it up much afterwards. So you get the truth, but it might not be pretty :-) However, feedback is of course welcome.

You can find it at; it's only 9 kB large. Enjoy!

[23:37] | | FOSDEM video stream goodiebag

Sun, 02 Feb 2014 - FOSDEM video streaming, post-mortem

Wow, what a ride that was. :-)

I'm not sure if people generally are aware of it, but the video streaming at FOSDEM this year came together on extremely short notice. I got word late Wednesday that the video team was overworked and would not have the manpower to worry about streaming, and consequently, that there would be none (probably not even of the main talks, like last year).

I quickly conferred with Berge on IRC; we both agreed that something as big as FOSDEM shouldn't be without at least rudimentary streams. Could we do something about it? After all, all devrooms (save for some that would not due to licensing issues) would be recorded using DVswitch anyway, where it's trivial to just connect another sink to the master, and we both had extensive experience doing streaming work from The Gathering.

So, we agreed to do a stunt project; either it would work or it would crash and burn, but at least it would be within the playful spirit of free software. The world outside does not stand still, and neither should we.

The FOSDEM team agreed to give us access to the streams, and let us use the otherwise unused cycles on the “slave” laptops (the ones that just take in a DV switch from the camera and send it to the master for mixing). Since I work at Google, I was able to talk to the Google Compute Engine people, who were able to turn around on extremely short notice and sponsor GCE resources for the actual video distribution. This took a huge unknown out of the equation for us; since GCE is worldwide and scalable, we'd be sure to have adequate bandwidth for serving our viewers almost no matter how much load we got.

The rest was mainly piecing together existing components in new ways. I dealt with the encoding (on VLC, using WebM, since that's what FOSDEM wanted), hitting one or two really obscure bugs in the process, and Berge dealt with all the setup of distribution (we used cubemap, which had already been tuned for the rather unique needs of WebM during last Debconf), parsing the FOSDEM schedule to provide live program information, and so on. Being a team of two was near-ideal here; we already know each other extremely well from previous work, and despite the frantic pace, everything felt really relaxed and calm.

So, less than 72 hours after the initial “go”, the streaming laptops started coming up in the various devrooms, and I rsynced over my encoding chroot to each of them and fired up VLC, which then cubemap would pick up and send on. And amazingly enough, it worked! We had a peak of about 380 viewers, which is about 80% more than the peak of 212 last year (and this was with almost no announcement before the conference). Amusingly, the most popular stream by far was not a main track, but that of the Go devroom; at times, they had over half the total viewers. (I never got to visit it myself, because it was super-packed every time I went there.)

I won't pretend everything went perfect—we found a cubemap segfault on the way, and also some other issues (such as initially not properly restarting the encoding when the DVswitch master went down and up again). But I'm extremely happy that the video team believed in us and gave us the chance; it was fun, it was the perfect icebreaker when meeting new people at FOSDEM, and hopefully, we let quite a few people sitting at home learn something new or interesting.

Oh, and not the least, it made my own talk get streamed. :-)

[22:06] | | FOSDEM video streaming, post-mortem

Mon, 19 Aug 2013 - Whole-disk dm-cache

dm-cache is an interesting new technology in the 3.10 kernel onwards; basically, it is a way to use SSDs as a cache layer in front of rotating media, supposedly getting the capacity of the latter and the speed of the former, similar to how the page cache already tries to exploit the good properties of both RAM and disks. (This is, historically, nothing new; for instance, ZFS has had this ability for years, in the form of a patented cache algorithm called L2ARC.)

dm-cache is not the only technology that does this; it competes with, for instance, bcache (also merged in 3.10). However, bcache expects you to format the data volume, which was a no-go in my case: What I wanted, was for dm-cache to sit below my main RAID-6 LVM (which has tons of volumes), without having to erase anything.

This is all a bit raw. Bear with me.

First of all, after a new enough kernel has been installed (you probably want 3.11-rc-something, actually), we want some basic scripts to hook onto initramfs-tools and so on. I used dmcache-tools, and simply converted it to a Debian package with alien. It comes with a tool called dmcache-format-blockdev that tries to partition your block device as an LVM, split into blocks and metadata volumes (seemingly they are separate in case you want e.g. RAID-1 for your metadata only), but I found it to make a metadata volume that was too small for my use. I ended up with 512MB for metadata and then the rest for blocks.

The next part is how to get startup right. First of all, we want an /etc/cachetab so that dmcache-load-cachetab knows how to set up the cache:

cache:/dev/cache/metadata:/dev/cache/blocks:/dev/md1:1024:1 writeback default 4 random_threshold 8 sequential_threshold 512
This gives you a new /dev/mapper/cache that's basically identical to /dev/md1 except faster du to the extra cache. Then, you'll have to tell LVM that it should never try to use /dev/md1 as a physical volume on its own (that would be very bad if the cache had dirty blocks!), so /etc/lvm/lvm.conf needs to contain something like:
    filter = [ "a/md2/", "r/md/", "a/.*/" ]
Note that my SSD RAID is on md2, so I'll need to make an exception for it. LVM aficionados will probably know of something more efficient here (r/md1/ didn't work for me, since there's also /dev/md/1 and possibly others). Then, we need to get everything set up right during boot. This is governed by /sbin/dmcache-load-cachetab. Unfortunately, LVM is not started by udev, but rather late in the process, so /dev/cache/blocks and /dev/cache/mapper are not available when dmcache-load-cachetab runs! I hacked that in, just before the “Devices not ready?” comment, by simply adding the LVM load line used elsewhere in the initramfs:
/sbin/lvm vgchange -aly --ignorelockingfailure
Finally, we need to make sure the hook is installed in the first place. The hook script has a line to check if dm-cache is needed for the root volume, but it's far too simplistic, so I simply changed /usr/share/initramfs-tools/hooks/dmcache so that should_install() always returned true:
should_install() {
        # sesse hack
        echo yes

After that, all you need to do is clear the first few kilobytes of the metadata filesystem using dd, update the initramfs, and voila! Cache.

It would seem the code in the kernel is still a bit young; it has memory allocation issues and doesn't cache all that aggressively yet, but most of my writes are already going to the cache, and an increasing amount of reads, so I think this is going to be quite OK in a few revisions.

The integration with Debian could use some work, though =)

[23:24] | | Whole-disk dm-cache

Sun, 28 Apr 2013 - Precise cache miss monitoring with perf

This should have been obvious, but seemingly it's not (perf is amazingly undocumented, and has this huge lex/yacc grammar for its command-line parsing), so here goes:

If you want precise cache miss data from perf (where “precise” means using PEBS, so that it gets attributed to the actual load and not some random instruction a few cycles later), you cannot use “cache-misses:pp” since “cache-misses” on Intel maps to some event that's not PEBS-capable. Instead, you'll have to use “perf record -e r10cb:pp”. The trick is, apparently, that “perf list” very much suggests that what you want is rcb10 and not r10cb, but that's not the way it's really encoded.

FWIW, this is LLC misses, so it's really things that go to either another socket (less likely), or to DRAM (more likely). You can change the 10 to something else (see “perf list”) if you want e.g. L2 hits.

[22:53] | | Precise cache miss monitoring with perf

Mon, 15 Apr 2013 - TG and VLC scalability

With The Gathering 2013 well behind us, I wanted to write a followup to the posts I had on video streaming earlier.

Some of you might recall that we identified an issue at TG12, where the video streaming (to external users) suffered from us simply having too fast network; bursting frames to users at 10 Gbit/sec overloads buffers in the down-conversion to lower speeds, causing packet loss, which triggers new bursts, sending the TCP connection into a spiral of death.

Lacking proper TCP pacing in the Linux kernel, the workaround was simple but rather ugly: Set up a bunch of HTB buckets (literally thousands), put each client in a different bucket, and shape each bucket to approximately the stream bitrate (plus some wiggle room for retransmits and bitrate peaks, although the latter are kept under control by the encoder settings). This requires a fair amount of cooperation from VLC, which we use as both encoder and reflector; it needs to assign a unique mark (fwmark) to each connection, which then tc can use to put the client into the right HTB bucket.

Although we didn't collect systematic user experience data (apart from my own tests done earlier, streaming from Norway to Switzerland), it's pretty clear that the effect was as hoped for: Users who had reported quality for a given stream as “totally unusable” now reported it as “perfect”. (Well, at first it didn't seem to have much effect, but that was due to packet loss caused by a faulty switch supervisor module. Only shows that real-world testing can be very tricky. :-) )

However, suddenly this happened on the stage:

Cosplayer on stage

which led to this happening to the stream load:

Graph going up and to the right

and users, especially ones external to the hall, reported things breaking up again. It was obvious that the load (1300 clients, or about 2.1 Gbit/sec) had something to do with it, but the server wasn't out of CPU—in fact, we killed a few other streams and hung processes, freeing up three or so cores, without any effect. So what was going on?

At the time, we really didn't get to go deep enough into it before the load had lessened; perf didn't really give an obvious answer (even though HTB is known to be a CPU hog, it didn't really figure high up in the list), and the little tuning we tried (including removing HTB) didn't really help.

It wasn't before this weekend, when I finally got access to a lab with 10gig equipment (thanks, Michael!), that I could verify my suspicions: VLC's HTTP server is single-threaded, and not particularly efficient at that. In fact, on the lab server, which is a bit slower than what we had at TG (4x2.0GHz Nehalem versus 6x3.2GHz Sandy Bridge), the most I could get from VLC was 900 Mbit/sec, not 2.1 Gbit/sec! Clearly we were both a bit lucky with our hardware, and that we had more than one stream (VLC vs. Flash) to distribute our load on. HTB was not the culprit, since this was run entirely without HTB, and the server wasn't doing anything else at all.

(It should be said that this test is nowhere near 100% exact, since the server was only talking to one other machine, connected directly to the same switch, but it would seem a very likely bottleneck, so in lieu of $100k worth of testing equipment and/or a very complex netem setup, I'll accept it as the explanation until proven otherwise. :-) )

So, how far can you go, without switching streaming platforms entirely? The answer comes in form of Cubemap, a replacement reflector I've been writing over the last week or so. It's multi-threaded, much more efficient (using epoll and sendfile—yes, sendfile), and also is more robust due to being less intelligent (VLC needs to demux and remux the entire signal to reflect it, which doesn't always go well for more esoteric signals; in particular, we've seen issues with the Flash video mux).

Running Cubemap on the same server, with the same test client (which is somewhat more powerful), gives a result of 12 Gbit/sec—clearly better than 900 Mbit/sec! (Each machine has two Intel 10Gbit/sec NICs connected with LACP to the switch, and load-balance on TCP port number.) Granted, if you did this kind of test using real users, I doubt they'd get a very good experience; it was dropping bytes like crazy since it couldn't get the bytes quickly enough to the client (and I don't think it was the client that was the problem, although that machine was also clearly very very heavily loaded). At this point, the problem is almost entirely about kernel scalability; less than 1% is spent in userspace, and you need a fair amount of mucking around with multiple NIC queues to get the right packets to the right processor without them stepping too much on each others' toes. (Check out /usr/src/linux/Documentation/network/scaling.txt for some essential tips here.)

And now, finally, what happens if you enable our HTB setup? Unfortunately, it doesn't really go well; the nice 12 Gbit/sec drops to 3.5–4 Gbit/sec! Some of this is just increased amounts of packet processing (for instance, the two iptables rules we need to mark non-video traffic alone take the speed down from 12 to 8), but it also pretty much shows that HTB doesn't scale: A lot of time is spent in locking routines, probably the different CPUs fighting over locks on the HTB buckets. In a sense, it's maybe not so surprising when you look at what HTB really does; you can't process each packet as independently, the entire point is to delay packets based on other packets. A more welcome result is that setting up a single fq_codel qdisc on the interface hardly mattered at all; it went down from 12 to 11.7 or something, but inter-run variation was so high, this is basically only noise. I have no idea if it actually had any effect at all, but it's at least good to know that it doesn't do any harm.

So, the conclusion is: Using HTB to shape works well, but it doesn't scale. (Nevertheless, I'll eventually post our scripts and the VLC patch here. Have some patience, though; there's a lot of cleanup to do after TG, and only so much time/energy.) Also, VLC only scales up to a thousand clients or so; after that, you want Cubemap. Or Wowza. Or Adobe Media Server. Or nginx-rtmp, if you want RTMP. Or… or… or… My head spins.

[00:14] | | TG and VLC scalability

Mon, 25 Mar 2013 - Net::Telnet::Cisco with SSH

This should be obvious, but I don't really think anybody thought of it before, given that nobody updated Net::Telnet::Cisco in years, and web search results are really inconclusive :-)

my $ssh = Net::OpenSSH->new($username . ':' . $password . '@' . $hostname, timeout => 60);
if ($ssh->error) {
    print STDERR "$hostname: " . $ssh->error . "\n";

my ($pty, $pid) = $ssh->open2pty({stderr_to_stdout => 1})
    or next;
my $telnet = Net::Telnet::Cisco->new(
    -fhopen => $pty,
    -telnetmode => 0,
    -cmd_remove_mode => 1);

and voila, you can use Net::Telnet::Cisco without actually having to enable telnet on your Cisco router. :-)

[19:37] | | Net::Telnet::Cisco with SSH

Thu, 14 Mar 2013 - Introduction to gamma

Stuck in a suburb of Auckland for the night, mostly due to Air New Zealand. *sigh* Well, OK, maybe I can at least write that blog entry I've been meaning to for a while...

When I wrote about color a month ago, my post included, in a small parenthesis, the following: “Let me ignore the distinction between Y and Y' for now.” Such a small sentence, and so much it hides :-) Let's take a look.

First, let's remember that Y measures the overall brightness, or luminance. Let's ignore the fact that there are multiple frequencies in play (again, sidestepping “what is white?”), and let's just think of them as a bunch of equal photons. If so, there's a very natural way to measure the luminance of a pixel; conceptually, just look the amount of photons emitted per second, and normalize for some value.

However, this is not usually the way we choose to store these values. First of all, note that there's typically not infinite precision when storing pixel data; although we could probably allow ourselves to store full floating-point these days (and we sometimes do, although it's not very common), back in the day, where all of these conventions were effectively decided, we certainly could not. You had a fixed number of bits to represent the different gray tones, and even today's eight bits (giving 256 distinct levels, bordering on the limits of what the human eye can distinguish) was a far-fetched luxury.

So, can we quantize linearly to 256 levels and just be done with it? The answer is no, and there are two good reasons why not. The first has to do, as so many things, with how our visual system works. Let's take a look at a chart that I shamelessly stole from Anti-Grain Geometry:

To quote AGG: “On the right there are two pixels and we can credibly say that they emit two times more photons pre (sic) second than the pixel on the left.” Yet, it doesn't really appear twice as bright! (What does “twice as bright” really mean, by the way? I don't know, but there's some sort intuitive notion of it. In any case, we could rephrase the question in terms of being capable of distinguishing between different levels, but it just complicates things.)

So, the eye's response to luminance is not linear, but more like the square root (actually, more like the exponent of 1/2.2 or 1/2.4). Thus, if we want to quantize luminance into N (for instance 256) distinct levels, we'd better not space them out linearly; let's instead do x^(1/2.2) (or something similar) and then quantize linearly. This is equivalent to a non-uniform quantizer; we say that we have encoded the signal with gamma 2.2. (In reality, we don't use exactly this, but it's close, and the reasons are more of electrical than perceptual nature.) Also, to distinguish this gamma-compressed representation of the luminance from the actual (linear) luminance Y, we now add a little prime to the symbol, and say that Y' is the luma.

The other reason is a very interesting coincidence. A CRT monitor takes in an input voltage and outputs (through some electronics controlling an electron gun, lighting up phosphor) luminance. However, the output luminance is not linearly dependent with the input voltage; it's more like the square! (This has nothing to do with the phospor, by the way; it's the electrical circuits behind it. It's partially by coincidence and partially by engineering.) In other words, a CRT doesn't even need to undo the gamma-compressed quantization, it can just take the linear signal in and push it through the circuit, and get the intended luminance back out.

Of course, LCDs don't work that way anymore, but by the time they became commonplace, the convention was already firmly in place, and again, the perceptual reasons still apply.

Now, what does this mean for pixel processing, and Movit in particular? Noting that many of the filters we typically apply to our videos (say, blur) are physical processes that work on light, and that light behaves quite linearly, it's quite obvious that we want to process luminance, not some arbitrarily-compressed version of it. But this is not what most software does. Most software just take the gamma-encoded RGB values (you encode the three channels separately) and do mathematics on them as if they were representing linear values, which ends up being subtly wrong in some cases and massively wrong in others. There's an article by Eric Brasseur that has tons of detail about this if you care, but in general, I can say that correct processing is more the exception than the norm.

So, what does Movit do? The answer is quite obvious: Convert to linear values on the input side (by applying the right gamma curve; something like x^2.2 for each color channel), do the processing, and then compress back again afterwards. (Movit works in 16-bit and 32-bit floating point internally, by virtue of it being supported and fast in modern GPUs, so we don't have problems with quantization that you'd get in 8-bit fixed point.) Actually, it's a bit more complex than that, since some filters don't really care (e.g., if you just want to flip an image vertically, who cares about gamma), but the general rule is:

If you want to do more with pixels than moving them around (especially combining two or more, or doing arithmetic on them), you want to work in linear gamma.

There, I said it. And now to try to get dinner before getting up at 5am tomorrow (which is 2am on my internal clock, since I just arrived from Tokyo). Gah.

[07:01] | | Introduction to gamma

Sun, 24 Feb 2013 - IPv6 reverse generation

We recently renumbered (for the first, and I hope the last time), and in the process, the question of IPv6 adressing came up; how do you assign static IPv6 addresses within a given /64?

I won't be going into the full discussion of the various different strategies, but I'll say that one element of the solution chosen was that if you had an IPv4 address ending in .123, you'd also get an IPv6 address ending in ::123. (IPv6-only hostnames would be handled differently.)

But then, how do you make sure the reverses are in sync? For some reason, BIND doesn't have a good way of synthesizing a PTR name from an IPv6 address, so you're stuck with typing and hope you got everything right. It's a pain.

So instead, behold:


$GENERATE 1-9   $.0.0   CNAME   $
$GENERATE 0-9   $.1.0   CNAME   1$
$GENERATE 0-9   $.2.0   CNAME   2$
$GENERATE 0-9   $.3.0   CNAME   3$
$GENERATE 0-9   $.4.0   CNAME   4$
$GENERATE 0-9   $.5.0   CNAME   5$
$GENERATE 0-9   $.6.0   CNAME   6$
$GENERATE 0-9   $.7.0   CNAME   7$
$GENERATE 0-9   $.8.0   CNAME   8$
$GENERATE 0-9   $.9.0   CNAME   9$
$GENERATE 0-9   $.0.1   CNAME   10$
$GENERATE 0-9   $.1.1   CNAME   11$
$GENERATE 0-9   $.2.1   CNAME   12$
$GENERATE 0-9   $.3.1   CNAME   13$
$GENERATE 0-9   $.4.1   CNAME   14$
$GENERATE 0-9   $.5.1   CNAME   15$
$GENERATE 0-9   $.6.1   CNAME   16$
$GENERATE 0-9   $.7.1   CNAME   17$
$GENERATE 0-9   $.8.1   CNAME   18$
$GENERATE 0-9   $.9.1   CNAME   19$
$GENERATE 0-9   $.0.2   CNAME   20$
$GENERATE 0-9   $.1.2   CNAME   21$
$GENERATE 0-9   $.2.2   CNAME   22$
$GENERATE 0-9   $.3.2   CNAME   23$
$GENERATE 0-9   $.4.2   CNAME   24$
$GENERATE 0-4   $.5.2   CNAME   25$

and voila:

pannekake:~> host 2001:67c:29f4::50 is an alias for domain name pointer

Instant IPv4/IPv6 sync.

[14:17] | | IPv6 reverse generation

Sun, 03 Feb 2013 - Color and color spaces: An introduction

One of the topics that has come up a few times during the development of Movit (my high-performance, high-quality video filter library) is the one of color and color spaces. There's a lot of information out there, but it took me quite a while to put everything together in my own head. Thus, allow me to share a distilled version; I'll try to skip all the detail and the boring parts. Color is an extremely complex topic, though; the more I understand, the more confusing it becomes. Thus, it will probably become quite long.

What is color?

Color is, ultimately, the way our vision reacts to the fact that light comes in many different frequencies. (In a sense, the field of color is actually more a subfield of biology than of physics.) In its most exact form, you can describe this with a frequency spectrum. For instance, here is (from Wikipedia) a typical spectrum of the sky on a clear summer day:

Spectrum of blue sky

However, human eyes are not spectrometers; there are many colors with different frequency spectra that we perceive as the same. Thus, it's useful to invent some sort of representation that more closely corresponds to how we see color.

Now, almost everybody knows that we represent colors on computers with various amounts of red, green and blue. This is correct, but how do we go from those spectra to RGB values?


The first piece of the puzzle comes in the form of the CIE 1931 XYZ color space. It defines three colors, X, Y and Z, that look like this (again Wikipedia; all the images in this post are):

XYZ color matching functions

Don't be confused that they are drawn in red, green and blue, because they don't correspond to RGB. (They also don't correspond to the different cones in the eye.)

In any case, almost all modern color theory starts off by saying that describing frequency spectra by various mixtures of X, Y and Z is a good enough starting point. (In particular, this means we discard infrared and ultraviolet.) As a handy bonus, Y corresponds very closely to our perception of overall brightness, so if you set X=Z=0, you can describe a black-and-white picture with only Y. (This is the same Y as you might have seen in YUV or YCbCr. Let me ignore the distinction between Y and Y' for now.)

Actually we tend to go one step further when discussing color, since we don't care about the brightness; we normalize the XYZ coordinates so that x+y+z=1, after which a color is uniquely defined with only its x and y (note that we now write lowercase!) values. (If we also include the original Y value, we have the full description of the same color again, so the xyY color space is equivalent to the XYZ one.)

RGB and spectral colors

So, now we have a way to describe colors in an absolute sense with only three numbers. Now, we already said earlier, usually we do this by using RGB. However, this begs the question: When I say “red”, which color do I mean exactly? What would be its xy coordinates?

The natural answer would probably be some sort of spectral color. We all know the spectral colors from the rainbow; they are the ones that contain a single wavelength of light. (Then we start saying stupid things like “white is not a color” since it is a mixture of many wavelengths, conveniently ignoring that e.g. brown is also a mixture and thus not in the rainbow. I've never heard anyone saying brown is not a color.) If you take all the spectral values, convert them to xy coordinates and draw them in a diagram, you get something like this:

Planckian locus

(To be clear, what I'm talking about is the big curve, with markings from 380 nm to 700 nm.)

So now, we can define “red” to be e.g. light at 660 nm, and similar for green and blue. This gives rise to a gamut, the range of colors we can represent by mixing R, G and B in various amounts. For instance, here's the Rec. 2020 (Rec. = Recommendation) color space, used in the upcoming UHDTV standard:

Rec. 2020 color space

You can see that we've limited ourselves a bit here; some colors (like spectral light at 500 nm) fall outside our gamut and cannot be represented except as some sort of approximation. Still, it's pretty good.

For a full description, we also need a white point that says where we end up when we set R=G=B, but let me skip over the discussion of “what is white” right now. (Hint: Usually it's not “equal amount of all wavelengths”.) There's also usually all sorts of descriptions about ambient lighting and flare in your monitor and whatnot—again, let me skip over them. You can see the white point marked off as “D65” in the diagram above.

A better compromise

You might have guessed by now that we rarely actually use spectral primaries today, and you're right. This has a few very important reasons:

First, it makes for a color space that is very hard to realize in practice. How many things to do you know that can make exactly single-frequency light? Probably only one: Lasers. I'm sure that having a TV built with a ton of lasers would be cool (*pew pew*!), but right now, we're stuck with LCD and LED and such. (You may have noticed that outside a certain point, all the colors in the diagram look the same. Your monitor simply can't show the difference anymore.) You could, of course, argue that we should let it be the monitor's problem to figure out what do to with the colors it can't represent, but proper gamut mapping is very hard, and the subject of much research.

Second, the fact that the primaries are far from each other means that we need many bits to describe transitions between them smoothly. The typical 8 bits of today are not really enough; UHDTV will be done with 10- or 12-bit precision. (Processing should probably have even more; Movit uses full floating-point.)

Third, pure colors are actually quite dim (they contain little energy). When producing a TV, color reproduction is not all you care about; you also care about light level for e.g. white. If we reduce the saturation of our primaries a bit (moving them towards the white point), we make it easier to get a nice and bright output image.

So, here are the primaries of the sRGB color space, which is pretty much universally used on PCs today (and the same primaries as Rec. 709, used for today's HDTV broadcasts):

sRGB color space

Quite a bit narrower; in particular, we've lost a lot in the greens. This is why some photographers prefer to work in a wider color space like e.g. Adobe RGB; no need to let your monitor's limitations come between what your camera and printer can do. (Printer gamuts are a whole new story, and they don't really work the same way monitor gamuts do.)

Color spaces and Movit

So, this is why Movit, and really anything processing color data, has to care: To do accurate color processing, you must know what color space you are working in. If you take in RGB pixels from an sRGB device, and then take those exact values and show on an SDTV (which uses a subtly different color space, Rec. 601), your colors will be slightly off. Remember, red is not red. sRGB and SDTV are not so different, but what about sRGB and Rec. 2020? If you take your sRGB data and try to send it on the air for UHDTV, it will look strangely oversaturated.

You could argue that almost everything is sRGB right now anyway, and that the difference between sRGB and Rec. 601 is so small that you can ignore it. Maybe; I prefer not to give people too many reasons to hate my software in the future. :-)

So Movit solves this by moving everything into the same colorspace on input, and processes everything as sRGB internally. (Basically what you do is you convert the color from whatever colorspace to XYZ, and then from XYZ to sRGB. On output, you go the other way.) Lightroom does something similar, only with a huge-gamut colorspace (so big it includes “imaginary colors”, colors that can't actually be represented as spectra) called Pro Photo RGB; I might go that way in the future, but currently, sRGB will do.

[18:58] | | Color and color spaces: An introduction

Steinar H. Gunderson <>