I'm currently trying to make sense of some still image coding (more details to come at a much later stage!), and for a variety of reasons, I've chosen to use rANS as the entropy coder. However, there's an interesting little detail that I haven't actually seen covered anywhere; maybe it's just because I've missed something, or maybe because it's too blindingly obvious, but I thought I would document what I ended up with anyway. (I had hoped for something even more elegant, but I guess the obvious would have to do.)

For those that don't know rANS coding, let me try to handwave it as much as possible. Your state is typically a single word (in my case, a 32-bit word), which is refilled from the input stream as needed. The encoder and decoder works in reverse order; let's just talk about the decoder. Basically it works by looking at the lowest 12 (or whatever) bits of the decoder state, mapping each of those 2^12 slots to a decoded symbol. More common symbols are given more slots, proportionally to the frequency. Let me just write a tiny, tiny example with 2 bits and three symbols instead, giving four slots:

Lowest bits | Symbol |
---|---|

00 | 0 |

01 | 0 |

10 | 1 |

11 | 2 |

Note that the zero coefficient here maps to one out of two slots (ie., a range); you don't choose which one yourself, the encoder stashes some information in there (which is used to recover the next control word once you know which symbol there is).

Now for the actual problem: When storing DCT coefficients, we typically want to also store a sign (ie., not just 1 or 2, but also -1/+1 and -2/+2). The statistical distribution is symmetrical, so the sign bit is incompressible (except that of course there's no sign bit needed for 0). We could have done this by introducing new symbols -1 and -2 in addition to our three other ones, but this means we'll need more bits of precision, and accordingly larger look-up tables (which is negative for performance). So let's find something better.

We could also simply store it separately somehow; if the coefficient is non-zero, store the bits in some separate repository. Perhaps more elegantly, you can encode a second symbol in the rANS stream with probability 1/2, but this is more expensive computationally. But both of these have the problem that they're divergent in terms of control flow; nonzero coefficients potentially need to do a lot of extra computation and even loads. This isn't nice for SIMD, and it's not nice for GPU. It's generally not really nice.

The solution I ended up with was simulating a larger table with a smaller one. Simply rotate the table so that the zero symbol has the top slots instead of the bottom slots, and then replicate the rest of the table. For instance, take this new table:

Lowest bits | Symbol |
---|---|

000 | 1 |

001 | 2 |

010 | 0 |

011 | 0 |

100 | 0 |

101 | 0 |

110 | -1 |

111 | -2 |

(The observant reader will note that this doesn't describe the exact same distribution as last time—zero has twice the relative frequency as in the other table—but ignore that for the time being.)

In this case, the bottom half of the table doesn't actually need to be stored! We know that if the three bottom bits are >= 110 (6 in decimal), we have a negative value, can subtract 6, and then continue decoding. If we are go past the end of our 2-bit table despite that, we know we are decoding a zero coefficient (which doesn't have a sign), so we can just clamp the read; or for a GPU, reads out-of-bounds on a texture will typically return 0 anyway. So it all works nicely, and the divergent I/O is gone.

If this pickled your interest, you probably want to read up on rANS in general; Fabian Giesen (aka ryg) has some notes that work as a good starting point, but beware; some of this is pretty confusing. :-)