Steinar H. Gunderson

Sat, 01 Jul 2023 - The bestest sine function, redux

A small followup to the last post:

Sreepathi Pai pointed out that Z3 isn't the fastest solver when it comes to floats; CVC5 is faster but cannot optimize. However, OptiMathSAT, while non-free, detronizes Z3 on this problem; the minimization time goes down from eight hours to five minutes! (Pai also pointed to sat-smt.codes, if you want to read 650+ pages about SAT/SMT techniques.)
I discovered that the desired range is not [0,16384) but [0,16384]; that is, you need to include 16384. This makes for slightly different optimal polynomials.

With that in mind, here are the final, bestest coefficients given the aforementioned constraints:

Cosine: 1.0f + (x*x) * (-4.564684e-9f + (x*x) * 3.1372154e-18f); maxerr=7.3719024e-4
Sine: x * (9.584001e-5f + (x*x) * (-1.4590816e-13f + (x*x) * 6.0535898e-23f)); maxerr=8.0764293e-5

The original LUT approach has maxerr 2.8767669e-4 (since it chops off the lowest two bits; presumably some rounding would be better), so the sine approach actually beats it in accuracy.

[16:32] | | The bestest sine function, redux

Thu, 29 Jun 2023 - Finding the BESTEST sine function for Nintendo 64

YouTube gave me the video “Finding the BEST sine function for Nintendo 64” yesterday, and while I thought the title was very intriguing, I came away a bit disappointed; the video is hardly about the specifics of Nintendo 64 at all, and the author mostly spends time comparing straight-up LUTs with some hand-made splines under a proxy metric (no real comparative benchmarks, hardly any numerics).

I have no idea what the optimal solution is, but it got me thinking: How good can we get with the same basic formulas, just better constants? The video is pretty sparse with actual source code and data, but I think I've extracted enough info to go work on his “5th order sine” and “4th order cosine” methods (both straight-up polynomials).

Let's start with the 5th order sine; it's basically an odd polynomial (he does simple range reduction to [-pi/2,pi/2)). The implementation goes roughly like this, taking in a value between 0 and 65535 inclusive:

float x = i * (float)(0.0003 / M_PI);  // Prescaler reverse-engineered from the video
float sin = x * (1.0f + x * x * (b + x * x * a));

He uses 1.0f specifically because it's ready in a register (so you don't need to load the constant), but it's not hard to see that this is exactly the same as dropping the prescaler and evaluting ax⁵ + bx³ + cx. So let's see what Sollya thinks:

> fpminimax(sin(x*2*Pi/65536.0),[|x,x^3,x^5|],[|SG...|],[0;16384]);
Warning: For at least 2 of the constants displayed in decimal, rounding has happened.
x * (9.58634263952262699604034423828125e-5 + x^2 * (-1.46252571143166976153082714517950080335140228271484e-13 + x^2 * 6.1585575698811928868593794069233315902067715796875e-23))

We can bruteforce-evaluate this over all 16384 relevant inputs (corresponding to 0 through pi/2), giving maximum absolute error 0.000108. Given that he claims maximum absolute error of 0.000395 for his own version (nearly 4x as high), and we saved the initial multiplication (though not the load), this is a good start!

Nevertheless, we're not looking for better. We're looking for BESTEST. And we want to prove it. So we must unleash… the SAT solver. Usually I use OR-Tools, but for a project this bizarre, we need the big guns. We need Z3.

Z3 has full support for IEEE floats of any size, so we can start off by declaring that a,b,c are floats (the N64 supports double just fine, but his assembly strongly suggests he's working in single precision):

(declare-const a Float32)
(declare-const b Float32)
(declare-const c Float32)

We also need a maximum tolerable error. Since we're trying to beat Sollya, let's start off at 0.0001:

(declare-const eps Float32)
(assert (= eps (fp #b0 #x71 #b10100011011011100010111)))  ; 0.0001

Now we want to ask the solver to minimize the error. We need to do that point by point. Let's first start with i=16383 (testing i=0 is meaningless, since it will be zero no matter what the constants are):

(declare-const x16383 Float32)
(declare-const x16383sq Float32)
(declare-const y16383 Float32)
(declare-const yy16383 Float32)
(assert (= x16383 (fp #b0 #x8c #b11111111111110000000000)))
(assert (= x16383sq (fp #b0 #x9a #b11111111111100000000000)))
(assert (= y16383 (fp #b0 #x71 #b10010010010100010001000)))
(assert (= yy16383
  (fp.mul roundNearestTiesToEven x16383
   (fp.add roundNearestTiesToEven c
    (fp.mul roundNearestTiesToEven x16383sq
     (fp.add roundNearestTiesToEven b
      (fp.mul roundNearestTiesToEven a x16383sq)))))))
(assert (fp.lt (fp.abs (fp.sub roundNearestTiesToEven y16383 yy16383)) eps))

So we're defining a constant for 16383.0f, and then one for the same value squared (precomputing this creates fewer SAT clauses, and we're really creating enough already!), and then the expected output value (y16383) of roughly 0.000096f. We then define that yy16383 is the output of the formula x * (c + xsq * (b + xsq * a)), done with standard IEEE 754 round-ties-to-nearest-even mode. Finally, we say that the absolute difference between yy16383 and the expected answer (where the rounding mode doesn't matter much) must be less than our tolerance.

Behind the scenes, Z3 will create a bazillion SAT clauses for us, presumably with adders and normalization and whatever. The beauty of this is that we should be able to match hardware exactly; no matter what may happen with rounding or overflow or denormals or whatever, our simulation will give us the right error. We could even model tricks like “we only care to load the upper 16 bits of the float, to save on a MIPS instruction” and see if we could still get a reasonable approximation. There are no limits (especially not if P=NP and SAT admits a solution in polynomial time!).

There's one more clause we want:

(assert (fp.leq yy16383 one))

We don't want any value to exceed 1.0f, since his code relies on using sqrt(1 - x²) to calculate the cosine from sine, and we can't take the square root of a negative number. (I think using the square root here is rather dubious performance-wise, but we'll roll with it; again, we only want to change constants.)

Now, we could repeat this for all the other 16382 points. However, that would create literally millions of SAT clauses and take forever to work out. So I just put in 16 roughly equidistant points and let it roll:

(check-sat)
(get-value (a b c))

And after four minutes or so of calculation, Z3 (well, my script wrapper around Z3) told me that it was indeed possible to get under 0.0001 maximum absolute error:

sat
((a (fp #b0 #x35 #b00100101101100010000110))  ; 0.000000000000000000000060733997
 (b (fp #b1 #x54 #b01001000110000100101110))  ; -0.00000000000014599842
 (c (fp #b0 #x71 #b10010010000001010001001)))  ; 0.00009584899

Now, of course, this was only for those points, but there's a limit to how fast the error polynomial can ripple, so it turns out that the error over the full input range wasn't far away (it was 0.00010049343).

But better still isn't what we want! BESTEST. Let me take a little detour towards the other simple polynomial, namely the “4th order cosine” (which calculates sin from cos, instead of the other way round). We can implement this in exactly the same way:

(fp.add roundNearestTiesToEven one
 (fp.mul roundNearestTiesToEven x16383sq
  (fp.add roundNearestTiesToEven b
   (fp.mul roundNearestTiesToEven a x16383sq))))

That is, 1 + x_sq * (b + a * x_sq). Here there's no prescaler in his version (and the coefficients are visible in the video), so we've kept the constraint that one of the constants must be 1.0f, and thus Z3 will have less work to do. Thus, let's ask it not for finding something that's better than a target, but as low as possible:

(assert (fp.isPositive eps))
(declare-const eps_e (_ BitVec 8))
(declare-const eps_m (_ BitVec 23))
(assert (= eps (fp #b0 eps_e eps_m)))

(minimize eps_e)
(minimize eps_m)

It seems roundabout that we need to declare separate variables for the exponent and mantissa and then minimize those, instead of just minimizing the float directly, but evidently, this is the standard trick. So we let Z3 churn for half an hour, and lo and behold:

sat
((a (fp #b0 #x44 #b11001110111101011111101))  ; 0.0000000000000000031371447
 (b (fp #b1 #x63 #b00111001101011100110110))  ; -0.000000004564664
 (eps (fp #b0 #x74 #b10000001110110000000001)))  ; 0.0007359386

So now it's found the optimal epsilon all by itself! Of course, it still is over that set of points. But the truth is not far behind; the actual worst case has error 0.0007380843. So it's way less than a percent off.

Unfortunately Z3 is much slower when we ask to minimize; seemingly, it switches to a more powerful but slower solver (SMT compared to SAT). The alternative would be to drive it ourselves by repeated queries with binary search or similar for epsilon, but OK, half an hour per run is entirely plausible. But we want the bestest. So what can we do?

Well, we find that the worst case for this polynomial is i=7231. So we just add that to our list of points to check. This is rather similar to how the Reméz exchange algorithm (used by, among others, Sollya) works; it is bound to fix at least that single point, and hopefully the procedure will converge much before we add all 16383 of them. (That sounds very natural even without any deep proof; again, the error function can only really oscillate so fast, so there are bound to be tons of values that will never be close to the max error). And indeed, after twelve iterations or so, we get:

((a (fp #b0 #x44 #b11001110111110101000110))  ; 0.0000000000000000031372656
 (b (fp #b1 #x63 #b00111001101011101110110))  ; -0.0000000045646926
 (eps (fp #b0 #x74 #b10000010010111011110000)))  ; 0.00073693599

This has max error 0.00073693593 (remember, we asked for the error to be less than eps, and that eps is indeed the next possible float after the max error), and that error is in i=16383, which was already a point we asked it to optimize, and so this is THE BESTEST 4th-order cosine approximation under the given constraints. With his best cosine being 0.002787, it's a nice 3–4x improvement on error just by swapping out constants.

The sine is much slower to minimize (five hours or so per iteration), so it hasn't converged yet; still waiting for the absolute truth there. So for now, the one above is what we've got. At least we have a lower bound saying that it's less than 1% off the best possible one (asking for anything lower produces a unsatisfiability result). But the bestest 4th-order cosine approximation is indeed 1.0f + (x * x) * (-0.000000004564664f + (x * x) * 0.0000000000000000031372656f). On the Nintendo 64.

[23:16] | | Finding the BESTEST sine function for Nintendo 64

Tue, 30 Oct 2018 - SAT solvers for fun and fairness

Trøndisk 2018, the first round of the Norwegian ultimate series (the frisbee sport, not the fighting style) is coming up this weekend! Normally that would mean that I would blog something about all the new and exciting things we are doing with Nageru for the stream, but for now, I will just point out that the stream is on plastkast.no and will be live from 0945–1830 CET on Saturday (group stage) and 1020–1450 (playoffs) on Sunday.

Instead, I wanted to talk about a completely different but interesting subproblem we had to solve; how do you set up a good schedule for the group stages? There are twelve teams, pre-seeded and split into two groups (call them A0–A5 and B0–B5) that are to play round-robin, but there are only two fields—and only one of them is streamed. You want a setup that maximizes fairness in the sense that people get adequate rest between matches, and also more or less equal number of streamed games. Throw in that one normally wants the more exciting games last, and it starts to get really tricky to make something good by hand. Could we do it programmatically?

My first thought was that since this is all about the ordering, it sounded like a variant of the infamous travelling salesman problem. It's well-known that TSP is NP-hard (or NP-complete, but I won't bother with the details), but there are excellent heursitic implementations in practice. In particular, I had already used OR-Tools, Google's optimization toolkit, to solve TSP problems in the past; it contains a TSP solver that can deal with all sorts of extra details, like multiple agents to travel (in our case, multiple fields), subconstraints on ordering and so on. (OR-Tools probably doesn't contain the best TSP solver in the world—there are specialized packages that do even better—but it's much better than anything I could throw together myself.)

However, as I tried figuring out something, and couldn't quite get it to fit (there are so many extra nonlocal constraints), I saw that the OR-Tools documentation had a subsection on scheduling problems. It turns out this kind of scheduling can be represented as a so-called SAT (satisfiability) problem, and OR-Tools also has a SAT solver. (SAT, in its general forms, is also NP-hard, but again, there are great heuristics.) I chose the Python frontend, which probably wasn't the best idea in the world (it's poorly documented, and I do wonder when Python will take the step into the 90s and make spelling errors in variables into compile-time errors instead of throwing a runtime exception four hours into a calculation), but that's what the documentation used, and the backend is in C++ anyway, so speed doesn't matter.

The SAT solver works by declaring variables and various constraints between them, and then asking the machine to either come up with a solution that fits, or to prove that it's not possible. Let's have a look of some excerpts to get a feel for how it all works:

We know we have 15 rounds, two fields on each, and every field should contain a match. So let's generate 30 such variables, each containing a match number (we use the convention that match 0, 2, 4, 6, etc. are on the stream field and 1, 3, 5, 7, etc. are played in parallel on the other field):

matchnums = []
for match_idx in range(num_matches):
  matchnums.append(model.NewIntVar(0, num_matches - 1, "matchnum%d" % (match_idx)))

So this is 30 variables, and each go from 0 to 29, inclusive. We start a fairly obvious constraint; we can only play each match once:

model.AddAllDifferent(matchnums)

The SAT solver might make this into a bunch of special constraints underneath, or it might not. We don't care; it's abstracted away for us.

Now, it's not enough to just find any ordering—after all, we want to find an ordering with some constraints. However, the constraints are rarely about the match numbers, but more about the teams that play in those matches. So we'll need some helper variables. For instance, it would be interesting to know which teams play in each match:

home_teams = []
away_teams = []
for match_idx in range(num_matches):
  home_teams.append(model.NewIntVar(0, num_teams - 1, "home_team_match%i" % (match_idx)))
  away_teams.append(model.NewIntVar(0, num_teams - 1, "away_team_match%i" % (match_idx)))
  model.AddElement(matchnums[match_idx], home_teams_for_match_num, home_teams[match_idx])
  model.AddElement(matchnums[match_idx], away_teams_for_match_num, away_teams[match_idx])

AddElement() here simply is an indexing operation; since there's no difference between home and away teams for us, we've just pregenerated all the matches as A0 vs. A1, A0 vs. A2, etc. up until A4 vs. A6, A5 vs. A6 and then similarly for the other gruop. The “element” constraint makes sure that e.g. home_team_match0 = home_teams_for_match_num[matchnum0]. Note that even though I think of this is as an assignment where the home team for match 0 follows logically from which match is being played as match 0, it is a constraint that goes both ways; the solver is free to do inference that way, or instead first pick the home team and then deal with the consequences for the match number. (E.g., if it picks A5 as the home team, the match number most certainly needs to be 14, which corresponds to A5–A6.)

We're not quite done with the helpers yet; we want to explode these variables into booleans:

home_team_in_match_x_is_y = [[
  model.NewBoolVar('home_team_in_match_%d_is_%d' % (match_idx, team_idx)) for team_idx in range(num_teams)
] for match_idx in range(num_matches)]
for match_idx in range(num_matches):
  model.AddMapDomain(matchnums[match_idx], match_x_has_num_y[match_idx])

and similarly for away team and match number.

So now we have a bunch of variables of the type “is the home team in match 6 A4 or not?”. Finally we can make some interesting constraints! For instance, we've decided already that the group finals (A0–A1 and B0–B1) should be the last two matches of the day, and on the stream field:

model.AddBoolOr([match_x_has_num_y[28][0], match_x_has_num_y[28][15]])
model.AddBoolOr([match_x_has_num_y[26][0], match_x_has_num_y[26][15]])

This is a hard constraint; we don't have a solution unless match 0 and match 15 are the last two (and we earlier said that they must be different).

We're going to need even more helper variables now. It's useful to know whether a team is playing at all in a given round; that's the case if they are the home or away team on either field:

plays_in_round = {}
for team_idx in range(num_teams):
  plays_in_round[team_idx] = {}
  for round_idx in range(num_rounds):
    plays_in_round[team_idx][round_idx] = model.NewBoolVar('plays_in_round_t%d_r%d' % (team_idx, round_idx))
    model.AddMaxEquality(plays_in_round[team_idx][round_idx], [
        home_team_in_match_x_is_y[round_idx * 2 + 0][team_idx],
        home_team_in_match_x_is_y[round_idx * 2 + 1][team_idx],
        away_team_in_match_x_is_y[round_idx * 2 + 0][team_idx],
        away_team_in_match_x_is_y[round_idx * 2 + 1][team_idx]])

Now we can establish a few other very desirable properties; in particular, each team should never need to play two matches back-to-back:

for round_idx in range(num_rounds - 1):
  for team_idx in range(num_teams):
    model.AddBoolOr([plays_in_round[team_idx][round_idx].Not(), plays_in_round[team_idx][round_idx + 1].Not()])

Note that there's nothing here that says the same team can't be assigned to play on both fields at the same time! However, this is taken care of by some constraints on the scheduling that I'm not showing for brevity (in particular, we established that each round must have exactly one game from group A and one from group B).

Now we're starting to get out of the “hard constraint” territory and more into things that would be nice. For this, we need objectives. One such objective is what I call ”tiredness”; playing matches nearly back-to-back (ie., game - rest - game) should have a penalty, and the solution should try to avoid it.

tired_matches = []
for round_idx in range(num_rounds - 2):
  for team_idx in range(num_teams):
    tired = model.NewBoolVar('team_%d_is_tired_in_round_%d' % (team_idx, round_idx))
    model.AddMinEquality(tired, [plays_in_round[team_idx][round_idx], plays_in_round[team_idx][round_idx + 2]])
    tired_matches.append(tired)
sum_tiredness = sum(tired_matches)

So here we have helper variables that are being set to the minimum (effectively a logical AND) of “do I play in round N” and “do I play in round N + 2”. Tiredness is simply a sum of those 0–1 variables, which we can seek to minimize:

model.Minimize(sum_tiredness)

You may wonder how we went from a satisfiability problem to an optimization problem. Conceptually, however, this isn't so hard. Just ask the solver to find any solution, e.g. something with sum_tiredness 20. Then simply add a new constraint saying sum_tiredness <= 19 and ask for a re-solve (or continue). Eventually, the solver will either come back with a better solution (in which case you can tighten the constraint further), or the message that you've asked for something impossible, in which case you know you have the optimal solution. (I have no idea whether modern SAT solvers actually work this way internally, but again, conceptually it's simple.)

As an extra bonus, you do get incrementally better solutions as you go. These problems are theoretically very hard—in fact, I let it run for fun for a week now, and it's still not found an optimal solution—and in practice, you just take some intermediate solution that is “good enough”. There are always constraints that you don't bother adding to the program anyway, so there's some eyeballing involved, but still feels like a more fair process than trying to nudge it by hand.

We had many more objectives, some of them contradictory (e.g., games between more closely seeded opponents are more “exciting”, and should be put last—but they should also be put on the stream, so do you put them early on the stream field or late on the non-stream field?). It's hard to weigh all the factors against each other, but in the end, I think we ended up with something pretty nice. Every team gets to play two or three times (out of five) on the stream, only one team needs to be “tired” twice (and I checked; if you ask for a hard maximum of once for every team, it comes back pretty fast as infeasible), many of the tight matches are scheduled near the end… and most importantly, we don't have to play the first matches while I'm still debugging the stream. :-)

You can see final schedule here. Good luck to everyone, and consider using a SAT solver next time you have a thorny scheduling problem!

[22:55] | | SAT solvers for fun and fairness

Tue, 12 Sep 2017 - rANS encoding of signed coefficients

I'm currently trying to make sense of some still image coding (more details to come at a much later stage!), and for a variety of reasons, I've chosen to use rANS as the entropy coder. However, there's an interesting little detail that I haven't actually seen covered anywhere; maybe it's just because I've missed something, or maybe because it's too blindingly obvious, but I thought I would document what I ended up with anyway. (I had hoped for something even more elegant, but I guess the obvious would have to do.)

For those that don't know rANS coding, let me try to handwave it as much as possible. Your state is typically a single word (in my case, a 32-bit word), which is refilled from the input stream as needed. The encoder and decoder works in reverse order; let's just talk about the decoder. Basically it works by looking at the lowest 12 (or whatever) bits of the decoder state, mapping each of those 2^12 slots to a decoded symbol. More common symbols are given more slots, proportionally to the frequency. Let me just write a tiny, tiny example with 2 bits and three symbols instead, giving four slots:

Lowest bits	Symbol
00	0
01	0
10	1
11	2

Note that the zero coefficient here maps to one out of two slots (ie., a range); you don't choose which one yourself, the encoder stashes some information in there (which is used to recover the next control word once you know which symbol there is).

Now for the actual problem: When storing DCT coefficients, we typically want to also store a sign (ie., not just 1 or 2, but also -1/+1 and -2/+2). The statistical distribution is symmetrical, so the sign bit is incompressible (except that of course there's no sign bit needed for 0). We could have done this by introducing new symbols -1 and -2 in addition to our three other ones, but this means we'll need more bits of precision, and accordingly larger look-up tables (which is negative for performance). So let's find something better.

We could also simply store it separately somehow; if the coefficient is non-zero, store the bits in some separate repository. Perhaps more elegantly, you can encode a second symbol in the rANS stream with probability 1/2, but this is more expensive computationally. But both of these have the problem that they're divergent in terms of control flow; nonzero coefficients potentially need to do a lot of extra computation and even loads. This isn't nice for SIMD, and it's not nice for GPU. It's generally not really nice.

The solution I ended up with was simulating a larger table with a smaller one. Simply rotate the table so that the zero symbol has the top slots instead of the bottom slots, and then replicate the rest of the table. For instance, take this new table:

Lowest bits	Symbol
000	1
001	2
010	0
011	0
100	0
101	0
110	-1
111	-2

(The observant reader will note that this doesn't describe the exact same distribution as last time—zero has twice the relative frequency as in the other table—but ignore that for the time being.)

In this case, the bottom half of the table doesn't actually need to be stored! We know that if the three bottom bits are >= 110 (6 in decimal), we have a negative value, can subtract 6, and then continue decoding. If we are go past the end of our 2-bit table despite that, we know we are decoding a zero coefficient (which doesn't have a sign), so we can just clamp the read; or for a GPU, reads out-of-bounds on a texture will typically return 0 anyway. So it all works nicely, and the divergent I/O is gone.

If this pickled your interest, you probably want to read up on rANS in general; Fabian Giesen (aka ryg) has some notes that work as a good starting point, but beware; some of this is pretty confusing. :-)

[00:56] | | rANS encoding of signed coefficients

Sun, 25 Dec 2016 - Cracking a DataEase password

I recently needed to get access to a DataEase database; the person I helped was the legitimate owner of the data, but had forgotten the password, as the database was largely from 1996. There are various companies around the world that seem to do this, or something similar (like give you an API), for a usually unspecified fee; they all have very 90s homepages and in general seem like they have gone out of business a long time ago. And I wasn't prepared to wait.

For those of you who don't know DataEase, it's a sort-of relational database for DOS that had its heyday in the late 80s and early 90s (being sort of the cheap cousin of dBase); this is before SQL gained traction as the standard query language, before real multiuser database access, and before variable-width field storage.

It is also before reasonable encryption. Let's see what we can do.

DataEase has a system where tables are mapped through the data dictionary, which is a table on its own. (Sidenote: MySQL pre-8.0 still does not have this.) This is the file RDRRTAAA.DBM; I don't really know what RDRR stands for, but T is the “database letter” in case you wanted more than one database in the same directory, and AAA, AAB, AAC etc. is a counter so that a table grows to be too big for one file. (There's also .DBA files for structure of non-system tables, and then some extra stuff for indexes.)

DBM files are pretty much the classical, fixed-length 80s-style database files; each row has some flags (I believe these are for e.g. “row is deleted”) and then just the rows in fixed format right after each other. For instance, here's one I created as part of testing (just the first few lines of the hexdump are shown):

00000000: 0e 00 01 74 65 73 74 62 61 73 65 00 00 00 00 00  ...testbase.....
00000010: 00 00 00 00 00 00 00 73 46 cc 29 37 00 09 00 00  .......sF.)7....
00000020: 00 00 00 00 00 43 3a 52 44 52 52 54 41 41 41 2e  .....C:RDRRTAAA.
00000030: 44 42 4d 00 00 01 00 0e 00 52 45 50 4f 52 54 20  DBM......REPORT 
00000040: 44 49 52 45 43 54 4f 52 59 00 00 00 00 00 1c bd  DIRECTORY.......
00000050: d4 1a 27 00 00 00 00 00 00 00 00 00 43 3a 52 45  ..'.........C:RE
00000060: 50 4f 54 41 41 41 2e 44 42 4d 00 00 01 00 0e 00  POTAAA.DBM......
00000070: 52 65 6c 61 74 69 6f 6e 73 68 69 70 73 00 00 00  Relationships...

Even without going in-depth, we can see the structure here; there's “testbase” which maps to C:RDRRTAA.DBM (the RDRR itself), there's a table called “REPORT DIRECTORY” that maps to C:REPOTAAA.DBM, and then more stuff after that, and so on.

However, other tables are not so easily read, because you can ask DataEase to encrypt a table. Let's look at such an encrypted table, like the “Users” table (containing usernames, passwords—not password hashes—and some extra information like access level), which is always encrypted:

00000000: 0c 01 9f ed 94 f7 ed 34 ba 88 9f 78 21 92 7b 34  .......4...x!.{4
00000010: ba 88 0f d9 94 05 1e 34 ba 88 a0 78 21 92 7b 34  .......4...x!.{4
00000020: e2 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b 34  ...x!.{4...x!.{4
00000030: ba 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b     ...x!.{4...x!.{

Clearly, this isn't very good encryption; it uses a very short, repetitive key of eight bytes (64 bits). (The data is mostly zero padding, which makes it much easier to spot this.) In fact, in actual data tables, only five of these bytes are set to a non-zero value, which means we have a 40-bit key; export controls?

My first assumption here was of course XOR, but through some experimentation, it turned out what you need is actually 8-bit subtraction (with wraparound). The key used is derived from both a database key and a per-table key, both stored in the RDRR; again, if you disassemble, I'm sure you can find the key derivation function, but that's annoying, too. Note, by the way, that this precludes making an attack by just copying tables between databases, since the database key is different.

So let's do a plaintext attack. If you assume the plaintext of the bottom row is all padding, that's your key and here's what you end up with:

00000000: 52 79 00 75 73 65 72 00 00 00 00 00 00 00 00 00  Ry.user.........
00000010: 00 00 70 61 73 73 a3 00 00 00 01 00 00 00 00 00  ..pass..........
00000020: 28 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  (...............
00000030: 00 00 00 00 00 00 00 00                          ........

Not bad, eh? Actually the first byte of the key here is wrong as far as I know, but it didn't interfere with the fields, so we have what we need to log in. (At that point, we've won, because DataEase will helpfully decrypt everything transparent for us.)

However, there's a twist; if the password is longer than four characters, the entire decryption of the Users table changes. Of course, we could run our plaintext attack against every data table and pick out the information by decoding the structure, but again; annoying. So let's see what it looks like if we choose “passs” instead:

00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d  ...z..!..c.m...]
00000010: 70 cb df 36 7e 7c 91 c5 d8 33 d8 3d 73 71 e7 2d  p..6~|...3.=sq.-
00000020: 7b 9b 3f a5 db d9 4f 95 a8 03 a7 0d 43 41 b7 fd  {.?...O.....CA..
00000030: 10 6b 0f 75 ab a9 1f 65 78 d3 77 dd 13 11 87     .k.u...ex.w....

Distinctly more confusing. At this point, of course, we know at which byte positions the username and password start, so if we wanted to, we could just try setting the start byte of the password to every possible byte in turn until we hit 0x00 (DataEase truncates fields at the first zero byte), which would allow us to get in with an empty password. However, I didn't know the username either, and trying two bytes would mean 65536 tries, and I wasn't up for automating macros through DOSBox. So an active attack wasn't too tempting.

However, we can look at the last hex byte (where we know the plaintext is 0); it goes 0x5d, 0x2d, 0xfd... and some other bytes go 0x08, 0xd8, 0xa8, 0x78, and so on. So clearly there's an obfuscation here where we have a per-line offset that decreases with 0x30 per line. (Actually, the increase/decrease per line seems to be derived from the key somehow, too.) If we remove that, we end up with:

00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d  ...z..!..c.m...]
00000010: a0 fb 0f 66 ae ac c1 f5 08 63 08 6d a3 a1 17 5d  ...f.....c.m...]
00000020: db fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17 5d  ....;9...c.m...]
00000030: a0 fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17     ....;9...c.m...

Well, OK, this wasn't much more complicated; our fixed key is now 16 bytes long instead of 8 bytes long, but apart from that, we can do exactly the same plaintext attack. (Also, it seems to change per-record now, but we don't see it here, since we've only added one user.) Again, assume the last line is supposed to be all 0x00 and thus use that as a key (plus the last byte from the previous line), and we get:

00000000: 6e 06 00 75 73 65 72 00 00 00 00 00 00 00 00 00  n..user.........
00000010: 00 00 70 61 73 73 12 00 00 00 01 00 00 00 00 00  ..pass..........
00000020: 3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ;...............
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00     ...............

Well, OK, it wasn't perfect; we got “pass\x12” instead of “passs”, so we messed up somehow. I don't know exactly why the fifth character gets messed up like this; actually, it cost me half an hour of trying because the password looked very real but the database wouldn't let me in, but eventually, we just guessed at what the missing letter was supposed to be.

So there you have it; practical small-scale cryptanalysis of DOS-era homegrown encryption. Nothing advanced, but the user was happy about getting the data back after a few hours of work. :-)

[22:21] | | Cracking a DataEase password

Wed, 26 Oct 2016 - Why does software development take so long?

Nageru 1.4.0 is out (and on its way through the Debian upload process right now), so now you can do live video mixing with multichannel audio to your heart's content. I've already blogged about most of the interesting new features, so instead, I'm trying to answer a question: What took so long?

To be clear, I'm not saying 1.4.0 took more time than I really anticipated (on the contrary, I pretty much understood the scope from the beginning, and there was a reason why I didn't go for building this stuff into 1.0.0); but if you just look at the changelog from the outside, it's not immediately obvious why “multichannel audio support” should take the better part of three months of develoment. What I'm going to say is of course going to be obvious to most software developers, but not everyone is one, and perhaps my experiences will be illuminating.

Let's first look at some obvious things that isn't the case: First of all, development is not primarily limited by typing speed. There are about 9,000 lines of new code in 1.4.0 (depending a bit on how you count), and if it was just about typing them in, I would be done in a day or two. On a good keyboard, I can type plain text at more than 800 characters per minute—but you hardly ever write code for even a single minute at that speed. Just as when writing a novel, most time is spent thinking, not typing.

I also didn't spend a lot of time backtracking; most code I wrote actually ended up in the finished product as opposed to being thrown away. (I'm not as lucky in all of my projects.) It's pretty common to do so if you're in an exploratory phase, but in this case, I had a pretty good idea of what I wanted to do right from the start, and that plan seemed to work. This wasn't a difficult project per se; it just needed to be done (which, in a sense, just increases the mystery).

However, even if this isn't at the forefront of science in any way (most code in the world is pretty pedestrian, after all), there's still a lot of decisions to make, on several levels of abstraction. And a lot of those decisions depend on information gathering beforehand. Let's take a look at an example from late in the development cycle, namely support for using MIDI controllers instead of the mouse to control the various widgets.

I've kept a pretty meticulous TODO list; it's just a text file on my laptop, but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83 work items (a single-digit number is not ticked off, mostly because I decided not to do those things), which corresponds roughly 1:2 to the number of commits. So let's have a look at what the ~20 MIDI controller items went into.

First of all, to allow MIDI controllers to influence the UI, we need a way of getting to it. Since Nageru is single-platform on Linux, ALSA is the obvious choice (if not, I'd probably have to look for a library to put in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer). Which one do you want? It sounds like raw MIDI is what we want, but actually, it's the sequencer interface (it does more of the MIDI parsing for you, and generally is friendlier).

The first question is where to start picking events from. I went the simplest path and just said I wanted all events—anything else would necessitate a UI, a command-line flag, figuring out if we wanted to distinguish between different devices with the same name (and not all devices potentially even have names), and so on. But how do you enumerate devices? (Relatively simple, thankfully.) What do you do if the user inserts a new one while Nageru is running? (Turns out there's a special device you can subscribe to that will tell you about new devices.) What if you get an error on subscription? (Just print a warning and ignore it; it's legitimate not to have access to all devices on the system. By the way, for PCM devices, all of these answers are different.)

So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out it probably doesn't integrate too well with Qt, but it's easy enough to put it in a thread. The class dealing with the MIDI handling now needs locking; what mutex granularity do we want? (Experience will tell you that you nearly always just want one mutex. Two mutexes give you all sorts of headaches with ordering them, and nearly never gives any gain.) ALSA expects us to poll() a given set of descriptors for data, but on shutdown, how do you break out of that poll to tell the thread to go away? (The simplest way on Linux is using an eventfd.)

There's a quirk where if you get two or more MIDI messages right after each other and only read one, poll() won't trigger to alert you there are more left. Did you know that? (I didn't. I also can't find it documented. Perhaps it's a bug?) It took me some looking into sample code to find it. Oh, and ALSA uses POSIX error codes to signal errors (like “nothing more is available”), but it doesn't use errno.

OK, so you have events (like “controller 3 was set to value 47”); what do you do about them? The meaning of the controller numbers is different from device to device, and there's no open format for describing them. So I had to make a format describing the mapping; I used protobuf (I have lots of experience with it) to make a simple text-based format, but it's obviously a nightmare to set up 50+ controllers by hand in a text file, so I had to make an UI for this. My initial thought was making a grid of spinners (similar to how the input mapping dialog already worked), but then I realized that there isn't an easy way to make headlines in Qt's grid. (You can substitute a label widget for a single cell, but not for an entire row. Who knew?) So after some searching, I found out that it would be better to have a tree view (Qt Creator does this), and then you can treat that more-or-less as a table for the rows that should be editable.

Of course, guessing controller numbers is impossible even in an editor, so I wanted it to respond to MIDI events. This means the editor needs to take over the role as MIDI receiver from the main UI. How you do that in a thread-safe way? (Reuse the existing mutex; you don't generally want to use atomics for complicated things.) Thinking about it, shouldn't the MIDI mapper just support multiple receivers at a time? (Doubtful; you don't want your random controller fiddling during setup to actually influence the audio on a running stream. And would you use the old or the new mapping?)

And do you really need to set up every single controller for each bus, given that the mapping is pretty much guaranteed to be similar for them? Making a “guess bus” button doesn't seem too difficult, where if you have one correctly set up controller on the bus, it can guess from a neighboring bus (assuming a static offset). But what if there's conflicting information? OK; then you should disable the button. So now the enable/disable status of that button depends on which cell in your grid has the focus; how do you get at those events? (Install an event filter, or subclass the spinner.) And so on, and so on, and so on.

You could argue that most of these questions go away with experience; if you're an expert in a given API, you can answer most of these questions in a minute or two even if you haven't heard the exact question before. But you can't expect even experienced developers to be an expert in all possible libraries; if you know everything there is to know about Qt, ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua (in addition to C++11, of course), I'm sure you'd be a great fit for Nageru, but I'd wager that pretty few developers fit that bill. I've written C++ for almost 20 years now (almost ten of them professionally), and that experience certainly helps boosting productivity, but I can't say I expect a 10x reduction in my own development time at any point.

You could also argue, of course, that spending so much time on the editor is wasted, since most users will only ever see it once. But here's the point; it's not actually a lot of time. The only reason why it seems like so much is that I bothered to write two paragraphs about it; it's not a particular pain point, it just adds to the total. Also, the first impression matters a lot—if the user can't get the editor to work, they also can't get the MIDI controller to work, and is likely to just go do something else.

A common misconception is that just switching languages or using libraries will help you a lot. (Witness the never-ending stream of software that advertises “written in Foo” or “uses Bar” as if it were a feature.) For the former, note that nothing I've said so far is specific to my choice of language (C++), and I've certainly avoided a bunch of battles by making that specific choice over, say, Python. For the latter, note that most of these problems are actually related to library use—libraries are great, and they solve a bunch of problems I'm really glad I didn't have to worry about (how should each button look?), but they still give their own interaction problems. And even when you're a master of your chosen programming environment, things still take time, because you have all those decisions to make on top of your libraries.

Of course, there are cases where libraries really solve your entire problem and your code gets reduced to 100 trivial lines, but that's really only when you're solving a problem that's been solved a million times before. Congrats on making that blog in Rails; I'm sure you're advancing the world. (To make things worse, usually this breaks down when you want to stray ever so slightly from what was intended by the library or framework author. What seems like a perfect match can suddenly become a development trap where you spend more of your time trying to become an expert in working around the given library than actually doing any development.)

The entire thing reminds me of the famous essay No Silver Bullet by Fred Brooks, but perhaps even more so, this quote from John Carmack's .plan has struck with me (incidentally about mobile game development in 2006, but the basic story still rings true):

To some degree this is already the case on high end BREW phones today. I have a pretty clear idea what a maxed out software renderer would look like for that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that seems to be the standard direction. When I was doing the graphics engine upgrades for BREW, I started along those lines, but after putting in a couple days at it I realized that I just couldn't afford to spend the time to finish the work. "A clear vision" doesn't mean I can necessarily implement it in a very small integral number of days.

In a sense, programming is all about what your program should do in the first place. The “how” question is just the “what”, moved down the chain of abstractions until it ends up where a computer can understand it, and at that point, the three words “multichannel audio support” have become those 9,000 lines that describe in perfect detail what's going on.

[09:29] | | Why does software development take so long?

Wed, 23 Jul 2014 - The sad state of Linux Wi-Fi

I've been using 802.11 on Linux now for over a decade, and to be honest, it's still a pretty sad experience. It works well enough that I mostly don't care... but when I care, and try to dig deeper, it always ends up in the answer “this is just crap”.

I can't say exactly why this is; between the Intel cards I've always been using, the Linux drivers, the firmware, the mac80211 layer, wpa_supplicant and NetworkManager, I have no idea who are supposed to get all these things right, and I have no idea how hard or easy they actually are to pull off. But there are still things annoying me frequently that we should really have gotten right after ten years or more:

Why does my Intel card consistently pick 2.4 GHz over 5 GHz? The 5 GHz signal is just as strong, and it gives a less crowded 40 MHz channel (twice the bandwidth, yay!) instead of the busy 20 MHz channel the 2.4 GHz one has to share. The worst part is, if I use an access point with band-select (essentially forcing the initial connection to be to 5 GHz—this is of course extra fun when the driver sees ten APs and tries to connect to all of them over 2.4 in turn before trying 5 GHz), the driver still swaps onto 2.4 GHz a few minutes later!
Rate selection. I can sit literally right next to an AP and get a connection on the lowest basic rate (which I've set to 11 Mbit/sec for the occasion). OK, maybe I shouldn't trust the output of iwconfig too much, since rate is selected per-packet, but then again, when Linux supposedly has a really good rate selection algorithm (minstrel), why are so many drivers using their own instead? (Yes, hello “iwl-agn-rs”, I'm looking at you.)
Connection time. I dislike OS X pretty deeply and think that many of its technical merits are way overblown, but it's got one thing going for it; it connects to an AP fast. RFC4436 describes some of the tricks they're using, but Linux uses none of them. In any case, even the WPA2 setup is slow for some reason, it's not just DHCP.
Scanning/roaming seems to be pretty random; I have no idea how much thought really went into this, and I know it is a hard problem, but it's not unusual at all to be stuck at some low-speed AP when a higher-speed one is available. (See also 2.4 vs. 5 above.) I'd love to get proper support for CCX (Cisco Client Extensions) here, which makes this tons better in a larger Wi-Fi setting (since the access point can give the client a lot of information that's useful for roaming, e.g. “there's an access point on thannel 52 that sends its beacons every 100 ms with offset 54 from mine”, which means you only need to swap channel for a few milliseconds to listen instead of a full beacon period), but I suppose that's covered by licensing or patents or something. Who knows.

With now a billion mobile devices running Linux and using Wi-Fi all the time, maybe we should have solved this a while ago. But alas. Instead we get access points trying to layer hacks upon hacks to try to force clients into making the right decisions. And separate ESSIDs for 2.4 GHz and 5 GHz.

Augh.

[12:28] | | The sad state of Linux Wi-Fi

Thu, 08 May 2014 - Static library linking behavior

A static library (an .a file) is, at its heart, nothing special—it's just a bunch of object files (.o files) in an archive, with an optional index. Thus, I've always treated them mentally the same to a bunch of object files, but it turns out there's an edge case where there is a difference between linking to an .a file and a bunch of .o files (even if you discount the fact that the .a file is a linker group and the .o files would need -( and -) around it to get the same effect).

The difference manifests itself when you are building a shared library (a .so file). For an .o file, the linker does what you'd expect; any symbol that has global linkage (ie., for C or C++: it's not marked as static or in an anonymous namespace) and has default visibility (which, unfortunately, means almost all symbols unless you're clever enough to use -fvisibility=hidden and explicitly mark your exports with __attribute__((visibility("default"))) or similar), will be exported in your shared library to external clients. This means, among others, that it will take up space because the name and a few other things need to be stored.

However, for an .a file, things are different! If you have a symbol in an .a file, and it's not used by anything from an .o file (directly or indirectly), the linker will silently just discard it as unused. It doesn't matter what visibility it has; it will just not be there. (To be honest, I haven't checked if this is on the level of a symbol, a section, an .o file within the library, or the entire .a file.)

The workaround for this is to add --whole-archive before the .a file in question (and presumably --no-whole-archive after it to negate the effect for the following ones), which will negate this behavior. However, I am unsure if it also has other effects, like messing with visibility, so beware.

To be honest, I think this is insane behavior, but given that so few people set visibility explicitly, I guess there would never be any pruning if the behavior were different (and ELF symbol visibility is a relatively new invention anyway), so I can understand why it is like that.

Of course, I figured this out the hard way. Poof, there goes a few hours of my life.

[05:40] | | Static library linking behavior

Tue, 04 Feb 2014 - FOSDEM video stream goodiebag

Borrowing a tradition from TG, we have released a video streaming goodiebag from FOSDEM 2014. In short, it contains all the scripts we used for the streaming part (nothing from the video team itself, although I believe of most of what they do is developed out in the open).

If you've read my earlier posts on the subject, you'll know that it's all incredibly rough, and we haven't cleaned it up much afterwards. So you get the truth, but it might not be pretty :-) However, feedback is of course welcome.

You can find it at http://storage.sesse.net/fosdem14-video.tar.gz; it's only 9 kB large. Enjoy!

[23:37] | | FOSDEM video stream goodiebag

Sun, 02 Feb 2014 - FOSDEM video streaming, post-mortem

Wow, what a ride that was. :-)

I'm not sure if people generally are aware of it, but the video streaming at FOSDEM this year came together on extremely short notice. I got word late Wednesday that the video team was overworked and would not have the manpower to worry about streaming, and consequently, that there would be none (probably not even of the main talks, like last year).

I quickly conferred with Berge on IRC; we both agreed that something as big as FOSDEM shouldn't be without at least rudimentary streams. Could we do something about it? After all, all devrooms (save for some that would not due to licensing issues) would be recorded using DVswitch anyway, where it's trivial to just connect another sink to the master, and we both had extensive experience doing streaming work from The Gathering.

So, we agreed to do a stunt project; either it would work or it would crash and burn, but at least it would be within the playful spirit of free software. The world outside does not stand still, and neither should we.

The FOSDEM team agreed to give us access to the streams, and let us use the otherwise unused cycles on the “slave” laptops (the ones that just take in a DV switch from the camera and send it to the master for mixing). Since I work at Google, I was able to talk to the Google Compute Engine people, who were able to turn around on extremely short notice and sponsor GCE resources for the actual video distribution. This took a huge unknown out of the equation for us; since GCE is worldwide and scalable, we'd be sure to have adequate bandwidth for serving our viewers almost no matter how much load we got.

The rest was mainly piecing together existing components in new ways. I dealt with the encoding (on VLC, using WebM, since that's what FOSDEM wanted), hitting one or two really obscure bugs in the process, and Berge dealt with all the setup of distribution (we used cubemap, which had already been tuned for the rather unique needs of WebM during last Debconf), parsing the FOSDEM schedule to provide live program information, and so on. Being a team of two was near-ideal here; we already know each other extremely well from previous work, and despite the frantic pace, everything felt really relaxed and calm.

So, less than 72 hours after the initial “go”, the streaming laptops started coming up in the various devrooms, and I rsynced over my encoding chroot to each of them and fired up VLC, which then cubemap would pick up and send on. And amazingly enough, it worked! We had a peak of about 380 viewers, which is about 80% more than the peak of 212 last year (and this was with almost no announcement before the conference). Amusingly, the most popular stream by far was not a main track, but that of the Go devroom; at times, they had over half the total viewers. (I never got to visit it myself, because it was super-packed every time I went there.)

I won't pretend everything went perfect—we found a cubemap segfault on the way, and also some other issues (such as initially not properly restarting the encoding when the DVswitch master went down and up again). But I'm extremely happy that the video team believed in us and gave us the chance; it was fun, it was the perfect icebreaker when meeting new people at FOSDEM, and hopefully, we let quite a few people sitting at home learn something new or interesting.

Oh, and not the least, it made my own talk get streamed. :-)

[22:06] | | FOSDEM video streaming, post-mortem

Steinar H. Gunderson <sgunderson@bigfoot.com>

<	July 2023
Su	Mo	Tu	We	Th	Fr	Sa
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Categories