Sat, 01 Jul 2023 - The bestest sine function, redux
A small followup to the last post:
- Sreepathi Pai pointed out that Z3 isn't the fastest solver when it comes to
floats; CVC5 is faster
but cannot optimize. However, OptiMathSAT,
while non-free, detronizes Z3 on this problem; the minimization time goes
down from eight hours to five minutes! (Pai also pointed to
sat-smt.codes, if you want to read 650+ pages
about SAT/SMT techniques.)
- I discovered that the desired range is not [0,16384) but [0,16384];
that is, you need to include 16384. This makes for slightly different
optimal polynomials.
With that in mind, here are the final, bestest coefficients given the
aforementioned constraints:
- Cosine: 1.0f + (x*x) * (-4.564684e-9f + (x*x) * 3.1372154e-18f); maxerr=7.3719024e-4
- Sine: x * (9.584001e-5f + (x*x) * (-1.4590816e-13f + (x*x) * 6.0535898e-23f)); maxerr=8.0764293e-5
The original LUT approach has maxerr 2.8767669e-4 (since it chops off the
lowest two bits; presumably some rounding would be better), so the sine
approach actually beats it in accuracy.
Thu, 29 Jun 2023 - Finding the BESTEST sine function for Nintendo 64
YouTube gave me the video “Finding the BEST sine function for Nintendo 64”
yesterday, and while I thought the title was very intriguing, I came away
a bit disappointed; the video is hardly about the specifics of Nintendo 64
at all, and the author mostly spends time comparing straight-up LUTs with
some hand-made splines under a proxy metric (no real comparative benchmarks, hardly any
numerics).
I have no idea what the optimal solution is, but it got me thinking: How good
can we get with the same basic formulas, just better constants?
The video is pretty sparse with actual source code and data, but I think
I've extracted enough info to go work on his “5th order sine” and “4th order
cosine” methods (both straight-up polynomials).
Let's start with the 5th order sine; it's basically an odd polynomial
(he does simple range reduction to [-pi/2,pi/2)). The implementation goes
roughly like this, taking in a value between 0 and 65535 inclusive:
float x = i * (float)(0.0003 / M_PI); // Prescaler reverse-engineered from the video
float sin = x * (1.0f + x * x * (b + x * x * a));
He uses 1.0f specifically because it's ready in a register (so you don't need
to load the constant), but it's not hard to see that this is exactly the
same as dropping the prescaler and evaluting ax⁵ + bx³ + cx. So let's see what
Sollya thinks:
> fpminimax(sin(x*2*Pi/65536.0),[|x,x^3,x^5|],[|SG...|],[0;16384]);
Warning: For at least 2 of the constants displayed in decimal, rounding has happened.
x * (9.58634263952262699604034423828125e-5 + x^2 * (-1.46252571143166976153082714517950080335140228271484e-13 + x^2 * 6.1585575698811928868593794069233315902067715796875e-23))
We can bruteforce-evaluate this over all 16384 relevant inputs (corresponding
to 0 through pi/2), giving maximum absolute error 0.000108. Given that he
claims maximum absolute error of 0.000395 for his own version (nearly 4x
as high), and we saved the initial multiplication (though not the load),
this is a good start!
Nevertheless, we're not looking for better. We're looking for BESTEST.
And we want to prove it.
So we must unleash… the SAT solver. Usually I use OR-Tools,
but for a project this bizarre, we need the big guns. We need Z3.
Z3 has full support for IEEE floats of any size, so we can start off by
declaring that a,b,c are floats (the N64 supports double just fine, but his
assembly strongly suggests he's working in single precision):
(declare-const a Float32)
(declare-const b Float32)
(declare-const c Float32)
We also need a maximum tolerable error. Since we're trying to beat Sollya,
let's start off at 0.0001:
(declare-const eps Float32)
(assert (= eps (fp #b0 #x71 #b10100011011011100010111))) ; 0.0001
Now we want to ask the solver to minimize the error. We need to do that
point by point. Let's first start with i=16383 (testing i=0 is meaningless,
since it will be zero no matter what the constants are):
(declare-const x16383 Float32)
(declare-const x16383sq Float32)
(declare-const y16383 Float32)
(declare-const yy16383 Float32)
(assert (= x16383 (fp #b0 #x8c #b11111111111110000000000)))
(assert (= x16383sq (fp #b0 #x9a #b11111111111100000000000)))
(assert (= y16383 (fp #b0 #x71 #b10010010010100010001000)))
(assert (= yy16383
(fp.mul roundNearestTiesToEven x16383
(fp.add roundNearestTiesToEven c
(fp.mul roundNearestTiesToEven x16383sq
(fp.add roundNearestTiesToEven b
(fp.mul roundNearestTiesToEven a x16383sq)))))))
(assert (fp.lt (fp.abs (fp.sub roundNearestTiesToEven y16383 yy16383)) eps))
So we're defining a constant for 16383.0f, and then one for the same value
squared (precomputing this creates fewer SAT clauses, and we're really
creating enough already!), and then the expected output value (y16383)
of roughly 0.000096f. We then define that yy16383 is the output of the
formula x * (c + xsq * (b + xsq * a))
, done with standard
IEEE 754 round-ties-to-nearest-even mode. Finally, we say that the
absolute difference between yy16383 and the expected answer (where the rounding
mode doesn't matter much) must be less than our tolerance.
Behind the scenes, Z3 will create a bazillion SAT clauses for us, presumably
with adders and normalization and whatever. The beauty of this is that we
should be able to match hardware exactly; no matter what may happen with
rounding or overflow or denormals or whatever, our simulation will give us
the right error. We could even model tricks like “we only care to load the
upper 16 bits of the float, to save on a MIPS instruction” and see if we
could still get a reasonable approximation. There are no limits (especially
not if P=NP and SAT admits a solution in polynomial time!).
There's one more clause we want:
(assert (fp.leq yy16383 one))
We don't want any value to exceed 1.0f, since his code relies on using
sqrt(1 - x²) to calculate the cosine from sine, and we can't take the
square root of a negative number. (I think using the square root here
is rather dubious performance-wise, but we'll roll with it; again,
we only want to change constants.)
Now, we could repeat this for all the other 16382 points. However,
that would create literally millions of SAT clauses and take forever
to work out. So I just put in 16 roughly equidistant points and let it
roll:
(check-sat)
(get-value (a b c))
And after four minutes or so of calculation, Z3 (well, my script
wrapper around Z3) told me that it was indeed possible to get under
0.0001 maximum absolute error:
sat
((a (fp #b0 #x35 #b00100101101100010000110)) ; 0.000000000000000000000060733997
(b (fp #b1 #x54 #b01001000110000100101110)) ; -0.00000000000014599842
(c (fp #b0 #x71 #b10010010000001010001001))) ; 0.00009584899
Now, of course, this was only for those points, but there's a limit
to how fast the error polynomial can ripple, so it turns out that the
error over the full input range wasn't far away (it was 0.00010049343).
But better still isn't what we want! BESTEST. Let me take a little
detour towards the other simple polynomial, namely the “4th order cosine”
(which calculates sin from cos, instead of the other way round).
We can implement this in exactly the same way:
(fp.add roundNearestTiesToEven one
(fp.mul roundNearestTiesToEven x16383sq
(fp.add roundNearestTiesToEven b
(fp.mul roundNearestTiesToEven a x16383sq))))
That is, 1 + x_sq * (b + a * x_sq)
. Here there's no
prescaler in his version (and the coefficients are visible in the video),
so we've kept the constraint that one of the constants must be 1.0f,
and thus Z3 will have less work to do. Thus, let's ask it not for
finding something that's better than a target, but as low as possible:
(assert (fp.isPositive eps))
(declare-const eps_e (_ BitVec 8))
(declare-const eps_m (_ BitVec 23))
(assert (= eps (fp #b0 eps_e eps_m)))
(minimize eps_e)
(minimize eps_m)
It seems roundabout that we need to declare separate variables
for the exponent and mantissa and then minimize those,
instead of just minimizing the float directly, but evidently,
this is the standard trick. So we let Z3 churn for half an
hour, and lo and behold:
sat
((a (fp #b0 #x44 #b11001110111101011111101)) ; 0.0000000000000000031371447
(b (fp #b1 #x63 #b00111001101011100110110)) ; -0.000000004564664
(eps (fp #b0 #x74 #b10000001110110000000001))) ; 0.0007359386
So now it's found the optimal epsilon all by itself! Of course, it still
is over that set of points. But the truth is not far behind; the actual
worst case has error 0.0007380843. So it's way less than a percent off.
Unfortunately Z3 is much slower when we ask to minimize; seemingly,
it switches to a more powerful but slower solver (SMT compared to SAT).
The alternative would be to drive it ourselves by repeated queries with
binary search or similar for epsilon, but OK, half an hour per run is
entirely plausible. But we want the bestest. So what can we do?
Well, we find that the worst case for this polynomial is i=7231.
So we just add that to our list of points to check. This is rather
similar to how the Reméz exchange algorithm (used by, among others, Sollya)
works; it is bound to fix at least that single point, and hopefully
the procedure will converge much before we add all 16383 of them.
(That sounds very natural even without any deep proof; again, the
error function can only really oscillate so fast, so there are
bound to be tons of values that will never be close to the max error).
And indeed, after twelve iterations or so, we get:
((a (fp #b0 #x44 #b11001110111110101000110)) ; 0.0000000000000000031372656
(b (fp #b1 #x63 #b00111001101011101110110)) ; -0.0000000045646926
(eps (fp #b0 #x74 #b10000010010111011110000))) ; 0.00073693599
This has max error 0.00073693593 (remember, we asked for the error to be
less than eps, and that eps is indeed the next possible float after
the max error), and that error is in i=16383, which was already a point
we asked it to optimize,
and so this is THE BESTEST 4th-order cosine approximation under the given
constraints. With his best cosine being 0.002787, it's a nice 3–4x
improvement on error just by swapping out constants.
The sine is much slower to minimize (five hours or so per iteration),
so it hasn't converged yet; still waiting for the absolute truth there.
So for now, the one above is what we've got. At least we have a lower bound
saying that it's less than 1% off the best possible one (asking for anything
lower produces a unsatisfiability result). But the bestest 4th-order cosine
approximation is indeed 1.0f + (x * x) * (-0.000000004564664f + (x *
x) * 0.0000000000000000031372656f)
. On the Nintendo 64.
Tue, 30 Oct 2018 - SAT solvers for fun and fairness
Trøndisk 2018, the
first round of the Norwegian ultimate series (the frisbee sport, not the
fighting style) is coming up this weekend! Normally that would mean that
I would blog something about all the new and exciting things we are doing
with Nageru for the stream, but for now,
I will just point out that the stream is on plastkast.no
and will be live from 0945–1830 CET on Saturday (group stage) and 1020–1450
(playoffs) on Sunday.
Instead, I wanted to talk about a completely different but interesting
subproblem we had to solve; how do you set up a good schedule for the group
stages? There are twelve teams, pre-seeded and split into two groups (call them A0–A5
and B0–B5) that are to play
round-robin, but there are only two fields—and only one of them is streamed.
You want a setup that maximizes fairness in the sense that people get
adequate rest between matches, and also more or less equal number of streamed
games. Throw in that one normally wants the more exciting games last, and
it starts to get really tricky to make something good by hand. Could we do
it programmatically?
My first thought was that since this is all about the ordering, it sounded
like a variant of the infamous travelling salesman problem. It's
well-known that TSP is NP-hard (or NP-complete, but I won't bother with
the details), but there are excellent heursitic implementations in practice.
In particular, I had already used OR-Tools, Google's optimization
toolkit, to solve TSP problems in the past; it contains a TSP solver that
can deal with all sorts of extra details, like multiple agents to travel
(in our case, multiple fields), subconstraints on ordering and so on.
(OR-Tools probably doesn't contain the best TSP solver in the world—there
are specialized packages that do even better—but it's much better than
anything I could throw together myself.)
However, as I tried figuring out something, and couldn't quite get it to
fit (there are so many extra nonlocal constraints), I saw that the OR-Tools
documentation had a subsection on scheduling problems. It turns
out this kind of scheduling can be represented as a so-called SAT
(satisfiability) problem, and OR-Tools also has a SAT solver. (SAT, in its
general forms, is also NP-hard, but again, there are great heuristics.)
I chose the Python frontend, which probably wasn't the best idea in the
world (it's poorly documented, and I do wonder when Python will take the
step into the 90s and make spelling errors in variables into compile-time
errors instead of throwing a runtime exception four hours into a
calculation), but that's what the documentation used, and the backend
is in C++ anyway, so speed doesn't matter.
The SAT solver works by declaring variables and various constraints
between them, and then asking the machine to either come up with a solution
that fits, or to prove that it's not possible. Let's have a look of some
excerpts to get a feel for how it all works:
We know we have 15 rounds, two fields on each, and every field should
contain a match. So let's generate 30 such variables, each containing
a match number (we use the convention that match 0, 2, 4, 6, etc. are
on the stream field and 1, 3, 5, 7, etc. are played in parallel on the
other field):
matchnums = []
for match_idx in range(num_matches):
matchnums.append(model.NewIntVar(0, num_matches - 1, "matchnum%d" % (match_idx)))
So this is 30 variables, and each go from 0 to 29, inclusive.
We start a fairly obvious constraint; we can only play each match once:
model.AddAllDifferent(matchnums)
The SAT solver might make this into a bunch of special constraints
underneath, or it might not. We don't care; it's abstracted away for us.
Now, it's not enough to just find any ordering—after all, we want to
find an ordering with some constraints. However, the constraints are
rarely about the match numbers, but more about the teams that play in
those matches. So we'll need some helper variables. For instance,
it would be interesting to know which teams play in each match:
home_teams = []
away_teams = []
for match_idx in range(num_matches):
home_teams.append(model.NewIntVar(0, num_teams - 1, "home_team_match%i" % (match_idx)))
away_teams.append(model.NewIntVar(0, num_teams - 1, "away_team_match%i" % (match_idx)))
model.AddElement(matchnums[match_idx], home_teams_for_match_num, home_teams[match_idx])
model.AddElement(matchnums[match_idx], away_teams_for_match_num, away_teams[match_idx])
AddElement() here simply is an indexing operation; since there's no difference
between home and away teams for us, we've just pregenerated all the matches as
A0 vs. A1, A0 vs. A2, etc. up until A4 vs. A6, A5 vs. A6 and then similarly for
the other gruop. The “element” constraint makes sure that e.g. home_team_match0 = home_teams_for_match_num[matchnum0].
Note that even though I think of this is as an assignment where the home team
for match 0 follows logically from which match is being played as match 0,
it is a constraint that goes both ways; the solver is free to do inference
that way, or instead first pick the home team and then deal with the
consequences for the match number. (E.g., if it picks A5 as the home team,
the match number most certainly needs to be 14, which corresponds to A5–A6.)
We're not quite done with the helpers yet; we want to explode these variables
into booleans:
home_team_in_match_x_is_y = [[
model.NewBoolVar('home_team_in_match_%d_is_%d' % (match_idx, team_idx)) for team_idx in range(num_teams)
] for match_idx in range(num_matches)]
for match_idx in range(num_matches):
model.AddMapDomain(matchnums[match_idx], match_x_has_num_y[match_idx])
and similarly for away team and match number.
So now we have a bunch of variables of the type “is the home team in match 6
A4 or not?”. Finally we can make some interesting constraints! For instance,
we've decided already that the group finals (A0–A1 and B0–B1) should be the
last two matches of the day, and on the stream field:
model.AddBoolOr([match_x_has_num_y[28][0], match_x_has_num_y[28][15]])
model.AddBoolOr([match_x_has_num_y[26][0], match_x_has_num_y[26][15]])
This is a hard constraint; we don't have a solution unless match 0 and match 15
are the last two (and we earlier said that they must be different).
We're going to need even more helper variables now. It's useful to know whether
a team is playing at all in a given round; that's the case if they are the
home or away team on either field:
plays_in_round = {}
for team_idx in range(num_teams):
plays_in_round[team_idx] = {}
for round_idx in range(num_rounds):
plays_in_round[team_idx][round_idx] = model.NewBoolVar('plays_in_round_t%d_r%d' % (team_idx, round_idx))
model.AddMaxEquality(plays_in_round[team_idx][round_idx], [
home_team_in_match_x_is_y[round_idx * 2 + 0][team_idx],
home_team_in_match_x_is_y[round_idx * 2 + 1][team_idx],
away_team_in_match_x_is_y[round_idx * 2 + 0][team_idx],
away_team_in_match_x_is_y[round_idx * 2 + 1][team_idx]])
Now we can establish a few other very desirable properties; in particular,
each team should never need to play two matches back-to-back:
for round_idx in range(num_rounds - 1):
for team_idx in range(num_teams):
model.AddBoolOr([plays_in_round[team_idx][round_idx].Not(), plays_in_round[team_idx][round_idx + 1].Not()])
Note that there's nothing here that says the same team can't be assigned
to play on both fields at the same time! However, this is taken care of
by some constraints on the scheduling that I'm not showing for brevity
(in particular, we established that each round must have exactly one game
from group A and one from group B).
Now we're starting to get out of the “hard constraint” territory and more
into things that would be nice. For this, we need objectives. One such
objective is what I call ”tiredness”; playing matches nearly back-to-back
(ie., game - rest - game) should have a penalty, and the solution should
try to avoid it.
tired_matches = []
for round_idx in range(num_rounds - 2):
for team_idx in range(num_teams):
tired = model.NewBoolVar('team_%d_is_tired_in_round_%d' % (team_idx, round_idx))
model.AddMinEquality(tired, [plays_in_round[team_idx][round_idx], plays_in_round[team_idx][round_idx + 2]])
tired_matches.append(tired)
sum_tiredness = sum(tired_matches)
So here we have helper variables that are being set to the minimum
(effectively a logical AND) of “do I play in round N” and “do I play in round
N + 2”. Tiredness is simply a sum of those 0–1 variables, which we can seek
to minimize:
model.Minimize(sum_tiredness)
You may wonder how we went from a satisfiability problem to an
optimization problem. Conceptually, however, this isn't so hard.
Just ask the solver to find any solution, e.g. something with sum_tiredness 20.
Then simply add a new constraint saying sum_tiredness <= 19 and ask
for a re-solve (or continue). Eventually, the solver will either come back
with a better solution (in which case you can tighten the constraint
further), or the message that you've asked for something impossible, in
which case you know you have the optimal solution. (I have no idea whether
modern SAT solvers actually work this way internally, but again, conceptually
it's simple.)
As an extra bonus, you do get incrementally better solutions as you go.
These problems are theoretically very hard—in fact, I let it run for fun
for a week now, and it's still not found an optimal solution—and in practice,
you just take some intermediate solution that is “good enough”. There are
always constraints that you don't bother adding to the program anyway, so there's
some eyeballing involved, but still feels like a more fair process than trying
to nudge it by hand.
We had many more objectives, some of them contradictory (e.g., games between
more closely seeded opponents are more “exciting”, and should be put last—but
they should also be put on the stream, so do you put them early on the stream
field or late on the non-stream field?). It's hard to weigh all the factors
against each other, but in the end, I think we ended up with something pretty
nice. Every team gets to play two or three times (out of five) on the stream,
only one team needs to be “tired” twice (and I checked; if you ask for a
hard maximum of once for every team, it comes back pretty fast as
infeasible), many of the tight matches are scheduled near the end… and most importantly, we don't have to play the first matches
while I'm still debugging the stream. :-)
You can see final schedule
here.
Good luck to everyone, and consider using a SAT solver next time you have
a thorny scheduling problem!
Tue, 12 Sep 2017 - rANS encoding of signed coefficients
I'm currently trying to make sense of some still image coding
(more details to come at a much later stage!), and for a variety
of reasons, I've chosen to use rANS as the entropy coder. However,
there's an interesting little detail that I haven't actually seen
covered anywhere; maybe it's just because I've missed something,
or maybe because it's too blindingly obvious, but I thought I would
document what I ended up with anyway. (I had hoped for something
even more elegant, but I guess the obvious would have to do.)
For those that don't know rANS coding, let me try to handwave it
as much as possible. Your state is typically a single word
(in my case, a 32-bit word), which is refilled from the input stream
as needed. The encoder and decoder works in reverse order;
let's just talk about the decoder. Basically it works by looking
at the lowest 12 (or whatever) bits of the decoder state, mapping
each of those 2^12 slots to a decoded symbol. More common symbols
are given more slots, proportionally to the frequency. Let me just
write a tiny, tiny example with 2 bits and three symbols instead,
giving four slots:
Lowest bits |
Symbol |
00 |
0 |
01 |
0 |
10 |
1 |
11 |
2 |
Note that the zero coefficient here maps to one out of two slots
(ie., a range); you don't choose which one yourself, the encoder
stashes some information in there (which is used to recover the
next control word once you know which symbol there is).
Now for the actual problem: When storing DCT coefficients, we
typically want to also store a sign (ie., not just 1 or 2, but
also -1/+1 and -2/+2). The statistical distribution is symmetrical,
so the sign bit is incompressible (except that of course there's
no sign bit needed for 0). We could have done this by introducing new symbols
-1 and -2 in addition to our three other ones, but this means we'll
need more bits of precision, and accordingly larger look-up tables
(which is negative for performance). So let's find something better.
We could also simply store it separately somehow; if the coefficient
is non-zero, store the bits in some separate repository. Perhaps
more elegantly, you can encode a second symbol in the rANS stream
with probability 1/2, but this is more expensive computationally.
But both of these have the problem that they're divergent in terms
of control flow; nonzero coefficients potentially need to do a lot
of extra computation and even loads. This isn't nice for SIMD, and
it's not nice for GPU. It's generally not really nice.
The solution I ended up with was simulating a larger table with
a smaller one. Simply rotate the table so that the zero symbol
has the top slots instead of the bottom slots, and then replicate
the rest of the table. For instance, take this new table:
Lowest bits |
Symbol |
000 |
1 |
001 |
2 |
010 |
0 |
011 |
0 |
100 |
0 |
101 |
0 |
110 |
-1 |
111 |
-2 |
(The observant reader will note that this doesn't describe the
exact same distribution as last time—zero has twice the relative
frequency as in the other table—but ignore that for the time
being.)
In this case, the bottom half of the table doesn't actually need
to be stored! We know that if the three bottom bits are >= 110
(6 in decimal), we have a negative value, can subtract 6, and then
continue decoding. If we are go past the end of our 2-bit table
despite that, we know we are decoding a zero coefficient (which
doesn't have a sign), so we can just clamp the read; or for a GPU,
reads out-of-bounds on a texture will typically return 0 anyway.
So it all works nicely, and the divergent I/O is gone.
If this pickled your interest, you probably want to read up on
rANS in general; Fabian Giesen (aka ryg) has some notes
that work as a good starting point, but beware; some of this is
pretty confusing. :-)
Sun, 25 Dec 2016 - Cracking a DataEase password
I recently needed to get access to a DataEase database; the person I helped
was the legitimate owner of the data, but had forgotten the password,
as the database was largely from 1996. There are various companies
around the world that seem to do this, or something similar (like give
you an API), for a usually unspecified fee; they all have very 90s homepages
and in general seem like they have gone out of business a long time ago.
And I wasn't prepared to wait.
For those of you who don't know DataEase, it's a sort-of relational database
for DOS that had its heyday in the late 80s and early 90s (being sort of
the cheap cousin of dBase); this is before SQL gained traction as the
standard query language, before real multiuser database access, and before
variable-width field storage.
It is also before reasonable encryption. Let's see what we can do.
DataEase has a system where tables are mapped through the data dictionary,
which is a table on its own. (Sidenote: MySQL pre-8.0 still does not have
this.) This is the file RDRRTAAA.DBM; I don't really know what RDRR stands
for, but T is the “database letter” in case you wanted more than one database
in the same directory, and AAA, AAB, AAC etc. is a counter so that a table
grows to be too big for one file. (There's also .DBA files for structure of
non-system tables, and then some extra stuff for indexes.)
DBM files are pretty much the classical, fixed-length 80s-style database
files; each row has some flags (I believe these are for e.g. “row is
deleted”) and then just the rows in fixed format right after each other.
For instance, here's one I created as part of testing (just the first few
lines of the hexdump are shown):
00000000: 0e 00 01 74 65 73 74 62 61 73 65 00 00 00 00 00 ...testbase.....
00000010: 00 00 00 00 00 00 00 73 46 cc 29 37 00 09 00 00 .......sF.)7....
00000020: 00 00 00 00 00 43 3a 52 44 52 52 54 41 41 41 2e .....C:RDRRTAAA.
00000030: 44 42 4d 00 00 01 00 0e 00 52 45 50 4f 52 54 20 DBM......REPORT
00000040: 44 49 52 45 43 54 4f 52 59 00 00 00 00 00 1c bd DIRECTORY.......
00000050: d4 1a 27 00 00 00 00 00 00 00 00 00 43 3a 52 45 ..'.........C:RE
00000060: 50 4f 54 41 41 41 2e 44 42 4d 00 00 01 00 0e 00 POTAAA.DBM......
00000070: 52 65 6c 61 74 69 6f 6e 73 68 69 70 73 00 00 00 Relationships...
Even without going in-depth, we can see the structure here; there's
“testbase” which maps to C:RDRRTAA.DBM (the RDRR itself), there's a table
called “REPORT DIRECTORY” that maps to C:REPOTAAA.DBM, and then more stuff
after that, and so on.
However, other tables are not so easily read, because you can ask DataEase
to encrypt a table. Let's look at such an encrypted table, like the “Users”
table (containing usernames, passwords—not password hashes—and some extra
information like access level), which is always encrypted:
00000000: 0c 01 9f ed 94 f7 ed 34 ba 88 9f 78 21 92 7b 34 .......4...x!.{4
00000010: ba 88 0f d9 94 05 1e 34 ba 88 a0 78 21 92 7b 34 .......4...x!.{4
00000020: e2 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b 34 ...x!.{4...x!.{4
00000030: ba 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b ...x!.{4...x!.{
Clearly, this isn't very good encryption; it uses a very short, repetitive
key of eight bytes (64 bits). (The data is mostly zero padding, which makes
it much easier to spot this.) In fact, in actual data tables, only five of
these bytes are set to a non-zero value, which means we have a 40-bit key;
export controls?
My first assumption here was of course XOR, but through some experimentation,
it turned out what you need is actually 8-bit subtraction (with wraparound).
The key used is derived from both a database key and a per-table key,
both stored in the RDRR; again, if you disassemble, I'm sure you can find the
key derivation function, but that's annoying, too. Note, by the way, that
this precludes making an attack by just copying tables between databases,
since the database key is different.
So let's do a plaintext attack. If you assume the plaintext of the bottom row
is all padding, that's your key and here's what you end up with:
00000000: 52 79 00 75 73 65 72 00 00 00 00 00 00 00 00 00 Ry.user.........
00000010: 00 00 70 61 73 73 a3 00 00 00 01 00 00 00 00 00 ..pass..........
00000020: 28 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (...............
00000030: 00 00 00 00 00 00 00 00 ........
Not bad, eh? Actually the first byte of the key here is wrong as far as I
know, but it didn't interfere with the fields, so we have what we need to
log in. (At that point, we've won, because DataEase will helpfully decrypt
everything transparent for us.)
However, there's a twist; if the password is longer than four characters,
the entire decryption of the Users table changes. Of course, we could run
our plaintext attack against every data table and pick out the information
by decoding the structure, but again; annoying. So let's see what it looks
like if we choose “passs” instead:
00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d ...z..!..c.m...]
00000010: 70 cb df 36 7e 7c 91 c5 d8 33 d8 3d 73 71 e7 2d p..6~|...3.=sq.-
00000020: 7b 9b 3f a5 db d9 4f 95 a8 03 a7 0d 43 41 b7 fd {.?...O.....CA..
00000030: 10 6b 0f 75 ab a9 1f 65 78 d3 77 dd 13 11 87 .k.u...ex.w....
Distinctly more confusing. At this point, of course, we know at which byte
positions the username and password start, so if we wanted to, we could just try
setting the start byte of the password to every possible byte in turn until
we hit 0x00 (DataEase truncates fields at the first zero byte), which would
allow us to get in with an empty password. However, I didn't know the
username either, and trying two bytes would mean 65536 tries, and I wasn't
up for automating macros through DOSBox. So an active attack wasn't
too tempting.
However, we can look at the last hex byte
(where we know the plaintext is 0); it goes 0x5d, 0x2d, 0xfd... and some
other bytes go 0x08, 0xd8, 0xa8, 0x78, and so on. So clearly there's an
obfuscation here where we have a per-line offset that decreases with 0x30
per line. (Actually, the increase/decrease per line seems to be derived from
the key somehow, too.) If we remove that, we end up with:
00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d ...z..!..c.m...]
00000010: a0 fb 0f 66 ae ac c1 f5 08 63 08 6d a3 a1 17 5d ...f.....c.m...]
00000020: db fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17 5d ....;9...c.m...]
00000030: a0 fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17 ....;9...c.m...
Well, OK, this wasn't much more complicated; our fixed key is now 16 bytes
long instead of 8 bytes long, but apart from that, we can do exactly the same
plaintext attack. (Also, it seems to change per-record now, but we don't see
it here, since we've only added one user.) Again, assume the last line is
supposed to be all 0x00 and thus use that as a key (plus the last byte from
the previous line), and we get:
00000000: 6e 06 00 75 73 65 72 00 00 00 00 00 00 00 00 00 n..user.........
00000010: 00 00 70 61 73 73 12 00 00 00 01 00 00 00 00 00 ..pass..........
00000020: 3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ;...............
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...............
Well, OK, it wasn't perfect; we got “pass\x12” instead of “passs”, so we
messed up somehow. I don't know exactly why the fifth character gets messed
up like this; actually, it cost me half an hour of trying because the
password looked very real but the database wouldn't let me in, but
eventually, we just guessed at what the missing letter was supposed to be.
So there you have it; practical small-scale cryptanalysis of DOS-era
homegrown encryption. Nothing advanced, but the user was happy about getting
the data back after a few hours of work. :-)
Wed, 26 Oct 2016 - Why does software development take so long?
Nageru 1.4.0 is out (and on its way through
the Debian upload process right now), so now you can do live video mixing with multichannel
audio to your heart's content. I've already blogged about most of the
interesting new features, so instead, I'm trying to answer a question:
What took so long?
To be clear, I'm not saying 1.4.0 took more time than I really anticipated
(on the contrary, I pretty much understood the scope from the beginning,
and there was a reason why I didn't go for building this stuff into 1.0.0);
but if you just look at the changelog from the outside, it's not immediately
obvious why “multichannel audio support” should take the better part of three
months of develoment. What I'm going to say is of course going to be obvious
to most software developers, but not everyone is one, and perhaps my
experiences will be illuminating.
Let's first look at some obvious things that isn't the case: First of all,
development is not primarily limited by typing speed. There are about 9,000
lines of new code in 1.4.0 (depending a bit on how you count), and if it was
just about typing them in, I would be done in a day or two. On a good
keyboard, I can type plain text at more than 800 characters per minute—but
you hardly ever write code for even a single minute at that speed. Just as
when writing a novel, most time is spent thinking, not typing.
I also didn't spend a lot of time backtracking; most code I wrote actually
ended up in the finished product as opposed to being thrown away. (I'm not
as lucky in all of my projects.) It's pretty
common to do so if you're in an exploratory phase, but in this case, I had a
pretty good idea of what I wanted to do right from the start, and that plan
seemed to work. This wasn't a difficult project per se; it just needed
to be done (which, in a sense, just increases the mystery).
However, even if this isn't at the forefront of science in any way (most code
in the world is pretty pedestrian, after all), there's still a lot of
decisions to make, on several levels of abstraction. And a lot of those
decisions depend on information gathering beforehand. Let's take a look at
an example from late in the development cycle, namely support for using MIDI
controllers instead of the mouse to control the various widgets.
I've kept a pretty meticulous TODO list; it's just a text file on my laptop,
but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83
work items (a single-digit number is not ticked off, mostly because I decided
not to do those things), which corresponds roughly 1:2 to the number of
commits. So let's have a look at what the ~20 MIDI controller items went into.
First of all, to allow MIDI controllers to influence the UI, we need a way
of getting to it. Since Nageru is single-platform on Linux, ALSA is the
obvious choice (if not, I'd probably have to look for a library to put
in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer).
Which one do you want? It sounds like raw MIDI is what we want, but actually,
it's the sequencer interface (it does more of the MIDI parsing for you,
and generally is friendlier).
The first question is where to start picking events from. I went the simplest
path and just said I wanted all events—anything else would necessitate a UI,
a command-line flag, figuring out if we wanted to distinguish between
different devices with the same name (and not all devices potentially even
have names), and so on. But how do you enumerate devices? (Relatively simple,
thankfully.) What do you do if the user inserts a new one while Nageru is
running? (Turns out there's a special device you can subscribe to that will
tell you about new devices.) What if you get an error on subscription?
(Just print a warning and ignore it; it's legitimate not to have access to
all devices on the system. By the way, for PCM devices, all of these answers
are different.)
So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out
it probably doesn't integrate too well with Qt, but it's easy enough to put
it in a thread. The class dealing with the MIDI handling now needs locking;
what mutex granularity do we want? (Experience will tell you that you nearly
always just want one mutex. Two mutexes give you all sorts of headaches with
ordering them, and nearly never gives any gain.) ALSA expects us to poll()
a given set of descriptors for data, but on shutdown, how do you break out
of that poll to tell the thread to go away? (The simplest way on Linux is
using an eventfd.)
There's a quirk where if you get two or more MIDI messages right after each
other and only read one, poll() won't trigger to alert you there are more
left. Did you know that? (I didn't. I also can't find it documented. Perhaps
it's a bug?) It took me some looking into sample code to find it. Oh, and
ALSA uses POSIX error codes to signal errors (like “nothing more is
available”), but it doesn't use errno.
OK, so you have events (like “controller 3 was set to value 47”); what do you do
about them? The meaning of the controller numbers is different from
device to device, and there's no open format for describing them. So I had to
make a format describing the mapping; I used protobuf (I have lots of
experience with it) to make a simple text-based format, but it's obviously
a nightmare to set up 50+ controllers by hand in a text file, so I had to
make an UI for this. My initial thought was making a grid of spinners
(similar to how the input mapping dialog already worked), but then I realized
that there isn't an easy way to make headlines in Qt's grid. (You can
substitute a label widget for a single cell, but not for an entire row.
Who knew?) So after some searching, I found out that it would be better
to have a tree view (Qt Creator does this), and then you can treat that
more-or-less as a table for the rows that should be editable.
Of course, guessing controller numbers is impossible even in an editor,
so I wanted it to respond to MIDI events. This means the editor needs
to take over the role as MIDI receiver from the main UI. How you do
that in a thread-safe way? (Reuse the existing mutex; you don't generally
want to use atomics for complicated things.) Thinking about it, shouldn't the
MIDI mapper just support multiple receivers at a time? (Doubtful; you don't
want your random controller fiddling during setup to actually influence
the audio on a running stream. And would you use the old or the new mapping?)
And do you really need to set up every single controller for each bus,
given that the mapping is pretty much guaranteed to be similar for them?
Making a “guess bus” button doesn't seem too difficult, where if you
have one correctly set up controller on the bus, it can guess from
a neighboring bus (assuming a static offset). But what if there's
conflicting information? OK; then you should disable the button.
So now the enable/disable status of that button depends on which cell
in your grid has the focus; how do you get at those events? (Install an event
filter, or subclass the spinner.) And so on, and so on, and so on.
You could argue that most of these questions go away with experience;
if you're an expert in a given API, you can answer most of these questions
in a minute or two even if you haven't heard the exact question before.
But you can't expect even experienced developers to be an expert in all
possible libraries; if you know everything there is to know about Qt,
ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua
(in addition to C++11, of course), I'm sure you'd be a great fit for
Nageru, but I'd wager that pretty few developers fit that bill.
I've written C++ for almost 20 years now (almost ten of them professionally),
and that experience certainly helps boosting productivity, but I can't
say I expect a 10x reduction in my own development time at any point.
You could also argue, of course, that spending so much time on the editor
is wasted, since most users will only ever see it once. But here's the
point; it's not actually a lot of time. The only reason why it seems
like so much is that I bothered to write two paragraphs about it;
it's not a particular pain point, it just adds to the total. Also,
the first impression matters a lot—if the user can't get the editor
to work, they also can't get the MIDI controller to work, and is likely
to just go do something else.
A common misconception is that just switching languages or using libraries
will help you a lot. (Witness the never-ending stream of software that
advertises “written in Foo” or “uses Bar” as if it were a feature.)
For the former, note that nothing I've said so far is specific to my choice
of language (C++), and I've certainly avoided a bunch of battles by making
that specific choice over, say, Python. For the latter, note that most of these problems are actually related
to library use—libraries are great, and they solve a bunch of problems
I'm really glad I didn't have to worry about (how should each button look?),
but they still give their own interaction problems. And even when you're a
master of your chosen programming environment, things still take time,
because you have all those decisions to make on top of your libraries.
Of course, there are cases where libraries really solve your entire problem
and your code gets reduced to 100 trivial lines, but that's really only when
you're solving a problem that's been solved a million times before. Congrats
on making that blog in Rails; I'm sure you're advancing the world. (To make
things worse, usually this breaks down when you want to stray ever so
slightly from what was intended by the library or framework author. What
seems like a perfect match can suddenly become a development trap where you
spend more of your time trying to become an expert in working around the
given library than actually doing any development.)
The entire thing reminds me of the famous essay
No Silver Bullet by Fred
Brooks, but perhaps even more so, this quote from
John Carmack's .plan has
struck with me (incidentally about mobile game development in 2006,
but the basic story still rings true):
To some degree this is already the case on high end BREW phones today. I have
a pretty clear idea what a maxed out software renderer would look like for
that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that
seems to be the standard direction. When I was doing the graphics engine
upgrades for BREW, I started along those lines, but after putting in a couple
days at it I realized that I just couldn't afford to spend the time to finish
the work. "A clear vision" doesn't mean I can necessarily implement it in a
very small integral number of days.
In a sense, programming is all about what your program should do in the first
place. The “how” question is just the “what”, moved down the chain of
abstractions until it ends up where a computer can understand it, and at that
point, the three words “multichannel audio support” have become those 9,000
lines that describe in perfect detail what's going on.
Wed, 23 Jul 2014 - The sad state of Linux Wi-Fi
I've been using 802.11 on Linux now for over a decade, and to be honest, it's
still a pretty sad experience. It works well enough that I mostly don't
care... but when I care, and try to dig deeper, it always ends up in the
answer “this is just crap”.
I can't say exactly why this is; between the Intel cards I've always been
using, the Linux drivers, the firmware, the mac80211 layer, wpa_supplicant
and NetworkManager, I have no idea who are supposed to get all these things
right, and I have no idea how hard or easy they actually are to pull off.
But there are still things annoying me frequently that we should really
have gotten right after ten years or more:
- Why does my Intel card consistently pick 2.4 GHz over 5 GHz? The 5 GHz
signal is just as strong, and it gives a less crowded 40 MHz channel
(twice the bandwidth, yay!) instead of the busy 20 MHz channel the 2.4 GHz
one has to share. The worst part is, if I use an access point with
band-select (essentially forcing the initial connection to be to 5
GHz—this is of course extra fun when the driver sees ten APs
and tries to connect to all of them over 2.4 in turn before trying
5 GHz), the driver still swaps onto 2.4 GHz a few minutes later!
- Rate selection. I can sit literally right next to an AP and get a
connection on the lowest basic rate (which I've set to 11 Mbit/sec
for the occasion). OK, maybe I shouldn't trust the output of iwconfig
too much, since rate is selected per-packet, but then again, when
Linux supposedly has a really good rate selection algorithm (minstrel),
why are so many drivers using their own instead? (Yes, hello
“iwl-agn-rs”, I'm looking at you.)
- Connection time. I dislike OS X pretty deeply and think that many
of its technical merits are way overblown, but it's got one thing
going for it; it connects to an AP fast.
RFC4436 describes some of the
tricks they're using, but Linux uses none of them. In any case, even
the WPA2 setup is slow for some reason, it's not just DHCP.
- Scanning/roaming seems to be pretty random; I have no idea how much
thought really went into this, and I know it is a hard problem,
but it's not unusual at all to be stuck at some low-speed AP when
a higher-speed one is available. (See also 2.4 vs. 5 above.)
I'd love to get proper support for CCX (Cisco Client Extensions) here,
which makes this tons better in a larger Wi-Fi setting (since the access
point can give the client a lot of information that's useful for
roaming, e.g. “there's an access point on thannel 52 that sends
its beacons every 100 ms with offset 54 from mine”, which means you
only need to swap channel for a few milliseconds to listen instead of
a full beacon period), but I suppose that's covered by licensing
or patents or something. Who knows.
With now a billion mobile devices running Linux and using Wi-Fi
all the time, maybe we should have solved this a while ago.
But alas. Instead we get access points trying to layer hacks upon
hacks to try to force clients into making the right decisions.
And separate ESSIDs for 2.4 GHz and 5 GHz.
Augh.
Thu, 08 May 2014 - Static library linking behavior
A static library (an .a file) is, at its heart, nothing special—it's
just a bunch of object files (.o files) in an archive, with an optional
index. Thus, I've always treated them mentally the same to a bunch of object
files, but it turns out there's an edge case where there is a difference between linking to
an .a file and a bunch of .o files (even if you discount the fact that
the .a file is a linker group and the .o files would need -( and -) around it
to get the same effect).
The difference manifests itself when you are building a shared library
(a .so file).
For an .o file, the linker does what you'd expect; any symbol that has
global linkage (ie., for C or C++: it's not marked as static or in an
anonymous namespace) and has default visibility (which, unfortunately,
means almost all symbols unless you're clever enough to use
-fvisibility=hidden
and explicitly mark your exports
with __attribute__((visibility("default")))
or similar),
will be exported in your shared library to external clients. This means,
among others, that it will take up space because the name and a few other
things need to be stored.
However, for an .a file, things are different! If you have a symbol in an .a file, and it's not used by anything
from an .o file (directly or indirectly), the linker will silently just
discard it as unused. It doesn't matter what visibility it has; it will
just not be there. (To be honest, I haven't checked if this is on the level
of a symbol, a section, an .o file within the library, or the entire .a file.)
The workaround for this is to add --whole-archive
before the
.a file in question (and presumably --no-whole-archive
after it
to negate the effect for the following ones), which will negate this
behavior. However, I am unsure if it also has other effects, like messing
with visibility, so beware.
To be honest, I think this is insane behavior, but given that so few people
set visibility explicitly, I guess there would never be any pruning if
the behavior were different (and ELF symbol visibility is a relatively new
invention anyway), so I can understand why it is like that.
Of course, I figured this out the hard way. Poof, there goes a few hours of
my life.
Tue, 04 Feb 2014 - FOSDEM video stream goodiebag
Borrowing a tradition from TG, we have released
a video streaming goodiebag from FOSDEM 2014. In short, it contains all the
scripts we used for the streaming part (nothing from the video team itself,
although I believe of most of what they do is developed out in the open).
If you've read my earlier posts on the subject, you'll know that it's all
incredibly rough, and we haven't cleaned it up much afterwards. So you get
the truth, but it might not be pretty :-) However, feedback is of course
welcome.
You can find it at
http://storage.sesse.net/fosdem14-video.tar.gz;
it's only 9 kB large. Enjoy!
Sun, 02 Feb 2014 - FOSDEM video streaming, post-mortem
Wow, what a ride that was. :-)
I'm not sure if people generally are aware of it, but the video streaming
at FOSDEM this year came together on extremely short notice. I got word
late Wednesday that the video team was overworked and would not have the
manpower to worry about streaming, and consequently, that there would be
none (probably not even of the main talks, like last year).
I quickly conferred with Berge on IRC; we both agreed that something
as big as FOSDEM shouldn't be without at least rudimentary streams. Could we
do something about it? After all, all devrooms (save for some that would not
due to licensing issues) would be recorded using DVswitch anyway, where it's
trivial to just connect another sink to the master, and we both had extensive
experience doing streaming work from The Gathering.
So, we agreed to do a stunt project; either it would work or it would crash
and burn, but at least it would be within the playful spirit of free software.
The world outside does not stand still, and neither should we.
The FOSDEM team agreed to give us access to the streams, and let us use the
otherwise unused cycles on the “slave” laptops (the ones that just take in
a DV switch from the camera and send it to the master for mixing).
Since I work at Google, I was able to talk to the Google Compute Engine people,
who were able to turn around on extremely short notice and sponsor GCE resources
for the actual video distribution. This took a huge unknown out of the
equation for us; since GCE is worldwide and scalable, we'd be sure to have
adequate bandwidth for serving our viewers almost no matter how much load we got.
The rest was mainly piecing together existing components in new ways. I dealt
with the encoding (on VLC, using WebM, since that's what FOSDEM
wanted), hitting one or two really obscure bugs in the process, and
Berge dealt with all the setup of distribution (we used cubemap,
which had already been tuned for the rather unique needs of WebM during last
Debconf), parsing the FOSDEM schedule to provide live program information,
and so on. Being a team of two was near-ideal here; we already know each other
extremely well from previous work, and despite the frantic pace, everything
felt really relaxed and calm.
So, less than 72 hours after the initial “go”, the streaming laptops started
coming up in the various devrooms, and I rsynced over my encoding chroot
to each of them and fired up VLC, which then cubemap would pick up and send
on. And amazingly enough, it worked! We had a peak of about 380 viewers,
which is about 80% more than the peak of 212 last year (and this was with
almost no announcement before the conference). Amusingly, the most popular
stream by far was not a main track, but that of the Go devroom; at times,
they had over half the total viewers. (I never got to visit it myself, because
it was super-packed every time I went there.)
I won't pretend everything went perfect—we found a cubemap segfault
on the way, and also some other issues (such as initially not properly
restarting the encoding when the DVswitch master went down and up again).
But I'm extremely happy that the video team believed in us and gave us
the chance; it was fun, it was the perfect icebreaker when meeting new
people at FOSDEM, and hopefully, we let quite a few people sitting at home
learn something new or interesting.
Oh, and not the least, it made my own talk get streamed. :-)