And I Have Come Upon This Place By Lost Ways; or, team Celestial Dire Badger's 2007 ICFP contest writeup

Team Celestial Dire Badger is, like last year, and like Team XLERB before it, the set containing exactly me: Jed Davis. Familiarity with this year's problem and the contest in general is assumed. After entirely too much revising I wound up doing this writeup chronologically (with occasional help from the version control log to remind me of what was when), but:

Subject index:

Worse Is Better (Sat. morning)
Programs
- Innuendo (Fri.)
- Condenser (Sat.)
- Elider (Sun. afternoon)
- Tightener (Sun. evening)
- Boxer (Sun. night)
Results
Aftermath

Friday

[morning]

In my time zone, the contest began at 6 AM. So my team (i.e., I) got up then and read the problem. I'd reserved judgment on specific programming languages until now, but it was quickly clear I'd need something reasonably practical and general-purpose; i.e., not m4.

In particular, the pattern and template recognition specs seemed to be crying out for implementation in ocamllex, which I did; the result was a fairly literal transcription, except that I replaced all the mutable state with accumulator-passing -- this is a functional programming contest, after all.

Having a REPL (er, “toplevel”) to try things out in proved quite useful, for the usual reasons — ease of quickly testing individual modules, one-off acts of data processing (e.g., disassembling a mystery prefix), and so on.

It was already clear to me that flat strings weren't quite what I'd want here, for the DNA sequence, so I tossed together something with a list of (string,offset,length) triples, which I called Mbuf, and used this for the sequence as a whole, but not for the substitution environment.

[afternoon]

That didn't work so well; questions of efficiency aside, I ran into OCaml's maximum string length (16MB). Some more work, and it was still slow, because I'd assumed that most sequence alteration would be around the beginning, and it wasn't.

So I replaced the lists with trees (still pure-functional, except where I had to mutate a buffer to feed ocamllex; thus, the primitive operations were of_string, append, take and drop, and blit_to_string). And it worked!... except for the very last iteration of unprefixed Endo, where it went off the rails. This was distinctly frustrating to debug, such that I gave up on inspecting the program, went back to read the spec more closely, and found several boundary conditions I'd mishandled.

[evening]

Plain Endo finally terminated, but I couldn't do anything with the RNA yet. The drawing layer was shamelessly imperative, and in fact the only notable deviation from spec was replacing the recursive DFS flood-fill with a BFS using OCaml's Queue module. I did go to my bookshelf in search of a less naive flood fill, and found one, but decided I'd deal with that later if became a problem; it didn't.

And that got written in about an hour and a half, according to the commit log, and then another hour to figure out why nothing was drawing: I had reversed the RNA list one too many times in the DNA interpreter. (Mini-lesson: this is what happens when you do mechanical program transformations by hand.)

Amazingly, the entire self-check worked the first time after that. In a moment of gratuitous random humor, I named the little program that glued all this together innuendo.

[night]

For some reason, perhaps because I'd already been contemplating ditching the genome and attacking the image-similarity problem directly, it occurred to me to adjust the renderer to take a snapshot before each canvas-stack push. This, of course, let me see the sun — but not in real life, as it was only midnight. (In fact, it was early enough that that got me into the top 15 for surprisingly long.)

I then spent entirely too long misunderstanding the cryptically worded advice about how to page through the manual. (I was, I later learned, very much not alone in that specific error.) I eventually realized what it actually meant, and browsed the manual, and could see what I was meant to do next.

But it seemed far too likely that I'd spend the most of the time struggling with and failing to divine the meaning of obscure hints; nor, with a team of exactly one person, was I feeling optimistic about trying to reverse-engineer the genome by more invasive means.

But it was late; and so, after finding the secret multimedia clue, I slept.

Saturday

[morning]

I'm not entirely sure what possessed me to spend quite so much more time futzing with the DNA sequence representation, which already had acceptable performance. It is however interesting that actually attempting to balance the trees didn't seem to noticeably help run times, and even merging small leaves didn't help much. Also, making the sub-sequence search use the Str library on successive windows was a noticeable improvement over the ridiculously naive thing it replaced — most noticeably so on one particular prefix, where it was a 4x speedup (and that from one single failed search).

As performance of DNA interpretation was a topic of some interest, I'll report that mine takes 17s to process the unprefixed Endo on a recent PC (110 kips), and 32s on a year-old laptop (60 kips).

[afternoon, evening, night]

Meanwhile, I'd been weighing my options. A daring plan had slowly condensed for ignoring the genome entirely and just treating this as a problem in lossy image compression, though with a measure of image similarity rather different from those commonly studied. After all, one of the common themes of ICFP contest problems has been optimization — see SML/NG, the text adventure optimizer, the raytracer, and the racecars, for example — and maybe this year's organizers even actually intended this.

The key observations were that the DNA language is trivially usable to compress repeated substrings (think "cut and paste"), that most of the hard parts of the target image are already present in the sun image, and that most of the rest can be passably approximated by a set of equal-sized, axis-aligned, solid-color rectangles (preferably drawn in order of color, then position).

A little back-of-the-envelope convinced me that this might even actually work; but, more the point, it would almost certainly be fun.

My new motto: Worse Is Better. Or something like that.

So, as for the lossless DNA compressor, called condenser: the basic idea is to find a bunch of repeated substrings, stick them in the data section, use (!_n) patterns to load them into the environment, then replace them with i₀ templates. Any remaining subsequences can be either quoted or cut-and-pasted as singletons, whichever is shorter.

More to the point, such a utility should, I reasoned, be able to compress a run of k repeats of some sequence into space logarithmic in k. The trick would be to feed its output back in for further condensation, until it fails to shrink in size; any one pass would need do no more than condense two of the same template symbol into one. (Iterating the “decompressor” is trivial, of course — it's right there in the matchreplace spec.)

This, alone of these programs, was written in C++. I was initially going to continue with OCaml, but there's the array-length limit, and I could tell that performance was going to be critical, and compatibility with the libraries I'd alredy written more or less irrelevant. And it wasn't all that bad, but there are still a few mildly impolite comments towards the language in the source.

It's obvious that the lower-numbered environment slots are more valuable, due to their shorter code words. So here I proceeded greedily: find the best candidate for slot 0, then mark off all the acids involved as unavailable, then find the best candidate for slot 1 from among the remaining acids, and so on until there ceases to be any benefit.

Then, the inner loops are to, for each n-gram of interest, iterate through its instances in order of position, throwing out any self-overlap and instances that overlap with extents claimed by earlier passes. This was slightly intricate, but not really difficult, and fairly inexpensive due to the indices' being sorted.

Data Structure #1: everyone likes suffix arrays, but here they're not quite what I want, which is an ordered list of each position of a given n-gram. So I stopped the comparison at n chars and sorted stably. Yes, this does mean a separate array for each value of n under test; there are worse things.

Data Structure #2: it was called ralist, and it was Okasaki's random-access list (1995) I was thinking of, but I misremembered some important details and so it's not that. Instead, in hindsight, it's more like Myers' random-access stack (1983). But this is good, because Okasaki's primary contribution was log-time functional update, which operation I didn't need, and that came at the cost of having to allocate in order to obtain a tail of the list, which would have been completely prohibitive to have in my critical path. Anyway, it's to store the marked-off extent list, and is used purely functionally.

And, after many hours and more segfaults, it worked; but that wasn't quite enough. I noticed that the work for each n could proceed largely in parallel with the others, synchronizing only after the initial array sort and once for each slot, to pick the overall winner. So I did that thing, with POSIX threads.

That didn't work so well, apparently because I'd tried to use C++'s OO features to abstract out thread management for the two kinds of parallel task. Stuff broke, apparently because vtable pointers weren't being properly initialized or some such. Um. And, I mean, I know that Threads Cannot Be Implemented As A Library, but this is kind of ridiculous.

And with that, I slept.

Sunday

[morning]

And in the morning I fixed the threading, by the simple expedient of ripping out the use of inheritance and replacing it with plain old dependable functions.

And then there were some more minor improvements; at the end of this, innuendo, which begin as a literal one-liner with everything hard-coded, had gained a raft of useful options (with help strings!), the perhaps most noteworthy being those to dump and load the intermediate RNA.

Also appearing this morning was a little utility called reverse_transcriptase, with the obvious functionality; I'd decided on using OCaml's Marshal module to toss RNA lists around internally.

So, at this point I could create a prefix with the same effect as the sun one that increasingly many teams had, only much longer (and thus worse). Sounds about right for 55 hours in....

And I went and wrote functions for color-to-RNA conversion, even though wouldn't be useful until much later.

[afternoon]

Remember the thing about iterating the condenser until the file size stops decreasing? This was when I stopped doing that by hand and wrote a small shell script, calling it fix_condenser for the obvious reason.

And then I spent a bunch of time experimenting with and tuning the condenser, both finding appropriate parameters for dealing with pure RNA, and adjusting its internal heuristics. Specifically, I gave it a notion of “pessimism”, which is how much a given slot expects future slots to compress/expand anything it doesn't take care of. It starts out very optimistic, then whenver it can't find anything worth doing it exclaims with woe and increments pessimism, until it can take no more (i.e., passes 5/4).

If you're looking at this and wondering what sort of lunatic I am to be fine-tuning stuff that already runs when so much functionality remained to be written: well, so am I. But it worked out, so whatever; also, I was usually doing a certain amount of planning of later stuff at times like these (see also: pipelining, ILP). I think I'd fallen out of the top 15 by this point.

Anyway, back to useful stuff. I needed a way to remove the space junk from the landscape. Fortunately, the snapshots (remember them?) had shown that each object was generally drawn on its own canvas and composed down — usually with nesting for anything with smaller parts, even. So, I could just operate on the composition tree.

Or, imperatively, to run through a stream of RNA with an output enable bit, toggling it whenever a designated canvas-push is hit, and stacking those events to re-toggle the bit at the matching compose or clip. Not that I ever needed to delete an item and reinstate a subitem of it, but it was nice that I could.

But wait! It's not that simple! Only the line and fill operations (and balanced canvas ops) can be safely removed; everything else side-effects the graphics context in a way that will be depended on. So, as-is, this certainly gets rid of the objects, but it doesn't do much about the length. Oops.

[evening]

In the Unix spirit, I made the thing to tighten up all those paths that wander aimlessly and throw out unused paint a separate utility, called (continuing the agentive theme) tightener.

It was fairly quickly obvious I couldn't take shortcuts to the goal of finding shortcuts, so to speak; I'd have to actually get the state and work backwards from that. But I didn't want to duplicate the graphics code; so clearly this calls for: emergency refactoring!

And thus was created the 'a grafport, a record of imperative functions, each a command in a simplification of RNA where line and fill are parameterized on such positions and bucket state as they need; and done : unit -> 'a, which may be my only use of polymorphism in the entire project.

Thereby did the original renderer backend yield a pixmap grafport, and the tightening one an rna grafport, and all was pretty and elegant. Except for the routine to emit the RNA to goto a point, which was a copy-and-pasted sprawl, but nonetheless worked.

At last, now that I had all the tools, I went through the snapshots picking off the indices of space junk; one commit message reads simply "Restore the river grass." (Of course the script I stored them in got an odd name; it was factotum, with apologies to Plan9.)

But I couldn't submit any of this without a wrapper to delete the original genome, so I finally got around to writing pattern/template unparsers and did that thing. As this seems like it'd be a rather drastic thing to do in terms of the Endo story, and I'm justifying my course of action by the urgent nature of the situation, the bootloader generator was named extreme_unction, for a similarly drastic measure.

And in that way, with ~8 hours to go, I returned to the top 15, by removing all the space junk. Including any form of Endo, who'd thus have to live on as an animist spirit inhabiting the tulips, perhaps. So, not terribly good odds — 2.7%, to be exact.

Also, I'd need the tightener to optionally restore the graphics context to the default at the end, so I could compose (in the functional sense) its output with other sequences, but that was a fairly trivial addition.

[night]

At some point I became aware of the watermark on the image (as hinted at on page 3 of the manual). Of course I could elide it from the denuded landscape, and I even had the foresight to make the elider general enough that I could reverse its sense and remove everything but the watermark; but I needed a de-watermarked target image to approximate.

Now, inverting composition in this manner will not necessarily yield a single value; it may be an interval. And I tried to deal with this in proper generality, and... it didn't work right. For such small alphas, only a few pixels should have gotten a choice of values, but that was happening all over the place; on the other hand, the lower bound was agreeing with the correct value almost everywhere.

As the hour was growing ever later and I ever more confused, I gave up on understanding and just plugged in the lower bound.

And finally, finally, the important part: dividing up the image with an evenly spaced grid and doing paint-by-numbers on the rectangles. Or, rather, for each box, finding the plurality color in that part of the target image, and seeing it if accounts for more pixels than are already correct from the backdrop.

And then tossing out boxes of overall underperforming colors (this was less important than I thought), sorting the rest by (color,y,x), and emitting graphics commands to actually draw the little colored boxes. This thing, of course, was called the boxer.

It's probably good that I'd done much of the planning for this earlier, not only because of the time crunch, but because my better judgment was fading fast. Anyway, it worked without too much struggle.

So with three hours to go, I picked an arbitrary box size, ran everything, found the prefix to not be ruinously long, submitted it, and... oh my; that's a much better percentage than I'd been expecting. It was already clear that I'd remain in the top 15 at the end of this.

Nor was I done yet. The next half hour went on fiddling with the box size (oh, and fixing a bug with odd box widths so I could do that); 3x3 turned out to be the best compromise between fidelity and entropy. And also raised my survival percentage to an even higher number than anything I'd hoped for.

I also tried having the boxer move in all four directions to reach the next box, instead of just going south and east and wrapping around if needed; and of course this reduced the RNA size quite a bit, but the final condensed DNA size went up. Go figure.

Following that triumph, I ran out of Coke and took leave of my senses, in that order. Specifically, I spent most of the rest of the time writing a variant of boxer that used horizontal lines of arbitrary length, called liner, of course. It was, in hindsight not surprisingly, a miserable failure (but see below). I lament this thing in that there were other things I could have done, even without the benefit of hindsight, that might have actually helped. And then I tried multiple boxer runs with different box sizes (coarse then fine), which also didn't help.

And it was over.

[morning']

Well, not quite. I still had to hand in my files within the next three hours. I'd had the common sense, towards the end, to record the web of commands that generated my final result — as a shell script. So it wasn't much work to get it to the point where I could say "just run this, and it'll probably mostly work"; but I also wanted to do a proper README, with a little explanatory text about my madness and the method behind it; i.e, the overall strategy and each of the commands. So it wasn't until 8:30 or so until that it was actually over and I, with personal uptime around 24 hours, and too little sleep the previous three nights, and suspecting that maybe I'm starting to get a little too old for this, could fall over.

Results

Pixels Correct: 341994
Pixels Incorrect: 18006
Prefix Length: 113838
Total Score: 293898
Survival Chance: 75.59%
Rank: 3rd, and the Judges' Prize.
CPU Time: 3 minutes, on a 3GHz Intel Core 2 Duo

That CPU time is to compile all of my tools from source and generate my solution given Endo's DNA and the target PNG, as automated by my shell script unifier. (Obviously I burned far more than that over the course of the contest!) The only “precomputation” is my list of objects to remove, which as I've mentioned was obtained by manually going through all 135 snapshots; and assorted tuning parameters, selected by half-enlightened trial-and-error. So, hopefully I've allayed any worries that this was done by applying unreasonable amounts of processor power.

(The source code might be posted at some point, and if I haven't said anything too unfortunate in the revision control history I might post that too.)

Aftermath

The observant reader may have noticed that some of the objects I approximated with opaque solid color were actually alpha-blended over a sometimes nontrivial background. The observant reader should know that that occurred to me back on Saturday, but I dismissed the idea of trying to extract the alpha values as too hard, at least for the first attempt. And the I never got back to it. Which is a shame, becuase I realized a day or so after this was all over that it actually wouldn't have been hard to do with brute force.

Other regrets: not thinking to combine boxer and liner, or better just boxer, so as to pick up the horizontal gradient on the house-thing intead of just every third line. Nor trying a bit more tuning on the condenser; there was a thing or two I'd had in mind to try at some point, but didn't. (In further hindsight, starting the pessimism higher for the later runs might have helped; and it wouldn't have taken too long to actually experimentally test my hunch that boxes should be in color-major and not position-major order.) Given that I haven't bothered to follow up on any of this for several months now, it's safe to say that I probably don't care all that much.

Interesting realization about the condenser: to handle a repeating sequence of period n, it's not necessary to be looking for words of length exactly n. For example, for a period-37 sequence, recognizing some length-30 subsequence will compress each 37 down to 7+4+i — or, with two 17s, to 3+4+i+4+i' — at which point the next pass should be able to grab multiple repeats at once (and, again, it needn't be exact).

Related Work

Several other teams reported experimenting with direct reproduction of all or part of the target image, by both automatic means and machine-assisted hand-drawing of polygons. However, as far as I'm aware, no other team to date has admitted to dissecting the genome RNA, nor de-watermarking the target image, nor performing general-purpose automated DNA compression. But, as I write this, the contest organizers' long-awaited presentation is nigh, and so more information may be released soon....

I suppose I ought to see what actual information theory and/or algorithms research has to say about the condenser problem, but I haven't yet.