we’re howling into the moon, here

The activity tracker in the project management thing we are using (Redmine) has sort of drifted off into the wilderness. So this is just to say that some of the issues are getting closed out, even though you’d be hard pressed to tell.

Errors

We made a design decision a while back. Every objective should be a scalar minimization.

What’s an “error”? Programs just run in Push and Nudge. They don’t make decisions, they don’t do much of anything; they just shove shit around, they do a little math now and then, they’re in their own little worlds. So it’s necessary to consider how best to interrogate the state of Nudge program at the end of an interpreter run. The interpreter just hands you back a set of stacks. You might have a stack of integers, and a stack of floats, and a stack of booleans, and so on. How do you use that mess to derive a scalar evaluation of your program’s performance?

Let’s restrict ourselves to the case where we’re creating a numerical model. We handed some input data to the program, and we want to interpret something it’s done as “output”. For the sake of argument, say we’re looking at one integer. We gave it weather data from yesterday, it’s trying to predict temperature today in degrees. We gave it medical observations from a patient, it’s trying to predict their life expectancy in days. We gave it the number of cars parked in all the garages in Ann Arbor over the last week, and it’s trying to predict the total beer revenue at Arbor Brewing Company next week, in dollars.

The program doesn’t “know” what a degree, day or dollar is. It’s like some stupid tourist in a far-off land, emptying its pocket and proffering a bunch of foreign coins and gum and bus tokens. “Here! I don’t know! You take what you need! Please, just take it!” It’s got integers, it’s got floats, it’s got booleans.

You’re the decider.

The simplest thing, of course, is to just take the top integer on the program’s integer stack. Use an arbitrary rule: top integer is defined to be the “output”. You want an integer (degrees, days, dollars), you can have one. If you want to get fancy later on, maybe you make a more complicated rule: topmost non-negative integer, or sum of the top five, or something. For now, stick with “top”.

You run the program and hand it the input data, and it stops running and holds out its random bundle of crap. You reach precisely into the pile and pluck off the “output”. If the program hasn’t managed to drum even one integer up in the course of its run, or if it’s frittered them all away, well too bad, Program: penalize severely. Otherwise, just say “prediction←topmost integer on stack” and be done.

But you have an integer. Assuming you have a bunch of test cases, you have a bunch of integers. How do you assess the accuracy of the “model” your stupid program has internalized?

There are dozens of well-known possibilities for error statistics, and unlimited potential to make stuff up. Don’t forget what it says in the book: every function of data is a statistic. Assuming we want to stick to well-trodden commonplace material, this summary may be a good place to start a shorter list.

I like their list. Mean Squared Error is common and familiar; everybody learns it in high school. It’s rare to meet somebody even in these advanced days who understands why it’s stupid to use it unthinkingly, but it’ll do in most cases. It’s useful.

Note though there are others. A lot of people pushing decision support tools don’t bother to include Bias measures, or Mean Absolute Percent [or Proportional] Error. And that distorts the user’s perception of success. There are many ways to be wrong in numerical modeling; you should always consider different gauges to see what’s happening with your models.

There’s a whole slew of specialized (and general) metrics that should also be available for the general-purpose modeling framework. You may not always want to fit data individually; maybe you want to match a distribution, and you’ll want to use KS or AD tests. Maybe you want to assume a bunch about the way your model behaves, and can get away with using 1−R2 (since we’re always trying to minimize error, you’ll want a smaller objective to be better). If you’re doing classification instead of numerical modeling, you’ll want measures of classification error.

In any case, here’s the interesting benefit about multiobjective optimization: You don’t need to decide. You can throw them all in, if you want; you can minimize MSE and minimize [the magnitude of] Bias and minimize MAPE. All at the same time. Some models will be better at matching by MSE; some will be less biased. You want to collect them all, at least the nondominated ones.

We’re all stupid. But it’s less immediately stupid if you postpone decisions about which error measures to use until the last moment. Be like the programs themselves: “Here, I have these! You pick!”

Search algorithms

There are really three basic genetic programming algorithms I like to use out of the box. One’s stupid, one’s (almost) standard, and one’s pretty fancy and effective.

Start with one nugget of assumption: Every search algorithm should be multi-objective. Period. No exceptions. You do a single-objective search on any real-world problem, you’re either lazy or a liar. The end, no exceptions. I don’t care if you’re an Operations Research guru or a lone stock trader looking to make a buck, no problem is single-objective.

So when I say “better” in any context, I am talking about Pareto domination, or if you’re an economist kinda girl, “relative Pareto efficiency”. Say you have two alternatives, A and B, and you’re deciding between them on the basis of two minimization objectives, w1 and w2. A is dominated by B if w1(B)<w1(A), and w2(B)<w2(A). B is dominated by A if w1(A)<w1(B), and w2(A)<w2(B). Neither one dominates the other unless it is strictly better on all comparisons. Otherwise, A and B are nondominated by one another; they can’t be differentiated in a meaningful way without saying which objective is “more important”.

You want to survive a car crash, or do you want to drive really really fast? Which is better? So, get that into your head. There is never one objective. Never.

Let’s make sure we understand why there are always multiple objectives in GP runs. You want accuracy, right? But you also want parsimony. You’re evolving models, solutions to problems, functions, structures. Things can get hairy very quickly in a runaway evolutionary world: programs can get superstitious, they can inherit weird cultish quirks, like learning to use “Cosine(0)” or Sqrt(Sqrt(Sqrt(Sqrt(Sqrt(8.9912)))) instead of “1″, if some forgotten ancestor happened to find that first and passed it on as Big Math Juju. So as God of Your Little Engineering Problem trust me: you honestly want to reach down out of the clouds and say, “Hey, guys, chill. Remember, we use our human-readable voice when we model, OK? Two commandments here: accurate, and succinct.”

So there’s an intrinsic reason right there why you want to minimize model error and model complexity. At least two objectives. Always.

Because there’s always a less efficient way to solve any problem without sacrificing accuracy.

Anyway, those search algorithms.

“Stupid” is random search:

  1. Create a random program, and evaluate it
  2. Create a new random program, and evaluate that
  3. Throw away the one that’s dominated, or the newer one if neither is dominated
  4. Go to 2

Seem so ridiculous it’s not worth considering? Not really. It’s essentially the worst you could possibly do, knowing nothing about your search problem. So if nothing else, it provides a useful baseline, as well as a nice test of your randomization methods, evaluator timing, and other infrastructure that almost any search algorithm will share with this dumb-as-a-sack-of-hammers approach.

Plus: Easy to explain? Damn straight. Always run random search on your problem, with your representation, full-fledged evaluation, and data collection system in place.

“(Almost) standard”? Steady-state Pareto tournament selection works for me for a number of reasons, at least as a first approach:

  1. Create a population of, oh I don’t know, 500 random programs
  2. Select a tournament of maybe, ummm, 10? yeah 10 programs from the population, uniformly at random, and evaluate them
  3. Identify the nondominated “best” programs among them, and the dominated not-so-best ones, and split them into two groups
  4. If there are no dominated ones, pick one of the “best” pool at random and demote it out of spite. If there’s only one “best” program, it’s lonely, and inbreeding is bad, so promote one of the lesser ones out of pity.
  5. Replace all the programs in the not-so-best pool with offspring of the nondominated “best” pool
    • Pick two parents, with replacement, from the nondominated set
    • Use one-point crossover to make offspring
    • Mutate the babies slightly, here and there, to taste
  6. go to 2

OK, I confess. This is standard in almost no ways whatsoever. It’s a “standard” toolkit for me, sure, but it’s a pretty far cry from Koza’s classic lockstep population-based methods, the kind everybody and their brother has copied wholesale. I like it anyway: it gives rapid feedback on progress, it’s kindof elitist (except for that “spite” step), it saves a lot of search time by limiting expensive multiobjective comparisons to relatively small pools, it works with fitness caching. It’s not perfect, but I already linked to the NFL Wikipedia page, so what are you, so dumb you can’t read? Nothing’s perfect.

Fancy and effective? I really like the ParetoGP method from Mark Kotanchek and Guido Smits (and their colleagues):

  1. Create an Archive of 500 random programs
  2. Create a Population of 500 random programs
  3. Create a container for the Next Generation (an empty list; hold onto that)
  4. Select a tournament of 20 Archive programs with uniform probability, and evaluate them, and forget about the dominated ones. Just make a list of the nondominated ones from the Archive tournament.
  5. Select a tournament of 20 Population programs with uniform probability, and evaluate them, and forget about the dominated ones. Just keep the subset of nondominated ones from the Population tournament.
  6. For each Population tournament winner you got from step 5, pick an Archive tournament winner you kept from step 4, at random. Cross them over to create an offspring, maybe with some mutation if you must.
  7. Stick the baby into the Next Generation. If you have 500 programs in the Next Generation, replace the Population with it and go to 3. If you’ve run out of tournament winners before filling the Next Generation, just go to 4. If you choose to attack the Orc, turn to page 88.
  8. You’re going to be wondering when you stop cycling the Next Generations back into the Population, aren’t you? Oh, I dunno. Pick a number. Say “10″. Go on, say it. There: do it ten times. Or more. Or less. Whatever; it’ll work.
  9. Aha! But then: once you’ve done your ten renewals of the Population (what Kotanchek and such call a “cascade”), you’re not done. Oh, no, boyo. You get your butt up there and you start the whole thing, all over again at step 2. That’s right. Another cascade, from another new, random Population, and another ten generations or so. But—but—before you do, take the final Population, the best best best of that cascade you just finished, and you stick it into the Archive. Add it to the programs that were already there, and take the time to trim the Archive back down to 500 programs. You can do a full-fledged 1000-wise Pareto tournament if you have the time, or you could do some other cunning selection method if you want. Often I tend to do something quick and dirty, like just picking tournaments of 20 and culling all the dominated ones until the Archive size drops back down to 500. So, anyway, you were going back to Step 2 to run another cascade.
  10. Do that until you’re bored.

Now this is a close approximation of the ParetoGP algorithm described in a number of papers by Mark Kotanchek, Guido Smits and Katya Vladislavleva. It’s a broadly customizable framework, and if it seems like a lot of extra work compared with the “simpler” GP algorithms described above… well, that may be true. I like it very much because it exposes exploitation (via the Archive) and exploration (via the Population) as almost separable processes. You feel like things are getting stuck, let the Population grow a bit and do more random restarting and exploratory search for weird alternatives; you feel like things are getting too wild, you do fewer, shorter cascades and keep a bigger Archive, really nail down the variations on the already-discovered themes you’ve collected there. Play around, fiddle the knobs, see what you get.

And that’s the point, of course. There is no “best” search algorithm. Don’t make me link to that paper again.

You have to pay attention. You have to bring to bear a vast combinatorial armamentarium. You have to keep a seven-drawer toolbox in the virtual garage, with a hundred different wrenches in it, and then look at your problem and listen to your algorithm purring along, and reach into a drawer without looking and find just the right tool, and give that algorithm a little tightening here, a little shimmy there, grind a bit off the edge, take it apart and put it together again slightly better. With grease.

See, it’s not about solving your problem as fast as possible. It’s about watching it improve, and having the sense to see it improving. About having the tools on hand and the experience under your hat so you can intervene in a reasonable and effective way.

So, yes: there are other ways. None are best.

How we think we’re doing it, here

In stolen moments. Afternoons, weekends. Together, mostly.

So far, that’s about it. (And so far, so good.)

Sufficiently

One problem: we haven’t done sufficiently for one another to make it easy to work on Nudge in these stolen moments and weekly-ish meetings together. The whole stupid xp / agile thing doesn’t work, since none of us are embedded in the project on a steady, consistent basis. We don’t need to justify our rain-hungry expanses of time to some control tower, and then bail half-baked some fiscal quarters later. (We can bail half-baked without structure to enable that behavior, thanks.)

Nudge is important, but it’s not the first thing on anybody’s list.

So there’s this push towards a collection of stories goals designs objectives: a frontier of things that need to happen for Nudge to happen.

With some edges of this frontier scoped out sufficiently that we can peel off little one-hour-ish chunks of work to perform and commit.

Not the best

I wonder about quality of the work we do in this environment. I know it’s not my best. Part of that is because I am learning by doing (all this GP stuff is new to me; that’s why I’m grateful for this little black book PDF). Another part is that Nudge never gets the best of my time or energy. But software isn’t about quality, it’s about bringing light into the world. It just has to be something that wasn’t there before, with [1] decent test coverage and [2] reasons for being both plusses. And on that count, I’m pleased by what’s happened so far, since it’s more than we had before, and it seems in danger of becoming something genuinely, demonstrably useful.

What we think we’re doing here

…in one-liner bullets:

As of this writing, we’re:

With luck, we’re getting close to our next iteration release. We’ll post a coherent code repository. Watch this space.