We made a design decision a while back. Every objective should be a scalar minimization.
What’s an “error”? Programs just run in Push and Nudge. They don’t make decisions, they don’t do much of anything; they just shove shit around, they do a little math now and then, they’re in their own little worlds. So it’s necessary to consider how best to interrogate the state of Nudge program at the end of an interpreter run. The interpreter just hands you back a set of stacks. You might have a stack of integers, and a stack of floats, and a stack of booleans, and so on. How do you use that mess to derive a scalar evaluation of your program’s performance?
Let’s restrict ourselves to the case where we’re creating a numerical model. We handed some input data to the program, and we want to interpret something it’s done as “output”. For the sake of argument, say we’re looking at one integer. We gave it weather data from yesterday, it’s trying to predict temperature today in degrees. We gave it medical observations from a patient, it’s trying to predict their life expectancy in days. We gave it the number of cars parked in all the garages in Ann Arbor over the last week, and it’s trying to predict the total beer revenue at Arbor Brewing Company next week, in dollars.
The program doesn’t “know” what a degree, day or dollar is. It’s like some stupid tourist in a far-off land, emptying its pocket and proffering a bunch of foreign coins and gum and bus tokens. “Here! I don’t know! You take what you need! Please, just take it!” It’s got integers, it’s got floats, it’s got booleans.
You’re the decider.
The simplest thing, of course, is to just take the top integer on the program’s integer stack. Use an arbitrary rule: top integer is defined to be the “output”. You want an integer (degrees, days, dollars), you can have one. If you want to get fancy later on, maybe you make a more complicated rule: topmost non-negative integer, or sum of the top five, or something. For now, stick with “top”.
You run the program and hand it the input data, and it stops running and holds out its random bundle of crap. You reach precisely into the pile and pluck off the “output”. If the program hasn’t managed to drum even one integer up in the course of its run, or if it’s frittered them all away, well too bad, Program: penalize severely. Otherwise, just say “predictionātopmost integer on stack” and be done.
But you have an integer. Assuming you have a bunch of test cases, you have a bunch of integers. How do you assess the accuracy of the “model” your stupid program has internalized?
There are dozens of well-known possibilities for error statistics, and unlimited potential to make stuff up. Don’t forget what it says in the book: every function of data is a statistic. Assuming we want to stick to well-trodden commonplace material, this summary may be a good place to start a shorter list.
I like their list. Mean Squared Error is common and familiar; everybody learns it in high school. It’s rare to meet somebody even in these advanced days who understands why it’s stupid to use it unthinkingly, but it’ll do in most cases. It’s useful.
Note though there are others. A lot of people pushing decision support tools don’t bother to include Bias measures, or Mean Absolute Percent [or Proportional] Error. And that distorts the user’s perception of success. There are many ways to be wrong in numerical modeling; you should always consider different gauges to see what’s happening with your models.
There’s a whole slew of specialized (and general) metrics that should also be available for the general-purpose modeling framework. You may not always want to fit data individually; maybe you want to match a distribution, and you’ll want to use KS or AD tests. Maybe you want to assume a bunch about the way your model behaves, and can get away with using 1−R2 (since we’re always trying to minimize error, you’ll want a smaller objective to be better). If you’re doing classification instead of numerical modeling, you’ll want measures of classification error.
In any case, here’s the interesting benefit about multiobjective optimization: You don’t need to decide. You can throw them all in, if you want; you can minimize MSE and minimize [the magnitude of] Bias and minimize MAPE. All at the same time. Some models will be better at matching by MSE; some will be less biased. You want to collect them all, at least the nondominated ones.
We’re all stupid. But it’s less immediately stupid if you postpone decisions about which error measures to use until the last moment. Be like the programs themselves: “Here, I have these! You pick!”