Best Seen plots are best not seen…

The Best Seen plot

For decades (lit­er­ally), evo­lu­tion­ary algo­rithms (espe­cially genetic pro­gram­ming) have been visu­al­ized with “Best Seen plots”, or occa­sion­ally “Best-and-average plots”. Sim­ple enough to do, these are just a line chart of the first order sta­tis­tic of the sam­ple of fit­nesses seen so far. Because some­body with some sta­tis­ti­cal back­ground said some­thing back in the 90s, many Best Seen plots are actu­ally plots of the aver­age of sev­eral runs of the evo­lu­tion­ary algo­rithm, “so as bet­ter to com­pare per­for­mance between methods.”

In my expe­ri­ence, very few peo­ple have an intu­ition of how order sta­tis­tics work. I know I don’t.

First of all, it seems to me that the implicit premise of a Best Seen plot is that the under­ly­ing process you’re plot­ting is iid, and of course we all know that evo­lu­tion­ary algo­rithms of all sorts con­verge, and that the dis­tri­b­u­tion of vari­a­tions often shifts from “explo­ration” to “exploita­tion” (and back again, in some cases). So – it seems to me – the thing one wants to see is where that shift arises, the “elbow” of the curve, at least the change in rate of advancement.

If you want to know any­thing about the “best so far” at all.

More impor­tant, at least to me, is the infor­ma­tion these Best Seen plots hide. I can’t tell you how many papers I’ve seen where there’s a Aver­aged Best Seen pile of over­lap­ping hockey-sticks for dif­fer­ent test con­di­tions, and the authors spend two or three columns fret­ting about try­ing to “explain” what’s hap­pen­ing under var­i­ous con­di­tions. This just came up at the GPTP work­shop last week: Why is the lit­tle red hockey-stick curve taller than the blue one in the end, but they cross in the mid­dle? What makes the qual­i­ta­tive behav­ior of the two treat­ments differ?

Why don’t you plot the data you’re talk­ing about?

The Piano Roll plot

I’m a big fan of the use of sum­mary sta­tis­tics (includ­ing order sta­tis­tics, aver­ages, medi­ans, all that junk) for sto­ry­telling about the data. For explana­tory power they pro­vide, and the the­o­ret­i­cal under­pin­nings that can be use­ful when the accom­pa­ny­ing assump­tions are met.

But when­ever I am actu­ally try­ing to explain what’s hap­pen­ing in the course of an evo­lu­tion­ary algorithm’s run, I use a Piano Roll plot. Here’s one I dug up:

10100.jpg

  • A piano roll, as you know, is a long sheet of rolled paper, into which are cut a series of holes which trig­ger the keys of an auto­mated player piano or music box. A Piano Roll plot is sim­ply a plot of every eval­u­ated fit­ness in the order of evaluation.

    No sum­mary sta­tis­tics applied.

    I mean look at the infor­ma­tion you have there. Dur­ing the entire course of the run, you can see the “themes” of indi­vid­u­als with log error about 42, and about 14, and about 8. You can see all the actual work being done in the first 2500 eval­u­a­tions or so. You can see when minor vari­a­tions make incre­men­tal improvements.

    What you’re see­ing is the real impact your algo­rithm has on the mea­sured fitnesses.

    Do those weird lines at 42, 14 and 8 arise because of the inher­ent prob­lem struc­ture, or because there is an inher­ent ten­dency in the problem’s for­mu­la­tion to ran­domly pro­duce solu­tions with that quality?

    I can tell you imme­di­ately, even though these are results from years ago. You see that green line? That indi­cates the point where ran­dom guess­ing (dur­ing ini­tial­iza­tion) tran­si­tions to active search. We see there are def­i­nitely peaks at 42, 14 and 8, and so yes: those lines are per­fectly rea­son­able out­comes of my search algo­rithm “break­ing” bet­ter solu­tions back into their com­po­nent approximations.

    Why does that lit­tle stub of a line around log error 3 appear and then disappear?

    I can tell you imme­di­ately: Actual search is work­ing there. That fam­ily of solu­tions is the real “best seen” in those first 2500 steps or so, but it clearly gives rise to a sit­u­a­tion where selec­tion can pre­vent rever­sion to that form. Com­pare it to the long lines that cover the entire time-span, and you can see that in this case the appear­ance and sub­se­quent dis­ap­pear­ance of that vari­a­tion around 3 is qual­i­ta­tively dif­fer­ent from the one around 12.

    Why does the “best seen” curve (the lowest-error solu­tion, in this ori­en­ta­tion) have such a tight “elbow” around 2500 steps?

    In a Best Seen plot, you would lose all infor­ma­tion you might use to explain the dynam­ics of the “elbow” we see. Here we can see exploita­tion has stopped by about 2500 steps, and what we’re get­ting is mainly ran­dom search for the remain­ing 17000 steps or so. How can I say that? Because the rolling his­togram of fit­nesses, say over 200 steps, would look a lot like the ini­tial ran­dom search plus the best solu­tions found.

    Assum­ing this search is not “finished” — that is, that it hasn’t found the “right” solu­tion — I can tell you imme­di­ately what’s hap­pened: we’ve con­verged to a sub­op­ti­mal solution.

    You couldn’t tell me that with a Best Seen plot. You could guess. But as we begin to use more real­is­ti­cally com­pli­cated algo­rithms your guess is less likely to be useful.

    And draw­ing five or six guesses — five or six Best Seen plots — all on one plot?

    Do you want me to inter­ro­gate you in pub­lic and irri­tate you and make your speak­ing engage­ment a liv­ing hell, only to have you look at the data you threw away to answer a rea­son­able ques­tion like mine?

    No, you do not. Stop it.

    Another exam­ple

    Here’s another plot from a project I’m work­ing on right now:

    pianoRoll2.jpg

    Yes, there are a lot of points. I’m not look­ing at the points, I’m look­ing at the behav­ior of my algo­rithm in the con­text of the prob­lem it’s being applied to solve.

    I see a lot of things here, mainly in com­par­i­son to other prob­lems and other vari­a­tions of the algo­rithm. But there’s some­thing I want to point out (I’ve anno­tated the PDF): around 20000 steps, there is a tiny improve­ment in the error (you can see the small drop at the lower edge).

    But the inter­est­ing thing to me — the guy writ­ing the algo­rithm and hop­ing to make it work for a given task — is all that activ­ity hap­pen­ing sud­denly in “worse” solu­tions, start­ing around 18000 steps. A whole slew of new error rates start crop­ping up in the 130 – 140 range, which have rarely if ever been seen before in the run.

    Where do those come from? I can only infer they’re related to the slightly-better best that appears at the same time. Crossover and muta­tion prod­ucts, some favor­able mat­ings with what we had on hand.

    This is infor­ma­tion lost in the Best Seen plot. Explana­tory, use­ful information.

    Think about my response to the two forms:

    1. If I were watch­ing merely the Best Seen plot as this run was unfold­ing, I’d see an elbow around 2000 eval­u­a­tions, and a long plateau. Noth­ing seems to be chang­ing, even though within the actual pop­u­la­tion there’s plenty going on under the sur­face. If I were patient enough, and could have sat and stared at noth­ing for 17000 eval­u­a­tions, finally a lit­tle “blip” but then noth­ing bet­ter again for a long time. I’d get bored.
    2. Since I am watch­ing the Piano Roll plot, I see there was poten­tial for improve­ment. While the search is run­ning, I am spend­ing time work­ing on a diversity-preserving vari­a­tion that I hope can intro­duce “new blood” more often after about 3000 eval­u­a­tions. About the time I’m done writ­ing that, the Piano Roll shows me not a lit­tle “blip” but a qual­i­ta­tive change in the pop­u­la­tion dynam­ics around 17000 steps. And I can tell a use­ful story about that: the long wait was specif­i­cally the period when the pop­u­la­tion was wrestling with com­ing up with a novel approach to the prob­lem using muta­tion, not crossover. Fur­ther sup­port­ing evi­dence for my idea that more raw mate­ri­als are needed.

    And again, imag­ine if I had not only looked at the Best Seen plot for one run, but I had run the thing five or twenty or a hun­dred times and aver­aged those record curves.

    I won’t say the Best Seen plot is utterly use­less, because clearly peo­ple find enough value in them that they fill the pages of pro­ceed­ings vol­umes. But I can­not use one in my work, and nor can any­body else who is try­ing to solve the prob­lem on which they are searching.

    Best Seen plots are, appar­ently, for telling a very par­tic­u­lar kind of story about evo­lu­tion­ary algo­rithms, in a way that makes the world seem smooth and clean and unen­cum­bered by inter­est­ing variety.

    The world I work in is con­tin­gent. It’s the same one you work in, as it hap­pens. So when you tell me one of those nice smooth sto­ries about the imag­i­nary world, you leave your­self open for me to tell, quite sim­ply, five other sto­ries just as good.

    Do not show me your Best Seen plot, unless you have your story down pat.

    Comments are closed.