Friday, August 15, 2014

Sampling Bias Is Not Information

[Edits 8/19/2014 and 8/20/2014: Although the proof of the so-called “no free lunch” theorem is logically very simple, it looked like a vomited hairball in the first version of this post. I’ve made it considerably more inviting. Note also that the online version of “Evaluation of Evolutionary and Genetic Optimizers: No Free Lunch” now has a preface.]

This corrects a post that I retracted, and promised to replace, more than a year ago. It took me only a few days to redo the math. But then I executed a fabulously ADDled drift into composition of a preface for my first paper on “no free lunch” in optimization, planning to use it also as a replacement post. Now I must admit that I am ensnarled in words, and that I don’t know when, or even if, I will break free. Consequently, I am dumping here some fragments that should make sense, though they say much less than I would like to get said.

For those of you who are more interested in ID creationism than in NFL, the highlight is that my references to “conservation of information,” elevated to “English’s Principle of Conservation of Information” by Dembski and Marks in “Conservation of Information in Search: Measuring the Cost of Success,” amount to nothing but failure to recognize that I was discussing statistical independence. Much of what I say in the subsection “Understanding the Misunderstanding of Information” applies not only to me, but also to Dembski and Marks. In fact, I recommend reading the relatively accessible “Bob Marks grossly misunderstands ‘no free lunch’” first.



Abstract

The recent “no free lunch” theorems of Wolpert and Macready indicate the need to reassess empirical methods for evaluation of evolutionary and genetic optimizers. Their main theorem states, loosely, that the average performance of all optimizers is identical if the distribution of functions is average. [An “optimizer” selects a sample of the values of the objective function. Its “performance” is a statistic of the sample.] The present work generalizes the result to an uncountable set of distributions. The focus is upon the conservation of information as an optimizer evaluates points [statistical independence of the selection process and the selected values]. It is shown that the information an optimizer gains about unobserved values is ultimately due to its prior information of value distributions. [The paper mistakes selection bias for prior information of the objective function.] Inasmuch as information about one distribution is misinformation about another, there is no generally superior function optimizer. Empirical studies are best regarded as attempts to infer the prior information optimizers have about distributions [match selection biases to constrained problems] — i.e., to determine which tools are good for which tasks.


0. Sampling Bias Is Not Information

This preface (Sect. 0) addresses expository errors in the original paper (English, 1996), but stands as a self-contained report. Specific corrections and amplifications appear in the remainder of the text (Sects. 1–6).


Figure 1: Objective function $ \newcommand{\disteq}{\stackrel{\scriptscriptstyle{\mathcal{D}}}{=}} \newcommand{\Fseq}[2]{% {F_1^{#1}\hspace{-1pt}({#2)}} } \newcommand{\selection}{{X_1^n}} \newcommand{\sample}{{\Fseq{n}{X}}} \newcommand{\Fn}[1]{{\Fseq{n}{#1}}} \newcommand{\wn}{{w_1^n}} \newcommand{\xn}{{x_1^n}} \newcommand{\yn}{{y_1^n}} \newcommand{\zn}{{z_1^n}} \newcommand{\boldx}{\mathbf{x}} \newcommand{\boldw}{\mathbf{w}} \newcommand{\X}{\mathcal{X}} \newcommand{\Xdiamond}{\mathcal{X}_\diamond} \newcommand{\Y}{\mathcal{Y}} \newcommand{\Ydiamond}{\mathcal{Y}_\diamond} F$ and quality measure $\psi$ comprise the optimization problem. An optimizer is a sampler $X,$ along with a reduction $r$ of the selection $\xn$ and the sample $\yn$ to outputs $\tilde{x}_1^m = x_{j_1}, \dotsc, x_{j_m}$ and $\tilde{y}_1^m = y_{j_1}, \dotsc, y_{j_m},$ with $j_1^m$ strictly increasing. NFL analyses make four assumptions: the selection is non-repeating, the reduction is identical for all optimizers under consideration, the output $\tilde{y}_1^m$ depends only on the sample, and quality depends only on $\tilde{y}_1^m.$ Accordingly, write $\tilde{y}_1^m = r(\yn)$ and $\phi_n = \psi(\tilde{y}_1^m).$ Then the quality $\psi(r(\yn))$ of the optimizer’s output is a statistic $\phi = \psi \circ r$ of the sample it generates.


Black-box optimization may be decomposed into generation of a sample of the values of the objective function, and use of the sample to produce a result. “No free lunch” (NFL) analyses assume explicitly that sampling is without repetition, and define quality of the optimization result to be a statistic of the sample (Wolpert and Macready, 1995, 1997). Fig. 1 unpacks the tacit assumptions, the most important of which is that the optimizers under consideration differ only in selection of samples. Fig. 2 shows how postprocessing of the sample, assumed to be identical for all optimizers, is embedded in the statistic. Although the statistic combines part of the optimizer with part of the optimization problem, the NFL literature refers to it as the performance measure, and to samplers as optimization (or search) algorithms.


Figure 2: Random sampler $X$ is statistically independent of random objective function $F,$ and the selection $X(\lambda) = x_1, \dotsc, X(y_1, \dotsc, y_{n-1}) = x_n$ is statistically independent of the sample $F(x_1) = y_1,$ $\dotsc, F(x_n) = y_n.$ The statistic $\phi = \psi \circ r$ is the composition of the quality measure $\psi$ on optimizer outputs, given in the optimization problem, and the reduction $r$ of samples to outputs, assumed to be identical for all optimizers. The selection and the sample are abbreviated $\selection = \xn$ and $\sample = \yn,$ respectively.


The better known of NFL theorems, including that of Sect. 3.2 (which appears also in Streeter, 2003, and Igel and Toussaint, 2005), are reinventions of basic results in probability (Häggström, 2007). They address sampling and statistics. This claim is justified by showing that the selection process \[ X_i \equiv X(F(X_1), \dotsc, F(X_{i-1})), \] $1 \leq i \leq n,$ is statistically independent of the selected values \[ \sample \equiv F(X_1), \dotsc, F(X_n), \] despite its data processing. Sect. 3.1 supplies proof for the special case of deterministic selection, with the domain and codomain of the random objective function restricted to finite sets. Here the result is generalized to random selection, with repetition allowed, and with the domain and codomain possibly infinite, though countable. Furthermore, the selection process is equipped to terminate. Although this would seem to complicate matters, the proof is greatly simplified.

The selection process cannot be regarded as having or gaining information of unselected values of the objective function when it is statistically independent of the selected values. Data processing is a source of bias in the selection. The paper fails, in exposition, to recognize statistical independence for what it is, and refers instead to “conservation of information.” (The formal results are correct.) It furthermore equates the propensity of an optimizer to perform better for some distributions of the random objective function and worse for others with prior information that somehow inheres in the optimizer. The notion that one entity simply has, rather than acquires, varying degrees of prior information and misinformation of others is incoherent.

The following subsection formalizes the system sketched in Fig. 2, proves that the selection is indeed statistically independent of the selected values, and goes on to demonstrate that a general NFL theorem, so called, follows almost immediately. This exercise leaves very little room for contention that the NFL theorems address something other than sampling. The preface concludes with a sub­sec­tion that briefly deconstructs the paper’s erroneous claims about information.


0.1 Formal Results on Sampling

Sampling processes do not necessarily terminate, and it is convenient to treat them all as infinite. The distinguished symbol $\diamond$ is interpreted as the sample terminator. Let countable sets $\Xdiamond$ = $\X \cup \{\diamond\}$ and $\Ydiamond$ = $\Y \cup \{\diamond\},$ where sets $\X$ and $\Y$ exclude $\diamond.$ The random objective function $F$ maps $\Xdiamond$ to $\Ydiamond,$ with $F(x) = \diamond$ if and only if $x = \diamond.$

Finite sequences of the forms $\alpha_1, \dotsc,$ $\alpha_n$ and $\gamma(\alpha_1), \dotsc,$ $\gamma(\alpha_n)$ are abbreviated $\alpha_1^n$ and $\gamma_1^n(\alpha),$ respectively. A sampler is a random function, statistically independent of $F,$ from the set of all finite sequences on $\Ydiamond$ to $\Xdiamond.$ A selection is a random vector $\selection$ with \[ X_i \equiv X(F(X_1), \dotsc, F(X_{i-1})) \] $1 \leq i \leq n,$ where $X$ is a sampler. The selection is non-repeating if $\selection \in \pi(\X)$ surely, where $\pi(\X)$ is the set of all non-repeating sequences on $\X.$

A statistic is a function with the set of all finite sequences on $\Ydiamond$ as its domain. Assume, without loss of generality, that the codomain of a statistic is a countable superset of its range, which is countable like the domain.

Probability measure is unproblematic when considering the selection $\selection$ and the corresponding sequence $\sample,$ because both take values in countable sets. The following lemma establishes that the selection is statistically independent of the selected values, i.e., that it is correct to refer to $\sample$ as a sample of the values $\{F(x) \mid x \in \Xdiamond\}$ of the objective function. The data processing in extensions of the selection, highlighted in the proof, is a potential source of selection bias, not information about the objective function.

Lemma. Selection $X_1, \dotsc, X_n$ is statistically independent of $F(X_1), \dotsc, F(X_n).$

Proof. Let $\xn$ and $\yn$ be nonempty sequences on $\Xdiamond$ and $\Ydiamond,$ respectively, such that $P( \Fseq{n}{x} = \yn) > 0.$ Then \begin{align*} P( \selection = \xn, \sample = \yn) &= P(\selection = \xn, \Fseq{n}{x} = \yn) \\ &= P(\selection = \xn \mid \Fseq{n}{x} = \yn) \cdot P(\Fseq{n}{x} = \yn), \end{align*} and \begin{align*} &P(\selection = \xn \mid \Fseq{n}{x} = \yn) \\ &\quad= P(X(\Fseq{0}{x}) = x_1, \dotsc, X(\Fseq{n-1}{x}) = x_n \mid \Fseq{n}{x} = \yn) \\ &\quad= P(X(y_1^0) = x_1, \dotsc, X(y_1^{n-1}) = x_n \mid \Fseq{n}{x} = \yn) \\ &\quad= P(X(y_1^0) = x_1, \dotsc, X(y_1^{n-1}) = x_n) \end{align*} because sampler $X$ is statistically independent of objective function $F.$ 

Corollary 1. Selection $X_1, \dotsc, X_n$ is statistically independent of $\phi(F(X_1), \dotsc, F(X_n))$ for all statistics $\phi.$

The following NFL theorem, so called, uses the symbol $\disteq$ to denote equality in probability distribution. It says, loosely, that the distribution of a statistic depends on the choice of sampler if and only if the distribution of the statistic depends on the choice of sample.

Theorem. Let $x_1, \dotsc, x_n$ be a nonempty, non-repeating sequence on $\X,$ and let $\phi$ be a statistic. Then \begin{equation} \phi(F(X_1), \dotsc, F(X_n)) \disteq \phi(F(x_1), \dotsc, F(x_n)) \label{eq:selections} \end{equation} for all non-repeating selections $X_1, \dotsc, X_n$ if and only if \begin{equation} \phi(F(w_1), \dotsc, F(w_n)) \disteq \phi(F(x_1), \dotsc, F(x_n)) \label{eq:sequences} \end{equation} for all non-repeating sequences $w_1, \dotsc, w_n$ on $\X.$

Proof.
($\Rightarrow$) Suppose that (1) holds for all non-repeating selections $\selection,$ and let $\wn$ be a non-repeating sequence on $\X.$ There exists sampler $X$ with constant selection $\selection = \wn,$ and thus (2) follows.
($\Leftarrow$) Suppose that (2) holds for all non-repeating sequences $\wn$ on $\X,$ and let $\selection$ be a non-repeating selection. A condition stronger than (1) follows. For each and every realization $\selection = \wn,$ non-repeating on $\X$ by definition, \begin{equation*} \phi(\sample) \disteq \phi(\Fseq{n}{w}) \disteq \phi(\Fseq{n}{x}). \end{equation*} The first equality holds by Corollary 1, and the second by assumption. 

Corollary 2. Let $x_1, \dotsc, x_n$ be a nonempty, non-repeating sequence on $\X.$ Then \begin{equation} F(X_1), \dotsc, F(X_n) \disteq F(x_1), \dotsc, F(x_n) \label{eq:distselections} \end{equation} for all non-repeating selections $X_1, \dotsc, X_n$ if and only if \begin{equation} F(w_1), \dotsc, F(w_n) \disteq F(x_1), \dotsc, F(x_n) \label{eq:distsequences} \end{equation} for all non-repeating sequences $w_1, \dotsc, w_n$ on $\X.$

Proof. Set the statistic $\phi$ in the theorem to the identity function. 

The corollary is essentially the NFL theorem of English (1996, Sect. 3.2). When (4) holds for all $n,$ the random values $\{F(x) \mid x \in \X\}$ of the objective function are said to be exchangeable (Häggström, 2007).


0.2 Understanding the Misunderstanding of Information

The root error is commitment to the belief that information is the cause of performance in black-box optimization (search). The NFL theorems arrived at a time when researchers commonly claimed that evolutionary optimizers gained information about the fitness landscape, and adapted themselves dynamically to improve performance. Wolpert and Macready (1995) observe that superior performance on a subset of functions is offset precisely by inferior performance on the complementary subset. In online discussion of their paper, Bill Spears referred to this as conservation of performance. My paper suggests that conservation of information accounts for conservation of performance.

The lemma of Sect. 3.1, “Conservation of Information,” expresses the absolute uninformedness of the sample selection process. The performance of a black-box optimizer has nothing whatsoever to do with its information of the objective function. But the paper recognizes only that information gain is impossible, and claims incoherently that prior information resides in the optimizer itself. Conservation of this chimeral information supposedly accounts for conservation of performance in optimization. Here are the salient points of illogic:

  1. Information causes performance.
  2. The optimizer gains no exploitable information by observation, so it must be prior information that causes performance.
  3. There is no input by which the optimizer might gain prior information, so it must be that prior information inheres in the optimizer.
  4. Prior information of one objective function is prior misinformation of another. Conservation of performance is due to conservation of information.

It should have been obvious that prior information is possessed only after it is acquired. The error is due in part to a mangled, literalistic half-reading of (Wolpert and Macready, 1995, p. 8, emphasis added):

The NFL theorem illustrates that even if we know something about [the objective function] … but don’t incorporate that knowledge into [the sampler] then we have no assurances that [the sampler] will be effective; we are simply relying on a fortuitous matching between [the objective function] and [the sampler].
The present work (Sect. 6) concludes that:
The tool literally carries information about the task. Furthermore, optimizers are literally tools — an algorithm implemented by a computing device is a physical entity. In empirical study of optimizers, the objective is to determine the task from the information in the tool.
This reveals confusion of one type of information with another. When a toolmaker imparts form to matter, the resulting tool is in-formed to suit a task. But such form is not prior information. Having been formed to perform is different from having registered a signal relevant to the task. An optimization practitioner may gain information of a problem by observation, and then form a sampler to serve as a proxy in solving it. Although the sampler is informed to act as the practitioner would, it is itself uninformed of the problem to which it is applied, and thus cannot justify its own actions. The inherent form that accounts for its performance is sampling bias.

Unjustified application of a biased sampler to an optimization problem is merely biased sampling by proxy. The NFL theorems do not speak to this fundamental point. They specify conditions in which all of the samplers under consideration are equivalent in overall performance, or are nearly so. Ascertaining that none of these theorems applies to a real-world circumstance does not justify a bias, but instead suggests that justification may be possible. There is never a “free lunch” for the justifier.


References

T. M. English. Evaluation of evolutionary and genetic optimizers: No free lunch. In L. J. Fogel, P. J. Angeline, and T. Bäck, editors, Evolutionary Programming V: Proceedings of the Fifth Annual Conference on Evolutionary Programming, pages 163–169. MIT Press, Cambridge, MA, 1996.

O. Häggström. Intelligent design and the NFL theorems. Biology & Philosophy, 22(2):217–230, 2007.

C. Igel and M. Toussaint. A no-free-lunch theorem for non-uniform distributions of target functions. Journal of Mathematical Modelling and Algorithms, 3(4):313–322, 2005.

M. J. Streeter. Two broad classes of functions for which a no free lunch result does not hold. In Genetic and Evolutionary Computation – GECCO 2003, pages 1418–1430. Springer, 2003.

D. H. Wolpert and W. G. Macready. No free lunch theorems for search. Technical report, SFI-TR-95-02-010, Santa Fe Institute, 1995.

D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997.

No comments :

Post a Comment