Concurrent Effect Search in Evolutionary Systems

Transcription

Concurrent Effect Search in Evolutionary Systems
Concurrent Effect Search in Evolutionary Systems
Keki M. Burjorjee
Pandora Media Inc.
Oakland, CA
[email protected]
ABSTRACT
Concurrent Effect Search (CES) is a form of efficient computational learning thought to underlie general-purpose, nonlocal, noise-tolerant adaptation in evolutionary algorithms.
We demonstrate that CES is indeed efficient by showing
that it can be used to obtain optimal bounds on the time
and queries required to approximately correctly solve a subclass (k = 7, η = 1/5) of a familiar computational learning
problem: learning parities with noisy membership queries;
where k is the number of relevant attributes and η is the
oracle’s noise rate.
We show that a simple genetic algorithm that treats the
noisy membership query oracle as a fitness function can be
straightforwardly used to PAC-learn the relevant variables
in O(log(n/δ)) queries and O(n log(n/δ)) time, where n is
the total number of attributes and δ is the probability of
error. To the best of our knowledge, this is the first rigorous
identification of optimally efficient computation in an evolutionary algorithm on a non-trivial learning problem. Our
proof technique relies on accessible symmetry arguments and
the use of statistical hypothesis testing to reject a global null
hypothesis at the 10−100 level of significance.
The result obtained and indeed the implicit implementation of CES by a simple genetic algorithm depends crucially on recombination. This dependence yields a fresh explanation for the role of sex (based on the unmixibility of
genes). This new explanation follows straightforwardly from
the CES hypothesis, further elevating its appeal.
1.
INTRODUCTION
In recent years, theoretical computer scientists have become increasingly perturbed by the problem posed by evolution. The subject of consternation is a system thought
to have computed the encoding of every biological form to
have lived, that, as luck would have it, represents information the way a Turing Machine might—digitally; in strings
drawn from a quaternary alphabet—and manipulates information in ways that are well understood (e.g. meiosis) or
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
amenable to abstraction (e.g. natural selection). Contemplating what this computational system has achieved given
the resources at its disposal leaves one awestruck. Yet, theoretical computer science, for all its success in other areas,
has not identified anything particularly efficient or arresting about evolution.1 We have on our hands what might
be called a computational origins problem—while our physical origins have been worked out, our computational origins
remain a mystery. Referring to this problem, Valiant speculates that future generations will wonder why it was not
regarded with a greater sense of urgency [29].
If the computational origins problem is cause for reflection within Theoretical Computer Science, it is doubly so
amongst Evolutionary Computation theorists. The promise
of Evolutionary Computation was twofold: 1) That the computational efficiencies at play in natural evolution would be
identified and 2) That these efficiencies would be harnessed,
via biomimicry, and used to efficiently procure solutions to
human problems. One might expect these outcomes to be
realized in order or simultaneously. In a curious twist, however, the field seems to be making good on the second part
of its promise, but has not delivered on the first. This twist
makes evolutionary algorithms rather interesting from a theoretical standpoint. Find something computationally efficient about one of the more bioplausible ones, and a piece
of the computational origins puzzle may fall into place.
We have previously suggested that an efficient form of
computational learning underlies adaptation in evolutionary
algorithms [4, 3]. Here, we describe the hypothesized computational efficiency—Concurrent Effect Search (CES)—in
detail and explain how it can drive efficient general-purpose,
non-local, noise-tolerant adaptation. We establish that
CES is a bonafide form of efficient computational learning by using it to derive optimal bounds on the query and
time complexity of an algorithm that PAC-solves a subclass
(k = 7, η = 1/5) of a non-trivial computational learning
problem: learning parities with noisy membership queries
[28, 8], where k is the number of relevant attributes, and η
is the probability that the membership query (MQ) oracle
oracle misclassifies a query (i.e. returns a 1 instead of a 0,
or vice versa). This result—to the best of our knowledge,
the first time efficient, not to mention optimally efficient,
non-trivial computation is shown to occur in a evolutionary
1
The Computational Theories of Evolution Workshop held
at the Simons Institute, Berkeley, CA from March 17 –
March 21, 2014 brought together researchers from multiple
disciplines to discuss the issue. Presentations available at
http://simons.berkeley.edu/workshops/abstracts/326
algorithm—hinges on an empirical conclusion with a p-value
less than 10−100 .
1.1
The Integral Role of Recombination
Of all the mysteries evolution holds for a computer scientist, surely, sex is one of the greatest. No account of the
computational power of evolution can be considered complete or even close to complete as long as the part played by
sex remains murky, or marginal. The role of recombination
in the implicit implementation of Concurrent Effect Search
is neither. Following the derivation of our result we explain
why recombination is crucial to the derivation. Briefly, parity functions induce a property called unmixibility, which
recombination adroitly exploits.
Schemata in a schema partition are assumed to be in lexicographical order of their templates. Formally, for any positive integer n, and any index set I ⊆ [n], we assume that
the schemata in JIKn are ordered as follows: Js1 K, . . . , Js2|I| K,
where s1 , . . . , s2|I| is a list of templates of the schemata in
JIKn ordered lexicographically.
Definition 2 (Underlying Effect). For any positive integer n,any non-negative integer k, and any index set
I ⊆ [n] such that |I| = k, let Si denote the ith schema of
JIKn . The underlying effect of the schema partition JIKn
with respect to some fitness function φ : {0, 1} → R is a
value ξI defined as follows:
k
ξI =
2.
CONCURRENT EFFECT SEARCH
We begin with a primer on schemata and schema partitions [10, 19]. For any positive integer k, let [k] denote
the set {1, . . . , k}. For any mathematical expression a, and
boolean valued expression b, let [b]{a} denote a if b evaluates to true and 0 otherwise. For any positive integer n, a
schema (plural, schemata) of the set of binary strings {0, 1}n
is a subset of {0, 1}n that is traditionally represented by a
schema template. A schema template is a string in {0, 1, ∗}n .
As will become clear from the formal definition below, the
symbol ∗ stands for ‘wildcard’ at the positions at which it
occurs. Given a schema template s ∈ {0, 1, ∗}n , let JsK denote the schema represented by s. Then,
n
JsK = {x ∈ {0, 1} | ∀i ∈ [n], si = ∗ ∨ xi = si }
Let I ⊆ {1, . . . , n} be some set of integers. Then I represents a partition of {0, 1}n into 2|I| schemata, whose templates are given by {s ∈ {0, 1, ∗}n | si 6= ∗ ⇔ i ∈ I}. This
partition of {0, 1}n , denoted JIKn can also be represented in
template form by a string t of length n drawn from from the
alphabet {#, ∗} where ti = # ⇐⇒ i ∈ I. Here the symbol
# stands for ‘defined bit’ and the symbol ∗ stands, as it did
before, for ’wildcard’. The schema partition represented by
a schema partition template t is denoted JtK. We omit the
double brackets (i.e., J·K) when their use is clear from the
context.
The order of a schema partition is simply the number of
# symbols in the schema partition’s template. It is easily seen that schema partitions of lower order are coarser
than schema partitions of higher order. More specifically, a
schema partition of order k is comprised of 2k schemata.
Example 1. The index set {1, 2, 4} induces an order
three schema partition of {0, 1}5 into eight schemata as
shown in Figure 1.
00*0*
00000
00001
00100
00101
00*1*
00010
00011
00110
00111
01*0*
01000
01001
01100
01101
01*1*
01010
01011
01110
01111
10*0*
01000
01001
01100
01101
10*1*
10010
10011
10110
10111
11*0*
10000
10001
10100
10101
11*1*
11010
11011
11110
11111
Figure 1: A tabular depiction of the schema partition ## ∗ #∗ of order three. The table headings give
the templates of the schemata comprising the partition. The elements of each schema in the partition
appear in the column below its template.
2
1 X φ
φ
2
(F − F{0,1}
n)
2k i=1 Si
where, for any set S ⊆ {0, 1}n , FSφ denotes the mean fitness
of the elements of S.
In words, the underlying effect of the schema partition
JIKn is the variance of the mean fitness values of the
schemata in JIKn . Considering the size of typical search
spaces, the underlying effect of non-trivial schema partitions are unknowable. However, given some number of independent samples drawn from the uniform distribution over
{0, 1}n and their fitness values, one can construct an estimator like the one below.
Definition 3 (Sampling Effect). For any positive
integer n, any non-negative integer k, and any index set
I ⊆ [n] such that |I| = k, let Si denote the ith schema of
JIKn . Let X1 , . . . , Xr be random variables drawn independently from the uniform distribution over {0, 1}n . For any
i ∈ [2m ], let Yi be a random variable defined as follows:
Yi =
r
X
j=1
[Xj ∈ Jsi K] {1}
and let W be a random variable defined as follows:
k
W =
2
X
[Yi > 0]{1}
i=1
For any i ∈ [2k ] such that Yi > 0, let the random variable Zij be Xk where k is the index of the jth random variable in the list X1 , . . . , Xr that belongs to schema Si . The
sampling effect of JIKn with respect to some fitness function φ : {0, 1}n → R is given by the estimator ξbI defined as
follows:
(
2 )
2m
X
1
d
φ
\
φ
ξbI =
[Yi > 0]
FSi − F{0,1}n
W i=1
φ
where F\
{0,1}n is an estimator defined as follows:
r
1X
φ
=
φ(Xi )
F\
n
{0,1}
r i=1
d
and for any i ∈ [2k ] such that Yi > 0, FSφi is an estimator
defined as follows:
1 X
d
FSφi =
φ(Zij )
Yi j=1
As estimators are random variables, they have means and
variances. Let us informally examine how the mean and
variance of ξbI changes as we coarsen a schema partition.
Coarsening a schema partition JIKn amounts to removing
elements from I. For any I 0 ⊂ I it is easily seen that ξI 0 ≤
ξI . We conjecture, therefore, that E[ξbI 0 ] ≤ E[ξbI ]. What
about the variance of ξbI ? Here we rely on the observation
that coarsening JIKn , reduces the variance of the estimators
d
FSφi for all i ∈ [2k ] such that Yi > 0. This observation
underlies our conjecture that Var[ξbI 0 ] ≤ Var[ξbI ].
The variance of an estimator is, of course, inversely related to the statistical significance of statements about the
expected value of the estimator. Suppose one’s goal is to
find schema partitions with statistically significant sampling
effects. Then the conjectures above suggest that coarsening schema partitions is a mixed blessing. On the positive
side, coarsening decreases the variance of the sampling effect, driving up statistical significance. On the other hand
it pushes the expected sampling effect to zero, adversely affecting statistical significance. With respect to the fitness
functions arising in practice, we conjecture the existence
of one or more “goldilocks” schema partitions, i.e. schema
partitions that are coarse enough that the variance of the
sampling effect is not too high, and not so coarse that the
expected sampling effect is too low.
Assuming, conservatively, that goldilocks schema partitions are rare, how many coarse schemata partitions must
one evaluate on average before such a schema partition is
found? For large values of n, the numbers are astronomical.
For example, when n = 106 , the number of schema partitions
of order 2, 3, 4 and 5 are on the order of 1011 , 1017 , 1022 , and
1028 respectively. Implicit concurrent effect evaluation is a
capacity possessed by simple genetic algorithms with uniform crossover for scaleably (with respect to n) finding one
or more coarse schema partitions with non-negligible effects.
It amounts to a capacity for efficiently performing vast numbers of concurrent effect/no-effect multivariate analyses to
identify small numbers of interacting loci with statistically
significant non-zero sampling effects.
2.1
p
Ξβ
Sφ
p0
/q
Sφ 0
VT
Ξβ
/ q0
/r
VT 0
Ξβ
/ r0
The proof of the above is constructive. That is, φ0 and T 0
are precisely specified. Crucially, there is no restriction on
the index set I. In other words, the above holds true simultaneously for all schema partitions of {0, 1}n . It is this
simultaneity that underlies our sense that evolution implicitly carries out vast numbers of implicit operations concurrently.2
2.2
A Need for Science, Practiced With Rigor
Formal evidence for the CES hypothesis beyond that referenced above, specifically formal evidence pertaining to evolutionary algorithms with finite population sizes seems difficult to obtain, not least because the analysis of evolutionary
algorithms with finite populations is notoriously unwieldy.
We have argued previously that resorting to the scientific
method is a necessary and appropriate response to this hurdle [4]. After all, the scientific method, rigorously practiced,
is the foundation of many a useful field of engineering. A
hallmark of rigorous science is the ongoing making and testing of predictions. Predictions found to be true lend credence to the hypotheses that entail them. The more unexpected a prediction (in the absence of the hypothesis), the
greater the credence owed the hypothesis if the prediction is
borne out [23, 22].
The work that follows is meant to be received in this context. As we explain in the next section, the CES hypothesis
straightforwardly entails that a genetic algorithm with uniform crossover (UGA) can be used in the straightforward
construction of an algorithm that efficiently solves the problem of learning parities with a noisy MQ oracle for small but
non-trivial values of k, the number of relevant attributes,
and η ∈ (0, 1/2), the probability that the oracle makes a
classification error (returns a 1 instead of a 0, or vice versa)
on a given query. Such a result is completely unexpected in
the absence of the CES hypothesis.
Prior Theoretical Evidence
Theoretical support for the idea that evolution can implicitly perform vast numbers of concurrent operations appears
in a prior paper [5], where it is shown that for any schema
partition JIKn , infinite population evolution over {0, 1}n using fitness proportional selection, homologous recombination, and standard bit mutation implicitly induces infinite
population evolution over the set of schemata in JIKn .
More formally, let β : {0, 1}n → JIKn be a function
that maps binary strings in {0, 1}n to their corresponding
schemata in JIKn . Let S, V, and Ξ be the parameterized selection, variation, and projection operators defined in the
paper [4]. Then, for any probability distribution p over
{0, 1}n , fitness function φ : {0, 1}n → R, and transmission
function T over {0, 1}n that models homologous recombination followed by canonical bit mutation, there exist probability distributions q and r over {0, 1}n , probability distributions p0 , q 0 , r0 over JIKn , fitness function φ0 : JIKn → R, and
transmission function T 0 over JIKn such that the following
diagram commutes:
3.
PAC-LEARNING PARITIES WITH MQS
The problem of learning parities is a refinement of the
learning juntas problem [20], so we approach the former
problem by way of the latter. For any boolean function
f over n variables, a variable i is said to be relevant if there
exist binary strings x, y ∈ {0, 1}n that differ only in their ith
coordinate such that f (x) 6= f (y). Variables that are not relevant are said to be irrelevant. For any non-negative integer
n and any non-negative integer k ≤ n, a k-junta is a function
f : {0, 1}n → {0, 1} such that for some non-negative integer
2
The function β in the above was mistakenly called a coarsegraining by Burjorjee and Pollack [5]. The mistake was corrected in a later paper [2], where the idea of coarse-graining
was explained in detail. We mention this because the concept of coarse-graining is used later on in this paper (in
Section 6). As φ0 in the commutative diagram is not invariant to p, no coarse-graining of evolutionary dynamics was
shown by Burjorjee and Pollack. What was actually shown
is arguably more interesting—implicit concurrent evolution.
j ≤ k only j of the n inputs to f are relevant. These j relevant variables are said to be the juntas of f . The function f
is completely specified by its juntas (characterizable by the
set J ⊆ [n] of junta indices) and by a hidden boolean function
h over the j juntas. The output of f is just the output of
h on the values of the j relevant variables (the values of the
irrelevant variables are ignored). The problem of identifying
the relevant variables of any k-junta f and the truth table
of its hidden boolean function is called the learning k-juntas
problem.
The learning k-parities problem is a refinement of the
learning k-juntas problem where it is additionally specified
that j = k, and the hidden function h is the parity (i.e.
xor) function over k inputs. In this case, the function f is
completely specified by its juntas.
An algorithm A with access to some oracle φ is said to
solve the learning k-parities problem if for any k-parity function f : {0, 1}n → {0, 1} whose juntas are given by the set
J ⊆ [n] and any δ ∈ (0, 1/2), Aφ (n, δ) outputs a set S ⊆ [n]
such that Pr(S 6= J) ≤ δ.
A noisy MQ oracle φ behaves as follows. For any string
x ∈ {0, 1}n , φ returns ¬f (x) with probability η ∈ (0, 1/2)
and f (x) with probability 1 − η. The parameter η is called
the noise rate. Bounds derived for the time and query complexity of Aφ with respect to n and δ speak to the efficiency
with which Aφ solves the problem. The algorithmic learning
described here is Probably Approximately Correct learning
[14] with the inaccuracy tolerance set to zero: Probably
Correct (PC) learning, if you will.
3.1
The Noise Model
Blum and Langley [1] give a simple binary search based
method that learns a single relevant variable of any k-junta
in O(n log n) time and O(log n) queries with noise free membership queries. As explained in Section 3.1 of a paper by
Mossel and O’Donnell [20], once a single relevant variable is
found, the method can be recursively repeated O(k2k ) times
to find the remaining relevant variables. In an MQ setting,
the introduction of noise does not complicate the situation
much if the corruption of a given answer is independent of
the corruption of previous answers. In this case, noise can
be eliminated to an arbitrary level simply by sampling the
1
) times, where p(·) is some polynomial,
MQ oracle p( 1−2η
and taking the majority value.
An appropriate departure from independent noise is the
random persistent classification noise model due to Goldman, Kearns, and Shapire [11] wherein on any query x, the
oracle operates as follows: If the oracle has been queried
on x before, it returns its previous answer to x. Otherwise,
it returns the correct answer with probability 1 − η. Such
an oracle appears deterministic from the outside, so clearly,
querying it some number of times and taking the majority
value does not reduce the effect of noise.
While persistent noise is an appropriate noise model for a
membership query oracle, it tends to make analysis difficult.
Fortunately, if it is extremely unlikely that an algorithm A
will query the oracle twice with the same value, then A can
be treated as if it is making calls to an MQ oracle with random independent classification noise [8]. The analysis in the
following sections takes advantage of this loophole. As n
gets large it becomes extremely unlikely that the membership query oracle will be queried twice or more on the same
input. We therefore treat the noise as if it were independent.
3.2
Information Theoretic Lower Bound
For any positive integer k, a simple information theoretic
argument shows that it is not possible to PAC-learn k-parity
functions in less than O(n log n) time or less than O(log n)
queries (not possible, in other words, to PAC-learn the relevant variables in o(n log n) time or o(log n) queries). The
argument relies on Shannon’s source coding theorem, part
of which states that if N i.i.d. random variables each with
entropy H(X) are transmitted in less than N H(X) bits it
is virtually certain that information will be lost [16].
Let us consider the minimum time and queries required
to learn just one relevant variable. Observe that the oracle
can transmit at most one bit per query and that the time
required by A to generate each query is Ω(n). Finally recall
that the entropy of a random variable X that can take an
arbitrary value in [n] is Ω(log n). Thus, by Shanon’s source
coding theorem, the transmission of the index of a single
relevant variable with an arbitrarily small possibility of error
takes Ω(log n) queries and Ω(n log n) time.
3.3
Sampling Effects of k-parities
Let f be some k-parity function whose relevant variables
are given by the set J. It can be seen that a noisy MQ
oracle induces an expected sampling effect of zero for all
schema partitions with order less than or equal to k − 1, and
a positive expected sampling effect for precisely one schema
partition JIKn of order k—the one where I = J. While the
variance of the sampling effect of JIKn increases with η, the
expected effect remains unchanged. The CES hypothesis
predicts that for low values of k and η, a UGA can be used
to efficiently identify J.
4.
OUR RESULT AND APPROACH
We give a evolutionary computation based algorithm that
probably correctly learns 7-parities in O(log(n/δ)) queries
and O(n log(n/δ)) time given access to a MQ oracle with a
noise rate of 1/5. Our argument is comprised of two parts.
In the first, we define a form of learning for the learning
parities problem called Piecewise Probably Correct (PPC)
learning and show that an algorithm that PPC-learns kparities in O(n) time and O(1) queries can be used in the
construction of an algorithm that PC-learns k-parities in
O(n log(n/δ)) time and O(log(n/δ)) queries. In the second
part we rely on a symmetry argument and a hypothesis testing based rejection of two null hypotheses at the 10−100 level
of significance to conclude that for η = 1/5, a UGA can PPC
learn 7-parities in O(n) time and O(1) queries.
5.
PPC TO PC LEARNING
An algorithm A with access to some oracle φ is said to
piecewise probably correctly (PPC) learn k-parities if there
exists some δ ∈ (0, 21 ) such that for any k-parity f , whose
juntas are given by J ⊆ [n], Aφ (n) outputs a set S such that
for any x ∈ [n], Pr(¬(x ∈ S ⇔ x ∈ J)) ≤ δ. That is, the
probability that A misclassifies x is less than or equal to δ.
Given that it takes O(n) time to formulate a query, PPCLearning in O(n) time and O(1) queries is clearly optimal.
The following theorem shows a close relationship between
PPC-learning k-parities and PC-learning k-parities
Theorem 4. If k-parities is PPC-learnable in O(n) time
and O(1) queries, then k-parities is PC learnable in
n
}|
z
m



































































1
0
1
1
1
0
1
.
.
.
0
0
0
1
0
0
1
.
.
.
1
1
0
0
0
0
0
.
.
.
1
0
1
0
0
1
1
.
.
.
0
1
0
1
0
0
0
.
.
.
1
0
0
0
1
1
0
.
.
.
0
0
1
1
0
0
1
.
.
.
0
1
1
0
1
0
1
.
.
.
1
1
1
0
0
1
1
.
.
.
{
0
0
0
0
0
1
0
.
.
.
1
1
1
1
1
0
1
.
.
.
0
0
0
1
0
1
1
.
.
.
1
0
1
0
0
0
0
.
.
.
0
1
1
1
1
0
1
.
.
.
0
0
0
0
0
0
0
.
.
.
···
···
···
···
···
···
···
1
0
1
0
1
0
1
.
.
.
0
0
0
1
0
0
0
.
.
.
.
.
.
.
.
.
1
0
0
0
···
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
···
0
0
1
0
0
1
0
0
0
1
1
0
0
0
1
1
0
1
0
0
1
1
0
0
0
0
1
1
0
0
···
···
Figure 2: A hypothetical population of m chromosomes, each n bits long. The 3rd , 5th , and 9th loci of the
population are shown in grey.
O(n log(n/δ)) time and O(log(n/δ)) queries, where δ is the
probability of error
Given the information theoretic lower bound on learning
relevant variables of a parity function with respect to n,
for any positive integer k, Theorem 4 states that the PPClearnability of k-parities in O(n) time and O(1) queries entails that k-parities are PC-learnable with optimal efficiency
with respect to n. The proof of the theorem relies on the two
well known bounds. The first is the additive upper Chernoff
bound [14]:
Theorem 5 (Additive Upper Chernoff Bound).
Let X1 , . . . , Xr be r independent bernoulli random variables,
let µ
b = r1 (X1 + . . . + Xr ) be an estimator for the mean
of these variables, and let µ = E[b
µ] be the expected mean.
Then, for any > 0, the following inequality holds:
Pr(b
µ > µ + ) ≤ e−2r
2
The second is the union bound [14], which is as follows:
Theorem 6 (Union Bound). For any probability
space, and any two events A and B over that space,
Pr[A ∪ B] ≤ Pr[A] + Pr[B]
Crucially, the events A and B need not be independent.
Proof of Theorem 4. Let Aφ be an algorithm that
PPC-learns k-parities in O(n) time and O(1) queries with a
per attribute error probability δ 0 . For any k-parity function
f over n variables whose juntas are given by J, let S1 , . . . , Sr
be sets output by Aφ on r independent runs, and let S be a
set defined as follows
That is x ∈ S iff x appears in more than half the sets
S1 , . . . , Sr . We claim that Pr(S 6= J) ≤ δ if
n
2
r>
log
0
2
(1 − 2δ )
δ
Considering that it takes O(nr) time to compute S given
S1 , . . . , Sr , Theorem 4 follows straightforwardly from the
claim. For a proof of the claim observe that for each x ∈ [n]
we have exactly two cases: (i) x ∈ J and (ii) x 6∈ J.
Case i) [x ∈ J]: Let µ
cx be a random variable defined as
follows:
r
1X
µ
cx =
[x 6∈ Si ]{1}
r i=1
Theorem 5 entails that for any > 0,
Pr (c
µx > E[c
µx ] + ) ≤ e−2r
2
Note that E[c
µx ] = δ 0 . So by the premise of Theorem 4,
E[c
µx ] < 1/2. Setting = 1/2 − δ 0 in the expression above
yields
0 2
1
1
≤ e− 2 r(1−2δ )
Pr µ
cx >
2
1
0 2
Thus, Pr(x 6∈ S) ≤ e− 2 r(1−2δ ) .
Case ii) [x 6∈ J]: An argument similar to the one above
0 2
1
with µ
cx defined as follows yields Pr(x ∈ S) ≤ e− 2 r(1−2δ ) .
µ
cx =
r
1X
[x ∈ Si ]{1}
r i=1
By combining the two cases we get that for all x ∈ [n],
0 2
1
Pr(¬(x ∈ S ⇔ x ∈ J)) ≤ e− 2 r(1−2δ ) . The application of
the union bound yields
Pr(¬(1 ∈ S ⇔ 1 ∈ J) ∨ . . . ∨ ¬(n ∈ S ⇔ n ∈ J)) ≤ ne
1
x ∈ S ⇐⇒
1
r
r
X
i=1
[x ∈ Si ]{1} >
1
2
0 2
− 1 r(1−2δ 0 )2
2
In other words, Pr(S =
6 J) ≤ ne− 2 r(1−2δ ) . Finally, set0 2
−1
r(1−2δ
)
< δ and taking logarithms yields the
ting ne 2
claim.
6.
SYMMETRY ANALYSIS
For any positive integer m, let Dm denote the set
2
1
,m
, . . . , m−1
, 1}. Let G be a UGA with a population
{0, m
m
of size m and binary chromosomes of length n. A hypothetical population is shown in Figure 2. The 1-frequency
of some locus i ∈ [n] at some time step t is a value in Dm
that gives the frequency of the bit 1 at locus i at time step
t (in other words the number of ones in the population of
G at locus i in generation t divided by m, the size of the
population).
Let f be a k-junta over {0, 1}n whose juntas are given
by J and let h be the hidden function of f such that h
is symmetric, i.e. for any permutation π : [n] → [n] and
any element x ∈ {0, 1}n , h(x1 , . . . , xk ) = h(xπ(1) , . . . , xπ(k) ).
Consider a noisy MQ oracle φf that internally uses f . Let G
be a UGA that uses φf as a fitness function, and let 1ti be a
random variable that gives the 1-frequency of G at timestep
t, then for any time step t, any loci i, j ∈ J and any loci
i0 , j 0 ∈ [n]\J, an appreciation of algorithmic symmetry (the
absence of the positional bias in uniform crossover and the
fact that h is symmetric) yields the following conclusions:
Conclusion 7. ∀ x ∈ Dm , Pr(1ti = x) = Pr(1tj = x)
Conclusion 8. ∀ x ∈ Dm , Pr(1ti0 = x) = Pr(1tj 0 = x)
Which is to say that for all i, j ∈ J, 1ti and 1tj are drawn
from the same distribution, which we denote pt , and for all
i0 , j 0 ∈ [n]\J, 1ti0 and 1tj 0 are drawn from the same distribution, which we denote qt . (It is not to say that 1ti and 1tj
are independent, or that 1ti0 and 1tj 0 are independent.) Appreciating that the location of the juntas of f is immaterial
to the 1-frequency dynamics of the relevant and irrelevant
loci yields the following conclusion:
Conclusion 9. For all t, pt and qt are invariant to J
Finally, if it is known that that the per bit probability
of mutation is not dependent on the length of the chromosomes, then appreciating that the non-relevant loci are just
“along for the ride” and can be spliced out without affecting
the 1-frequency dynamics at other loci give us the following
conclusion:
Conclusion 10. For all t, pt and qt are invariant to n
6.1
Note on our Use of Symmetry
This section is a lightly modified version of Section 3 in
an earlier paper [4] . We include it here because our case for
the use of symmetry arguments remains the same.
A simple genetic algorithm with a finite but non-unitary
population of size m (the kind of GA used in this paper) can
be modeled by a Markov Chain over a state space consisting
of all possible populations of size m [21]. Such models tend
to be unwieldy [12] and difficult to analyze for all but the
most trivial fitness functions. Fortunately, it is possible to
avoid modeling and analysis of this kind, and still obtain
precise results for non-trivial fitness functions by exploiting some simple symmetries introduced through the use of
uniform crossover and length independent mutation.
A homologous crossover operation between two chromosomes of length n can be modeled by a vector of n random binary variables hX1 , . . . , Xn i representing a crossover
mask. Likewise, a mutation operation can be modeled by a
vector of n random binary variables hY1 , . . . , Yn i representing a mutation mask. Only in the case of uniform crossover
are the random variables X1 , . . . , Xn independent and identically distributed. This absence of positional bias [7] in
uniform crossover constitutes a symmetry. Essentially, permuting the bits of all chromosomes using some permutation π before crossover, and permuting the bits back using
π −1 after crossover has no effect on the overall dynamics
of a UGA. If, in addition, the random variables Y1 , . . . , Yn
that model the mutation operator are identically distributed
(which is typical), conditionally independent given the per
bit mutation rate, and independent of the value of n, then
in the event that the values of chromosomes at some locus i
are immaterial to the fitness evaluation, the locus i can be
“spliced out” without affecting allele dynamics at other loci.
In other words, the dynamics of the UGA can be exactly
coarse-grained [2].
These conclusions flow readily from an appreciation of the
symmetries induced by uniform crossover and length independent mutation. While the use of symmetry arguments
is uncommon in Theoretical Computer Science, symmetry
arguments form a crucial part of the foundations of Physics
and Chemistry. Indeed, according to the theoretical physicist E. T. Jaynes “almost the only known exact results in
atomic and nuclear structure are those which we can deduce
by symmetry arguments, using the methods of group theory”
[13, p331-332]. Note that the conclusions above hold true regardless of the selection scheme (fitness proportionate, tournament, truncation, etc), and any fitness scaling that may
occur (sigma scaling, linear scaling etc). “The great power
of symmetry arguments lies just in the fact that they are
not deterred by any amount of complication in the details”,
writes Jaynes [13, p331]. An appeal to symmetry, in other
words, allows one to cut through complications that might
hobble attempts to reason within a formal axiomatic system. Of course, symmetry arguments are not without peril.
However, when used sparingly and only in circumstances
where the symmetries are readily apparent, they can yield
significant insight at low cost.
7.
STATISTICAL HYPOTHESIS TESTING
For any positive integer n, let f be a 7-parity function
over {0, 1}n , and let φf be a noisy MQ oracle such that for
any x ∈ {0, 1}n Pr(φ(x) = ¬f (x)) = 1/5 and Pr(φ(x) =
f (x)) = 4/5. Let G(n) be the simple genetic algorithm given
in Algorithm 1 with chromosomes of length n, population
size m=1500, uniform recombination (asex = true), and
per bit mutation probability pmut = 0.004. Let Aφ (n) be
an algorithm that runs G(n) for 800 generations using φf
as the fitness function and returns a set S ⊆ [n] such that
i ∈ S if and only if the 1-frequency of locus i at generation
800 exceeds 1/2.
Claim 11. Aφf PPC-solves the learning 7-parities problem in O(n) time and O(1) queries.
0
Argument. Let Dm
be the set
{x ∈ Dm | 0.05 < x < 0.95}
Note that the hidden function of f is invariant to a reordering of its inputs and the per bit probability of mutation in
G is constant with respect to n. Thus, Conclusions 7, 8, 9,
and 10
¸ hold. Consider the following two null hypotheses:
(a)
(b)
Figure 3: The 1-frequency dynamics over 3000 runs of the first (left figure) and last (right figure) loci of
Algorithm 1 with m = 1500, n = 8, τ = 800, pmut = 0.004, and asex = f alse using the membership query oracle
φf ∗ , described in the text, as a fitness function. The dashed lines mark the frequencies 0.05 and 0.95.
H0p :
X
p800 (x) ≥
0
x∈D1500
H0q :
X
Likewise, if H0q is true, then for any independent random
variables X1 , . . . , X3000 drawn from the distribution q800 ,
and any i ∈ [3000],
1
8
0
Pr(Xi 6∈ D1500
) ≥ 1/8
q800 (x) ≥
0
x6∈D1500
H0p
which entails that
1
8
H0q
Pr(X1 ∈
0
Pr(Xi ∈ D1500
) ≥ 1/8
which entails that
0
Pr(Xi 6∈ D1500
) < 7/8
3000
7
<
8
Thus, the chance that the 1-frequency of the last locus of
0
G(8) could be in D1500
in generation 800 of all 3000 runs,
as seen in Figure 4b, is less than (7/8)3000 . We thus reject
hypothesis H0q at the 10−173 level of significance.
Each p-value is less than a Bonferroni adjusted critical
value of 10−100 /2, so we reject the global null hypotheses
H0p ∨ H0q at the 10−100 level of significance. We are left with
the following conclusions:
The independence of the random variables entails that
3000
7
0
0
Pr(X1 6∈ D1500
∧ . . . ∧ X3000 6∈ D1500
)<
8
X
8
Let f be the 7-parity function over {0, 1} whose juntas
are given by the set {1, . . . , 7}. Figures 3a and 3b show the 1frequency of the first and last loci, respectively, of G(8) given
the fitness function φf ∗ in 3000 independent runs, each 800
generations long.3 Thus, the chance that the 1-frequency of
0
the first locus of G(8) is in D1500 \D1500
in generation 800
of all 3000 runs, as seen in Figure 3a, is less than (7/8)3000 .
As (7/8)3000 < 10−173 , we can reject hypothesis H0p at the
10−173 level of significance.
The experiment can be rerun and the results examined by visiting https://github.com/burjorjee/
evolve-parities.git and following instructions.
p800 (x) <
1
8
q800 (x) <
1
8
0
x∈D1500
X
3
∧ . . . ∧ X3000 ∈
0
D1500
)
−100
We seek to reject
∨
at the 10
level of significance. Assume H0p is true, then for any independent random
variables X1 , . . . , X3000 drawn from the distribution p800 ,
and any i ∈ [3000],
∗
0
D1500
0
x6∈D1500
The observation that running G(n) for 800 generations
takes O(n) time and O(1) queries completes the argument.
8.
REMARKS
Now that our prediction has been validated, some remarks
and clarifications are in order.
8.1
Other values of k and η?
The result obtained above pertains only to k = 7 and
η = 1/5. We expect that the proof technique used here can
Algorithm 1: Pseudocode for a simple genetic algorithm with uniform crossover. The population is stored
in an m by n array of bits, with each row representing a single chromosome. shuffle(·) randomly shuffles the contents of an array in-place, rand() returns
a number drawn uniformly at random from the interval [0,1], ones(a, b) returns an a by b array of ones, and
rand(a, b) < c resolves to an a by b array of bits each of
which is 1 with probability c.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Input: m: population size
Input: n: length of bitstrings
Input: τ : number of generations
Input: pmut : per bit mutation probability
Input: asex: (flag) perform asexual evolution
pop ← rand(m,n) < 0.5
for t ← 1 to τ do
fitnessVals ← evaluate-fitness(pop)
totalFitness ← 0
for i ←1 to m do
totalFitness ← totalFitness + fitnessVals[i]
end
cumFitnessVals[1] ← fitnessVals[1]
for i ←2 to m do
cumFitnessVals[i] ← cumFitnessVals[i − 1] +
fitnessVals[i]
end
for i ← 1 to 2m do
k ← rand() ∗ totalFitness
ctr ← 1
while k > cumFitnessVals[ctr] do
ctr ← ctr + 1
end
parentIndices[i] ← ctr
end
shuffle(parentIndices)
if asex then
crossOverMasks ← ones(m, n)
else
crossOverMasks ← rand(m, n) < 0.5
end
for i ← 1 to m do
for j ← 1 to n do
if crossMasks[i,j]= 1 then
newPop[i, j]← pop[parentIndices[i],j]
else
newPop[i, j]← pop[parentIndices[i + m],j]
end
end
end
mutationMasks ← rand(m, n) < pmut
for i ← 1 to m do
for j ← 1 to n do
newPop[i,j]← xor(newPop[i, j], mutMasks[i, j])
end
end
pop←newPop
end
to give the first proof of efficient computational learning in
an evolutionary algorithm on a non-trivial learning problem.
That the computational learning is optimally efficient is a
welcome finding.
8.2
8.3
be used to derive identical bounds with respect to n and δ
for other values of k and η as long as these values remain
small. This conjecture can only be verified on a case by
case basis with the proof technique provided. the symmetry argument used in the proof precludes the derivation of
bounds with respect to k and η as is typically done in the
computational learning literature. Our goal, however, is not
to derive such bounds for all k and η, but to verify a prediction that follows from the CES hypothesis, and in doing so
Use and Misuse of CES
We have previously explained how CES powers efficient
general-purpose, non-local, noise-tolerant optimization in
genetic algorithms with uniform crossover [4]. For the sake
of completeness, we provide a sketch below. Consider the
following heuristic: Use CES to find a coarse schema partition JIKn with a significant effect. Now limit future search
to a schema S ∈ JIKn with an above average average sampling fitness. Limiting future search in this way amounts
to permanently setting, i.e. fixing, the values at each locus i ∈ I in the population to the binary value given by
the ith symbol of the template of S and performing search
over the remaining loci. Fixing a schema thus yields a new,
lower-dimensional search space. We now make use of the
staggered conditional effects assumption [4]. We assume, in
other words, that there exists a coarse schema partition in
the new search space that may have had a statistically insignificant effect in the old search space but has a significant
effect in the new space. Once again, use CES to find such a
schema partition and limit future search to a schema in the
partition with an above average sampling fitness. Recurse
in this manner until no statistically significant conditional
effects can be found.
Such a heuristic is non-local because it does not make use
of neighborhood information. It is noise-tolerant because
it is sensitive only to the average fitness values of coarse
schemata. We consider it to be general-purpose firstly, because it relies on a very weak assumption about the distribution of fitness over a search space—the existence of staggered
conditional effects; and secondly, because it is an example
of a decimation heuristic, and as such is in good company.
Decimation heuristics such as Survey Propagation [18, 15]
and Belief Propagation [17], when used in concert with local
search heuristics (e.g. WalkSat [27]), are state of the art
methods for solving large instances of a several NP-Hard
combinatorial optimization problems close to their solvability/unsolvability thresholds.
We have hypothesized that the heuristic sketched above
is the abstract heuristic that UGAs implement—or, as is
the case more often than not, misimplement. It stands to
reason that an unspecified computational efficiency might
not be harnessed properly while it remains unspecified. The
common practice of making the per bit mutation rate depend on the length of chromosomes so as to make the expected number of mutations per chromosome independent
of chromosomal length is a mistake; a non-bioplausible one
at that.
CES is not Implicit Parallelism
Given its description in terms of concepts from schema
theory, CES bears a resemblance to implicit parallelism, the
computational efficiency hypothesized to be powering adaptation in genetic algorithms according to the beleaguered
building block hypothesis [10, 24]. While the implicit parallelism hypothesis has informed the formulation of the CES
hypothesis, the two hypotheses are emphatically not the
same.
The objects supposedly evaluated during implicit parallel
Figure 4:
The 1-frequency dynamics over 75 runs of the first (left figure) and last (right figure) loci of
Algorithm 1 with m = 1500, n = 8, τ = 10000, pmut = 0.004 and asex = true using the membership query oracle
φf ∗ , described in the text, as a fitness function. The dashed lines mark the frequencies 0.05 and 0.95.
search are low order schemata satisfying an adjacency constraint: The maximum number of ∗ symbols between any
two non-∗ symbols in a schema’s template must be small.
The objective of the search is to find schemata whose expected sampling fitness is greater than the average sampling
fitness of the population. If a schema fits the bill, then according to the implicit parallelism hypothesis, its frequency
will increase multiplicatively in each generation.
On the other hand, the objects evaluated in concurrent
effect search, are coarse schema partitions. The objective of
the search is to find schema partitions with statistically significant sampling effects. If a schema partition fits the bill
across multiple generations, then according to the CES hypothesis, a single schema in the partition with above average
expected sampling fitness will go to fixation.
Looking at the number of objects evaluated as n gets large,
a simple argument shows that the adjacency constraint required by implicit parallelism gives concurrent effect search
the upper hand. For chromosomes of length n, the number of schema partitions of order k is nk ∈ Ω(nk ) [6]. Thus,
the number of schema partitions evaluated by concurrent effect search scales polynomially with n. On the other hand,
the adjacency constraint, restricts the number of schemata
supposedly evaluated by implicit parallelism to O(n).
Figure 4 shows the 1-frequencies of the first and last loci of
of Algorithm 1 over 75 runs using φf ∗ as a fitness function
as before and m = 1500 and n = 8 as before. The values of
τ and asex were changed to 10000 and true respectively. In
other words, we greatly reduced the number of runs, greatly
increased the length of each run, and, most importantly,
disabled recombination. As one can see, the first loci does
not go to fixation even once during the 75 runs, despite the
increase in the run length to 10000 generations.
Perhaps some other settings of m, or pmut might, do a better job of sending the the first locus to fixation while keeping the the last locus unfixed. To understand why this isn’t
likely, consider the 2-parity function f 0 with juntas {1, 2}
over {0, 1}3 and consider infinite population evolution over
{0, 1}3 using φf 0 as the fitness function. The frequencies of
00∗, 01∗, 10∗, and 11∗ in any population is then a point
in the simplex over these values. In the absence of recombination, h0, 12 , 12 , 0i is the only fixed point of the system
and it is stable. In the presence of uniform recombination,
however, h0, 12 , 12 , 0i is an unstable fixed point. The instability arises from the fact that 01∗ and 10∗ are unmixable,
i.e. tend to yield chromosomes of lower value when crossed.
The only two stable fixed points of the system are h0, 1, 0, 0i
and h0, 1, 0, 0i.
9.
9.1
THE ROLE OF SEX
It is evident that higher organisms reproduce sexually,
however, the nature of the advantage conferred by sex is
still under debate. Sex seems contradictory given what we
know about the prevalence of epistasis in biological genomes.
Interactions between genes (epistasis) is the norm, not the
exception [25]. What sense does it make, then, to break up
genes that play well together?
We build up to an answer by beginning with a simpler
question: What role does recombination play in procuring
the result in Figure 3? One way to approach the question
is to repeat the experiment with recombination turned off.
Sex, Effects, and Unmixibility
It can be seen that a schema partition with unmixable
schemata must have a non-zero effect. It is this property
that gives recombination its power. Essentially, recombination, selection, and the unmixability of the schemata in
a schema partition ensure that one of the schemata in the
partition will go to fixation; A second and equally important
benefit of sex is that it curbs hitchhiking [26, 9, 19], so only
the loci, participating in an effect go to fixation.
The explanation above can be short and sweet because
of the bulk of the explanatory heavy lifting has already occurred in the CES hypothesis. Within the rubric provided by
this hypothesis, the computational purpose of sex becomes
transparent. It is difficult, in fact, to think of another operation that can play the part played by sex (driving the fixation
of genes participating in an effect while curbing hitchhiking)
as effectively.
10.
CONCLUSION
For the first time, optimally efficient non-trivial computation is shown to occur in an evolutionary system. This
demonstration is accompanied by a hypothesis about the
more general computation that evolutionary algorithms implicitly execute efficiently—Concurrent Effect Search. We
explained how Concurrent Effect Search in evolutionary
algorithms can power non-local, noise-tolerant, generalpurpose search—the sort of thing evolutionary systems are
known to be good at. Finally, we highlighted the crucial
role played by sex and unmixability in the derivation of our
optimality result, and, more generally, in the implicit implementation of concurrent effect search.
11.
REFERENCES
[1] Avrim L Blum and Pat Langley. Selection of relevant
features and examples in machine learning. Artificial
intelligence, 97(1):245–271, 1997.
[2] Keki M. Burjorjee. Sufficient conditions for
coarse-graining evolutionary dynamics. In Foundations
of Genetic Algorithms 9 (FOGA IX), 2007.
[3] Keki M. Burjorjee. Generative Fixation: A Unifed
Explanation for the Adaptive Capacity of Simple
Recombinative Genetic Algorithms. PhD thesis,
Brandeis University, 2009.
[4] Keki M. Burjorjee. Explaining optimization in genetic
algorithms with uniform crossover. In Proceedings of
the twelfth workshop on Foundations of genetic
algorithms XII. ACM, 2013.
[5] Keki M. Burjorjee and Jordan B. Pollack. A general
coarse-graining framework for studying simultaneous
inter-population constraints induced by evolutionary
operations. In GECCO 2006: Proceedings of the 8th
annual conference on Genetic and evolutionary
computation. ACM Press, 2006.
[6] T. H. Cormen, C. H. Leiserson, and R. L. Rivest.
Introduction to Algorithms. McGraw-Hill, 1990.
[7] L.J. Eshelman, R.A. Caruana, and J.D. Schaffer.
Biases in the crossover landscape. Proceedings of the
third international conference on Genetic algorithms
table of contents, pages 10–19, 1989.
[8] Vitaly Feldman. Attribute-efficient and non-adaptive
learning of parities and dnf expressions. Journal of
Machine Learning Research, 8(1431-1460):101, 2007.
[9] Stephanie Forrest and Melanie Mitchell. Relative
building-block fitness and the building-block
hypothesis. In L. Darrell Whitley, editor, Foundations
of Genetic Algorithms 2, pages 109–126, San Mateo,
CA, 1993. Morgan Kaufmann.
[10] David E. Goldberg. Genetic Algorithms in Search,
Optimization & Machine Learning. Addison-Wesley,
Reading, MA, 1989.
[11] Sally A Goldman, Michael J Kearns, and Robert E
Schapire. Exact identification of read-once formulas
using fixed points of amplification functions. SIAM
Journal on Computing, 22(4):705–726, 1993.
[12] John H. Holland. Building blocks, cohort genetic
algorithms, and hyperplane-defined functions.
Evolutionary Computation, 8(4):373–391, 2000.
[13] E.T. Jaynes. Probability Theory: The Logic of Science.
Cambridge University Press, 2007.
[14] Michael J Kearns and Umesh Virkumar Vazirani. An
introduction to computational learning theory. MIT
press, 1994.
[15] Lukas Kroc, Ashish Sabharwal, and Bart Selman.
Survey propagation revisited. In Ronald Parr and
Linda C. van der Gaag, editors, UAI, pages 217–226.
AUAI Press, 2007.
[16] David JC MacKay. Information theory, inference, and
learning algorithms, volume 7. Cambridge University
Press, 2003.
[17] Elitza Maneva, Elchanan Mossel, and Martin J.
Wainwright. A new look at survey propagation and its
generalizations. J. ACM, 54(4), July 2007.
[18] M. M´ezard, G. Parisi, and R. Zecchina. Analytic and
algorithmic solution of random satisfiability problems.
Science, 297(5582):812–815, 2002.
[19] Melanie Mitchell. An Introduction to Genetic
Algorithms. The MIT Press, Cambridge, MA, 1996.
[20] Elchanan Mossel, Ryan O’Donnell, and Rocco P
Servedio. Learning juntas. In Proceedings of the
thirty-fifth annual ACM symposium on Theory of
computing, pages 206–212. ACM, 2003.
[21] A.E. Nix and M.D. Vose. Modeling genetic algorithms
with Markov chains. Annals of Mathematics and
Artificial Intelligence, 5(1):79–88, 1992.
[22] Karl Popper. Conjectures and Refutations. Routledge,
2007.
[23] Karl Popper. The Logic Of Scientific Discovery.
Routledge, 2007.
[24] C.R. Reeves and J.E. Rowe. Genetic Algorithms:
Principles and Perspectives: a Guide to GA Theory.
Kluwer Academic Publishers, 2003.
[25] Sean H. Rice. The evolution of developmental
interactions. Oxford University Press, 2000.
[26] J. David Schaffer, Larry J. Eshelman, and Daniel
Offut. Spurious correlations and premature
convergence in genetic algorithms. In Gregory J. E.
Rawlins, editor, Foundations of Genetic Algorithms,
pages 102–112, San Mateo, 1991. Morgan Kaufmann.
[27] B. Selman, H. Kautz, and B. Cohen. Local search
strategies for satisfiability testing. Cliques, coloring,
and satisfiability: Second DIMACS implementation
challenge, 26:521–532, 1993.
[28] Uehara, Tsuchida, and Wegener. Identification of
partial disjunction, parity, and threshold functions.
TCS: Theoretical Computer Science, 230, 2000.
[29] Leslie Valiant. Probably approximately correct:
nature’s algorithms for learning and prospering in a
complex world. Basic Books, 2013.