Introduction to Adaptive Systems

Transcription

Introduction to Adaptive Systems
Marco A. Wiering
2
Contents
1 Introduction
1.1 Adaptive Systems . . . . . . . . . . . . . . . . . . . . .
1.2 Intelligent Agents . . . . . . . . . . . . . . . . . . . . .
1.3 Model for Adaptive Systems . . . . . . . . . . . . . . .
1.3.1 Reward function . . . . . . . . . . . . . . . . .
1.3.2 The internal state . . . . . . . . . . . . . . . .
1.4 Total System Perspective . . . . . . . . . . . . . . . .
1.4.1 An example: a room heater with a thermostat
1.5 Environments . . . . . . . . . . . . . . . . . . . . . . .
1.6 Multi-agent Systems . . . . . . . . . . . . . . . . . . .
1.6.1 Model of a multi-agent system . . . . . . . . .
1.7 Complex Adaptive Systems . . . . . . . . . . . . . . .
1.7.1 Predator-Prey systems . . . . . . . . . . . . . .
1.7.2 State dynamics . . . . . . . . . . . . . . . . . .
1.8 Outline of this Syllabus . . . . . . . . . . . . . . . . .
2 Artificial Life
2.1 Genetic Algorithms and Artificial Life . . . . . . .
2.1.1 Interaction between evolution and learning
2.2 Cellular Automata . . . . . . . . . . . . . . . . . .
2.2.1 Formal description of CA . . . . . . . . . .
2.2.2 Example CA . . . . . . . . . . . . . . . . .
2.2.3 Dynamics of the CA . . . . . . . . . . . . .
2.2.4 Processes in CA . . . . . . . . . . . . . . .
2.2.5 Examples of cyclic processes . . . . . . . . .
2.2.6 Elimination of basis patterns . . . . . . . .
2.2.7 Research in CA . . . . . . . . . . . . . . . .
2.3 Ecological Models . . . . . . . . . . . . . . . . . . .
2.3.1 Strategic Bugs . . . . . . . . . . . . . . . .
2.4 Artificial Market Models . . . . . . . . . . . . . . .
2.4.1 Are real markets predictable? . . . . . . . .
2.4.2 Models of financial theories . . . . . . . . .
2.5 Artificial Art and Fractals . . . . . . . . . . . . . .
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
9
10
11
12
13
13
16
18
18
19
19
20
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
26
27
28
28
29
29
29
30
31
32
35
36
37
37
38
38
40
4
CONTENTS
3 Evolutionary Computation
3.1 Solving Optimisation Problems . . . . . . . . . . . . .
3.1.1 Formal description of an optimisation problem
3.1.2 Finding a solution . . . . . . . . . . . . . . . .
3.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . .
3.2.1 Steps for making a genetic algorithm . . . . . .
3.2.2 Constructing a representation . . . . . . . . . .
3.2.3 Initialisation . . . . . . . . . . . . . . . . . . .
3.2.4 Evaluating an individual . . . . . . . . . . . . .
3.2.5 Mutation operators . . . . . . . . . . . . . . . .
3.2.6 Recombination operators . . . . . . . . . . . .
3.2.7 Selection strategies . . . . . . . . . . . . . . . .
3.2.8 Replacement strategy . . . . . . . . . . . . . .
3.2.9 Recombination versus mutation . . . . . . . . .
3.3 Genetic Programming . . . . . . . . . . . . . . . . . .
3.3.1 Mutation in GP . . . . . . . . . . . . . . . . .
3.3.2 Recombination in GP . . . . . . . . . . . . . .
3.3.3 Probabilistic incremental program evolution . .
3.4 Memetic Algorithms . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
4 Physical and Biological Adaptive Systems
4.1 From Physics to Biology . . . . . . . . . . . . . . .
4.2 Non-linear Dynamical Systems and Chaos Theory
4.2.1 The logistic map . . . . . . . . . . . . . . .
4.3 Self-organising Biological Systems . . . . . . . . . .
4.3.1 Models of infection diseases . . . . . . . . .
4.4 Swarm Intelligence . . . . . . . . . . . . . . . . . .
4.4.1 Sorting behavior of ant colonies . . . . . . .
4.4.2 Ant colony optimisation . . . . . . . . . . .
4.4.3 Foraging ants . . . . . . . . . . . . . . . . .
4.4.4 Properties of ant algorithms . . . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . .
5 Co-Evolution
5.1 From Natural Selection to Co-evolution . . . . .
5.2 Replicator Dynamics . . . . . . . . . . . . . . . .
5.3 Daisyworld and Gaia . . . . . . . . . . . . . . . .
5.3.1 Cellular automaton model for Daisyworld
5.3.2 Gaia hypothesis . . . . . . . . . . . . . . .
5.4 Recycling Networks . . . . . . . . . . . . . . . . .
5.5 Co-evolution for Optimisation . . . . . . . . . . .
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
42
42
43
44
45
46
47
48
49
50
53
55
55
56
57
57
57
59
60
.
.
.
.
.
.
.
.
.
.
.
61
62
64
66
69
70
71
72
72
74
75
77
.
.
.
.
.
.
.
.
79
80
81
82
83
84
86
88
90
CONTENTS
6 Unsupervised Learning and Self Organising Networks
6.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . .
6.1.1 K-means clustering . . . . . . . . . . . . . . . . .
6.2 Competitive Learning . . . . . . . . . . . . . . . . . . .
6.2.1 Normalised competitive learning . . . . . . . . .
6.2.2 Unnormalised competitive learning . . . . . . . .
6.2.3 Vector quantisation . . . . . . . . . . . . . . . .
6.3 Learning Vector Quantisation (LVQ) . . . . . . . . . . .
6.4 Kohonen Networks . . . . . . . . . . . . . . . . . . . . .
6.4.1 Kohonen network learning algorithm . . . . . . .
6.4.2 Supervised learning in Kohonen networks . . . .
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
92
92
93
94
96
98
101
103
103
105
105
6
CONTENTS
Chapter 1
Introduction
Everywhere around us we can observe change, in fact without change life would be extremely
boring. Life implies change, since if there would not be any change anymore in the universe,
everything would be dead. Physicists think it is likely that after very many years (think about
10500 years), the universe would stop changing and enter a state of thermal equilibrium in
which it is extremely cold (near the absolute minimum temperature) and in which all particles
(electrons, neutrinos and protons) are isolated and stable (in this stable state even dark holes
will have evaporated). This means that the particles will not interact anymore and change
will stop. This view relies on the theory that the universe is expanding — and this expansion
is accelerating which is implied by a positive cosmological constant (the energy density of
vacuum). The theory that the universe would contract again after some while (which may
imply a harmonic universe) is not taken very serious anymore nowadays. So, after a long time,
the universe will reach a stable state without change. Fortunately since this takes so long, we
should not worry at the moment. Furthermore, there are some thoughts that intelligent life
may change all of this.
A realistic model of any changing system (e.g. the weather or the stock market) consists
of a description of the state at the current time step and some function or model which
determines in a deterministic way (only 1 successor state is possible) or in a stochastic way
(there are multiple possible successor states which may occur with some probability) the next
state given the current state. We will call the state of the system at time-step t: S(t). It
is clear that if we examine the state of the system over time, that there is a sequence of
states: S(t), S(t + 1), . . . , S(t + n), . . .. Such a sequence of states is often referred to as the
state-trajectory of the system. Note that we often consider time to be discrete, that is that all
time-steps are positive natural numbers: t ∈ {0, 1, 2, . . . , ∞}. We only use discrete time due to
computational reasons and simplicity, since representing continuous numbers on a computer
using bit-representations is not really feasible (although very precise approximations are of
course possible). Mathematically, we could also consider time as being continuous, although
the mathematics would involve some different notation.
1.1
Adaptive Systems
Although the “objective” state of the universe would consist of a single representation of all
elements, and therefore a single state, in reality we can observe different objects which can
be modelled as separate elements. Therefore instead of a single state at time t: S(t), we
7
8
CHAPTER 1. INTRODUCTION
may consider the world to consist of l objects and write Si (t) where 1 ≤ i ≤ l, to denote
the state of object i at time t. In this way there are trajectories for all different objects. If
all these objects would evolve completely separately, the universe would basically consist of
many sub-universes, and we can look at the trajectory of every single object alone. However,
in most real systems, the objects will interact. Interaction means that the state of some
object influences the trajectory of another object, e.g. think about Newton’s laws in which
gravity causes attraction from one object to another one.
At this point we are ready to understand what an adaptive system is.
An adaptive system is a system in which there is interaction between the system
and its environment so that both make transitions to changing states.
Of course it may happen that after a long period of time, the adaptive system enters a
stable state and does not change anymore. In that case we still speak of an adaptive system,
but if the adaptive system never made transitions to different states, it would not be an
adaptive system. So the first requirement is that an adaptive system is dynamic (changing),
at least for a while. Sometimes an adaptive system is part of another system. Think for
example about some robot which walks in a room, but does not displace any objects in that
room. We have to think about this situation as a room which has a robot inside of it. Since
the robot is changing its position, the room is also changing. So in this case the robot is the
adaptive system and the room is the changing environment.
Another requirement for an adaptive system is that the adaptive system will change itself
or its environment using its trajectory of states in order to attain a goal that may be to
simulate some process — to understand what will happen under some conditions, (e.g. we
can simulate what happens if we put ten sharks in a pool and do not feed them), or the goal
to optimize something (e.g. a robot which keeps the floors clean).
Finally there can be learning adaptive systems that have the ability to measure their
own performance and are able to change their own internal knowledge parameters in order
to improve their performance. In this case we say that the adaptive system is optimizing its
behavior for solving a particular task. If we call the state of the internal knowledge parameters:
SI (t) then learning means to change the state of the internal knowledge parameters after each
iteration (time-step) so learning will cause a trajectory: SI (t), SI (t + 1), . . . , SI (T ) where the
final state SI (T ) may be a stable state and has (near)-optimal performance on the task. When
you are not acquainted with machine learning, a learning computer system may seem strange.
However, machine learning receives a lot of interest in the artificial intelligence community
nowadays, and learning computer programs certainly exist. A very simple example of learning
is a computer program which can decide between option 1 and option 2. Each time it selects
option 1 the environment (possibly a human teacher) tells the system that it was a success, and
each time the program selects option 2 it is told that it is a failure. It will not be surprising
that with a simple learning program the system will quickly always select option 1 and
optimizes its performance. More advanced learning systems such as for speech recognition,
face recognition, or handwritten text recognition are also widely spread.
Other terms which are very related to adaptive systems are: cybernetics, self-organising
systems, and complex adaptive systems. The term cybernetics as it is used nowadays stems
from Norbert Wiener and is motivated in his book: Cybernetics: or, Control and Communication in the Animal and the Machine (1948). Before Norbert Wiener worked on gunfire
control. Freudenthal wrote about this:
1.2. INTELLIGENT AGENTS
9
While studying anti-aircraft fire control, Wiener may have conceived the idea of
considering the operator as part of the steering mechanism and of applying to
him such notions as feedback and stability, which had been devised for mechanical
systems and electrical circuits. ... As time passed, such flashes of insight were
more consciously put to use in a sort of biological research ... [Cybernetics] has
contributed to popularising a way of thinking in communication theory terms, such
as feedback, information, control, input, output, stability, homeostasis, prediction,
and filtering. On the other hand, it also has contributed to spreading mistaken
ideas of what mathematics really means.
There are many adaptive systems to be found, some examples are:
• Robots which navigate through an environment with some particular goal (e.g. showing
visitors of a museum a sequence of different objects or helping people in elderly homes
to walk around in the corridors)
• Learning systems which receive data and output knowledge, e.g. classifying the gender
of humans using photos of their faces, or recognising speech from recorded and annotated
speech fragments
• Automatic driving cars or unmanned aerial vehicles (UAVs)
• Evolutionary systems in which the distribution of the gene-pool adapts itself to the
environment
• Economical systems in which well performing companies expand and bad performing
ones go out of business
• Biological systems such as earthquakes or forest fires
1.2
Intelligent Agents
A fairly new concept in artificial intelligence is an Agent. The definition of an agent is a
computer system that is situated in some environment, and that is capable of autonomous
action in this environment in order to meet its design objectives. An agent possesses particular
characteristics such as:
• Autonomy: The agent makes its own choices based on its (virtual) inputs of the environment; even if a user tells the agent to drive of a cliff, the agent can refuse
• Reactivity: Agents are able to perceive their environment, and respond in a timely
fashion to changes that occur in it in order to satisfy their design objectives
• Pro-activeness: Agents are able to exhibit goal-directed behavior by taking the initiative
in order to satisfy their design objectives
• Social Ability: Intelligent agents are capable of interacting with other agents (and
possibly humans)
10
Examples of agents are robots, mail-clients, and thermostats. The advantages of using
the agent metaphor becomes clear when we have to control a system (e.g. a robot). First
of all it becomes easier to speak about the sensory inputs which an agent receives from its
environment though its (virtual) sensors. Using the inputs and possibly its current internal
state, the agent selects an action. The action leads to a change in the environment. The agent
usually has goals which it should accomplish. There can be goals of achievement (reaching a
particular goal state) or maintenance goals (keeping a desired state of the system). The goals
can often be easily modelled as a reward function which sends the agent utility values for
reaching particular states. The reward function could also give a reward (or penalty which is
a negative reward) for individual actions. E.g. if the task for a robot-agent is to go to office
R12 as soon as possible, the reward function could emit -1 for every step (a penalty) and a
big reward of +100 if the agent reaches the desired office.
An intelligent agent can perceive its environment, reason, predict, and act (using its
actuators). A rational agent acts to maximize its performance measure so that it will
reach its goal with the least amount of effort. An autonomous agent acts according to its
own experiences. So it does not execute a fixed algorithm which always performs the same
operations (such as a sorting algorithm), but uses its perceptions to direct its behavior. The
agent is modelled in a program which is executed on an architecture (computer, hardware).
The program, architecture, and environment determine the behavior of the agent.
1.3
Model for Adaptive Systems
We now want to make a formal model of an adaptive system which interacts with an environment. The objective state of the world is the state of the world at some time-step. Often
the adaptive system does not perceive this complete state, but receives (partial) inputs from
the environment. Next to current inputs from the environment, the system can have beliefs
about the world from its past interaction with the environment. Furthermore, the agent can
perform a number of actions, and chooses one of them at every time-step. The control method
which uses beliefs and inputs to select an action is often referred to as the policy. There is
also a transition function which changes the state of the world according to the previous
state and the action that the agent executed. Then there is a reward function which provides
rewards to the agent after executing actions in the environment. Finally the system requires
a function to update the internal (belief) state. So when we put these together, we get a
model M = < t, S, I, B, A, π, T, R, U > with:
• A time-element t = {1, 2, 3, . . .}
• A state of the environment at time t: S(t)
• An input of the environment received at time t: I(t)
• An internal state (belief) of the agent at time t: B(t)
• A number of possible actions A with A(t): the action executed by the agent at time t.
• A policy which maps the input and belief to an action of the agent: π(I(t), B(t)) → A(t)
• A transition-rule which maps the state of the environment and the action of the agent
to a new state of the environment: T (S(t), A(t)) → S(t + 1)
1.3. MODEL FOR ADAPTIVE SYSTEMS
11
• A reward-function which gives rewards to the system, for this there are two possibilities,
depending on whether the reward function is located in the environment so that we get:
R(S(t), A(t)) → R(t) or when the reward function is located in the agent and the agent
cannot know S(t) we have to use: R(I(t), B(t), A(t)) → R(t).
• An update function for the internal (belief) state of the agent U (I(t), B(t), A(t)) →
B(t + 1).
We can note a number of causal relations in the model which are depicted in Figure 1.1.
Causality in time
t
t
S
I
I
B
A
Causal Graph
I
S
B
I
B
A
R
t
t+1
S
A
S
I
B
A
B
R
A
Figure 1.1: The relations between the different elements of an adaptive system.
If we study the figure, we can see that there is one big feedback loop, going from Belief
to Action to State to Input to Belief. So Belief influences belief on a later time-step. Note
that not all adaptive systems use an internal state (belief), we will go into this in more detail
later.
1.3.1
Reward function
An agent usually has one or more goals which it wants to achieve or maintain. To formalise the notion of goal, one could use qualitative goals which can be true or false, such as
Goal(go home). Such qualitative goals are usually used in logical agents that try to make a
plan using operators which bring the current state to a goal state (the plan can be computed
forwards from the current state to the goal or alternatively backwards from the goal state to
the current state). Another possibility is to use a more quantitative notion of a goal using
a reward signal which is emitted after each time-step. The advantage of the latter is that it
becomes easier to distinguish between multiple plans which bring about a trajectory which
attains a specific goal. E.g. if an agent uses 100 steps or 20 steps to find the kitchen, then
clearly using 20 steps should be preferred. However, when qualitative goals are used, they
both become true after some time. Even if the planner tries to come up with the shortest
plan, efforts to execute the plan are not easily incorporated. Using a reward function we can
emit after each step a reward of -1 (so a cost of 1) and for reaching the goal, the agent may
12
be rewarded with a high bonus. In this way shorter paths are preferred. Furthermore, when
different actions require different effort, we can use different costs for different actions (e.g.
when climbing a mountain it costs usually a lot of effort to take steep paths). In decision
theory usually utilities or reward signals are used. The goal for the agent then becomes to
maximize its obtained rewards in its future. So its policy should maximize:
∞
X
γ t R(t)
(1.1)
t=0
Where 0 ≤ γ ≤ 1 is the discount factor which determines how future rewards are traded off
against immediate rewards. E.g. if we find it is important to get a lot of reward during the
current day and are not interested in the examination tomorrow, we will set the discount
factor to a very low number, maybe resulting in drinking a lot of beer in a bar and failing the
examination tomorrow. However, if we are interested in life-long happiness, we should use a
high discount factor (close to 1).
1.3.2
The internal state
Often no internal state (IS) is used, but without internal state we can only construct a
reactive agent. A reactive agent uses a policy which maps the current input to an action.
It does not use any memory of previous inputs or actions. For a game like chess, a reactive
agent is perfect, because it does not really matter how a particular board-position came
about, the best move only depends on the current state of the board which is fully accessible
(completely observable) for the agent. However, in case you are looking for a restaurant and
someone tells you “go straight until the second traffic light and then turn left.” Then you
have to use memory, because if you would see a traffic light you cannot know whether to turn
left or not without knowing (remembering) that you have seen another traffic light before.
In more complex agents, internal state is very important. Note that we define the internal
state as a recollection of past inputs and performed actions and not the knowledge learned
by the agent about how to perform (this knowledge is in the adaptive policy). If an agent
has to count to ten, it can map the next number using the previous one and does not need
to remember what was before. In such cases there is therefore only a previous state which
is the input for the policy. If the agent has to remember the capital of the United States,
and uses it a long time afterwards, then it uses some kind of internal memory, but in some
cases it would use long-term memory that is stored in the policy by learning the response
to the question “what is the capital of the US?” Therefore we can speak of long-term and
short-term memory, and the long-term memory resides usually in the policy (or knowledge
representation) whereas short-term information which needs to be remembered only for a
while is stored in short-term memory or the internal state. When we speak about belief (e.g.
facts which are believed by the agent with some probability), however, it can also be stored in
long-term memory, and therefore it would be better to make a distinction between short-term
internal state and long-term belief. For acting one would still use knowledge stored in the
policy, although this would usually be procedural knowledge (for learned skills) in contrast
to declarative knowledge (knowledge and beliefs about the world). For now we just use the
distinction between internal state (to remember facts) which is the short-term changing belief
or a policy for acting.
Humans possess a very complex internal state. If you close your eyes and ears, and stop
focusing on your senses, then you do not receive any inputs from the environment. But
1.4. TOTAL SYSTEM PERSPECTIVE
13
still, thoughts arise. These thoughts come from the internal state, most often the thoughts
are about things which happened not so long ago (like a minute ago, today or yesterday).
Of course you can also act and direct your thoughts, in this way your brain becomes the
environment and there is an interaction between you and your brain. Therefore when you
think about how it would be to walk on the beach, you use your imagination and some policy
for choosing what to do next. In that case, the internal state is only there to remind you of
the start of the walk on the beach and whether you saw the sun shining or not. In many
forms of meditation, one should close her eyes and concentrate on breathing. In this way,
there is no information at all in the brain, basically one starts to think about nothing at
all. In that case, there is no input and a diminishing internal state until it becomes empty
too, and this may cause a very relaxing experience. Note that meditation is not the same as
sleeping, some people say that sleeping is inside the inactive consciousness and meditation
is in the subconscious where people are still experiencing things, but can concentrate on
some thoughts (such as nothingness) much better. Finally, the opposite of a yogi is someone
who has schizophrenia. In schizofrenia, one believes very much in the current internal state,
and the actions focus on the information present in the internal state. So new inputs which
disprove strange ideas residing in the internal state are almost not taken into account, and it
is very difficult to convince such people that they are living in a reality set up by themselves
without any logic or correspondence to the real world.
1.4
Total System Perspective
An adaptive system (e.g. an agent) interacts with an environment. In principle there may be
multiple agents acting in the environment, and it is important to understand the interaction
between the agents and their environment. Therefore we usually have to look at the total
system which consists of the smaller parts. Looking at the complete system gives different
possible views on what the agents are and what they should do. For example, examine forest
fire control, the entities which play a role are the trees, fire-men, bulldozers, air-planes, fire,
smoke columns, the weather etc. If we examine these entities, we can easily see that only
the bulldozers, fire-men, and air-planes can be controlled, and therefore we can make them
an agent with their own behavior, goals, etc. Sometimes it is not so easy to abstract from
reality; we do not want to model all details, but we want a realistic interaction between the
agent and the environment.
Example 1. Examine a restaurant, which entities play a role and which could be modelled
as an agent? If we examine possible scenarios we can exploit our creativity on this topic. For
example the entities may be the kitchen, tables, chairs, cook, waiter, lights, etc. Now we might
consider to make them all agents, e.g. lights which dim if some romantic couple is sitting
below them, tables and chairs which can move by themselves so that a new configuration
of tables can be made automatically when a large group of people enters the restaurant etc.
Would such as futuristic restaurant not be nice to visit?
1.4.1
An example: a room heater with a thermostat
Consider a thermostat for a room heater which regulates the temperature of a room. The
heater uses the thermostat to measure the temperature of the room. This is the input of
the system. The heater has actions: heat, or do-nothing. The temperature of the room
will decrease (until some lower limit value) if the heater does not heat the room, and the
14
temperature of the room will increase if the heater is on. Figure 1.2 shows the interaction
between the heater and the temperature of the room.
Room
Heater
Input
Temperature
Action
Figure 1.2: The interaction between a heater and the temperature in a room.
Making a model for the heater
The state of the environment which should first be modelled is the temperature of the room
at a specific time. Since this is the environmental state, we denote it as S(t). The input of the
heater is in this case also the temperature of the room (although it might contain noise due
to imprecise measurements), we denote this input as I(t). The internal state of the heater is
denoted as B(t) and it can take on values whether the heater is on (heating) or whether it is
off (doing nothing). The possible actions of the heater are: heat or do nothing.
Policy of the heater. Now we have to design the policy of the heater which is the most
important element, since this is our control objective. Of course we can design the policy
in many possible ways, but if there is a reward function, the control policy should be the
one which optimizes the cumulative reward over time. The construction of the policy can be
done by manual design, although it could also be learned. We will not go into details at this
moment how learning this policy should be done, instead we manually design a policy since
it is easy enough to come up with a good solution (so learning is not required). An example
policy of the heater uses the following if-then rules:
1. If I(t) ≤ 21 then heat
2. If I(t) > 21 and I(t) ≤ 23 and B(t) == heat then heat
3. If I(t) > 21 and I(t) ≤ 23 and B(t) == do nothing then do-nothing
4. If I(t) > 23 then do-nothing
If we examine the rules, we can see they are exclusive, at each time-step only one rule can
be applied (sometimes the application of a rule is called a firing rule). If rules would overlap,
the system would become more complex, since some mechanism should then be constructed
which chooses the final decision. Research in fuzzy logic uses membership functions for rules,
e.g. if the temperature is warm then do-nothing. The membership function then determines
whether it is warm, e.g. is 24 degrees warm, and 27 degrees? This membership function should
1.4. TOTAL SYSTEM PERSPECTIVE
15
be designed (although it may also be learned) and the rules all fire using their activation which
is given by the application of the membership functions to the input. After this all actions
are integrated using the activations as votes. We will not go into detail into fuzzy logic here,
but just mention that it can be used when it is difficult to set absolute thresholds for rules
(such as 23 degrees in the above example).
Another issue which is important is that the used policy creates a negative feedback
loop. This means that if the temperature goes up, the heater will stop to increase the
temperature, so that the temperature will go down again. In this way the system remains
stable between the temperature bounds. If we would create a policy which would heat the
room more when the temperature becomes higher, we would create a positive feedback
loop, leading to a temperature which becomes very hot until possibly the heater will break
down. It is therefore important to note that negative feedback loops are important for stable
systems, although positive feedback loops can also be useful, e.g. if one want to have a desired
speed very fast, the system can increase the speeds with larger and larger jumps until finally
a negative feedback loop would take over.
Another way to construct the policy is to use decision trees. A decision tree makes a
choice by starting at the root node of the tree and following branches with choice labels until
we finally arrive at a leave node which makes a decision. A decision tree which is equivalent
to the set of above rules is shown in Figure 1.3.
ROOT
I(t) > 21 and
I(t) <= 23
I(t) <= 21
HEAT
I(t) > 23
DO NOTHING
B(T) = HEAT
HEAT
B(T) = DO NOTHING
DO NOTHING
Figure 1.3: The policy of the heater designed as a decision tree.
The update and transition function. To make the model of the system complete,
we also have to specify how we update the belief and environmental transition function. In
our simple model, these are easily obtained (although the environmental transition function
might depend on a lot of different factors such as the temperature outside, whether a door or
window is open etc.). The belief update function is modelled as follows:
• U (∗, ∗, heat) → heat
• U (∗, ∗, do nothing) → do nothing
Where ∗ denotes the don’t care symbol which can take on any value for the function (or
rule) to be applied. So the update function for the internal state or belief just remembers
the previous action. We make the following simple transition function of the environment
(in reality this transition function does not have to be known, but we construct it here to
16
make our model complete). If the heater is on then the temperature will increase (let’s say
that it is a simple linear increasing function, which is of course not true in reality due to the
effect that there is an upper limit of the temperature, and that more heat will be lost due to
interaction with the outside when the temperature difference is larger. In reality the heat-loss
is a linear function of the temperature difference, but in our model we do not include the
outside temperature, since then isolation will also be important and we get too many details
to model). We also make a simple transition function when the heater is off. So using our
simple assumptions we make the following environmental transition function:
• T (S(t), heat) → S(t) + 0.1
• T (S(t), do nothing) → S(t) − 0.05
The reward function is only needed for self-adapting systems. However, we can also use it
as a measurement function on the performance of a policy. Let’s say that we want the room’s
temperature to remain close to 22 degrees, then the reward function may look like:
R(I, ∗, ∗) = −(I − 22)2
Dynamics of the interaction
When we let the heater interact with the temperature of the room, we will note that there
will be constant change or dynamics of a number of variables. The following variables will
show dynamics:
• The state of the environment S(t)
• The input of the heater (in this case equal to the state of the environment): I(t)
• The action of the heater: A(t)
• The received reward: R(t)
• The internal state of the heater (in this case equal to the previous action of the heater):
B(t)
If we let the temperature of the room start at 15 degrees, we can examine the dynamics
of the room’s temperature (the state of the environment). This is shown in Figure 1.4.
1.5
Environments
The interaction with the environment depends a lot on the environment itself. We can make
a very simple system which shows very complex behavior when the environment is complex.
One good example of this is Simon’s ant. Herbert Simon is a well-known researcher in artificial
intelligence and he thought about a simple ant which follows the coast line along the beach.
Since the waves make different complex patterns on the beach, the ant which follows the coast
line will also show complex behavior, although the design of this ant may be very simple.
On the other hand, the environment can also make the design of a system much more
complicated. There are some characteristics of environments which are important to study,
before we can understand how complex the construction of a well performing system will be.
The following characteristics of environments are most important:
1.5. ENVIRONMENTS
17
25
20
Temperature
15
10
5
0
10
30
50
70
90
110
130 Time
Figure 1.4: The dynamics of the room’s temperature while interacting with the heater with
the given policy. Note that there is a repetition in the dynamics.
• Completely / Partially observable. The question here is about the perception of
the agent of the environment. Can it perceive the complete state of the environment
through its (virtual) sensors? Then the environment is completely observable, this is
for example the case in many board-games (but not in Stratego).
• Deterministic / Non-deterministic. If the next state of an environment given the
previous state and action of an agent is always unique, then it is a deterministic environment. If the successor state can be one of many possible states, usually a probability
distribution is used and then the environment is non-deterministic (also called stochastic).
• Episodic / Non-episodic. If the task requires always a single interaction with the
environment, then the interaction with the environment is episodic. In case a complete
sequence of actions should be planned and executed, the interaction with the environment is non-episodic.
• Static / Dynamic. If the environment does not change when we do not regard the
action of the agent, then the environment is static. In case the environment changes on
its own independently of the action of the agent, we say the environment is dynamic.
In case the reward function changes, we say the environment is semi-dynamic.
• Discrete / Continuous. If the state of the environment only uses discrete variables
such as in chess, the environment is discrete. If continuous variables are necessary to
accurately describe the state of the environment, the environment is continuous (as is
the case with robotics where the position and orientation are continuous).
If we consider these dimensions to characterise the environment, it will not be surprising
that the environments that are most complex to perfectly control are partially observable,
non-deterministic, non-episodic, dynamic, and continuous. We may always be able to try to
simulate these environments, although a good model is also complicated (as for example for
weather prediction).
We can make a list of environments and show the characteristics of these environments.
Figure 1.5 shows such a mapping of tasks (and environments) to characteristics.
18
Completely
observable
Deterministic
Episodic
Static
Discrete
Environment
Chess with clock
Chess without clock
Poker
Backgammon
Taxi driving
Medical diagnosis
Object recognition
Interactive english teacher
Yes
Yes
No
Yes
No
No
Yes
No
Yes
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
No
Semi
Yes
Yes
Yes
No
No
Semi
No
Yes
Yes
Yes
Yes
No
No
No
Yes
Figure 1.5: A mapping from environments and tasks to characteristics.
1.6
Multi-agent Systems
In particular tasks, there are multiple agents which may be working together to solve a
problem, or they may be competing to get the best out of the situation for themselves. In
the case of multiple agents interacting with each other and the environment, we speak of a
Multi-agent System (MAS). In principle the whole MAS could be modelled as one superagent which selects actions for all individual agents. However, thinking about a MAS as a
decentralised architecture has some advantages:
• Robustness. If the super-agent would stop working, nothing can be done anymore,
whereas if a single agent of a big group of agents stops to work, the system can still
continue to solve most tasks.
• Speed. In case of multiple agents, each agent could easily run on its own computer
(distributed computing), making the whole system much faster than using a single
computer.
• Simplicity to extend or modify the system. It is much easier to add a new agent running
its own policy than to change one big program of the super-agent.
• Information hiding. If some companies have secret information, they do not want other
agents to access that information. Therefore this information should only be known to
a single agent. If everything runs on a super-agent the privacy rules are much harder
to guarantee.
1.6.1
Model of a multi-agent system
If we are dealing with a MAS, we can still model the individual agents with the same formal
methods as with single agents, so with inputs, actions, internal state, policy, reward function,
and belief update function. In many cases, however, there will also be communication between
the agents. In that case the agents possess communication signals (usually some language)
and they map inputs and internal states to communication signals which they can send to
1.7. COMPLEX ADAPTIVE SYSTEMS
19
individual agents or broadcast to all of them. Communication is important if the agents have
to cooperate. Coordination of agents is important to optimize a MAS, since otherwise they
might all start to do the same job and go to the same places etc. It is clearly more efficient if the
agents can discuss among themselves what role they will play in solving a task. Furthermore
there may also be management agents which give roles and tasks to individual agents etc.
A current challenging research field is to study self-adaptive structures or architectures of
multi-agent organisations.
1.7
Complex Adaptive Systems
Some systems consisting of multiple interacting entities are called complex adaptive systems.
The difference between complex adaptive systems and MASs is that in complex adaptive
systems, the individual entities do not have a goal, they are just part of the overall system.
Basically, these entities are smaller than a complete agent (think about the difference between
your body-cells and you as a complete organism). Therefore complex adaptive systems also
do not have to be able to control some process or solve some task, they are more important for
simulating processes. We do not consider such complex adaptive systems as being rational,
although they may still adapt themselves and can be very complex. In complex adaptive
systems, simple rules can create complex behavior if multiple simple entities interact. We
then often say that the overall system behavior emerges from the interaction between the
entities. Examples of processes which we can model with complex adaptive systems are:
• Traffic consisting of many vehicles or other users of infrastructures
• Forest fires consisting of trees, grass, etc. which propagate the fire
• Infection diseases consisting of viruses and virus-carriers
• Magnetism consisting of elementary particles which can be positively or negatively
charged
• Ecological systems which consist of many organisms which can eat each other and
reproduce
• Economical markets which consist of many stocks and investors
In some cases of the above processes, we might also use a MAS to model them and try to
optimize the process. This is especially clear in traffic or economical markets.
1.7.1
Predator-Prey systems
A simple example of a system consisting of multiple entities is a predator-prey system. The
predator looks for food (prey) to eat and produces offspring. The prey also looks for food,
reproduces itself, and tries to circumvent being eaten by predators. The interesting phenomenon is that the population of prey and predators depend on each other. If there are
many predators, the population of prey will decrease since many of them will be eaten. But
if there are few prey, the population of predators will decrease since there will not be enough
food for all of them. If there are then few predators left, the population of prey will increase
again, leading to repetitive dynamics.
20
Lotka-Volterra Equations. Lotka and Volterra captured the predator-prey system
with a couple of equations. We will call the size of the prey-population x and the size of
the predator-population y. Now the environmental state S(t) = (x(t), y(t)). The state will
change according to the following two rules:
• x(t + 1) = x(t) + Ax(t) − Bx(t)y(t)
• y(t + 1) = y(t) − Cy(t) + Dx(t)y(t)
When we choose starting population sizes: S(0) = (x(0), y(0)) and we take some parameter
values for A, B, C, D we get a dynamical system which behaves for example as seen in Figure
1.6.
Figure 1.6: The predator-prey dynamics using Lotka-Volterra equations. Note that the predator population y will grow if there is a lot of prey and the prey population will decrease if
there are many predators.
1.7.2
State dynamics
We have seen that the state of the environment shows a particular kind of dynamics. We can
distinguish between three kinds of dynamics: dynamics to a Stable point, dynamics leading
to a periodic cycle, and chaotic dynamics. When the state enters a stable point, it will
always stay there, this means that the dynamics basically ends and S(t+1) = S(t) for all t ≥ n
where n is some time-step where the process enters the stable point. We can compute what
the stable point of the dynamics of the Lotka Volterra equations will be depending on the
parameters A, B, C, D. Whether the process will enter the stable point may also depend on
the initial state. The following should hold for a stable point for the Lotka-Volterra process:
(x(t + 1), y(t + 1)) = (x(t), y(t))
Then we can find a stable point S(∗) = (x(∗), y(∗)) as follows:
x(∗) = x(∗) + Ax(∗) − Bx(∗)y(∗)
0 = A − By(∗)
A
y(∗) =
B
y(∗) = y(∗) − Cy(∗) + Dx(∗)y(∗)
(1.2)
(1.3)
(1.4)
(1.5)
1.7. COMPLEX ADAPTIVE SYSTEMS
21
0 = −C + Dx(∗)
(1.6)
C
(1.7)
x(∗) =
D
Periodic Cycle. For a periodic cycle, after some initial transient process, the statesequence should always repeat itself after some period of fixed length. We have already seen
two processes which lead to a periodic cycle, the heater and the Lotka-Volterra equations.
Formally for a periodic cycle the following should hold:
S(t) = S(t + n)
S(t + 1) = S(t + n + 1)
.
.
.
S(t + n − 1) = S(t + 2n − 1)
Here we say that the length of the periodic cycle is n. Note that a stable point is equivalent
to a periodic cycle of length 1. Sometimes a process slowly converges to a cyclic behavior.
We then say that the final attractor is a limit cycle.
Chaotic dynamics. In case the process does not lead to a stable point or to a periodic
cycle (also called a stable limit cycle), the process might be called chaotic although there
are some additional conditions for a true definition of chaos explained below. In chaotic
dynamics it is very hard to predict what will happen after a long time, although according to
the above definition alone it may be simple in some cases, e.g. the equation S(t+1) = S(t)+1
would according to the above definition also lead to chaotic dynamics. This is of course very
strange, since we always think about chaotic processes as being unpredictable. Therefore we
have to include the condition that the process is non-linear and sensitive to initial conditions.
This means that when we start with two initial states S1 (0) and S2 (0) which may be very
close to each other, that the difference between the trajectories will increase (exponentially)
after iterating the process over time. In the case of the equation S(t + 1) = S(t) + 1 the
difference between two starting states will not grow but remain the same and the system is
clearly linear. But there are processes which are non-linear for which the difference between
the state trajectories grows which are still predictable such as S(t + 1) = S(t) × S(t) where
S(0) ≥ 1. Therefore even this requirement may not be strict enough, and to eliminate such
trivial cases we have to add the condition that the state trajectory does not go to infinity, but
remains bounded in some subspace. This bounded subspace is called a chaotic attractor, and
although the state trajectory will remain in the attractor, it is unpredictable where it will be
if we do not know the precise initial state and model of the chaotic system. All we can do is
to compute a probability function over this subspace to guess in which area the process will
be at some time-step.
The requirement that the difference between two initial states will grow makes the prediction problem much harder, since if our measured initial state has some small error ǫ then
after some time, the error will have grown drastically so that our prediction of the state will
not be valid or useful anymore. Since measuring a state and the change of the state for a
complex non-linear system at the same time is impossible (for change we need to look at the
difference between two states), we can never have a precise measurement of the current state
(where the state includes position and velocity or change). Therefore, when the process is
chaotic, it cannot be predicted over time.
Another interesting thought is that chaos is not really possible on a computer, since there
are a fixed number of states on the computer. Therefore, since a chaotic system always uses a
22
deterministic transition function, we will always come back some time to the same state and
then go to the next state etc. leading to some periodic cycle of very large period. It is also
true that it is often hard to distinguish between chaotic dynamics and a periodic cycle, since
the period may be so large that the process appears to be chaotic, but in reality has a very
large period which did not appear in the generated state trajectory. Finally we should note
that there is a big difference between non-determinism (randomness) or a chaotic process.
A chaotic system is deterministic, but may appear random to an observer. On the other
hand in non-determinism the process will never follow exactly the same state trajectory,
so one might think such processes are chaotic. However, in a chaotic system we could in
principle predict future states if the current state is exactly known. The impossibility to
predict future states comes from the impossibility to know exactly the current state. On
the other hand, in a non-deterministic system, even if we would know the exact initial state,
prediction of a trajectory would be impossible since there would be many possible future
trajectories. If we examine random-number generators, they are in reality pseudo-random
number generators which provide us with seemingly random numbers, but basically it draws
the random numbers from a huge periodic cycle of fixed length. Real randomness probably
exists in nature, although it is extremely difficult to find out whether it is not deterministic
chaos which makes nature to appear random.
1.8
Outline of this Syllabus
This syllabus describes a wide variety of adaptive systems, ranging from artificial life models
such as cellular automata to machine learning methods such as artificial neural networks.
Since the topic of adaptive systems is so broad, there may not always be an evident connection
between the different topics. For example in machine learning, knowledge may be learned
from examples. The interaction with the environment may not be very clear in such cases,
since the knowledge representation is changing according to the learning dynamics generated
by the interaction between the learning algorithm and the examples. Therefore the concept of
environment should be considered also very broad ranging from the system itself or examples
to a real world environment. In this syllabus the following topics will be covered:
• Cellular Automata which are useful as models for complex adaptive systems and studying artificial life.
• Biological adaptive systems in which systems inspired on swarm (e.g. ants) intelligence
are used to solve complex problems
• Evolutionary computation in which a model of evolutionary processes is used to solve
complex optimisation problems
• Robotics, where physical robots interact with an environment to solve some specific
task
• Machine learning, in which different algorithms such as decision trees, Bayesian learning,
neural networks, and self-organising maps are studied in their way of learning knowledge
from examples. This knowledge may then be used to solve classification tasks such as
mapping mushroom-features to the concept whether they are edible or poisonous.
1.8. OUTLINE OF THIS SYLLABUS
23
• Reinforcement learning, which is a part of machine learning, but where the focus is more
on an agent which can learn to behave by interacting with some specific environment.
24
Chapter 2
Artificial Life
Artificial life researchers study computation models of life-like and emergent processes in
which complex dynamics or patterns arise from the interaction between many simple entities.
Artificial Life is a broad interdisciplinary field where research runs from biology, chemistry,
physics to computer science and engineering. The first artificial life workshop was held in
Santa Fe in 1987 and after this the interest in this field grew tremendously. One of the most
ambitious goals of artificial life is to study the principles of life itself. To study the properties
of life there are basically two roads; to study carbon life forms and their development (mainly
done in biochemistry) and to examine life forms and their properties using a computer. What
both fields have in common is that life emerges from building blocks which cannot be called
alive on their own. So the interaction between the elements makes the whole system appear
to be alive. Since the interactions are usually not well understood, the study to artificial life is
usually holistic in nature, which means that we look at the whole system without being able to
make clear separations in smaller modules. Still today many scientists think that life evolved
from chemicals in the primordial soup (containing a large number of carbon compounds),
although some scientists believe that life may have come from space on a comet. Some assert
that all life in the universe must be based on the chemistry of carbon compounds, which is
also referred to as “carbon chauvinism”.
Thus, artificial life constructs models and simulates them to study living entities or other
complex systems in computer systems. Some research questions which it tries to answer are:
• Biology: How do living organisms interact in biological processes such as finding/eating
food, survival strategies, reproduction?
• Biochemistry: How can living entities emerge from the interaction of non-living chemical
substrates?
• Sociology: How do agents interact in artificial societies if they have common or competing goals?
• Economy: How do rational entities behave and interact in economical environments
such as in stock-markets, e-commerce, auctions, etc.?
• Physics: How do physical particles interact in a particular space?
• Artificial Art: How can we use artificial life to construct computer art?
25
26
CHAPTER 2. ARTIFICIAL LIFE
One important goal of artificial life is to understand the source and functionality of life.
One particular way of doing that is to make computer programs which simulate organisms
using some encoding (might be similar to DNA encoding, but the encoding can range to
computer programs resembling Turing machines). The development of artificial creatures
which can be called alive also requires us to have a good definition of alive. For this we cite:
http://www.wordiq.com/definition/Life
In biology a conventional definition of an entity that is considered alive has to
exhibit all the following phenomena at least once during its existence:
• Growth
• Metabolism; consuming, transforming and storing energy/mass growing by
absorbing and reorganizing mass; excreting waste
• Motion, either moving itself, or having internal motion
• Reproduction; the ability to create entities which are similar to itself
• Response to stimuli; the ability to measure properties of its surrounding
environment, and act upon certain conditions
A problem with this definition is that one can easily find counterexamples and
examples that require further elaboration, e.g. according to the above definition
fire would be alive, male mules are not alive as they are sterile and cannot reproduce, viruses are not alive as they do not grow. One could restrict the definition
to say that living organisms found in biology should consist of at least one cell
and require both energy and matter to continue living, but these restrictions do
not help us to understand artificial life. Finally one could change the definition
of reproduction to say that organisms such as mules and ants are still alive by
applying the definition to the level of entire species or of individual genes.
As we can see; there are still many possible definitions and just as with the concept
intelligence, we may not easily get one unique definition of “alive”.
2.1
Genetic Algorithms and Artificial Life
One well-known algorithm in artificial intelligence that is based on evolutionary theory is the
genetic algorithm (GA). Darwin speculated (without knowing anything about the existence of
genes) that evolution works by recombination of material of parents which pass the selective
pressure of the environment. If there are many individuals only some can remain alive and
reproduce, this selection is very important for nature since it allows the best apt individuals
to reproduce (survival of the fittest). Once parents are selected they are allowed to create
offspring and this offspring is slightly mutated so that the offspring will not contain exactly
the same genetic material as the parents. Genetic algorithms can be used for combinatorial
optimization problems, function optimization, robot control, and the study of artificial life
societies. We will not go into detail into genetic algorithms here, since they will be described
thoroughly in a separate chapter. Shortly, genetic algorithms are able to mimic the concept
of reproduction. Say some artificial organism is stored in some representation, such as a
bitstring (a string of 0’s and 1’s). Then we can take two parents, cutoff their string in two
parts and glue these parts together to create a new offspring, which could possibly be better
2.1. GENETIC ALGORITHMS AND ARTIFICIAL LIFE
27
in the task than its parents. Since parents which are allowed to reproduce are selected on
their fitness in the environment, they are likely to possess good blocks of genetic material
which may then be propagated to the child (offspring). In combination with artificial life,
genetic algorithms allow us to study a wide variety of topics, including:
• Robots which interact with an environment to solve some task
• Competitive evolutionary models such as arm-races studied by Karl Sims. In the armraces experiment different morphologies and behaviors were evolved in 3D structures
where two organisms had to compete against each other by harming the opponent. The
winning individual passed the test and was able to reproduce leading to a wide variety
of improving morphologies and behaviors.
• Models of social systems such as the study of emerging societies of individuals which
work together
• Economical models such as the development of buying and selling strategies
• Population genetics models where one examines which groups of genes remain in the
population
• The study of the interaction between learning and evolution
2.1.1
Interaction between evolution and learning
In evolutionary theory, sociology, and psychology one often considers the difference between
nature and nurture. Nature is what a newborn organism possesses at its birth. E.g. Chomsky
claims that a lot of knowledge for learning a language is already born in the brain of a
child when it is born. Nurture is the knowledge, skills, and behaviors which an organism
develops through its adaption and learning process while interacting with an environment.
The nature/nurture dilemma is often to say whether something was born inside an organism
or whether it developed due to the interaction with the environment. Examples of this are
whether criminals are born like a criminal or whether they become one due to their education
and life. Another example is whether homo-sexuality or intelligence is inborn and stored in
the genes or not. Often it is better to say that nature gives a bias towards some behavior
or the other, and nurture causes some behaviors to be expressed. E.g. if someone has genes
which may be similar to other people having schizophrenia, it is not necessary that such a
person would develop the disease, this depends a lot on circumstances but if such a person
would suffer from a lot of stress, the genes may be expressed with a much bigger probability.
In artificial life simulations a number of machine learning algorithms can be used which
can learn from the interaction with the world. Examples of this are reinforcement learning
and neural networks. Although these topics will be discussed in separate chapters, they could
also be used together with genetic algorithms in an environment consisting of many entities
that interact and evolve. Now if we want to study the interaction between evolution and
learning we see that evolution is very slow and takes place over generations of individuals,
whereas learning is very fast and takes place within an individual (agent). The combination
of these 2 leads to two possible effects:
• Baldwin effect. Here an individual learns during its interaction with the environment.
This learning may increase the fitness of the individual so that individuals which are
28
good in learning may receive higher fitness values (are better able to act in the environment) than slow learning individuals. Therefore individuals which are good in learning
may reproduce with a higher probability leading to offspring which are potentially also
very good in learning. Thus, although the skill of learning is propagated to offspring,
learned knowledge is not immediately propagated to the offspring.
• Lamarckian learning. Here an individual learns during its life and when it gets
offspring it also propagates its learned knowledge to its children which then do not have
to learn this knowledge anymore.
Lamarckian learning is biologically not very realistic, but in computer programs it would
be easily feasible. E.g. suppose that a group of robots all go to learn to use a language,
then if they meet they can create offspring which immediately possess multiple languages. In
this way the evolutionary process could become much more efficient. Although Lamarckian
learning has not been realistic from a biological point of view until today, research in genetic
engineering has currently invented methods to change the DNA of an organism which can
then be transmitted to its offspring.
2.2
Cellular Automata
Cellular automata are often used by researchers working in artificial life. The inventor of
cellular automata (CA) is John von Neumann who also devised the modern computer and
played an important role in (economical) game theory. Cellular automata are decentralised
spatial systems with a large number of simple, identical components which are locally connected. The interesting thing of cellular automata is that they are very suited for visualizing
processes, and that although they consist of simple components and some simple rules, they
can show very complex behaviors. CA are used in a number of fields for biological, social,
and physical processes such as:
• Fluid dynamics
• Galaxy formation
• Earthquakes
• Biological pattern formation
• Forest fires
• Traffic models
• Emergent cooperative and collective behavior
2.2.1
Formal description of CA
A cellular automaton consists of two components:
• The cellular space. The cellular space consists of a lattice of N identical cells. Usually
P
all cells have the same local connectivity to other cells. Let
be the set of possible
P
states for a single cell. Then k = | | is the number of possible states per cell. A cell
with index i on time-step t is in state sti . The state sti together with the states of the
cells with which i is connected is called the neigborhood nti of cell i.
2.2. CELLULAR AUTOMATA
29
• The transition rule. The transition rule r(nti ) gives an update for cell i to its next
state sit+1 as a function of its neigborhood. Usually all cells are synchronously (at the
same time) updated. The rule is often implemented as a lookup-table.
2.2.2
Example CA
The following gives an example of a CA consisting of a 1-dimensional lattice of 11 states with
periodic boundary conditions. The periodic boundary conditions mean that the most left
state has the most right state as its left neighbour and vice versa. Since the neigborhood
of a cell consists of itself, the state of the cell to the left and to the right, the size of the
neigborhood is 3. Therefore, since the number of possible states of a single cell is only 2 (1
or 0), the transition rule consists of 23 = 8 components; for each neigborhood there is one
possible successor state for each cell. Note that in this example there are 211 = 2048 possible
complete state configurations for the CA.
Rule Table R:
Neighborhood: 000 001 010 011 100 101 110 111
Output bit
0
1
1
1
0
1
1
0
Lattice:
Periodic boundary conditions
t=0
1
0
1
0
0
1
1
0
0
1
0
t=1
1
1
1
0
1
1
1
0
1
1
1
Figure 2.1: A cellular automaton using a 1-dimensional lattice, a neigborhood size of 3, and
2 possible states (0 or 1) per cell. The figure shows the CA configuration at time t = 1
computed using the transition rule on the CA configuration at time t = 0.
2.2.3
Dynamics of the CA
The CA given in the previous subsection only uses 1 dimension, a neigborhood size of only 3,
and 2 possible states per cell. Therefore, it is one of the simplest CA. But even this CA can
show complex behavior if we iterate it over time and show the dynamics in the space-time
dimensions, see Figure 2.2.
It will not be a surprise that cellular automata with more complex transition rules and a
larger number of possible states can even shown much more complex behavior. In principle
there are other possible iterative networks or automata networks, cellular automata are just
one kind of automata of this family.
2.2.4
Processes in CA
In Chapter one we have already seen that when we have bounded spaces, we can divide a
process resulting in a pattern into three different classes; stable, periodic, and chaotic. Since
the cellular configuration state space of a CA is bounded, we can divide patterns created by
30
Figure 2.2: The sequence of cellular patterns of the CA given in Figure 2.1 generated by
iterating it over 100 time steps.
a CA into these three groups. Note however that the set of possible complete states of a CA
is not only bounded, but also finite. The three possible resulting patterns of a CA are:
• A stable state (or point), after entering the stable state, the process remains in the same
state and change stops.
• A cyclic pattern. The CA traverses through a repeating pattern of some periodic length.
If there are multiple sub-patterns each with their own periodic length, the complete
pattern will be periodic but with a larger length (e.g. if two sub-patterns which do
not interact in the CA have periodic lengths of 2 and 3, the complete pattern will have
periodic length 6).
• Chaotic behavior. The CA always goes to new, unseen patterns. Since the CA is
deterministic, chaotic behavior would be possible. However, since the number of possible
states on a computer is finite (although it is often huge), there will after finite time
always be a state which has been seen before after which the process repeats the same
cycle of configurations. Therefore real chaotic behavior in a CA is not possible, only a
periodic cycle of very large length will be possible in a finite CA.
It is important to understand that an initial configuration may lead to a sequence of
patterns which are all different, after which it may enter a stable state or a periodic cycle.
The time until the CA enters a stable state or periodic cycle is called the transient period.
Some researchers also like to include structured behavior with the above mentioned three
types of behavior. In structured behavior, the behavior seems very structured, but there is
no repetitive sequence (at least not for a long time).
The dynamics of CA can be influenced by the transition rules. Some transition rules can
lead to very simple behavior, whereas others lead to very complex behavior. Some people
find it a sport to make a transition rule which has the longest possible periodic length.
2.2.5
Examples of cyclic processes
A stable state is easy to make, e.g. it can consist of only 1’s. Then if we make transition
rules which always output a 1, we get the resulting stable state from any possible initial
31
configuration after one time step. Periodic cycles can be made in many possible ways. Here
we show a simple example. Suppose we have a 2-dimensional lattice. The transition rule is: if
2 neighbours (out of 4) are active, then the cell is activated (becomes 1 or black). Otherwise
the cell is not activated (becomes 0 or white). Figure 2.3 shows a lattice without boundary
conditions (basically we show a small part of the lattice which is everywhere else empty so
that we still have identical connectivity for all states), resulting in a periodic cycle of length
2.
Figure 2.3: A cellular automaton configuration with a repeating pattern (the periodic length
is 2).
Problem. Given a 2-dimensional lattice with transition rule: if one neighbour is active
and the cell was inactive, then the cell becomes active. Else if the cell was active at the
previous time-step keep the cell active in the next time-step. Otherwise the cell remains
inactive. Now evolve the CA in Figure 2.4.
t=0
t=1
t=2
t=3
Figure 2.4: The lattice of the CA for the problem. Try to evolve the CA over time with the
above given transition rule.
2.2.6
Elimination of basis patterns
When one evolves a CA, there are often some regularities involved, and other parts which
are completely unpredictable. Therefore some researchers have tried to use methods for
eliminating the basis of the evolutionary transitions in a CA. This basis can consist of walls,
singularities, etc. and can then be eliminated from the process.
The importance of eliminating the basis patterns is to get more inside in possible chaotic
or turbulent processes. For example take the process from Figure 2.5. If we remove the
regularities from this process, we get the process shown in Figure 2.6. We can see that most
of the seemingly complex process is removed, but some embedded particles move about
32
in a seemingly random way. It turns out that when these embedded particles hit each other,
that they will be destroyed.
Figure 2.5: A CA process iterated over time.
Figure 2.6: The process of Figure 2.5 with the regular basis patterns removed.
2.2.7
Research in CA
One important insight is that cellular automata are universal machines. That means that
they can compute any computable function and are therefore just as powerful as Turing
Machines. This also means that any algorithm which can be implemented on the usual
sequential computer can in principle also be implemented in a CA.
Conway’s game of life
The game of life was invented by the mathematician John Conway in 1970. He chose the rules
carefully after trying many other possibilities, some of which caused the cells to die too fast
and others which caused too many cells to be born. The game of life balances these tendencies,
making it hard to tell whether a pattern will die out completely, form a stable population, or
grow forever. Conways’ game of life uses a 2-dimensional lattice with 8 neighbours for each
cell. The transition rule(s) are:
33
• If a cell is not active (dead, black, or 1) and it has exactly 3 living neighbours, then the
cell will become active (rule of birth)
• If a cell is active and it has 2 or 3 neighbours which are active, then the cell stays active
(rule of survival)
• In all other cases the cell becomes not active (rule of death due to overcrowding or
loneliness).
One of the interesting things about the game of life is that it has universal computing
power, even with the three rules given above. This universal computing power relies on
particular patterns known as gliders. Such gliders are living entities which cross the 2-D
lattice and which can pass information so that it becomes possible to make logical AND, and
NOT gates. For an example of the behavior of a glider look at Figure 2.7.
t=0
t=1
t=3
t=4
t=2
Figure 2.7: A glider moving one step diagonal after each 4 time-steps.
Another important object in the game of life is the use of a Glider gun. Glider guns can
fire gliders and remain stable, which makes it possible to propagate information at some rate.
By using multiple glider guns which shoot gliders, we can make interactions between different
patterns which are propagated in the cellular space. An example of this is to have two gliders
which collapse after which they will be destroyed. This would be useful to make a NOT gate.
Making a CA using the game of life rules to compute arbitrary functions is very complicated,
because it requires a very careful development of the initial configuration consisting of glider
guns and other patterns, but in principle it would be possible.
Another interesting pattern in the game of life which shows very complex behavior is
known as the R-pentomino which looks as shown in Figure 2.8. It is remarkable that such a
simple pattern can create complex behavior including gliders and many other patterns.
Development of cellular automata
One goal of artificial life is to make artificial systems which can be called alive. For this
reproduction seems necessary, and therefore research investigated whether this was possible
34
Figure 2.8: The pattern called R-pentomino which creates very complex behavior.
in cellular automata. In 1966, John Von Neumann constructed a cellular automaton which
was able to reproduce itself, demonstrating one of the necessary abilities of living systems.
Some other researchers examined whether cellular automata could be used for recognizing
languages. In 1972, Smith constructed a CA which could recognize context-sensitive languages
such as palindromes (palindromes are strings which are the same if you read them from left
to right or from right to left). After that, Mitchell et. al (1994) used genetic algorithms to
evolve the transition rules of CA. They tried this using the majority problem as a testbed.
In the majority problem a bitstring is given of some size and each bit of the string can be on
or off. Now the system should tell whether the majority of bits was on or off. The system
could indicate this by making all bits on (off) if the majority was on (off) after a number of
iterations. Although this problem can of course be simply solved by counting all bits, such
a counter would require some form of register or additional memory which was not inside
the cellular automaton. Thus, the question was whether the genetic algorithm could evolve
transition rules which can solve the problem. The result was that the genetic algorithms
found different solutions which are however not optimal for solving all initial problems (with
any order of 1’s and 0’s). Some of the solutions used embedded particles. The reason that no
optimal solution was evolved was due to the limited local connectivity which does not allow
all bits to communicate to each other.
Other cellular automata
Cellular automata can also be simply and efficiently used for simulating particular processes
such as:
• Modelling Traffic. Here a cell is active if there is a car and it is inactive if there is
no car. The rules are simple to make too; if the predessor cell is empty, move to that
cell, otherwise stop. The CA can be made more complicated by adding in each cell
occupied by a car some internal state which models the destination address of the car.
Also different speeds can be taken into account.
• Modelling Epidemics. Here a cell can be a sick, healthy, or immune person.
• Modelling Forest Fires. A cell can be a tree on fire, water, a tree without being on fire,
grass, sand, etc. It is also possible to include external parameters such as wind-strength
and wind-direction, humidity etc. to influence the behavior of the model.
2.3. ECOLOGICAL MODELS
35
Power laws
There is a lot of research using CA for examining chaotic processes as for example studied in
sandpile models. In cellular automata sandpile models a granular material in a gravitational
field is used (the model can be two or three dimensional). There are two kinds of cells;
immovable ground cells and movable sand grains. Grains fall from a source at the top of
the window and proceed down to the ground. Grains pile up and redistribute themselves
according to the cellular automata rules (e.g. if two cells on top of each other possess grain,
and a neighboring cell does not, then the top grain element will make a transition to the empty
neighboring cell). One interesting thing of CA implementations of such physical models is
that there will sometimes be long shifts of grain during the redistribution. Such a shift is
often called an avalanche. Now the interesting thing is that large avalanches will be much
less probable than smaller ones, and that the probability distribution law respects a power
law (or Zipf’s rule or Pareto distribution). E.g. if we take English words according to their
number of occurrences and we rank all the words according to their usage (so rank 1 means
the word is used most often), then Zipf’s law states that the size y of occurrence of an event
(in this example the occurrence of a word) is inversely proportional to its rank r according
to:
y = ar −b
Where a is some constant and the exponential factor b is close to 1. Such a power law has been
demonstrated in many research fields, such as in social studies where the number of users of
web-pages are counted to examine Website popularity. There are few web-pages with a large
number of users, and many web-pages with few users, and the distribution follows Zipf’s law.
Pareto looked at income and found that there are few millionaires whereas there are many
people with a modest income. Also for earthquakes, there are few very heavy earthquakes
and many smaller ones, etc. To show whether some data provides evidence for a power law, it
can be hard to work with very large values appearing in the data. In that case we can make
a log-log plot by taking the logarithm on both sides (note that they should be positive) so
that we get:
log y = log ar −b
log y = log a + log r −b
log y = log a − b log r
(2.1)
Thus in a log-lot plot of the data, the resulting function relating two variables should be a
line (with negative slope b).
2.3
Ecological Models
In biology and ecology, simulation models often make use of cellular automata due to their
insightfulness and easy implementation while still providing interesting and complex behaviors. Ecological models can be used to study social phenomena, immunology and epidemics,
population dynamics of different species etc. An artificial ecosystem consists of a number of
individuals (agents) which:
• Occupy a position in the environment
• Interact with the environment and with other agents
36
• Possess some internal state such as amount of energy or money
By examining the evolutionary process in an ecosystem it is possible to research the creation
and continuity of processes such as:
• Cooperation: E.g., trading behavior between individuals
• Competition: E.g., fighting behavior between individuals
• Imitation: E.g., an agent learns what he should do by looking at and imitating other
agents
• Parasitic behavior: An individual profits from another individual whereas the other
individual is harmed by this. Parasitic behavior can be found in many places in nature,
a good example of this are viruses.
• Communities: If a large group of individuals are put together they might form communities for the benefit of all. An example of this is fish-schools which can better protect
the fish from predators (especially the fish which swim in the middle). Another advantage of communities is that individuals can cooperate and specialise on their own
task.
2.3.1
Strategic Bugs
Bedau and Packard developed the artificial life model called Strategic bugs (1992). This
model of an ecosystem uses individuals which try to find food and reproduce. The model
consists of:
• An environment modelled as a 2-dimensional lattice.
• A cell in the environment can be occupied by food or by a bug or is empty
• Food will grow automatically in the environment; food is added in a cell with some
probability if there was no food or bug there
• Bugs survive by finding food
• Bugs use energy to move and die if they do not have any energy anymore
• Bugs can clone themselves or reproduce with another bug if they have sufficient energy.
The behavior of a bug evolves from the interaction of the policy of the bug and the
environment. The bug’s policy uses a lookup table to map environmental inputs to actions.
An example rule is: if there are more than 5 food units in the east, then make a step to the
east.
Bedau and Packard tried to come up with a measure for the evolutionary dynamics. If
such an ecosystem is simulated and new individuals will be generated all the time, then
the question is “What is really new and which individual traits are evolved in the system?”
For this they examined the evolutionary activity which looks at the genetic changes in the
chromosome strings. The experiments showed that there were waves of evolutionary activity,
new genetic material was often found after some time and then stayed in the population for
some period. Thus it was seen that new genetic material and therefore behavior was found
and exploited during the evolutionary process.
2.4. ARTIFICIAL MARKET MODELS
2.4
37
Artificial Market Models
Financial markets such as stock markets are difficult to predict. Some might think it is
completely random behavior, but the investors involved do not seem to make random, but on
the contrary, rational decisions. Thus it seems more to be a chaotic process emerging from
the large number of investors and unforeseen circumstances.
One important question is to examine under what conditions predictions about the dynamics of financial markets will be possible. To study this question we first have to look
at the efficient market hypothesis (EMH). In an information efficient market all price
fluctuations are unpredictable if all necessary investment information is taken into account
by the investors. The information is taken into account if the expectancies, intentions, and
(secret) information of the market participants is incorporated in the prices. From this follows that when a market is more efficient, that the price fluctuations which are generated by
the market are more random (and therefore unpredictable). Basically this is caused by the
fact that if there would be only a small information advantage by some investors, that the
actions of these investors will immediately correct the prices, so that further gain will become
impossible.
2.4.1
Are real markets predictable?
Some people tend to make a lot of gain from stock markets. One important case is that of
an analyst which has such an importance that (s)he is considered a guru for predicting which
stocks will rise and fall. If the guru tells everyone that stock X will increase a lot, then there
will be many people buying that stock. The effect is that of a self-fulfilling prophecy; the
stock price will increase since the prophet announced it and many people believe it and will
buy that stock. Only the buyers which were the last in buying that stock will loose money,
the investors which are quickest will gain money and sell them immediately after the price
has increased sufficiently. There are other cases and reasons to believe that stock markets can
be predictable. One reason is that investors trade-off expected risk and expected gain. This
means that a risk-aversive (in contrary to a risk-seeking) investor will sell stocks with a high
risk but also with an expected gain. The distribution between risk-aversive and risk-seeking
individuals will then cause different price fluctuations, which are therefore not completely
random. In fact a number of studies have indicated that price fluctuations are not completely
random.
When we examine the efficient market hypotheses, then it requires rational and completely
informed investors. However these assumptions are not realistic. Investors are not completely
rational and sometimes hard to predict. Furthermore, information is often difficult to interpret, technologies and companies change, and there are costs associated with transactions
and information gathering.
One seemingly efficient method for trading stocks is to examine the relative competitive
advantage between different markets. When one compares some market to other markets,
one can see that one market (such as a market in obligations) was more promising during the
previous period, so that it will be likely that more investors will step to that relatively more
advantageous market which leads to more profit on that market. Comparing markets (e.g.
between countries, or kind of markets — e.g. obligations versus stocks) can therefore be a
good option.
38
2.4.2
Models of financial theories
Already for a long time there have been people trying to come up with financial theories,
since if it would work one could get a lot of money out of it. It should be said, however, that
if you would ever find a theory which works, that you should not tell it to other people. The
reason is that your advantage will be lost in using this theory if everyone knows it. People
could even trade in such a way that you will loose money with your once so well working
theory. Therefore we can only show general approaches that have been invented to come up
with models to predict the price fluctuations:
• Psychological models. Here the model tries to analyse the risk-taking behavior of investors and examines how human-attitudes to the market influences the stock prices.
• Learning models. Here data about the stock prices of the past is used to train a model
to predict its development in the future.
• Agent models. Here investors are modelled as agents which use particular strategies.
By letting the modelled agents interact the complex dynamic of stock markets can be
simulated.
• Evolutionary algorithms for developing strategies. Here the evolution of strategies of
investors is mimicked. Competitive strategies could be used to create other strategies.
Finally a strategy which was observed to gain most money in the past could be used to
trade in the future.
2.5
Artificial Art and Fractals
Iterating a simple function can create very complex, artistic, patterns. This was shown by
Bernoit Mandelbrot who discovered the Mandelbrot set, which is a fractal. A fractal is a
pattern which is self-similar to different scales, so if we look at a zoomed out picture of some
details of the fractal we can recognize features which were also shown in the bigger pattern.
It should be said that a fractal can be very complex and not all small scale components look
similar to the whole pattern. So how can we get the Mandelbrot set? First of all consider the
function:
xk+1 = x2k
If we look at the starting values for xk for which the iteration converges to a single point, we
can see that these are the values −1 < x0 < 1, and the final point will be x∞ = 0. If x0 < −1
or x0 > 1 then the value after many iterations goes to infinity. If x0 is -1 or 1 then the point
will stay in 1, but this point is unstable, since small perturbations (changes of xk ) will let the
value go to 0 or ∞. In principle the values for which the iteration stays bounded is called the
Julia set, although more interesting Julia sets are associated to Mandelbrot sets as we will
see later. So for the function f (x) = x2 , the Julia set would be the region between -1 and 1.
In the space of real numbers, not so many interesting things can happen. But now let’s
consider the use of complex numbers. Complex numbers consist√of a real and an imaginary
part, so we write them as: x = ai + b, where i is defined as i = −1. We can add, subtract,
multiply and divide complex numbers just as we can with real numbers. For example if we
take x = 3i, then x2 = −9. Complex numbers are used in many sciences such as in quantum
mechanics and electric engineering, but we will not go into details about them here.
2.5. ARTIFICIAL ART AND FRACTALS
39
Now consider the functions of the type:
xk+1 = x2k + C
The question is: if we start with x0 = 0, for which complex numbers C will the iteration of
this function not become infinite? This set of complex numbers for which the iterations will
stay bounded is called the Mandelbrot set, and it is displayed in Figure 2.9. We can see its
complex shape in the complex plane (the real part is depicted on the x-axis and the imaginary
part of the points belonging to the set are shown on the y-axis). The points in black belong
to the Mandelbrot set, and the others do not. This is an example of a fractal, a self-similar
structure. The word fractal was also invented by Mandelbrot.
Figure 2.9: The Mandelbrot fractal
Now look what happens if we zoom-in in the picture. The zoomed in figure of the lower
part of Figure 2.9 is shown in Figure 2.10. Note that this pattern is very similar to the original
Mandelbrot set, and we already see that there are much more self-similar structures to be
found in the picture.
Figure 2.10: A zoomed in pattern of the Mandelbrot fractal
Now, consider again the iterated function
xk+1 = x2k + C
40
But, now we have chosen a value for C which is an element of the Mandelbrot set. Then
another question we can ask is; which initial values x0 in the complex plane cause the iteration
to remain bounded? This set which belongs to a particular value of C is called the Julia set
for C. An example pattern from the Julia set is shown in Figure 2.11.
Figure 2.11: An example pattern from the Julia set
Computer artists like to use fractals, since although the equations are simple, as long as
they are non-linear (linear maps cannot produce interesting patterns like fractals) they can
produce a large variety of complex patterns, and zooming in in the pictures creates many
other patterns. This is just another example of using simple rules to create very complex
patterns.
2.6
Conclusion
Artificial life is useful for simulating many biological, physical, sociological, and economical
processes. One goal of artificial life is to understand the principles underlying living entities
and the emergence of life forms. Artificial life can be combined with genetic algorithms
for optimizing individual behaviors by adapting them to the (changing) environment. If
multiple individuals adapt themselves and also adapt the environment, the resulting dynamics
can be very complex and unpredictable. Even with simple entities such as used in cellular
automata, complex behavior can result from the interaction between simple components.
Cellular automata are very useful for modelling and visualizing spatial processes such as
forest fires and can be used to study the behavior of many different complex processes. One
interesting thing is that cellular automata are just as powerful as Turing machines which
means that any computable function can be implemented using a cellular automaton.
Another aspect in artificial life is the study of price-dynamics in financial markets. Although an efficient market would be completely unpredictable, in reality there are many
reasons to believe that price-fluctuations are not completely random. Making models for predicting price changes is a challenging research topic, although found theories may never be
published, since they would eliminate their usefulness if they are known by many investors.
Finally we have shown that using the complex plane, simple iterative functions can create
complex patterns, called fractals. Examples of these are the Mandelbrot and Julia sets.
Computer artists like to use fractals, because they look complex, but are easy to make.
Fractals also play a role in chaotic systems as we will see in a later chapter.
Chapter 3
Evolutionary Computation
Inspired by the success of nature in evolving such complex creatures as human beings, researchers in artificial intelligence have developed algorithms which are based on evolution
theory. The class of these algorithms are called evolutionary algorithms and consists among
others of genetic algorithms, evolutionary strategies, and genetic programming. Genetic algorithms (GAs) are the most famous ones and they were invented by John Holland. Evolutionary
algorithms are optimisation algorithms that are inspired on Darwin’s evolution theory, known
as natural selection or survival of the fittest and they were developed during the 1960’s and
1970’s. One of their strengths is that they can find very good solutions in very large search
spaces, where exhaustive search (trying out all possible solutions) would cost much too much
time. The principle of evolutionary algorithms is that solutions are evaluated after which the
best solutions are allowed to reproduce most offspring (children). If the parent individuals
form good solutions, they are likely to possess good building blocks of genetic material (the genetic material makes up the solution) that may be useful for creating new individuals. Genetic
algorithms usually take two parent individuals and they recombine their genetic material to
produce a child that inherits genetic material from both parents. If the child performs well on
the evaluation test (evaluating an individual and measuring how well an individual performs
is commonly done by the use of a fitness function), it will also be selected for reproduction
and in this way the genetic material can again be propagated to new generations. Since the
individuals themselves will usually die (they are often replaced by individuals of the next
generation), Richard Dawkins came with the selfish gene hypothesis. This hypothesis says
that basically the genes are alive and use the mortal individuals (e.g. us) as hosts so that
they are able to propagate themselves further. Some genes may be found in many individuals,
whereas other genes are only found in a small subset of individuals. In this way, the genes
seem to compete for hosts, and genes which occupy well performing individuals are likely to
be able to reproduce themselves. The other way around we can say that genes which occupy
well performing individuals give advantages for the individual and therefore it is good if they
are allowed to reproduce.
In this chapter we will look at evolutionary algorithms in general and focus on genetic
algorithms, although most issues involved also play a role for other evolutionary algorithms.
We first describe optimisation problems and then examine which steps should be pursued for
constructing an evolutionary algorithm, and what kind of representations are useful for the
algorithm for solving a particular problem. Finally we will examine some other evolutionary
algorithms.
41
42
3.1
CHAPTER 3. EVOLUTIONARY COMPUTATION
Solving Optimisation Problems
A lot of research in computer science and artificial intelligence has been devoted to solving
optimisation problems. There are many different optimisation problems; e.g. one of them is
shortest path-planning which requires the algorithm to compute the shortest path from a state
to a particular goal state. Well known applications for such algorithms are planners used by
cars (e.g. the Carin system) or for train-passengers. In principle shortest path problems are
simple problems, and can be solved efficiently by algorithms such as Dijkstra’s shortest path
algorithm or the A* algorithm. These algorithms can compute the shortest path in a very
short time for problems consisting of more than 100,000 cities (or nodes if we formalise the
problem as a graph using nodes and weighted edges representing the distances of connections
between nodes). On the other hand, there also exist combinatorial optimisation problems
which are very hard to solve. One example is the traveling salesman problem (TSP). This
problem requires that a salesman goes to N customers which live in different cities, so that
the total tour he has to make from his starting city to single visits to all customers and
back to his starting place should be minimal. This problem is known to be NP-complete
and therefore unless P = N P not solvable in polynomial time. For example if we use an
exhaustive search algorithm which computes and evaluates all possible tours, then it has to
examine about N ! tours, which increases exponentially with N . Thus for a problem with 50
cities, the exhaustive search algorithm would need to evaluate 50! solutions. Let’s say that
evaluating one solution costs 1 nanosecond (which is 10−9 second), then evaluating all possible
solutions would cost about 9.6× 1047 years, which is therefore much longer than the age of the
universe. Clearly exhaustive search approaches cannot be used for solving such combinatorial
optimisation problems and heuristic search algorithms have to be used which can find good
solutions in a short time, although they do not always come up with the optimal solution.
There is a number of different heuristic search algorithms such as Tabu search, simulated
annealing, multiple restart local hill-climbing, ant colony algorithms, and genetic algorithms.
Genetic algorithms differ from the others in the way that they keep a population of solutions
and use recombination operators to form new solutions.
3.1.1
Formal description of an optimisation problem
Optimisation problems consist of two components; the representation space and the evaluation
(or fitness) function. The representation space denotes all possible solutions. For example if
we want to solve the TSP, the representation space consists of all possible tours which are
encoded in some specific way. If we want to throw a spear at some target and can select
the force and the angle to the ground, the representation space might consist of 2 continuous
dimensions which take on all possible values for the force and angle. On the other hand,
one could restrict this space by allowing only angles between 0 and 360 degrees and positive
forces which are smaller than the maximum force one can use to throw the spear. Let’s call
the representation space S and a single solution s ∈ S.
The evaluation function (which in the context of evolutionary algorithms is usually called
a fitness function) compares different solutions to each other. Although solutions could be
compared on multiple criteria, let’s assume for now that there is a single fitness function f (.)
which maps a solution s to a specific fitness value f (s) ∈ ℜ. The goal is to find the solution
smax which has the maximal fitness:
f (smax ) ≥ f (s) ∀ s
3.1. SOLVING OPTIMISATION PROBLEMS
43
It may happen that there are multiple different solutions with the same maximal fitness value.
We may then require to find all of them, or only one (which is of course simpler).
So the goal is to search through the representation space for a solution which has the
maximal possible fitness value given the fitness function f (.). Since the representation space
may consist of a huge number of possible solutions or may be continuous, the optimal solution
may be very hard to find. Therefore, in practice algorithms are compared by their best found
solutions within the same amount of computational time. Among these algorithms there
could also be a human (expert) which tries to come up with a solution, but if the fitness
function gets more complicated and the representation space becomes bigger, the advantage
of computers in their ability to try out millions of solutions within a short period of time
outcompetes the ability of any human in finding a good solution.
3.1.2
Finding a solution
Heuristic search algorithms usually start with one or more random solutions which are then
evaluated. For example local hill-climbing starts with a random solution and then changes
this solution slightly in some way. Then, this new solution is evaluated and if it is a better one
than the previous one, it is kept and otherwise the previous one is kept. This simple process is
repeated until the solution is good enough or time is expired. The local hill-climbing algorithm
looks as follows:
• Generate initial solution s0 ; t = 0
• Repeat until stop criterium holds:
• snew = change(st )
• if f (snew ) ≥ f (st) then st+1 = snew
• else st+1 = st .
• t = t +1
Using this algorithm and a random initial solution s0 , a sequence of solutions s0 , s1 , . . . , sT
is generated, where each later solution has a larger or equal fitness value compared to all
preceding solutions. The most important function in this algorithm is the function change.
By changing a solution, we do not mean to generate a new random solution, since if we would
generate and evaluate random solutions all the time, there would not be any progressive
search towards a better solution. Instead random search would probably work just as good
as exhaustive search and is not a heuristic search algorithm. So it should be clear than the
function change should keep some part of the old solution in the new solution and change
some other part. As an example consider a representation space consisting of bitstrings of
some specific length N . It is clear that the representation space in this case is: S = {0, 1}N .
Now we could make a function change which changes a single bit (i.e. mutating it from 0 to
1 or from 1 to 0). In this case a solution would have N neighbours with this change operator.
Now one possible local hill-climbing algorithms would try all solutions in the neighbourhood
of the current solution and then select the best one as snew . Or, alternatively, it could select
a single random solution from the neighbourhood. In both cases, for many fitness functions,
the local hill-climbing algorithm could get stuck in a local optimum. A local optimum is a
solution which is not the global optimum (the best solution in the representation space), but
44
one which cannot be improved using the specific change operator. Thus, a local optimum
is the best one in a specific subspace (or attractor in the fitness landscape). Since the local
hill-climbing algorithm would not generate a new solution if it has found a local optimum, the
algorithm gets stuck and will not find the global optimum. This could be avoided of course
by changing the change operator, however this is not trivial. Since if we allow the change
operator to change two bits, the neighbourhood would become bigger, but since still not all
solutions can be reached, we can again easily get trapped in a local optimum. Only if we allow
the change operator to change all bits, we may eventually always find the global optimum,
but as mentioned before changing all bits amounts up to exhaustive or random search. A
solution to the above problem is to change bits with a specific small probability. In this way,
usually small changes will be made, but it is always possible to escape from a local minimum
with some probability. Another possibility is used by algorithms such as simulated annealing
that always accepts improving solutions, but also can select a new solution with lower fitness
value than the current one, albeit with a probability smaller than 1. In specific, simulated
annealing accepts a new solution with probability:
min(1, e(f (snew )−f (st ))/T )
where T is the temperature which allows the algorithm to explore more (using a large T )
or to only accept improving solutions (using T = 0). Usually the temperature is cooled
down (annealed) starting with a high temperature and ending with a temperature of 0. If
annealing the temperature from infinity to 0 is done with very slow steps, the algorithm will
finally converge to the global optimum. However, in practice annealing should be done faster
and the algorithm usually converges to a local maxima just like local hill-climbing. A practical
method to deal with this is to use multiple restarts with different initial solutions and finally
selecting the best found solution during all runs.
3.2
Genetic Algorithms
In contrast to local hill-climbing and simulated annealing, genetic algorithms use a population
of individuals to search for solutions. The advantage of a population is that the search is done
in a distributed way and that individuals are enabled to exchange genetic material (in principle
the individuals are able to communicate). Making the search using a population also allows
for parallel computation, which is especially useful if executing the fitness function costs a
long time. However, it would also be possible to parallellize local hill-climbing or simulated
annealing, so that different initial solutions are brought to different final solutions after which
the best can be selected. Therefore the real advantage lies in the possibility of individuals
to exchange genetic material by using recombination operators and by the use of selective
pressure on the whole population so that the best individuals are most likely to reproduce and
continue the search for novel solutions. A genetic algorithm looks as follows in pseudo-code:
1. Initialize a population of N individuals
2. Repeat:
(a) Evaluate all individuals in the population using the fitness function
(b) Repeat N times:
• Select two individuals for reproduction according to their fitness values
3.2. GENETIC ALGORITHMS
45
• Recombine these two parent individuals to create one offspring
• Mutate the offspring
• Insert the offspring in a new population
(c) Replace the population by the new population
There is a state of every individual and since a population consists of N individuals, the
population also has a state. Therefore after each iteration of this algorithm (usually called a
generation), the population state makes a transition to a new state. Finally after a long time,
it may happen that the population contains the optimal solution. Since the optimal solution
may get lost, we always store the best solution found so far in some place (or alternatively the
Elitist strategy may be used that always copies the best found solution to the new population).
3.2.1
Steps for making a genetic algorithm
For solving real world problems with genetic algorithms, such as a time-tabling problem which
requires us to schedule for example busses to drivers so that all busses have one driver and
no driver has to drive when (s)he indicated that (s)he does not want to drive, the question
arises how to make a representation of the problem. This is often more art than science,
and research has indicated that particular representations allow better solutions to be found
much earlier. For other problems, making a representation does not need to be hard but the
chosen representation can influence how fast good solutions are found. Take for example the
colouring problem which is also a NP hard problem. In a colouring problem multiple cities
may be connected to each other and we want to assign different colors to cities if they are
connected. The goal is to find a feasible solution while minimizing the amount of used colors.
To solve this problem we may choose a representation which consists of N numbers where N
is the number of cities and the number indicates the assigned color to the city. On the other
hand, we could also design a representation in which we have a maximum of M colors and
N M binary states in which each element of the list of N M states indicates whether the city
has that color or not. One should note that the second representation is larger, although it
requires only binary states. Furthermore in the second representation it is much easier that
false solutions (solutions which do not respect the conditions of the problem) are generated,
since it allows for cities to have multiple or 0 colors. Therefore, the first representation should
be preferred.
Except for constructing a representation, we also need to find ways to initialize a population, to construct a mapping from genotype to phenotype (the genotype is the encoding in
the chromosome on which the genetic operators work, whereas the phenotype is tested using
the fitness function), and also to make a fitness function for evaluating an individual (some
fitness functions would favour the same optimal solution, but one of these can be more useful
for the genetic algorithm to find it).
There are also more specific steps; we need to design a mutation operator, a recombination
operator, we have to determine how parents are selected for reproduction, we need to decide
how individuals are used to construct a new population, and finally we have to decide when
the algorithm has to stop. We will explain these steps in more detail below.
46
3.2.2
Constructing a representation
The first decision we have to make when we want to implement a genetic algorithm for solving
a specific problem is the representation we want to use. As mentioned above, there are often
many possible representations, and therefore we have to examine the problem to choose one.
Although the representation is often the first decision, we also have to take into account a
possible fitness function and which genetic operators (mutation and crossover) we would like
to use. For example, if we want to evolve a robot which drives as fast as possible without
hitting any obstacles, we could decide to use a function which maps sensory information of the
robot to actions (e.g. left motor speed and right motor speed). The obvious representation
used in this case would consist of continuous parameters making up the function. Therefore,
we may prefer to use particular representations which allow for continuous numbers, although
this is not strictly necessary since we may also construct the genotype to phenotype mapping
in some way that converts discrete symbols to continuous numbers.
Binary representations and finite discrete sets
The most often used representation in genetic algorithms uses binary values, encoding a chromosome using a bitstring of N bits. See Figure 3.1 for an example. Of course it would also
be possible to use a different set of discrete values, e.g. like the one used by biological DNA:
{C, G, A, T }. It depends on the problem whether a binary representation would be more
suitable than using different sets of values. It should be said that by concattenating two
neighboring binary values, one could also encode each value from a set containing 4 different
values. However, in this case a binary encoding would not be preferred, since the recombination operator would not respect the primitive element being a single symbol and could easily
destroy such symbols through crossover. Furthermore, a solution in which primitive symbols
would be mapped to a single gene would be more readable.
Chromosome
1
0
1
0
0
0
1
1
Gene
Figure 3.1: A chromosome which uses a binary representation and which is therefore encoded
as a bitstring.
If we have a binary representation for the genotype, we can still use it to construct different
representations for phenotypes. It should be said that search using the genetic operators takes
place in the genotype space, but the phenotype is an intermediary representation which is
easier to evaluate by the fitness function. Often, however, the mapping from genotype to
phenotype can be an identity mapping meaning that they are exactly the same.
For example, using the 8-bit phenotype given before, we can construct an integer number
by computing the natural value of the binary representation. E.g. in the example genotype of
47
Figure 3.1 we could convert the genotype to the integer: 27 +25 +21 +20 = 163. Alternatively,
if we want a phenotype which is a number between 2.5 and 20.5 we could compute x =
2.5 + 163
256 (20.5 − 2.5) = 13.9609.
Thus, using a mapping from phenotype to genotype gives us additional freedom. In the
first example, small changes of the genotype (e.g. mutating the first bit) would correspond
to big changes in the phenotype (changing from 163 to 35). We note, however, that in the
second example, not all solutions between 2.5. and 20.5 can be represented using the limited
precision of the 8-bit genotype.
Representing real numbers
If we want to construct a phenotype of real numbers, it is a more natural way to encode
these real numbers immediately in the genotype and to search in the space of real numbers.
We have already seen that this can lead to more precise solutions, since the binary encoding
would have a limited precision unless we use a very long bitstring. Another advantage is that
the encoding is much smaller, although this comes at the cost of creating a continuous search
space.
Thus, if our problem requires the combined optimisation of n real numbers we could use
a genotype X = (x1 , x2 , . . . , xn ) where xi ∈ ℜ. The representation space would therefore be
S = ℜn . For real numbered representations, we have to use a fitness function which maps
a solution to a real number, therefore the fitness function is a mapping f : ℜn → ℜ. This
encoding is often used for parameter optimisation, e.g. when we want to construct a washing
machine which has to determine how much water to consume, how much power to use for
turning the cabinet, etc. The fitness function could then trade-off costs versus the quality of
the washing machine.
Representing ordering problems
For particular problems there are natural constraints which the representation should obey.
An example is the traveling salesman problem which requires a solution that is a tour from
a starting city to a last city while visiting all cities in between exactly once. A natural
representation for such an ordering problem is to use a list of numbers where each number
represents a city. An example is the chromosome in Figure 3.2.
3
4
8
6
1
2
7
5
Figure 3.2: A chromosome which uses a list encoding of natural numbers to represent ordering
problems.
3.2.3
Initialisation
Before running the genetic algorithm, one should have an initial population. Often one does
not have any a-priori knowledge of the problem so that the initialisation is usually done using
a pseudo-random generator. As with all decisions in a GA, the initialisation also depends on
the representation, so that we have different possible initialisations:
48
• Binary strings. Each single bit on each location in the string of each individual receives
50% probability to become a 0 and 50% probability to become a 1. Note that the whole
string will likely possess as many 0’s and 1’s, if we would have a-priori knowledge, we
might want to change the a-priori generation constant of 50%. For discrete sets with
more than 2 elements, one can choose uniform randomly between all possible symbols
to initialize each location in a genetic string.
• Real numbers. If the space of the real numbers is bounded by lower and higher limits,
it would be natural to generate a uniform number in between these boundaries. If we
have an unbounded space (e.g. the space of real numbers) then we cannot generate
uniform randomly chosen numbers, but have to use for example a Gaussian function
with a mean value and a standard deviation for initialisation. If one would not have
any a-priori information about the location of fit individuals, initialisation in this case
would be difficult, and one should try some short runs with different initialisations to
locate good regions in the fitness landscape.
• Ordered lists. In this case, we should take care that we have a legal initial population
(each city has to be represented in each individual exactly one time). This can be easily
done by generating numbers randomly and eliminating those numbers that have been
used before during the initialisation of an individual coding a tour.
Sometimes, one possesses a-priori knowledge of possible good solutions. This may be
through heuristic knowledge or from previous runs of the genetic algorithm or another optimisation algorithm. Although this has the advantage that the starting population may have
higher average fitness, there are also some disadvantages to this approach:
• It is more likely that genetic diversity in the initial population is decreased, which can
make the population converge much faster to a population of equal individuals.
• Due to the initial bias which is introduced in this way, it is more difficult for the
algorithm to search through the whole state space, possibly making it almost impossible
to find a global optimum which is distant from the individuals in the initial population.
3.2.4
Evaluating an individual
Since most operations in a genetic algorithm can be executed in a very short time, the
time needed for evaluating an individual is often a bottleneck. The evaluation can be done
by a subroutine, a (black-box) simulator, or an external process (e.g. robots). In some
cases evaluating an individual can be quite fast, e.g. in the traveling salesman problem the
evaluation would cost at most a number of computations which is linear in the number of
cities (i.e. one can simply sum all the distances between cities which are directly connected
in the tour). In other cases, especially for real world problems, evaluating an individual can
consume a lot of time. For example if one wants to use genetic algorithms to learn to control
a robot for solving some task, even the optimal controller might already take several minutes
to solve the task. Clearly in such a case, populations can not be very large and the number of
generations should also be limited. One method to reduce evaluation time for such problems
is to store the evaluations of all individuals in memory, so that a possible solution which has
already been evaluated before, does not need to be re-evaluated.
49
If evaluating time is so large, that too few solutions can be evaluated in order for the
algorithm to come up with good solutions starting with a random initial population, one
could try to approximate the evaluation function by a model which is much faster albeit not
as accurate as the real evaluation function. After evolving populations using this approximate
fitness function, the best individuals may be further evolved using the real fitness function. A
possibility for computing an approximate fitness function is to evaluate a number of solutions
and to use a function approximator (such as a neural network) to learn to approximate the
fitness landscape. Since the approximate fitness function often does not approximate the real
one accurately, one should not run too many generations to find optimal solutions for this
approximate fitness function, but only use it to come up with a population which can perform
reasonably in the real problem. In case of robotics, some researchers try to come up with
very good simulators which makes the evolution much faster than executing the robots in the
real world. If the simulator accurately models the problem in the real world, good solutions
which have been evolved using the simulator often also perform very well in the real world.
Another function provided by the fitness function is to deal with constraints on the solution
space. For particular problems there may be hard or soft constraints which a solution has to
obey. Possibilities to deal with such constraints are:
• Use a penalty term which punishes illegal solutions. A problem of this solution is that
in some cases where there are many constraints a large proportion of a population may
consist of illegal solutions, and even if these are immediately eliminated, they make the
search much less efficient.
• Use specific evolutionary operators which make sure that all individuals form legal
solutions. This is often preferred, but can be harder to implement, especially if not all
constraints in the problem are known.
3.2.5
Mutation operators
In genetic algorithms there are two operators which determine the search for solutions in the
genotype space. The first one is mutation. Mutation is used to perturbate (slightly change)
an individual so that a new individual is created, but which still resembles the previous one
(in genetic algorithms mutation is often performed after recombination so that the previous
one is already a new individual). Mutation is an important operator, since it allows us
to explore the representation space. Without it, it would become possible that the whole
population contains the same allele (value on some locus or location in the genetic string),
so that different values for this locus would never be examined. Mutation is also useful to
create more diversity and to escape from a converged population which otherwise would not
explore different solutions anymore. It is possible to use different mutation operators for the
same representation, but it is important that:
• At least one mutation operator should make it possible to search through the whole
space of solutions
• The size of the mutation operator should be controllable
• Mutation should create valid (legal) individuals
50
Mutation for binary representations
Mutation on a bitstring usually is performed by changing a bit to its opposite (0 → 1 or
1 → 0). This is usually done on each locus of a genetic string with some probability Pm .
Thus the mean number of mutations is N Pm where N is the length of the bitstring. By
increasing Pm the algorithm becomes more explorative, but may also lose more important
genetic material that was evolved before. A good heuristic to set Pm is to set it as N1 which
creates a mean number of mutations of 1. Figure 3.3 shows schematically how mutation is
done on a bitstring.
1
1
1
1
1
1
1
1
Before mutation
1
1
1
0
1
1
1
1
After mutation
Mutated Gene
Figure 3.3: A chromosome represented as a bitstring is changed by mutation.
In case of multi-valued discrete representations with a finite number of elements, mutation is usually done by first examining each locus and using the probability Pm to choose
whether mutation should occur, and if a mutation should occur, each possible symbol has
equal probability to replace the previous symbol on that location in the chromosome.
Mutation for real numbers
If a representation of real numbers is used, we also need a different mutation operator. We
can use the same way as before to select a locus which will be mutated with probability
Pm . But now the value of the locus is a real number. We can perturb this number using a
particular form of added randomness. Usually Gaussian distributed zero-mean noise is used
with a particular standard deviation, so that we get for the chosen value of the gene xi in a
chromosome:
xi = xi + N (0, σ)
Mutation for ordered representations
For mutating ordered representations we should try to make sure that the resulting individual
respects the constraints of the problem. That means that for a traveling salesman problem
all cities are used exactly one time in the chromosome. We can do this by using a swap of
two values on two different loci. Thus we generate two locations and swap their values as
demonstrated in Figure 3.4.
3.2.6
Recombination operators
The advantage of using recombination operators is that it becomes possible to combine useful
genetic material from multiple parents. Therefore, if one parent has particular good building
51
7
3
1
8
2
4
6 5
7
3 6
8
2
4
1 5
Figure 3.4: A chromosome represented as an ordered list is mutated by swapping the values
of two locations.
blocks, and another parent has different good building blocks, the offspring by recombining
these parents may immediately possess all good building blocks from both parents. Of course
this is only the case if recombination succeeds very well, an offspring may also contain those
parts of the parents which are not useful. However, good individuals will be kept in the
population and the worse ones will die, so that it is often still useful to use recombination.
A recombination operator usually maps two parent individuals to one or two children. We
can use one or more recombination operators, but it is important that:
• The child must inherit particular genetic material from both parents. If it only inherits
genetic material from one of the parents, it is basically a mutation operator
• The recombination operator must be designed together with the representation of an
individual and the fitness function so that recombination is not often a catastrophe
(generating bad individuals)
• The recombination operator should generate legal individuals, if possible
Recombination for binary strings
For binary strings there exist a number of different crossover operators. One of them is 1-point
crossover in which there is a single cutting point that is randomly generated after which both
individuals are cut at that point in two parts. Then these parts are combined, resulting in
two possible children of which finally one or both will be kept in the new population (usually
after mutating them as well). Figure 3.5 shows how 1-point crossover is done on bitstrings.
Instead of using a single cutting point, one could also use two cutting points and take both
sides of one parent together with the middle part of the other parent to form new solutions.
This crossover operator is known as 2-point crossover. Another possibility is to use uniform
crossover, here it is decided by a random choice for each location separately whether the value
of the first individual or of the second individual is used in the offspring. We can see the
different effects of a generated crossover operator using crossover masks. Figure 3.6 shows a
crossover mask which is used to create two children from two parents.
Note that these recombination operators are useful for all finite discrete sets and thus
wider applicable than only for binary strings.
52
Cut
Cut
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Parents
1
1
1
0
0
0
0
0
0
0
1
1
1
1
Children
Figure 3.5: The recombination operator known as 1-point crossover. Here the part left to the
cutting point of the first parent is combined with the part right to the cutting point of the
second parent (and vice versa).
1 1
0
0
1 0
0
Mask
(Uniform)
1
1
1
1
0
1
1
0
0
1
0
0
0
0
Parents
1
1
1
0
0
0
0
0
0
1
1 0
1
1
Children
Figure 3.6: The effect of a recombination operator can be shown by a crossover mask. Here
the crossover mask is uniformly generated, after which this mask is used to decide which
values on which location to use from both parents in the offspring.
Recombination for real numbered representations
If we have representations which consist of real numbers, one might also want to use the
recombination operators that are given above for binary strings. However, another option is
to average the numbers on the same location, so that we get:
(xc1 =
xa1 + xb1
xa + xbn
, . . . , xcn = n
)
2
2
The two different recombination operators for real numbers can also be used together by
randomly selecting one of them each time.
Recombination for ordered representations
Designing recombination operators for ordered representations is usually more difficult, since
we have to ensure that we get children that respect the constraints of the problem. E.g. if
we would use 1-point crossover for the TSP, we will almost for sure get children which have
53
some cities twice and some other cities no time in their representation, which would amount
to many illegal solutions. Penalising such solutions would also not be effective, since almost
all individuals would become illegal. There has been a lot of research for making recombination operators for ordered representations, but we only mention one possible recombination
operator here.
Since the constraint on a recombination operator is that it has to inherit information from
both parents, we start by selecting a part of the first parent and copy that to the child. After
this, we want to use information from the second parent about the order of values which is
not yet copied to the child. This we do by looking at the second parent, examining the order
in the second parent of the cities which are not yet inside the child, and attaching these cities
in this order to the child. Figure 3.7 shows an illustration of this recombination operator for
ordered lists.
Parent 1
7
3
1
Parent 2
8
2
4
6
5
4
3
2
8
6
7
1
5
7,3,4,6,5
7
5
1
8
2
1
8
2
Order:
4,3,6,7,5
4
3
6
Child 1
Figure 3.7: A possible recombination operator for ordered representations such as for the
TSP. The operator copies a part of the first parent to the child and attaches the remaining
cities to the child while respecting their order in the second parent.
3.2.7
Selection strategies
Another important topic in the design of GAs is to select which parents are allowed to create
children. If one would always randomly choose parents for creating children, there would not
be any selective pressure for obtaining better individuals. Thus, good individuals must have
a larger probability for generating offspring than worse individuals. The selection strategy
determines how individuals of a population are chosen for generating offspring. Often the
selection strategy allows bad individuals to generate offspring as well, albeit with a much
smaller probability, although some selection strategies only create offspring with the best
individuals. The reason for using less than average fit individuals for creating offspring is
that they can still contain good genetic material and that the good individuals may resemble
each other very much. Therefore, using bad individuals may create more diverse populations.
In the following we will describe a number of different selection strategies.
54
Fitness proportional selection
In fitness proportional selection, parents which are allowed to reproduce themselves are assigned a probability for reproduction that is based on their fitness. Suppose all fitness values
are positive, then fitness proportional selection computes the probability pi that individual i
is used for creating offspring as:
fi
pi = P
j fj
where fi indicates the fitness of the ith individual. If some fitness values are negative, one
should first subtract the fitness of the worst individual to create only new fitness values which
are positive. There are some disadvantages to this selection strategy:
• There is a danger of premature convergence, since good individuals with a much larger
fitness value than other individuals can quickly take over the whole population
• There is little selection pressure if the fitness values all lie close to each other
• If we add some constant to all fitness values, the resulting probabilities will become
different, so that similar fitness functions lead to completely different results
A possible way to deal with some of these disadvantages is to scale all fitness values, for
example between values of 0 and 1. For this scaling one might use different functions such
as the square root etc. Although this might seem a solution, the scaling method should be
designed ad-hoc for a particular problem and therefore requires a lot of experimental testing.
Tournament selection
Tournament selection does not have the problems mentioned above, and is therefore used
much more often, also because it is very easy to implement. In tournament selection k
individuals are selected randomly from the population without replacing (so each individual
can only be selected one time), and then the best individual of this group of k individuals is
used for creating offspring. Here, k is known as the tournament size, and is usually set to 2
or 3 (although the best value also depends on the size of the population). Very high values of
k cause a too high selection pressure and therefore can easily lead to premature convergence.
Figure 3.8 shows how this selection strategy works.
Population
Winner
Participants (k = 3)
f=6
f=2
f=1
f=3
f=8
f=9
f=9
f=4
f=5
f=9
f=5
f=5
f=3
2
1
f=3
3
Figure 3.8: In tournament selection k individuals are selected and the best one is used for
creating offspring.
55
Rank-based selection
In rank-based selection all individuals receive a rank where higher ranks are assigned to
better individuals. Then this rank is used to select a parent. So if we have a population of
N individuals, the best individual gets a rank of N , and the worst one a rank of 1. Then we
compute probabilities of each individual to become a parent as:
ri
pi = P
j rj
where ri is the rank of the ith individual.
Truncated selection
In truncated selection the best M < N individuals are selected and used for generating
offspring with equal probability. The problem of truncated selection is that it does not make
distinctions between the best and the M th best individual. Some researchers have used
truncated selection where the best 25% of the individuals in the population are used for
creating offspring, but this is a very high selection pressure and can therefore easily lead to
premature convergence.
3.2.8
Replacement strategy
The selective pressure is also influenced by the way individuals of the current population
are eliminated to make place for new individuals. In a generational genetic algorithm, one
usually kills the old population and replaces it by a completely new population, whereas in
a steady-state genetic algorithm at each time one new individual is created which replaces
one individual of the old population (usually the worst one). Generational GAs are most
often used, but sometimes part of the old population is kept in the new population. E.g. one
well-known approach is to always keep the best individual and copy it to the next population,
this approach is called Elitism (or elitist strategy). We recall that even if the elitist strategy
is not used, we always keep the best found solution so far in memory.
3.2.9
Recombination versus mutation
The two search operators used in genetic algorithms have different usage. The recombination
operator causes new individuals to depend on the whole population (genetic material of
individuals is mixed). Its utility relies on the schemata-theorem which tells us that if the
crossover operator does not destroy good building blocks too often, they can be quickly
mixed and stay in the population, since an individual consisting of two good building blocks
(schemata) is likely to have a higher fitness value and therefore more likely to propagate
its genetic material. In principle, the crossover operator exploits previously found genetic
material and leads to faster convergence. In case the whole population has converged to the
same individual, the crossover operator will not have any effect anymore. Thus, with less
diverse populations, the effect of crossover diminishes.
On the other hand the mutation operator possesses different properties. It allows a population to escape from a single local minimum. Furthermore it allows values of locations which
have been lost to be reinserted again. Thus we should regard it as an exploration operator.
56
Genetic algorithms and evolutionary strategies
Independently on the development of genetic algorithms, Rechenberg invented evolutionary
strategies (ES). There is a number of different evolutionary strategies, but in principle ES
resemble GA a lot. Like GAs they rely on reproducing parents for creating new solutions. The
differences between GA and ES are that ES usually work on real numbered representations
and that they also evolve their own mutation parameter σ. Furthermore, most ES do not use
crossover, and some ES only use a single individual whereas GAs always use a population.
The choice whether to use crossover or not depends on:
• Is the fitness function separable in additive components (e.g. if we want to maximize
the number of 1’s in bitstring, then the fitness function is the addition of the fitness of
each separate location). In case of separable fitness functions, the use of recombination
can lead to much faster search times for optimal solutions.
• Are there building blocks? If there are no real building blocks, then crossover does not
make sense.
• Is there a semantically meaningful recombination operator? If recombination is meaningful it should be used.
3.3
Genetic Programming
Although genetic algorithms can be used for learning (robot) controllers or functions mapping
inputs to outputs, the use of binary representations or real numbers without a structure does
not provide immediate means for doing so. Therefore in the late 1980’s Genetic Programming
(GP) was invented and made famous by the work and books of John Koza. The main element
of genetic programming is the use of functional (or program) trees which are used to map
inputs to outputs. E.g., for robot control the inputs may consist of sensory inputs and the
outputs may be motor commands. By evolving functional program trees, those programs
which work best for the task at hand will remain in the population and reproduce.
A program tree may consist of a large number of functions such as cos, sin, ×, +, /, exp,
and random constants. These functions usually require a fixed number of inputs. Therefore
a program tree must obey some constraints which make it legal. To make a program tree
legal, functions which require n arguments (called n-ary functions), should have n branches
to child-nodes where each child-node is filled in by another function or variable. The leaf
nodes of the tree are input-variables or random constants. Figure 3.9 shows an example of a
program tree.
Genetic programming has been used for a number of different problems among which;
supervised learning (machine learning) to map inputs to outputs, learning to control robots,
and pattern recognition to distinguish between different objects from pixel-data.
Genetic programming is quite flexible in its use of functions and primitive building blocks.
Loops, memory registers, special random numbers, and more have been used to solve particular tasks. Like in genetic algorithms, one has to devise mutation and crossover operators for
program trees. The other elements of a genetic programming algorithm can be equal to the
ones used by genetic algorithms.
3.3. GENETIC PROGRAMMING
57
Program Tree
COS
Cos((X1 + X2) * 2)
*
+
X1
Function
2
X2
Figure 3.9: A program tree and its corresponding function.
3.3.1
Mutation in GP
The mutation operator can adjust a node in the tree. If the new function in the node will have
the same number of arguments, it is easy, but otherwise some solutions have to be found. In
the case of point-mutations one only allows mutating a terminal to a different terminal and
a function to a different function of the same arity. Other researchers have used mutation of
subtrees, in which a complete subtree is replaced by a randomly created new subtree. Figure
3.10 shows an example of a point mutation in GP.
Before Mutation
COS
COS
*
+
+
X1
After Mutation
+
2
X2
X1
2
X2
Figure 3.10: Point mutation in genetic programming. A function in a node is replaced by a
different function with the same number of arguments.
3.3.2
Recombination in GP
The recombination operator also works on program trees. First particular subtrees are cut
from the main program trees for both parent individuals and then these subtrees are exchanged. Figure 3.11 shows an example of the recombination operator in GP.
3.3.3
Probabilistic incremental program evolution
Instead of using a population of individuals, one could also use generative prototypes which
generate individuals according to some probability distribution. Baluja invented population
based incremental learning (PBIL) which encodes a chromosome for generating bitstrings. For
58
Parents
COS
SIN
CUT
CUT
+
*
+
X1
*
2
X2
2
COS
X1
X2
SIN
COS
Children
*
*
+
COS
2
X1
2
+
X2
X2
X1
Figure 3.11: Recombination in genetic programming. A subtree of one parent is exchanged
with a subtree of another parent.
this the chromosome consists of probabilities for generating 1 on a specific location (and 1
minus that probability for generating a 0). Using this prototype chromosome, individuals can
be generated and evaluated. After that the prototype chromosome can be adjusted towards
the best individual so that it will generate solutions around the best individuals with higher
probability.
This idea was pursued by Rafal Salustowicz for transforming populations of program trees
in a representation using a probabilistic program tree (PPT). The idea is known as probabilistic incremental program evolution (PIPE) and it uses probabilities to generate functions
in a particular node. The probabilistic program tree which is used for generating program
trees consists of a single large tree consisting of probabilities of functions in each node, as
shown in Figure 3.12.
The PPT is used to generate an individual as follows:
• Start at the root node and select a function according to the probabilities
• Go to the subtrees of the PPT to generate the necessary arguments for the previously
generated functions
• Repeat this until the program is finished (all leaf nodes consist of terminals such as
variables or constants)
For learning in PIPE, it is requested that the PPT is changed so that the individuals
which are generated from it obtain higher fitness values. For this PIPE repeats the following
steps:
• Generate N individuals with the prototype tree
• Evaluate these N individuals
• Select the best individual and increase the probabilities of the functions and terminals
used by this best individual
3.4. MEMETIC ALGORITHMS
59
Probabilistic Prototype Tree
SIN
COS
*
+
/
X1
X2
SIN
COS
*
+
/
X1
X2
0.23
0.11
0.19
0.06
0.06
0.19
0.06
0.51
0.20
0.09
0.04
0.06
0.09
0.01
SIN
COS
*
+
/
X1
X2
0.01
0.22
0.19
0.24
0.09
0.07
0.18
Figure 3.12: The probabilistic prototype tree used in PIPE for generating individuals.
• Mutate the probabilities of the PPT a little bit
PIPE has been compared to GP and it was experimentally found that PIPE can find good
solutions faster than GP for particular problems.
3.4
Memetic Algorithms
There is an increasing amount of research which combines GA with local hill-climbing techniques. Such algorithms are known as memetic algorithms. Memetic algorithms are inspired
by memes [Dawkins, 1976], pieces of mental ideas, like stories, ideas, and gossip, which reproduce (propagate) themselves through a population of meme carriers. Corresponding to the
selfish gene idea [Dawkins, 1976] in this mechanism each meme uses the host (the individual)
to propagate itself further through the population, and in this way competes with different
memes for the limited resources (there is always limited memory and time for knowing and
telling all ideas and stories).
The difference between genes and memes is that the first are inspired by biological evolution and the second by cultural evolution. Cultural evolution is different because Lamarckian
learning is possible in this model. That means that each transmitted meme can be changed
according to receiving more information from the environment. This makes it possible to
locally optimize each different meme before it is transmitted to other individuals. Although
optimisation of transmitted memes before they are propagated further seems an efficient way
for knowledge propagation or population-based optimisation, the question is how we can
optimize a meme or individual. For this we can combine genetic algorithms with different
optimisation methods. The optimisation technique which is most often used is a simple local hill-climber, but some researchers have also proposed different techniques such as Tabu
Search. Because a local hill-climber is used, each individual is not truly optimized, but only
brought to its local maximum. If it would be possible to fully optimize the individual, we
would not need a genetic algorithm at all.
60
The good thing of memetic algorithms compared to genetic algorithms is that genetic
algorithms usually have problems in fine-tuning a good solution to make it an optimal one.
E.g. suppose that a bitstring contains perfect genetic material except for a single bit. In
this case there are much more possible mutations which harm the individual than mutations
which bring it to the true global optimum. Memetic algorithms do not have this problem
and they also have the advantage that all individuals in the population are in local maxima.
However, this also involves a cost, since the local hill-climber can require many evaluations
to bring an individual to a local maximum in its region.
Memetic algorithms have already been compared to GAs on a number of combinatorial optimisation problems such as the traveling salesman problem (TSP) [Radcliffe and Surry, 1994]
and experimental results indicated that the memetic algorithms found much better solutions than standard genetic algorithms. Memetic algorithms have also been compared to the
Ant Colony System [Dorigo et al., 1996], [Dorigo and Gambardella, 1997] and to Tabu Search
[Glover and Laguna, 1997] and results indicated that memetic algorithms outperformed both
of them on the Quadratic Assignment Problem [Merz and Freisleben, 1999].
3.5
Discussion
Evolutionary algorithms have the advantage that they can be used for solving a large number
of different problems. For example if one wants to make a function which generates particular
patterns and no other learning method exists, one could always use an evolutionary algorithm.
Furthermore, evolutionary algorithms are good in searching through very large spaces and
can be easily parallellized.
A problem with evolutionary algorithms is that sometimes the population converges prematurely to a suboptimal local minimum. Therefore a lot of research effort has come up with
methods for keeping diversity during the evolution. Another problem is that many individuals are evaluated and then never used anymore, which seems a waste of computer power.
Furthermore, the learning progress can be quite slow for some problems and if many individuals have the same fitness value there is not much selective pressure. E.g. if there is only a
good/bad final evaluation, it is very hard to come up with solutions which are evaluated good
if in the beginning all individuals are bad. Therefore, the fitness function should be designed
in a way to provide maximal informative information.
A lot of current research focuses on “linkage learning”. We have seen that recombination
is a useful operator which can allow for quickly combining good genetic material (building
blocks). However, uniform crossover is very disruptive, since it is a random crossover operator
it does not keep building blocks as a whole together. On the other hand 1-point crossover
may keep building blocks together if the building blocks are encoded on bits which lie nearby
on a genetic string (i.e. next to each other). It may happen, however, that a building block is
not encoded in a genetic string as material next to each other, but distributed over the whole
string. In order to use effective crossover for such problems one must identify the building
blocks which is known as linkage learning. Since building blocks can be quite large, finding
the complete block can be very difficult, but effective progress in this direction has been made.
Chapter 4
Physical and Biological Adaptive
Systems
Before the 16’th century, the Western thinkers believed in a deductive approach to acknowledge truth. For example, Aristotle always thought that heavy objects would fall faster to the
ground than lighter objects. It was not until Galileo Galilei (1564-1642) tested this (according to some he did his experiments by dropping objects from the tower of Pisa), that this
hypothesis turned out to be false (if we disregard air-resistance). After this Galilei played
an important role to use mathematics for making predictive models and he also showed that
planets were going around the sun instead of around the earth (this hypothesis he had to
retract from the church). This was the start of a natural science where experiments were
used to make (predictive) models. Christiaan Huygens also played an important role by his
discovery of much better clocks to make measuring time much more precise, his discovery of
better lenses and telescopes, and the discovery that light could be described by waves instead
of particles. The new science continued with Kepler (1571 - 1630) who approximated the orbits of planets and came up with ellipsoids to predict them instead of the commonly thought
hypothesis that the orbits should be approximated using circles.
Isaac Newton (1642-1727) discovered the gravitation laws and laws of mechanics which
were the final breakthrough for a new natural science. Newton’s gravitation laws tells that two
objects (e.g. planets) attract each other based on the multiplication of their masses divided
by the square of the distance between them, and it is very accurate for big objects which do
not move at very high speed (for very small moving particles quantum mechanics introduced
different laws, and for very high speed relativity theory was later invented). Newton’s laws
of mechanics were also used to predict that planet orbits were ellipsoids and that planets will
circle around the sun whose movement is hardly influenced by the planets.
After this fruitful period of scientific revolutions, researchers started to think that the
universe worked like a clock and that everything could be predicted. This even led to the
idea of a Genius by Laplace which would be an almighty entity which could predict the future
and the past based on the current state and the mechanical laws. Although this idea of a
universal clock brought many fruitful machines such as computers and television, already in
the start of the 19’th century Poincaré had discovered that not everything could be predicted.
Poincaré was studying three body problems, like three planets moving around each other, and
discovered that there were not enough known equations to come up with a single analytical
solution for predicting their movements. This eventually led to chaos theory, where a model
61
62
CHAPTER 4. PHYSICAL AND BIOLOGICAL ADAPTIVE SYSTEMS
can be deterministic, but still shows unpredictable behavior if we cannot exactly measure the
initial state.
Although the word chaos often refers to a state without order, researchers have found
remarkable structures in chaotic systems. Even in chaotic systems there seems to be a kind
of self-organisation. In this chapter we will look at the path from physics to biology, take a
look at chaotic systems, and then we will examine self-organising biological systems such as
ants and the use of these systems for solving optimisation problems.
4.1
From Physics to Biology
In Newtonian mechanics, the systems are reversible, which means that we can turn around the
arrow of time and compute the past instead of the future. There are specific laws of physical
systems such as the conservation of energy which states that the sum of potential and kinetic
energy of an objects must remain constant. An example is a ball which we throw in the air.
In the beginning the kinetic energy (due to its speed) is maximal, and it will become 0 at
the highest point where the potential energy (due to gravitation) is maximal. Then it will
fall again while conserving its energy until finally it bounces against the ground and will lose
energy due to this (in reality the height of the ball will be damped due to friction which
causes a loss of energy. Without loss of energy the ball would continue bouncing forever).
If we have energy preserving systems, the system will continue with its movement. A
good example is a pendulum. Suppose a pendulum is mounted at some point, and there is
no friction at this point or friction due to air resistance. Then we give the clock a push to
the right and it will remain moving to the left and to the right. If we give the pendulum a
harder push, it will go around and continue going around. Let’s look at the phase diagram
in Figure 4.1 that shows possible trajectories in the plane with the angle on the x-axis, and
the (normalised) angular speed on the y-axis.
Figure 4.1: The phase diagram of the pendulum
4.1. FROM PHYSICS TO BIOLOGY
63
In the middle of the figure a stable equilibrium is shown, the pendulum is not moving
at all. Trajectories a and b show periodic cycles (orbits) where the pendulum is moving to
the left, right, left, etc. Orbit c leads to an unstable equilibrium in which the pendulum
goes to the highest point and there it stops to move. This point is unstable, because a slight
perturbation will cause it to move again. Finally, in orbit d the pendulum is going over its
head.
The pendulum is an example of a reversible system in which energy is conserved. Ideally,
such mechanical systems always conserve their energy. However, there are also many systems
which are irreversible, which are thermodynamic objects. After the industrial revolution,
many scientists were interested in making the optimal (perpetuum mobile) machine; one
which would continue to work forever. But soon they discovered that every machine would
lose useful energy due to production of heat. An example of a thermodynamic object is a
system which consists of a box with 2 halves. In one half there are N gas-molecules and in the
other half there are none. The initial state is very ordered since all the gas-molecules are at
the left half. Now we take away the border between the halves and we will observe that after
some time both halves will contain roughly the same amount of molecules. This is an example
of an irreversible system since if the system would be in a state with the same amount of
molecules in both halves it would probably never go to the state with all molecules in one
half again. To describe such processes, Boltzmann invented the word entropy. Entropy
corresponds to the amount of disorder which is caused by the production of useless energy
such as heat which cannot be turned back to make energy without a heat potential. Entropy
has also been widely used in thermodynamics to explain why heat will always flow from a hot
space to a colder space.
Consider the case of the N gas molecules again. Boltzmann used a statistical explanation
why the molecules would mix and a state of disorder would arise. Consider now N molecules
and the number of permutations that can describe a possible state. For example all N
molecules in one half of the box would only have one possible state, one molecule in one half
and the rest in the other half would have N possible states. Now if we divide the N molecules
in N1 and N2 molecules in both halves, the number of permutations would be:
P =
N!
N1 !N2 !
Now its logical that the system will go to an equilibrium with most possible states, that is
where N1 = N2 . For this Boltzmann defined entropy of a system as:
S = k log P
Where k is called the Boltzmann constant.
So although all microscopic states are equally possible, due to the effect that there are
much more microscopic states around the macroscopic situation of having the same number of
molecules in both halves, this situation will arise after some time. Of course small deviations
from the macroscopic equilibrium can happen, but the system’s state will oscillate around this
equilibrium. We can see that entropy is maximised and that disorder will be the result. Since
entropy production is always positive in a closed system and there is a state with maximal
entropy, the system will always converge to such an equilibrium. Since the initial state gets
lost in this case, the process is not reversible (many states lead to the same final state). Note
the difference with the energy-preserving pendulum which is reversible. It is important to
64
note that there are many irreversible processes in machines caused by loss of heat, friction
etc. so that is is not possible to make a machine which continues forever without receiving
additional energy. These processes cause an increase of the entropy of the system.
But how is this with open systems such as living systems? Here the change of entropy
of the system is governed by an internal change of entropy dSi /dt which is irreversible and
always positive, and an exchange of entropy between the system and its environment dSu /dt
which can be positive or negative. We note that the exchange of entropy of a closed system
is not possible (since there is no environment to interact with) so that the entropy can only
increase or remain constant (at the maximal value). In this case, the entropy determines the
direction of time; for all closed systems their future lies in the direction of increased entropy.
This lead to the two laws of the thermodynamics by Clausius in 1865:
• The energy of the world is constant
• The entropy of the world goes to a maximal value
Thus in the thermodynamic equilibrium the entropy and disorder will be at its maximum.
However, living systems can exchange entropy with their environment. This allows them to
keep their entropy low. E.g. by consuming food and energy, a living system is able to keep
its order without having to increase its entropy. This is the essential difference between open
and closed systems. An open system can receive useful energy from the environment and
thereby it can reduce its disorder and create more order.
4.2
Non-linear Dynamical Systems and Chaos Theory
As mentioned before, Poincaré had already discovered that there are no analytical solutions
to be found for the n-body problem with n larger than 2. For 2 planets, there is an analytical
solution which determines the position and velocity of both interacting planets given the
initial conditions. These planets will move around their point of joint mass as shown in
Figure 4.2.
Point of common mass
Planet 1
Planet 2
Figure 4.2: The orbits of two interacting planets
For the n-body problem with n ≥ 3, Poincaré had demonstrated that there were not
enough differential equations to be able to compute a solution to the problem. The problem
was therefore not integratable to a closed-form analytical solution. Poincaré has also demonstrated that small perturbations could cause large differences of trajectories in this case. This
was the first time chaotic dynamics had been mentioned.
After this, for a long time few researchers were studying chaotic systems. One major
breakthrough in their understanding came when computers were used which could visualise
4.2. NON-LINEAR DYNAMICAL SYSTEMS AND CHAOS THEORY
65
such processes. The first famous demonstration of chaos using computer simulations was
described by the meteorologist Edward Lorenz who was studying weather prediction. In 1961
he saw an event in his computer simulations. By accident he discovered sensitivity to initial
conditions, since he wanted to repeat his simulations, but found completely different results.
After some time he discovered that the values he used in his second simulation were rounded
to three decimals, whereas the computer used values with 6 decimals during the entire run.
These minimal differences quickly caused large deviations as is seen in Figure 4.3.
Figure 4.3: The simulations done by Lorenz showed sensitivity to initial conditions. Although
the initial values were almost similar, the difference between the trajectories became very
large.
In chaos theory it is often said that little causes create big consequences. After simplifying
his model to three variables, he first noted something like random behavior, but after plotting
the values in a coordinate space, he obtained his famous Lorenz attractor depicted in Figure
4.4. We can see an ordered structure, so again we should not confuse chaotic dynamics with
non-determinism.
Figure 4.4: The Lorenz attractor
The dynamical system of Lorenz is quite complicated to analyse, and therefore we will
use an example from biology to demonstrate chaotic dynamics in an easier way.
66
4.2.1
The logistic map
Around 1800, T.R. Malthus assumed that the growth of a population would be linear with
the number of individuals x(t). The mathematical expression is the differential equation:
dx
= kx
dt
which has as closed-form solution an exponential growing population:
x(t) = x(0)exp(kt)
In 1844 P.F. Verhulst noted that for a growing population there must arise competition so that
the population would stop growing at some time. He noted that the population would grow
linearly with the number of individuals and the difference between the number of available
sources and the sources needed to sustain the population. This model is known as the following
Verhulst equation:
dx
= Ax(N − x)
dt
with AN the maximal number of available sources and Ax the amount needed for x persons.
The logistic map equation can be derived from this in which we use discrete time and change
variables. The logistic map equation looks as follows:
x(t + 1) = rx(t)(1 − x(t))
Where x has a value between 0 and 1. For values of r below 1, the population will die out
(x(∞) = 0). If r is between 1 and 3, there is one single final state x(∞). Now if we keep
increasing r, there will arise period-2 cycles and higher periodic cycles. Each value for r that
causes the period to increase (in the beginning it doubles) is called a bifurcation point. Figure
4.5 shows a period-2 cycle of this map with a value of r a little bit larger than 3.
Figure 4.5: A period-2 cycle of the logistic map.
Figure 4.6 shows a larger periodic cycle. Although the periodic attractor is difficult to
see, it is important to note that trajectories from different starting points x0 approach this
limit cycle.
Now, look what happens if we plot the value of r to the values which x can take after a
long transient period (so we eliminate the initial values x(t) by waiting for 1000 steps). This
4.2. NON-LINEAR DYNAMICAL SYSTEMS AND CHAOS THEORY
67
Figure 4.6: A larger periodic cycle of the logistic map.
plot is shown in Figure 4.7. The figure shows a very complicated bifurcation diagram. In the
beginning there is a single steady state (for r ≤ 1 all trajectories go to x(∞) = 0). When
r > 1 and r < 3 there is a single stable state for x, although the final value x(∞) depends on
r. Now if we increase r to a value higher than 3, there is a periodic cycle of length 2, which
is shown in the bifurcation diagram by the two branches which determine the multitude of
values of x which are part of periodic cycles. Increasing r further leads to periodic cycles of
length 4, 8, 16, etc. Until finally the period becomes infinite and the system shows chaotic
behavior.
Figure 4.7: A plot of the value of the control parameter r to the values which x will take after
some transient period.
In Figure 4.8, we see a more detailed figure of this bifurcation diagram for values of r
between 3.4 and 4. It shows that although there are values of r displaying chaotic behavior,
for some values of r there are again periodic cycles, which is shown by the bands with only
few branches. If we further zoom in in the area of Figure 4.8, we get the figure displayed in
Figure 4.9. This figure shows clearly that there are periodic cycles alternating with chaotic
dynamics.
A remarkable property of the chaotic dynamics generated by the logistic map is when
we further zoom in in the area of Figure 4.9 and get the Figure 4.10. This figure clearly
shows that there is a self-similar pattern on a smaller scale. Again we see bifurcation points
68
Figure 4.8: A plot of the value of the control parameter r to the values which x will take after
some transient period.
Figure 4.9: A plot of the value of the control parameter r between 3.73 and 3.753 to the
values which x will take after some transient period.
and periodic lengths which double, until again it arrives at chaotic dynamics which visit an
infinite number of points.
So what can we learn from this? First of all even simple equations can display chaotic
behavior. For a map (or difference equation) chaotic dynamics can be obtained with a single
variable (the population x). When using differential equations it turns out that there need
to be three differential equations which form a non-linear system in order for the system
to display chaotic behavior. Furthermore, when chaotic dynamics arise, even a very small
difference between two initial states can cause a very different trajectory. This means that
if we cannot exactly measure the initial state our hope to predict the future dynamics of
the system is lost. Of course, here we have shown simple mathematical equations leading to
chaotic behavior, the question therefore is whether chaos also arises in real natural systems.
The answer to this is yes; research has turned out that the heartbeat follows an irregular
non-periodic patterns, and using a EEG it was shown that the brain also possesses chaotic
dynamics. Furthermore, in biological systems the population of particular kinds of flies also
shows chaotic behavior. And of course to come back to Lorenz, the weather is unpredictable
since it is very sensitive to initial conditions. This sensitivity in chaos theory is often related
to the possibility that a butterfly in Japan can cause a tornado in Europe.
4.3. SELF-ORGANISING BIOLOGICAL SYSTEMS
69
Figure 4.10: A plot of the value of the control parameter r between 3.741 and 3.745 to the
values which x will take after some transient period.
Instead of only disorder, we can also see ordered patterns in chaotic systems. One example
is the self-similar structure if we look at the pattern of a bifurcation diagram at different scales.
Furthermore, if we look at the Lorenz attractor, we can see that not all states are possible;
the state trajectories are all on a particular manifold (subspace of the whole space). On the
contrary, when we would use a stochastic (non-deterministic) system, the trajectories would
fill up the whole phase diagram. In reality chaos therefore also displays order, which is also
the statement of Ilya Prigogine; “order out of chaos”.
4.3
Self-organising Biological Systems
Adaptive systems can be used fruitfully to model biological systems. We have already seen
that the model can consist of mathematical equations, but they can also have a spatial
configuration using individualistic models such as cellular automata. The advantage of using
individualistic models moving in a particular space is that there is an additional degree of
freedom for the physical space and therefore additional emergent patterns. By simulating
an individualistic model, it also becomes much easier to visualise processes such as the fire
propagation in forest fires. The disadvantage of spatial models compared to mathematical
equations is that it is much slower to simulate. Some examples of biological models which
can be modelled are:
• Infection diseases
• Forest fires
• Floods
• Volcano eruptions
• Co-evolving species
The first four processes mentioned above show a common aspect; they propagate themselves
over paths which depend on the environment. To stop the propagation, such paths should be
“closed”. This is essential for controlling these natural disasters, but will not be the issue in
this chapter.
70
4.3.1
Models of infection diseases
We will look at two different models for simulating infection diseases. In infection diseases,
we can distinguish between three kinds of individuals in the population:
• Healthy individuals (H)
• Infected, sick individuals (S)
• Immune individuals which have had the disease (I)
If a healthy person comes in the neighbourhood of an infected individual, the healthy
person will also become infected in our model (although usually this will only happen with
some probability). If an infected person has been sick long enough, it becomes an immune
individual which is not sick anymore.
Mathematical model of infection diseases
We can make a model using difference equations. We start with a state of the population:
S(0) = (H(0), I(0), S(0)), and use the following equations to determine the evolution of the
system:
S(t + 1) = S(t) + S(t)H(t) − bS(t)
I(t + 1) = I(t) + bS(t)
H(t + 1) = H(t) − aS(t)H(t)
Here we have two control parameters a and b. Note that the values H, I, S should not become
negative! If we examine the model, we can see that the number of immune individuals is
always increasing or stays equal. Therefore a stable attractor point is a situation with all
people immune to the disease. However, if the control parameter b is set to a very large value,
the population of sick people might decrease too fast and might become 0 before all healthy
people became sick. Therefore other possible stable states include a number of healthy and
immune people. Also when there are no sick or immune people at all at start, the stable point
would consist only of healthy people.
Cellular automaton model of infection diseases
We can also use a cellular automaton (CA) in which we have to make rules to update the
state of the CA. Suppose we take the 2-dimensional CA with individuals as shown in Figure
4.11. Cells can be empty or be occupied by a sick, immune, or healthy person.
The CA also needs transition rules to change the state of the system, we can make the
following rules:
• If H has a S in a cell next to it, the H becomes a S.
• S has each time step a chance to become a I
• For navigation, all individuals make a random step at each time-step
4.4. SWARM INTELLIGENCE
71
H
S
S S
H
H
H
S H
I
I
S
H
I
S
S
I
S
Figure 4.11: The CA for infection diseases. H = healthy person, I = immune person, S =
sick person
Step 2 above uses a probability to change a state of a cell and navigation also used randomness,
therefore this is an example of a stochastic cellular automaton. Finally, we can also make
another navigation strategy so that healthy persons stay away from sick individuals. This
could lead to different evolving patterns where healthy persons are in one corner, far away
from the sick individuals.
4.4
Swarm Intelligence
It is well known that a large group of simple organisms such as ants or bees can show intelligent
behavior. The question is how this collective intelligent behavior emerges from simple individuals. In this section we will see examples of this phenomenon and how this self-organising
collective behavior can be used for making optimisation algorithms.
First we will look at some smart collective behaviors:
• Foraging behavior: individuals search for food and bring it to their nest
• Protection of the nest: individuals have received an altruistic and non-producing task
which helps the group to survive
• Building of a nest: E.g. how do termites construct their nest or how are honeycombs
made by bees.
• Stacking food and spreading it
It is clear that there is no super controller which sends the individuals messages how to do
their task. In some ways the behaviors emerge from simple individual behaviors. E.g. if we
look at the process of creating honeycombs, then we can see that the structure emerges from
local interactions between the bees. Every bee creates a single cell in the wax by hollowing
out part of the space of the wax. Whenever a bee makes a cell it takes away parts of the
borders of the cell. When it feels that there is another bee working in the cell close next to it,
it stops taking wax out of the direction of that bee. In this way a hexagonal pattern emerges
with very similar cells (because bees have similar sizes), see Figure 4.12.
72
Figure 4.12: A honeycomb
It is also known that ants can solve particular problems, such as finding the shortest path
to a food pile, clustering or sorting food, and clustering dead ant bodies. Although a single
ant is not intelligent, the whole colony shows intelligent group behavior (super-intelligence).
4.4.1
Sorting behavior of ant colonies
When many ants die at the same time, the living group makes cemeteries of the dead ants by
stacking them all on the same place. How can this be done if single ants are not intelligent
enough to know where to put the dead ant they may be carrying? To explain this, we can
make a simple model with three rules:
• An ant walks in arbitrary directions
• Whenever an ant does not carry anything and finds a dead ant, it takes it and will carry
it to some different place
• Whenever an ant carries a dead ant and sees a pile of dead ants, it will drop the ant
near that pile
These three simple rules can explain the group-behavior of sorting ants. A similar model can
be made to let ants make piles of sugar and chocolate. Since each ant is very simple, it would
take a long time until some organisation would emerge using a single ant. However, when
many ants are used, the self-organisation of matter in the space can occur at very short time
periods. This is also a reason why some researchers investigate collective swarm robotics to
make many simple small robots collaborate together to perform different tasks, instead of a
single large robot which has to do everything alone.
4.4.2
Ant colony optimisation
A new kind of multi-agent adaptive system for combinatorial optimisation has been invented
by Marco Dorigo in the 90’s. In this algorithm, a colony of ants works together to find
solutions to difficult path-planning problems such as the traveling salesman problem. The
algorithm is inspired by how ant colonies work in reality. The foraging ants leave a chemical
73
substance known as pheromone on the ground when they go from their nest to a food source
and vice versa. Other foraging ants follow the paths with most pheromone according to a
probability distribution. While following these paths they strengthen them by leaving additional pheromone. This collective foraging behavior enables an ant colony to find the shortest
path between the nest and a food source. Optimisation algorithms which are inspired by the
collective foraging behavior of ants are called ant colony systems or simply ant algorithms.
We will first examine combinatorial optimisation problems which determines the class of
problems which are hard to solve and for which ant colony systems can be applied.
Combinatorial optimisation
Particular problems cost exponential amount of time to solve. To get an idea of an exponential
problem, consider a solution that consists of n states and the time to solve it is 2n or n!. An
example is to find a bitstring of only 1’s when the fitness is 0 for all solutions except for
the state with all 1’s which gets higher fitness (known as a needle in a haystack problem).
Exponential time problems grow much faster than polynomial time problems:
np
→0
n→∞ en
lim
Where p is the degree of some polynomial function and e is the natural exponent. A number of
well known mathematical problems are called combinatorial optimisation problems, a subset
of these are NP-complete problems which cannot be solved in polynomial time unless P =
NP. The question P = NP is known as one of the open and most important questions in
computer science and optimisation. The interesting thing is that if one of these NP-complete
problems can be solved by some algorithm in polynomial time, all these problems can be
solved in polynomial time. So far no polynomial time algorithm has been found to solve one
of these problems, however.
Since computer power cannot increase faster than exponential time (Moore’s law states
that computer power doubles every two years), some big combinatorial optimisation problems
can never be solved optimally. Some examples of combinatorial optimisation problems are:
• The traveling salesman problem: find the shortest tour through n cities
• Quadratic assignment problem: minimize the flow (total distance which has to be travelled) if a number of employees has to visit each other daily in a building according
to some frequency. So the total cost is the product of the distance matrix and the
frequency matrix. The problem requires to assign locations to all people to minimize
the total cost. This often involves putting people who meet each other frequently in
nearby locations.
• 3-satisfiability: Find truth-values for n propositions to make the following kind of formula true:
{x1 ∨ ¬x2 ∨ x4 } ∧ . . . ∧ {x1 ∨ ¬x5 ∨ x7 }
• Job-shop scheduling: Minimize the total time to do a number of jobs on a number of
machines where each job has to visit a sequence of machines in a specific order and each
machine can only handle one job at a time.
74
We will elaborate a bit on the traveling salesman problem here, since ant algorithms were
first used to solve this kind of problem. In the traveling salesman problem (TSP) there is a
seller which wants to visit n cities and come back to his starting city. All cities i and j are
connected with a road of distance l(i, j). These lengths are represented in a distance matrix.
The agent must compute a tour to minimize the length of the total tour. An example of a
tour with 8 cities with distance 31 is shown in Figure 4.13.
4
3
5
4
5
4
4
2
Figure 4.13: A tour in a traveling salesman problem
How can we generate a tour for the traveling salesman problem? The constraints are that
all cities have to be visited exactly once and that the tour ends at the starting city. Now
we keep a set of all cities which have not been visited: J = {i | i is not visited}. In the
beginning J consists of all cities. After visiting a city, we remove that city from the set J.
The algorithm for making a tour now consists of the following steps:
1. Choose an initial city s0 and remove it from J
2. For t = 1 to n:
(a) Choose city st out of J and remove st from J
3. Compute the total length of the tour:
P −1
L= N
t=0 l(st , st+1 ) + l(sN , s0 )
Of course the most important thing is to make the rule for choosing the next city given the
current one and the set J. Different algorithms for computing tours can be compared to the
final value L returned by the algorithm (note that for very large problems, it is extremely
hard to find an optimal solution, so that an algorithm should just find a good one).
4.4.3
Foraging ants
One algorithm for making an adaptive rule for selecting the next city given the current one
and the set J is inspired on the collective foraging behavior of ants. We will first examine
why ants can find shortest paths from the nest to a food source. Let’s have a look at Figure
4.14. It shows two paths from the left to the right and ants approaching the point where they
75
Figure 4.14: In the beginning the ant colony does not have any information about which path
to take
need to choose one of them. In the beginning their choice will be completely random, so 50%
will take the upper path and the other 50% the lower path.
Now in Figure 4.15 it becomes clear that ants which took the lower path will arrive at
the destination earlier than those which took the upper path. Therefore, as we can see in
Figure 4.16, the lower path will accumulate more pheromone and will be preferred by most
ants, leading to more and more strengthening of this path (see Figure 4.17).
Figure 4.15: Ants which take the lower path will arrive sooner at the destination
Figure 4.16: This causes more ants to follow the lower part
4.4.4
Properties of ant algorithms
There are multiple different ant algorithms, but they all share the following properties:
• They consist of a colony of artificial ants
76
Figure 4.17: The amount of pheromone keeps on strengthening more along the lower path
than along the upper path, therefore finally almost all ants will follow the lower path.
• Ants make discrete steps
• Ants put pheromone on chosen paths
• Ants use the pheromone to decide which steps to make
Ant algorithms have been used for a wide variety of combinatorial optimisation problems
such as the traveling salesman problem, the quadratic assignment problem, and network
routing. The idea to let individuals interact, because one of them changes the environment is
called stigmercy. The first ant algorithm, the ant system, was initially tested on the TSP.
It works as follows:
• All N ants make a tour for which they use pheromone between cities to select the next
city
• All not followed edges lose a bit of pheromone due to evaporation
• All followed edges receive additional pheromone where edges belonging to shorter tours
receive more pheromone.
The ant-system was thereafter changed in some ways and this led to the ant colony system.
We will now give a formal description of the ant-colony system used for the traveling salesman
problem (although the ant-colony system is often called a meta-heuristic that includes many
possible algorithms and can be used for different problems).
The ant colony system consists of K ants. The amount of pheronome between 2 cities i
and j is denoted as m(i, j). For choosing the next city an additional heuristic is used which
1
.
is the inverse of the length between two cities: v(i, j) = l(i,j)
Now every ant: k = 1 . . . k makes a tour:
• Choose a random starting city for ant k : i = random(0, N ) and take the city out of
the set Jk of unvisited cities for ant k
• Choose next cities given the previous one according to:
j=
(
arg max{[m(i, h)] · [v(i, h)]β } if q ≤ q0
h∈Jk
S
else
(4.1)
Here β is a control parameter, 0 ≤ q ≤ 1 is a random number, and the control parameter
0 ≤ q0 ≤ 1 determines the relative importance of exploration versus exploitation. If
4.5. DISCUSSION
77
exploration is used, we generate S which is a city chosen according to the probability
distribution given by the following equation:
pij =

 P

[m(i,j)]·[v(i,j)]β
[m(i,h)]·[v(i,h)]β
h∈J
k
0
if j ∈ Jk
(4.2)
else
Now all ants have made a tour and we can update the pheromone trails as follows. First
we compute which generated tour was the best one during the last generation, let’s call this
tour Sgb for global-best solution. This tour has length: Lgb . Now the update rule looks as
follows:
m(i, j) = (1 − α) · m(i, j) + α · ∆m(i, j)
where ∆m(i, j) =
(
(Lgb )−1 if edge (i,j) ∈ Sgb
0
else
Here, α is a control parameter similar to the learning-rate. Note that the added pheromone
depends on the length of the best tour, and that pheronome on other edges evaporate.
This is one possible ant colony system, it is also possible to let the pheromone be adapted
to the best tour ever found, instead of the best tour of the last cycle. Other possibilities of
choosing paths are also possible, but the method given above usually works a bit better. Note
also that there are many parameters to set: α, β, q0 and the initial values for the pheromone.
4.5
Discussion
Biological systems differ from mechanical systems or thermodynamic systems since they are
able to take energy from the environment in order to decrease their internal entropy (state of
disorder). We have seen that there are dynamic systems which look very simple, but which
can lead to chaotic dynamics. An example is the logistic map and its operation depends on
the control parameter r. If we increase r we can see that instead of a single stable state, there
will arise bifurcations to periodic cycles of higher order, finally leading to chaotic dynamics.
Chaotic dynamics leads to unpredictable systems, since if we do not know the exact initial
condition of the system, the evolution of the system will create large discrepancies between
the predicted and the real observed behavior. Although chaotic systems are unpredictable,
they also show some kind of order which is seen from the emergence of manifolds on which
all points lie (such as in the Lorenz attractor) or the self-similar structure when one looks at
chaotic dynamics from different scales.
In biology, there are often simple organisms which can fulfil complex tasks. We have seen
that this intelligent collective behavior can emerge from simple individual rules. An example
of this is when ants build ant-cemeteries. Furthermore, this ability of swarm intelligence has
also inspired researchers to develop algorithms for solving complex problems. A well-known
example of this is the ant colony system which has been fruitfully used to solve combinatorial
optimisation problems such as the traveling salesman problem.
78
Chapter 5
Co-Evolution
Let us first consider the history of the earth. Using the internet-site: “http:///www.solstation.com/life.htm”
the following summary can be extracted:
Our solar system was born about 4.6 billion years ago. In this time protoplanets
agglomerated from a circum-Solar disk of dust and gas. Not long after that the
protoplanetary Earth was struck by a Mars-sized body to form the Earth and
Moon. Geologists have determined that the Earth is about 4.56 billion years old.
Initially, the Earth’s surface was mostly molten rock that cooled down due to the
radiation of heat into space, whereas the atmosphere consisted mostly of water
(H2 O), carbon dioxide (CO2 ), nitrogen (N2 ), and hydrogen (N2 ) with only a little bit of oxygen (O2 ). Eventually a rocky crust was formed and some areas were
covered with water rich with organic compounds. From these organic compounds,
self-replicating, carbon-based microbial life developed during the first billion years
of Earth’s existence. The microbes spread widely in wet habitats and life diversified and adapted to new biotic niches, some on land, but life stayed single-celled.
After some time microbes were formed which produced oxygen and these became
widespread. Chemical reactions caused the production of ozone (O3 ) which protected carbon-based life forms from the Sun’s ultraviolet radiation. Although
the large concentration of CO2 caused the Earth to warm-up, the produced O2
caused a chilling effect and as a result the Earth’s surface was frozen for large
parts, although some prokaryotic microbial life survived in warm ocean seafloors,
near volcanos and other warm regions. Due to a large volcanic activity, the Earth
warmed up again, but leading to a different niche which led to heavy evolutionary
pressure. About 2.5 billion years ago some microbes developed a nucleus using
cellular membranes to contain their DNA (eukaryotes), perhaps through endosymbiosis in which different microbes merged to new life-forms. The first multi-cellular
life-forms (e.g. plants) evolved after 2.6 billion years of Earth’s existence. This
multi-cellularity allowed the plants to grow larger than their microbial ancestors.
Between 3.85 and 4.02 billion years after the birth of the solar system, there may
have been a cycle between ice climates and acid hothouses, leading to strong selective pressure. After a massive extinction, intense evolutionary pressure may
have resulted in a burst of multi-cellular evolution and diversity leading to the
first multi-cellular animals. After this Dinosaurs were created and may have become extinct 65 millions years ago by the assistance of a large cometary impact.
79
80
CHAPTER 5. CO-EVOLUTION
The extinction of the Dinosaurs created ecological conditions which eventually led
to the creation of modern Human (Homo sapiens sapiens) which originated only
100,000 years ago.
What we can observe from the history of the Earth is that life adapts itself to the biological niche. If environmental circumstances are good for some organisms they can multiply,
but there have been many species which became extinct due to environmental conditions or
cometary impacts. The way that evolution works is therefore really governed by environmental selection; there is no optimisation but only adaptation to the environment.
5.1
From Natural Selection to Co-evolution
No biologist doubts that natural evolution has occurred and created the diversity of organisms
alive today. For the evolutionary theory there are enough indicative facts such as the existence
of DNA, organisms which have been shown to mutate themselves to cope with changing
environments, and observed links between different organisms in the phylogenetic tree.
The current debate is more on the question how evolution has come about and which
mechanisms play a role in evolutionary processes on a planetary scale. In Darwin’s evolutionary theory survival of the fittest plays an eminent role to explain the evolution of organisms.
We can explain the birth of this competitive mechanism by looking at a planet which is initially populated by some organisms of a specific type with plenty (though finite) amount of
nutricients for them to survive. As long as the initial circumstances are good, the population will grow. However, this growth will always lead to a situation in which there are so
many organisms that the resources (space, food) will become limited. If the resources are
scarce, not all individuals will be able to get enough food and multiply themselves. In such
a situation there will arise a competition for the resources and those organisms which are
best able to get food will survive and create offspring. The question is which organisms will
survive and reproduce. For this we have to examine their genetic material. The existence of
particular genes in an individual will give it an advantage and this allows such genes to be
reproduced. Therefore there will be more and more offspring which will consist of these genes
in their genetic material. Since the resources will usually not grow very much, the population
will not grow anymore and only the genetic material inside individual organisms will change.
Finally, it may happen that all organisms of the same population will resemble each other
very much, especially if the environmental conditions are the same over the whole planet.
However, if there are different biological niches, individuals may have adapted themselves to
their local niche, so that individuals of the same population will remain somewhat different.
Since mutation keeps on occurring during reproduction, it may happen that many mutations
after many generations create a new organism which does not look alike the original one. In
this way, multiple organisms can evolve and keep on adapting to their local niche.
Since evolution through natural selection is just a mechanism we can implement it in a
computer program. A known example of artificial evolution is the use of genetic algorithms.
In genetic algorithms, a fitness function is used to evaluate individuals. Such a fitness function
is designed a-priori by the programmer and determines how many children an individual can
obtain in a given population. Although these genetic algorithms are very good for solving
optimisation problems, they do not exactly look alike natural evolution. The problem is that
the fitness function is defined a-priori, whereas in natural evolution there is nobody who
determines the fitness function.
5.2. REPLICATOR DYNAMICS
81
In reality the (implicit) fitness of an individual depends on its environment in which
other species interact with it. Such a fitness function is therefore non-stationary and changes
according to the adaptions of different populations in the environment. Here we speak of
co-evolution. Co-evolutionary processes can be quite complex, since everything depends
on each other. Therefore we have to look at the whole system or environment to study the
population dynamics.
5.2
Replicator Dynamics
We have already seen two different models for studying the dynamics of interacting species:
• With differential equations (mathematical rules which specify how the variables change).
An example of this are the Lotka-Volterra equations.
• With cellular automata
We can also generalise the Lotka-Volterra equations to multiple organisms, this is done
using the model of Replicator dynamics. We will first study a model in which the fitness
of an organism (phenotype) is fixed and independent of its environment. The replicator
equation describes the behavior of a population of organisms which is divided in n phenotypes
E1 , . . . , En . The relative frequencies of these phenotypes are denoted as x1 , . . . , xn , and so
P
we obtain a relative frequency vector ~x = (x1 , x2 , . . . , xn ), where i xi = 1. The fitness of a
phenotype Ei is fixed and is denoted as fi (~x).
Now we can first compute the average fitness of a population using:
fˆ(~x) =
n
X
xi fi (~x)
i=1
The change of the frequency of phenotype Ei is related to the difference in fitness of Ei
and the average of the population:
∂xi
= fi (~x) − fˆ(~x)
xi
Now we get the replicator equation with adaption speed α (α can be seen as a timeoperator dt after which we recompute the relative frequencies):
∆xi = αxi (fi (~x) − fˆ(~x))
If the fitness values of the existing phenotypes are different, the replicator equation will
also change their relative frequencies. If the environment does not change from outside and
the fitness values of phenotypes remain constant, then the phenotype with the largest fitness
will overtake the whole population. This assumption is of course unrealistic: the environment
and fitness values will change due to the changing frequencies.
Now we will look at a model for co-evolutionary replicator dynamics. Here we make the
fitness of a phenotype dependent on other existing phenotypes and the relative frequencies.
We do this by computing the fitness value at some time-step as follows:
fi (~x) =
n
X
j=1
aij xj
82
Here the values aij make up the fitness value of phenotype Ei in the presence of Ej . We can
immediately see that phenotypes can let the fitness of other phenotypes increase or decrease.
It can therefore happen that both aij and aji are positive and quite large. The result will
be that these species co-operate and obtain a higher fitness due to this co-operation. Since
we always compute relative frequencies with replicator dynamics, we do not always see this
co-operation in the values xi . However, in reality we may assume that both populations will
grow, although one may grow faster than the other.
On the other hand when aij and aji are negative and quite large, these species are competitive, and the one with the largest frequency will dominate and can make the other one
extinct (dependent on the rest of the environment of course).
Instead of two cooperating or competitive organisms, there can also be whole groups of
cooperating organisms which may compete with other groups. In this sense we can clearly
see the dependence of an organism of its environment.
5.3
Daisyworld and Gaia
In 1983, James Lovelock presented his Daisyworld model which he presented to explore the
relationship between organisms and their environment. Daisyworld is a computer model of an
imaginary planet consisting of white and black daisies. Daisies can change their environment,
reproduce, grow, and die. There is a global variable: the temperature of the planet which
may slowly increase due to the radiation of an imaginary sun.
Now the temperature of the planet has an influence on the growth, reproduction, and
death of daisies. White daisies have a favourite temperature in which they grow and reproduce
fastest and this temperature is higher than the favourite temperature of black daisies. This has
as a consequence that if the temperature of the planet would increase, that the population
of white daisies would become bigger than the population of black daisies. If the planet’s
temperature would not stop increasing, however, the temperature would become too hot for
any living organism to survive leading to a planet without life-forms.
Due to the albedo effect of white daisies, however, the solar radiation is reflected which
causes the temperature of the planet to decrease when there are enough white daisies. Therefore when the planet is warmed up and there are many white daisies the planet will cool
down. If the white daisies would continue to decrease the planet’s temperature, the planet
would become too cold and all life forms would also become extinct.
However, black daisies absorb the heat of the sun and therefore they increase the temperature of the planet. Therefore, if the planet becomes colder, the number of black daisies
would become larger than the number of white daisies (since the black daisies’ favourite temperature for growth is lower), and the planet would become warmer again. This again leads
to a temperature which is closer to the favourite temperature of the white daisies so that the
population of white daisies would grow again and thereby cool down the planet.
Thus, we can see that in Daisyworld the daisies influence the environment, and the environment has an influence of the population growth of the daisies. The daisies are also
related, since if there would only be black daisies, the temperature could only increase so
that life becomes impossible. By increasing and decreasing the temperature of the planet, the
different daisy populations are linked to each other, leading to cooperative co-evolutionary
dynamics. Furthermore, since the daisies make the temperature suitable for both to survive,
they regulate the temperature, like a thermostat of a heater would regulate the temperature
5.3. DAISYWORLD AND GAIA
83
of a room. Therefore we can see that there is a self-regulating feedback loop.
5.3.1
Cellular automaton model for Daisyworld
We can use a cellular automaton as a spatial model for Daisyworld. Each cell can be a black
or white daisy or a black or white daisy-seed. Furthermore, each cell has its local temperature.
Each cycle we can increase the temperature of all cells with for example one degree (of course
we can also decrease the temperature). If the temperature of each cell continues to increase,
the temperature would become 100 degrees and all life-forms would die.
The rules of the CA look as follows:
• Black daisies have most probability to survive at a temperature of 40 degrees, and
white daisies at 60 degrees. Each 20 degrees away from their favourite temperature, the
survival probability decreases with 50%.
• Black daisies increase the temperature of 49 cells around their cell with 3 degrees. White
daisies cool down the 49 cells around them with 3 degrees.
• White daisies reproduce 6 seeds in random location of their 25-cell neighbourhood with
most probability (40%) at 60 degrees, and black daisies do the same at 40 degrees.
• Daisy seeds have a probability of 10% to die each cycle. White (black) seeds become
white (black) daisies with most probability at 60 (40) degrees.
We can see the Cellular Automaton model of Daisyworld in Figure 5.1.
Figure 5.1: A cellular automaton model of Daisyworld. At the right the average temperature
of the planet is shown and the temperature in all cells.
Now there are two evolutionary processes in this model: natural selection and selfregulation. Natural selection in Daisyworld takes place becomes there is competition between
the different daisy types, since there are limited sources (cells or space to grow). Now let’s examine what happens if we use mutation in the model. Mutation is an arbitrary small change
84
of a genotype of an organism. Such a small change results in a small change of the color
which means a difference in the absorbing or reflection of solar energy and therefore different
cooling or heating behaviors. In general a mutation can be good for an individual organism,
although most mutations are damaging or neutral. However, even if a mutation only gives an
advantage one in a million times, once it occurred the new organism may quickly propagate
through the environment.
The most interesting aspect of Daisyworld is the self-regulation which appears to be at
a higher level than natural selection. This self-regulation is good for all individuals, because
it keeps the temperature of a planet at a level which makes life possible. Because this self
regulation is good for all individuals, we might think that is is on its own caused by natural
selection. However, in Daisyworld self-regulation is not participating in a competitive or
reproductive mechanism and therefore is not created by some form of higher level evolutionary
process. We can better say that natural selection prefers daisy properties and patterns which
lead to self-regulating dynamics.
5.3.2
Gaia hypothesis
In the beginning of the sixties, James Lovelock was working at NASA that wanted to research
whether there was life on Mars. Lovelock wondered what kind of tests would be possible to
demonstrate the existence of life. Of course it would be possible to check the surface of Mars
and to look whether some organisms live there, but it might always be possible that at the
place where the spaceship would have landed no life forms existed, whereas life forms might
exist at other parts of the planet.
Lovelock thought about examining processes that reduce the entropy of the planet. This
can best be explained by looking at a beach. When we sea a sand-castle on the beach, we
can see a very ordered object which must be constructed by life forms. On the other hand, if
there would not be any life forms on the beach, the surface of the sand on the beach would
be completely smooth and not contain any order. But how can this be measured, since not
all organisms make sand castles. Lovelock thought about the atmospheric conditions of the
planet. If we consider our planet, the Earth, then we can see that the constituents of the
atmosphere are very much out of equilibrium. For example, there is much too much oxygen
(O2 ) and much too little carbon dioxide (CO2 ). If we look at Venus, there is 98% carbon
dioxide and only a tiny bit oxygen in the atmosphere. On Mars, there is 95% carbon dioxide
and 0.13% oxygen. If we compare this to the Earth where there is 0.03% carbon dioxide
and 21% oxygen we can see a huge difference. Lovelock explained this difference due to the
existence of self-regulatory mechanisms of the biosphere on Earth which he called Gaia. If
there would not be any life on Earth, the gases would react with each other and this would
lead to an equilibrium similar to that of Mars or Venus. However, since life forms regulate the
complete atmosphere it can continuously stay far out of equilibrium and make life possible.
Lovelock predicted that because the planet Mars has an atmosphere which is in a chemical
equilibrium, there cannot be any life on Mars, On the other hand, because the atmosphere on
Earth is far out of equilibrium there is a complex organising self-regulating force called Gaia
which makes life possible. Without this self regulation the amount of carbon dioxide may
become much too large and heat up the planet, making life impossible. If one looks at the
mechanisms of Gaia, one can see a complex web consisting of bacteria, alges and green plants
which play a major role in transforming chemical substances so that life can flourish. In this
way Gaia has some kind of metabolism, keeping its temperature constant like a human does.
5.3. DAISYWORLD AND GAIA
85
For example if a human being is very cold, he begins to shake, this causes movements of the
body and muscles which makes the body temperature higher. On the other hand if a human
being is warm, he will transpirate and thereby lose body heat. These mechanisms therefore
keep the temperature of a human more or less constant, and without it (e.g. without feeling
cold when it is very cold) people would have died a long time ago.
The name Gaia refers to the Greek goddess Gaea, see Figure 5.2. Since the whole web
of organisms creates a self-regulating mechanism, one may speculate that this entire superorganism is alive as well. This led to three forms of the Gaia-hypothesis:
• Co-evolutionary Gaia is a weak form of the Gaia hypothesis. It says that life determines the environment through a feedback loop between organisms and the environment
which shape the evolution of both.
• Geophysiological Gaia is a strong form of the Gaia hypothesis. It says that the
Earth itself is a living organism and that life itself optimizes the physical and chemical
environment.
• Homeostatic Gaia is between these extremes. It says that the interaction between
organisms and the environment are dominated by mostly negative feedback loops and
some positive feedback loops that stabilize the global environment.
Figure 5.2: The Greek Goddess Gaea, or mother Earth.
There are many examples to demonstrate the homeostatic process of Gaia. Some of these
are:
• The amount of oxygen. Lovelock demonstrated that Gaia worked to keep the amount
of oxygen high in the atmosphere, but not too high so that a fire would spread too fast
and destroy too much.
• Temperature. The average ground temperature per year around the equator has been
between 10 and 20 degrees for more than a billion years. The temperature on Mars
fluctuates much more and is not suitable for life-forms (-53 degrees is much too cold).
• Carbon-dioxide. The stability of the temperature on the Earth is partially regulated
by the amount of carbon dioxide in the atmosphere. The decrease of heat absorption
of the Earth in some periods is caused by a smaller amount of carbon dioxide which is
regulated by life-forms.
86
In Figure 5.3 we can see that the temperature of the world has increased during the last
century. This may be caused by the large amount of burned fossil fuels during this period,
although differences in temperatures are also often caused by the change of the Earth’s orbit
around the sun. The Gaia hypothesis states that mankind can not destroy life on Earth by
e.g. burning all fossil fuels, or using gases which deplete the ozone layer, since the metabolism
of the Earth will be much too strong and always some organisms will survive. Even if we
would throw all nuclear weapons, we would not destroy all life forms and life will continue
albeit without human beings.
Figure 5.3: The northern hemisphere shows an increasing temperature during the last 100
years.
5.4
Recycling Networks
If there are multiple co-evolving organisms in an environment, they can also interact with
the available sources such as chemical compounds. It is possible that these organisms recycle
each other’s waste so that all compounds remain available for the environment. Of course
some of these processes will cost energy which is usually obtained by the sun through e.g.
photo-synthesis. Some other processes will create free energy for an organism which it can
use to move or to reproduce.
An example of such a process is when we put plants and mammals together in an environment and make a simplified model:
• Plants transform CO2 into C and O2 molecules
• Mammals transform C and O2 into CO2 molecules
• External chemical reactions transform C and O2 into CO2
• Mammals can eat plants and thereby increase their mass with C molecules which they
store.
5.4. RECYCLING NETWORKS
87
We can implement this model in a cellular automaton consisting of plants, mammals,
and molecules. In Figure 5.4 we show the simple model in which we use a layered cellular
automaton, one layer of the CA consisting of the positions of molecules and the other layer
consisting of plants and mammals. These two layers will interact on a cell by cell basis (for
simplicity mammals have the same size as molecules which is of course very unrealistic).
P
M
CO
O
P
P
O
M
CO
O
P
CO
CO
M
M
CO
M
P
P
M
P
M
O
O
P
P
P
Figure 5.4: A layered cellular automaton for modelling a recycling network.
To make the CA model complete, we also need to model the amount of carbon (C) inside
plants and mammals. Therefore, Figure 5.4 does not show us the complete picture, there are
internal states of plants and mammals which model the amount of C molecules.
Furthermore, we need to make transition rules to let plants and mammals reproduce and
die. Mammals should also have the possibility to navigate on the grid and look for food.
We do not model these issues here, although an implementation of these rules would be
straightforward. Here we are more interested to examine the feedback loops in the model
which will create a recycling network.
If we examine this simple ecology consisting of plants, mammals, and chemical molecules,
we can see that the molecules will be recycled under good conditions. If they would not be
recycled then the mammals would die since there would not be any O2 molecules anymore
for them. We can see the following dynamics in this model:
• Without plants, all C and O2 molecules will be transformed to CO2 molecules. This
will lead to a stable chemical equilibrium where no reactions can take place anymore,
resulting in the death of all mammals.
• If there are many plants, the number of O2 molecules would grow, leading to less CO2
molecules for the plants, possibly also leading to the death of some plants. This will
cause the transformation of CO2 to C and O2 molecules done by the plants to become
much slower, and will give the external reactions and mammals the ability to create
more CO2 molecules, leading to a homeostatic equilibrium. If there would not be any
mammals it can be easily seen that there cannot be too many plants, because the speed
of the external reactions can be very slow. Therefore, the existence of mammals may
be profitable for plants (although the mammals also eat the plants).
• If there are many plants and mammals, they will quickly recycle the molecules. This
leads to a situation that even with few molecules, many plants and mammals can survive.
88
It should be noted that these amounts of plants and mammals depend heavily on each
other, but natural processes are likely to create a good situation.
• If there are too many mammals, many plants will be eaten. If this causes few plants
to survive, there will not be enough food for all mammals causing many mammals to
die. Therefore the growth and decline of the mammal population will not make it easily
possible that all plants will be eaten, so that mammals cause their own extinction (this
is similar to predator-prey dynamics).
Recycling networks are very important for Gaia and co-evolutionary systems. For example
in Gaia many molecules are recycled causing almost optimal conditions for life. One example
is the amount of salt in the seas. When this becomes too large, almost all sea life-forms
will dry out and die. However, every year a lot of salt is moved from the land to the seas
which might easily lead to very large concentrations of salt in the sea. It has been shown by
Lovelock that the sea functions as a kind of salt pump keeping the concentration of salt at
levels which are advantageous for life forms.
Also in rain-forests the plants and trees cause a very efficient recycling of water molecules.
In this way, even with a small amount of H2 O molecules many plants and trees can survive.
Furthermore this recycling and co-evolutionary dynamics also causes a profitable temperature
for the plants and trees which makes it possible to have rain-forests in hot countries that create
their own local redistribution of water.
5.5
Co-evolution for Optimisation
Co-evolutionary processes are not only important for studying population dynamics in ecologies, but can also be used for making optimisation algorithms. We already studied genetic
algorithms which can be used for searching for optimal (or near-optimal) solutions for many
different problems for which exhaustive search would never work.
There is currently more and more research to use co-evolution to improve the ability of
genetic algorithms in finding solutions. The idea relies on evolving a population of individuals
to solve some problem which can be described by a large set of tests for which an individual
should succeed. If the tests are not clearly specified, they can also be evolved by evolutionary
algorithms. An example is to learn to play backgammon. If you want to be sure your
individual, which encodes for a backgammon playing program, is very good in backgammon,
you want to test it against other programs. When the other programs are not available,
you can evolve these programs. The individual which plays best against these test-programs
(some call them parasites since they are used to kill individuals by determining their fitness),
may reproduce in the learner population. The tests which are good for evaluating learners
can also reproduce to create other tests. This is then a co-evolutionary process and makes
sense since there is no clear fitness function to specify what a good backgammon player is.
We will now examine a specific problem which requires a solution to be able to solve a
specific task such as sorting a series of numbers. In principle there are many instantiations
of the sorting problem, since we can vary the numbers, or the amount of numbers, or their
initial order, etc. So suppose we take N instantiations of the sorting problem and keep these
fixed (like we would do with normal evolutionary computation). Now we can use as a fitness
function the amount of instantiations of the sorting task which are solved by an individual.
The problem of this is that it can cost a lot of time to evaluate all individuals on all N
5.5. CO-EVOLUTION FOR OPTIMISATION
89
tasks if N is large. And if we take the number of instantiations too low, maybe we evolve
an individual which can sort these instantiations of the sorting problem, but performs poorly
on other sorting problems. Furthermore, it is possible that the best individuals always are
able to sort the same 0.7N problems and never the others. In this case there is not a good
gradient (search direction) for further evolution.
Co-evolution for optimisation. A solution to these problems is to use co-evolution
with learners (the individuals which need to solve the task) and problem-instantiations (the
parasites or the tests). There are K tests which can be much smaller than the N tests we
needed for a complete evaluation in normal evolution, since these K tests also evolve. There
are also L learners which are tested on the tests (can be all tests, but might also be a part of
all tests). The fitness of a learner is higher if it scores better on the tests it is evaluated on.
This creates improving learners, but how can we evolve the test-individuals?
An initial idea would be to make the fitness of a test higher when less learners can solve it
(we will later examine the problems of this method for assigning such fitness values to tests).
In this way, the learners and tests will co-evolve. The parasites make the tests harder and
harder and the individuals have to solve these increasingly difficult tests.
A problem of the above fitness definition of tests is that it becomes possible that only
tests remain which cannot be solved by any learner. This leads to all learners having the
same fitness and would stop further evolution. Therefore it is much better to let the fitness
of a test depend on the way it can differentiate between different learners. In this way when
a test is solved by all learners or is not solved by any learner, the test is basically useless at
the current stage of evolution and will get a low fitness so that it is not allowed to stay in
the population or to reproduce. If two tests make exactly the same distinctions between all
learners, it is possible to reduce the fitness of one of them since they would essentially encode
the same distinction.
Pareto-front in co-evolutionary GA. If we have a number of learners with their result
on all tests, we want to examine which learners are allowed to reproduce themselves. For this
we will examine the Pareto-front of individuals which means the set of individuals which are
not dominated by any other individual. When a learner passes a number of tests and another
learner passes the same number of tests, but also an additional one, it is not hard to see that
the second learner performs strictly better than the first one. In this case we say that the
first learner is dominated by the second one. We can make this more formal by the following
definition, where fi (j) is the fitness of learner i on test j.
We define:
dominates(k, i) = ∀jfi (j) ≤ fk (j) ∧ ∃lfi (l) < fk (l)
So dominates(k,i) says that learner i is dominated by learner k. Now we define the Paretofront as all learners which are not dominated by any other learner. Now we only let the learners
in the Pareto-front reproduce and eliminate all other learners which are dominated by some
other learner.
This Pareto-front optimisation is also a used and good method for multi-objective optimisation in which there are more criteria to evaluate an individual.
90
5.6
Conclusion
In this chapter we studied co-evolutionary processes which are important in natural evolution.
We have seen that instead of Darwin’s survival of the fittest, there can be groups of cooperating
organisms which struggle for the same spatial resources, but which may help each other to
survive at the same time. We also looked at the methods that life-forms use to alter their
environment. There are many mechanisms which keep the environment profitable for life to
sustain itself. Lovelock studied this complex web of many coupled processes for the first time
and called this entire mechanism of a homeostatic Earth; Gaia. There are many examples
of Gaian processes, and in this chapter we only examined a few, but important ones. Gaian
homeostasis also relies on recycling networks in which chemical compounds are transformed
through a sequence of different organisms so that resources never become depleted. This is
very important, since if some compound would get lost, the whole recycling network might
starve since their required resources are not available. Finally we have examined how coevolution can be used in evolutionary computation to make the search for optimal solutions
different and for some problems more successful than the search process of normal evolutionary
algorithms. Here learners and tests evolve together to allow learners to become better in
solving the tests, and the tests to create harder and harder problems while still being able to
differentiate between learners.
It is important that we look at the co-evolutionary mechanisms when we use the Earth’s
resources and kill organisms. Particular organisms may play very important roles to keep
the homeostatic equilibrium of the Earth or of a local environmental niche. Since we cannot
study a single organism alone, apart from its environment, we again need a holistic approach
in which all elements are studied in a total perspective.
Chapter 6
Unsupervised Learning and Self
Organising Networks
Unsupervised learning is one of the three forms of machine learning; supervised, unsupervised,
and reinforcement learning. The special aspect of unsupervised learning is that there are only
input (sensory) signals and no desired outputs or evaluations of actions to specify an optimal
behavior. In unsupervised learning it is more important to deal with input signals and to
form a meaningful representation of these. E.g. if we look at different objects, we can cluster
objects which look similar together. This clustering does not take into account what the
label of an object is, and therefore trees and plants may be grouped together in unsupervised
learning, whereas in supervised learning we may want to map inputs to the label plant or tree.
It is also possible that particular plants form their own cluster (or group) such as cactuses,
this grouping is only based on their input representation (e.g. their visual input), and not
based on any a-priori specified target concept which we want to learn. We may therefore
start to think that unsupervised learning is less important than supervised learning, but this
is not true since they have different objectives. Unsupervised learning has an important task
in preprocessing the sometimes high-dimensional input and can therefore be used to make the
supervised learning task simpler. Furthermore, supervised learning always requires labelled
data, but labelled data is much harder obtained than unlabelled data. Therefore unsupervised
learning can be applied continuously without the need for a teacher. This is a big advantage,
since it makes continual life-long learning possible.
The second topic in this chapter is self-organising networks or often called self-organising
maps (SOMs). These SOMs are neural networks and can be applied for unsupervised learning
purposes and can be easily extended for supervised and reinforcement learning problems. In
principle a SOM consists of a number of neurons that have a position in the input space. Every
time a new input arrives, the SOM computes distances between the input and all neurons,
and thereby activates those neurons which are closest to the input. This idea of looking at
similarities is a very general idea for generalization, since we usually consider objects that
look alike (have a small distance according to some distance measure) to be of the same group
of objects. To train the SOM, activated neurons are brought closer to the generated inputs,
in order to minimize the distance between generated inputs and activated neurons. In this
way a representation of the complete set of inputs is formed in which the distance between
generated inputs and activated neurons is slowly minimized. By using SOMs in this way, we
can construct a lower dimensional representation of a continuous, possibly high-dimensional
91
92 CHAPTER 6. UNSUPERVISED LEARNING AND SELF ORGANISING NETWORKS
input space. E.g. if we consider faces with different orientations as input, the input-space is
high-dimensional, but activated neurons in the SOM essentially represent the orientation of
the face which is of much smaller dimensionality.
6.1
Unsupervised Learning
In unsupervised learning the program receives at each time-step an input pattern xp which is
not associated to a target concept. Therefore all learned information must be obtained from
the input patterns alone. Possible uses of unsupervised learning are:
• Clustering: The input patterns are grouped into clusters where input patterns inside a
cluster are similar and input patterns between clusters are dissimilar according to some
distance measure.
• Vector quantisation: A continuous input-space is discretized.
• Dimensionality reduction: The input-space is projected to a feature space of lower
dimensionality while still containing most information about the input patterns.
• Feature extraction: particular characteristic features are obtained from input patterns.
6.1.1
K-means clustering
One well-known clustering method is called K-means clustering. K-means clustering uses K
prototypes which will form K clusters of all input patterns. In principle K-means clustering
is a batch learning method, which means that all the data should be collected before and the
algorithm is executed one time on all this data. Running the algorithm on this data creates a
specific set of clusters. If another input pattern is collected, the algorithm has to be executed
again on all data examples which therefore can cost more time than online clustering methods.
K-means clustering is usually executed on input patterns consisting of continuous attributes, although it can be extended on patterns partly consisting of nominal or ordinal
attributes.
The K-means algorithm uses K prototype vectors: w1 , . . . , wK where each prototype
vector is an element of ℜN where N is the number of attributes describing an input pattern.
Each prototype vector wi represents a cluster C i which is a set of input patterns which are
element of that cluster. So the algorithm partitions the data in the K clusters. We assume
that there are n input patterns (examples) denoted as: x1 , . . . , xn .
The algorithm works as follows:
• Initialize the weight-vectors w1 , . . . , wK .
• Repeat the following steps until the clusters do not change anymore
1. Assign all examples x1 , . . . , xn to one of the clusters. This is done as follows:
An example xi is an element of cluster C j if the prototype vector wj is closer to
the input pattern xi than all other prototype vectors:
d(wj , xi ) ≤ d(wl , xi )
For all l 6= j
6.2. COMPETITIVE LEARNING
93
The distance d(x, y) between two vectors is computed using the Euclidean distance
measure:
s
d(x, y) =
X
i
(xi − yi )2
In case the distances to multiple prototype vectors are exactly equal, the example
can be assigned to a random one of these.
2. Set the prototype vector to the center of all input patterns in the corresponding
cluster. So for each cluster C j we compute:
wij =
P
k
k∈C j xi
|C j |
Where |C| denotes the number of elements in the set C.
An example of K-means clustering. Suppose we have four examples consisting of
two continuous attributes. The examples are: (1,2); (1,4); (2,3); (3,5).
Now we want to cluster these examples using K = 2 clusters. We first initialize these
clusters, suppose that w1 = (1, 1) and w2 = (3, 3). Now we can see that if we assign the
examples to the closest prototypes, we get the following assignment:
(1, 2) → 1
(1, 4) → 2
(2, 3) → 2
(3, 5) → 2
Now we compute the new prototype vectors and obtain: w1 = (1, 2) and w2 = (2, 4). We
have to repeat the process to see whether the cluster stay equal after the prototype vectors
have changed. If we repeat the assignment process to clusters, we can see that the examples
stay in the same clusters, and therefore we can stop (continuing would not change anything).
6.2
Competitive Learning
K-means clustering works on a given collection of data and when the data changes, the algorithm has to be executed again on all examples. There also exist a number of online clustering
approaches which are based on artificial neural network models. Competitive learning is one
of these methods and partitions the data into specific clusters by iterating an update rule a
single time each time a new input pattern arrives. Therefore these online competitive learning
algorithms are more suitable for changing environments, since they can change the clusters
online according to the changing distributions of input patterns. Again only input patterns
xp are given to the system. The system consists of a particular neural network as a representation of the clustering. The network propagates the input to the top where an output is
given which tells us in which cluster an input pattern falls. Like in the K-means algorithm,
the number of clusters should be given to the system and usually stays fixed during learning.
In a simple competitive learning network all inputs are connected to all outputs representing the clusters, see Figure 6.1. The inputs describe a specific input pattern and when
given these inputs, the competitive learning network can easily compute in which cluster the
input falls.
output o
woi
input i
Figure 6.1: In a competitive network all input units are connected to output units through a
set of weights.
6.2.1
Normalised competitive learning
There are two versions of the competitive learning algorithm, the normalised and unnormalised versions. We first examine the normalised version which normalises all weight vectors
and input vectors to a length of 1. Normalising a vector v means that its norm ||v||| will be
one. The norm of a vector is computed as:
||v|| =
q
2
(v12 + v22 + . . . + vN
v
uN
uX
= t vi2
i=1
Basically the norm of a vector is its Euclidean distance to the origin of the coordinate system.
This origin is a vector with only 0’s. Normalising a vector is then done by dividing a vector
by its norm:
x
xnorm =
kxk
So if all vectors are normalised, all weight vectors (each output unit has one weight vector
which determines how it will be activated by an input pattern) will have length 1, which
means that they all fall on a circle when there are 2 dimensions (N = 2). Therefore, the
weights can only move on the circle.
So, how do we adapt the weight vectors? Just as in the K-means algorithm, we initialize the weight vectors for the chosen number of clusters (represented by as many output
units). Then, the normalised competitive learning algorithm performs the following steps
after receiving an input pattern:
• Each output unit o computes its activation y o by the dot- or inner-product:
yo =
X
wio xi = wo x
i
• Then the output neuron k with the highest activation will be selected as the winning
neuron:
∀o 6= k :
yo ≤ yk
95
• Finally, the weights of the winning neuron k will be updated by the following learning
rule:
wk (t) + γ(x(t) − wk (t))
wk (t + 1) =
kwk (t) + γ(x(t) − wk (t))k
The divisor in the fraction makes sure that the weight vector remains normalised.
The mechanism of normalised competitive learning causes the winning weight-vector to
turn towards the input pattern. This causes weight-vectors to point to regions where there
are many inputs, see Figure 6.2.
w1
w3
w2
Figure 6.2: In a normalised competitive network, the weight-vectors will start to point to
clusters with many inputs.
When we would not use normalised weight vectors, there would be a problem with this
algorithm which is illustrated in Figure 6.3. Here it is seen that if weight-vectors are different
in size, larger vectors would win against smaller weight vectors, since their dot-product with
input vectors is larger, although their (Euclidean) distance to an example is larger.
w1
w1
x
w2
x
w2
Winner = 1
Winner = 1
Figure 6.3: (A) With normalised weight vectors the algorithm works appropriate. (B) When
weight vectors would not be normalised, we would get undesirable effects, since larger weight
vectors would start to win against small weight vectors.
6.2.2
Unnormalised competitive learning
Instead of using the dot-product between two vectors to determine the winner for which we
need normalised vectors, we can also use the Euclidean distance to determine the winning
neuron. Then we do not need normalised weight vectors anymore, but we will deal with
unnormalised ones. So in this case all weight-vectors are again randomly initialised and we
determine the winner with the Euclidean distance:
Winner
k : kwk − xk ≤ kwo − xk ∀o.
So here we take the norm of the difference between two vectors, which is the same as taking the
Euclidean distance d(wk , x). The neuron with the smallest distance will win the competition.
If all weight-vectors are normalised, this will give us the same results as computing the winner
with the dot-product, but if the vectors are not normalised different results will be obtained.
After determining the winning neuron for an input vector, we move that neuron closer to
the input vector using the following learning rule:
wk (t + 1) = wk (t) + γ(x(t) − wk (t))
(6.1)
where 0 ≤ γ ≤ 1 is a learning rate which determines how much the neuron will move to
the pattern (if γ = 1 the point will jump to the input vector, and therefore when continuing
learning there will be a lot of jumping around. When the learning rate decreases while more
updates have been done, a real “average” of the represented input patterns can be learned).
Example unnormalised competitive learning. Suppose we start with K = 2 neurons
with initialized weight-vectors: w1 = (1, 1) and w2 = (3, 2). Now we receive the following
four examples:
x1 = (1, 2)
x2 = (2, 5)
x3 = (3, 4)
x4 = (2, 3)
When we set the learning rate γ to 0.5, the following updates will be made:
On x1 = (1, 2) → d(w1 , x1 ) = 1, d(w2 , x1 ) = 2. Therefore: Winner w1 = (1, 1). Application
of the update rule gives:
w1 = (1, 1) + 0.5((1, 2) − (1, 1))
√ = (1, 1.5). 2 2
√
2
1
2
x = (2, 5) → d(w , x ) = 13.25, d(w , x ) = 10. Therefore: Winner w2 = (3, 2).
Application of the update rule gives:
w2 = (3, 2) + 0.5((2, 5) − (3, √
2)) = (2.5, 3.5).
√
x3 = (3, 4) → d(w1 , x3 ) = 10.25, d(w2 , x3 ) = 0.5. Therefore: Winner w2 = (2.5, 3.5)
Application of the update rule gives:
w2 = (2.5, 3.5) + 0.5((3, 4) − (2.5, 3.5)) = (2.75, 3.75).
Now try it yourself on the fourth example.
Initialisation
A problem of the recursive (online) clustering methods which also holds for the K-means
clustering algorithm is a possible wrong initialisation of the weight vectors of the neurons.
Therefore it can happen that some neuron never becomes a winner and therefore never learns.
In that case we are basically dealing with a dead (or silent) neuron and have one cluster less
in our algorithm. To deal with this problem, there are two methods:
97
• Initialise a neuron on some input pattern
• Use “leaky learning”. For this we let all neurons adapt on all examples, although we
use a very small learning rate for this adaption so that this will only make a difference
in the long run. The leaky learning rule adapts all neurons (except for the winning
neuron) to the current example with a very small learning rate γ ′ << γ:
wl (t + 1) = wl (t) + γ ′ (x(t) − wl (t)), ∀l 6= k
Minimising the cost function
The goal of a clustering method is to obtain a clustering in which the similarities between
inputs of the same cluster are much larger than similarities between inputs of different clusters.
The similarity between two inputs can be computed using the inverse of the (Euclidean)
distance between the two inputs. Therefore if we minimize the distances between a neuron
and all the examples in the cluster, we will maximize the similarities between the inputs in a
cluster.
A common measure to compute the quality of a final obtained set of clusters on a number
of input patterns is to use the following quadratic cost function E:
E=
1 XX k
1X k
kw − xp k2 =
(wi − xpi )2
2 p
2 p i
In which k is the winning neuron on input pattern xp .
Now we can prove that competitive learning searches for the minimum of this cost function
by following the negative gradient of this cost function.
Proof that the cost function is minimized. The cost-function for pattern xp :
Ep =
1X k
(wi − xpi )2
2 i
in which k is the winning neuron is minimized by Equation 6.1.
We first examine how the weight-vectors should be adjusted to minimize the cost-function
p
E on pattern xp :
∂E p
∆p wio = −γ
∂wio
Now we have as the partial derivative of E p to the weight-vectors:
∂E p
= wio − xpi , If unit o wins
o
∂wi
=
0,
else
(6.2)
From this follows (for winner o):
∆p wio = γ(xpi − wio )
Thus we demonstrated that the cost-function is minimized by repetitive weight-vector updates. Some notes on this are:
• If we continue the updating process with a fixed learning rate, the weight-vectors will
always make some update step, and therefore we do not obtain a stable clustering.
To obtain a stable clustering we should decrease the learning-rate γ after each update
P
according to the conditions of stochastic approximation: (1) ∞
t=1 γt = ∞ and (2)
P∞ 2
t=1 γt < ∞. The first condition makes sure that the weight-vectors are able to move
an arbitrarily long distance to their final cluster-point, and the second condition makes
sure that the variance of updates goes to zero which means that finally a stable state will
be obtained. A possible way of setting the learning rate which respect these conditions
is: γt = 1t .
• It is important to note that the cost-function is likely to contain local minima. Therefore
the algorithm does not always obtain the global minimum of the cost-function. Although
the algorithm will converge (given the conditions on the learning-rate), convergence to
a global minimum is not guaranteed. Better results can therefore be obtained if we
execute the algorithm multiple times starting with different initial weight-vectors.
• Choosing the number of cluster-points (or neurons) is an art and not a science. Of
course the minimum of the cost-function can be obtained if we use as many clusterpoint as input-patterns and set all the cluster-points on a different input-pattern. This
would result in a cost of 0. However, using as many cluster-points as input-patterns
does not make any sense since we want to obtain an abstraction of the input data. It is
also logical that increasing K leads to a smaller minimal cost, so how should we then
choose K? Often we need to trade off the complexity of the clustering (the number of
used cluster-points) and the obtained error-function. Thus, we like to minimize a new
cost-function:
Ef = E + λK
where the user-defined parameter λ trades off complexity versus clustering cost. Ef can
then be minimized by running the algorithm with different K.
6.2.3
Vector quantisation
Another important use of competitive learning is vector quantisation. In vector quantisation
we divide the whole input space into a number of non-overlapping subspaces. The difference
with clustering is that we are not so much interested in the clusters of similar input-patterns,
but more in the quantisation of the whole input space. Vector quantisation uses the same
(unnormalised) competitive learning algorithm as described before, but we will finally examine
the subspaces and not the clusters. It should be noted that the distribution of input-patterns
is respected by competitive learning; more inputs in a region lead to more cluster-points. An
example of an obtained vector quantisation is shown in Figure 6.4.
Vector quantisation combined with supervised learning
Vector quantisation can also be used in a preprocessing phase for supervised learning purposes.
In this case, each neuron corresponds to some output value which is the average of the output
values for all input-patterns for which this neuron wins the competition. The output-values
for multiple outputs belonging to some neuron that represents a subspace of the input space is
usually stored in the weights from this neuron to the output neurons. Thus we can denote the
value for output o which is computed when neuron h is activated as woh . If there is only one
99
Figure 6.4: A final set of clusters (the big black dots) corresponds with a quantisation of the
input space into subspaces.
output, we sometimes write y h to indicate that this value is the output of neuron h. Figure
6.5 shows a supervised vector quantisation network in which vector quantisation in the first
layer is combined with supervised learning in the second layer.
Vector
Quantisation
i
whi
Feed
Forward
h
woh
o
Y
Figure 6.5: A supervised vector quantisation network. First the input is mapped by a competitive network to a single activated internal neuron. Then this neuron is used for determining
the output of the architecture.
For learning this network we can first perform the (unsupervised) vector quantisation
steps with the unnormalised vector quantisation algorithm and then perform the supervised
learning steps, but is is also possible to perform these two updates at the same time. The
supervised learning step can simply be done with a simple version of the delta-rule. The
complete algorithm for supervised vector quantisation looks as follows:
• Present the network with input x and target value D = f (x)
• Apply the unsupervised quantisation step: determine the distance of x to the (input)
weight-vector of each neuron and determine the winner k, then update the (input)
weight-vector of neuron k with the unsupervised competitive learning rule (Eq. 6.1).
• Apply the supervised approximation step, for all outputs o do:
wok (t + 1) = wok (t) + α(Do − wok (t))
100CHAPTER 6. UNSUPERVISED LEARNING AND SELF ORGANISING NETWORKS
This is a simple version of the delta-rule where α is the learning-rate and k the winning
neuron.
This algorithm can work well for smooth functions, but may have problems with fluctuating functions. The reason is that inside a subspace in which a single neuron is activated, the
generated network output is always the same. This means that large fluctuations will cause
problems and can only be approximated well when enough neurons are used. For smooth
functions, however, the target values inside a subspace are quite similar so that the approximation can be quite good. Given a vector quantisation and input-patterns with their target
outputs, we can compute to which values wok the network converges. First we define a function
g(x, k) as:
g(x, k) = 1,
If k is the winner
= 0,
Else
Now it can be shown that the supervised vector quantisation learning rule converges to:
woh
=
R
ℜnRDo (x)g(x, h)p(x)dx
ℜn
g(x, h)p(x)dx
where Do (x) is the desired output value of output o on input-pattern x and p(x) is a probability
density function which models the probabilities of receiving different inputs. Thus, each
weight from neuron h to output o converges to the average target output value for output o
for all the cases that neuron h wins.
Example of supervised vector quantisation. The winning neuron moves according to
the same update rule as normalised competitive learning. Since there is only a single output
in the example below, we will write y k to denote the output value of neuron k. The value y k
for the winning neuron wk is adapted after each example by the following update rule:
y k = y k + α(D p − y k )
Suppose we start again with 2 cluster-points and set their output-values to 0 :
w1 = (1, 1), y 1 = 0 and w2 = (3, 2), y 2 = 0.
Now we receive the following learning examples:
(x1 → D 1 ) = (1, 2 → 3)
(x2 → D 2 ) = (2, 5 → 7)
(x3 → D 3 ) = (3, 4 → 7)
(x4 → D 4 ) = (2, 3 → 5)
Suppose we set the learning-rate γ to 0.5 and the learning rate for the supervised learning
step α = 0.5. Now if we update on the four learning examples, the following updates are
made:
x1 = (1, 2) → d(w1 , x1 ) = 1, d(w2 , x1 ) = 2. Thus: Winner w1 = (1, 1). Application of the
update rule gives:
w1 = (1, 1) + 0.5((1, 2) − (1, 1)) = (1, 1.5).
This is just the same as in the example of unnormalised competitive learning before.
The only difference in computations is that we also adjust the output values of the winning
neuron:
y 1 = 0 + 0.5(3 − 0) = 1.5
6.3. LEARNING VECTOR QUANTISATION (LVQ)
101
Since the weight-vectors wi are adjusted in the same way as in the example of competitive
learning, we only show the updates of the neurons’ output values:
x2 = (2, 5). Winner is neuron 2.
y 2 = 0 + 0.5(7 − 0) = 3.5.
x3 = (3, 4). Winner is neuron 2.
y 2 = 3.5 + 0.5(7 − 3.5) = 5.25.
Now try it yourself on the fourth example.
6.3
Learning Vector Quantisation (LVQ)
Learning vector quantisation is basically a supervised learning algorithm, since the neurons
have labels associated to them and therefore can classify inputs into a fixed number of categories. Using the training examples, which in this case consist of an input pattern and an
associated discrete label (or output), LVQ learns decision boundaries which partition the input space into subspaces with an associated label. The goal is that each input patterns falls
into a subspace with the same associated label.
The algorithm looks as follows:
• Initialize the weight-vectors of a number of neurons and label each neuron o with a
discrete class label y o
• Present a training example (xp , dp )
• Use the distance measure between the weight-vectors of the neurons and the input vector
xp to compute the winning neuron k1 and the second closest neuron k2 :
kxp − wk1 k < kxp − wk2 k < kxp − wi k ∀i 6= k1 , k2
• The labels y k1 and y k2 are compared to the desired label of the example dp from which
an update is computed
The update rule causes the winning neuron to move closer to the input example when its
label corresponds to the desired label for that example. In case the labels are not the same,
the algorithm looks at the second-best neuron and when its label is correct it is moved closer
and in this case the winning neuron is moved away from the input example. Formally, the
update rules look as follows:
• If y k1 = dp : Apply the weight update rule for k1 :
wk1 (t + 1) = wk1 (t) + γ(xp − wk1 (t))
• Else, if y k1 6= dp and y k2 = dp : Apply the weight update rule for k2 :
wk2 (t + 1) = wk2 (t) + γ(xp − wk2 (t))
and move the winning neuron away from the example:
wk1 (t + 1) = wk1 (t) − γ(xp − wk1 (t))
The algorithm does not perform any update if the labels of the winning and second-best
neurons do not agree with the label of the example. One could make an algorithm which
would move the closest neuron with the correct label to the example (and possibly move all
others away from it), but this is not done in LVQ. A possible problem of this would be strong
oscillation of the weight-vectors of the neurons due to noise.
LVQ example. In LVQ, we use K cluster-points (neurons) with a labelled output. We
compute the closest (winning) neuron wk1 and the second closest neuron wk2 for each training
example and apply the weight update rules.
Suppose we start with 2 cluster-points: w1 = (1, 1) with label y 1 = A, and w2 = (3, 2)
with label y 2 = B. We set the learning rate γ to 0.5.
Now we receive the following training examples:
(x1 → D 1 ) = (1, 2 → A)
(x2 → D 2 ) = (2, 5 → B)
(x3 → D 3 ) = (3, 4 → A)
(x4 → D 4 ) = (2, 3 → B)
Then we get the following update rules: For (1, 2 → A), the winner is neuron 1 and the second
best is neuron 2. The label of neuron 1 y 1 = D 1 . Therefore neuron 1 is moved closer to the
example:
w1 = (1, 1) + 0.5((1, 2) − (1, 1)) = (1, 1.5).
x2 = (2, 5). Winner is neuron 2. Second closest is neuron 1. The label of neuron 2 is the
same as the label D 2 , therefore neuron 2 is moved closer to the example:
w2 = (3, 2) + 0.5((2, 5) − (3, 2)) = (2.5, 3.5).
x3 = (3, 4). Winner is neuron 2. Second closest is neuron 1. The label of neuron 2 is not the
same as the label D 3 . The label of neuron 1 is the same as D 3 . Therefore we move neuron 1
closer to the example, and neuron 2 away from the example:
w1 = (1, 1.5) + 0.5((3, 4) − (1, 1.5)) = (2, 2.75).
w2 = (2.5, 3.5) − 0.5((3, 4) − (2.5, 3.5)) = (2.25, 3.25).
Now try it yourself on the fourth example. An example partitioning of a 2-dimensional
input space is shown in Figure 6.6. The structure of the decision boundaries of such a
partitioning is often called a Voronoi diagram.
D
A
C
B
A
Figure 6.6: An example of a partitioning created by LVQ.
6.4. KOHONEN NETWORKS
6.4
103
Kohonen Networks
Kohonen networks or Kohonen maps are self-organising maps (SOMs) in which the neurons
are ordered in a specific structure such as a 2-dimensional grid. This ordering or structure
determines which neurons are neighbours. Input patterns which are lying close together are
mapped to neurons in the structure S which are close together (the same neuron or neighbouring neurons). The learning algorithm causes the structure of the neurons to get a specific
shape which reflects the underlying (low dimensional) manifold of the input patterns received
by the algorithm. The structure of a Kohonen network is determined before the learning
process, and often a structure is used which has lower dimensionality than the dimensionality
of the input space. This is very useful to visualise the structure of inputs which fall on a subspace of the input space, see Figure 6.7. The structure used here is a 2-dimensional structure
consisting of 4 × 4 neurons.
Figure 6.7: In this example, the 2-dimensional 4 × 4 structure of the Kohonen network covers
a manifold of lower dimensionality than the input space.
6.4.1
Kohonen network learning algorithm
Again we compute the winning neuron for an incoming input pattern using some distance
measure such as the Euclidean distance. Instead of only updating the winning neuron, we
also update the neighbours of the winning neuron for which we use a neighbourhood function
g(o, k) between two neurons. Here we define g(k, k) = 1 and with a longer separation distance
in the structure we decrease the value of the neighbourhood function g(o, k). So the update
is done using:
wo (t + 1) = wo (t) + γg(o, k)(x(t) − wo (t)) ∀o ∈ S.
Where k is the winning neuron and we have to define a function g(o, k). For example we can
use a Gaussian function defined as:
g(o, k) = exp(−distanceS (o, k))
Where distanceS (o, k) computes the distance in the structure S between two neurons. This
distance is the minimal number of edges which have to be traversed in the structure to arrive
at neuron o from winning neuron k.
By this collective learning method input patterns which lie close together are mapped to
neurons which are close together in the structure. In this way the topology which can be
found in the input signals is represented in the learned Kohonen network. Figure 6.8 shows
an example of the learning process in which input patterns are drawn randomly from the
2-dimensional subspace.
Iteration 0
Iteration 600
Iteration 1900
Figure 6.8: The Kohonen network learns a representation which preserves the structure of
the input patterns.
If the intrinsic dimensionality of the structure S is smaller than the dimensionality of the
input space, the neurons of the network are “folded” in the input space. This can be seen in
Figure 6.9.
Figure 6.9: If the dimensionality of the structure is smaller than the manifold from which
input patterns are generated, the resulting Kohonen map is folded in the input space. Here
this folding is shown for a 1-dimensional structure in a 2-dimensional input-space.
Example Kohonen network. Suppose we use a Kohonen network with 3 neurons
connected in a line (thus 1-dimensional) structure. We use a neighbourhood relation as
follows: g(k, k) = 1 and g(h, k) = 0.5 if h and k are direct neighbours on the line, else
g(h, k) = 0.
Again we always compute the winning neuron on each input pattern, and then we update
all neurons as follows:
wi = wi + γg(i, k)(xp − wi )
We initialise: w1 = (1, 1), w2 = (3, 2), w3 = (2, 4). We set γ = 0.5. Now we obtain the
examples:
x1 = (1, 2)
x2 = (2, 5)
x3 = (3, 4)
x4 = (2, 3)
6.5. DISCUSSION
105
On x1 = (1, 2) neuron 1 wins the competition. This results in the update:
w1 = (1, 1) + 0.5 ∗ 1((1, 2) − (1, 1)) = (1, 1.5).
We also have to update the neighbours. g(2, 1) = 0.5 en g(3, 1) = 0. So we update neuron 2:
w2 = (3, 2) + 0.5 ∗ 0.5((1, 2) − (3, 2)) = (2.5, 2).
On x2 = (2, 5) neuron 3 wins. This results in the update:
w3 = (2, 4) + 0.5 ∗ 1((2, 5) − (2, 4)) = (2, 4.5).
We also have to update the neighbours. g(2, 3) = 0.5 en g(1, 3) = 0. So we update neuron 2:
w2 = (2.5, 2) + 0.5 ∗ 0.5((2, 5) − (2.5, 2)) = (2.375, 2.75).
On x3 = (3, 4) neuron 3 wins. This results in the update:
w3 = (2, 4.5) + 0.5 ∗ 1((3, 4) − (2, 4.5)) = (2.5, 4.25).
We also have to update the neighbours. Again g(2, 3) = 0.5 en g(1, 3) = 0. So we update
neuron 2:
w2 = (2.375, 2.75) + 0.5 ∗ 0.5((3, 4) − (2.375, 2.75)) = (2.53, 3.06).
Try it yourself on the last example.
6.4.2
Supervised learning in Kohonen networks
A Kohonen network can also be used for supervised learning. For this we use outputs woh for
each neuron h and each output o. In case there is only a single output we can denote the
output of a neuron h as y h . To determine the overall output on a training example, we use
the outputs of all activated neurons (neurons are activated if g(h, k) > 0. So we obtain the
output yo by the following formula which weighs the neuron outputs by their activations:
P
h
h∈S g(h, k)wo
yo = P
h∈S
g(h, k)
This is basically a weighted sum and causes smoother functions when larger neighbourhood
function values are used.
Now each neuron can learn output values in two different ways. The first possibility is to
let neurons learn the average output weighted by its activation using:
woh = woh + α(Do − woh ) P
g(h, k)
i∈S g(i, k)
Where Do is the target value for output o.
We can also let each neuron learn to reduce the overall error of the network. In this case
neurons collaborate more. The following learning rule does this:
woh = woh + α(Do − yo ) P
g(h, k)
i∈S g(i, k)
Furthermore for supervised learning in Kohonen networks, the unsupervised steps can be
changed so that neurons with small errors are moved faster to the input pattern than neurons
with larger errors.
6.5
Discussion
In this chapter we examined unsupervised learning methods which can be used for clustering
data, vector quantisation, dimensionality reduction, and feature extraction. The K-means
algorithm is a well-known method for clustering, but is a batch learning method meaning
that it has to be executed on all input patterns. In competitive learning, updates are made
online. The neurons compete for becoming activated based on their distance to the input
pattern. Unsupervised learning methods can also be extended with additional output weights
to make supervised learning possible. In this case we can simply use the delta rule for learning
outputs of each neuron. The shown algorithms are well able in dealing with continuous inputs,
for discrete inputs some adaptions may be necessary to improve the algorithms. All learning
algorithms respect the locality principle; inputs which lie close together in the input space
are grouped together. For supervised learning, the shown algorithms can be very suitable if
the function is smooth. By using additional neurons a good approximation of a fluctuating
target function can be learned, but finding the winning neuron becomes slow if many neurons
are used.
Bibliography
[Dawkins, 1976] Dawkins, R. (1976). The Selfish Gene. Oxford University Press.
[Dorigo and Gambardella, 1997] Dorigo, M. and Gambardella, L. M. (1997). Ant colony
system: A cooperative learning approach to the traveling salesman problem. Evolutionary
Computation, 1(1):53–66.
[Dorigo et al., 1996] Dorigo, M., Maniezzo, V., and Colorni, A. (1996). The ant system:
Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and
Cybernetics-Part B, 26(1):29–41.
[Glover and Laguna, 1997] Glover, F. and Laguna, M. (1997). Tabu Search. Kluwer Academic
Publishers.
[Merz and Freisleben, 1999] Merz, P. and Freisleben, B. (1999). A comparison of memetic
algorithms, tabu search, and ant colonies for the quadratic assignment problem. In et al.,
P. J. A., editor, Proceedings of the Congress on Evolutionary Computation, volume 3, pages
2063–2070.
[Radcliffe and Surry, 1994] Radcliffe, N. J. and Surry, P. D. (1994). Formal memetic algorithms. In Evolutionary Computing, AISB Workshop, pages 1–16.
107
Transparanten bij het vak Inleiding Adaptieve Systemen: Reinforcement Leren. M.
Wiering
Stap = overgang (transitie)
van de ene toestand
P
naar de volgende ( j P (i, a, j) = 1)
Toestanden kunnen terminaal zijn: ketens van
stappen die hier terecht komen worden niet verder voortgezet
Inhouds opgave
• Markov Decision Problems
Markov eigenschap
• Dynamisch Programmeren: herhaling
De huidige toestand en actie geven alle mogelijke informatie voor het voorspellen naar welke
volgende toestand een stap gemaakt zal worden:
• Reinforcement Leren: principes
• Temporal difference leren
P (st+1 |st , at ) = P (st+1 |st , at , . . . , s1 , a1 )
• Q-leren
Dus, voor het voorspellen van de toekomst doet
het er niet toe hoe je in de huidige toestand
gekomen bent.
• Model gebaseerd leren
Vergelijk processen in de natuurkunde: waar
zou het verleden gerepresenteerd moeten zijn?
Leerdoelen:
1. De theorie begrijpen en de RL algoritmen
kunnen opschrijven/gebruiken.
Voorbeeld MDP:
2. Begrijpen waarom exploratie/generalisatie
van belang is en manieren kunnen vertellen hoe we dat kunnen aanpakken.
0
3. Applicaties kunnen bedenken voor RL toepassingen.
+1
0
0
0
0
0
0
-1
0
Deterministic MDP
Number of states = 5
Number of actions = 2
Markov besluits problemen
Een Markov decision process (MDP) bestaat uit:
Passief leren — leert uitkomst van proces zonder besluiten te kunnen nemen welke uitkomst
van proces beinvloeden → predictie.
• S: Eindige verzameling toestanden
{S1 , S2 , . . . , Sn }.
• A: Eindige verzameling acties.
Voorbeeld: in het bovenstaande MDP worden
alle acties met 50% gekozen. Wat is de verwachte som der beloningen in de toekomst?
• P (i, a, j): kans om een stapje naar toestand j te maken als actie a wordt geselecteerd in toestand i.
Actief leren — leert policy welke acties selecteert zodat uitkomst van proces voor de agent
zo goed mogelijk is → controle.
• R(i, a, j) beloning voor het maken van een
transitie van toestand i naar toestand j
door het executeren van actie a
Voorbeeld: in bovenstaande MDP: wat is de optimale actie in elke toestand? Wat is dan de
verwachte som der beloningen in de toekomst?
• γ: discount parameter voor toekomstige
beloningen: (0 ≤ γ ≤ 1)
Actie selectie policy
Policy Π selecteert een actie als een functie van
de huidige toestand
at = Π(st )
1
Doel: Leer de policy Π∗ welke de toekomstige Q-functie voor het evalueren van toestand/actie
verwachte beloningen maximaliseert:
paren.
Π∗ = arg max E(
Π
∞
X
Als V , de toestand waarde-functie bekend is,
dan kunnen we in een toestand alle acties uitproberen, de nieuwe toestand bepalen (met behulp van het model) en die actie selecteren welke
leidt tot de grootst verwachte som van toekomstige beloningen.
γ t R(st , Π(st ), st+1 )|s0 = s)
t=0
Voorbeeld policy:
Als de Q-functie bekend is dan kunnen we in
elke toestand direct de actie selecteren met de
hoogste Q-waarde (hiervoor is dan ook geen model meer nodig).
G
Dynamisch programmeren (Bellman 1957)
De optimale Q-functie voldoet aan de Bellman
vergelijking:
X
Q∗ (i, a) =
P (i, a, j)(R(i, a, j) + γV ∗ (j))
Er zijn |A||S| policies, hoe weten we welke policy
het beste is?
j
Waarde-functie (utiliteiten functie): De waar- Hier is V ∗ (j) = maxa Q∗ (j, a)
de van een toestand schat de verwachte toekomDe optimale policy verkrijgen we dan door:
stige beloningen:
V (s) = E(
∞
X
Π∗ (i) = arg max Q∗ (i, a)
a
t
γ R(st , Π(st ), st+1 )|s0 = s)
t=0
Opmerkingen:
De Q-functie schat de waarde voor het selecteren
• V ∗ is uniek bepaald
van een actie in een gegeven toestand:
• Π∗ is niet altijd uniek bepaald (soms zijn
X
P (st , at , st+1 )(R(st , at , st+1 )+γV (st+1 ))er meerdere optimale policies)
Q(st , at ) =
st+1
Voorbeeld waarde functie (in deterministische
wereld):
5
6
6
7
5
6
4
Value Iteration
We kunnen de optimale policy en de Q-functie
berekenen door gebruik te maken van een dynamisch programmeer algoritme:
10
8
Value iteration:
9
1. Initialiseer de Q-waarde en V-waarden (b.v.
op 0)
8
6
7
2. Maak een “update” voor de Q-waarden:
X
Q(i, a) :=
P (i, a, j)(R(i, a, j) + γV (j))
j
De V-functie en de Q-functie
Voor terminale toestanden geldt:
P (i, a, i) = 1 en R(i, a, i) = 0 voor elke
actie.(Of P (i, a, j) = 0 voor alle j)
We maken gebruik van 2 waarde functies: de Vfunctie voor het evalueren van toestanden en de
2
3. Bereken dan de nieuwe waarde functie:
V (i) := max Q(i, a)
a
En los de onbekenden op.
(1)
• Policy evaluation:
Start met V (i) = 0 voor alle toestanden i
4. Pas de policy aan zodat in elke toestand de
actie met maximale huidige waarde wordt
geselecteerd.
en herhaal
X
V (i) :=
Π(i) := argmaxa Q(i, a)
P (i, j)(R(i, j) + γV (j))
j
een groot aantal keer voor alle niet-terminale
toestanden i
5. Ga naar (2) totdat V niet meer verandert
Evalueren van een policy
Opgave:
Als we de policy vastleggen, kunnen we berekenen wat de exacte waarde van een bepaalde toe- Gegeven de toestanden 1 t/m 4 waarvan 4 terstand is. Dit correspondeert met passief leren minaal is:
(waarbij de vastgelegde policy de overgangskansen bepaalt).
P = 0.5
R = -1
Omdat we nu een vaste policy Π hebben kunnen
we de acties uit de transitie en belonings functies
elimineren:
1
P = 0.5
R=1
P (i, j) = P (i, Π(i), j) en: R(i, j) = R(i, Π(i), j)
3
2
P = 0.5
R = -1
P=1
R=1
P = 0.5
R=2
4
Nu is V Π (i) voor elke toestand i vastgelegd:
• voor terminale toestanden i:
Bereken de waarden voor alle toestanden.
V Π (i) = 0
• voor niet-terminale toestanden i:
X
V Π (i) =
P (i, j)(R(i, j) + γV Π (j))
Dynamisch programmeren als planning tool
Planning: bereken acties om doel te verwezenlijken. Voorbeeld: A* planning. Problemen met
niet deterministische omgevingen.
j
Stelsel van n lineaire vergelijkingen met n onbekenden V (i)
=⇒ precies één oplossing voor de V-functie.
DP: gegeven een toestand: selecteer een actie en
volg vervolgens de (optimale) policy
Hoe bepaal je de n onbekenden V (i)?
DP voordeel: Tijdens het runnen kunnen acties
direct geselecteerd worden (dus zonder kostbare
plan operaties)
Twee methoden:
• Gauss-eliminatie ( = ‘vegen’)
V (1)
=
X
DP nadeel: de waarde functie moet nauwkeurig
zijn.
P (1, j)(R(1, j) + γV (j))
Problemen voor het gebruik van dynamisch programmeren:
j
V (2)
=
X
P (2, j)(R(2, j) + γV (j))
X
P (n, j)(R(n, j) + γV (j))
j
.. =
V (n) =
• Een a-priori model van het Markov decision process is nodig (de omgeving moet
bekend zijn)
j
3
• Als er veel variabelen zijn wordt de toe- Geen a-priori gegeven model (transitie kansen,
standsruimte zeer groot (bv. n binaire va- beloningen) is nodig.
riabelen → 2n toestanden. DP wordt comReinforcement leren leert een subjectieve “view”
putationeel dan heel duur.
op de wereld door de interactie met de wereld.
• Wat als de toestandsruimte continu is?
Een policy wordt getest hetgeen ervaringen oplevert waarvan geleerd kan worden om een nieuwe
policy te berekenen.
• Wat als acties/tijd continu zijn?
• Wat als omgeving niet-Markov?
Exploratie van de toestands ruimte is nodig.
De beste actie leren met RL
G
Stel je speelt de twee-armige bandiet: er zijn
twee acties (L en R), beide kosten een euro.
De linkerarm heeft kans 10% op uitbetalen van
6 euro.
Epoch = Sequentie
Ervaringen (stapjes)
De rechterarm heeft kans 1% op uitbetalen van
100 euro.
Subjectieve kijk van
de agent op de wereld
Helaas weet je de kansen en opbrengsten niet.
Door herhaaldelijk beide armen uit te proberen,
kun je de kans op winst en het winstbedrag leren
(gewoon door het gemiddelde te bepalen).
Principes van reinforcement leren (RL)
Om de Q-functie te leren, herhalen RL algoritmen voortdurend het volgende:
Als de kansen en de bedragen nauwkeurig bekend zijn is het simpel om optimaal te spelen.
1. Selecteer actie at gegeven de toestand st
2-armige bandiet en exploratie
2. Vergaar de beloning rt en observeer de opvolgende toestand st+1
Stel je speelt het spel en krijgt de volgende resultaten:
3. Maak een “update” van de Q-functie door
gebruik te maken van de laatse ervaring:
(st , at , rt , st+1 )
(1, L, -1)
(2, R, -1)
(3, L, +5)
De verwachtings waarden kunnen we opschrijven als een quadruple:
Epoch = keten opeenvolgende toestanden eindigend in terminale toestand (of na vast aantal
(Actie A, kans P , winstbedrag R, gem. waarde stapjes).
V)
+1
Voor bovenstaande ervaringen worden deze:
(L, 0.5, 5, 2.0) en (R, 0, ?, -1).
-1
+2
0
Als we nu verder spelen, moeten we dan direct
L kiezen, of toch R blijven uitproberen?
Dit wordt het exploratie/exploitatie dilemma ge- Uit de epochs willen we de waarde functie en de
optimale strategie leren
noemd
Vergelijk: kiezen we voor informatie voor meer
toekomstige beloning of voor directe beloning?
Vier RL methoden:
• Monte Carlo sampling (Naı̈ef updaten)
Reinforcement leren
4
Hier is α een klein positief getal, de learning
rate
• Temporal difference (TD) leren
• Q-leren
Idee: geef elke keer V (i) een duwtje in de ge• Model-gebaseerd dynamisch program- wenste richting
meren
Bij vaste α komt dit snel in de buurt van de echte
utiliteit, maar convergeert daarna niet verder
De eerste drie methodes gebruiken geen transitie
Als α steeds kleiner wordt naarmate toestand i
model en worden daarom ook vaak direct RL
vaker bezocht is, convergeert het wel
of model-free RL genoemd.
De vierde schat eerst een transitie model en berekent de waarde functie aan de hand van dynamisch programmeer achtige methoden. Daarom
wordt deze methode ook wel indirect RL of
model-based RL genoemd.
Voorbeeld:
Als P (i, j) = 13 en P (i, k) = 23 , en de overgang i → j komt 10 keer voor en de overgang
i → k komt 20 keer voor, dan:
10× : V (i) := V (i) + α(R(i, j) + γV (j) − V (i))
Monte Carlo Sampling
20× : V (i) := V (i) + α(R(i, k) + γV (k) − V (i))
≈ V (i) := V (i) + α(10R(i, j) + 10γV (j)+
• Bepaal voor elke toestand s in een epoch k
de
reward-to-go: ak = de som van alle beloningen in die epoch vanaf het eerste moment dat die toestand bezocht is tot de
epoch afgelopen is
20R(i, k) + 20γV (k) − 30V (i)),
⇔
30αV (i) = α(10R(i, j) + 10γV (j) + 20R(i, k) +
20γV (k))
• Schatting voor utiliteit van een toestand:
precies een stap in de gewenste richting =⇒
neem het gemiddelde van alle rewardsto-go van alle keren dat die toe1
2
V (i) := (R(i, j) + γV (j)) + (R(i, k) + γV (k))
stand in een epoch voorkomt
3
3
V (s) =
Pk
i=1 ai (s)| s bezocht in epoch i
aantal epochs dat s bezocht werd
Opgave:
+5
Bezwaar: deze methode convergeert zeer langzaam (update variantie is heel groot)
2
+5
3
-5
5
1
-5
Temporal difference leren:
-5
4
In plaats van direct de hele toestand keten te gebruiken voor een update, kunnen we ook alleen
de opvolgende toestand gebruiken.
Doe voor elke stap van i naar j in een epoch:
+5
6
7
Stel elke overgang 50% kans.
• als j terminaal: V (i) := V (i) + α(R(i, j) −
V (i))
Stel vervolgens dat de agent de volgende epochs
(sequenties van toestanden) meemaakt:
{1, 2, 3}
{1, 4, 7}
{1, 2, 5}
• als j niet terminaal:
V (i) := V (i) + α(R(i, j) + γV (j) − V (i))
5
{1, 2, 3}
Welke updates van de V-functie zal de agent maken met Monte Carlo sampling?
Welke met TD-leren?
Model-gebaseerd RL
Model-gebaseerd RL schat eerst de transitie en
de belonings functies:
Maak schatting van P (i, a, j):
P̂( i, a, j) :=
Q-leren
Q-learning (Watkins, 1989) verandert een enkele
Q-waarde gegeven de ervaring (st , at , rt , st+1 ):
# overgangen van i → j waarbij a gekozen
# keren actie a in toestand i gekozen
Doe hetzelfde voor de beloning:
Q(st , at ) := Q(st , at )+α(rt +γV (st+1 )−Q(st , at ))
Hierbij is V (s) = maxa Q(s, a).
R̂(i, a, j) :=
som beloningen op overgang van i → j na kiezen a
# transities van i naar j waarbij a gekozen werd
Herhaal de update
X
Q(i, a) :=
P̂ (i, a, j)(R̂(i, a, j) + γV (j))
Als Q-leren gebruikt wordt, convergeert de Qfunctie naar de optimale Q-functie als alle toestand/actie paren oneindig vaak bezocht worden
(en de leersnelheid afneemt).
j
een aantal keer voor alle niet-terminale toestanden
Q-leren is meest gebruikte RL methode.
Vaak is het onnodig om alle Q-waarden te updaten:
Voordeel van Q-leren: simpel te implementeren.
Nadeel van Q-leren: kan lang duren voordat beloning aan eind van keten terug gepropageerd is
naar een toestand.
Slechts een subset van de Q-waarden zal significant veranderen door de laatste ervaring. Snellere update-methoden houden hier rekening mee
(bv. Prioritized sweeping)
Voorbeeld Q-leren
We hebben de volgende toestandsgraaf met over- Experimentele vergelijking
gangen voor de acties Links (L) en rechts (R).
Er zijn 5 toestanden: A,B,C,D,E. E is een terminale toestand.
R=-1
P(L) = 0.9
P(R) = 0.1
A
R=-1
P(L) = 0.9
P(R) = 0.1
B
R=-1
P(L) = 1.0
P(R) = 1.0
R=-1
P(L) = 0.9
P(R) = 0.1
C
R=-1
P(L) = 0.1
P(R) = 0.9
G
D
R=-1
P(L) = 0.1
P(R) = 0.9
E
R=-1
P(L) = 0.1
P(R) = 0.9
S
50 × 50 maze
Reward goal = + 100 ; Reward blocked = -2 ;
Reward penalty = -10 ; otherwise -1;
10% noise in action execution
Max-random exploration (30% → 0% noise)
Stel de volgende overgangen worden gemaakt:
(A, L, B); (B, R, C); (C, R, D); (D,R,E)
(C,L,D); (D, L, C); (C,R,D); (D,R,E)
(B,L,A); (A,R,B); (B,L,C); (C,R,D); (D,L,E)
Vraag: Wat zijn de resulterende Q-waarden als
Q-leren gebruikt wordt (α = 0.5)?
Indirect vs. Direct RL
Voordelen direct RL:
6
100000
1e+06
Q-learning
Q(0.5)-learning
Model-based Q
Prioritized Sweeping
Q-learning
Q(0.5)-learning
Model-based Q
Prioritized Sweeping
80000
number of steps in trial
Cumulative reward per 10000 steps
100000
60000
40000
20000
10000
1000
0
100
-20000
-40000
10
0
250000
500000
750000
1e+06
0
2000
#steps
4000
6000
8000
trial number
• Minder geheugen ruimte nodig (transitie
functie kan groot zijn)
• Werkt ook met niet discrete representaties
(bv. neurale netwerken)
• Kan beter werken als Markov eigenschap
niet geldt
Nadelen direct RL:
• Veel informatie wordt weggegooid
• Agent heeft geen mogelijkheid tot introspectie: bv. welke actie heb ik nog weinig
uitgeprobeerd (voor exploratie)
• Leren kan veel langer duren
• Geleerde waarde functie meestal veel minder nauwkeurig
7
Transparanten bij het vak Inleiding Adaptieve Systemen: Optimal Control. M. Wiering
4. Begrijpen waarom exploratie/generalisatie
van belang is en manieren kunnen vertellen hoe we dat kunnen aanpakken.
Reinforcement leren
5. Applicaties kunnen bedenken voor RL toepassingen.
Supervised leren: leren uit gegevens die allemaal voorzien zijn van de gewenste uitkomst
Reinforcement leren :
Reinforcement leren: leren door het uitproberen van acties, waarbij na sommige acties een
beloning (reward) of straf (punishment) wordt
gegeven
Leer een agent te controlleren door acties uit te
proberen en de verkregen feedback (beloningen)
te gebruiken om gedrag te versterken (reinforce).
Voorbeelden:
De agent interacteert met een omgeving door
het gebruik van (virtuele) sensoren en actuatoren.
• Bepaal route van een robot
– Beloning: als gewenste positie bereikt De belonings functie bepaalt welk gedrag van de
is
agent het meest gewenst is.
– Straf: als de robot ergens tegen opbotst
• Speel schaak, dammen, backgammon, . . .
Omgeving
– Beloning: als het spel gewonnen is
Beloning
Input
– Straf: als het spel verloren is
Actie
Straf = beloning met negatieve waarde
Agent
Inhouds opgave
• Kortste pad algoritmen
Reinforcement leren (RL) en Evolutionai• Optimal Control, dynamisch programme- re Algoritmen (EA)
ren
Stel je wilt leren schaken, dan wil je dus een
evaluatie functie leren.
• Reinforcement Leren: principes
Je wilt de evaluatie van een stand weten; wat je
kunt doen is de stand 1000 keer uitspelen (waarbij verschillende zetten aan bod komen) en bekijken hoe vaak er gewonnen wordt.
• Temporal difference leren
• Q-leren
• Model gebaseerd leren
Door tegen jezelf te spelen, kun je op deze manier een steeds betere evaluatie functie leren (met
RL).
Leerdoelen:
Een andere mogelijkheid is om speler A tegen
speler B te laten spelen. De winnaar gaat door
en krijgt een tegenstander welke net iets afwijkt.
1. Markov decision problems begrijpen
2. Dynamisch programmeer algoritmen begrijpen en kunnen toepassen
Door herhaaldelijk competities uit te voeren, kunnen dergelijke evolutionaire algoritmen leren scha3. De RL principes begrijpen en de RL algo- ken.
ritmen kunnen opschrijven/gebruiken.
1
• De toestand
Enkele bekende applicaties
• De minimaal gevonden afstand tot de beginknoop
Samuel’s checkers programma leerde dammen
(op 64 velden) door tegen zichzelf te spelen en
werd het eerste spel-programma dat de programmeur versloeg (1959).
• De vaderknoop welke aangeeft wat de vorige toestand in het best gevonden pad is.
Tesauro maakte TD-gammon (1992), een RL
programma welke backgammon leerde spelen op
wereldklasse nivo. TD-gammon werd veel beter
dan Neuro-gammon, een programma welke supervised leren gebruikte.
We verdelen de toestanden over 2 verzamelingen:
Crites en Barto (1996) gebruikten RL om een
controller te leren voor meerdere liften in een
gesimuleerde omgeving.
• een verzameling waarvan we dat nog niet
weten.
• een verzameling waarvan we weten dat we
het kortste pad hebben bepaald
De datastructuur: (knoop, kosten, vader, geexp).
Verder:
• Robot besturing
Voorbeeld zoekprobleem
• Combinatorial optimization
• Network routing
B
2
C
4
E
7
• Verkeers controle
2
A
F
1
3
9
Kortste pad algoritmen
D
Reinforcement-leer algoritmen leren paden in een
zoekruimte met de hoogste som van beloningen
of de laagste padkosten.
(1) Initialisatie: Zet beginknoop op padkosten 0. Hiervan is optimale pad bekend (maar
knoop is niet geexpandeerd). Zet andere knoWe kennen wellicht al een aantal zoek-algoritmen pen op maximale waarde
Bekend = [(A, 0, [A], 0)]
welke het kortste pad kunnen berekenen
Onbekend = [(B, 1000, []), (C, 1000, []), (D, 1000, []),
(vb. Breadth-first zoeken).
Reinforcement leren kan dan ook goed gebruikt
worden voor het vinden van het kortste pad.
(E, 1000, []), (F, 1000, [])]
Voor het kortste-pad probleem in een netwerk,
bestaan er echter efficientere algoritmen.
(2) Expandeer bekende, niet-geexpandeerde
knoop (A):
De besproken zoek algoritmen zijn inefficient omBekend = [(A, 0, [A], 1)]
dat ze de zoekboom voor elke zoekknoop steeds
Onbekend = [(B, 7, [BA]), (C, 1000, []), (D, 3, [DA]),
opnieuw expanderen.
(E, 1000, []), (F, 1000, [])]
Dijkstra’s kortste pad algoritme (1959) is het
meest efficient, als de omgeving deterministisch
en bekend is.
(3) Onbekende knoop met kortste padkosten wordt bekend (D):
Bekend = [(A, 0, [A], 1), (D, 3, [DA], 0)]
Onbekend = [(B, 7, [BA]), (C, 1000, []), (E, 1000, []),
(F, 1000, [])]
Dijkstra’s kortste pad algoritme (DKPA)
knoop (D):
Bekend = [(A, 0, [A], 1), (D, 3, [DA], 1)]
Onbekend = [(B, 7, [BA]), (C, 4, [CDA]), (E, 1000, []),
DKPA maakt gebruik van de structuur van het
probleem (een graaf i.p.v. een boom)
Het houdt voor elke zoekknoop de volgende informatie bij:
(F, 12, [F DA])]
2
(3) Onbekende knoop met kortste padkos- als de kosten en het effect (de opvolgende toeten wordt bekend (C):
stand) van operatoren bekend zijn.
Bekend = [(A, 0, [A], 1), (D, 3, [DA], 1), (C, 4, [CDA], 0)]
Onbekend = [(B, 7, [BA]), (E, 1000, []), (F, 12, [F DA])]
(2) Expandeer bekende, niet-geexpandeerdeDynamisch Programmeren
knoop (C):
We kunnen dynamisch programmeren gebruiken
Bekend = [(A, 0, [A], 1), (D, 3, [DA], 1), (C, 4, [CDA], 1)]
als:
Onbekend = [(B, 6, [BCDA]), (E, 8, [ECDA]),
(F, 12, [F DA])]
(3) Onbekende knoop met kortste padkosten wordt bekend (B):
Bekend = [(A, 0, [A], 1), (D, 3, [DA], 1), (C, 4, [CDA], 1),
(B, 6, [BCDA], 0)]
• De omgeving stochastisch is (er zijn meerdere mogelijke opvolgende toestanden als
een operator gebruikt wordt in een bepaalde toestand)
• Operatoren een negatieve kosten kunnen
hebben
Onbekend = [(E, 8, [ECDA]), (F, 12, [F DA])]
• De effecten van operatoren bekend zijn.
knoop (B):
Als een omgeving stochastisch is, kan het gebeu(B, 6, [BCDA], 1)]
ren dat de padkosten van een toestand afhanOnbekend = [(E, 8, [ECDA]), (F, 12, [F DA])]
kelijk is van vadertoestanden. Dit zorgt voor
(3) Onbekende knoop met kortste padkoscyclische afhankelijkheden tussen toestanden.
ten wordt bekend (E):
(B, 6, [BCDA], 1), (E, 8, [ECDA], 0)]
A=L
P=1
K=1
Onbekend = [(F, 12, [F DA])]
knoop (E):
A=R
P=0.5
K=2
(B, 6, [BCDA], 1), (E, 8, [ECDA], 1)]
Onbekend = [(F, 10, [F ECDA])]
(3) Onbekende knoop met kortste padkosten wordt bekend (F):
(B, 6, [BCDA], 1), (E, 8, [ECDA], 1), (F, 10, [F ECDA])]
A
B
A=R
P=0.5
K=4
C
A=R
P=1
K=2
Complexiteit Dijkstra’s kortste pad
De complexiteit van Dijkstra’s kortste pad is
O(n2 ) voor een naieve implementatie (n is het
aantal toestanden).
Rekenen met Dynamisch programmeren
Als een omgeving stochastisch is, moeten we de
gemiddelde padkosten berekenen. Tijdens het
De complexiteit kan verbeterd worden door het
genereren van een pad, laten we een random
gebruik van efficientere datastructuren om het
nummer generator bepalen wat de uitkomst van
minimum te vinden van de toestanden waarvan
een operator is.
er 1 bekend (en minimaal) is.
Zo zijn voor bovenstaand probleem de volgende
Dijkstra’s algoritme werkt alleen als alle operapaden mogelijk (als steeds actie = R gekozen
toren een positieve kosten hebben.
wordt):
Dijkstra’s algoritme werkt alleen voor determi[A, B, C]; [A, B, A, B, C]; [A, B, A, B, A, B, C] etc.
nistische omgevingen.
Elk van deze paden heeft een kans:
Dijkstra’s algoritme kan alleen gebruikt worden
3
P ([A, B, C]) = 0.5 ; P ([A, B, A, B, C]) = 0.25;
etc.
• Te itereren: begin met V[A,C] = 0 en V[B,C]
= 0. Bereken de waarden dan steeds opnieuw door gebruik te maken van de vergelijkingen.
Elk van deze paden heeft ook een totale kosten:
K([A, B, C]) = 6 ; K([A, B, A, B, C]) = 10 etc.
• Te elimineren: We kunnen afhankelijkheid
(1) invullen in (2) en verkrijgen:
De verwachte padkosten V van A naar C als
steeds actie R gekozen wordt is dan het gemiddelde:
V [B, C]
0.5V [B, C]
V [B, C] = 8
V ([A, C]) = P ([A, B, C])K([A, B, C]) +
P ([A, B, A, B, C])K([A, B, A, B, C]) + . . .
= 3 + 0.5(2 + V [B, C])
= 4
Hieruit volgt: V [A, C] = 10.
Hiervoor gingen we ervan uit dat de policy Π al
bekend was. Dan kan V Π berekend worden.
NB. Als in toestand B de actie L gekozen wordt,
dan zijn de padkosten V [A, C] oneindig groot!
We schrijven daarom ook vaak V Π [A, C], waarin de policy Π aangeeft welke actie er in elke Voorbeeld
toestand gekozen wordt.
Beschouw de volgende cyclische graaf. Stel dat
altijd actie = R wordt geselecteerd door de policy. Bereken nu de waarde functie.
Gebruik maken van afhankelijkheden
Als we naar het probleem kijken, zien we dat de
kosten van A naar C gelijk zijn aan de kosten
om van A naar B te gaan plus de kosten om van
B naar C te gaan.
A=L
P=1
K=1
Dit kunnen we opschrijven als:
A=R
P=0.5
K=2
V [A, C] = K[A, B]+V [B, C] = 2+V [B, C] (1)
D
Hetzelfde kunnen we doen voor toestand B (we
gaan er weer van uit dat actie R gekozen wordt:
V [B, C]
=
A=R
P=0.5
K=2
A
A=R
P= 1
K=3
B
A=R
P=0.5
K=4
A=R
P=0.5
K=4
0.5(K[B, A] + V [A, C]) + 0.5(K[B, C] + V [C, C])
= 0.5 ∗ 2 + 0.5V [A, C] + 0.5 ∗ 4 + 0.5 ∗ 0
= 3 + 0.5V [A, C]
Berekenen van actie waarden
Door gebruik te maken van de afhankelijkheden:
Normaal gesproken willen we de optimale policy
Π∗ berekenen.
(1) V [A, C] = 2 + V [B, C], en
(2) V [B, C] = 3 + 0.5V [A, C]
Hiervoor gebruiken we Quality (Q)-waarden voor
acties.
Kunnen we V [A, C] en V [B, C] berekenen.
Voorbeeld: Q([A, C], R) = 2 + V [B, C]
Voorbeeld: Q([B, C], L) = 1 + V [A, C]
Voorbeeld:
Q([B, C], R) = 0.5(2+V [A, C])+0.5(4+V [C, C])
Methoden om de V-functie te berekenen
Dit kunnen we doen door:
4
C
Gegeven de waarden Q([B, C], L) en Q([B, C], R) Als we eerder stoppen, verkrijgen we een subkunnen we beste actie selecteren (degene met optimale oplossing, welke beter wordt naarmate
laagste Q-waarde).
er langer geitereerd wordt.
Op deze manier kunnen we V uitdrukken in Q:
Voorbeeld: dynamisch programmeren
V ([B, C]) = min Q([B, C], a)
a
Beschouw een deterministische doolhof. De kosten van alle acties zijn 1. G is doeltoestand
(V (G) = 0).
Gegeven de Q-waarden kunnen we nu ook de
beste actie selecteren:
Als we value iteration toepassen krijgen we achtereenvolgens:
Π([B, C]) = arg min Q([B, C], a)
a
Het algemene geval
We willen de doeltoestand niet altijd expliciet
vermelden, er kunnen immers meerdere doeltoestanden zijn.
0
0
0
0
0
0
0
Daarom schrijven we simpelweg V (S) en Q(S, A)
voor de waarden (verwachte padkosten) vanuit
toestand S.
Verder gebruiken we P (S, A, T ) voor de kans op
een overgang van toestand S naar toestand T
als actie A geexecuteerd wordt.
K(S, A, T ) beschrijft de kosten om met actie A
van toestand S naar toestand T te gaan.
0
2
2
2
2
2
2
2
0
2
2
D
1
1
0
1
1
0
1
1
0
1
0
3
3
1
3
3
2
3
3
2
3
0
1
1
1
1
1
2
1
0
2
3
3
De complexiteit van dynamisch programmeren
voor een deterministische doolhof = O(N AL),
waarbij N het aantal toestanden, A het aantal
acties, en L het langste optimale pad is.
Nu kunnen we een dynamisch programmeer algoritme gebruiken voor het berekenen van alle
V- en Q-waarden en de optimale policy.
Dynamisch programmeren kunnen we niet gebruiken als de effecten en kosten van operatoren
onbekend zijn.
Value iteration:
• (1) Initialisatie V(S) = 0 ; Q(S,A) = 0
voor alle toestanden en acties.
Conclusie
• Dijkstra’s kortste pad algoritme kan gebruikt worden als de omgeving bekend is
en alle acties positieve kosten hebben en
alle acties deterministisch zijn.
• Herhaal stappen (2-4) voor alle (S,A) paren totdat de waardefunctie V niet of nauwelijks meer verandert.
• (2) Iteratie
P : bereken Q-waarden
Q(S, A) = T P (S, A, T )(K(S, A, T )+V (T ))
• Dynamisch programmeren kan gebruikt worden als de omgeving bekend is. Acties
kunnen negatieve kosten hebben en nietdeterministisch zijn.
• (4) Iteratie : bereken nieuwe policy
acties: Π(S) = arg minA Q(S, A)
• Dynamisch programmeren berusten op een
V-functie welke de waarde om in een toestand te zijn schat en op een Q-functie welke de waarde van een actie in een toestand
schat.
• (3) Iteratie : bereken V-waarden
V (S) = minA Q(S, A)
Dit levert de optimale policy op. Helaas kan het
itereren totdat de waarde-functie V niet meer
verandert oneindig lang duren.
5
• DP kan niet gebruikt worden als de omgeving onbekend is.
6
1. Bereken welke voorbeelden x1 , . . . , xn in
elk van de clusters vallen.
Transparanten bij het vak Inleiding Adaptieve Systemen: Unsupervised Leren/ Self
organizing networks. M. Wiering
Een voorbeeld xi valt in een cluster C j als
wj de prototype vector is met de kleinste
Euclidische afstand tot het voorbeeld:
Unsupervised Learning en Self Organizing
Networks
d(wj , xi ) ≤ d(wl , xi )
Leerdoelen:
Voor alle l 6= j
pP
2
Met d(x, y) =
i (xi − yi ) : de Euclidische afstand tussen x en y.
• Weten wat unsupervised learning is
• Weten hoe K-means clustering gaat
2. Zet de prototype vectoren op het centrum
van de input voorbeelden in de betreffende
cluster. Voor alle clusters C j doe:
• Competitive Learning begrijpen en kunnen uitleggen
• LVQ begrijpen en kunnen toepassen
wij
• Kohonen netwerken begrijpen en de leerformules kennen
=
P
k∈C j
|C j |
xki
• Kunnen uitrekenen wat gewichtenveranderingen in een competitief leersysteem zijn Voorbeeld
We hebben vier voorbeelden:
(1,2)
(1,4)
Unsupervised Leren
(2,3)
In Unsupervised leren krijgen we enkel patronen (3,5)
xp als input en geen doel output.
We initialiseren 2 prototype vectoren w1 = (1, 1)
De geleerde informatie moet dus volledig uit de en w2 = (3, 3)
inputpatronen gehaald worden
We zien dat de voorbeelden in de volgende clusUnsupervised leren kan gebruikt worden voor:
ters vallen:
(1,2) → 1
• Clustering: Groepeer de data in clusters. (1,4) → 2
(2,3) → 2
• Vector quantisation: Discretiseer een con- (3,5) → 2
tinue inputruimte
Nu berekenen we de nieuwe prototype vectoren
• Dimensionaliteits reductie: groepeer de da- en krijgen: w1 = (1, 2) en w2 = (2, 4).
ta in een subruimte van lagere dimensie
Hierna vallen alle voorbeelden weer in dezelfde
dan de dimensionaliteit van de data
clusters dus zijn we klaar.
• Feature extraction: Extraheer kenmerken
uit de data
Competitive learning
Competitief leren verdeelt de input ruimte over
clusters in de input data
K-means clustering
Er worden enkel input patronen xp aangeboden.
De output van het leernetwerk op een input patroon is de cluster waarin xp valt.
K-means clustering kan gebruikt worden op continue attributen.
Het algoritme begint met K prototype vectoren
w1 , . . . , wK welke elk een cluster (C 1 , . . . , C K )
representeren.
In een simpel competitief leernetwerk worden alle inputs i met alle outputs o verbonden.
Hoe wordt de actieve (winnende) cluster bepaald?
Herhaal tot clusters niet meer veranderen:
1
output o
w1
w3
wio
w2
input i
w1
w1
We veronderstellen allereerst dat gewichten en
inputs genormaliseerd worden tot lengte 1.
x
w2
x
w2
Adaptieve stappen voor het netwerk
Winnaar = 1
(1) Elke output unit o berekent zijn activatie yo
door het ”dot-product”:
X
yo =
wio xi = wo x
Selecteren van de winnaar met de Euclidische afstand
i
Als inputs en gewichten vectoren niet genormaliseerd zijn, kunnen we de Euclidische afstand
nemen om de winnaar te bepalen:
(2) Vervolgens wordt de output neuron k met
maximale activatie gekozen:
∀o 6= k :
Winnaar = 1
Winnaar
y o ≤ yk
k : kwk − xk ≤ kwo − xk
∀o.
Als de vectoren genormaliseerd zijn, geeft het
Activaties worden gereset zodat yk = 1 en yo6=k =
gebruik van de Euclidische afstand dezelfde win0.
naar terug als het gebruik van het dot-produkt.
(3) Tenslotte worden de gewichten van de winDe gewichten update-regel schuift de gewichten
nende neuron k veranderd door de volgende leervector van de winnende neuron naar het inputregel:
patroon:
wk (t) + γ(x(t) − wk (t))
wk (t + 1) =
kwk (t) + γ(x(t) − wk (t))k
wk (t + 1) = wk (t) + γ(x(t) − wk (t)) (1)
De deler zorgt ervoor dat de gewichten vector
genormaliseerd wordt.
Competitive Learning (niet genormaliseerd)
Er zijn K clusterpunten (neuronen) w i , 1 ≤ i ≤
K.
Werking van competitive learning
In principe worden de gewichten vectoren in de
input ruimte gedraaid naar het voorbeeld
Eerst berekenen we voor elke clusterpunt de Euclidische afstand naar een voorbeeld met:
sX
d(wi , xp ) =
(wji − xpj )2
Dus gaan gewichten vectoren wijzen naar gebieden waar veel inputs verschijnen
Het kan mis gaan als inputs en gewichten vectoren niet genormaliseerd zijn:
Het ongewenste effect is dat grote gewichten vectoren het winnen tegen kleine gewichten vectoren.
j
De winnende neuron k heeft de minimale d(w k , xp )
Vervolgende verschuiven we de winnende neuron
k naar voorbeeld xp :
wik = wik + γ(xpi − wik )
2
Voorbeeld: We hebben 2 (K = 2) neuronen met
geinitialiseerde waarden: w 1 = (1, 1) en w2 =
(3, 2).
Een gebruikelijke maat om de kwaliteit van een
clustering te berekenen is gegeven door de volgende kwadratische-fout kosten functie:
Nu krijgen we 4 leervoorbeelden:
x1 = (1, 2)
x2 = (2, 5)
x3 = (3, 4)
x4 = (2, 3)
E=
1X
kwk − xp k2
2 p
Hier is k weer de winnende neuron voor inputpatroon xp .
We zetten de leersnelheid γ = 0.5. Dan krijgen
we:
We kunnen aantonen dat competitive learning
het minimum van de kosten functie zoekt door
de negatieve afgeleidde te volgen.
x1 = (1, 2) → d(w 1 , x1 ) = 1, d(w2 , x1 ) = 2.
Dus: Winnaar w1 = (1, 1) Toepassen van de
update vergelijking geeft:
w1 = (1, 1) + 0.5((1, 2) − (1, 1)) = (1, 1.5).
Bewijs:
√
2
1
2
2
2
p
x
√ = (2, 5) → d(w , x 2) = 13.25, d(w , x ) = De foutfunctie voor patroon x :
10. Dus: Winnaar w = (3, 2) Toepassen van
1X
de update vergelijking geeft:
E=
(wik − xpi )2
2
2 i
w = (3, 2) + 0.5((2, 5) − (3, 2)) = (2.5, 3.5).
√
3
1
3
2
3
x
√ = (3, 4) → d(w , x )2 = 10.25, d(w , x ) = waar k een winnende neuron is, wordt gemini0.5. Dus: Winnaar w = (2.5, 3.5) Toepassen maliseert door de gewichten update-regel van
van de update vergelijking geeft:
vergelijking (1).
w2 = (2.5, 3.5)+0.5((3, 4)−(2.5, 3.5)) = (2.75, 3.75).
Bewijs: we berekenen het effect van een gewichProbeer het nu zelf voor het laatste voorbeeld. ten verandering op de foutfunctie:
∆p wio = −γ
Initialisatie
∂E p
∂wio
Een probleem van de recursieve clustering me- Nu hebben we als gedeeltelijke afgeleidde van
thodes is de initialisatie: Het kan gebeuren dat E p :
een neuron nooit winnaar wordt en dus niet leert.
∂E p
= wio − xpi , Als unit o wint
Twee oplossingen:
∂wio
=
0,
anders
(2)
• Initialiseer een neuron op een inputpatroon
Hieruit volgt (voor winnaar o):
• Gebruik ”leaky learning”. Hier leren alle
∆p wio = γ(xpo − wio )
neuronen op alle inputpatronen met een
zeer kleine leersnelheid γ 0 << γ:
Dus: de kosten functie wordt geminimaliseerd
door herhaalde gewichten updates.
wl (t + 1) = wl (t) + γ 0 (x(t) − wl (t)), ∀l 6= k
Vector quantisatie
Minimalisering van de kosten functie
Een ander belangrijk gebruik van competitief leren is vector quantisatie.
Clustering houdt in dat overeenkomsten tussen
inputs in dezelfde cluster veel groter zijn dan
inputs in andere clusters.
Een vector quantisatie verdeelt de input ruimte
in een aantal verschillende subruimtes.
De mate van overeenkomst wordt vaak uitgedrukt door de (inverse van de) Euclidische afstandsmaat te gebruiken
Het verschil met clustering is dat we nu niet zo
geinteresseerd zijn in clusters van gelijke data,
maar meer in het quantificeren van de hele input
ruimte.
3
De quantisatie welke geleerd wordt door competitive leren respecteert de verdeling van inputs:
meer inputs in een regio leidt tot meer clusters
uit:
Een voorbeeld vector quantisatie:
Dit is simpelweg de delta-regel met yo =
wko waar k de winnende neuron is.
wko (t + 1) = wko (t) + γ(d − wko (t))
Als we een functie g(x, k) definieren als:
g(x, k)
= 1,
= 0,
Als k de winnaar is
Anders
Dan kan aangetoond worden dat de bovenstaande leerprocedure convergeert naar:
R
n yo g(x, h)dx
Who = R<
<n g(x, h)dx
Combinatie met Supervised leren
Vector quantisatie kan gebruikt worden als preprocessing stadium voor een supervised lerend
systeem.
Dus: elke tabel-entry (gewicht van hidden unit
h naar output unit o) convergeert naar het gemiddelde van de doelwaarde voor alle keren dat
de neuron wint.
Een voorbeeld is het volgende netwerk dat vector quantisatie in de eerste laag combineert met
supervised leren in de 2e laag:
Supervised Vector Quantization
Vector
Quantisatie
i
wih
Feed
Forward
h
who
De winnende neuron verschuift volgens dezelfde update regel als bij vector quantisatie, maar
bovendien wordt er een output y k voor de winnende neuron wk na het leervoorbeeld bijgesteld
met:
o
Y
y k = y k + α(Dp − y k )
Stel nu weer 2 clusterpunten :
w1 = (1, 1), y1 = 0 en w2 = (3, 2), y2 = 0.
We kunnen eerst de vector quantisatie uitvoeren en dan de supervised leerstap maken, of we
kunnen beide lagen tegelijk aanpassen.
We
(x1
(x2
(x3
(x4
Leeralgoritme supervised vector quantisatie
krijgen de volgende leervoorbeelden:
→ D1 ) = (1, 2 → 3)
→ D2 ) = (2, 5 → 7)
→ D3 ) = (3, 4 → 7)
→ D4 ) = (2, 3 → 5)
Stel we zetten de leersnelheid γ = 0.5 en α =
0.5. Dan krijgen we:
• Presenteer het netwerk met input x en doel
x1 = (1, 2) → d(w 1 , x1 ) = 1, d(w2 , x1 ) = 2.
waarde d = f (x).
Dus: Winnaar w1 = (1, 1) Toepassen van de
• Voer de unsupervised quantisatie stap uit. update vergelijking geeft:
Bereken de afstand van x naar elke gewich- w1 = (1, 1) + 0.5((1, 2) − (1, 1)) = (1, 1.5).
ten vector en bepaal de winnaar k. Stel de Dit is precies als hiervoor.
gewichtenvector bij met de unsupervised
leerregel.
Het enige verschil is dat we nu ook de output
• Voer de supervised approximatie leerstap van de winnende neuron moeten bijstellen:
y 1 = 0 + 0.5(3 − 0) = 1.5
4
We laten nu enkel zien hoe de output waarden Dus: de neuron met het juiste label (als de winvan de neuronen veranderen, de gewichtenvec- naar of de 1-na beste dit zijn) wordt geschoven
naar het input patroon.
toren wi veranderen net als hiervoor.
x2 = (2, 5). Winnaar is neuron 2.
y 2 = 0 + 0.5(7 − 0) = 3.5.
D
A
x3 = (3, 4). Winnaar is neuron 2.
y 2 = 3.5 + 0.5(7 − 3.5) = 5.25.
C
Probeer nu zelf de update voor voorbeeld 4 te
berekenen.
B
A
Learning Vector Quantisatie (LVQ)
Is eigenlijk een supervised leeralgoritme voor discrete outputs
LVQ: voorbeeld
Deze netwerken proberen ”decision boundaries”te We hebben nu K clusterpunten (neuronen) met
leren aan de hand van gelabelde voorbeelden, een gelabelde output. We berekenen de dichtszodat elk voorbeeld in een regio valt met de juis- bijzijnde neuron wk1 en de op 1 na dichtsbijzijnte klasse-label.
de neuron wk2 .
Het algoritme ziet er als volgt uit:
Stel we beginnen met de volgende clusterpunten:
w1 = (1, 1) en label y 1 = A. w2 = (3, 2) en label
1. Associeer met elke output neuron o een
y 2 = B.
klasse label yo
We krijgen de volgende leervoorbeelden:
2. Presenteer een leervoorbeeld (xp , dp )
(x1 → D1 ) = (1, 2 → A)
2
2
3. Gebruik een afstandsmaat tussen gewich- (x → D ) = (2, 5 → B)
3
3
tenvectoren en inputvector xp om de win- (x → D ) = (3, 4 → A)
4
4
nende neuron k1 en de op een na beste (x → D ) = (2, 3 → B)
neuron k2 te vinden
Dan krijgen we: (1, 2 → A), winnaar is neuron
is neuron 2. De label van neukxp −wk1 k < kxp −wk2 k < kxp −wi k∀i 6= k1 , k12 een na beste
ron 1 = D1 . Dus: neuron 1 verschuift naar het
4. De labels yk1 en yk2 worden vergeleken voorbeeld:
1
met dp , waaruit een gewichtenverandering w = (1, 1) + 0.5((1, 2) − (1, 1)) = (1, 1.5).
x2 = (2, 5) Winnaar is neuron 2. 1 na beste is
neuron 1. De label van neuron 2 is gelijk aan
de label D2 , dus neuron 2 verschuift naar het
voorbeeld:
w2 = (3, 2) + 0.5((2, 5) − (3, 2)) = (2.5, 3.5).
wordt bepaald.
Update regels
x3 = (3, 4). Winnaar is neuron 2. 1 na beste
is neuron 1. De label van neuron 2 is niet gelijk aan de label D3 . De label van neuron 1 is
wel gelijk aan D 3 . Dus verschuiven we neuron
1 naar het voorbeeld en neuron 2 weg van het
voorbeeld:
w1 = (1, 1.5) + 0.5((3, 4) − (1, 1.5)) = (2, 2.75).
w2 = (2.5, 3.5)−0.5((3, 4)−(2.5, 3.5)) = (2.25, 3.25).
• Als yk1 = dp : Voer de gewichten update
regel voor k1 uit:
wk1 (t + 1) = wk1 (t) + γ(xp − wk1 (t))
• Anders, als yk1 6= dp en yk2 = dp : Voer de
gewichten update regel voor k2 uit:
wk2 (t + 1) = wk2 (t) + γ(xp − wk2 (t))
Bepaal nu zelf de updates voor voorbeeld 4.
en verwijder de winnende neuron van het
voorbeeld:
Vraag: tekenen decision boundaries
wk1 (t + 1) = wk1 (t) − γ(xp − wk1 (t))
5
De geleerde partitie van de input ruimte wordt
ook wel een Voronoi diagram genoemd.
wo (t+1) = wo (t)+γg(o, k)(x(t)−wo (t)) ∀o ∈ S.
Vraag: Gegeven de volgende plaatsing van output neuronen, teken (ongeveer) het bijpassende
Voronoi diagram
Hier is g(o, k) een afnemende functie van de afstand tussen units o en k zodat g(k, k) = 1.
Bijvoorbeeld:
g(o, k) = exp(−buuraf stand(o, k))2
k
o
Kohonen netwerk
In een Kohonen netwerk zijn de output units
geordend op een bepaalde manier, b.v. in een
2-dimensionale grid.
Voorbeeld leerproces
Inputs die dicht bij elkaar vallen moeten gemapped worden op output units (in S) welke dicht
bij elkaar liggen (dezelfde neuron of buren)
Hierdoor blijft de topologie inherent in de inputsignalen bewaard in het geleerde Kohonen
netwerk:
Door deze collectieve leermethode worden inDeze ordening bepaalt welke neuronen buren van puts die dichtbij elkaar vallen gemapped op output neuronen die dicht bij elkaar zitten.
elkaar zijn.
Vaak wordt een Kohonen netwerk van lagere dimensionaliteit gebruikt dan de inputvectoren.
Dit is vooral handig als de inputs in een subruimte van <n vallen:
Iteratie 0
Iteratie 600
Iteratie 1900
Als de instrinsieke dimensionaliteit in S kleiner
is dan N , worden de neuronen van het netwerk
“gevouwen” in de input ruimte:
Kohonen netwerk leeralgoritme
Voor een leervoorbeeld wordt weer de winnende
neuron k berekend met de bekende Euclidische
afstandsmaat.
Vervolgens worden de inputs van de winnende
neuron en zijn (niet enkel directe) buren bijgesteld door:
Kohonen netwerk: Voorbeeld
6
We hebben een Kohonen netwerk met 3 neuro- Elke neuron leert dan de gemiddelde output over
nen verbonden in een lijn. We maken gebruik alle gewogen input patronen:
van de burenrelaties door g(k, k) = 1 te nemen
g(h, k)
en g(h, k) = 0.5 te zetten als h en k directe buwho = who + γ(d − who ) P
ren zijn, anders is g(h, k) = 0.
i∈S g(i, k)
Nu berekenen we weer eerst de winnende neuron
k, en vervolgens updaten we alle neuronen als
volgt:
De unsupervised leerstap kan eventueel aangepast worden om de beste neuronen het snelst te
verschuiven naar het leervoorbeeld.
wi = wi + γg(i, k)(xp − wi )
Conclusie
Nu initialiseren we w 1 = (1, 1), w2 = (3, 2),
w3 = (2, 4). Weer zetten we γ = 0.5.
• Unsupervised leermethoden kunnen gebruikt
worden voor: Clustering, Vector quantisation, Dimensionaliteits reductie en Feature extraction.
We krijgen weer de voorbeelden:
x1 = (1, 2)
x2 = (2, 5)
x3 = (3, 4)
x4 = (2, 3)
• In competitief leren strijden de neuronen
om geactiveerd te worden.
Op x1 = (1, 2) wint neuron 1. Dit resulteert in
de update:
w1 = (1, 1) + 0.5 ∗ 1((1, 2) − (1, 1)) = (1, 1.5).
We moeten ook de buren updaten. g(2, 1) = 0.5
en g(3, 1) = 0. Dus updaten we neuron 2:
w2 = (3, 2) + 0.5 ∗ 0.5((1, 2) − (3, 2)) = (2.5, 2).
• De unsupervised leermethoden kunnen uitgebreid worden met een extra output laag
om ook supervised te kunnen leren. Hiervoor wordt de delta regel voor de nieuwe
laag gebruikt.
•
de update:
w3 = (2, 4) + 0.5 ∗ 1((2, 5) − (2, 4)) = (2, 4.5).
•
2
w = (2.5, 2)+0.5∗0.5((2, 5)−(2.5, 2)) = (2.375, 2.75).
de update:
w3 = (2, 4.5)+0.5∗1((3, 4)−(2, 4.5)) = (2.5, 4.25).
w2 = (2.375, 2.75)+0.5∗0.5((3, 4)−(2.375, 2.75)) =
(2.53, 3.06).
Probeer het nu zelf voor het laatste voorbeeld.
Kohonen netwerk: Supervised leren
Een Kohonen netwerk kan ook gebruikt worden
voor supervised leren. Hiervoor kunnen we elke output neuron h met een tabel-entry (who )
verschaffen
Voor het bepalen van de totale output y kunnen
we de outputs van buren mee laten tellen door:
P
h∈S g(h, k)who
y= P
h∈S g(h, k)
7
De getoonde leeralgoritmen kunnen het best
omgaan met continue inputs, voor discrete
inputs zijn extra aanpassingen nodig
De leeralgoritmen respecteren het lokaliteits principe: inputs die dicht bij elkaar
liggen worden samen gegroepeerd.
• Voor Supervised leren zijn de getoonde leeralgoritmen geschikt als de functie erg grillig (niet smooth) is. Door extra neuronen
toe te voegen kan een goede approximatie
van een functie geleerd worden.
Transparanten bij het vak Inleiding Adaptieve Systemen: Biologische Adaptieve Systemen. M. Wierin
Nauurwetenschap
Voor de 16e eeuw geloofden nog veel wetenschappers in een deductieve benadering om kennis te
vergaren
Zo dacht Aristoteles dat zware objecten sneller
vallen dan lichte objecten
Dit duurde tot Galileo Galilei (1564 - 1642) dit
testte waaruit bleek dat deze hypothese fout was
Het is b.v. niet mogelijk om een machine te
maken die altijd kan blijven doorgaan zonder
dat deze extra energie krijgt
Hierna volgden een aantal belangrijke wetenschappelijke doorbraken welke uitdraaiden op
een nieuwe natuurwetenschap
Als voorbeeld van een niet-omkeerbaar systeem
nemen we een vat met twee helften waarin aanvankelijk alle gasmoleculen in 1 helft zitten (een
geordende toestand)
• Galilei maakte voorspellende methoden
(aarde draait om de zon)
Als we de wand die de helften scheidt wegnemen, dan zal de wanorde van het systeem enkel
toenemen
• Huygens kon betere klokken, lensen, en telescopen maken zodat experimenten veel
precieser gedaan konden worden
Boltzmann bedacht de maat entropie om de wanorde van een systeem te beschrijven.
• Kepler benaderde banen van planeten met
ellipsen i.p.v. de gebruikelijke cirkels
Stel dat er N moleculen zijn waarvan in 1 helft
van het vak N1 en in het andere helft van het
vat N2 . Dan is het aantal permutaties van die
toestand:
N!
P =
N1 !N2 !
• Newton ontdekte de aantrekkingswet tussen 2 objecten waaruit ook volgde dat planeetbanen ellipsen waren
Omkeerbare Systemen
Entropie
De nieuwe wetenschap leidde tot de gedachte dat
het universum voorspelbaar was (Genius van Lap- Het is dus logisch dat het systeem naar een evenwicht gaat met de meeste mogelijke toestanden,
lace)
dus met N1 = N2 .
De mechanische wetten van Newton beschrijven
een omkeerbaar systeem. Dit houdt in dat als Boltzmann definieerde de entropie van het syswe de richting van de tijd veranderen, we het teem:
S = k log P
verleden en de toekomst omdraaien.
Omkeerbare systemen behouden hun energie en
daarom kunnen ze doorgaan met hun beweging.
Omdat de entropie voortdurend toeneemt en er
een toestand is met maximale entropie, zal het
systeem uiteindelijk in een evenwicht terecht komen, het systeem is dan dus niet omkeerbaar.
Een voorbeeld hiervan is een slingerklok als we
wrijving verwaarlozen.
Dit leidde tot de twee wetten van de thermodynamica (Classius 1865):
Niet Omkeerbare Systemen
Er zijn ook veel systemen waarbij bruikbare energie verloren gaat (thermodynamische systemen)
1
• De energie van de wereld is constant
• De entropie van de wereld gaat naar een
maximale waarde
Let op dat dit geldt voor gesloten systemen. Voor r < 1 gaat x altijd naar 0. Als we r verOpen systemen zoals levende wezens kunnen hun hogen krijgen we eerst 1 stabiel eindpunt welke
entropie verminderen door bruikbare energie van afhangt van de waarde van r.
de omgeving op te nemen
Als we r nog meer verhogen krijgen we een periodische cyclus van lengte 2
Chaos Theorie: de Lorenz attractor
In sommige systemen veroorzaken kleine verschillen in beginvoorwaarden grote verschillen in
de toekomst. Dit soort systemen volgen chaotisch gedrag
De meteoroloog Edward Lorenz vond per toeval
met behulp van zijn computer een chaotisch systeem toen hij een model maakte van het weer.
Hoewel hij bijna dezelfde parameters had gebruikt en het systeem deterministisch was, volgde het systeem na een tijdje een heel ander traject
Chaos in de Logisieke Map
Maar wat gebeurt er als we r blijven verhogen?
En als we dan inzoomen:
Logistieke Map
De Lorenz attractor is moeilijk om te analyseren, daarom gebruiken we de simpelere logistieke map beschreven door:
Biologische Adaptieve Systemen
x(t + 1) = rx(t)(1 − x(t))
Adaptieve systemen kunnen goed gebruikt worden om biologische processen mee te modelleren.
Hierin heeft x(t) een waarde tussen 0 en 1. Het
interessante is om te kijken wat er gebeurt als
we de controle parameter r veranderen.
Enkele voorbeelden hiervan zijn:
• Infectie ziektes
2
• Ziek geweeste, immune individuen (I)
Als een gezond individu in aanraking komt met
een geinfecteerd individu, dan wordt het gezonde individu ook geinfecteerd.
Als een geinfecteerd individu lang genoeg ziek is
geweest, wordt het een ziek geweeste, immuun
individu
• Bosbranden
Niet ruimtelijk model voor infectie ziektes
• Overstromingen
We kunnen een model maken m.b.v. update vergelijkingen.
• Vulkaan uitbarstingen
We beginnen met een populatie (Z(0), I(0), G(0)),
hierna gebruiken we:
• Co-Evoluerende soorten
De eerste 4 voorbeelden hebben een overeenkomstig aspect:
Z(t + 1) = Z(t) + aZ(t)G(t) − bZ(t)
Ze breiden zich uit (propageren zichzelf) over
paden welke afhangen van de omgeving.
G(t + 1) = G(t) − aZ(t)G(t)
I(t + 1) = I(t) + bZ(t)
Om de propagatie tegen te gaan moeten propagatieBeginnend met een initiele populatie en gekozen
paden “gesloten” worden.
parameters (a, b), kunnen we het gedrag simuleren.
Ruimtelijke Modellen vs. niet-ruimtelijke
modellen
Let op dat Z, I, G niet negatief mogen worden!
Het gebruik van ruimtelijke modellen maken
bepaalde processen makkelijker te visualiseren
(v.b. bosbranden).
We kunnen ook een cellulaire automaat (CA)
gebruiken:
Uiteindelijk kunnen alle individuen ziek geweeWe kunnen biologische processen modelleren met ste, immune individuen worden
ruimtelijke modellen zoals cellulaire automaten,
Vraag: begin met (Z(0) = 10, I(0) = 0, G(0)
maar we kunnen deze ook direct modelleren m.b.v
= 90), neem a = 0.01 en b = 0.1, itereer het
vergelijkingen.
model enkele keren.
Het gebruik van ruimtelijke modellen geeft een
extra vrijheidsgraad en daarmee meer mogelijke
Ruimtelijk model van infectie ziektes
emergerende patronen.
Het gebruik van ruimtelijke modellen is wel veel
langzamer om te simuleren.
G
Modelleren van infectie ziektes
Z
Z Z
G
G
G
We zullen nu 2 manieren bekijken om infectie
ziektes te simuleren.
Bij infectie ziektes bestaan er 3 soorten individuen (agenten of populaties):
I
I
• Gezonde individuen (G)
• Geinfecteerde, zieke individuen (Z)
3
Z
Z
Z G
I
G
Z
Z
I
Nu moeten we regels opstellen om de toestanden
in de CA te veranderen.
Toch kan het totale groepsgedrag welke emergeert uit de samenwerking tussen veel mieren in
een kolonie vrij intelligent lijken.
Hoe kunnen mieren begraafplaatsen met stapels
van lijken bouwen?
Regels voor het CA model
Als G een Z in een vakje naast zich heeft, wordt
deze individu zelf ook een Z.
We kunnen een simpel model van een mier maken:
Z heeft elke tijdstap een kans p om een I te
worden
• De mier beweegt in willekeurige richtingen
Voor de navigatie kunnen we een random-walk
van alle agenten gebruiken; ze maken willekeurige stapjes in alle richtingen.
• Als een mier niets draagt en een dode mier
ziet, pakt de mier het lijk op
• Als de mier een lijk draagt en een verzameling andere lijken tegenkomt, laat hij het
lijk daar achter.
We zouden ook kunnen modelleren dat gezonde
mensen wegblijven van geinfecteerden, dat veroorzaakt een heel andere dynamiek.
Vraag: Wat voor phenomenen zouden kunnen
optreden als gezonde individuen uit de buurt
blijven van geinfecteerden?
Deze 3 simpele regels veroorzaken de begraafplaatsen van mierenlijken welke men kan observeren.
Swarm Intelligence
Vraag: bedenk hoe mieren korrels suiker en chocolade van elkaar kunnen scheiden.
Grote groepen van simpele organismen zoals
bijen of mieren, kunnen samen intelligent gedrag Combinatorische optimalisatie
vertonen.
Bepaalde problemen kosten exponentieel veel tijd
Voorbeelden zijn:
om optimaal op te lossen.
• Foraging behavior (op zoek gaan naar voed- V.b. van exponentiele tijd problemen: Stel een
probleem bestaat uit n toestanden en de tijd die
sel)
het kost om het op te lossen is 2n of n!
• Bescherming van het nest
Exponentiele tijd problemen groeien veel sneller
• Bouwen van het nest (v.b. termieten, waar dan polynomiale tijd problemen:
zit de blueprint?)
np
lim n → 0
n→∞ e
• Voedsel verspreiding en opslag
Het is bekend dat mier-kolonieen bepaalde bekende problemen kunnen oplossen, zoals het vinden van het kortste pad naar een voedselplek,
sorteren (clusteren en stapelen) van voedsel of
mieren-lijken.
Een aantal bekende wiskundige problemen noemen we combinatorische optimalisatie problemen (een voorbeeld zijn NP-complete problemen, welke niet in polynomiale tijd opgelost kunnen worden, tenzij P=NP).
Hoewel een enkele mier of bij niet erg intelligent
is, vertoont het gedrag van de gehele kolonie wel
intelligent gedrag (super-intelligentie)
Aangezien de computer kracht niet sneller dan
exponentieel toe kan nemen, kunnen we bepaalde grote combinatorische optimalisatie problemen nooit optimaal oplossen.
Sorteer gedrag van mier-kolonieen
Voorbeelden van Combinatorische optimalisatie problemen
Een enkele mier heeft maar zeer beperkte intelligente vermogens.
Een aantal voorbeelden van combinatorische optimalisatie problemen zijn:
4
• Traveling salesman probleem: vind kortste
tour tussen n steden.
De constraints zijn dat alle steden precies 1 keer
aangedaan worden en dat de tour terugkomt op
zijn beginstad.
• Quadratic assignment probleem: Minimaliseer de flow (totaal afgelegde afstand) als
een aantal werknemers elkaar in een gebouw volgens een bepaalde frequentie opzoeken.
We kunnen een lijst bijhouden van steden die
nog niet bezocht zijn: J = {i| i is nog niet bezocht}
In het begin bevat J alle steden. Na het bezoeken van een stad wordt die stad uit J gehaald.
• 3-Satisfiability: vind waarheidswaarden van
proposities die volgende vorm waarmaken:
1. Kies initiele stad s1 en haal s1 weg uit J
2. For t = 2 To N:
{x1 ∨ ¬x2 ∨ x4 } ∧ . . . ∧ {x1 ∨ ¬x5 ∨ x7 }
3.
• Job-shop scheduling: Minimaliseer de totaal benodigde tijd om een aantal jobs te
laten volbrengen door een aantal machines
welke in sequentie elke job moeten doorlopen.
Kies stad st uit J en haal deze weg uit
J
4. Bereken
lengte van deze tour:
PN −1
L = t=1 l(st , st+1 ) + l(sN , s1 )
Het doel is om de tour met minimale totale lengte L te vinden
In het vak (logische) complexiteits theorie wordt
hier nader op ingegaan.
Ant Algoritmen
Traveling salesman probleem (TSP)
Een nieuw soort multi-agent adaptief systeem
voor combinatorische optimalisatie is bedacht
door Marco Dorigo in 1992.
Er is een verkoper die n steden aan wil doen en
in zijn beginstad wil terugkomen.
Een kolonie van mieren werkt samen om b.v.
voor het TSP een optimale tour te zoeken.
Alle steden i en j zijn verbonden met een weg
van lengte l(i, j). Deze lengtes staan in een afstand matrix.
Foraging ants leggen een chemische substantie
neer (genoemd pheromoon) wanneer ze van hun
nest naar een voedsel bron gaan en vice versa.
De agent moet nu een tour bedenken welke de
totale kosten voor het aandoen van alle steden
minimaliseert.
Andere foraging ants volgen de tracks met de
meeste gehalte aan pheromoon volgens een kansverdeling. Dit collectieve foraging behavior stelt
de mieren in staat om het kortste pad te vinden
van hun nest naar een voedsel bron.
4
3
5
4
Optimalisatie algoritmen welke geinspireerd zijn
door het collectieve foraging gedrag van mieren
worden Ant Algoritmen genoemd.
5
4
4
2
Voorbeeld: Foraging ants
Vraag: Hoeveel mogelijke touren zijn er met N
steden?
Vervolg Foraging ants
Genereren van een tour
Eigenschappen van Ant Algoritmen
Hoe kunnen we een tour genereren?
Er zijn een reeks verschillende Ant Algoritmen,
maar ze delen allemaal de volgende eigenschappen:
5
• Ze bestaan uit een kunstmatige kolonie
van coopererende mieren
2. Alle niet gevolgde edges (kanten) verliezen
een beetje pheromoon door evaporatie
• Mieren maken discrete stappen.
3. Alle gevolgde touren krijgen extra pheromoon waarbij de kortste tour meer pheromoon krijgt dan langere touren.
• Mieren leggen pheromoon neer op hun gekozen paden
• Mieren gebruiken de neergelegde pheremoon Vraag: bedenk hier een variant op en analyseer
tracks voor het kiezen waar ze naar toe de voor- en nadelen.
gaan
Ant Algoritmen worden gebruikt voor een groot Formele specificatie Ant Systeem
aantal Combinatorische optimalisatie algoritmen De kolonie bestaat uit K mieren.
zoals TSP, QAP, network routing
De mate van pheromoon tussen 2 steden i en j
Het idee dat twee individuen indirect interac- noteren we als m(i, j)
teren omdat 1 van hen de omgeving verandert
In het kiezen van een volgende stad gebruiken
en de ander de veranderde omgeving gebruikt
we een extra heuristiek: de inverse van de lengte
voor het nemen van besluiten wordt stigmercy
1
tussen 2 steden: v(i, j) = l(i,j)
genoemd.
Nu maakt elke mier: k = 1..K een tour:
Ant System
1. Kies random start stad voor mier k: i =
random(1,N) en haal deze start stad weg
uit de lijst Jk van onbezochte steden voor
mier k
Het eerste Ant algoritme was “the Ant system”
(AS). The Ant system werd initieel getest op het
TSP.
2. Kies steeds de volgende stad voor mier k
als volgt:
(
arg max{[m(i, h)] · [v(i, h)]β } if q ≤ q0
h∈Jk
j=
S
anders
(1)
The Ant System werkt als volgt:
1. Alle N ants maken een tour waarbij ze de
pheromoon tracks gebruiken voor het kiezen van de volgende stad.
6
Hierbij is q een random getal (0 ≤ q ≤ 1) We hebben allereerst gekeken hoe we modellen
Parameter (0 ≤ q0 ≤ 1) bepaalt de rela- voor de spreiding van infectie ziektes kunnen
tieve belangrijkheid van exploitatie versus maken.
exploratie.
Ruimtelijke modellen bieden meer simulatie vrijheden, maar duren langer om te draaien op een
computer.
Vervolgens hebben we gekeken hoe een kolonie
S is een stad gekozen volgens de kansver- van “domme” mieren toch intelligent gedrag kan
deling gegeven in de volgende vergelijking: vertonen.
pij =

 P
[m(i,j)]·[v(i,j)]β
[m(i,h)]·[v(i,h)]β
Zo kunnen mieren objecten clusteren m.b.v. een
aantal simpele regels
if j ∈ Jk
Ant algoritmen kunnen gebruikt worden om complexe combinatorische problemen zoals de traveling salesman probleem op te lossen.
h∈Jk
 0
anders
(2)
Nu hebben alle mieren een tour gemaakt.
Andere manieren om de volgende steden te selecteren zijn ook mogelijk, maar bovenstaande
werken meestal iets beter.
Als parameters moeten ingesteld worden hoe groot
de pheromoon tracks m(i, j) initieel zijn, wat q0
is, en wat β moet zijn.
Updaten van de pheromoon tracks
Er zijn meerdere mogelijke update regels. Als
keuze kan men bijvoorbeeld maken om enkel de
beste tour te updaten (en niet alle).
We noemen de best gevonden tour van alle mieren in de laatste generatie Sgb (S globaal best).
Deze tour heeft lengte Lgb
De update regel komt er dan als volgt uit te zien:
m(i, j) = (1 − α) · m(i, j) + α · ∆m(i, j)
waarbij ∆m(i, j) =
(Lgb )−1
0
if edge (i,j) ∈ Sgb
anders
Hierbij is α de leersnelheid. Let op dat de extra gehalte aan pheromoon van de lengte van de
best gevonden tour afhangt en dat andere edges
evaporeren.
Discussie
Er zijn veel soorten biologisch adaptieve systemen.
7
Transparanten bij het vak Inleiding Adaptieve Systemen: Neurale Netwerken. M.
Wiering
Synapse
Axon van andere neuron
Nucleus
Neurale netwerken
Dendriet
Leerdoelen:
Axon
Synapse
Soma
• Weten wanneer neurale netwerken toepasbaar zijn
Dit heeft als effect dat de actie-potentiaal in
de soma vemindert of vermeerdert.
• De Delta-leerregel kennen
• Kunnen uitrekenen wat gewichtenverande- Wanneer de actie-potentiaal een bepaalde dremringen in een lineair netwerk zijn gegeven pelwaarde overschrijdt, wordt een electrische pulse doorgegeven naar de axon (de neuron vuurt).
een leervoorbeeld
• Weten wat multi-layer feedforward neurale netwerken zijn
Synapses welke de actie potentiaal laten toenemen heten excitatory.
• De backpropagation leerregel kunnen opschrijven en uitleggen
Synapses welke de actie potentiaal laten afnemen heten inhibitory.
• Weten wat recurrente neurale netwerken
zijn
Kunstmatige neurale netwerken
Een neuraal netwerk bestaat uit een aantal neuronen (units) en verbindingen tussen de neuronen.
Neurale netwerken
Elke verbinding heeft een gewicht eraan geassocieerd (een getal).
Kunstmatige neurale netwerken (KNN) bestaan
uit een verzameling neuronen (rekeneenheden)
welke verbonden zijn in netwerken (McCulloch
en Pitts, 1943).
Het leren gebeurt gewoonlijk door de gewichten
bij te stellen.
Ze bezitten nuttige computationele eigenschap- Elke neuron heeft een aantal ingaande verbinpen (bv. ze kunnen alle continue functies bena- dingen van andere neuronen, een aantal uitgaanderen met 1 hidden laag en alle functies met 2 de verbindingen en een activatie nivo.
hidden lagen)
Het idee is dat elke neuron een lokale berekening
Ze bieden ons de mogelijkheid om te leren hoe uitvoert, gebruikmakende van zijn inkomende
de hersenen werken.
verbindingen.
Een neuron of zenuwcel is de fundamentele bouw- Om een neuraal netwerk te bouwen moet men
steen van het brein.
de topologie van het netwerk instellen (hoe zijn
neuronen verbonden).
Een neuron bestaat uit een cellichaam : de soma
Gewichten worden meestal willekeurig geinitialiseerd.
Uit het cellichaam vertakken dendrieten en een
axon. Een axon verbindt zich met dendrieten
van andere neuronen in synapses, de verbindingspunten.
Vergelijking KNN en Biologische NN
Beschouw mensen:
Een menselijk neuraal netwerk
• Neuron switch tijd: .001 seconde
Chemische transmitter vloeistoffen worden vrijgegeven in de synapses en stromen de dendrieten
binnen.
• Aantal neuronen: 1010−11
• Connecties per neuron: 104−5
1
• Visuele herkennings tijd : 0.1 seconde
• 100 inferentie stappen lijkt niet genoeg →
Veel parallelle computatie
"!$#&%
Eigenschappen van neurale netwerken (KNN)
• Veel neuron-achtige drempel switch units
• Veel gewogen connecties tussen units
Sharp
Left
Straight
Ahead
Sharp
Right
30 Output
Units
4 Hidden
Units
• In hoge mate parallel, gedistribueerd proces
30x32 Sensor
Input Retina
• Nadruk op automatisch leren van gewichten
''
( ) * + , - ) . ( / 0 ) . 1 2 - + ) 3 + 4 2 2 5 687 9 : ; < = > = 7 ? < ; < @ A B C D8/ + * E ) ( A D* F - G HI / ( A J K K '
Wanneer kunnen Neurale Netwerken gebruikt worden?
• De Hidden laag: hier worden interne (nietlineaire) berekeningen uitgevoerd.
• Input is hoog-dimensionaal discreet of continu (b.v. ruwe sensor input)
• De Output laag: hier worden de waarden
van de outputs van het netwerk berekend.
• Output is discreet of continu
• Output is een vector van waarden
• Mogelijk ruisige data
Output Layer
• Vorm van doelfunctie is onbekend
Hidden Layer
• Menselijke leesbaarheid van gevonden oplossing is onbelangrijk
Input Layer
Voorbeelden:
• Spraak herkenning
• Beeld classificatie (gezichts herkenning)
Neuron
• Financiële voorspelling
In een netwerk ziet een individuele neuron er als
volgt uit:
• Patroon herkenning (postcodes)
Aj
W
Ai = g(Ii)
j,i
Σ
Voorbeeld: autorijden
Input
verbindingen
Ii
Input
functie
Feedforward neurale netwerken
Vormen een gelaagde structuur. Alle verbindingen gaan van 1 laag, naar de volgende laag.
g
Activatie
functie
Ai
Output
Verbindingen
Output
Leren gaat door gewichten bij te stellen aan de
hand van de fout op leervoorbeelden.
We onderscheiden de volgende lagen:
• De Input laag: hier worden de inputs van
het netwerk naartoe gecopieerd.
Voorbeeld: de output van een netwerk is 0.9.
De gewenste output is 1.0. Verhoog de gewich2
ten die de output van het netwerk doen toenemen. Verlaag de gewichten die de output doen
afnemen.
Y
1
-1.5
1
1
Een lineair neuraal netwerk
X1
Het simpelste neurale netwerk is een lineair neuraal netwerk. Deze bestaat uit enkel een input
en output laag.
• Definieer de fout als het kwadratische verschil tussen de gewenste uitkomst D en de
verkregen uitkomst Y voor een voorbeeld:
X1 , . . . , XN → D:
Y
Output Unit
w2
w4
w3
1
X1
X2
X3
Bias
Het leren gaat als volgt:
Een lineair neuraal netwerk ziet er als volgt uit:
w1
X2
E=
Input Units
1
(D − Y )2
2
Bias
• We willen nu de afgeleidde van de fout E
naar de gewichten w1 , . . . , wN berekenen:
Er wordt een bias-unit gebruikt om alle lineaire
functies te kunnen representeren. Deze kan als
extra waarde (1) aan de inputvector meegegeven
worden.
∂E
∂E ∂Y
=
= −(D − Y )Xi
∂wi
∂Y ∂wi
• Nu “updaten” we de gewichten met leersnelheid α > 0 om de fout te verkleinen.
De Delta-leerregel ziet er als volgt uit:
Het lineaire netwerk wordt gezien als een functiemapping van de inputs X1 , . . . , XN naar output
Y:
X
Y =
w i Xi
wi = wi + α(D − Y )Xi
i
• We stoppen als de totale fout over alle leervoorbeelden klein genoeg is.
Representeren
Een lineair netwerk kan bijvoorbeeld de AND
functie representeren:
Voorbeeld
Gegeven leervoorbeeld (0.5, 0.5 → 1).
We maken een lineair netwerk met initiele gewichten 0.3 en 0.5 en 0.0 (voor de bias).
1
1
0
X2
We kiezen een leersnelheid, bv: α = 0.5
Lineair Netwerk
0
X1
Nu kunnen we de gewichten aanpassen:
1
Y = 0.3 ∗ 0.5 + 0.5 ∗ 0.5 + 0.0 ∗ 1.0 = 0.4.
E = 1/2(1.0 − 0.4)2 = 0.18
w1 = 0.3 + 0.5 ∗ 0.6 ∗ 0.5 = 0.45
w2 = 0.5 + 0.5 ∗ 0.6 ∗ 0.5 = 0.65
w3 = 0.0 + 0.5 ∗ 0.6 ∗ 1.0 = 0.30
Het volgende netwerk (de Perceptron) doet dit
(als de output > 0 dan Y = 1, anders Y = 0):
Bij een volgende presentatie van het leervoorbeeld is de nieuwe uitkomst:
Y 0 = 0.45 ∗ 0.5 + 0.65 ∗ 0.5 + 0.3 ∗ 1.0 = 0.85.
Leren
Een initieel netwerk wordt gemaakt met random
gewichten (bv. tussen -0.5 en 0.5)
3
Vraag: Stel hetzelfde leervoorbeeld wordt nogmaals gepresenteerd. Bereken de nieuwe gewichten.
E(p)
4
3
2
1
Batch vs Stochastic Gradient Descent
-3
-2
0
-1
1
2
3
W
-1
Er zijn in principe 2 methodes om met de data
om te gaan:
-2
• Batch-leren: probeert de fout in 1 keer
voor alle voorbeelden in de leerverzame- We hebben : Y = W T X, voor alle voorbeelden.
ling te verminderen. Hiervoor wordt de We zetten de voorbeelden in de matrices X en
totale gradient berekend en in 1 keer bij- Y en doen dat als volgt:
gesteld:
Y = [Y 1 , Y 2 , Y 3 , . . . , Y N ], en X = [X 1 , X 2 , X 3 , . . . , X N ].
1X p
Nu hebben we (pseudo-inverse):
(D − Y p )2
E=
2 p
Y = WTX
Y X T = W T XX T
Dus:
X
∂E
∂E ∂Y
=
=
−(Dp − Y p )Xip
∂wi
∂Y ∂wi
p
(1)
Y X T (XX T )−1 = W T
W T = Y X T (XX T )−1
• Online-leren: stelt de fout na elk leervoorbeeld bij. Maakt dus stochastische stapjes
in het foutlandschap (de totale fout kan Voorbeeld pseudo inverse
verminderd of vermeerderd worden):
Voorbeeld: lineair netwerk met 1 input unit en 1
bias (met constante waarde 1). Data-Voorbeelden:
∂E
∂E ∂Y p
p
p
p
(0 → 1)
=
= −(D − Y )Xi
∂wi
∂Y p ∂wi
(1 → 2)
(2 → 3)
Meestal wordt online learning gebruikt. Dit kan
enkele orders van magnitude sneller convergeren
(10 a 100 keer zo snel).
0
T
XX = 1
Intuitie van leerproces
0
1 2 1
1 1 2
1
1
1
1
(XX T )−1 = 21
−2
Fout van leervoorbeeld heeft afgeleidde naar elk
gewicht. Minimaliseer de fout door de afgeleidde
(gradient) af te gaan.
Voorbeeld doelfunctie : Y = 2X.
X T (XX T )−1
Leervoorbeeld p = (1, 2).
1
−
2
= 0
1
2
Foutlandschap voorbeeld:
Y X T (XX T )−1
Berekenen van optimale gewichten
We kunnen ook Lineaire Algebra gebruiken om
de optimale gewichten voor een lineair netwerk
te berekenen.
= 1 2
5
=
3
− 21 5 3 3 6
5
6
1
3
− 16
1
−2
3 0
1
2
5
6
1
3
− 16
Beperkingen van lineaire netwerken
4
= 1 1 1
1
0
ii =
X2
X
wji aj + bi
j
1
0
0
X1
• Bereken activatie hidden units (andere ~a):
1
ai = Fi (ii ) =
Een lineair netwerk kan de X-OR functie niet representeren. De voorbeelden van de X-OR functie zijn niet-lineair scheidbaar.
1
1 + e−ii
• Bereken activatie output units s:
Na het verschijnen van het book Perceptrons
van Minsky en Papert (1969) waarin deze problemen aangetoond werden, was het aanvankelijke enthousiasme voor neurale netwerken verdwenen.
si = F i (
X
wji aj + bi )
j
Activatie Functies
Een kleine groep onderzoekers ging wel door.
Dit leidde tot een aantal verschillende neurale
netwerken.
Er kunnen meerdere activatie functies gebruikt
worden.
Verschillende activatie functies zijn nuttig voor
representeren bepaalde functie (voorkennis)
In 1986 werden neurale netwerken weer populair
na het uitvinden van het backpropagation algoritme, waarmee door gebruik van de kettingregel ook niet-lineaire (multi-layer) feedforward
neurale netwerken geleerd konden worden.
A
A
A
1
1
1
0
1
0
I
I
Representatie in multi-layer feedforward
neurale netwerken
A=I
We representeren het netwerk in een gerichte
graaf. De optimale representatie kan een willekeurig kleine fout hebben voor een bepaalde
doel functie.
Lineair
A=
1
1 + exp (-I)
Sigmoid
0
A = exp( (m -2I)
s
I
2
)
Radial Basis
(Gaussian)
Gewoonlijk wordt de topologie van het netwerk
van te voren gekozen.
De hidden laag gebruikt meestal sigmoid functies of Radial Basis functies (meer lokaal)
Hierdoor onstaat er echter een representatie fout
(zelfs de optimale gewichten in een gekozen representatie hebben een bepaalde fout)
De output laag gebruikt meestal een lineaire activatie functie (zodat alle functies gerepresenteerd kunnen worden)
Ook lukt het vaak niet om de optimale gewichten te vinden (leerfout) door de lokale minima.
Backpropagation
We onderscheiden 2 stappen: voorwaartste pro- Minimaliseer error functie:
pagatie (gebruik) en terugwaartse propagatie (le1X
ren).
E=
(di − si )2
2 i
Voorwaartse propagatie in multi-layer feed- Door gewichten aan te passen m.b.v. gradient
descent:
forward neurale netwerken
∂E
∂E ∂ii
=
∂wji
∂ii ∂wji
• Clamp input vector ~a
• Bereken gesommeerde input hidden units:
- Leerregel met leersnelheid α :
5
∆wji = −α
• Zal lokaal minimum vinden en niet noodzakelijk globaal minimum. Kan met meerdere restarts toch goed werken.
∂E
= αδi aj
∂wji
• Gebruikt soms een momentum term:
- Output unit:
δi = −
∆wij (t) = γδj xi + µ∆wij (t − 1)
∂E
= (di − si )Fi0 (ii )
∂ii
• Minimaliseert fout over alle trainings voorbeelden, zal het goed generaliseren naar
opvolgende voorbeelden?
- Hidden unit:
δi = Fi0 (ii )
X
δo wio
– Pas op met teveel hidden units →
overfitting
o∈Outputs
Hier is Fi0 (ii ) = (1 − ai )ai , als F de sigmoid
functie is.
– Werkt goed met genoeg voorbeelden:
Vuistregel: aantal leervoorbeelden is
veelvoud van aantal gewichten.
en Fi0 (ii ) = 1, als F de lineaire functie is.
• Leren kan duizenden iteraties duren → traag!
Leren als zoeken
• Gebruik van geleerd netwerk gaat snel.
Gradient descent op het foutlandschap werkt als
volgt:
Representatie van hidden units
E
R
R
O
R
!#"$&% ')(+*-,/.
Inputs
State : W
Problemen:
Outputs
• Lange even vlaktes. Als het foutlandschap ergens heel vlak is, gaat het leren
erg langzaam (de gradient is zeer klein).
0 $&1*2"/$&34/5 33$6"87 1&9+$6*:*2$6;/*2$&<=$&"%21%=5 (".
@5 3/3$6"
A?/%=;/?%
> "/;?/%
B 17 ?/$&<
C&DD+D+D+D+D+DEGF H+IJF DKLF D+HEMC&D+D+DD+D+D+D
DC6D+D+D+D+D+DEGF DC#F C+CNF H+HEODC&D+DD+D+D+D
D+DC&D+D+D+D+DEGF DC#F IPJF QPNEOD+DC&DD+D+D+D
D+DDC&D+D+D+DEGF I+IJF IPJF PCREOD+D+DC6D+D+D+D
D+DD+DC&D+D+DEGF D+SJF D+TUF D+QEOD+D+D+DC&D+D+D
D+DD+D+DC&D+DEGF Q+QJF I+IUF I+IEOD+D+D+DDC&D+D
D+DD+D+D+DC&DEGF H+DJF DCNF I+HEOD+D+D+DD+DC&D
D+DD+D+D+D+DCREGF V+DJF IKLF DCREOD+D+D+DD+D+DC
• Het leren van een optimaal netwerk is een
NP-moeilijk probleem.
WX
• Lokale minima. Als het netwerk in een
lokaal minimum komt, kan het niet meer
verbeterd worden met gradient descent.
Y Z [ \ ] ^ Z6_ Y ` a Z _2b c ^ \ Z d \ e c c f&gh i j k l m6n m h o l k l p q r2s t` \ [ u Z Y q t[ vw^ x y/z=` Y q { W W |
Meer over backpropagation
• Gradient descent over gehele netwerk gewichten vector
Evolutie van leerproces
• Makkelijk generaliseerbaar naar willekeurige gerichte grafen.
Evolutie van leerproces (2)
6
werken:
• Elman netwerken
• Jordan netwerken
Sum of squared errors for each output unit
0.9
• Time delay neurale netwerken (TDNN)
0.8
0.7
0.6
• Hopfield netwerken (Boltzmann machines)
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
2000
2500
Elman netwerken
Elman netwerken koppelen activatie van hidden units terug naar inputs: goed voor predictie
waarin tijd belangrijke rol speelt.
! " # $ % & ' # & ( ) *+ # , -
1
1
./ 01 21 23
Weights from inputs to one hidden unit
4
INPUT UNITS
3
CONTEXT UNITS
2
1
0
-1
-2
-3
Leeralgoritme: Recurrent backpropagation through
time:
-4
-5
0
500
1000
1500
2000
2500
Y(t+1)
45
6 7 8 9 : ; 7 < 6 = > 7 < ? @ ; 9 7 A 9 B @ @ C D E F G H I J K J E L I H I M N O P Q = 9 8 R 7 6 N Q 8 S ; T UV = 6 N W 4 4 5
Y(t)
Y(t-1)
Recurrente neurale netwerken
Feedforward neurale netwerken worden meest
gebruikt: voor patroon herkenning zijn ze uitstekend
Recurrente neurale netwerken zijn geschikt voor Jordan netwerken
problemen waarin tijdspredictie (bv. SpraakJordan netwerken koppelen activatie van outherkenning) een rol speelt.
put units terug naar inputs: goed voor predictie
Recurrente netwerken kunnen vorige inputs mee waarin tijd belangrijke rol speelt en sequentie
laten tellen in hun predictie van de huidige toe- van beslissingen een rol speelt.
stand van het systeem.
Jordan en Elman netwerken werken ongeveer
Recurrente netwerken kunnen ook gebruikt wor- even goed.
den voor het infereren van een heel patroon op
Ze hebben grote problemen als de gradient through
basis van een deelpatroon (pattern completion)
time een erg zwak signaal wordt →
We onderscheiden de volgende recurrente net- gewichten worden erg langzaam bijgesteld.
7
A1
1
A6
A2
W
INPUT UNITS
A5
CONTEXT UNITS
A3
A4
NB: Leren vaak veel trager dan feedforward netwerken. Alternatieve leermethode: Evolutionary computation.
Symmetrische gewichten / Asymmetrische gewichten. Lijken op Bayesiaanse geloofs netwerken.
Time Delay Neurale Networken (TDNN)
TDNN gebruiken inputs van voorgaande tijdstappen voor huidige predictie
Voor- en nadelen van Neurale netwerken
Nadelen:
OUTPUT(T)
• Belanden vaak in lokale minima.
• Geen directe manier om om te gaan met
missende waarden
• Soms erg traag leerproces
INPUTS(T-m)....INPUTS(T-1)
• Soms vergeet het netwerk geleerde kennis
als het getraind wordt op nieuwe kennis
(leer-interferentie)
INPUTS(T)
• Het kan veel experimenteertijd kosten om
een goede topologie en leerparameters te
vinden.
Hebben problemen met Markov order (m) :
• Hoeveel voorgaande inputs moeten meegegeven worden?
• Het is niet zo makkelijk om a-priori kennis
in een netwerk te zetten
• Kan inputs die langer geleden gezien zijn
nooit mee laten tellen in beslissing.
• Leren optimaal neuraal netwerk is NP-moeilijk
probleem
• Veroorzaakt soms erg groot netwerk
Hebben geen problemen met afnemende gradient Voordelen:
(b.v. 100000001 → 1 en 000000001 → 0).
• Kan alle functies exact representeren
• Kan goed met ruis omgaan
Hopfield Netwerk
Autoassociative Networks (Hopfield Netwerk, Boltz- • Kan goed met redundantie omgaan
mann machine): soort geheugen voor opslaan
• Kan goed met hoog dimensionale inputpatronen: goed voor pattern completion
ruimtes omgaan
Leerregels versterken verbindingen tussen inputs
• Kan direct continue functies benaderen
die gelijk aan staan. Als deelpatroon aangeboden wordt, zullen inputs die vaak gelijk met an• Is robuust tegen wegvallen neuronen →
dere inputs voorkomen ook aan komen te staan
graceful degradation
De nieuw geactiveerde inputs kunnen weer andere inputs activeren
8
Transparanten bij het vak Inleiding Adaptieve Systemen: Co-evolutie. M. Wiering
Evolutie in een computer
Omdat evolutie door natuurlijke selectie een mechanisme is, kunnen we het programmeren in
een computer programma.
Co-evolutie
Een bekend voorbeeld van een programma dat
Geen bioloog twijfelt eraan of evolutie opgetregebruik maakt van kunstmatige evolutie d.m.v.
den is, omdat er genoeg direct geobserveerde benatuurlijke selectie is het genetische algoritwijzen voor zijn.
me.
Het huidige debat gaat over de vraag hoe evoluHierbij wordt er een fitness functie gebruikt weltie tot stand is gekomen en welke mechanismen
ke door de programmeur is gedefinieerd.
erin een rol spelen.
De fitness functie bepaald (indirect) hoeveel naIn dit college gaan we de volgende aspecten bekomelingen een individu kan genereren.
handelen:
Hoewel GA geschikt zijn voor optimalisatie doeleinden, lijkt het niet perfect op natuurlijke evo• Natuurlijke selectie
lutie.
• Co-evolutie
De fitness functie moet namelijk gedefinieerd worden, en in natuurlijke evolutie is er niemand die
• Replicator dynamics
de fitness functie bepaalt.
• Daisyworld
• Gaia hypothese
Co-evolutie
• Recycling netwerken
In het echt hangt de fitness van een individu af
van zijn omgeving waaronder de andere species
die erin voorkomen.
• Co-evolutie voor optimalisatie
Zo’n fitness functie is daarom niet-stationair,
maar verandert met de tijd aangezien de groottes van de verschillende populaties veranderen.
Evolutie door natuurlijke selectie
In Darwin’s evolutionaire theorie speelt survival We hebben voorheen reeds 2 verschillende moof the fittest de belangrijkste rol als verklaring dellen bekeken om de dynamiek van interactevan de evolutie van organismen.
rende species te bestuderen:
Beschouw een wereld bevolkt met organismen
• Met differentiaal vergelijkingen
met een aanzienlijke reproductie snelheid.
(wiskundige regels die specificeren hoe beZolang de omstandigheden goed zijn, zal de popaalde variabelen veranderen)
pulatie groeien, maar op een gegeven moment
• Met cellulaire automaten
zijn er te weinig bronnen of ruimte om alle organismen zich te laten voortplanten.
Daarom zullen enkel bepaalde individuen zich Een bekend voorbeeld van de eerste zijn de LotkaVolterra vergelijkingen.
voortplanten en de vraag is welke.
Organismen verschillen omdat ze verschillende We kunnen de Lotka-Volterra vergelijkingen ook
genen hebben. Deze verschillen geven een voor- generaliseren voor meerdere organismen.
deel aan bepaalde organismen om met de omgeving om te gaan.
Replicator Dynamics
Organismen met voordelen zullen meer kans hebben zich voort te planten en daarom zullen er Laten we eerst een model bekijken waarin de fitmeer nakomelingen komen met deze genetische ness van een organisme (phenotype) wel gegeven
is.
eigenschappen.
1
De replicator vergelijking beschrijft het gedrag Hierbij zien we dat phenotypen de fitness van
van een populatie welke verdeeld is in n pheno- andere phenotypen kunnen laten afnemen of toetypes E1 tot En .
nemen.
De relatieve frequenties noteren we als x1 tot xn Hierbij kan het zijn dat aij en aji beide posigespecificeerd
door de vector ~x = (x1 , x2 , . . . , xn ) tief en groot zijn. Het gevolg daarvan is dat
P
(hier is i xi = 1)
ze samen werken om beide in frequentie toe te
nemen.
De fitness van phenotype Ei wordt genoteerd als
fi (~x)
Hierdoor kunnen bepaalde mutuele coopererende phenotypes ontstaan. Deze groepen kunnen
De gemiddelde fitness van de populatie is:
ook weer competitie hebben met andere groepen
phenotypes.
n
X
f ˆ(~x) =
xi fi (~x)
Vraag: Wat voor soort waarden a en a hebij
ji
ben predator-prooi phenotypes?
i=1
Vervolg replicator dynamics
Daisyworld
De snelheid van toename van de frequentie van
Ei is gelijk aan het verschil in fitness van Ei en
de gemiddelde fitness van de populatie:
In 1983 presenteerde Lovelock het model Daisyworld welke hij maakte om de relatie tussen
organismen en hun omgeving te verkennen.
∂xi
ˆx)
= fi (~x) − f (~
xi
Daisyworld is een computer model van een imaginaire planeet waarin zwarte en witte daisies
leven.
Nu krijgen we de replicator vergelijking met adap- Daisies kunnen hun omgeving veranderen, groeitie snelheid α:
en, reproduceren, en sterven.
ˆx))
∆xi = αxi (fi (~x) − f (~
Er is een globale variabele: de temperatuur van
de planeet welke langzaam toeneemt door een
imaginaire zon.
Als de fitness waarden van de phenotypes verschillen, verandert de vector van relatieve frequenties ~x.
Witte daisies hebben een favoriete temperatuur
waarin ze het snelste groeien en deze temperatuur is hoger dan de favoriete temperatuur van
zwarte daisies.
Als de omgeving niet verandert en de fitness
functies blijven ook gelijk (constant selection)
dan zal de phenotype met de hoogste fitness de
hele populatie overnemen.
Daarom zal de populatie witte daisies sneller
groeien als de temperatuur vrij hoog begint te
worden dan de populatie zwarte daisies.
Deze assumpties zijn natuurlijk onrealistisch: de
omgeving en fitness functies zullen wel veranderen door de selectie.
Vervolg Daisyworld
Witte daisies refecteren echter de zon en koelen
de planeet af. Dus als zij in aantal toenemen zal
de temperatuur van de planeet afnemen.
Co-evolutie met replicator dynamics
We kunnen de fitness van een phenotype ook laten afhangen van andere aanwezige phenotypes:
fi (~x) =
n
X
Zwarte daisies absorberen de hitte van de zon
en daarom verhogen ze de temperatuur van de
omgeving.
aij xj
Daisies veranderen de omgeving, en de omgeving heeft een impact op de populatie groei van
de fitness van Ei in de aanwezig- de daisies.
j=1
Hierbij is aij
heid van Ej
Groeiende aantallen witte daisies koelen de pla-
2
neet af hetgeen gunstig wordt voor de groei van
zwarte daisies.
De populatie zwarte daisies zal daarom toenemen en de temperatuur zal weer stijgen.
Het is het toe- en afnemen van de temperatuur
van de planeet welke de twee soorten daisies verbinden met hun omgeving.
Dus is er een zelf-regulerende feedback loop
CA model voor Daisyworld
We kunnen een CA gebruiken als ruimtelijk model van Daisyworld. Elke cel kan een (witte of
zwarte) daisy of een (wit of zwart) daisy-zaadje
bevatten.
den om beperkte bronnen (namelijk de beschikbare ruimte).
Verder heeft elke cel een temperatuur. Elke
cycle wordt de temperatuur van elke cel opgehoogd met 1 graad.
Mutatie is een willekeurige verandering van het
genotype van een organisme
Als witte daisies de gemiddelde temperatuur niet
afkoelen, zal de temperatuur overal 100 graden Zo’n verandering kan resulteren in een klein verschil in de kleur van de daisy hetgeen een verworden en al het leven dood gaan.
andering betekent in de absorptie van hitte van
de daisy.
Vervolg CA model Daisyworld
In het algemeen kan een mutatie goed zijn voor
Zwarte daisies hebben de meeste kans om te een organisme, hoewel de meeste mutaties schaoverleven bij 40 graden. Witte daisies bij 60 delijk of neutraal zijn.
graden. Elke 20 graden daarvan weg, daalt de
Echter, zelfs als een mutatie maar in 1 op de
overlevingskans met 50%
miljoen gevallen gunstig is, zal deze zich snel
Zwarte daisies verwarmen alle 49 cellen rondom door de populatie heen kunnen propageren.
hen met 3 graden. Witte daisies koelen de 49
cellen met 3 graden af.
Zelf-regulatie en natural selection
Witte daisies reproduceren 6 zaadjes op willekeurige lokaties in hun omgeving van 25 gridcel- Het meest interessante aspect in Daisyworld is
len met de meeste kans (40%) op 60 graden (en echter de zelf-regulatie welke op een hogere level
optreedt dan natuurlijke selectie.
zwarte op 40 graden).
De zelf-regulatie is gunstig voor alle individuen
omdat het de temperatuur op een nivo houdt
welke leven mogelijk maakt.
Daisy zaadjes hebben 10% kans om elke cycle
dood te gaan. Witte (zwarte) zaadjes worden
witte (zwarte) daisies met de meeste kans op 60
(40) graden.
Omdat de zelf-regulatie gunstig is voor alle individuen, zou men kunnen denken dat zelf-regulatie
bestaat door natuurlijke selectie.
Plaatje Daisyworld CA
In Daisyworld is de zelf-regulatie echter niet betrokken in een vorm van competitie of reproductie.
Natuurlijke selectie in Daisyworld
We kunnen echter wel zeggen dat natuurlijke selectie daisy eigenschappen prefereert welke leiden tot een zelf-regulerende omgeving.
In Daisyworld is er competitie (en daarom natuurlijke selectie) omdat de daisies allebei strij-
3
Vraag: Wat is het verband tussen zelf-regulatie
en de fitness functie?
tussen wezens en de omgeving kan de evolutie van beiden vormen.
• Geophysiological Gaia is de sterke vorm
van Gaia. Het stelt dat de aarde zelf een
levend organisme is en dat het leven zelf
de fysieke en chemische omgeving optimaliseert.
Gaia Hypothese
Toen James Lovelock onderzocht of er leven op
Mars is, bedacht hij de Gaia hypothese.
Hij realiseerde dat het niet noodzakelijk was om
Mars te bezoeken, maar dat er slechts een klein
onderzoek in de atmosfeer van Mars nodig was.
• Homeostatic Gaia ligt hier tussenin. Het
stelt dat de interactie tussen organismen
en de omgeving gedomineerd worden door
positieve en negatieve feedback loops welke de globale omgeving stabiliseren.
Omdat de atmosfeer van Mars in een chemisch
evenwicht is, kan er geen activiteit of leven zijn
op Mars.
Op aarde is er echter een chemisch niet-evenwicht;
Lovelock zegt zelf dat zijn theorie een reactie
onze atmosfeer bestaat uit veel gassen welke met
van het systeem vereist, maar geen bewustzijn,
elkaar kunnen reageren, maar toch in andere
planning, of intentie.
proporties aanwezig blijven.
De Gaia theorie zegt dat dit niet-evenwicht een
resultaat is van zelf-regulatie.
Voorbeelden van Gaia processen
Het idee dat Gaia een zelf-regulerend systeem
is leidde tot het idee dat Gaia zelf een levend
organisme is.
• Zuurstof: Lovelock toont aan dat Gaia werkt
om de zuurstof gehalte in de atmosfeer
hoog te houden. De atmosfeer van Venus
en Mars bevat slechts 0 en 0.13 procent
aan vrije zuurstof.
Griekse Godin Gaea
• Temperatuur: De gemiddelde grondtemperatuur van de aarde ligt al meer dan
120 miljoen jaar tussen de 10 en 20 graden
celcius. Op Mars varieert de temperatuur
elke dag veel meer.
• Koolstofdioxide: De stabiliteit van de temperatuur van de aarde wordt gewaarborgd
door varierende hoeveelheden aan koolstofdioxide in de atmosfeer. De vermindering
van opgenomen zonne-radiatie komt door
een vermindering van koolstofdioxide in die
perioden.
Toenemende temperatuur van de aarde
Drie vormen van de Gaia Hypothese
Er zijn drie vormen van de Gaia Hypothese:
Recycling netwerken
• Co-evolutionairy Gaia is een zwakke vormAls we meerdere co-evoluerende species in een
van Gaia. Het stelt dat het leven de om- omgeving hebben, kunnen ze ook interacteren
geving bepaalt en deze feedback koppeling met aanwezige bronnen in de omgeving, zoals
chemische verbindingen.
4
Let op dat we ook het gehalte aan C in planten
(P) en zoogdieren (Z) moeten modelleren. We
hebben boven dus niet alle toestanden getoond.
Een interne toestands-variabele is hiervoor nodig.
We moeten ook regels opstellen om planten en
zoogdieren zich te laten reproduceren en dood
te laten gaan. Verder moeten zoogdieren zich
kunnen bewegen en eventueel op zoek gaan naar
planten.
Recyling in het model
Door de ecologie worden de moleculen gerecycled.
Wat zien we nu:
Een voorbeeld is als we planten en zoogdieren
bij elkaar in een omgeving zetten en een gesimplificeerd model maken:
• Zonder planten worden alle (vrije) C en
O2 moleculen omgezet in CO2 . Er treedt
dus een evenwicht op.
• Planten zetten CO2 om in C en O2 moleculen.
• Als er heel veel planten zijn, groeit de hoeveelheid O2 . Hierdoor zal er weinig CO2
meer over zijn voor de planten.
• Zoogdieren zetten C en O2 om in CO2 moleculen.
• Als er zowel planten als zoogdieren zijn,
recyclen ze de moleculen. Dit zorgt ervoor
dat er meer planten en zoogdieren naast
elkaar kunnen bestaan.
• Externe reacties zetten C en O2 om in
CO2 .
• Zoogdieren kunnen planten eten en daarmee massa aan C erbij krijgen.
• Als er te veel zoogdieren zijn, worden alle planten snel opgegeten. Als er echter
weinig planten meer over zijn, zullen de
zoogdieren sneller sterven.
Dit model kunnen we implementeren in een cellulaire automaat bestaande uit planten, zoogdieren, en moleculen.
Voor de inzichtelijkheid kunnen we 2 CA gebruiken die in wisselwerking met elkaar staan;
1 geeft aan waar planten en dieren zijn, 1 geeft
aan waar de moleculen zitten.
Recycling treedt op in veel Co-evolutionaire systemen en ecologieen. B.v. in regenwouden zorgt
het voor een efficiente manier om met beperkte
hoeveelheden H2 O om te gaan.
CA voor recycling netwerken
Co-evolutie voor optimalisatie
We hebben reeds kennis gemaakt met Genetische algoritmen welke voor optimalisatie doeleinden gebruikt kunnen worden.
P
Z
O
CO
P
P
O
Z
O
Z
P
CO
CO
Sommige onderzoekers proberen een GA efficienter te maken door co-evolutie te gebruiken.
Z
Z
CO
CO
P
Z
P
Stel dat het probleem vereist om N taken op te
lossen (b.v. sorteren van verschillende reeksen
getallen).
P
Z
O
O
P
P
P
5
Als fitness functie kun je nu gebruiken hoeveel
van de N taken door een individu opgelost worden.
Individuen
I
Het probleem hiervan is dat er steeds veel tijd
in de evaluatie gestoken wordt.
I
Verder is het mogelijk dat er steeds door de beste individuen 0.7N taken opgelost worden, maar
niet meer. Er is dan geen goede richting voor de
volgende evolutie stap.
I
Parasieten
I
I
I
I
I
I
P P P P
P P P P
I
P P P P
P P P P
verschil kan maken (differentieren) tussen verschillende individuen.
Co-evolutie voor optimalisatie, vervolg
Een mogelijke oplossing hiervoor is om co-evolutie Bv. als een parasiet door alle individuen of juist
te gebruiken tussen probleem-instanties (para- door helemaal geen individu opgelost wordt, dan
sieten) en oplossers (de individuen).
heeft hij een lage fitness.
Er zijn K parasieten welke een kleine deelver- Als twee parasieten dezelfde onderscheidingen
zameling van de N probleeminstanties bevatten maken tussen de populatie individuen kan de
(dit maakt het efficient).
fitness van 1 van hen ook omlaag gebracht worEr zijn ook K individuen die elk getest worden
op een bepaald gekozen parasiet.
den.
De fitness van het individu is hoger naar mate
het individu beter scoort op de testproblemen
gegenereerd door de toegewezen parasiet.
Pareto front in Co-evolutionaire GA
Individuen kunnen gedomineerd worden door andere individuen als alle parasieten die zij oplossen ook door ten minste 1 ander individu opgelost wordt:
De fitness van de parasiet is hoger naar mate het
individu slechter scoort op zijn testproblemen.
Op deze manier co-evolueren parasieten en individuen. Parasieten maken de taken steeds moeilijker, terwijl de individuen op de moeilijker taken steeds beter moeten gaan scoren.
Stel we noemen de fitness van individu i op parasiet j: fi (j)
Als nu geldt; domineert(k,i) =
∀jfi (j) ≤ fk (j) ∧ ∃lfi (l) < fk (l)
Spatiele co-evolutie voor optimalisatie
Dan wordt i gedomineerd door de individu k.
Sommige onderzoekers gebruiken een spatiele ruimte (een CA) waarin de individuen en parasieten Het pareto front bestaat uit alle individuen die
helemaal niet gedomineerd worden.
co-evolueren.
Nu kunnen we evolueren door gebruik te maken
van individuen die in het pareto-front zitten
De propagatie van een individu is dan lokaal en
van de parasieten ook, zodat een individu niet te
snel met totaal verschillende parasieten in aanraking komt.
Dit is ook een goede methode voor multi-objective
optimization; als er meerdere verschillende fitness waarden voor een individu zijn.
Differentierende parasieten
Conclusie
Een probleem kan zijn dat alle individuen slecht
scoren omdat de parasieten zich hebben ontwikkeld tot te lastige problemen.
We hebben gekeken naar natuurlijke selectie en
het probleem van de fitness functie.
In ecologieen wordt de fitness van een organisme
bepaald door de andere aanwezige organismen.
Er is hier sprake van co-evolutie.
Daarom kunnen we beter de fitness van een parasiet laten afhangen van de mate waarin hij
6
De verschillende organismen kunnen ook samenwerken, waardoor een zelf-regulerend systeem
kan ontstaan.
De (zwakke) Gaia-hypothese zegt dat het leven
de omgeving bepaalt en dat deze feedback koppeling tussen wezens en de omgeving de evolutie
van beiden vormt.
In bepaalde ecologieen hebben organismen elkaar nodig om resources te recyclen. Er kunnen
dan recycling netwerken ontstaan.
Ook optimalisatie algoritmen kunnen goed gebruik maken van co-evolutie.
7
Transparanten bij het vak Inleiding Adaptieve Systemen: Beslisbomen. M. Wiering
Kleur
groen
Geur
Leerdoelen
geen
• Beslisbomen begrijpen en kunnen gebruiken om een classificatie te bepalen
Ja
wit
Nee
bruin
Nee
kruidig
wee
Ja
Nee
• Kunnen berekenen hoe beslisbomen gebouwd
worden
Aangezien elk attribuut hooguit 1 keer voor• De entropie van een kansverdeling kunnen komt op een pad, heb je na maximaal M (aantal
attributen) stappen de classificatie.
berekenen
• Weten wanneer beslisbomen toegepast kunFuncties leren met beslisbomen
nen worden
We kunnen attributen op twee verschillende manieren gebruiken.
Beslisbomen
(1) We kunnen subbomen genereren voor alle
discrete waarden van een attribuut.
Beslisbomen kunnen functies met discrete outputs representeren.
(2) We kunnen groter dan of kleiner dan gebruiken op bepaalde numerieke attributen zoals leeftijd.
Ze worden al enige tijd (sinds 1986) gebruikt
voor data mining en supervised learning.
Een beslisboom kan alle Boolean functies representeren:
Voorbeeld: je ziet de karakteristieken (Features) van een aantal paddestoelen (kleur, geur,
formaat) en krijgt te horen welke giftig zijn:
Attr. 1
Kleur
groen
wit
bruin
groen
groen
bruin
wit
groen
wit
groen
Geur
wee
kruidig
wee
geen
kruidig
wee
kruidig
geen
geen
wee
Formaat
middel
klein
middel
groot
klein
groot
middel
klein
klein
groot
Giftig?
ja
nee
nee
ja
nee
nee
nee
ja
nee
ja
Attr.2
Attr.2
Attr.2
Attr.2
Aantal Attributen
Classificatie
Representatie van functies
Met M binaire attributen zijn er op deze maM
nier 22 functioneel verschillende beslisbomen
(er zijn er zelfs meer, omdat attributen in verschillende volgordes getest kunnen worden)
Voorbeeld beslisboom
9
We willen een functie hebben welke (Kleur, Geur, Vb. M = 5 dan 4 ∗ 10 beslisbomen.
Formaat) afbeeldt op (Giftig).
In principe kan een verzameling voorbeelden direct
gekopieerd worden in paden in de boom.
We kunnen hiervoor de volgende consistente beslisboom genereren:
Hier hebben we echter niks aan, omdat we hier
niet mee kunnen generaliseren. (Vergelijk Ockham’s razor).
Voor het bepalen van de classificatie loop je simpelweg het pad van de boom af tot je in een blad
terecht komt.
1
Sommige functies kunnen niet efficient gerepre- we willen een kleine representatie zodat we kunsenteerd worden.
nen generaliseren over nieuwe voorbeelden.
Hiervoor zijn dan 2M bladeren en 2M −1 knopen
nodig.
Het induceren van beslisbomen
Voorbeeld is pariteits functie (even aantal attributen staan op 1):
Een logisch voorbeeld is gegeven bij de waarden
voor de eigenschappen (attributen) en de waarde van het doel-predicaat.
1
Gegeven de volgende voorbeelden:
2
2
3
3
4
4
1 0
1 0
3
VB.
x1
x2
x3
x4
x5
x6
x7
3
4
4
4
4
4
4
1 0
1 0
1 0
1 0
1 0
1 0
Figuur: Partiteits beslisboom
Alt
Ja
Ja
Nee
Ja
Ja
Nee
Nee
Bar
Nee
Nee
Ja
Nee
Nee
Ja
Ja
Vrij
Ja
Nee
Nee
Ja
Ja
Nee
Nee
Hon
Nee
Ja
Nee
Ja
Nee
Ja
Nee
Men
Enk
Vol
Enk
Vol
Vol
Enk
Geen
Reg
Nee
Nee
Nee
Nee
Nee
Ja
Ja
Res
Ja
Nee
Nee
Nee
Ja
Ja
Nee
Type
Frans
Thai
Burg
Thai
Frans
Ita
Burg
Een triviale oplossing copieert elk voorbeeld in
een pad in de beslisboom.
Ook de majority functie:
Hier hebben we weinig aan, want het gaat erom
Als de meeste attributen 1 zijn, is het antwoord om te generaliseren naar voorbeelden die onbekend zijn.
1, anders is antwoord 0.
We moeten niet enkel een beslisboom vinden die
consistent is met de voorbeelden, maar ook de
kleinst mogelijke (Ockham’s Razor).
Kan niet efficient gerepresenteerd worden:
1
1
0
2
0
2
1
1
0
0
3
0
3
1
0
0
4
0
0
1
1
1
0
4
0
0
Het leeralgoritme
3
1
0
4
1
1
0
0
Het algoritme kiest het belangrijkste attribuut eerst. Belangrijkst staat hier voor het
grootste verschil in de classificaties van de voorbeelden.
0
1
1
Figuur: Majority beslisboom
Je moet minstens (M/2) attributen testen voor
bepalen antwoord. =⇒ Meer dan 2M/2 bladeren.
Mensen?
Geen
+
- x7
Enkele
+ x1,x3,x6
-
Vol
+ x4
- x2, x5
Type?
frans
Ita Thai Burger
+ x1 + x6
+ x4 + x3
- x5
- x2
- x7
Exponentieel grote functies
Exponentieel grote functies zijn onvermijdelijk
Er bestaan 22
naar {0, 1}.
M
verschillende functies van {0, 1}M
Dus je hebt gemiddeld minstens 2M bits nodig
om de functie te representeren.
• Als er positieve en negatieve voorbeelden
zijn, kies dan het beste attribuut om ze te
splitsen.
In elke notatie heeft minstens de helft van de
functies een representatie van ≥ 2M bits.
• Als alle overige voorbeelden pos of neg zijn,
zijn we klaar, en geven we ja of nee terug.
Dat geldt ook voor propositie notatie:
(x1 ∧ x2 ) ∨ ¬(x5 ∧ (x3 ∧ ¬x1 )).
• Als er geen voorbeelden over zijn, geven
we de meerderheids beslissing van de vaderknoop terug
Zulke functies zijn dus niet goed leerbaar, want
2
Sch
0-10
30-60
0-10
10-30
>60
0-10
0-10
Wacht
Ja
Nee
Ja
Ja
Nee
Ja
Nee
• Als er nog pos en neg voorbeelden zijn,
maar geen attributen meer, dan is de data
incorrect (er is ruis). We kiezen dan de
meerderheids beslissing.
Stel (0 of 1), dan 2 mogelijkheden, 1 bit nodig.
Stel (00, 01, 10, 11), dan 4 mogelijkheden, 2 bits
nodig.
Stel keuze uit 8 mogelijkheden, 3 bits nodig.
Keuze uit k = 2n mogelijkheden → n maal keuze uit 2 mogelijkheden. Dan log2 (k) = n bits
nodig.
Voorbeeld van gemaakte beslisboom
Als we nu doorgaan met het bouwen van de beslisboom, krijgen we het volgende resultaat:
Mensen?
Geen
Enkele
+
- x7
Keuze uit k mogelijkheden dan log2 (k) bits nodig. Dit hoeft geen geheel getal te zijn.
Keuze uit 3 mogelijkheden → log2 (3) ≈ 1.55 bits
Totaal: keuze uit 63 mogelijkheden, → optellen
= 5.91 bits.
Vol
+ x1,x3,x6
>60
+
- x5
Wachtschatting?
30-60 10-30
+ x4
-
+
- x2
0-10
+
-
Informatietheorie voor kansen
Iets dat gebeurt met kans 1 op n ≈ keuze uit
n mogelijkheden, heeft informatie inhoud van
log2 (n) bits.
Stel we hebben de volgende data:
VB.
x1
x2
x3
x4
A
1
1
0
0
B
0
1
1
0
C
0
0
0
0
Output
0
1
0
0
Iets wat gebeurt met kans P heeft informatie
inhoud van log2 ( P1 ) = −log2 (P ) bits.
Als er n mogelijke gebeurtenissen zijn,P
met respectievelijke kansen: P1 , P2 , . . . , Pn ( i Pi =
1.0),
dan is de verwachte informatie inhoud:
Vraag: Welke beslisboom zou het algoritme voor
bovenstaande data genereren?
P1 (−log2 P1 ) + P2 (−log2 P2 ) + . . . + Pn (−log2 Pn ) =
n
X
Pi (−log2 Pi )
Kiezen van het splits attribuut
De belangrijkste functie in het induceren van
een boom is het kiezen van splits attributen.
i=1
Hierbij is 0log2 0 = 0.
We willen het attribuut dat de classificaties zoveel mogelijk discrimineert.
Deze waarde heet ook wel de entropie (H).
Voor het berekenen van de waarde (gain) van
een attribuut gebruiken we informatie theorie.
Voorbeelden
Informatie theorie houdt zich bezig met het bepalen van de hoeveelheid bits die nodig zijn om
iets (in dit geval pos. en neg. voorbeelden) te
coderen.
1
2
1
1
log
22
2
Eerlijke munt: n = 2, P1 = P2 =
Entropie (H) = − 21 log2 12 −
=
1
2
+
1
2
=1
Oneerlijke munt, bijvoorbeeld gooi met dobbelsteen, wel of niet 6 → n = 2, P1 = 16 , P2 = 56
Voorbeeld: als munt gegooid wordt en we willen
uitkomst representeren hebben we 1 bit nodig
(kop = 1, munt = 0).
Entropie (H) = − 16 log2 16 − 56 log2 56 = 0.65.
Deze laatste waarde is dus lager, er is dan ook
minder onzekerheid.
Informatie theorie
Eenheid van informatie is aantal (n) bits.
3
Uitwerking paddestoelen voorbeeld
Entropie voor n=2
1
Als we nu naar kleur kijken, levert dat het volgende op:
H
5 groen, waarvan 4 positief en 1 negatief
3 wit, allen negatief
2 bruin, beide negatief
0.5
0 1/6
1/2
1
5/6
Rest(kleur) =
P
5
3
2
10 E(4, 1) + 10 E(0, 3) + 10 E(0, 2)
Alles uitrekenen geeft: 0 < Rest(kleur) = 0.36 <
Rest(geur) = 0.68 < Rest(f ormaat) = 0.88 <
1.
Gebuik entropie voor kiezen van attributen
Kies-attribuut geeft dus allereerst de kleur terug:
Kies-attribuut(attributen, leervoorbeelden)
Elk attribuut resulteert in nieuwe verzamelingen
van positieve en negatieve clusters (in de nieuwe
subbomen/bladeren).
Kleur
Van 1 cluster met p positieve voorbeelden en n
negatieve voorbeelden is de entropie (verwachte
informatie inhoud):
−
groen
Hier recursief
verder
bruin
wit
0
0
p
p
n
n
log2
−
log2
p+n
p+n p+n
p+n
Voorbeeld: paddestoelen. In begin 4 positieve
en 6 negatieve voorbeelden.
Entropie =
4
4
− 10
log 10
−
6
6
10 log 10
De algemene Kies-attribuut functie
1) Bereken voor elk attribuut a ∈ attributen:
= 0.97
Kiezen attribuut
Rest(a) =
k
X
pi + ni
i=1
Nu moeten we 1 van de 3 attributen kiezen. Bijvoorbeeld: formaat.
M
E(pi , ni )
Van de 10 zijn er:
3 groot, waarvan 2 positief en 1 negatief.
3 middel, waarvan 1 positief en 2 negatief.
4 klein, waarvan 1 positief en 3 negatief.
Waarbij a1 , . . . , ak de attributen waarden voor
a zijn,
pi = {nr. positieve voorbeelden met a−waarde ai }
ni = {nr. negatieve voorbeelden met a−waarde ai }
M = totaal aantal voorbeelden
Intuitief: van alles wat, dus niet zo’n beste keuze.
Kies-attribuut kiest a waarvoor Rest(a) minimaal is.
De entropie zal dus vrij groot zijn.
Equivalent: Kies a waarvoor:
Gain(a) = E(p, n) − Rest(a) maximaal is.
De (rest)-entropie is de gemiddelde entropie van
de deelverzamelingen:
Gain(a) is altijd ≥ 0 en wordt de verwachte inwinst genoemd.
3
3
4
Rest(f ormaat) = 10
E(2, 1)+ 10
E(1, 2)+ 10
E(1, 3)formatie
p
p
n
n
Hierbij is E(p, n) = − p+n
log2 p+n
− p+n
log2 p+n
.
Vinden kleinste beslisboom
Beslisboom algoritmen (ID3, C4.5) berekenen
Rest(a) voor alle attributen en kiezen het attribuut met minimale Rest(a).
Leert het algoritme altijd de kleinste beslisboom?
Nee, hoewel er in de praktijk wel vrij kleine en
bruikbare beslisbomen gevonden worden.
4
Voorbeeld: 3 binaire attributen (stip, hoog, breed),
6 leervoorbeelden
• Worp dobbelsteen (1, 2, 3, 4, 5, 6)
• Worp munt (kop, munt)
Classificatie 1
Classificatie 0
• Weer (zonnig, bewolkt, regen)
Doel: voorspellen kop/munt aan de hand van
andere attributen.
Het algoritme zal een bijna consistente beslisboom vinden.
Toch is beslisboom niet bruikbaar.
Kleinste, consistente boom:
Cross validatie
Stip
nee
ja
Hoog
Na het induceren van een beslisboom op de leervoorbeelden gaan we deze testen op de verzameling testvoorbeelden.
Breed
ja
nee
ja
nee
1
0
1
0
Als we de rare beslisboom gaan testen op de
testvoorbeelden komt er een lage (random) score
uit.
Het bouwen van een grote beslisboom zorgt vaak
voor een goede score op leervoorbeelden en slechte score op testvoorbeelden (overfitting)
Er zijn wel meer consistente beslisbomen, maar
die hebben allemaal meer dan 3 knopen, meer
dan 4 bladeren en diepte groter dan 2.
Gegeven een hypothese ruimte H. Een hypothese h overfits de leervoorbeelden, als er een
andere hypothese h0 is zodat h een kleinere fout
heeft over de leervoorbeelden, maar h0 een kleinere fout heeft over de hele verzameling voorbeelden.
Het gewenste attribuut Stip wordt echter niet
door het algoritme gekozen, want :
gain(stip = 0)
gain(hoog) = gain(breed) = 0.044
Het algoritme vindt dus niet de kleinste consistente beslisboom.
Precisie
Het vinden van de kleinste consistente beslisboom is ook een NP-moeilijk probleem.
Leervoorbeelden
0.9
Testvoorbeelden
0.7
Ander probleem: leren van X-or functie (00 =
0, 01 = 1, 10 = 1, 11 = 0).
0.5
0
Alle attributen hebben gain(x) = 0.
20
40
60
Grootte van boom
Dan wordt er maar een attribuut gegokt, waarna
er verder geleerd kan worden.
Tegengaan van overfitting
Een raar experiment
Twee benaderingen:
Stel iemand verzamelt de volgende data gedurende enkele weken:
• Stop het groeien van de boom voordat het
de leervoorbeelden precies classificeert.
• Dag van de week (zo, ma, di, wo, do, vr,
za)
• Sta toe dat de beslisboom de leervoorbeelden overfits, maar snoei daarna de geleerde beslisboom (meer succesvol).
• Uur van de dag (9, 10, 11, . . ., 20)
• Kleur stoplicht (groen, oranje, rood)
5
Snoeien van de beslisboom kan op de volgende
manieren:
• Ontwerpen van olie platform gereedschap
• Gebruik de verzameling testvoorbeelden om
de utiliteit van het snoeien van takken te
evalueren (simpel) nadat de hele beslisboom geleerd is.
• Gebruik alle voorbeelden om te leren, maar
gebruik een Chi-square test om te evalueren of een bepaalde knoop een verbetering
voorbij de leervoorbeelden veroorzaakt.
– In 1986 maakte BP gebruik van een
expert systeem genaamd GASOIL
– Dit systeem kon systemen ontwerpen
welke olie van gas kunnen scheiden
– Het systeem was de grootste commerciele expertsysteem met 2500 regels.
– Zo’n systeem bouwen zou 10 mensjaren gekost hebben. Het leren kostte
slechts 100 dagen.
– Het systeem bespaart het bedrijf jaarlijks vele miljoenen guldens.
Uitbreidingen beslisbomen
• Leren te vliegen.
• Ontbrekende gegevens:
– vervang door meest voorkomende waarde
– Beter: gebruik kansen.
• Attributen met een prijskaartje
– Leren een Cessna te vliegen op een
vluchtsimulator.
– De data werd gegenereerd door een
ervaren menselijke piloot te observeren in zijn vlieggedrag.
– Er werden zo 90.000 voorbeelden gegenereerd elk beschreven met 20 variabelen.
– Stop prijs in gain-criterium (b.v. gebruik Gain
Cost ).
– Het geleerde systeem kon beter vliegen dan de leraren!
• Attributen met meerdere waarden
– Vervang gain door gainratio.
• Attributen met numerieke waarde
Verdere toepassingen
– Maak binaire keuze tussen a < A en
a ≥ A. Kies hierbij A zodat gain
maximaal is.
• Inductie: Probeer classificatie te voorspellen voor voorbeelden buiten gegeven data
set.
– Uitbreiding: Oblique decision trees
→ keuze op grond van lineaire combinaties van numerieke attributen.
• Datamining: probeer bepaalde regels te
vinden welke features classificeren aan de
hand van waarden van andere features. Doel
atttribuut kan alles zijn.
• Multi-variate tests → kies op grond van
meerdere attributen tegelijk.
• Windowing
Ten slotte: er kunnen verschillende equiva– Leer op subset van leervoorbeelden, lente beslisbomen voor dezelfde data set gebouwd
voeg fout geclassificeerde voorbeelden worden:
toe en leer opnieuw.
– Maakt meerdere verschillende beslisbomen. Majority vote van alle geleer- Discussie
de beslisbomen kan gebruikt worden.
• Beslisbomen kunnen voor supervised leren gebruikt worden met discrete output
waarden.
Praktisch gebruik van beslisbomen
6
o
o
o
x
o
o
o
o
x
o
o
o
o
o
x
x
x
x
x
x
x
x
x
x
o
o
o
o
x
o
o
o
o
o
o
o
o
o
o
x
o
o
o
x
x
o
o
o
x
x
x
x
x
x
x
x
x
x
x
x
x
o
o
o
x
x
o
x
x
x
x
x
• Door gebruik van greedy recursief opslits
algoritme zijn ze vaak erg snel (+)
• De geleerde hypothese is goed leesbaar voor
domein experts (+)
• Het algoritme moet goed gebruikt worden
om overfitting tegen te gaan (-)
7
Transparanten bij het vak Inleiding Adaptieve Systemen: Introductie Machine Leren. M. Wiering
Wat is het leerprobleem?
Leren = verbeteren een taak uit te voeren door
middel van ervaring
Lerende Machines
• Verbeter in taak T,
Voorbeeld: je ziet de karakteristieken (Features)
van een aantal dieren en krijgt te horen welk dier
het is.
• Met respect tot prestatie maat P,
• Gebaseerd op ervaring E.
Je bent geinteresseerd om een concept te leren
welke olifanten van andere dieren onderscheid.
Voorbeeld: Leren te schaken
Dier gegeven als [Gewicht, Kleur, Aantalpoten, Lengte]
• T: Speel Schaak
Nu hebben we een aantal (leer)-voorbeelden:
• P: % of spelletjes gewonnen in toernooi
[16kg, bruin, 4, 37cm, Hond]
[560kg, bruin, 4, 200cm, P aard]
• E: Gelegenheid om N spelletjes tegen zich[5100kg, grijs, 4, 310cm, Olif ant]
zelf te spelen
[5420kg, grijs, 4, 320cm, Olif ant]
Voorbeeld: Leren spraak te herkennen
Welke concepten zouden nu de olifant onder• T: Zet spraak om in tekst
scheiden van de rest?
[?, grijs, ?, ?, Olif ant]
[> 5000kg, ?, 4, ?, Olif ant]
[> 5099kg, grijs, 4, > 309cm, Olif ant]
• P: % goed geclassificeerde woorden
• E: Luister naar gesproken bekende tekst
van spreker
Vraag: welk concept zou jij kiezen?
Vraag: Wat zijn T, P, E voor piano leren spelen
Waarom passen we leren toe?
Waaruit bestaat een lerend programma?
• Het kan veel werk zijn om kennis te eli- We onderscheiden een leer-element welke verciteren en in een computer in te voeren. beteringen maakt en een prestatie-element welVoorbeeld: een expertsysteem welke bo- ke acties of outputs selecteert.
men classificeert.
Het ontwerp van leer-elementen wordt bepaald
door de volgende zaken:
• Voor bepaalde domeinen heeft een mens
onvoldoende kennis om alles uit te pro• Welke componenten van het prestatie elegrammeren. Voorbeeld: spraakherkenning.
ment moeten verbeterd worden?
• Welke representatie is gekozen voor die componenten?
• Een omgeving kan a priori volledig onbekend zijn. Voorbeeld: Een robot die in een
onbekende omgeving een taak moet uitvoeren.
• Wat voor feedback is aanwezig?
• Welke prior informatie is aanwezig?
• Menselijke kennis kan imperfect zijn. Voorbeeld: maken van een schaakprogramma:
hoe ziet de evaluatie functie eruit?
3 vormen van leren
• Continue automatische adaptie aan gebrui- Aan de hand van aanwezige feedback onderscheiker: lerende zoekprogramma’s, intelligent den we 3 vormen van leren:
information retrieval.
1
Agent
Inputs
Prestatie
Element
Outputs
Acties
Leer
Element
Feedback
Vormen van Supervised leren
Er bestaan 3 verschillende leerproblemen in supervised leren:
• Binary Classification. De data heeft 2 mogelijke uitkomsten. Het doel is de ene categorie van de andere te scheiden. Als dit
niet perfect lukt (consistent) dan moet het
aantal fouten geminimaliseerd worden.
Prior
Informatie
• Supervised leren: de agent krijgt de input en de gewenste output binnen. Het
doel is een functie te leren die de inputoutput mapping goed benadert.
• Multi-class Classification. In dit geval zijn
er meerdere mogelijke output klassen en
moet voor elke input de optimale klasse
teruggeven worden.
• Reinforcement leren: de agent krijgt
enkel een evaluatie van zijn gekozen actie
binnen, maar niet de optimale actie. Het
doel is een optimale actie-selectie policy te
leren.
• Regression. In dit geval moet de output
een reële waarde zijn. Nu moet de (kwadratische) fout tot de gewenste output geminimaliseerd worden.
• Unsupervised leren: de agent krijgt enkel inputs en geen outputs. Door middel
van een objectieve (afstands) functie kan
de agent patronen in de input leren zoals
clusters in de inputspace.
NB. Regression problemen kunnen niet door alle
leeralgoritmen opgelost worden.
Ingredienten van een Lerend Systeem
Vaak worden bepaalde optimalisatie (zoek) algoritmen zoals genetische algoritmen ook als lerend beschouwd.
Om te leren hebben we allereerst een representatie nodig. Voorbeeld: een neuraal netwerk,
beslisboom. De representatie bepaald welke hypotheses (functies) geleerd kunnen worden.
Enkele voorbeelden:
Een hypothese is een invulling van een representatie (vb. Een geleerd neuraal netwerk).
Supervised: Handgeschreven tekst herkenning,
Spraakherkenning, Bepalen of iemand een lening krijgt, Gezichtsherkenning.
Verder hebben we een evaluatie functie (prestatie functie) nodig, om een bepaalde hypothese
te kunnen evalueren (vb. tel het aantal fouten).
Reinforcement Leren: Spelletjes leren door tegen jezelf te spelen, Robot Controle, Kortstepad bepalen, Lift Controle
Tenslotte hebben we leervoorbeelden nodig.
In het geval van reinforcement leren hebben we
een leeromgeving nodig — de leervoorbeelden
Unupervised Leren: Compressie van data (voor- worden door het systeem zelf gegenereerd.
beeld images), data analyse (Wat zijn belang- In de leervoorbeelden kan ruis (noise) zitten:
rijkste kenmerken?), market basket (Associatie dit kan het leren aanzienlijk moeilijker maken.
regels: Als iemand zuurkool koopt, koopt deze We moeten de ruis dus niet perfect leren, maar
dan ook fles rode of witte wijn?)
ons concentreren op de generalisatie eigenschapOptimalisatie (GA’s): Robot Controle, Functie
optimalisatie (hoe moeten parameters bepaald
worden voor optimaal resultaat?), Combinatorische problemen (Scheduling, TSP).
pen.
Vraag: Welke leervorm is geschikt voor het vertalen van een document?
In Supervised leren krijgt de agent een aantal
input-output voorbeelden en moet deze een func-
Inductief supervised leren
2
tie leren welke de voorbeelden goed benadert
(inductie).
Genereer een hypothese welke consistent is met
de leervoorbeelden
Een functie kan een continue wiskundige functie
zijn, maar ook een logische functie bestaande uit
proposities etc.
- - -
Een leervoorbeeld is een paar: (x, f (x)). Soms
gebruiken we y = f (x) om de outputfunctie weer
te geven.
- - -
+ +
+ +
+ +
De inductieve inferentie taak is als volgt: gegeven een aantal voorbeelden van f (x), geef een
functie h terug welke f benadert.
Meest algemene
consistente hypothese
- - -
Meest specifieke consistente hypothese
De functie h wordt ook wel een hypothese genoemd.
Inductieve Bias: Een leeralgoritme heeft een Testen van een hypothese
preferentie over bepaalde hypotheses vergeleken De totale verzameling voorbeelden wordt meestmet andere hypotheses.
al gesplitst in een leerverzameling en een testverzameling.
De testverzameling wordt gebruikt om de geleerde hypothese te evalueren op haar generalisatie mogelijkheden (naar andere onbekende voorbeelden).
Verschillende soorten inductieve bias
Het kan gebeuren dat de fout op de leerverzameling 0 is, terwijl deze op de testverzameling
vrij hoog is.
-
-
-
-
+ +
+
+
+
+ +
-
-
+
- -
5 fouten
-
-
-
6 fouten
Voorbeeld van genereren hypothese
Gegeven de volgende leervoorbeelden:
Root learning
- - -
Root learning is eigenlijk niets anders dan het
onthouden van alle leervoorbeelden.
- - -
De Hypothese is in dit geval de gehele verzameling geziene voorbeelden met hun classificaties.
+ +
+ +
+ +
-
Hoewel het nuttig is om telefoonnummers te leren heeft root learning de volgende grote nadelen:
- - -
3
• Het systeem kan enkel voorbeelden goed
classificeren als deze al een keer gezien zijn.
E
R
R
O
R
• Voor (bijna) perfecte classificatie moeten
alle voorbeelden gezien worden
• Het aantal leervoorbeelden (leertijd) en de
opslagruimte voor de hypothese zijn enorm
groot.
State : W
Leren is eigenlijk compressie!
• Aantal input-attributen
Ockham’s razor
• Soort inputs: continu of discreet (hoeveel
mogelijkheden)
Als er meerdere hypotheses zijn die dezelfde leervoorbeelden goed classificeren, prefereren we de
kleinste (met de minste parameters).
• Zijn er missende waarden voor input-attributen
• Is er ruis in input attributen (waarde van
attribuut is gewijzigd)
De kleinste kan het beste generaliseren
Voorbeeld: Stel we hebben de volgende leervoorbeelden voor gezichten:
• Aantal output-attributen
[baard, mond, neus, snor, 2 blauwe ogen, 2 oren, bruin haar, gezicht]
• Soort outputs: continu of discreet
[¬baard, mond, neus, ¬snor, 2 groene ogen, 2 oren, bruin haar, gezicht]
[¬baard, ¬mond, ¬neus, ¬snor, 5 witte ogen, ¬oren, ¬haar, dobbelsteen]
• Zijn er missende waarden voor output-attributen
• Is er ruis in output attributen
We kunnen nu bv. de volgende 2 concepten voor
een gezicht maken:
[mond, neus, 2 oren, 2 ogen] of
[mond, neus, 2 oren, 2 ogen, bruin haar]
• Hoe ziet het proces achter de leervoorbeelden eruit (a priori kennis)
De eerste generaliseert duidelijk beter
Stappen voor Gebruik Machine Learning
Leren is zoeken naar parameters
Leren kan gezien worden als het zoeken naar parameters welke de fout minimaliseren.
Bepaal de leertaak
De parameters zijn afhankelijk van de gebruikte
representatie.
Taak T
B.v. logische proposities voor beslisbomen, reële
gewichten voor neurale netwerken.
Verzamel voorbeelden
voor deze taak
Door gebruik te maken van een leeralgoritme
krijg je een gestuurde zoektocht in het fout landschap:
Selecteer representatie
voor functie
Leervoorbeelden
Representatie
Run leeralgoritme
Hoe zien de leervoorbeelden eruit?
Hypothese
We leren een mapping van:
Input [x1 , x2 , . . . , xn ] naar
Output [y1 , . . . , ym ] voorbeelden
Testvoorbeelden
Voor het karakteriseren van de leertaak is van
belang:
4
Test
Evaluatie
Sommige topics in Machine Learning
• Welke algoritmen kunnen welke functies
goed representeren?
Unsupervised learning:
• Clustering
• Hoe beinvloedt het aantal leervoorbeelden
de nauwkeurigheid?
• K-Means Clustering
• Principale Component Analyse
• Hoe beinvloedt de complexiteit van de hypothese de nauwkeurigheid?
• Independent Component Analyse
• Hoe beinvloedt ruis de nauwkeurigheid?
• Kohonen netwerk/ Zelf-organiserende netwerken
• Wat zijn theoretische limieten van leerbaarheid?
Evolutionary Computation
• Hoe kan a priori kennis gebruikt worden?
• Welke ideeën verschaffen biologische systemen ons?
• Genetische Algoritmen
• Hoe kan de representatie door het algoritme veranderd worden?
• Probabilistic Incremental Program Evolution (PIPE)
• Genetisch Programmeren
• Evolutionary approaches to optimize Neural networks
Wat voor lerende algoritmen bestaan er?
• Evolutionary Markov Chain Monte Carlo
(EMCMC)
Supervised Learning:
• Version Spaces
• Decision Trees
Gerelateerde disciplines
• Naieve Bayes classifiers
• Kunstmatige Intelligentie
• Neurale Netwerken
• Statistiek
• k-Nearest Neighbors / Locally weighted
learning
• Computationele Complexiteits theorie
• Zelforganiserende maps
• Controle Theorie
• Bayesian belief networks
• Cognitieve Psychologie
• Support vector machines
• Biologie
• Filosofie
Reinforcement Leren
• Neurofysiologie
• Optimale controle, dynamisch programmeren
• Economie
• ...
• Reinforcement leren (RL) algoritmes
• Exploratie
Conclusie
• Functie approximatie en RL, multi-agent
RL
Er zijn drie vormen van leren: supervised, unsupervised, en reinforcement leren.
5
Om te leren hebben we leervoorbeelden (of een
leeromgeving) nodig.
Leren is het zoeken naar een hypothese (invulling van een representatie) welke goed presteert
op de leervoorbeelden.
Om om te gaan met ruis en te kunnen generaliseren, prefereren we meestal de kleinst mogelijke
(consistente) hypothese.
Om de generalisatie van een hypothese te meten,
gebruiken we testvoorbeelden welke nog niet gezien zijn door het leeralgoritme.
6
Transparanten bij het vak Inleiding Adaptieve Systemen: Evolutionary Computation. M. Wiering
f (smax ) ≥ f (s) ∀ s
Omdat de representatie ruimte vaak erg groot
is, kunnen we deze niet helemaal doorzoeken.
Evolutionary Computation (EC)
Daarom moeten we gebruik maken van heuristieke zoekalgoritmen
Optimalisatie algoritmen geinspireerd door Darwin’s evolutie theorie (“natural selection”), ontdekt in de jaren ’60 en ’70.
Voorbeelden hiervan zijn: Genetische algoritmen, Tabu search, Local hill-climbing, Simulated Annealing, Ant Colony Systemen
Goed in gedistribueerd zoeken in grote ruimtes.
Gebruiken genetische strings (meestal rij van nullen
en enen) als representatie
Local hillclimbing
Evolutionaire algoritmen kunnen gebruikt wor- Local-hillclimbing werkt als volgt voor het vinden voor:
den van een oplossing:
• Controle taken
• Genereer een oplossing s0 , t = 0
• Combinatorische optimalisatie
• Herhaal tot stopconditie geldt:
– snew = verander(st )
• Functie optimalisatie
– Als f (snew ) > f (st ) then st+1 = snew
– Anders: st+1 = st
– t = t +1
Leerdoelen
• Begrijpen waar optimalisatie problemen om De belangrijkste functie is hierbij: verander(s)
draaien
Een nadeel van local-hillclimbing is dat deze vaak
in een lokaal minimum terecht komt.
• Genetische algoritmen begrijpen
Om daar mee om te gaan, wordt het algoritme
• Representatie kunnen bedenken voor be- meestal vanaf verschillende beginpunten gestart.
paald probleem
• Mutatie en Recombinatie operatoren kunnen maken voor bepaalde representatie
Genetische algoritmen
• Verschillende selectie strategieën kennen
Initialiseer: genereer populatie individuen
(genetische strings / genotypes)
• Verschil genetische algoritmen en Evolutionaire strategieen begrijpen
- Herhaal:
(1) Evalueer alle individuen (bereken fitness)
(2) Selecteer individuen aan de hand van hun
fitness
(3) Recombineer geselecteerde individuen m.b.v.
crossover en mutatie
• Genetisch programmeren begrijpen
• PIPE begrijpen
Theorie: schemata worden overgedragen aan kinderen.
Optimalisatie problemen
In een optimalisatie probleem is er sprake van
een representatie (oplossings) ruimte S en een
fitness (evaluatie) functie.
Stappen voor maken van EA
Het doel is om de oplossing smax ∈ S te vinden
met maximale fitness
Hoe maken we een evolutionair algoritme?
Abstracte stappen:
1
POPULATIE
EVALUATIE
SELECTIE/
RECOMBINATIE
1 0 0 0
0.72
1 0 1 0
0 1
1 0
0.45
0 1
1 1 0
0.56
0 1 0 1
0.11
1 0 1 0
0.89
1
Omzetten van genotype in phenotype
1 0
Als we een representatie voor de genotypes hebben, hebben we ook nog alle vrijheid om deze te
vertalen in een phenotype
1 0 1 1
Het zoeken gebeurt in de genotype representatie
ruimte
• Ontwerp een representatie
De phenotype wordt geevalueerd
• Initialiseer de populatie
8 bits Genotype
• Ontwerp een manier om een genotype om
te zetten naar een phenotype
1
0
1
0
0
0
1
Phenotype:
* Integer
* Real number
* Schedule
* .....
1
• Ontwerp een evaluatie (fitness) functie om
een individu te evalueren
Meer specifieke stappen:
Voorbeeld: Phenotype: natuurlijke getallen: uitkomst binaire representatie = 163
• Ontwerp bepaalde recombinatie operator(en)Voorbeeld: Phenotype: getal tussen 2.5 en 20.5:
• Ontwerp bepaalde mutatie operator(en)
• Beslis hoe ouders worden geselecteerd
x = 2.5 +
• Beslis hoe individuen worden gestopt in de
nieuwe populatie
• Beslis wanneer het algoritme kan stoppen
163
(20.5 − 2.5) = 13.9609
256
Representeren van reële getallen
Ontwerpen van een representatie
Als we een phenotype van reële getallen willen
verkrijgen, is een natuurlijke manier om de reële
getallen direct te coderen in de genotype
We moeten een manier verzinnen om een individu als een genotype te representeren
Voorbeeld: een tupel met n reële getallen:
X = (x1 , x2 , . . . , xn ) xi ∈ <
Er zijn hiervoor vele mogelijkheden. De manier
die we kiezen moet relevant zijn voor het probleem dat we willen oplossen.
Een bekende applicatie hiervoor is parameter
optimalisatie
De fitness functie f mapt de tupel op een enkel
getal:
Voor het kiezen van een representatie moet er
rekening gehouden worden met de evaluatie
methode en de genetische operatoren
f : <n → <
De representatie kan met discrete waarden (binair, integer etc)
Representatie voor orderingsprobleem. Voor bepaalde problemen zoals de TSP is het nodig een
volgorde van steden te vinden. Dit coderen we
als:
Voorbeeld: binaire representatie :
Chromosoom
3
1
0
1
0
0
0
1
1
Gen
2
4
8
6
1
2
7
5
Initialisatie
• Tenminste 1 mutatie operator moet het
mogelijk maken om de hele zoekruimte te
kunnen doorzoeken
Voor de intialisatie zijn er de volgende methoden:
• Uniform willekeurig over de gehele zoekruimte:
• De grootte van de mutatie stap moet controleerbaar zijn
• Mutatie moet geldige chromosomen opleveren
1. Binaire strings: 0 en 1 met kans 0.5
2. Reeële getallen: Uniform op een gegeven interval (werkt niet als interval
niet gesloten is)
Voorbeeld mutatie:
• Gebruik vorige resultaten om de populatie
te initialiseren of gebruik heuristieken
1
1
1
1
1
1
1
1
Voor mutatie
1. Nadeel: mogelijk verlies van genetische diversiteit
1
1
1
0
1
1
1
1
Na mutatie
2. Nadeel: soms onmogelijk om te ontsnappen aan initiële bias.
Gemuteerde Gen
Evalueren van een individu
Mutatie gebeurt gewoonlijk met kans Pm op elk
gen
Dit kan soms een erg kostbaar proces zijn, zeker
voor real-world problems
Speciale mutatie operatoren
• Evalueer onveranderde individuen niet opnieuw
Mutatie op reële getallen kan gedaan worden
door een perturbatie aan te brengen met willekeurige ruis.
Het kan een subroutine zijn, een black-box simulator of een extern proces (bv robots)
Vaak wordt een Gaussiaanse ruis distributie gebruikt (met gemiddelde 0):
Je kunt beginnen met een evaluatie functie welke de uitkomst van het proces benadert, maar
dit kan niet te lang (bv. leer met neuraal netwerk hoe fitness landscape eruit ziet).
xi = xi + N (0, σ)
Mutatie voor order specifieke representaties (swap):
selecteer 2 genen en wissel ze om:
Omgaan met eisen: wat als phenotype niet aan
een bepaalde eis voldoet?
• Gebruik een penalty term in de fitness functie
• Gebruik specifieke evolutionare algoritmen
die omgaan met eisen
7
3
1
8
2
4
6 5
7
3 6
8
2
4
1 5
Probleem: omgaan met meerdere doelen in de
fitness functie : gebruik compromissen tussen
subdoelen
Recombinatie operatoren
Een recombinatie operator mapt gewoonlijk 2
ouders op 1 of 2 kinderen
Mutatie operatoren
We kunnen 1 of meerdere mutatie operatoren
We kunnen 1 of meer recombinatie operatoren
maken voor een representatie
hebben. Belangrijk is:
Belangrijk hierbij is:
3
• Het kind moet iets van elke ouder overerven. Anders is het een mutatie operator
Ouder 1
7
3
• De recombinatie operator moet samen met
de representatie ontworpen worden zodat
recombinatie niet vaak een catastrofe is.
1
Ouder 2
8
2
4
6
5
4
3
2
8
6
7
1
5
7,3,4,6,5
• Recombinatie moet geldige chromosomen
opleveren
7
5
1
8
2
1
8
2
Order:
4,3,6,7,5
4
3
6
Kind 1
Voorbeeld recombinatie (1-point crossover):
Knip
1
1
1
1
1
1
Knip
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
1
1
Selectie Strategie
Ouders
0
1
We willen een manier hebben zodat betere individuen een grotere kans hebben om ouders te
zijn dan minder goede individuen
Kinderen
Dit geeft de selectie-druk welke de populatie in
staat stelt zich te verbeteren
We moeten minder goede individuen een kleine
kans geven zich ook voort te propageren, omdat
ze
Recombinatie (crossover) kan one-point, two-point, bruikbaar genetisch materiaal kunnen bevatof uniform zijn. We kunnen dit visualiseren als ten.
Recombinatie maskers
een crossover masker:
1 1
0
0
1 0
0
Bv. Fitness proportionele
selectie : ouder selecP
tie met kans fi / j fj
Verwacht aantal keer dat individu i gekozen wordt
ˆ met fˆ als de gemiddelde populatieals ouder = fi /f,
fitness
Masker (Uniform)
1
1
1
1
0
1
1
0
0
1
0
0
0
0
Ouders
1
1
1
0
0
0
0
0
0
1
1 0
1
1
Kinderen
Nadelen fitness proportionele selectie:
• Gevaar van voorbarige convergentie, omdat zeer goed individuen hele populatie
snel overnemen
Crossover op reële getallen representaties kan
gebeuren door getallen te middelen:
(xc1 =
• Lage selectie druk als fitness waarden dicht
bij elkaar liggen
xa1 + xb1
xa + xbn
, . . . , xcn = n
)
2
2
• Gedraagt zich verschillend voor translatieve veranderingen van de fitness functie
Crossover voor order afhankelijke representatie
Oplossing: schalen van de fitness functie: fitness
tussen 0 en 1 (beste), som van fitness waarden
= 1.
Hier moeten we oppassen dat we wel geldige
chromosomen krijgen
Kies een willekeurig deel van de eerste ouder en
copieer dit naar het eerste kind
Tournament selectie
Copieer de overgebleven genen die niet in het
gecopieerde deel zitten naar het eerste kind (gebruik de order van de genen van de 2e ouder)
Selecteer k random individuen zonder terugleggen
4
Populatie
Kampioen
Deelnemers (k = 3)
f=6
f=2
f=1
f=8
f=3
f=5
GA vs. ES
f=9
f=4
f=9
f=5
f=5
f=3
f=9
2
1
GA gebruikt crossover en mutatie
f=3
Evolutionaire strategieën (ES) gebruiken enkel
mutatie
3
Keuze hangt af van:
• Is de fitness functie scheidbaar?
Neem de beste (k is de grootte van het toernooi)
• Bestaan er building blocks?
Andere selectie strategie: rank-based selection
• Is er een semantisch betekenisvolle recombinatie operator?
Rank-based selection ordent alle individuen op
basis van hun fitness. De plaats in deze geordende lijst wordt de rank genoemd.
Als recombinatie betekenisvol is → gebruik het!
We gebruiken de rank om een individu te selecteren. Individuen met hoge rank (goede fitness)
worden vaker gekozen.
Genetisch Programmeren
Gebruikt functionele bomen als representatie i.p.v.
binaire strings
Vervangings strategie
De selectie druk wordt ook beinvloed door de Functies zoals: cos, sin, *, +, /, random conmanier waarop individuen uit de vorige gene- stant
ratie gedood worden om plaats te maken voor Voorbeeld programma:
nieuwe individuen
We kunnen steeds een nieuwe populatie genereren, of een deel van de oude populatie elimineren
gebruik makende van hun fitness.
Programma Boom
Functie
COS
Cos((X1 + X2) * 2)
*
We kunnen bijvoorbeeld de elitist strategie gebruiken en besluiten om het beste individu nooit
uit de populatie te verwijderen
+
X1
2
X2
In elk geval slaan we het beste individu op in
een veilige plaats
Gebruik:
• Supervised leren op continue inputs/outputs
Recombinatie vs. Mutatie
• Robot controle taken
Recombinatie;
• Beeld classificatie
• Veranderingen hangen van hele populatie
af
• Flexibel: loops, geheugen registers, speciale random getallen etc.
• Afnemend effect met convergentie van populatie
• Exploitatie operator
Mutatie en Recombinatie in GP
De mutatie operator kan dan een knoop van de
boom aanpassen:
Mutatie:
• Verplicht om uit lokale minima te ontsnappen
Ook de recombinatie operator werkt op programma’s:
• Exploratie operator
5
Voor Mutatie
COS
COS
*
+
+
X1
(2) Evalueer alle M individuen
Na mutatie
+
2
X2
(3) Selecteer de beste en schuif de kansen op
zodat de kans dat de gevonden individu opnieuw
gegenereerd wordt toeneemt
2
X1
(4) Muteer de PPT zodat de kansenverdeling
licht gemuteerd wordt
X2
PIPE werd vergeleken met GP en bleek in staat
voor sommige problemen betere oplossingen te
vinden
Ouders
COS
SIN
CUT
CUT
+
*
+
X1
*
2
X2
2
X2
Voorbeeld van taak voor GA
SIN
Kinderen
COS
COS
X1
Het Bin Packing probleem is een NP-moeilijk
combinatorisch optimalisatie probleem:
+
*
*
+
2
X1
2
COS
X2
X2
X1
Gegeven een lijst van objecten met hun gewicht
en een bin met een bepaalde maximale capaciteit.
PIPE
Doel: deel de objecten zo in de bins dat er een
minimum aantal bins nodig is voor het verpakProbabilistic Incremental Program Evolution
(PIPE) gebruikt kansen om functies op een knoop ken van alle objecten.
in programma boom te genereren
Representatie: lijst van bin-nummer waarin elk
Het slaat één probabilistic prototype tree (PPT) object zit:
op i.p.v. alle individuen:
Voorbeelden: [1, 2, 1, 3, 1, 2, 1, 3, 1, 2, 1, 3]
en [2, 2, 4, 3, 1, 3, 1, 2, 1, 2, 1, 3]
Probabilistic Prototype Tree
SIN
COS
*
+
/
X1
X2
SIN
COS
*
+
/
X1
X2
0.23
0.11
0.19
0.06
0.06
0.19
0.06
Fitness functie: hoeveel bins worden gebruikt?
Hoe vol zijn gebruikte bins?
0.51
0.20
0.09
0.04
0.06
0.09
0.01
SIN
COS
*
+
/
X1
X2
Vraag: hoeveel kinderen kunnen er gegenereerd
worden door uniform crossover te gebruiken op
2 strings van lengte n?
0.01
0.22
0.19
0.24
0.09
0.07
0.18
Voor- en Nadelen EC
Met behulp van de PPT worden individuen gemaakt:
Voordelen:
• Begin bij wortel en kies hiervoor functie
volgens kansverdeling
• Evolutionaire algoritmen kunnen vaak direct ingezet worden
• Ga naar de subbomen van de PPT om de
benodigde argumenten voor de eerder gegenereerde functies te genereren
• Ze zijn goed in doorzoeken grote zoekruimtes
• Kunnen gebruikt worden voor vele soorten
problemen
• Totdat programma helemaal af is
• Makkelijk te parallelliseren
Leerproces in PIPE
Nadelen:
Herhaal tot stopconditie geldig:
• Soms last van voorbarige convergentie: verlies aan genetische diversiteit
(1) Genereer op bovenstaande manier een populatie van M individuen
6
• Veel individuen worden geevalueerd en dan
niet meer gebruikt
• Kan traag leerproces opleveren
• Als meeste individuen gelijke, slechte fitness hebben is er weinig selectie druk (bv.
taak: goed antwoord/ fout antwoord).
7
Transparanten bij het Inleiding Adaptieve Systemen. Artificial Life. M. Wiering
• Optimalisatie. Bv. Numerieke optimalisatie, job-shop scheduling.
• Robot leren.
Artificial Life
• Economische modellen. Zoals ontwikkeling van prijs-strategieen
Artificial life houdt zich bezig met het modelleren en simuleren van levende entiteiten en complexe systemen. Applicaties hiervan kan men
vinden in:
• Ecologische modellen. Zoals arm-races, coevolutie
• Populatie genetics modellen. Welke genen
blijven in populatie?
• Biologie. Hoe interacteren levende organismen in biologische processen?
• Economie: Hoe interacteren rationele agenten in economische omgevingen zoals aandeel markten, e-commerce, etc.
• Sociologie: Hoe interacteren agenten in gemeenschappen als ze competitieve/ collaboratieve doelen hebben?
• Bestuderen van interacties tussen evolutie
en leren
• Modellen van sociale systemen. Zoals evolutie van cooperatie.
Interactie tussen evolutie en leren
• Natuurkunde: Hoe interacteren fysische
deeltjes in een bepaalde ruimte?
Evolutie en leren zijn twee adaptie processen
voor organismen. Evolutie is langzaam en vindt
plaats over generaties van individuen. Leren is
snel en vindt plaats in een individu (agent).
ALife probeert de oorsprong en functionaliteit
van het leven te begrijpen.
Er zijn 2 mogelijke effecten van de combinatie
van evolutie en leren.
ALife probeert kunstmatige organismen te ontwerpen welke betekenisvol levend genoemd kunnen worden.
• Baldwin Effect. Een individu leert door
zijn interactie met de omgeving. Dit leren
stuurt de fitness van het individu. Hierdoor kunnen organismen die goed kunnen
leren een hogere fitness behalen. Geleerde kennis wordt niet overgedragen aan de
offspring.
ALife probeert interactie patronen tussen organismen te begrijpen.
ALife probeert multi-agent systemen te construeren welke een bepaald multi-agent gedrag modelleren, nabootsen, en eventueel optimaliseren.
• Lamarckiaans leren. Een individu leert tijdens zijn leven en geeft de geleerde kennis
door aan zijn offspring.
Voorbeelden:
• Markt economieen
Onderzoek toont aan dat leren en evolutie tegelijkertijd beter kan werken dan enkel 1 van
beiden.
• Sociale systemen
• Imuun systemen
• Ecosystemen
Ecosystemen en evolutionaire dynamica
Ecosystemen bestaan uit een aantal individuen
(agenten) welke:
Vraag: Wat is het verschil tussen AI en ALife?
GA en ALife
• Een positie in de omgeving bezetten
Genetische Algoritmen worden gebruikt voor diverse doeleinden:
• Interacteren met de omgeving en andere
agenten
1
• Een bepaalde interne toestand (bv. energie, geld) hebben.
• De cellulaire ruimte. De cellulaire ruimte is een lattice van N identieke cellen, elk
met identieke patronen van lokale connectiviteit naar andere cellen.
Door de evolutie van een ecosysteem te bestuderen kan men allerlei analogieen vinden met echte
ecosystemen. Voorbeelden hiervan zijn:
Laat Σ de verzameling toestanden zijn van
een cel. k = |Σ| is het aantal toestanden
per cel.
• Cooperatie. Bv. handel tussen agenten.
Een cel met index i is op tijdstip t in toestand sti . De toestand sti tezamen met de
toestanden van de cellen waarmee i verbonden is wordt de neighborhood nti genoemd.
• Competitie. Bv. gevechten tussen agenten.
• Nabootsen. Bv. een agent leert wat hij
moet doen door te kijken naar een andere
agent.
• De transitie regel. De transitie regel
r(nti ) geeft de update van toestand st+1
i
als een functie van zijn neighborhood.
• Parasitair gedrag. Een agent profiteert
van een andere agent.
Gewoonlijk worden alle cellen synchroon
geupdate.
• Evolutie van gemeenschappen. Binnen gemeenschappen kunnen agenten samenwerken en zich specialiseren.
De regel wordt meestal geimplementeerd
als een lookup-tabel.
Cellulaire Automaten
Voorbeeld CA
Cellulaire Automaten (CA) zijn gedecentraliseerde ruimtelijke systemen met een groot aantal
simpele identieke componenten met lokale verbondenheid.
Regel Tabel R:
Ze worden gebruikt als modellen voor biologische, sociale en fysische processen, zoals:
Lattice:
Neighborhood: 000 001 010 011 100 101 110 111
Output bit
0
1
1
1
0
1
1
0
Periodische boundary condities
t=0
1
0
1
0
0
1
1
0
0
1
0
t=1
1
1
1
0
1
1
1
0
1
1
1
• Vloeistof beweging
• Galaxy formaties
• Aardbevingen
Vraag: bepaal de nieuwe configuratie op tijdstap
t=2
• Biologische patroon vorming
• Emergent cooperatief en collectief gedrag
• Verkeers modellen
Werking in tijd van CA
• Bosbranden
De bovenstaande CA met 1 dimensie en een
neighborhood grootte van 3, is 1 van de simpelste CA.
Toch kan zo’n CA complex gedrag vertonen als
we deze voortdurend door itereren:
Formele beschrijving CA
CA bestaan uit 2 componenten:
CA zijn onderdeel van de klasse iteratieve netwerken of automata netwerken.
Verschillende soorten processen in CA
2
Zie hieronder een lattice zonder boundary condities (we tonen eigenlijk maar een klein gedeelte
van de complete lattice welke verder leeg is)
Dit leidt tot een periodieke cycle van lengte 2
voor bovenstaande initiele configuratie.
Vraag: kan bovenstaande transitie regel volgens
jou tot chaotisch gedrag leiden?
Als we een gegeven CA met een begintoestand
door itereren kunnen we verschillende soorten
patronen verkrijgen:
Opgave
Gegeven een 2-dimensionele lattice en de updateregel: als 1 buur actief is en cel was inactief
maak cel actief. Anders als cel actief was op
tijdtap t, maak cel ook actief op volgende tijdstap.
• 1 stabiele eindtoestand, waarna de evolutie (verandering) ophoudt.
• 1 cyclisch patroon. Hier bevindt de CA
zich achtereenvolgens in verschillende configuraties, maar deze configuraties vormen Evolueer de CA in onderstaande figuur:
een cyclus, zodat dezelfde toestanden steeds
terugkeren
t=0
t=1
t=2
t=3
• Chaotisch gedrag. De configuratie blijft
veranderen en komt nooit terug in dezelfde toestand (NB. dit kan in principe enkel
voor oneindig grote automaten)
• Sommige initiele patronen resulteren in complexe lokale structuren, welke soms lang
doorgaan. Hierin zit vaak de meeste informatie.
We noemen de tijdsperiode totdat er een fixed
point of periodieke cyclus optreedt de transient
period.
Vraag: wat voor vorm komt eruit als we dit proces voortdurend door itereren op een grotere lattice?
Ergens op de grens is er een fase transitie van
complex geordend gedrag naar chaotisch gedrag.
Eliminatie van de patroon basis
Voorbeeld van cyclische patronen
Sommige onderzoekers proberen een basis te vinden welke een achtergrond vormt (muren, singulariteiten) etc.
Een stabiele eindtoestand kan b.v. enkel 1’en
zijn. Transitie regels met zulke eindtoestanden
zijn makkelijk op te stellen.
Door de reguliere basis weg te halen kan inzicht
gegeven worden in de evolutie van chaotische of
turbulente processen.
Stel we hebben een 2-dimensionele lattice. De
transitie regel is: als 2 buren actief zijn, wordt
de cel geactiveerd. Anders gaat de cel uit.
Neem het volgende chaotische proces:
3
• Anders wordt de cel niet actief.
Universele rekenkracht in Game of Life
Conway’s game of life heeft ook universele rekenkracht. Deze rekenkracht berust op gliders welke informatie kunnen doorpropageren en waarmee logische and en not-gates gebouwd kunnen
worden.
Als we de reguliere achtergrond weghalen verkrijgen we:
t=0
t=1
t=3
t=4
t=2
Deze glider beweegt zich 1 hokje diagonaal na
elke 4 tijdstappen.
Door gebruik te maken van glider guns kunnen
interacties tussen gliders plaats vinden.
Voorbeeld: Als gliders elkaar treffen worden ze
vernietigd. Dit wordt gebruikt om een not-gate
te maken.
De “embedded particles” vertonen een “random
walk” gedrag.
Ontwikkeling van CA
Als de “embedded particles” elkaar kruisen, wor- Von Neumann (1966) construeerde een CA welden ze vernietigd.
ke in staat was om zichzelf te reproduceren.
Smith (1972) construeerde een CA welke contextsensitieve talen kon herkennen (bv. palindromen).
Onderzoek naar CA
CA zijn universele computatie machines. Ze zijn
dus net zo krachtig als Turing Machines.
Mitchell et. al (1994) gebruikten GA om CA
regels te evolueren. Ze probeerden dit uit op het
Er is veel onderzoek naar CA. Een bekend voor- meerderheidsprobleem in 1-dimensionele CA.
beeld van een CA is Conway’s game of life.
Dit probleem is: als de meerderheid van de celConway’s game of life is een 2-dimensionele lat- len aan staat moeten na een periode van tijd
tice met 8 buren voor elke cel. Verder zijn de alle cellen aangaan. Anders moeten alle cellen
transitie regel(s):
uitgaan.
De GA vonden hiervoor verschillende oplossingen. Sommigen maakten gebruik van embedded particles. Door de lokale verbondenheid kon
echter geen oplossing voor alle initiële configuraties het juiste antwoord genereren.
• Als een cel actief is en 2 of 3 van zijn andere buren zijn actief, dan wordt de cel
actief.
• Anders als de cel inactief is en precies 3
buren zijn actief, wordt de cel ook actief.
4
Andere CA
De experimenten tonen dat golven van activeit
over tijdsperiodes worden geevolueerd. Dus de
populatie vindt en exploiteert voortdurend nieuw
genetisch materiaal.
CA kunnen ook goed gebruikt worden voor problemen als het modelleren van verkeer. Hier is
een cel b.v. actief als er een auto op een plek
staat.
Hoewel het model vrij simpel is, wordt de evolutionaire activiteit als een bruikbare variabele
gezien om evolutie processen te meten.
De CA kan complexer worden door elke auto een
interne toestand mee te geven met de plaats van
bestemming. (Dit kan in de cel zelf gecodeerd
worden)
Markt modellen
CA kunnen ook gebruikt worden om epidemieën
mee te modelleren. Hier kan een cel een ziek,
gezond, of ziek-geweest persoon zijn.
Financiële markten zijn moeilijk te voorspellen.
De vraag is: onder welke omstandigheden zijn
voorspellingen überhaupt mogelijk?
CA kunnen ook gebruikt worden voor bosbranden. Hier kan een cel zich bv. in de toestanden
(Brand, Water, Bomen, Gras, Zand) bevinden.
Een uitgangspunt is de “Efficiente Markt Hypothese” (EMH)
In een informatieve efficiente markt zijn prijsOok kunnen extra externe parameters zoals wind
veranderingen onvoorspelbaar als er behoorlijk
het gedrag van de CA beinvloeden (eventueel
rekening mee is gehouden. Dat is als ze volledig
door dynamische transitie regels)
de verwachtingen en informatie van marktdeelnemers incorporeren.
Hieruit volgt: hoe efficienter een markt is, des te
willekeuriger zijn de prijsveranderingen die door
de markt gegenereerd worden.
Voorbeeld Strategic Bugs
Strategic bugs is een ALife model (Bedau and
Packard, 1992) waarin:
Dit komt omdat zelfs bij de kleinste informatieve
voordelen van investeerders, de acties van die
investeerders direct de mogelijke winst op die
aandelen elimineren.
• De omgeving een 2-dimensionele lattice is.
• Een cel in de omgeving een agent of voedsel kan bevatten.
• Voedsel automatisch groeit in de omgeving: af en toe komt er voedsel in een willekeurige cel
Zijn markten voorspelbaar?
Maar: 1 van de neigingen van investeerders is de
trade-off tussen verwachte risico’s en verwachte
winst.
• Bugs (agenten) overleven door het vinden
en eten van voedsel.
• Bugs gebruiken energie om te bewegen en
sterven als ze geen energie meer hebben.
Hierdoor kunnen bv. risk-aversieve investeerders aandelen verkopen met een hoog risico maar
ook een verwachte winst.
• Bugs kunnen zichzelf klonen of met een
andere bug offspring maken.
Het gevolg is dat niet alle prijsfluctuaties willekeurig zijn.
Bepaalde studies hebben inderdaad aangetoond
dat prijsfluctuaties niet volledig random zijn.
Vraag: denk je dat computer algoritmen in staat
zijn om aanzienlijke winsten te maken op de
aandelenmarkt?
Een bug wordt gemodelleerd door een lookuptabel representatie. Een voorbeeld regel is: als
er meer dan 5 voedsel units ten noordwesten zitten, maak dan een stap naar het noordwesten.
Bedau en Packard bestuderen de evolutionaire Niet rationele en volledig geinformeerde
activiteit: in welke mate worden nieuwe gene- agenten
tische vernieuwingen tot stand gebracht in de
De hypothese dat investeerders volledig ratiopopulatie?
5
neel zijn en alle benodigde informatie bevatten
is niet realistisch.
Verdienstelijke strategieën accumuleren geld, terwijl verliesgevende strategieën geld verliezen en
tenslotte verdwijnen.
Dit komt omdat menselijk gedrag onvoorspelbaar is, informatie moeilijk te interpreteren valt, Dit kan dus gedaan worden m.b.v. genetische
technologieen en instituten veranderen, en er algoritmen.
kosten verbonden zijn aan transacties en inforDe creatie van nieuwe strategieën kunnen andere
matie vergaring.
strategieën minder/meer aantrekkelijk maken.
Een veelbelovende richting is de relatieve effiEen financiële markt kan dus gezien worden als
cientheid van een markt in vergelijking met aneen co-evoluerende ecologie van handels stratedere markten. B.v. future-contracten versus opgieën.
ties.
Net als een koelkast met een efficientheid van
40% behoorlijk goed lijkt en geprefereerd wordt
ten opzichte van een koelkast met een efficientheid van 35%, worden ook bepaalde markten geprefereerd.
Reinforcement leren voor agent-modellen
Reinforcement leren kan gebruikt worden om
agenten te laten leren van hun acties. Als bepaalde strategieën winst opleveren, worden ze
meer gebruikt in de toekomst.
De kern van zo’n agent is de mapping van partiële informatie naar handelingen. Deze mapDe wens om financiële theorieën te onwikkelen
ping wordt aangepast door interactie met de omheeft geleid tot verschillende benaderingen om
geving (de markt).
de markt te verklaren:
Reinforcement leren stelt een agent dus bij tij• Psychologische benaderingen ten opzichte dens zijn leven, terwijl GA een agent doen vervan risico-neem gedrag van investeerders. dwijnen of copiëren, zodat bepaalde winstgevenHierin wordt geanalyseerd hoe de mense- de strategieën meer in de populatie zullen kolijke psychologie het economische beslis- men.
singsproces beinvloedt.
Om markt-modellen realistisch te maken, kun• Evolutionaire speltheorie. Hierin wordt de nen ook allerlei financiële indexen, technologievolutie van stabiele toestanden van popu- sche innovaties, fusies etc. gemodelleerd worden
laties van competitieve marktstrategieën in de simulatie.
bestudeerd. Dit gebeurt meestal in gesimplificeerde, ideale gesimuleerde markten.
Artificial Art en Fractals
• Agent-gebaseerde modellering van markten. Hier worden investeerders gemodel- Het itereren van een simpele functie kan comleerd als agenten die bepaalde strategieën plexe, artistieke patronen genereren.
gebruiken. Dit maakt het mogelijk om Een fractal is een patroon welke op zijn eigen
complex leergedrag en dynamica in finan- subpatronen lijkt en welke vaak gemaakt kan
ciële markten te bestuderen.
worden door simpele vergelijkingen
Modellen voor financiële theorieën
Beschouw de functie:
xk+1 = x2k
Agent-gebaseerde modellen
Een veelbelovende richting is de agent-gebaseerde Kijk naar de startwaarden voor xk waarvoor de
iteratie naar een stabiel punt convergeert
modellering.
Markt-deelnemers zijn computationele entiteiten welke strategieën gebruiken op beperkte informatie.
We kunnen direct zien dat voor de waarden −1 <
x0 < 1, iteratief updaten convergeert en het stabiele punt is x∞ = 0.
Sommige markt-deelnemers maken verlies en an- De waarden waarvoor de iteratie niet naar onderen winst.
eindig gaat noemen we de Julia verzameling
6
Fractals gemaakt m.b.v. complexe getallen
Met reële getallen kunnen er niet zo veel interessante dingen gebeuren. Dus laten we gebruik
maken van complexe getallen beschreven door
x = ai + b
Beschouw nu functies van het type:
xk+1 = x2k + C
Als meerdere agenten zich evolueren en de fitness van een agent bepaald wordt door de hele
Als we beginnen met x0 = 0, voor welke com- gemeenschap, wordt dit ook wel co-evolutie geplexe getallen C zal de iteratie voor deze functie noemd.
dan niet oneindig worden?
Cellulaire automaten zijn geschikt voor het moDeze verzameling van complexe getallen waar- delleren van bepaalde biologische, sociale en fyvoor de iteratie niet naar oneindig gaat noemen sische processen.
we de Mandelbrot verzameling
Markt economieën kunnen gemodelleerd worden
als complexe wisselwerkingen tussen agenten met
eigen gedragingen. De werking van zo’n systeem
heeft analogieën met eco-systemen.
Fractals zijn patronen welke op hun subpatronen lijken. Mandelbrot fractals kunnen gemaakt
worden met een simpel iteratief algoritme
Julia verzameling
Beschouw opnieuw de geitereerde functie:
xk+1 = x2k + C
Nu hebben we een waarde voor C welke een element is van de Mandelbrot verzameling.
Welke beginwaarden voor x0 in de complexe ruimte zorgen ervoor dat de iteratie niet naar oneindig gaat?
Deze verzameling welke behoort bij een bepaalde waarde voor C noemen we de Julia verzameling voor C.
Conclusie
ALife is geschikt om complexe processen te bestuderen, bestaande uit interacterende entiteiten.
Genetische algoritmen kunnen gebruikt worden
om entiteiten (agenten) aan te passen aan hun
veranderlijke omgeving.
7
Transparanten bij het vak Inleiding Adaptieve Systemen: Introductie. M. Wiering
Modellering m.b.v. agenten
Als een systeem gecontrolleerd moet worden, gebruiken we meestal de agent-metafoor.
Adaptieve Systemen
Adaptieve systemen zijn systemen waarin er een
wisselwerking bestaat tussen het systeem en zijn
omgeving, zodat beiden steeds transities maken
naar veranderlijke toestanden.
Een agent krijgt informatie over de toestand van
de omgeving d.m.v. inputs, en kiest op basis van
zijn interne toestand en de input een actie.
De actie leidt tot een verandering van de toestand van de omgeving.
Aan de hand van veranderende toestanden van
het systeem zal een adaptief systeem zichzelf
en/of de omgeving aanpassen om een bepaald
doel te bewerkstelligen (simulatie, optimalisatie).
Verder is een agent autonoom: hij bepaalt zelf
zijn actie en kan bepaalde subdoelen tegen elkaar afwegen.
Een agent kan ook communiceren met andere
agenten, in dit geval spreken we van een sociale
agent in een multi-agent systeem.
Een lerend adaptief systeem heeft een mogelijkheid om zijn eigen prestatie te meten en kan zijn
eigen interne parameters wijzigen om zichzelf te
verbeteren (optimaliseren).
De agent kan doelen hebben welke met een beloningsfunctie gemodelleerd kunnen worden.
Andere geassocieerde benamingen zijn:
• Cybernetica
Model voor adaptieve systemen
• Zelf-organiserende systemen
Een adaptief systeem welke interacteert met een
omgeving kan gemodelleerd worden met:
• Complexe adaptieve systemen
• Een tijd-element t = {1, 2, 3, . . .}
• De toestand van de omgeving op tijdstip
t: S(t)
Voorbeelden van adaptieve systemen zijn:
• Input van de agent verkregen op tijdstip
t: I(t)
• Robots die hun weg kunnen vinden in een
bepaalde omgeving
• Lerende systemen die data omzetten in kennis (vb. boomclassificatie)
• Automatisch rijdende auto’s of automatische vliegtuig piloten
• Een interne toestand op tijdstip t: B(t)
• Een verzameling mogelijke acties: A, met
A(t) de actie op tijdstip t
• Een policy welke de input en interne toestand afbeeldt op een actie van het systeem:
Π(I(t), B(t)) → A(t)
• Evolutionaire systemen waarin de distributie van de genenpool zich aanpast aan
de omgeving
• Een transitieregel welke de toestand van
de omgeving en de actie van het systeem
afbeelt op een volgende toestand van de
omgeving:
T (S(t), A(t)) → S(t + 1)
• Economieen waarin goed presterende bedrijven zich expanderen en anderen failliet
gaan
• Organismale systemen welke bepaalde essentiele eigenschappen handhaven om te
kunnen overleven (zoals homeostase, autonomie)
• Een beloningsfunctie voor het systeem:
R(I(t), B(t), A(t)) → R(t) of
R(S(t), A(t)) → R(t)
Vraag: Bedenk een ander voorbeeld van een adaptief systeem
1
• Een update functie van de interne toestand
van het systeem:
U (I(t), B(t), A(t)) → B(t + 1)
Een rationele agent handelt om zijn performance maat te maximaleren: v.b. doel bereiken
met minst mogelijke kosten.
Causale relaties in het model
We zien bepaalde causale relaties in het model:
Causaliteit in de tijd
t
t
S
I
I
B
A
I
B
A
R
t
t+1
S
A
S
I
B
A
B
Een autonome agent handelt op basis van zijn
eigen ervaringen. De agent voert dus niet een
vaststaand algoritme uit, maar gebruikt zijn waarnemingen om zijn gedrag te sturen (eventueel
ook bij te stellen).
Causale Graaf
I
Een agent wordt gestuurd door een programma welke draait op een architectuur (computer, hardware).
S
B
R
Het programma, de architectuur, en de omgeving bepalen het gedrag van de agent.
A
Vraag: In hoeverre bepaalt de architectuur het
gedrag van een agent?
Totale Systeem Perspectief
Voor het begrijpen van de interactie tussen de
agent en de omgeving is het van belang om naar
het gehele systeem te kijken.
Beloningsfunctie
Een voorbeeld: Bosbrand controle. De entiteiten die een rol spelen zijn bomen, bulldozers,
vliegtuigen, vuur, rookkolommen, het weer etc.
Een agent kan bepaalde doelen hebben welke hij
moet bereiken. Hiervoor zou men kwalitatieve
Goals kunnen gebruiken.
Hier zijn bulldozers en vliegtuigen de (controlleerbare) autonome agenten.
Een andere mogelijkheid is om een kwantitatief
belonings signaal te geven om een actie van de
agent te beoordelen.
Soms kan het lastig zijn om te abstraheren van
de werkelijkheid: we willen niet alle details opIn de Logica gebruikt men meestal Goals en in nemen, maar wel een realistische interactie tusdecision theory meestal utilities of beloningssig- sen agent en omgeving modelleren.
nalen.
Vraag: Welke entiteiten spelen een rol in het exploiteren van een restaurant? Wat zijn de agenten?
Het doel van de agent is om zijn som van beloningen verkregen in de toekomst te maximaliseren door het gebruik van een bepaalde policy:
∞
X
Een voorbeeld: de bekende kachel
γ t Rt
Beschouw een kachel welke ervoor moet zorgen
dat de kamertemperatuur gereguleerd wordt.
t=0
Hier is 0 ≤ γ ≤ 1 de discount factor welke bepaalt in hoeverre lange termijn beloningen mee
moeten tellen.
De kachel heeft een thermometer om de kamertemperatuur op te meten. Dit is de input van
het systeem.
De kachel heeft de acties: Verwarm of doe niets.
Intelligente agenten
De temperatuur van de kamer daalt als de kachel
uitstaat en wordt hoger als de kachel aanstaat.
Een intelligente agent kan waarnemen (d.m.v.
sensoren), redeneren, voorspellen, en handelen
(d.m.v. actuatoren).
Model van de kachel (1)
2
WORTEL (ROOT)
Kamer
Kachel
I(t) <= 21
Input
I(t) > 21 en
I(t) <= 23
I(t) > 23
Temperatuur
VERWARM
Actie
DOE NIETS
B(T) = VERWARM
VERWARM
B(T) = DOE NIETS
DOE NIETS
De gemodelleerde toestand van de omgeving S(t) U (∗, ∗, Doe niets) → Doe niets
is de kamertemperatuur.
Een simpele transitie regel voor de omgeving is
De input I(t) van de kachel is in dit geval ook als volgt:
de kamertemperatuur op tijdstip t.
T (S(t), V erwarm) → S(t) + 0.1
Als interne toestand B(t) heeft de kachel enkel
als waarden of de kachel reeds aanstond of uitstond (Verwarm of Doe niets).
T (S(t), Doe niets) → S(t) − 0.05
De beloningsfunctie hebben we alleen nodig voor
zelflerende systemen. We zouden deze kunnen
instellen als:
De acties van de kachel zijn: Verwarm, Doe
niets.
R(I, ∗, ∗) = −|I − 22|
Policy van de kachel
Dus het systeem krijgt steeds straf als de temperatuur van de 22 graden afwijkt.
Het belangrijkste element is de policy, omdat
deze de interactie met de omgeving bepaalt (en
eventueel geoptimaliseerd moet worden).
Dynamiek van de wisselwerking
Als we de kachel laten interacteren met de omgeving met een bepaalde begintemperatuur van
stel 15 graden, krijgen we een dynamiek van de
volgende parameters:
De constructie van een policy kan met de hand
gebeuren, maar kan ook geleerd worden.
We kunnen een policy voor de kachel maken,
b.v. bestaande uit als-dan regels:
• De toestand van de omgeving
(1) If I(t) ≤ 21 Then Verwarm
• De input van de kachel (in dit geval gelijk
aan de toestand van de omgeving)
(2) If I(t) > 21 and I(t) ≤ 23 and B(t) ==
Verwarm Then Verwarm
• De actie van de kachel
(3) If I(t) > 21 and I(t) ≤ 23 and B(t) == Doe
niets Then Doe niets
• De verkregen beloning
(4) If I(t) > 23 Then Doe niets
Voorbeeld: temperatuur
Beslisbomen
Omgevingen
Deze policy kunnen we ook met behulp van een
beslisboom laten zien:
Er zijn vele verschillende omgevingen, deze kunnen gekarakteriseerd worden aan de hand van de
volgende kenmerken:
Model van de kachel (2)
• Volledige info/partiële informatie. Ziet
een agent de volledige toestand van een
omgeving met zijn/haar sensoren, of slechts
een gedeelte hiervan?
De update functie voor de interne toestand van
de kachel is simpel en ziet er als volgt uit:
U (∗, ∗, V erwarm) → V erwarm
3
25
Volledige
informatie
20
Schaken met klok
Schaken zonder klok
Poker
Backgammon
Taxi rijden
Medische diagnose
Beeld analyse
Interactieve engelse leraar
10
5
10
30
50
Episodisch
Statisch
Discreet
Omgeving
Temperatuur
15
0
Deterministisch
70
90
110
130 Tijd
Yes
Yes
No
Yes
No
No
Yes
No
Yes
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
No
Semi
Yes
Yes
Yes
No
No
Semi
No
• Deterministisch vs.
Vraag: bedenk zelf een omgeving en bepaal de
Niet-deterministisch. Wordt de volgen- karakteristieken van deze omgeving.
de toestand van de omgeving uniek bepaald door de huidige toestand en de gekozen actie van de agent, of is er een kans- De interne toestand
verdeling naar opvolgende toestanden?
Dikwijls wordt er geen interne toestand gebruikt,
deze reactieve agent mapt direct input op ac• Episodisch vs. Niet episodisch. Wordt tie.
een agent steeds voor 1 losstaande keuze
gesteld welke onafhankelijk is van volgen- In sommige gevallen krijgt de agent dezelfde inde acties, of zit de agent in een wereld put in verschillende states. Zijn optimale acties
waarbij de hele sequentie van acties een kunnen dan niet van elkaar afwijken. In zulke
gevallen, gebruiken we de interne toestand om
rol speelt.
de huidige input te disambigueren.
• Statisch vs. Dynamisch. Verandert de
wereld tijdens het handelsproces, of niet.
Als de overgangen tussen toestanden veranderen, dan is de omgeving dynamisch en
speelt tijd een rol. Als de prestatie maat
(belonings functie) verandert heet de omgeving semi-dynamisch.
De interne toestand vat het verleden samen, alle
vorige inputs en acties kunnen de huidige interne
toestand bepalen.
Als de actie veel van de interne toestand en weinig van de input afhangt, kunnen we spreken
van een “introverte” agent.
In het geval van psychoses, gaat de interne toe• Discreet vs. Continu. Is de toestand stand zijn eigen weg en kan het geloof erin door
van de wereld discreet gerepresenteerd (vb. nieuwe inputs en acties moeilijk weerlegd worschaken) of niet, v.b. robot in continue den.
omgeving met snelheid, x en y positie etc.
Vraag: Hoe verandert de yogi zijn interne toestand met meditatie?
Partiële informatie, niet-deterministische, niet
episodische, dynamische, continue omgevingen
zijn het moeilijkst voor de optimalisatie van agen- Multi-agent systemen
ten.
Als er meerdere agenten zijn, spreken we van
Simulatie kan altijd, maar een goed voorspellend een multi-agent systeem (MAS).
model is soms weer heel moeilijk te maken (vb.
Hoewel het hele systeem in veel gevallen ook met
weersvoorspelling).
1 superagent gemodelleerd kan worden, heeft
De complexiteit van het gedrag van een agent het opsplitsen van het systeem in meerdere gehangt vaak af van de complexiteit van de omge- decentraliseerde agenten bepaalde voordelen:
ving (vb. Simon’s Ant).
• Robuustheid
• Snelheid (Distributed computing)
Voorbeelden van omgevingen
4
Yes
Yes
Yes
Yes
No
No
No
Yes
• Eenvoudigere uitbreidbaarheid of verandering
• Bosbranden bestaande uit vele bomen die
de brand doorpropageren
• Information hiding (privacy)
• Infectie ziekten bestaande uit virussen en
virusdragers
• Magnetisme bestaande uit deeltjes welke
positief of negatief geladen kunnen zijn
Model van een multi-agent systeem
Als we te maken hebben met een multi-agent
systeem, kunnen we de enkele agenten modelleren als hiervoor (met inputs, acties, interne toestand, policy, beloningsfunctie, update functie)
• Ecologieen bestaande uit vele soorten dieren welke elkaar opeten en nakomelingen
produceren.
• Economische markten welke bestaan uit
vele verschillende beleggers
In sommige gevallen wordt het systeem ook uitgebreid met communicatie tussen agenten.
De agenten hebben dan de beschikking over com- Vraag: welke van deze processen kunnen beter
municatie signalen (een taal) en mappen inputs met MAS gemodelleerd worden?
en interne toestand op een communicatie signaal.
Predator-prooi systeem
Zo’n communicatie signaal kan aan alle agenten
worden gestuurd (broadcasting), maar ook naar Een simpel voorbeeld van een systeem bestaaneen enkele agent.
de uit meerdere entiteiten is een predator-prey
systeem.
Communicatie is van belang als de agenten moeten coopereren. De agenten moeten dan op een De predator gaat op zoek naar prey (prooi) om
bepaalde manier gecoordineerd worden.
op te eten en plant zich voort.
De prooi gaat zelf ook op zoek naar voedsel,
probeert de predator te ontwijken, en plant zich
voort.
Complexe Adaptieve Systemen
Bepaalde systemen bestaande uit meerdere entiteiten worden in sommige gevallen complexe
adaptieve systemen genoemd.
De populatie van prooien en predatoren hangen
van elkaar af.
Als er veel predatoren zijn, daalt de populatie
prooien.
Het verschil met multi-agent systemen is dat het
niet zozeer gaat om de controle en optimalisatie van een systeem, maar meer om simulatie.
Entiteiten worden hier vaak als niet rationeel
verondersteld.
Als er dan weinig prooien meer zijn, daalt de
populatie predatoren.
Als er weinig predatoren meer zijn, groeit de
populatie prooien.
In complexe adaptieve systemen kunnen simpele
regels een complex gedrag genereren als er meerdere entiteiten met elkaar kunnen interacteren.
Lotka-Volterra vergelijkingen
We zeggen dan dat de totale dynamiek emergeert uit de wisselwerking tussen de entiteiten
en de omgeving.
We kunnen zo’n systeem vatten m.b.v. bepaalde
regels. Een bekend model is het gebruik van
Lotka-Volterra vergelijkingen.
We noemen de grootte van de populatie prooien
x en de de grootte van de populatie predatoren
y.
Vervolg Complexe adaptieve systemen
Voorbeelden van processen die we met complexe
adaptieve systemen kunnen modelleren zijn:
Nu stellen we als toestand S(t) = (x(t), y(t))
met:
• Het verkeer bestaande uit vele weggebruikers
x(t + 1) = x(t) + Ax(t) − Bx(t)y(t)
5
Een systeem is chaotisch als het niet in een stabiel punt terecht komt, en er ook geen periodieke
cykels zijn. Het systeem herhaalt zich dan nooit.
y(t + 1) = y(t) − Cy(t) + Dx(t)y(t)
Voorbeelden dynamiek:
Het is vaak moeilijk te detecteren of een systeem
echt chaotisch is, de cykel zou erg lang kunnen
zijn!
Vraag: denk je dat chaos in een computerprogramma mogelijk is?
Inhoud van dit vak
We gaan de volgende onderwerpen behandelen
in de vervolg colleges:
Vraag: Teken de dynamiek op de x- en y-assen
• Cellulaire Automaten
Stabiel punt
• Artificial Life
De dynamiek kan leiden tot een stabiel punt, periodisch zichzelf herhalen, of chaotisch zijn (altijd veranderlijk)
• Biologische Adaptieve Systemen
Voor een stabiel punt is het van belang dat alles
gelijk blijft. Dus x(t+1) = x(t) en y(t+1) = y(t)
• Robotica
• Co-evolutie
• Evolutionary Computation (2 colleges)
We kunnen een stabiel punt (x(∗), y(∗)) vinden
door de vergelijkingen:
• Lerende Machines
1. Decision Trees
x(∗) = x(∗) + Ax(∗) − Bx(∗)y(∗)
0 = A − By(∗)
A
y(∗) =
B
(1)
(2)
2. Bayesian Learning
(3)
4. Self-organizing maps
3. Neurale Netwerken
• Reinforcement leren (3 colleges)
y(∗)
0
x(∗)
= y(∗) − Cy(∗) + Dx(∗)y(∗)
(4)
= −C + Dx(∗)
C
=
D
(5)
Praktische Informatie
(6)
De toetsing van het vak bestaat uit:
• 2 Deeltentamens (beide 40% van het eindcijfer)
Periodieke cykels
In bepaalde gevallen treedt er een herhaling op
in een systeem. We spreken dan van een periodieke cykel.
• Inlever opgaven (20% van het eindcijfer)
• Practicum opdrachten (Verplichte voldoende)
We hebben te maken met een periodieke cykel
van lengte n als:
De inlever opgaven dienen binnen een week nadat de opgaven verspreid zijn, ingeleverd te worden.
S(t) = S(t + n)
S(t + 1) = S(t + n + 1)
Een aantal van de prakticum opgaven zijn op te
halen van het lokweb (www.ou.nl/lokweb).
. . .
S(t + n − 1) = S(t + 2n − 1)
6
Conclusie
We hebben inzicht opgedaan in adaptieve systemen welke interacteren met een omgeving.
Adaptieve systemen kunnen goed gemodelleerd
worden d.m.v. agenten met policies en een interne toestand.
Als er meerdere agenten zijn, spreken we van
multi-agent systemen (MAS). De agenten in een
MAS kunnen dikwijls met elkaar communiceren.
Veel natuurlijke processen kunnen gemodelleerd
worden met multi-agent systemen (optimalisatie) of complexe adaptieve systemen (simulatie).
De dynamiek van een systeem resulteert uit de
begintoestand en de regels welke de nieuwe toestand bepalen.
De dynamiek kan leiden naar een stabiel punt,
kan periodiek gedrag vertonen of chaotisch zijn.
7
Reinforcement Leren
Marco A. Wiering
([email protected])
Intelligent Systems Group
Institute of Computing and Computing Sciences
Universiteit Utrecht
Samenvatting
Dit korte overzichtsartikel beschrijft reinforcement leren als methode om agenten
mee te leren controleren. Allereerst wordt de theoretische achtergrond van optimale
controle theorie besproken. Vervolgens worden de principes van reinforcement−leer
algoritmen en de belangrijkste 3 algoritmen beschreven. Tenslotte wordt kort
ingegaan hoe RL algoritmen uitgebreid kunnen worden met exploratie technieken en
functie approximatoren om het voor meer praktische problemen te gebruiken.
1 Introductie
Reinforcement leren (RL) stelt een agent in staat om te leren van zijn eigen
ervaringen. De betekenis van reinforcement leren is "versterkings leren". Dat houdt in
dat als de agent iets doet waarvoor die beloning krijgt, de agent dat gedrag daarna
vaker zal uitproberen. Het doel van een agent is om zo veel mogelijk beloning over de
lange termijn te vergaren. Hiervoor moet de agent verschillende gedragingen
uitproberen en evalueren. Door het leren van het uitproberen van verschillende acties,
kan de agent op een gegeven moment leren welk gedrag optimaal is. Hiervoor worden
waarde−functies gebruikt welke weergeven hoe goed het is voor de agent om zich in
een bepaalde toestand van de wereld te bevinden en hoe goed het is om dan een
bepaalde actie te verrichten. Figuur 1 illustreert de interactie van een RL agent met
zijn omgeving. De agent krijgt inputs van de omgeving binnen en selecteert daarmee
zijn actie. Na het verrichten van zijn actie krijgt de agent een belonings signaal en
maakt de agent een stap naar een nieuwe toestand in de wereld.
Omgeving
Beloning
Input
Actie
Agent
Figuur 1: Een RL agent die interacteert met zijn omgeving
Reinforcement leren is een groeiend onderzoeksveld. Zo is het al gebruikt om het spel
Backgammon mee te leren (Tesauro, 92). In het begin werd hiervoor een neuraal
netwerk random geinitialiseerd en dit netwerk werd gebruikt als evaluatie functie voor
bordposities. Aanvankelijk speelde het netwerk dus willekeurige zetten. Na elke
gespeelde partij werd de evaluatie functie bijgesteld door middel van een RL
algoritme. Na het spelen van ruim 1 miljoen partijen tegen zichzelf, hetgeen in 1992
drie maanden kostte op een RS6000 computer, was het programma zo goed geworden
dat deze zich tot de beste spelers van de wereld mocht rekenen. Reinforcement leren
is ook gebruikt om robots te controlleren, om te leren schaken, om liften in een
gesimuleerd gebouw te controlleren, voor verkeerslicht controle, etc.
De theorie van reinforcement leren gaat terug naar de theorie om systemen optimaal
te controlleren in een bekende omgeving. Deze theorie van "optimale controle" hield
zich bezig met dynamisch programmeren om gegeven een probleem specificatie een
optimale oplossing te berekenen. Het interessante van reinforcement leren is dat de
omgeving initieel geheel onbekend kan zijn, en de agent zelf de effecten van zijn
acties moet leren en ook moet leren wat het doel is in deze omgeving. Een RL agent
zal dus in het begin willekeurige acties uitvoeren om informatie mee te vergaren. Als
de agent een keer beloning krijgt, kan deze natuurlijk het gedrag herhalen dat tot de
beloning leidde, maar dat heeft weinig zin om bepaalde redenen: (1) De omgeving is
meestal stochastisch dus een herhaling van gekozen acties hoeft niet altijd tot het
bereiken van het doel te leiden; (2) De agent wil een optimale controller te leren en
initiele "trials" die tot het doel leiden zijn meestal verreweg van optimaal. Dus moet
de agent exploreren om ook andere acties uit te proberen.
De theorie van reinforcement leren vertelt ons dat een agent uiteindelijk optimaal zal
worden als aan een aantal condities voldaan is. Uiteindelijk betekent hier dat de agent
alle acties in alle toestanden oneindig vaak moet hebben uitgeprobeerd. Hoewel dat
voor praktische toepassingen dus onmogelijk is, kan er vaak na een eindig aantal
"trials" al gestopt worden, en is de controller meestal al vrij goed. Langer uitproberen
kan de agent wel verbeteren, maar een optimale oplossing kan nooit gegarandeerd
worden in eindige tijd. Dit heeft te maken met de stochasticiteit in de omgeving. De
agent zal namelijk nooit de kansen precies kennen van de echte transitie functie: wat
is de kans dat de agent zich na het uitvoeren van een bepaalde actie in een bepaalde
toestand in een nieuwe toestand bevindt? Als deze kans bijvoorbeeld 0.543 is, en de
agent probeert deze actie 1000 keer uit, dan kan de voorspelde kans bijvoorbeeld
0.527 zijn, maar precies weet de agent het nooit. In sommige gevallen is het
noodzakelijk om alle overgangskansen precies te kennen om het optimale gedrag te
leren en dit kost nu eenmaal oneindig veel tijd. Toch laten allerlei onderzoeken zien
dat het goed mogelijk is om een agent na eindige tijd te stoppen. In de meeste
gevallen is dan een goede of bijna optimale oplossing gevonden.
In dit korte overzichtsartikel zullen we allereerst de modellering van het probleem aan
de hand van Markov decision problems (MDPs) bespreken en ingaan hoe we met
dynamisch programmeer algoritmen hiervoor optimale oplossingen kunnen
berekenen. Vervolgens komt in Hoofdstuk 3 reinforcement leren aan de orde. Hier
worden de principes van reinforcement−leer (RL) algoritmen besproken en worden
drie verschillende algoritmen beschreven. In Hoofdstuk 4 wordt kort beschreven hoe
de agent kan exploreren om de hele toestandsruimte te doorlopen en hoe de agent
m.b.v. functie approximatoren waardefuncties voor zeer grote of continue
toestandsruimtes kan leren. Hoofdstuk 5 concludeert dit overzicht.
2 Dynamisch Programmeren
Dynamisch programmeren wordt gebruikt voor verschillende toepassingen zoals in
vision, bio−informatica, en optimale controle theorie. De toepassing is meestal om
door herhaalde updates van een bepaalde functie over een toestandstuimte, een
resulterende (optimale) functie te vinden. We zullen hier allereerst ingaan op de
modellering van het controle probleem waarin wij geinteresseerd zijn.
2.1 Markov Decision Problems
Markov decision problems (MDPs) zijn problemen welke bestaan uit een aantal
discrete toestanden, een aantal acties welke in de verschillende toestanden verricht
kunnen worden, een transitiefunctie welke aangeeft naar welke toestand de wereld zal
gaan als de agent een bepaalde actie in een toestand zal uitvoeren, en een
beloningsfunctie welke aangeeft hoeveel beloning de agent krijgt als deze een actie
uitvoert in een bepaalde toestand. Meer formeel bestaat een MDP uit:
1) Een verzameling mogelijke toestanden van de wereld: S = {s1 , s2 , ...., sN}
2) Een verzameling acties welke uitgevoerd kunnen worden: A = {a1, a2, ... aM}
3) Een transitiefunctie welke de kans aangeeft om naar een nieuwe toestand s(t+1) te
gaan na het uitvoeren van actie a(t) in toestand s(t). Deze kans wordt genoteerd als:
P(s(t), a(t), s(t+1)). Deze transitiefunctie moet voor alle acties in alle toestanden
gespecificeerd worden en is gewoonlijk stationair (onafhankelijk van de tijd).
4) Een beloningsfunctie welke aangeeft hoe veel beloning r(t) de agent krijgt voor het
uitvoeren van actie a(t) in toestand s(t) waarna de agent een stap maakt naar
toestand s(t+1). Deze beloningsfunctie wordt genoteerd als R(s(t), a(t), s(t+1)).
5) Een discountfactor γ welke onmiddelijke beloningen afweegt tegen beloningen die
later worden ontvangen. Dit zorgt er b.v. voor dat een agent liever zo snel mogelijk
een schaakwedstrijd wint in plaats van te winnen na een groot aantal zetten.
6) Een tijdsindicator t = {1, 2, ....., T} waarin T ook oneindig kan zijn.
De bedoeling is nu dat de agent een reeks acties executeert welke zijn behaalde
gedisconteerde som van beloningen maximaliseert. De agent gebruikt zijn policy−
functie π(.) om toestanden op acties te mappen:
a(t) = π(s(t))
In het begin is de optimale policy onbekend. Het doel is nu om de policy te vinden
die de volgende functie maximaliseert:
π*(.) = arg max E(Σt γt R(s(t), π(s(t)), s(t+1)))
π(.)
Hier wordt dus gekeken naar alle mogelijke policy−functies en wordt bekeken
hoeveel "discounted" beloning de agent gemiddeld ontvangt door het volgen van een
bepaalde policy. De E−operator is hier een verwachtings operator welke het
gemiddelde van alle mogelijke trajecten in de toestandsruimte bepaalt. De
hoeveelheid beloning is de (gedisconteerde) som van alle beloningen verkegen tijdens
alle toekomstige stapjes (acties). Het moge duidelijk zijn dat we deze optimale policy
niet zo maar kunnen berekenen. We zouden de verwachte beloningssom van alle
mogelijke policies wel kunnen uitrekenen, maar er zijn een exponentieel aantal
policies. Het aantal policies is namelijk MN waarbij M het aantal mogelijke acties van
de agent is, en N het aantal mogelijke toestanden.
2.2 Dynamisch Programmeren
We kunnen de optimale policy berekenen met behulp van dynamisch programmeer
algoritmen. Allereerst voeren we twee waardefuncties in: een toestand waardefunctie
V(s) welke aangeeft hoeveel de som van de toekomstige beloningen wordt als de
agent zich in toestand s bevindt en policy π gebruikt wordt om acties te selecteren.
Ten tweede gebruiken we een actie−selectie waardefunctie Q(s,a) welke aangeeft hoe
groot de verwachte som van de toekomstige beloningen wordt als de agent in toestand
s zit en actie a selecteert, waarna de agent met policy π verder gaat. Deze functies
dienen ervoor om verschillende policies met elkaar te vergelijken en kunnen recursief
berekend worden. We willen allereerst dus dat V(s) de volgende waarde aanneemt:
V(s) = E(Σt γt R(s(t), π(s(t)), s(t+1)) | s(0) = s)
Of te wel de gedisconteerde som van toekomstige beloningen. Nu kunnen we
gebruik maken van een lokale vergelijking om Q(s,a) te bepalen. We constateren
dat de waarde van een actie in een toestand gelijk moet zijn aan de onmiddelijk
ontvangen beloning plus de som van de beloningen die vanuit de nieuwe toestand
wordt verkregen. Verder is er een kansverdeling om naar opvolgende toestanden te
gaan. Met deze kennis kunnen we Q(s,a) dus als volgt bepalen:
Q(s,a) = Σs’ P(s,a,s’) ( R(s,a,s’) + γ V(s’))
Hiermee kunnen we dus Q bepalen als we V kennen. V kennen we nog niet, maar
we kunnen V recursief uitdrukken in een vergelijking over de Q−waarden van de
verschillende acties. De waarde van een toestand is namelijk gelijk aan de Q−waarde
van de beste actie in die toestand. Dit houdt dus in dat we V als volgt kunnen bepalen:
V(s) = Maxa Q(s,a)
Tenslotte kunnen we de policy bepalen als we de Q−waarden van de verschillende
acties weten. De beste actie in een toestand is namelijk de actie met de hoogste Q−
waarde in die toestand:
π(s) = Arg Maxa Q(s,a)
Nu hebben we dus drie formules om V, Q, en π te bepalen. Ze zijn echter allemaal
van elkaar afhankelijk: als we de waarde van een toestand niet weten, kunnen we
ook de waarde van een actie in een toestand, die een kans heeft om naar de nog niet
geevalueerde toestand te gaan, nog niet bepalen. Dit lossen we op door recursief te
itereren. De Bellman vergelijking (Bellman, 57) vertelt ons dat de optimale Q* en
V* waarden aan de volgende vergelijking moeten voldoen:
Q*(s,a) = Σs’ P(s,a,s’) ( R(s,a,s’) + γ V*(s’)),
waarbij V* opnieuw uitgedrukt kan worden in termen van Q:
V*(s) = Maxa Q*(s,a)
Dit betekent dus dat er een fixed point is in de waardefunctie ruimte. Als we
eenmaal de optimale Q* en V* functie hebben gevonden, maakt dooritereren
niets meer uit: alles blijft dan gelijk. Ter illustratie Figuren 2a en 2b: hier worden de
optimale policy en waardefunctie afgebeeld voor een klein doolhof probleem waarbij
elke stap 1 punt kost en het bereiken van het doel met 10 punten beloond wordt. Merk
op dat er maar 1 optimale V−functie bestaat, maar meerdere optimale policies
(sommige toestanden hebben meerdere mogelijke optimale acties).
G
5
6
6
7
5
6
4
Figuur 2a: De optimale policy in
een doolhof
10
8
9
8
6
7
Figuur 2b: De optimale waarde
functie in deze doolhof
Hoe kunnen we deze optimale policy en waardefuncties nu berekenen? We beginnen
met een initiele Q− en V−functie en itereren totdat er niets meer verandert. Dit wordt
bijvoorbeeld gedaan door het Value Iteratie algoritme:
1) Initialiseer V(s) en Q(s,a) voor alle toestanden s en acties a. Meestal wordt hier
V(s) = 0 en Q(s,a) = 0 voor alle toestanden en acties.
2) Herhaal stappen 3−5 totdat de waarde−functie niet of nauwelijks meer verandert.
3) Update de Q−functie voor alle toestanden en acties m.b.v. de volgende
vergelijking:
Q(s,a) = Σs’ P(s,a,s’) ( R(s,a,s’) + γ V(s’))
4) Pas de V−functie aan:
V(s) = Maxa Q(s,a)
5) Pas de policy aan:
π(s) = Arg Maxa Q(s,a)
Dit algoritme geeft de optimale Q*, V*, en policy terug, hoewel je hiervoor oneindig
vaak moet dooritereren. Meestal wordt met de iteratie gestopt als de waarde−functie
V nauwelijks meer verandert; dus als |Vt+1(s) − Vt(s)| < ε, voor alle toestanden s,
waarbij ε een kleine waarde heeft. Als we dat doen, verkrijgen we een sub−optimale
oplossing welke beter is naarmate ε kleiner is.
2.3 Dynamisch Programmeren als Tool voor Planning
Als het model met transitiekansen en beloningen a−priori gegeven is, kan met behulp
van dynamisch programmeer algoritmen de optimale oplossing berekend worden.
Voor Pad−plannings problemen is dit vaak veel efficienter dan het gebruik van
conventionele padplanners zoals Breadth of Depth first search. In principe lijkt het
algoritme wel wat op A* of Dijkstra’s korste pad algoritme, maar deze standaard
algoritmen kunnen niet met kansverdeling omgaan of met positieve beloningen
(negatieve kosten), hetgeen DP algoritmen algemener bruikbaar maakt. Een voordeel
van DP is dat alle informatie om een actie te selecteren in de policy zit. Dus als de Q−
en V−functies nauwkeurig zijn hoeft er tijdens de executie van een agent niets meer
geplanned te worden; de agent kan reactief in een toestand een actie selecteren en
executeren. Er is dus een trade−off tussen de nauwkeurigheid van de waardefuncties
en een mogelijke planning of lookahead strategie waarin gekeken wordt welke
reeksen van acties ondernomen kunnen worden. Een goed voorbeeld om dit
onderscheid duidelijk te maken is een spelprogramma zoals een schaakprogramma.
Een schaakprogramma bestaat uit een evaluatie functie en een methode om zetten
vooruit te rekenen. Als de evaluatie functie perfect is, is er geen noodzaak om dieper
dan 1 zet te rekenen. Voor schaakprogramma’s is het echter vrijwel onmogelijk om
een perfecte evaluatie functie te hebben vanwege het enorme aantal toestanden,
lookahead blijft hiervoor dus noodzakelijk. Voor bepaalde problemen als pad−
planning of navigatie in stationaire werelden is lookahead echter niet noodzakelijk en
kunnen we dus een snelle reactieve agent gebruiken. Helaas is DP alleen mogelijk als
het model a−priori gegeven is. Als dit niet het geval is, moeten reinforcement−leer
algoritmen toegepast worden. Dit wordt besproken in het volgende hoofdstuk.
3 Reinforcement Leren
Reinforcement leren (Kaelbling et al, 96; Sutton en Barto, 98) stelt een agent in
staat te leren van zijn interactie met een omgeving. Initieel wordt de agent in een
startpositie gezet en weet de agent niets over de overgangskansen en beloningen.
Merk dus op dat de agent geen initieel doel kent; het enige dat de agent wil is zijn som
van toekomstige beloningen maximaliseren, maar hoe hij dit moet doen moet hij zelf
leren. Nadat een trial is begonnen, verkrijgt de agent na elke actie informatie om van
te leren (we gaan allereerst weer uit van een discrete toestand/actie ruimte):
Agent zit in toestand s(t)
Agent selecteert actie a(t) = π(s(t))
Agent executeert actie en gaat naar toestand s(t+1) en vergaart beloning r(t)
Agent maakt een aanpassing van zijn waarde−functies m.b.v. de informatie
(s(t), a(t), s(t+1), r(t))
Als nu een RL algoritme gebruikt wordt, leert de agent steeds beter om acties te
selecteren welke hem een hoge lange termijn som aan beloningen geven. Voor de
aanpassing van de waarde functies bestaan er 3 conventionele RL algoritmen: (1) Q−
leren, (2) Monte Carlo Sampling en (3) Model−gebaseerde RL. We zullen ze alle 3
bespreken.
3.1 Q−leren
Q−leren (Watkins, 89) past 1 Q−waarde aan na elke stap. Nadat een stap gemaakt
is en de informatie (s(t), a(t), s(t+1), r(t)) bekend is, maakt Q−leren de volgende
leerstap:
Q(s(t), a(t)) = (1 − α) Q(s(t), a(t)) + α (r(t) + γ V(s(t+1)))
Hierin is α de leersnelheid welke een waarde tussen 0.0 en 1.0 heeft. In principe
verschuift de Q−leerregel de Q−functie na elk stapje een beetje om met de laatste
ervaring (stap) rekening te houden. Merk op hoe dicht de Q−leerregel bij DP
algortimen staat: de kansverdeling is vervangen door een leersnelhied. Als α
langzaamaan afneemt zodat aan bepaalde eisen voldaan is, convergeert de Q−functie
met Q−leren naar de optimale Q*−functie als alle toestanden/actie paren oneindig
vaak uitgeprobeerd worden. Q−leren is onder andere gebruikt om liften mee te
controleren in een gesimuleerd gebouw. Hiervoor werd Q−leren gecombineerd met
neurale netwerken (Crites and Barto, 96). Het resulterende algoritme was in staat om
4 liften beter te controlleren dan conventionele controllers en kan daarmee de totale
wachttijd met 15% reduceren.
Q−leren leert steeds maar 1−stapje terug; de updates die van een doellokatie
naar een beginlokatie gaan worden dus maar langzaam verricht. Stel bijvoorbeeld
dat een agent moet leren dat er honderd passen rechts een doellokatie is. De eerste
trial beweegt de agent zich random (hij heeft immers nog niets geleerd). Als de
doellokatie nu gevonden wordt, en de agent daarvoor beloning krijgt, wordt enkel
de laatste stap geupdated door de uiteindelijke positieve beloning. Dit zorgt soms dus
voor een lang leerproces. Manieren om dit leren te versnellen zijn om gebruik te
maken van Q(λ) leren (Peng and Williams, 96). We zullen daar nu niet verder op
ingaan.
3.2 Monte Carlo Sampling
Monte Carlo sampling wordt gebruikt voor verschillende sampling methoden. In
principe berust deze methode op het genereren van uitkomsten en het middelen van
deze uitkomsten om een schatting te krijgen van de echte waarde (of kans op iets).
Monte Carlo sampling werkt als volgt; de agent maakt een trial en stelt het leren
uit (het is dus een off−line lerende agent). Tijdens de trial houdt de agent precies
bij welke toestanden deze heeft gezien, welke actie hij daarin heeft verricht en
hoeveel beloning hij heeft gekregen. Nadat de trial van de agent afgelopen is,
wordt allereerst een cumulatieve reinforcement signaal berekend voor elk tijdstip tot
het einde van de trial:
R(t) = Σi=t γi−t r(i)
Vervolgens worden alle voorgekomen (s(t), a(t)) paren geupdated m.b.v.:
Q(s(t), a(t)) = (1 − α) Q(s(t), a(t)) + α R(t)
Er wordt hierbij dus een verschuiving gemaakt van de Q−functie naar de bepaalde
som van beloningen in de trial. Er valt nog onderscheid te maken tussen first−visit
en every−visit Monte Carlo methoden. De first−visit methode maakt 1 update (alleen
voor de allereerst voorkomende (s, a) paar en de every−visit methode maakt een
update voor alle keren dat (s, a) is voorgekomen in een trial. Als we een waar
gemiddelde als uitkomst willen hebben (we berekenen dan het gemiddelde van alle
trials waarin (s, a) voorkwam) dan kunnen we dan doen door de leersnelheid als volgt
te laten afnemen:
α = 1 / Ν(s,a)
Waarin N(s,a) het aantal keren is dat paar (s,a) is geupdated. Monte Carlo sampling
heeft als voordeel dat de hele toekomst wordt gebruikt om te updaten. Het lijkt
daarom wellicht dat Monte Carlo sampling sneller naar de optimale Q−functie zal
convergeren dan Q−leren. Dit is echter vaak niet het geval, de variantie van de
updates wordt namelijk veel groter. Hoewel Monte Carlo sampling geen bias heeft (de
huidige Q−functie wordt helemaal niet gebruikt) is de variantie erg hoog. De variantie
is erg hoog omdat de hele stochastische toekomst gebruikt wordt in de update, en deze
toekomst kan veel verschillende uitkomsten genereren. Aangezien de fout van de Q−
functie opgesplitst moet worden in de bias en variantie, kan het zo zijn dat Monte
Carlo sampling een hoge fout in zijn huidige schatting heeft door de hoge variantie in
de updates.
Een andere probleem is dat exploratie acties een grote verstorende rol kunnen hebben
in Monte Carlo sampling. Exploratie is nodig (zie ook het volgende hoofdstuk) om de
optimale Q−functie te kunnen leren, maar veel exploratie met Monte Carlo sampling
zorgt voor een door de exploratie gebiasde feedback. Aangezien door exploratie soms
slechte of sub−optimale acties geselecteerd worden, kan de verkregen som van
beloningen soms veel lager uitvallen dan wat de huidige beste policy zou hebben
verkregen. Q−leren heeft hier geen last van en wordt ook wel off−policy leren
genoemd: bij Q−leren wordt altijd enkel de Q−functie aangepast op de direct gekozen
actie en hebben exploratie acties geen verstorende invloed.
3.3 Model−gebaseerd Reinforcement Leren
Het derde en laatste reinforcement−leer algoritme is gebaseerd op het gebruik van
een model. Dit model schat de transitie kansen en beloningen op de transities en
gebruikt vervolgens dynamisch programmeer−achtige algoritmen om de waarde−
functies te updaten. Dit zorgt vaak voor een leeralgortime welke veel minder
ervaringen nodig heeft om een goede of de optimale Q−functie te leren. Het algoritme
vereist echter wel dat de transitiefunctie wordt opgeslagen, dus is de ruimte−
complexiteit van het leeralgoritme hoger dan bij Q−leren of Monte Carlo sampling.
In model gebaseerd reinforcement leren gebruiken we een aantal tellers om de
transitie en beloningsfunctie te schatten uit de verkregen ervaringen van de agent.
Hiervoor introduceren we de volgende tellers:
C(s, a)
C(s, a, s’)
= aantal keren dat actie a in toestand s is verricht.
= aantal keren dat actie a in toestand s is verricht en de agent een stap
heeft gemaakt naar toestand s’.
RT(s, a, s’) = som van onmiddelijke beloningen nadat actie a in toestand s is verricht
en er een stapje werd gemaakt naar toestand s’.
Uit deze tellers (welke real numbers kunnen zijn), kunnen we de transitie− en
beloningsfunctie als volgt schatten:
P’(s, a, s’) = C(s, a, s’) / C(s, a)
en
R’(s, a, s’) = RT(s, a, s’) / C(s, a, s’)
Na het aanpassen van de transitie− en beloningsfunctie na elk stapje kunnen we
dynamisch programmeer algoritmen (value iteratie) gebruiken om direct een
nieuwe waarde functie te berekenen. Dit is echter erg traag, omdat de hele
Q−functie opnieuw berekend gaat worden op basis van 1 verandering van het model.
Om hier versnelling in aan te brengen zijn er 2 algoritmen bedacht: real−time
dynamisch programmeren en Prioritized Sweeping (Moore and Atkeson, 93). Real−
time dynamisch programmeren past de Q−functie in een beperkt aantal iteraties (b.v.
1 of 5) aan zodat elke update niet te lang gaat duren. Prioritized Sweeping (PS)
bekijkt welke Q−waarden aangetast worden door de laatste update en past enkel deze
waarden aan. In principe werkt PS net als een nieuwsflits, als er ergens in een
gebied een waarde geupdated wordt, worden enkel naburige toestanden opnieuw
geupdated. Dit zorgt voor een heel snel algoritme om de noodzakelijke updates door
te voeren voor de grootste aangetaste Q−waarden en andere Q−waarden niet te
updaten. Experimenten hebben aangetoond dat PS veel sneller dan Q−leren of Monte
Carlo sampling een goede schatting van de optimale Q−functie kan leren. Model−
gebaseerde RL kan echter alleen gebruikt worden als de toestandsruimte goed
gediscretiseerd kan worden en niet te groot is.
4 RL in de Praktijk
We hebben nu gezien hoe drie verschillende RL−algoritmen eruit zien. In de praktijk
combineren we deze technieken met enkele andere methoden om een goed werkend
syteem te verkijgen. De eerste methode is het gebruik van exploratie om de optimale
Q−functie te leren, de tweede methode maakt gebruik van functie approximatoren
voor het omgaan met grote of continue toestand/actie ruimtes.
4.1 Exploratie
Als we steeds acties selecteren met behulp van de huidige greedy policy π(.)
verkrijgen we meestal een lokaal optimum terug. Het kan namelijk zo zijn dat de
Q−waarden van de huidige policy hoger zijn dan de Q−waarden van alternatieve
acties waardoor deze nooit uitgeprobeerd worden. Het doel van exploratie is dus om
van tijd tot tijd alternatieve acties te selecteren om zo op zoek te gaan naar de
optimale policy.
De meest eenvoudige exploratie methode is de Max−random methode. Hierin wordt
de huidige beste actie gekozen met kans (1 − ε) en een alternatieve random actie met
kans ε. Als de exploratie−rate ε langzaam naar 0 gaat, worden er steeds meer acties
verricht volgens de greedy policy, terwijl er aan het begin veel geexploreerd wordt.
De Max−random exploratie methode wordt het vaakst gebruikt. Men kan ook
gebruik maken van de Boltzmann exploratie methode welke kansen om een actie
te selecteren gegeven de Q−functie als volgt bepaald:
P(a(t) | s(t)) = exp(Q(s(t), a(t)) / τ) / Σ a exp(Q(s(t), a) / τ)
Hierin is τ de (afnemende) temperatuur (als deze 0 is wordt de greedy actie
geselecteerd, als deze oneindig is worden random acties geselecteerd).
De twee bovengenoemde exploratie methoden zijn indirecte exploratie methoden;
ze houden geen rekening met wat de agent al bezocht heeft of in welke toestanden de
agent al veel informatie verzameld heeft. Directe exploratie methoden maken hier wel
gebruik van. Zo kan de agent bijvoorbeeld steeds de actie selecteren welke het minst
vaak geselecteerd is en als alle acties in alle toestanden een N aantal keer geselecteerd
zijn, kan de agent meer greedy gedrag gaan vertonen. Voor model−gebaseerd leren
bestaan er enkele efficiente directe exploratie algoritmen om de hele toestandstuimte
te doorlopen (Wiering and Schmidhuber, 1998). Deze algoritmen kunnen het leren
van een (bijna) optimale policy aanzienlijk versnellen. Onderzoek naar exploratie
heeft ook duidelijk gemaakt dat optimistische waardefuncties welke de waarde van
toestanden optimistisch bekijken (volgens een statistische methode) zeer goede
exploratie methoden kunnen opleveren.
4.2 Functie Approximatie
Als de toestandsruimte heel groot (b.v. door meerdere input dimensies) of continu is,
dan is het vaak ondoenlijk om alle toestand/actie paren in een tabulaire representatie
op te slaan. Daarom worden functie approximatoren in zulke gevallen gebruikt. Dit
biedt de volgende voordelen: als een toestand nog nooit bezocht is, kan de functie
approximator toch een Q−waarde voor die toestand hebben geleerd door gebruik te
maken van generalisatie; daarom hoeft niet de hele toestandsruimte doorzocht te
worden. Het leren kan dus veel sneller gaan met functie approximatoren, maar een
nadeel is dat de functie approximator vaak problemen heeft om de exacte optimale
waardefunctie op te slaan. Verder vergeet de functie approximator bepaalde
toestandswaarden en kan de functie approximator soms overgeneraliseren zodat de
preciese waarde van een toestand/actie paar niet goed geleerd wordt. Er zijn
verschillende mogelijke functie approximatoren; (1) neurale netwerken worden veel
gebruikt voor spelprogramma’s en ook wel voor robots; (2) CMACs worden gebruikt
om eerst de toestandsruimte te discretiseren en op te splitsen in een verzameling
discrete cellen, waarna de cellen de Q−waarde van een toestand bepalen. CMACs zijn
onder andere gebruikt om gesimuleerd multi−agent voetbal mee te leren. (3) Verder
worden self organizing netwerken, beslisbomen, locally weighted regression etc.
gebruikt in combinatie met RL.
Het gebruik van de combinatie van functie approximatoren en RL biedt geen grote
problemen indien Q−leren of Monte Carlo sampling gebruikt wordt. In dit geval kan
de functie approximator direct gebruikt worden om de updates mee te maken. In het
geval van model−gebaseerd RL kunnen neurale netwerken niet goed gebruikt worden,
omdat het leren van de transitie− en beloningsfunctie met een neuraal netwerk grote
problemen oplevert; namelijk hoeveel outputs moet het netwerk hebben als de
omgeving stochastisch is? Zelfs als dit een vaststaand gegeven aantal is, is de
combinatie neurale−netwerken model−gebaseerde RL vaak erg traag en daarom
vanwege efficientie redenen niet altijd goed toepasbaar.
In (Kaelbling et al., 96) staat een overzicht van onderzoek naar het gebruik van
functie approximatoren in combinatie met RL.
4.3 Toepassingen
RL heeft veel verschillende toepassingen. Over het algemeen wordt het gebruikt
voor controle of predictie, maar er zijn ook toepassingen van RL voor
combinatorische optimalisatie problemen. Onder het laatste vallen onder andere de
Ant Colony Systemen (Dorigo, 97). RL is al efficient toegepast om spelletjes te leren
(schaken, dammen, backgammon) en biedt wellicht de meestbelovende methode
om een spelletje te leren door een programma tegen zichzelf te laten spelen. Ook is
RL gebruikt voor network routing (Littman and Boyan, 93), waarin een aantal
pakketjes over een verbonden netwerk getransporteerd moeten worden (vergelijk
internet routing systemen). De RL systemen kunnen vaak goed omgaan met de
beschikbare resources of met veranderingen in de omgeving. RL is ook toegepast op
lift controle (Crites and Barto, 96) en verkeerslicht controle (Wiering, 00). Voor
verkeerslicht controle werd met RL geleerd wat de wachttijd van auto’s is als een
bepaald verkeerslicht op rood en op groen staat. Elk auto heeft een bepaald voordeel
als zijn licht nu op groen wordt gezet en dit voordeel is gelijk aan de wachttijd voor
een rood licht min de wachttijd op een groen licht. Vervolgens werden de
verkeerslichten op een kruising gezet om de winst ta maximaliseren. Er is nu een
verkeerslicht simulator (http://www.sourceforge.net/project/stoplicht) waarmee men
met verschillende lerende en statische verkeerslicht controllers kan experimenteren.
De voorlopige resultaten tonen aan dat RL de doorstroming van het verkeer
aanzienlijk kan verbeteren ten opzichte van vaste controllers. RL wordt ook gebruikt
in combinatie met speltheorie om rationele agenten te leren welke bepaalde matrix
spelen tegen elkaar spelen en van de uitkomst kunnen leren. Zo onderzochten
Sandholm and Crites (1995) of RL agenten de tit−for−tat strategie in de Prisoner’s
dilemma konden leren en dit bleek wel het geval te zijn. RL wordt ook gebruikt om
robots mee te controleren, maar omdat RL veel trials nodig kan hebben, wordt vaak
a−priori hand gegeven data gebruikt (joystick of programma) zodat het startpunt al
vrij redelijk is en de agent niet heel veel moeite heeft om af en toe beloningen te
vergaren.
5 Conclusie
We hebben in dit korte overzichtsartikel beschreven hoe reinforcement leren werkt.
Allereerst hebben we de theoretische achtergrond van optimale controle beschreven
en dynamisch programmeren besproken. Dynamisch programmeer algoritmen
gebruiken waarde functies voor toestanden en toestand/actie paren en kunnen hiermee
een optimale policy berekenen als het model bekend is. Als het model onbekend is,
dan kunnen we reinforcement leren (RL) gebruiken om door middel van interactie
met de omgeving een policy te leren. Als RL met een tabulaire representatie gebruikt
wordt, leert een agent onder bepaalde condities de optimale policy als alle toestanden
oneindig vaak bezocht worden. Hoewel dit geen sterk theoretisch uitgangspunt is
(oneindig lang exploreren is in realiteit onhaalbaar), leren de agenten vaak al na een
relatief korte tijd een goede policy welke daarna langzaamaan beter wordt.
We hebben ook de combinatie van RL met exploratie methoden en functie
approximatoren kort besproken. Huidig onderzoek in RL bekijkt hoe het opschalen
van problemen gedaan kan worden door nog efficientere RL methoden, en welke
functie approximatoren geschikt zijn voor het oplossen van bepaalde problemen. Ook
wordt RL steeds vaker gebruikt in lerende multi−agent systemen. Dat RL een
interessant gebied is met vele mogelijke nieuwe toepassingen, zal ertoe leiden dat
steeds meer mensen RL als methode gaan gebruiken voor het oplossen van een
bepaald probleem.
Referenties
Bellman, 57: R. Bellman, Dynamic Programming. Princeton University Press, 1957.
Crites and Barto, 96: R. Crites and A. Barto, Improving Elevator Performance using
Reinforcement Learning. Advances in Neural Information Processing
Systems 8, pp: 1017−1023, 1996.
Dorigo, 97: M. Dorigo and L. Gambardella, Ant Colony System: A Cooperative
Learning Approach to the Traveling Salesman Problem. Evolutionary
Computation 1(1), pp: 53−66, 1997.
Kaelbling et al., 96: L. Kaelbling, M. Littman, and A. Moore, Reinforcement
Learning: a Survey. Journal of Artificial Intelligence Research 4,
pp: 257−285, 1996.
Littman and Boyan, 93: M. Littman and J. Boyan, A Distributed Reinforcement
Learning Scheme for Network Routing. First International Workshop on
Applications of Neural Networks to Telecommunication, pp: 45−51, 1993.
Moore and Atkeson, 93: A. Moore and C. Atkeson, Prioritized Sweeping:
Reinforcement Learning with less Data and less Time. Machine Learning
13, pp: 103−130, 1993.
Peng and Williams, 96: J. Peng and R. Williams, Incremental Multi−step Q−learning.
Machine Learning 22, pp: 283−290, 1996.
Sandholm, 95. T. Sandholm and R. Crites. On Multi−agent Q−learning in a Semi−
Competitive Domain. IJCAI’95 workshop: Adaption and Learning in
Multi−Agent Systems, pp: 164−176, 1995.
Sutton and Barto, 98: R. Sutton and A. Barto, Reinforcement Learning: an
Introduction. MIT Press, 1998.
Tesauro, 92: G. Tesauro. Practical Issues in Temporal Difference Learning.
Advances in Neural Information Processing Systems 4, pp: 259−266, 1992.
Watkins, 89: C. Watkins. Learning from Delayed Rewards, PhD thesis, King’s
College, Cambridge, England, 1989.
Wiering and Schmidhuber, 98: M. Wiering and J. Schmidhuber, Efficient Model−
based Exploration. Sixth International Conference on Simulation of
Adaptive Behavior: From Animals to Animats 6, pp: 223−228, 1998.
Wiering, 00: M. Wiering, Multi−agent Reinforcement Learning for Traffic Light
Control. Seventeenth International Conference on Machine Learning, pp:
1151−1158, 2000.

Introduction to Adaptive Systems

Transcription

Similar documents

APLIKASI AGEN MOBIL MAKLUMAT DALAM SISTEM

The grammaticalisation of the gaan + infinitive future in spoken

Spintronics in hard drives 22

Northern Renaissance 1400 - 1500 PDF

LESGEVEN MET DE LEERLING CENTRAAL

and abstract book in pdf for more information.

INSide Information

Handleiding

using the hisparc network to measure the gerasimova

rretuoorden - TNO Publications

Big Picture Learning en de competenties van docenten

catalogus - Waterfly Nederland

Dutch agents 1940 – 1945

Was there an error? - Ghent University Academic Bibliography

A-Z - Faculteit Architectuur

2013 - Free

Mitel Powerpoint Template

2 overzicht van de life fitness-loopband

teacher roles and pupil outcomes in technology-rich early

Naar een cartografie van condities voor werkplekleren in

doordachte

Advies 06-2013 van het Wetenschappelijk Comité van het FAVV