Heirarichal Decision Making using Spatio

Transcription

Heirarichal Decision Making using Spatio
Heirarichal Decision Making using Spatio-Temporal
Abstraction In Reinforcement Learning
A Project Report
submitted by
PEEYUSH KUMAR
in partial fulfilment of the requirements
for the award of the degree of
DUAL DEGREE - BACHELOR & MASTER OF TECHNOLOGY
DEPARTMENT OF APPLIED MECHANICS
INDIAN INSTITUTE OF TECHNOLOGY MADRAS.
MAY 2013
THESIS CERTIFICATE
This is to certify that the thesis titled Heirarichal Decision Making using SpatioTemporal Abstraction In Reinforcement Learning, submitted by Peeyush Kumar,
to the Indian Institute of Technology, Madras, for the award of the degree of Dual
Degree, is a bona fide record of the research work done by him under our supervision.
The contents of this thesis, in full or in parts, have not been submitted to any other
Institute or University for the award of any degree or diploma.
Balaraman Ravindran
Research Guide
Associate Professor
Dept. of Computer Science
IIT-Madras, 600 036
Place: Chennai
Date: 10th May 2013
M. Manivannan
Research Guide
Associate Professor
Dept. of Applied Mechanics
IIT-Madras, 600 036
ACKNOWLEDGEMENTS
I would like to thank my research advisor, Dr. B. Ravindran, for giving me the opportunity, freedom and encouragement to pursue the research, listening patiently to many
harebrained ideas. I could not have imagined having a better advisor and mentor for my
study. He was truly inspiring and I consider myself very lucky to work with him.
I would also like to thank Dr. M Manivanan for giving me enough freedom to pursue
my research and the various interesting discussions I had with him.
I would also like to thank my colleagues Niveditha Narasimhan and Vimal Mathew
for their contribution relevant to the project. Their invaluable help is sincerely acknowledged.
i
ABSTRACT
KEYWORDS:
Reinforcement learning; Markov decision process; Automated
state abstraction; Metastability; Dynamical systems; Subtasks;
Skills; Options
Reinforcement learning (RL) is a machine learning framework in which an agent manipulates its environment through a series of actions, receiving a scalar feedback signal
or reward after each action. The agent learns to select actions so as to maximize the
expected total reward.
The nature of the rewards and their distribution across the environment decides the
task that the agent learns to perform. Identifying useful structures present in a task often
provides ways to simplify and speed up such algorithms and they enable ways to generalize such algorithms over multiple tasks without relearning policies for the entire task.
One approach to using the task structure involves identifying a hierarchical description
in terms of abstract states and extended actions between abstract states (Barto and Mahadevan, 2003). The decision policy over the entire state space can then be decomposed
over abstract states, simplifying the solution to the current task and allowing for reuse
of partial solutions in other related tasks.
This thesis introduces a general principle for automated skill acquisition based on
the interaction of a reinforcement learning agent with its environment. We use ideas
from dynamical systems to find metastable regions in the state space and associate them
with abstract states. Skills are defined in terms of transitions between such abstract
states. These skills or subtasks are defined independent of any current task and we
show how they can be efficiently reused across a variety of tasks defined over some
common state space. We demonstrate empirically that our method finds effective skills
across a variety of domains.
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS
i
ABSTRACT
ii
LIST OF TABLES
v
LIST OF FIGURES
vii
ABBREVIATIONS
viii
NOTATION
ix
1
Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2
Overview
3
Background
3.1
8
12
A Reinforcement Learning problem and its related random-walk operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.2
Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . .
13
3.3
A Spectral Clustering Algorithm: PCCA+ . . . . . . . . . . . . . .
14
3.3.1
Numerical Analysis . . . . . . . . . . . . . . . . . . . . . .
15
3.3.2
Options . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.4
Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.5
The Infinite Mario . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5.1
States . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5.2
Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
iii
3.5.3
4
5
6
7
8
Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Spatial Abstraction
23
4.1
27
Utility of PCCA+ . . . . . . . . . . . . . . . . . . . . . . . . . . .
Temporal Abstraction
30
5.1
Subtask options . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.1.1
Initiation Set I . . . . . . . . . . . . . . . . . . . . . . . .
31
5.1.2
Option Policy µ . . . . . . . . . . . . . . . . . . . . . . . .
31
5.1.3
Termination Condition β . . . . . . . . . . . . . . . . . . .
34
PCCA+HRL: An Online Framework for Task Solving
36
6.1
Interpretation of subtasks identified . . . . . . . . . . . . . . . . . .
38
6.2
Extending the Transition Model to include the underlying Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Experiments
42
7.1
2 room Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
7.2
Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
7.3
Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Conclusion and Future Directions
A PCCA+ on other domains
46
47
A.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . .
47
A.2 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
A.3 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
LIST OF TABLES
A.1 Normalized Cut Values for PCCA+ and N-Cut. (Lower is good) . .
50
A.2 Accuracy Measure for TFTD2 . . . . . . . . . . . . . . . . . . . .
50
A.3 Accuracy Measure for Reuters . . . . . . . . . . . . . . . . . . . .
51
A.4 Purity measure for Document Clusters . . . . . . . . . . . . . . . .
52
A.5 Details of the used Reuters21578 top 7 Categories . . . . . . . . . .
53
A.6 Macro connectivity information for Reuters21578 top 7 Categories .
53
A.7 Purity measure for Synthetic Dataset . . . . . . . . . . . . . . . . .
54
v
LIST OF FIGURES
1.1
2 room domain . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2.1
3 room world illustration domain . . . . . . . . . . . . . . . . . . .
8
2.2
Membership Functions for the 3 room world domain in Figure 2.1 .
9
2.3
Sample Option Model. . . . . . . . . . . . . . . . . . . . . . . . .
10
2.4
Policy over meta states. Paths marked in red are part of the optimal
policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
Solution to the illustration problem in Figure 2.1 . . . . . . . . . . .
11
3.1
Taxi world domain . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
A mario visual scene . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.1
Simplex First order and Higher order Perturbation . . . . . . . . . .
24
4.2
PCCA+ Numerical analysis for 3 room world domain in Figure 2.1 .
26
4.3
Comparison of clustering algorithm on the maze domain . . . . . .
26
4.4
Comparision of clustering algorithm on a room-inside-room domain
27
4.5
20 Metastable States. The state space partitioning identifies 20 metastable
states. For every passenger position and destination pair the algorithm
identifies 1 metastable state, hence 5×4 gives 20 metastable states. The
nature of subtasks which can be assigned are a) Pickup if the Taxi is on
the pickup location, b) Putdown if the taxi is on the destination location,
and c) Navigate When the taxi is far from the destination or the pickup
location depending on what subtask is being solved . . . . . . . . .
27
4.6
Segmentation of the Physical Space in the Taxi Domain . . . . . . .
28
5.1
Two episodes with different starting states. Skill in this context is defined as transition between rooms. Skills shown by down arrow are not
learned again, instead they are reused from the first task . . . . . . .
30
Policy for option going from Abstract State 1 to Abstract state 2 for
domain in Figure 2.1 . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.3
Room inside a room, the policy for the domain in Figure 4.4 . . . .
32
5.4
Few more examples for the policy composed using PCCA+HRL . .
33
5.2
vi
5.5
Termination Condition for option going from Abstract State 1 to Abstract state 2 for domain in Figure 2.1 . . . . . . . . . . . . . . . .
35
7.1
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
7.2
CMACs involve multiple overlapping tilings of the state space. Here
we show two 5 × 5 regular tilings offset and overlaid over a continuous,
two-dimensional state space. Any state, such as that shown by the dot,
is in exactly one tile of each tiling. A state’s tiles are used to represent
it in the Sarsa algorithm described above. The tilings need not be regular grids such as shown here. In particular, they are often hyperplanar
slices, the number of which grows sub-exponentially with dimensionality of the space. CMACs have been widely used in conjunction with
reinforcement learning systems (e.g., )). . . . . . . . . . . . . . . .
44
Mario Domain # Cumulative goal completions . . . . . . . . . . . .
45
A.1 Partitioning into multiple segments. Comparision with N-Cut . . . .
47
A.2 Membership Functions for some clusters . . . . . . . . . . . . . . .
48
A.3 Segmentation for some other images . . . . . . . . . . . . . . . . .
49
A.4 Clustering on Synthetic Datasets using PCCA+ . . . . . . . . . . .
54
A.5 Simplex identified by PCCA+ for the spiral dataset in A.4b. The data
points are shown as colored dots and are clustered around the vertices
of the transformed simplex (shown in black dots). The original basis is
shown in black lines. . . . . . . . . . . . . . . . . . . . . . . . . .
54
7.3
vii
ABBREVIATIONS
RL
Reinforcement Learning
MDP
Markov Decision Process
HRL
Heirarichal Reinforcement Learning
AI
Artificial Intelligence
N-Cut
Normalized Cut
PCCA
Perron Cluster Analysis
viii
NOTATION
a
Rss
Reward achieved while transitioning from state s to state s0 while taking action a
0
a
Pss0
Probability of reaching state s0
after a single time step, starting from state s taking action a
β
Termination Condition
µ
Option Policy
V
Value Function
Q
Action Value Function
W
Adjacency Matrix
D
Adjacency Matrix
D
Valency Matrix
Ỹ
Vertices of the Simplex
χ
Membership Matrix
m
Membership function
γ
Discount factor
Simplex Deviation factor
L
Lagrangian Matrix
T
Transition Matrix
λ
Eigen Value
A
Set of Actions
S
Set of States
φ
Transition Counts
δ
Temporal Difference Error
o
Discounted Cumulative Reward achieved while transitioning
Rss0
from state s to state s0 while taking option o
o
Pss
Probability of reaching state s0
0
till the termination of option o, starting from state s taking option o
ix
CHAPTER 1
Introduction
Autonomous agents are used in a variety of settings like hazardous tasks, repetitive
tasks, assistive, etc. Such systems have had many successful applications, for example
Google car, helicopter, games, software agents, etc. Learning is a crucial component
of autonomy. Any autonomous agent is required to plan and learn decision policies for
controlled action execution in the environment. There has been a lot of work on learning systems in the Artificial Intelligence community. Reinforcement Learning (Sutton
and Barto, 1998) is a whole branch dedicated to developing such systems. It is a very
powerful way to design autonomous agents, because of its generalizing and adaptive capabilities. Learning about structure present in the environment makes RL agents effective learners. In this thesis we use the reinforcement learning framework to build such
agents which can perform autonomously by identifying and using structures present in
the environment while learning policies to solve a given task.
Figure 1.1: 2 room domain
Consider a simple example (1.1) consisting of two rooms connected by a door. Assume that locations in this environment are described by discrete states as viewed by a
learning agent. The agent can take actions North, South, East and West to move around,
with each action causing a change in the state of the agent. The set of all states could
be partitioned based on the room to which each state belongs, with each room corresponding to an abstract state. The environment can then be described, at a high level,
in terms of two interconnected states and transitions between these states. A transition
between the two rooms would correspond to a sequence of actions over the lower-level
states, leading to a temporal abstraction. Therefore, a set of temporal abstractions can
be defined in terms of transitions between abstract states. In the example given, the
temporal abstractions correspond to going from one room to another (for example, go
to room 2 from room 1 or vice-versa). Further, the aggregation of states as we move to
a higher-level implies that the environment can be described in terms of fewer (abstract)
states. Solving a task over this smaller state space has computational advantages. By
repeating this process of spatio-temporal abstraction, a hierarchy of abstractions can be
defined over the environment. Researchers have worked on this framework of using
abstraction to simplify problem solving for quite some time (Knoblock, 1990; Dean
and hong Lin, 1995)
1.1
Motivation
The environment contains a lot of structural information. As humans we detect such
structures all the time, in order to perform efficiently in the environment. For example,
consider a simple task of getting a coffee from the snacks room which is located down
the corridor. This environment contains lot of useful structures like the door to enter
the corridor, coffee table, coffee machine, etc. Conventional controlled dynamical Artificial Systems act by learning sequence of primitive actions at individual states in the
state-space (the policy is of the form at state s take action a). Although such policies
are good for smaller domains, they prove to be highly inefficient and unnecessary for
larger domains and real world problems which contain a lot of structural information.
Moreover defining generalization to transfer such policies across domains prove to be
highly difficult task. Hence we need an approach which could address such issues.
Markov decision processes (MDPs) are widely used in reinforcement learning (RL)
2
to model controlled dynamical systems as well as sequential decision problems (Barto
et al., 1995). Many researchers have developed algorithms that determine optimal or
near-optimal decision policies for MDPs. Such algorithms are generally very subjective
and task specific. Moreover algorithms that identify decision policies on MDPs scale
poorly with the size of the task. The structure present in a task often provides ways
to simplify and speed up such algorithms. Identifying useful structures often provide
ways to generalize such algorithms over multiple tasks without relearning policies for
the entire task. One approach to using the task structure involves identifying a hierarchical description in terms of abstract states and extended actions between abstract
states (Barto and Mahadevan, 2003). The decision policy over the entire state space can
then be decomposed over abstract states, simplifying the solution to the current task
and allowing for reuse of partial solutions in other related tasks. While there have been
several frameworks proposed for hierarchical RL (HRL), in this work we will focus
on the options framework, though our results generalize to other frameworks as well.
We propose a framework, PCCA+HRL, that exploits the structural properties of the
underlying transition model while returning methods to generate option models. We
show how such generative models speed up learning by waiving off the requirement of
learning options in MDPs.
Hierarchical decomposition exploits task structure by introducing models defined by
stand-alone policies (known as macro-actions, temporally-extended actions, options, or
skills) that can take multiple time steps to execute. Skills can exploit repeating structure by representing subroutines that are executed multiple times during solution of a
task. If a skill has been learned in one task, it can be reused in other tasks that require execution of the same subroutine. Options also enable more efficient exploration
by providing high-level behavior that enables a decision maker to look ahead to the
completion of the associated subroutine. One of the crucial questions in HRL is where
do the abstract actions come from. There has been much research on automated temporal abstraction. One approach is to identify bottlenecks (McGovern, 2002; Şimşek
et al., 2005), and cache away policies for reaching such bottle neck states as options.
Another approach is to identify strongly connected components of the MDP using clustering methods, spectral or otherwise, and to identify access-states that connect such
components (Menache et al., 2002; Meuleau et al., 1998). Yet another approach is to
3
identify the structure present in a factored state representation (Hengst, 2002; Jonsson
and Barto, 2006), and find options that cause what are normally infrequent changes in
the state variables. While these methods have had varying amounts of success they
have certain crucial deficiencies. Bottleneck based methods do not have a natural way
of identifying in which part of the state space are the options applicable without some
form of external knowledge about the problem domain. Spectral methods need some
form of regularization in order to make them well conditioned, and this can lead to
arbitrary splitting of the state space (For e.g., see Figure 4.3).
Exploiting spatial structures in the environment provides an autonomous way to
learn object oriented decision policies. Usually the environment consists of regions
(which we will denote as macro states or metastable states) which are very well connected with sparse connections to other densely connected regions. Spatial abstraction
provides a way to group such states which are very well connected. The idea here is
that for densely connected states with low mixing times, it is easy for an autonomous
agent to find its way around the region. Hence, given a mechanism to generate good
options within these regions, the learning problem reduces to finding decision policies
across such meta-stable states compared to learning policies over the entire state space.
1.2
Objective
In this work we propose a robust approach for automated skill acquisition based on the
interaction of a RL agent with its environment. PCCA+HRL addresses all the aforementioned issues associated with previous work. This framework detects well-connected or
metastable regions of the state-space using PCCA+, a spectral clustering algorithm. We
then propose a very effective way of composing options using the same framework to
take us from one metastable region to another. Using these options we use SMDP Value
learning to learn a policy over the subtask to solve the given task. One important contribution of this work is that while exploiting the structural aspects of the state space,
we propose a way of composing options that are robust across tasks. Indeed we see in
chapter 7 that this method of composing options while exploiting the structure present
in the state space gives us a distinct advantage over all the other previous works.
4
The additional significance of PCCA+HRL is that it does not assume any model of
the MDP. The spatial abstraction is detected on sampled trajectories as shown in Chapter 5. We observe that even with limited sampling experience our method was able
to learn reasonably good skills. Such abstractions are particularly useful in situations
where exploration is limited by the environment costs. PCCA+HRL also provides a way
to refine these abstractions for every consecutive run without explicitly reconstructing
the entire MDP. Since we do not need to learn options therefore the proposed framework
allows to do this effectively as shown in Section 5. Moreover, with PCCA+HRL we are
able to run PCCA+ to find suitable spatial abstraction on sampled trajectories, therefore the computational time complexity decreases drastically, which is usually quite
expensive for other spectral clustering based methods.
PCCA+HRL gives us several advantages:
1. The clustering algorithm produces characteristic functions that describes the degree of membership for all states belonging to a given metastable region. These
characteristic functions gives us a powerful way to naturally compose options.
chapter 5 shows how we use these characteristic functions to define the subtask
dynamics.
2. The PCCA+ algorithm also returns connectivity information between the metastable
regions. This allows us to construct an abstract graph of the state space, where
each node is a metastable region thus combining both spatial and temporal abstraction meaningfully. This graph serves as the base for deriving good policies
very rapidly when faced with a new task. While it is possible to create such
graphs with any set of options, typically some ad-hoc processing has to be done
based on domain knowledge (Lane and Kaelbling, 2002). Still, the structure of
the graphs so derived would depend on the set of options used, and may not reflect
the underlying spatial structure completely.
3. Since the the clustering algorithm in PCCA+HRL looks for well-connected regions and not bottle neck-states, it discovers options that are better aligned to the
structure of the state space.
4. PCCA+HRL can acquire skills on the sampled trajectories. Hence, it does not
require any prior model for MDP. This is particularly useful in situations where
the environment model is not available and exploration costs.
5. Low computation time complexity as compared to other spectral clustering based
methods. This is because the time complexity is a function of the length of sampled trajectory rather than the entire state space.
6. PCCA+HRL allow us to reuse policies across similar subtasks.
5
7. Since PCCA+HRL only requires transition information gathered through sampled trajectories, therefore this method can be used with symbolic state representation of domains as shown in the Mario Game (see Section 3.5)
This approach is based on the ideas from conformal dynamics. Dynamical systems
may possess certain region of states between which the system switches only once a
while. The reason is the existence of dynamical bottlenecks which confine the system
for very long periods of time in some regions in the phase space, usually referred to as
metastable states. Metastability is characterized by the slow changing dynamics of the
system. Analogously a subtask can be associated with a particular dynamics where the
distinction between a subtask is defined in terms of the dynamics associated with each
of the subtask. Hence if the agent knows a good way of solving a particular subtask
(in other words if the agent exhibits a good dynamics which takes it to the relevant
dynamical bottleneck soon enough) we can learn a recursively optimal policy to solve
the entire task (chapter 6)
1.3
Organization
The rest of the paper is organized as follows: Section 2 gives an overview of PCCA+HRL
using a simple example on a 2 room world domain. Section 3 gives the background on
the problem PCCA+HRL attempts to solve while explaining various terminologies used
in this paper. In the subsequent sections, we specify a complete algorithm for detecting metastable regions and for efficiently discovering options for different sub-tasks by
reusing experience gathered while estimating the model parameters. It should be noted
that our approach for option discovery can be dependent on the underlying reward structure or just on the transition structure of the MDP. We demonstrate the utility of using
PCCA+ in discovering metastable regions, as well as empirically validate PCCA+HRL
on several RL domains like the Taxi Domain, and the Infinite Mario Project. We observe
in Section 7.2 that PCCA+HRL is able to acquire good skills for the taxi domain which
are quite intuitive. Further we observe in Section 7.3 that PCCA+HRL is able to acquire skills to learn object oriented policies for Mario to solve the entire task. Joshi et al.
(2012) shows the significance of learning object oriented-policies. The contribution of
6
our work is that PCCA+HRL acquires and learns such policies entirely autonomously.
We do not know of any other work in heirarichal reinforcement learning which could do
this as effectively. Another advantage of learning object oriented policies is the added
capability to reuse policies to solve other subtasks. This is indeed what we observe in
the mario play as PCCA+HRL reuses policies across different frames when faced with
a similar subtask. This capability also allows us to use a smaller feature representation
of the state space while aliasing similar states with the same symbol.
7
CHAPTER 2
Overview
Figure 2.1: 3 room world illustration domain
In this section we give a brief overview of the working of PCCA+HRL. Let us consider a simple 3 room domain as shown in Figure 2.1 for illustration purpose. The world
is composed of rooms separated by walls marked in black color. For the current task let
us assume that the goal of the agent is to start from the tile marked S and reach the goal
tile marked G. The agent can either move a step in the direction towards north, east,
west or south. PCCA+HRL starts by constructing a transition model using a sampled
trajectory. We will assume for the current illustration that the agent samples a trajectory
using a uniform random walk but later in section 5 we introduce a more sophisticated
sampling method based on the idea of UCT (Kocsis and Szepesvári, 2006). We then
use this sampled random trajectory to find suitable spatial abstractions using a spectral clustering algorithm PCCA+. PCCA+ returns a set of membership function for
every identified abstract state. We plot these membership functions in Figure 2.2. We
observe that PCCA+HRL segments the state space into rooms defined by degrees of
membership of states to rooms. One important point to note here is that the rooms are
very well connected within, with low mixing times, while there are evidently very few
connections across rooms. Such kind of connections have conventionally been labeled
as bottlenecks (McGovern, 2002; Şimşek et al., 2005). Please note that our method
does not explicitly look for bottlenecks, the identification of bottlenecks is intrinsic to
the way PCCA+HRL identifies spatial abstraction. This could be useful when it is required to define spatial abstraction based on the underlying reward structure along with
the transition information as well. Identifying such spatial abstraction now reduces the
problem of learning policies for transitioning from each state(for example take a step
North or take a step South, etc.) to learning policies for transitioning from each abstract state(for example go to room 2, etc.). As one could infer such policies are object
oriented policies, where the objects are special structures detected in the environment.
(a) Room 1
(b) Room 2
(c) Room 3
(d) World Segmentation
Figure 2.2: Membership Functions for the 3 room world domain in Figure 2.1
We now compose options using membership functions as demonstrated in Section 5.
One of the options composed using this method is shown in Figure 2.3. The figure
shows the stochastic option policy as arrows, where the length of each arrow is propor9
Figure 2.3: Sample Option Model.
Figure 2.4: Policy over meta states. Paths marked in red are part of the optimal policy
tional to its respective probability. The colored portion is the initiation set of the option
with the background gradient showing the termination condition of the option (lighter
background implies termination with higher probability, βwhite = 1). One important
point to note here is that the option model is not learned but computed directly from the
membership functions. Though there are no guarantees on the optimality of the option
policy constructed, but it is observed that these option models are generally close to
optimal and sometimes optimal as shown in Figure 2.5. This is particularly useful for
large domains as we see in Section 7. Now given such option models we could use any
of our favorite learning methods over the abstract states to compute a recursive optimal
decision policy. For our experiments we use the SMDP Value learning (Sutton et al.,
1999), unless otherwise specified. For this problem the estimated SMDP policy, which
is a trivial one, is shown in Figure 2.4. The entire solution to the problem is shown
in figure 2.5. Figure 4.2a shows the solution to reach the abstract state containing the
goal state. Since we only used the transition information to find structures in the state
space, therefore spatial abstraction does not give us enough information to reach the
goal state once the agent is inside the abstract state containing the goal state. Once the
10
agent is inside the abstract state containing the goal state, it can use any MDP solver to
reach the goal state, this is an easy problem to solve since all abstract states are densely
connected regions. We also plot the entire solution in Figure 4.2b when PCCA+HRL
use the transition as well as the reward structure of the underlying MDP as discussed
in Section 5. This way we detect 4 abstract states with the fourth meta state being the
lone goal state itself. Also this way the option models for the abstract state containing
the goal state could be directly constructed from the information extracted through the
structures identified in the state space. The interesting observation here is that we could
solve this entire problem with absolutely no learning (except the decision to chose the
option to go from abstract state 2 to 3 rather than the option to go from abstract state 2 to
1), while any flat RL method would have to learn the entire decision policy to go from
the starting state to the goal state. These kind of scenarios are frequently seen in large
domains where, if not all, a subset of the problem could be solved without explicitly
learning decision policies.
(a) Policy composed from transition structure(b) Policy composed from transition and reonly
ward structure
Figure 2.5: Solution to the illustration problem in Figure 2.1
11
CHAPTER 3
Background
In this section, we provide a background to the problem that our framework attempts to
solve, and we introduce notation of concepts that we use throughout the paper.
3.1
A Reinforcement Learning problem and its related
random-walk operator
Markov Decision Processes (MDPs) are widely used to model controlled dynamical
systems (Barto et al., 1995). Consider a discrete MDP M = (S, A, P, R) with a finite
set of discrete states S, a finite set of actions A, a transition model P (s, a, s0 ) specifying
the distribution over future states s0 when an action a is performed in state s, and a
corresponding reward model R(s, a, s0 ), specifying a scalar cost or reward. The set of
actions possible at a state s is given by A(s).
A policy π : S × A → [0, 1] specifies a way of behaving. The state value function,
V π : S × IR, is the expected discounted return when following π. We denote by V ∗ =
maxπ V π the optimal value function. A policy that is greedy with respect to the optimal
value function is an optimal policy. Similarly, one can define action-value functions
Qπ , Q∗ , which allow the first action choice to be different from π.
Value functions obey recursive sets of equations, called Bellman equations. For
example, the state-value function V π obeys:
V π (s) =
X
a
π(s, a)[R(s, a) + γ
X
π 0
a
Pss
0 V (s )], ∀s ∈ S
s0
Temporal-difference (TD) methods are incremental learning rules derived from the
Bellman equations, which can compute value functions directly without using a model.
For example, TD(0) performs the following update on every time step t:
V (st ) ← V (st ) + αδt
where δt = rt+1 +γV (st+1 )−V (st ) and α ∈ [0, 1) is a learning rate parameter, possibly
dependent on t. Under appropriate conditions, V → V π as t → ∞.
The state-space can also be modeled as a weighted graph G = (V , W ). The states
S correspond to the vertices V of the graph. The edges of the graph are given by
E = {(i, j) : W ij > 0}. Two states s and s0 are considered adjacent if there exists
an action with non-zero probability of transition from s to s0 . This adjacency matrix
A provides a binary weight (W = A) that respects the topology of the state space
(Mahadevan, 2005). The weight matrix on the graph and the adjacency matrix W will
be used interchangeably from here onwards. A random-walk operator can be defined on
the state space of an MDP as T = D−1 W where W is the adjacency matrix and D is a
valency matrix, i.e., a diagonal matrix with entries corresponding to the degree of each
vertex (i.e., the row sums of W ). This random-walk operator models an uncontrolled
dynamical system extracted from the MDP.
3.2
Spectral Graph Theory
We want to detect suitable spatial structures in the environment so that we can define
temporal abstractions which are aligned to the structure of the underlying MDP. Algebraic methods have proven to be especially effective in treating graphs which are regular
and symmetric. Its been observed that the spectra of a graph, constituted by the eigensystem, provides a suitable embedding which supply a good picture of the underlying
planar graph. It is seen that eigenvalues are closely related to almost all major invariants of a graph, linking one extremal property to another. There are various algebraic
representations of a graph. We use the Laplacian (L) to represent a graph in a matrix form. The Laplacian allows a natural link between discrete representations, such
as graphs, and continuous representations, such as vector spaces and manifolds. The
most important application of the Laplacian is spectral clustering that corresponds to a
13
computationally tractable solution to the graph partitioning problem (Luxburg, 2007).
Hence it provides a very elegant way to detect suitable spatial abstractions in the environment. In the literature there is no unique convention which matrix exactly is called
“graph Laplacian”. A good graph Laplacian has the following properties;
• For every vector f ∈ IRn we have
n
1X
f Lf =
wij (fi − fj )2
2 i,j=1
0
• L is symmetric and positive semi-definite.
• The smallest eigenvalue of L is 0, the corresponding eigenvector is the constant
one vector 1.
• L has n non-negative, real-valued eigenvalues 0 = λ1 ≤ λ2 ≤ · · · λn .
For the current discussion we will use the normalized graph Laplacian defined as
follows L = D−1 (D − W ) because it is easy to compute this using the random walk
operator defined above. In fact, for all our purposes we can replace the Laplacian with
random walk operator as T = I − L, where I is the identity matrix.
3.3
A Spectral Clustering Algorithm: PCCA+
Given an algebraic representation of the graph representing a MDP we want to find
suitable abstractions aligned to the underlying structure. We use a spectral clustering
algorithm to do this. There have been many successful applications of spectral clustering methods on the real world data and other works on spatial abstraction (e.g.,Shi
and Malik (2000); White and Smyth; Wolfe and Barto (2005); Shahnaz et al. (2006);
Cai et al. (2005), etc.). Central to the idea of spectral clustering is the graph Laplacian
which is obtained from the similarity graph (Luxburg (2007)). There are many tight
connections between the topological properties of graphs and the graph Laplacian matrices, which spectral clustering methods exploit to partition the data into clusters. In
this work we will use a clustering approach that attempts to exploit the structural properties in the configuration space of objects as well as the spectral sub-space. We take
inspiration from the conformal dynamics literature, where Weber et al. (2004) does a
14
similar analysis to detect conformal states of a dynamical system. They propose a spectral clustering algorithm PCCA+, which is based on the the principles of Perron Cluster
Analysis of the transition structure of the system. We extend their analysis to detect
spatial abstractions in autonomous controlled dynamical systems.
Using this algorithm gives us various advantages:
• Global as well as local information to define spatial abstractions. It exploits the
local structural properties of the underlying data space by using pairwise similarity functions, while using spectral methods to encode the global structural
properties.
• A formal notion of macro states as vertices of a simplex in the eigen-subspace
of the Laplacian. The clustering is performed by minimizing deviations from
a simplex structure and hence does not require any arbitrary regularization term.
The clustering procedure does not assume anything about the underlying structure
and the mapping to a simplex is inherently built in the properties of the Laplacian.
• Characteristic functions that describe the degree of membership of each state to a
given abstract state. We can interpret the membership functions as the likelihood
of a state belonging to a particular abstract state (see Singpurwalla and Booker
(2004)). The algorithm could also generate crisp partitioning of the states into
abstract states, as and when required.
• Connectivity information between the abstract states which is seldom required in
dynamical systems. For example one might be interested to know the connectivity
information between abstract states to learn decision policies across such states
as shown in Section 6.
3.3.1
Numerical Analysis
For a graph with disjoint components i.e., one whose similarity matrix can be reduced to
a block diagonal form, the Lapalcian matrix L̃ (from here and now we will refer to the
random walk operator as the Laplacian as described above) has a block structure, where
each block is a matrix which corresponds to the Laplacian matrix for the disjoint set of
vertices. Each vertex vi ∈ V can be mapped to the ith row of the eigenvector matrix
Ỹnk ; where n is the number of vertices and k is number of eigenvectors, corresponding
to the eigenvalues = 1. As it turns out, the eigenvectors for the Laplacian matrix can
be interpreted as an indicator of membership for each object to a suitable disjoint set.
Lemma 1. Eigenvectors of the Laplacian L with a block diagonal structure form the
vertices of a simplex IRk−1 (Weber et al. (2004)).
15
Proof. Each block matrix has its own Laplacian L̂, since the rows of the Laplacian
sum to 1, hence a vector with all identical elements is an eigenvector of this system.
Transforming this to the full Laplacian matrix, components of each of the eigenvectors
[Y1 , Y2 , . . . , Yk ] of L are pairwise identical for indices corresponding to the same block.
Regarding the rows of Y in IRk as k distinct points in IRk , they form the vertices of a
simplex because by definition the convex hull of k distinct points in IRk form a simplex
σ̂ k−1 .
We call these vertices in the eigenspace IRk as (first-order) Clusters Ck . Laplacian in
systems which exhibit connections across groups can be approximated as perturbations
on disjoint sets.
L̃ = L + L(1) + 2 L(2) + . . .
where L(1) , L(2) , . . . are respectively the first order and higher order Laplacian perturbation terms, and is the order artifact. With the perturbation analysis of this
equation the perturbation on the eigenvectors and eigenvalues can similarly be written
as
Ỹ = Y + Y (1) + O(2 )
Λ̃ = Λ − Λ(1) − O(2 )
Lemma 2. First order perturbation term Y (1) = Y B, B ∈ IRk×k is a linear mapping:
IRk 7→ IRk .
Proof. consider the ith eigenvector
L̃ỹi = λ̃i ỹi
writing it in terms of the perturbation expansion and matching the same order terms
(for the first order perturbation) we get
(1)
(1)
(1)
Lyi + L(1) yi = λi yi − λi yi
16
therefore,
(1)
(1)
(L(1) + λi I)yi + (L − λi I)yi
=0
Taking a dot product of this equation with another eigenvector yj , we get
D
E D
E
(1)
(1)
(1)
(L + λi I)yi , yj + (L − λi I)yi , yj = 0
D
E D
E
(1)
(1)
(L(1) + λi I)yi , yj + yi , (L − λi I)yj = 0
The second term goes to zero because (L−λi I)yj = 0, hence we have the first term zero
(1)
as well. This implies that (L(1) + λi I) is linear transformation which takes a vector yi
and either transforms it perpendicular to itself, or to itself, because hyi , yj i = 0. Also
(1)
yi
(1)
Hence we get yi
(1)
= (λi I − L)−1 (L(1) + λi I)yi
= Byi
This implies that the perturbation of the simplex structure can at most be of the order
O(2 ) (see Figure 4.1). In other words, the simplex structure is preserved for first order
perturbations, while for the higher order perturbations the simplex structure perturbs.
Hence we have here a formal definition of clusters, in the abstract notion, as vertices Ck
of this simplex structure.
Def 1. A vertex vi is said to belong to the cluster Ck with the perfect membership if
Y (i, :) = Ck
In soft clustering a continuous indicator for membership χ̃i : V 7→ [0, 1], which
assigns a grade of membership between 0 and 1 to each vertex vi ∈ V ; ∀i. Therefore,
a vertex may correspond to different clusters with a different grade of membership. For
each vertex v ∈ V the sum of the grades of membership with regard to the different
clusters is 1, i.e.
k
X
χ̃i (v) = 1
i
Each vertex is represented by a vector (χ1 (v), . . . , χk (v)) ∈ IRk . Since these vectors are positive and the partition of unity holds, they lie in the standard σk−1 simplex
17
spanned by the k unit vectors of IRk . Therefore, clustering can be seen as a simple
linear mapping from the rows of Y to the rows of a membership matrix χ̃. The linear
mapping is expressed by a regular k × k matrix A:
χ̃ = Ỹ A
This matrix maps the vertices of the simplex contained in the rows of Y onto the vertices
of the simplex σk−1 . Therefore, if one finds the indices π1 , . . . , πn ∈ [1, N ] of the
vertices in Y one can construct the linear mapping as follows

A−1
Ỹ
···
 π1 ,1
..
 .
=  ..
.

Ỹπk ,1 · · ·

Ỹπ1 ,k

.. 
. 

Ỹπk ,k
Weber et al. (2002) shows that solution for A exists, if and only if the convex hull
of the rows of Ỹ is a simplex. From perturbation analysis we know that this is the case
with a deviation of order O(2 ). To partition the data, as and when required, we assign
ek (s). each state to a partition numbered P (s) = arg maxnk=1 χ
To estimate the number of clusters k, the spectral gap is used as an indicator of
deviation from the simplex structure. Spikes in the eigenvalues indicate the presence of
a group structure in the graph.
The connectivity information between the various clusters can also be recovered
from the membership function.
Def 2. The connectivity information across different clusters, is given by
Lmacro = χ̃T L̃χ̃
where Lmacro is the Laplacian in a matrix space.
In this representation each cluster is represented by a single node, and connectivity information across the clusters is given Lmacro (i, j) for i 6= j, while the relative
connectivity information within a cluster is given by Lmacro (i, i) for all i.
18
3.3.2
Options
Our goal is to define temporally abstract policies (i.e., the ability to reason at a higher
level than just primitive actions) on the structures identified in the environment. We use
the framework of options, which can be viewed as fixed policies with preconditions and
termination conditions.
Formally an option is a triple (I, µ, β), where I ⊂ S is the initiation set of the
option, µ : S 7→ A is the option policy and β : S 7→ [0, 1] gives the probability of
termination in each state s ∈ S. An MDP endowed with an option set becomes a
Semi-Markov Decision Process (SMDP). In each SMDP time step, the agent selects an
option available at its current state and follows the option’s policy µ until termination
according to β.
For any Markov policy π : S × O 7→ [0, 1], Bellman equations for value functions
in the SMDP exist, and they are direct extension of the Bellman equations in the MDP
case, e.g.:
#
"
V π (s) =
X
π(s, o) R(s, o) +
X
o
π 0
Pss
0 V (s )
s0
o∈Os
Here, R(s, o) is the expected discounted reward obtained during the execution of the
option:
R(s, o) = E rt + · · · + γ k−1 rt+k−1 |st = s, ot = o
o
where k is the duration of the option. Pss
0 is the option’s transition model, defined
similarly.
3.4
Taxi Domain
Dietterich (1998) created the taxi task (Figure 3.1) to demonstrate MAXQ hierarchical
reinforcement learning. We use the episodic form of the same domain illustrate our
framework. The taxi problem can be formulated as an episodic MDP with the 3 state
variables: the location of the taxi (values 1-25), the passenger location including in the
taxi (values 1-5, 5 means in the taxi) and the destination location (values 1-4). Figure
3.1 shows a 5-by-5 grid world inhabited by a taxi agent. There are four specially19
designated locations in this world, marked as R(ed), B(lue), G(reen), and Y(ellow).
The taxi problem is episodic. In each episode, the taxi starts in a randomly-chosen
square. There is a passenger at one of the four locations (chosen randomly), and that
passenger wishes to be transported to one of the four locations (also chosen randomly).
The taxi must go to the passenger’s location, pick up the passenger, go to the destination
location, and put down the passenger there. The episode ends when the passenger is
deposited at the destination location. There are six primitive actions in this domain: (a)
four navigation actions that move the taxi one square North, South, East, or West, (b) a
Pickup action, and (c) a Putdown action. There is a reward of -1 for each action and an
additional reward of +20 for successfully delivering the passenger. There is a reward of
-10 if the taxi attempts to execute the Putdown or Pickup actions illegally.
Figure 3.1: Taxi world domain
This task has a simple hierarchical structure in which there are two main sub-tasks:
Get the passenger and Deliver the passenger. Each of these subtasks in turn involves
the subtask of navigating to one of the four locations and then performing a Pickup or
Putdown action. The temporal abstraction is obvious, for example, the process of navigating to the passenger’s location and picking up the passenger is a temporally extended
action that can take different numbers of steps to complete depending on the distance to
the target. The top level policy (get passenger; deliver passenger) can be expressed very
simply if these temporal abstractions can be employed. Reusing policies is critical in
this domain. For example, if the system could learn how to solve the navigation subtask
once, then the solution could be shared by both the “Get the passenger” and “Deliver
the passenger” subtasks.
20
3.5
The Infinite Mario
Infinite Mario competition is a Reinforcement Learning domain developed for Reinforcement Learning Competition 2009. It is a variant of Nintendo Super Mario Bros, a
complete side-scrolling video game with destructible blocks, enemies, fireballs, coins,
chasms, and platforms. It requires the player to move towards right to reach the finish line, earning points and powers along the way by collection coins, mushrooms, fire
flowers and killing monsters. The game is interesting because it requires agent to reason
and learn at several levels; from representation to path-planning and devising strategies
to deal with various components of the environment. Infinite Mario has been implemented on RL-Glue; a standard interface that allows connecting reinforcement learning
agents, environments, and experiment programs together. The agent is provided information about the current visual scene through arrays.
3.5.1
States
The state space has no set specification. Only a part of the game is visible at a given
time as shown in Figure 3.2. The visual scene is divided into a two-dimensional [16
x 22] matrix of 352 tiles. At any given time-step the agent has access to a char array
generated by the environment corresponding to the given scene. Each tile (element
in the char array) can have one of the many values that can be used by the agent to
determine if the corresponding tile in the scene is a brick, or contains a coin, etc. The
agent also has access to the locations of different types of monsters (including Mario) on
the scene, along with their vertical and horizontal velocities. The reward structure varies
across different instances of the game. Hence for any visual scene the total number of
possible states are 25352 , where each tile can 25 possible values.
3.5.2
Actions
The actions available to the agent are same as those available to a human player through
gamepad in a game of Mario. The agent can choose to move right or left or can choose
to stay still. It can jump while moving or standing. It can move at two different speeds.
21
Figure 3.2: A mario visual scene
All these actions can be accomplished by setting the values of action array.
3.5.3
Rewards
The reward structure varies across different MDPs. The agent earns a huge positive
reward when it reaches the finish line. Every step the agent takes in the environment,
it gets a small negative reward. The agent gets some negative reward if it dies before
reaching the finish line and gets some positive reward by collecting coins, mushrooms
and killing monsters. The goal is to reach the finish line which is at a fixed distance from
the starting point such that the total reward earned in an episode is maximized. Hence
in such a domain the underlying reward distribution might provide a useful information
for structure identification during gameplay.
22
CHAPTER 4
Spatial Abstraction
We use a spectral clustering algorithm to find suitable abstractions on the sample trajectories. Spectral clustering was made popular by the works of Shi and Malik (2000)
(Normalized Cut Algorithm), Meila and Shi (2001), Kannan et al. (2000), etc. Although
these methods are known to have many successful applications, they typically work on
a case-by-case basis. Though the spectra of the Laplacian preserves the structural properties of the graph, the methods used thereafter to cluster the data in the eigenspace
of the Laplacian do not guarantee this. For example Ng et al. (2001); Shi and Malik (2000) uses k-means clustering in the eigenspace of the Laplacian, which will only
work if the clusters lie in disjoint convex sets of the underlying eigenspace. While Meila
and Shi (2001) uses projections onto the largest k − eigenvectors to partition the data
into clusters, which does not preserve the topological properties of the data lying in
the eigenspace of the Laplacian. This is because projection methods do not incorporate
geometric constraints due to the underlying structure in the eigenspace: points closer in
Eucledian distance may be far apart on the manifold.
In this work we want to use a clustering approach that exploits the structural properties in the configuration space of objects as well as the spectral sub-space, quite unlike
earlier methods. Hence we use PCCA+ (see Algorithm 1) as a spatial abstraction algorithm on sampled trajectories. Datasets are represented as an adjacency graph by
defining a pairwise similarity function between pairs of states. Using these similarity
functions for all the data points, we construct an adjacency matrix S. We construct the
Laplacian from the adjacency information. The spectra of this Laplacian is constructed,
which encodes the structural properties of the underlying graph. We then find the best
transformation of the spectra, such that the transformed basis aligns itself with the clusters of data points in the eigen-space. Then we use a projection method described in
Section 3.3.1 to find the membership of each of the datapoints to a set of of special
points lying on the transformed basis, which we identify as vertices of a simplex, as
decribed in Section 3.3.1.The spectral gap method is used to estimate the number of
clusters k (line 3 in Algorithm 1). This is used to find the simplex in the IRk eigensubspace.
Figure 4.1: Simplex First order and Higher order Perturbation
An important point to note here is that while the secondary clustering might look
similar to the one in N-Cut, it is in fact quite different from the N-Cut algorithm. N-cut
performs k-means clustering after projecting the data onto the top-k eigenvectors, while
we assign memberships to the identified vertices in the eigenspace. These vertices need
not coincide with the centroid of the data points.
Algorithm 1 PCCA+
1. Construct L from the Similarity matrix S
2. Compute first n (in the descending order) eigenvalues (d) for L
k −ek+1 )
3. Choose first k eigenvalues for which (e1−e
> tc (Spectral Gap T hreshold).
k+1
Compute the eigenvectors for corresponding eigenvalues (e1 , e2 , . . . , ek ) and
stack them as column vectors in matrix Y
4. Lets denote the rows of Y as Y (1), Y (2), . . . Y (N ) ∈ IRk .
5. Define π(1) as that index, for which kY (π(1))k2 is maximal. Define γ1 =
span{Y (π1 )}.
6. For i = 2, . . . , k: Define πi as that index, for which the distance to the hyperplane
γi−1 , i.e. kY (πi ) −γi−1 k2 , is maximal. Define γi = span{Y
(π1 ), ..., Y (πi )}.
T
T
−1
T kY (πi ) − γi−1 k2 = Y (πi ) − γi−1 ((γi−1 γi−1 ) γi−1 )Y (πi ) )
Note that for the first order perturbation the simplex is just a linear transformation
around the origin, hence in order to find the vertices of the simplex σ k−1 , we need to find
the k points which could form the convex hull such that the deviation of all the points
24
from this convex hull is minimized. Hence, we start by finding the datapoint which is
farthest located from the origin (line 5 in Algorithm 1), say Ỹπ1 . Then we proceed by
finding the data point which is farthest located from the first point, say Ỹπ2 . We iterate
this procedure of finding the datapoints which is located farthest from the consecutive
hyperplane constructed by joining the previous data points, until we find k datapoints
(see line 6 in Algorithm 1). These datapoints form the vertices of the simplex. While
this approach is superficially similar to Thurau et al. (2012) in fact this is very different
since they operate in the data space and we operate in the Eigen subspace.
For example, Figure 4.1, shows a 3 dimensional eigen-subspace with a σ 2 simplex.
The data points are shown as small black dots. Had the system been a completely
disjoint system with 3 disjoint sets, the simplex would be aligned with the axes, with the
clusters corresponding to the vertices of the simplex (the unit vectors). Since the system
is represented as perturbations around this disjoint system, the first order perturbation
linearly transforms the simplex, as shown. Because the data consists of higher order
perturbations, the datapoints do not exactly map to the vertices of the new simplex,
though their deviation is minimized from this simplex. The clusters for this system are
the vertices of the new simplex. Thus we have clustering as membership of datapoints
to the vertices of the new simplex.
For a graph with pronounced group structure, the datapoints will tend to clutter near
the vertices of the transformed simplex, while for a graph with high connectivity the
datapoints will spread out over the simplex plane. Hence this framework also contains
an intrinsic mechanism to return the information about goodness of clustering, which
is the distribution of the membership functions for a datapoint across various clusters.
Sharp peaks in the value indicates a good clustering while a more uniform value indicates a bad group structure and hence a bad clustering.
For illustration we show in Figure 4.2, the simplex identified for the 3 room world
illustration domains introduced above. We also show the states (marked in small colored
dots) as data points in the eigenspace. The vertices of the simplex is shown in black
dots. We also show the eigenvalue distribution for the same domain. As we can see,
the eigenvalue has a its first jump at k = 3,. This indicates the presence of good
segmentation threshold for 3 way partitioning.
25
(a) Simplex
(b) Eigen-value distribution
Figure 4.2: PCCA+ Numerical analysis for 3 room world domain in Figure 2.1
(a) Domain
(b) PCCA+
(c) Kannan
(d) NCut
Figure 4.3: Comparison of clustering algorithm on the maze domain
26
(a) Domain
(b) PCCA+
(c) Kannan
(d) NCut
Figure 4.4: Comparision of clustering algorithm on a room-inside-room domain
Figure 4.5: 20 Metastable States. The state space partitioning identifies 20 metastable
states. For every passenger position and destination pair the algorithm identifies 1 metastable state, hence 5×4 gives 20 metastable states. The nature of
subtasks which can be assigned are a) Pickup if the Taxi is on the pickup location, b) Putdown if the taxi is on the destination location, and c) Navigate
When the taxi is far from the destination or the pickup location depending
on what subtask is being solved
4.1
Utility of PCCA+
We illustrate the utility of PCCA+ by comparing the partitioning obtained with the
other state of the art algorithms on a complex maze domain. We also show its utility on partitioning the state space of the taxi domain by interpreting the partitions as
27
(a) Configuration State Abstraction when the
passenger is in the taxi for different destinations. The last layer in Figure 4.5. Note that
when the destination and taxi location are
same there is a lone colored state, this state
is a part of the abstract state corresponding
to the Pickup-Putdown subtask.
(b) Configuration state abstraction when the
destination is Y (blue) and passenger is in
the taxi.
Figure 4.6: Segmentation of the Physical Space in the Taxi Domain
natural abstractions over which subtasks can be defined.1 In order to examine this we
use the full transition matrix, although the experiments for detecting subgoals do not
assume any prior availability of the transition matrix. The results are summarized in
Figures 4.5, 4.6. i) In Figure 4.5 each row of states corresponds to a specific drop of
location and location of passenger. The final row of pixels corresponds to the passenger
being in the taxi. While each of the front row of pixels gets an unique color the last row
shows fragmentation due to the navigate option. Each sublayer in the last layer has at
least one state which has the same color as a sublayer in any of the front 4 layers (which
are all differently colored) , this corresponds to pickup subtasks. Degeneration in the
last layer corresponds to the naviagation or putdown task depending on the location of
the taxi. ii) Due to the presence of obstacles in the configuration space, the dynamics
corresponding to pickup when the taxi is near the pickup location is much more connected compared to dynamics wherein the taxi has to move across the obstacles in the
configuration space, see Figure4.6b. iii) As the number of clusters identified increases
the state space breaks down, introducing additional metastable states corresponding to
other subtasks. iv) We explored more clusters corresponding to the other large local
variation in eigenvalues (which occurred next at 50 eigenvalues) and observed more degeneration of the state space introducing more subtasks corresponding to Pickup when
the taxi is near other pickup locations. We also observed further clustering of configuration space which consequently becomes strictly bounded by the obstacles. v) As the
number of clusters increase the Putdown subtasks degenerate earlier than the Pickup
1
To see the working of PCCA+ on other domains than spatial abstraction, refer to Appendix A
28
subtasks because any putdown task leads to an absorbing state.
29
CHAPTER 5
Temporal Abstraction
Temporally extended actions are a useful mechanism to reuse experience and provide a
high level description of control. One example of how such policies could be reused is
shown in Figure 5.1 for the domain in Figure 4.3. For example having learned to solve
the first task, the skills (shown by down arrow) need not be learned again in order to
solve the second task. This reduces the size of the problem needed solving. We explore
the use of such actions across a collection of tasks defined over a common domain
in this section. We use the Options framework (Sutton et al., 1999) to model these
skills or temporal abstraction. In this section we use the partitions of the state space
(a) Subtasks for 1st Task
(b) Subtasks for 2nd Task
Figure 5.1: Two episodes with different starting states. Skill in this context is defined
as transition between rooms. Skills shown by down arrow are not learned
again, instead they are reused from the first task
into abstract states along with the membership function to define high-level actions
in terms of transitions between these abstract states. We also show how to identify
relevant skills needed to solve a given task. We use the structural information obtained
to define behavioral policies for the subtasks independent of the task being solved.
We believe that to find a hierarchal optimal solution for the entire task, composed of
smaller subtasks, it is not necessary for the agent to solve the subtask optimally. Hence
any behavior interpreted by exploiting the structure present in the knowledge of the
domain (from experience over past trajectories and the recent run) can be used. This
claim is also strengthened by our observations on experiments run on several domains
(chapter 7). Hence, while doing so, we do not have to learn the option polices, rather
these policies are derived from the structures exploited in the state space.
5.1
Subtask options
Each column of the matrix χ returned by the spatial abstraction algorithm is a membership function defining the degree of memberships of each state s to the abstract
state Sj . The membership function mSi (s) could be interpreted as the likelihood of
a state s belonging to abstract state Si (Singpurwalla and Booker (2004)). Hence we
could formally interpret the rows in membership matrix as probability distributions over
all abstract states. Each element χij could be interpreted as the probability of state
i ∈ abstract state j. This gives us an elegant method to compose options to exit from a
given abstract state. In case of multiple exits or bottlenecks PCCA+HRL is also able to
compose multiple options, each taking the agent to the respective exit.
5.1.1
Initiation Set I
Given that the agent is in state s, the initiation set of the option are all states belonging
to the abstract state S such that arg maxj χij = arg maxj χsj = S∀i. This is very
intuitive, as the mixing time within an abstract states is very low, hence the agent does
not need too much of deliberation to transit within an abstract state. Therefore a given
option can be initiated from anywhere within the respective abstract state.
5.1.2
Option Policy µ
Our goal is to construct option policies that could take the agent to their respective
exits. The membership functions mSj (s) provide a very formal way to construct such
policies. Consider that the agent is in state s which belongs to the abstract state Si , we
want to construct a policy which would take the agent to the exit of the abstract state
Si joining the abstract state Sj . Membership functions are stochastic functions which
31
Figure 5.2: Policy for option going from Abstract State 1 to Abstract state 2 for domain
in Figure 2.1
.
(a) Inner
(b) Outer
Figure 5.3: Room inside a room, the policy for the domain in Figure 4.4
32
(a) 2rooms
(b) χ̃1 Large Room
(c) χ̃2 Small Room
(d) Object
(e) χ̃1 Without Object
(f) χ̃2 Without Object
(g) χ̃1 With Object
(h) χ̃2 With Object
Figure 5.4: Few more examples for the
33 policy composed using PCCA+HRL
quantify the membership of each state to the respective abstract state. Hence, if the
agent follows a stochastic gradient ascent on the membership function mSj (s)∀s ∈ Si
then this policy would take the agent to the exit of abstract state Si joining abstract
state Si . For example, consider the 3 room world illustration domain, the membership
function for abstract state 2 forms a surface as shown in Figure 5.2. The gradient ascent
on this membership function takes the agent from abstract state 1 to the abstract state
2. Hence we could define the option policy µ(s, a)(the probability of taking action a in
state s) which takes the agent from abstract state Si to abstract state Sj as a stochastic
gradient function as follows:
!
µ(s, a) = max α(s)
X
!
P (s, a, s0 )mSj (s0 ) − mSj (s) , 0 ∀s ∈ Si .
s0
where α(s) is a normalization constant to keep the values of µ ∈ [0, 1], P (s, a, s0 ) is the
transition model of the MDP, which gives the probability of reaching the state s0 given
that the agent in state s took action a.
5.1.3
Termination Condition β
Termination condition is a probability function, which assigns the probability of termination of the current option at state s. Another interpretation of the probability function
could be the probability of a state s being a decision epoch given the current option
being executed. Using this interpretation, for an option which takes the agent from
abstract state Si to abstract state Sj , β could be defined as follows:
β(s) = min
log(mSi (s))
, 1 ∀s ∈ Si
log(mSj (s))
There can be other ways to define β but we use this formulation, because it gives a
smooth peaked function which has nice mathematical properties (see Figure 5.5).
34
Figure 5.5: Termination Condition for option going from Abstract State 1 to Abstract
state 2 for domain in Figure 2.1
35
CHAPTER 6
PCCA+HRL: An Online Framework for Task Solving
We demonstrate here an online method (Algorithm 3) for efficiently finding SpatioTemporal Abstractions while the agent is following another policy (not necessarily the
policy being learned). The key to our subtask discovery is that the subtasks are identified
from the experiences of the agent. We propose two methods for dynamically composing
option policies and learning the behavior policy over the subtasks identified. Depending upon the availability of memory one could either use the latter, in case of memory
shortage or performing analysis on large state spaces, or the former, for normal mode of
operation. For all our experiments we use the former method. Both these methods are
inspired from the UCT framework which is a Monte-Carlo search algorithms that is also
called rollout-based. A rollout-based algorithm builds its lookahead tree by repeatedly
sampling episodes from the initial state. The tree is built by adding the information
gathered during an episode to it in an incremental manner. The generic scheme of
rollout-based Monte-Carlo planning is given in Algorithm 2. The algorithm iteratively
generates episodes (line 3), and returns the action with the highest average observed
long-term reward (line 5).2 In procedure UpdateValue the total reward q is used to adjust the estimated value for the given state-action pair at the given depth, completed by
increasing the counter that stores the number of visits of the state-action pair at the given
depth. Episodes are generated by the search function that selects and effectuates actions
recursively until some terminal condition is satisfied. This can be the reach of a terminal state, or episodes can be cut at a certain depth (line 8). We use the UCT algorithm
because it is more effective than the vanilla Monte-Carlo planning where the actions are
sampled uniformly, while UCT does a selective sampling of actions. In case of the second method for constructing the transition model from the agent’s experience selective
sampling is not just preferred but a requirement too. In UCT, in state s, at depth d, the
action that maximizes Qt (s, a, d) + cNs,d (t),Ns,a,d (t) is selected, where Qt (s, a, d) is the
estimated value of action a in state s at depth d and time t, Ns,d (t) is the number of times
state s has been visited up to time t at depth d and Ns,a,d (t) is the number of times action
a was selected when state s has been visited, up to time t at depth d, ct,s has the form
q
, where Cp is an empirical constant. A variant adaptation of this search
ct,s = 2Cp lnt
s
method for our framework is used in PCCA+HRL. Since we are composing option policies rather than learning them, we replace Q(s, a) with the stochastic gradient function
µ(s, a) as constructed in Section 5, where the particular µ chosen corresponds to the
greedy option chosen from the option value function. Hence the search criteria becomes
P
0
0
max arg maxa α(s)
s0 P (s, a, s )mSj (s ) − mSj (s) , 0 t (d) + cNs,d (t),Ns,a,d (t). Using this search method we propose the following methods to construct the transition
models on which spatio-temporal abstractions could be defined.
Algorithm 2 Monte Carlo Search
1. function MonteCarloPlanning(state)
2. repeat
(a) search(state, 0)
3. until Timeout
4. return bestAction(state,0)
1. function search(state, depth)
2. if Terminal(state) then return 0
3. if Leaf(state; d) then return Evaluate(state)
4. action := selectAction(state, depth)
5. (nextstate; reward) := simulateAction(state, action)
6. q := reward + γsearch(nextstate, depth + 1)
7. UpdateValue(state; action; q; depth)
8. return q
• For every sampled trajectory as described above, we maintain transition counts
a
φass0 of number of times the transition s → s0 is observed, starting from prior
φ0 . These transition counts are used to populate the local adjacency matrix
P aD,
0
0
and the transition count model U as Dposterior (s, s ) = Dprior (s, s ) + a φss0 ,
Uposterior (s, a, s0 ) = Uprior (s, a, s0 )+φass0 which is updated after each sampled trajectory. Now for every iteration we find suitable spatial abstraction using PCCA+
as described in Algorithm 1. After identifying suitable spatial abstractions (chapter 4) and constructing the corresponding membership functions, we construct
37
0
)
subtask options as illustrated in chapter 5, where P (s, a, s0 ) = PU0 (s,a,s
0 . We
s U (s,a,s )
then use a SMDP value learning algorithm to find an optimal policy over options
which uses these skills to reach the goal state efficiently.
• When running on very large domains with limited memory capacity, instead of
storing previous transition counts, one could define the local adjacency matrix
and the transition count model on the current trajectory itself. Although this way
we lose a lot of valuable information, but it has been observed to work in practice only when the new sampled trajectory is very close to the previous sample
trajectory.
Constructing spatio-temporal abstractions iteratively as described above pose a matching problem for learning policies using SMDP learning techniques. With every sampled
trajectory the structure of the transition matrix changes which in-turn changes the spatial abstractions identified. However small this change might be. in order to define
the SMDP Learning update rule, we still need to match the previous options to the
new ones. This can be done easily by mapping the vertices of the simplex returned
by PCCA+. Hence consider two iterations which returned the simplex vertices Ỹ1 and
Ỹ2 respectively, where Ỹ1 and Ỹ2 are matrices whose rows denote the location of the
vertices. Hence we use the following matching criteria (similar to defining membership
functions as described in Section 3.3.1). Hence we define
κ12 = Ỹ1 Ỹ2−1
Using κ12 we assign vertex i of simplex 1 to vertex j of simplex 2 using the Munkres
algorithm (a.k.a Hungarian Method) (Munkres, 1957) where the match weighting between i and j is κ12 (i, j) (or the distance metric could be defined as 1 − κ12 (i, j)).
6.1
Interpretation of subtasks identified
Given an agent following a random walk in some state space, it is more likely to stay
in its current metastable region than make a transition to another metastable region.
Transitions between metastable regions are relatively rare events, and are captured in
terms of subtasks. The subtasks overcome the natural obstacles created by a domain
and are useful over a variety of tasks in the domain. Our framework differs from the
38
Algorithm 3 PCCA+HRL Online Planning agent
function PCCA+HRL()
Q ⇒ Option Value Function
1. Observe initial state s0
2. Initialize Q arbitrarily (example identically zero)
3. Initialize T Transition Matrix
4. U = {}
5. For e = 1 to maximum number of episodes
(a) Membership Function χ, Simplex Vertex Ỹ = PCCA+(T )
(b) Find all pairs of connected Abstract states Ck = (Si , Sj ) from the non zero
entries in χT T χ
(c) Ok0 = Ok ∀k
(d) ∀Ck construct Ok = {Ik , µk , βk } as described in chapter 6
(e) match Ok ∀k(new set of options composed) with previous set of options
composed Ok0 ∀k as described in Section 6
(f) Find k for which s0 ∈ Ck (1)
(g) i = k;
si = s0
(h) while not the End-of-Episode
i.
arg maxO Q(si , O)
Oi ←
A uniform Random Option Ok = {Ik , µk , βk }
w.p 1 − w.p s.t. si ∈ Ik
ii. Update Q(si ,Oi ) using any Option Value Function Learning
Method(We use SMDP Learning Method)
iii. Sample action according to α(µi (s) + cNs,d (t),Ns,a,d (t)) as described in
Section 3.3.1 and follow until option termination, return the termination
state st
a
a
iv. φass0 := φass0 + δss
0 , where δss0 is the step δ-function = 1, if action a in
state s takes the agent to s0
a
v. Rss
0 = reward returned while taking action a in state s taking the
system to state s0
a
vi. U (s, a, s0 ) := U (s, a, s0 ) + φass0 e−ν |Rss0 | as described in Section 6
P
0)
vii. D(s, s0 ) = a U (s, a, s0 ); P (s, a, s0 ) = PU0 (s,a,s
U (s,a,s0 )
s
0
viii. T (s, s ) =
D(s,s0 )
0
P0
0 , ∀s, s
s D(s,s )
ix. si = st
39
Intrinsic Motivation framework Barto et al. (2004) in the absence of a predefined notion
of saliency or interestingness but depends on the particular domain. Our notion of
salient events is restricted to rare events that occur in a random walk in the environment
by a particular agent. We are, however, able to automatically identify such events,
model subtasks around them, and reuse them efficiently in a variety of tasks. As it turns
out these subtasks are precisely the object oriented subtasks, where these objects are the
bottlenecks connecting the two metastable region
6.2
Extending the Transition Model to include the underlying Reward Structure
The underlying transition structure only encodes the topological properties of the state
space. There are other kinds of structures in the environments which we would like an
autonomous agent to detect. For example, like in the 3 room world illustration domain
we would also like the agent to detect the goal state as a structure present in the environment. We have already seen in Figure 4.2b the significance of detecting such structures.
We call these kind of structures as functional structures based on the functional properties of the given task. Along with the transition count the agent also returns a reward
count for each transition. These reward distribution contains all the information we
require regarding the functional properties of an environment. Hence it seems natural
to extend our notion of the transition model to include the reward distribution, while
defining suitable spatio-temporal abstraction for the same.
The motivation to do so comes from the fact that we would like our spatial abstraction to degenerate with spikes in the reward distribution. This is because we want our
heirarichal learning algorithm to learn decision policies at such epochs. Consider for
example the same 3 room world domain, with the goal state as shown, goal state renders a spike in the reward distribution at the point where it is located. Given the agent
is in the abstract state surrounding the goal state, we would want to compose temporal
abstraction policies which could take the agent to the goal state. This can only happen
when the agent interprets the state joining the goal state and the surrounding abstract
40
state as an exit or a bottleneck connecting the two abstract states (the second abstract
a
state being the lone goal state). Let us consider Rss
0 to be the reward received while
transitioning from state s to state s0 while taking an action a. We modify the local adjaP
a
cency matrix D defined above as Dposterior = Dprior (s, s0 ) + a φass0 e−ν |Rss0 | , where ν
is the regularization constant, which can be chosen to balance the relative weight of the
underlying reward and transition structure. Using the exponential weighting for rewards
in the transition model has various advantages: a) We want the abstraction to degenerate near spikes in reward function, hence we require that the adjacency information
should have very low wights at such points. b) The exponential function provides a nice
mathematical form which is continuous and differentiable everywhere. c) It returns a
value of 1 for zero rewards, hence preserving the transition structure as it is. d) It also
allows for easy tuning of the relative weights by changing the parameter ν.
41
CHAPTER 7
Experiments
We present here three domains on which we perform our experiments. We compare our
method with LCut by Şimşek et al. (2005). We also compare our results with options
composed though biased random policies with various termination conditions along
with the availability of primitive actions. In both the domains, the method proposed
here outperform, significantly, any other work.
7.1
2 room Domain
This is a very simple domain where one can still acquire skills. The domain consists
of 2 rooms of unequal sizes where the agent has to start from the first room and reach
a particular goal state in the second room 7.1a. Typical skill acquired would be to
reach the doorway from any state in the first room and to navigate from this doorway
to the intended goal state. This is indeed what we observe while using the PCCA+HRL
framework. We use the transition and reward structure to create the transition model.
Using this we find 3 abstract states, where one abstract state corresponds to the lone
goal state itself. 7.1b shows the Partitioning obtained after 3 episodes of the trajectory
of length 3000 each. 7.1c compares the average return for different methods while
solving the same task. We plot the average return with respect to the number of epochs
of decision used.
7.2
Taxi Domain
We use the episodic format for task solving in the Taxi Domain proposed by Dietterich
(1998). This is a complex domain with the availability of large number of typical skills
which can be acquired. Typical Skills would constitute of options facilitating Navigate,
Pickup and Putdown subtasks. Again this is indeed what we observed using PCCA+.
We use the episodic version of the taxi domain, where in the beginning of each episode
the taxi, the passenger’s position and its intended destination are randomly reset. 7.1d
compares the average return for different methods.
(a) 2 Room Domain
(b) Spatial
Abstraction
using
PCCA+
(c) Average Return for 2 room Domain
(d) Average Return for Taxi Domain
Figure 7.1: Experiments
7.3
Mario
The Mario is a very complex domain. Due to the the large state space ( 25352 states), we
cannot use a tabular representation for the states. A domain for the Mario is dynamic
in nature, which at any given instance is a single frame of the entire game. Hence, we
define coordinates with Mario as a reference point. Then we define a CMAC encoding
with hashing to make the state representation of the game tractable. A CMAC uses multiple overlapping tilings of the state space to produce a feature representation for a final
linear mapping where all the learning takes place. See Figure 7.2. The overall effect is
43
Figure 7.2: CMACs involve multiple overlapping tilings of the state space. Here we
show two 5 × 5 regular tilings offset and overlaid over a continuous, twodimensional state space. Any state, such as that shown by the dot, is in
exactly one tile of each tiling. A state’s tiles are used to represent it in the
Sarsa algorithm described above. The tilings need not be regular grids such
as shown here. In particular, they are often hyperplanar slices, the number of
which grows sub-exponentially with dimensionality of the space. CMACs
have been widely used in conjunction with reinforcement learning systems
(e.g., )).
.
much like a network with fixed radial basis functions, except that it is particularly efficient computationally (in other respects one would expect RBF networks and similar
methods (see )) to work just as well). It is important to note that the tilings need not be
simple grids. For example, to avoid the “curse of dimensionality,” a common trick is to
ignore some dimensions in some tilings, i.e., to use hyperplanar slices instead of boxes.
A second major trick is “hashing”Ůa consistent random collapsing of a large set of tiles
into a much smaller set. Through hashing, memory requirements are often reduced by
large factors with little loss of performance. This is possible because high resolution is
needed in only a small fraction of the state space. Hashing frees us from the curse of
dimensionality in the sense that memory requirements need not be exponential in the
number of dimensions. We chose a grid size of 1024 with 2 tilings, hence giving us a
state space of size 4096. The value functions are defined on this representation without
any approximation. We run the PCCA+HRL to autonomously play Mario. It was able
to compose subtasks on structures in the games, like kill the monster, collect the coin,
etc. We do not know of any other work which could do this autonomously in an effective manner. We compare our results with the primitive Qlearning techinques as shown
in Figure 7.3
44
Figure 7.3: Mario Domain # Cumulative goal completions
45
CHAPTER 8
Conclusion and Future Directions
Viewing random walks on MDPs as dynamical systems allows us to decompose the
state space along the lines of temporal scales of change. Thus parts of the state space
which are well-connected were identified as metastable states and we proposed a complete algorithm that not only identifies the meta-stable states, but also learns options to
navigate between metastable states. We demonstrated the effectiveness of the approach
on a variety of domains. We also discussed some crucial advantages of our approach
over existing option discovery algorithms. While the approach detects intuitive options, it is possible that under a suitable re-representation of the state space, some of
the metastable regions detected can be identical to each other. We are looking to use
notions of symmetries in MDPs to identify such equivalent metastable regions. Another
promising line of inquiry is to extend our approach to continuous state spaces, taking
cues from the Proto Value Function literature.
APPENDIX A
PCCA+ on other domains
Other spectral clustering methods had varying amounts of success on different domains.
Although few of them have been successfully used across multiple domains. The primary reason for this is because none of them have a principled way of exploiting structural properties encoded in the laplcian. We demonstrate the utility of PCCA+ across
multiple domains, while comparing it with other state of the art methods in that domain.
We also compare PCCA+ with N-Cut method (Shi and Malik (2000)) across all these
domains. This is the first time such kind of analysis is being performed using PCCA+
on multiple domains while comparing with other state-of-the-art methods.
A.1
Image Segmentation
Figure A.1a shows an image we would like to segment. The procedure for clustering is
(a) Image
(b) PCCA+
(c) N-Cut
Figure A.1: Partitioning into multiple segments. Comparision with N-Cut
as follows
1. Construct a similarity graph G = (V, E) by taking each pixel as a node and
connecting each pair of pixels by an edge. The similarity value should reflect
the likelihood of 2 pixels belonging to the same group. We define the similarity
matrix in terms of the radial basis function of the brightness of the pixels adjacent
to each other as follows
(
kF −F k2
if kXi − Xj k2 ≤ 1
exp − i σ2 j
I
(A.1)
Wij =
0
otherwise
(a) χ˜1
(b) χ˜2
(c) χ˜3
(d) χ˜4
Figure A.2: Membership Functions for some clusters
where Xi and Xj are the spatial location of the pixels, Fi and Fj are the intensity
values for brightness of the pixels. This similarity matrix gives a non zero value
for the pixel i connected to a pixel j which is located on any of the 8 sites on a
square lattice around the pixel i. For a colored RGB image, the corresponding
grayscale image is used here for image segmentation. Please note that this construction of the similarity matrix is quite different from the construction of the
similarity matrix by Shi and Malik (2000), we have only 1 parameter to adjust
which is the width of the intensity gaussian mixture as opposed to 6 parameters
in Shi and Malik (2000).
2. We find the number of clusters using spectral gap method by finding the top k
−ek )
> tc , where tc is the spectral gap threshold.
eigenvalues for which (ek+1
1−ek
3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for
the graph. We show in Figure A.2 the plot of membership functions for some
of the clusters identified. We show in Figure A.1 the partitioning of the image
into discrete segments, where each segment is differently color coded. We also
show in the same Figure A.1, the results obtained using N-Cut (by performing
k-means in the eigen-space for secondary clustering) by carefully choosing the
48
(a) Image
(b) Image
(c) PCCA+
(d) PCCA+
(e) N-Cut
(f) N-Cut
Figure A.3: Segmentation for some other images
best parameters1 . Note that PCCA+ provides clusters which could separate the
background from the objects which is structurally a very complex group, while
N-Cut segmented the background into different groups.
We also compare the Normalized Cut Values of PCCA+ and N-Cut in Table A.1.
We observe that PCCA+ gives very good average Normalized Cut values ∼ 0.
A.2
Text Clustering
Document clustering is one of the most crucial techniques to organize the documents
in an unsupervised manner. Many clustering methods have been applied to clustering
documents into categories, such as k-means MacQueen (1967), naive Bayes or Gaussian mixture model Baker and McCallum (1998); Liu et al. (2002), single-link Jain and
1
The parameter values for N-Cut are taken from http:://note.sonots.com/SciSoftware/ NcutImageSegmentation.html, where the author claims that these parameter values for N-Cut produce the best results
49
Table A.1: Normalized Cut Values for PCCA+ and N-Cut. (Lower is good)
I MAGE
B IRD
BABY
A IRCRAFT
PCCA+
2.1616 E -005
4.7245 E -004
0.0029
N CUT
0.2099
0.0126
0.0128
Table A.2: Accuracy Measure for TFTD2
K
2
3
4
5
6
7
8
9
10
Avg
K-means
0.871
0.775
0.732
0.671
0.655
0.623
0.582
0.553
0.545
0.667
LSI
LPI
LE
0.913 0.963 0.923
0.815 0.884 0.816
0.773 0.843 0.793
0.704 0.780 0.737
0.683 0.760 0.719
0.651 0.724 0.694
0.617 0.693 0.650
0.587 0.661 0.625
0.573 0.646 0.615
0.702 0.657 0.730
NMF-NCW
0.925
0.807
0.787
0.735
0.722
0.689
0.662
0.623
0.616
0.730
PCCA+
0.9948
0.9810
0.9528
0.9424
0.9403
0.9331
0.7953
0.859
0.817
0.913
Dubes (1988), and DBSCAN Ester et al. (1996). From different perspectives, these
clustering methods can be classified into agglomerative or divisive, hard or fuzzy, deterministic or stochastic. The typical data clustering tasks are directly performed in the
data space. However, the document space is always of very high dimensionality. Due
to the consideration of the curse of dimensionality, it is desirable to first project the
documents into a lower-dimensional subspace in which the semantic structure of the
document space becomes clear. Literature on spectral clustering shows its capability
to handle highly nonlinear data. Also, its strong connections to differential geometry
make it capable of discovering the manifold structure of the document space.
For the experiments, three standard document collections were used in our experiments: Reuters-21578, Newsgroups20 and TDT2. Reuters-21578 corpus2 contains
21,578 documents in 135 categories. The 20 Newsgroups data set is a collection of
approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups3 . The TDT2 corpus4 consists of data collected during the first half
2
Reuters-21578
corpus
is
available
at
http://www.daviddlewis.com/
resources/testcollections/reuters21578/
3
The homepage of 20 Newsgroups dataset is http://qwone.com/ jason/20Newsgroups/
4
Nist Topic Detection and Tracking corpus is at http://www.nist.gov/speech/tests/tdt/tdt98/index.html
50
of 1998 and taken from six sources. It consists of 11,201 on-topic documents which
are classified into 96 semantic categories. From the original corpus, the documents appearing in multiple categories are removed. The pruned TDT2 dataset contains 9394
documents containing top 30 categories, Reuters-21578 contains 8293 documents belonging to 65 categories and 20Newsgroup contains 18846 documents that belongs to
20 groups. All the dataset are Document-term matrix where each row represents a document. We perform TF-IDF. This gives the weighted document-term matrix. Then we
calculate the distance between any two documents using cosine similarity where each
document is a term-vector.
Table A.3: Accuracy Measure for Reuters
K
2
3
4
5
6
7
8
9
10
Avg
K-means
0.989
0.974
0.959
0.948
0.945
0.883
0.874
0.852
0.835
0.918
LSI
LPI
LE
0.992 0.998 0.998
0.985 0.996 0.996
0.970 0.996 0.996
0.961 0.993 0.993
0.954 0.993 0.992
0.903 0.990 0.988
0.890 0.989 0.987
0.870 0.987 0.984
0.850 0.982 0.979
0.931 0.982 0.990
NMF-NCW
0.985
0.953
0.964
0.980
0.932
0.921
0.908
0.895
0.898
0.937
PCCA+
0.9715
0.9794
0.9801
0.9693
0.970
0.9703
0.9699
0.9698
0.9697
0.9722
Consider a set of documents x1 , x2 , . . . , xn ∈ IRm . Assume xi has been normalized
to 1.
1. To construct the adjacency graph, suppose the ith node corresponds to the document xi . We put an edge between nodes i and j if xi is among p nearest neighbors
of xj or xj is among p nearest neighbors of xi .
2. We construct the similarity matrix S as follows: If nodes i and j are connected,
Sij = xT
i xj ; Otherwise, Sij = 0.
3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for
the documents to clusters. The maximum membership criteria is used to cluster documents. We perform 3 sets of experiments to evaluate the cluster quality
identified by PCCA+. In the first set of experiments using the unlabeled dataset,
we chose the number of clusters using the eigen-gap measure. It was observed
that using this measure the number of clusters were exactly equal to the number
of categories in the labeled data. Table A.4 shows the purity measure for a few
datasets. In the second set of experiments, we compare PCCA+ with other clustering based on LSI Deerwester et al. (1990), spectral clustering method, LPI Cai
51
et al. (2005) and Nonnegative Matrix Factorization clustering method Zha et al.
(2001); Xu et al. (2003) (refer Table A.3, TableA.2). The evaluations are conducted for the cluster numbers ranging from two to ten. For each given cluster
number k, 50 test runs are conducted on different randomly chosen clusters. The
clustering performance is evaluated by comparing the obtained label of each document with that provided by the document corpus. Given a document xi , let ri and
si be the obtained cluster label and the label provided by the corpus,
Pn respectively
δ(s ,map(ri ))
,
(refer Zha et al. (2001)). The accuracy measure is defined as i=1 ni
where n is the total number of documents, δ(x, y) is the delta function, and
map(ri ), is the permutation mapping function, that maps each cluster label ri
to the equivalent label from the data corpus. We observe that PCCA+ provides
better quality clusters as compared to other clustering techniques in most of the
cases while comparable in others. The third set of experiments is performed using the Reuters-21578 dataset, in this experiment 7 most frequent categories are
considered but we do not remove documents belonging to multiple categories.
After removing documents whose label sets or main texts are empty, 8,866 documents are retained where only 3.37% of them are associated with more than one
class labels. After randomly removing documents with only one label, a text categorization data set containing 1998 documents is obtained (Table A.5 provides
details of the used categories). We run PCCA+ on this document and generate the connectivity information across categories using the definition provided
in Def 2. The macro-transition operator (normalized) which quantifies relation
between categories is shown in Table A.6. The interesting observation to note
is that the macro-transition operator provides high values across categories with
seemingly related nature, for example the pairs Money fx-Trade, Trade-Crude,
Trade-Interest have high values, while the pair Grain-Earn have low connectivity
values.
Table A.4: Purity measure for Document Clusters
Dataset
TDT2
Reuters
Newsgroup
Number of docs K Purity
9394
30 0.9344
8293
65 0.9694
18846
20 0.9023
We observe that PCCA+ is competitive with the best algorithm for all the values of
number of clusters. We also observe in Table A.4 that PCCA+ has a very high purity
measure for document clustering.
A.3
Synthetic Datasets
We demonstrate the goodness of quality of clustering obtained by using PCCA+ on
various synthetic datasets with different structural properties.
52
Table A.5: Details of the used Reuters21578 top 7 Categories
Category
Number of Docs
Earn
831
Acquisition
482
Money-fx
299
Grain
153
Crude
128
Trade
154
Interest
261
Table A.6: Macro connectivity information for Reuters21578 top 7 Categories
Earn Acquisition
Earn
0.106
0.093
Acquisition 0.021
0.104
Money-fx 0.005
0.056
Grain
0.005
0.038
Crude
0.015
0.062
Trade
0.013
0.07
Interest
0.009
0.043
Money-fx
0.031
0.058
0.184
0.039
0.06
0.073
0.034
Grain
0.052
0.075
0.074
0.287
0.095
0.083
0.107
Crude
0.259
0.243
0.228
0.189
0.291
0.261
0.164
Trade
0.255
0.279
0.283
0.169
0.268
0.328
0.138
Interest
0.202
0.216
0.166
0.269
0.205
0.169
0.501
1. Conisder the set of datapoints x1 , x2 , . . . , xn ∈ IRm : The similarity matrix is
constructed as follows

(xi −xj )T Σ−1 (xi −xj )

if kxi − xj k
exp −
2
Sij =
≤ threshold and i 6= j


0
otherwise
where Σ is the covariance matrix
2. We find the number of clusters using spectral gap method by finding the top k
−ek )
eigenvalues for which (ek+1
> tc , where tc is the spectral gap threshold.
1−ek
3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for
the data points to clusters. Figure A.4 shows the clusters obtained using PCCA+.
Table A.7 shows the purity measure of the clusters obtained using PCCA+ for the
same number of clusters k as present in the labeled data (Note that the spectral
gap method always identified the same number of clusters a present in the labeled
data)
We observe in Table A.7 that PCCA+ obtained clustering with a very high purity
measure, even with datasets of different structural properties as is seen in the original
data space. We also plot the simplex identified by PCCA+ for the case of spiral dataset
in Figure A.5. Few observations to note here are: a) the simplex is linearly transformed
53
(a) Aggregation(7 clusters)
(b) Spiral(3 clusters)
(c) R15(15 clusters)
Figure A.4: Clustering on Synthetic Datasets using PCCA+
Figure A.5: Simplex identified by PCCA+ for the spiral dataset in A.4b. The data points
are shown as colored dots and are clustered around the vertices of the transformed simplex (shown in black dots). The original basis is shown in black
lines.
from its corresponding regular simplex structure, which shows the first order perturbation, and b) the data points(plotted in colored markers) around the vertices of the
transformed simplex shows higher order perturbations around the simplex structure.
Table A.7: Purity measure for Synthetic Dataset
Dataset
R15
Spiral
Aggregation
Number of datapoints K
600
15
312
3
788
7
54
Purity
99.7
100.0
99.6
REFERENCES
1. Baker, L. D. and A. K. McCallum, Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference
on Research and development in information retrieval, SIGIR ’98. ACM, New York,
NY, USA, 1998. ISBN 1-58113-015-5. URL http://doi.acm.org/10.1145/
290941.290970.
2. Barto, A., S. S., and N. C., Intrinsically Motivated Learning of Hierarchical Collections
of Skills. In Proceedings of the 2004 International Conference on Development and
Learning. 2004, 112–119.
3. Barto, A. G., S. J. Bradtke, and S. P. Singh (1995). Learning to act using real-time
dynamic programming. Artif. Intell., 72(1-2), 81–138. ISSN 0004-3702.
4. Barto, A. G. and S. Mahadevan (2003). Recent advances in hierarchical reinforcement
learning. 13, 2003.
5. Bros, S. M. ().
Bros.
URL http://en.wikipedia.org/wiki/Super_Mario_
6. Cai, D., X. He, J. Han, and S. Member (2005). Document clustering using locality
preserving indexing. IEEE Transactions on Knowledge and Data Engineering.
7. competition, R. L. . (). URL http://2009.rl-competition.org/mario.
php.
8. Dean, T. and S. hong Lin, Decomposition techniques for planning in stochastic domains. In IN PROCEEDINGS OF THE FOURTEENTH INTERNATIONAL JOINT
CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-95. Morgan Kaufmann,
1995.
9. Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman
(1990). Indexing by latent semantic analysis. Journal of the american society for
information science, 41(6), 391–407.
10. Dietterich, T. G. (1998). Hierarchical reinforcement learning with the maxq value
function decomposition. Journal of Artificial Intelligence Research, 13, 227–303.
11. Ester, M., H. peter Kriegel, J. S, and X. Xu, A density-based algorithm for discovering
clusters in large spatial databases with noise. AAAI Press, 1996.
12. Hengst, B., Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the International Conference on Machine learning, volume 19. ACM Press,
2002.
13. Jain, A. K. and R. C. Dubes, Algorithms for clustering data. Prentice-Hall, Inc., Upper
Saddle River, NJ, USA, 1988. ISBN 0-13-022278-X.
55
14. Jonsson, A. and A. Barto (2006). Causal Graph Based Decomposition of Factored
MDPs. Journal of Machine Learning Research, 7, 2259–2301.
15. Joshi, M., R. Khobragade, S. Sarda, U. Deshpande, and S. Mohan, Object-oriented
representation and hierarchical reinforcement learning in infinite mario. In ICTAI. 2012.
16. Kannan, R., S. Vempala, and A. Vetta (2000). On clusterings: Good, bad and spectral.
17. Knoblock, C. (1990). Learning abstraction hierarchies for problem solving.
18. Kocsis, L. and C. Szepesvári, Bandit based monte-carlo planning. In Proceedings
of the 17th European conference on Machine Learning, ECML’06. Springer-Verlag,
Berlin, Heidelberg, 2006. ISBN 3-540-45375-X, 978-3-540-45375-8. URL http:
//dx.doi.org/10.1007/11871842_29.
19. Lane, T. and L. Kaelbling (2002). Nearly deterministic abstractions of Markov Decision Processes.
20. Liu, X., Y. Gong, W. Xu, and S. Zhu, Document clustering with cluster refinement and
model selection capabilities. In In Proceedings of the 25th International ACM SIGIR
conference on research and development in information retrieval. ACM Press, 2002.
21. Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. URL http://dx.doi.org/10.1007/
s11222-007-9033-z.
22. MacQueen, J. B., Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (eds.), Proc. of the fifth Berkeley Symposium
on Mathematical Statistics and Probability, volume 1. University of California Press,
1967.
23. Mahadevan, S., Proto-value functions: developmental reinforcement learning. In
ICML ’05: Proceedings of the 22nd international conference on Machine learning.
2005. ISBN 1-59593-180-5.
24. McGovern, A. (2002). Autonomous Discovery of Temporal Abstractions from Interaction with an Environment. Ph.D. thesis, University of Massachusetts - Amherst.
25. Meila, M. and J. Shi, A random walks view of spectral segmentation. 2001.
26. Menache, I., S. Mannor, and N. Shimkin, Q-cut - dynamic discovery of sub-goals in
reinforcement learning. In Machine Learning: ECML 2002, 13th European Conference
on Machine Learning, volume 2430 of LectureNotes in Computer Science. Springer,
2002.
27. Meuleau, N., M. Hauskrecht, K. Kim, L. Peshkin, L. Kaelbling, T. Dean, and
C. Boutilier, Solving very large weakly coupled markov decision processes. In In
Proceedings of the Fifteenth National Conference on Artificial Intelligence. 1998.
28. Munkres, J. (1957). Algorithms for the assignment and transportation problems.
29. Ng, A., M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm.
In Advances in Neural Information Processing Systems 14. 2001.
56
30. Shahnaz, F., M. W. Berry, V. P. Pauca, and R. J. Plemmons (2006). Document
clustering using nonnegative matrix factorization. Inf. Process. Manage..
31. Shi, J. and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
32. Şimşek, Ö., A. P. Wolfe, and A. G. Barto, Identifying useful subgoals in reinforcement
learning by local graph partitioning. In Machine Learning, Proceedings of the TwentySecond International Conference (ICML 2005). 2005.
33. Singpurwalla, N. D. and J. M. Booker (2004). Membership functions and probability
measures of fuzzy sets. Journal of the American Statistical Association, 99, 867–877.
34. Sutton, R. and A. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
35. Sutton, R. S., D. Precup, and S. P. Singh (1999). Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial Intelligence,
112(1-2), 181–211.
36. Thurau, C., K. Kersting, M. Wahabzada, and C. Bauckhage (2012). Descriptive
matrix factorization for sustainability adopting the principle of opposites. Data Min.
Knowl. Discov..
37. Weber, M., W. Rungsarityotin, and A. Schliep (2002). Characterization of transition
states in conformational dynamics using fuzzy sets. Technical report.
38. Weber, M., W. Rungsarityotin, and A. Schliep (2004). Perron Cluster Analysis and
Its Connection to Graph Partitioning for Noisy Data. Technical Report ZR-04-39, Zuse
Institute Berlin.
39. White, S. and P. Smyth (). A spectral clustering approach to finding communities in
graphs.
40. Wolfe, A. P. and A. G. Barto, Identifying useful subgoals in reinforcement learning by
local graph partitioning. In In Proceedings of the Twenty-Second International Conference on Machine Learning. 2005.
41. Xu, W., X. Liu, and Y. Gong, Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference
on Research and development in informaion retrieval, SIGIR ’03. ACM, New York,
NY, USA, 2003. ISBN 1-58113-646-3. URL http://doi.acm.org/10.1145/
860435.860485.
42. Zha, H., X. He, C. Ding, H. Simon, and M. Gu, Spectral relaxation for k-means
clustering. MIT Press, 2001.
57
LIST OF PAPERS BASED ON THESIS
1. Peeyush Kumar, Vimal Mathew and Balaraman Ravindran Heirarichal Decision
Making using Spatio-Temporal Abstraction In Reinforcement Learning Under
Communication at Journal of Machine Leaning Research 2013
2. Vimal Mathew, Peeyush Kumar and Balaraman Ravindran Abstraction in Reinforcement Learning in Terms of Metastability European Workshop on Reinforcement Learning 2012
3. Peeyush Kumar, Niveditha Narasimhan and Balaraman Ravindran Spectral Clustering as Mapping to a Simplex Spectral Workshop, International Conference on
Machine Learning 2013
Under Review at Knowledge Discovery and Data Mining 2013
58