Heirarichal Decision Making using Spatio
Transcription
Heirarichal Decision Making using Spatio
Heirarichal Decision Making using Spatio-Temporal Abstraction In Reinforcement Learning A Project Report submitted by PEEYUSH KUMAR in partial fulfilment of the requirements for the award of the degree of DUAL DEGREE - BACHELOR & MASTER OF TECHNOLOGY DEPARTMENT OF APPLIED MECHANICS INDIAN INSTITUTE OF TECHNOLOGY MADRAS. MAY 2013 THESIS CERTIFICATE This is to certify that the thesis titled Heirarichal Decision Making using SpatioTemporal Abstraction In Reinforcement Learning, submitted by Peeyush Kumar, to the Indian Institute of Technology, Madras, for the award of the degree of Dual Degree, is a bona fide record of the research work done by him under our supervision. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Balaraman Ravindran Research Guide Associate Professor Dept. of Computer Science IIT-Madras, 600 036 Place: Chennai Date: 10th May 2013 M. Manivannan Research Guide Associate Professor Dept. of Applied Mechanics IIT-Madras, 600 036 ACKNOWLEDGEMENTS I would like to thank my research advisor, Dr. B. Ravindran, for giving me the opportunity, freedom and encouragement to pursue the research, listening patiently to many harebrained ideas. I could not have imagined having a better advisor and mentor for my study. He was truly inspiring and I consider myself very lucky to work with him. I would also like to thank Dr. M Manivanan for giving me enough freedom to pursue my research and the various interesting discussions I had with him. I would also like to thank my colleagues Niveditha Narasimhan and Vimal Mathew for their contribution relevant to the project. Their invaluable help is sincerely acknowledged. i ABSTRACT KEYWORDS: Reinforcement learning; Markov decision process; Automated state abstraction; Metastability; Dynamical systems; Subtasks; Skills; Options Reinforcement learning (RL) is a machine learning framework in which an agent manipulates its environment through a series of actions, receiving a scalar feedback signal or reward after each action. The agent learns to select actions so as to maximize the expected total reward. The nature of the rewards and their distribution across the environment decides the task that the agent learns to perform. Identifying useful structures present in a task often provides ways to simplify and speed up such algorithms and they enable ways to generalize such algorithms over multiple tasks without relearning policies for the entire task. One approach to using the task structure involves identifying a hierarchical description in terms of abstract states and extended actions between abstract states (Barto and Mahadevan, 2003). The decision policy over the entire state space can then be decomposed over abstract states, simplifying the solution to the current task and allowing for reuse of partial solutions in other related tasks. This thesis introduces a general principle for automated skill acquisition based on the interaction of a reinforcement learning agent with its environment. We use ideas from dynamical systems to find metastable regions in the state space and associate them with abstract states. Skills are defined in terms of transitions between such abstract states. These skills or subtasks are defined independent of any current task and we show how they can be efficiently reused across a variety of tasks defined over some common state space. We demonstrate empirically that our method finds effective skills across a variety of domains. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS i ABSTRACT ii LIST OF TABLES v LIST OF FIGURES vii ABBREVIATIONS viii NOTATION ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Overview 3 Background 3.1 8 12 A Reinforcement Learning problem and its related random-walk operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 A Spectral Clustering Algorithm: PCCA+ . . . . . . . . . . . . . . 14 3.3.1 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 The Infinite Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.2 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iii 3.5.3 4 5 6 7 8 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Spatial Abstraction 23 4.1 27 Utility of PCCA+ . . . . . . . . . . . . . . . . . . . . . . . . . . . Temporal Abstraction 30 5.1 Subtask options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.1 Initiation Set I . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.2 Option Policy µ . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.3 Termination Condition β . . . . . . . . . . . . . . . . . . . 34 PCCA+HRL: An Online Framework for Task Solving 36 6.1 Interpretation of subtasks identified . . . . . . . . . . . . . . . . . . 38 6.2 Extending the Transition Model to include the underlying Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Experiments 42 7.1 2 room Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.2 Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.3 Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Conclusion and Future Directions A PCCA+ on other domains 46 47 A.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.2 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A.3 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 LIST OF TABLES A.1 Normalized Cut Values for PCCA+ and N-Cut. (Lower is good) . . 50 A.2 Accuracy Measure for TFTD2 . . . . . . . . . . . . . . . . . . . . 50 A.3 Accuracy Measure for Reuters . . . . . . . . . . . . . . . . . . . . 51 A.4 Purity measure for Document Clusters . . . . . . . . . . . . . . . . 52 A.5 Details of the used Reuters21578 top 7 Categories . . . . . . . . . . 53 A.6 Macro connectivity information for Reuters21578 top 7 Categories . 53 A.7 Purity measure for Synthetic Dataset . . . . . . . . . . . . . . . . . 54 v LIST OF FIGURES 1.1 2 room domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2.1 3 room world illustration domain . . . . . . . . . . . . . . . . . . . 8 2.2 Membership Functions for the 3 room world domain in Figure 2.1 . 9 2.3 Sample Option Model. . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Policy over meta states. Paths marked in red are part of the optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Solution to the illustration problem in Figure 2.1 . . . . . . . . . . . 11 3.1 Taxi world domain . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 A mario visual scene . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Simplex First order and Higher order Perturbation . . . . . . . . . . 24 4.2 PCCA+ Numerical analysis for 3 room world domain in Figure 2.1 . 26 4.3 Comparison of clustering algorithm on the maze domain . . . . . . 26 4.4 Comparision of clustering algorithm on a room-inside-room domain 27 4.5 20 Metastable States. The state space partitioning identifies 20 metastable states. For every passenger position and destination pair the algorithm identifies 1 metastable state, hence 5×4 gives 20 metastable states. The nature of subtasks which can be assigned are a) Pickup if the Taxi is on the pickup location, b) Putdown if the taxi is on the destination location, and c) Navigate When the taxi is far from the destination or the pickup location depending on what subtask is being solved . . . . . . . . . 27 4.6 Segmentation of the Physical Space in the Taxi Domain . . . . . . . 28 5.1 Two episodes with different starting states. Skill in this context is defined as transition between rooms. Skills shown by down arrow are not learned again, instead they are reused from the first task . . . . . . . 30 Policy for option going from Abstract State 1 to Abstract state 2 for domain in Figure 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Room inside a room, the policy for the domain in Figure 4.4 . . . . 32 5.4 Few more examples for the policy composed using PCCA+HRL . . 33 5.2 vi 5.5 Termination Condition for option going from Abstract State 1 to Abstract state 2 for domain in Figure 2.1 . . . . . . . . . . . . . . . . 35 7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.2 CMACs involve multiple overlapping tilings of the state space. Here we show two 5 × 5 regular tilings offset and overlaid over a continuous, two-dimensional state space. Any state, such as that shown by the dot, is in exactly one tile of each tiling. A state’s tiles are used to represent it in the Sarsa algorithm described above. The tilings need not be regular grids such as shown here. In particular, they are often hyperplanar slices, the number of which grows sub-exponentially with dimensionality of the space. CMACs have been widely used in conjunction with reinforcement learning systems (e.g., )). . . . . . . . . . . . . . . . 44 Mario Domain # Cumulative goal completions . . . . . . . . . . . . 45 A.1 Partitioning into multiple segments. Comparision with N-Cut . . . . 47 A.2 Membership Functions for some clusters . . . . . . . . . . . . . . . 48 A.3 Segmentation for some other images . . . . . . . . . . . . . . . . . 49 A.4 Clustering on Synthetic Datasets using PCCA+ . . . . . . . . . . . 54 A.5 Simplex identified by PCCA+ for the spiral dataset in A.4b. The data points are shown as colored dots and are clustered around the vertices of the transformed simplex (shown in black dots). The original basis is shown in black lines. . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.3 vii ABBREVIATIONS RL Reinforcement Learning MDP Markov Decision Process HRL Heirarichal Reinforcement Learning AI Artificial Intelligence N-Cut Normalized Cut PCCA Perron Cluster Analysis viii NOTATION a Rss Reward achieved while transitioning from state s to state s0 while taking action a 0 a Pss0 Probability of reaching state s0 after a single time step, starting from state s taking action a β Termination Condition µ Option Policy V Value Function Q Action Value Function W Adjacency Matrix D Adjacency Matrix D Valency Matrix Ỹ Vertices of the Simplex χ Membership Matrix m Membership function γ Discount factor Simplex Deviation factor L Lagrangian Matrix T Transition Matrix λ Eigen Value A Set of Actions S Set of States φ Transition Counts δ Temporal Difference Error o Discounted Cumulative Reward achieved while transitioning Rss0 from state s to state s0 while taking option o o Pss Probability of reaching state s0 0 till the termination of option o, starting from state s taking option o ix CHAPTER 1 Introduction Autonomous agents are used in a variety of settings like hazardous tasks, repetitive tasks, assistive, etc. Such systems have had many successful applications, for example Google car, helicopter, games, software agents, etc. Learning is a crucial component of autonomy. Any autonomous agent is required to plan and learn decision policies for controlled action execution in the environment. There has been a lot of work on learning systems in the Artificial Intelligence community. Reinforcement Learning (Sutton and Barto, 1998) is a whole branch dedicated to developing such systems. It is a very powerful way to design autonomous agents, because of its generalizing and adaptive capabilities. Learning about structure present in the environment makes RL agents effective learners. In this thesis we use the reinforcement learning framework to build such agents which can perform autonomously by identifying and using structures present in the environment while learning policies to solve a given task. Figure 1.1: 2 room domain Consider a simple example (1.1) consisting of two rooms connected by a door. Assume that locations in this environment are described by discrete states as viewed by a learning agent. The agent can take actions North, South, East and West to move around, with each action causing a change in the state of the agent. The set of all states could be partitioned based on the room to which each state belongs, with each room corresponding to an abstract state. The environment can then be described, at a high level, in terms of two interconnected states and transitions between these states. A transition between the two rooms would correspond to a sequence of actions over the lower-level states, leading to a temporal abstraction. Therefore, a set of temporal abstractions can be defined in terms of transitions between abstract states. In the example given, the temporal abstractions correspond to going from one room to another (for example, go to room 2 from room 1 or vice-versa). Further, the aggregation of states as we move to a higher-level implies that the environment can be described in terms of fewer (abstract) states. Solving a task over this smaller state space has computational advantages. By repeating this process of spatio-temporal abstraction, a hierarchy of abstractions can be defined over the environment. Researchers have worked on this framework of using abstraction to simplify problem solving for quite some time (Knoblock, 1990; Dean and hong Lin, 1995) 1.1 Motivation The environment contains a lot of structural information. As humans we detect such structures all the time, in order to perform efficiently in the environment. For example, consider a simple task of getting a coffee from the snacks room which is located down the corridor. This environment contains lot of useful structures like the door to enter the corridor, coffee table, coffee machine, etc. Conventional controlled dynamical Artificial Systems act by learning sequence of primitive actions at individual states in the state-space (the policy is of the form at state s take action a). Although such policies are good for smaller domains, they prove to be highly inefficient and unnecessary for larger domains and real world problems which contain a lot of structural information. Moreover defining generalization to transfer such policies across domains prove to be highly difficult task. Hence we need an approach which could address such issues. Markov decision processes (MDPs) are widely used in reinforcement learning (RL) 2 to model controlled dynamical systems as well as sequential decision problems (Barto et al., 1995). Many researchers have developed algorithms that determine optimal or near-optimal decision policies for MDPs. Such algorithms are generally very subjective and task specific. Moreover algorithms that identify decision policies on MDPs scale poorly with the size of the task. The structure present in a task often provides ways to simplify and speed up such algorithms. Identifying useful structures often provide ways to generalize such algorithms over multiple tasks without relearning policies for the entire task. One approach to using the task structure involves identifying a hierarchical description in terms of abstract states and extended actions between abstract states (Barto and Mahadevan, 2003). The decision policy over the entire state space can then be decomposed over abstract states, simplifying the solution to the current task and allowing for reuse of partial solutions in other related tasks. While there have been several frameworks proposed for hierarchical RL (HRL), in this work we will focus on the options framework, though our results generalize to other frameworks as well. We propose a framework, PCCA+HRL, that exploits the structural properties of the underlying transition model while returning methods to generate option models. We show how such generative models speed up learning by waiving off the requirement of learning options in MDPs. Hierarchical decomposition exploits task structure by introducing models defined by stand-alone policies (known as macro-actions, temporally-extended actions, options, or skills) that can take multiple time steps to execute. Skills can exploit repeating structure by representing subroutines that are executed multiple times during solution of a task. If a skill has been learned in one task, it can be reused in other tasks that require execution of the same subroutine. Options also enable more efficient exploration by providing high-level behavior that enables a decision maker to look ahead to the completion of the associated subroutine. One of the crucial questions in HRL is where do the abstract actions come from. There has been much research on automated temporal abstraction. One approach is to identify bottlenecks (McGovern, 2002; Şimşek et al., 2005), and cache away policies for reaching such bottle neck states as options. Another approach is to identify strongly connected components of the MDP using clustering methods, spectral or otherwise, and to identify access-states that connect such components (Menache et al., 2002; Meuleau et al., 1998). Yet another approach is to 3 identify the structure present in a factored state representation (Hengst, 2002; Jonsson and Barto, 2006), and find options that cause what are normally infrequent changes in the state variables. While these methods have had varying amounts of success they have certain crucial deficiencies. Bottleneck based methods do not have a natural way of identifying in which part of the state space are the options applicable without some form of external knowledge about the problem domain. Spectral methods need some form of regularization in order to make them well conditioned, and this can lead to arbitrary splitting of the state space (For e.g., see Figure 4.3). Exploiting spatial structures in the environment provides an autonomous way to learn object oriented decision policies. Usually the environment consists of regions (which we will denote as macro states or metastable states) which are very well connected with sparse connections to other densely connected regions. Spatial abstraction provides a way to group such states which are very well connected. The idea here is that for densely connected states with low mixing times, it is easy for an autonomous agent to find its way around the region. Hence, given a mechanism to generate good options within these regions, the learning problem reduces to finding decision policies across such meta-stable states compared to learning policies over the entire state space. 1.2 Objective In this work we propose a robust approach for automated skill acquisition based on the interaction of a RL agent with its environment. PCCA+HRL addresses all the aforementioned issues associated with previous work. This framework detects well-connected or metastable regions of the state-space using PCCA+, a spectral clustering algorithm. We then propose a very effective way of composing options using the same framework to take us from one metastable region to another. Using these options we use SMDP Value learning to learn a policy over the subtask to solve the given task. One important contribution of this work is that while exploiting the structural aspects of the state space, we propose a way of composing options that are robust across tasks. Indeed we see in chapter 7 that this method of composing options while exploiting the structure present in the state space gives us a distinct advantage over all the other previous works. 4 The additional significance of PCCA+HRL is that it does not assume any model of the MDP. The spatial abstraction is detected on sampled trajectories as shown in Chapter 5. We observe that even with limited sampling experience our method was able to learn reasonably good skills. Such abstractions are particularly useful in situations where exploration is limited by the environment costs. PCCA+HRL also provides a way to refine these abstractions for every consecutive run without explicitly reconstructing the entire MDP. Since we do not need to learn options therefore the proposed framework allows to do this effectively as shown in Section 5. Moreover, with PCCA+HRL we are able to run PCCA+ to find suitable spatial abstraction on sampled trajectories, therefore the computational time complexity decreases drastically, which is usually quite expensive for other spectral clustering based methods. PCCA+HRL gives us several advantages: 1. The clustering algorithm produces characteristic functions that describes the degree of membership for all states belonging to a given metastable region. These characteristic functions gives us a powerful way to naturally compose options. chapter 5 shows how we use these characteristic functions to define the subtask dynamics. 2. The PCCA+ algorithm also returns connectivity information between the metastable regions. This allows us to construct an abstract graph of the state space, where each node is a metastable region thus combining both spatial and temporal abstraction meaningfully. This graph serves as the base for deriving good policies very rapidly when faced with a new task. While it is possible to create such graphs with any set of options, typically some ad-hoc processing has to be done based on domain knowledge (Lane and Kaelbling, 2002). Still, the structure of the graphs so derived would depend on the set of options used, and may not reflect the underlying spatial structure completely. 3. Since the the clustering algorithm in PCCA+HRL looks for well-connected regions and not bottle neck-states, it discovers options that are better aligned to the structure of the state space. 4. PCCA+HRL can acquire skills on the sampled trajectories. Hence, it does not require any prior model for MDP. This is particularly useful in situations where the environment model is not available and exploration costs. 5. Low computation time complexity as compared to other spectral clustering based methods. This is because the time complexity is a function of the length of sampled trajectory rather than the entire state space. 6. PCCA+HRL allow us to reuse policies across similar subtasks. 5 7. Since PCCA+HRL only requires transition information gathered through sampled trajectories, therefore this method can be used with symbolic state representation of domains as shown in the Mario Game (see Section 3.5) This approach is based on the ideas from conformal dynamics. Dynamical systems may possess certain region of states between which the system switches only once a while. The reason is the existence of dynamical bottlenecks which confine the system for very long periods of time in some regions in the phase space, usually referred to as metastable states. Metastability is characterized by the slow changing dynamics of the system. Analogously a subtask can be associated with a particular dynamics where the distinction between a subtask is defined in terms of the dynamics associated with each of the subtask. Hence if the agent knows a good way of solving a particular subtask (in other words if the agent exhibits a good dynamics which takes it to the relevant dynamical bottleneck soon enough) we can learn a recursively optimal policy to solve the entire task (chapter 6) 1.3 Organization The rest of the paper is organized as follows: Section 2 gives an overview of PCCA+HRL using a simple example on a 2 room world domain. Section 3 gives the background on the problem PCCA+HRL attempts to solve while explaining various terminologies used in this paper. In the subsequent sections, we specify a complete algorithm for detecting metastable regions and for efficiently discovering options for different sub-tasks by reusing experience gathered while estimating the model parameters. It should be noted that our approach for option discovery can be dependent on the underlying reward structure or just on the transition structure of the MDP. We demonstrate the utility of using PCCA+ in discovering metastable regions, as well as empirically validate PCCA+HRL on several RL domains like the Taxi Domain, and the Infinite Mario Project. We observe in Section 7.2 that PCCA+HRL is able to acquire good skills for the taxi domain which are quite intuitive. Further we observe in Section 7.3 that PCCA+HRL is able to acquire skills to learn object oriented policies for Mario to solve the entire task. Joshi et al. (2012) shows the significance of learning object oriented-policies. The contribution of 6 our work is that PCCA+HRL acquires and learns such policies entirely autonomously. We do not know of any other work in heirarichal reinforcement learning which could do this as effectively. Another advantage of learning object oriented policies is the added capability to reuse policies to solve other subtasks. This is indeed what we observe in the mario play as PCCA+HRL reuses policies across different frames when faced with a similar subtask. This capability also allows us to use a smaller feature representation of the state space while aliasing similar states with the same symbol. 7 CHAPTER 2 Overview Figure 2.1: 3 room world illustration domain In this section we give a brief overview of the working of PCCA+HRL. Let us consider a simple 3 room domain as shown in Figure 2.1 for illustration purpose. The world is composed of rooms separated by walls marked in black color. For the current task let us assume that the goal of the agent is to start from the tile marked S and reach the goal tile marked G. The agent can either move a step in the direction towards north, east, west or south. PCCA+HRL starts by constructing a transition model using a sampled trajectory. We will assume for the current illustration that the agent samples a trajectory using a uniform random walk but later in section 5 we introduce a more sophisticated sampling method based on the idea of UCT (Kocsis and Szepesvári, 2006). We then use this sampled random trajectory to find suitable spatial abstractions using a spectral clustering algorithm PCCA+. PCCA+ returns a set of membership function for every identified abstract state. We plot these membership functions in Figure 2.2. We observe that PCCA+HRL segments the state space into rooms defined by degrees of membership of states to rooms. One important point to note here is that the rooms are very well connected within, with low mixing times, while there are evidently very few connections across rooms. Such kind of connections have conventionally been labeled as bottlenecks (McGovern, 2002; Şimşek et al., 2005). Please note that our method does not explicitly look for bottlenecks, the identification of bottlenecks is intrinsic to the way PCCA+HRL identifies spatial abstraction. This could be useful when it is required to define spatial abstraction based on the underlying reward structure along with the transition information as well. Identifying such spatial abstraction now reduces the problem of learning policies for transitioning from each state(for example take a step North or take a step South, etc.) to learning policies for transitioning from each abstract state(for example go to room 2, etc.). As one could infer such policies are object oriented policies, where the objects are special structures detected in the environment. (a) Room 1 (b) Room 2 (c) Room 3 (d) World Segmentation Figure 2.2: Membership Functions for the 3 room world domain in Figure 2.1 We now compose options using membership functions as demonstrated in Section 5. One of the options composed using this method is shown in Figure 2.3. The figure shows the stochastic option policy as arrows, where the length of each arrow is propor9 Figure 2.3: Sample Option Model. Figure 2.4: Policy over meta states. Paths marked in red are part of the optimal policy tional to its respective probability. The colored portion is the initiation set of the option with the background gradient showing the termination condition of the option (lighter background implies termination with higher probability, βwhite = 1). One important point to note here is that the option model is not learned but computed directly from the membership functions. Though there are no guarantees on the optimality of the option policy constructed, but it is observed that these option models are generally close to optimal and sometimes optimal as shown in Figure 2.5. This is particularly useful for large domains as we see in Section 7. Now given such option models we could use any of our favorite learning methods over the abstract states to compute a recursive optimal decision policy. For our experiments we use the SMDP Value learning (Sutton et al., 1999), unless otherwise specified. For this problem the estimated SMDP policy, which is a trivial one, is shown in Figure 2.4. The entire solution to the problem is shown in figure 2.5. Figure 4.2a shows the solution to reach the abstract state containing the goal state. Since we only used the transition information to find structures in the state space, therefore spatial abstraction does not give us enough information to reach the goal state once the agent is inside the abstract state containing the goal state. Once the 10 agent is inside the abstract state containing the goal state, it can use any MDP solver to reach the goal state, this is an easy problem to solve since all abstract states are densely connected regions. We also plot the entire solution in Figure 4.2b when PCCA+HRL use the transition as well as the reward structure of the underlying MDP as discussed in Section 5. This way we detect 4 abstract states with the fourth meta state being the lone goal state itself. Also this way the option models for the abstract state containing the goal state could be directly constructed from the information extracted through the structures identified in the state space. The interesting observation here is that we could solve this entire problem with absolutely no learning (except the decision to chose the option to go from abstract state 2 to 3 rather than the option to go from abstract state 2 to 1), while any flat RL method would have to learn the entire decision policy to go from the starting state to the goal state. These kind of scenarios are frequently seen in large domains where, if not all, a subset of the problem could be solved without explicitly learning decision policies. (a) Policy composed from transition structure(b) Policy composed from transition and reonly ward structure Figure 2.5: Solution to the illustration problem in Figure 2.1 11 CHAPTER 3 Background In this section, we provide a background to the problem that our framework attempts to solve, and we introduce notation of concepts that we use throughout the paper. 3.1 A Reinforcement Learning problem and its related random-walk operator Markov Decision Processes (MDPs) are widely used to model controlled dynamical systems (Barto et al., 1995). Consider a discrete MDP M = (S, A, P, R) with a finite set of discrete states S, a finite set of actions A, a transition model P (s, a, s0 ) specifying the distribution over future states s0 when an action a is performed in state s, and a corresponding reward model R(s, a, s0 ), specifying a scalar cost or reward. The set of actions possible at a state s is given by A(s). A policy π : S × A → [0, 1] specifies a way of behaving. The state value function, V π : S × IR, is the expected discounted return when following π. We denote by V ∗ = maxπ V π the optimal value function. A policy that is greedy with respect to the optimal value function is an optimal policy. Similarly, one can define action-value functions Qπ , Q∗ , which allow the first action choice to be different from π. Value functions obey recursive sets of equations, called Bellman equations. For example, the state-value function V π obeys: V π (s) = X a π(s, a)[R(s, a) + γ X π 0 a Pss 0 V (s )], ∀s ∈ S s0 Temporal-difference (TD) methods are incremental learning rules derived from the Bellman equations, which can compute value functions directly without using a model. For example, TD(0) performs the following update on every time step t: V (st ) ← V (st ) + αδt where δt = rt+1 +γV (st+1 )−V (st ) and α ∈ [0, 1) is a learning rate parameter, possibly dependent on t. Under appropriate conditions, V → V π as t → ∞. The state-space can also be modeled as a weighted graph G = (V , W ). The states S correspond to the vertices V of the graph. The edges of the graph are given by E = {(i, j) : W ij > 0}. Two states s and s0 are considered adjacent if there exists an action with non-zero probability of transition from s to s0 . This adjacency matrix A provides a binary weight (W = A) that respects the topology of the state space (Mahadevan, 2005). The weight matrix on the graph and the adjacency matrix W will be used interchangeably from here onwards. A random-walk operator can be defined on the state space of an MDP as T = D−1 W where W is the adjacency matrix and D is a valency matrix, i.e., a diagonal matrix with entries corresponding to the degree of each vertex (i.e., the row sums of W ). This random-walk operator models an uncontrolled dynamical system extracted from the MDP. 3.2 Spectral Graph Theory We want to detect suitable spatial structures in the environment so that we can define temporal abstractions which are aligned to the structure of the underlying MDP. Algebraic methods have proven to be especially effective in treating graphs which are regular and symmetric. Its been observed that the spectra of a graph, constituted by the eigensystem, provides a suitable embedding which supply a good picture of the underlying planar graph. It is seen that eigenvalues are closely related to almost all major invariants of a graph, linking one extremal property to another. There are various algebraic representations of a graph. We use the Laplacian (L) to represent a graph in a matrix form. The Laplacian allows a natural link between discrete representations, such as graphs, and continuous representations, such as vector spaces and manifolds. The most important application of the Laplacian is spectral clustering that corresponds to a 13 computationally tractable solution to the graph partitioning problem (Luxburg, 2007). Hence it provides a very elegant way to detect suitable spatial abstractions in the environment. In the literature there is no unique convention which matrix exactly is called “graph Laplacian”. A good graph Laplacian has the following properties; • For every vector f ∈ IRn we have n 1X f Lf = wij (fi − fj )2 2 i,j=1 0 • L is symmetric and positive semi-definite. • The smallest eigenvalue of L is 0, the corresponding eigenvector is the constant one vector 1. • L has n non-negative, real-valued eigenvalues 0 = λ1 ≤ λ2 ≤ · · · λn . For the current discussion we will use the normalized graph Laplacian defined as follows L = D−1 (D − W ) because it is easy to compute this using the random walk operator defined above. In fact, for all our purposes we can replace the Laplacian with random walk operator as T = I − L, where I is the identity matrix. 3.3 A Spectral Clustering Algorithm: PCCA+ Given an algebraic representation of the graph representing a MDP we want to find suitable abstractions aligned to the underlying structure. We use a spectral clustering algorithm to do this. There have been many successful applications of spectral clustering methods on the real world data and other works on spatial abstraction (e.g.,Shi and Malik (2000); White and Smyth; Wolfe and Barto (2005); Shahnaz et al. (2006); Cai et al. (2005), etc.). Central to the idea of spectral clustering is the graph Laplacian which is obtained from the similarity graph (Luxburg (2007)). There are many tight connections between the topological properties of graphs and the graph Laplacian matrices, which spectral clustering methods exploit to partition the data into clusters. In this work we will use a clustering approach that attempts to exploit the structural properties in the configuration space of objects as well as the spectral sub-space. We take inspiration from the conformal dynamics literature, where Weber et al. (2004) does a 14 similar analysis to detect conformal states of a dynamical system. They propose a spectral clustering algorithm PCCA+, which is based on the the principles of Perron Cluster Analysis of the transition structure of the system. We extend their analysis to detect spatial abstractions in autonomous controlled dynamical systems. Using this algorithm gives us various advantages: • Global as well as local information to define spatial abstractions. It exploits the local structural properties of the underlying data space by using pairwise similarity functions, while using spectral methods to encode the global structural properties. • A formal notion of macro states as vertices of a simplex in the eigen-subspace of the Laplacian. The clustering is performed by minimizing deviations from a simplex structure and hence does not require any arbitrary regularization term. The clustering procedure does not assume anything about the underlying structure and the mapping to a simplex is inherently built in the properties of the Laplacian. • Characteristic functions that describe the degree of membership of each state to a given abstract state. We can interpret the membership functions as the likelihood of a state belonging to a particular abstract state (see Singpurwalla and Booker (2004)). The algorithm could also generate crisp partitioning of the states into abstract states, as and when required. • Connectivity information between the abstract states which is seldom required in dynamical systems. For example one might be interested to know the connectivity information between abstract states to learn decision policies across such states as shown in Section 6. 3.3.1 Numerical Analysis For a graph with disjoint components i.e., one whose similarity matrix can be reduced to a block diagonal form, the Lapalcian matrix L̃ (from here and now we will refer to the random walk operator as the Laplacian as described above) has a block structure, where each block is a matrix which corresponds to the Laplacian matrix for the disjoint set of vertices. Each vertex vi ∈ V can be mapped to the ith row of the eigenvector matrix Ỹnk ; where n is the number of vertices and k is number of eigenvectors, corresponding to the eigenvalues = 1. As it turns out, the eigenvectors for the Laplacian matrix can be interpreted as an indicator of membership for each object to a suitable disjoint set. Lemma 1. Eigenvectors of the Laplacian L with a block diagonal structure form the vertices of a simplex IRk−1 (Weber et al. (2004)). 15 Proof. Each block matrix has its own Laplacian L̂, since the rows of the Laplacian sum to 1, hence a vector with all identical elements is an eigenvector of this system. Transforming this to the full Laplacian matrix, components of each of the eigenvectors [Y1 , Y2 , . . . , Yk ] of L are pairwise identical for indices corresponding to the same block. Regarding the rows of Y in IRk as k distinct points in IRk , they form the vertices of a simplex because by definition the convex hull of k distinct points in IRk form a simplex σ̂ k−1 . We call these vertices in the eigenspace IRk as (first-order) Clusters Ck . Laplacian in systems which exhibit connections across groups can be approximated as perturbations on disjoint sets. L̃ = L + L(1) + 2 L(2) + . . . where L(1) , L(2) , . . . are respectively the first order and higher order Laplacian perturbation terms, and is the order artifact. With the perturbation analysis of this equation the perturbation on the eigenvectors and eigenvalues can similarly be written as Ỹ = Y + Y (1) + O(2 ) Λ̃ = Λ − Λ(1) − O(2 ) Lemma 2. First order perturbation term Y (1) = Y B, B ∈ IRk×k is a linear mapping: IRk 7→ IRk . Proof. consider the ith eigenvector L̃ỹi = λ̃i ỹi writing it in terms of the perturbation expansion and matching the same order terms (for the first order perturbation) we get (1) (1) (1) Lyi + L(1) yi = λi yi − λi yi 16 therefore, (1) (1) (L(1) + λi I)yi + (L − λi I)yi =0 Taking a dot product of this equation with another eigenvector yj , we get D E D E (1) (1) (1) (L + λi I)yi , yj + (L − λi I)yi , yj = 0 D E D E (1) (1) (L(1) + λi I)yi , yj + yi , (L − λi I)yj = 0 The second term goes to zero because (L−λi I)yj = 0, hence we have the first term zero (1) as well. This implies that (L(1) + λi I) is linear transformation which takes a vector yi and either transforms it perpendicular to itself, or to itself, because hyi , yj i = 0. Also (1) yi (1) Hence we get yi (1) = (λi I − L)−1 (L(1) + λi I)yi = Byi This implies that the perturbation of the simplex structure can at most be of the order O(2 ) (see Figure 4.1). In other words, the simplex structure is preserved for first order perturbations, while for the higher order perturbations the simplex structure perturbs. Hence we have here a formal definition of clusters, in the abstract notion, as vertices Ck of this simplex structure. Def 1. A vertex vi is said to belong to the cluster Ck with the perfect membership if Y (i, :) = Ck In soft clustering a continuous indicator for membership χ̃i : V 7→ [0, 1], which assigns a grade of membership between 0 and 1 to each vertex vi ∈ V ; ∀i. Therefore, a vertex may correspond to different clusters with a different grade of membership. For each vertex v ∈ V the sum of the grades of membership with regard to the different clusters is 1, i.e. k X χ̃i (v) = 1 i Each vertex is represented by a vector (χ1 (v), . . . , χk (v)) ∈ IRk . Since these vectors are positive and the partition of unity holds, they lie in the standard σk−1 simplex 17 spanned by the k unit vectors of IRk . Therefore, clustering can be seen as a simple linear mapping from the rows of Y to the rows of a membership matrix χ̃. The linear mapping is expressed by a regular k × k matrix A: χ̃ = Ỹ A This matrix maps the vertices of the simplex contained in the rows of Y onto the vertices of the simplex σk−1 . Therefore, if one finds the indices π1 , . . . , πn ∈ [1, N ] of the vertices in Y one can construct the linear mapping as follows A−1 Ỹ ··· π1 ,1 .. . = .. . Ỹπk ,1 · · · Ỹπ1 ,k .. . Ỹπk ,k Weber et al. (2002) shows that solution for A exists, if and only if the convex hull of the rows of Ỹ is a simplex. From perturbation analysis we know that this is the case with a deviation of order O(2 ). To partition the data, as and when required, we assign ek (s). each state to a partition numbered P (s) = arg maxnk=1 χ To estimate the number of clusters k, the spectral gap is used as an indicator of deviation from the simplex structure. Spikes in the eigenvalues indicate the presence of a group structure in the graph. The connectivity information between the various clusters can also be recovered from the membership function. Def 2. The connectivity information across different clusters, is given by Lmacro = χ̃T L̃χ̃ where Lmacro is the Laplacian in a matrix space. In this representation each cluster is represented by a single node, and connectivity information across the clusters is given Lmacro (i, j) for i 6= j, while the relative connectivity information within a cluster is given by Lmacro (i, i) for all i. 18 3.3.2 Options Our goal is to define temporally abstract policies (i.e., the ability to reason at a higher level than just primitive actions) on the structures identified in the environment. We use the framework of options, which can be viewed as fixed policies with preconditions and termination conditions. Formally an option is a triple (I, µ, β), where I ⊂ S is the initiation set of the option, µ : S 7→ A is the option policy and β : S 7→ [0, 1] gives the probability of termination in each state s ∈ S. An MDP endowed with an option set becomes a Semi-Markov Decision Process (SMDP). In each SMDP time step, the agent selects an option available at its current state and follows the option’s policy µ until termination according to β. For any Markov policy π : S × O 7→ [0, 1], Bellman equations for value functions in the SMDP exist, and they are direct extension of the Bellman equations in the MDP case, e.g.: # " V π (s) = X π(s, o) R(s, o) + X o π 0 Pss 0 V (s ) s0 o∈Os Here, R(s, o) is the expected discounted reward obtained during the execution of the option: R(s, o) = E rt + · · · + γ k−1 rt+k−1 |st = s, ot = o o where k is the duration of the option. Pss 0 is the option’s transition model, defined similarly. 3.4 Taxi Domain Dietterich (1998) created the taxi task (Figure 3.1) to demonstrate MAXQ hierarchical reinforcement learning. We use the episodic form of the same domain illustrate our framework. The taxi problem can be formulated as an episodic MDP with the 3 state variables: the location of the taxi (values 1-25), the passenger location including in the taxi (values 1-5, 5 means in the taxi) and the destination location (values 1-4). Figure 3.1 shows a 5-by-5 grid world inhabited by a taxi agent. There are four specially19 designated locations in this world, marked as R(ed), B(lue), G(reen), and Y(ellow). The taxi problem is episodic. In each episode, the taxi starts in a randomly-chosen square. There is a passenger at one of the four locations (chosen randomly), and that passenger wishes to be transported to one of the four locations (also chosen randomly). The taxi must go to the passenger’s location, pick up the passenger, go to the destination location, and put down the passenger there. The episode ends when the passenger is deposited at the destination location. There are six primitive actions in this domain: (a) four navigation actions that move the taxi one square North, South, East, or West, (b) a Pickup action, and (c) a Putdown action. There is a reward of -1 for each action and an additional reward of +20 for successfully delivering the passenger. There is a reward of -10 if the taxi attempts to execute the Putdown or Pickup actions illegally. Figure 3.1: Taxi world domain This task has a simple hierarchical structure in which there are two main sub-tasks: Get the passenger and Deliver the passenger. Each of these subtasks in turn involves the subtask of navigating to one of the four locations and then performing a Pickup or Putdown action. The temporal abstraction is obvious, for example, the process of navigating to the passenger’s location and picking up the passenger is a temporally extended action that can take different numbers of steps to complete depending on the distance to the target. The top level policy (get passenger; deliver passenger) can be expressed very simply if these temporal abstractions can be employed. Reusing policies is critical in this domain. For example, if the system could learn how to solve the navigation subtask once, then the solution could be shared by both the “Get the passenger” and “Deliver the passenger” subtasks. 20 3.5 The Infinite Mario Infinite Mario competition is a Reinforcement Learning domain developed for Reinforcement Learning Competition 2009. It is a variant of Nintendo Super Mario Bros, a complete side-scrolling video game with destructible blocks, enemies, fireballs, coins, chasms, and platforms. It requires the player to move towards right to reach the finish line, earning points and powers along the way by collection coins, mushrooms, fire flowers and killing monsters. The game is interesting because it requires agent to reason and learn at several levels; from representation to path-planning and devising strategies to deal with various components of the environment. Infinite Mario has been implemented on RL-Glue; a standard interface that allows connecting reinforcement learning agents, environments, and experiment programs together. The agent is provided information about the current visual scene through arrays. 3.5.1 States The state space has no set specification. Only a part of the game is visible at a given time as shown in Figure 3.2. The visual scene is divided into a two-dimensional [16 x 22] matrix of 352 tiles. At any given time-step the agent has access to a char array generated by the environment corresponding to the given scene. Each tile (element in the char array) can have one of the many values that can be used by the agent to determine if the corresponding tile in the scene is a brick, or contains a coin, etc. The agent also has access to the locations of different types of monsters (including Mario) on the scene, along with their vertical and horizontal velocities. The reward structure varies across different instances of the game. Hence for any visual scene the total number of possible states are 25352 , where each tile can 25 possible values. 3.5.2 Actions The actions available to the agent are same as those available to a human player through gamepad in a game of Mario. The agent can choose to move right or left or can choose to stay still. It can jump while moving or standing. It can move at two different speeds. 21 Figure 3.2: A mario visual scene All these actions can be accomplished by setting the values of action array. 3.5.3 Rewards The reward structure varies across different MDPs. The agent earns a huge positive reward when it reaches the finish line. Every step the agent takes in the environment, it gets a small negative reward. The agent gets some negative reward if it dies before reaching the finish line and gets some positive reward by collecting coins, mushrooms and killing monsters. The goal is to reach the finish line which is at a fixed distance from the starting point such that the total reward earned in an episode is maximized. Hence in such a domain the underlying reward distribution might provide a useful information for structure identification during gameplay. 22 CHAPTER 4 Spatial Abstraction We use a spectral clustering algorithm to find suitable abstractions on the sample trajectories. Spectral clustering was made popular by the works of Shi and Malik (2000) (Normalized Cut Algorithm), Meila and Shi (2001), Kannan et al. (2000), etc. Although these methods are known to have many successful applications, they typically work on a case-by-case basis. Though the spectra of the Laplacian preserves the structural properties of the graph, the methods used thereafter to cluster the data in the eigenspace of the Laplacian do not guarantee this. For example Ng et al. (2001); Shi and Malik (2000) uses k-means clustering in the eigenspace of the Laplacian, which will only work if the clusters lie in disjoint convex sets of the underlying eigenspace. While Meila and Shi (2001) uses projections onto the largest k − eigenvectors to partition the data into clusters, which does not preserve the topological properties of the data lying in the eigenspace of the Laplacian. This is because projection methods do not incorporate geometric constraints due to the underlying structure in the eigenspace: points closer in Eucledian distance may be far apart on the manifold. In this work we want to use a clustering approach that exploits the structural properties in the configuration space of objects as well as the spectral sub-space, quite unlike earlier methods. Hence we use PCCA+ (see Algorithm 1) as a spatial abstraction algorithm on sampled trajectories. Datasets are represented as an adjacency graph by defining a pairwise similarity function between pairs of states. Using these similarity functions for all the data points, we construct an adjacency matrix S. We construct the Laplacian from the adjacency information. The spectra of this Laplacian is constructed, which encodes the structural properties of the underlying graph. We then find the best transformation of the spectra, such that the transformed basis aligns itself with the clusters of data points in the eigen-space. Then we use a projection method described in Section 3.3.1 to find the membership of each of the datapoints to a set of of special points lying on the transformed basis, which we identify as vertices of a simplex, as decribed in Section 3.3.1.The spectral gap method is used to estimate the number of clusters k (line 3 in Algorithm 1). This is used to find the simplex in the IRk eigensubspace. Figure 4.1: Simplex First order and Higher order Perturbation An important point to note here is that while the secondary clustering might look similar to the one in N-Cut, it is in fact quite different from the N-Cut algorithm. N-cut performs k-means clustering after projecting the data onto the top-k eigenvectors, while we assign memberships to the identified vertices in the eigenspace. These vertices need not coincide with the centroid of the data points. Algorithm 1 PCCA+ 1. Construct L from the Similarity matrix S 2. Compute first n (in the descending order) eigenvalues (d) for L k −ek+1 ) 3. Choose first k eigenvalues for which (e1−e > tc (Spectral Gap T hreshold). k+1 Compute the eigenvectors for corresponding eigenvalues (e1 , e2 , . . . , ek ) and stack them as column vectors in matrix Y 4. Lets denote the rows of Y as Y (1), Y (2), . . . Y (N ) ∈ IRk . 5. Define π(1) as that index, for which kY (π(1))k2 is maximal. Define γ1 = span{Y (π1 )}. 6. For i = 2, . . . , k: Define πi as that index, for which the distance to the hyperplane γi−1 , i.e. kY (πi ) −γi−1 k2 , is maximal. Define γi = span{Y (π1 ), ..., Y (πi )}. T T −1 T kY (πi ) − γi−1 k2 = Y (πi ) − γi−1 ((γi−1 γi−1 ) γi−1 )Y (πi ) ) Note that for the first order perturbation the simplex is just a linear transformation around the origin, hence in order to find the vertices of the simplex σ k−1 , we need to find the k points which could form the convex hull such that the deviation of all the points 24 from this convex hull is minimized. Hence, we start by finding the datapoint which is farthest located from the origin (line 5 in Algorithm 1), say Ỹπ1 . Then we proceed by finding the data point which is farthest located from the first point, say Ỹπ2 . We iterate this procedure of finding the datapoints which is located farthest from the consecutive hyperplane constructed by joining the previous data points, until we find k datapoints (see line 6 in Algorithm 1). These datapoints form the vertices of the simplex. While this approach is superficially similar to Thurau et al. (2012) in fact this is very different since they operate in the data space and we operate in the Eigen subspace. For example, Figure 4.1, shows a 3 dimensional eigen-subspace with a σ 2 simplex. The data points are shown as small black dots. Had the system been a completely disjoint system with 3 disjoint sets, the simplex would be aligned with the axes, with the clusters corresponding to the vertices of the simplex (the unit vectors). Since the system is represented as perturbations around this disjoint system, the first order perturbation linearly transforms the simplex, as shown. Because the data consists of higher order perturbations, the datapoints do not exactly map to the vertices of the new simplex, though their deviation is minimized from this simplex. The clusters for this system are the vertices of the new simplex. Thus we have clustering as membership of datapoints to the vertices of the new simplex. For a graph with pronounced group structure, the datapoints will tend to clutter near the vertices of the transformed simplex, while for a graph with high connectivity the datapoints will spread out over the simplex plane. Hence this framework also contains an intrinsic mechanism to return the information about goodness of clustering, which is the distribution of the membership functions for a datapoint across various clusters. Sharp peaks in the value indicates a good clustering while a more uniform value indicates a bad group structure and hence a bad clustering. For illustration we show in Figure 4.2, the simplex identified for the 3 room world illustration domains introduced above. We also show the states (marked in small colored dots) as data points in the eigenspace. The vertices of the simplex is shown in black dots. We also show the eigenvalue distribution for the same domain. As we can see, the eigenvalue has a its first jump at k = 3,. This indicates the presence of good segmentation threshold for 3 way partitioning. 25 (a) Simplex (b) Eigen-value distribution Figure 4.2: PCCA+ Numerical analysis for 3 room world domain in Figure 2.1 (a) Domain (b) PCCA+ (c) Kannan (d) NCut Figure 4.3: Comparison of clustering algorithm on the maze domain 26 (a) Domain (b) PCCA+ (c) Kannan (d) NCut Figure 4.4: Comparision of clustering algorithm on a room-inside-room domain Figure 4.5: 20 Metastable States. The state space partitioning identifies 20 metastable states. For every passenger position and destination pair the algorithm identifies 1 metastable state, hence 5×4 gives 20 metastable states. The nature of subtasks which can be assigned are a) Pickup if the Taxi is on the pickup location, b) Putdown if the taxi is on the destination location, and c) Navigate When the taxi is far from the destination or the pickup location depending on what subtask is being solved 4.1 Utility of PCCA+ We illustrate the utility of PCCA+ by comparing the partitioning obtained with the other state of the art algorithms on a complex maze domain. We also show its utility on partitioning the state space of the taxi domain by interpreting the partitions as 27 (a) Configuration State Abstraction when the passenger is in the taxi for different destinations. The last layer in Figure 4.5. Note that when the destination and taxi location are same there is a lone colored state, this state is a part of the abstract state corresponding to the Pickup-Putdown subtask. (b) Configuration state abstraction when the destination is Y (blue) and passenger is in the taxi. Figure 4.6: Segmentation of the Physical Space in the Taxi Domain natural abstractions over which subtasks can be defined.1 In order to examine this we use the full transition matrix, although the experiments for detecting subgoals do not assume any prior availability of the transition matrix. The results are summarized in Figures 4.5, 4.6. i) In Figure 4.5 each row of states corresponds to a specific drop of location and location of passenger. The final row of pixels corresponds to the passenger being in the taxi. While each of the front row of pixels gets an unique color the last row shows fragmentation due to the navigate option. Each sublayer in the last layer has at least one state which has the same color as a sublayer in any of the front 4 layers (which are all differently colored) , this corresponds to pickup subtasks. Degeneration in the last layer corresponds to the naviagation or putdown task depending on the location of the taxi. ii) Due to the presence of obstacles in the configuration space, the dynamics corresponding to pickup when the taxi is near the pickup location is much more connected compared to dynamics wherein the taxi has to move across the obstacles in the configuration space, see Figure4.6b. iii) As the number of clusters identified increases the state space breaks down, introducing additional metastable states corresponding to other subtasks. iv) We explored more clusters corresponding to the other large local variation in eigenvalues (which occurred next at 50 eigenvalues) and observed more degeneration of the state space introducing more subtasks corresponding to Pickup when the taxi is near other pickup locations. We also observed further clustering of configuration space which consequently becomes strictly bounded by the obstacles. v) As the number of clusters increase the Putdown subtasks degenerate earlier than the Pickup 1 To see the working of PCCA+ on other domains than spatial abstraction, refer to Appendix A 28 subtasks because any putdown task leads to an absorbing state. 29 CHAPTER 5 Temporal Abstraction Temporally extended actions are a useful mechanism to reuse experience and provide a high level description of control. One example of how such policies could be reused is shown in Figure 5.1 for the domain in Figure 4.3. For example having learned to solve the first task, the skills (shown by down arrow) need not be learned again in order to solve the second task. This reduces the size of the problem needed solving. We explore the use of such actions across a collection of tasks defined over a common domain in this section. We use the Options framework (Sutton et al., 1999) to model these skills or temporal abstraction. In this section we use the partitions of the state space (a) Subtasks for 1st Task (b) Subtasks for 2nd Task Figure 5.1: Two episodes with different starting states. Skill in this context is defined as transition between rooms. Skills shown by down arrow are not learned again, instead they are reused from the first task into abstract states along with the membership function to define high-level actions in terms of transitions between these abstract states. We also show how to identify relevant skills needed to solve a given task. We use the structural information obtained to define behavioral policies for the subtasks independent of the task being solved. We believe that to find a hierarchal optimal solution for the entire task, composed of smaller subtasks, it is not necessary for the agent to solve the subtask optimally. Hence any behavior interpreted by exploiting the structure present in the knowledge of the domain (from experience over past trajectories and the recent run) can be used. This claim is also strengthened by our observations on experiments run on several domains (chapter 7). Hence, while doing so, we do not have to learn the option polices, rather these policies are derived from the structures exploited in the state space. 5.1 Subtask options Each column of the matrix χ returned by the spatial abstraction algorithm is a membership function defining the degree of memberships of each state s to the abstract state Sj . The membership function mSi (s) could be interpreted as the likelihood of a state s belonging to abstract state Si (Singpurwalla and Booker (2004)). Hence we could formally interpret the rows in membership matrix as probability distributions over all abstract states. Each element χij could be interpreted as the probability of state i ∈ abstract state j. This gives us an elegant method to compose options to exit from a given abstract state. In case of multiple exits or bottlenecks PCCA+HRL is also able to compose multiple options, each taking the agent to the respective exit. 5.1.1 Initiation Set I Given that the agent is in state s, the initiation set of the option are all states belonging to the abstract state S such that arg maxj χij = arg maxj χsj = S∀i. This is very intuitive, as the mixing time within an abstract states is very low, hence the agent does not need too much of deliberation to transit within an abstract state. Therefore a given option can be initiated from anywhere within the respective abstract state. 5.1.2 Option Policy µ Our goal is to construct option policies that could take the agent to their respective exits. The membership functions mSj (s) provide a very formal way to construct such policies. Consider that the agent is in state s which belongs to the abstract state Si , we want to construct a policy which would take the agent to the exit of the abstract state Si joining the abstract state Sj . Membership functions are stochastic functions which 31 Figure 5.2: Policy for option going from Abstract State 1 to Abstract state 2 for domain in Figure 2.1 . (a) Inner (b) Outer Figure 5.3: Room inside a room, the policy for the domain in Figure 4.4 32 (a) 2rooms (b) χ̃1 Large Room (c) χ̃2 Small Room (d) Object (e) χ̃1 Without Object (f) χ̃2 Without Object (g) χ̃1 With Object (h) χ̃2 With Object Figure 5.4: Few more examples for the 33 policy composed using PCCA+HRL quantify the membership of each state to the respective abstract state. Hence, if the agent follows a stochastic gradient ascent on the membership function mSj (s)∀s ∈ Si then this policy would take the agent to the exit of abstract state Si joining abstract state Si . For example, consider the 3 room world illustration domain, the membership function for abstract state 2 forms a surface as shown in Figure 5.2. The gradient ascent on this membership function takes the agent from abstract state 1 to the abstract state 2. Hence we could define the option policy µ(s, a)(the probability of taking action a in state s) which takes the agent from abstract state Si to abstract state Sj as a stochastic gradient function as follows: ! µ(s, a) = max α(s) X ! P (s, a, s0 )mSj (s0 ) − mSj (s) , 0 ∀s ∈ Si . s0 where α(s) is a normalization constant to keep the values of µ ∈ [0, 1], P (s, a, s0 ) is the transition model of the MDP, which gives the probability of reaching the state s0 given that the agent in state s took action a. 5.1.3 Termination Condition β Termination condition is a probability function, which assigns the probability of termination of the current option at state s. Another interpretation of the probability function could be the probability of a state s being a decision epoch given the current option being executed. Using this interpretation, for an option which takes the agent from abstract state Si to abstract state Sj , β could be defined as follows: β(s) = min log(mSi (s)) , 1 ∀s ∈ Si log(mSj (s)) There can be other ways to define β but we use this formulation, because it gives a smooth peaked function which has nice mathematical properties (see Figure 5.5). 34 Figure 5.5: Termination Condition for option going from Abstract State 1 to Abstract state 2 for domain in Figure 2.1 35 CHAPTER 6 PCCA+HRL: An Online Framework for Task Solving We demonstrate here an online method (Algorithm 3) for efficiently finding SpatioTemporal Abstractions while the agent is following another policy (not necessarily the policy being learned). The key to our subtask discovery is that the subtasks are identified from the experiences of the agent. We propose two methods for dynamically composing option policies and learning the behavior policy over the subtasks identified. Depending upon the availability of memory one could either use the latter, in case of memory shortage or performing analysis on large state spaces, or the former, for normal mode of operation. For all our experiments we use the former method. Both these methods are inspired from the UCT framework which is a Monte-Carlo search algorithms that is also called rollout-based. A rollout-based algorithm builds its lookahead tree by repeatedly sampling episodes from the initial state. The tree is built by adding the information gathered during an episode to it in an incremental manner. The generic scheme of rollout-based Monte-Carlo planning is given in Algorithm 2. The algorithm iteratively generates episodes (line 3), and returns the action with the highest average observed long-term reward (line 5).2 In procedure UpdateValue the total reward q is used to adjust the estimated value for the given state-action pair at the given depth, completed by increasing the counter that stores the number of visits of the state-action pair at the given depth. Episodes are generated by the search function that selects and effectuates actions recursively until some terminal condition is satisfied. This can be the reach of a terminal state, or episodes can be cut at a certain depth (line 8). We use the UCT algorithm because it is more effective than the vanilla Monte-Carlo planning where the actions are sampled uniformly, while UCT does a selective sampling of actions. In case of the second method for constructing the transition model from the agent’s experience selective sampling is not just preferred but a requirement too. In UCT, in state s, at depth d, the action that maximizes Qt (s, a, d) + cNs,d (t),Ns,a,d (t) is selected, where Qt (s, a, d) is the estimated value of action a in state s at depth d and time t, Ns,d (t) is the number of times state s has been visited up to time t at depth d and Ns,a,d (t) is the number of times action a was selected when state s has been visited, up to time t at depth d, ct,s has the form q , where Cp is an empirical constant. A variant adaptation of this search ct,s = 2Cp lnt s method for our framework is used in PCCA+HRL. Since we are composing option policies rather than learning them, we replace Q(s, a) with the stochastic gradient function µ(s, a) as constructed in Section 5, where the particular µ chosen corresponds to the greedy option chosen from the option value function. Hence the search criteria becomes P 0 0 max arg maxa α(s) s0 P (s, a, s )mSj (s ) − mSj (s) , 0 t (d) + cNs,d (t),Ns,a,d (t). Using this search method we propose the following methods to construct the transition models on which spatio-temporal abstractions could be defined. Algorithm 2 Monte Carlo Search 1. function MonteCarloPlanning(state) 2. repeat (a) search(state, 0) 3. until Timeout 4. return bestAction(state,0) 1. function search(state, depth) 2. if Terminal(state) then return 0 3. if Leaf(state; d) then return Evaluate(state) 4. action := selectAction(state, depth) 5. (nextstate; reward) := simulateAction(state, action) 6. q := reward + γsearch(nextstate, depth + 1) 7. UpdateValue(state; action; q; depth) 8. return q • For every sampled trajectory as described above, we maintain transition counts a φass0 of number of times the transition s → s0 is observed, starting from prior φ0 . These transition counts are used to populate the local adjacency matrix P aD, 0 0 and the transition count model U as Dposterior (s, s ) = Dprior (s, s ) + a φss0 , Uposterior (s, a, s0 ) = Uprior (s, a, s0 )+φass0 which is updated after each sampled trajectory. Now for every iteration we find suitable spatial abstraction using PCCA+ as described in Algorithm 1. After identifying suitable spatial abstractions (chapter 4) and constructing the corresponding membership functions, we construct 37 0 ) subtask options as illustrated in chapter 5, where P (s, a, s0 ) = PU0 (s,a,s 0 . We s U (s,a,s ) then use a SMDP value learning algorithm to find an optimal policy over options which uses these skills to reach the goal state efficiently. • When running on very large domains with limited memory capacity, instead of storing previous transition counts, one could define the local adjacency matrix and the transition count model on the current trajectory itself. Although this way we lose a lot of valuable information, but it has been observed to work in practice only when the new sampled trajectory is very close to the previous sample trajectory. Constructing spatio-temporal abstractions iteratively as described above pose a matching problem for learning policies using SMDP learning techniques. With every sampled trajectory the structure of the transition matrix changes which in-turn changes the spatial abstractions identified. However small this change might be. in order to define the SMDP Learning update rule, we still need to match the previous options to the new ones. This can be done easily by mapping the vertices of the simplex returned by PCCA+. Hence consider two iterations which returned the simplex vertices Ỹ1 and Ỹ2 respectively, where Ỹ1 and Ỹ2 are matrices whose rows denote the location of the vertices. Hence we use the following matching criteria (similar to defining membership functions as described in Section 3.3.1). Hence we define κ12 = Ỹ1 Ỹ2−1 Using κ12 we assign vertex i of simplex 1 to vertex j of simplex 2 using the Munkres algorithm (a.k.a Hungarian Method) (Munkres, 1957) where the match weighting between i and j is κ12 (i, j) (or the distance metric could be defined as 1 − κ12 (i, j)). 6.1 Interpretation of subtasks identified Given an agent following a random walk in some state space, it is more likely to stay in its current metastable region than make a transition to another metastable region. Transitions between metastable regions are relatively rare events, and are captured in terms of subtasks. The subtasks overcome the natural obstacles created by a domain and are useful over a variety of tasks in the domain. Our framework differs from the 38 Algorithm 3 PCCA+HRL Online Planning agent function PCCA+HRL() Q ⇒ Option Value Function 1. Observe initial state s0 2. Initialize Q arbitrarily (example identically zero) 3. Initialize T Transition Matrix 4. U = {} 5. For e = 1 to maximum number of episodes (a) Membership Function χ, Simplex Vertex Ỹ = PCCA+(T ) (b) Find all pairs of connected Abstract states Ck = (Si , Sj ) from the non zero entries in χT T χ (c) Ok0 = Ok ∀k (d) ∀Ck construct Ok = {Ik , µk , βk } as described in chapter 6 (e) match Ok ∀k(new set of options composed) with previous set of options composed Ok0 ∀k as described in Section 6 (f) Find k for which s0 ∈ Ck (1) (g) i = k; si = s0 (h) while not the End-of-Episode i. arg maxO Q(si , O) Oi ← A uniform Random Option Ok = {Ik , µk , βk } w.p 1 − w.p s.t. si ∈ Ik ii. Update Q(si ,Oi ) using any Option Value Function Learning Method(We use SMDP Learning Method) iii. Sample action according to α(µi (s) + cNs,d (t),Ns,a,d (t)) as described in Section 3.3.1 and follow until option termination, return the termination state st a a iv. φass0 := φass0 + δss 0 , where δss0 is the step δ-function = 1, if action a in state s takes the agent to s0 a v. Rss 0 = reward returned while taking action a in state s taking the system to state s0 a vi. U (s, a, s0 ) := U (s, a, s0 ) + φass0 e−ν |Rss0 | as described in Section 6 P 0) vii. D(s, s0 ) = a U (s, a, s0 ); P (s, a, s0 ) = PU0 (s,a,s U (s,a,s0 ) s 0 viii. T (s, s ) = D(s,s0 ) 0 P0 0 , ∀s, s s D(s,s ) ix. si = st 39 Intrinsic Motivation framework Barto et al. (2004) in the absence of a predefined notion of saliency or interestingness but depends on the particular domain. Our notion of salient events is restricted to rare events that occur in a random walk in the environment by a particular agent. We are, however, able to automatically identify such events, model subtasks around them, and reuse them efficiently in a variety of tasks. As it turns out these subtasks are precisely the object oriented subtasks, where these objects are the bottlenecks connecting the two metastable region 6.2 Extending the Transition Model to include the underlying Reward Structure The underlying transition structure only encodes the topological properties of the state space. There are other kinds of structures in the environments which we would like an autonomous agent to detect. For example, like in the 3 room world illustration domain we would also like the agent to detect the goal state as a structure present in the environment. We have already seen in Figure 4.2b the significance of detecting such structures. We call these kind of structures as functional structures based on the functional properties of the given task. Along with the transition count the agent also returns a reward count for each transition. These reward distribution contains all the information we require regarding the functional properties of an environment. Hence it seems natural to extend our notion of the transition model to include the reward distribution, while defining suitable spatio-temporal abstraction for the same. The motivation to do so comes from the fact that we would like our spatial abstraction to degenerate with spikes in the reward distribution. This is because we want our heirarichal learning algorithm to learn decision policies at such epochs. Consider for example the same 3 room world domain, with the goal state as shown, goal state renders a spike in the reward distribution at the point where it is located. Given the agent is in the abstract state surrounding the goal state, we would want to compose temporal abstraction policies which could take the agent to the goal state. This can only happen when the agent interprets the state joining the goal state and the surrounding abstract 40 state as an exit or a bottleneck connecting the two abstract states (the second abstract a state being the lone goal state). Let us consider Rss 0 to be the reward received while transitioning from state s to state s0 while taking an action a. We modify the local adjaP a cency matrix D defined above as Dposterior = Dprior (s, s0 ) + a φass0 e−ν |Rss0 | , where ν is the regularization constant, which can be chosen to balance the relative weight of the underlying reward and transition structure. Using the exponential weighting for rewards in the transition model has various advantages: a) We want the abstraction to degenerate near spikes in reward function, hence we require that the adjacency information should have very low wights at such points. b) The exponential function provides a nice mathematical form which is continuous and differentiable everywhere. c) It returns a value of 1 for zero rewards, hence preserving the transition structure as it is. d) It also allows for easy tuning of the relative weights by changing the parameter ν. 41 CHAPTER 7 Experiments We present here three domains on which we perform our experiments. We compare our method with LCut by Şimşek et al. (2005). We also compare our results with options composed though biased random policies with various termination conditions along with the availability of primitive actions. In both the domains, the method proposed here outperform, significantly, any other work. 7.1 2 room Domain This is a very simple domain where one can still acquire skills. The domain consists of 2 rooms of unequal sizes where the agent has to start from the first room and reach a particular goal state in the second room 7.1a. Typical skill acquired would be to reach the doorway from any state in the first room and to navigate from this doorway to the intended goal state. This is indeed what we observe while using the PCCA+HRL framework. We use the transition and reward structure to create the transition model. Using this we find 3 abstract states, where one abstract state corresponds to the lone goal state itself. 7.1b shows the Partitioning obtained after 3 episodes of the trajectory of length 3000 each. 7.1c compares the average return for different methods while solving the same task. We plot the average return with respect to the number of epochs of decision used. 7.2 Taxi Domain We use the episodic format for task solving in the Taxi Domain proposed by Dietterich (1998). This is a complex domain with the availability of large number of typical skills which can be acquired. Typical Skills would constitute of options facilitating Navigate, Pickup and Putdown subtasks. Again this is indeed what we observed using PCCA+. We use the episodic version of the taxi domain, where in the beginning of each episode the taxi, the passenger’s position and its intended destination are randomly reset. 7.1d compares the average return for different methods. (a) 2 Room Domain (b) Spatial Abstraction using PCCA+ (c) Average Return for 2 room Domain (d) Average Return for Taxi Domain Figure 7.1: Experiments 7.3 Mario The Mario is a very complex domain. Due to the the large state space ( 25352 states), we cannot use a tabular representation for the states. A domain for the Mario is dynamic in nature, which at any given instance is a single frame of the entire game. Hence, we define coordinates with Mario as a reference point. Then we define a CMAC encoding with hashing to make the state representation of the game tractable. A CMAC uses multiple overlapping tilings of the state space to produce a feature representation for a final linear mapping where all the learning takes place. See Figure 7.2. The overall effect is 43 Figure 7.2: CMACs involve multiple overlapping tilings of the state space. Here we show two 5 × 5 regular tilings offset and overlaid over a continuous, twodimensional state space. Any state, such as that shown by the dot, is in exactly one tile of each tiling. A state’s tiles are used to represent it in the Sarsa algorithm described above. The tilings need not be regular grids such as shown here. In particular, they are often hyperplanar slices, the number of which grows sub-exponentially with dimensionality of the space. CMACs have been widely used in conjunction with reinforcement learning systems (e.g., )). . much like a network with fixed radial basis functions, except that it is particularly efficient computationally (in other respects one would expect RBF networks and similar methods (see )) to work just as well). It is important to note that the tilings need not be simple grids. For example, to avoid the “curse of dimensionality,” a common trick is to ignore some dimensions in some tilings, i.e., to use hyperplanar slices instead of boxes. A second major trick is “hashing”Ůa consistent random collapsing of a large set of tiles into a much smaller set. Through hashing, memory requirements are often reduced by large factors with little loss of performance. This is possible because high resolution is needed in only a small fraction of the state space. Hashing frees us from the curse of dimensionality in the sense that memory requirements need not be exponential in the number of dimensions. We chose a grid size of 1024 with 2 tilings, hence giving us a state space of size 4096. The value functions are defined on this representation without any approximation. We run the PCCA+HRL to autonomously play Mario. It was able to compose subtasks on structures in the games, like kill the monster, collect the coin, etc. We do not know of any other work which could do this autonomously in an effective manner. We compare our results with the primitive Qlearning techinques as shown in Figure 7.3 44 Figure 7.3: Mario Domain # Cumulative goal completions 45 CHAPTER 8 Conclusion and Future Directions Viewing random walks on MDPs as dynamical systems allows us to decompose the state space along the lines of temporal scales of change. Thus parts of the state space which are well-connected were identified as metastable states and we proposed a complete algorithm that not only identifies the meta-stable states, but also learns options to navigate between metastable states. We demonstrated the effectiveness of the approach on a variety of domains. We also discussed some crucial advantages of our approach over existing option discovery algorithms. While the approach detects intuitive options, it is possible that under a suitable re-representation of the state space, some of the metastable regions detected can be identical to each other. We are looking to use notions of symmetries in MDPs to identify such equivalent metastable regions. Another promising line of inquiry is to extend our approach to continuous state spaces, taking cues from the Proto Value Function literature. APPENDIX A PCCA+ on other domains Other spectral clustering methods had varying amounts of success on different domains. Although few of them have been successfully used across multiple domains. The primary reason for this is because none of them have a principled way of exploiting structural properties encoded in the laplcian. We demonstrate the utility of PCCA+ across multiple domains, while comparing it with other state of the art methods in that domain. We also compare PCCA+ with N-Cut method (Shi and Malik (2000)) across all these domains. This is the first time such kind of analysis is being performed using PCCA+ on multiple domains while comparing with other state-of-the-art methods. A.1 Image Segmentation Figure A.1a shows an image we would like to segment. The procedure for clustering is (a) Image (b) PCCA+ (c) N-Cut Figure A.1: Partitioning into multiple segments. Comparision with N-Cut as follows 1. Construct a similarity graph G = (V, E) by taking each pixel as a node and connecting each pair of pixels by an edge. The similarity value should reflect the likelihood of 2 pixels belonging to the same group. We define the similarity matrix in terms of the radial basis function of the brightness of the pixels adjacent to each other as follows ( kF −F k2 if kXi − Xj k2 ≤ 1 exp − i σ2 j I (A.1) Wij = 0 otherwise (a) χ˜1 (b) χ˜2 (c) χ˜3 (d) χ˜4 Figure A.2: Membership Functions for some clusters where Xi and Xj are the spatial location of the pixels, Fi and Fj are the intensity values for brightness of the pixels. This similarity matrix gives a non zero value for the pixel i connected to a pixel j which is located on any of the 8 sites on a square lattice around the pixel i. For a colored RGB image, the corresponding grayscale image is used here for image segmentation. Please note that this construction of the similarity matrix is quite different from the construction of the similarity matrix by Shi and Malik (2000), we have only 1 parameter to adjust which is the width of the intensity gaussian mixture as opposed to 6 parameters in Shi and Malik (2000). 2. We find the number of clusters using spectral gap method by finding the top k −ek ) > tc , where tc is the spectral gap threshold. eigenvalues for which (ek+1 1−ek 3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for the graph. We show in Figure A.2 the plot of membership functions for some of the clusters identified. We show in Figure A.1 the partitioning of the image into discrete segments, where each segment is differently color coded. We also show in the same Figure A.1, the results obtained using N-Cut (by performing k-means in the eigen-space for secondary clustering) by carefully choosing the 48 (a) Image (b) Image (c) PCCA+ (d) PCCA+ (e) N-Cut (f) N-Cut Figure A.3: Segmentation for some other images best parameters1 . Note that PCCA+ provides clusters which could separate the background from the objects which is structurally a very complex group, while N-Cut segmented the background into different groups. We also compare the Normalized Cut Values of PCCA+ and N-Cut in Table A.1. We observe that PCCA+ gives very good average Normalized Cut values ∼ 0. A.2 Text Clustering Document clustering is one of the most crucial techniques to organize the documents in an unsupervised manner. Many clustering methods have been applied to clustering documents into categories, such as k-means MacQueen (1967), naive Bayes or Gaussian mixture model Baker and McCallum (1998); Liu et al. (2002), single-link Jain and 1 The parameter values for N-Cut are taken from http:://note.sonots.com/SciSoftware/ NcutImageSegmentation.html, where the author claims that these parameter values for N-Cut produce the best results 49 Table A.1: Normalized Cut Values for PCCA+ and N-Cut. (Lower is good) I MAGE B IRD BABY A IRCRAFT PCCA+ 2.1616 E -005 4.7245 E -004 0.0029 N CUT 0.2099 0.0126 0.0128 Table A.2: Accuracy Measure for TFTD2 K 2 3 4 5 6 7 8 9 10 Avg K-means 0.871 0.775 0.732 0.671 0.655 0.623 0.582 0.553 0.545 0.667 LSI LPI LE 0.913 0.963 0.923 0.815 0.884 0.816 0.773 0.843 0.793 0.704 0.780 0.737 0.683 0.760 0.719 0.651 0.724 0.694 0.617 0.693 0.650 0.587 0.661 0.625 0.573 0.646 0.615 0.702 0.657 0.730 NMF-NCW 0.925 0.807 0.787 0.735 0.722 0.689 0.662 0.623 0.616 0.730 PCCA+ 0.9948 0.9810 0.9528 0.9424 0.9403 0.9331 0.7953 0.859 0.817 0.913 Dubes (1988), and DBSCAN Ester et al. (1996). From different perspectives, these clustering methods can be classified into agglomerative or divisive, hard or fuzzy, deterministic or stochastic. The typical data clustering tasks are directly performed in the data space. However, the document space is always of very high dimensionality. Due to the consideration of the curse of dimensionality, it is desirable to first project the documents into a lower-dimensional subspace in which the semantic structure of the document space becomes clear. Literature on spectral clustering shows its capability to handle highly nonlinear data. Also, its strong connections to differential geometry make it capable of discovering the manifold structure of the document space. For the experiments, three standard document collections were used in our experiments: Reuters-21578, Newsgroups20 and TDT2. Reuters-21578 corpus2 contains 21,578 documents in 135 categories. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups3 . The TDT2 corpus4 consists of data collected during the first half 2 Reuters-21578 corpus is available at http://www.daviddlewis.com/ resources/testcollections/reuters21578/ 3 The homepage of 20 Newsgroups dataset is http://qwone.com/ jason/20Newsgroups/ 4 Nist Topic Detection and Tracking corpus is at http://www.nist.gov/speech/tests/tdt/tdt98/index.html 50 of 1998 and taken from six sources. It consists of 11,201 on-topic documents which are classified into 96 semantic categories. From the original corpus, the documents appearing in multiple categories are removed. The pruned TDT2 dataset contains 9394 documents containing top 30 categories, Reuters-21578 contains 8293 documents belonging to 65 categories and 20Newsgroup contains 18846 documents that belongs to 20 groups. All the dataset are Document-term matrix where each row represents a document. We perform TF-IDF. This gives the weighted document-term matrix. Then we calculate the distance between any two documents using cosine similarity where each document is a term-vector. Table A.3: Accuracy Measure for Reuters K 2 3 4 5 6 7 8 9 10 Avg K-means 0.989 0.974 0.959 0.948 0.945 0.883 0.874 0.852 0.835 0.918 LSI LPI LE 0.992 0.998 0.998 0.985 0.996 0.996 0.970 0.996 0.996 0.961 0.993 0.993 0.954 0.993 0.992 0.903 0.990 0.988 0.890 0.989 0.987 0.870 0.987 0.984 0.850 0.982 0.979 0.931 0.982 0.990 NMF-NCW 0.985 0.953 0.964 0.980 0.932 0.921 0.908 0.895 0.898 0.937 PCCA+ 0.9715 0.9794 0.9801 0.9693 0.970 0.9703 0.9699 0.9698 0.9697 0.9722 Consider a set of documents x1 , x2 , . . . , xn ∈ IRm . Assume xi has been normalized to 1. 1. To construct the adjacency graph, suppose the ith node corresponds to the document xi . We put an edge between nodes i and j if xi is among p nearest neighbors of xj or xj is among p nearest neighbors of xi . 2. We construct the similarity matrix S as follows: If nodes i and j are connected, Sij = xT i xj ; Otherwise, Sij = 0. 3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for the documents to clusters. The maximum membership criteria is used to cluster documents. We perform 3 sets of experiments to evaluate the cluster quality identified by PCCA+. In the first set of experiments using the unlabeled dataset, we chose the number of clusters using the eigen-gap measure. It was observed that using this measure the number of clusters were exactly equal to the number of categories in the labeled data. Table A.4 shows the purity measure for a few datasets. In the second set of experiments, we compare PCCA+ with other clustering based on LSI Deerwester et al. (1990), spectral clustering method, LPI Cai 51 et al. (2005) and Nonnegative Matrix Factorization clustering method Zha et al. (2001); Xu et al. (2003) (refer Table A.3, TableA.2). The evaluations are conducted for the cluster numbers ranging from two to ten. For each given cluster number k, 50 test runs are conducted on different randomly chosen clusters. The clustering performance is evaluated by comparing the obtained label of each document with that provided by the document corpus. Given a document xi , let ri and si be the obtained cluster label and the label provided by the corpus, Pn respectively δ(s ,map(ri )) , (refer Zha et al. (2001)). The accuracy measure is defined as i=1 ni where n is the total number of documents, δ(x, y) is the delta function, and map(ri ), is the permutation mapping function, that maps each cluster label ri to the equivalent label from the data corpus. We observe that PCCA+ provides better quality clusters as compared to other clustering techniques in most of the cases while comparable in others. The third set of experiments is performed using the Reuters-21578 dataset, in this experiment 7 most frequent categories are considered but we do not remove documents belonging to multiple categories. After removing documents whose label sets or main texts are empty, 8,866 documents are retained where only 3.37% of them are associated with more than one class labels. After randomly removing documents with only one label, a text categorization data set containing 1998 documents is obtained (Table A.5 provides details of the used categories). We run PCCA+ on this document and generate the connectivity information across categories using the definition provided in Def 2. The macro-transition operator (normalized) which quantifies relation between categories is shown in Table A.6. The interesting observation to note is that the macro-transition operator provides high values across categories with seemingly related nature, for example the pairs Money fx-Trade, Trade-Crude, Trade-Interest have high values, while the pair Grain-Earn have low connectivity values. Table A.4: Purity measure for Document Clusters Dataset TDT2 Reuters Newsgroup Number of docs K Purity 9394 30 0.9344 8293 65 0.9694 18846 20 0.9023 We observe that PCCA+ is competitive with the best algorithm for all the values of number of clusters. We also observe in Table A.4 that PCCA+ has a very high purity measure for document clustering. A.3 Synthetic Datasets We demonstrate the goodness of quality of clustering obtained by using PCCA+ on various synthetic datasets with different structural properties. 52 Table A.5: Details of the used Reuters21578 top 7 Categories Category Number of Docs Earn 831 Acquisition 482 Money-fx 299 Grain 153 Crude 128 Trade 154 Interest 261 Table A.6: Macro connectivity information for Reuters21578 top 7 Categories Earn Acquisition Earn 0.106 0.093 Acquisition 0.021 0.104 Money-fx 0.005 0.056 Grain 0.005 0.038 Crude 0.015 0.062 Trade 0.013 0.07 Interest 0.009 0.043 Money-fx 0.031 0.058 0.184 0.039 0.06 0.073 0.034 Grain 0.052 0.075 0.074 0.287 0.095 0.083 0.107 Crude 0.259 0.243 0.228 0.189 0.291 0.261 0.164 Trade 0.255 0.279 0.283 0.169 0.268 0.328 0.138 Interest 0.202 0.216 0.166 0.269 0.205 0.169 0.501 1. Conisder the set of datapoints x1 , x2 , . . . , xn ∈ IRm : The similarity matrix is constructed as follows (xi −xj )T Σ−1 (xi −xj ) if kxi − xj k exp − 2 Sij = ≤ threshold and i 6= j 0 otherwise where Σ is the covariance matrix 2. We find the number of clusters using spectral gap method by finding the top k −ek ) eigenvalues for which (ek+1 > tc , where tc is the spectral gap threshold. 1−ek 3. We apply PCCA+ algorithm (Algorithm 1) to obtain the membership matrix for the data points to clusters. Figure A.4 shows the clusters obtained using PCCA+. Table A.7 shows the purity measure of the clusters obtained using PCCA+ for the same number of clusters k as present in the labeled data (Note that the spectral gap method always identified the same number of clusters a present in the labeled data) We observe in Table A.7 that PCCA+ obtained clustering with a very high purity measure, even with datasets of different structural properties as is seen in the original data space. We also plot the simplex identified by PCCA+ for the case of spiral dataset in Figure A.5. Few observations to note here are: a) the simplex is linearly transformed 53 (a) Aggregation(7 clusters) (b) Spiral(3 clusters) (c) R15(15 clusters) Figure A.4: Clustering on Synthetic Datasets using PCCA+ Figure A.5: Simplex identified by PCCA+ for the spiral dataset in A.4b. The data points are shown as colored dots and are clustered around the vertices of the transformed simplex (shown in black dots). The original basis is shown in black lines. from its corresponding regular simplex structure, which shows the first order perturbation, and b) the data points(plotted in colored markers) around the vertices of the transformed simplex shows higher order perturbations around the simplex structure. Table A.7: Purity measure for Synthetic Dataset Dataset R15 Spiral Aggregation Number of datapoints K 600 15 312 3 788 7 54 Purity 99.7 100.0 99.6 REFERENCES 1. Baker, L. D. and A. K. McCallum, Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’98. ACM, New York, NY, USA, 1998. ISBN 1-58113-015-5. URL http://doi.acm.org/10.1145/ 290941.290970. 2. Barto, A., S. S., and N. C., Intrinsically Motivated Learning of Hierarchical Collections of Skills. In Proceedings of the 2004 International Conference on Development and Learning. 2004, 112–119. 3. Barto, A. G., S. J. Bradtke, and S. P. Singh (1995). Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2), 81–138. ISSN 0004-3702. 4. Barto, A. G. and S. Mahadevan (2003). Recent advances in hierarchical reinforcement learning. 13, 2003. 5. Bros, S. M. (). Bros. URL http://en.wikipedia.org/wiki/Super_Mario_ 6. Cai, D., X. He, J. Han, and S. Member (2005). Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering. 7. competition, R. L. . (). URL http://2009.rl-competition.org/mario. php. 8. Dean, T. and S. hong Lin, Decomposition techniques for planning in stochastic domains. In IN PROCEEDINGS OF THE FOURTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-95. Morgan Kaufmann, 1995. 9. Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the american society for information science, 41(6), 391–407. 10. Dietterich, T. G. (1998). Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13, 227–303. 11. Ester, M., H. peter Kriegel, J. S, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 1996. 12. Hengst, B., Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the International Conference on Machine learning, volume 19. ACM Press, 2002. 13. Jain, A. K. and R. C. Dubes, Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. ISBN 0-13-022278-X. 55 14. Jonsson, A. and A. Barto (2006). Causal Graph Based Decomposition of Factored MDPs. Journal of Machine Learning Research, 7, 2259–2301. 15. Joshi, M., R. Khobragade, S. Sarda, U. Deshpande, and S. Mohan, Object-oriented representation and hierarchical reinforcement learning in infinite mario. In ICTAI. 2012. 16. Kannan, R., S. Vempala, and A. Vetta (2000). On clusterings: Good, bad and spectral. 17. Knoblock, C. (1990). Learning abstraction hierarchies for problem solving. 18. Kocsis, L. and C. Szepesvári, Bandit based monte-carlo planning. In Proceedings of the 17th European conference on Machine Learning, ECML’06. Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 3-540-45375-X, 978-3-540-45375-8. URL http: //dx.doi.org/10.1007/11871842_29. 19. Lane, T. and L. Kaelbling (2002). Nearly deterministic abstractions of Markov Decision Processes. 20. Liu, X., Y. Gong, W. Xu, and S. Zhu, Document clustering with cluster refinement and model selection capabilities. In In Proceedings of the 25th International ACM SIGIR conference on research and development in information retrieval. ACM Press, 2002. 21. Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. URL http://dx.doi.org/10.1007/ s11222-007-9033-z. 22. MacQueen, J. B., Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (eds.), Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1. University of California Press, 1967. 23. Mahadevan, S., Proto-value functions: developmental reinforcement learning. In ICML ’05: Proceedings of the 22nd international conference on Machine learning. 2005. ISBN 1-59593-180-5. 24. McGovern, A. (2002). Autonomous Discovery of Temporal Abstractions from Interaction with an Environment. Ph.D. thesis, University of Massachusetts - Amherst. 25. Meila, M. and J. Shi, A random walks view of spectral segmentation. 2001. 26. Menache, I., S. Mannor, and N. Shimkin, Q-cut - dynamic discovery of sub-goals in reinforcement learning. In Machine Learning: ECML 2002, 13th European Conference on Machine Learning, volume 2430 of LectureNotes in Computer Science. Springer, 2002. 27. Meuleau, N., M. Hauskrecht, K. Kim, L. Peshkin, L. Kaelbling, T. Dean, and C. Boutilier, Solving very large weakly coupled markov decision processes. In In Proceedings of the Fifteenth National Conference on Artificial Intelligence. 1998. 28. Munkres, J. (1957). Algorithms for the assignment and transportation problems. 29. Ng, A., M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14. 2001. 56 30. Shahnaz, F., M. W. Berry, V. P. Pauca, and R. J. Plemmons (2006). Document clustering using nonnegative matrix factorization. Inf. Process. Manage.. 31. Shi, J. and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 32. Şimşek, Ö., A. P. Wolfe, and A. G. Barto, Identifying useful subgoals in reinforcement learning by local graph partitioning. In Machine Learning, Proceedings of the TwentySecond International Conference (ICML 2005). 2005. 33. Singpurwalla, N. D. and J. M. Booker (2004). Membership functions and probability measures of fuzzy sets. Journal of the American Statistical Association, 99, 867–877. 34. Sutton, R. and A. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998. 35. Sutton, R. S., D. Precup, and S. P. Singh (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2), 181–211. 36. Thurau, C., K. Kersting, M. Wahabzada, and C. Bauckhage (2012). Descriptive matrix factorization for sustainability adopting the principle of opposites. Data Min. Knowl. Discov.. 37. Weber, M., W. Rungsarityotin, and A. Schliep (2002). Characterization of transition states in conformational dynamics using fuzzy sets. Technical report. 38. Weber, M., W. Rungsarityotin, and A. Schliep (2004). Perron Cluster Analysis and Its Connection to Graph Partitioning for Noisy Data. Technical Report ZR-04-39, Zuse Institute Berlin. 39. White, S. and P. Smyth (). A spectral clustering approach to finding communities in graphs. 40. Wolfe, A. P. and A. G. Barto, Identifying useful subgoals in reinforcement learning by local graph partitioning. In In Proceedings of the Twenty-Second International Conference on Machine Learning. 2005. 41. Xu, W., X. Liu, and Y. Gong, Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03. ACM, New York, NY, USA, 2003. ISBN 1-58113-646-3. URL http://doi.acm.org/10.1145/ 860435.860485. 42. Zha, H., X. He, C. Ding, H. Simon, and M. Gu, Spectral relaxation for k-means clustering. MIT Press, 2001. 57 LIST OF PAPERS BASED ON THESIS 1. Peeyush Kumar, Vimal Mathew and Balaraman Ravindran Heirarichal Decision Making using Spatio-Temporal Abstraction In Reinforcement Learning Under Communication at Journal of Machine Leaning Research 2013 2. Vimal Mathew, Peeyush Kumar and Balaraman Ravindran Abstraction in Reinforcement Learning in Terms of Metastability European Workshop on Reinforcement Learning 2012 3. Peeyush Kumar, Niveditha Narasimhan and Balaraman Ravindran Spectral Clustering as Mapping to a Simplex Spectral Workshop, International Conference on Machine Learning 2013 Under Review at Knowledge Discovery and Data Mining 2013 58