tressette
Transcription
tressette
Heuristic Search in Two Player Games – Minimax Principle and Monte Carlo Tree Search Assoc Prof Marko Robnik Šikonja December 2015 NIM game example Two-player games Nim game nim 7 example (small statespace) MINIMAX principle Assume: players have all the information, both trying to win Player MAX (tries to maximize its result) plays against player MIN (trying to minimize MAX’s result) Label nodes by level (whose turn it is) Label leaves: 1 – MAX wins, 0 – MIN wins MINIMAX computation Propagate bottom levels MIN and MAX values towards the root according to the minimax principle: ◦ Player MAX takes maximum of its successors ◦ Player MIN takes minimum of its successors Assigned values represent the player’s expectation, if she plays the optimal moves; so she chooses the moves according to these values Minimax on nim 7 Minimax on fixed depth search spaces Searching all the way down to the leaves might be infeasible Search to a fixed depth (time, resources available) n-ply look-ahead At depth n assign to nodes heuristic quality estimates and propagate them towards the root minimax finds the best strategy with information until depth n Heuristics; e.g., chess: number and position of pieces An example: minimax of depth 4 Minimax pseudo code double minimax (currentNode) { if ( isLeaf (currentNode) || depth(currentNode) == MaxDepth ) return heuristicEvaluation (currentNode); if ( isMinNode (currentNode) ) return min ( minimax (children (currentNode)) ); if ( isMaxNode (currentNode) ) return max (minimax (children (currentNode)) ); } Minimax analysis Short-sighted (myopic) (seemingly) good state at depth n can be misleading Remedy: selective search ahead of the most promising leaf nodes Minimax anomalies: violation of assumptions, difficulty of certain positions for humans Alpha-beta pruning Minimax searches all nodes up to depth n Waste of resources for unpromising nodes idea: ◦ ◦ ◦ ◦ Depth first search For MAX nodes α is the highest value found for MIN nodes β is the lowest value found We can safely prune MAX values less than α and MIN values larger than β α-β pruning in practice Rules of α-β pruning Prune successors of MIN node with β less or equal to α of its MAX predecessor Prune successors of MAX node with α larger or equal to β of its MIN predecessor Therefore: compare levels m-1 and m+1 to prune level m α-β pruning: a simple implementation double alphaBeta (currentNode) { if ( isLeaf (current_node) ) return heuristicEvaluation (currentNode); if ( isMaxNode (currentNode) && alpha (currentNode) >= beta (minAncestor (currentNode) ) stopSearchBelow (currentNode); if ( isMinNode (currentNode) && beta (currentNode) <= alpha (maxAncestor(currentNode) ) stopSearchBelow (currentNode); } α-β pruning: actual implementation •Use α and β as parameters •For MAX node store minimum of β values of its MIN successors as beta •For MIN node store maximum of α values of its MAX successors as alpha •Each interior node therefore stores alpha and beta value •The root is initialized as α = -∞, β = ∞ α-β pruning: the pseudo code double alpha_beta (current_node, alpha, beta) { if ( is_leaf (current_node) ) return heuristic_evaluation (current_node); if ( is_max_node (current_node) ) { alpha = max (alpha, alpha_beta (children, alpha, beta)); if alpha >= beta cut_off_search_below (current_node); } if is_min_node (current_node){ beta = min (beta, alpha_beta (children, alpha, beta)); if beta <= alpha cut_off_search_below (current_node); } } call: alpha_beta (start_node, -infinity, infinity) Checkers (b≈8, n ≈1020) (Samuel, 1959) ◦ minimax, α-β, evaluation of positions ◦ Comparable to middle level players Chinook (Schaeffer , 1990, 1994-) ◦ minimax, α-β, excellent evaluation of positions, database of endgames with up to 8 pieces, database of openings up to 20 moves, forward pruning heuristics ◦ Beats world champion Blondie24 (Fogel, 2000) ◦ learns strategies with neural networks and evolutionary methods ◦ Compared to good players Chess (b≈38, n ≈10120) ◦ ◦ ◦ ◦ ◦ minimax, α-β, evaluation of positions, database of game openings Best programs search to the depth of around 12 Linear dependency between depth of search and quality of the player 1997, DeepBlue beats world champion DeepFritz, Rybka, Junior, Houdini… Some other games Go: 19 x 19 board, b ≈ 360, on a level of average players Go-Moku, 15 x 15, a variant of 5 in a row, solved Othello (reversi), search to the depth 50, beats world champion arimaa, chess board, pieces (b ≈ 17281), computer unfriendly game Probabilistic games: backgammon, bridge, tarock, poker expectiminimax: probabilistic version of minimax Monte Carlo tree search Expectiminimax as minimax but used where outcomes are uncertain e.g. with the games where chance plays certain role (cards, gambling) instead of deterministic minimax compute expected minimax the heuristic evaluations estimate probability of events example: estimate probability that a player is in a possession of a certain card Monte Carlo Tree Search (MCTS) for really large search spaces all exhaustive methods are prohibitive sampling based evaluation: make many playouts represent the search space in a tree form idea: ◦ ◦ ◦ ◦ simulate random moves until a decision is reached in a leaf propagate decision towards the root repeat evaluate top level nodes by the proportion of successful games rooted in them proven: MCTS converges to minimax in the limit Advantages of MCTS no heuristic needed for state quality estimation therefore applicable to general game playing and games without developed theory anytime algorithm asymmetrical tree, suitable for games with high branching factor parallel execution (leaf, root, tree) MCTS break down several random simulations four iterative steps ◦selection – chose (top level) nodes ◦expansion – add new nodes to the tree ◦simulation – simulate game from selected node ◦backup - leaf result is propagated to the root, visited nodes get new values MCTS illustration MCTS “heuristic evaluation” i.e., node selection randomly – large variance find a balance between explore and exploit UCT approach (Upper Confidence Bounds applied to Trees) UCT 𝑘= 𝑤𝑖 argmax𝑖∈𝐼 ( 𝑛𝑖 +C ln 𝑛 ) 𝑛𝑖 • compute UCT for all candidate nodes i (moves), where wi – number of wins from node i ni – number of visits of node i n – number of all node visits n = n1 +n2 + … C – a coefficient to tune explore and exploit ratio, theoretically 2, in practice determined empirically select the node (move) with maximal UCT or the move with maximal ni Details of MCTS light (random) and heavy (heuristic, knowledge based) playouts heavy not necessary better adding heuristic term to UCT formula e.g., bi / ni, where bi is a heuristic score of the move speed-up of the initial phase where moves are mostly random e.g. in ganes where a move can results from several positions A use case: Tressette card game Tressette - rules 2, 3 or 4 players 4 suits (colors) of 10 cards each Strength of cards ◦ 3, 2, ace, king knight, knave ◦ 7, 6, 5, 4 (no points) Points: ◦ ◦ ◦ ◦ ace: 1 point 3, 2, king, knight, knave: ⅓ of a point 7, 6, 5, 4: 0 points the last trick (round) taken: 1 point a rule: follow the suit if possible Tressette – a two player game each player receives 10 cards 20 cards remain face-down in the pile (stochastic element) after each take the players take one card from the pile and show it to the opponent gradually we get more and more information after round 10 we have all the information Tressette – a strategy altogether there are 11 and 2/3 points, 6 are needed to win simple heuristics: ◦ players try to capture aces (own or opponents), 2, 3 and faces (king, knight and knave) ◦ players try to capture the last point a completely automatic approach based on expectiminimax and Monte Carlo Tree search Tressette 2-player game agent goal: automatic player time limit: 1 sec of CPU time on a server for each move Master thesis work of Žan Kafol Expectiminimax in each round: generate all possible configurations of opponent’s hand and pile too many combinations for exhaustive search stop at certain level and apply heuristic estimate of the position difficult to find a good heuristic estimate of the state Minimax from 10th round we have a complete information search exhaustively with minimax use alpha-beta pruning MCTS when information is still incomplete run MCTS with UCT selector for allowed time select the best evaluated move MCTS against a random player MCTS against human players 37.7% winning rate, average human 38.9% Improvement to MCTS idea: use human game traces to set initial quality value of each top level node apply MCTS from this point forward to find exactly the same game positions is highly unlikely problem: how to find similar games? Extracting human skill 16 million recorded moves use only good players define game features to find similar games: number of aces, 2s, and 3s, sequences… extract moves taken in circumstances similar to the given position use kd-trees Using human skill 1st approach ◦ use kNN to select the best general move ◦ specialize the selected generalized move to given situation 2nd approach ◦ count winning moves and use them as a prior for UCT only small improvement over basic MCTS: not enough time to do enough MCTS iterations in a given time limit MCTS as a general mechanism MCTS goes beyond games MCTS in machine learning, for example for feature selection