tressette

Transcription

tressette
Heuristic Search in Two Player
Games – Minimax Principle
and Monte Carlo Tree Search
Assoc Prof Marko Robnik Šikonja
December 2015
NIM game example
Two-player games
Nim game
nim 7 example (small statespace)
MINIMAX principle
Assume: players have all the information, both trying to win
Player MAX (tries to maximize its result) plays against player MIN
(trying to minimize MAX’s result)
Label nodes by level (whose turn it is)
Label leaves: 1 – MAX wins, 0 – MIN wins
MINIMAX computation
Propagate bottom levels MIN and MAX values towards the root
according to the minimax principle:
◦ Player MAX takes maximum of its successors
◦ Player MIN takes minimum of its successors
Assigned values represent the player’s expectation, if she plays the
optimal moves; so she chooses the moves according to these values
Minimax
on nim 7
Minimax on fixed depth search
spaces
Searching all the way down to the leaves might be infeasible
Search to a fixed depth (time, resources available)
n-ply look-ahead
At depth n assign to nodes heuristic quality estimates and propagate
them towards the root
minimax finds the best strategy with information until depth n
Heuristics; e.g., chess: number and position of pieces
An example: minimax of depth 4
Minimax pseudo code
double minimax (currentNode) {
if ( isLeaf (currentNode) || depth(currentNode) == MaxDepth )
return heuristicEvaluation (currentNode);
if ( isMinNode (currentNode) )
return min ( minimax (children (currentNode)) );
if ( isMaxNode (currentNode) )
return max (minimax (children (currentNode)) );
}
Minimax analysis
Short-sighted (myopic)
(seemingly) good state at depth n can be misleading
Remedy: selective search ahead of the most promising leaf nodes
Minimax anomalies: violation of assumptions, difficulty of certain
positions for humans
Alpha-beta pruning
Minimax searches all nodes up to depth n
Waste of resources for unpromising nodes
idea:
◦
◦
◦
◦
Depth first search
For MAX nodes α is the highest value found
for MIN nodes β is the lowest value found
We can safely prune MAX values less than α and MIN values larger than β
α-β pruning in practice
Rules of α-β pruning
Prune successors of MIN node with β less or equal to α of its MAX
predecessor
Prune successors of MAX node with α larger or equal to β of its MIN
predecessor
Therefore: compare levels m-1 and m+1 to prune level m
α-β pruning: a simple implementation
double alphaBeta (currentNode) {
if ( isLeaf (current_node) )
return heuristicEvaluation (currentNode);
if ( isMaxNode (currentNode) &&
alpha (currentNode) >= beta (minAncestor (currentNode) )
stopSearchBelow (currentNode);
if ( isMinNode (currentNode) &&
beta (currentNode) <= alpha (maxAncestor(currentNode) )
stopSearchBelow (currentNode);
}
α-β pruning: actual
implementation
•Use α and β as parameters
•For MAX node store minimum of β values of its MIN successors as beta
•For MIN node store maximum of α values of its MAX successors as
alpha
•Each interior node therefore stores alpha and beta value
•The root is initialized as α = -∞, β = ∞
α-β pruning: the pseudo code
double alpha_beta (current_node, alpha, beta) {
if ( is_leaf (current_node) )
return heuristic_evaluation (current_node);
if ( is_max_node (current_node) ) {
alpha = max (alpha, alpha_beta (children, alpha, beta));
if alpha >= beta
cut_off_search_below (current_node);
}
if is_min_node (current_node){
beta = min (beta, alpha_beta (children, alpha, beta));
if beta <= alpha
cut_off_search_below (current_node);
}
}
call: alpha_beta (start_node, -infinity, infinity)
Checkers
(b≈8, n ≈1020)
(Samuel, 1959)
◦ minimax, α-β, evaluation of positions
◦ Comparable to middle level players
Chinook (Schaeffer , 1990, 1994-)
◦ minimax, α-β, excellent evaluation of positions, database of endgames with
up to 8 pieces, database of openings up to 20 moves, forward pruning
heuristics
◦ Beats world champion
Blondie24 (Fogel, 2000)
◦ learns strategies with neural networks and evolutionary methods
◦ Compared to good players
Chess
(b≈38, n ≈10120)
◦
◦
◦
◦
◦
minimax, α-β, evaluation of positions, database of game openings
Best programs search to the depth of around 12
Linear dependency between depth of search and quality of the player
1997, DeepBlue beats world champion
DeepFritz, Rybka, Junior, Houdini…
Some other games
Go: 19 x 19 board, b ≈ 360, on a level of average players
Go-Moku, 15 x 15, a variant of 5 in a row, solved
Othello (reversi), search to the depth 50, beats world champion
arimaa, chess board, pieces (b ≈ 17281), computer unfriendly game
Probabilistic games: backgammon, bridge, tarock, poker
expectiminimax: probabilistic version of minimax
Monte Carlo tree search
Expectiminimax
as minimax but used where outcomes are uncertain e.g. with the games
where chance plays certain role (cards, gambling)
instead of deterministic minimax compute expected minimax
the heuristic evaluations estimate probability of events
example: estimate probability that a player is in a possession of a
certain card
Monte Carlo Tree Search (MCTS)
for really large search spaces all exhaustive methods are prohibitive
sampling based evaluation: make many playouts
represent the search space in a tree form
idea:
◦
◦
◦
◦
simulate random moves until a decision is reached in a leaf
propagate decision towards the root
repeat
evaluate top level nodes by the proportion of successful games rooted in them
proven: MCTS converges to minimax in the limit
Advantages of MCTS
no heuristic needed for state quality estimation therefore applicable to
general game playing and games without developed theory
anytime algorithm
asymmetrical tree, suitable for games with high branching factor
parallel execution (leaf, root, tree)
MCTS break down
several random simulations
four iterative steps
◦selection – chose (top level) nodes
◦expansion – add new nodes to the tree
◦simulation – simulate game from selected
node
◦backup - leaf result is propagated to the
root, visited nodes get new values
MCTS illustration
MCTS “heuristic evaluation”
i.e., node selection
randomly – large variance
find a balance between explore and
exploit
UCT approach (Upper Confidence
Bounds applied to Trees)
UCT
𝑘=
𝑤𝑖
argmax𝑖∈𝐼 (
𝑛𝑖
+C
ln 𝑛
)
𝑛𝑖
• compute UCT for all candidate nodes i (moves), where
wi – number of wins from node i
ni – number of visits of node i
n – number of all node visits n = n1 +n2 + …
C – a coefficient to tune explore and exploit ratio, theoretically 2, in
practice determined empirically
select the node (move) with maximal UCT or the move with maximal
ni
Details of MCTS
light (random) and heavy (heuristic, knowledge based) playouts
heavy not necessary better
adding heuristic term to UCT formula e.g., bi / ni, where bi is a heuristic
score of the move
speed-up of the initial phase where moves are mostly random e.g. in
ganes where a move can results from several positions
A use case: Tressette card game
Tressette - rules
2, 3 or 4 players
4 suits (colors) of 10 cards each
Strength of cards
◦ 3, 2, ace, king knight, knave
◦ 7, 6, 5, 4 (no points)
Points:
◦
◦
◦
◦
ace: 1 point
3, 2, king, knight, knave: ⅓ of a point
7, 6, 5, 4: 0 points
the last trick (round) taken: 1 point
a rule: follow the suit if possible
Tressette – a two player game
each player receives 10 cards
20 cards remain face-down in the pile (stochastic element)
after each take the players take one card from the pile and show it to
the opponent
gradually we get more and more information
after round 10 we have all the information
Tressette – a strategy
altogether there are 11 and 2/3 points, 6 are needed to win
simple heuristics:
◦ players try to capture aces (own or opponents), 2, 3 and faces (king, knight
and knave)
◦ players try to capture the last point
a completely automatic approach based on expectiminimax and Monte
Carlo Tree search
Tressette 2-player game agent
goal: automatic player
time limit: 1 sec of CPU time on a server for each move
Master thesis work of Žan Kafol
Expectiminimax
in each round:
generate all possible configurations of opponent’s hand and pile
too many combinations for exhaustive search
stop at certain level and apply heuristic estimate of the position
difficult to find a good heuristic estimate of the state
Minimax
from 10th round we have a complete information
search exhaustively with minimax
use alpha-beta pruning
MCTS
when information is still incomplete
run MCTS with UCT selector for allowed time
select the best evaluated move
MCTS against a random player
MCTS against human players
37.7% winning rate, average human 38.9%
Improvement to MCTS
idea: use human game traces to set initial quality value of each top level
node
apply MCTS from this point forward
to find exactly the same game positions is highly unlikely
problem: how to find similar games?
Extracting human skill
16 million recorded moves
use only good players
define game features to find similar games:
number of aces, 2s, and 3s, sequences…
extract moves taken in circumstances similar to the given position
use kd-trees
Using human skill
1st approach
◦ use kNN to select the best general move
◦ specialize the selected generalized move to given situation
2nd approach
◦ count winning moves and use them as a prior for UCT
only small improvement over basic MCTS: not enough time to do
enough MCTS iterations in a given time limit
MCTS as a general mechanism
MCTS goes beyond games
MCTS in machine learning, for example for feature selection