Description:

  • Instead of casting out the tree, we try full assignment of variables and iterate it
  • Combines 2 ideas:
    • Evaluation by rollouts:
      • play multiple games to termination from a state s (using simple, fast or random policy) and count nb of wins/ nb of games explored
      • as law of large number, the fraction of wins correlates the true value of the position
    • Selective search: explore parts of the tree that help improve the decision at the root, regardless of depth
      • Should allocate rollouts to more promising (more win rate) and uncertain nodes (less rollouts,ie less nodes explored)
      • so we defined new heuristic, Upper Confidence Bound (UCB1), that combines promising and uncertain idea
        • where:
          • is the number of rollouts of node
          • is the total utility (wins) for Player at parent of node

Algorithm:

  • Using the ideas above
  • Repeat until out of time:
    • Selection: recursively apply UCB to choose a path down to a leaf node
      • child grandchild leaf node
    • Expansion: add a new child to
    • Simulation: run a rollout from
    • Backpropagation: update and function counts from back up to the root
  • Choose the action leading to the child with highest
  • The value of a node, , is a weighted sum of child values
  • As , the vast majority of rollouts are concentrated in the best child(ren), so the weighted average tends to max/min
  • No optimality is guaranteed