Description:

  • also decide how to collect experiences)
  • Challenges: how to explore and minimize regret

Exploration vs Exploitation:

  • Need a way for agent to acting optimal also explore at the same time
  • Use to denote random action, -greedy
    • At every step, act random with probability, otherwise act on the current policy
  • Eventually explored all the space and learn enough over time:
    • so should lower over time
    • or explore with an exploration function
  • Want to explore when::
    • random actions: explore a fixed amount
    • better idea: explore areas whose badness is not yet established, eventually stop exploring
  • Exploration function:
    • Take a value estimate and a visit count , return an optimistic utility, ex:
      • ex: use on Q-learning, modify from
      • to:
    • this propagates the “bonus” back to states that lead to unknown states as well!

Regret:

  • Even if you learn the optimal policy, you still make mistakes along the way!
  • Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards
  • Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal
  • Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

Exploiration: