Need a way for agent to acting optimal also explore at the same time
Use ϵ to denote random action, ϵ-greedy
At every step, act random with ϵ probability, otherwise act on the current policy
Eventually explored all the space and learn enough over time:
so should lower ϵ over time
or explore with an exploration function
Want to explore when::
random actions: explore a fixed amount
better idea: explore areas whose badness is not yet established, eventually stop exploring
Exploration function:
Take a value estimate u and a visit count n, return an optimistic utility, ex: f(u,n)=u+k/n
ex: use on Q-learning, modify from Q(s,a)←αR(s,a,s′)+γmaxa′Q(s′,a′)
to: Q(s,a)←αR(s,a,s′)+γmaxa′f(Q(s′,a′),N(s′,a′))
this propagates the “bonus” back to states that lead to unknown states as well!
Regret:
Even if you learn the optimal policy, you still make mistakes along the way!
Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards
Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal
Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret