Active Reinforcement Learning

Need a way for agent to acting optimal also explore at the same time
Use $ϵ$ to denote random action, $ϵ$ -greedy
- At every step, act random with $ϵ$ probability, otherwise act on the current policy
Eventually explored all the space and learn enough over time:
- so should lower $ϵ$ over time
- or explore with an exploration function
Want to explore when::
- random actions: explore a fixed amount
- better idea: explore areas whose badness is not yet established, eventually stop exploring
Exploration function:
- Take a value estimate $u$ and a visit count $n$ , return an optimistic utility, ex: $f (u, n) = u + k / n$
  - ex: use on Q-learning, modify from $Q (s, a) \leftarrow_{α} R (s, a, s^{'}) + γ max_{a^{'}} Q (s^{'}, a^{'})$
  - to: $Q (s, a) \leftarrow_{α} R (s, a, s^{'}) + γ max_{a^{'}} f (Q (s^{'}, a^{'}), N (s^{'}, a^{'}))$
- this propagates the “bonus” back to states that lead to unknown states as well!

Even if you learn the optimal policy, you still make mistakes along the way!
Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards
Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal
Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

StrixTheKiet Notes