Description:

  • Maximize the sum of rewards possible as the process is unpredictable, with experience by interacting with the environment
  • Online learning
    • Receive feedback in the form of rewards
    • Agent’s utility is defined by the reward function
    • Must (learn to) act so as to maximize expected rewards
    • All learning is based on observed samples of outcomes
  • We dont know true transition probabilities and rewards function

Episode:

  • A set of sample collected

Utilities of sequences:

  • Rewards is given to agent at each step
  • The reward sequence must be more as it comes closer to the goal (similar to heuristic)
  • The reward must be early on rather than only the end, as it will have no incentive to move the right direction.

Discounting:

  • Some actions must be taken now rather than later
    • also help the algorothm to converge
  • We reduce the reward of the taken by a discounting factor, where is nb steps later it takes

Solution to infinite utilities:

  • If the game can last forever, we can’t have typical reward system as the agent will go on forever
  • Finite horizon:
    • So we terminate episoles after a fixed T steps (e.g life)
    • Similar to Depth Limiting Search
    • Gives non-stationary policies for them ( depends on the time left)
  • Discounting
    • Having smaller means smaller “horizon” - short term focus
  • Absorbing state:
    • Guarantee that for every policy, a terminal state will be eventually reached

Passive Reinforcement Learning

Active Reinforcement Learning