Description:

  • Estimate the transition and reward functions with the samples during exploration before using these estimates to solve the MDP normally with value or policy iteration.
  • Generates an approximation of the transition function, , by keeping counts of the number of times it arrives in each state after entering each q-state .

Steps:

  • Step 1: Learn empirical MDP model
    • Count outcomes for each
    • Normalize to give an estimate of
    • Discover each when we experience
  • Step 2: Solve the learned MDP

Exploration:

  • The agent can then generate the the approximate transition function upon request by normalizing the counts it has collected dividing the count for each observed tuple by the sum over the counts for all instances where the agent was in q-state
  • Normalization of counts scales them such that they sum to one, allowing them to be interpreted as probabilities