Estimate the transition and reward functions with the samples during exploration before using these estimates to solve the MDP normally with value or policy iteration.
Generates an approximation of the transition function, T^(s,a,s′), by keeping counts of the number of times it arrives in each state s′ after entering each q-state (s,a).
Steps:
Step 1: Learn empirical MDP model
Count outcomes s’ for each s,a
Normalize to give an estimate of T^(s,a,s′)
Discover each R^(s,a,s′) when we experience (s,a,s’)
The agent can then generate the the approximate transition function T^ upon request by normalizing the counts it has collected dividing the count for each observed tuple (s,a,s′) by the sum over the counts for all instances where the agent was in q-state (s,a)
Normalization of counts scales them such that they sum to one, allowing them to be interpreted as probabilities