A model-free way of doing Policy Evaluation, mimicking Bellman updates with running sample averages
Idea: learn from every experience
Update V(s) each time we experience a transition (s,a,s’,r)
Likely outcome s′ will contribute updates more often than the other
TD learning tries to answer the question of how to compute this weighted average without the weights, cleverly doing so with Exponential Moving Average in RL
Problem:
If we want to turn values into a new policy, we are stuck as we dont have T and R so no Q