Temporal Difference Learning

A model-free way of doing Policy Evaluation, mimicking Bellman updates with running sample averages
Idea: learn from every experience
- Update $V (s)$ each time we experience a transition $(s, a, s ’, r)$
- Likely outcome $s^{'}$ will contribute updates more often than the other
TD learning tries to answer the question of how to compute this weighted average without the weights, cleverly doing so with Exponential Moving Average in RL
Problem:
- If we want to turn values into a new policy, we are stuck as we dont have $T$ and $R$ so no $Q$
Solution:
- Learn Q values instead, Q-learning
- Make action selection also model-free

Policy still fixed, still doing evaluation
Move values toward value of whatever successor occurs: running average
Sample of $V (s) : sample = R (s, π (s), s^{'}) + γ V^{π} (s^{'})$
Update to $V (s)) : V^{π} (s) \leftarrow (1 - α) V^{π} (s) + (α) sample$
- where $α$ is Learning Rate
Same update: $V^{π} (s) \leftarrow V^{π} (s) + α (sample - (1 - V^{π} (s))$

StrixTheKiet Notes