Reinforcement Learning

How can an agent learn to act given only indirect, delayed rewards or penalties as feedback?

Consider a robot learning to act in its environment.

Robot's sensors report the state of the environment (e.g., cameras, sonars).
Robot can perform actions to alter this state (e.g., move forward, turn).
Each action may generate a reward or penalty indicating the desirability of the resulting state.
Goals are defined by a reward function.
- Example: dock onto battery charger when batteries are low
- +100 for connecting to charger, 0 for all other actions
Robot's task is to learn a control policy that maximizes rewards accumulated over time.
Robot learns by actively exploring the environment, performing actions, and observing their consequences.

General Reinforcement Learning Task

Agent can perceive a set S of distinct states.
Agent can choose from a set A of actions to perform.
Agent must learn a target function π : S -> A that maps from states to optimal actions.
Supervised training is not possible, because the optimal actions π(s) are unknown.
Only a sequence of immediate reward values is available to the agent.
Rewards that happen later are less important than rewards that happen sooner.
Goal: Learn to choose actions that maximize the cumulative reward over time

r₀ + γ r₁ + γ² r₂ + γ³ r₃ + . . .

where the discount factor 0 < γ < 1 determines the relative value of immediate vs. delayed rewards
If γ = 0, delayed rewards are irrelevant.
If γ = 1, immediate and delayed rewards are equally important.
Future rewards are discounted exponentially by their delay.
Problem of credit assignment: how to determine which of the agent's actions are responsible for eventual rewards?
Agent faces a tradeoff between exploration of unknown states and actions (to gain new information) and exploitation of learned states and actions (to maximize its cumulative reward).
Environment described by a reward function r and a state transition function δ:

r_t = r (s_t , a_t) is the reward received for performing action a_t in state s_t

s_t+1 = δ (s_t , a_t) is the new state resulting from performing action a_t in state s_t
r and δ may be nondeterministic or unknown to the agent (i.e., agent may be unable to predict the results of its actions).
The discounted cumulative reward V^π for a particular policy π specifies the total reward achievable by the agent by following the policy from any given starting state:

V^π(s_t) = r_t + γ r_t+1 + γ² r_t+2 + γ³ r_t+3 + . . .
Example: if the current state is s, V^a(s) = 81, and V^b(s) = 99, then the agent would be better off following policy b, because it would achieve a higher reward in the long run.
The optimal policy π* is the one that achieves the greatest cumulative reward for all states.
V*(s) is the value function of the optimal policy.
V*(s) gives the maximum possible cumulative reward that the agent can obtain by starting in state s.

V*(s_yellow) = 0 + γ 100 + γ² 0 + γ³ 0 + . . . = 90

V*(s_green) = 0 + γ 0 + γ² 100 + γ³ 0 + . . . = 81

Q Learning

If an agent had perfect knowledge of the cumulative reward function V*, the immediate reward function r, and the state transition function δ, it could always choose the best action π*(s) from any state s as follows:

π*(s) = the action a that maximizes r(s, a) + γ V*(δ (s, a))
However, robotic agents generally do not have perfect knowledge of r or δ, because it is usually not possible to predict in advance the exact outcome of applying an arbitrary action to an arbitrary state.
Suppose we define an evaluation function for state/action pairs:

Q(s, a) = r (s, a) + γ V*(δ (s, a))
Q is the reward received immediately upon performing action a in state s, plus the value (discounted by γ) of following the optimal policy thereafter.
If an agent had perfect knowledge of the evaluation function Q, it could always choose the best action π*(s) from any state s as follows:

π*(s) = the action a that maximizes Q(s, a)
The agent does not need to know r, δ, or V* explicitly, because this knowledge is "hidden inside" the knowledge of Q.
Knowing V* means knowing only about states; knowing Q means knowing about states and actions.

      Q(s, a_red)     = 0 + γ × 81    = 72
      Q(s, a_green) = 0 + γ × 100 = 90
      Q(s, a_blue)    = 0 + γ × 100 = 90
How can the agent learn the Q function?
V*(s) is just the maximum of Q(s, a) over all actions a that are applicable to state s.
Using this, we can rewrite the definition of Q as follows:

Q(s, a) = r (s, a) + γ V*(δ(s, a))

Q(s, a) = r (s, a) + γ max_a'{Q(δ(s, a), a')}

where a' are all the actions applicable to the new state δ(s, a).
We can use this equation as the basis for iteratively approximating the Q function.

Q Learning Algorithm

Let Q denote the true Q function (i.e., the target function).
Let Q' denote an approximation to Q (i.e., the current hypothesis).
We can represent Q' by a table:
Algorithm:
1. Initialize all Q' table entries to 0 (or random values)
2. Observe the current state s
3. Repeat:
  1. Choose an action a and execute it
  2. Receive immediate reward r
  3. Observe the resulting state s_new
  4. Update Q' entry for s and a: Q'(s, a) <-- r + γ max_a'{Q'(s_new , a' )}
  5. Update current state: s <-- s_new

Another Example

Properties of Q Learning

Q learning is closely related to dynamic programming.
Requires discrete states and actions representable by a lookup table.
A training episode consists of starting the agent in a random state and letting it explore until it reaches a goal state.
Each training episode propagates information backwards through the Q' table, from later states to earlier states.
Q' will converge to Q in the limit, provided that:
- r and δ functions are deterministic
- reward values are bounded: ∀ s, a |r(s, a)| < c for some positive constant c
- agent visits every possible (state, action) pair infinitely often
Error in Q' table diminishes over time, because each Q' update depends only in part on error-prone Q' estimates (reduced by a factor of γ), with the remainder depending on error-free observed immediate rewards.
How to choose each action in step 3a above?
Not a good idea for agent to always select action with highest Q' value.
- This approach limits the amount of exploration done by the agent.
- Overcommits to exploitation of high Q' values found early in training.
- Convergence requires that all state/action transitions be sampled repeatedly.
Better approach: Select actions probabilistically according to Q' values.
- Actions with higher Q' values are more likely to be chosen.
- All actions have some chance of being chosen.
- Example: Probability of choosing action a in state s = k^{Q'(s, a)} / ∑_a' k^{Q'(s, a')}
- k parameter acts like a knob that can be used to adjust the balance between exploration and exploitation.
- Smaller values of k: All actions are roughly equally likely to be chosen (exploration)
- Larger values of k: Actions with larger Q' values are more likely to be chosen (exploitation)
- k can be varied dynamically with the number of training iterations.
Ability to generalize to unseen states or actions can be achieved by replacing the discrete lookup table by some type of continuous function approximator.
- Example: Represent Q' function as a backpropagation neural network that takes s and a as input, and produces Q'(s, a) as output.
- Training examples can be generated by using the earlier update rule:
  Q'(s, a) <-- r + γ max_a'{Q'(s_new , a' )}
- Convergence is guaranteed only for discrete, lookup-table representations of Q'.
- Another (probably better) way to represent Q' as a neural network:

Nondeterministic Q Learning

What if reward function r and state transition function δ are nondeterministic?
Q can be generalized to nondeterministic environments using expected values of r, δ, and V*:

      Q(s, a) = E[ r(s, a) ] + γ E[ V*(δ(s, a)) ]

                   = E[ r(s, a) ] + γ ∑_s' P(s' | s, a) V*(s')

                   = E[ r(s, a) ] + γ ∑_s' P(s' | s, a) max_a' Q(s', a')

where P(s' | s, a) is the probability of state s' resulting from applying action a to state s.
Second term is the average of the best Q values attainable from each possible outcome of applying a to s, weighted by likelihood.
Training rule:
Q'_n(s, a) <-- (1 − α_n) Q'_n−1(s, a) + α_n[r + γ max_a' Q'_n−1(s_new , a')]

where α_n = 1 / (1 + visits_n(s, a))

and visits_n(s, a) = number of updates for this state/action pair so far.
α_n parameter decreases over time, so updates gradually become smaller.
α_n = 1 equivalent to deterministic training rule.
α_n = 0 prevents Q' from changing at all.
This gives a decaying weighted average of the current Q' value and the revised estimate.
Guaranteed to converge to the true Q function.