How can an agent learn to act given only indirect, delayed rewards or penalties as feedback?
Consider a robot learning to act in its environment.
Robot's sensors report the state of the environment (e.g., cameras, sonars).
Robot can perform actions to alter this state (e.g., move forward, turn).
Each action may generate a reward or penalty indicating the desirability of the resulting state.
Goals are defined by a reward function.
Robot's task is to learn a control policy that maximizes rewards accumulated over time.
Robot learns by actively exploring the environment, performing actions, and observing their consequences.
Agent can perceive a set S of distinct states.
Agent can choose from a set A of actions to perform.
Agent must learn a target function π : S -> A that maps from states to optimal actions.
Supervised training is not possible, because the optimal actions π(s) are unknown.
Only a sequence of immediate reward values is available to the agent.
Rewards that happen later are less important than rewards that happen sooner.
Goal: Learn to choose actions that maximize the cumulative reward over time
r0 + γ r1 + γ 2 r2 + γ 3 r3 + . . .
where the discount factor 0 < γ < 1 determines the relative value of immediate vs. delayed rewards
If γ = 0, delayed rewards are irrelevant.
If γ = 1, immediate and delayed rewards are equally important.
Future rewards are discounted exponentially by their delay.
Problem of credit assignment: how to determine which of the agent's actions are responsible for eventual rewards?
Agent faces a tradeoff between exploration of unknown states and actions (to gain new information) and exploitation of learned states and actions (to maximize its cumulative reward).
Environment described by a reward function r and a state transition function δ:
rt = r (st , at) is the reward received for performing action at in state st
st+1 = δ (st , at) is the new state resulting from performing action at in state st
r and δ may be nondeterministic or unknown to the agent (i.e., agent may be unable to predict the results of its actions).
The discounted cumulative reward Vπ for a particular policy π specifies the total reward achievable by the agent by following the policy from any given starting state:
Vπ(st) = rt + γ rt+1 + γ 2 rt+2 + γ 3 rt+3 + . . .
Example: if the current state is s, Va(s) = 81, and Vb(s) = 99, then the agent would be better off following policy b, because it would achieve a higher reward in the long run.
The optimal policy π* is the one that achieves the greatest cumulative reward for all states.
V*(s) is the value function of the optimal policy.
V*(s) gives the maximum possible cumulative reward that the agent can obtain by starting in state s.
V*(syellow) = 0 + γ 100 + γ 2 0 + γ 3 0 + . . . = 90
V*(sgreen) = 0 + γ 0 + γ 2 100 + γ 3 0 + . . . = 81
If an agent had perfect knowledge of the cumulative reward function V*, the immediate reward function r, and the state transition function δ, it could always choose the best action π*(s) from any state s as follows:
π*(s) = the action a that maximizes r(s, a) + γ V*(δ (s, a))
However, robotic agents generally do not have perfect knowledge of r or δ, because it is usually not possible to predict in advance the exact outcome of applying an arbitrary action to an arbitrary state.
Suppose we define an evaluation function for state/action pairs:
Q(s, a) = r (s, a) + γ V*(δ (s, a))
Q is the reward received immediately upon performing action a in state s, plus the value (discounted by γ) of following the optimal policy thereafter.
If an agent had perfect knowledge of the evaluation function Q, it could always choose the best action π*(s) from any state s as follows:
π*(s) = the action a that maximizes Q(s, a)
The agent does not need to know r, δ, or V* explicitly, because this knowledge is "hidden inside" the knowledge of Q.
Knowing V* means knowing only about states; knowing Q means knowing about states and actions.
Q(s, ared) = 0 + γ × 81 = 72
Q(s, agreen) = 0 + γ × 100 = 90
Q(s, ablue) = 0 + γ × 100 = 90
How can the agent learn the Q function?
V*(s) is just the maximum of Q(s, a) over all actions a that are applicable to state s.
Using this, we can rewrite the definition of Q as follows:
Q(s, a) = r (s, a) + γ V*(δ(s, a))
Q(s, a) = r (s, a) + γ maxa'{Q(δ(s, a), a')}
where a' are all the actions applicable to the new state δ(s, a).
We can use this equation as the basis for iteratively approximating the Q function.
Let Q denote the true Q function (i.e., the target function).
Let Q' denote an approximation to Q (i.e., the current hypothesis).
We can represent Q' by a table:
Algorithm:
Q learning is closely related to dynamic programming.
Requires discrete states and actions representable by a lookup table.
A training episode consists of starting the agent in a random state and letting it explore until it reaches a goal state.
Each training episode propagates information backwards through the Q' table, from later states to earlier states.
Q' will converge to Q in the limit, provided that:
Error in Q' table diminishes over time, because each Q' update depends only in part on error-prone Q' estimates (reduced by a factor of γ), with the remainder depending on error-free observed immediate rewards.
How to choose each action in step 3a above?
Not a good idea for agent to always select action with highest Q' value.
This approach limits the amount of exploration done by the agent.
Overcommits to exploitation of high Q' values found early in training.
Convergence requires that all state/action transitions be sampled repeatedly.
Better approach: Select actions probabilistically according to Q' values.
Actions with higher Q' values are more likely to be chosen.
All actions have some chance of being chosen.
Example: Probability of choosing action a in state s = k Q'(s, a) / ∑a' k Q'(s, a')
k parameter acts like a knob that can be used to adjust the balance between exploration and exploitation.
Smaller values of k: All actions are roughly equally likely to be chosen (exploration)
Larger values of k: Actions with larger Q' values are more likely to be chosen (exploitation)
k can be varied dynamically with the number of training iterations.
Ability to generalize to unseen states or actions can be achieved by replacing the discrete lookup table by some type of continuous function approximator.
Example: Represent Q' function as a backpropagation neural network that takes s and a as input, and produces Q'(s, a) as output.
Training examples can be generated by using the earlier update rule:
Q'(s, a) <-- r + γ maxa'{Q'(snew , a' )}
Convergence is guaranteed only for discrete, lookup-table representations of Q'.
Another (probably better) way to represent Q' as a neural network:
What if reward function r and state transition function δ are nondeterministic?
Q can be generalized to nondeterministic environments using expected values of r, δ, and V*:
Q(s, a) = E[ r(s, a) ] + γ E[ V*(δ(s, a)) ]
= E[ r(s, a) ] + γ ∑s' P(s' | s, a) V*(s')
= E[ r(s, a) ] + γ ∑s' P(s' | s, a) maxa' Q(s', a')
where P(s' | s, a) is the probability of state s' resulting from applying action a to state s.
Second term is the average of the best Q values attainable from each possible outcome of applying a to s, weighted by likelihood.
Training rule:
Q'n(s, a) <-- (1 − αn) Q'n−1(s, a) + αn[r + γ maxa' Q'n−1(snew , a')]
where αn = 1 / (1 + visitsn(s, a))
and visitsn(s, a) = number of updates for this state/action pair so far.
αn parameter decreases over time, so updates gradually become smaller.
αn = 1 equivalent to deterministic training rule.
αn = 0 prevents Q' from changing at all.
This gives a decaying weighted average of the current Q' value and the revised estimate.
Guaranteed to converge to the true Q function.