Can computers be taught to do interesting things?
This has been a central question of AI from the very beginning.
Arthur Samuel's checkers player was the first significant machine learning program (mid 1950s).
Strong interest in machine learning in the 1950s and 60s. Less in 70s. Rebirth in early 80s that continues today.
Highly interdisciplinary
Sphinx speech recognition systems (CMU) learn speaker-specific strategies for recognizing phonemes using neural networks and hidden Markov models (listen to an example).
NavLab computer-controlled vehicles (CMU) learned to drive unassisted on public highways by observing human drivers (ALVINN, more photos).
SKICAT astronomical cataloging system (JPL/Caltech) learned to classify objects from the second Palomar Observatory Sky Survey using decision trees (3 terabytes of image data).
TD-Gammon backgammon program (Gerald Tesauro/IBM) learned to play backgammon at the grandmaster level by playing over 1 million practice games against itself using reinforcement learning. World's top backgammon program.
Percepts are used to improve agent's future behavior, as well as serving as the basis for current actions.
An agent can learn by observing its own decision-making process.
Learning may occur by simple memorization or may involve higher-level processes such as analogy.
We will focus on inductive learning: constructing a function called a hypothesis from a set of input/output examples.
The hypothesis represents the agent's current knowledge about how to respond in any given situation.
Most learning agents can be described in terms of four components:
The Performance System selects actions in response to percepts.
The Critic monitors the performance of the agent and generates feedback that can be used to improve performance.
The Learning Algorithm uses feedback to update the internal hypothesis used by the Performance System.
The Experiment Generator uses the updated hypothesis to generate new problems for the system to explore in order to maximize its rate of learning.
So far, our agent designs have focused on just the Performance System.
Performance standard should be external, so that the agent must adjust its performance to fit the standard, not the other way around.
The Experiment Generator forces the agent to explore new situations. Otherwise, the agent would keep doing whatever actions it has determined are the best so far, even though there may be better untried choices.
The choice of learning algorithm depends strongly on the choice of representation for the knowledge to be learned (i.e. the hypothesis), and vice versa.
Trade-off between expressiveness and efficiency. More expressive representation schemes (e.g. first-order logic) lead to heavier penalties in terms of computation time and number of examples needed to learn a reasonable function.
A precise teacher (supervised learning).
Example: individual
checkers board states with correct moves provided by teacher.
A vague teacher (indirect reinforcement).
Example: final win/lose
outcome of game.
No teacher (unsupervised learning). Still possible to learn regularities in perceptual inputs even in the absence of a teacher.
Reinforcement learning raises the issue of credit assignment:
How to assign credit or blame for the final outcome to each move?
Example:
optimal moves can be followed by poor moves leading to loss of game.
Learner is given a series of training examples <x1, f (x1)>, <x2, f (x2)>, . . .
Learner returns a hypothesis h that approximates the target function f.
Suppose we have the training examples below:
The following hypotheses are possible:
Hypothesis G is too simple it passes through only one point.
Hypotheses R and B approximate the training examples equally well, but differ in how they assign values to unknown inputs. Since f is unknown, no reason to prefer B over R.
In general, if two hypotheses approximate f equally well, no a priori reason to prefer one over the other.
Learning algorithms that do exhibit a preference exhibit inductive bias.
Example: one common type of inductive bias is called Occam's Razor: Prefer the simplest hypothesis that fits the data. This bias would prefer hypothesis B above.
Without some type of inductive bias, generalization would be impossible. The agent would be unable to decide how to respond to situations it has never seen before.
Hard question: How do we know when the learned function is good enough?
We want the agent to perform well in novel situations, but we won't know what those situations are until they happen.
Use the following methodology to evaluate hypotheses' ability to generalize:
|
Critical assumption: Distribution of training examples is identical to distribution of testing examples.
This makes theoretical results easier to obtain, but is not always valid in practice.
It is best to use freshly-generated examples each time in step 5, rather than a different mixture of the same examples, but this is often difficult to do in practice.
Must identify:
Task to be learned
Performance measure to be improved
Training experience to use
Exact type of knowledge to be learned
A representation for this knowledge
A learning algorithm
Task: playing checkers
Performance measure: % of games won in world checkers tournament
Training experience: games played against itself
Knowledge to be learned
Target function V : Board --> Reals (static evaluation function)
This is a nonoperational definition. Agent must learn an operational approximation h to this ideal target function.
Representation of the learned h function
h(b) = w0 + w1 BP(b) + w2 RP(b) + w3 BK(b) + w4 RK(b) + w5 BT(b) + w6 RT(b)
Learning algorithm
How to generate accurate training examples <b, Vtrain(b)> ?
Easy to assign correct target values to final board states.
Example: < [BP=3, RP=0, BK=1, RK=0,
BT=0, RT=0], +100 >
Good way to pick target values for intermediate states:
Vtrain(b) <-- h(Successor(b))
where Successor(b) = board state following program's move and opponent's response.
This approach uses h itself to generate training values for improving h.
h tends to be more accurate for board states closer to the end of the game.
Under certain conditions, h can be proven to converge toward perfect estimates of Vtrain
How to adjust feature weights w0 , . . . , wn?
Least Mean Squares (LMS) algorithm:
Do repeatedly:
|
This process minimizes the squared error E:
E = ∑ [Vtrain(b) - h(b)] 2 over all <b,Vtrain(b)>
The Final Design
Performance System uses h to choose next move at each step in the game.
Critic generates new training examples <b, Vtrain(b)> based on current hypothesis and observed performance of agent.
Learning Algorithm updates feature weights according to training examples provided by Critic, using LMS algorithm, and outputs new hypothesis h.
Experiment Generator always proposes same problem (i.e. initial checkers board state), but in principle could propose other board states to practice on.
Assumes that the ideal target function V is representable within a hypothesis space of linear, continuously-valued, six-coefficient, polynomial functions.
Performs hill-climbing search (actually, gradient descent) in hypothesis space.
Machine learning can be viewed generally as a search through a space of hypotheses defined by some underlying representation (e.g. linear functions, logical descriptions, decision trees, artificial neural networks, etc.).
Learning algorithms rely on structure of hypothesis space.
Different hypothesis representations are appropriate for different kinds of target functions.