Machine Learning Overview

Can computers be taught to do interesting things?
This has been a central question of AI from the very beginning.
Arthur Samuel's checkers player was the first significant machine learning program (mid 1950s).
Strong interest in machine learning in the 1950s and 60s. Less in 70s. Rebirth in early 80s that continues today.
Highly interdisciplinary
- artificial intelligence
- probability and statistics
- computational complexity theory
- control theory
- information theory
- philosophy
- psychology
- neurobiology

Reading

Chapter 10 of the textbook

Some Successful Applications

Sphinx speech recognition systems (CMU) learn speaker-specific strategies for recognizing phonemes using neural networks and hidden Markov models (listen to an example).
NavLab computer-controlled vehicles (CMU) learned to drive unassisted on public highways by observing human drivers (ALVINN, more photos).
SKICAT astronomical cataloging system (JPL/Caltech) learned to classify objects from the second Palomar Observatory Sky Survey using decision trees (3 terabytes of image data).
TD-Gammon backgammon program (Gerald Tesauro/IBM) learned to play backgammon at the grandmaster level by playing over 1 million practice games against itself using reinforcement learning. World's top backgammon program.

General Model of a Learning Agent

Percepts are used to improve agent's future behavior, as well as serving as the basis for current actions.
An agent can learn by observing its own decision-making process.
Learning may occur by simple memorization or may involve higher-level processes such as analogy.
We will focus on inductive learning: constructing a function called a hypothesis from a set of input/output examples.
The hypothesis represents the agent's current knowledge about how to respond in any given situation.
Most learning agents can be described in terms of four components:
- The Performance System selects actions in response to percepts.
- The Critic monitors the performance of the agent and generates feedback that can be used to improve performance.
- The Learning Algorithm uses feedback to update the internal hypothesis used by the Performance System.
- The Experiment Generator uses the updated hypothesis to generate new problems for the system to explore in order to maximize its rate of learning.
So far, our agent designs have focused on just the Performance System.
Performance standard should be external, so that the agent must adjust its performance to fit the standard, not the other way around.
The Experiment Generator forces the agent to explore new situations. Otherwise, the agent would keep doing whatever actions it has determined are the best so far, even though there may be better untried choices.
The choice of learning algorithm depends strongly on the choice of representation for the knowledge to be learned (i.e. the hypothesis), and vice versa.

Possible hypothesis representation schemes

Propositional logic
First-order logic
Mathematical functions
Decision trees
Version spaces
Neural networks
Trade-off between expressiveness and efficiency. More expressive representation schemes (e.g. first-order logic) lead to heavier penalties in terms of computation time and number of examples needed to learn a reasonable function.

Possible types of feedback

A precise teacher (supervised learning).
Example: individual checkers board states with correct moves provided by teacher.
A vague teacher (indirect reinforcement).
Example: final win/lose outcome of game.
No teacher (unsupervised learning). Still possible to learn regularities in perceptual inputs even in the absence of a teacher.
Reinforcement learning raises the issue of credit assignment: How to assign credit or blame for the final outcome to each move?
Example: optimal moves can be followed by poor moves leading to loss of game.

Possible prior knowledge

None (tabula rasa).
Agent begins with some innate background knowledge.

Inductive Learning

Learner is given a series of training examples <x₁, f (x₁)>, <x₂, f (x₂)>, . . .
Learner returns a hypothesis h that approximates the target function f.
Suppose we have the training examples below:
The following hypotheses are possible:
Hypothesis G is too simple — it passes through only one point.
Hypotheses R and B approximate the training examples equally well, but differ in how they assign values to unknown inputs. Since f is unknown, no reason to prefer B over R.
In general, if two hypotheses approximate f equally well, no a priori reason to prefer one over the other.
Learning algorithms that do exhibit a preference exhibit inductive bias.
Example: one common type of inductive bias is called Occam's Razor: Prefer the simplest hypothesis that fits the data. This bias would prefer hypothesis B above.
Without some type of inductive bias, generalization would be impossible. The agent would be unable to decide how to respond to situations it has never seen before.
Hard question: How do we know when the learned function is good enough?
We want the agent to perform well in novel situations, but we won't know what those situations are until they happen.

Use the following methodology to evaluate hypotheses' ability to generalize:

Collect a large set of examples
Divide examples into disjoint training and testing sets
Use learning algorithm on training examples to generate hypothesis h
Measure the portion of testing examples that are correctly predicted by h
Repeat steps 1-4 for different sizes and mixtures of training and testing sets

Critical assumption: Distribution of training examples is identical to distribution of testing examples.
This makes theoretical results easier to obtain, but is not always valid in practice.
It is best to use freshly-generated examples each time in step 5, rather than a different mixture of the same examples, but this is often difficult to do in practice.

Designing a Learning System

Must identify:

Task to be learned
Performance measure to be improved
Training experience to use
- supervised or unsupervised?
- direct or indirect feedback?
- agent's degree of control over new training examples?
Exact type of knowledge to be learned
A representation for this knowledge
A learning algorithm

Example: Checkers

Task: playing checkers

Performance measure: % of games won in world checkers tournament

Training experience: games played against itself

no teacher present
indirect feedback (reinforcement)
full control over which training examples to use
distribution of training situations may differ significantly from distribution of testing situations

Knowledge to be learned

Target function V : Board --> Reals (static evaluation function)
If b is a final board state that is won, then V(b) = +100
If b is a final board state that is lost, then V(b) = -100
If b is a final board state that is drawn, then V(b) = 0
If b is not a final state in the game, then V(b) = V(b'), where b' is the best final board state that can be reached by starting from b and playing optimally to the end of the game (assuming opponent plays optimally).
This is a nonoperational definition. Agent must learn an operational approximation h to this ideal target function.

Representation of the learned h function

collection of rules?
neural network?
polynomial function of board features?
h(b) = w₀ + w₁BP(b) + w₂RP(b) + w₃BK(b) + w₄RK(b) + w₅BT(b) + w₆RT(b)
BP(b) = # of black pieces on board b
RP(b) = # of red pieces
BK(b) = # of black kings
RK(b) = # of red kings
BT(b) = # of black pieces threatened by red
RT(b) = # of red pieces threatened by black

Learning algorithm

How to generate accurate training examples <b, V_train(b)> ?
- Easy to assign correct target values to final board states.
  Example: < [BP=3, RP=0, BK=1, RK=0, BT=0, RT=0], +100 >
- Good way to pick target values for intermediate states:
  
  V_train(b) <-- h(Successor(b))
  
  where Successor(b) = board state following program's move and opponent's response.
- This approach uses h itself to generate training values for improving h.
- h tends to be more accurate for board states closer to the end of the game.
- Under certain conditions, h can be proven to converge toward perfect estimates of V_train

How to adjust feature weights w₀ , . . . , w_n?

Least Mean Squares (LMS) algorithm:

Do repeatedly:

Select a training example <b, V_train(b)> at random

Compute error(b) = V_train(b)- h(b)

For each board feature f_i update weight w_i as follows:

w_i <-- w_i + c · f_i · error(b)

where c is a small constant (~ 0.1) that moderates the rate of learning.

This process minimizes the squared error E:

E = ∑ [V_train(b)-h(b)] ² over all <b,V_train(b)>

The Final Design

Performance System uses h to choose next move at each step in the game.
Critic generates new training examples <b, V_train(b)> based on current hypothesis and observed performance of agent.
Learning Algorithm updates feature weights according to training examples provided by Critic, using LMS algorithm, and outputs new hypothesis h.
Experiment Generator always proposes same problem (i.e. initial checkers board state), but in principle could propose other board states to practice on.
Assumes that the ideal target function V is representable within a hypothesis space of linear, continuously-valued, six-coefficient, polynomial functions.
Performs hill-climbing search (actually, gradient descent) in hypothesis space.

Summary

Machine learning can be viewed generally as a search through a space of hypotheses defined by some underlying representation (e.g. linear functions, logical descriptions, decision trees, artificial neural networks, etc.).
Learning algorithms rely on structure of hypothesis space.
Different hypothesis representations are appropriate for different kinds of target functions.