Decision Trees


Some successful applications of decision trees:

Example:  Decision tree for the concept PlayTennis

<Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong> classified as No
 

In general, decision trees are disjunctions of conjunctions of constraints on attribute values.
 

Above tree is equivalent to the following logical expression:

(Outlook=Sunny ^ Humidity=Normal) v (Outlook=Overcast) v (Outlook=Rain ^ Wind=Weak)


Constructing Decision Trees


Idea: Determine which attribute of the examples makes the most difference in their classification and begin there.
 

Inductive bias: Prefer shallower trees that ask fewer questions.
 

Example: Boolean function: first bit or second bit on.
 
Instance Classification
A   000 -
B   001 -
C   010 +
D   011 +
E   100 +
F   101 +
G   110 +
H   111 +

We can think of each bit position as an attribute (with value on or off).  Which attribute divides the examples best?

Bits 1 and 2 are equally good starting points, but Bit 3 is less desirable.

Let's start with Bit1.  The left branch of this tree is complete -- no more work is necessary.  Need to complete right branch.

Consider Bit2 vs. Bit3.  Bit2 is better:

This decision tree is equivalent to the following logical expression:

Bit1(on) v (Bit1(off) ^ Bit2(on))



Example: Boolean function: first and last bits equal.
 
Instance Classification
A   000 +
B   001 -
C   010 +
D   011 -
E   100 -
F   101 +
G   110 -
H   111 +

Bit1, Bit2, and Bit3 look equally good at the beginning, but it turns out that Bit2 is actually worse in the long run:

These trees are equivalent to the following logical expressions:

(Bit1(on) ^ Bit3(on)) v (Bit1(off) ^ Bit3(off))

(Bit3(on) ^ Bit1(on)) v (Bit3(off) ^ Bit1(off))

(Bit2(on) ^ Bit1(on) ^ Bit3(on)) v (Bit2(on) ^ Bit1(off) ^ Bit3(off)) v (Bit2(off) ^ Bit3(on) ^ Bit1(on)) v (Bit2(off) ^ Bit3(off) ^ Bit1(off))
 

Constructing a decision tree in this manner is a hill-climbing search through a hypothesis space of decision trees.
 

Tree with Bit2 attribute at root corresponds to a local (suboptimal) maximum.


How to determine best attribute in general?


Use entropy to compute information gain.

Entropy measures heterogeneity of a set of examples.

Entropy(S) = -p(+) log2 p(+) - p(-) log2 p(-)

where p(-) and p(+) are the proportion of positive and negative examples in S.
 

Example:

Entropy([9+, 5-]) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940
 

Entropy = 0 implies that all examples in S have the same classification.

Entropy = 1 implies equal numbers of positive and negative examples.

Interpretations of Entropy:


General case: n possible values for classification

Range of possible values:  0  <  Entropy(S)  <  log2 n

Information gain measures reduction in entropy caused by partitioning examples according to a particular attribute.

where Values(A) is the set of possible values for attribute A and Sv is the subset of examples with attribute A = v.

The sum term is just the weighted average of the entropies of the partitioned examples (weighted by relative partition size).


Example: Learning the concept of PlayTennis

Values(Wind) = { Weak, Strong }

S:  [9+, 5-]

SWeak:  [6+, 2-]

SStrong:  [3+, 3-]

Gain(S, Wind) = Entropy(S) - (8/14) Entropy(SWeak) - (6/14) Entropy(SStrong)

                        = 0.940 - (8/14) 0.811 - (6/14) 1.00

                        = 0.048

Humidity versus Wind:

Similar computations for Outlook and Temp yield:

Resulting partially-constructed tree:


Decision Tree Learning Algorithm

DTL(Examples, TargetAttribute, Attributes)
    /*  Examples are the training examples. TargetAttribute is the attribute whose value is to be predicted by the tree.  Attributes is a list of other attributes that may be tested by the learned decision tree.  Returns a decision tree that correctly classifies the given Examples.  */
     
  • create a Root node for the tree
  • if all Examples are positive, return the single-node tree Root, with label = Yes
  • if all Examples are negative, return the single-node tree Root, with label = No
  • if Attributes is empty, return the single-node tree Root, with label = most common value of TargetAttribute in Examples
  • else begin
    • A  <--  the attribute from Attributes with the highest information gain with respect to Examples
    • Make A the decision attribute for Root
    • for each possible value v of A {
      • add a new tree branch below Root, corresponding to the test A = v
      • let Examplesv be the subset of Examples that have value v for attribute A
      • if Examplesv is empty then
        • add a leaf node below this new branch with label = most common value of TargetAttribute in Examples
        else
          add the subtree DTL(Examplesv, TargetAttribute, Attributes - { A })
      }
    end
  • return Root


DTL performs hill-climbing search using information gain as a heuristic.


Characteristics of DTL


Overfitting


Decision Tree Pruning


Continuous-Valued Attributes


Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals

Example:  Suppose we have training examples with the following Temp values:
 
Temp 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No

Split into two intervals:  Temp < val  and  Temp > val

This defines a new boolean attribute Temp>val

How to choose val threshold?

Consider boundary cases val = (48+60)/2 = 54 and val = (80+90)/2 = 85

Possible attributes:  Temp>54 , Temp>85

Choose attribute with largest information gain (as before):  Temp>54

This approach can be generalized to multiple thresholds.


Missing Attributes


Gain Ratio and Split Information



Reference: Machine Learning, Chapter 3, Tom M. Mitchell, McGraw-Hill, 1997.