Perceptron
Kristin P. Bennett, Donghui Wu
Department of Mathematical Sciences Rensselaer Polytechnic Institute
bennek@rpi.edu, wud2@rpi.edu Nello Cristianini
Dept of Engineering Mathematics University of Bristol
John Shawe-Taylor
Dept of Computer Science Royal Holloway, University of London
nello.cristianini@bristol.ac.uk
jst@dcs.rhbnc.ac.uk0-0
Perceptron Decision Trees
Definition 0.1 Perceptron Decision Trees (PDT), are decision trees in which each internal node is associated with a hyperplane in general position in the input space, i.e. the decision is constructed using a linear combination of attributes, instead of one attribute.
w1 x x x x x x x x o x x x x x x o o o o w2 o o o o o o o o o x x x w3 o o
w1
w2
w3x
x
o
o
x
1
Which Decision Is Better?
– Capacity Control of Linear Classifier
Construct a linear classifier: f (x) = wT x − b, i.e. find a vector w ∈ Rn and a scalar b such that f (xi ) = wT xi − b ≥1 if yi = 1, i = 1, . . . ,
f (xi ) = wT xi − b ≤ −1 if yi = −1,
2
Large Margin PDTs Generalize Better
Theorem 0.1 Suppose we are able to classify an m sample oflabeled examples using a perception decision tree and suppose that the tree obtained contained K decision nodes with margins γi at node i, then we can bound the generalization error with probability greater than 1 − δ to be less than 130R2 m where D = (4m)K+1 2K K D log(4em) log(4m) + log (K + 1)δ
K 1 2 i=1 γi .
3
The Baseline Algorithms for Comparison
Algorithm 0.1 (Basic OC1 Algorithm )node, while a node remains to split Start with the root
• Optimize the decision based on some splitting criterion. – Randomized search for local minimum. – Re-starts to jump out of local minimum. • Partition the node into two or more child nodes based on decision. Prune the tree if necessary. Splitting Criterion : Twoing Rule (goodness measure) |TL | |TR | ∗ ∗ T woingV alue = n n
k i=1 2
|Li||Ri | − |TL | |TR |
4
Twoing Rule
Splitting criteria - Twoing Rule: (Breiman, et al. (1984)) T woingV alue = |TL | |TR | ∗ ∗ n n
k i=1
|Ri | |Li| − |TL | |TR |
2
where n = |TL | + |TR | - total number of instances at current node k - number of classes, for two class problems |TL | - number of instances on the left of the split, i.e. wT x − b >= 0 |TR | - number of instances onthe right of the split i.e. wT x − b < 0 |Li | - number of instances in category i on the the left of the split |Ri | - number of instances in category i on the the right of the split
5
Enlarging the Margins of PDT
Algorithms of producing large margin PDTs: • Post-process existing trees (FAT). Find optimal separating hyperplane at each node of the existing trees. • Incorporate large Margininto splitting criteria (MOC1) max T woingV alue + C ∗ CurrentM argin • Incorporate large margin into goodness measure (MOC2) Modified Towing Rule.
6
The OC1-PDT and FAT-PDT
IDEA: Post-process the existing OC1-PDT, replace the the OC1 decision with optimal separating hyperlane at each decision node.
x
o
o o
o o o o o o
o
x
x x x x x
x
x x x x x x
x
x x
xx o o ; o o FAT o o x o x x OC1 x x
o o
o
7
The FAT Algorithm
1. Construct a decision tree using OC1, call it OC1-PDT. 2. Starting from root of OC1-PDT, traverses through all the non-leaf nodes. At each node, • Relabel the points at node with ω T x − b ≥ 0 as superclass right, the other points at this node as superclass left. • Find the perceptron (optimal separating hyperplane) f(x) = ω ∗ T x − b∗ , which separates superclasses right and left perfectly with maximal margin. • Replace the original perceptron with the new one.
8
FAT Generalizes Better Than OC1
10−fold cross validation results: FAT vs OC1 100
95
significant x=y
90
FAT 10−CV average accuracy
85
80
75
70
65 65
70
75
80 85 OC1 10−CV average accuracy
90
95
100...
Regístrate para leer el documento completo.