Vowpal Wabbit A Machine Learning System

Transcription

Vowpal Wabbit
A Machine Learning System
John Langford
Microsoft Research
http://hunch.net/~vw/
git clone
git://github.com/JohnLangford/vowpal_wabbit.git
Why does Vowpal Wabbit exist?
1. Prove research.
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
A user base becomes addictive
1. Mailing list of >400
2. The ocial strawman for large scale logistic
regression @ NIPS :-)
2. The ocial strawman for large scale logistic
regression @ NIPS :-)
3.
An example
wget
http://hunch.net/~jl/VW_raw.tar.gz
vw -c rcv1.train.raw.txt -b 22 --ngram 2
--skips 4 -l 0.25 --binary provides stellar
performance in 12 seconds.
Surface details
1. BSD license, automated test suite, github
repository.
Surface details
repository.
2. VW supports all I/O modes: executable, library,
port, daemon, service (see next).
Surface details
repository.
3. VW has a reasonable++ input format: sparse,
dense, namespaces, etc...
Surface details
repository.
4. Mostly C++, but bindings in other languages of
varying maturity (python good).
Surface details
repository.
4. Mostly C++, but bindings in other languages of
varying maturity (python good).
5. A substantial user base + developer base.
Thanks to many who have helped.
VW service
http://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
VW service
http://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
Solution:
This Tutorial in 4 parts
How do you:
1. use all the data?
How do you:
2. solve the right problem?
How do you:
3. solve complex joint problems?
How do you:
3. solve complex joint problems?
4. solve interactive problems?
Using all the data: Step 1
Small RAM
⇒
data
⇒
Online Learning
Active research area, 4-5 papers related to online
learning algorithms in VW.
1. 3.2
∗ 106
labeled emails.
2. 433167 users.
3.
∼ 40 ∗ 106
unique tokens.
How do we construct a spam lter which is
personalized, yet uses global information?
1. 3.2
∗ 106
labeled emails.
2. 433167 users.
3.
∼ 40 ∗ 106
unique tokens.
How do we construct a spam lter which is
personalized, yet uses global information?
Bad answer: Construct a global lter + 433167
personalized lters using a conventional hashmap to
specify features. This might require
433167
∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes
of RAM.
Using Hashing
Use hashing to predict according to:
hw , φ(x )i + hw , φu (x )i
2x
text document (email)
x
xl
NEU
Votre
Apotheke
en
ligne
Euro
...
USER123_NEU
USER123_Votre
USER123_Apotheke
USER123_en
USER123_ligne
USER123_Euro
...
+
tokenized, duplicated
bag of words
xh
h
323
0
5235
0
0
123
0
626
232
...
hashed,
sparse vector
w! xh
classification
(in VW: specify the userid as a feature and use -q)
!"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%
Results
!"'#%
!"!'%
!"#$%
!"##%
#"$#%
!"#&%
#"$'%
#")#%
!"##%
!"##%
!%
#")$%
#")(%
#"(#%
#"*#%
+,-./,01/2134%
5362-7/,8934%
./23,873%
#"'#%
#"##%
!$%
'#%
''%
'*%
')%
0%0&)!%&1%3#!3')#0,*%
26
2
parameters
= 64M parameters = 256MB of
RAM.
An x270K savings in RAM requirements.
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?
John: I'd like to solve AI.
I: How?
J: I want to use parallel learning algorithms to create
fantastic learning machines!
I: How?
I: You fool! The only thing parallel machines are
good for is computational windtunnels!
I: How?
I: You fool! The only thing parallel machines are
good for is computational windtunnels!
The worst part: he had a point.
Given 2.1 Terafeatures of data, how can you learn a
good linear predictor
fw (x ) =
P
i wi xi ?
fw (x ) =
17B Examples
16M parameters
1K nodes
How long does it take?
P
i wi xi ?
fw (x ) =
P
i wi xi ?
17B Examples
16M parameters
1K nodes
How long does it take?
70 minutes = 500M features/second: faster than the
IO bandwidth of a single machine⇒ faster than all
possible single machine linear learning algorithms.
MPI-style AllReduce
Allreduce initial state
5
1
7
2
6
3
4
MPI-style AllReduce
Allreduce final state
28
28
28
28
28
28
Properties:
1. How long does it take?
2. How much bandwidth?
3. How hard to program?
28
MPI-style AllReduce
Create Binary Tree
7
5
1
6
2
3
4
MPI-style AllReduce
Reducing, step 1
7
8
1
13
2
3
4
MPI-style AllReduce
Reducing, step 2
28
8
1
13
2
3
4
MPI-style AllReduce
Broadcast, step 1
28
28
1
28
2
3
4
MPI-style AllReduce
28
28
28
28
28
28
28
AllReduce = Reduce+Broadcast
MPI-style AllReduce
28
28
28
28
28
28
28
Properties:
1. How long does it take?
2. How much bandwidth?
3. How hard to program?
MPI-style AllReduce
28
28
28
28
28
28
28
Properties:
1. How long does it take? O(1) time(*)
2. How much bandwidth? O(1) bits(*)
3. How hard to program? Very easy
(*) When done right.
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number
<
max)
1. While (examples left)
1.1 Do online update.
2. AllReduce(weights)
3. For each weight
w
← w /n
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number
<
max)
1. While (examples left)
1.1 Do online update.
2. AllReduce(weights)
3. For each weight
Code tour
w
← w /n
What is Hadoop AllReduce?
Data
Program
1.
Map job moves program to data.
Data
Program
1.
2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, before
initializing allreduce.
Data
Program
1.
2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, before
initializing allreduce.
3. Speculative execution: In a busy cluster, one
node is often slow. Use speculative execution to
start additional mappers.
Robustness & Speedup
Speed per method
10
Average_10
Min_10
Max_10
linear
9
8
Speedup
7
6
5
4
3
2
1
0
10
20
30
40
50
60
70
80
90
100
Splice Site Recognition
0.55
0.5
auPRC
0.45
0.4
0.35
0.3
Online
L−BFGS w/ 5 online passes
L−BFGS w/ 1 online pass
L−BFGS
0.25
0.2
0
10
20
30
Iteration
40
50
Splice Site Recognition
0.6
0.5
auPRC
0.4
0.3
0.2
L−BFGS w/ one online pass
Zinkevich et al.
Dekel et al.
0.1
0
0
5
10
15
Effective number of passes over data
20
How do you:
3. solve complex joint problems easily?
Applying Machine Learning in Practice
Applying Machine Learning in Practice
1. Ignore mismatch. Often faster.
2. Understand problem and nd more suitable tool.
Often better.
Importance Weighted Classication
Importance-Weighted Classication
I
I
I
Given training data
{(x1, y1, c1), . . . , (xn, yn , cn )}, produce a
classier h : X → {0, 1}.
Unknown underlying distribution
X × {0, 1}×[0, ∞).
Find
D over
h with small expected cost:
`(h, D ) = E(x ,y ,c )∼D [c · 1(h(x ) 6= y )]
Where does this come up?
1. Spam Prediction (Ham predicted as Spam much
worse than Spam predicted as Ham.)
2. Distribution Shifts (Optimize search engine
results for monetizing queries.)
3. Boosting (Reweight problem examples for
residual learning.)
4. Large Scale Learning (Downsample common
class and importance weight to compensate.)
Multiclass Classication
X × Y , where Y = {1, . . . , k }.
Find a classier h : X → Y minimizing the
multi-class loss on D
Distribution
D
over
`k (h, D ) = Pr(x ,y )∼D [h(x ) 6= y ]
Multiclass Classication
X × Y , where Y = {1, . . . , k }.
Find a classier h : X → Y minimizing the
multi-class loss on D
Distribution
D
over
`k (h, D ) = Pr(x ,y )∼D [h(x ) 6= y ]
1. Categorization: Which of
2. Actions: Which of
k
k
things is it?
choices should be made?
Use in VW
Multiclass label format: Label [Importance] ['Tag]
Methods:
oaa k one-against-all prediction.
baseline
O (k ) time.
O (log (k )) time.
O (log (n)) time.
ect k error correcting tournament.
log_multi n Adaptive log time.
The
One-Against-All (OAA)
Create
k
For class
binary problems, one per class.
i
predict Is the label
i
or not?

(x , 1(y = 1))



(x , 1(y = 2))
(x , y ) 7−→

...



(x , 1(y = k ))
The inconsistency problem
Given an optimal binary classier, one-against-all
doesn't produce an optimal multiclass classier.
Prob(label|features)
1
2
−δ
1
4
+
δ
2
1
4
+
δ
2
Prediction
1v23
1
0
0
0
2v13
0
1
0
0
3v12
0
0
1
0
The inconsistency problem
Given an optimal binary classier, one-against-all
doesn't produce an optimal multiclass classier.
Prob(label|features)
1
2
−δ
1
4
+
δ
2
1
4
+
δ
2
Prediction
1v23
1
0
0
0
2v13
0
1
0
0
3v12
0
0
1
0
Solution: always use one-against-all regression.
Cost-sensitive Multiclass
Cost-sensitive multiclass classication
X × [0, 1]k , where a vector in
[0, 1]k species the cost of each of the k choices.
Find a classier h : X → {1, . . . , k } minimizing the
Distribution
D
over
expected cost
c , D ) = E(x ,c )∼D [ch(x )].
cost(
Cost-sensitive Multiclass
Cost-sensitive multiclass classication
X × [0, 1]k , where a vector in
[0, 1]k species the cost of each of the k choices.
Find a classier h : X → {1, . . . , k } minimizing the
Distribution
D
over
expected cost
c , D ) = E(x ,c )∼D [ch(x )].
cost(
1. Is this packet {normal,error,attack}?
2. A subroutine used later...
Use in VW
Label information via sparse vector.
A test example:
|Namespace
Feature
A test example with only classes 1,2,4 valid:
1: 2: 4:
|Namespace
Feature
A training example with only classes 1,2,4 valid:
1:0.4 2:3.1 4:2.2 |Namespace Feature
Use in VW
Label information via sparse vector.
A test example:
|Namespace
Feature
A test example with only classes 1,2,4 valid:
1: 2: 4:
|Namespace
Feature
A training example with only classes 1,2,4 valid:
1:0.4 2:3.1 4:2.2 |Namespace Feature
Methods:
csoaa k cost-sensitive OAA prediction.
O (k ) time.
csoaa_ldf Label-dependent features OAA.
wap_ldf LDF Weighted-all-pairs.
Code Tour
Let's take a break
How do you:
4. How do you solve interactive problems?
Structured Prediction
= joint prediction with joint loss
Structured Prediction
= joint prediction with joint loss
Example: Machine Translation
Simple Example: Part of Speech Tagging
Pierre
Vinken
,
61
years
old
Proper N.
Proper N.
Comma
Number
Noun
Adj.
How can you best do structured prediction?
We care about:
1. Programming complexity.
We care about:
1. Programming complexity. Most structured
predictions are not addressed with structured
learnign algorithms, because it it too complex to
do so.
We care about:
do so.
2. Prediction accuracy. It had better work well.
We care about:
do so.
3. Train speed. Debug/development productivity +
maximum data input.
We care about:
do so.
3. Train speed. Debug/development productivity +
maximum data input.
4. Test speed. Application eciency
A program complexity comparison
lines of code for POS
1000
100
10
1
CRFSGD CRF++
S-SVM
Search
Accuracy (per tag)
0.98
Part of speech tagging (tuned hps)
0.96
96.6
96.1 95.896.1
95.7
95.7
95.0
0.94
0.92
0.90
0.88
1m
100
VW Search
VW Search (own fts)
90.7
VW Classification
CRF SGD
CRF++
Str. Perceptron
SVM
10m
30m 1hStructured
Str.SVM (DEMI-DCD)
101
102
103
Training Time (minutes)
Prediction (test-time) Speed
POS
13
5.7
129
133
5.3
14
5.6
NER
98
24
0
VW Search
VW Search (own fts)
CRF SGD
CRF++
Str. Perceptron
Structured SVM
Str. SVM (DEMI-DCD)
218
50
100
150
200
Thousands of Tokens per Second
285
250
300
How do you program?
Sequential_RUN(
1:
2:
3:
4:
5:
6:
examples )
for i = 1 to len(examples) do
prediction ← predict(examples[i ], examples[i ].label )
if output .good then
output ' ' prediction
end if
end for
How do you program?
Sequential_RUN(
1:
2:
3:
4:
5:
6:
examples )
prediction ← predict(examples[i ], examples[i ].label )
if output .good then
output ' ' prediction
end if
end for
In essence, write the decoder, providing a little bit of
side information for training.
examples , false_negative_loss )
Seq_Detection(
Let max_label = 1, max_prediction = 1
max_label ← max(max_label , examples[i ].label )
end for
max_prediction ← max(max_prediction,
predict(examples[i ], examples[i ].label ))
end for
if max_label > max_prediction then
loss(false_negative_loss)
else
if max_label < max_prediction then
loss(1)
else
loss(0)
end if
end if
How does it work?
An Application of Learning to Search algorithms
(e.g. Searn/DAgger).
Decoder run many times at train time to optimize
predict(...) for loss(...).
A Search Space
Start State
A Search Space
Start State
A Search Space
Start State
A Search Space
Start State
A Search Space
Start State
A Search Space
Start State
End States
A Search Space
Start State
Predict
Loss
End States
Learning to Search
Start State
Rollout
End States
Learning to Search
Start State
Rollout
Collapse
End States
Learning to Search
Start State
Rollout
Collapse
Loss
End States
TDOLR program equivalence
Theorem: Every algorithm which:
1. Always terminates.
2. Takes as input relevant feature information
X.
3. Make 0+ calls to predict.
4. Reports loss on termination.
denes a search space, and such an algorithm exists
for every search space.
An outline
1. How?
1.1 Programming
1.2 Learning to Search
1.3 Equivalence
2. Other Results
Named Entity Recogntion
Is this word part of an organization, person, or not?
Named entity recognition (tuned hps)
79.280.0
76.5
0.80
F-score (per entity)
0.75
73.3
76.5
78.3
75.9
74.6
0.70
0.65
0.60
0.55
0.50
0.45
10s
10-1
1m
100
Training Time (minutes)
10m
101
Entity Relation
Goal: nd the Entities and then nd their Relations
Method
Entity F1 Relation F1
Train Time
Structured SVM
88.00
50.04
300 seconds
L2S
92.51
52.03
13 seconds
Requires about 100 LOC.
Dependency Parsing
Goal: Find the dependency structure of words in a
sentence.
Method
dependency accuracy
Redshift (single beam)
89.25
L2S
90.27
Redshift (beam search)
91.32
Requires about 450 LOC.
A demonstration
http://bilbo.cs.uiuc.edu/~kchang10/
tmp/wsj.vw.zip
wget
vw -b 24 -d wsj.train.vw -c search_task sequence search 45
search_alpha 1e-8 search_neighbor_features -1:w,1:w
ax -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw
How do you:
Examples of Interactive Learning
Repeatedly:
1. A user comes to Microsoft (with history of previous
visits, IP address, data related to an account)
2. Microsoft chooses information to present (urls, ads, news
stories)
3. The user reacts to the presented information (clicks on
something, clicks, comes back and clicks again,...)
Microsoft wants to interactively choose content and use the
Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctor
with symptoms, medical history,
test results
2. The doctor chooses a treatment
3. The patient responds to it
The doctor wants a policy for
choosing targeted treatments for
individual patients.
The Contextual Bandit Setting
For
t = 1, . . . , T :
x ∈X
The learner chooses an action a ∈ A
The world reacts with reward ra ∈ [0, 1]
1. The world produces some context
2.
3.
Goal:
Learn a good policy for choosing actions
given context.
The Direct method
Use past data to learn a reward predictor
act according to arg maxa
r̂ (x , a).
r̂ (x , a), and
The Direct method
Example:
a2 on x2.
x1
x2
a1
r̂ (x , a).
Deployed policy always takes
a2
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
Observed
x1
x2
a1
.8
?
r̂ (x , a).
a2
?
.2
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
Observed/Estimated
x1
x2
r̂ (x , a).
a1
a2
.8/.8
?/.5
?/.5
.2 /.2
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
Observed/Estimated
x1
x2
r̂ (x , a).
a1
a2
.8/.8
.3/.5
?/.5
.2 /.2
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
Observed/Estimated
x1
x2
a1
.8/.8
.3/.3
r̂ (x , a).
a2
?/.514
.2 /.014
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
r̂ (x , a).
Observed/Estimated/True
x1
x2
a1
a2
.8/.8/.8
.3/.3/.3
?/.514/1
.2 /.014 /.2
r̂ (x , a), and
a1 on x1 and
The Direct method
Example:
a2 on x2.
r̂ (x , a).
r̂ (x , a), and
a1 on x1 and
x1
x2
a1
a2
.8/.8/.8
.3/.3/.3
?/.514/1
.2 /.014 /.2
Basic observation 1: Generalization insucient.
The Direct method
Example:
a2 on x2.
r̂ (x , a).
r̂ (x , a), and
a1 on x1 and
x1
x2
a1
a2
.8/.8/.8
.3/.3/.3
?/.514/1
.2 /.014 /.2
Basic observation 2: Exploration required.
The Direct method
Example:
a2 on x2.
r̂ (x , a).
r̂ (x , a), and
a1 on x1 and
x1
x2
Basic
a1
a2
.8/.8/.8
.3/.3/.3
?/.514/1
.2 /.014 /.2
observation 3: Errors 6=
exploration.
The Evaluation Problem
Let
π:X →A
be a policy mapping features to
actions. How do we evaluate it?
The Evaluation Problem
Let
π:X →A
Method 1: Deploy algorithm in the world.
Very Expensive!
The Importance Weighting Trick
Let
π:X →A
Let
π:X →A
One answer: Collect
T exploration samples
(x , a, ra , pa ),
where
x = context
a = action
ra = reward for action
pa = probability of action a
then evaluate:
Value(π)
=
Average
ra 1(π(x ) = a)
pa
Theorem
For all policies
π,
for all IID data distributions
D,
Value(π) is an unbiased estimate of the expected
reward of
π:
E(x ,~r )∼D rπ(x ) = E[ Value(π) ]
Proof:
i P
r
a 1(π(x )=a)
Ea∼p
= a pa ra 1(π(pax )=a) = rπ(x )
pa
Example:
h
Action
1
2
Reward
0.5
1
Probability
Estimate
1
4
3
4
Theorem
For all policies
π,
D,
reward of
π:
Proof:
i P
r
a 1(π(x )=a)
Ea∼p
pa
Example:
h
Action
1
2
Reward
0.5
1
1
4
Probability
Estimate
2
3
4
0
Theorem
For all policies
π,
D,
reward of
π:
Proof:
i P
r
a 1(π(x )=a)
Ea∼p
pa
Example:
h
Action
1
2
Reward
0.5
1
Probability
1
4
Estimate
2 | 0
3
4
0 |
4
3
How do you test things?
Use format:
action:cost:probability | features
Example:
1:1:0.5 | tuesday year million short compan vehicl line
stat nanc commit exchang plan corp subsid credit
issu debt pay gold bureau prelimin ren billion
telephon time draw basic relat le spokesm reut secur
acquir form prospect period interview regist toront
resourc barrick ontario qualif bln prospectus
convertibl vinc borg arequip
...
How do you train?
How do you train?
Reduce to cost-sensitive classication.
How do you train?
Reduce to cost-sensitive classication.
vw cb 2 cb_type dr rcv1.train.txt.gz -c ngram 2 skips 4 -b
24 -l 0.25
Progressive 0/1 loss: 0.04582
vw cb 2 cb_type ips rcv1.train.txt.gz -c ngram 2 skips 4 -b
24 -l 0.125
vw cb 2 cb_type dm rcv1.train.txt.gz -c ngram 2 skips 4 -b
24 -l 0.125
Reminder: Contextual Bandit Setting
For t = 1, . . . , T :
1. The world produces some context x ∈ X
2. The learner chooses an action a ∈ A
3. The world reacts with reward ra ∈ [0, 1]
Goal:
Learn a good policy for choosing actions given context.
What does learning mean? Eciently competing with some
large reference class of policies Π = {π : X → A}:
Regret = max averaget (rπ(x ) − ra )
π∈Π
Building an Algorithm
For
t = 1, . . . , T :
a∈A
3. The learner chooses an action
4.
x ∈X
Building an Algorithm
Q1 = uniform distribution
For t = 1, . . . , T :
Let
2. Draw
π ∼ Qt
a ∈ A using π(x ).
Update Qt +1
3. The learner chooses an action
4.
5.
x ∈X
What is good
I
Qt ?
Exploration: Qt
allows discovery of good
policies
I
Exploitation: Qt
small on bad policies
How do you nd
Qt ?
How do you nd
Qt ?
by Reduction ... [details complex, but coded]
How well does this work?
losses on CCAT RCV1 problem
loss
0.12
0.1
0.08
0.06
0.04
0.02
0
C
B*
C
er
ov
nU
Li
t
rs
fi
u-
ta
dy
ee
gr
sep
How long does this take?
running times on CCAT RCV1 problem
1e+06
seconds
100000
10000
1000
100
10
1
C
B*
C
er
ov
nU
Li
t
rs
fi
u-
ta
dy
ee
gr
sep
Further reading
https://github.com/JohnLangford/
vowpal_wabbit/wiki
VW wiki:
Search: NYU large scale learning class
NIPS tutorial:
http://hunch.net/~jl/interact.pdf
Bibliography
Release Vowpal Wabbit open source project,
http://github.com/JohnLangford/vowpal_
wabbit/wiki, 2007.
L2Search H. Daume, J. Langford, and S. Ross, Ecient
programmable learning to search,
http://arxiv.org/abs/1406.1837
Explore A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li,
R. Schapire, Taming the Monster: A Fast and
Simple Algorithm for Contextual Bandits,
Terascale A. Agarwal, O. Chapelle, M. Dudik, J. Langford
A Reliable Eective Terascale Linear Learning
System,
Bibliography: Parallel other
Average G. Mann et al. Ecient large-scale distributed
training of conditional maximum entropy
models, NIPS 2009.
Ov. Av. M. Zinkevich, M. Weimar, A. Smola, and L. Li,
Parallelized Stochastic Gradient Descent, NIPS
2010.
D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L.
Xiao, Optimal Distributed Online Predictions
Using Minibatch,

Vowpal Wabbit A Machine Learning System

Transcription

Similar documents

Image Reconstruction - GPU Technology Conference

stephen baldwin

licenseesbyrep - BF Canada • SOD Reports

Rseslib: Programmer`s Guide

An assessment of support vector machines for land cover classification

Small Sample Statistics for

CD - Instituto de Investigaciones Biotecnológicas

EXPANDING AND TAILORING RCOFs` PRODUCTS FOR

Syddansk Universitet Prediction Markets Horn, Christian Franz

Ascent - American Express

14 Ice Cr2015 Ice Cream Mooncake Voucher order Form

MedicineAsk: An intelligent search facility for information about

do what is right, let the consequence follow

A Fusion-based Multiclass AdaBoost for Classifying Object Poses

Predicting genome-wide DNA methylation using methylation marks

Special offers for HSBC Credit Cardholders

Recurrent preeclampsia - Maastrichtuniversity.nl

Catalogue 2014 -2015