slides - Applied Machine Learning Reading Group, MIT Media Lab

Transcription

slides - Applied Machine Learning Reading Group, MIT Media Lab
Machine listening
and reading at scale
Brian Whitman
co-Founder & CTO, The Echo Nest
@bwhitman / [email protected]
Hello! I “teach computers to listen to music”
Columbia University, NYC
MIT Media Lab
finishing my dissertation
4
·
Elias Pampalk
air
air
air
air
air
avemaria
avemaria
avemaria
avemaria
avemaria
elise
elise
elise
elise
elise
kidscene
kidscene
kidscene
kidscene
kidscene
mond
mond
mond
mond
mond
branden
branden
branden
branden
branden
branden
vm−bach
vm−bach
vm−bach
frozen
frozen
frozen
frozen
beethoven
beethoven
beethoven
beethoven
fuguedminor
fuguedminor
fuguedminor
fuguedminor
therose
future
therose
therose
future
future
therose
future
lovemetender
lovemetender
lovemetender
lovemetender
lovemetender
lovemetender
vm−brahms
vm−brahms
vm−brahms
vm−brahms
memory
memory
memory
memory
memory
memory
rainbow
rainbow
rainbow
threetimesalady
threetimesalady
threetimesalady
eifel65−blue
eifel65−blue
eifel65−blue
gowest
gowest
gowest
gowest
gowest fromnewyorktola
lovsisintheair
fromnewyorktola
fromnewyorktola
lovsisintheair
lovsisintheair gowest
fromnewyorktola
lovsisintheair
radio
radio
radio supertrouper
supertrouper
supertrouper
supertrouper
supertrouper
dancingqueen
dancingqueen
dancingqueen
dancingqueen
manicmonday
manicmonday
manicmonday
manicmonday
ga−doedel
ga−doedel
ga−doedel
ga−doedel
ga−doedel
ga−doedel
ga−japan
ga−japan
ga−japan
ga−japan
nma−bigblue
nma−bigblue
nma−bigblue
fatherandson
fatherandson
fatherandson
fatherandson
foreveryoung
foreveryoung
foreveryoung
conga
conga
conga
mindfiels
mindfiels
mindfiels
Retrieving Music by Rhythmic Similarity
eternalflame
eternalflame
eternalflame
eternalflame
eternalflame
eternalflame
feeling
feeling
feeling
revolution
revolution
revolution
ironic
ironic
ironic
americanpie
americanpie
americanpie
drummerboy
angels
drummerboy
drummerboy angels
angels
drummerboy
angels
friend
friend
friend
missathing
missathing
missathing
friend
friend
friend
missathing
missathing
missathing
yesterday−b
yesterday−b
yesterday−b
1.5
addict
addict
addict
addict
addict
ga−lie
ga−lie
ga−lie
ga−lie
newyork
newyork
newyork
newyork
newyork
newyork
sml−adia
sml−adia
sml−adia
sml−adia
squared Euclidean distance
californiadream
californiadream
californiadream
californiadream
californiadreamunbreak−
unbreak−
unbreak−
firsttime
firsttime
firsttime
firsttime
firsttime
myheart
myheart
myheart
myheart
myheart
risingsun
risingsun
risingsun myheart
risingsun
risingsun
rhcp−
rhcp−
rhcp−
rhcp−
rhcp−
rhcp−
ga−nospeech
rhcp−
ga−nospeech
ga−nospeech
rhcp−
rhcp− bfmc−freestyler
bfmc−freestyler
bfmc−freestyler
ga−nospeech
rhcp−
californication
californication
californication
californication
californication world
torn
torn
torn
world
world
sl−summertime
sl−summertime
sl−summertime
torn
torn
torn
world
world
world
sl−summertime
sl−summertime
sl−summertime
sl−whatigot
sl−whatigot
sl−whatigot
sl−whatigot
sl−whatigot
limp−nobody
limp−nobody
limp−nobody
limp−nobody
pr−broken
pr−broken
pr−broken
pr−broken
pr−broken
pr−broken
limp−
bongobong
bongobong
bongobong
limp−
limp−
limp−
bongobong
bongobong
bongobong
n2gether
n2gether
n2gether
themangotree
themangotree
n2gether
n2gether
n2gether themangotree
themangotree
themangotree
themangotree
korn−freak
korn−freak
korn−freak
110 bpm
korn−freak
korn−freak
korn−freak
bigworld
limp−pollution
limp−pollution
limp−pollution
bigworld
bigworld ga−iwantit
bigworld
ga−iwantit
ga−iwantit
ga−iwantit
ga−iwantit pr−deadcell
pr−deadcell
pr−deadcell
lovedwoman
lovedwoman ga−iwantit
lovedwoman
pr−deadcell
pr−deadcell
112pr−revenge
bpm
pr−revenge
pr−revenge
1
cocojambo
cocojambo
cocojambo
cocojambo
cocojambo
macarena
macarena
macarena
macarena
macarena
macarena
rockdj
rockdj
rockdj
bfmc−instereo
bfmc−instereo
bfmc−instereo
bfmc−instereo
bfmc−instereo
bfmc−instereo
bfmc−rocking
bfmc−rocking
bfmc−rocking
bfmc−rocking
bfmc−skylimit
bfmc−skylimit
bfmc−skylimit
bfmc−skylimit
bfmc−uprocking
bfmc−uprocking
bfmc−uprocking
bfmc−uprocking
114 bpm
120 bpm
122 bpm
sexbomb
sexbomb
sexbomb
116 bpm
Fig. 2. Islands of Music representing a 7x7 SOM with 77 pieces of music. The artists and full titles of the pieces,
124 bpm
which are represented by the short identifiers here, can be found on the web page or in the thesis.
5. CONCLUSION 0.5
126 bpm
The Islands of Music have not yet reached a level which would suggest their commercial usage,
however, they demonstrate the possibility of such systems and serve well as a tool for further
research. Any part of128the
bpmsystem can be modified or replaced and the resulting effects can easily
be evaluated using the graphical user interface. Furthermore, the feature extraction technique
and the visualization technique developed for this thesis can both be applied seperately to a
broad range of applications.
130 bpm
REFERENCES
0
130
128
126
124
122
Bishop, C. M., Svensén, M., and Williams, C. K. I.
120
118
Tempo (bpm)
1998.
116
114
112
110
GTM: The generative topographic mapping.
Music retrieval and “music intelligence” applications
creepy Hit prediction
Taste / recommendation
Automated composition
Classification, tagging
Feature extraction (key, tempo)
Search & retrieval
Review Regression
100
10
60
8
6
40
4
20
6
40
100
80
60
40
2
20
8
4
2
2
4
6
8
Randomly selected AMG Ratings
120
4
4
6
AMG Ratings
60
100
.147
[.080]
2
8
8
Audio−derived Ratings
6
4
6
AMG Ratings
10
80
20
2
2
8
Pitchfork Ratings
12
80
Audio−derived Ratings
Pitchfork Ratings
100
80
60
.127
[.082]
6
5
4
3
40
2
20
1
20
40
60
80
Pitchfork Ratings
100
Set A
Britney Spears
Backstreet Boys
Cristina Aguilera
Set B
Alice in Chains
Korn
Faith no More
Set C
Chris Isaak
Bob Dylan
Crowded House
The Echo Nest 2007
Somerville
2 people
2 computers
Lots of ideas
1,000,000 documents
10,000 artists
100,000 songs
The Echo Nest 2011
Somerville, NY, SF, LDN
35 people
300 computers
150m people / month
5,000,000,000 documents
2,000,000 artists
35,000,000 songs
Audio features:
➔ musical features, Jehan ’05++
➔ publicly available, ~5s / song on cloud machines
auditory spectrogram
auditory spectrogram
25
20
15
4
x 10
2
10
5
1
1
0
0.5
1
segmentation
0.6
2 sec.
beat markers
-2
0
0.4
5
10
15
20
25
240
190
0.2
0
1.5
-1
0.8
segments
1
0
0
0.5
1
143
1.5
2 sec.
tempogram
B
114
96
A#
pitch features
pitch features
A
G#
G
F#
72
F
60
E
D#
D
0
5
10
15
20
25
1
C#
C
0
0.5
1
0.8
1.5
2 sec.
25
timbre features
timbre features
0.6
20
tempo spectrum
0.4
15
0.2
10
0
60
5
72
96
114
143
1
2
4
6
8
10
12
14
16 segments
190
240
Text features:
➔ “community metadata” Whitman ’02++
➔ web crawls, blogs, reviews, news: “top terms”
➔ constantly updated, >1m pages/day
n2 Term
dancing queen
mamma mia
disco era
winner takes
chance on
swedish pop
my my
s enduring
and gimme
enduring appeal
Score
0.0707
0.0622
0.0346
0.0307
0.0297
0.0296
0.0290
0.0287
0.0280
0.0280
np Term
dancing queen
mamma mia
benny
chess
its chorus
vous
the invitations
voulez
something’s
priscilla
Score
0.0875
0.0553
0.0399
0.0390
0.0389
0.0382
0.0377
0.0377
0.0374
0.0369
adj Term
perky
nonviolent
swedish
international
inner
consistent
bitter
classified
junior
produced
Score
0.8157
0.7178
0.2991
0.2010
0.1776
0.1508
0.0871
0.0735
0.0664
0.0616
Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for adj
(adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and np
(noun phrases.) Parsing artifacts are left alone.
Metadata features:
➔ Explicit categories, tags, similars from partners
➔ Physical sales data
➔ Other ontologies
Per capita sales for BON JOVI
Per capita sales for YOUNG JEEZY
Per capita sales for JUSTIN BIEBER
Per capita sales for NEON TREES
Biggest database of artist data ever:
➔ billions of datapoints about artists
➔ web crawls: {terms: [(funky,0.34),(jazz:0.95)]}
➔ metadata: {style:jazz, location:Boston}
➔ user data: {id:LIX2923ND, likes:[Phonat, Apparat]}
➔ analytics: {hotttnesss: 0.45, chart_position... etc}
Normalize everything to a “term” in a big index:
➔ words get indexed w/ their probabilities as TF
➔ artist names gets resolved to IDs in a graph
➔ numerical data goes out to a tree or clustered
➔ optimizes for fast query, distributed is easy
➔ “boosts” set in real time per app or trained
“Song to song” similarity
➔ Catalogs at >10m tracks each (multiple clients!)
➔ Existing cultural metadata coverage 80% ish
➔ “Happy birthday to <your name>” x 1,200
➔ Turn around live API in < week
➔ Bottlenecks all in infrastructure: disk speed, data
storage, DB access / inserts, queues
Merge what we know
➔ If we have text data, filter on artist similars
➔ Sort filtered X,000 tracks in memory, cache
➔ Acoustic features use statistics of EN Analyze
➔ QA process (interns + spot checks)
➔ Evaluation is customer feedback
Hardest problem?
➔ Evaluating distance metrics?
➔ Selecting audio features?
➔ Sharding / replicating key-value stores?
➔ Tuning, QA?
Hardest problem?
➔ Evaluating distance metrics?
➔ Selecting audio features?
➔ Sharding / replicating key-value stores?
➔ Tuning, QA?
➔ Matching artist and track titles
Echoprint & ENMFP
➔ Both massive LSH problems
➔ Different audio features >> n-bit codes >> DB
ENMFP
auditory spectrogram
25
20
15
10
5
1
0
0.5
1
1.5
2 sec.
1
segmentation
0.8
0.6
Chroma -> VQ
10-bit # (0-1023)
3 in a row
1 30-bit # per seg
0.4
0.2
0
0
0.5
1
1.5
2 sec.
0.5
1
1.5
2 sec.
B
A#
pitch features
A
G#
G
F#
F
E
D#
D
C#
C
0
timbre features
25
20
15
10
5
1
2
4
6
8
10
12
14
16 segments
Echoprint
Whitened 8-band subband analysis
Onset detection
Each onset (time + mags)hashed to 20bit key
Server
All codes go into a big inverted index:
123984
6940
21823
TR5904 50ms TR1283 1840ms TR1293 680ms
TR1283 120ms TR7348 860ms TR5909 650ms
A query pulls out the top ranking TRs that have the
most overlapping codes as the query. Does this fast.
Time is used for ‘close matches’ (almost all of them)
nario will be moderated by a match-
Evaluation
QUERY CODES TO SONGS
known tracks, with each track T conmetadata (artist, album, track name).
olves taking an unknown audio query
sponding track in the reference database.
se, each track is split into 60 second
t sections overlapping by 30 seconds.
bias introduced when longer songs
for a set of query hashes.
second segment are represented as
D in an inverted index. The combinak ID plus the segment number is used
ur underlying data store uses Apache
ry handler to provide a fast lookup of
document IDs.
has about 800 hash keys. The query
ments with the most matches of each
In practice we find that there is rarely
nificantly more matches than all other
x, however, the top matches (we use
contain the actual match if it exists.
m of all time offset t differences per
sult set. We then use the total of the
kets to inform the “true score.” This
t the codes occur in order even if Q is
n of the song and thus has a different
100,000 tracks alongside a set of 2,000 tracks that are known
to be not in the set of 100,000. We compute queries on
audio pulled from selections of the audio files, for example, 30 seconds from the middle, downsampled to 96kHz or
decoded using various MP3 decoders or had their volume
adjusted. This lets us compute metrics such as the number of false positives f p (the wrong song was identified),
false negatives f n (the query was not matched to the correct track), false accept f a (a known non-database query
was matched to a track,) true positive tp and true negative
tn. Using these measures we compute a probability of error
P (E) that weights the size of the database (100,000 as D)
and the size of the known-non-database tracks (2,000 as N )
with the false reject rate Rr and the false accept rate Ra :
Our FPs now have >75m tracks, serves hundreds of
millions of queries a day
ents are ordered by their true score. If
nt from the same track are in the list,
document with the highest score. The
t is returned as a positive match if its
tly higher than the score of all other
t list. If the gap between the top two
P (E) = ((
D
N
) ⇤ Rr ) + ((
) ⇤ Ra )
D+N
D+N
(1)
where
Rr =
fp + fn
tp + f p + f n
(2)
fa
f a + tn
(3)
Ra =
We have published results 2 of various P (E):
Manipulation
30 second WAV file
60 second WAV file
60 second recompressed MP3 file
30 second recompressed MP3 file
30 second recompressed MP3 file at 96kbps
P (E)
0.0109
0.0030
0.0136
0.0163
0.0260
5. REFERENCES
[1] Daniel P.W. Ellis, Brian Whitman, Tristan Jehan, and
Listener analytics
- Data from our crawlers and partners
- 100m -ish profiles
- Help artists find fans & etc
userID
likes “The
Beatles”
smokes
gender
average song
tempo
LI394AXT9
0.8
N
M
139.4
LI484ART3
LI578TR4E
0.4
Y
in friend
cluster 393
WOEID
1.0
59403
...
F
...
100m users
1m features
- 100m x 1m matrix M
- <1% non-zero
- All operations must be iterative, parallel
- Dream:
[value, confidence] = prediction(listener, feature)
userID
likes “The
Beatles”
smokes
gender
average song
tempo
LI394AXT9
0.8
N
M
139.4
LI484ART3
[0.1, 0.9]
[N, 0.4]
[M, 0.9]
[83, 0.3]
LI578TR4E
0.4
Y
F
...
in friend
cluster 393
WOEID
1.0
59403
...
C could be classified w
ary
classification
problem
shown
in
figure
5-1
ble because of the addition of the regularization term CI ), then for a
yperplane
learned
However,
side y, we can
computeby
the an
newSVM.
c via a simple
matrix non-linearly
multiplication. separable da
he
derivative
with respect
c and setting
0, we arrive
need
to consider
a newtotopology,
andthe
weequation
can substitute
in a gaussi
RLSC
ven a fixed set of training observations, we can create the kernel matrix
hat represents data as
1
ularization term,
and(K
invert.
c=
+ (In
⌥I)practice
y we use iterative optimization (5.19)
o create machines for each possible output class, 2simply multiply the
(|x1 x2 |)
a truth y Remembering
vector
1 .definition
observation
in 1 . . . ⌥. 5.11,
2 of
tybymatrix.
from Equation
Kf(where
(x1 , xy 2is)the
=
e. . 1 for each
will be able to classify new data by projecting a test point xtest through
tion K and then c:
I
(5
(K Kernel
+
)c
= y,
(5.20) functio
ble parameter.
functions
can be viewed as a ‘distance
2C
l the high-dimensionality points in your input feature space and re
t that in practice
same with different problems,
we
f (xtestremains
) = cK(xthe
(5.21)
test , x)
data as some distance between points. There is a bit of engineering
n the denominator.
t kernel function, as the function should reflect structure in your da
ion, this can be interpreted as your classifier multiplied by every point
uss
an
alternate
kernel
for
music
analysis.
For
now
it
should
be
not
a evaluated through the kernel against your new test point. In effect,
(roughly)
an
SVM
where
all
datapoints
are
SVs
seen
RLSC
ces
(the
⌅
⌅
matrix
K
that
contains
all
the
kernel
evaluations)
shou
as equivalent to an SVM in which every training observation be- requires
ofthea each
kernel
matrix
all non-negativ
obs
rtsitive
vector.semi-definite;
The
vector c forinversion
each
weights
observation’s
thatclass
is, all
eigenvalues
of imporKofare
ultant
classifying
his
approach
isfunction.
that
in Equationare
5.20simple
the solution
c is linear
in
- new
classifiers
matrix
mults
I
1
y. We compute
and
store
the
inverse
matrix
(K
+
)
(this
is
-1
-1
⇥
C
through
K ininversion:
c classes
= (K+(I/C))
*
y
(x)problems,
with
the
newstored
f (x)
class
training
time
is linear
the amount of
n.
TrainI
ecause
of
the
addition
of
the
regularization
term
C ), then for a
discriminate amongst n classes either requires n SVMs in a one-vs-all
But
(1) wait, i can’t compute a 100m^2 kernel matrix
- i mean, I can, but it’ll take a while.
(2) the inversion? Fast iterative SPD solvers?
(3) data normalization:
- 20% of features are probabilities
- 20% are boolean
- 30% are indexes
- the rest are all over the place: tempo, key,
term frequencies, age (!) etc
The platform
The power of APIs
We even give it away
Big Data & Connected Data!
SecondHandSongs
- cover songs
7digital
- 30s audio
Last.fm
- song tags
- similar songs
Echo Nest
API dump
Musicmetric
- artist popularity
- user location
musiXmatch
- lyrics
Musicbrainz
- years
Grooveshark
- playlists
- playcounts
Discovr
iHeartRadio
Scrobbyl
tastebuds.fm
MTV Music Meter
Music Hunter
Vib Ribboff
Music Hack Days
Developers are the future of music