slides - Applied Machine Learning Reading Group, MIT Media Lab
Transcription
slides - Applied Machine Learning Reading Group, MIT Media Lab
Machine listening and reading at scale Brian Whitman co-Founder & CTO, The Echo Nest @bwhitman / [email protected] Hello! I “teach computers to listen to music” Columbia University, NYC MIT Media Lab finishing my dissertation 4 · Elias Pampalk air air air air air avemaria avemaria avemaria avemaria avemaria elise elise elise elise elise kidscene kidscene kidscene kidscene kidscene mond mond mond mond mond branden branden branden branden branden branden vm−bach vm−bach vm−bach frozen frozen frozen frozen beethoven beethoven beethoven beethoven fuguedminor fuguedminor fuguedminor fuguedminor therose future therose therose future future therose future lovemetender lovemetender lovemetender lovemetender lovemetender lovemetender vm−brahms vm−brahms vm−brahms vm−brahms memory memory memory memory memory memory rainbow rainbow rainbow threetimesalady threetimesalady threetimesalady eifel65−blue eifel65−blue eifel65−blue gowest gowest gowest gowest gowest fromnewyorktola lovsisintheair fromnewyorktola fromnewyorktola lovsisintheair lovsisintheair gowest fromnewyorktola lovsisintheair radio radio radio supertrouper supertrouper supertrouper supertrouper supertrouper dancingqueen dancingqueen dancingqueen dancingqueen manicmonday manicmonday manicmonday manicmonday ga−doedel ga−doedel ga−doedel ga−doedel ga−doedel ga−doedel ga−japan ga−japan ga−japan ga−japan nma−bigblue nma−bigblue nma−bigblue fatherandson fatherandson fatherandson fatherandson foreveryoung foreveryoung foreveryoung conga conga conga mindfiels mindfiels mindfiels Retrieving Music by Rhythmic Similarity eternalflame eternalflame eternalflame eternalflame eternalflame eternalflame feeling feeling feeling revolution revolution revolution ironic ironic ironic americanpie americanpie americanpie drummerboy angels drummerboy drummerboy angels angels drummerboy angels friend friend friend missathing missathing missathing friend friend friend missathing missathing missathing yesterday−b yesterday−b yesterday−b 1.5 addict addict addict addict addict ga−lie ga−lie ga−lie ga−lie newyork newyork newyork newyork newyork newyork sml−adia sml−adia sml−adia sml−adia squared Euclidean distance californiadream californiadream californiadream californiadream californiadreamunbreak− unbreak− unbreak− firsttime firsttime firsttime firsttime firsttime myheart myheart myheart myheart myheart risingsun risingsun risingsun myheart risingsun risingsun rhcp− rhcp− rhcp− rhcp− rhcp− rhcp− ga−nospeech rhcp− ga−nospeech ga−nospeech rhcp− rhcp− bfmc−freestyler bfmc−freestyler bfmc−freestyler ga−nospeech rhcp− californication californication californication californication californication world torn torn torn world world sl−summertime sl−summertime sl−summertime torn torn torn world world world sl−summertime sl−summertime sl−summertime sl−whatigot sl−whatigot sl−whatigot sl−whatigot sl−whatigot limp−nobody limp−nobody limp−nobody limp−nobody pr−broken pr−broken pr−broken pr−broken pr−broken pr−broken limp− bongobong bongobong bongobong limp− limp− limp− bongobong bongobong bongobong n2gether n2gether n2gether themangotree themangotree n2gether n2gether n2gether themangotree themangotree themangotree themangotree korn−freak korn−freak korn−freak 110 bpm korn−freak korn−freak korn−freak bigworld limp−pollution limp−pollution limp−pollution bigworld bigworld ga−iwantit bigworld ga−iwantit ga−iwantit ga−iwantit ga−iwantit pr−deadcell pr−deadcell pr−deadcell lovedwoman lovedwoman ga−iwantit lovedwoman pr−deadcell pr−deadcell 112pr−revenge bpm pr−revenge pr−revenge 1 cocojambo cocojambo cocojambo cocojambo cocojambo macarena macarena macarena macarena macarena macarena rockdj rockdj rockdj bfmc−instereo bfmc−instereo bfmc−instereo bfmc−instereo bfmc−instereo bfmc−instereo bfmc−rocking bfmc−rocking bfmc−rocking bfmc−rocking bfmc−skylimit bfmc−skylimit bfmc−skylimit bfmc−skylimit bfmc−uprocking bfmc−uprocking bfmc−uprocking bfmc−uprocking 114 bpm 120 bpm 122 bpm sexbomb sexbomb sexbomb 116 bpm Fig. 2. Islands of Music representing a 7x7 SOM with 77 pieces of music. The artists and full titles of the pieces, 124 bpm which are represented by the short identifiers here, can be found on the web page or in the thesis. 5. CONCLUSION 0.5 126 bpm The Islands of Music have not yet reached a level which would suggest their commercial usage, however, they demonstrate the possibility of such systems and serve well as a tool for further research. Any part of128the bpmsystem can be modified or replaced and the resulting effects can easily be evaluated using the graphical user interface. Furthermore, the feature extraction technique and the visualization technique developed for this thesis can both be applied seperately to a broad range of applications. 130 bpm REFERENCES 0 130 128 126 124 122 Bishop, C. M., Svensén, M., and Williams, C. K. I. 120 118 Tempo (bpm) 1998. 116 114 112 110 GTM: The generative topographic mapping. Music retrieval and “music intelligence” applications creepy Hit prediction Taste / recommendation Automated composition Classification, tagging Feature extraction (key, tempo) Search & retrieval Review Regression 100 10 60 8 6 40 4 20 6 40 100 80 60 40 2 20 8 4 2 2 4 6 8 Randomly selected AMG Ratings 120 4 4 6 AMG Ratings 60 100 .147 [.080] 2 8 8 Audio−derived Ratings 6 4 6 AMG Ratings 10 80 20 2 2 8 Pitchfork Ratings 12 80 Audio−derived Ratings Pitchfork Ratings 100 80 60 .127 [.082] 6 5 4 3 40 2 20 1 20 40 60 80 Pitchfork Ratings 100 Set A Britney Spears Backstreet Boys Cristina Aguilera Set B Alice in Chains Korn Faith no More Set C Chris Isaak Bob Dylan Crowded House The Echo Nest 2007 Somerville 2 people 2 computers Lots of ideas 1,000,000 documents 10,000 artists 100,000 songs The Echo Nest 2011 Somerville, NY, SF, LDN 35 people 300 computers 150m people / month 5,000,000,000 documents 2,000,000 artists 35,000,000 songs Audio features: ➔ musical features, Jehan ’05++ ➔ publicly available, ~5s / song on cloud machines auditory spectrogram auditory spectrogram 25 20 15 4 x 10 2 10 5 1 1 0 0.5 1 segmentation 0.6 2 sec. beat markers -2 0 0.4 5 10 15 20 25 240 190 0.2 0 1.5 -1 0.8 segments 1 0 0 0.5 1 143 1.5 2 sec. tempogram B 114 96 A# pitch features pitch features A G# G F# 72 F 60 E D# D 0 5 10 15 20 25 1 C# C 0 0.5 1 0.8 1.5 2 sec. 25 timbre features timbre features 0.6 20 tempo spectrum 0.4 15 0.2 10 0 60 5 72 96 114 143 1 2 4 6 8 10 12 14 16 segments 190 240 Text features: ➔ “community metadata” Whitman ’02++ ➔ web crawls, blogs, reviews, news: “top terms” ➔ constantly updated, >1m pages/day n2 Term dancing queen mamma mia disco era winner takes chance on swedish pop my my s enduring and gimme enduring appeal Score 0.0707 0.0622 0.0346 0.0307 0.0297 0.0296 0.0290 0.0287 0.0280 0.0280 np Term dancing queen mamma mia benny chess its chorus vous the invitations voulez something’s priscilla Score 0.0875 0.0553 0.0399 0.0390 0.0389 0.0382 0.0377 0.0377 0.0374 0.0369 adj Term perky nonviolent swedish international inner consistent bitter classified junior produced Score 0.8157 0.7178 0.2991 0.2010 0.1776 0.1508 0.0871 0.0735 0.0664 0.0616 Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for adj (adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and np (noun phrases.) Parsing artifacts are left alone. Metadata features: ➔ Explicit categories, tags, similars from partners ➔ Physical sales data ➔ Other ontologies Per capita sales for BON JOVI Per capita sales for YOUNG JEEZY Per capita sales for JUSTIN BIEBER Per capita sales for NEON TREES Biggest database of artist data ever: ➔ billions of datapoints about artists ➔ web crawls: {terms: [(funky,0.34),(jazz:0.95)]} ➔ metadata: {style:jazz, location:Boston} ➔ user data: {id:LIX2923ND, likes:[Phonat, Apparat]} ➔ analytics: {hotttnesss: 0.45, chart_position... etc} Normalize everything to a “term” in a big index: ➔ words get indexed w/ their probabilities as TF ➔ artist names gets resolved to IDs in a graph ➔ numerical data goes out to a tree or clustered ➔ optimizes for fast query, distributed is easy ➔ “boosts” set in real time per app or trained “Song to song” similarity ➔ Catalogs at >10m tracks each (multiple clients!) ➔ Existing cultural metadata coverage 80% ish ➔ “Happy birthday to <your name>” x 1,200 ➔ Turn around live API in < week ➔ Bottlenecks all in infrastructure: disk speed, data storage, DB access / inserts, queues Merge what we know ➔ If we have text data, filter on artist similars ➔ Sort filtered X,000 tracks in memory, cache ➔ Acoustic features use statistics of EN Analyze ➔ QA process (interns + spot checks) ➔ Evaluation is customer feedback Hardest problem? ➔ Evaluating distance metrics? ➔ Selecting audio features? ➔ Sharding / replicating key-value stores? ➔ Tuning, QA? Hardest problem? ➔ Evaluating distance metrics? ➔ Selecting audio features? ➔ Sharding / replicating key-value stores? ➔ Tuning, QA? ➔ Matching artist and track titles Echoprint & ENMFP ➔ Both massive LSH problems ➔ Different audio features >> n-bit codes >> DB ENMFP auditory spectrogram 25 20 15 10 5 1 0 0.5 1 1.5 2 sec. 1 segmentation 0.8 0.6 Chroma -> VQ 10-bit # (0-1023) 3 in a row 1 30-bit # per seg 0.4 0.2 0 0 0.5 1 1.5 2 sec. 0.5 1 1.5 2 sec. B A# pitch features A G# G F# F E D# D C# C 0 timbre features 25 20 15 10 5 1 2 4 6 8 10 12 14 16 segments Echoprint Whitened 8-band subband analysis Onset detection Each onset (time + mags)hashed to 20bit key Server All codes go into a big inverted index: 123984 6940 21823 TR5904 50ms TR1283 1840ms TR1293 680ms TR1283 120ms TR7348 860ms TR5909 650ms A query pulls out the top ranking TRs that have the most overlapping codes as the query. Does this fast. Time is used for ‘close matches’ (almost all of them) nario will be moderated by a match- Evaluation QUERY CODES TO SONGS known tracks, with each track T conmetadata (artist, album, track name). olves taking an unknown audio query sponding track in the reference database. se, each track is split into 60 second t sections overlapping by 30 seconds. bias introduced when longer songs for a set of query hashes. second segment are represented as D in an inverted index. The combinak ID plus the segment number is used ur underlying data store uses Apache ry handler to provide a fast lookup of document IDs. has about 800 hash keys. The query ments with the most matches of each In practice we find that there is rarely nificantly more matches than all other x, however, the top matches (we use contain the actual match if it exists. m of all time offset t differences per sult set. We then use the total of the kets to inform the “true score.” This t the codes occur in order even if Q is n of the song and thus has a different 100,000 tracks alongside a set of 2,000 tracks that are known to be not in the set of 100,000. We compute queries on audio pulled from selections of the audio files, for example, 30 seconds from the middle, downsampled to 96kHz or decoded using various MP3 decoders or had their volume adjusted. This lets us compute metrics such as the number of false positives f p (the wrong song was identified), false negatives f n (the query was not matched to the correct track), false accept f a (a known non-database query was matched to a track,) true positive tp and true negative tn. Using these measures we compute a probability of error P (E) that weights the size of the database (100,000 as D) and the size of the known-non-database tracks (2,000 as N ) with the false reject rate Rr and the false accept rate Ra : Our FPs now have >75m tracks, serves hundreds of millions of queries a day ents are ordered by their true score. If nt from the same track are in the list, document with the highest score. The t is returned as a positive match if its tly higher than the score of all other t list. If the gap between the top two P (E) = (( D N ) ⇤ Rr ) + (( ) ⇤ Ra ) D+N D+N (1) where Rr = fp + fn tp + f p + f n (2) fa f a + tn (3) Ra = We have published results 2 of various P (E): Manipulation 30 second WAV file 60 second WAV file 60 second recompressed MP3 file 30 second recompressed MP3 file 30 second recompressed MP3 file at 96kbps P (E) 0.0109 0.0030 0.0136 0.0163 0.0260 5. REFERENCES [1] Daniel P.W. Ellis, Brian Whitman, Tristan Jehan, and Listener analytics - Data from our crawlers and partners - 100m -ish profiles - Help artists find fans & etc userID likes “The Beatles” smokes gender average song tempo LI394AXT9 0.8 N M 139.4 LI484ART3 LI578TR4E 0.4 Y in friend cluster 393 WOEID 1.0 59403 ... F ... 100m users 1m features - 100m x 1m matrix M - <1% non-zero - All operations must be iterative, parallel - Dream: [value, confidence] = prediction(listener, feature) userID likes “The Beatles” smokes gender average song tempo LI394AXT9 0.8 N M 139.4 LI484ART3 [0.1, 0.9] [N, 0.4] [M, 0.9] [83, 0.3] LI578TR4E 0.4 Y F ... in friend cluster 393 WOEID 1.0 59403 ... C could be classified w ary classification problem shown in figure 5-1 ble because of the addition of the regularization term CI ), then for a yperplane learned However, side y, we can computeby the an newSVM. c via a simple matrix non-linearly multiplication. separable da he derivative with respect c and setting 0, we arrive need to consider a newtotopology, andthe weequation can substitute in a gaussi RLSC ven a fixed set of training observations, we can create the kernel matrix hat represents data as 1 ularization term, and(K invert. c= + (In ⌥I)practice y we use iterative optimization (5.19) o create machines for each possible output class, 2simply multiply the (|x1 x2 |) a truth y Remembering vector 1 .definition observation in 1 . . . ⌥. 5.11, 2 of tybymatrix. from Equation Kf(where (x1 , xy 2is)the = e. . 1 for each will be able to classify new data by projecting a test point xtest through tion K and then c: I (5 (K Kernel + )c = y, (5.20) functio ble parameter. functions can be viewed as a ‘distance 2C l the high-dimensionality points in your input feature space and re t that in practice same with different problems, we f (xtestremains ) = cK(xthe (5.21) test , x) data as some distance between points. There is a bit of engineering n the denominator. t kernel function, as the function should reflect structure in your da ion, this can be interpreted as your classifier multiplied by every point uss an alternate kernel for music analysis. For now it should be not a evaluated through the kernel against your new test point. In effect, (roughly) an SVM where all datapoints are SVs seen RLSC ces (the ⌅ ⌅ matrix K that contains all the kernel evaluations) shou as equivalent to an SVM in which every training observation be- requires ofthea each kernel matrix all non-negativ obs rtsitive vector.semi-definite; The vector c forinversion each weights observation’s thatclass is, all eigenvalues of imporKofare ultant classifying his approach isfunction. that in Equationare 5.20simple the solution c is linear in - new classifiers matrix mults I 1 y. We compute and store the inverse matrix (K + ) (this is -1 -1 ⇥ C through K ininversion: c classes = (K+(I/C)) * y (x)problems, with the newstored f (x) class training time is linear the amount of n. TrainI ecause of the addition of the regularization term C ), then for a discriminate amongst n classes either requires n SVMs in a one-vs-all But (1) wait, i can’t compute a 100m^2 kernel matrix - i mean, I can, but it’ll take a while. (2) the inversion? Fast iterative SPD solvers? (3) data normalization: - 20% of features are probabilities - 20% are boolean - 30% are indexes - the rest are all over the place: tempo, key, term frequencies, age (!) etc The platform The power of APIs We even give it away Big Data & Connected Data! SecondHandSongs - cover songs 7digital - 30s audio Last.fm - song tags - similar songs Echo Nest API dump Musicmetric - artist popularity - user location musiXmatch - lyrics Musicbrainz - years Grooveshark - playlists - playcounts Discovr iHeartRadio Scrobbyl tastebuds.fm MTV Music Meter Music Hunter Vib Ribboff Music Hack Days Developers are the future of music