Learning the meaning of music
Transcription
Learning the meaning of music
Learning the meaning of music Brian Whitman MIT Media Lab April 14 2005 Committee Barry Vercoe Professor of Media Arts & Sciences Massachusetts Institute of Technology [Music Mind and Machine] Daniel P.W. Ellis Assistant Professor of Electrical Engineering Columbia University [LabROSA] Deb Roy Associate Professor of Media Arts & Sciences Massachusetts Institute of Technology [Cognitive Machines] 15784 NEWTON v. DIAMOND APPENDIX [James Newton, “Choir” from Axum (ECM Recordings)] [Beastie Boys, “Pass the Mic” from Check Your Head, Grand Royal] 4 5 6 JEFFREY A. BERCHENKO, SBN 094902 LAW OFFICE OF JEFFREY BERCHENKO 240 Stockton Street, 3rd Floor San Francisco, California 94108 (415) 362-5700; Fax (415) 362-4119 7 8 Attorneys for Plaintiff James W. Newton, Jr. dba Janew Music 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 UNITED STATES DISTRICT COURT CENTRAL DISTRICT OF CALIFORNIA JAMES W. NEWTON, JR. dba JANEW MUSIC, ) ) ) Plaintiff, ) ) v. ) ) MICHAEL DIAMOND, ADAM HOROVITZ ) and ADAM YAUCH, dba BEASTIE BOYS,) a New York Partnership, CAPITOL ) RECORDS, INC., a Delaware ) Corporation, GRAND ROYAL RECORDS,) INC., a California Corporation, ) UNIVERSAL POLYGRAM INTERNATIONAL) PUBLISHING, INC., a Delaware ) Corporation, BROOKLYN DUST MUSIC,) an entity of unknown origin, ) MARIO CALDATO, JR., an ) individual, JANUS FILMS, LLC, a ) New York Limited Liability ) Company, CRITERION COLLECTION, a ) California Partnership, VOYAGER ) PUBLISHING COMPANY, INC., a ) Delaware Corporation, SONY MUSIC) ENTERTAINMENT, INC.,A Delaware ) Corporation, BMG DIRECT ) FIRST AMENDED COMPLAINT – PAGE 1 Case No. CV 00-04909-NM (MANx) FIRST AMENDED COMPLAINT (COPYRIGHT INFRINGEMENT -17 U.S.C. §101 et seq.) DEMAND FOR JURY TRIAL 0 Jan 11 Jan 12 Jan 13 Jan 14 Jan 15 Jan 16 Jan 17 Jan 18 Jan 19 Jan 20 Jan 21 Jan 22 Jan 23 Jan 24 Jan 25 Jan 26 Jan 27 Jan 28 Jan 29 Jan 30 Jan 31 Feb 1 Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Feb 9 Feb 10 Feb 11 Feb 12 Feb 13 Feb 14 Feb 15 Feb 16 Feb 17 Feb 18 Feb 19 Feb 20 Feb 21 Feb 22 Feb 23 Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 Mar 6 Mar 7 Mar 8 Mar 9 Mar 10 Mar 11 Mar 12 Mar 13 Mar 14 Mar 15 Mar 16 Mar 17 Mar 18 Mar 19 Mar 20 Mar 21 Mar 22 Mar 23 Mar 24 Mar 25 Mar 26 Mar 27 Mar 28 Mar 29 Mar 30 Mar 31 Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9 300 261 225 225 32 1 1 1 3 4 16 237 [M.I.A. “Galang” from Arular, XL Recordings] 169 150 140 119 87 96 37 29 17 93 75 64 37 21 5 14 31 73 43 22 2021 14 14 3 48 21 12 93 1 1313 13 26 21 3 2 49 41 { “my favorite song” “i hate this song” “four black women in rural arkansas” “from sri lanka” “about the falklands war” “romantic and sweet” “loud and obnoxious” “sounds like early XTC” #1 in the country “reminds me of my ex-girlfriend” { “my favorite song” “i hate this song” “four black women in rural arkansas” “from sri lanka” “about the falklands war” “romantic and sweet” “loud and obnoxious” “sounds like early XTC” #1 in the country “reminds me of my ex-girlfriend” { “my favorite song” “i hate this song” “four black women in rural arkansas” “from sri lanka” “about the falklands war” “romantic and sweet” “loud and obnoxious” “sounds like early XTC” #1 in the country “reminds me of my ex-girlfriend” USER MODEL { “my favorite song” “i hate this song” “four black women in rural arkansas” “from sri lanka” “about the falklands war” “romantic and sweet” “loud and obnoxious” “sounds like early XTC” #1 in the country “reminds me of my ex-girlfriend” USER MODEL { “my favorite song” “i hate this song” “four black women in rural arkansas” ✓ “from sri lanka” ✓ “about the falklands war” ✓ “romantic and sweet” “loud and obnoxious” “sounds like early XTC” #1 in the country ✓ “reminds me of my ex-girlfriend” RLSC Penny "Semantic projection" Perceptual features Community Metadata Interpretation "My favorite song" "Romantic electronic music" "Sounds like old XTC" ★★★★★ Contributions 1 Music retrieval problems 2 Meaning 3 Contextual & perceptual analysis 4 Learning the meaning 5 “Semantic Basis Functions” 1 Music retrieval problems a Christmas b Semantic / Signal approaches c Recommendation ◆ Field for organization classification of musical data ◆ Score level, audio level, contextual level ◆ Most popular: “genre ID,” playlist generation, segmentation Music Retrieval Christmas Referential Genre ID Style ID Preference Artist ID Absolutist Audio similarity Structure extraction Verse/chorus/bridge Energy Beat / tempo Query by humming Transcription Key finding Music Retrieval [Jehan 2005] 4 x 10 2 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 19-22, 2003, New Paltz, NY [Cooper, Foote 2003] 1 ! -1 ! -2 0 25 20 15 10 5 1 0 mented training data. In contrast, our technique uses the digital audio to model itself for both segmentation and clustering. Tzanetakis and Cook [6] discuss “audio thumbnailing” using a segmentation based method in which short segments near segmentation 0.5 1 1.5 2 2.5 boundaries are concatenated. This is similar to “time-based compression” of speech [7]. In contrast, we use complete segments for summaries, and we do not alter playback speed. Previous work by the authors has also used similarity matrices for excerpting, without an explicit segmentation step [8]. The present method results in a structural characterization, and is far more likely to start or end the summary excerpts on actual segment boundaries. We have also presented an earlier version of this approach, however with 0.5 1 2 2.5 less complete validation [4]. 1.5 "The Magical Mystery Tour" Spectrogram Similarity Matrix 3 Time (sec) 0 0.8 40 0.6 60 0.4 80 0.2 0 120 !0.2 140 !0.4 160 !0.6 20 40 60 80 100 Time (sec) 120 140 160 3 Novelty Score: The Magical Mystery Tour 1 0.8 2.2. Media Segmentation, Clustering, & Similarity Analysis 0.6 Our clustering approach is inspired by methods developed for segmenting still images [9]. Using color, texture, or spatial similarity measures, a similarity matrix is computed between pixel pairs. This similarity matrix is then factorized into eigenvectors and eigenvalues. Ideally, the foreground and background pixels ex0.5 1 1.5 2 2.5 hibit within-class similarity and between-class dissimilarity. Thus thresholding the eigenvector corresponding to the largest eigenvalue can classify the pixels into foreground and background. In contrast, we employ a related technique to cluster time-ordered data. Gong and Liu have presented an SVD based method for video summarization [10], by factorizing a rectangular time-feature matrix, rather than a square similarity matrix. Cutler and Davis use affinity matrices to analyze periodic motion using a correlation0.5 1.5 2 2.5 based method [11]. 1 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 20 100 1 0 [Casey 2002] 0 0.9 0.8 0.6 0.5 0.4 3 0.6 0.4 0.2 0 0 0.3 0.2 0.1 0 3 1 0.8 [Goto 2002-04] 0.7 0 20 40 60 80 100 Time (seconds) 120 140 160 180 Figure 2: Top: The similarity matrix computed from the song “The Magical Mystery Tour ” by The Beatles. Bottom: the timeindexed novelty score produced by correlating the checkerboard kernel along the main diagonal of the similarity matrix. 3. SIMILARITY ANALYSIS “Unsemantic” music-IR Similarity analysis is a non-parametric technique for studying the (what works) global structure of time-ordered streams. First, we calculate 80-bin 3.1. Constructing the similarity matrix spectrograms from the Fourier transform (STFT) 0.5 1 short time 1.5 2 2.5 of 0.05 second non-overlapping frames in the source audio. Each frame is Hamming-windowed, and the logarithm of the magnitude of the FFT is binned into an 80-dimensional vector. We have also ex- squares along the main diagonal. Brighter rectangular regions off the main diagonal indicate similarity between segments. 3 3.2. Audio Segmentation [Tzanetakis 2001] azz R ock n c in g classic country Disco Hiphop jazz Rock classic 86 2 0 4 18 1 country 1 57 5 1 12 13 disco 0 6 55 4 0 5 Hiphop 0 15 28 90 4 18 Jazz 7 1 0 0 .37 12 Rock 6 19 11 0 27 48 Table 2. Genre classification confusion matrix [Whitman and Smaragdis 2002] R&B choral orchestral Piano string 4tet 99 10 16 12 0 53 2 5 1 20 75 3 Contemporary string 4tet Country 0 17 7 80 choral IDM orchestral ults piano Classical Rap Heavy Metal 25 76 Table 2. Classical music classification confusion matrix 100% Context 0% Signal Genre Futility 100% Personalization 2 Meaning a Musical meaning b Grounding c Our approach Meaning: relationship between perception and interpretation. Meaning: relationship between perception and interpretation. & ? œ œ bœ œ. ‰ œ. <n>œ. œ œ bœ œ j j ‰ œ œ bœ <n>œ œ œœ mp j ‰ ##œœ œ # œ. œ œ œ bœ ‰ "About the Falklands war" Correspondence Correspondence: Connection between representation and content. Musical “story”: lyrics, discussion Explicit correspondence: instruments, score etc. Meaning: relationship between perception and interpretation. & ? œ œ bœ œ. ‰ œ. <n>œ. œ œ bœ œ j j ‰ œ œ bœ <n>œ œ œœ mp j ‰ ##œœ œ # œ. œ œ œ bœ ‰ Similar artists Genres: Rock, pop, world Styles: IDM, math rock "About the Falklands war" Correspondence Reference Reference: Connection between music and other music. Similar artists, styles, genres Meaning: relationship between perception and interpretation. & ? œ œ bœ œ. ‰ œ. <n>œ. œ œ bœ œ j j ‰ œ œ bœ <n>œ œ œœ mp j ‰ ##œœ œ # œ. œ œ œ bœ ‰ Similar artists Genres: Rock, pop, world Styles: IDM, math rock "#1 in America" Buzz, trends Influencing Reference Significance "About the Falklands war" Correspondence Significance: Aggregated cultural preference, “meaningful” Charts, popularity, critical review Meaning: relationship between perception and interpretation. & ? œ œ bœ œ. ‰ œ. <n>œ. œ œ bœ œ j j ‰ œ œ bœ <n>œ œ œœ mp j ‰ ##œœ œ # œ. œ œ œ bœ ‰ Similar artists Genres: Rock, pop, world Styles: IDM, math rock "#1 in America" Buzz, trends Influencing Reference Significance "funky, loud, romantic" "reminds me of my father" Usage and behavior "About the Falklands war" Correspondence Reaction: Effect of music on listener (personal significance) Personal comments, reviews Usage patterns, ratings Reaction [Mueller] [Miller] [All Media Guide] 2. THE EXISTING SYSTEMS buildwith a system that hasofthese called SAR rateWe tracks, 330 minutes audiofunctions, recordings of animal [Duygulu, Barnard, Freitas, Forsyth 2002] (semantic–audio retrieval), by learning thename connection a sounds. In addition, the concatenated of thebetween CD (e.g., There are many multimedia retrieval systems that use a comb semantic space auditory (e.g., space.“One Semantic spacehay maps “Horses I”) and and trackan description horse eating and tion of words and/or examples to retrieve audio (and video) words a high-dimensional space. Acoustic movinginto around”) form a unique probabilistic semantic label for each track. users. space describes a multidimensional vector. In general, The audio fromsounds the CDbytrack and the liner notes form a pair of An effective way to find an image of the space shuttle i the connection betweendocuments these two used spaces many. acoustic and semantic to will trainbe themany SARtosystem. enter the words “space shuttle jpg” into a text-based web sea Horse sounds, for example, might include footsteps and neighs. engine. The original Google system did not know about ima Figure 1 shows one half of SAR; how to retrieve sounds from but, fortunately, many people created web pages with the ph 2. THE EXISTING SYSTEMS words. Annotations that describe sounds are clustered within a “space shuttle” that contained a JPEG image of the shuttle. M hierarchical semantic model retrieval that usessystems multinomial models. The There are many multimedia that use a combinarecently, both Google and AltaVista for images, and Compus sound or acoustic documents, that correspond to each node tion offiles, words and/or examples to retrieve audio (and video) for ics for audio, have built systems that automate these searc in the semantic hierarchy are modeled with Gaussian mixture users. They allow people to look for images and sound based on nea models Given semantic request, identifies the sea to skyafind sun waves catis grass jet planethose sky search techniques by c An (GMMs). effective way an image of theSAR space shuttle to tiger words. The SAR work expands portion of words the semantic the request,web andsearch then enter the “space space shuttlethat jpg”best intofits a text-based sidering the acoustic and semantic similarity of sounds to al measures the original likelihood that each sound in the database the engine. The Google system did not know about fits images, [Slaney 2002] [Roy, Hsiao, Mavridis, Gorniak 2001-05] but, fortunately, many people created web pages with the phrase Semantic Space Acoustic Space Semantic Space Acoustic Space “space shuttle” that contained a JPEG image of the shuttle. More Horse recently, both Google and AltaVista for images, and CompusonTrot ics for audio, have built systems that automate Step these searches. They allow people to look for images and sound based on nearby words. The SAR work expands those search techniques by considering the acoustic and semantic similarity of sounds to allow Whinny Figure 1: SAR models with aSpace hierarchical Semantic Space all of semantic space Acoustic collection of multinomial models, each portion in the semantic model is linked to Horse equivalent sound documents in acoustic space Trot with a GMM. Figure 2: SAR describes with words an audio query by partitioning the audio space with a set of hierarchical acoustic models and then linking each set of audio files (or documents) to a probability model in semantic space. Figure 2: SAR describes with words an audio query by partit ing the audio space with a set of hierarchical acoustic models then linking each set of audio files (or documents) to a proba ity model in semantic space. Grounding [All Media Guide] source packing meaning extraction application query by description Audio DSP semantic basis functions recommendation Community Metadata NLP, statistics reaction prediction long distance song effects chat reviews explicits cluster RLSC LSA charts SVM tfidf usage edited info tagging word net(s) HMM 3 Contextual & perceptual analysis a b c d “Community Metadata” Usage mining Community identification Perceptual analysis "angry loud guitars" "The best album of 2004" WQHT adds "My exgirlfriend's favorite song" Community Metadata [Whitman, Lawrence 2002] Pages for "Context" Search for target context (artistname, songname) Parsing, position Terms per type for "Context" POS tagging, NP chunking contexts TF-IDF Gaussian smoothing type Webtext terms ft s( f t , f d ) = fd s( f t , f d ) = ft e − (log( f d ) − µ ) 2 2σ 2 TF-IDF 90% Gaussian smoothed 60% 30% 0% Unigram Bigram Noun Phrase Adjectives Artist Terms C (a ) − C (b) C ( a, b) S ( a, b) = (1 − ) C (b) C (c ) Peer-to-peer crawling [Ellis, Whitman, Berenzweig, Lawrence 2002] [Berenzweig, Logan, Ellis, Whitman 2003] Survey (self) Audio (anchor) Audio (MFCC) Expert (AMG) Playlists Collections Webtext Baseline 54% 20% 22% 28% 27% 23% Top-rank agreement Evaluation 19% 12% funky0 sm ab it ba h m ad o po nn a rt is xt he ad c ae ro ith sm ro ab ae heavy metal loud cello ad m ba ith sm ro o po nn a rt is xt he ad c funky0 ab ae b m a ad o po nn a rt is xt he ad c funky funky1 ... romantic funky_p funky1 funky_k Community Identification 5000 4500 4000 3500 Pitch (mels) 3000 2500 2000 Frames (l) 1500 1000 500 0 0 500 1000 1500 2000 2500 Frequency (Hz) 3000 3500 4000 4500 Dimensions (d) Figure 3-7: Mel scale: mels vs. frequency in Hz. ◆ Audio features: not too specific ◆ High expressitivity at a low rate ◆ No assumptions other than biological ◆ “The sound of the sound” %-" !-$ % !-# ! !" !-" !-& !# ! !-$ !!-" !$ !-# !!-# !-" ! !!-" !& !!-$ !%! !!-& ! "! #! '()*+, $! !% ! "! #! '()*+, $! !%" ! "! #! '()*+, Figure 3-8: Penny V1, 2 and 3 for the first 60 seconds of “Shipbuilding.” Audio representation To compute modulation cepstra we start with MFCCs at a cepstral frame rate (o between 5 Hz and 100 Hz), returning a vector of 13 bins per audio frame. We then s successive time samples for each MFCC bin into 64 point vectors and take a sec Fourier transform on these per-dimension temporal energy envelopes. We aggre Modulation cepstra: [Ellis / Whitman 2004] FFT of the MFCC Mixed to 6 ‘channels’ MFCC 5 Modulation range 0 !5 !10 !15 2 4 6 0-1.5 Hz 2 4 6 1.5-3 Hz 2 4 6 3-6 Hz 2 4 6 6-12 Hz 2 4 6 12-25 Hz 2 4 6 25-50 Hz !20 !25 !30 !35 0 0.5 1 1.5 2 2.5 4 x 10 FFT mixing 2 4 6 8 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 100 200 300 400 500 600 700 Penny 800 900 Frames MFCC 5Hz MFCC 20Hz Penny 20Hz Penny 5Hz PSD 5Hz 0% 15% 30% 45% Figure 3-10: Evaluation of five features in a 1-in-20 artist ID task. is still a valuable feature for us: low data rate and time representation. Because of the overlap in the fourier analysis of the cepstral frames, the Penny data rate is a fraction of the cepstral rate. In usual implementation (Penny with a cepstral frame rate of 5 Hz, 300 MFCC frames per minute) we end up with 45 Penny frames per minute of audio. Even if MFCCs outperform at equal cepstral analysis rates, Penny needs far less actual {Featurefight} 4 Learning the meaning a SVM / Regularized least-squares classification b A note on computation c Evaluations able. 6 rock dance rap s of the SVM lies in the representer theorem, where a high d x can be represented fully by a generalized dot product (in a Re t Space [7]) between xi and xj using a kernel function K(x binary classification problem shown in figure 5-1 could be clas hyperplane learned by an SVM. However, non-linearly sepa -2 need to consider a new topology, and we can substitute in n that represents data as 20 5.5 40 60 5 80 100 4.5 120 4 140 160 3.5 180 3 0.3 200 0.32 0.34 0.36 0.38 0.4 0.42 0.44 50 0.46 100 150 200 (|x1 −x2 |)2 − σ2 Kf (x1 , x2 ) = e nable parameter. Kernel functions can be viewed as a ‘distanc all the high-dimensionality points in your input feature spac SVM / Kernel methods r nk osm ith y ” th is so lo ng s ud u gu cks fa lk itar la nd sw ar Reaction (community metadata) fu “is d dimensions by Ae Perception (audio) ✓ l frames ✓ ✓ ✓ ... ✓ ... ✓ c classes ◆ Most machine learning classifiers have compute time linear in c ◆ For higher accuracy, DAG or 1 vs. 1 classifiers required - c(c-1) classifiers! ◆ We need to scale to over 30,000 c ◆ Bias, incorrect ground truth, unimportant truth Multiclass (|x1 −x |)−x2 |) (|x 2 1 − −2 σ σ2 ry other in the example space. We usually use (1) (x , x ) = e real-valued classification function f is or un on f is (1) K (x , x ) = e or unimportant. f 1 2 f 1 2 P (a ) P (a n in n kernel, n Training Evaluation RLSC system consists of solving the syslowest two modulati fined as fined ! meter we keep at 0.5. parameter we keep at 0.5. !(|x61 −x2 Linguistic |)2 6 L Experts for Pa uations ture across all cepstr − σ2 even in th K (x , x ) = e (1) f (x) = c K(x, xof (3) even (3) ! f RLSC 1 system 2 an RLSC consists the i consists i ).solving ning an system of solving the ! use a 10 Hz feature f D Discovery f (x) = c K(x, x ) i i Now i=1 w I No equations ear equations 1 Hz. We split the a (K + )c = y, (2) i=1 y| is the conventional Euclidean distance beC threshold thres half of the albums in Give Given a set of ‘grounded’ single ter al property of RLSC is that if we store the inwe store the inpoints, and σIis a parameter we keepfat (x)0.5. ∼ P (termi |audiox ) I scribed above we com I c−1 P (a ) P (a ernel matrix, is a classifier ‘machine,’ y m tourgr (K + )c = y, (2) t our method for uncovering paramete ) , then for a new right-hand side y, rix (K+ (K + )c = y, (2) ght-hand side y, C data and then invert, C C ,matrix and C ismultiplia user-supplied regularization manner w mann terms terms and learning the knobs to vary ompute the new c via a simple matrix multiplireview corpus. 1 we keep at 10. The crucial property of r-supplied regularization constant. The(after ahis user-supplied constant. The mode scoring c scori model new states that certain knowledge allows (after us toregularization compute classifiers classifiers ask ◆isSubstitute that if we store the inverse matrix square loss for hinge loss in the SVM problem senso ued classification function f is sensory input or intrinsic knowledge l-valued classification function f is the data and storing it in memory) on the fly or unimp or un mory) on the fly n for a new right-hand side yformation (i.e. a -new ◆ Easily graspable linear algebra solution is linear ing a ‘linguistic expert.’ If we ing heara ◆ “Anwe SVM where experimenter defines support vectors” ple matrix multiplications. values are trying to predict), we can ! ! Evaluation of ◆ New classes can be added after training and each7.2. is a simple ! ! w classifier c via a simple matrix multiplihear hear ‘quiet’ audio, we would needL 6 6 Lin matrix multiplication! (3) (x)for = aci“Query-by-Description” cito K(x, (3) fSC (x) = K(x, x aluation Task cription” Task i ). xi ).of this isfvery well-suited problems terms terms are antonymially related befo D Disc To evaluate the mod RLSC i=1 i=1 observations and a large d set of training ate we compute em,our we connection-finding compute dation system, datioa space between them. testing gram matrix Compute time vs. classes RLSC Memory/disk allocation vs. observations RLSC SVM 60000 min 40000 MB 30000 min 20000 MB 0 min 0 MB 5 10 100 1000 5000 10000 100 1000 SVM 10000 % classes, n = 1000 RLSC 50000 100000 2 2 σ σ (x , x ) = e real-valued classification function f over is (1) or un on f is (1)P (aelimin K (x , x ) = e or unimportant. f 21 2 in half the operations o be1fcomputed Gaussian ) P (a n in n n kernel, ation only requires the lower triangle of the matrix to be st ◆ Creating the kernel K: fined as fined ! meter we keep at 0.5. parameter we keep at 0.5. ! 2 (|x −x |) 6 L 1 by 2Linguistic 6 Experts for Pa kernel matrix K (which definition is symmetric positive − in th σ2 f RLSC (x) = c K(x, ).solving (3)eveneven (3) Kan , x ) = e (1)the f (x consists x isof parallelizable or vectorized an RLSC system consists i ieasily ning system of solving the 1 2 makes it fully definite) is D Discovery i=1 NowNow y| is the conventional Euclidean distance be◆ Solving theof system of equations: threshold thresh Given Given a set of ‘grounded’ single ter al property RLSC is that if we store the inwe store the inpoints, and σIis a I parameter we keep at 0.5. I −1 −1 T −1 P (a ) gr P (a our m t (K + )c = y, (2) via Cholesky: t our method for uncovering parameter ) , then for a new right-hand side y, rix (K+ (K + )c = y, (2) ght-handCside y, K = (LL ) C C manner w mann terms terms and learning the knobs to vary ompute the new c via a simple matrix multiplimatrix multipliK is always symmetric positive definite because of the regularization term Iterative gradient, pseudoinverse r-supplied regularization constant. The(after user-supplied regularization constant. The mode scoring c scorin model states that etc. certain knowledge his allowsmethods: us toconjugate compute new classifiers classifiers (after where L was derived from the Cholesky decomposition. The senso ued classification function f is sensory input or intrinsic knowledge -valued classification function f is the data and storing it in memory) on the fly or unimp or un mory) on the fly ithms for both computing the Cholesky decomposition in pl equations ear equations ◆ On a single 4GB machine, l < 40,000 ( (l*(l+1))/2)*4 bytes ) = 3.2GB Accuracy the classifier increasesof as the l goesCholesky up matrix andof also the inverse factorization ! ! Random subsampling on obs. space over each node a ing a ‘linguistic expert.’ If we ing hear le matrix multiplications. avai ! ! hear ‘quiet’ audio, we would hear needL 6 6 Lin (3) (x)for = aci“Query-by-Description” ci K(x, (3) fluation (x)f = K(x, xi ). xi ). Task ription” Task terms terms are the antonymially related befo n our implementations, we use single precision LAPACK D Disc i=1 i=1 te our connection-finding we compute RLSC system, Optimizations em, we(SPPTRI) computeon adation space them. nverse packed lowerbetween triangular matrix. datio This !2 quiet loud 1.5 1.5 1 1 0.5 0.5 0 1000 2000 3000 4000 5000 0 1000 funky 1.5 1 1 0.5 0.5 1000 2000 3000 3000 4000 5000 4000 5000 lonesome 1.5 0 2000 4000 5000 0 1000 2000 Query by description 3000 [Whitman and Ri.in 2002] Good terms Bad terms Electronic 33% Annoying 0% Digital 29% Dangerous 0% Gloomy 29% Fictional 0% Unplugged 30% Magnetic 0% Acoustic 23% Pretentious 1% Dark 17% Gator 0% Female 32% Breaky 0% Romantic 23% Sexy 1% Vocal 18% Wicked 0% Happy 13% Lyrical 0% Classical 27% Worldwide 2% Baseline = 0.14% ◆ Collect all terms through CM as ground truth against corresponding artist feature space -- artist (broad) level! ◆ Evaluation: on a held out test set of audio (with known labels), how well does each classifier predict its label? ◆ In evaluation model, bias is countered: Accuracy of positive association times accuracy of negative association = “P(a) overall accuracy” if P (ap ) is the overall positive accuracy (i.e.and given an [Whitman Ellis 2004] audio frame, the probability that a positive association adj Term K-L bits np Term to a term is predicted) and P (an ) indicates overall negaaggressive 0.0034 tive accuracy, P (a) is defined as P (ap )P (areverb n ). This measure gives us a tangible 0.0030 feeling for how our modsofter theterm noise els are working against the held out test set and is use0.0029and the review newtrimming wave ful for synthetic grounded term prediction punkbelow. However, 0.0024 costello experiment to rigorouslyelvis evaluate our term model’s in a review generation task, we sleepyperformance0.0022 the mud note that this value has an undesirable dependence on the funky of each0.0020 histerm guitar prior probability label and rewards classifiers with a very high natural df , often by chance. noisy 0.0020 guitarInstead, bass and for thisangular task we use a model of relative entropy, using the 0.0016 instrumentals Kullback-Leibler (K-L) distance to a random-guess prob0.0015 melancholy ability acoustic distribution. drums random guessing is: 0.0014 chords We romantic use the K-L distance in a two-classthree problem described by the four trial counts in a confusion matrix: ! KL = a log N K-L bits 0.0064 0.0051 0.0039 0.0036 0.0032 0.0029 0.0027 0.0021 0.0020 0.0019 " Na (a + b) (a + c) ! " b Nb + log N (a + b) (b + d) ! " c Nc + log N (a + c) (c + d) ! " d Nd + log N (b + d) (c + d) Table 2. Selected top-performing models of adjective and “funky” “not funky” noun funky phrase terms used bto predict new reviews of music a not funky c d with their corresponding bits of information from the K-L distance measure. (3) 5 Semantic Basis Functions a Anchor models b Music intelligence evaluation c Media intelligence vectors of size m × n, V is the right singular matrix matrix of size n × n, and Σ is a diagonal matrix of the singular values σk . The highest singular value will be in the upper left of the diagonal matrix Σ and in descending order from the top-left. For w of the a matrix A when the λ is an if and covariance matrixAw input=ofλw. AAT(λ , Uare and VTeigenvalues: will be equivalent foreigenvalue the non-zero only ifeigenvalued det(A − λI) = 0.) vectors. To reduce rank of the observation matrix A we simply choose the top r vectors of U and the top r singular values in Σ. We use the singular value decomposition (SVD) [33] to compute the eigenvectors and To compute a weight matrix w from the decomposition we multiply our (cropped) eigenvalues: eigenvectors by a scaled version of our (cropped) A= UΣVTsingular values: [74] (6.3) w= √ Σ−1 UT (6.4) Here, if A is of size m × n, U is the left singular matrix composed of the singular This w will now be of size r × m. To project your original data (or new data) through vectors of size m × n, V is the right singular matrix matrix of size n × n, and Σ is the weight matrix you simply multiply w by A, resulting in a whitened and rank rea diagonal the rsingular values σk . The singularprojected value will be in the duced matrix matrix f of of size × n. To ‘resynthesize’ rank highest reduced matrices through upperwleft thecompute diagonal Σ andthis in new descending youof first w−1matrix and multiply iw by f . order from the top-left. For T the covariance matrix input of AA , U and multiply. VT will be equivalent for the non-zero where × is a per-element The divergence measure [PCA] here is The intuition behind PCA isrank to reduce dimensionality of an set; by the eigenvalued vectors. To reduce of thethe observation matrix Aobservation we simply choose creasing thetofollowing two update rules: [Lee, Seunggiven 1999]needed the eigenvectors regenerate top r ordering vectors of U and the top r singular valuesthe in matrix Σ. and ‘trimming’ only the top Originalthe rate of lossy compression. The compression is the experimenter can choose T that V achieved through analysis of the correlated dimensions so that dimensions move W · To compute a weight matrix w from the decomposition we multiply our (cropped) W·H in the same direction are minimized. Geometrically,H the = SVDH (and, by extension, PCA) × eigenvectors by a scaled version of our (cropped) singular values: [74] T between is explained as the top r best rotations of your input data space so that variance W ·1 √ the dimensions are maximized. V T −1 UT · H w = Σ (6.4) W·H ! = W=W× 1 · HT r, NMF = 6.2.5 NMF This w will now be of size r × m. To project your original data (or new data) through the weight matrix you simply multiply w by A, resulting in a whitened and rank refactorization [44] isof that enforces am × n(NMF) matrix all 1.decomposition VQNon-negative PCA duced matrixwhere f of matrix size1r is × n. To ‘resynthesize’ ranka matrix reduced matrices projected through a positivity constraint on the bases. Given a positive input matrix V of size m × n, it −1 w youis first compute w and multiply this new iw by f . Rank factorized into twoStatistical matrices W of size m × rReduction and H of size r × n, where r ≤ m. = The error of %W· H& ≈ V is minimized. The advantage of the NMF decomposition ! The intuition !behind PCA is to reduce the dimensionality of an observation set; by [NMF] the visual cortex [8]. We also provide the combination classifier with the precise positions of the detected components relative to the upper left corner of the 58 58 window. Overall we have three values per component classifier that are propagated to the combination classifier: the maximum output of the component classifier and the - image coordinates of [Heisele, Serre, Pontil, Vetter, Poggio 2001] the maximum. Left Eye Eye Left expert: expert: Linear SVM SVM Linear * *Outputs of component experts: bright intensities indicate high confidence. (O1 , X 1 , Y1 ) . . . Nose expert: expert: Nose Linear SVM SVM Linear (O1 , X 1 , Y1 ,..., O14 , X 14 , Y14 ) * (Ok , X k , Yk ) . . . Mouth Mouth expert: expert: Linear SVM SVM Linear 1. Shift 58x58 window over input image 2. Shift component experts over 58x58 window Combination Combination classifier: classifier: Linear SVM SVM Linear * (O14 , X 14 , Y14 ) 3. For each component k, determine its maximum output within a search region and its location: (Ok , X k , Yk ) 4. Final decision: face / background Figure 2: System overview of the component-based classifier. [Berenzweig, Ellis, Lawrence 2002] [Apple 2003] . E15-491 A 02139 USA dia.mit.edu [Whitman 2003] 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoust PCA NMF Semantic 2 1 1 0 0.5 0.5 !2 0hz 345hz 690hz 0 0hz 345hz 690hz 0 0hz non pca funky 345hz 690hz 5 5 10 10 15 15 20 20 25 25 cool 5 10 highest junior low 15 20 25 5 10 nmf 5 10 10 15 15 20 20 25 25 10 15 20 25 20 25 sem 5 5 15 20 25 5 10 15 Figure 3: Confusion matrices for the four experiments. Top: no dimensionality reduction and PCA with r = 10. Bottom: NMF with r = 10 and semantic rank reduction with r = 10. Lighter five bases for each type of depoints indicate that the examples from artists on the x-axis were five second power spectral denthought to be by artists on the y-axis. Figure 1: Comparison of the top composition, trained from a set of sity frames. The PCA weights aim to maximize variance, the NMF weights try to find separable additive parts, and the semantic weights map the best possible labels to the generalized observatraining across the board, with perhaps the NMF hurting the accutions. versus not having an reduced rank representation at all. For Semantic Rankracy Reduction the test case, results widely vary. PCA shows a slight edge over no reduction in the per-observation metric while NMF appears to hurt Basis extraction set community metadata ✓ ✓ ✓ ✓ ✓ ✓ “What the community hears” “What the community thinks” Electronic Digital Gloomy Unplugged Acoustic Dark Female Romantic Vocal Happy Classical 33% 29% 29% 30% 23% 17% 32% 23% 18% 13% 27% sorted class P(a) outputs “What are the most important things to a community” Semantic Basis Functions Good terms Electronic Digital Gloomy Unplugged Acoustic Dark Female Romantic Vocal Happy Classical 33% 29% 29% 30% 23% 17% 32% 23% 18% 13% 27% Experimenter chooses r New audio is represented as the prediction community reaction to the signal: sorted class P(a) outputs “Electronic” 0.45 “Digital” 0.21 “Gloomy” -0.12 “Unplugged” -0.45 “Acoustic” 0.84 Semantic Basis Functions Artist ID: rare true ground truth in music-IR! Test: of a set of c artists / classes, with training data for each, how many of a set of n songs can be placed in the right class in testing? ◆ Album effect - Learning producers instead of musical content ◆ Time-aware - “Madonna” problem ◆ Data density / overfitting - Sensitive to rate of feature, amount of data per class ◆ Features or learning? Evaluation 70% 67.1% No rank reduction PCA NMF Semantic rank reduction Baseline (random) 35% 22.2% 0% 24.6% 19.5% 3.9% Artist ID accuracy, 1-in-20, obs = 8000 beautiful free small 100 10 20 30 40 50 60 70 80 0.13 0.10 0.33 Figure 7-3: Top terms for community metadata vectors associated with the image at left. informed by the probabilities p(i) of each symbol i in X. More ‘surprising’ symbols in a message need more bits to encode as they are less often seen. This equation commonly gives a upper bound for compression ratios and is often studied from an artistic standpoint. [54] In this model, the signal contains all the information: its significance is defined by its self-similarity and redundancy,npaTerm very absolutist view. Score However, we inaustrailia exhibit 0.003 tend instead to consider the meaning of those bits, and by working with other domains, light and shadow 0.003 this incredibly beautiful countrydata 0.002from these signifidifferent packing schemes, and methods for synthesizing new sunsets 0.002 god’shope creations to bring meaning 0.002 cantly semantically-attached representations we back into the the southeast portion 0.002 Score adj Term notion of information. 10 20 30 40 50 60 70 80 90 100 7.2.1 Images and Video 10 20 30 40 50 60 70 80 religious human simple beautiful free small 1.4 0.36 0.21 0.13 0.10 0.33 Figure 7-3: Top terms vectors the image High Termassociated Typewith Accuracy Low Term Typefor community Accuracymetadata at left. sea np 20% antiquarian adj 0% purei in X. Moreadj 18.7% boston np 0% informed by the probabilities p(i) of each symbol ‘surprising’ symbols in a message neednp more bits to are less often seen.adj This equation com17.1% library 0%encode as theypacific monly gives a upper compression ratios and is often studied an artistic cloudy adj from17.1% analytical adjbound for 0% standpoint. [54] In this model, the signal contains all the information: its significance np However, 17.1% disclaimer np 0% is defined by its self-similarity and redundancy,air a very absolutist view. we incolorful generation np the meaning 0% of those bits, tend instead to consider and by workingadj with other11.1% domains, different packing schemes, and methods for synthesizing new data from these signifi- a large − ich Sentence g(s) mother loves this album.” We look to the success 2 of our σ erstandgrounded term models for insights into th testing gram matrix and check each learned c again K (x , x ) = e The drums that kick in midway are also decidedly more similar to Air’s previous work. 3.170% f 1 2 defined But at first, it’s all Beck: a harmonica solo, folky acoustic strumming, Beck’s distinctive, marble-mouthed and tolls ringing in ndgrounded term models for insights into thevocals, musicality of 2.257% the background. audio frame in the test set. without description and develop a ‘review trimmin But with lines such as, ”We need to use envelope filters/ To say how we feel,” the track is also an oddly beautiful lament. 2.186% out description and develop a ‘review trimming’ system that The beat, meanwhile, is cut from the exact same mold as The Virgin Suicides– from the dark, ambling pace all the way down to the 1.361% ere angelicchovoices coalescing in the background. summarizes reviews and retains only the m We used two separate evaluation techniques t hoobservasummarizes reviews and retains only the most descriptive After listing off his feelings, the male computerized voice receives an abrupt retort from a female computerized voice: ”Well, I really 0.584% think you shouldcontent. quit smoking.” trimmed reviews can then be fedOne into furthebutThe strength ofa music our term predictions. metric is associtrimmed reviews can then I wouldn’t say she was a lost cause,content. my girlfriend needed The doctor like I needed, well, a girlfriend. 0.449% b She’s taken to thether Pixies, and I’ve taken to,understanding um, lots of sex. 0.304% textual systems or read directly by the sure classifier performance with the recallor produc el-space Needless to say, we became well acquainted with the album, which both of us were already fond of to begin with. 0.298% ther textual understanding systems read listener. if listener. Ptheir (apg(s) ) inisa review the trimming overall positive accuracy (i.e. gi ularizaTable 3. Selected sentences and experiment. From Pitchfork’s review of Air’s “10,000 To trim a review we create a grounding sum term operHz Legend.” frame, the probability that a positive asso esultant ated onaudio a sentence s ofaword lengthwe n, create a grounding To trim review to a term is predicted) and P (an ) indicates overal rom the # lation established that a random association of these n we ated on a sentence s of length n, i word P (agives ) a as tive accuracy, P (a) two is datasets defined P (acoefficient ). Th ing or each correlation of n magnitude p )P (a i=0 g(s) = smaller than r = 0.080 with 95% confidence.(4) Thus, these m’s # sure gives us a tangible feeling forn correlation how our term esenting resultsnindicate a very significant between the i automatic and ground-truth ratings. P (a ) ehat testing rei=0 are working against the held out test set and term where els The Pitchfork model did not fare as well with r = a perfectly grounded sentence (in which the predicg(s) = a 0.127 (baseline of r = 0.082 with 95% confidence.) Figtgaterm’s npreciful forofgrounded prediction and the review tri ‘ma- tive qualities each termterm onurenew music has 100% 1 shows the scatter plot/histograms for each experictor ment; we see that the audio predictions are mainly bunched sion) is 100%. This upper bound is virtually experiment below. However, to impossible rigorouslyin evalu xamples dlsoin rearound the mean of the ground truth ratings and have a where a perfectly sentence (in wh muchgrounded smaller variance. Visually, itsee is hard to judge how a grammatically correct sentence, and we usually g(s) term model’s performance in a review generation t etting a na well the review information has been captured. However, of {0.1% .. 10%}. The user sets a threshold and the systive qualities of each term on new music the correlation values demonstrate that the automatic anal-h Perceptual Text Analysis note that this value has an undesirable dependence eos-vectortem simply removes sentences ysis is indeed finding and exploiting informative features. ! under the threshold. See re σ is a parameter we keep at 0.5. hen, training an RLSC system consists of em of linear equations 100 Pitchfork AMG 90 I (K + )c = y, C 80 100 Pitchfork AMG % of review kept 70 90 60 80 50 % of review kept 70 re C is a user-supplied regularization con lting real-valued classification function f 40 60 50 30 40 20 2.0 1.8 1.5 1.2 1.0 g(s) threshold 0.8 0.5 30 20 2.0 1.8 1.5 1.2 1.0 g(s) threshold 0.8 0.5 0.2 0.2 [June of 44, “Four Great Points”] June of 44's fourth full-length is their most experimental effort to date -- fractured melodies and dub-like rhythms collide in a noisy atmosphere rich in detail, adorned with violins, trumpet, severe phasing effects, and even a typewriter. - Jason Ankeny 4.15 [Arovane, “Tides”] The homeless lady who sits outside the yuppie coffee bar on the corner of my street assures passers-by that the end is coming. I think she’s desperate to convey her message. Though the United States is saber-rattling with the People’s Republic of China, it seems that everyone has overcome their millennial tension, and the eve of destruction has turned to a morning of devil-may-care optimism. Collectively, we’re overjoyed that, without much effort or awareness, we kicked the Beasťs ass. The Beast, as prophesied by some locust-muncher out in the Negev Desert thousands of years ago, was supposed to arrive last year and annihilate us before being mightily smote by our Lord and Savior Jesus Christ. I missed this. Living as I do in America’s capital, the seat of iniquity and corruption, I should have had ring-side seats to the most righteous beatdown of all time. I even missed witnessing the Rapture, the faithfuľs assumption to the right hand of God that was suppose to occur just before Satan’s saurian shredded all of creation.... [it goes o% like this for a while] - Paul Cooper 0.862 Perceptual Text Analysis ◆ “Human” meaning vs. “Computer” meaning: the junior problem ◆ Target scale - Artist vs. album vs. song ◆ Better audio representation ◆ Other multimedia domains ◆ Human evaluation: - Community modeling - Query by description - Similarity / recommendation Problems and Future Thanks: Barry Vercoe & the MMM group; esp. Youngmoo, Paris, Keith, Michael Casey, Judy, Mike Mandel, Wei, Victor, John, Nyssim, Rebecca, Kristie, Tamara Hearn. Dan Ellis & Columbia; Adam Berenzweig, Ani Nenkova, Noemie Elhadad. Deb Roy. Ben Recht, Ryan Ri.in, Jason, Mary, Tristan, Rob A, Hugo S, Ryan McKinley, Aggelos, Gemma & Ayah & Tad & Limor, Hyun, Cameron, Peter G. Dan P., Chris C, Dan A, Andy L., Barbara, Push, Beth Logan. ex-NECI: Steve Lawrence, Gary Flake, Lee Giles, David Waltz. Kelly Dobson, Noah Vawter, Ethan Bordeaux, Scott Katz, Tania & Ruth, Lauren Kroiz. Drew Daniel, Kurt Ralske, Lukasz L., Douglas Repetto. Bruce Whitman, Craig John and Keith Fullerton Whitman. Stanley and Albert (mules), Wilbur (cat), Sara Whitman and Robyn Belair. Sofie Lexington Whitman: Selected Publications: Whitman, Brian, Daniel P.W. Ellis. “Automatic Record Reviews.” In Proceedings of ISMIR 2004 - 5th International Conference on Music Information Retrieval. October 10-14, 2004, Barcelona, Spain. Berenzweig, Adam, Beth Logan, Daniel Ellis, Brian Whitman. “A Large Scale Evaluation of Acoustic and Subjective Music Similarity Measures.” Computer Music Journal, Summer 2004, 28(2), pp 63-76. Whitman, Brian. “Semantic Rank Reduction of Music Audio.” In Proceedings of the 2003 Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 19-22 October 2003, New Paltz, NY. pp135-138 Whitman, Brian, Deb Roy, and Barry Vercoe. “Learning Word Meanings and Descriptive Parameter Spaces from Music.” in Proceedings of the HLTNAACL03 workshop on Learning Word Meaning from Non-Linguistic Data. 26 -31 May 2003, Edmonton, Alberta, Canada. Whitman, Brian and Ryan Ri.in. “Musical Query-by-Description as a Multiclass Learning Problem.” In Proceedings of the IEEE Multimedia Signal Processing Conference. 8-11 December 2002, St. Thomas, USA. Ellis, Daniel, Brian Whitman, Adam Berenzweig and Steve Lawrence. “The Quest For Ground Truth in Musical Artist Similarity.” In Proceedings of the 3rd International Conference on Music Information Retrieval. 13-17 October 2002, Paris, France. Whitman, Brian and Paris Smaragdis. “Combining Musical and Cultural Features for Intelligent Style Detection.” In Proceedings of the 3rd International Conference on Music Information Retrieval. 13-17 October 2002, Paris, France. Whitman, Brian and Steve Lawrence (2002). “Inferring Descriptions and Similarity for Music from Community Metadata.” In “Voices of Nature,” Proceedings of the 2002 International Computer Music Conference. pp 591-598. 16-21 September 2002, Göteborg, Sweden. Questions?