icmr2014_tutorial_4_..
Transcription
icmr2014_tutorial_4_..
Prerequisites Tutorial: Music Information Retrieval Basic high school math Probability and Statistics Linear Algebra Computer programming Basic music theory George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 226 G. Tzanetakis 2 / 226 Textbook This tutorial is based on slide material created for an online course and associated textbook. Video recordings of some of the lectures , more detailed slides and draft of the book (all under heavy construction) can be found at: [http://marsyas.cs.uvic.ca/mirBook] The goal of the tutorial is to familiarize researchers engaged in multimedia information retrieval (typically image and video, sometimes audio/speech) with the work done in music information retrieval. I will be covering some background in audio digital signal processing that is needed but will assume familiarity with standard machine learning/data mining. Current draft on webpage - will be frequently updated and dated. Also there is much more material for the remaining chapters that exists in other documents, papers, etc that I have that I will be editing and trasferring in the future. MIR Unfortunately we are stuck with kind of similar acronyms. G. Tzanetakis 3 / 226 G. Tzanetakis 4 / 226 Education and Academic Work Experience 1997 BSc in Computer Science (CS), University of Crete, Greece 1999 MA in CS, Princeton University, USA 2002 PhD in CS, Princeton University, USA 2003 PostDoc in CS, Carnegie Mellon University, USA 2004 Assistant Professor in CS, Univ. of Victoria, Canada 2010 Associate Professor in CS, Univ. of Victoria, Canada 2010 Canada Research Chair (Tier II) in Computer Analysis of Audio and Music Music theory, saxophone and piano performance, composition, improvisation both in conservatory and academic settings Main focus of my research has been Music Information Retrieval (MIR) Involved from the early days of the field Have published papers in almost every ISMIR conference Organized ISMIR in 2006 Tutorials on MIR in several conferences G. Tzanetakis 5 / 226 Research G. Tzanetakis 6 / 226 Work Experience beyond Academia Inherently inter-disciplinary and cross-disciplinary work. Connecting theme: making computers better understand music to create more effective interactions with musicians and listeners. Audio analysis is challenging due to large volume of data - did big data before it became fashionable. Music Information Retrieval Digital Signal Processing Machine Learning Human-Computer Interaction Software Engineering G. Tzanetakis Many internships in research labs throughout studies. Several consulting jobs while in academia. A few representative examples: Moodlogic Inc (2000). Designed and developed one of the earliest audio fingerprinting systems (patented) 100000 users matching to 1.5 million songs Teligence Inc (2005). Automatic male/female voice discrimination for voice messages used in popular phone dating sites - processing of 20000+ recordings per day. Artifical Intelligence Multimedia Robotics Visualization Programming Languages 7 / 226 G. Tzanetakis 8 / 226 Marsyas Visiting Scientist at Google Research (6 months) Things I worked on (of course as part of larger teams): Cover Song Detection (applied to every uploaded YouTube video). 100 hours of video are uploaded to YouTube every minute Content ID scans over 250 years of video every day - 15 million references Audio Fingerprinting (part of Android Jelly Bean) Named inventor on 6 pending US patents related to audio matching and fingerprinting Music Analysis, Retrieval and Synthesis for Audio Signals Open source in C++ with Python Bindings Started by me in 1999 - core team approximately 4-5 developers Approximately 400 downloads per month Many projects in industry and academia State-of-the-art performance while frequently orders of magnitude faster than other systems G. Tzanetakis 9 / 226 History of MIR before computers G. Tzanetakis Brief History of computer MIR Pre-history (< 2000): scattered papers in various communities. Symbolic processing mostly in digital libraries and information retrieval venues and audio processing (less explored) mostly in acoustics and DSP venues. The birth 2000: first International symposium on Music Information Retrieval (ISMIR) with funding from NSF Digital Libraries II initiative organized by J. Stephen Downie, Time Crawford and Don Byrd. First contact between the symbolic and the audio side. 2000-2006 Rapid growth 2006-2014 Slower growth and steady state How did a listener encounter a new piece of music throughout history ? Live performance Music Notation Physical recording Radio G. Tzanetakis 10 / 226 11 / 226 G. Tzanetakis 12 / 226 Conceptual MIR dimensions I Conceptual MIR dimensions II Data sources: Audio Track metadata Score Lyrics Reviews Ratings Download patterns Micro-blogging Stages Representation/Hearing Analysis/Learning Interaction/Action Specificity Audio fingerprinting Common score performance Cover song detection Artist identification Genre classification Recommendation ? G. Tzanetakis 13 / 226 MIR Tasks G. Tzanetakis Digital Audio Recordings Recordings in analog media (like vinyl or magnetic tape) degrade over time Digital audio representations theoretically can remain accurate without any loss of information through copying of patterns of bits. MIR requires a distilling information from an extremely large amount of data Digitally storing 3 minutes of audio requires approximately 16 million numbers. A tempo extraction program must somehow convert these to a single numerical estimate of the tempo. Similarity retrieval, playlists, recommendation Classification and clustering Tag annotation Rhythm, melody, chords Music transcription and source separation Query by humming Symbolic MIR Segmentation, structure, alignment Watermarking, fingerprinting and cover song detection G. Tzanetakis 14 / 226 15 / 226 G. Tzanetakis 16 / 226 Production and Perception of Periodic Sounds Pitch Perception Pitch When the same sound is repeated more than 10-20 times per second instead of it being perceived as a sequence of individual sound events it is fused into a single sonic event with a property we call pitch that is related to the underlying period of repetition. Note that this fusion is something that our perception does rather than reflect some underlying singal change other than the decrease of the repetition period. Animal sound generation and perception The sound generation and perception systems of animals have evolved to help them survive in their environment. From an evolutionary perspective the intentional sounds generated by animals should be distinct from the random sounds of the environment. Repetition Repetition is a key property of sounds that can make them more identifiable as coming from other animals (predators, prey, potential mates) and therefore animal hearing systems have evolved to be good at detecting periodic sounds. G. Tzanetakis 17 / 226 Time-Frequency Representations 18 / 226 Spectrum Music Notation When listening to mixtures of sounds (including music) we are interested in when specific sounds take place (time) and what is their source of origin (pitch, timbre). This is also reflected in music notation which fundamentally represents time from left to right and pitch from bottom to top. G. Tzanetakis G. Tzanetakis Informal definition of Spectrum A fundamental concept in DSP is the notion of a spectrum. Informally complex sounds such as the ones produced by musical instruments and their combinations can be modeled as linear combinations of simple elementary sinusoidal signals with different frequencies. A spectrum shows how “much” each such basis sinusoidal component contributes to the overall mixture. It can be used to extract information about the sound such as its perceived pitch or what instrument(s) are playing. A spectrum corresponds to a short snapshot of the sound in time. 19 / 226 G. Tzanetakis 20 / 226 Spectrum example Spectrograms Spectrum of a tenor saxophone note G. Tzanetakis Spectrograms Music and sound change over time. A spectrum does not provide any information about the time evolution of different frequencies. It just shows the relative contribution of each frequency to the mixture signal over the duration analyzed. In order to capture the time evolution of sound and music the standard approach is to segment the audio signal into small chunks (called windows or frames) and calculate the spectrum for each of these windows. The assumption is that during the relatively short period of analysis (typically less than a second) there is not much change and therefore the calculated short-time spectrum is an accurate representation of the underlying signal. The resulting sequence of spectra over time is called a spectrogram. 21 / 226 Examples of spectrograms 22 / 226 Waterfall spectrogram view Waterfall display using sndpeek Spectrogram of a few tenor saxophone notes G. Tzanetakis G. Tzanetakis 23 / 226 G. Tzanetakis 24 / 226 Why is DSP important for MIR ? DSP for MIR A large amount of MIR research deals with audio signals. Audio signals are represented digitally as very long sequences of numbers. Digital Signal Processing techniques are essential in extracting information from audio signals. The mathematical ideas behind DSP are amazing. For example it is through DSP that you can understand how any sound that you can hear can be expressed as a sum of sine waves or represented as a long sequence of 1’s and 0’s. G. Tzanetakis Digital Signal Processing is a large field and therefore impossible to cover adequately in this course. The main goal of the lectures focusing on DSP will be to provide you with some intuition behind the main concepts and techniques that form the foundation of many MIR algorithms. I hope that they serve as a seed for growing a long term passion and interest for DSP and the textbook provides some pointers for further reading. 25 / 226 Sinusoids 26 / 226 What is a sinusoid ? Family of elementary signals that have a particular shape/pattern of repetition. sin(ωt) and cosin(ωt) are particular examples of sinusoids that can be described by the more general equation: We start our exposition with discussing sinusoids which are elementary signals that are crucial in understading both DSP concepts and the mathematical notation used to understand them. Our ultimate goal of the DSP lectures is to make equations such as less intimidating and more meaningfull: Z ∞ X (f ) = x(t)e −j2πft dt (1) x(t) = sin(ωt + φ) (2) where ω is the frequency and φ is the phase. There is an infinite number of continuous periodic signals that belong to the sinusoid family. Each is characterized by three numbers: the amplitude the frequency and the phase. −∞ G. Tzanetakis G. Tzanetakis 27 / 226 G. Tzanetakis 28 / 226 4 motivating viewpoints for sinusoids Solutions to the differential equations that describe simple systems of vibration Family of signals that pass “unchanged” through LTI systems Phasors (rotating vectors) providing geometric intution about DSP concepts and notation Basis functions of the Fourier Transform Figure : Simple sinusoids G. Tzanetakis 29 / 226 Linear Time Invariant Systems 30 / 226 Sinusoids and LTI Systems Definition Systems are transformations of signals. They take a input a signal x(t) and produce a corresponding output signal y (t). Example: y (t) = [x(t)]2 + 5. When a sinusoids of frequency ω goes through a LTI system it “stays” in the family of sinusoids of frequency ω i.e only the amplitude and the phase are changed by the system. Because of linearity this implies that if a complex signal is a sum of sinusoids of different frequencies then the system output will not contain any new frequencies. The behavior of the system can be completely understood by simply analyzing how it responds to elementary sinusoids. Examples of LTI systems in music: guitar boy, vocal tract, outer ear, concert hall. LTI Systems Linearity means that one can calculate the output of the system to the sum of two input signals by summing the system outputs for each input signal individually. Formally if y1 (t) = S{x1 (t)} and y2 (t) = S{x2 (t)} then S{x1 (t) + x2 (t)} = ysum (t) = y1 (t) + y2 (t). Time invariance shift in input results in shift in output. G. Tzanetakis G. Tzanetakis 31 / 226 G. Tzanetakis 32 / 226 Thinking in circles Projecting a phasor The projection of the rotating vector or phasor on the x-axis is a cosine wave and on the y-axis a sine wave. Key insight Think of sinusoidal signal as a vector rotating at a constant speed in the plane (phasor) rather than a single valued signal that goes up and down. Amplitude = Length Frequency = Speed Phase = Angle at time t G. Tzanetakis 33 / 226 Notating a phasor G. Tzanetakis 34 / 226 Multiplication by j Complex numbers An elegant notation system for describing and manipulating rotating vectors. Multiplication by j is an operation of rotation in the plane. You can think of it as rotate +90 degrees counter-clockwise. Two successive rotations by +90 degrees bring us to the negative real axis, hence j 2 = −1. This geometric viewpoint shows that there is nothing imaginary or strange about complex numbers. x + jy where x is called the real part and y is called the imaginary part. If we represent a sinusoid as a rotating vector then using complex number notation we can simply write: cos(ωt) + jsin(ωt) G. Tzanetakis 35 / 226 G. Tzanetakis 36 / 226 Adding sinusoids of the same frequency I Adding sinusoids of the same frequency II Geometric view of the property that sinusoids (phasors) of a particular frequency ω are closed under addition. G. Tzanetakis 37 / 226 Negative frequencies and phasors G. Tzanetakis 38 / 226 Book that inspired this DSP exposition A Digital Signal Processing Primer by Ken Steiglitz G. Tzanetakis 39 / 226 G. Tzanetakis 40 / 226 Summary Sampling Discretize continuous signal by taking regular measurements in time (discretization of the measurements is called quantization) Notation: fs is sampling rate in Hz, ωs is sampling rate in radians per second Sampling a sinusoid - only frequencies below half the sampling rate (Nyquist frequency) will be accurately reprsented after sampling For sinusoid at ω0 then all frequencies ω0 + kωs are aliases Sinusoidal signals are fundamental in understanding DSP Representing them as phasors (i.e vectors rotating at a constant speed) can help understand intuitively several concepts in DSP Complex numbers are an elegant system for expressing rotations and can be used to notate phasors in a way that leverages our knowledge of algebra Thinking this way makes e jωt more intuitive. G. Tzanetakis 41 / 226 Phasor view of aliasing G. Tzanetakis Frequency Domain Illustration of sampling at a high sampling rate compared to the phasor frequency, sampling at the Nyquist rate, and slightly above. Numbers indicate the discrete samples of the continuous phasor rotation. Each sample is a complex number. G. Tzanetakis 42 / 226 Any periodic sound can be represented as a sum of sinusoids (or equivalently phasors) This representation is called a frequency domain representation and the linear combination coefficients are called the spectrum Commonly used variants: Fourier Series, Discrete Fourier Transform, the z-transform, and the classical continuous Fourier Transform These transforms provide procedures for obtaining the linear combination weights of the frequency domain from the signal in time domain (as well as the inverse direction) 43 / 226 G. Tzanetakis 44 / 226 2D coordinate system Inner product (projection) properties A vector ~v in 2-dimensional space can be written as a combination of the 2 unit vectors in each coordinate direction. The inner product operation < v̂ , ŵ > corresponds to the ~ . It is the sum of P projection of ~v onto w the products of like coordinates < v , w >= vx wx + vy wy = N−1 i=0 vi wi . G. Tzanetakis < ~x , ~y >= 0 then the vectors are orthogonal ~ >=< ~u , w ~ > + < ~v , w ~ > distributive law < ~u + ~v , w Basis vectors are orthogonal and have length 1 < ~v , ~v >= vx2 + vy2 is the square of the vector length In order to have the inner product with self be the square of length for vectors of complex numbers we have to slightly change the definition by using the complex conjugate. P ∗ ∗ < v , w >= vx wx∗ + vy wy∗ = N−1 i=0 vi wi . where () denotes the complex conjugate of a number. 45 / 226 Hilbert Spaces 46 / 226 Orthogonal Coordinate System Key idea Generalize the notion of a Euclidean space with finite dimensions to other types of spaces for which a suitable notion of an inner product can be defined. These spaces can have an infitite number of dimensions but as long as we have an appropriate definition of a projection operator/inner product we can reuse a lof the notation and concepts familiar from Euclidean space. For example a space we will investigate are all continuous functions that are periodic with an interval [0, T ]. G. Tzanetakis G. Tzanetakis We need an orthogonal coordinate system i.e a projection (inner product) operator and an orthogonal basis for each space we are interested The Fourier Series, the Discrete Fourier Transform, the z-transform and the continuous Fourier Transform can all be defined by specifying what projection operator to use and what basis elements to use 47 / 226 G. Tzanetakis 48 / 226 Discrete Fourier Transform Introduction DFT Our input is a finite, legnth N segment of a digital signal x[0], . . . , x[N − 1]. The DFT is an abstract mathematical transformation and the Fast Fourier Transform FFT is a very efficient algorithm for computing it The FFT is at the heart of digital signal processing and a lot of MIR systems utilize it one way or another It is applied on sequences of N samples of a digital signal Similarly to the Fourier Series we will define it using the components of an orthogonal coordinate system: an inner product and a set of basis elements G. Tzanetakis Definition The inner product is what one would expect: < x, y >= N−1 X x[t]y ∗ [t] t=0 Switch frequency interval from [−ωs /2, +ωs /2] to [0, ω]a as they are equivalent. One possibility would be all phasors in that frequency range: e jtω for 0 ≤ ω < ωs . It turns out we just need N phasors in that range. 49 / 226 DFT basis elements G. Tzanetakis 50 / 226 The Discrete Fourier Transform (DFT) Definition The inverse DFT expresses the time domain signal as a complex weighted sum of N phasors with spectrum X [k]. We need N frequencies spaced in the range from 0 to the sampling frequency. If we use radians per sample then we have 0, 2π/N, 2(2π/n), . . . , (N − 1)(2π/N). The corresonding basis is: N−1 1 X x[t] = X [k]e jtk2π/N N k=0 Definition e jk2π/N for 0 ≤ k ≤ N − 1 Definition The DFT can be obtained by projecting the signal to the basis elements using the inner product definition: Note: using the definition of the inner product above one can show that indeed these basis elements are orthogonal i.e the inner product between any two elements is zero. X [k] =< x, e jtk2π/N >= N−1 X x[t]e −jtk2π/N t=0 G. Tzanetakis 51 / 226 G. Tzanetakis 52 / 226 Matrix-vector formulation Circular Domain One can view the DFT as a way to transformation a sequence of N complex numbers to a different sequence of N complex numbers. The DFT can be expressed in matrix-vector notation. If we use x and X to denote the N dimensional vectors with component x[t] and X [k] respectively, and define the N × N matrix F by The bins of the DFT are numbered 0, N − 1 but correspond to frequencies between [−ωs /2, +ωs /2]. [F ]k,t = e −jtk2π/N then we can write: X = Fx and x = F−1 X G. Tzanetakis 53 / 226 The discrete frequency domain G. Tzanetakis 54 / 226 DFT frequency mapping example The bins of the DFT are numbered 0, N − 1 but correspond to frequencies between [−ωs /2, +ωs /2]. Since N corresponds to the sampling rate, we need to divide by N to get the frequencies in terms of fractions of the sampling rate. So in the case shown in the figure we would have the following frequencies (fractions of sampling rate): 0, G. Tzanetakis Example of a 2048-point DFT with 44100 Hz sampling rate: bin frequency 0 0 1 21.5 2 43.1 ... ... 1024 22050 1025 -220285 1026 22007 1 2 8 7 6 1 , , .... , − , , . . . , − 16 16 16 16 16 16 55 / 226 G. Tzanetakis 56 / 226 Fast Fourier Transform Summary I Sampling a phasor introduces aliasing which means that multiple frequencies (the aliases) are indistinguishable from each other based on the samples We can extend the concept of an orthogonal coordinate system beyond Euclidean space vectors By appropriate definitions of a projection operator (inner product) and basis elements we can formulate transformations from the time domain to the frequency domain such as the Fourier Series and the Discrete Fourier Transform The FFT is a fast implementation of the DFT Straight implementation of the DFT requires O(N 2 ) arithmetic operations. Divide and conquer: do two N/2 DFTs and then merge the results O(NlogN) much faster when N is not small. G. Tzanetakis 57 / 226 Summary II 58 / 226 Music Notation Music notation systems typically encode information about discrete musical pitch (notes on a piano) and timing. Any signal of interest can be expressed as a weighted sum (with complex coefficients) of basis elements that are phasors. The complex coefficients that act as coordinates are called the spectrum of the signal We can obtain the spectrum by projecting the time domain signal to the phasor basis elemnts The DFT output contains frequency between [−ωs /2, +ωs /2] using a circular domain. For real signal the coefficients of the negative frequencies are symmatric to the positive frequencies and carry no additional information. G. Tzanetakis G. Tzanetakis 59 / 226 G. Tzanetakis 60 / 226 Terminology Psychoacoustics The term pitch is used in different ways in the literature which can result in some confusion. Definition The scientific study of sound perception. Perceptual Pitch: is a perceived quality of sound that can be ordered from “low” to “high”. Musical Pitch: refers to a discrete finite set of perceived pitches that are played on musical instruments Measured Pitch: is a calculated quantity of a sound using an algorithm that tries to match the perceived pitch. Monophonic: refers to a piece of music in which a single sound source (instrument or voice) is playing and only one pitch is heard at any particular time instance. G. Tzanetakis Frequently testing the limits of perception: Frequency range 20Hz-20000Hz Intensity (0dB-120dB) Masking Missing fundamental (presence of harmonics at integer multiples of fundamental give the impression of “missing” pitch) 61 / 226 Origins of Psychoacoustics G. Tzanetakis 62 / 226 Pitch Detection Pitch is a PERCEPTUAL attribute correlated but not equivalent to fundamental frequency. Simple pitch detection algorithms most deal with fundamental frequency estimation but more sophisticated ones take into account knowledge about the human auditory system. Pythagoras of Samos established a connection between perception (music intervals) and physical measurable quantities (string lengths) using the monochord. Time Domain Frequency Domain Perceptual G. Tzanetakis 63 / 226 G. Tzanetakis 64 / 226 Time-domain Zerocrossings AutoCorrelation In autocorrelation the signal is delayed and multiplied with itself for different time lags l. The autocorrelation functions has peaks at the lags in which the signal is self-similar. Zero-crossings are sensitive to noise so frequency low-pass filtering is utilized. Definition rx [l] = N−1 X x[n]x[n + l] l = 0, 1, . . . , L − 1 n=0 Efficient Computation Figure : C4 Sine [Sound] X [f ] = DFT {X (t)} S[f ] = X [f ]X ∗ [f ] R[l] = DFT −1 {S[f ]} Figure : C4 Clarient [Sound] G. Tzanetakis 65 / 226 Autocorrelation examples G. Tzanetakis 66 / 226 Average Magnitude Difference Function The average magnitude difference function also shifts the signal but instead of multiplication uses subtraction to detect periodicities as nulls. No multiplications make it efficient for DSP chips and real-time processing. Definition AMDF (m) = N−1 X |x[n] − x[n + m]|k n=0 Figure : C4 Sine G. Tzanetakis Figure : C4 Clarinet Note 67 / 226 G. Tzanetakis 68 / 226 AMDF Examples Frequency Domain Pitch Detection Figure : C4 Sine Figure : C4 Sine Fundamental frequency (as well as pitch) will correspond to peaks in the spectrum (not necessarily the highest though). Figure : C4 Clarinet Note G. Tzanetakis Figure : C4 Clarinet Note 69 / 226 Plotting over time G. Tzanetakis 70 / 226 Modern pitch detection Modern pitch detection algorithm are based on the basic approaches we have presented but with various enhancements and extra steps to make them more effective for the signals of interest. Open source and free implementations available. Figure : Spectrogram YIN from the “yin” and “yang” of oriental philosophy that alludes to the interplay between autocorrelation and cancellation. SWIPE a sawtooh waveform inspired pitch estimator based on matching spectra Figure : Correlogram [Sound] G. Tzanetakis 71 / 226 G. Tzanetakis 72 / 226 Pitch Perception Duplex theory of pitch perception Proposed by J.C.R Licklider in 1951 (also a realy visionary regarding the future of computers) One perception but two overlapping mechanisms Pitch is not just fundamental frequency Periodicity or harmonicity or both ? How can perceived pitch be measured ? A common approach is to adjust sine wave until match In 1924 Fletcher observed that one can still hear a pitch when playing harmonic partials missing the fundamental frequency (i.e bass notes with small radio) G. Tzanetakis Counting cycles of a period < 800Hz Place of excitation along basilar membrane > 1600Hz 73 / 226 The human auditory system 74 / 226 Auditory Models Incoming sound generates a wave in the fluid filled cochlea (causing the basilar membrane to be displaced - 15000 inner hair cells). Originally it was thought that the chochlea acted as a frequency analyzer similar to the Fourier transform and the perceived pitch was based on the place of highest excitation. Evidence from both perception and biophysics showed that pitch perception can not be explained solely by the place theory. G. Tzanetakis G. Tzanetakis From “On the importance of time: a temporal representation of sound” by Malcolm Slaney and R. F. Lyon. 75 / 226 G. Tzanetakis 76 / 226 Perceptual Pitch Scales Musical Pitch Attempt to quantify the perception of frequency Typically obtained through just noticeable difference (JND) experiments using sine waves All agree that perception is linear in frequency below a certain breakpoint and logarithmic above it, but disagree on what that breakpoint is (popular choices include 1000, 700, 625 and 228) Examples: Mel, Bark, ERB G. Tzanetakis In many styles of music a set of finite and discrete frequencies are used rather than the whole frequency continuum. The fundamental unit that is subdivided is the octave (ratio of 2 in frequency). Tuning systems subdivide the octave logarithmically into distinct intervals Tension between harmonic ratios for consonant intervals, desire to modulate to different keys, regularlity, and presence of pure fifths (ratio of 1.5 or 3:2) 77 / 226 Pitch Helix 78 / 226 From frequency to musical pitch Sketch of a simple pitch detection algorithm Perform the FFT on a short segment of audio typically around 10-20 milliseoncds Select the bin with the highest peak Convert the bin index k to a frequency f in Hertz: Pitch perception has two dimesions: Height: naturally organizes pitches from low to high Chroma: represents the inherent circularity of pitch (octaves) Linear pitch (i.e log(frequency)) can be wrapped around a cylinder to mode the octave equivalence. G. Tzanetakis G. Tzanetakis f = k ∗ (Sr /N) where Sr is the sampling rate, and N is the FFT size. Map the value in Hertz to a MIDI note number m = 69 + 12log2 (f /440) 79 / 226 G. Tzanetakis 80 / 226 Chant analysis Query by Humming (QBH) Computational Ethnomusicology Transition from oral to written transmission Study how diverse recitation traditions having their origin in primarily non-notated melodies later became codified Cantillion - joint work with Daniel Biro [Link] Users sings a melody [Musart QBH examples] Computer searches a database of refererence tracks for a track that contains the melody Monophonic pitch extraction is the first step Many more challenges: difficult queries, variations, tempo changes, partial matches, efficient indexing Commercial implementation: Midomi/SoundHound Academic search for classical music: Musipedia G. Tzanetakis 81 / 226 Summary 82 / 226 State-space representation Key idea Model everything you want to know about a process of interest that changes behavior over time as a vector of numbers indexed by time There are many fundamental frequency estimation (sometimes also called pitch detection) algorithms It is important to distinguish between fundamental frequency, measured pitch and perceived pitch F0 estimation algortihms can roughly be categorized as time-domain, frequency-domain and perceptual Query-by-humming requires a monophonic pitch extraction step Chant analysis is another more academic application G. Tzanetakis G. Tzanetakis 83 / 226 G. Tzanetakis 84 / 226 Representations for music tracks Short-time Fourier Transform A music track can be represented as a: Trajectory of feature vectors over time Cloud (or bag) of feature vectors (unordered, time ordering lost) Single feature vector (or point in N-dimensional space) G. Tzanetakis 85 / 226 G. Tzanetakis 86 / 226 Windowing Repetition introduces discontinuities at the boundaries of the repeated portions that cause artifacts in the DFT computation. The impact of these artifacts can be reduced by windowing. (a) Basis function G. Tzanetakis 87 / 226 G. Tzanetakis (b) Time domain waveform (c) Windowed sinusoid (d) Windowed waveform 88 / 226 History The big picture Reducing information through frequency summarization and temporal summarization. Origins of audio features for music processing lay in speech proceesing Also they have been informed by work in characterizating timbre Eventually features that are music specific such as chroma vectors and rhyhtmic pattern descriptors were added Unlike measured pitch, audio features do not necessarily have a direct perceptual correlate G. Tzanetakis 89 / 226 Frequency Summarization G. Tzanetakis 90 / 226 Centroid The “center of gravity” of the spectrum. Correlates with pitch and “brightness”. PN−1 k|Xn [k]| Cn = Pk=0 N−1 k=0 |Xn [k]| where n is the frame index, N is the DFT size, and |Xn [k]| is the magnitude spectrum at bin k. G. Tzanetakis 91 / 226 G. Tzanetakis 92 / 226 Rolloff Mel-Frequency Cepstral Coefficients Widely used in automatic speech recognition (ASR) as they provide a somewhat speaker/pitch invariant representation of phonemes. The frequency Rn, below which 85% of the energy in the mangitude spectrum is concentrated: Rn−1 X n=0 |Xn [k]| = 0.85 N−1 X |Xn [k]| n=0 G. Tzanetakis 93 / 226 Cepstrum G. Tzanetakis DCT Strong energy compaction i.e few coefficients required to reconstruct most of the energy of the original signal itemFor certain types of signals approximates Karhunen-Loeve transform (theoretically optimal orthogonal basis) “Low” coefficients represent most of the signal and higher ones can be discarded i.e set to 0 MFCCs keep first 13-20 MDCT (overlap-based) is used in MP3, AAC, and Vorbis audio compression Measure of periodicity of frequency response plot S(e jθ ) = H(e jθ )E (e jθ ) where H is a linear filter, E is an excitation log (|S(e jθ |)) = log (|H(e jθ )|) + log (|E (e jθ )|) Homomorphic transformation - the convolution of two signals becomes the equivalent to the sum of their cepstra Aims to deconvolve the signal (low coefficients model filter shape, high order coefficients excitation with possible F0) G. Tzanetakis 94 / 226 95 / 226 G. Tzanetakis 96 / 226 Temporal Summarization Temporal summarization Texture window of size M, starting at feature index n can be summarized by mean and standard deviation. A variety of terms have been used to describe methods that summarize a sequence of feature values over time. Typical frame size (10-20 msecs), texture window size (1-3 secs). T [n] = (F [n − M + 1], . . . , F [n]) Texture windows Aggregates Modulation features (when detecting modulation) Dynamic features (∆) Temporal feature integration Fluctuation patterns Pooling (from Neural Networks terminology) Song-level (when summarization is across the track) G. Tzanetakis 0.10 0.07 Beatles Debussy 0.025 Beatles Debussy 0.06 0.08 Beatles Debussy 0.020 0.05 0.06 0.015 Centroid Centroid Centroid 0.04 0.03 0.04 0.010 0.02 0.02 0.000 0.005 0.01 100 200 Frames 300 400 500 (e) Centroid 0.000 100 200 Frames 300 400 500 (f) Mean Centroid 0.0000 100 200 Frames 300 400 500 (g) Std Centroid [Sound] [Sound] 97 / 226 Pitch Histograms G. Tzanetakis 98 / 226 Pitch Histograms Pitch Histograms of two Jazz pieces (left column) and two Irish Folk music pieces (right column) Average amplitudes of DFT bins mapping to the same MIDI note number (different averaging shapes can be used for example triangles or Gaussians) If desired “fold” the resulting histogram collapsing bins that belong to the same pitch class into one Frequently more than 12 bins per octave to account for tuning and performance variations Alternatively multiple pitch detection can be performed and the detected pitches can be added to a histogram G. Tzanetakis 99 / 226 G. Tzanetakis 100 / 226 Pitch Helix and Chroma Chroma Profiles Chroma profile: 12 bins, start with A, chromatic spacing Chroma of C4 sine G. Tzanetakis 101 / 226 Chroma Profiles G. Tzanetakis Chroma of C4 clarinet 102 / 226 Summary Chroma profile: 12 bins, start with A, chromatic spacing Sine melody The Short Time Fourier Transform with windowing forms the basis of extracting time-frequency representations (magnitude spectrograms) from audio signals The process of audio feature extraction consists of summarizing in various ways the information in the frequency dimension and across the time dimension Originally audio features used in MIR were inspired by automatic speech recognition (MFCCs) and phychological investigations of timbre (centroid, rolloff, flux). Additional features capturing information specific to music such as Chroma and Pitch Histograms have been proposed. Clarinet melody [Sound] G. Tzanetakis 103 / 226 G. Tzanetakis 104 / 226 Introduction History of Music Notation Earliest known form of music notation in cuneiform Sumerian tablet around 2000 BC. Initially a mnemonic aid to oral instruction, performance and transmission it evolved into a codified set of conventions that transformed how music was created, distributed and consumed across time and space. Notation can be viewed as a visual representation of instructions for how to perform an instrument. Tablature notation for example is specific to stringed instruments. Primary focus of traditional musicology Music notation and theory are complex topics that can take many years to master This presentation barely scratches the surface of the subject The main goal is to provide enough background for students with no formal music training to be able to read and understand MIR papers that use terminology from music notation and theory It is never too late to get some formal music training G. Tzanetakis 105 / 226 Western Common Music Notation G. Tzanetakis Notating rhythm Originally used in European Classical Music is currently used in many genres around the world Mainly encodes pitch and timing (to a certain degree designed for keyboard instruments) Considerable freedom in interpretation Five staff lines G. Tzanetakis 106 / 226 Symbols indicate relative durations in terms of multiples (or fractions) of underlying regular pulse If tempo is specified then exact durations can be computed (for example the first symbol would last 60 seconds / 85 BPM = 0.706 seconds) A different set of symbols is used to indicate rests Numbers under symbols indicate the duration in terms of eighth notes. Each measure is subdivded into 2 half notes, 4 quarter notes, 8 eighth notes. 107 / 226 G. Tzanetakis 108 / 226 Time signature and measures Notating pitches Clef sign anchors the five staff lines to a particular pitch Note symbols are either placed on staff lines or between staff lines. Successive note symbols (one between lines followed by one on a staff line or the other way around) correspond to successive white notes on a keyboard. Invisible staff lines extend above and below Measure (or bar) lines indicate regular groupings of notes Time signature shows the rhythmic content of each measure Compound rhythms consists of smaller rhythmic units G. Tzanetakis 109 / 226 Notating pitches G. Tzanetakis 110 / 226 Repeat signs and structure Repeat signs and other notation conventions can be thought of as a “proto” programming language providing looping constructs and goto statements Hierarchical structure is common i.e ABAA form Structure = segmentation + similarity G. Tzanetakis 111 / 226 G. Tzanetakis 112 / 226 Structure of Naima by J. Coltrane Intervals Intervals are pairs of pitches Melodic when the pitches are played in succession Harmonic when the pitches are played simultaneously Uniquely characterized by number of semitones (although typically named using a more complex system) (microtuning also possible) G. Tzanetakis 113 / 226 Naming of intervals G. Tzanetakis 114 / 226 Scales and Modes The most common naming convention for intervals uses two attributes to describe them: quality and number. A scale is a sequence of intervals typically consisting of whole tones and semitones and spanning an octave. Diatonic scales are the ones that can be played using only the white keys on a piano. They are called modes and have ancient greek names. Quality Quality: perfect, major, minor, augmented, diminished. Number Number: unison, second, third, fifth, sixth, seventh, octave and is based on counting staff positions G. Tzanetakis 115 / 226 G. Tzanetakis 116 / 226 Enharmonic Spelling Major/Minor Scales The naming of intervals (and absolute pitches) is not unique meaning that the same exact note can have two different names as in C # and Db. Similarly the same interval can be a minor third or an augmented second. The spelling comes from the role an interval plays as part of a scale as well as the historical tuning practice of having different frequency ratios for enharmonic intervals. The scales used in composed Western classical music are primarily the major and minor scales. The harmonic minor scale has an augmented second (A) that occurs between the 6th and 7th tone. G. Tzanetakis 117 / 226 Chords 118 / 226 Root, Inversions, Voicings A chord is a set of two or more notes that sound simultaneously. A chord label can also be applied to a music excerpt (typically a measure) by inferring, using various rules of harmony, what theoretical chord would sound “good” with the underlying music material. The basis of the western classical and pop music chord system is the triad consisting of three notes. Different naming schemes are used for chords. Jazz and Pop music frequently use naming based on triad with additional modifiers for the non-triad notes. G. Tzanetakis G. Tzanetakis 119 / 226 The lowest note of a chord in its “default” position is called the root. Inversions occur when the lowest note of a chord is different than the root. Voicings are different arrangements of the chord notes that can include repeated notes as well as octaves. G. Tzanetakis 120 / 226 Chord Progressions and Harmony Jazz Lead Sheets Sequences of chords are called chord progressions. Certain progressions are more common than others and also indicate the key of a piece. Frequently chords are constructed from subsets of notes from a particular scale. The root of the scale is called the tonic and defines the key of the piece. For example a piece in C Major will mostly consist of chords formed by the notes of the C major scale. Modulation refers to a change in key. Chords have specific qualities and functions which are studied in Harmonic analysis. G. Tzanetakis 121 / 226 TuneDex G. Tzanetakis G. Tzanetakis 122 / 226 Pianoroll 123 / 226 G. Tzanetakis 124 / 226 MIDI Lilypond Musical Instrument Digital Interface MIDI is both a communication protocol (and associated file format) as well as a hardware connector specification that allows the exchange of information between electronic musical instruments and computers. It was developed in the early 80s and was mostly designed with keyboard instruments in mind. Essentially piano-roll representation of music. G. Tzanetakis Music engraving program Text language for input that is complied Encodes much more than just notes and duration in order to produce a visual musical score Produces beautiful looking scores and is free 125 / 226 Music XML 126 / 226 jSymbolic - jMIR Software in Java for extracting high level musical features from symbolic music representations, specifically MIDI files Features capture aspects of instrumentation, texture, rhythm, dynamics, pitch statistics, melody, and chords Part of jMIR a more general package for MIR including audio, lyrics, web feature extraction as well as a classification engine Extensible Markup Language (XML) format for interchanging information about scores Supported by more than a 170 notation, score writing applications Proprietary but open specification Hard to read but comprehensive G. Tzanetakis G. Tzanetakis 127 / 226 G. Tzanetakis 128 / 226 music21 music 21 pitch/duration distribution Distribution of pitches and note duration for a Chopin Mazurka using music21. Set of tools written in Python for computer aided musicology Corpora included is a great feature Works with MusicXML, MIDI Example: add german name (i.e., B=B, B=H, A= Ais) under each note of a Bach chorale G. Tzanetakis 129 / 226 Query-by-Example 130 / 226 Bag-of-frames Distance Measures In this apporach to music recommendation, the user provides a query (or seed) music track as input and the system returns a list of tracks that is ranked by their audio-based similarity to the query. [Query (Mr. Jones by Talking Head) ] Top 3 ranked list results using different feature sets: [Spectral] [Rhythm] [Combined] G. Tzanetakis G. Tzanetakis 131 / 226 Each track can be represented as a bag of feature vectors modeled by a probability density function. Therefore we need some way of measuring the similarity between distributions. Definition The Kullback/Leibler(KL) diverence, also known as the relative entropy, between two probability density functions f (x) and g (x) is Z f (x) D(f ||g ) = f (x)log dx g (x) G. Tzanetakis 132 / 226 KL divergence and Earth Movers distance Some examples The symmetric KL divergence can be formed by taking the average of the divergences D(f ||g ) and D(g ||f ). For Gaussians the KL diverence has a closed form solution but for other distribution models such as Gaussian Mixture Models no such closed form exists. Monte Carlo estimation can be used in these cases. Another common possibility is to use Earth Movers Distance. Informally, if the distributions are interpreted as two different ways of piling up dirt, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of dirt moved times the distance by which it is moved. G. Tzanetakis 133 / 226 More extreme examples G. Tzanetakis HipHop Reggae Piano World [Query] [Query] [Query] [Query] [Results] [Results] [Results] [Results] G. Tzanetakis 134 / 226 Genres In these examples I tried to find queries that were atypical and I could not think of good matches in my collection. I find the results fascinating as they reveal, to some extent, what aspects the system is capturing. African Dreamer (Supertramp) Idle Chatter (Computer Music) Tuva throat singing Single vector per track (72 dimensions) with euclidean distance over max/min normalized feature vectors. Results from 2000 experiment using 3500 mp3 clips each 30 second long from my personal collection (diverse but no Kenny G). [Query] [Query] [Query] [Query] [Results] [Results] [Results] [Results] Definition Genres are categorical labels used by humans to organize music into distinct categories. During the age of physical recordings they were also used to physically organize the spatial layout in music stores. Genres are fluid and change over time. Top level genres are not as subjective but more specific genres can be very specific to the music listeners of that genre (for example subgenres of heavy metal or electronic dance music). Check out [Ushkur’s Guide to Electronic Music]. 135 / 226 G. Tzanetakis 136 / 226 Automatic Musical Genre Classification Where do Genres come from ? Artists: for example bluegrass originates from the Blue Grass Boys named after Kentucky, “the Blue Grass State”. Records: for example free jazz from Ornette Coleman’s 1960 album of the same name Lyrics: Old-school DJ Lovebug Starski claims to have coined the term hip-hop by rhyming “hip-hop, hippy to the hippy hop-bop”. Record labels: Industrial named after Throbbing Gristle’s imprint. Journalists: Rhythm and blues when Jerry Wexler, a Billboard editor, began using it instead of “Race Records”. G. Tzanetakis Given as input an audio recording of a track without any associated meta-data determine what genre it belongs to from a set of predefined genre labels Four stages: Ground truth acquisition Audio feature extraction Song representation and classification Evaluation 137 / 226 Scanning the dial user study G. Tzanetakis 138 / 226 Scanning the dial - results (Perrott and Gjerdingen, 2008) At ceiling performance (3000 ms) participants agreed with the genres assigned by music companies about 70% of the time (that does not mean they were wrong). Even at 250 milliseconds prediction (43%) was significantly better than chance (10%). Inspired by circular radio dials - how long does it take to decide whether to listen to a particular channel or scan for another one ? Study conducted in 1999 (still early days of digital music, would have been very difficult to conduct with analog media). Snippets (250, 325, 400, 475, 3000 milliseconds) were played to 52 subjects. Blues Classical Country Dance Jazz G. Tzanetakis Latin Pop R&B Rap Rock [250 msec collage] [3000 msec collage] Classical,HipHop,Jazz 139 / 226 G. Tzanetakis 140 / 226 Ground Truth Acquistion Audio Feature Extraction Most common approach use “authoritative source” such as Amazon or All Music Guide. Custom hierarchy that is rationally defined (the Esperando of music genres) Gjerdingen “scan-the-dial” user study - no perfect agreement with ground truth - 70% was the best User study involving multiple subjects - use majority as ground truth and investigate how much inter-subject agreement there is Clusters of listeners possibly utilizing external sources of information - different notions of genres G. Tzanetakis Timbral Features (Spectral, MFCC) Rhythmic features Pitch content features 141 / 226 Track Representation and Classification I 142 / 226 Track Representation and Classification II If each track is represented as a sequence of feature vectors one possibility is to perform short-term classification in smaller segments and then aggregate the results using majority or weighted majority voting. Check [Genremeter from 2000] Song level features are the easiest approach. In this approach each track is represented by a single aggregate feature vector that characterizes it. Each “genre” is represented by the set of feature vectors of the tracks in training set that are labeled with that genre. Standard data mining classifiers can be applied without any modification if this approach is used. G. Tzanetakis G. Tzanetakis In bag-of-frames each track is modeled as a probability density function and the distance between pdf’s need to be estimated. KL-divergence can be used either in closed form or numerically approximated. Monte Carlo methods can also be used. 143 / 226 G. Tzanetakis 144 / 226 Comparison of human and automatic genre classification Comparison of human and automatic genre classification (Lippens et al., 2004) (Lippens et al., 2004) G. Tzanetakis 145 / 226 Evaluation I G. Tzanetakis MIREX results MIREX The Music Information Retrieval Evaluation eXchange is an annual event in which different MIR algorithms contributed by groups from around the world are evaluated using a variety of metrics on different tasks which include several audio-based classification tasks. Best performing algorithms in MIREX audio classification tasks in 2009 and 2013. Part of improvement might be due to overfitting - universal background model. Task Genre Genre (Latin) Audio Mood Composer Audio-based classification evaluation is done using standard classification metrics such as classification accuracy. Artist filtering i.e tracks by the same artists are either all allocated to the training set or all allocated to the testing set when performing cross-validation to avoid the “album” effect. G. Tzanetakis 146 / 226 147 / 226 G. Tzanetakis Tracks 7000 3227 600 2772 Classes 10 10 5 11 2009 66.41 65.17 58.2 53.25 2013 76.23 77.32 68.33 69.70 148 / 226 Issues with automatic genre classification Issues with automatic genre classification Ill-defined problem There is too much subjectivity in how genre is perceived and exclusive allocation does not make sense in many cases. Evaluation metrics based on ground truth do not take into account how mistakes would be perceived by humans (the WTF factor). Mini-genres are more useful but also more subjective. Dataset saturation Public datasets are important for comparison of different systems. The GTZAN dataset has been used a lot in genre classification but has many limitations as it is rather small, was collected informally for personal research, and has issues such as correpted files and duplicates. Introduction of a new curated dataset would make comparison with previous systems harder. Glass ceiling: it has been observed that there genre classification results have improved only incrementally and using varations of the “classic” audio descriptors and “classic” classifiers only results in very minor changes. Open questions: can new higher level descriptors, better machine learning improve classification ? What would the human glass ceiling be ? Is it useful ? A lot of music available both in physical media and digital stores is already annotated by genre. With powerful music recommendation systems and no physical stores why bother ? It can also be viewed as a special case of tag annotation. G. Tzanetakis 149 / 226 Genre Hierarchies G. Tzanetakis 150 / 226 Bayesian Aggregation - Bayesian Networks P(y1 , . . . yN ) = N Y P(yi |yparents(i) ) (3) i=1 G. Tzanetakis 151 / 226 G. Tzanetakis 152 / 226 Baysian Aggregation - Raw Accuracy Music Emotion Single label, multi-class “raw accuracy” (38 classes). Using symbolic data (from MIREX 2005). Categorical each selection is classified into an “emotion” class (basically classification) Emotion variation detection the emotion is “tracked” continuously within a music selection Emotion recognition predict arousal and valence (or some other continuous space) for each music piece G. Tzanetakis 153 / 226 MIREX Mood Clusters G. Tzanetakis G. Tzanetakis 154 / 226 Valence-arousal emotion space 155 / 226 G. Tzanetakis 156 / 226 Emotion space prediction Tags Support vector regression nonlinearly maps the input feature vectors to a higher dimensional feature space using the kernel trick and yields prediction functions based on the support vectors. It is an extension of the well known support vector classification algorithm. Definition A tag is a short phrase or word that can be used to characterize a piece of music. Examples: “bouncy”, “heavy metal”, or “hand drums”. Tags can be related to instruments, genres, amotions, moods, usages, geographic origins, musicological terms, or anything the users decide. Similarly to a text index, a music index associated music documents to tags. A document can be a song, an album, an artist, a record label, etc. We consider songs/tracks to be our musical documents. G. Tzanetakis 157 / 226 Music Index 158 / 226 Tag research terminology Vocabulary happy pop a capella saxophone s1 .8 .7 .1 0 s2 .2 0 .1 .7 Cold-start problem: songs that are not annotated can not be retrieved. Popularity bias: songs (in the short head tend to be annotated more thoroughly than unpopular songs (in the long tail). Strong labeling versus weak labeling. Extensible or fixed vocabulary. Structured or unstructured vocabulary. s3 .6 .1 .5 .9 A query can either be a list of tags or a song. Using the music index the system can return a playlist of songs that somehow “match” the specified tags. G. Tzanetakis G. Tzanetakis 159 / 226 Note: Evaluation is a big challenge due to subjectivity. Tags generalize classification labels G. Tzanetakis 160 / 226 Many thanks to Tagging a song Material for these slides was generously provided by: Doug Turnbull Emanule Coviello Mohamed Sordo G. Tzanetakis 161 / 226 Tagging multiple songs G. Tzanetakis G. Tzanetakis 162 / 226 Text query 163 / 226 G. Tzanetakis 164 / 226 Sources of Tags Human participation: Surveys Social Tags Games Survey Pandora: a team of approximately 50 expert music reviewers (each with a degree in music and 200 hours of training) annotate songs using a structured vocabulary of between 150 and 200 tags. Tags are “objective” i.e there is a high degree of inter-reviewer agreement. Between 2000 and 2010, Pandora annotated about 750, 000 songs. Annotation takes approximately 20-30 minutes. CAL500: one song from 500 unique artists, each annod by a minimum of 3 nonexpert reviewers using a structured vocabulary of 174 tags. Standard dataset of training and evaluating tag-based retrieval systems. Automatic: Text mining Autotagging G. Tzanetakis 165 / 226 Harvesting social tags G. Tzanetakis 166 / 226 Last.fm tags for Adele Last.fm is a music discovery Web site that allows users to contribute social tags through a text box in their audio player interface. It is an example of crowd sourcing. In 2007, 40 million active users built up a vocabulary of 960, 000 free-text tags and used it to annotate millions of songs. All data available through public web API. Tags typically annotate artists rather than sons. Problems with multiple spelling, polysemous tags (such as progressive). G. Tzanetakis 167 / 226 G. Tzanetakis 168 / 226 Playing Annotation Games Tag-a-tune In ISMIR 2007, music annotation games were presented for the first time: ListenGame, Tag-a-Tune, and MajorMiner. ListenGame uses a structured vocabulary and is real time. Tag-a-Tune and MajorMiner are inspired by the ESP Game for image tagging. In this approach the players listen to a track and are asked to enter “free text” tags until they both enter the same tag. This results in an extensible vocabulary. G. Tzanetakis 169 / 226 Mining web documents G. Tzanetakis 170 / 226 cal500.sness.net There are many text sources of information associated with a music track. These include artist biographies, album reviews, song reviews, social media posts, and personal blogs. The set of documents associated with a song is typically processed by text mining techniques resulting in a vector space representation which can then be used as input to data mining/machine learning techniques (text mining will be covered in more detail in a future lecture). G. Tzanetakis 171 / 226 G. Tzanetakis 172 / 226 Audio feature extraction Bag of words for text Audio features for tagging are typically very similar to the ones used for audio classification i.e statistics of the short-time magnitude spectrum over different time scales. G. Tzanetakis 173 / 226 Bag of words for audio G. Tzanetakis 174 / 226 Multi-label classification (with twists) “Classic” classification is single label and multi-class. In multi-label classification each instance can be assigned more than one label. Tag annotation can be viewed as multi-label classification with some additional twists: Synonyms (female voice, woman singing) Subpart relations (string quartet, classical) Sparse (only a small subset of tags applies to each song) Noisy Useful because: Cold start problem Query-by-keywords G. Tzanetakis 175 / 226 G. Tzanetakis 176 / 226 Machine Learning for Tag Annotation Tag models A straightforward approach is to treat each tag independently as a classification problem. G. Tzanetakis Identify songs associated with tag t Merge all features either directly or by model merging Estimate p(x|t) 177 / 226 Direct multi-label classifiers G. Tzanetakis 178 / 226 Tag co-occurence Alternatives to individual tag classifiers: K-NN multi-label classifier - straightforward extension that requires strategy for label merging (union or intersection are possibilities) Multi-layer perceptron - simple train directly with multi-label ground truth G. Tzanetakis 179 / 226 G. Tzanetakis 180 / 226 Stacking G. Tzanetakis Stacking II 181 / 226 How stacking can help ? G. Tzanetakis 182 / 226 Other terms/variants The main idea behind stacking i.e using the output of a classification stage as the input to a subsequent classification stage has been proposed under several different names: Correction approach (using binary outputs) Anchor classification (for example classification into artists used as a feature for genre classification) Semantic space retrieval Cascaded classification (in computer vision) Stacked generalization (in the classification) Context modeling (in autotagging) Cost-sensitive stacking (variant) G. Tzanetakis 183 / 226 G. Tzanetakis 184 / 226 Datasets Combining taggers/bag of systems There are several datasets that have been used to train and evaluate auto-tagging. They differ in the amount of data they contain, and the source of the ground truth tag information. Major Miner Magnatagatune CAL500 (the most widely used one) CAL10K MediaEval Reproducibility: common dataset is not enough, ideally exact details about the cross-validation folding process and evaluation scripts should also be included. G. Tzanetakis 185 / 226 Magnatagatune G. Tzanetakis CAL-10K Dataset Number of tracks: 10866 Tags: 1053 (genre and acoustic tags) Tags/Track: min = 2, max = 25, µ = 10.9, σ = 4.57, median = 11 Most used tags: major key tonality (4547), acoustic rhythm guitars (2296), a vocal-centric aesthetic (2163), extensive vamping (2130) Less used tags: cocky lyrics (1), psychedelic rock influences (1), breathy vocal sound (1), well-articulated trombone solo (1), lead flute (1) Tags collected using survey Available at: http://cosmal.ucsd.edu/cal/projects/AnnRet/ 26K sound clips from magnatune.com Human annotation from the Tag-a-tune game Audio features from the Echo Nest 230 artists 183 tags G. Tzanetakis 186 / 226 187 / 226 G. Tzanetakis 188 / 226 Tagging evaluation metrics Annotation vs retrieval The inputs to a autotagging evaluation metric are the predicted tags (#tags by #tracks binary matrix) or tag affinities (#tags by #tracks) matrix of reals) and the associated ground truth (binary matrix). One possibility would be to convert matrices into vectors and then use classification evaluation metrics. This approach has the disadvantage that popular tags will dominate and performance in less-frequent tags (which one could argue are more important) will be irrelevant. Therefore the common approach is to treat each tag column separately and then average across tags (retrieval) or alternatively treat each track row separately and average across tracks (annotation). Asymmetry between positives and negatives makes classification accuracy not a very good metric. Retrieval metrics are better choices. If the output of the auto-tagging system is affinities then many metrics require binarization. Common binarization variants: select k top scoring tags for each track, threshold each column of tag affinities to achieve the tag priors in the training set. G. Tzanetakis Validation schems are similar to classification: cross-validation, repeated cross-validation, and bootstrapping. 189 / 226 Annotation Metrics G. Tzanetakis Annotation Metrics based on rank When using affinities it is possible to use rank correlation metrics: Based on counting TP, FP, TN, FN: Spearman’s rank correlation coefficient ρ Kendal tau τ Precision Recall F-measure G. Tzanetakis 190 / 226 191 / 226 G. Tzanetakis 192 / 226 Retrieval measures - Mean Average Precision Retrieval measures - AUC-ROC Precision at N is the number of relevant songs retrieved out of N divided by N. Rather than choosing N one can average precision for different N and then take the mean over a set of queries (tags). G. Tzanetakis 193 / 226 Stacking results I G. Tzanetakis G. Tzanetakis 194 / 226 Stacking results II 195 / 226 G. Tzanetakis 196 / 226 Stacking results III G. Tzanetakis Stacking results IV 197 / 226 Stacking results V G. Tzanetakis 198 / 226 MIREX Tag Annotation Task The Music Information Retrieval Evaluation Exchange (MIREX) audio tag annotation task started in 2008 MajorMiner dataset (2300 tracks, 45 tags) Mood tag dataset (6490 tracks, 135 tags) 10 second clips 3-fold cross-validation Binary relevance (F-measure, precision, recall) Affinity ranking (AUC-ROC, Precision at 3,6,9,12,15) G. Tzanetakis 199 / 226 G. Tzanetakis 200 / 226 MIREX 2012 F-measure G. Tzanetakis MIREX 2012 AUC-ROC 201 / 226 History of MIREX tagging G. Tzanetakis 202 / 226 Open questions Should the tag annotations be sanitized or should the machine learning part handle it ? Do auto-taggers generalize outside their collections ? Stacking seems to improve results (even though one paper has shown no improvement). How does stacking perform when dealing with synonyms, antonyms, noisy annotations ? Why ? How can multiple sources of tags be combined ? G. Tzanetakis 203 / 226 G. Tzanetakis 204 / 226 Future work Future work Weak labeling: in most cases absense of a tag does NOT imply that the tag would not be considered valid by most users Explore a continuous grading of semi-supervised learning where the distinction between supervised and unsupervised is not binary Explore feature clusering of untagged instances Include additional sources of information (separate from tags) such as artist, genre, album multiple instance learning approaches (for example if genre information is available at the album level) Statistical relational learning G. Tzanetakis The lukewarm start problem: what if some tags are known for the testing data but not all ? Missing label type of approaches such as EM Markov logic inference in structured data Other ideas: Online learning where tags enter the system incrementally and individually rather than all at the same time or for a particular instance Taking into account user behavior when interacting with a tag system Personalization vs Crowd: would clustering users based on their tagging make sense ? 205 / 226 Extracting Context Information G. Tzanetakis 206 / 226 Guess the genre Fan pages Artist personal web pages Music portals such as http://allmusic.com Web 2.0 APIs (Last.fm, Pandora, Echonest) Playlists Peer-to-Peer Networks General observation: there is a lot of information available but it is noisy and not all of it is good quality. Yeah I’m out that Brooklyn, now I’m down in TriBeCa Right next to Deniro, but I’ll be hood forever I’m the new Sinatra, and since I made it here I can make it anywhere, yeah, they love me everywhere I used to cop in Harlem, all of my Dominicano’s Right there up on Broadway, pull me back to that McDonald’s Took it to my stashbox, 560 State St. Catch me in the kitchen like a Simmons with them Pastry’s Cruisin’ down 8th St., off white Lexus Drivin’ so slow, but BK is from Texas Empire State of Mind, Jay Z G. Tzanetakis 207 / 226 G. Tzanetakis 208 / 226 Guess the genre Song Lyrics Lyrics give information about the semantics of a piece of music. They can also reveal aspects such as the artist’s cultural background and style. A typical lyric analysis system will have the following stages and uses techniques from text retrieval and bioinformatics: You know a dream is like a river ever changin’ as it flows and a dreamer’s just a vessel that must follow where it goes trying to learn from what’s behind you and never knowing what’s in store makes each day a constant battle just to stay between the shores and i will sail my vessel ’til the river runs dry like a bird upon the wind these waters are my sky Query: a search engine to find web-pages likely to contain lyrics Text extraction: removal of HTML tags, conversion to lowercase Alignment: determine pages that contain the same lyrics, alignment of word pairs Evaluation: ground truth can be obtained from CD covers River, Garth Brooks G. Tzanetakis 209 / 226 Country of Origin G. Tzanetakis 210 / 226 TF-IDF One possibility is to look specifically into artist pages and bios (for example in Last.fm, Freebase, Wikipedia) for occurance of words indicating geographic origin. An alternative approach is to query a search engine with pairs of artist names and country names. A simple approach is to simply count the number of page hits. A more complicated approach uses ideas from text information retrieval i.e term frequency - inverse document frequency (TF-IDF). Definition Document Frequency (DF) dft,a is the total number of Web pages retrieved for artist a on which term t occurs at least once. Definition Term Frequency (TF) tft,a is the total number of occurrences of term t in all pages retrieved for a. TFIDFt,a G. Tzanetakis 211 / 226 G. Tzanetakis n = ln(1 + tft,a ) × ln 1 + dft 212 / 226 TF-IDF motivation Bag of words and vector space model Measure of how important a word is to a document within a corpus. Each document is represented as an unordered set of its words (or terms such as noun phrase), ignoring structure and grammar rules. Each term t describing a document d is assigned a “weight” wt,d for example based on frequency of occurence. The set of weights over all terms of interest is the feature vector and is typically sparse meaning that many of the weights are zero. For music it is common to aggregate all relevant pages into a single “virtual” document. Part-of-speech taggers can be used to create terms from a list of words. The motivation of TF-IDF is to increase “weight” of t if it occurs frequently in webpages retrieved for a (for example t =’Greece’,a =’Vangelis’, t =’and’, a =’Vangelis’) , but decrease it if t occurs in a large number of documents retrieved for all a (for example t =’and’). Evaluation: precision and recall (for example in a study by Schedl et al. page count estimates achieve 23% precision but TFIDF achieves 71%. G. Tzanetakis 213 / 226 Similarity based on term profiles The idea behind cosine similarity is to not take into account the length of the associated documents when computing similarity but only the relative frequency of each term. wg (t, a) = tf (t, a) df (t) tf (t, a)e P w (t, ai )w (t, aj ) pP 2 2 w (t, a ) i t∈T t∈T w (t, aj ) −(log (df (t))−µ)2 sim(ai , aj ) = cos(θ) = pP 2σ 2 The similarity between artists can be computed by the overlap between their corresponding term profiles i.e the sum of weights of all the terms that occur in both term profiles. Cosine similarity is another popular choice. A similar approach can be used for any textual source of information such as lyrics or blog posts. G. Tzanetakis 214 / 226 Cosine similarity The weight of a term t for artist a can be computed using TFIDF schemes such as w (t, a) = G. Tzanetakis 215 / 226 t∈T In general the idea of vector space and associated feature vectors can be used in many different tasks in MIR by varying the source of information used and the desired outcome. Examples include language identification, song structure detection, thematic categorization. Finally text-based features can be combined with audio features. G. Tzanetakis 216 / 226 Collaborative Tags Introduction Tag overlap can be used as a measure of similarity between artists (or tracks). In a study using Last.fm datait was found that similar artists share on avergae 10 tags compared to 4 for arbitrary pairs. Pros: smaller vocabulary, focused on domain, level of individual tracks, The basic idea behind co-occurence approaches is that the occurence of two pieces of music, artists or tags within the same context indicates similarity. Cons: require large and active user community, coverage in the “long tail” can be low or non-existent. G. Tzanetakis 217 / 226 Page Counts G. Tzanetakis 218 / 226 Introduction Similarity using a search engine and associated page counts (with some restrictions to constrain results to music) : sim(ai , aj ) = 1 sim(ai , aj ) = 2 G. Tzanetakis pc(ai , aj ) min(pc(ai )pc(aj )) pc(ai , aj ) pc(ai , aj ) + pc(ai ) pc(aj ) 219 / 226 G. Tzanetakis 220 / 226 Example Audio Features Basic idea: capture the energy of different pitches or chroma values Summarize FFT bins Design filterbank (constant-Q)— Detect multiple pitches G. Tzanetakis 221 / 226 Onset calculation 222 / 226 Chroma Representation Downsampling First-order difference Half-wave rectification G. Tzanetakis G. Tzanetakis From the 88 note pitch profile (histogram) compute 12 chroma (or pitch class) vector by summing all the octave separated elements. 223 / 226 G. Tzanetakis 224 / 226 Chroma Normalization Other topics not covered in this tutorial Chord Detection Structure Analysis Music transcription and sound source separation Computational Ethnomusicology Audio watermarking and fingerprinting Grahical user interfaces and visualization v ← v /||v ||1 where ||v ||1 = 12 X |v (i)| i=1 Replace with flat chroma vector if ||v ||1 below threshold for silent passages. G. Tzanetakis 225 / 226 G. Tzanetakis 226 / 226