icmr2014_tutorial_4_..

Transcription

icmr2014_tutorial_4_..
Prerequisites
Tutorial: Music Information Retrieval
Basic high school math
Probability and Statistics
Linear Algebra
Computer programming
Basic music theory
George Tzanetakis
University of Victoria
2014
G. Tzanetakis
1 / 226
G. Tzanetakis
2 / 226
Textbook
This tutorial is based on slide material created for an online
course and associated textbook. Video recordings of some of
the lectures , more detailed slides and draft of the book (all
under heavy construction) can be found at:
[http://marsyas.cs.uvic.ca/mirBook]
The goal of the tutorial is to familiarize researchers engaged in
multimedia information retrieval (typically image and video,
sometimes audio/speech) with the work done in music
information retrieval. I will be covering some background in
audio digital signal processing that is needed but will assume
familiarity with standard machine learning/data mining.
Current draft on webpage - will be frequently updated and
dated. Also there is much more material for the remaining
chapters that exists in other documents, papers, etc that I
have that I will be editing and trasferring in the future.
MIR
Unfortunately we are stuck with kind of similar acronyms.
G. Tzanetakis
3 / 226
G. Tzanetakis
4 / 226
Education and Academic Work Experience
1997 BSc in Computer Science (CS), University of Crete,
Greece
1999 MA in CS, Princeton University, USA
2002 PhD in CS, Princeton University, USA
2003 PostDoc in CS, Carnegie Mellon University, USA
2004 Assistant Professor in CS, Univ. of Victoria, Canada
2010 Associate Professor in CS, Univ. of Victoria, Canada
2010 Canada Research Chair (Tier II) in Computer
Analysis of Audio and Music
Music theory, saxophone and piano performance,
composition, improvisation both in conservatory and
academic settings
Main focus of my research has been Music Information
Retrieval (MIR)
Involved from the early days of the field
Have published papers in almost every ISMIR conference
Organized ISMIR in 2006
Tutorials on MIR in several conferences
G. Tzanetakis
5 / 226
Research
G. Tzanetakis
6 / 226
Work Experience beyond Academia
Inherently inter-disciplinary and cross-disciplinary work.
Connecting theme: making computers better understand
music to create more effective interactions with musicians and
listeners. Audio analysis is challenging due to large volume of
data - did big data before it became fashionable.
Music Information
Retrieval
Digital Signal Processing
Machine Learning
Human-Computer
Interaction
Software Engineering
G. Tzanetakis
Many internships in research labs throughout studies. Several
consulting jobs while in academia. A few representative
examples:
Moodlogic Inc (2000). Designed and developed one of
the earliest audio fingerprinting systems (patented) 100000 users matching to 1.5 million songs
Teligence Inc (2005). Automatic male/female voice
discrimination for voice messages used in popular phone
dating sites - processing of 20000+ recordings per day.
Artifical Intelligence
Multimedia
Robotics
Visualization
Programming Languages
7 / 226
G. Tzanetakis
8 / 226
Marsyas
Visiting Scientist at Google Research (6 months)
Things I worked on (of course as part of larger teams):
Cover Song Detection (applied to every uploaded
YouTube video).
100 hours of video are uploaded to YouTube every minute
Content ID scans over 250 years of video every day - 15
million references
Audio Fingerprinting (part of Android Jelly Bean)
Named inventor on 6 pending US patents related to audio
matching and fingerprinting
Music Analysis, Retrieval and Synthesis for Audio Signals
Open source in C++ with Python Bindings
Started by me in 1999 - core team approximately 4-5
developers
Approximately 400 downloads per month
Many projects in industry and academia
State-of-the-art performance while frequently orders of
magnitude faster than other systems
G. Tzanetakis
9 / 226
History of MIR before computers
G. Tzanetakis
Brief History of computer MIR
Pre-history (< 2000): scattered papers in various
communities. Symbolic processing mostly in digital
libraries and information retrieval venues and audio
processing (less explored) mostly in acoustics and DSP
venues.
The birth 2000: first International symposium on Music
Information Retrieval (ISMIR) with funding from NSF
Digital Libraries II initiative organized by J. Stephen
Downie, Time Crawford and Don Byrd. First contact
between the symbolic and the audio side.
2000-2006 Rapid growth
2006-2014 Slower growth and steady state
How did a listener encounter a new piece of music throughout
history ?
Live performance
Music Notation
Physical recording
Radio
G. Tzanetakis
10 / 226
11 / 226
G. Tzanetakis
12 / 226
Conceptual MIR dimensions I
Conceptual MIR dimensions II
Data sources:
Audio
Track metadata
Score
Lyrics
Reviews
Ratings
Download patterns
Micro-blogging
Stages
Representation/Hearing
Analysis/Learning
Interaction/Action
Specificity
Audio fingerprinting
Common score performance
Cover song detection
Artist identification
Genre classification
Recommendation ?
G. Tzanetakis
13 / 226
MIR Tasks
G. Tzanetakis
Digital Audio Recordings
Recordings in analog media (like vinyl or magnetic tape)
degrade over time
Digital audio representations theoretically can remain
accurate without any loss of information through copying
of patterns of bits.
MIR requires a distilling information from an extremely
large amount of data
Digitally storing 3 minutes of audio requires
approximately 16 million numbers. A tempo extraction
program must somehow convert these to a single
numerical estimate of the tempo.
Similarity retrieval, playlists, recommendation
Classification and clustering
Tag annotation
Rhythm, melody, chords
Music transcription and source separation
Query by humming
Symbolic MIR
Segmentation, structure, alignment
Watermarking, fingerprinting and cover song detection
G. Tzanetakis
14 / 226
15 / 226
G. Tzanetakis
16 / 226
Production and Perception of Periodic Sounds
Pitch Perception
Pitch
When the same sound is repeated more than 10-20 times per
second instead of it being perceived as a sequence of individual
sound events it is fused into a single sonic event with a
property we call pitch that is related to the underlying period
of repetition. Note that this fusion is something that our
perception does rather than reflect some underlying singal
change other than the decrease of the repetition period.
Animal sound generation and perception
The sound generation and perception systems of animals have
evolved to help them survive in their environment. From an
evolutionary perspective the intentional sounds generated by
animals should be distinct from the random sounds of the
environment.
Repetition
Repetition is a key property of sounds that can make them
more identifiable as coming from other animals (predators,
prey, potential mates) and therefore animal hearing systems
have evolved to be good at detecting periodic sounds.
G. Tzanetakis
17 / 226
Time-Frequency Representations
18 / 226
Spectrum
Music Notation
When listening to mixtures of sounds (including music) we are
interested in when specific sounds take place (time) and what
is their source of origin (pitch, timbre). This is also reflected
in music notation which fundamentally represents time from
left to right and pitch from bottom to top.
G. Tzanetakis
G. Tzanetakis
Informal definition of Spectrum
A fundamental concept in DSP is the notion of a spectrum.
Informally complex sounds such as the ones produced by
musical instruments and their combinations can be modeled as
linear combinations of simple elementary sinusoidal signals
with different frequencies. A spectrum shows how “much”
each such basis sinusoidal component contributes to the
overall mixture. It can be used to extract information about
the sound such as its perceived pitch or what instrument(s)
are playing. A spectrum corresponds to a short snapshot of
the sound in time.
19 / 226
G. Tzanetakis
20 / 226
Spectrum example
Spectrograms
Spectrum of a tenor saxophone note
G. Tzanetakis
Spectrograms
Music and sound change over time. A spectrum does not
provide any information about the time evolution of different
frequencies. It just shows the relative contribution of each
frequency to the mixture signal over the duration analyzed.
In order to capture the time evolution of sound and music the
standard approach is to segment the audio signal into small
chunks (called windows or frames) and calculate the spectrum
for each of these windows. The assumption is that during the
relatively short period of analysis (typically less than a second)
there is not much change and therefore the calculated
short-time spectrum is an accurate representation of the
underlying signal. The resulting sequence of spectra over time
is called a spectrogram.
21 / 226
Examples of spectrograms
22 / 226
Waterfall spectrogram view
Waterfall display using sndpeek
Spectrogram of a few tenor saxophone notes
G. Tzanetakis
G. Tzanetakis
23 / 226
G. Tzanetakis
24 / 226
Why is DSP important for MIR ?
DSP for MIR
A large amount of MIR research deals with audio signals.
Audio signals are represented digitally as very long
sequences of numbers.
Digital Signal Processing techniques are essential in
extracting information from audio signals.
The mathematical ideas behind DSP are amazing. For
example it is through DSP that you can understand how
any sound that you can hear can be expressed as a sum of
sine waves or represented as a long sequence of 1’s and
0’s.
G. Tzanetakis
Digital Signal Processing is a large field and therefore
impossible to cover adequately in this course. The main goal
of the lectures focusing on DSP will be to provide you with
some intuition behind the main concepts and techniques that
form the foundation of many MIR algorithms. I hope that they
serve as a seed for growing a long term passion and interest
for DSP and the textbook provides some pointers for further
reading.
25 / 226
Sinusoids
26 / 226
What is a sinusoid ?
Family of elementary signals that have a particular
shape/pattern of repetition.
sin(ωt) and cosin(ωt) are particular examples of sinusoids that
can be described by the more general equation:
We start our exposition with discussing sinusoids which are
elementary signals that are crucial in understading both DSP
concepts and the mathematical notation used to understand
them. Our ultimate goal of the DSP lectures is to make
equations such as less intimidating and more meaningfull:
Z ∞
X (f ) =
x(t)e −j2πft dt
(1)
x(t) = sin(ωt + φ)
(2)
where ω is the frequency and φ is the phase. There is an
infinite number of continuous periodic signals that belong to
the sinusoid family. Each is characterized by three numbers:
the amplitude the frequency and the phase.
−∞
G. Tzanetakis
G. Tzanetakis
27 / 226
G. Tzanetakis
28 / 226
4 motivating viewpoints for sinusoids
Solutions to the differential equations that describe
simple systems of vibration
Family of signals that pass “unchanged” through LTI
systems
Phasors (rotating vectors) providing geometric intution
about DSP concepts and notation
Basis functions of the Fourier Transform
Figure : Simple sinusoids
G. Tzanetakis
29 / 226
Linear Time Invariant Systems
30 / 226
Sinusoids and LTI Systems
Definition
Systems are transformations of signals. They take a input a
signal x(t) and produce a corresponding output signal y (t).
Example: y (t) = [x(t)]2 + 5.
When a sinusoids of frequency ω goes through a LTI system it
“stays” in the family of sinusoids of frequency ω i.e only the
amplitude and the phase are changed by the system. Because
of linearity this implies that if a complex signal is a sum of
sinusoids of different frequencies then the system output will
not contain any new frequencies. The behavior of the system
can be completely understood by simply analyzing how it
responds to elementary sinusoids. Examples of LTI systems in
music: guitar boy, vocal tract, outer ear, concert hall.
LTI Systems
Linearity means that one can calculate the output of the
system to the sum of two input signals by summing the system
outputs for each input signal individually. Formally if
y1 (t) = S{x1 (t)} and y2 (t) = S{x2 (t)} then
S{x1 (t) + x2 (t)} = ysum (t) = y1 (t) + y2 (t). Time invariance
shift in input results in shift in output.
G. Tzanetakis
G. Tzanetakis
31 / 226
G. Tzanetakis
32 / 226
Thinking in circles
Projecting a phasor
The projection of the rotating vector or phasor on the x-axis is
a cosine wave and on the y-axis a sine wave.
Key insight
Think of sinusoidal signal as a vector rotating at a constant
speed in the plane (phasor) rather than a single valued signal
that goes up and down.
Amplitude = Length
Frequency = Speed
Phase = Angle at time t
G. Tzanetakis
33 / 226
Notating a phasor
G. Tzanetakis
34 / 226
Multiplication by j
Complex numbers
An elegant notation system for describing and manipulating
rotating vectors.
Multiplication by j is an operation of rotation in the plane. You
can think of it as rotate +90 degrees counter-clockwise. Two
successive rotations by +90 degrees bring us to the negative
real axis, hence j 2 = −1. This geometric viewpoint shows that
there is nothing imaginary or strange about complex numbers.
x + jy
where x is called the real part and y is called the imaginary
part. If we represent a sinusoid as a rotating vector then using
complex number notation we can simply write:
cos(ωt) + jsin(ωt)
G. Tzanetakis
35 / 226
G. Tzanetakis
36 / 226
Adding sinusoids of the same frequency I
Adding sinusoids of the same frequency II
Geometric view of the property that sinusoids (phasors) of a
particular frequency ω are closed under addition.
G. Tzanetakis
37 / 226
Negative frequencies and phasors
G. Tzanetakis
38 / 226
Book that inspired this DSP exposition
A Digital Signal Processing
Primer by Ken Steiglitz
G. Tzanetakis
39 / 226
G. Tzanetakis
40 / 226
Summary
Sampling
Discretize continuous signal by taking regular
measurements in time (discretization of the
measurements is called quantization)
Notation: fs is sampling rate in Hz, ωs is sampling rate in
radians per second
Sampling a sinusoid - only frequencies below half the
sampling rate (Nyquist frequency) will be accurately
reprsented after sampling
For sinusoid at ω0 then all frequencies ω0 + kωs are aliases
Sinusoidal signals are fundamental in understanding DSP
Representing them as phasors (i.e vectors rotating at a
constant speed) can help understand intuitively several
concepts in DSP
Complex numbers are an elegant system for expressing
rotations and can be used to notate phasors in a way that
leverages our knowledge of algebra
Thinking this way makes e jωt more intuitive.
G. Tzanetakis
41 / 226
Phasor view of aliasing
G. Tzanetakis
Frequency Domain
Illustration of sampling at a high sampling rate compared to
the phasor frequency, sampling at the Nyquist rate, and
slightly above. Numbers indicate the discrete samples of the
continuous phasor rotation. Each sample is a complex number.
G. Tzanetakis
42 / 226
Any periodic sound can be represented as a sum of
sinusoids (or equivalently phasors)
This representation is called a frequency domain
representation and the linear combination coefficients are
called the spectrum
Commonly used variants: Fourier Series, Discrete Fourier
Transform, the z-transform, and the classical continuous
Fourier Transform
These transforms provide procedures for obtaining the
linear combination weights of the frequency domain from
the signal in time domain (as well as the inverse direction)
43 / 226
G. Tzanetakis
44 / 226
2D coordinate system
Inner product (projection) properties
A vector ~v in 2-dimensional space can be written as a
combination of the 2 unit vectors in each coordinate direction.
The inner product operation < v̂ , ŵ > corresponds to the
~ . It is the sum of P
projection of ~v onto w
the products of like
coordinates < v , w >= vx wx + vy wy = N−1
i=0 vi wi .
G. Tzanetakis
< ~x , ~y >= 0 then the vectors are orthogonal
~ >=< ~u , w
~ > + < ~v , w
~ > distributive law
< ~u + ~v , w
Basis vectors are orthogonal and have length 1
< ~v , ~v >= vx2 + vy2 is the square of the vector length
In order to have the inner product with self be the square
of length for vectors of complex numbers we have to
slightly change the definition by using the complex
conjugate.
P
∗
∗
< v , w >= vx wx∗ + vy wy∗ = N−1
i=0 vi wi . where ()
denotes the complex conjugate of a number.
45 / 226
Hilbert Spaces
46 / 226
Orthogonal Coordinate System
Key idea
Generalize the notion of a Euclidean space with finite
dimensions to other types of spaces for which a suitable notion
of an inner product can be defined. These spaces can have an
infitite number of dimensions but as long as we have an
appropriate definition of a projection operator/inner product
we can reuse a lof the notation and concepts familiar from
Euclidean space. For example a space we will investigate are all
continuous functions that are periodic with an interval [0, T ].
G. Tzanetakis
G. Tzanetakis
We need an orthogonal coordinate system i.e a projection
(inner product) operator and an orthogonal basis for each
space we are interested
The Fourier Series, the Discrete Fourier Transform, the
z-transform and the continuous Fourier Transform can all be
defined by specifying what projection operator to use and what
basis elements to use
47 / 226
G. Tzanetakis
48 / 226
Discrete Fourier Transform Introduction
DFT
Our input is a finite, legnth N segment of a digital signal
x[0], . . . , x[N − 1].
The DFT is an abstract mathematical transformation
and the Fast Fourier Transform FFT is a very efficient
algorithm for computing it
The FFT is at the heart of digital signal processing and a
lot of MIR systems utilize it one way or another
It is applied on sequences of N samples of a digital signal
Similarly to the Fourier Series we will define it using the
components of an orthogonal coordinate system: an inner
product and a set of basis elements
G. Tzanetakis
Definition
The inner product is what one would expect:
< x, y >=
N−1
X
x[t]y ∗ [t]
t=0
Switch frequency interval from [−ωs /2, +ωs /2] to [0, ω]a as
they are equivalent. One possibility would be all phasors in
that frequency range: e jtω for 0 ≤ ω < ωs . It turns out we
just need N phasors in that range.
49 / 226
DFT basis elements
G. Tzanetakis
50 / 226
The Discrete Fourier Transform (DFT)
Definition
The inverse DFT expresses the time domain signal as a
complex weighted sum of N phasors with spectrum X [k].
We need N frequencies spaced in the range from 0 to the
sampling frequency. If we use radians per sample then we have
0, 2π/N, 2(2π/n), . . . , (N − 1)(2π/N). The corresonding basis
is:
N−1
1 X
x[t] =
X [k]e jtk2π/N
N k=0
Definition
e jk2π/N for 0 ≤ k ≤ N − 1
Definition
The DFT can be obtained by projecting the signal to the basis
elements using the inner product definition:
Note: using the definition of the inner product above one can
show that indeed these basis elements are orthogonal i.e the
inner product between any two elements is zero.
X [k] =< x, e jtk2π/N >=
N−1
X
x[t]e −jtk2π/N
t=0
G. Tzanetakis
51 / 226
G. Tzanetakis
52 / 226
Matrix-vector formulation
Circular Domain
One can view the DFT as a way to transformation a sequence
of N complex numbers to a different sequence of N complex
numbers. The DFT can be expressed in matrix-vector
notation. If we use x and X to denote the N dimensional
vectors with component x[t] and X [k] respectively, and define
the N × N matrix F by
The bins of the DFT are numbered 0, N − 1 but correspond to
frequencies between [−ωs /2, +ωs /2].
[F ]k,t = e −jtk2π/N
then we can write:
X = Fx
and
x = F−1 X
G. Tzanetakis
53 / 226
The discrete frequency domain
G. Tzanetakis
54 / 226
DFT frequency mapping example
The bins of the DFT are numbered 0, N − 1 but correspond to
frequencies between [−ωs /2, +ωs /2]. Since N corresponds to
the sampling rate, we need to divide by N to get the
frequencies in terms of fractions of the sampling rate. So in
the case shown in the figure we would have the following
frequencies (fractions of sampling rate):
0,
G. Tzanetakis
Example of a 2048-point DFT with 44100 Hz sampling rate:
bin frequency
0
0
1
21.5
2
43.1
...
...
1024
22050
1025 -220285
1026
22007
1 2
8
7 6
1
, , .... , − , , . . . , −
16 16
16 16 16
16
55 / 226
G. Tzanetakis
56 / 226
Fast Fourier Transform
Summary I
Sampling a phasor introduces aliasing which means that
multiple frequencies (the aliases) are indistinguishable
from each other based on the samples
We can extend the concept of an orthogonal coordinate
system beyond Euclidean space vectors
By appropriate definitions of a projection operator
(inner product) and basis elements we can formulate
transformations from the time domain to the frequency
domain such as the Fourier Series and the Discrete
Fourier Transform
The FFT is a fast implementation of the DFT
Straight implementation of the DFT requires O(N 2 )
arithmetic operations.
Divide and conquer: do two N/2 DFTs and then merge
the results
O(NlogN) much faster when N is not small.
G. Tzanetakis
57 / 226
Summary II
58 / 226
Music Notation
Music notation systems typically encode information about
discrete musical pitch (notes on a piano) and timing.
Any signal of interest can be expressed as a weighted sum
(with complex coefficients) of basis elements that are
phasors.
The complex coefficients that act as coordinates are
called the spectrum of the signal
We can obtain the spectrum by projecting the time
domain signal to the phasor basis elemnts
The DFT output contains frequency between
[−ωs /2, +ωs /2] using a circular domain. For real signal
the coefficients of the negative frequencies are symmatric
to the positive frequencies and carry no additional
information.
G. Tzanetakis
G. Tzanetakis
59 / 226
G. Tzanetakis
60 / 226
Terminology
Psychoacoustics
The term pitch is used in different ways in the literature which
can result in some confusion.
Definition
The scientific study of sound perception.
Perceptual Pitch: is a perceived quality of sound that can be
ordered from “low” to “high”.
Musical Pitch: refers to a discrete finite set of perceived
pitches that are played on musical instruments
Measured Pitch: is a calculated quantity of a sound using an
algorithm that tries to match the perceived pitch.
Monophonic: refers to a piece of music in which a single sound
source (instrument or voice) is playing and only
one pitch is heard at any particular time instance.
G. Tzanetakis
Frequently testing the limits of perception:
Frequency range 20Hz-20000Hz
Intensity (0dB-120dB)
Masking
Missing fundamental (presence of harmonics at integer
multiples of fundamental give the impression of “missing”
pitch)
61 / 226
Origins of Psychoacoustics
G. Tzanetakis
62 / 226
Pitch Detection
Pitch is a PERCEPTUAL attribute correlated but not
equivalent to fundamental frequency. Simple pitch detection
algorithms most deal with fundamental frequency estimation
but more sophisticated ones take into account knowledge
about the human auditory system.
Pythagoras of Samos established a connection between
perception (music intervals) and physical measurable
quantities (string lengths) using the monochord.
Time Domain
Frequency Domain
Perceptual
G. Tzanetakis
63 / 226
G. Tzanetakis
64 / 226
Time-domain Zerocrossings
AutoCorrelation
In autocorrelation the signal is delayed and multiplied with
itself for different time lags l. The autocorrelation functions
has peaks at the lags in which the signal is self-similar.
Zero-crossings are sensitive to noise so frequency low-pass
filtering is utilized.
Definition
rx [l] =
N−1
X
x[n]x[n + l] l = 0, 1, . . . , L − 1
n=0
Efficient Computation
Figure : C4 Sine [Sound]
X [f ] = DFT {X (t)}
S[f ] = X [f ]X ∗ [f ]
R[l] = DFT −1 {S[f ]}
Figure : C4 Clarient [Sound]
G. Tzanetakis
65 / 226
Autocorrelation examples
G. Tzanetakis
66 / 226
Average Magnitude Difference Function
The average magnitude difference function also shifts the
signal but instead of multiplication uses subtraction to detect
periodicities as nulls. No multiplications make it efficient for
DSP chips and real-time processing.
Definition
AMDF (m) =
N−1
X
|x[n] − x[n + m]|k
n=0
Figure : C4 Sine
G. Tzanetakis
Figure : C4 Clarinet Note
67 / 226
G. Tzanetakis
68 / 226
AMDF Examples
Frequency Domain Pitch Detection
Figure : C4 Sine
Figure : C4 Sine
Fundamental frequency (as well as pitch) will correspond to
peaks in the spectrum (not necessarily the highest though).
Figure : C4 Clarinet Note
G. Tzanetakis
Figure : C4 Clarinet Note
69 / 226
Plotting over time
G. Tzanetakis
70 / 226
Modern pitch detection
Modern pitch detection algorithm are based on the basic
approaches we have presented but with various enhancements
and extra steps to make them more effective for the signals of
interest. Open source and free implementations available.
Figure : Spectrogram
YIN from the “yin” and “yang” of oriental philosophy
that alludes to the interplay between autocorrelation and
cancellation.
SWIPE a sawtooh waveform inspired pitch estimator
based on matching spectra
Figure : Correlogram
[Sound]
G. Tzanetakis
71 / 226
G. Tzanetakis
72 / 226
Pitch Perception
Duplex theory of pitch perception
Proposed by J.C.R Licklider in 1951 (also a realy visionary
regarding the future of computers)
One perception but two overlapping mechanisms
Pitch is not just fundamental frequency
Periodicity or harmonicity or both ?
How can perceived pitch be measured ? A common
approach is to adjust sine wave until match
In 1924 Fletcher observed that one can still hear a pitch
when playing harmonic partials missing the fundamental
frequency (i.e bass notes with small radio)
G. Tzanetakis
Counting cycles of a period < 800Hz
Place of excitation along basilar membrane > 1600Hz
73 / 226
The human auditory system
74 / 226
Auditory Models
Incoming sound generates a wave in the fluid filled cochlea
(causing the basilar membrane to be displaced - 15000 inner
hair cells). Originally it was thought that the chochlea acted
as a frequency analyzer similar to the Fourier transform and
the perceived pitch was based on the place of highest
excitation. Evidence from both perception and biophysics
showed that pitch perception can not be explained solely by
the place theory.
G. Tzanetakis
G. Tzanetakis
From “On the importance of time: a temporal representation
of sound” by Malcolm Slaney and R. F. Lyon.
75 / 226
G. Tzanetakis
76 / 226
Perceptual Pitch Scales
Musical Pitch
Attempt to quantify the perception of frequency
Typically obtained through just noticeable difference
(JND) experiments using sine waves
All agree that perception is linear in frequency below a
certain breakpoint and logarithmic above it, but disagree
on what that breakpoint is (popular choices include 1000,
700, 625 and 228)
Examples: Mel, Bark, ERB
G. Tzanetakis
In many styles of music a set of finite and discrete
frequencies are used rather than the whole frequency
continuum.
The fundamental unit that is subdivided is the octave
(ratio of 2 in frequency).
Tuning systems subdivide the octave logarithmically into
distinct intervals
Tension between harmonic ratios for consonant intervals,
desire to modulate to different keys, regularlity, and
presence of pure fifths (ratio of 1.5 or 3:2)
77 / 226
Pitch Helix
78 / 226
From frequency to musical pitch
Sketch of a simple pitch detection algorithm
Perform the FFT on a short segment of audio typically
around 10-20 milliseoncds
Select the bin with the highest peak
Convert the bin index k to a frequency f in Hertz:
Pitch perception has two
dimesions:
Height: naturally
organizes pitches from low
to high
Chroma: represents the
inherent circularity of
pitch (octaves)
Linear pitch (i.e
log(frequency)) can be
wrapped around a cylinder to
mode the octave equivalence.
G. Tzanetakis
G. Tzanetakis
f = k ∗ (Sr /N)
where Sr is the sampling rate, and N is the FFT size.
Map the value in Hertz to a MIDI note number
m = 69 + 12log2 (f /440)
79 / 226
G. Tzanetakis
80 / 226
Chant analysis
Query by Humming (QBH)
Computational Ethnomusicology
Transition from oral to written transmission
Study how diverse recitation traditions having their origin
in primarily non-notated melodies later became codified
Cantillion - joint work with Daniel Biro [Link]
Users sings a melody [Musart QBH examples]
Computer searches a database of refererence tracks for a
track that contains the melody
Monophonic pitch extraction is the first step
Many more challenges: difficult queries, variations, tempo
changes, partial matches, efficient indexing
Commercial implementation: Midomi/SoundHound
Academic search for classical music: Musipedia
G. Tzanetakis
81 / 226
Summary
82 / 226
State-space representation
Key idea
Model everything you want to know about a process of
interest that changes behavior over time as a vector of
numbers indexed by time
There are many fundamental frequency estimation
(sometimes also called pitch detection) algorithms
It is important to distinguish between fundamental
frequency, measured pitch and perceived pitch
F0 estimation algortihms can roughly be categorized as
time-domain, frequency-domain and perceptual
Query-by-humming requires a monophonic pitch
extraction step
Chant analysis is another more academic application
G. Tzanetakis
G. Tzanetakis
83 / 226
G. Tzanetakis
84 / 226
Representations for music tracks
Short-time Fourier Transform
A music track can be represented as a:
Trajectory of feature vectors over time
Cloud (or bag) of feature vectors (unordered, time
ordering lost)
Single feature vector (or point in N-dimensional space)
G. Tzanetakis
85 / 226
G. Tzanetakis
86 / 226
Windowing
Repetition introduces discontinuities at the boundaries of the
repeated portions that cause artifacts in the DFT computation.
The impact of these artifacts can be reduced by windowing.
(a) Basis function
G. Tzanetakis
87 / 226
G. Tzanetakis
(b) Time domain waveform
(c) Windowed sinusoid (d) Windowed waveform
88 / 226
History
The big picture
Reducing information through frequency summarization and
temporal summarization.
Origins of audio features for music processing lay in
speech proceesing
Also they have been informed by work in characterizating
timbre
Eventually features that are music specific such as chroma
vectors and rhyhtmic pattern descriptors were added
Unlike measured pitch, audio features do not necessarily
have a direct perceptual correlate
G. Tzanetakis
89 / 226
Frequency Summarization
G. Tzanetakis
90 / 226
Centroid
The “center of gravity” of the spectrum. Correlates with pitch
and “brightness”.
PN−1
k|Xn [k]|
Cn = Pk=0
N−1
k=0 |Xn [k]|
where n is the frame index, N is the DFT size, and |Xn [k]| is
the magnitude spectrum at bin k.
G. Tzanetakis
91 / 226
G. Tzanetakis
92 / 226
Rolloff
Mel-Frequency Cepstral Coefficients
Widely used in automatic speech recognition (ASR) as they
provide a somewhat speaker/pitch invariant representation of
phonemes.
The frequency Rn, below which 85% of the energy in the
mangitude spectrum is concentrated:
Rn−1
X
n=0
|Xn [k]| = 0.85
N−1
X
|Xn [k]|
n=0
G. Tzanetakis
93 / 226
Cepstrum
G. Tzanetakis
DCT
Strong energy compaction i.e few coefficients required to
reconstruct most of the energy of the original signal
itemFor certain types of signals approximates
Karhunen-Loeve transform (theoretically optimal
orthogonal basis)
“Low” coefficients represent most of the signal and higher
ones can be discarded i.e set to 0
MFCCs keep first 13-20
MDCT (overlap-based) is used in MP3, AAC, and Vorbis
audio compression
Measure of periodicity of frequency response plot
S(e jθ ) = H(e jθ )E (e jθ ) where H is a linear filter, E is an
excitation
log (|S(e jθ |)) = log (|H(e jθ )|) + log (|E (e jθ )|)
Homomorphic transformation - the convolution of two
signals becomes the equivalent to the sum of their cepstra
Aims to deconvolve the signal (low coefficients model
filter shape, high order coefficients excitation with
possible F0)
G. Tzanetakis
94 / 226
95 / 226
G. Tzanetakis
96 / 226
Temporal Summarization
Temporal summarization
Texture window of size M, starting at feature index n can be
summarized by mean and standard deviation.
A variety of terms have been used to describe methods that
summarize a sequence of feature values over time. Typical
frame size (10-20 msecs), texture window size (1-3 secs).
T [n] = (F [n − M + 1], . . . , F [n])
Texture windows
Aggregates
Modulation features (when detecting modulation)
Dynamic features (∆)
Temporal feature integration
Fluctuation patterns
Pooling (from Neural Networks terminology)
Song-level (when summarization is across the track)
G. Tzanetakis
0.10
0.07
Beatles
Debussy
0.025
Beatles
Debussy
0.06
0.08
Beatles
Debussy
0.020
0.05
0.06
0.015
Centroid
Centroid
Centroid
0.04
0.03
0.04
0.010
0.02
0.02
0.000
0.005
0.01
100
200
Frames
300
400
500
(e) Centroid
0.000
100
200
Frames
300
400
500
(f) Mean Centroid
0.0000
100
200
Frames
300
400
500
(g) Std Centroid
[Sound] [Sound]
97 / 226
Pitch Histograms
G. Tzanetakis
98 / 226
Pitch Histograms
Pitch Histograms of two Jazz pieces (left column) and two
Irish Folk music pieces (right column)
Average amplitudes of DFT bins mapping to the same
MIDI note number (different averaging shapes can be
used for example triangles or Gaussians)
If desired “fold” the resulting histogram collapsing bins
that belong to the same pitch class into one
Frequently more than 12 bins per octave to account for
tuning and performance variations
Alternatively multiple pitch detection can be performed
and the detected pitches can be added to a histogram
G. Tzanetakis
99 / 226
G. Tzanetakis
100 / 226
Pitch Helix and Chroma
Chroma Profiles
Chroma profile: 12 bins, start with A, chromatic spacing
Chroma of C4 sine
G. Tzanetakis
101 / 226
Chroma Profiles
G. Tzanetakis
Chroma of C4 clarinet
102 / 226
Summary
Chroma profile: 12 bins, start with A, chromatic spacing
Sine melody
The Short Time Fourier Transform with windowing forms
the basis of extracting time-frequency representations
(magnitude spectrograms) from audio signals
The process of audio feature extraction consists of
summarizing in various ways the information in the
frequency dimension and across the time dimension
Originally audio features used in MIR were inspired by
automatic speech recognition (MFCCs) and phychological
investigations of timbre (centroid, rolloff, flux). Additional
features capturing information specific to music such as
Chroma and Pitch Histograms have been proposed.
Clarinet melody
[Sound]
G. Tzanetakis
103 / 226
G. Tzanetakis
104 / 226
Introduction
History of Music Notation
Earliest known form of music notation in cuneiform
Sumerian tablet around 2000 BC.
Initially a mnemonic aid to oral instruction, performance
and transmission it evolved into a codified set of
conventions that transformed how music was created,
distributed and consumed across time and space.
Notation can be viewed as a visual representation of
instructions for how to perform an instrument. Tablature
notation for example is specific to stringed instruments.
Primary focus of traditional musicology
Music notation and theory are complex topics that can
take many years to master
This presentation barely scratches the surface of the
subject
The main goal is to provide enough background for
students with no formal music training to be able to read
and understand MIR papers that use terminology from
music notation and theory
It is never too late to get some formal music training
G. Tzanetakis
105 / 226
Western Common Music Notation
G. Tzanetakis
Notating rhythm
Originally used in
European Classical Music
is currently used in many
genres around the world
Mainly encodes pitch and
timing (to a certain
degree designed for
keyboard instruments)
Considerable freedom in
interpretation
Five staff lines
G. Tzanetakis
106 / 226
Symbols indicate relative durations in terms of multiples
(or fractions) of underlying regular pulse
If tempo is specified then exact durations can be
computed (for example the first symbol would last 60
seconds / 85 BPM = 0.706 seconds)
A different set of symbols is used to indicate rests
Numbers under symbols indicate the duration in terms of
eighth notes. Each measure is subdivded into 2 half
notes, 4 quarter notes, 8 eighth notes.
107 / 226
G. Tzanetakis
108 / 226
Time signature and measures
Notating pitches
Clef sign anchors the five staff lines to a particular pitch
Note symbols are either placed on staff lines or between
staff lines.
Successive note symbols (one between lines followed by
one on a staff line or the other way around) correspond to
successive white notes on a keyboard.
Invisible staff lines extend above and below
Measure (or bar) lines indicate regular groupings of notes
Time signature shows the rhythmic content of each
measure
Compound rhythms consists of smaller rhythmic units
G. Tzanetakis
109 / 226
Notating pitches
G. Tzanetakis
110 / 226
Repeat signs and structure
Repeat signs and other notation conventions can be
thought of as a “proto” programming language providing
looping constructs and goto statements
Hierarchical structure is common i.e ABAA form
Structure = segmentation + similarity
G. Tzanetakis
111 / 226
G. Tzanetakis
112 / 226
Structure of Naima by J. Coltrane
Intervals
Intervals are pairs of pitches
Melodic when the pitches are played in succession
Harmonic when the pitches are played simultaneously
Uniquely characterized by number of semitones (although
typically named using a more complex system)
(microtuning also possible)
G. Tzanetakis
113 / 226
Naming of intervals
G. Tzanetakis
114 / 226
Scales and Modes
The most common naming convention for intervals uses two
attributes to describe them: quality and number.
A scale is a sequence of intervals typically consisting of whole
tones and semitones and spanning an octave. Diatonic scales
are the ones that can be played using only the white keys on a
piano. They are called modes and have ancient greek names.
Quality
Quality: perfect, major, minor, augmented, diminished.
Number
Number: unison, second, third, fifth, sixth, seventh, octave
and is based on counting staff positions
G. Tzanetakis
115 / 226
G. Tzanetakis
116 / 226
Enharmonic Spelling
Major/Minor Scales
The naming of intervals (and absolute pitches) is not unique
meaning that the same exact note can have two different
names as in C # and Db. Similarly the same interval can be a
minor third or an augmented second. The spelling comes from
the role an interval plays as part of a scale as well as the
historical tuning practice of having different frequency ratios
for enharmonic intervals.
The scales used in composed Western classical music are
primarily the major and minor scales. The harmonic minor
scale has an augmented second (A) that occurs between the
6th and 7th tone.
G. Tzanetakis
117 / 226
Chords
118 / 226
Root, Inversions, Voicings
A chord is a set of two or more notes that sound
simultaneously. A chord label can also be applied to a music
excerpt (typically a measure) by inferring, using various rules
of harmony, what theoretical chord would sound “good” with
the underlying music material. The basis of the western
classical and pop music chord system is the triad consisting of
three notes. Different naming schemes are used for chords.
Jazz and Pop music frequently use naming based on triad with
additional modifiers for the non-triad notes.
G. Tzanetakis
G. Tzanetakis
119 / 226
The lowest note of a chord in its “default” position is called
the root. Inversions occur when the lowest note of a chord is
different than the root. Voicings are different arrangements
of the chord notes that can include repeated notes as well as
octaves.
G. Tzanetakis
120 / 226
Chord Progressions and Harmony
Jazz Lead Sheets
Sequences of chords are called chord progressions. Certain
progressions are more common than others and also indicate
the key of a piece. Frequently chords are constructed from
subsets of notes from a particular scale. The root of the scale
is called the tonic and defines the key of the piece. For
example a piece in C Major will mostly consist of chords
formed by the notes of the C major scale. Modulation refers
to a change in key. Chords have specific qualities and
functions which are studied in Harmonic analysis.
G. Tzanetakis
121 / 226
TuneDex
G. Tzanetakis
G. Tzanetakis
122 / 226
Pianoroll
123 / 226
G. Tzanetakis
124 / 226
MIDI
Lilypond
Musical Instrument Digital Interface
MIDI is both a communication protocol (and associated file
format) as well as a hardware connector specification that
allows the exchange of information between electronic musical
instruments and computers. It was developed in the early 80s
and was mostly designed with keyboard instruments in mind.
Essentially piano-roll representation of music.
G. Tzanetakis
Music engraving program
Text language for input that is complied
Encodes much more than just notes and duration in order
to produce a visual musical score
Produces beautiful looking scores and is free
125 / 226
Music XML
126 / 226
jSymbolic - jMIR
Software in Java for extracting
high level musical features from
symbolic music representations,
specifically MIDI files
Features capture aspects of
instrumentation, texture,
rhythm, dynamics, pitch
statistics, melody, and chords
Part of jMIR a more general
package for MIR including audio,
lyrics, web feature extraction as
well as a classification engine
Extensible Markup
Language (XML) format
for interchanging
information about scores
Supported by more than a
170 notation, score
writing applications
Proprietary but open
specification
Hard to read but
comprehensive
G. Tzanetakis
G. Tzanetakis
127 / 226
G. Tzanetakis
128 / 226
music21
music 21 pitch/duration distribution
Distribution of pitches and note duration for a Chopin
Mazurka using music21.
Set of tools written in Python for computer aided
musicology
Corpora included is a great feature
Works with MusicXML, MIDI
Example: add german name (i.e., B=B, B=H, A= Ais)
under each note of a Bach chorale
G. Tzanetakis
129 / 226
Query-by-Example
130 / 226
Bag-of-frames Distance Measures
In this apporach to music recommendation, the user provides a
query (or seed) music track as input and the system returns a
list of tracks that is ranked by their audio-based similarity to
the query.
[Query (Mr. Jones by Talking Head) ]
Top 3 ranked list results
using different feature sets:
[Spectral]
[Rhythm]
[Combined]
G. Tzanetakis
G. Tzanetakis
131 / 226
Each track can be represented as a bag of feature vectors
modeled by a probability density function. Therefore we need
some way of measuring the similarity between distributions.
Definition
The Kullback/Leibler(KL) diverence, also known as the
relative entropy, between two probability density functions
f (x) and g (x) is
Z
f (x)
D(f ||g ) = f (x)log
dx
g (x)
G. Tzanetakis
132 / 226
KL divergence and Earth Movers distance
Some examples
The symmetric KL divergence can be formed by taking the
average of the divergences D(f ||g ) and D(g ||f ). For
Gaussians the KL diverence has a closed form solution but for
other distribution models such as Gaussian Mixture Models no
such closed form exists. Monte Carlo estimation can be used
in these cases.
Another common possibility is to use Earth Movers
Distance. Informally, if the distributions are interpreted as
two different ways of piling up dirt, the EMD is the minimum
cost of turning one pile into the other; where the cost is
assumed to be amount of dirt moved times the distance by
which it is moved.
G. Tzanetakis
133 / 226
More extreme examples
G. Tzanetakis
HipHop
Reggae
Piano
World
[Query]
[Query]
[Query]
[Query]
[Results]
[Results]
[Results]
[Results]
G. Tzanetakis
134 / 226
Genres
In these examples I tried to find queries that were atypical and
I could not think of good matches in my collection. I find the
results fascinating as they reveal, to some extent, what
aspects the system is capturing.
African
Dreamer (Supertramp)
Idle Chatter (Computer Music)
Tuva throat singing
Single vector per track (72 dimensions) with euclidean
distance over max/min normalized feature vectors. Results
from 2000 experiment using 3500 mp3 clips each 30 second
long from my personal collection (diverse but no Kenny G).
[Query]
[Query]
[Query]
[Query]
[Results]
[Results]
[Results]
[Results]
Definition
Genres are categorical labels used by humans to organize
music into distinct categories. During the age of physical
recordings they were also used to physically organize the
spatial layout in music stores.
Genres are fluid and change over time. Top level genres are
not as subjective but more specific genres can be very specific
to the music listeners of that genre (for example subgenres of
heavy metal or electronic dance music). Check out [Ushkur’s
Guide to Electronic Music].
135 / 226
G. Tzanetakis
136 / 226
Automatic Musical Genre Classification
Where do Genres come from ?
Artists: for example bluegrass originates from the Blue
Grass Boys named after Kentucky, “the Blue Grass
State”.
Records: for example free jazz from Ornette Coleman’s
1960 album of the same name
Lyrics: Old-school DJ Lovebug Starski claims to have
coined the term hip-hop by rhyming “hip-hop, hippy to
the hippy hop-bop”.
Record labels: Industrial named after Throbbing
Gristle’s imprint.
Journalists: Rhythm and blues when Jerry Wexler, a
Billboard editor, began using it instead of “Race
Records”.
G. Tzanetakis
Given as input an audio recording of a track without any
associated meta-data determine what genre it belongs to from
a set of predefined genre labels
Four stages:
Ground truth acquisition
Audio feature extraction
Song representation and classification
Evaluation
137 / 226
Scanning the dial user study
G. Tzanetakis
138 / 226
Scanning the dial - results
(Perrott and Gjerdingen, 2008)
At ceiling performance (3000 ms) participants agreed with the
genres assigned by music companies about 70% of the time
(that does not mean they were wrong). Even at 250
milliseconds prediction (43%) was significantly better than
chance (10%).
Inspired by circular radio dials - how long does it take to
decide whether to listen to a particular channel or scan for
another one ? Study conducted in 1999 (still early days of
digital music, would have been very difficult to conduct with
analog media). Snippets (250, 325, 400, 475, 3000
milliseconds) were played to 52 subjects.
Blues
Classical
Country
Dance
Jazz
G. Tzanetakis
Latin
Pop
R&B
Rap
Rock
[250 msec collage]
[3000 msec collage]
Classical,HipHop,Jazz
139 / 226
G. Tzanetakis
140 / 226
Ground Truth Acquistion
Audio Feature Extraction
Most common approach use “authoritative source” such
as Amazon or All Music Guide.
Custom hierarchy that is rationally defined (the
Esperando of music genres)
Gjerdingen “scan-the-dial” user study - no perfect
agreement with ground truth - 70% was the best
User study involving multiple subjects - use majority as
ground truth and investigate how much inter-subject
agreement there is
Clusters of listeners possibly utilizing external sources of
information - different notions of genres
G. Tzanetakis
Timbral Features
(Spectral, MFCC)
Rhythmic features
Pitch content
features
141 / 226
Track Representation and Classification I
142 / 226
Track Representation and Classification II
If each track is represented as a sequence of feature vectors
one possibility is to perform short-term classification in smaller
segments and then aggregate the results using majority or
weighted majority voting. Check [Genremeter from 2000]
Song level features are the easiest approach. In this approach
each track is represented by a single aggregate feature vector
that characterizes it. Each “genre” is represented by the set
of feature vectors of the tracks in training set that are
labeled with that genre. Standard data mining classifiers can
be applied without any modification if this approach is used.
G. Tzanetakis
G. Tzanetakis
In bag-of-frames each track is modeled as a probability density
function and the distance between pdf’s need to be estimated.
KL-divergence can be used either in closed form or numerically
approximated. Monte Carlo methods can also be used.
143 / 226
G. Tzanetakis
144 / 226
Comparison of human and automatic genre
classification
Comparison of human and automatic genre
classification
(Lippens et al., 2004)
(Lippens et al., 2004)
G. Tzanetakis
145 / 226
Evaluation I
G. Tzanetakis
MIREX results
MIREX
The Music Information Retrieval Evaluation eXchange is an
annual event in which different MIR algorithms contributed by
groups from around the world are evaluated using a variety of
metrics on different tasks which include several audio-based
classification tasks.
Best performing algorithms in MIREX audio classification
tasks in 2009 and 2013. Part of improvement might be due to
overfitting - universal background model.
Task
Genre
Genre (Latin)
Audio Mood
Composer
Audio-based classification evaluation is done using standard
classification metrics such as classification accuracy. Artist
filtering i.e tracks by the same artists are either all allocated to
the training set or all allocated to the testing set when
performing cross-validation to avoid the “album” effect.
G. Tzanetakis
146 / 226
147 / 226
G. Tzanetakis
Tracks
7000
3227
600
2772
Classes
10
10
5
11
2009
66.41
65.17
58.2
53.25
2013
76.23
77.32
68.33
69.70
148 / 226
Issues with automatic genre classification
Issues with automatic genre classification
Ill-defined problem There is too much subjectivity in
how genre is perceived and exclusive allocation does not
make sense in many cases. Evaluation metrics based on
ground truth do not take into account how mistakes
would be perceived by humans (the WTF factor).
Mini-genres are more useful but also more subjective.
Dataset saturation Public datasets are important for
comparison of different systems. The GTZAN dataset has
been used a lot in genre classification but has many
limitations as it is rather small, was collected informally
for personal research, and has issues such as correpted
files and duplicates. Introduction of a new curated dataset
would make comparison with previous systems harder.
Glass ceiling: it has been observed that there genre
classification results have improved only incrementally and
using varations of the “classic” audio descriptors and
“classic” classifiers only results in very minor changes.
Open questions: can new higher level descriptors, better
machine learning improve classification ? What would the
human glass ceiling be ?
Is it useful ? A lot of music available both in physical
media and digital stores is already annotated by genre.
With powerful music recommendation systems and no
physical stores why bother ? It can also be viewed as a
special case of tag annotation.
G. Tzanetakis
149 / 226
Genre Hierarchies
G. Tzanetakis
150 / 226
Bayesian Aggregation - Bayesian Networks
P(y1 , . . . yN ) =
N
Y
P(yi |yparents(i) )
(3)
i=1
G. Tzanetakis
151 / 226
G. Tzanetakis
152 / 226
Baysian Aggregation - Raw Accuracy
Music Emotion
Single label, multi-class “raw accuracy” (38 classes). Using
symbolic data (from MIREX 2005).
Categorical each selection is classified into an “emotion”
class (basically classification)
Emotion variation detection the emotion is “tracked”
continuously within a music selection
Emotion recognition predict arousal and valence (or some
other continuous space) for each music piece
G. Tzanetakis
153 / 226
MIREX Mood Clusters
G. Tzanetakis
G. Tzanetakis
154 / 226
Valence-arousal emotion space
155 / 226
G. Tzanetakis
156 / 226
Emotion space prediction
Tags
Support vector regression nonlinearly maps the input feature
vectors to a higher dimensional feature space using the kernel
trick and yields prediction functions based on the support
vectors. It is an extension of the well known support vector
classification algorithm.
Definition
A tag is a short phrase or word that can be used to
characterize a piece of music. Examples: “bouncy”, “heavy
metal”, or “hand drums”. Tags can be related to instruments,
genres, amotions, moods, usages, geographic origins,
musicological terms, or anything the users decide.
Similarly to a text index, a music index associated music
documents to tags. A document can be a song, an album, an
artist, a record label, etc. We consider songs/tracks to be our
musical documents.
G. Tzanetakis
157 / 226
Music Index
158 / 226
Tag research terminology
Vocabulary
happy
pop
a capella
saxophone
s1
.8
.7
.1
0
s2
.2
0
.1
.7
Cold-start problem: songs that are not annotated can
not be retrieved.
Popularity bias: songs (in the short head tend to be
annotated more thoroughly than unpopular songs (in the
long tail).
Strong labeling versus weak labeling.
Extensible or fixed vocabulary.
Structured or unstructured vocabulary.
s3
.6
.1
.5
.9
A query can either be a list of tags or a song. Using the music
index the system can return a playlist of songs that somehow
“match” the specified tags.
G. Tzanetakis
G. Tzanetakis
159 / 226
Note:
Evaluation is a big challenge due to subjectivity.
Tags generalize classification labels
G. Tzanetakis
160 / 226
Many thanks to
Tagging a song
Material for these slides was generously provided by:
Doug Turnbull
Emanule Coviello
Mohamed Sordo
G. Tzanetakis
161 / 226
Tagging multiple songs
G. Tzanetakis
G. Tzanetakis
162 / 226
Text query
163 / 226
G. Tzanetakis
164 / 226
Sources of Tags
Human participation:
Surveys
Social Tags
Games
Survey
Pandora: a team of approximately 50 expert music reviewers
(each with a degree in music and 200 hours of training)
annotate songs using a structured vocabulary of between 150
and 200 tags. Tags are “objective” i.e there is a high degree
of inter-reviewer agreement. Between 2000 and 2010, Pandora
annotated about 750, 000 songs. Annotation takes
approximately 20-30 minutes.
CAL500: one song from 500 unique artists, each annod by a
minimum of 3 nonexpert reviewers using a structured
vocabulary of 174 tags. Standard dataset of training and
evaluating tag-based retrieval systems.
Automatic:
Text mining
Autotagging
G. Tzanetakis
165 / 226
Harvesting social tags
G. Tzanetakis
166 / 226
Last.fm tags for Adele
Last.fm is a music discovery Web site that allows users to
contribute social tags through a text box in their audio player
interface. It is an example of crowd sourcing. In 2007, 40
million active users built up a vocabulary of 960, 000 free-text
tags and used it to annotate millions of songs. All data
available through public web API. Tags typically annotate
artists rather than sons. Problems with multiple spelling,
polysemous tags (such as progressive).
G. Tzanetakis
167 / 226
G. Tzanetakis
168 / 226
Playing Annotation Games
Tag-a-tune
In ISMIR 2007, music annotation games were presented for
the first time: ListenGame, Tag-a-Tune, and MajorMiner.
ListenGame uses a structured vocabulary and is real time.
Tag-a-Tune and MajorMiner are inspired by the ESP Game for
image tagging. In this approach the players listen to a track
and are asked to enter “free text” tags until they both enter
the same tag. This results in an extensible vocabulary.
G. Tzanetakis
169 / 226
Mining web documents
G. Tzanetakis
170 / 226
cal500.sness.net
There are many text sources of information associated with a
music track. These include artist biographies, album reviews,
song reviews, social media posts, and personal blogs. The set
of documents associated with a song is typically processed by
text mining techniques resulting in a vector space
representation which can then be used as input to data
mining/machine learning techniques (text mining will be
covered in more detail in a future lecture).
G. Tzanetakis
171 / 226
G. Tzanetakis
172 / 226
Audio feature extraction
Bag of words for text
Audio features for tagging are typically very similar to the ones
used for audio classification i.e statistics of the short-time
magnitude spectrum over different time scales.
G. Tzanetakis
173 / 226
Bag of words for audio
G. Tzanetakis
174 / 226
Multi-label classification (with twists)
“Classic” classification is single label and multi-class. In
multi-label classification each instance can be assigned more
than one label. Tag annotation can be viewed as multi-label
classification with some additional twists:
Synonyms (female voice, woman singing)
Subpart relations (string quartet, classical)
Sparse (only a small subset of tags applies to each song)
Noisy
Useful because:
Cold start problem
Query-by-keywords
G. Tzanetakis
175 / 226
G. Tzanetakis
176 / 226
Machine Learning for Tag Annotation
Tag models
A straightforward approach is to treat each tag independently
as a classification problem.
G. Tzanetakis
Identify songs associated with tag t
Merge all features either directly or by model merging
Estimate p(x|t)
177 / 226
Direct multi-label classifiers
G. Tzanetakis
178 / 226
Tag co-occurence
Alternatives to individual tag classifiers:
K-NN multi-label classifier - straightforward extension
that requires strategy for label merging (union or
intersection are possibilities)
Multi-layer perceptron - simple train directly with
multi-label ground truth
G. Tzanetakis
179 / 226
G. Tzanetakis
180 / 226
Stacking
G. Tzanetakis
Stacking II
181 / 226
How stacking can help ?
G. Tzanetakis
182 / 226
Other terms/variants
The main idea behind stacking i.e using the output of a
classification stage as the input to a subsequent classification
stage has been proposed under several different names:
Correction approach (using binary outputs)
Anchor classification (for example classification into
artists used as a feature for genre classification)
Semantic space retrieval
Cascaded classification (in computer vision)
Stacked generalization (in the classification)
Context modeling (in autotagging)
Cost-sensitive stacking (variant)
G. Tzanetakis
183 / 226
G. Tzanetakis
184 / 226
Datasets
Combining taggers/bag of systems
There are several datasets that have been used to train and
evaluate auto-tagging. They differ in the amount of data they
contain, and the source of the ground truth tag information.
Major Miner
Magnatagatune
CAL500 (the most widely used one)
CAL10K
MediaEval
Reproducibility: common dataset is not enough, ideally exact
details about the cross-validation folding process and
evaluation scripts should also be included.
G. Tzanetakis
185 / 226
Magnatagatune
G. Tzanetakis
CAL-10K Dataset
Number of tracks: 10866
Tags: 1053 (genre and acoustic tags)
Tags/Track: min = 2, max = 25, µ = 10.9, σ = 4.57,
median = 11
Most used tags: major key tonality (4547), acoustic
rhythm guitars (2296), a vocal-centric aesthetic (2163),
extensive vamping (2130)
Less used tags: cocky lyrics (1), psychedelic rock
influences (1), breathy vocal sound (1), well-articulated
trombone solo (1), lead flute (1)
Tags collected using survey
Available at:
http://cosmal.ucsd.edu/cal/projects/AnnRet/
26K sound clips from magnatune.com
Human annotation from the Tag-a-tune game
Audio features from the Echo Nest
230 artists
183 tags
G. Tzanetakis
186 / 226
187 / 226
G. Tzanetakis
188 / 226
Tagging evaluation metrics
Annotation vs retrieval
The inputs to a autotagging evaluation metric are the
predicted tags (#tags by #tracks binary matrix) or tag
affinities (#tags by #tracks) matrix of reals) and the
associated ground truth (binary matrix).
One possibility would be to convert matrices into vectors and
then use classification evaluation metrics. This approach has
the disadvantage that popular tags will dominate and
performance in less-frequent tags (which one could argue are
more important) will be irrelevant. Therefore the common
approach is to treat each tag column separately and then
average across tags (retrieval) or alternatively treat each
track row separately and average across tracks (annotation).
Asymmetry between positives and negatives makes
classification accuracy not a very good metric. Retrieval
metrics are better choices. If the output of the auto-tagging
system is affinities then many metrics require binarization.
Common binarization variants: select k top scoring tags for
each track, threshold each column of tag affinities to achieve
the tag priors in the training set.
G. Tzanetakis
Validation schems are similar to classification: cross-validation,
repeated cross-validation, and bootstrapping.
189 / 226
Annotation Metrics
G. Tzanetakis
Annotation Metrics based on rank
When using affinities it is possible to use rank correlation
metrics:
Based on counting TP, FP, TN, FN:
Spearman’s rank correlation coefficient ρ
Kendal tau τ
Precision
Recall
F-measure
G. Tzanetakis
190 / 226
191 / 226
G. Tzanetakis
192 / 226
Retrieval measures - Mean Average Precision
Retrieval measures - AUC-ROC
Precision at N is the number of relevant songs retrieved out of
N divided by N. Rather than choosing N one can average
precision for different N and then take the mean over a set of
queries (tags).
G. Tzanetakis
193 / 226
Stacking results I
G. Tzanetakis
G. Tzanetakis
194 / 226
Stacking results II
195 / 226
G. Tzanetakis
196 / 226
Stacking results III
G. Tzanetakis
Stacking results IV
197 / 226
Stacking results V
G. Tzanetakis
198 / 226
MIREX Tag Annotation Task
The Music Information Retrieval Evaluation Exchange
(MIREX) audio tag annotation task started in 2008
MajorMiner dataset (2300 tracks, 45 tags)
Mood tag dataset (6490 tracks, 135 tags)
10 second clips
3-fold cross-validation
Binary relevance (F-measure, precision, recall)
Affinity ranking (AUC-ROC, Precision at 3,6,9,12,15)
G. Tzanetakis
199 / 226
G. Tzanetakis
200 / 226
MIREX 2012 F-measure
G. Tzanetakis
MIREX 2012 AUC-ROC
201 / 226
History of MIREX tagging
G. Tzanetakis
202 / 226
Open questions
Should the tag annotations be sanitized or should the
machine learning part handle it ?
Do auto-taggers generalize outside their collections ?
Stacking seems to improve results (even though one
paper has shown no improvement). How does stacking
perform when dealing with synonyms, antonyms, noisy
annotations ? Why ?
How can multiple sources of tags be combined ?
G. Tzanetakis
203 / 226
G. Tzanetakis
204 / 226
Future work
Future work
Weak labeling: in most cases absense of a tag does NOT
imply that the tag would not be considered valid by most users
Explore a continuous grading of semi-supervised learning
where the distinction between supervised and
unsupervised is not binary
Explore feature clusering of untagged instances
Include additional sources of information (separate from
tags) such as artist, genre, album
multiple instance learning approaches (for example if
genre information is available at the album level)
Statistical relational learning
G. Tzanetakis
The lukewarm start problem: what if some tags are known for
the testing data but not all ?
Missing label type of approaches such as EM
Markov logic inference in structured data
Other ideas:
Online learning where tags enter the system incrementally
and individually rather than all at the same time or for a
particular instance
Taking into account user behavior when interacting with
a tag system
Personalization vs Crowd: would clustering users based on
their tagging make sense ?
205 / 226
Extracting Context Information
G. Tzanetakis
206 / 226
Guess the genre
Fan pages
Artist personal web pages
Music portals such as http://allmusic.com
Web 2.0 APIs (Last.fm, Pandora, Echonest)
Playlists
Peer-to-Peer Networks
General observation: there is a lot of information available but
it is noisy and not all of it is good quality.
Yeah I’m out that Brooklyn, now I’m down in TriBeCa
Right next to Deniro, but I’ll be hood forever
I’m the new Sinatra, and since I made it here
I can make it anywhere, yeah, they love me everywhere
I used to cop in Harlem, all of my Dominicano’s
Right there up on Broadway, pull me back
to that McDonald’s
Took it to my stashbox, 560 State St.
Catch me in the kitchen like a Simmons
with them Pastry’s
Cruisin’ down 8th St., off white Lexus
Drivin’ so slow, but BK is from Texas
Empire State of Mind, Jay Z
G. Tzanetakis
207 / 226
G. Tzanetakis
208 / 226
Guess the genre
Song Lyrics
Lyrics give information about the semantics of a piece of
music. They can also reveal aspects such as the artist’s
cultural background and style. A typical lyric analysis system
will have the following stages and uses techniques from text
retrieval and bioinformatics:
You know a dream is like a river
ever changin’ as it flows
and a dreamer’s just a vessel
that must follow where it goes
trying to learn from what’s behind you
and never knowing what’s in store
makes each day a constant battle
just to stay between the shores
and i will sail my vessel
’til the river runs dry
like a bird upon the wind
these waters are my sky
Query: a search engine to find web-pages likely to
contain lyrics
Text extraction: removal of HTML tags, conversion to
lowercase
Alignment: determine pages that contain the same
lyrics, alignment of word pairs
Evaluation: ground truth can be obtained from CD
covers
River, Garth Brooks
G. Tzanetakis
209 / 226
Country of Origin
G. Tzanetakis
210 / 226
TF-IDF
One possibility is to look specifically into artist pages and bios
(for example in Last.fm, Freebase, Wikipedia) for occurance of
words indicating geographic origin.
An alternative approach is to query a search engine with pairs
of artist names and country names. A simple approach is to
simply count the number of page hits. A more complicated
approach uses ideas from text information retrieval i.e term
frequency - inverse document frequency (TF-IDF).
Definition
Document Frequency (DF) dft,a is the total number of Web
pages retrieved for artist a on which term t occurs at least
once.
Definition
Term Frequency (TF) tft,a is the total number of occurrences
of term t in all pages retrieved for a.
TFIDFt,a
G. Tzanetakis
211 / 226
G. Tzanetakis
n
= ln(1 + tft,a ) × ln 1 +
dft
212 / 226
TF-IDF motivation
Bag of words and vector space model
Measure of how important a word is to a document within a
corpus.
Each document is represented as an unordered set of its words
(or terms such as noun phrase), ignoring structure and
grammar rules. Each term t describing a document d is
assigned a “weight” wt,d for example based on frequency of
occurence. The set of weights over all terms of interest is the
feature vector and is typically sparse meaning that many of
the weights are zero.
For music it is common to aggregate all relevant pages into a
single “virtual” document. Part-of-speech taggers can be used
to create terms from a list of words.
The motivation of TF-IDF is to increase “weight” of t if it
occurs frequently in webpages retrieved for a (for example
t =’Greece’,a =’Vangelis’, t =’and’, a =’Vangelis’) , but
decrease it if t occurs in a large number of documents
retrieved for all a (for example t =’and’).
Evaluation: precision and recall (for example in a study by
Schedl et al. page count estimates achieve 23% precision but
TFIDF achieves 71%.
G. Tzanetakis
213 / 226
Similarity based on term profiles
The idea behind cosine similarity is to not take into account
the length of the associated documents when computing
similarity but only the relative frequency of each term.
wg (t, a) =
tf (t, a)
df (t)
tf (t, a)e
P
w (t, ai )w (t, aj )
pP
2
2
w
(t,
a
)
i
t∈T
t∈T w (t, aj )
−(log (df (t))−µ)2
sim(ai , aj ) = cos(θ) = pP
2σ 2
The similarity between artists can be computed by the overlap
between their corresponding term profiles i.e the sum of
weights of all the terms that occur in both term profiles.
Cosine similarity is another popular choice. A similar approach
can be used for any textual source of information such as lyrics
or blog posts.
G. Tzanetakis
214 / 226
Cosine similarity
The weight of a term t for artist a can be computed using
TFIDF schemes such as
w (t, a) =
G. Tzanetakis
215 / 226
t∈T
In general the idea of vector space and associated feature
vectors can be used in many different tasks in MIR by varying
the source of information used and the desired outcome.
Examples include language identification, song structure
detection, thematic categorization. Finally text-based features
can be combined with audio features.
G. Tzanetakis
216 / 226
Collaborative Tags
Introduction
Tag overlap can be used as a measure of similarity between
artists (or tracks). In a study using Last.fm datait was found
that similar artists share on avergae 10 tags compared to 4 for
arbitrary pairs.
Pros: smaller vocabulary, focused on domain, level of
individual tracks,
The basic idea behind co-occurence approaches is that the
occurence of two pieces of music, artists or tags within the
same context indicates similarity.
Cons: require large and active user community, coverage in
the “long tail” can be low or non-existent.
G. Tzanetakis
217 / 226
Page Counts
G. Tzanetakis
218 / 226
Introduction
Similarity using a search engine and associated page counts
(with some restrictions to constrain results to music) :
sim(ai , aj ) =
1
sim(ai , aj ) =
2
G. Tzanetakis
pc(ai , aj )
min(pc(ai )pc(aj ))
pc(ai , aj ) pc(ai , aj )
+
pc(ai )
pc(aj )
219 / 226
G. Tzanetakis
220 / 226
Example
Audio Features
Basic idea: capture the energy of different pitches or chroma
values
Summarize FFT bins
Design filterbank (constant-Q)—
Detect multiple pitches
G. Tzanetakis
221 / 226
Onset calculation
222 / 226
Chroma Representation
Downsampling
First-order difference
Half-wave rectification
G. Tzanetakis
G. Tzanetakis
From the 88 note pitch profile (histogram) compute 12
chroma (or pitch class) vector by summing all the octave
separated elements.
223 / 226
G. Tzanetakis
224 / 226
Chroma Normalization
Other topics not covered in this tutorial
Chord Detection
Structure Analysis
Music transcription and sound source separation
Computational Ethnomusicology
Audio watermarking and fingerprinting
Grahical user interfaces and visualization
v ← v /||v ||1
where
||v ||1 =
12
X
|v (i)|
i=1
Replace with flat chroma vector if ||v ||1 below threshold for
silent passages.
G. Tzanetakis
225 / 226
G. Tzanetakis
226 / 226