asian4you hai ling yang

Transcription

Oral Session 7
Recommendation & Listeners
437
15th International Society for Music Information Retrieval Conference (ISMIR 2014)
This Page Intentionally Left Blank
438
TASTE SPACE VERSUS THE WORLD: AN EMBEDDING ANALYSIS OF
LISTENING HABITS AND GEOGRAPHY
Joshua L. Moore, Thorsten Joachims
Cornell University, Dept. of Computer Science
{jlmo|tj}@cs.cornell.edu
ABSTRACT
Douglas Turnbull
Ithaca College, Dept. of Computer Science
[email protected]
Hauger et al. have matched to cities and other geographic
descriptors as well. Our goal in this work is to use embedding methods to enable a more thorough analysis of
geographic and cultural patterns in this data by embedding
cities and the artists from track plays in those cities into
a joint space. The resulting taste space gives us a way to
directly measure city/city, city/artist, and artist/artist affinities. After verifying the predictive fidelity of the learned
taste space, we explore the surprisingly clear segmentations in taste space across geographic, cultural, and linguistic borders. In particular, we find that the taste space
of cities gives us a remarkably clear image of some cultural
and linguistic phenomena that transcend geography.
Probabilistic embedding methods provide a principled way
of deriving new spatial representations of discrete objects
from human interaction data. The resulting assignment
of objects to positions in a continuous, low-dimensional
space not only provides a compact and accurate predictive
model, but also a compact and flexible representation for
understanding the data. In this paper, we demonstrate how
probabilistic embedding methods reveal the “taste space”
in the recently released Million Musical Tweets Dataset
(MMTD), and how it transcends geographic space. In particular, by embedding cities around the world along with
preferred artists, we are able to distill information about
cultural and geographical differences in listening patterns
into spatial representations. These representations yield a
similarity metric among city pairs, artist pairs, and cityartist pairs, which can then be used to draw conclusions
about the similarities and contrasts between taste space and
geographic location.
2. RELATED WORK
Embeddings methods have been applied to many different
modeling and information retrieval tasks. In the field of
music IR, these models have been used for tag prediction
and song similarity metrics, as in the work of Weston et
al. [7]. However, instead of a prediction task such as this,
we intend to focus on data analysis tasks. Therefore, we
rely on generative models like those proposed in our previous work [5, 6] and by Aizenberg et al [1]. Our prior work
uses models which rely on sequences of songs augmented
with social tags [5] or per-user song sequences with temporal dynamics [6]. The aim of this work differs from that of
our previous work in that we are interested in aggregate
global patterns and not in any particular playlist-related
task, so we do not adopt the notion of song sequences. We
also are concerned with geographic differences in listening patterns, and so we ignore individual users in favor of
embedding entire cities into the space.
Aizenberg et al. utilize generative models like those in
our work for purposes of building a recommendation engine for music from Internet radio data on the web. However, their work focuses on building a powerful recommendation system using freely available data, and does not focus on the use of the resulting models for data analysis, nor
do they concern themselves with geographic data.
The data set which we will use throughout this work
was published by Hauger et al. [3]. The authors of this
work crawled Twitter for 17 months, looking for tweets
which carried certain key words, phrases, or hashtags in
order to find posts which signal that a user is listening to a
track and for which the text of the tweet could be matched
to a particular artist and track. In addition, the data was selected for only tweets with geographical tags (in the form
1. INTRODUCTION
Embedding methods are a type of machine learning algorithm for distilling large amounts of data about discrete objects into a continuous and semantically meaningful representation. These methods can be applied even when
only contextual information about the objects, such as cooccurrence statistics or usage data, is available. For this
reason and due to the easy interpretability of the resulting models, embeddings have become popular for tasks
in many fields, including natural language processing, information retrieval, and music information retrieval. Recently, embeddings have been shown to be a useful tool for
analyzing trends in music listening histories [6].
In this paper, we learn embeddings that give insight into
how music preferences relate to geographic and cultural
boundaries. Our input data is the Million Musical Tweets
Dataset (MMTD), which was recently collected and curated by Hauger et al. [3]. This dataset consists of over a
million tweets containing track plays and rich geographical information in the form of globe coordinates, which
c Joshua L. Moore, Thorsten Joachims, Douglas Turnbull.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Joshua L. Moore, Thorsten Joachims,
Douglas Turnbull. “Taste Space Versus the World: an Embedding Analysis of Listening Habits and Geography”, 15th International Society for
Music Information Retrieval Conference, 2014.
439
of GPS coordinates), and temporal data was retained. The
final product is a large data set of geographically and temporally tagged music plays. In their work, the authors emphasize the collection of this impressive data set and a thorough description of the properties of the data set. The authors do add some analyses of the data, but the geographic
analysis is limited to only a few examples of coarse patterns found in the data. The primary contribution of our
work over the work presented in that paper is to greatly
extend the scope of the geographic analysis, presenting a
much clearer and more exhaustive view of the differences
in musical taste across regions, countries, and languages.
Finally, we describe how geographic information can
be useful for various music IR tasks. Knopke [4] also
discusses how geospatial data can be exploited for music
marketing and musicological research. We use embedding
as a tool to further explore these topics. Others, such as
Lamere’s Roadtrip Mixtape 1 app, have developed systems
that use a listeners location to generate a playlist of relevant music by local artists.
(X,Y,p) = max
X,Y,p
= max
−||X(ci)−Y (ai)||22 +pai −log(Z(ai)).
We solve this optimization problem using a Stochastic
Gradient Descent approach. First, each embedding vector X(·) and Y (·) is randomly initialized to a point in the
unit ball in Rd (for the chosen dimension d). Then, the
model parameters are updated in sequential stochastic gradient steps until convergence. The partition function Z(·)
presents an optimization challenge, in that a naı̈ve optimization strategy requires O(|A|2 ) time for each pass over
the data. For this work, we used our C++ implementation of the efficient training method employed in [6], an
approximate method that estimates the partition function
for efficient training. This implementation is available by
request, and will later be available on the project website,
http://lme.joachims.org.
3.1 Interpretation of Embedding Space
As defined above, the model gives us a joint space in which
both cities and artists are represented through their respective embedding vectors X(·) and Y (·). Related works have
found such embedding spaces to be rich with semantic significance, compactly condensing the patterns present in the
training data. Distances in embedding space reveal relationships between objects, and visual or spatial inspection
of the resulting models quickly reveals a great deal of segmentation in the space. In particular, joint embeddings
yield similarity metrics among the various types of embedded objects, even though individual dimensions in the
embedding space have no explicit meaning (e.g. the embeddings are rotation invariant). In our case, this specifically entails the following three measures of similarity:
City to Artist: this is the only similarity metric explicitly formulated in the model, and it reflects the distribution
Pr(a|c) that we directly observe data for. In particular, we
directly optimize the positions of cities and artists so that
cities have a high probability of listening to artists which
they were observed playing in the dataset. This requires
placing the city and artist nearby in the embedding space,
so proximity in the embedding space can be interpreted as
an affinity between a city and an artist.
Artist to Artist: due to the learned conditional probability distributions’ being constrained by the metric space,
two artists which are placed near each other in the space
will have a similar probability mass in each city’s distribution. This implies a kind of exchangeability or similarity,
since any city which is likely to listen to one artist is likely
to listen to the other in the model distribution.
City to City: finally, the form of similarity on which we
will most rely in this work is that among cities. Again due
to the metric space, two nearby cities will assign similar
masses to each artist, and so will have very similar distributions over artists in the model. This implies a similarity
in musical taste or preferred artists between two cities.
The third type of similarity will form the basis for most
of the analyses in this paper. In particular, we are interested
The embedding model used in this paper is similar to the
one used in our previous work [6]. However, the following
analysis focuses on geographical patterns instead of temporal dynamics and trends. In particular, we focus on the
relationships among cities and artists, and so we elect to
condense the geographical information in a tweet down to
the city from which it came. Similarly, we discard the track
name from each tweet and use only the artist for the song.
This leads to a joint embedding of cities and artists.
At the core of the embedding model lies a probabilistic link function that connects the observed data to the
underlying semantic space. Intuitively, the link function
we use states that the probability Pr(a|c) of a given city
c playing a given artist a is proportional to the distance
||X(c) − Y (a)||22 between that city and that artist in a Euclidean embedding space of a chosen dimension d. X(c)
and Y (a) are the embedding locations of city c and artist
a respectively. Similar to previous works, we also incorporate a popularity bias term pa for each artist to model
global popularity. More formally, the probability for a city
c to play an artist a is:
exp(−||X(c) − Y (a)||22 + pa )
.
2
a ∈A exp(−||X(c) − Y (a )||2 + pa )
The sum in the denominator is over the set A of artists.
This sum is known as the partition function, denoted Z(·),
and serves to normalize the distribution over artists.
Determining the embedding locations X(c) and Y (a)
for all cities and artists (and the popularity terms pa ) is
the learning problem the embedding method must solve.
To fit a model to the data, we maximize the log-likelihood
formed by the sum of log-probabilities log(Pr(ai |ci ):
1
log(Pr(ai |ci ))
(ci ,ai )∈D
X,Y,p
(ci ,ai )∈D
3. PROBABILISTIC EMBEDDING MODEL
Pr(a|c) = http://labs.echonest.com/CityServer/roadtrip.html
440
4.1 Quantitative Evaluation of the Model
Before we inspect our model in order to make qualitative
claims about the patterns in the data, we first wish to evaluate it on a quantitative basis. This is essential in order
to confirm that the model accurately captures the relations
among cities and artists, which will offer validation for the
conclusions we draw later in the work.
4.1.1 Evaluating Model Fidelity
First, we considered the performance of the model in
terms of perplexity, which is a reformulation of the loglikelihood objective outside of a log scale. This is a commonly used measure of performance in other areas of research where models similar to ours are used, such as
natural language processing [2]. The perplexity p is related to the average log-likelihood L by the transformation
p = exp(−L).
Our baseline is the unigram distribution, which assumes that Pr(a|c) is directly proportional to the number
of tweets artist a received in the entire data set independent of the city. Estimating the unigram distribution from
the training set and using it to calculate the perplexity on
the validation set yielded a perplexity of 589 (very similar
to the perplexity attained when estimating this distribution
from the train set and calculating the perplexity on the train
set itself). Our model offered a great improvement over
this – the 100-dimensional model yielded a perplexity on
the validation set of 290, while the 2-dimensional model
reached a perplexity of 357. This improvement suggests
that our model has captured a significant amount of useful
information from the data.
Figure 1: Precision at k of our model, a cosine similarity baseline, a tweet count ranking baseline, and a random
baseline on a city/artist tweet prediction task.
in the connection between the metric space of cities in the
embedding space and another metric space: the one formed
by the geographic distribution of cities on the Earth’s surface. As we will see, these two spaces differ greatly, and
the taste space of cities gives us a clear image of some cultural and linguistic phenomena that transcend geography.
4. EXPERIMENTS
We use the MMTD data set presented by Hauger et al. [3].
This data set contains nearly 1.1 million tweets with geographical data. We pre-process the data by condensing
each tweet to a city/artist pair, which results in a city/artist
affinity matrix used to train the model. Next, we discard all
cities and artists which have not appeared at least 100 times
in the data, as well as all cities for which fewer than 30 distinct users tweeted from that city. The post-processed data
contains 1,017 distinct cities and 1,499 distinct artists.
4.1.2 Evaluating Predictive Accuracy
Second, we created a task to evaluate the predictive power
of our model. To this end, we split the data chronologically
into two halves, and further divided the first half into a
training set and a validation set. Using the first half of the
data, we trained a 100-dimensional model. Our goal is to
use this model to predict which new artists various cities
will begin listening to in the second half of the data.
We accomplish this by considering, for each city, the set
of artists which had no observed tweets in that city in the
first half of the data. We then sorted these artists by their
score in the model – namely, for city c and artist a, the
function −||X(c) − Y (a)||22 + pa . Using this ordering as
a ranking function, we calculated the precision at k of our
ranking for various values of k, where an artist is considered to be relevant if that artist receives at least one tweet
from that city in the second half of the data. We average
the results of each city’s ranking.
We compare the performance of our model on this task
to three baselines. First, we consider a random ranking of
all the artists which a city has not yet tweeted. Second,
we sort the yet untweeted artists by their raw global tweet
count in the first half of the data – which we label the unigram baseline. Third, we use the raw artist tweet counts
for a city’s nearest neighbor city in the first half of data to
rank untweeted artists for that city. In this case, the nearest
For choosing model parameters, we randomly selected
80% of the tweets for the training set, and the remaining
20% for the validation set. This resulted in a training set of
390,077 tweets and a validation set of 97,592 tweets. We
used the validation set both to determine stopping criteria
for the optimization as well as to choose the initial stochastic gradient step size η0 from the set {0.25, 0.1, 0.05, 0.01}
and to evaluate the quality of models of dimension {2, 50,
100}. The optimal step size varied from model to model,
but the 100-dimensional model consistently out-performed
the others (although the difference between it and the 50dimensional model was small).
We will analyze the data through the trained embedding
models, both through spatial analyses (i.e. nearest neighbor queries and clusterings) and through visual inspection.
In general, the high-dimensional model better captures the
data, and so we will use it when direct visual inspection is
not required. But first, we evaluate the quality of the model
through quantitative means.
441
neighbor is not determined using our embedding but rather
based on the maximum cosine similarity between the vector of artist tweet counts for the city and the vectors of
tweet count for all other cities.
The results can be seen in Figure 1. At k = 1, our model
correctly guesses an artist that a city will later tweet with
64% accuracy, compared to 46% for the cosine similarity, 42% for unigram and around 5% for the random baseline. This advantage is consistent as k increases, with our
method attaining about 24% precision at 100, compared to
18% for unigram and 14% for cosine similarity. We also
show the performance of the same model at this task when
popularity terms are excluded from the scoring function at
ranking time. Interestingly, the performance in this case
is still quite good. We see precision at 1 of about 51% in
this case, with the gap between this method and the method
with popularity terms growing smaller as k increases. This
suggests that proximity in the space is very meaningful,
which is an important validation of the analyses to follow.
Finally, the good performance on this task invites an application of the space to making marketing predictions –
which cities are prone to pick up on which artists in the
near future? – but we leave this for future work.
4.2 Visual Inspection of the Embedding Space
gleaned. However, higher dimensional models are able to
achieve perplexities on the validation set which far exceed
those of lower dimensional models. For example, as mentioned before, our best performing 2-dimensional model
attains a validation perplexity of 357, while our best performing 100-dimensional model attains a perplexity of 290
on the validation set. This suggests that higher dimensional
models capture more of the nuanced patterns present in the
data. On the other hand, simple plotting is no longer sufficient to inspect high-dimensional data – we must resort
to alternative methods, for example, clustering and nearest
neighbor queries. First, in Figure 3, we present the results of using k-means clustering in the city space of the
100-dimensional model. The common algorithm for solving the k-means clustering problem is known to be prone
to getting stuck in local optima, and in fact can be difficult to validate properly. We attempted to overcome these
problems by using cross validation and repeated random
restarts. Specifically, we used 10-fold cross-validation on
the set of all cities in order to find a validation objective for
each candidate value of k from 2 to 20. Then, we selected
the parameter k by choosing the largest value for which no
larger value offers more than a 5% improvement over the
immediately previous value.
In Figure 2 we present plots of the two-dimensional embedding space, with labels for some key cities (left) and
artists (right). Note that the two plots are separated by city
and artists only for readability, and that all points lie in
the same space. In this figure, we can already see a striking segmentation in city space, with extreme distinction
between, e.g., Brazilian cities, Southeast Asian cities, and
American cities. We can also already see distinct regional
and cultural groupings in some ways – the U.S. cities
largely form a gradient, with Chicago, Atlanta, Washington, D.C., and Philadelphia in the middle, Cleveland and
Detroit on one edge of the cluster, and New York and Los
Angeles on the opposite edge. Interestingly, Toronto is also
on the edge of the U.S. cluster, and on the same edge where
New York and Los Angeles – arguably the most “international” of the U.S. cities shown here – end up.
It is also interesting to note that the space has a very
clear segmentation in terms of genre – just as clear as embeddings produced in previous work from songs alone [5]
or songs and individual users [6]. Of course, this does not
translate into an effective user model – surely there are
many users in Recife, Brazil that would quickly tire of a
radio station inspired by Linkin Park – but we believe it is
still a meaningful phenomenon. Specifically, this suggests
that the taste of the average listener can vary dramatically
from one city to the next, even within the same country.
More surprisingly, this variation in the average user is so
dramatic that cities themselves can form nearly as coherent
a taste space as individual users, as the genre segmentation
is barely any less clear than in other authors’ work with
user modeling.
4.3 Higher-dimensional Models
Once the value of k was chosen, we tried to overcome
the problem of local optima by running the clustering algorithm 10 times on the entire set of cities with that value
of k and different random initializations, finally choosing
the trial with the best objective value. This process resulted
in optimal k values ranging from 6 to 13. Smaller values
resulted in some clusterings with granularity too coarse to
see interesting patterns, while larger values were noisy and
produced unstable clusterings. Ultimately, we found that
k = 9 was a good trade-off.
Additionally, in Table 1, we obtain a complementary
view of the 100-dimensional embedding by listing the results of nearest-neighbor queries for some well-known,
hand-selected cities. These queries give us an alternative
perspective of the city space, pointing out similarities that
may not be apparent from the clustering alone. By combining these views, we can start to see many interesting
patterns arise:
The French-speaking supercluster: French-speaking
cities form an extremely tight cluster, as can also be seen
in the 2-dimensional embedding in Figure 2. Virtually every French city is part of this cluster, as well as Frenchspeaking cities in nearby European countries, such as
Brussels and Geneva. Indeed even beyond the top 10 listed
in Table 1, almost all of the top 100 nearest neighbors for
Paris are French-speaking. Language is almost certainly
the biggest factor in this effect, but if we consider the countries near France, we see that despite linguistic divides, in
the clustering, many cities in the U.K. still group closely
with Dutch cities and even Spanish cities. Furthermore,
this grouping can be seen in every view of the data – in
the two-dimensional space, the clustering, and the nearest
neighbor queries. It should be noted that in our own trials clustering the data, the French cluster is one of the first
Directly visualizing two-dimensional models can give us
striking images from which rough patterns can be easily
442
Figure 2: The joint city/artist space with some key cities and artists labeled.
Figure 3: A k-means clustering of cities around the world with k = 9.
Kuala Lumpur
Kulim
Sungai Lembing
Ipoh
Kuching
Sunway City
Seremban
Seri Kembangan
Taman Cheras Hartamas
Kuantan
Selayang
Paris
Boulogne-Billancourt
Brussels
Rennes
Lille
Aix-en-Provence
Limoges
Amiens
Marseille
Geneva
Grenoble
Singapore
Hougang
Seng Kang
USJ9
Subang
Kota Bahru
Bangkok
Alam Damai
Kota Padawan
Glenmarie
Budapest
Los Angeles, CA
Grand Prairie, TX
Ontario, CA
Riverside, CA
Sacramento, CA
Salinas, CA
Paterson, NJ
San Bernardino, CA
Inglewood, CA
Modesto, CA
Pomona, CA
Chicago, IL
Buffalo, NY
Clarksville, TN
Cleveland, OH
Durham, NC
Birmingham, AL
Flint, MI
Montgomery, AL
Nashville, TN
Jackson, MS
Paterson, NJ
São Paulo
Osasco
Jundiaı́
Carapicuı́ba
Ribeirão Pires
Shinjuku
Vargem Grande Paulista
Santa Maria
Itapevi
Cascavel
Embu das Artes
Brooklyn, NY
Minneapolis, MN
Winston-Salem, NC
Arlington, VA
Waterbury, CT
Washington, DC
Syracuse, NY
Jersey City, NJ
Louisville, KY
Tallahassee, FL
Ontario, CA
Atlanta, GA
Savannah, GA
Tallahassee, FL
Cleveland, OH
Washington, DC
Memphis, TN
Flint, MI
Huntsville, AL
Montgomery, AL
Jackson, MS
Lafayette, LA
Madrid
Sevilla
Granada
Barcelona
Murcia
Sorocaba
Ponta Grossa
Huntington Beach, CA
Istanbul
Vigo
Oxford
Amsterdam
Eindhoven
Tilburg
Emmen
Nijmegen
Enschede
Zwolle
Amersfoort
Maastricht
Antwerp
Coventry
Sydney
Toronto
Denver, CO
Windhoek
Angers
Rialto, CA
Hamilton
Rotterdam
Ottawa
London - Tower Hamlets
London - Southwark
Montréal
Montpellier
Geneva
Raleigh, NC
Limoges
Angers
Ontario, CA
Anchorage, AK
Nice
Lyon
Rennes
Table 1: Nearest neighbor query results in 100-dimensional city space. Brooklyn was chosen over New York, NY due to
having more tweets in the data set. In addition, only result cities with population at least 100,000 are displayed.
443
Country
Brazil
Canada
Netherlands
Mexico
Indonesia
France
United States
Malaysia
United Kingdom
Russia
Spain
Least typical
Criciúma, Santa Catarina
Surrey, BC
Leiden
Campeche, CM
Panunggangan Barat
Bordeaux
Huntington Beach, CA
Kota Damansara
Wolverhampton, England
Ufa
Álora, Andalusia
Most typical
Itapevi, São Paulo
Toronto, ON
Emmen
Cuauhtémoc, DF
RW 02
Mantes-la-Jolie, Île-de-France
Jackson, MS
Kuala Lumpur
London Borough of Camden
Podgory
Barcelona
typical taste profiles for that country.
The results are shown in Table 2. We can see a few interesting patterns here. First, in Brazil, the most typical city is
an outlying city near São Paulo city, while the least typical
is a city in Santa Catarina, the second southernmost state in
Brazil, which is also less populous than the southernmost,
Rio Grande do Sul, which was also well-represented in the
data. In Canada, the least typical city is an edge city on
Vancouver’s east side, while the most typical is the largest
city, Toronto. In France, the most typical city is in Île-deFrance, not too far from Paris. We also see in England that
the least typical city is Wolverhampton, and edge city of
Birmingham towards England’s industrial north, while the
most typical is a borough of London.
Table 2: Most and least typical cities in taste profile for
various countries.
clusters to become apparent, as well as one of the most
consistent to appear. We can also see that the French cluster is indeed a linguistic and cultural one which is not just
due to geographic proximity: although Montreal has several nearest neighbors in North America, it is present in
the French group in the k-means clustering (as is Quebec
City) and is also very close to many French-speaking cities
in Europe, such as Geneva and Lyon. We can also see that
Abidjan, Ivory Coast joins the French k-means cluster, as
do Dakar in Senegal, Les Abymes in Guadeloupe and Le
Lamentin and Fort-de-France in Martinique – all cities in
countries which are members of the Francophonie.
Australia: Here again, despite the relatively tight geographical proximity of Australia and Southeast Asia, and
the geographic isolation of Australia from North America,
Australian cities tend to group closely with Canadian cities
and some cities in the United Kingdom. One way of seeing this is the fact that Sydney’s nearest neighbors include
Toronto, Hamilton, Ontario, Ottawa, and two of London’s
boroughs. In addition, other cities in Australia also belong to a cluster that mainly includes cities in the Commonwealth (e.g., U.K., Canada).
Cultural divides in the United States: the cities in the
U.S. tend to form at least two distinct subgroups in terms
of listening patterns. One group contains many cities in the
Southeast and Midwest, as well as a few cities on the southern edge of what some might call the Northeast (Philadelphia, for example). The other group consists primarily of
cities in the Northeast, on the West Coast, and in the Southwest of the country, including most of the cities in Texas.
Intuitively, there are two results that might be surprising
to some here. The first is that the listening patterns of
Chicago tend to cluster with listening patterns in the South
and the rest of the Midwest, and not those of very large
cities on the coasts (after all, Chicago is the third-largest
city in the country). The second is that Texas groups with
the West Coast and Northeast, and not with the Southeast,
which would be considered by many to be more culturally
similar in many ways.
5. CONCLUSIONS
In this work, we learned probabilistic embeddings of the
Million Musical Tweets Dataset, a large corpus of tweets
containing track plays which has rich geographical information for each play. Through the use of embeddings, we
were able to easily process a large amount of data and sift
through it visually and with spatial analysis in order to uncover examples of how musical taste conforms to or transcends geography, language, and culture. Our findings reflect that differences in culture and language, as well as historical affinities among countries otherwise separated by
vast distances, can be seen very clearly in the differences in
taste among average listeners from one region to the next.
More generally, this paper shows how nuanced patterns in
large collections of preference data can be condensed into
a taste space, which provides a powerful tool for discovering complex relationships. Acknowledgments: This work
was supported by NSF grants IIS-1217485, IIS-1217686,
IIS-1247696, and an NSF Graduate Research Fellowship.
6. REFERENCES
[1] N. Aizenberg, Y. Koren, and O. Somekh. Build your
own music recommender by modeling internet radio
streams. In WWW, pages 1–10. ACM, 2012.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent,
and Christian Janvin. A neural probabilistic language
model. JMLR, 3:1137–1155, 2003.
[3] D. Hauger, M. Schedl, A. Košir, and M. Tkalčič. The
million musical tweets dataset - what we can learn from
microblogs. In ISMIR, 2013.
[4] I. Knopke. Geospatial location of music and sound files
for music information retrieval. ISMIR, 2005.
[5] J. L. Moore, S. Chen, T. Joachims, and D. Turnbull.
Learning to embed songs and tags for playlist prediction. In ISMIR, 2012.
4.4 Most and least typical cities
We can also consider the relation of individual cities to
their member countries. For this analysis, we considered
all the countries which have at least 10 cities represented
in the data. Then for each country we calculated the average position in embedding space of cities in that country.
With this average city position, we can then measure the
distance of individual cities from the mean of cities in their
country and find the cities which have the most and least
[6] J. L. Moore, Shuo Chen, T. Joachims, and D. Turnbull.
Taste over time: the temporal dynamics of user preferences. In ISMIR, 2013.
[7] J. Weston, S. Bengio, and P. Hamel. Multi-tasking with
joint semantic spaces for large-scale music annotation
and retrieval. JNMR, 40(4):337–348, 2011.
444
ENHANCING COLLABORATIVE FILTERING MUSIC
RECOMMENDATION BY BALANCING EXPLORATION AND
EXPLOITATION
Zhe Xing, Xinxi Wang, Ye Wang
School of Computing, National University of Singapore
{xing-zhe,wangxinxi,wangye}@comp.nus.edu.sg
ABSTRACT
Collaborative filtering (CF) techniques have shown great
success in music recommendation applications. However,
traditional collaborative-filtering music recommendation algorithms work in a greedy way, invariably recommending songs with the highest predicted user ratings. Such a
purely exploitative strategy may result in suboptimal performance over the long term. Using a novel reinforcement
learning approach, we introduce exploration into CF and
try to balance between exploration and exploitation. In
order to learn users’ musical tastes, we use a Bayesian
graphical model that takes account of both CF latent factors and recommendation novelty. Moreover, we designed
a Bayesian inference algorithm to efficiently estimate the
posterior rating distributions. In music recommendation,
this is the first attempt to remedy the greedy nature of CF
approaches. Results from both simulation experiments and
user study show that our proposed approach significantly
improves recommendation performance.
1. INTRODUCTION
In the field of music recommendation, content-based approaches and collaborative filtering (CF) approaches have
been the prevailing recommendation strategies. Contentbased algorithms [1, 9] analyze acoustic features of the
songs that the user has rated highly in the past and recommend only songs that have high degrees of acoustic similarity. On the other hand, collaborative filtering (CF) algorithms [7, 13] assume that people tend to get good recommendations from someone with similar preferences, and
the user’s ratings are predicted according to his neighbors’
ratings. These two traditional recommendation approaches,
however, share a weakness.
Working in a greedy way, they always generate “safe”
recommendations by selecting songs with the highest predicted user ratings. Such a purely exploitative strategy may
result in suboptimal performance over the long term due to
the lack of exploration. The reason is that user preference
c Zhe Xing, Xinxi Wang, Ye Wang.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Zhe Xing, Xinxi Wang, Ye Wang.
“Enhancing collaborative filtering music recommendation by balancing
exploration and exploitation”, 15th International Society for Music Information Retrieval Conference, 2014.
is only estimated based on the current knowledge available in the recommender system. As a result, uncertainty
always exists in the predicted user ratings and may give
rise to a situation where some of the non-greedy options
deemed almost as good as the greedy ones are actually better than them. Without exploration, however, we will never
know which ones are better. With the appropriate amount
of exploration, the recommender system could gain more
knowledge about the user’s true preferences before exploiting them.
Our previous work [12] tried to mitigate the greedy problem in content-based music recommendation, but no work
has addressed this problem in the CF context. We thus
aim to develop a CF-based music recommendation algorithm that can strike a balance between exploration and exploitation and enhance long-term recommendation performance. To do so, we introduce exploration into collaborative filtering by formulating the music recommendation
problem as a reinforcement learning task called n-armed
bandit problem. A Bayesian graphical model taking account of both collaborative filtering latent factors and recommendation novelty is proposed to learn the user preferences. The lack of efficiency becomes a major challenge, however, when we adopt an off-the-shelf Markov
Chain Monte Carlo (MCMC) sampling algorithm for the
Bayesian posterior estimation. We are thus prompted to
design a much faster sampling algorithm for Bayesian inference. We carried out both simulation experiments and a
user study to show the efficiency and effectiveness of the
proposed approach. Contributions of this paper are summarized as follows:
• To the best of our knowledge, this is the first work in
music recommendation to temper CF’s greedy nature by
investigating the exploration-exploitation trade-off using a
reinforcement learning approach.
• Compared to an off-the-shelf MCMC algorithm, a
much more efficient sampling algorithm is proposed to speed
up Bayesian posterior estimation.
• Experimental results show that our proposed approach
enhances the performance of CF-based music recommendation significantly.
2. RELATED WORK
Based on the assumption that people tend to receive good
recommendations from others with similar preferences, col-
445
laborative filtering (CF) techniques come in two categories:
memory-based CF and model-based CF. Memory-based
CF algorithms [3, 8] first search for neighbors who have
similar rating histories to the target user. Then the target
user’s ratings can be predicted according to his neighbors’
ratings. Model-based CF algorithms [7, 14] use various
models and machine learning techniques to discover latent
factors that account for the observed ratings.
Our previous work [12] proposed a reinforcement learning approach to balance exploration and exploitation in
music recommendation. However, this work is based on
a content-based approach. One major drawback of the personalized user rating model is that low-level audio features
are used to represent the content of songs. This purely
content-based approach is not satisfactory due to the semantic gap between low-level audio features and high-level
user preferences. Moreover, it is difficult to determine
which underlying acoustic features are effective in music recommendation scenarios, as these features were not
originally designed for music recommendation. Another
shortcoming is that songs recommended by content-based
methods often lack variety, because they are all acoustically similar to each other. Ideally, users should be provided with a range of genres rather than a homogeneous
set.
While no work has attempted to address the greedy problem of CF approaches in the music recommendation context, Karimi et al. tried to investigate it in other recommendation applications [4, 5]. However, their active learning
approach merely explores items to optimize the prediction
accuracy on a pre-determined test set [4]. No attention is
paid to the exploration-exploitation trade-off problem. In
their other work, the recommendation process is split into
two steps [5]. In the exploration step, they select an item
that brings maximum change to the user parameters, and
then in the exploitation step, they pick the item based on
the current parameters. The work takes balancing exploration and exploitation into consideration, but only in an
ad hoc way. In addition, their approach is evaluated using only an offline and pre-determined dataset. In the end,
their algorithm is not practical for deployment in online
recommender systems due to its low efficiency.
3. PROPOSED APPROACH
We first present a simple matrix factorization model for
collaborative filtering (CF) music recommendation. Then,
we point out major limitations of this traditional CF algorithm and describe our proposed approach in detail.
song a song feature vector vj ∈ Rf , j = 1, 2, ..., n. For
a given song j, vj measures the extent to which the song
contains the latent factors. For a given user i, ui measures
the extent to which he likes these latent factors. The user
rating can thus be approximated by the inner product of the
two vectors:
r̂ij = uTi vj
(1)
To learn the latent feature vectors, the system minimizes
the following regularized squared error on the training set:
(i,j)∈I
(rij − uTi vj )2 + λ(
m
nui ui 2 +
i=1
n
nvj vj 2 ) (2)
j=1
where I is the index set of all known ratings, λ a regularization parameter, nui the number of ratings by user i, and
nvj the number of ratings of song j. We use the alternating
least squares (ALS) [14] technique to minimize Eq. (2).
However, this traditional CF recommendation approach
has two major drawbacks. (I) It fails to take recommendation novelty into consideration. For a user, the novelty
of a song changes with each listening. (II) It works greedily, always recommending songs with the highest predicted
mean ratings, while a better approach may be to actively
explore a user’s preferences rather than to merely exploit
available rating information [12]. To address these drawbacks, we propose a reinforcement learning approach to
CF-based music recommendation.
3.2 A Reinforcement Learning Approach
Music recommendation is an interactive process. The system repeatedly choose among n different songs to recommend. After each recommendation, it receives a rating
feedback (or reward) chosen from an unknown probability
distribution, and its goal is to maximize user satisfaction,
i.e., the expected total reward, in the long run. Similarly,
reinforcement learning explores an environment and takes
actions to maximize the cumulative reward. It is thus fitting
to treat music recommendation as a well-studied reinforcement learning task called n-armed bandit.
The n-armed bandit problem assumes a slot machine
with n levers. Pulling a lever generates a payoff from the
unknown probability distribution of the lever. The objective is to maximize the expected total payoff over a given
number of action selections, say, over 1000 plays.
3.2.1 Modeling User Rating
To address drawback (I) in Section 3.1, we assume that
a song’s rating is affected by two factors: CF score, the
extent to which the user likes the song in terms of each CF
latent factor, and novelty score, the dynamically changing
novelty of the song.
From Eq. (1), we define the CF score as:
3.1 Matrix Factorization for Collaborative Filtering
Suppose we have m users and n songs in the music recommender system. Let R = {rij }m×n denote the user-song
rating matrix, where each element rij represents the rating
of song j given by user i.
Matrix factorization characterizes users and songs by
vectors of latent factors. Every user is associated with a
user feature vector ui ∈ Rf , i = 1, 2, ..., m, and every
UCF = θ T v
(3)
where vector θ indicates the user’s preferences for different CF latent factors and v is the song feature vector
446
݀Ͳ
learned by the ALS CF algorithm. For the novelty score,
we adopt the formula used in [12]:
UN = 1 − e−t/s
τ
θ
s
v
R
t
N
Figure 1: Bayesian Graphical Model.
ing probability dependency is defined as follows:
R|v, t, θ, s, σ 2 ∼ N (θ T v(1 − e−t/s ), σ 2 )
(5)
θ|σ ∼ N (0, a0 σ I)
2
2
Given the variability in musical taste and memory strength,
each user is associated with a pair of parameters Ω =
(θ, s), to be learned from the user’s rating history. More
technical details will be explained in Section 3.2.2.
Since the predicted user ratings always carry uncertainty,
we assume them to be random variables rather than fixed
numbers. Let Rj denote the rating of song j given by the
target user, and Rj follows an unknown probability distribution. We assume that the expectation of Rj is the Uj
defined in Eq. (5). Thus, the expected rating of song j can
be estimated as:
E[Rj ] = Uj = (θ T vj )(1 − e−tj /s )
ܿͲ
ܾͲ
(4)
where t is the time elapsed since when the song was last
heard, s the relative strength of the user’s memory, and
e−t/s the well-known forgetting curve. The formula assumes that a song’s novelty decreases immediately when
listened and gradually recovers with time. (For more details on the novelty definition, please refer to [12].) We
thus model the final user rating by combining these two
scores:
U = UCF UN = (θ T v)(1 − e−t/s )
ܽͲ
݁Ͳ
(6)
Traditional recommendation strategy will first obtain the
vj and tj of each song in the system to compute the expected rating using Eq. (6) and then recommend the song
with the highest expected rating. We call this a greedy recommendation as the system is exploiting its current knowledge of the user ratings. By selecting one of the nongreedy recommendations and gathering more user feedback, the system explores further and gains more knowledge about the user preferences. A greedy recommendation may maximize the expected reward in the current iteration but would result in suboptimal performance over
the long term. This is because several non-greedy recommendations may be deemed nearly as good but come with
substantial variance (or uncertainty), and it is thus possible that some of them are actually better than the greedy
recommendation. Without exploration, however, we will
never know which ones they are.
Therefore, to counter the greedy nature of CF (drawback II), we introduce exploration into music recommendation to balance exploitation. To do so, we adopt one of
the state-of-the-art algorithms called Bayesian Upper Confidence Bounds (Bayes-UCB) [6]. In Bayes-UCB, the expected reward Uj is a random variable rather than a fixed
value. Given the target user’s rating history D, the posterior distribution of Uj , denoted as p(Uj |D), needs to be
estimated. Then the song with the highest fixed-level quantile value of p(Uj |D) will be recommended to the target
user.
3.2.2 Bayesian Graphical Model
To estimate the posterior distribution of U , we adopt the
Bayesian model (Figure 1) used in [12]. The correspond-
447
(7)
(8)
s ∼ Gamma(b0 , c0 )
(9)
τ = 1/σ ∼ Gamma(d0 , e0 )
(10)
2
I is the f × f identity matrix. N represents Gaussian
distribution with parameters mean and variance. Gamma
represents Gamma distribution with parameters shape and
rate. θ, s, and τ are parameters. a0 , b0 , c0 , d0 , and e0 are
hyperparameters of the priors.
At current iteration h + 1, we have gathered h observed
recommendation history Dh = {(vi , ti , ri )}hi=1 . Given
that each user in our model is described as Ω = (θ, s),
we have according to the Bayes theorem:
p(Ω | Dh ) ∝ p(Ω)p(Dh | Ω)
(11)
Then the posterior probability density function (PDF) of
the expected rating Uj of song j can be estimated as:
p(Uj |Dh ) =
p(Uj |Ω)p(Ω|Dh )dΩ
(12)
Since Eq. (11) has no closed form solution, we are unable
to directly estimate the posterior PDF in Eq. (12). We thus
turn to a Markov Chain Monte Carlo (MCMC) algorithm
to adequately sample the parameters Ω = (θ, s). We then
substitute every parameter sample into Eq. (6) to obtain a
sample of Uj . Finally, the posterior PDF in Eq. (12) can
be approximated by the histogram of the samples of Uj .
After estimating the posterior PDF of each song’s expected rating, we follow the Bayes-UCB approach [6] to
recommend song j ∗ that maximizes the quantile function:
j ∗ = arg
max
j=1,...,|S|
Q (α, p(Uj |Dh ))
(13)
1
where α = 1 − h+1
, |S| is the total number of songs in the
recommender system, and the quantile function Q returns
the value x such that Pr(Uj ≤ x|Dh ) = α. The pseudo
code of our algorithm is presented in Algorithm 1.
3.3 Efficient Sampling Algorithm
Bayesian inference is very slow with an off-the-shelf MCMC
sampling algorithm because it takes a long time for the
Markov chain to converge. In response, we previously proposed an approximate Bayesian model using piecewise linear approximation [12]. However, not only is the original
# Users
100,000
Algorithm 1 Exploration-Exploitation Balanced Music
Recommendation
for h = 1 → N do
if h == 1 then
Recommend a song randomly;
else
Draw samples of θ and s based on p(Ω | Dh−1 );
for song j = 1 → |S| do
Obtain vj and tj of song j and compute samples of
Uj using Eq. (6);
Estimate p(Uj |Dh−1 ) using histogram of the samples
of Uj ;
Compute quantile qjh = Q 1 − h1 , p(Uj |Dh−1 ) ;
end for
Recommend song j ∗ = arg maxj=1,...,|S| qjh ;
Collect user rating rh and update p(Ω | Dh );
end if
end for
p(θ|D, τ, s) ∝ p(τ )p(θ|τ )p(s)
α = d0 +
β = e0 +
1
I+
a0
p(ri |vi , ti , θ, s, τ )
T
μ Σ
−1
T
=η =τ
N
4. EVALUATION
4.1 Dataset
The Taste Profile Subset 1 used in the Million Song Dataset
Challenge [10] has over 48 million triplets (user, song,
count) describing the listening history of over 1 million
users and 380,000 songs. We select 20,000 songs with top
listening counts and 100,000 users who have listened to the
most songs. Since this collection of listening history is a
form of implicit feedback data, we use the approach proposed in [11] to perform negative sampling. The detailed
statistics of the final dataset are shown in Table 1.
(15)
ri (1 − e
−ti /s
)viT
(16)
i=1
Similarly, the conditional distribution p(τ |D, θ, s) remains a Gamma distribution and can be derived as:
p(τ |D, θ, s) ∝ p(τ )p(θ|τ )p(s)
N
4.2 Learning CF Latent Factors
i=1
∝ p(τ )p(θ|τ )
N
i=1
First, we determine the optimal value of λ, the regularization parameter, and f , the dimensionality of the latent
feature vectors. We randomly split the dataset into three
disjoint parts: training set (80%), validation set (10%),
and test set (10%). Training set is used to learn the CF
latent factors, and the convergence criteria of the ALS algorithm is achieved when the change in root mean square
1 T
2 −1
exp(−e0 τ ) × exp − θ (a0 σ I) θ ×
∝τ
2
N
√ −N
2
1 T
−ti /s
σ 2π
exp
− 2 ri − θ vi (1 − e
)
2σ
i=1
d0 −1
∝ τ α−1 exp(−βτ ) ∝ Gamma (α, β)
# MH Step
tmp
i=1
(19)
Draw u ∼ U nif orm(0, 1);
if u < α then
stmp = y;
end if
end for
s(t+1) = stmp ;
end for
(1 − e−ti /s )2 vi viT
N
2
θT θ
1 ri − θ T vi (1 − e−ti /s )
+
2a0
2 i=1
Initialize θ, s, τ ;
for t = 1 → M axIteration do
Sample θ (t+1) ∼ p(θ|D, τ (t) , s(t) );
Sample τ (t+1) ∼ p(τ |D, θ (t+1) , s(t) );
stmp = s(t) ;
for i = 1 → K do
Draw y ∼N (stmp , 1);
(t+1) (t+1)
,τ
)
,
1
;
α = min p(sp(y|D,θ
(t+1)
(t+1)
|D,θ
,τ
)
where μ and Σ, respectively the mean and covariance of
the multivariate Gaussian distribution, satisfy:
Σ−1 = Λ = τ
(18)
Algorithm 2 Gibbs Sampling for Bayesian Inference
1
p(ri |vi , ti , θ, s, τ ) ∝ exp − θ T (a0 σ 2 I)−1 θ
2
i=1
N
2
1 − 2 ri − θ T vi (1 − e−ti /s )
×exp
2σ
i=1
1
(14)
∝ exp − θ T Λθ + η T θ ∝ N (μ, Σ)
2
N
f +N
2
The conditional distribution p(s|D, θ, τ ) has no closed
form expression. We thus adopt the Metropolis-Hastings
(MH) algorithm [2] with a proposal distribution q(st+1 |st ) =
N (st , 1) to draw samples of s. Our detailed Gibbs sampling process is presented in Algorithm 2.
N
% Density
1.035%
where α and β are respectively the shape and rate of the
Gamma distribution and satisfy:
i=1
∝ p(θ|τ )
# Observations
20,699,820
Table 1: Size of the dataset. Density is the percentage of entries
in the user-song matrix that have observations.
Bayesian model altered, tuning the numerous (hyper)parameters is also tedious. In this paper, we present a better way to improve efficiency. Since it is simple to sample from a conditional distribution, we develop a specific
Gibbs sampling algorithm to hasten convergence.
Given N training samples D = {vi , ti , ri }N
i=1 , the conditional distribution p(θ|D, τ, s) is still a Gaussian distribution and can be obtained as follows:
N
# Songs
20,000
1
(17)
448
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
Figure 2: Prediction accuracy of sampling algorithms.
Figure 3: Efficiency comparison of sampling algorithms.
(T imeM CM C = 538.762s and T imeGibbs = 0.579s when
T rainingSetSize = 1000).
error (RMSE) on the validation set is less than 10−4 . Then
we use the learned latent factors to predict the ratings on
the test set. We first fix f = 55 and vary λ from 0.005 to
0.1; minimal RMSE is achieved at λ = 0.025. We then
fix λ = 0.025 and vary f from 10 to 80, and f = 75
yields minimal RMSE. Therefore, we adopt the optimal
value λ = 0.025 and f = 75 to perform the final ALS CF
algorithm and obtain the learned latent feature vector of
each song in our dataset. These vectors will later be used
for reinforcement learning.
Figure 4: Online evaluation platform.
4.3 Efficiency Study
To show that our Gibbs sampling algorithm makes Bayesian
inference significantly more efficient, we conduct simulation experiments to compare it with an off-the-shelf MCMC
algorithm developed in JAGS 2 . We implemented the Gibbs
algorithm in C++, which JAGS uses, for a fair comparison.
For each data point di ∈ {(vi , ti , ri )}ni=1 in the simulation experiments, vi is randomly chosen from the latent
feature vectors learned by the ALS CF algorithm. ti is
randomly sampled from unif orm(50, 2592000), i.e. between a time gap of 50 seconds and one month. ri is calculated using Eq. (6) where elements of θ are sampled from
N (0, 1) and s from unif orm(100, 1000).
To determine the burn-in and sample size of the two
algorithms and to ensure they draw samples equally effectively, we first check to see if they converge to a similar
level. We generate a test set of 300 data points and vary
the size of the training set to gauge the prediction accuracy.
We set K = 5 in the MH step of our Gibbs algorithm.
While our Gibbs algorithm achieves reasonable accuracy
with burn-in = 20 and sample size = 100, the MCMC algorithm gives comparable results only when both parameters are 10000. Figure 2 shows their prediction accuracies
averaged over 10 trials. With burn-in and sample size determined, we then conduct an efficiency study of the two
algorithms. We vary the training set size from 1 to 1000
and record the time they take to finish the sampling process. We use a computer with Intel Core i7-2600 CPU
@ 3.40Ghz and 8GB RAM. The efficiency comparison result is shown in Figure 3. We can see that computation
time of both two sampling algorithms grows linearly with
the training set size. However, our proposed Gibbs sampling algorithm is hundreds of times faster than MCMC,
2
http://mcmc-jags.sourceforge.net/
449
suggesting that our proposed approach is practical for deployment in online recommender systems.
4.4 User Study
In an online user study, we compare the effectiveness of
our proposed recommendation algorithm, Bayes-UCB-CF,
with that of two baseline algorithms: (1) Greedy algorithm, representing the traditional recommendation strategy without exploration-exploitation trade-off. (2) BayesUCB-Content algorithm [12], which also adopts the BayesUCB technique but is content-based instead of CF-based.
We do not perform offline evaluation because it cannot capture the effect of the elapsed time t in our rating model and
the interactiveness of our approach.
Eighteen undergraduate and graduate students (9 females
and 9 males, age 19 to 29) are invited to participate in the
user study. The subject pool covers a variety of majors
of study and nationalities, including American, Chinese,
Korean, Malaysian, Singaporean and Iranian. Subjects receive a small payment for their participation. The user
study takes place over the course of two weeks in April
2014 on a user evaluation website we constructed (Figure
4). The three algorithms evaluated are randomly assigned
to numbers 1-3 to avoid bias. For each algorithm, 200 recommendations are evaluated using a rating scale from 1 to
5. Subjects are reminded to take breaks frequently to avoid
fatigue. To minimize the carryover effect, subjects cannot evaluate two different algorithms in one day. For the
user study, Bayes-UCB-CF’s hyperparameters are set as:
a0 = 10, b0 = 3, c0 = 0.01, d0 = 0.001 and e0 = 0.001.
Since maximizing the total expected rating is the main
objective of a music recommender system, we thus compare the cumulative average rating of the three algorithms.
Figure 5 shows the average rating and standard error of
the manuscript. This project is funded by the National Research
Foundation (NRF) and managed through the multi-agency Interactive & Digital Media Programme Office (IDMPO) hosted
by the Media Development Authority of Singapore (MDA) under Centre of Social Media Innovations for Communities (COSMIC).
7. REFERENCES
[1] P. Cano, M. Koppenberger, and N. Wack. Content-based music audio recommendation. In Proceedings of the 13th annual
ACM international conference on Multimedia, pages 211–
212. ACM, 2005.
[2] S. Chib and E. Greenberg. Understanding the metropolishastings algorithm. The American Statistician, 49(4):327–
335, 1995.
[3] J. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An
algorithmic framework for performing collaborative filtering.
In Proceedings of the 22nd annual ACM international conference on SIGIR, pages 230–237. ACM, 1999.
Figure 5: Recommendation performance comparison.
each algorithm from the beginning till the n-th recommendation iteration. We can see that our proposed Bayes-UCBCF algorithm significantly outperforms Bayes-UCB-Content,
suggesting that the latter still fails to bridge the semantic
gap between high-level user preferences and low-level audio features.
T-tests show that Bayes-UCB-CF starts to significantly
outperform the Greedy baseline after the 46th iteration (pvalue < 0.0472). In fact, Greedy’s performance decays
rapidly after the 60th iteration while others continue to
improve. Because Greedy solely exploits, it is quickly
trapped at a local optima, repeatedly recommending the
few songs with initial good ratings. As a result, the novelty
of those songs plummets, and users become bored. Greedy
will introduce new songs after collecting many low ratings,
only to be soon trapped into a new local optima. By contrast, our Bayes-UCB-CF algorithm balances exploration
and exploitation and thus significantly improves the recommendation performance.
[4] R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. SchmidtThieme. Active learning for aspect model in recommender
systems. In Symposium on Computational Intelligence and
Data Mining, pages 162–167. IEEE, 2011.
[5] R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. SchmidtThieme. Non-myopic active learning for recommender systems based on matrix factorization. In International Conference on Information Reuse and Integration, pages 299–303.
IEEE, 2011.
[6] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper
confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 592–
600, 2012.
[7] N. Koenigstein, G. Dror, and Y. Koren. Yahoo! music recommendations: modeling music ratings with temporal dynamics
and item taxonomy. In Proceedings of the fifth ACM conference on Recommender systems, pages 165–172. ACM, 2011.
[8] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker,
L. R. Gordon, and J. Riedl. Grouplens: applying collaborative filtering to usenet news. Communications of the ACM,
40(3):77–87, 1997.
5. CONCLUSION
We present a novel reinforcement learning approach to music recommendation that remedies the greedy nature of the
collaborative filtering approaches by balancing exploitation with exploration. A Bayesian graphical model incorporating both the CF latent factors and novelty is used to
learn user preferences. We also develop an efficient sampling algorithm to speed up Bayesian inference. In music recommendation, our work is the first attempt to investigate the exploration-exploitation trade-off and to address the greedy problem in CF-based approaches. Results
from simulation experiments and user study have shown
that our proposed algorithm significantly improves recommendation performance over the long term. To further improve recommendation performance, we plan to deploy a
hybrid model that combines content-based and CF-based
approaches in the proposed framework.
6. ACKNOWLEDGEMENT
We thank the subjects in our user study for their participation.
We are also grateful to Haotian “Sam” Fang for proofreading
[9] B. Logan. Music recommendation from song sets. In ISMIR,
2004.
[10] B. McFee, T. Bertin-Mahieux, D. P.W. Ellis, and G. R.G.
Lanckriet. The million song dataset challenge. In Proceedings of international conference companion on World Wide
Web, pages 909–916. ACM, 2012.
[11] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz,
and Q. Yang. One-class collaborative filtering. In Eighth
IEEE International Conference on Data Mining, pages 502–
511. IEEE, 2008.
[12] X. Wang, Y. Wang, D. Hsu, and Y. Wang. Exploration in interactive personalized music recommendation: A reinforcement learning approach. arXiv preprint arXiv:1311.6355,
2013.
[13] K. Yoshii and M. Goto. Continuous plsi and smoothing techniques for hybrid music recommendation. In ISMIR, pages
339–344, 2009.
[14] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale
parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management, pages 337–
348. Springer, 2008.
450
IMPROVING MUSIC RECOMMENDER SYSTEMS: WHAT
CAN WE LEARN FROM RESEARCH ON MUSIC TASTES?
Audrey Laplante
École de bibliothéconomie et des sciences de l’information, Université de Montréal
[email protected]
the taste profiles of these users. One of the principal limitations of RS based on CF is that, before they could gather sufficient information about the preferences of a user,
they perform poorly. This corresponds to the welldocumented new user cold-start problem.
One way to ease this problem would be to try to enrich
the taste profile of a new user by relying on other types of
information that are known to be correlated with music
preferences. More recently, it has become increasingly
common for music RS to encourage users to create a personal profile, or to allow them to connect to the system
with a general social network site account (for instance,
Deezer users can connect with their Facebook or their
Google+ account). Music RS thus have access to a wider
array of information regarding new users.
Research on music tastes can provide insights into how
to take advantage of this information. More than a decade
ago, similar reasoning led Uitdenbogerd and Schyndel [2]
to review the literature on the subject to identify the factors affecting music tastes. In 2003, however, a paper
published by Rentfrow and Gosling [3] on the relationship between music and personality generated a renewed
interest for music tastes among researchers, which translated into a sharp increase in research on this topic.
In this paper, we propose to review the recent literature
on music preferences from social psychology and sociology of music to identify the correlates of music tastes and
to understand how music tastes are formed and evolve
through time. We first explain the process by which we
identified and selected the articles and books reviewed.
We then present the structure and the correlates of music
preferences based on the literature review. We conclude
with a brief discussion on the implications of these findings for music RS design.
ABSTRACT
The success of a music recommender system depends on
its ability to predict how much a particular user will like
or dislike each item in its catalogue. However, such predictions are difficult to make accurately due to the complex nature of music tastes. In this paper, we review the
literature on music tastes from social psychology and sociology of music to identify the correlates of music tastes
and to understand how music tastes are formed and
evolve through time. Research shows associations between music preferences and a wide variety of sociodemographic and individual characteristics, including personality traits, values, ethnicity, gender, social class, and political orientation. It also reveals the importance of social
influences on music tastes, more specifically from family
and peers, as well as the central role of music tastes in the
construction of personal and social identities. Suggestions
for the design of music recommender systems are made
based on this literature review.
1. INTRODUCTION
The success of a music recommender system (RS) depends on its ability to propose the right music, to the right
user, at the right moment. This, however, is an extremely
complex task. A wide variety of factors influence the development of music preferences, thus making it difficult
for systems to predict how likely a particular user is to
like or dislike a piece of music. This probably explains
why music RS are often based on collaborative filtering
(CF): it allows systems to uncover complex patterns in
preferences that would be difficult to model based on musical attributes [1]. However, in order to make those predictions as accurate as possible, these systems need to
collect a considerable amount of information about the
music preferences of each user. To do so, they elicit explicit feedback from users, inviting them to rate, ban, or
love songs, albums, or artists. They also collect implicit
feedback, most often in the form of purchase or listening
history data (including songs skipped) of individual users.
These pieces of information are combined to form the user’s music taste profile, which allows the systems to identify like-minded users and to recommend music based on
2. METHODS
We used two databases to identify the literature on music
preferences, one in psychology, PsycINFO (Ovid), and
one in sociology, Sociological Abstracts (ProQuest). We
used the thesaurus of each database to find the descriptors that were used to represent the two concepts of
interest (i.e., music, preferences), which led to the queries presented in Table 1.
© Audrey Laplante
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Audrey Laplante. “Improving
Music Recommender Systems: What Can We Learn from Research on
Music Tastes?”, 15th International Society for Music Information
Retrieval Conference, 2014.
PsycINFO
music AND preferences
Sociological
Abstracts:
(music OR "music/musical") AND ("preference/preferences" OR preferences)
Table 1. Queries used to retrieve articles in databases
451
Both searches were limited to the subject heading field.
We also limited the search to peer-reviewed publications
and to articles published in 1999 or later to focus on the
articles published during the last 15 years. This yielded
155 articles in PsycINFO and 38 articles in Sociological
Abstracts. Additional articles and books were identified
through chaining (i.e., by following citations in retrieved
articles), which allowed us to add a few important documents that had been published before 1999. Considering
the limited space and the large number of documents on
music tastes, further selection was needed. After ensuring
that all aspects were covered, we rejected articles with a
narrow focus (e.g., articles focusing on a specific music
genre or personality trait). For topics on which there were
several publications, we retained articles with the highest
number to citations based on Google Scholar. We also
decided to exclude articles on the relationship between
music preferences and the functions of music to concentrate on individual characteristics.
with varimax rotation on participants’ ratings. This allowed them to uncover a factor structure of music preferences, composed of four dimensions, which they labeled
Reflective and Complex, Intense and Rebellious, Upbeat
and Conventional, and Energetic and Rhythmic. Table 2
shows the genres most strongly associated with each dimension. To verify the generalizability of this structure
across samples, they replicated the study with 1,384 students of the same university, and examined the music
libraries of individual users in a peer-to-peer music service. This allowed them to confirm the robustness of the
model.
Music-preference dimension
3. REVIEW OF LITERATURE ON MUSIC
TASTES
Research shows that people, especially adolescents, use
their music tastes as a social badge through which they
convey who they are, or rather how they would like to be
perceived [4, 5]. This indicates that people consider that
music preferences reflect personality, values, and beliefs.
In the same line, people often make inferences about the
personality of others based on their music preferences, as
revealed by a study in which music was found to be the
main topic of conversation between two young adults
who are given the task of getting to know each other [6].
The same study showed that these inferences are often
accurate: people can correctly infer several psychological
characteristics based on one’s music preferences, which
suggests that they have an intuitive knowledge of the relationships that exist between music preferences and personality. Several researchers have studied these relationships systematically to identify the correlates of music
preferences that pertain to personality and demographic
characteristics, values and beliefs, and social influences
and stratification.
Genres most strongly
associated
Reflective and Complex
Blues, Jazz, Classical, Folk
Intense and Rebellious
Rock, Alternative, Heavy
metal
Upbeat and Conventional
Country, Sound tracks, Religious, Pop
Energetic and Rhythmic
Rap/hip-hop, Soul/funk,
Electronica/dance
Table 2. Music-preference dimensions of Rentfrow and
Gosling (2003).
Several other researchers replicated Rentfrow and
Gosling’s study with other populations and slightly different methodologies. To name a few, [8] surveyed 2,334
Dutch adolescents aged 12–19; [9] surveyed 268 Japanese college students; [10, 11] surveyed 422 and 170
German students, respectively; and [12] surveyed 358
Canadian students. Although there is a considerable degree of similarity in the results across these studies, there
also appears to be a few inconsistencies. Firstly, the
number of factors varies: while 4 studies revealed a 4factor structure [3, 8-10], one found 5 factors [11], and
another, 9 factors1 [12]. These differences could potentially be explained by the fact that researchers used different music preference tests: the selection of the genres
to include in these tests depends on the listening habits of
the target population and thus needs to be adapted. The
grouping of genres also varies. In the 4 above-mentioned
studies in which a 4-factor structure was found, rock and
metal music were consistently grouped together. However, techno/electronic was not always grouped with the
same genres: while it was grouped with rap, hip-hop, and
soul music in 3 studies, it was grouped with popular music in the study with the Dutch sample [8]. Similarly, religious music was paired with popular music in Rentfrow
and Gosling’s study, but was paired with classical and
jazz music in the 3 other studies. These discrepancies
could come from the fact that some music genres might
have different connotations in different cultures. It can
also be added that music genres are problematic in themselves: they are broad, inconsistent, and ill-defined. To
3.1 Dimensions of music tastes
There are numerous music genres and subgenres. However, as mentioned in [7], attitudes toward genres are not
isolated from one another: there are genres that seem to
go together while others seem to oppose. Therefore, to
reduce the number of variables, prior to attempting to
identify the correlates of music preferences, most researchers start by examining the nature of music preferences to identify the principal dimensions. The approach
of Rentfrow and Gosling [3] is representative of the
work of several researchers. To uncover the underlying
structure of music preferences, they first asked 1,704
students from an American university to indicate their
liking of 14 different music genres using a 7-point Likert
scale. This questionnaire was called the Short Test Of
Music Preferences (STOMP). They then performed factor analysis by means of principal-components analysis
1
For this study, the researchers started with 30 genres, as opposed to
others who used between 11 and 21 genres.
452
Energetic and Rhythmic music and Extraversion, which
was found in most other studies [3, 8, 10], was not found
with the Japanese sample.
solve these problems, Rentfrow and colleagues [13, 14]
replicated the study yet again, but used 52 music excerpts
representing 26 different genres to measure music preferences instead of a list of music genres. The resulting
structure was slightly different. It was composed of 5 factors labeled Mellow, Unpretentious, Sophisticated, Intense, and Contemporary (MUSIC). This approach also
allowed them to examine the ties between the factors and
the musical attributes. To do so, they asked non-experts
to rate each music excerpt according to various attributes
(i.e., auditory features, affect, energy level, perceived
complexity) and used this information to identify the musical attributes that were more strongly associated with
each factor.
3.3 Values and Beliefs
Fewer recent studies have focused on the relationship between music preferences and values or beliefs compared
to personality. Nevertheless, several correlates of music
preferences were found in this area, from political orientation to religion to vegetarianism [15].
3.3.1 Political Orientation
In the 1980s, Peterson and Christenson [16] surveyed 259
American university students on their music preferences
and political orientation. They found that liberalism was
positively associated with liking jazz, reggae, soul, or
hardcore punk, whereas linking 70s rock or 80s rock was
negatively related to liberalism. They also uncovered a
relationship between heavy metal and political alienation:
heavy metal fans were significantly more likely than others to check off the “Don’t know/don’t care” box in response to the question about their political orientation.
More recently, Rentfrow and Gosling [3] found that political conservatism was positively associated with liking
Upbeat & Conventional music (e.g., popular music),
whereas political liberalism was positively associated
with liking Energetic and Rhythmic (e.g., rap, hip-hop) or
Reflective and Complex (e.g., classical, jazz) music, although the last two correlations were weak. North and
Hargreaves [15], who surveyed 2,532 British individuals,
and Gardikiotis and Baltzis [17], who surveyed 606
Greek college students, also found that people who liked
classical music, opera, and blues were more likely to have
liberal, pro-social beliefs (e.g., public health care, protection of the environment, taking care of the most vulnerable). In contrast, fans of hip-hop, dance, and DJ-based
music were found to be among the least likely groups to
hold liberal beliefs (e.g., increased taxation to pay for
public services, public health care) [15]. As we can see,
liking jazz and classical music was consistently associated with liberalism, but no such clear patterns of associations emerged for other music genres, which suggests that
further research is needed.
3.2 Personality traits
Several researchers have examined the relationship between music preferences and personality traits [3, 8-10]
using the 5-factor model of personality, commonly called
the “Big Five” dimensions of personality (i.e., Extraversion, Emotional Stability, Agreeableness, Conscientiousness, and Openness to Experience). Rentfrow and Gosling [3] were the first to conduct a large-scale study focusing on this aspect, involving more than 3,000 participants. In addition to taking the STOMP test for measuring their music preferences, participants had to complete
6 personality tests, including the Big Five Inventory. The
analysis of the results revealed associations between
some personality traits and the 4 dimensions of music
preferences. For instance, they found that liking Reflective and Complex music (e.g., classical, jazz) or Intense
and Rebellious music (e.g., rock, metal) was positively
related to Openness to Experience; and liking Upbeat and
Conventional music (e.g., popular music) or Energetic
and Rhythmic music (e.g., rap, hip-hop) was positively
correlated with extraversion. Emotional Stability was the
only personality dimension that had no significant correlation with any of the music-preference dimensions.
Openness and Extraversion were the best predictors of
music preferences.
As mentioned previously, since researchers use different genres and thus find different music-preference dimensions, comparing results from various studies is problematic. Nonetheless, subsequent studies seem to confirm
most of Rentfrow and Gosling’s findings. Delsing et al.
[8] studied Dutch adolescents and found a similar pattern
of associations between personality and music preferences dimensions. Only two correlations did not match.
However, it should be noted that the correlations were
generally lower, a disparity the authors attribute to the
age difference between the two samples (college student
vs. adolescents): adolescents being more influenced than
young adults by their peers, personality might have a
lesser effect on their music preferences. Brown [9] found
fewer significant correlations when studying Japanese
university students. The strongest correlations concerned
Openness, which was positively associated with liking
Reflective and Complex music (e.g., classical, jazz) and
negatively related to liking Energetic and Rhythmic music (e.g., hip-hop/rap). The positive correlation between
3.3.2 Religious Beliefs
There are very few studies that examined the link between music preferences and religion. The only recent
one we could find was the study by North and Hargreaves
previously mentioned [15]. Their analysis revealed that
fans of western, classical music, disco, and musicals were
the most likely to be religious; whereas fans of dance, indie, or DJ-based music were least likely to be religious.
They also found a significant relation between music
preferences and the religion affiliation of people. Fans of
rock, musicals, or adult pop were more likely to be
Protestant; fans of opera or country/western were more
likely to be Catholic; and fans of R&B and hip-hop/rap
were more likely to adhere to other religions. Another
older study used the 1993 General Social Survey to examine the attitude of American adults towards heavy
metal and rap music and found that people who attended
religious services were more likely to dislike heavy metal
453
(no such association was found with rap music) [18].
Considering that religious beliefs vary across cultures,
further studies are needed to discern a clear pattern of associations between music preferences and religion.
Black Stylists cluster was composed of fans of hip-hop
and reggae who were largely black, with some South
Asian representation. By contrast, the Hard Rockers, who
like heavy metal and alternative music, were almost exclusively white.
3.4 Demographic Variables
3.4.3 Age
Most researchers who study music preferences draw their
participants from the student population of the university
where they work. As a result, samples are mostly homogenous in terms of age, which explains the small number
of studies that focused on the relationship between age
and music preferences. Age was found to be significantly
associated with music preferences. For instance, [23]
compared the music preferences of different age groups
and found that there were only two genres—rock and
country—that appeared in the five most highly rated genres of both the 18-24 year olds and the 55-64 year olds.
While the favourite genres of younger adults were rap,
metal, rock, country, and blues; older adults preferred
gospel, country, mood/easy listening, rock, and classical/chamber music. [15] also found a correlation between
age and preferences for certain music genres. Unsurprisingly, their analysis revealed that people who liked what
could be considered trendy music genres (e.g., hiphop/rap, DJ-based music, dance, indie, chart pop) were
more likely to be young, whereas people who liked more
conventional music genres (e.g., classical music, sixties
pop, musicals, country) were more likely to be older. [24]
conducted a study involving more than 250,000 participants and found that the interest for music genres associated with the Intense (e.g., rock, heavy metal, punk) and
the Contemporary (e.g., rap, funk, reggae) musicpreference dimensions decreases with age, whereas the
interest for music genres associated with the Unpretentious (e.g., pop, country) and the Sophisticated (e.g., classical, folk, jazz) dimensions increases.
Some researchers have also looked at the trajectory of
music tastes. Studies on the music preferences of children
and adolescents revealed that as they get older, adolescents tend to move away from mainstream rock and pop,
although these genres remain popular throughout adolescence [7]. Research has also demonstrated that music
tastes are already fairly stable in early adolescence and
further crystallize in late adolescence or early adulthood
[25, 26]. Using data from the American national Survey
of Public Participation in the Arts (SPPA) of 1982, 1992,
and 2002, [23] examined the relationship between age
and music tastes, with a focus on older age. They looked
at the number of genres liked per age group and found
that in young adulthood, people had fairly narrow tastes.
Their tastes expand into middle age (i.e., 55 year old), to
then narrow again, suggesting that people disengage from
music in older age. They also found that although music
genres that are popular among younger adults change
from generation to generation; they remain much more
stable among older people.
3.4.1 Gender
Several studies have revealed associations between gender and music tastes. It was found that women were more
likely to be fans of chart pop or other types of easy listening music (e.g., country) [7, 12, 15, 19, 20], whereas men
were more likely to prefer rock and heavy metal [12, 15,
19, 20]. This is not to say that women do not like rock: in
Colley’s study [19], which focused on gender differences
in music tastes, rock was the second most highly rated
music genre among women: the average rating for women was 4.1 (on an 8-point scale from 0 to 7) vs. 4.8 for
men. There was, however, a much greater gap in the attitudes towards popular music between men and women,
who attributed 3.17 and 4.62 on average, respectively.
This was the genre for which gender difference was the
most pronounced. Lastly, it is worth mentioning that most
studies did not find any significant gender difference for
rap [7, 12, 19], which indicates that music in this category
appeals to both sexes. This is a surprising result considering the misogynistic message conveyed by many rap
songs. Christenson and Roberts [7] speculated that this
could be due to the fact that men appreciate rap for its
subversive lyrics while women appreciate it for its danceability.
3.4.2 Race and Ethnicity
Very few studies have examined the ties between music
preferences and race and ethnicity. In the 1970s, a survey
of 919 American college students revealed that, among
the demographic characteristics, race was the strongest
predictor of music preferences [21]. In a book published
in 1998 [7], Christenson and Roberts affirmed that racial
and ethnic origins of fans of a music genre mirror those
of its musicians. To support their affirmation, they reported the results of a survey of adolescents conducted in
the 1990s by Carol Dykers in which 7% of black adolescents reported rap as their favourite music genre, compared with 13% of white adolescents. On the other hand,
25% of white adolescents indicated either rock or heavy
metal as their favourite genre, whereas these two genres
had been only mentioned by a very small number of
black adolescents (less than 5% for heavy metal). North
& Hargreaves [15] also found a significant relationship
between ethnic background and music preferences. This
study was conducted more recently (in 2007), with British adults, and with a more diversified sample in terms of
ethnic origins. Interestingly, they found that a high proportion of the respondents who were from an Asian background liked R&B, dance, and hip-hop/rap, which seems
to challenge Christenson and Roberts’ affirmation. [22]
who studied 3,393 Canadian adolescents, performed a
cluster analysis to group respondents according to their
music preferences. They then examined the correlates of
each music-taste cluster. The analysis revealed a different
ethnic composition for different clusters. For instance, the
3.4.4 Education
Education was also found to be significantly correlated to
music preferences. [15] found that individuals who held a
master’s degree or a Ph.D. were most likely to like opera,
454
jazz, classical music, or blues; whereas fans of country,
musicals, or 1960s pop were most likely to have a lower
level of education. [27] studied 325 adolescents and their
parents and also found an association between higher education and a taste for classical and jazz music. Parents
with lower education were more likely to like popular
music and to dislike classical and jazz music.
3.5 Social influences
As mentioned before, research established that people use
their music preferences as a social badge that conveys
information about their personality, values, and beliefs.
But music does not only play a role in the construction of
personal identity. It is also important to social identity.
Music preferences can also act as a social badge that indicates membership in a social group or a social class.
3.5.1 Peers and Parents
Considering the importance adolescents ascribe to both
friendship and music, it is not surprising to learn that social groups often identify with music subcultures during
adolescence [4]. Therefore, it seems legitimate to posit
that in the process of forming their social identity, adolescents may adopt music preferences similar to that of
other members of the social group to which they belong
or they aspire to belong. This hypothesis seems to be confirmed by recent studies. [28] examined the music preferences of 566 Dutch adolescents who formed 283 samesex friendship dyads and found a high degree of similarity in the music preferences of mutual friends. Since they
surveyed the same participants one year after the initial
survey, they could also examine the role of music preferences in the formation of new friendships and found that
adolescents who had similar music preferences were
more likely to become friends, as long as their music
preferences were not associated with the most mainstream dimensions. In the same line, Boer and colleagues
[29] conducted three studies (two laboratory experiments
involving German participants and one field study involving Hong Kong university students) to examine the
relationship between similarity in music preferences and
social attraction. They found that people were more likely
to be attracted to others who shared their music tastes because it suggests that they might also share the same values.
Adolescents were also found to be influenced by the
music tastes of their parents. ter Bogt and colleagues [27]
studied the music tastes of 325 adolescents and their parents. Their analysis revealed some significant correlations.
The adolescents whose parents liked classical or jazz music were also more likely to appreciate these music genres.
Parents’ preferences for popular music were associated
with a preference for popular and dance music in their
adolescent children. Parents were also found to pass on
their liking of rock music to their adolescent daughters
but not to their sons. One possible explanation for the influence of parents on their children’s music tastes is that
since family members live under the same roof, children
are almost inevitably exposed to the favourite music of
their parents.
455
3.5.2 Social Class
In La Distinction [30], Bourdieu proposed a social stratification of tastes and cultural practices according to
which a taste for highbrow music or other cultural products (and a disdain for lowbrow culture) is considered the
expression of a high status. Recent research, however,
suggests that a profound transformation in the tastes of
the elite has occurred. In an article published in 1996, Peterson and Kern [31] reported the results of a study of the
musical tastes of Americans based on data from the Survey of Public Participation in the Arts of 1982 and 1992.
Their analysis revealed that far from being snobbish in
their tastes, individuals with a high occupational status
had eclectic tastes which spanned across the lowbrow/highbrow spectrum. In fact, people of high status
were found to be more omnivorous than others, and their
level of omnivorousness has increased over time. This
highly cited study has motivated several other researchers
to study the link between social class and music preferences. Similar studies were conducted in other countries,
notably in France [32], Spain [33], and the Netherlands
[34], and yielded similar results.
4. IMPLICATION FOR MUSIC RECOMMENDER
SYSTEM DESIGN
A review of the literature on music tastes revealed many
interesting findings that could be used to improve music
RS. Firstly, we saw that researchers had been able to uncover the underlying structure of music preferences,
which is composed of 4 or 5 factors. The main advantage
for music RS is that these factors are fairly stable across
populations and time, as opposed to genres, which are
inconsistent and ill-defined. As suggested by Rentfrow,
Goldberg, and Levitin themselves [13], music RS could
characterize the music preferences of their users by calculating a score for each dimension.
Secondly, some personality dimensions were found to
be correlated to music preferences. In most studies,
Openness to experience was the strongest predictor of
music tastes. It was positively related to liking Reflective
and Complex music (e.g., jazz and classical) and, to a
lesser extent, to Intense and Rebellious music (e.g., rock,
heavy metal). This could indicate that users who like these music genres are more open to new music than other
users. RS could take that into account and adapt the novelty level accordingly.
Finally, the demographic correlates of music preferences (e.g., age, gender, education, race), as well as religion and political orientation, could help ease the new
user cold-start problem. As mentioned in the introduction, many music RS invite new users to create a profile
and/or allow them to connect with a social networking
site account, in which they have a profile. These profiles
contain various types of information about users. Music
RS could combine such information to make inferences
about the music preferences of new users. In the same
line, information about the education and the occupation
of a user could be used to identify potential high-status,
omnivore users.
5. CONCLUSION
The abundant research on music tastes in sociology and
social psychology has been mostly overlooked by music
RS developers. This review of selected literature on the
topic allowed us to present the patterns of associations
between music preferences and demographic characteristics, personality traits, values and beliefs. It also revealed
the importance of social influences on music tastes and
the role music plays in the construction of individual and
social identities.
[16]
[17]
[18]
[19]
6. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[20]
Y. Koren: “Factor in the neighbors: Scalable and accurate
collaborative filtering,” ACM Transactions on Knowledge
Discovery Data, Vol. 4, No. 1, pp. 1-24, 2010.
A. Uitdenbogerd and R. V. Schyndel: "A review of factors
affecting music recommender success," ISMIR 2002:
Proceedings of the Third International Conference on
Music Information Retrieval, M. Fingerhut, ed., pp. 204208, Paris, France: IRCAM - Centre Pompidou, 2002.
P. J. Rentfrow and S. D. Gosling: “The do re mi's of
everyday life: The structure and personality correlates of
music preferences.,” Journal of Personality and Social
Psychology, Vol. 84, No. 6, pp. 1236-1256, 2003.
S. Frith: Sound effects : youth, leisure, and the politics of
rock'n'roll, New York: Pantheon Books, 1981.
A. C. North and D. J. Hargreaves: “Music and adolescent
identity,” Music Education Research, Vol. 1, No. 1, pp.
75-92, 1999.
P. J. Rentfrow and S. D. Gosling: “Message in a ballad:
the role of music preferences in interpersonal perception,”
Psychological Science, Vol. 17, No. 3, pp. 236-242, 2006.
P. G. Christenson and D. F. Roberts: It's not only rock &
roll : popular music in the lives of adolescents, Cresskill:
Hampton Press, 1998.
M. J. M. H. Delsing, T. F. M. ter Bogt, R. C. M. E.
Engels, and W. H. J. Meeus: “Adolescents' music
preferences and personality characteristics,” European
Journal of Personality, Vol. 22, No. 2, pp. 109-130, 2008.
R. Brown: “Music preferences and personality among
Japanese university students,” International Journal of
Psychology, Vol. 47, No. 4, pp. 259-268, 2012.
A. Langmeyer, A. Guglhor-Rudan, and C. Tarnai: “What
do music preferences reveal about personality? A crosscultural replication using self-ratings and ratings of music
samples,” Journal of Individual Differences, Vol. 33, No.
2, pp. 119-130, 2012.
T. Schäfer and P. Sedlmeier: “From the functions of
music to music preference,” Psychology of Music, Vol.
37, No. 3, pp. 279-300, 2009.
D. George, K. Stickle, R. Faith, and A. Wopnford: “The
association between types of music enjoyed and cognitive,
behavioral, and personality factors of those who listen,”
Psychomusicology, Vol. 19, No. 2, pp. 32-56, 2007.
P. J. Rentfrow, L. R. Goldberg, and D. J. Levitin: “The
structure of musical preferences: A five-factor model,”
Journal of Personality and Social Psychology, Vol. 100,
No. 6, pp. 1139-1157, 2011.
P. J. Rentfrow, L. R. Goldberg, D. J. Stillwell, M.
Kosinski, S. D. Gosling, and D. J. Levitin: “The song
remains the same: A replication and extension of the
music model,” Music Perception, Vol. 30, No. 2, pp. 161185, 2012.
A. C. North and D. J. Hargreaves, Jr.: “Lifestyle
correlates of musical preference: 1. Relationships, living
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
456
arrangements, beliefs, and crime, ” Psychology of Music,
Vol 35m No. 1, pp. 58-87, 2007.
J. B. Peterson and P. G. Christenson: “Political orientation
and music preference in the 1980s,” Popular Music and
Society, Vol. 11, No. 4, pp. 1-17, 1987.
A. Gardikiotis and A. Baltzis: “'Rock music for myself
and justice to the world!': Musical identity, values, and
music preferences,” Psychology of Music, Vol. 40, No. 2,
pp. 143-163, 2012.
J. Lynxwiler and D. Gay: “Moral boundaries and deviant
music: public attitudes toward heavy metal and rap,”
Deviant Behavior, Vol. 21, No. 1, pp. 63-85, 2000.
A. Colley: “Young people's musical taste: relationship
with gender and gender-related traits,” Journal of Applied
Social Psychology, Vol. 38, No. 8, pp. 2039-2055, 2008.
P. G. Christenson and J. B. Peterson: “Genre and gender
in the structure of music preferences,” Communication
Research, Vol. 15, No. 3, pp. 282-301, 1988.
R. S. Denisoff and M. H. Levine: “Youth and popular
music: A test of the taste culture hypothesis,” Youth &
Society, Vol. 4, No. 2, pp. 237-255, 1972.
J. Tanner, M. Asbridge, and S. Wortley: “Our favourite
melodies: musical consumption and teenage lifestyles,”
British Journal of Sociology, Vol. 59, No. 1, pp. 117-144,
2008.
J. Harrison and J. Ryan: “Musical taste and ageing,”
Ageing & Society, Vol. 30, No. 4, pp. 649-669, 2010.
A. Bonneville-Roussy, P. J. Rentfrow, M. K. Xu, and J.
Potter: “Music through the ages: Trends in musical
engagement and preferences from adolescence through
middle adulthood, ” American Psychological Association,
pp. 703-717, 2013.
M. B. Holbrook and R. M. Schindler: “Some exploratory
findings on the development of musical tastes,” Journal of
Consumer Research, Vol. 16, No. 1, pp. 119-124, 1989.
J. Hemming: “Is there a peak in popular music preference
at a certain song-specific age? A replication of Holbrook
& Schindler’s 1989 study,” Musicae Scientiae, Vol. 17,
No. 3, pp. 293-304, 2013.
T. F. M. ter Bogt, M. J. M. H. Delsing, M. van Zalk, P. G.
Christenson, and W. H. J. Meeus: “Intergenerational
Continuity of Taste: Parental and Adolescent Music
Preferences,” Social Forces, Vol. 90, No. 1, pp. 297-319,
2011.
M. H. W. Selfhout, S. J. T. Branje, T. F. M. ter Bogt, and
W. H. J. Meeus: “The role of music preferences in early
adolescents’ friendship formation and stability,” Journal
of Adolescence, Vol. 32, No. 1, pp. 95-107, 2009.
D. Boer, R. Fischer, M. Strack, M. H. Bond, E. Lo, and J.
Lam: “How shared preferences in music create bonds
between people: Values as the missing link,” Personality
and Social Psychology Bulletin, Vol. 37, No. 9, pp. 11591171, 2011.
P. Bourdieu: La distinction: critique sociale du jugement,
Paris: Éditions de minuit, 1979.
R. A. Peterson and R. M. Kern: “Changing Highbrow
Taste: From Snob to Omnivore,” American Sociological
Review, Vol. 61, No. 5, pp. 900-907, 1996.
P. Coulangeon and Y. Lemel: “Is ‘distinction’ really
outdated? Questioning the meaning of the omnivorization
of musical taste in contemporary France,” Poetics, Vol.
35, No. 2-3, pp. 93-111, 2007.
J. López-Sintas, M. E. Garcia-Alvarez, and N. Filimon:
“Scale and periodicities of recorded music consumption:
reconciling Bourdieu's theory of taste with facts,” The
Sociological Review, Vol. 56, No. 1, pp. 78-101, 2008.
K. van Eijck: “Social Differentiation in Musical Taste
Patterns,” Social Forces, Vol. 79, No. 3, pp. 1163-1185,
2001.
SOCIAL MUSIC IN CARS
Sally Jo Cunningham, David M. Nichols, David Bainbridge, Hasan Ali
Department of Computer Science, University of Waikato, New Zealand
{sallyjo, d.nichols, davidb}@waikato.ac.nz, [email protected]
Wayne: “I think we’ll go with a little Bohemian Rhapsody, gentlemen”
Garth: “Good call”
Wayne's World
(1992)
ABSTRACT
large majority of drivers in the United States declare they
sing aloud when driving”.
Walsh provides the most detailed discussion of the
social aspects of music in cars, noting the interaction with
conversation (particularly through volume levels) and
music’s role in filling “chasms of silence” [21]. Issues of
impression management [9, 21] (music I like but wouldn’t
want others to know I like) are more acute in the confined
environment of a car and vary depending on the social
relationships between the occupants [21]. Music selections
are often the result of negotiations between the passengers
and the driver [14, 21], where the driver typically has
privileged access to the audio controls.
Bull [6] reports a particularly interesting example of the
intersection between the private environment of personal
portable devices and the social environment of a car with
passengers:
Jim points to the problematic nature of joint listening
in the automobile due to differing musical tastes. The
result is that he plays his iPod through the car radio
whilst his children listen to theirs independently or
playfully in ‘harmony’ resulting in multiple soundworlds in the same space.
Here, although the children have personal devices they
try to synchronize the playback so that they can experience
the same song at the same time; even though their activity
will occur in the context of another piece of music on the
car audio system. Alternative methods for sharing include
explicit (and implicit) recommendation, as in Push!Music
[15], and physical sharing of earbuds [3]. Bull [6] also
highlights another aspect of music in cars: selection
activities that occur prior to a journey. The classic
‘roadtrip’ activity of choosing music to accompany a long
drive is also noted: “drivers would intentionally set up and
prepare for their journey by explicitly selecting music to
accompany the protracted journey “on the road”” [21].
Sound Pryer [18] is a joint-listening prototype that
enables drivers to ‘pry’ into the music playing in other
cars. This approach emphasizes driving as a social practice,
though it focuses on inter-driver relationships rather than
those involving passengers. Sound Pryer can also be
thought of as a transfer of some of the mobile music
sharing concepts in the tunA system [2] to the car setting.
This paper builds an understanding of how music is currently experienced by a social group travelling together in a
car—how songs are chosen for playing, how music both
reflects and influences the group’s mood and social interaction, who supplies the music, the hardware/software that
supports song selection and presentation. This fine-grained
context emerges from a qualitative analysis of a rich set of
ethnographic data (participant observations and interviews)
focusing primarily on the experience of in-car music on
moderate length and long trips. We suggest features and
functionality for music software to enhance the social experience when travelling in cars, and prototype and test a user
interface based on design suggestions drawn from the data.
1. INTRODUCTION
Automobile travel occupies a significant space in modern
Western lives and culture. The car can become a ‘homefrom-home’ for commuters in their largely solitary travels,
and for groups of people (friends, families, work colleagues) in both long and short journeys [20]. Music is
commonly seen as a natural feature of automotive travel,
and as cars become increasingly computerized [17] the opportunities are increased for providing music tailored to the
specific characteristics of a given journey. To achieve this
goal, however, we must first come to a more fine-grained
understanding of these car-based everyday music experiences. To that end, this paper explores the role of music in
supporting the ‘peculiar sociality’ [20] of car travel.
2. BACKGROUND
Most work investigating the experience of music in cars
focuses on single-users, (e.g. [4], [5]). Solo drivers are free
to create their own audio environment: “the car is a space of
performance and communication where drivers report being
in dialogue with the radio or singing in their own
auditized/privatized space” [5]. Walsh [21] notes that “a
© S.J. Cunningham, D.M. Nichols, D. Bainbridge, H. Ali.
License (CC BY 4.0). Attribution: S.J. Cunningham, D.M. Nichols, D.
Bainbridge, H. Ali.. “Social Music in Cars”, 15th International Society
for Music Information Retrieval Conference, 2014.
457
Driver distraction is known to be a significant factor in
vehicle accidents and has led to legislation around the
world restricting the use of mobile phones whilst driving.
In addition to distraction effects caused by operating audio
devices there are the separate issues of how the music itself
affects the driver. Driving style can be influenced by genre,
volume and tempo of music [10]: “at high levels, fast and
loud music has been shown to divert attention [from
driving]” [11], although drivers frequently use music to
relax [11]. Several reports indicate that drivers use music
to relieve boredom on long or familiar routes [1, 21], e.g.
“as repetitious scenery encourages increasing disinterest …
the personalized sounds of travel assume a greater role in
allowing the driver-occupants respite via intermitting the
sonic activity during protracted driving stints” [21].
Many accidents are caused by driver drowsiness; when
linked with physiological sensors to assess the driver’s
state, music can be used to assist in maintaining an
appropriate level of driver vigilance [16]. Music can also
counteract driver vigilance by masking external sounds and
auditory warnings, particularly for older drivers where agerelated hearing loss is more likely to occur [19].
In summary, music fulfils a variety of different roles in
affecting the mental state of the driver. It competes and
interacts with passenger conversation, the external
environmental and with audio functions from the
increasingly computerized driving interface of the car.
When passengers are present, the selection and playing of
music is a social activity that requires negotiation between
the occupants of the vehicle.
people participating in a trip ranged from one to five (Table
2). Of the 69 total travelers across the nineteen journeys, 45
were male and 24 were female. One set of travelers were all
female, 7 were all male, and the remainder (11) were mixed
gender.
Table 1. Demographics of student investigators
Male
Female National Origin
Count
17
5
Age Range:
20 - 27
NZ/Australia
China
Mid-East
Other
5
13
3
1
Grounded Theory methods [13] were used to analyze the
student summaries of their participant observations and interviews. This present paper teases out the social behaviors
that influence, and are influenced by, music played during
group car travel. Supporting evidence drawn from the ethnographic data is presented below in italics.
Table 2. Number of travelers in observed journeys
1
2
3
4
5
1
0
7
7
4
4. MUSIC BEHAVIOR IN CAR TRAVEL
This section explores: the physical car environment and the
reported car audio devices; the different reported roles of
the driver; observed behaviors surrounding the choice of
songs and the setting of volume; music and driving safety;
ordering of songs that are selected to be played; and the ‘activities’ that music supports and influences.
3. DATA COLLECTION AND METHODOLOGY
4.1 Pre-trip Activities
Our research uses data collected in a third year university
Human Computer Interaction (HCI) course in which students design and prototype a system for the set application,
where their designs are informed by an ethnographic investigations into behavior associated with the application domain. This present paper focuses on the ethnographic data
collected that relates to music and car travel, as gathered by
22 student investigators (Table 1). All data gathering for
this study occurred within New Zealand.
To explore the problem of designing a system to support
groups of people in selecting and playing music while traveling, The students performed participant observations,
with the observations focusing on how the music is chosen
for playing, how the music fits in with the other activities
being conducted, who supplies the music, and how/who
changes the songs or alters the volume. The students then
explored subjective social music experiences through autoethnographies [8] and interviews of friends. The data comprises 19 participant observations, two self-interviews, and
four interviews (approximately 45 printed pages). Of the 19
participant observations, four were of short drives (10 to 30
minutes), 14 were lengthier trips (50 minutes to 2 hours),
and one was a classic ‘road trip’ (7 hours). The number of
The owner of a car often keeps personal music on hand in
the vehicle (CDs, an MP3 player loaded with ‘car music’)
as well as carrying along a mobile or MP3 player loaded
with his/her music collection). If only the owner’s music is
played on the trip, then that person should, logically, also
manage the selection of songs during the journey. Unfortunately the owner of the car is also often the driver as well—
and so safety may be compromised when the driver is actively involved in choosing and ordering songs for play.
Passengers are also likely to have on hand a mobile or
MP3 player, and for longer trips may select CDs to share.
If two or more people contribute music to be played on the
journey, the challenge then becomes to bring all the songs
together onto a single device—otherwise they experience
the hassle of juggling several players. A consequence of
merging collections, however, is that no one person will be
familiar with the full set of songs, making on-the-road construction of playlists more difficult (particularly given the
impoverished display surface of most MP3 players).
A simple pooling of songs from the passengers’ and
driver’s personal music devices is unlikely to provide an
efficiently utilizable source for selection of songs for a specific journey. The music that an individual listens to during
458
a usual day’s activities may not be suitable for a particular
trip, or indeed for any car journey. People tend to tailor
their listening to the activity at hand [7], and so songs that
are perfect ‘gym music’ or ‘study music’ may not have the
appropriate tempo, mood, or emotional tenor. Further, an
individual’s music collection may include ‘guilty pleasures’
that s/he may not want others to become aware of [9]:
What mainly made [him] less comfortable in providing music that he likes is because he did [not] want to
destroy the hyper atmosphere in the car as a result of
the mostly energetic songs being played throughout
the trip. His taste is mostly doom and death metal,
with harsh emotion and so will create a bleak atmosphere in the car.
Conversely, loud, fast tempo music can adversely affect
safety ([As the driver, I] changed the volume very high…
my body was shaking with the song. I stepped on the accelerator in my car; The driver [was] seen to increase
the speed when the songs he liked is on).
• Listening to music can be the main source of entertainment during a trip, as the driver and passengers focus on
the songs played.
• Songs need not be listened to passively; travelers may
engage in group sing-alongs, with the music providing
support for their ‘performances’. These sessions may be
loud and include over-the-top emotive renditions for the
amusement of the singer and the group, and be accompanied by clapping and ‘dancing’ in the seats (The participants would sing along to the lyrics of the songs, and also sometimes dance along to the music, laughing and
smiling throughout it).
• A particular song may spark a conversation about the
music—to identify a song (they would know what song
they wanted to hear but they would not know the artist or
name of the song. When this happened, they would … try
to think of the artist name together) or to discuss other
aspects of the artist/song/genre/etc (‘In the air tonight,
Phil Collins!’ Ann asked Joan and I, ‘did you know that
it’s top of the charts at the moment’ … There was conversation about Phil Collins re-releasing his music.) A
lively debate can surround the choice and ordering of the
songs to play, if playlists are created during the trip itself.
• Music can provide a background to conversation; at
this point the travelers pay little or no attention to the
songs but they mask traffic noises (when we were chatting… no one really cared what was on as long as there
was some ambient sound). By providing ‘filler’ for awkward silences, music is particularly useful in supporting
conversations among groups who don’t know each other
particularly well (it seemed more natural to talk when
there was music to break the silence).
For shorter trips, music might serve only one or two of
these social purposes—playing as background to a debate
over where to eat, for example. On longer journeys, the
focus of group attention and activity is likely to shift over
time, and with that shift the role of the music will vary as
well: At some times it would be the focus activity, with everyone having input on what song to choose and then singing along. While at other times the group just wanted to
talk with each other and so the music was turned right
down and became background music…
4.2 Physical Environment and Audio Equipment
The travel described in the participant observations primarily occurred in standard sized cars with two seating areas,
comfortably seating at most two people in the front and
three in the rear sections. In this environment physical
movement is constrained. If the audio device controller is
fixed in place then not everyone can easily reach it or view
its display; if the controller is a handheld device, then it
must be passed around (and even then it may be awkward
to move the controller between the two sections).
As is typical of student vehicles in New Zealand, the
cars tended to be older (10+ years) and so were less likely
to include sophisticated audio options such as configurable
speakers and built-in MP3 systems. The range of audio
equipment reported included radio, built-in CD player,
portable CD player, stand-alone MP3 player plus speakers,
and MP3 player connected to the car audio system.
The overwhelming preference evinced in this study is for
devices that give more fine-grained control over song selection (i.e., MP3 players over CD players, CD players over
radio). The disadvantages of radio are that music choice is
by station rather than by song, reception can be disrupted if
the car travels out of range, and most channels include ads.
On the other hand, radio can provide news and talkback, to
break up a longer journey.
4.3 Music in Support of Journey Social Activities
Music is seen as integral to the group experience on a trip;
it would be unacceptable and anti-social for the car’s occupants to simply each listen to their individual MP3 player,
for example. We identify a wide variety of ways that travelers select songs so as to support group social activities during travel:
• Music can contribute to driving safety, by playing songs
that will reduce driver drowsiness and keep the driver focused (music… can liven up a drive and keep you entertained or awake much longer). For passengers, it can reduce the tedium associated with trips through uninteresting or too-familiar scenery (music can reduce the
boredom for you and your friends with the journey).
4.4 Selecting and Ordering Songs
The physical music device plays a significant role in determining who chooses the music on a car trip. If the device is
fixed (typically in the center of the dashboard), then it is
easily accessible only by the driver or front passenger—and
so they are likely to have primary responsibility for choosing, or arbitrating the choice, of songs. The driver is often
459
the owner of the vehicle, and in that case is likely to be assertive at decision points (Since I was the driver, I was basically the DJ. I would select the CD and the song to be
played. I also changed the song if I didn’t like it even if others in the car did.). Given the small display surfaces of
most music devices and the complexity of interactions with
those devices, it is likely that safety is compromised when
the driver acts as DJ. Consider, for example:
I select some remixed trance music from the second
CD at odd slots of the playlist, and then insert some
pop songs from other CDs in the rest of the slots of the
list. … I manually change the play order to random.
Also I disable the volume protect. And enable the max
volume that from the subwoofer due to the noises from
the outside of my car …
If the music system has a hand-held controller, then the
responsibility for song selection can move through the car.
At any one point, however, a single individual will assume
responsibility for music management. Friends are often familiar with each other’s tastes, and so decisions can be
made amicably with little or no consultation (I felt comfortable in choosing the music because they were mostly
friends and I knew what kind of music they were all into
and what music some friends were not into…). Imposing
one’s will might go against the sense of a group experience
and social expectations (…having the last word means it
could cause problems between friends), or alternatively
close ties might make unilateral decisions more acceptable
(I did occasionally get fed up from their music and put back
my music again without even asking them for permission,
you know we are all friends.).
As noted in Section 4.1, song selection on the fly can be
difficult because the chooser may not be familiar with the
complete base collection, or because the base collection includes songs not suited to the current mood of the trip. A
common strategy is to listen to the first few seconds of a
song, and if it is unacceptable then to skip to the song that
comes up ‘next’ in the CD / shuffle / predetermined
playlist. This strategy provides a choppy listening experience, but does have the advantage of simplicity: a song is
skipped if any one person in the car expresses an objection
to it. It may, however, be embarrassing to ask for a change
if one is not in current possession of the control device.
Song-by-song selection is appropriate for shorter trips, as
the setup time for a playlist may be longer than the journey
itself. Suggesting and ordering songs can also be a part of
the fun of the event and engage travelers socially (My
friends would request any songs that they would like to
hear, and the passenger in control of the iPod acted like a
human playlist; trying to memorise the requests in order
and playing them as each song finished.)
For longer trips, a set of pre-created playlists or mixes
(supporting the expected moods or phases of the journey)
can create a smoother travel experience. A diverse set of
playlists may be necessary to match the range of social mu-
sic behaviors reported in Section 4.2. Even with careful
pre-planning, however, a song may be rejected at time of
play for personal, idiosyncratic reasons (for example, one
participant skips particular songs … associated with particular memories and events so I don’t like to listen to them
while driving for example).
4.5 Music Volume
Sound volume is likely to change during a trip, signaling a
change in the mood of the gathering, an alteration in the
group focus, or to intensify / downplay the effects of a given song. Participant observations included the following
reasons for altering sound levels: to focus group attention
on a particular song (louder); for the group to sing along
with a song (louder); to switch the focus of group activity
from the music to conversation (softer); to ‘energize’ the
mood of the group (louder); to calm the group mood, and
particularly to permit passengers to sleep (softer); and to
move the group focus from conversation back to the music,
particularly when conversation falters (louder).
Clearly the ability to modulate volume to fit to the current activity or mood is crucial. A finer control than is currently available would be desirable, as often speaker placement means perceived volume depends on one’s seat in the
car ([he] asked the driver to turn the bass down … because
the bass effect was too strong, and the driver … think[s] the
bass is fine in the front).
Further, the physical division of a car into separate rows
of seats and its restriction of passenger movement can encourage separate activity ‘zones’ (for example, front seats /
back seats)—and the appropriate volume for the music can
differ between seating areas:
One of our friends who sets beside the driver is paying
more attentions on the music, the rest 3 of us set in the
back were communicate a lot more, and didn’t paying
too much attention on the music… the front people can
hear the music a lot more clear then the people sets in
the back, and it’s harder for the front people to join
the communication with the back people because he
need to turn his head around for the chat sometimes.
5. IMPLICATIONS FOR A SOCIAL AUDIO
SYSTEM FOR CAR TRAVEL
Leveraging upon music information retrieval capabilities,
we now describe how our findings can inform the design of
software specially targeted for song selection during car
trips—personified, the software we seek in essence acts as a
music host. In general a playlist generator [12] for song
selection coupled with access to a distributed network of
self-contained digital music libraries for storing, organizing, and retrieving items (the collections of songs the various people travelling have) are useful building blocks to
developing such software; however, to achieve a digital
music host, what is needed ultimately goes beyond this.
460
In broad terms, we envisage a software application with
two phases: initial configuration and responsive adaptation.
During configuration, the application gathers the pool of
songs for the trip from the individuals’ devices, taking into
account preferences such as which songs they wish to keep
private and which types of songs (genre, artist, tempo, etc.)
that they wish to have considered for the trip playlist. The
users are then prompted to enter the approximate length of
the upcoming road trip, and an initial playlist is constructed
based on the user preferences and pool of songs.
During the trip, the application can make use of a variety of inputs to dynamically adjust the sequence of songs
played. Here significant gains can be made from inventive
uses of MIR techniques coupled with temporal and spatial
information–even data sensors from the car. For instance,
if the application noticed the driver speeding for that section of road it could alter the selection of the next song to
one that is quieter with a slower tempo (beat detection);
alternatively, triggered by the detection of the conversation
lapsing into silence (noise cancelling) the next song played
could be altered to be one labeled with a higher “interest”
value (tagged, for instance, using semantic web technologies, and captured in the playlist as metadata). News
sourced from a radio signal (whichever is currently in
range) can be interspersed with the songs being played.
As evidenced by our analysis, the role of the driver/owner of the car takes on special significance in terms of
the interface and interaction design. As the host of the vehicle, there is a perception that they are more closely
linked to the software (the digital music host) that is making the decision over what to play next. While it is not a
strict requirement of the software, for the majority of situations it will be an instinctive decision that the key audio
device used to play the songs on the trip will be the one
owned by the driver. For the adaptive phase of the software
then, there is a certain irony that the driver (for reasons of
driving safely) has less opportunity to influence the song
selection during the trip. To address this imbalance, an aspect the software could support is the prioritization of input
from the “master” application at noted times that are
deemed safe (such when the car is stationary).
More prosaically, the travellers will requires support in
tweaking the playlist as the trip progresses. We developed
and tested a prototype of this aspect of the system, to evaluate the design’s potential. The existing behaviors explored
in Section 3 suggest that this system should be targeted at
tablet devices rather than smaller mobiles: while the device
should be lightweight enough to be easily passed between
passengers in a vehicle, the users should be able to clearly
see the screen details from an arm’s length, and controls
should be large and spaced to minimize input error.
Figure 1 presents screenshots for primary functionality
of our prototype: the view of the trip playlist, which features the current song in context with the preceding and
succeeding songs (Figure 1a); the lyrics display for the current song, sized to be viewable by all (Figure 1b); and a
screen allowing selected songs to be easily inserted into
different points in the playlist (Figure 1c). While it was
tempting on a technical level to include mobile-based wireless voting (using their smart phones) to move the currently playing item up or down as an expression of like/dislike
(relevance feedback), we recognize that face-to-face discussion and argument over songs is often a source of enjoyment and bonding for fellow travelers—and so we deliberately support only manual playlist manipulation.
Figure 1a. Playlist view.
Figure 1b. Lyrics view for the active song.
Figure 1c. After searching for a song, ‘smart options’ for
inserting the song into the current section of the playlist.
Given the practical and safety difficulties in evaluating
our prototype system in a moving car, we instead used a
stationary simulation. Two groups of four high school aged
461
[7] S.J. Cunningham, D. Bainbridge, A. Falconer. More
males participated in the evaluation, with each trial consisting of approximately 30 minutes in which they listened to
songs on a pre-prepared playlist, both collaboratively and
individually selected additional songs, inserted them into
the playlist, and viewed lyrics to sing along. The researchers took manual notes of the simulations, and participants
engaged in focus group discussions post-simulation.
While the participants found the prototype to be generally usable (though usability tweaks were identified), we
identified worrying episodes in which the drivers switched
focus from the wheel to the tablet. While we recognize that
behavior may be different in a simulation than in real driving conditions, we also saw strong evidence from the ethnographic data that drivers—particularly young, male drivers—can prioritize song selection over road safety. Further
design iterations must recognize that drivers will inevitably
seize control of a car’s music system, and so should prioritize design that supports fast, one-handed interactions.
of an art than a science: playlist and mix construction.
Proceedings of ISMIR ’06, Vancouver, 2006.
[8] S.J. Cunningham, M. Jones: “Autoethnography: a tool
for practice and education,” Proceedings of the 6th
New Zealand International Conference on ComputerHuman Interaction (CHINZ 2005), 1-8, 2005.
[9] S.J. Cunningham, M. Jones, S. Jones: “Organizing
digital music for use: an examination of personal
music collections”.
Proceedings of ISMIR’04,
Barcelona, 447-454, 2004.
[10] B.H. Dalton, D.G. Behm: “Effects of noise and music on
human and task performance: A systematic review,”
Occupational Ergonomics, 7:3, 143-152, 2007.
[11] N. Dibben, V.J. Williamson: “An exploratory survey of
in-vehicle music listening,” Psychology of Music, 35: 4,
571-589, 2007.
6. CONCLUSIONS
[12] A. Flexer, D. Schnitzer, M. Gasser, G. Widmer. “Playlist
generation using start and end songs”, Proceedings of
ISMIR’08, 173-178, 2008.
The primary contribution of this paper is understanding of
social music behavior of small groups of people while on
‘road trips’, developed through a qualitative analysis of
ethnographic data (participant observations and interviews). We prototyped and evaluated the more prosaic aspects of a system to support social music listening on road
trips, and suggest further extensions—including sensorbased input to modify the trip playlist—for future research.
[13] B. Glaser, A. Strauss: The Discovery of Grounded
Theory: Strategies for Qualitative Research, Chicago,
1967.
[14] A. E. Greasley, A. Lamont: “Exploring engagement with
music in everyday life using experience sampling
methodology,” Musicae Scientiae, 15: 45, 45-71, 2011.
7. REFERENCES
[15] M. Håkansson, M. Rost, L.E. Holmquist: “Gifts from
friends and strangers: a study of mobile music sharing,”
Proceedings of ECSCW’07, 311-330, 2007.
[1] K.P. Åkesson, A. Nilsson: “Designing Leisure
Applications for the Mundane Car-Commute,” Personal
and Ubiquitous Computing, 6:3, 176–187, 2002.
[16] C. Hasegawa, K. Oguri: “The effects of specific musical
stimuli on driver’s drowsiness,” Proceedings of the
Intelligent Transportation Systems Conference (ITSC’06),
817-822, 2006.
[2] A. Bassoli, J. Moore, S. Agamanolis: “tunA: Socialising
Music Sharing on the Move,” In K. O'Hara and B. Brown
(eds.), Consuming Music Together: Social and
Collaborative
Aspects
of
Music
Consumption
Technologies. Springer, 151-172, 2007.
[17] O. Juhlin: Social Media on the Road: The Future of Car
Based Computing, Springer, London, 2010.
[18] M. Östergren, O. Juhlin: “Car Drivers Using Sound Pryer
– Joint Music Listening in Traffic Encounters,” In K.
O'Hara and B. Brown (eds.), Consuming Music Together:
Social and Collaborative Aspects of Music Consumption
Technologies. Springer, 173-190, 2006.
[3] T. Bickford: “Earbuds Are Good for Sharing: Children’s
Sociable Uses of Headphones at a Vermont Primary
School,” In J. Stanyek and S. Gopinath (eds.), The
Handbook of Mobile Music Studies, Oxford University
Press, 2011.
[19] E.B. Slawinski, J.F. McNeil: (2002) “Age, Music, and
Driving Performance: Detection of External Warning
Sounds in Vehicles,” Psychomusicology, 18, 123-31,
2002.
[4] M. Bull: “Soundscapes of the car: a critical study of
automobile habitation,” In M. Bull and L. Back, (eds.)
The Auditory Culture Reader, Berg, 357–374, 2003.
[5] M. Bull: “Automobility and the power of sound”, Theory,
Culture & Society, 21:4/5, 243–259, 2004.
[20] Urry, J. “Inhabiting the car,” The Sociological Review,
54, 17–31, 2006.
[6] M. Bull: “Investigating the culture of mobile listening:
from Walkman to iPod,” In K. O'Hara and B. Brown
(eds.), Consuming Music Together. Springer, 131–149,
2006.
[21] M.J. Walsh: “Driving to the beat of one’s own hum: Automobility and musical listening,” In N. K. Denzin (ed.)
Studies in Symbolic Interaction, 35, 201-221, 2010.
462
Poster Session 3
463
464
A COMBINED THEMATIC AND ACOUSTIC APPROACH FOR A MUSIC
RECOMMENDATION SERVICE IN TV COMMERCIALS
Mohamed Morchid, Richard Dufour, Georges Linarès
LIA - University of Avignon (France)
{mohamed.morchid, richard.dufour, georges.linares}@univ-avignon.fr
ABSTRACT
ber of musics. For these reasons, the need for an automatic
song recommandation system, to illustrate advertisements,
becomes a critical subject for companies.
Most of modern advertisements contain a song to illustrate
the commercial message. The success of a product, and
its economic impact, can be directly linked to this choice.
Finding the most appropriate song is usually made manually. Nonetheless, a single person is not able to listen
and choose the best music among millions. The need for
an automatic system for this particular task becomes increasingly critical. This paper describes the LIA music
recommendation system for advertisements using both textual and acoustic features. This system aims at providing
a song to a given commercial video and was evaluated in
the context of the MediaEval 2013 Soundtrack task [14].
The goal of this task is to predict the most suitable soundtrack from a list of candidate songs, given a TV commercial. The organizers provide a development dataset including multimedia features. The initial assumption of the proposed system is that commercials which sell the same type
of product, should also share the same music rhythm. A
two-fold system is proposed: find commercials with close
subjects in order to determine the mean rhythm of this subset, and then extract, from the candidate songs, the music
which better corresponds to this mean rhythm.
In this paper, an automatic system for songs recommandation is proposed. The proposed approach combines both
textual (web pages) and audio (acoustic) features to select,
among a large number of songs, the most appropriate and
relevant music knowing the commercial content. The first
step of the proposed system is to represent commercials
into a thematic space built from a Latent Dirichlet Allocation (LDA) [4]. This pre-processing subtask uses the related textual content of the commercial. Then, acoustic
features of each song are extracted to find a set of the most
relevant songs for a given commercial.
An appropriate benchmark is needed to evaluate the effectiveness of the proposed recommandation system. For
these reasons, the proposed system is evaluated in the context of the challenging MediaEval 2013 Soundtrack task
for commercials [10]. Indeed, the MusiClef task seeks to
make this process automated by taking into account both
context- and content-based information about the video,
the brand, and the music. The main difficulty of this task
is to find the set of relevant features that best describes the
most appropriate song for a video.
Next section describes related work in topic space modeling for information retrieval and music tasks. Section 3
presents the proposed music recommandation system using both textual content and acoustic features related to
musics from commercials. Section 4 explains in details
the unsupervised Latent Dirichlet Allocation (LDA) technique, while Section 4.2 describes how the acoustic features are used to evaluate the proximity of a music to a
commercial. Finally, experiments are presented in Section 5, while Section 6 gives conclusions and perspectives.
1. INTRODUCTION
The success of a product or a service essentially depends
of the way to present it. Thus, companies pay much attention to choose the most appropriate advertisement that
will make a difference in the customer choice. The advertisers have different media possibilities, such as journal
paper, radio, TV or Internet. In this context, they can exploit the audio media (TV, radio...) to attract listeners using
a song related to the commercial. The choice of an appropriate song is crucial and can have a significant economic
impact [5,18]. Usually, this choice is made by a human expert. Nonetheless, while millions of musics exist, a human
agent could only choose a song among a limited subset.
This choice could then be inappropriate, or simply not the
best one, since the agent could not search into a large num-
2. RELATED WORKS
Latent Dirichlet Allocation (LDA) [4] is widely used in
several tasks of information retrieval such as classification or keywords extraction. However, this unsupervised
method is not much considered in the music processing
tasks. Next sections describe related works using LDA
techniques with text corpora (Section 2.1) and in the context of music tasks (Section 2.3).
c Mohamed Morchid, Richard Dufour, Georges Linarès.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mohamed Morchid, Richard Dufour,
Georges Linarès. “A Combined Thematic and Acoustic Approach for a
Music Recommendation Service in TV Commercials”, 15th International
Society for Music Information Retrieval Conference, 2014.
465
2.1 Topic modeling
ment d of a corpus D, a first parameter θ is drawn according
to a Dirichlet law of parameter α. A second parameter φ
is drawn according to the same Dirichlet law of parameter
β. Then, to generate every word w of the document d, a
latent topic z is drawn from a multinomial distribution on
θ. Knowing this topic z, the distribution of the words is
a multinomial of parameters φ. The parameter θ is drawn
for all the documents from the same prior parameter α.
This allows to obtain a parameter binding the documents
all together [4].
Several methods were proposed by Information Retrieval
(IR) researchers to build topic spaces such as Latent Semantic Analysis or Indexing (LSA/LSI) [2, 6], that use a
singular value decomposition (SVD) to reduce the space
dimension.
This method was improved by [11] which proposed a
probabilistic LSA/LSI (pLSA/pLSI). The pLSI approach
models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of topics. This method demonstrated its performance
on various tasks, such as sentence [3] or keyword [24] extraction. In spite of the effectiveness of the pLSI approach,
this method has two main drawbacks. The distribution of
topics in pLSI is indexed by training documents. Thus, the
number of these parameters grows with the training document set size, and then, the model is prone to overfitting
which is a main issue in an IR task such as document clustering. However, to address this shortcoming, a tempering
heuristic is used to smooth the parameter of pLSI model for
acceptable predictive performance. Nonetheless, authors
showed in [20] that overfitting can occur even if tempering
process is used.
As a result, IR researchers proposed the Latent Dirichlet
allocation (LDA) [4] method to overcome these two drawbacks. Thus, the number of parameters of LDA does not
grow with the size of the training corpus and LDA is not
candidate for overfitting. LDA is a generative model which
considers a document, seen as a bag-of-words [21], as a
mixture of latent topics. In opposition to a multinomial
mixture model, LDA considers that a theme is associated
to each occurrence of a word composing the document,
rather than associate a topic with the complete document.
Thereby, a document can change of topics from a word to
another. However, the word occurrences are connected by
a latent variable which controls the global respect of the
distribution of the topics in the document. These latent
topics are characterized by a distribution of word probabilities which are associated with them. pLSI and LDA
models have been shown to generally outperform LSI on
IR tasks [12]. Moreover, LDA provides a direct estimate
of the relevance of a topic knowing a word set or a document such as a web pages in the proposed system.
β
2.2 Gibbs sampling
Several techniques have been proposed to estimate LDA
parameters, such as Variational Methods [4], ExpectationPropagation [17] or Gibbs Sampling [8]. Gibbs Sampling
is a special case of Markov-chain Monte Carlo (MCMC) [7]
and gives a simple algorithm for approximate inference
in high-dimensional models such as LDA [9]. This overcomes the difficulty to directly and exactly estimate parameters that maximize the likelihood of
the whole data col)M
→
−
→
−
−
−
−
lection defined as: P (W |→
α , β ) = m=1 P (→
w m |→
α, β )
→
−
for the whole data collection W = { w m }M
m=1 knowing
→
−
→
−
the Dirichlet parameters α and β .
The first use of Gibbs Sampling for estimating LDA is
reported in [8] and a more comprehensive description of
this method can be found in [9]. One can refer to these papers for a better understanding of this sampling technique.
2.3 Topic modeling and Music
Topic modeling was already used in music processing, such
as [13], where the authors presented a system which learns
musical key as a key-profile. Thus, the proposed approach
considered a song as a random mixture of key-profiles.
In [25], authors described a classification method to assign
a label to an unseen music. The authors use LDA to build
a topic space from music-tags to get the probability of every music-tag belonging to each music genre. Then, each
music is labeled to a genre knowing its tags. The purpose
of the proposed approach is to find a set of relevant musics
for a TV commercial.
3. PROPOSED APPROACH
The goal of the proposed automatic system is to recommend a set of musics given a TV commercial. The system uses external knowledge to find these songs. These
external resources are composed with a set of TV commercials associated, for each one, with a song and a set of web
pages (see [14] for more details about the MediaEval 2013
Soundtrack task). The idea behind the proposed approach
is to assume that two commercials sharing same subjects or
interests, also share the same kind of songs. The main issue
in this approach is to find commercials, from the external
dataset, that have sets of subjects close to those in commercials from the test set. As described in Section 2.1, a document can be represented as a set of latent topics. Thus,
two documents sharing the same topics could be seen as
thematically close.
φ
word
distribution
α
θ
topic
distribution
z
w
topic
word
N
D
Figure 1. LDA Formalism.
Figure 1 presents the LDA formalism. For every docu-
466
LDA
the mapping of a commercial in this topic representation
to evaluate both V d and V t are described. Then, the computed similarity score is detailed. Finally, the soundtrack
prediction process from a TV commercial is explained.
Topic
Space
dD
Development set
{C1,S t }
Cosine similarity Topic
vectors
d
{V }dD
{Cd,Sd }
V
d
4. TOPIC REPRESENTATION OF A TV
COMMERCIAL
-1
cos (d,1 )
Mapping
Topic
vector
V1
tT
V
Let’s consider a corpus D from the development set of TV
commercials with a word vocabulary V = {w1 , . . . , wN }
of size N . A topic representation from corpus D is then
performed using a Latent Dirichlet Allocation (LDA) [4]
approach. At the final LDA analysis, a topic space m of n
topics is obtained with, for each theme z, the probability
of each word w of V knowing z, and for the entire model
m, the probability of each theme z knowing the model m.
Each TV commercial from both development and test sets
is mapped into the topic space (see Figure 3) to obtain a
vector representation (V d and V t ) of web pages related to
a commercial into the thematic space computed as follow:
1
A TV commercial C1
and the candidate songs St
from test set T
Cosine similarity Mean rhythm pattern
S
S
-1
cos ( l,t )
S
l commercial from
development set D
{Cl ,Sl }
with the highest
similarity with C1
from test set T
t
5 nearest soundtracks
{S t} t=1, ... , 5
with the commercial C1
V d [i](Cjd ) = P (zi |Cjd )
where P (zi |Cjd ) is the probability of a topic zi to be
generated by the web pages from the commercial Cjd , estimated using Gibbs sampling as described in Section 2.2. In
the same way, V t is estimated with the same topic space,
and with the use of web pages of commercials of test set
Cjt (see Figure 3).
Figure 2. Global architecture of the proposed system.
WORD
WEIGHT
w1
P (w1 |z1 )
w2
w|V |
P (w1 |z2 )
WORD
WEIGHT
w2
P (w2 |z2 )
w1
P (w1 |z3 )
w|V |
...
z1
z3
WEIGHT
w1
P (w2 |z3 )
w2
P (w|V | |z2 )
w|V |
P (w|V | |z3 )
P (w2 |z1 )
...
P (w|V | |z1 )
Vd [2]
Vd [3]
Vd [1]
z4
TV Commercial
Vd [4]
zn
WORD
WEIGHT
w1
P (w1 |zn )
w2
w|V |
P (w2 |zn )
WORD
WEIGHT
w1
P (w1 |z4 )
w2
w|V |
...
P (w2 |z4 )
...
Vd [n]
...
T = {C t , V
T
, Skt }k=1,...,5000
t=1,...,T
z2
WORD
...
Basically, the first process of the proposed three step
system is to map each TV commercial from the test and
development sets, into a topic space learnt with a LDA algorithm. A TV commercial from the test set is then linked
to TV commercials from development set sharing a set of
close topics. Moreover, each commercial of the development set is related to a music. Thus, as a result, a commercial from the test set is related to a subset of songs from
the development set, considered as thematically close to
the commercial textual content.
The second step has the responsibility to estimate a list
of candidate songs (see Figure 2) using song audio features
from the subset of songs thematically close associated during the first step. This subset of songs is used to evaluate a
rhythm pattern of the ideal song for this commercial.
The last step retrieves, from all candidate songs from
the test set, the closest song to the rhythm pattern estimated
during the previous step.
In details, the development set D is composed of TV
commercials C d , with for each, a soundtrack S d and a vector representation V d related to the dth TV commercial. In
the same manner, the test set T is composed of TV commercials C t , with, for the tth one, a vector representation
V t and a soundtrack S t to predict. Then a similarity score
d
{αd,t }t=1,...,T
d=1,...,D is computed for each commercial Ci of the
development set given one from the test set C t :
D = {C d , V D , S d }d=1,...,D
(2)
P (w|V | |z4 )
...
P (w|V | |zn )
Figure 3. Mapping of a TV commercial in the topic space.
4.1 Similarity measure
Each commercial from both development and test set, is
mapped into the topic space to produce a vector representation for each one, respectively V d and V t as outcomes.
Then, given a TV commercial C 1 from the test set T, a
subset of other TV commercials from the development set
D is selected knowing their thematic proximity with C 1 .
(1)
.
In the next sections, the topic space representation and
467
To estimate the similarity between C 1 and commercials
from development set, the cosine metric α is used. This
similarity metric is expressed thereafter:
extracted using the Ircam software available at [1]. More
information about features extraction from songs are detailed in [14].
As an outcome, each commercial is represented by a
rhythm pattern vector of size 58 (10 from song features and
48 from rhythm pattern). From the subset of soundtracks
of the l nearest commercials from D, a mean rhythm vector
S is performed as:
cosine(V d , V t ) = αd,t
n
V d [i] × V t [i]
2
=2
n
n
2
2
d
V [i]
V t [i]
i=1
i=1
(3)
S=
i=1
1 d
S .
l
d∈l
This metric allows to extract a subset of commercials
from D thematically close to C 1 .
Finally, the cosine measure between this mean rhythm
S of the l nearest commercials from D, and each commercial (cosine(S, S t )t∈T ), is used to find, from the soundtrack S t of the test set T, the 5 songs from all the candidates having the closest rhythm pattern.
4.2 Rhythm pattern
The cosine measure, presented in previous section, is also
used to evaluate the similarity between a mean rhythm pattern vector S d of a song, and all the candidate songs Skt of
the test set.
Rhythm pattern of a song
<?xml version="1.0" ?>
<rhythmdescription>
<media>363445_sum.wav</media>
<description>
<bpm_mean>99.982723</bpm_mean>
<bpm_std>0.047869</bpm_std>
<meter>22.000000</meter>
<perc>47.023527</perc>
<perc_norm>1.910985</perc_norm>
<complex>29.630575</complex>
<complex_norm>0.652134</complex_norm>
<speed>2.660229</speed>
<speed_norm>1.201633</speed_norm>
<periodicity>0.900763</periodicity>
<rhythmpattern>0.124231 ... 0.098873</rhythmpattern>
</description>
</rhythmdescription>
(a)
5. EXPERIMENTS AND RESULTS
Rhythm pattern vector
bmp_mean
bmp_std
meter
perc
perc_norm
complex
complex_norm
speed
speed_norm
peiodicity
{
rhythmpattern_1
rhythmpattern_2
Previous sections described the proposed automatic music
recommandation system for TV commercials. This system is decomposed into three sub-processes. The first one
maps the commercials into a topic space to evaluate the
proximity of a commercial from the test set and all commercials from the development set. Then, the mean rhythm
pattern of the thematically close commercials is computed.
Finally, this rhythm pattern is computed with all ones from
the test set of candidate songs to find a set of relevant musics.
...
rhythmpattern_48
5.1 Experimental protocol
(b)
The first step of the proposed approach, detailed in previous section, maps TV commercial textual content into a
topic space of size n (n = 500). This one is learnt from a
LDA in a large corpus of documents. Section 4 describes
the corpus D of web pages. This corpus contains 10, 724
Web pages related to brands of the commercials contained
in D. This corpus is composed of 44, 229, 747 words for
a vocabulary of 4, 476, 153 unique words. More details
about this text corpus, and the way to collect it, is explained
into [14].
The first step of the proposed approach is to map each
commercial textual content into a topic space learnt from a
latent Dirichlet allocation (LDA). During the experiments,
the MALLET tool is used [16] to perform a topic model.
The proposed system is evaluated in the MediaEval 2013
MusiClef benchmark [14]. The aim of this task is to predict, for each video of the test set, the most suitable soundtrack from 5,000 candidate songs. The dataset is split into
3 sets. The development set contains multimodal information on 392 commercials (various metadata including
Youtube uploader comments, audio features, video features, web pages and text features). The test set is a set
of 55 videos to which a song should be associated using
the recommandation set of 5,000 soundtracks (30 seconds
long excerpts).
Figure 4. Rhythm pattern of a song from the development
set in xml (a) and vector (b) representations.
In details, each commercial from D is related with a
soundtrack that is represented with a rhythm pattern vector.
The organizers provide for each song contained into the
MusicClef 2013 dataset:
• video features (MPEG-7 Motion Activity and Scalable Color Descriptor [15]),
• web pages about the respective brands and music
artists,
• music features:
− MFCC or BLF [22],
− PS209 [19],
− beat, key, harmonic pattern extracted with the
Ircam software [1].
In our experiments, 10 rhythm features of songs are
used (speed, percussion, . . . , periodicity) as shown in Figure 4. These features of beat, key or harmonic pattern are
468
and songs extraction (rhythm pattern estimation of the ideal
songs for a commercial from the test set). Moreover, this
promising approach, combining thematic representation of
the textual content of a set of web pages describing a TV
commercial and acoustic features, shows the relevance of
topic-based representation in automatic recommandation
using external resources (development set).
The choice of a relevant song to describe the idea behind
a commercial, is a challenging task when the framework
does not take into account relevant features related to:
5.2 Experimental metrics
For each video in the test set, a ranked list of 5 candidate
songs should be proposed. The song prediction evaluation
is manually performed using the Amazon Mechanical Turk
platform. This novel task is non-trivial in terms of “ground
truth”, that is why human ratings for evaluation are used.
Three scores have been computed from our system output.
Let V be the full collection of test set videos, and let sr (v)
be the average suitability score for the audio file suggested
at rank r for the video v. Then, the evaluation measures
are computed as follows:
• mood, such as harmonic content, harmonic progressions and timbre,
• Average suitability score of the first-ranked song:
|V
|
1
s1 (vi )
V
• music rhythm, such as musical style, texture, spectral centroid, or tempo.
i=1
• Average suitability score for the full top-5:
|V
| 1
1
V
5 sr (vi )
The proposed automatic music recommendation system
is limited by this small number (58) of features which not
describe all music aspects. For these reasons, in future
works, we plan to use others features, such as the song
lyrics or the audio transcription of the TV commercials,
and evaluate the effectiveness of the proposed hybrid framework into other information retrieval tasks such as classification of music genre or music clustering.
i=1
• Weighted average suitability score of the full top5. Here, we apply a weighted harmonic mean score
instead of an arithmetic mean:
|V
| 5r=1 sr (vi )
1
V
i=1
5
r=1
sr (vi )
r
The previously presented measures are used to study
both rating and ranking aspects of the results.
7. REFERENCES
5.3 Results
[1] Ircam.
analyse-synthse:
Software.
In
http://anasynth.ircam.fr/home/software.,
Accessed:
Sept. 2013.
The measures defined in the previous section are used to
evaluate the effectiveness of songs selected to be associated to TV commercials from the test set. The proposed
topic space-based approach is evaluated in the same way,
and obtained the results detailed thereafter:
[2] J.R. Bellegarda. A latent semantic analysis framework
for large-span language modeling. In Fifth European
Conference on Speech Communication and Technology, 1997.
[3] J.R. Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of
the IEEE, 88(8):1279–1296, 2000.
• First rank average score: 2.16
• Top 5 average score (arithmetic mean): 2.24
• Top 5 average score (harmonic mean, taking rank
into account): 2.22
[4] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet
allocation. The Journal of Machine Learning Research,
3:993–1022, 2003.
Considering that human judges rate the predicted songs
from 1 (very poor) to 4 (very well), we can consider that
our system is slightly better than the mean evaluation score
(2) no matter the metric considered. While the system
proposed in [23] is clearly different from ours, results are
very similar. This shows the difficulty to build an automatic song recommendation system for TV commercials,
the evaluation being also a critical point to discuss.
[5] Claudia Bullerjahn. The effectiveness of music in television commercials. Food Preferences and Taste: Continuity and Change, 2:207, 1997.
[6] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic
analysis. Journal of the American society for information science, 41(6):391–407, 1990.
[7] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration
of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, (6):721–741, 1984.
6. CONCLUSIONS AND PERSPECTIVES
In this paper, an automatic system to assign a soundtrack
to a TV commercial has been proposed. This system combines two media: textual commercial content and audio
rhythm pattern. The proposed approach obtains good results in spite of the fact that the system is automatic and unsupervised. Indeed, both subtasks are unsupervised (LDA
learning and commercials mapping into the topic space)
[8] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of
Sciences of the United States of America, 101(Suppl
1):5228–5235, 2004.
469
[9] Gregor Heinrich. Parameter estimation for text analysis. Web: http://www. arbylon. net/publications/textest. pdf, 2005.
[10] Nina Hoeberichts. Music and advertising: The effect of
music in television commercials on consumer attitudes.
Bachelor Thesis, 2012.
[11] T. Hofmann. Probabilistic latent semantic analysis. In
Proc. of Uncertainty in Artificial Intelligence, UAI ’ 99,
page 21. Citeseer, 1999.
[12] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning,
42(1):177–196, 2001.
[13] Diane Hu and Lawrence K Saul. A probabilistic
topic model for unsupervised learning of musical keyprofiles. In ISMIR, pages 441–446, 2009.
[23] Han Su, Fang-Fei Kuo, Chu-Hsiang Chiu, Yen-Ju
Chou, and Man-Kwan Shan. Mediaeval 2013: Soundtrack selection for commercials based on content correlation modeling. In MediaEval 2013, volume 1043 of
CEUR Workshop Proceedings. CEUR-WS.org, 2013.
[24] Y. Suzuki, F. Fukumoto, and Y. Sekiguchi. Keyword
extraction using term-domain interdependence for dictation of radio news. In 17th international conference
on Computational linguistics, volume 2, pages 1272–
1276. ACL, 1998.
[25] Chao Zhen and Jieping Xu. Multi-modal music genre
classification approach. In Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, volume 8, pages 398–402.
IEEE, 2010.
[14] Cynthia C. S. Liem, Nicola Orio, Geoffroy Peeters, and
Markus Scheld. MusiClef 2013: Soundtrack Selection
for Commercials. In MediaEval, 2013.
[15] Bangalore S Manjunath, Philippe Salembier, and
Thomas Sikora. Introduction to MPEG-7: multimedia
content description interface, volume 1. John Wiley &
Sons, 2002.
[16] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu,
2002.
[17] Thomas Minka and John Lafferty. Expectationpropagation for the generative aspect model. In Proceedings of the Eighteenth conference on Uncertainty
in artificial intelligence, pages 352–359. Morgan Kaufmann Publishers Inc., 2002.
[18] C Whan Park and S Mark Young. Consumer response
to television commercials: The impact of involvement
and background music on brand attitude formation.
Journal of Marketing Research, pages 11–24, 1986.
[19] Tim Pohle, Dominik Schnitzer, Markus Schedl, Peter
Knees, and Gerhard Widmer. On rhythm and general
music similarity. In ISMIR, pages 525–530, 2009.
[20] Alexandrin Popescul, David M Pennock, and Steve
Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data
environments. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages
437–444. Morgan Kaufmann Publishers Inc., 2001.
[21] G. Salton. Automatic text processing: the transformation. Analysis and Retrieval of Information by Computer, 1989.
[22] Klaus Seyerlehner, Gerhard Widmer, and Tim Pohle.
Fusing block-level features for music similarity estimation. In Proc. of the 13th Int. Conference on Digital
Audio Effects (DAFx-10), pages 225–232, 2010.
470
ARE POETRY AND LYRICS ALL THAT DIFFERENT?
Abhishek Singhi
Daniel G. Brown
University of Waterloo
Cheriton School of Computer Science
{asinghi,dan.brown}@uwaterloo.ca
and the word. The list of relevant synonyms obtained after pruning was used to obtain the probability distribution
over words.
A key requirement of our study is that there exists a
difference, albeit a hazy one, between poetry and lyrics.
Poetry attracts a more educated and sensitive audience
while lyrics are written for the masses. Poetry, unlike lyrics, is often structurally more constrained, adhering to a
particular meter and style. Lyrics are often written keeping the music in mind while poetry is written against a silent background. Lyrics, unlike poetry, often repeat lines
and segments, causing us to believe that lyricists tend to
pick more rhymable adjectives; of course, some poetic
forms also repeat lines, such as the villanelle. For twenty
different concepts we compare adjectives which are more
likely to be used in lyrics rather than poetry and vice versa.
ABSTRACT
We hypothesize that different genres of writing use different adjectives for the same concept. We test our hypothesis on lyrics, articles and poetry. We use the English
Wikipedia and over 13,000 news articles from four leading newspapers for the article data set. Our lyrics data set
consists of lyrics of more than 10,000 songs by 56 popular English singers, and our poetry dataset is made up of
more than 20,000 poems from 60 famous poets. We find
the probability distribution of synonymous adjectives in
all the three different categories and use it to predict if a
document is an article, lyrics or poetry given its set of adjectives. We achieve an accuracy level of 67% for lyrics,
80% for articles and 57% for poetry. Using these probability distribution we show that adjectives more likely to
be used in lyrics are more rhymable than those more likely to be used in poetry, but they do not differ significantly
in their semantic orientations. Furthermore we show that
our algorithm is successfully able to detect poetic lyricists
like Bob Dylan from non-poetic ones like Bryan Adams,
as their lyrics are more often misclassified as poetry.
1. INTRODUCTION
The choice of a particular word, from a set of words that
can instead be used, depends on the context we use it in,
and on the artistic decision of the authors. We believe that
for a given concept, the words that are more likely to be
used in lyrics will be different from the ones which are
more likely to be used in articles or poems, because lyricists have different objectives typically. We test our hypothesis on adjective usage in these categories of documents. We use adjectives, as a majority have synonyms
that can be used depending on context. To our surprise,
just the adjective usage is sufficient to separate documents quite effectively.
Finding the synonyms of a word is still an open problem. We used three different sources to obtain synonyms
for a word – the WordNet, Wikipedia and an online thesaurus. We prune synonyms, obtained from the three
sources, which fall below an experimentally determined
threshold for the semantic distance between the synonyms
Figure 1. The bold-faced words are the adjectives our
algorithm takes into account while classifying a document, which in this case in a snippet of lyrics by the
Backstreet Boys.
We use a bag of words model for the adjectives, where
we do not care about their relative positions in the text,
but only their frequencies. Finding synonyms of a given
word is a vital step in our approach and since it is still
considered a difficult task improvement in synonyms
finding approaches will lead to an improvement in our
classification accuracy. Our algorithm has a linear run
time as it scans through the document once to come up
with the prediction, giving us an accuracy of 68% overall.
Lyricists with a relatively high percentage of lyrics misclassified as poetry tend to be recognized for their poetic
style, such as Bob Dylan and Annie Lennox.
2. RELATED WORK
© Abhishek Singhi, Daniel G. Brown.
License (CC BY 4.0). Attribution: Abhishek Singhi, Daniel G.
Brown. “Are Poetry And Lyrics All That Different?”, 15th
International Society for Music Information Retrieval Conference,
2014.
We do not know of any work on the classification of
documents based on the adjective usage into lyrics, poetry or articles nor are we aware of any computational
471
edited by experts. Both of these are extremely rich
sources of data on many topics. To remove the influence
of the presence of articles about poems and lyrics in Wikipedia we set the pruning threshold frequency of adjectives to a high value, and we ensured that the articles were
not about poetry or music.
work which discerns poetic from non-poetic lyricists.
Previous works have used adjectives for various purposes
like sentiment analysis [1]. Furthermore in Music Information Retrieval, work on poetry has focused on poetry
translator, automatic poetry generation.
Chesley et al. [1] classifies blog posts according to
sentiment using verb classes and adjective polarity,
achieving accuracy levels of 72.4% on objective posts,
84.2% for positive posts, and 80.3% for negative posts.
Entwisle et al. [2] analyzes the free verbal productions of
ninth-grade males and females and conclude that girls use
more adjectives than boys but fail to reveal differential
use of qualifiers by social class.
Smith et al. [13] use of tf-idf weighting to find typical
phrases and rhyme pairs in song lyrics and conclude that
the typical number one hits, on average, are more clichéd. Nichols et al. [14] studies the relationship between
lyrics and melody on a large symbolic database of popular music and conclude that songwriters tend to align salient notes with salient lyrics.
There is some existing work on automatic generation
of synonyms. Zhou et al. [3] extracts synonyms using
three sources - a monolingual dictionary, a bilingual corpus and a monolingual corpus, and use a weighted ensemble to combine the synonyms produced from the
three sources. They get improved results when compared
to the manually built thesauri, WordNet and Roget.
Christian et al. [4] describe an approach for using
Wikipedia to automatically build a dictionary of named
entities and their synonyms. They were able to extract a
large amount of entities with a high precision, and the
synonyms found were mostly relevant, but in some cases
the number of synonyms was very high. Niemi et al. [5]
add new synonyms to the existing synsets of the Finnish
WordNet using Wikipedia’s links between the articles of
the same topic in Finnish and English.
As to computational poetry, Jiang et al. [6] use statistical machine translation to generate Chinese couplets
while Genzel et al. [7] use statistical machine translation
to translate poetry keeping the rhyme and meter constraints.
3.2 Lyrics
We took more than 10,000 lyrics from 56 very popular
English singers. Both the authors listen to English music
and hence it was easy to come up with a list which included singers from many popular genres with diverse
backgrounds. We focus on English-language popular music in our study, because it is the closest to “universally”
popular music, due to the strength of the music industry in
English-speaking countries. We do not know if our work
would generalize to non-English Language songs. Our
data set includes lyrics from the US, Canada, UK and Ireland.
3.3 Poetry
We took more than 20,000 poems from more than 60 famous poets, like Robert Frost, William Blake and John
Keats, over the last three hundred years. We selected the
top poets from Poem Hunter [19]. We selected a wide
time range for the poets, as many of the most famous
English poets are from that time period. None of the poetry selected were translations from another language. Most
of the poets in our dataset are poets from North America
and Europe. We believe that our training data, is representative of the mean, as a majority of poetry and poetic
style are inspired by the work of these few extremely famous poets.
3.4 Test Data
For the purpose of document classification we took 100
from each category, ensuring that they were not present in
the training set. While collecting the test data we ensured
the diversity, the lyrics and poets came from different
genres and artists and the articles covered different topics
and were selected from different newspapers.
To determine poetic lyricists from non-poetic ones we
took eight of each of the two types of lyricists, none of
whom were present in our lyrics data sets. We ensured
that the poetic lyricists we selected were indeed poetic by
looking up popular news articles or ensuring that they
were poet along with being lyricists. Our list for poetic
lyricists included Bob Dylan and Annie Lennox etc. while
the non-poetic ones included Bryan Adams and Michael
Jackson.
3. DATA SET
The training set consists of articles, lyrics and poetry and
is used to calculate the probability distribution of adjectives in the three different types of documents. We use
these probability distributions in our document classification algorithms, to identify poetic from non-poetic lyricists and to determine adjectives more likely to be used in
lyrics rather than poetry and vice versa.
3.1 Articles
4. METHOD
We take the English Wikipedia and over 13,000 news articles from four major newspapers as our article data set.
Wikipedia, an enormous and freely available data set is
These are the main steps in our method:
472
above, and the document(s) to be classified, calculates the
score of the document being an article, lyrics or poetry,
and labels it with the class with the highest score. The algorithm takes a single pass along the whole document and
identifies adjectives using WordNet.
For each word in the document we check its presence
in our word list. If found, we add the probability to the
score, with a special penalty of -1 for adjectives never
found in the training set and a special bonus of +1 for
words with probability 1. The penalty and boosting values
used in the algorithm were determined experimentally.
Surprisingly, this simple approach gives us much better
accuracy rates than Naïve Bayes, which we thought would
be a good option since it is widely used in classification
tasks like spam filtering. We have decent accuracy rates
with this simple, naïve algorithm; one future task could be
to come up with a better classifier.
1) Finding the synonyms of all the words in the
training data set.
2) Finding the probability distribution of word for
all the three types of documents.
3) The document classification algorithm.
4.1 Extracting Synonyms
We extract the synonyms for a term from three sources:
WordNet, Wikipedia and an online thesaurus.
WordNet is a large lexical database of English where
words are grouped into sets of cognitive synonyms
(synsets) together based on their meanings. WordNet interlinks not just word forms but specific senses of words.
As a result, words that are found in close proximity to one
another in the network are semantically disambiguated.
The synonyms returned by WordNet need some pruning.
We use Wikipedia redirects to discover terms that are
mostly synonymous. It returns a large number of words,
which might not be synonyms, so we need to prune the
results. This method has been widely used for obtaining
the synonyms of named entities e.g. [4], but we get decent
results for adjectives too.
We also used an online Thesaurus that lists words
grouped together according to similarity of meaning.
Though it gives very accurate synonyms, pruning is necessary to get better results.
We prune synonyms obtained from the three sources,
which fall below an experimentally determined threshold
for the semantic distance between the synonyms and the
word. To calculate the semantic similarity distance between words we use the method described by Pirro et al.
[8]. Extracting synonyms for a given word is an open
problem and with improvement in this area our algorithm
will achieve better classification accuracy levels.
5. RESULTS
First, we look at the classification accuracies between lyrics, articles and poems obtained by our classifier. We
show that the adjectives used in lyrics are much more
rhymable than the ones used in poems but they do not differ significantly in their semantic orientations. Furthermore, our algorithm is able to identify poetic lyricists
from non-poetic ones using the word distributions, calculated in earlier section. We also compare adjectives for a
given concepts which are more likely to be used in lyrics
rather than poetry and vice versa.
5.1 Document Classification
Our test set consists of the text of 100 each of our three
categories. Using our algorithm with the adjective distributions we get an accuracy of 67% for lyrics, 80% for articles and 57% for poems.
The confusion matrix, Table 1 we find the best accuracy for articles. This might be because of the enormous
size of the article training set which consisted of all English Wikipedia articles. A slightly more number of articles
get misclassified as lyrics than poetry.
Surprisingly, a large number of misclassified poems
get classified as articles rather than poetry, but most misclassified lyrics get classified as poems.
4.2 Probability Distribution
We believe that the choice of an adjective to express a
given concept depends on the genre of writing: adjectives
used in lyrics will be different from ones used in poems or
in articles. We calculate the probability of a specific adjective for each of the three document types.
First, WordNet is used to identify the adjectives in our
training sets. For each adjective we compute the frequency of that were in the training set and the frequency of it
and its synonyms; the ratio of these is the frequency with
which that adjective represents its synonym group in that
class of writing.
We exclude adjectives that occur infrequently (fewer
than 5 times in our lyrics/poetry set or 50 in articles). The
enormous size of the Wikipedia justifies the high threshold value.
5.2 Adjective Usage in Lyrics versus Poems
Poetry is written against a silent background while lyrics
are often written keeping the melody, rhythm, instrumentation, the quality of the singer’s voice and other qualities
of the recording in mind. Furthermore, unlike most poetry, lyrics include repeated lines. This led us to believe the
adjectives which were more likely to be used in lyrics rather than poetry would be more rhymable.
We counted the number of words an adjective in our
lyrics and poetry list rhymes with from the website
rhymezone.com. The values are tabulated in Table 2.
4.3 Document classification algorithm
We use a simple linear time algorithm which takes as input the probability distributions for adjectives, calculated
473
Our algorithm consistently misclassifies a large fraction of the lyrics of such poetic lyricists as poetry while
the percentage of misclassified lyrics as poetry for the
non-poetic lyricists is significantly much less. These values for poetic and non-poetic lyricists are tabulated in table 4 and table 5 respectively.
Poetic Lyricists
% of lyrics misclassified as
poetry
Bob Dylan
42%
Ed Sheeran
50%
Ani Di Franco
29%
Annie Lennox
32%
Bill Callahan
34%
Bruce Springsteen
29%
Stephen Sondheim
40%
Morrissey
29%
Average misclassification 36%
rate
From the values in Table 2, we can clearly see that the
adjectives which are more likely to be used in lyrics to be
much more rhymable than the adjectives which are more
likely to be used in poetry.
Actual
Lyrics
Articles
Poems
Lyrics
67
11
10
Predicted
Articles
11
80
33
Poems
22
6
57
Table 1. The confusion matrix for document classification. Many lyrics are categorized as poems, and many poems as articles.
Mean
Median
25th percentile
75th percentile
Lyrics
33.2
11
2
38
Poetry
22.9
5
0
24
Table 4. Percentage of misclassified lyrics as poetry for
poetic lyricists.
Table 2. Statistical values for the number of words an adjective rhymes with.
Mean
Median
25th percentile
75th percentile
Lyrics
-.05
0.0
-0.27
0.13
Non-Poetic Lyricists
Poetry
-.053
0.0
-0.27
0.13
Bryan Adams
Michael Jackson
Drake
Backstreet Boys
Radiohead
Stevie Wonder
Led Zeppelin
Kesha
Average misclassification
rate
Table 3. Statistical values for the semantic orientation of
adjectives used in lyrics and poetry.
We were also interested in finding if the adjectives
used in lyrics and poetry differed significantly in their
semantic orientations. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. We calculated the semantic orientations, which take a value between -1 and +1, using SentiWordNet, of all the adjectives in the lyrics and poetry
list, the values are in Table 3. They show no difference
between adjectives in poetry and those in lyrics.
% of lyrics misclassified as
poetry
14%
22%
7%
23%
26%
17%
8%
18%
17%
Table 5. Percentage of misclassified lyrics as poetry for
non-poetic lyricists.
From the values in table 4 and 5 we see that there is a
clear separation between the misclassification rate between poetic and non-poetic lyricists. The maximum misclassification rate for the non-poetic lyricists i.e. 26% is
less than the minimum mis-classification rate for poetic
lyricists i.e. 29%. Furthermore the difference in average
misclassification rate between the two groups of lyricists
is 19%. Hence our simple algorithm can accurately identify poetic lyricists from non-poetic ones, based only on
adjective usage.
5.3 Poetic vs non-Poetic Lyricists
There are lyricists like Bob Dylan [15], Ani DiFranco
[16], and Stephen Sondheim [17,18], whose lyrics are
considered to be poetic, or indeed, who are published poets in some cases. The lyrics of such poetic lyricists possibly could be structurally more constrained than a majority of the lyrics or might adhere to a particular meter and
style. While selecting the poetic lyricists we ensured that
popular articles supported our claim or by going to their
Wikipedia page and ensuring that they were poets along
with being lyricists and hence the influence of their poetry
on lyrics.
5.4 Concept representation in Lyrics vs Poetry
We compare adjective uses for common concepts. To
represent physical beauty we are more likely to use words
like “sexy” and “hot” in lyrics but “gorgeous” and “handsome” in poetry. For 20 of these, results are tabulated in
Table 6. The difference could possibly be because unlike
lyrics, which are written for the masses, poetry is generally written for people who are interested in literature. It
474
has been shown that the typical number one hits, on average, are more clichéd [13].
Lyrics
Poetry
proud, arrogant, cocky
haughty, imperious
sexy, hot, beautiful, cute
gorgeous, handsome
merry, ecstatic, elated
happy, blissful, joyous
heartbroken, brokenhearted
sad, sorrowful, dismal
real
genuine
smart
wise, intelligent
bad, shady
lousy, immoral, dishonest
mad, outrageous
wrathful, furious
royal
noble, aristocratic, regal
pissed
angry, bitter
greedy
selfish
cheesy
poor, worthless
lethal, dangerous, fatal
mortal, harmful, destructive
afraid, nervous
frightened, cowardly, timid
jealous
envious, covetous
lax, sloppy
lenient, indifferent
weak, fragile
feeble, powerless
black
ebon
naïve, ignorant
innocent, guileless, callow
corny
dull, stale
7. CONCLUSION
Our key finding is that the choice of synonym for even a
small number of adjectives are sufficient to reliably identify genre of documents. In accordance with our hypothesis, we show that there exist differences in the kind of adjectives used in different genres of writing. We calculate
the probability distribution of adjectives over the three
kinds of documents and using this distribution and a simple algorithm we are able to distinguish among lyrics, poetry and article with an accuracy of 67%, 57% and 80%
respectively.
Adjectives likely to be used in lyrics are more
rhymable than the ones used in poetry. This might be because lyrics are written keeping in mind the melody,
rhythm, instrumentation, quality of the singer’s voice and
other qualities of the recording while poetry is without
such concerns. There is no significant difference in the
semantic orientation of adjectives which are more likely
to be used in lyrics and those which are more likely to be
used in poetry. Using the probability distributions, obtained from training data, we present adjectives more likely to be used in lyrics rather than poetry and vice versa for
twenty common concepts.
Using the probability distributions and our algorithm
we show that we can discern poetic lyricists from nonpoetic ones. Our algorithm consistently misclassifies a
majority of the lyrics of such poetic lyricists as poetry
while the percentage of misclassified lyrics as poetry for
the non-poetic lyricists is significantly much less.
Calculating the probability distribution of adjectives
over the various document types is a vital step in our
method which in turn depends on the synonyms extracted
for an adjective. Synonym extraction is still an open problem and with improvements in it our algorithm will give
better accuracy levels. We extract synonyms from three
different sources – Wikipeia, WordNet and an online
Thesaurus, and prune the results based on the semantic
similarity between the adjectives and the obtained synonyms.
We use a simple naïve algorithm, which gives us better
result than Naïve Bayes. An extension to the work can be
coming up with an improved version of the algorithm
with better accuracy levels. Future works can use a larger
dataset for lyrics and poetry (we have an enormous dataset for articles) to come up with better probability distribution for the two document types or to identify parts
of speech that effectively separates genres of writing. Our
work here can be extended to different genres of writings
like prose, fiction etc. to analyze the adjective usage in
those writings. It would be interesting to do similar work
for verbs and discern if different words, representing the
same action, are used in different genres of writings.
Table 6. For twenty different concepts, we compare adjectives which are more likely to be used in lyrics rather
than poetry and vice versa.
6. APPLICATIONS
The algorithm developed has many practical applications
in Music Information Retrieval (MIR). They could be
used for automatic poetry/lyrics generation to identify adjectives more likely to be used in a particular type of document. As we have shown we can analyze documents, analyze how lyrical, poetic or article-like a document is. For
lyricists or poets we can come up with alternate better adjectives to make a document fit its genre better. Using the
word distributions we can come up with a better measure
of distance between documents where the weights are assigned to a word depending on its probability of usage in
a particular type of document. And, of course, our work
here can be extended to different genres of writings like
prose or fiction.
8. ACKNOWLEDGEMENTS
Our research is supported by a grant from the Natural
Sciences and Engineering Research Council of Canada to
DGB.
475
sources and Evaluation (LREC ‘06), pages 417–422,
2006.
9. REFERENCES
[1] P. Chesley, B. Vincent., L. Xu, and R. Srihari, “Using
Verbs and Adjectives to Automatically Classify Blog
Sentiment”, Training, volume 580, number 263, pages
233, 2006.
[13] A.G. Smith, C. X. S. Zee and A. L. Uitdenbogerd,
“In your eyes: Identifying cliché in song lyrics”, in Proceedings of the Australasian Language Technology Association Workshop, pages 88–96, 2012.
[2] D.R. Entwisle and C. Garvey, “Verbal productivity
and adjective usage”, Language and Speech, volume 15,
number 3, pages 288-298, 1972.
[14] E. Nichols, D. Morris, S. Basu, S. Christopher, “Relationships between lyrics and melody in popular music”,
in Proceedings of the 10th International Conference on
Music Information Retrieval (ISMIR ’09), 2009.
[3] H. Wu and M. Zhou, “Optimizing Synonym Extraction Using Monolingual and Bilingual Resources”, in
Proceedings of the 2nd International Workshop on Paraphrasing, volume 16, pages 72-79, 2003.
[15] K. Negus, Bob Dylan, Equinox London, 2008.
[16] A. DiFranco, Verses, Seven Stories, 2007.
[4] C. Bohn and K. Norvag, “Extracting Named Entities
and Synonyms from Wikipedia”, in Proceedings of 24th
IEEE International Conference on Advanced Information
Networking and Applications, (AINA ‘10), pages 1300–
1307.
[17] S. Sondheim, Look, I Made a Hat! New York:
Knopf, 2011.
[18] S. Sondheim, Finishing the Hat, New York: Knopf,
2010.
[5] J. Niemi, K. Linden and M. Hyvarinen, “Using a bilingual resource to add synonyms to a wordnet:FinnWordNet and Wikipedia as an example”, in Proceedings of the Global WordNet Association , pages 227–
231, 2012.
[19] http://www.poemhunter.com.
[6] L. Jiang and M. Zhou, “Generating Chinese couplets
using a statistical MT approach”, in Proceedings of the
22nd International Conference on Computational Linguistics, pages 377–384, 2008.
[7] D. Genzel, J. Uszkoreit and F. Och, “Poetic statistical
machine translation: rhyme and meter”, in Proceedings of
the 2010 Conference on Empirical Methods in Natural
Language Processing, pages 158–166, 2010.
[8] G. Pirro and J. Euzenat, “A Feature and Information
Theoretic Framework for Semantic Similarity and Relatedness”, in Proceedings of the 9th International Semantic
Web Conference (ISWC ‘10), pages 615-630, 2010.
[9] G. Miller, “WordNet: A Lexical Database for English”, Communications of the ACM, volume 38, number
11, pages 39-41, 1995.
[10] G. Miller and F. Christiane, WordNet: An Electronic
Lexical Database, 1998.
[11] S. Baccianella, A. Esuli, and F. Sebastiani, “SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining”, in Proceedings of
the 7th Conference on International Language Resources
and Evaluation (LREC ‘10) , pages 2200–2204, 2010.
[12] A. Esuli and F. Sebastiani, “SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining”, in
Proceedings of the 5th Conference on Language Re-
476
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS
USING DEEP RECURRENT NEURAL NETWORKS
Po-Sen Huang† , Minje Kim‡ , Mark Hasegawa-Johnson† , Paris Smaragdis†‡§
†
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA
‡
Department of Computer Science, University of Illinois at Urbana-Champaign, USA
§
Adobe Research, USA
{huang146, minje, jhasegaw, paris}@illinois.edu
ABSTRACT
Joint Discriminative Training
Monaural source separation is important for many real
world applications. It is challenging since only single channel information is available. In this paper, we explore using deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting.
Deep recurrent neural networks with different temporal connections are explored. We propose jointly optimizing the
networks for multiple source signals by including the separation step as a nonlinear operation in the last layer. Different discriminative training objectives are further explored
to enhance the source to interference ratio. Our proposed
system achieves the state-of-the-art performance, 2.30∼2.48
dB GNSDR gain and 4.32∼5.42 dB GSIR gain compared
to previous models, on the MIR-1K dataset.
1. INTRODUCTION
Monaural source separation is important for several realworld applications. For example, the accuracy of automatic speech recognition (ASR) can be improved by separating noise from speech signals [10]. The accuracy of
chord recognition and pitch estimation can be improved by
separating singing voice from music [7]. However, current
state-of-the-art results are still far behind human capability. The problem of monaural source separation is even
more challenging since only single channel information is
available.
In this paper, we focus on singing voice separation from
monaural recordings. Recently, several approaches have
been proposed to utilize the assumption of the low rank
and sparsity of the music and speech signals, respectively
[7, 13, 16, 17]. However, this strong assumption may not
always be true. For example, the drum sounds may lie in
the sparse subspace instead of being low rank. In addition,
all these models can be viewed as linear transformations in
the spectral domain.
Mixture
Signal
STFT
Magnitude
Spectra
Phase
Spectra
Evaluation
ISTFT
DNN/DRNN
Time Frequency
Masking
Discriminative
Training
Estimated
Magnitude Spectra
Figure 1. Proposed framework.
With the recent development of deep learning, without imposing additional constraints, we can further extend
the model expressibility by using multiple nonlinear layers
and learn the optimal hidden representations from data. In
this paper, we explore the use of deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting. We explore different deep recurrent neural network architectures along with the joint
optimization of the network and a soft masking function.
Moreover, different training objectives are explored to optimize the networks. The proposed framework is shown in
Figure 1.
The organization of this paper is as follows: Section 2
discusses the relation to previous work. Section 3 introduces the proposed methods, including the deep recurrent
neural networks, joint optimization of deep learning models and a soft time-frequency masking function, and different training objectives. Section 4 presents the experimental
setting and results using the MIR-1K dateset. We conclude
the paper in Section 5.
2. RELATION TO PREVIOUS WORK
Several previous approaches utilize the constraints of low
rank and sparsity of the music and speech signals, respectively, for singing voice separation tasks [7, 13, 16, 17].
Such strong assumption for the signals might not always
be true. Furthermore, in the separation stage, these models
can be viewed as a single-layer linear network, predicting
the clean spectra via a linear transform. To further improve
the expressibility of these linear models, in this paper, we
use deep learning models to learn the representations from
c Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson,
Paris Smaragdis.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Po-Sen Huang, Minje Kim, Mark
Hasegawa-Johnson, Paris Smaragdis. “Singing-Voice Separation From
Monaural Recordings Using Deep Recurrent Neural Networks”, 15th International Society for Music Information Retrieval Conference, 2014.
477
...
...
...
...
L-layer sRNN
...
...
...
...
...
...
L
2
1
1
time
...
...
...
...
l
...
1-layer RNN
L-layer DRNN
L
1
time
time
Figure 2. Deep Recurrent Neural Networks (DRNNs) architectures: Arrows represent connection matrices. Black, white,
and grey circles represent input frames, hidden states, and output frames, respectively. (Left): standard recurrent neural
networks; (Middle): L intermediate layer DRNN with recurrent connection at the l-th layer. (Right): L intermediate layer
DRNN with recurrent connections at all levels (called stacked RNN).
data, without enforcing low rank and sparsity constraints.
By exploring deep architectures, deep learning approaches
are able to discover the hidden structures and features at
different levels of abstraction from data [5]. Deep learning methods have been applied to a variety of applications
and yielded many state of the art results [2, 4, 8]. Recently,
deep learning techniques have been applied to related tasks
such as speech enhancement and ideal binary mask estimation [1, 9–11, 15].
In the ideal binary mask estimation task, Narayanan and
Wang [11] and Wang and Wang [15] proposed a two-stage
framework using deep neural networks. In the first stage,
the authors use d neural networks to predict each output
dimension separately, where d is the target feature dimension; in the second stage, a classifier (one layer perceptron
or an SVM) is used for refining the prediction given the
output from the first stage. However, the proposed framework is not scalable when the output dimension is high.
For example, if we want to use spectra as targets, we would
have 513 dimensions for a 1024-point FFT. It is less desirable to train such large number of neural networks. In
addition, there are many redundancies between the neural
networks in neighboring frequencies. In our approach, we
propose a general framework that can jointly predict all
feature dimensions at the same time using one neural network. Furthermore, since the outputs of the prediction are
often smoothed out by time-frequency masking functions,
we explore jointly training the masking function with the
networks.
Maas et al. proposed using a deep RNN for robust automatic speech recognition tasks [10]. Given a noisy signal
x, the authors apply a DRNN to learn the clean speech y.
In the source separation scenario, we found that modeling
one target source in the denoising framework is suboptimal compared to the framework that models all sources. In
addition, we can use the information and constraints from
different prediction outputs to further perform masking and
discriminative training.
478
3. PROPOSED METHODS
3.1 Deep Recurrent Neural Networks
To capture the contextual information among audio signals, one way is to concatenate neighboring features together as input features to the deep neural network. However, the number of parameters increases rapidly according
to the input dimension. Hence, the size of the concatenating window is limited. A recurrent neural network (RNN)
can be considered as a DNN with indefinitely many layers, which introduce the memory from previous time steps.
The potential weakness for RNNs is that RNNs lack hierarchical processing of the input at the current time step. To
further provide the hierarchical information through multiple time scales, deep recurrent neural networks (DRNNs)
are explored [3, 12]. DRNNs can be explored in different
schemes as shown in Figure 2. The left of Figure 2 is a
standard RNN, folded out in time. The middle of Figure
2 is an L intermediate layer DRNN with temporal connection at the l-th layer. The right of Figure 2 is an L intermediate layer DRNN with full temporal connections (called
stacked RNN (sRNN) in [12]).
Formally, we can define different schemes of DRNNs as
follows. Suppose there is an L intermediate layer DRNN
with the recurrent connection at the l-th layer, the l-th hidden activation at time t is defined as:
hlt = fh (xt , hlt−1 )
,
= φl Ul hlt−1 + Wl φl−1 Wl−1 . . . φ1 W1 xt
(1)
and the output, yt , can be defined as:
yt = fo (hlt )
,
= WL φL−1 WL−1 . . . φl Wl hlt
(2)
where xt is the input to the network at time t, φl is an
element-wise nonlinear function, Wl is the weight matrix
for the l-th layer, and Ul is the weight matrix for the recurrent connection at the l-th layer. The output layer is a
linear layer.
The stacked RNNs have multiple levels of transition
functions, defined as:
Output
l
hlt = fh (hl−1
t , ht−1 )
= φl (Ul hlt−1 + Wl hl−1
t ),
Source 2
y1t
y2t
zt
zt
y1t
y2t
(3)
ht 3
where hlt is the hidden state of the l-th layer at time t. Ul
and Wl are the weight matrices for the hidden activation at
time t − 1 and the lower level activation hl−1
t , respectively.
When l = 1, the hidden activation is computed using h0t =
xt .
Function φl (·) is a nonlinear function, and we empirically found that using the rectified linear unit f (x) =
max(0, x) [2] performs better compared to using a sigmoid or tanh function. For a DNN, the temporal weight
matrix Ul is a zero matrix.
Hidden Layers
ht 2
ht-1
ht+1
ht 1
Input Layer
xt
3.2 Model Architecture
Figure 3. Proposed neural network architecture.
At time t, the training input, xt , of the network is the concatenation of features from a mixture within a window. We
use magnitude spectra as features in this paper. The output targets, y1t and y2t , and output predictions, ŷ1t and
ŷ2t , of the network are the magnitude spectra of different
sources.
Since our goal is to separate one of the sources from a
mixture, instead of learning one of the sources as the target, we adapt the framework from [9] to model all different
sources simultaneously. Figure 3 shows an example of the
architecture.
Moreover, we find it useful to further smooth the source
separation results with a time-frequency masking technique,
for example, binary time-frequency masking or soft timefrequency masking [7, 9]. The time-frequency masking
function enforces the constraint that the sum of the prediction results is equal to the original mixture.
Given the input features, xt , from the mixture, we obtain the output predictions ŷ1t and ŷ2t through the network. The soft time-frequency mask mt is defined as follows:
|ŷ1t (f )|
mt (f ) =
,
(4)
|ŷ1t (f )| + |ŷ2t (f )|
add an extra layer to the original output of the neural network as follows:
|ŷ1t |
zt
|ŷ1t | + |ŷ2t |
(6)
|ŷ2t |
ỹ2t =
zt ,
|ŷ1t | + |ŷ2t |
where the operator
is the element-wise multiplication
(Hadamard product). In this way, we can integrate the
constraints to the network and optimize the network with
the masking function jointly. Note that although this extra
layer is a deterministic layer, the network weights are optimized for the error metric between and among ỹ1t , ỹ2t
and y1t , y2t , using back-propagation. To further smooth
the predictions, we can apply masking functions to ỹ1t and
ỹ2t , as in Eqs. (4) and (5), to get the estimated separation
spectra s̃1t and s̃2t . The time domain signals are reconstructed based on the inverse short time Fourier transform
(ISTFT) of the estimated magnitude spectra along with the
original mixture phase spectra.
ỹ1t =
3.3 Training Objectives
where f ∈ {1, . . . , F } represents different frequencies.
Once a time-frequency mask mt is computed, it is applied to the magnitude spectra zt of the mixture signals to
obtain the estimated separation spectra ŝ1t and ŝ2t , which
correspond to sources 1 and 2, as follows:
ŝ1t (f ) = mt (f )zt (f )
ŝ2t (f ) = (1 − mt (f )) zt (f ),
Source 1
Given the output predictions ŷ1t and ŷ2t (or ỹ1t and ỹ2t )
of the original sources y1t and y2t , we explore optimizing
neural network parameters by minimizing the squared error and the generalized Kullback-Leibler (KL) divergence
criteria, as follows:
(5)
where f ∈ {1, . . . , F } represents different frequencies.
The time-frequency masking function can be viewed as
a layer in the neural network as well. Instead of training the
network and applying the time-frequency masking to the
results separately, we can jointly train the deep learning
models with the time-frequency masking functions. We
JM SE = ||ŷ1t − y1t ||22 + ||ŷ2t − y2t ||22
(7)
JKL = D(y1t ||ŷ1t ) + D(y2t ||ŷ2t ),
(8)
and
where the measure D(A||B) is defined as:
Ai
D(A||B) =
Ai log
− Ai + Bi .
Bi
i
479
(9)
D(··) reduces to the KL divergence when i Ai = i Bi =
1, so that A and B can be regarded as probability distributions.
Furthermore, minimizing Eqs. (7) and (8) is for increasing the similarity between the predictions and the targets.
Since one of the goals in source separation problems is to
have high signal to interference ratio (SIR), we explore discriminative objective functions that not only increase the
similarity between the prediction and its target, but also
decrease the similarity between the prediction and the targets of other sources, as follows:
||ŷ1t −y1t ||22 −γ||ŷ1t −y2t ||22 +||ŷ2t −y2t ||22 −γ||ŷ2t −y1t ||22
(10)
and
D(y1t ||ŷ1t )−γD(y1t ||ŷ2t )+D(y2t ||ŷ2t )−γD(y2t ||ŷ1t ),
(11)
where γ is a constant chosen by the performance on the
development set.
4. EXPERIMENTS
4.1 Setting
Our system is evaluated using the MIR-1K dataset [6]. 1 A
thousand song clips are encoded with a sample rate of 16
KHz, with durations from 4 to 13 seconds. The clips were
extracted from 110 Chinese karaoke songs performed by
both male and female amateurs. There are manual annotations of the pitch contours, lyrics, indices and types for unvoiced frames, and the indices of the vocal and non-vocal
frames. Note that each clip contains the singing voice
and the background music in different channels. Only the
singing voice and background music are used in our experiments.
Following the evaluation framework in [13, 17], we use
175 clips sung by one male and one female singer (‘abjones’ and ‘amy’) as the training and development set. 2
The remaining 825 clips of 17 singers are used for testing.
For each clip, we mixed the singing voice and the background music with equal energy (i.e. 0 dB SNR). The goal
is to separate the singing voice from the background music.
To quantitatively evaluate source separation results, we
use Source to Interference Ratio (SIR), Source to Artifacts Ratio (SAR), and Source to Distortion Ratio (SDR)
by BSS-EVAL 3.0 metrics [14]. The Normalized SDR
(NSDR) is defined as:
NSDR(v̂, v, x) = SDR(v̂, v) − SDR(x, v),
(GNSDR), Global SIR (GSIR), and Global SAR (GSAR),
which are the weighted means of the NSDRs, SIRs, SARs,
respectively, over all test clips weighted by their length.
Higher values of SDR, SAR, and SIR represent better separation quality. The suppression of the interfering source is
reflected in SIR. The artifacts introduced by the separation
process are reflected in SAR. The overall performance is
reflected in SDR.
For training the network, in order to increase the variety of training samples, we circularly shift (in the time
domain) the singing voice signals and mix them with the
background music.
In the experiments, we use magnitude spectra as input
features to the neural network. The spectral representation
is extracted using a 1024-point short time Fourier transform (STFT) with 50% overlap. Empirically, we found
that using log-mel filterbank features or log power spectrum provide worse performance.
For our proposed neural networks, we optimize our models by back-propagating the gradients with respect to the
training objectives. The limited-memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) algorithm is used to train the
models from random initialization. We set the maximum
epoch to 400 and select the best model according to the
development set. The sound examples and more details of
this work are available online. 3
4.2 Experimental Results
In this section, we compare different deep learning models
from several aspects, including the effect of different input context sizes, the effect of different circular shift steps,
the effect of different output formats, the effect of different
deep recurrent neural network structures, and the effect of
the discriminative training objectives.
For simplicity, unless mentioned explicitly, we report
the results using 3 hidden layers of 1000 hidden units neural networks with the mean squared error criterion, joint
masking training, and 10K samples as the circular shift
step size using features with a context window size of 3
frames. We denote the DRNN-k as the DRNN with the recurrent connection at the k-th hidden layer. We select the
models based on the GNSDR results on the development
set.
First, we explore the case of using single frame features,
and the cases of concatenating neighboring 1 and 2 frames
as features (context window sizes 1, 3, and 5, respectively).
Table 1 reports the results using DNNs with context window sizes 1, 3, and 5. We can observe that concatenating
neighboring 1 frame provides better results compared with
the other cases. Hence, we fix the context window size to
be 3 in the following experiments.
Table 2 shows the difference between different circular
shift step sizes for deep neural networks. We explore the
cases without circular shift and the circular shift with a step
size of {50K, 25K, 10K} samples. We can observe that
the separation performance improves when the number of
training samples increases (i.e. the step size of circular
(12)
where v̂ is the resynthesized singing voice, v is the original clean singing voice, and x is the mixture. NSDR is
for estimating the improvement of the SDR between the
preprocessed mixture x and the separated singing voice
v̂. We report the overall performance via Global NSDR
1
https://sites.google.com/site/unvoicedsoundseparation/mir-1k
Four clips, abjones 5 08, abjones 5 09, amy 9 08, amy 9 09, are
used as the development set for adjusting hyper-parameters.
2
3
480
https://sites.google.com/site/deeplearningsourceseparation/
Model (context window size) GNSDR GSIR GSAR
DNN (1)
6.63
10.81 9.77
DNN (3)
6.93
10.99 10.15
DNN (5)
6.84
10.80 10.18
Table 1. Results with input features concatenated from
different context window sizes.
Model
GNSDR GSIR
(circular shift step size)
DNN (no shift)
6.30
9.97
DNN (50,000)
6.62
10.46
DNN (25,000)
6.86
11.01
DNN (10,000)
6.93
10.99
GSAR
9.99
10.07
10.00
10.15
Table 4. The results of different architectures and different
objective functions. The “MSE” denotes the mean squared
error and the “KL” denotes the generalized KL divergence
criterion.
Table 2. Results with different circular shift step sizes.
Model (num. of output
GNSDR GSIR
sources, joint mask)
DNN (1, no)
5.64
8.87
DNN (2, no)
6.44
9.08
DNN (2, yes)
6.93
10.99
Model (objective) GNSDR GSIR GSAR
DNN (MSE)
6.93
10.99 10.15
DRNN-1 (MSE)
7.11
11.74 9.93
DRNN-2 (MSE)
7.27
11.98 9.99
DRNN-3 (MSE)
7.14
11.48 10.15
sRNN (MSE)
7.09
11.72 9.88
DNN (KL)
7.06
11.34 10.07
DRNN-1 (KL)
7.09
11.48 10.05
DRNN-2 (KL)
7.27
11.35 10.47
DRNN-3 (KL)
7.10
11.14 10.34
sRNN (KL)
7.16
11.50 10.11
Model
GNSDR GSIR GSAR
DNN
6.93
10.99 10.15
DRNN-1
7.11
11.74 9.93
DRNN-2
7.27
11.98 9.99
DRNN-3
7.14
11.48 10.15
sRNN
7.09
11.72 9.88
DNN + discrim
7.09
12.11 9.67
DRNN-1 + discrim
7.21
12.76 9.56
DRNN-2 + discrim
7.45
13.08 9.68
DRNN-3 + discrim
7.09
11.69 10.00
sRNN + discrim
7.15
12.79 9.39
GSAR
9.73
11.26
10.15
Table 3. Deep neural network output layer comparison
using single source as a target and using two sources as
targets (with and without joint mask training). In the “joint
mask” training, the network training objective is computed
after time-frequency masking.
shift decreases). Since the improvement is relatively small
when we further increase the number of training samples,
we fix the circular shift size to be 10K samples.
Table 3 presents the results with different output layer
formats. We compare using single source as a target (row
1) and using two sources as targets in the output layer (row
2 and row 3). We observe that modeling two sources simultaneously provides better performance. Comparing row 2
and row 3 in Table 3, we observe that using the joint mask
training further improves the results.
Table 4 presents the results of different deep recurrent
neural network architectures (DNN, DRNN with different
recurrent connections, and sRNN) and the results of different objective functions. We can observe that the models
with the generalized KL divergence provide higher GSARs,
but lower GSIRs, compared to the models with the mean
squared error objective. Both objective functions provide
similar GNSDRs. For different network architectures, we
can observe that DRNN with recurrent connection at the
second hidden layer provides the best results. In addition,
all the DRNN models achieve better results compared to
DNN models by utilizing temporal information.
Table 5 presents the results of different deep recurrent
neural network architectures (DNN, DRNN with different recurrent connections, and sRNN) with and without
discriminative training. We can observe that discriminative training improves GSIR, but decreases GSAR. Overall, GNSDR is slightly improved.
481
Table 5. The comparison for the effect of discriminative
training using different architectures. The “discrim” denotes the models with discriminative training.
Finally, we compare our best results with other previous
work under the same setting. Table 6 shows the results
with unsupervised and supervised settings. Our proposed
models achieve 2.30∼2.48 dB GNSDR gain, 4.32∼5.42
dB GSIR gain with similar GSAR performance, compared
with the RNMF model [13]. An example of the separation
results is shown in Figure 4.
5. CONCLUSION AND FUTURE WORK
In this paper, we explore using deep learning models for
singing voice separation from monaural recordings. Specifically, we explore different deep learning architectures, including deep neural networks and deep recurrent neural
networks. We further enhance the results by jointly optimizing a soft mask function with the networks and exploring the discriminative training criteria. Overall, our
proposed models achieve 2.30∼2.48 dB GNSDR gain and
4.32∼5.42 dB GSIR gain, compared to the previous proposed methods, while maintaining similar GSARs. Our
proposed models can also be applied to many other applications such as main melody extraction.
(a) Mixutre
(b) Clean vocal
(c) Recovered vocal
(d) Clean music
(e) Recovered music
Figure 4. (a) The mixture (singing voice and music accompaniment) magnitude spectrogram (in log scale) for the clip
Ani 1 01 in MIR-1K; (b) (d) The groundtruth spectrograms for the two sources; (c) (e) The separation results from our
proposed model (DRNN-2 + discrim).
Unsupervised
Model
GNSDR GSIR GSAR
RPCA [7]
3.15
4.43 11.09
RPCAh [16]
3.25
4.52 11.10
RPCAh + FASST [16]
3.84
6.22
9.19
Supervised
Model
GNSDR GSIR GSAR
MLRR [17]
3.85
5.63 10.70
RNMF [13]
4.97
7.66 10.03
DRNN-2
7.27
11.98 9.99
DRNN-2 + discrim
7.45
13.08 9.68
[6] C.-L. Hsu and J.-S.R. Jang. On the improvement of singing
voice separation for monaural recordings using the MIR-1K
dataset. IEEE Transactions on Audio, Speech, and Language
Processing, 18(2):310 –319, Feb. 2010.
[7] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson. Singing-voice separation from monaural recordings
using robust principal component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60, 2012.
[8] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck.
Learning deep structured semantic models for web search using clickthrough data. In ACM International Conference on
Information and Knowledge Management (CIKM), 2013.
Table 6. Comparison between our models and previous
proposed approaches. The “discrim” denotes the models
with discriminative training.
6. ACKNOWLEDGEMENT
[9] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and
P. Smaragdis. Deep learning for monaural speech separation. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2014.
[10] A. L. Maas, Q. V Le, T. M O’Neil, O. Vinyals, P. Nguyen,
and A. Y. Ng. Recurrent neural networks for noise reduction
in robust ASR. In INTERSPEECH, 2012.
We thank the authors in [13] for providing their trained
[11] A. Narayanan and D. Wang. Ideal ratio mask estimation using
model for comparison. This research was supported by
deep neural networks for robust speech recognition. In ProU.S. ARL and ARO under grant number W911NF-09-1ceedings of the IEEE International Conference on Acoustics,
0383. This work used the Extreme Science and EngineerSpeech, and Signal Processing. IEEE, 2013.
ing Discovery Environment (XSEDE), which is supported
by National Science Foundation grant number ACI-1053575. [12] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. In International Conference on Learning Representations, 2014.
7. REFERENCES
[1] N. Boulanger-Lewandowski, G. Mysore, and M. Hoffman.
Exploiting long-term temporal dependencies in NMF using
recurrent neural networks with application to source separation. In IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2014.
[2] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier
neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and
Statistics (AISTATS 2011), 2011.
[3] M. Hermans and B. Schrauwen. Training and analysing deep
recurrent neural networks. In Advances in Neural Information
Processing Systems, pages 190–198, 2013.
[4] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and
B. Kingsbury. Deep neural networks for acoustic modeling
in speech recognition. IEEE Signal Processing Magazine,
29:82–97, Nov. 2012.
[13] P. Sprechmann, A. Bronstein, and G. Sapiro. Real-time online singing voice separation from monaural recordings using
robust low-rank modeling. In Proceedings of the 13th International Society for Music Information Retrieval Conference,
2012.
[14] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. Audio, Speech,
and Language Processing, IEEE Transactions on, 14(4):1462
–1469, July 2006.
[15] Y. Wang and D. Wang. Towards scaling up classificationbased speech separation. IEEE Transactions on Audio,
Speech, and Language Processing, 21(7):1381–1390, 2013.
[16] Y.-H. Yang. On sparse and low-rank matrix decomposition
for singing voice separation. In ACM Multimedia, 2012.
[17] Y.-H. Yang. Low-rank representation of both singing voice
and music accompaniment via learned dictionaries. In Proceedings of the 14th International Society for Music Information Retrieval Conference, November 4-8 2013.
[5] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 –
507, 2006.
482
IMPACT OF LISTENING BEHAVIOR ON MUSIC RECOMMENDATION
Katayoun Farrahi
Goldsmiths, University of London
London, UK
[email protected]
Markus Schedl, Andreu Vall, David Hauger, Marko Tkalčič
Johannes Kepler University
Linz, Austria
[email protected]
ABSTRACT
The next generation of music recommendation systems will
be increasingly intelligent and likely take into account user
behavior for more personalized recommendations. In this
work we consider user behavior when making recommendations with features extracted from a user’s history of listening events. We investigate the impact of listener’s behavior by considering features such as play counts, “mainstreaminess”, and diversity in music taste on the performance of various music recommendation approaches. The
underlying dataset has been collected by crawling social
media (specifically Twitter) for listening events. Each user’s
listening behavior is characterized into a three dimensional
feature space consisting of play count, “mainstreaminess”
(i.e. the degree to which the observed user listens to currently popular artists), and diversity (i.e. the diversity of
genres the observed user listens to). Drawing subsets of
the 28,000 users in our dataset, according to these three
dimensions, we evaluate whether these dimensions influence figures of merit of various music recommendation approaches, in particular, collaborative filtering (CF) and CF
enhanced by cultural information such as users located in
the same city or country.
1. INTRODUCTION
Early attempts in collaborative filtering (CF) recommender
systems for music content have generally treated all users
as equivalent in the algorithm [1]. The predicted score (i.e.
the likelihood that the observed user would like the observed music piece) was a weighted average of the K nearest neighbors in a given similarity space [8]. The only way
the users were treated differently was the weight, which
reflected the similarity between users. However, users’ behavior in the consumption of music (and other multimedia
material in general) has more dimensions than just ratings.
Recently, there has been an increase of research in music consumption behavior and recommender systems that
draw inspiration from psychology research on personality. Personality accounts for the individual difference in
users in their behavioral styles [9]. Studies showed that
personality affects rating behavior [6], music genre preferences [11] and taste diversity both in music [11] and other
domains (e.g. movies in [2]).
The aforementioned work inspired us to investigate how
user features intuitively derived from personality traits affect the performance of a CF recommender system in the
music domain. We chose three user features that are arguably proxies of various personality traits for user clustering and fine-tuning of the CF recommender system. The
chosen features are play counts, mainstreaminess and diversity. Play count is a measure of how often the observed
user engages in music listening (intuitively related to extraversion). Mainstreaminess is a measure that describes
to what degree the observed user prefers currently popular
songs or artists over non-popular (and is intuitively related
to openness and agreeableness). The diversity feature is
a measure of how diverse the observed user’s spectrum of
listened music is (intuitively related to openness).
In this paper, we consider the music listening behavior
of a set of 28,000 users, obtained by crawling and analyzing microblogs. By characterizing users across a three
dimensional space of play count, mainstreaminess, and diversity, we group users and evaluate various recommendation algorithms across these behavioral features. The goal
is to determine whether or not the evaluated behavioral
features influence the recommendation algorithms, and if
so which directions are most promising. Overall, we find
that recommending with collaborative filtering enhanced
by continent and country information generally performs
best. We also find that recommendations for users with
large play counts, higher diversity and mainstreaminess
values are better.
2. RELATED WORK
The presented work stands at the crossroads of personalityinspired user features and recommender systems based on
collaborative filtering.
Among various models of personality, the Five-factor
model (FFM) is the most widely used and is composed
of the following traits: openness, conscientiousness, extraversion, agreeableness and neuroticism [9]. The personality theory inspired several works in the field of recommender systems. For example, Pu et al. [6] showed that
user rating behavior is correlated with personality factors.
Tkalčič et al. [13] used FFM factors to calculate similarities in a CF recommender system for images. A study by
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page.
c 2014 International Society for Music Information Retrieval.
483
Rentfrow et al. [11] showed that scoring high on certain
personality traits is correlated with genre preferences and
other listening preferences like diversity. Chen et al. [2]
argue that people who score high in openness to new experiences prefer more diverse recommendations than people
who score low. The last two studies explore the relations
between personality and diversity. In fact, the study of diversity in recommending items has become popular after
the publishing of two popular books, The Long Tail [4]
and The Filter Bubble [10]. However, most of the work
was focused on the trade-off between recommending diverse and similar items (e.g. in [7]). In our work, we treat
diversity not as a way of presenting music items but as a
user feature, which is a novel way of addressing the usage
of diversity in recommender systems.
The presented work builds on collaborative filtering (CF)
techniques that are well established in the recommender
systems domain [1]. CF methods have been improved using context information when available [3]. Recently, [12]
incorporated geospatial context to improve music recommendations on a dataset gathered through microblog crawling [5]. In the presented work, we advance this work by
including personality-inspired user features.
3. USER BEHAVIOR MODELING
3.1 Dataset
We use the “Million Musical Tweets Dataset” 1 (MMTD)
dataset of music listening activities inferred from microblogs. This dataset is freely available [5], and contains approximately 1,100,000 listening events of 215,000 users
listening to a total of 134,000 unique songs by 25,000 artists, collected from Twitter. The data was acquired crawling Twitter and identifying music listening events in tweets,
using several databases and rule-based filters. Among others, the dataset contains information on location for each
post, which enables location-aware analyses and recommendations. Location is provided both as GPS coordinates and semantic identifiers, including continent, country, state, county, and city.
The MMTD contains a large number of users with only
a few listening events. These users are not suitable for reliable recommendation and evaluation. Therefore, we consider a subset of users who had at least five listening events
over different artists. This subset consists of 28,000 users.
Basic statistics of the data used in all experiments are
given in Table 1. The second column shows the total amount
of the entities in the corresponding first row, whereas the
right-most six columns show principal statistics based on
the number of tweets.
3.2 Behavioral Features
Each user is defined by a set of three behavioral features:
play count, diversity, and mainstreaminess, defined next.
These features are used to group users and to determine
how they influence the recommendation process.
1
http://www.cp.jku.at/datasets/MMTD
484
Play count The play count of a user, P (u), is a measure
of the quantity of listening events for a user u. It is computed as the total number of listening events recorded over
all time for a given user.
Diversity The diversity of a user, D(u), can be thought
of as a measure which captures the range of listening tastes
by the user. It is computed as the total number of unique
genres associated with all of the artists listened to by a
given user. Genre information was obtained by gathering
the top tags from Last.fm for each artist in the collection.
We then identified genres within these tags by matching the
tags to a selection of 20 genres indicated by Allmusic.com.
Mainstreaminess The mainstreaminess M (u) is a measure of how mainstream a user u is in terms of her/his listening behavior. It reflects the share of most popular artists
within all the artists user u has listened to. Users that listen
mostly to artists that are popular in a given time window
tend to have high M (u), while users who listen more to
artists that are rarely among the most popular ones tend to
score low.
For each time window i ∈ {1 . . . I} within the dataset
(where I is the number of all time windows in the dataset)
we calculated the set of the most popular artists Ai . We
calculated the most popular artists in an observed time period as follows. For the given period we sorted the artists
by the aggregate of the listening events they received in a
decreasing order. Then, the top k artists, that cover at least
50% of all the listening events of the observed period are
regarded as popular artists. For each user u in a given time
window i we counted the number of play counts of popular artists Pip (u) and normalized it with all the play counts
of that user in the observed time window Pia (u). The final
value M (u) was aggregated by averaging the partial values
for each time window:
1 Pip (u)
M (u) =
I i=1 Pia (u)
I
(1)
In our experiments, we investigated time windows of six
months and twelve months.
Table 3 shows the correlation between individual user
features. No significant correlation was found, except for
the mainstreaminess using an interval of six months and an
interval of twelve months, which is expected.
3.3 User Groups
Each user is characterized by a three dimensional feature
vector consisting of M (u), D(u), P (u). The distribution
of users across these features are illustrated in Figures 1
and 2. In Figure 3, mainstreaminess is considered with a
6 month interval. The results illustrate the even distribution of users across these features. Therefore, for grouping users, we consider each feature individually and divide
users between groups considering a threshold.
For mainstreaminess, we consider the histogram of M (u)
(Figure 2 for a 6 month (top) and 12 month (bottom)) in
making the groups. We consider 2 different cases for grouping users. First, we divide the users into 2 groups according
to the median value (referred to as M6(12)-median-G1(2)).
Level
Users
Artists
Tracks
Continents
Countries
States
Counties
Cities
Amount
27,778
21,397
108,676
7
166
872
3557
15123
Min.
5
1
1
9
1
1
1
1
1st Qu.
7
1
1
4,506
12
7
2
1
Median
10
2
1
101,400
71
40
10
5
Mean
27.69
35.95
7.08
109,900.00
4,633.00
882.00
216.20
50.86
3rd Qu.
17
9
4
142,200
555
195
41
16
Max.
89,320
11,850
2,753
374,300
151,600
148,900
191,900
148,900
Table 1. Basic dataset characteristics, where “Amount” is the number of items, and the statistics correspond to the values
of the data.
P-top10
P-mid5k
P-bottom22k
P-G1
P-G2
P-G3
D-G1
D-G2
D-G3
M6-03-G1
M6-03-G2
M6-median-G1
M6-median-G2
M12-05-G1
M12-05-G2
M12-median-G1
M12-median-G2
RB
10.28
1.33
0.64
0.45
0.65
1.08
0.64
0.73
0.93
0.50
1.34
0.35
1.25
1.35
0.36
0.36
1.34
Ccnt
11.75
1.75
0.92
0.67
1.32
2.04
0.85
0.93
1.63
0.88
2.73
0.58
2.49
2.02
0.59
0.62
2.09
Ccry
11.1
2.25
1.10
0.72
1.34
2.02
1.16
1.05
1.49
0.95
2.43
0.62
2.89
2.27
0.69
0.71
2.33
Csta
5.70
2.43
1.03
0.68
1.01
1.88
1.04
1.23
1.56
0.96
2.22
0.65
2.25
2.25
0.61
0.64
2.34
Ccty
5.70
1.46
0.77
0.44
0.69
1.30
0.87
0.84
0.93
0.64
1.49
0.48
1.47
1.50
0.41
0.43
1.57
Ccit
5.70
1.96
1.07
0.56
0.92
1.73
0.88
1.02
1.41
0.88
2.00
0.61
1.97
1.93
0.57
0.59
2.01
CF
11.22
4.47
1.85
1.13
1.71
3.51
2.22
2.04
2.49
1.76
3.36
1.35
3.14
2.90
1.30
1.41
3.10
CCcnt
10.74
4.59
1.95
1.26
1.78
3.60
2.24
2.21
2.56
1.84
3.50
1.46
3.27
3.02
1.38
1.50
3.24
CCcry
10.47
4.51
1.95
1.17
1.77
3.59
2.16
2.20
2.59
1.84
3.50
1.45
3.29
3.04
1.38
1.50
3.26
CCsta
5.89
3.56
1.56
0.78
1.32
2.90
1.59
1.68
2.03
1.43
2.81
1.04
2.67
2.47
1.01
1.10
2.66
CCcty
5.89
1.96
0.96
0.26
0.80
1.68
0.97
0.98
1.08
0.81
1.67
0.56
1.66
1.54
0.52
0.56
1.67
CCcit
5.89
2.56
1.16
0.35
0.89
2.16
0.93
1.08
1.54
1.00
2.08
0.66
2.07
1.94
0.66
0.71
2.10
Table 2. Maximum F-score for all combinations of methods and user sets. C refers to the CULT approaches, CC to
CF CULT; cnt indicates continent, cry country, sta state, cty county, and cit city. The best performing recommenders for
a given group are in bold.
Second, we divide users into 2 groups for which borders
are defined by a mainstreaminess of 0.3 and 0.5, respectively, for the 6 month case and the 12 month case (referred
to as M6(12)-03(05)-G1(2)). These values were chosen
by considering the histograms in Figure 2 and choosing
values which naturally grouped users. For the diversity,
we create 3 groups according to the 0.33 and 0.67 percentiles (referred to as D-G1(2,3)). For play counts, we
consider 2 different groupings. The first is the same as
for diversity, i.e. dividing groups according to the 0.33
and 0.67 percentiles (referred to as P-G1(2,3)). The second splits the users according to the accumulative play
counts into the following groups, each of which accounts
for approximately a third of all play counts: top 10 users,
mid 5,000 users, bottom 22,000 users (referred to as Ptop10(mid5k,bottom22k)).
D(u)
M(u) (12 mo.)
P(u)
D(u)
0.069
0.292
M(u) (6 mo.)
0.119
0.837
0.021
P(u)
0.292
0.013
-
Table 3. Feature correlations. Note due to the symmetry of
these featuers, mainstreaminess is presented for 6 months
on one dimension and 12 months on another. Overall, none
of the features are highly correlated other than the mainstreaminess 6 and 12 month features, which is expected.
4. RECOMMENDATION MODELS
In the considered music recommendation models, each user
u ∈ U is represented by a list of artists listened to A(u).
All approaches determine for a given seed user u a number K of most similar neighbors VK (u), and recommend
the artists listened to by these VK (u), excluding the artists
485
A(u) already known by u. The recommended
artists R(u)
1
for user u are computed as R(u) = v∈VK (u) A(v) \ A(u)
K
and VK (u) = argmaxK
v∈U \{u} sim(u, v), where argmaxv
denotes the K users v with highest similarities to u. In considering geographical information for user-context models,
we investigate the following approaches, which differ in
the way this similarity term sim(u, v) is computed. The
following approaches were investigated:
CULT: In the cultural approach, we select the neighbors
for the seed user only according to a geographical similarity computed by means of the Jaccard index on listening
distributions over semantic locations. We consider as such
5
2000
10
4
10
1500
# users
# users
3
10
2
1000
10
500
1
10
0
0
0
10
0
0.5
1
1.5
play count
2
2.5
3
4
0.2
0.4
0.6
0.8
mainstreaminess 6mo.
1
0.2
0.4
0.6
0.8
mainstreaminess 12mo.
1
2000
x 10
5000
1500
# users
# users
4000
3000
1000
2000
500
1000
0
0
5
10
diversity
15
0
0
20
Figure 2. Histogram of mainstreaminess considering a
time interval of (top) 6 months and (bottom) 12 months.
Figure 1. Histogram of (top) play counts (note the log
scale on the y-axis) and (bottom) diversity over users.
semantic categories continent, country, state, county, and
city. For each user, we obtain the relevant locations by
computing the relative frequencies of his listening events
over all locations. To exclude the aforementioned geoentities that are unlikely to contribute to the user’s cultural
circle, we retain only locations at which the user has listened to music with a frequency above his own average 2 .
On the corresponding listening vectors over locations of
two users u and v, we compute the Jaccard index to obtain
sim(u, v). Depending on the location category user similarities are computed on, we distinguish CULT continent,
CULT country, CULT state, CULT county, and CULT city.
CF: We also consider a user-based collaborative filtering approach. Given the artist play counts of seed user
u as a vector P (u) over all artists in the corpus, we first
omit the artists that occur in the test set (i.e. we set to 0 the
play count values for artists we want our algorithm to predict). We then normalize P (u) so that its Euclidean norm
equals 1 and compute similarities sim(u, v) as the inner
product between P (u) and P (v).
CF CULT: This approach works by combining the CF
similarity matrix with the CULT similarity matrix via pointwise multiplication, in order to incorporate both music preference and cultural information.
RB: For comparison, we implemented a random baseline model that randomly picks K users and recommends
2
This way we exclude, for instance, locations where the user might
have spent only a few days during vacation.
486
the artists they listened to. The similarity function can thus
be considered sim(u, v) = rand [0,1].
5. EVALUATION
5.1 Experimental Setup
For experiments, we perform 10-fold cross validation on
the user level. For each user, we predict 10% of the artists
based on the remaining 90% used for training. We compute precision, recall, and F-measure by averaging the results over all folds per user and all users in the dataset. To
compare the performance between approaches, we use a
parameter N for the number of recommended artists, and
adapt dynamically the number of neighbors K to be considered for the seed user u. This is necessary since we do
not know how many artists should be predicted for a given
user (this number varies over users and approaches). To
determine a suited value of K for a given recommendation approach and a given N , we start the approach with
K = 1 and iteratively increase K until the number of recommended artists equals or exceeds N . In the latter case,
we sort the returned artists according to their overall popularity among the K neighbors and recommend the top N .
5.2 Results
Table 2 depicts the maximum F-score (over all values of
N ) for each combination of user set and method. We decided to report the maximum F-scores, because recall and
20
3
US−P−G3−RB
US−P−G3−CULT_continent
US−P−G3−CULT_country
US−P−G3−CULT_state
US−P−G3−CULT_county
US−P−G3−CULT_city
US−P−G3−CF
US−P−G3−CF_CULT_continent
US−P−G3−CF_CULT_country
US−P−G3−CF_CULT_state
US−P−G3−CF_CULT_county
US−P−G3−CF_CULT_city
2.5
15
precision
diversity
2
10
1.5
1
5
0.5
0
0
0 0
10
5
20
30
40
recall
50
60
70
10
play count
Figure 4. Recommendation performance of investigated
methods on user group P-G3.
1
mainstreaminess 6
10
0.8
In terms of play counts, we observe as the user has a larger
number of events in the dataset, the performance increases
significantly (P-G3 and P-top10). This can be explained by
the fact that more comprehensive user models can be created for users about whom we know more, which in turn
yields better recommendations.
Also in terms of diversity, there are performance differences across groups given a particular recommender algorithm. Especially between the high diversity listeners
D-G3 and low diversity listeners D-G1, results differ substantially. This can be explained by the fact that it is easier to find a considerable amount of like-minded users for
seeds who have a diverse music taste, in technical terms,
less sparse A(u) vector.
When considering mainstreaminess, taking either a 6
month or 12 month interval does not appear to have a significant impact on recommendation performance. There
are minor differences depending on the recommendation
algorithm. However, in general, the groups with larger
mainstreaminess (M6-03-G2, M6-med-G2, M12-med-G2)
always performed much better for all approaches than the
groups with smaller mainstreaminess. It hence seems easier to satisfy users with a mainstream music taste than users
with diverging taste.
0.6
0.4
0.2
0
0
5
10
diversity
15
20
Figure 3. Users plot as a function of (top) D(u) vs P (u)
and (bottom) M (u) (6 months) vs D(u). Note the log scale
for P (u) only. These figures illustrate the widespread,
even distribution of users across the feature space.
precision show an inverse characteristics over N . Since
the F-score equals the harmonic mean of precision and
recall, it is less influenced by variations of N , nevertheless aggregate performance in a meaningful way. We further plot precision/recall-curves for several cases reported
in Table 2. In Figure 4, we present the results of all of
the recommendation algorithms for one group on the play
counts. For this case, the CF approach with integrated continent and country information performed best, followed
by the CF approach. Predominantly, these three methods
outperformed all of the other approaches for the various
groups, which is also apparent in Table 2. The only exception was the P-top10 case, where the CULT continent
approach outperformed CF approaches. However, considering the small number of users in this subset (10), the difference of one percentage point between CULT continent
and CF CULT continent is not significant. We observe the
CF approach with the addition of the continent and country information are very good recommenders in general for
the data we are using.
Now we are interested to know how the recommendations performed across user groups and respective features.
6. CONCLUSIONS AND FUTURE WORK
In this paper, we consider the role of user listening behavior related to the history of listening events in order
to evaluate how this may effect music recommendation,
particularly considering the direction of personalization.
We investigate three user characteristics, play count, mainstreaminess, and diversity, and form groups of users along
these dimensions. We evaluate several different recommendation algorithms, particularly collaborative filtering
(CF), and CF augmented by location information. We find
the CF and CF approaches augmented by continent and
country information about the listener to outperform the
other methods. We also find recommendation algorithms
for users with large play counts, higher diversity, and higher
487
mainstreaminess have better performance.
As part of future work, we will investigate content-based
music recommendation models as well as combinations of
content-based, CF-based, and location-based models. Additional characteristics of the user, such as age, gender, or
musical education, will be addressed, too.
3
US−P−G1−RB
US−P−G1−CF
US−P−G2−RB
US−P−G2−CF
US−P−G3−RB
US−P−G3−CF
2.5
precision
2
1.5
1
7. ACKNOWLEDGMENTS
This research is supported by the Austrian Science Funds
(FWF): P22856 and P25655, and by the EU FP7 project
no. 601166 (“PHENICX”).
8. REFERENCES
[1] G. Adomavicius and A. Tuzhilin. Toward the next generation
of recommender systems: A survey of the state-of-the-art and
possible extensions. IEEE Transactions on Knowledge and
Data Engineering, 17(6):734–749, 2005.
0.5
0
0
10
20
30
2
50
60
70
[2] L. Chen, W. Wu, and L. He. How personality influences
users’ needs for recommendation diversity? CHI ’13 Extended Abstracts on Human Factors in Computing Systems
on - CHI EA ’13, 2013.
US−D−G1−RB
US−D−G1−CF
US−D−G1−CF_CULT_continent
US−D−G1−CF_CULT_country
US−D−G2−RB
US−D−G2−CF
US−D−G3−RB
US−D−G3−CF
1.8
1.6
1.4
precision
40
recall
1.2
1
0.8
[3] N. Hariri, B. Mobasher, and R. Burke. Context-aware music
recommendation based on latent topic sequential patterns. In
Proc. ACM RecSys ’12, New York, NY, USA, 2012.
[4] M. Hart. The long tail: Why the future of business is selling
less of more by chris anderson. Journal of Product Innovation
Management, 24(3):274–276, 2007.
[5] D. Hauger, M. Schedl, A. Košir, and M. Tkalčič. The Million Musical Tweets Dataset: What Can We Learn From Microblogs. In Proc. ISMIR, Curitiba, Brazil, November 2013.
0.6
0.4
0.2
0
0
10
20
30
40
50
60
[6] R. Hu and P. Pu. Exploring Relations between Personality
and User Rating Behaviors. 1st Workshop on Emotions and
Personality in Personalized Services (EMPIRE), June 2013.
70
recall
2
precision
[7] N. Hurley and M. Zhang. Novelty and diversity in top-n recommendation – analysis and evaluation. ACM Trans. Internet
Technol., 10(4):14:1–14:30, March 2011.
US−M12−median−G1−RB
US−M12−median−G1−CF
US−M12−median−G1−CF_CULT_continent
US−M12−median−G1−CF_CULT_country
US−M12−median−G2−RB
US−M12−median−G2−CF
US−M12−median−G2−CF_CULT_continent
US−M12−median−G2−CF_CULT_country
2.5
1.5
[8] J. Konstan and J. Riedl. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted
Interaction, 22(1-2):101–123, March 2012.
[9] R. McCrae and O. John. An Introduction to the FiveFactor Model and its Applications. Journal of Personality,
60(2):175–215, 1992.
1
[10] E. Pariser. The filter bubble: What the Internet is hiding from
you. Penguin UK, 2011.
0.5
0
0
10
20
30
40
50
60
70
80
[11] P. Rentfrow and S. Gosling. The do re mi’s of everyday life:
The structure and personality correlates of music preferences.
Journal of Personality and Social Psychology, 84(6):1236–
1256, 2003.
recall
Figure 5. Precision vs. recall for play count (top), diversity (middle), and mainstreaminess with a 12 month interval (bottom) experiments over groups and various recommendation approaches.
[12] M. Schedl and D. Schnitzer. Hybrid Retrieval Approaches to
Geospatial Music Recommendation. In Proc. ACM SIGIR,
Dublin, Ireland, July–August 2013.
[13] M. Tkalčič, M. Kunaver, A. Košir, and J. Tasič. Addressing
the new user problem with a personality based user similarity
measure. Joint Proc. DEMRA and UMMS, 2011.
488
TOWARDS SEAMLESS NETWORK MUSIC PERFORMANCE:
PREDICTING AN ENSEMBLE’S EXPRESSIVE DECISIONS FOR
DISTRIBUTED PERFORMANCE
Elaine Chew
Queen Mary University of London
Centre for Digital Music
[email protected]
Bogdan Vera
Queen Mary University of London
Centre for Digital Music
[email protected]
ABSTRACT
Internet performance faces the challenge of network latency. One proposed solution is music prediction, wherein
musical events are predicted in advance and transmitted
to distributed musicians ahead of the network delay. We
present a context-aware music prediction system focusing
on expressive timing: a Bayesian network that incorporates
stylistic model selection and linear conditional gaussian
distributions on variables representing proportional tempo
change. The system can be trained using rehearsals of distributed or co-located ensembles.
We evaluate the model by comparing its prediction accuracy to two others: one employing only linear conditional dependencies between expressive timing nodes but
no stylistic clustering, and one using only independent distributions for timing changes. The three models are tested
on performances of a custom-composed piece that is played
ten times, each in one of two styles. The results are promising, with the proposed system outperforming the other two.
In predictable parts of the performance, the system with
conditional dependencies and stylistic clustering achieves
errors of 15ms; in more difficult sections, the errors rise
to 100ms; and, in unpredictable sections, the error is too
great for seamless timing emulation. Finally, we discuss
avenues for further research and propose the use of predictive timing cues using our system.
1. INTRODUCTION
Ensemble performance between remote musicians playing
over the Internet is generally made difficult or impossible by high latencies in data transmission [3] [5]. While
many composers and musicians have chosen to treat latency as a feature of network music, performance of conventional music, such as that of classical repertoire, remains extremely difficult in network scenarios. Audio latency frequently results in progressively decreasing tempo
c Bogdan Vera, Elaine Chew.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Bogdan Vera, Elaine Chew. “Towards
Seamless Network Music Performance: Predicting an Ensemble’s Expressive Decisions for Distributed Performance”, 15th International Society for Music Information Retrieval Conference, 2014.
489
and difficulty in synchronizing.
One aspect that has received less attention than the latency is the lack of visual contact when performing over
the internet. Visual cues can be transmitted via video, but
such data is at least as slow as audio, and was previously
found to not be of significant use for transmitting synchronization cues even when the audio had an acceptable latency [6].
Since the start of network music research, several researchers have posited theoretically that music prediction
could be the solution to network latency (see, for example,
Chafe [2]). Ideally, if the music can be predicted ahead of
time with sufficient accuracy, then it can be replicated at
all connected end-points with no apparent latency. Recent
efforts have made limited progress towards this goal. One
example is a system for predicting tabla drumming patterns [12], and recent proposals by Alexandraki [1]. Both
assume that the tempo of the piece will be at least locally
smooth and, in the case Alexandraki’s system, timing alterations are always based on one reference recording.
In many styles of music, such as romantic classical music, the tempo can vary widely, with musicians interacting
on fine-scale note-to-note timing changes and using visual
cues to synchronize. The tempo cannot be expected to always evolve in the exact same way as one previous performance, rather the musicians significantly improvise timing
deviations to some constraints.
In this paper we propose a system for predicting timing
in network performance in real time, loosely inspired by
Raphael’s approach based on Bayesian networks [11]. We
propose and test a way to incorporate abstract notions of
expressive context within a probabilistic framework, making use of time series clustering. Flossman et al. [8] employed similar ideas when they extended the YQX model
for expressive offline rendering of music by using conditional gaussian distributions to link expressive predictions
over time. Our model contains an extra layer of stylistic
abstraction and is applied to modeling and real-time tracking of one performer or ensemble’s expressive choice at
the inter-onset interval level. We also describe how the
method could be used for predicting musical timing in network performance, and discuss ideas for further work.
2. MOTIVATION
on those from preceding events, making the timing changes
dependent on both musicians’ previous timing choices, while
also allowing the system to respond to the interplay between the two musicians. Secondly, we abstract different
ways of performing the piece by summarizing these larger
scale differences in an unsupervised manner in a new discrete node in the network: a stylistic cluster node.
Our goal is to use observable sources of information during
a live performance to predict the timing of future notes so
as to counter the effects of network latency. The sources
of information we can use include the timing of previous
notes and the intensity with which the notes are played.
The core idea is reminiscent of Raphael’s approach to
automatic accompaniment [11], which uses a Bayesian network relating note onset times, tempo and its change over
time. In Raphael’s model, changes in tempo and local
note timing are represented as independent gaussian variables, with distributions estimated from rehearsals. During a performance, the system generates an accompaniment that emulates the rehearsals by applying similar alterations of timing and tempo at each note event in the performance. The model has been demonstrated in live performances and proven to be successful, however as long as
the system generates musically plausible expression in the
accompaniment, it is difficult to determine an error value,
as it is simply meant to follow a musician and replicate a
performance style established in rehearsals. An underlying
assumption of this statistical model is that the solo musician leading the performance tends to perform the piece
with the same expressive style each time.
In an ensemble performance scenario, two-way communication exists between musicians. The requirement for
the system to simply ‘follow’ is no longer enough. As a
step towards tighter ensemble, we set as a goal a stringent
accuracy requirement for our prediction system: to have
errors small enough−no higher than 20-40ms−as to be indistinguishable from the normal fluctuations in ensemble
playing. Note that actual playing may have higher errors,
even in ideal conditions, due to occasional mistakes and
fluctuations in motor control.
The same ensemble might also explore a variety of ways
to perform a piece expressively. When expressive possibilities are explored during rehearsals, the practices establish
a common ‘vocabulary’ for possible variations in timing
that the musicians can then anticipate. Another goal of our
system is to account for several distinct ways of applying
expression to the same piece. This is accomplished in two
ways. Like Flossman et al. [8], we deliberately encode the
context of the local expression by introducing dependencies between the expressive tempo changes at each time
step. We additionally propose and test a form of model
selection using discrete variables that represent the chosen
stylistic mode of the expression. For example, given two
samples exhibiting the same tempo change, one may be
part of a longer term tempo increase, while another may
be part of an elastic time-stretching gesture. Knowing the
stylistic context for a tempo change will allow us to better
predict its trajectory.
3.1 Linear Gaussian Conditional Timing Prediction
Our goal is to predict the timing of events such as notes,
chords, articulations, and rests. In particular, we wish to
determine the time until the next event given the score information and a timing model. We collapse all chords into
single events. Assume that the performance evolves according to the following equations,
tn+1
=
sn ln + tn , and
sn+1
=
sn · δn ,
(1)
where tn is the onset time of the n-th event, sn is the corresponding inter-beat period, ln is the length of the event in
beats, and δn is a proportional change in beat duration that
is drawn from the gaussian distributions Δn . For simplicity, there is no distinction between tempo and local timing
in our model, though it could be extended to include this
separation.
Because δn ’s reflect proportional change in beat duration, prediction of future beat durations are done on a logarithmic scale:
log2 sn+1 = log2 sn + log2 δn .
log(tempo) = log(1/sn ), thus log sn as well, has been
shown in recent research to be a more consistent measure
of tempo variation in expressive performance [4].
The parameters of the Δn distributions are predicted
during the performance from previous observations, such
as δn−1 . Thus, each inter-beat interval, sn , is shaped from
event to event by the random changes, δn . The conditional
dependencies between the random variables are illustrated
in Figure 1. The first and last layers in the network, labeled
P1 and P2 in the diagram, are the observed onset times.
The 3rd layer, labeled ‘Composite’ following Raphael’s
terminology, embodies the time and tempo information at
each event, regardless of which ensemble musician is playing, and it is on this layer that our model focuses. The 2nd
layer, Expression, consists of the variables Δn .
The Δn variables are conditioned upon their predecessors, using any number of previous timing changes as input; formally, they are represented by linear conditional
gaussian distributions [9]. Let there be a Bayesian network
node with a normal distribution Y . We can condition Y
on its k continuous parents C = {C1 , . . . , Ck } and discrete parents D = {D1 , . . . , Dk } by using a linear regression model to predict the mean and variance of Y given the
values of C and D. The following equation describes the
conditional probability of Y given only continuous parent
k
nodes:
P (Y |C = c) = N (β0 +
βi ci , σ 2 ).
3. CONTEXTUALIZING TIMING PREDICTION
We combine two techniques to implement ensemble performance prediction. First, we condition the expressive
‘update’ distributions characterizing temporal expression
i=1
490
than attempting to construct a universal model for mapping
score to performance. As a result, the amount of training
data will generally be much smaller as we may only use
the most recent recorded and annotated rehearsals of the
ensemble. The next section describes a clustering method
we use to account for large-scale differences in timing.
3.2 Unsupervised Stylistic Characterization
Although we could add a large number of previous inputs
to each of the Δn nodes, we cannot tractably condition
these variables’ distributions on potentially hundreds of
previous observations. This would require a large amount
of training data to estimate the parameters in a meaningful way. Instead, we propose to summarize larger-scale
expression using a small number of discrete nodes representing the stylistic mode. For example, a musician may
play the same section of music in few distinct ways, and
a listener may describe it as ‘static’, ‘swingy’ or ‘loose’.
If these playing styles could be classified in real time, prediction could be improved by considering this stylistic context. Our ultimate goal is to perform this segmentally on a
piece of music, discovering distinct stylistic choices that
occured in the ensemble’s rehearsals. In this paper, we
present the first steps towards this goal: we characterize
the style of the entire performance using a single discrete
stylistic node.
The stylistic node is shown at the top of Figure 1. In our
model this node links to all of the Δn nodes in the piece, so
that each of the Δn ’s is now linearly dependent on the previous timing changes with weights that are dependent on
the stylistic node. Assuming that each Δn node is linked
to one previous one, the parameters of the Δn distributions
are then predicted at run-time using
Figure 1: A section of the graphical model. Round nodes
are continuous gaussian variables, and the square node (S)
is a discrete stylistic cluster node.
This is the equation for both continuous and discrete parents:
k
P (Y |D = d, C = c) = N (βd,0 +
βd,j cj , σd2 ).
j=1
Simply speaking, the mean and variance of each linear conditional gaussian node is calculated from the values
of its continuous and discrete parent nodes. The mean is
derived through linear regression from its continuous parents’ values with one weight matrix per configuration of its
discrete parents.
The use of conditional gaussian distributions means that
rather than having fixed statistics for how the timing should
occur at each point, the parameters for the timing distributions are predicted in real time from previous observations using linear regression. This simple linear relationship provides a means of predicting the extent of temporal
expression as an ongoing gesture. For example, if the performance is slowing down, the model can capture the rate
of slowdown, or a sharp tempo turnaround if this occurred
during rehearsals.
Our network music approach involves interaction between two actual musicians rather than a musician and a
computer. Thus, each event observed is a ‘real’ event,
and we update the Δn probability distributions at each step
during run-time with the present actions of the musicians
themselves. Unlike a system playing in automatic accompaniment or an expressive rendering system, our system is
never left to play on its own, and its task is simply to continue from the musicians’ choices, leaving less opportunity
for errors to accumulate. Additionally, we can correct the
musicians’ intended timing by compensating for latency
post-hoc - this implies that we can make predictions that
emulate what the musicians would have done without the
interference of the latency.
We may also choose the number of previous changes to
consider. Experience shows that adding up to 3 previous
inputs improves the performance moderately, but the performance decreases thereafter with more inputs. For simplicity, we currently use only one previous input, which
provides the most significant step improvement.
In constrast to a similar approach by Flossman et al. [8],
we do not attempt to link score features to the performance;
we only consider the local context of their temporal expression. Our goal is to capture the essence of one particular ensemble’s interpretation of a particular piece rather
P (Δt |S = s, Δt−1 = δ) = N (βs,0 + βs,1 δ, σs2 ),
where S is the style node.
To predict note events, we can simply take the means of
the Δn distributions, and use Equation 1 to find the onset
time of the next event given the current one.
To use this model, we must first discover the distinct
ways (if any) in which the rehearsing musicians perform
the piece. We apply k-means clustering to the log(δn ) time
series obtained from each rehearsal. We find the optimal
number of clusters by using the Bayes Information Criterion (BIC) as described by Pelleg and Moore [10]. Note
that other methods exist for estimating an optimal number
of clusters. To train the Bayesian network, a training set is
generated containing all of the δn values for each rehearsal
as well as the cluster to which each time series is allocated.
We then use the algorithm by Murphy [9] to find all the
parameters of the linear conditional nodes. Note that all of
the nodes are observable and we have training data for the
Δn .
During the performance, the system can update its belief about the stylistic node’s value from the note timings
that have been observed at any point; we do not need to
re-cluster the performance, as the network has learned the
relationships between the Δn ’s and the stylistic node. We
491
We evaluated the system using a ‘leave-one-out’ approach,
where out of the 20 performances we always trained on
19 of them and tested on the remaining one. We always
used one previous input to the Δn nodes, using the actual
observations in the performances rather than our predictions (like the extended YQX), simulating the process of
live performance. We evaluated the prediction accuracy by
measuring timing errors, which we define as the absolute
difference between the true event times and those predicted
by the model (in seconds).
The training performances were clustered correctly in
all cases, dividing the dataset into the two styles, with the
first 10 performances being grouped with cluster 1 and the
second 10 becoming part of cluster 2. Figure 3 shows the
stylistic inference process. In the matrix, performances are
arranged as rows, with events on the x-axis. Recall that we
predict the time between events rather than just notes. So,
we also consider the timing of rests, and chords are combined into single events rather than individual notes. The
colors indicate the inferred value of the style node: grey
for Style 1 and white for Style 2. We see that the system
correctly infers the stylistic cluster of each performance
within the first 19 events. In many cases the classification
assigns the performance to the correct cluster after only
two events.
use the message passing algorithm of Bayesian networks to
infer the most likely state of the node. As the performance
progresses, the belief about the state of the node is gradually established. Intuitively, the system arrives at a stable
answer after some observations, otherwise the overall style
is ambiguous. The state of the node is then used to place
future predictions into some higher level context. The next
section shows that the prediction performance is improved
by using the stylistic node to select the best regression parameters to predict the subsequent timing changes, which
can be thought of as a form of model selection.
4. EVALUATION
4.1 Methodology
Performance Number
In this section we present an evaluation of the basic form of
our model. Evaluation of such predictive models remains a
challenge because testing in live performance requires further work on performance tracking and optimization, while
offline testing necessitates a large number of annotated performances from the same ensemble. We present initial results on a small dataset; in our future work we will study
real time performances of more complex pieces.
We evaluate the performance of three models: one uses
linear conditional nodes and a stylistic cluster node; the
second uses only linear conditional nodes; and, the third
has independent gaussian distributions for the Δ variables.
Our dataset consists of 20 performances by one pianist
of the short custom-composed piece shown in Figure 2.
Notice that we have not added any dynamics or temporelated markings - the interpretation is left entirely to the
musicians. While this is not an ensemble piece, the performances are sufficient to test the prediction accuracy of our
model in various conditions. In this simple example, we
consider only the composite layer in the model, without
P1 and P2.
Inferred Style per Event, per Performance
5
10
15
20
10
20
30
40
50
60
Event Number
70
80
90
Figure 3: Matrix showing most likely style state after each
event’s observed δ. Performances 1-10 are in Style 1, and
11-20 are in Style 2. Classification result: grey = Style 1,
white = Style 2.
Figure 4 shows the tempo information for the dataset.
Figure 4(a) shows the inter-beat period contours of all of
the performances, while Figure 4(b) shows boxplots (indicating the mean and variability) of the period at each musical event, for the entire dataset and for the two clusters.
4.2 Results
Figure 5a and Figure 5b show the performance of the models, measured using mean absolute error averaged over events
in each performance, and over performances for each event,
respectively. We also show a detailed ‘zoomed in’ plot of
the errors between events 20-84 to make the different models’ mean errors clearer in Figure 5c. For network music performance, we would want to predict at least as far
forward as needed to counter the network (and other system) latency. As some inter-event time differences may be
shorter than the latency, we may occasionally need to predict more than one event ahead.
The model with stylistic clustering and linear conditional nodes performed best, followed by the one with only
linear conditional nodes, then the model with independent
Figure 2: Custom-composed piano test piece.
The piece was played on an M-Audio AXIOM MIDI
keyboard in one of two expressive styles decided beforehand, ten times for each style. We used IRCAM’s Antescofo score follower [7] for live tracking of the performance in our system, and annotation of the note and chord
events. The log-period plots for every performance in the
dataset are shown in Figure 4a. The changes in log-period
per event are shown in Figure 4b, and we also show the
same changes but for the data in each cluster found, to
demonstrate the difference between the two playing styles.
492
0.11
0
−2
0
20
40
60
Event Number
80
100
(a) Log-period per event for every performance in the dataset.
Log2(Period)
Event−wise Log−Period Change Boxplots for the Unclustered Data
0.09
0.08
0.07
0.06
2
0.05
0
0.04
−2
0.03
Event Number
Log2(Period)
Stylistic Clustering + Conditional
Conditional Only
No Clustering, Not Conditional
0.1
−1
Mean Absolute Error
Log2(Period)
Log Period per Event for all Performances
1
0
5
15
20
(a) Mean absolute error for each performance.
2
0
1
−2
Stylistic Clustering + Conditional
Conditional Only
No Clustering, Not Conditional
Event Number
0.8
Event−wise Log−Period Change + Boxplots for K−Means Centroid 2
1
0
−1
−2
Mean Absolute Error
Log2(Period)
10
Performance Number
Event−wise Log−Period Change + Boxplots for K−Means Centroid 1
Event Number
(b) Boxplots showing median and variability for the log-period
change at each event. Top: unclustered data, Middle: first centroid, Bottom: second centroid.
0.6
0.4
0.2
0
0
20
40
60
Event Number
Figure 4: Tempo Data
80
100
Mean Absolute Error
(b) Mean absolute error per event, over the whole performance.
Δn nodes. In all cases the errors were higher for the second
style (the latter 10 performances), which was much looser
than the first. The mean absolute errors for each model,
considering all of the events in all of the performances are
summarized in Table 1.
Observe in Figure 5b that some parts of the performance
were very difficult to predict. For example, we note high
prediction errors in the first 12 events of the piece and one
large spike in the error at the end of the piece. These are
1-bar and 2-bar long chords, for which musicians in an ensemble would have to use visual gestures or other information to synchronize. We would not expect any prediction
system to do better than a musician anticipating the same
timing without any form of extra-musical information. We
discuss potential applications of music prediction for virtual cueing in the next section. The use of clustering and
conditional timing distributions reduced the error rate for
the events which were poorly predicted with independent
timing distributions. For much of the piece the mean error
was as low as 15ms, but even for these predictable parts
of the performance, the models with conditional distributions and clustering lowered the error, as can be seen from
Figure 5c.
0.04
0.03
0.02
0.01
20
30
40
50
60
Event Number
70
80
(c) A ‘zoomed-in’ view of the error rates between events 20-84.
Figure 5: Mean absolute error per event.
dent nodes, we have shown that the proposed approach
produces promising results. Specifically, we have shown
evidence that considering a notion of large scale expressive
context, drawn from performance styles of a particular ensemble, can intuitively increase the accuracy of timing prediction. The model remains to be tested on more data. As
creative musicians are infinitely diverse in their expressive
interpretations, the true test of the model would ultimately
be in live performances.
The end goal of this research is to implement and evaluate network music performance systems based on the prediction model. Whether music prediction can ever be precise enough to allow seamless network performance remains an open question. Important questions arise in pur-
5. CONCLUSIONS AND FUTURE WORK
Model
Independent
Conditional
Clustering and Conditional
We have outlined a novel approach to network music prediction using a Bayesian network incorporating contextual
inference and linear gaussian conditional distributions. In
an evaluation comparing the model with stylistic clustering and linear conditional nodes, one with only linear conditional nodes without clustering, and one with indepen-
Mean Abs. Error
69.8ms
57.4ms
48.5ms
Table 1: Overall Timing Errors for Each Model
493
suit of this goal: how much should the system lead the
musicians to help them stay in time without making the
performance artificial? Predicting musical timing with sufficient accuracy will open up interesting avenues for network music research, especially when we consider parallel
research into predicting other information such as intensity and even pitch information, but whether any musician
would truly want to let a machine impersonate them expressively remains to be seen, which is why we propose
that a ‘minimally-invasive’ conductor-like approach to regulating tempo would be more appropriate than complete
audio prediction.
accuracy the timing in sections of a piece requiring temporal coordination, then we could help musicians synchronize by providing them with perfectly simultaneous predicted cues. We regard the use of predictive virtual cues as
less invasive to networked ensembles than complete predictive sonification. In situations where the audio latency
is low enough for performance to be feasible but video latency is still too high for effective transmission of gestural
cues, predictive sonification may be omitted completely,
and virtual cues could be implemented as a regulating factor.
5.1 The Bayesian Network
This research was funded in part by the Engineering and
Physical Sciences Research Council.
6. ACKNOWLEDGEMENTS
It would be straightforward to extend our model by implementing prediction of timing from other forms of expression that tend to correlate with tempo. For example, using
event loudness in the prediction would simply require the
addition of another layer of variables in the Bayesian network and conditioning the timing variables on these nodes
as well.
7. REFERENCES
[1] C. Alexandraki and R. Bader. Using computer accompaniment to assist networked music performance. In Proc. of the
AES 53rd Conference on Semantic Audio, London, UK, 2013.
[2] C. Chafe. Tapping into the internet as an acoustical/musical
medium. Contemporary Music Review, 28, Issue 4:413–420,
2010.
5.2 Capturing Style
Much work remains to expand on the characterization of
stylistic mode. As previously mentioned, we plan to explore segmental stylistic characterization, considering different contextual information for each part of the performance. In our current model we use only one stylistic
node. This may be a plausible for a small segment of music, but in a longer performance the choice of performance
style may vary over time. If the predicted performance
starts within one style but changes to another, the model is
ill-informed to predict the parameters. In our future work
we would like to extend the model to capture such stylistic tendencies over time. One approach would require presegmentation of the piece based on the choice of expressive
choices during the reharsal stage, and introduction of one
stylistic node per segment. The prediction context would
then be local to each part of the performance. We may
then, for example, have causal conditional dependencies
between the stylistic nodes in each segment of the piece,
which would allow the system to both infer the style within
a part of the performance from what is being played and
from the previous stylistic choices.
In practice, a musician or ensemble’s rehearsals may
not comprise of completely distinct interpretations; however, capturing expression contextually will likely offer a
larger degree of freedom to the musicians in an internet
performance, who may then explore a greater variety of
temporal and other articulations.
5.3 Virtual Cueing
Virtual cueing forms an additional application of interest.
As mentioned at the start of the paper, visual communication is generally absent or otherwise delayed in network
music performance. If we could predict with reasonable
[3] C. Chafe and M. Gurevich. Network time delay and ensemble
accuracy: Effects of latency, asymmetry. In Proc. of the 117th
Audio Engineering Society Convention, 2004.
[4] E. Chew and C. Callender. Conceptual and experiential representations of tempo: Effects on expressive performance
comparisons. In Proc. of the 4th International Conference on
Mathematics and Compution in Music, pages 76–87, 2013.
[5] E. Chew, A. Sawchuk, C. Tanoue, and R. Zimmermann. Segmental tempo analysis of performances in user-centered experiments in the distributed immersive performance project.
In Proc. of the Sound and Music Computing Conference,
2005.
[6] E. Chew, R. Zimmermann, A. Sawchuk, C. Kyriakakis, and
C. Papadopolous. Musical interaction at a distance: Distributed immersive performance. In Proc. of the 4th Open
Workshop of MUSICNETWORK, Barcelona, 2004.
[7] A. Cont. Antescofo: Anticipatory synchronization and control of interactive parameters in computer music. In Proc. of
the International Computer Music Conference, 2008.
[8] S. Flossmann, M. Grachten, and G. Widmer. Guide to Computing for Expressive Music Performance, chapter Expressive
Performance Rendering with Probabilistic Models, pages 75–
98. Springer Verlag, 2013.
[9] K. P. Murphy. Fitting a conditional linear gaussian distribution. Technical report, University of British Columbia, 1998.
[10] D. Pelleg and A. W. Moore. X-means: Extending k-means
with efficient estimation of the number of clusters. In Proc. of
the Seventeenth International Conference on Machine Learning, pages 727–734, 2000.
[11] C. Raphael. Music plus one and machine learning. In Proc.
of the 27th International Conference on Machine Learning,
pages 21–28, 2010.
[12] M. Sarkar. Tablanet: a real-time online musical collaboration
system for indian percussion. Master’s thesis, MIT, 2007.
494
DETECTION OF MOTOR CHANGES IN VIOLIN PLAYING
BY EMG SIGNALS
Ling-Chi Hsu, Yu-Lin Wang, Yi-Ju Lin, Alvin
W.Y. Su
Department of CSIE, National Cheng-Kung
University, Taiwan
[email protected];
[email protected];
[email protected];
[email protected]
Cheryl D. Metcalf
Faculty of Health Sciences, University of
Southampton, United Kingdom
[email protected]
explored the dynamic pressures to analyze how pianists
depressed the piano keys and hold them down during
playing. The pressure measurement advances the evaluation of the keystroke in piano playing [4-5]. The use of
muscle activity via electromyography (EMG) signals allows further investigation into the motor control sequences that produce the music. EMG is a technique which
evaluates the electrical activity of the muscle by recording the electrical potentials when muscles generate an
electrical voltage during activation, which results in a
movement or coordinated action.
EMG is generally recorded in two protocols; invasive
electromyography (IEMG) and surface electromyography
(SEMG). IEMG is used to measure deep muscles and
discrete positions using a fine-wire needle; however, it is
not a preferable model for subjects due to the invasiveness and being less repetitive. Compared to IEMG,
SEMG has the following characteristics: (1) it is noninvasive; (2) it provides global information; (3) it is comparatively simple and inexpensive; (4) it is applicable by
non-medical personnel; and (5) it can be used over a
longer time during work and sport activities [6]. Therefore, the SEMG is suitable for use within biomechanics
and movement analysis, and was used in this paper.
For the analysis of musical performance, EMG has
been used to evaluate behavioral changes of the fingers
[7-8], upper limbs [9-10] shoulder [11-12] and wrist [13]
in piano, violin, cello and drum players. The EMG method allows for differentiating the variations and reproducibility of muscular activities in individual players. Comparing the EMG activity between expert pianists and novice players [7-14] has also been studied.
There have been many approaches developed for segmentation of EMG signals [15]. Prior EMG segmentation
techniques were mainly used to detect the time period for
a certain muscle contraction, but we found that the potential variations from various muscles maybe different during a movement. It causes the conventional EMG segmentation to fail to extract the accurate timing of movement in instrument playing.
In this paper, the timing activation of the muscle group
is assessed, and the changes in motor control of players
during performance are investigated. We propose a system with the function of concurrently recording the audio
signal and behavioral changes (EMG) while playing an
instrument. This work is particularly focused on violin
playing, which is considered difficult to segment with the
ABSTRACT
Playing a music instrument relies on the harmonious
body movements. Motor sequences are trained to achieve
the perfect performances in musicians. Thus, the information from audio signal is not enough to understand the
sensorimotor programming in players. Recently, the investigation of muscular activities of players during performance has attracted our interests. In this work, we
propose a multi-channel system that records the audio
sounds and electromyography (EMG) signal simultaneously and also develop algorithms to analyze the music
performance and discover its relation to player’s motor
sequences. The movement segment was first identified by
the information of audio sounds, and the direction of violin bowing was detected by the EMG signal. Six features
were introduced to reveal the variations of muscular activities during violin playing. With the additional information of the audio signal, the proposed work could efficiently extract the period and detect the direction of motor changes in violin bowing. Therefore, the proposed
work could provide a better understanding of how players
activate the muscles to organize the multi-joint movement
during violin performance.
1. INTRODUCTION
For musicians, their motor skills must be honed by many
hours of daily practice to maintain the performing quality.
Motor sequences are trained to achieve the perfect performances. Playing a musical instrument relies on the
harmonious coordination of body movements, arm and
fingers. This is fundamental to understanding the neurophysiological mechanisms that underpin learning. It
therefore becomes important to understand the sensorimotor programming in players. In the late 20th century, Harding et al. [1] directly measured the force between
player’s fingers and piano keys with different skill levels.
Engel et al. [2] found there is an anticipatory change of
sequential hand movements in pianists. Parlitz et al. [3]
© L.C. Hsu, Y.J. Lin, Y.L. Wang, A.W.Y. Su, C.D. Metcalf.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: L.C. Hsu, Y.J. Lin, Y.L. Wang,
A.W.Y. Su, C.D. Metcalf. “Detection Of Motor Changes In Violin Playing By EMG Signals”, 15th International Society for Music Information
495
soft onsets of the notes. The segment with body movements was first identified by the information of audio
sounds. It is believed that if there is an audio signal, then
there is a corresponding movement. Six features were
then introduced to EMG signals to discover the variation
of movements. This work identifies the individual
movement segments, i.e. up-bowing and down-bowing,
during violin playing. Thus, how motor systems operated
in musicians and affected during performance could be
explored using this methodology.
This paper is organized as follows. The multi-channel
signal recording system and its experimental protocol are
shown in section 2. In section 3, we introduce the proposed algorithms for segmenting the EMG signal with
additional audio information. The experimental results
are shown in section 4 and the conclusion and future
work are given in section 5.
Two seconds resting time was given between the two
consecutive movements.
The EMG sampling rate was 1000Hz. The electrodes
attached on the surface of the player’s skin as shown Figure 2. In this study, the direction of violin bowing, i.e. upbowing and down-bowing, is detected by the corresponding muscle activity (EMG signal). The total of 8 muscles
in the upper limb and body is measured in our system.
Figure 3 shows the 8-channel EMG signals of up-bowing
movement, and potential variations were shown in all
channels when bowing. Three types of variations were
observed and grouped:
(1) Channel#1 to Channel#6: it is seen that the trend of
six channels is similar; additionally, the average
noise floor between channel#3 and channel#6 are
lower than others; finally, we choose channel#6 because the position is convenient to place the electrode.
(2) Channel#7: the channel involving the most noise.
(3) Channel#8: although it has more noise than Channel#1 to Channel#6, it is the important part when we
have a whole-bowing movement.
2. AUDIO SOUNDS AND BIOSIGNAL
RECORDING SYSTEM
This work proposed a multi-channel signal recording
system capable of recording audio and EMG signals
concurrently. The system is illustrated in Figure 1 and
comprises: (a) a signal pre-amplifier acquisition board,
(b) an analog to digital signal processing unit, and (c) a
host-system.
Figure 1. The proposed multi-channel recording system
for recording audio signal and EMG concurrently.
The violin signal was recorded in a chamber and the
microphone was placed 30cm from the player with a
sampling rate of 44100Hz. With this real violin recording,
the sound is supposedly embedded with the noise and the
artifacts.
Furthermore, there is three subjects in the experiment
database. The violinist play music and be recorded. Each
participant was requested to press one string during playing. This experiment included two tasks for performance
evaluation, and each task contained 10 movements. The
movements for task#1 and task#2 are defined as follows.
Movements for task#1:
(1) Player presses the 2nd string then is idle for 2s
(begin the bow at the frog).
(2) Pulls the bow from the frog to the tip for 4s
(whole bow down).
(3) Pulls the whole bow up for 4s.
Figure 2. The placement of the electrodes attached on
the player’s skin [16, 17].
Movements for task#2:
(1) Player presses the 3rd string then is idle for 2s
(begin the bow at the tip).
(2) Pulls the whole bow up for 4s.
(3) Pulls the whole bow down for 4s.
Figure 3. The 8-channel EMG signals of up-down bowing movements.
496
the bowing; the Sustain is the duration of the note segment. Both frequency and spatial features were calculated
and used as the inputs to our developed finite state machine (FSM). The diagram of our proposed FSM is illustrated in Figure 6. The output of FSM identifies the result
of note detection and further used for EMG segmentation.
To reduce the computation and retain the variety of
features, only channel#6 and channel#8 were thereafter
used for further analysis. Figure 4 shows the EMG signals of channel#6 and channel#8 while during downbowing.
Figure 6. The state diagram of audio sounds.
The violin signal was analyzed both in frequency and
time domains. For frequency analysis, the violin signal
was first transformed by short time Fourier transform.
The inverse correlation (IC) was then applied to calculate
the possible note onset period. The inverse correlation (IC)
coefficients are computed from the correlation coefficients of two consecutive discrete Fourier transform spectra [18]. A support vector machine (SVM), denoted as
SVMic (1), was applied for detecting the accurate timing
of onset. SVM is a popular methodology, with high speed
and simple implementation, for classification and regression analysis [19].
Ͳ ǡ ݊‫ ݊݋‬െ ‫݊݋݅ݐ݅ݏ݊ܽݎݐ‬
ܸܵ‫ܯ‬௜௖ ൌ ቄ
(1)
ͳ ǡ
‫݊݋݅ݐ݅ݏ݊ܽݎݐ‬
For spatial analysis, the amplitude envelop (AE) was
used to detect the segment of the sound data. AE is evaluated as the maximum value of a frame. There are two
similar classifiers, called SVMae1 (2) and SVMae2 (3).
SVMae1 is used to identify the possible onsets and SVMae2
is used to identify the possible offsets.
Ͳ ǡ ݊‫ ݊݋‬െ ‫ݐ݁ݏ݊݋‬
(2)
ܸܵ‫ܯ‬௔௘ଵ ൌ ቄ
ͳ ǡ
‫ݐ݁ݏ݊݋‬
Figure 4. The EMG signals of triceps (channel#6) and
pectoralis (channel#8) during down-bowing movements.
3. METHOD
The following section will introduce the proposed algorithm for detecting the bowing states during violin playing. The proposed system is capable of recording audio
and EMG signals concurrently, and in this study a bowing state detection algorithm was developed, which was
implemented the embedded system. The flowchart of the
proposed method is shown in Figure 5.
Ͳ
ܸܵ‫ܯ‬௔௘ଶ ൌ ൜
ͳ
ǡ
ǡ
݊‫ ݊݋‬െ ‫ݐ݁ݏ݂݂݋‬
‫ݐ݁ݏ݂݂݋‬
(3)
Figure 7 shows (a) a segment of audio sounds with one
sequence of down-bowing and up-bowing, while Figure
7(b) and (c) display the results of IC and AE, respectively.
During the bowing state, the IC value is extremely
small when compared to the results of the non-bowing
state. IC seems to be a good index to identify the state of
whether the violin is being played, or not. However, it
can be seen that a time deviation is introduced if the system simply applies a hard threshold, e.g. 0.3. Alternatively, the AE value becomes larger at the playing state. But
the issue of time deviation is also present in this feature,
if a hard threshold is applied.
After calculating the IC and AE values, their variation
is considered as one set of input data for SVM. The time
period of each data is 100ms. Therefore, SVMic, SVMae1
and SVMae2 are designed to detect the most plausible timing of onset, transition and offset.
Figure 5. Flowchart of the proposed system.
The EMG signals were segmented according to the violin sounds. Then, six features were identified to detect
the direction of bowing movements. For analyzing the
audio signal, the window size of a frame is 2048 samples
and the hop size 256 samples.
3.1 Onset/Transition/Offset detection
This section elaborates on the state detection of audio
sounds. The states of audio sounds are defined as Onset,
Transition and Offset in this study. The Onset is the beginning of bowing; the Transition is the timing when the
next bowing movement occurred; the Offset is the end of
497
of CV for each active frame. Table 1 lists the number of
each feature for each channel.
Table 1. The number of each feature per channel
Feature
MAV
MAVS
ZC
SSC
WL
Number
20
19
1
1
20
A more detailed description of those applied features
could be found in [20]. Figure 8 displays the triceps EMG
signal of one active frame (8s ~ 16s) and the results calculated by MAV, MAVS, ZC, SSC and WL. It can be
seen that variations are exhibited for 6 features in violin
playing with a down-up bowing movement.
The detection of bowing direction is also determined
by a SVM classifier which is denoted as SVMdir (3). For
SVMdir, a total of 125 inputs are used (61 inputs for channel#6 and channel#8 each, plus 3 values of CV) and it
identifies whether the active EMG frame is in the upbowing or down-bowing state.
Ͳ ǡ
ܷ‫ ݌‬െ ܾ‫݃݊݅ݓ݋‬
(3)
ܸܵ‫ܯ‬ௗ௜௥ ൌ ൜
ͳ ǡ ‫ ݊ݓ݋ܦ‬െ ܾ‫݃݊݅ݓ݋‬
Figure 7. (a) The audio sounds of down-bowing and upbowing; (b) the results of IC; (c) the results of AE
3.2 Detection of bowing direction
In each movement, there are one onset, one offset, and
several transitions. However, the total number of transitions will differ from the number of notes. After detection
of the bowing state is completed, the duration between
onset and offset is applied for segmenting the EMG signal of triceps (channel#6) and pectroalis (channel#8). For
each note duration, there are three cases:
(1)The duration from the onset to the first transition.
(2)The duration from the current transition to the next
transition.
(3)The duration from the last transition of the offset.
This note duration extracted from the audio sound is
called an active frame and the active frames are variant
lengths from each other. The segment extracted by the
audio sounds is called an active frame and the active
frames are variant lengths from each other.
Figure 8. One down-up bowing movement and its six
features: (a) the down-bowing movement, (b) the upbowing movement.
For each active frame, six features in [20] were applied
to calculate the variations of EMG signal while bowing.
The features are:
z
Mean absolute value (MAV)
z
Mean absolute value slope (MAVS)
z
Zero crossings (ZC)
z
Slope sign changes (SSC)
z
Waveform length (WL)
z
Correlation variation (CV)
3.3 Performance evaluation
In our experiment, 10-fold cross-validation is used for
SVMic, SVMae and SVMdir, and the performance evaluation calculates the accuracy (4), precision (5), recall (6)
and F-score (7) of each detecting function.
ࢀ࢛࢘ࢋࡼ࢕࢙࢏࢚࢏࢜ࢋାࢀ࢛࢘ࢋே௘௚௔࢚࢏࢜ࢋ
ൌ
ࡼ࢕࢙࢏࢚࢏࢜ࢋାே௘௚௔௧௜௩௘
ൌ
ൌ
Here, the active frame is experimentally divided into
20 segments for calculating MAV and WL, thus each active frame has 20 values of MAV and WL. For CV, we
calculate the auto-correlation and cross-correlation of
channel#6 and channel#8, and therefore there are 3 values
்௥௨௘௉௢௦௜௧௜௩௭௘
்௥௨௘௉௢௦௜௧௜௩௘ାி௔௟௦௘௉௢௦௜௧௜௩௘
்௥௨௘௉௢௦௜௧௜௩௘
்௥௨௘௉௢௦௜௧௜௩௘ାி௔௟௦௘ேୣ௚௔௧௜௩௘
Ǧ ൌ
ଶ‫ڄ‬௉௥௘௖௜௦௜௢௡‫ڄ‬ோ௘௖௔௟௟
௉௥௘௖௜௦௜௢௡ାோ௘௖௔௟௟
;
;
(4)
;
(5)
;
(6)
(7)
The true positive means it correctly detected the
movement; the false positive is a falsely detected movement; and the false negative is a missed detection.
498
violin signal of task#1 with three movements. Figure 11
(b) and (c) are the EMG segmentations of our proposed
method and [15], respectively. Channel#6 is used in this
example to illustrate a sample output. It is believed that if
there is an audio signal, then there is a corresponding
movement. It can be seen that the results segmented by
[15], without the additional information of the audio signal, could not precisely identify the segment of movements during bowing. However, the proposed method is
based on the information from audio signals and clearly
identifies the segment of behavioral changes during violin
playing.
4. EXPERIMENTAL RESULTS
In this section, the efficiency of the proposed SVMs is
observed. An example of the proposed EMG segmentation is then compared to the prior work [15]. Finally, the
averaged and overall simulation results are given.
4.1 The performance of SVM classifications
To illustrate both the proposed IC and AE effectively
identify the sound states of onset and offset, respectively,
Figure 9 shows the trend of IC and AE values in one
down-up bowing movement by using the classification
results for SVMic and SVMae1 and SVMae2. Table 2 shows
that, with the given FSM, the detection rate of onsets,
transitions and offsets are 90%, 100%, 100%, respectively.
Figure 11. (a) The violin signal; (b) the proposed EMG
segmentations; (c) the EMG segmentations of [15].
4.3 The simulation results
The detection result of violin bowing direction was given
in Table 3 where accuracy, precision, recall and F-score
are presented.
Figure 9. The results of 3 classifiers: (a) onsets, (b) transitions, (c) offsets.
Table 3. The detection results of the bowing direction: (1)
the detection results of ground truths of active frames; (2)
the detection results of extracted active frames.
(1)
(2)
ġ
85%
87.5%
Accuracy
76.92%
82.61%
Precision
100%
95%
Recall
86.96%
88.37%
F-score
The average detection results were shown to have excellent performance with an accuracy of 85%~87.5%. The
results show that the proposed method efficiently identifies the bowing direction in violin playing.
Table 2. The detection results of the bowing states with
the given FSM.
Onset
Transition
Offset
90.00%
100%
100%
Accuracy
90.00%
100%
100%
Precision
90.00%
100%
100%
Recall
90.00%
100%
100%
F-score
Figure 10 shows the distribution of active EMG frames
during up-bowing and down-bowing states, and it displays the distribution of MAV, MAVS and WL. The
SVMdir classifies the data with 85% accuracy.
The proposed biomechanical system for recording the audio sounds and EMG signals during playing an instrument was developed. The proposed method not only extracts the segment during movement and detects the moving direction of bowing, but with the additional information of violin sounds, changes in muscle activity as an
element of motor control, could be efficiently detected
when compared to the prior EMG segmentation (without
any sound information). To the authors’ knowledge, this
is the first study which proposes such concept.
Figure 10. (a) The original distribution of up-bowing and
down-bowing EMG frames; (b) the results of SVMdir
classification.
4.2 EMG segmentation
The results of EMG segmentation and its comparison to
[15] are both illustrated in Figure 11. Figure 11 shows the
499
[11] A. Fjellman-Wiklund and H. Grip, J. S. Karlsson et
al.: “EMG trapezius muscle activity pattern in string
players: Part I—is there variability in the playing
technique?,” International journal of industrial ergonomics, vol. 33, no. 4, pp. 347-356, 2004.
[12] J. G. Bloemsaat, R. G. Meulenbroek, and G. P. Van
Galen: “Differential effects of mental load on proximal and distal arm muscle activity,” Experimental
brain research, vol. 167, no. 4, pp. 622-634, 2005.
Future work will improve the detection rate of onset,
transition and offset to extract the period of an active
frame more precisely. The detection of the bowing direction will be also improved. Furthermore, the relationship
between the musical sounds and the muscular activities of
players in musical performance will be observed and analyzed. By measuring the music and the player’s muscular
activity, better insights can be made into the neurophysiological control during musical performances and may
even prevent players from the injuries as greater insights
into these mechanisms are made.
[13] S. Fujii and T. Moritani: “Spike shape analysis of
surface electromyographic activity in wrist flexor
and extensor muscles of the world's fastest drummer,”
Neuroscience letters, vol. 514, no. 2, pp. 185-188,
2012.
6. REFERENCES
[1] DC. Harding, KD. Brandt, and BM. Hillberry:
“Minimization of finger joint forces and tendon
tensions in pianists,” Med. Probl. Perform Art
pp.103-104, 1989.
[14] S. Furuya and H. Kinoshita: “Organization of the
upper limb movement for piano key-depression
differs between expert pianists and novice players,”
Experimental brain research, vol. 185, no. 4, pp.
581-593, 2008.
[15] P. Mazurkiewicz, “Automatic Segmentation of EMG
Signals Based on Wavelet Representation,” Advances in Soft Computing Volume 45, 2007, pp 589-595
[16] Bodybuilding is lifestyle! "Chest - Bodybuilding is
lifstyle!"http://www.bodybuildingislifestyle.com/che
st/.
[17] Bodybuilding is lifestyle! "Chest - Bodybuilding is
lifestyle!"
http://www.bodybuildingislifestyle.com/hamstrings/.
[2] KC. Engel, M. Flanders, and JF. Soechting:
“Anticipatory and sequential motor control in piano
playing,” Exp Brain Res. pp. 189-199, 1997.
[3] D. Parlitz, T. Peschel, and E. Altenmuller:
“Assessment of dynamic finger forces in pianists:
Effects of training and expertise,” J. Biomech.
pp.1063-1067, 1998.
[4] H. Kinoshita , S. Furuya , T. Aoki, and E.
Altenmüller E.: “Loudness control in pianists as
exemplified in keystroke force measurements on
different touches,” J Acoust Soc Am. pp. 2959-69,
2007.
[5] AE. Minetti, LP. Ardigò, and T. McKee: “Keystroke
dynamics and timing: Accuracy, precision and
difference between hands in pianist's performance,”
J Biomech. pp. 3738-43, 2007.
[6] R. Merletti and P. Parker: Electromyography: physiology, engineering, and noninvasive applications,
Wiley-IEEE Press, 2004.
[7] C.-J. Lai, R.-C. Chan, and T.-F. Yang et al.: “EMG
changes during graded isometric exercise in pianists:
comparison with non-musicians,” Journal of the
Chinese Medical Association, vol. 71, no. 11, pp.
571-575, 2008.
[8] M. Candidi, L. M. Sacheli, and I. Mega et al.: “Somatotopic mapping of piano fingering errors in sensorimotor experts: TMS studies in pianists and visually trained musically naïves,” Cerebral Cortex, vol.
24, no. 2, pp. 435-443, 2014.
[9] S. Furuya, T. Aoki, and H. Nakahara et al.: “Individual differences in the biomechanical effect of
loudness and tempo on upper-limb movements during repetitive piano keystrokes,” Human movement
science, vol. 31, no. 1, pp. 26-39, 2012.
[10] D. L. Rickert and M. Halaki, K. A. Ginn et al.: “The
use of fine-wire EMG to investigate shoulder muscle
recruitment patterns during cello bowing: The results of a pilot study,” Journal of Electromyography
and Kinesiology, vol. 23, no. 6, pp. 1261-1268, 2013.
500
[18] WJJ. Boo, Y. Wang, and A. Loscos, “A violin music
transcriber for personalized learning,” pp 2081-2084,
IEEE International Conference on Multimedia and
Expo, 2006.
[19] BE. Boser, IM. Guyon and VN. Vapnik: “A training
algorithm for optimal margin classifiers,” In Fifth
Annual Workshop on Computational Learning
Theory, ACM 1992.
[20] AJ. Andrews: “Finger movement classification using
forearm EMG signals,” M. Sc. dissertation, Queen's
University, Kingston, ON, Canada, 2008.
AUTOMATIC KEY PARTITION BASED ON TONAL ORGANIZATION
INFORMATION OF CLASSICAL MUSIC
Lam Wang Kong, Tan Lee
Department of Electronic Engineering
The Chinese University of Hong Kong Hong Kong SAR, China
{wklam,tanlee}@ee.cuhk.edu.hk
ABSTRACT
they are the beginning or ending chords? Seems it is not,
as B major chord is normally not a member chord of C
major and vice versa. It seems that there must be a key
change in the middle. But how would you find out the
point of key change, and how does the key change? With
the help of the tonal grammar tree analysis in §2.1, a good
estimate of the key path can be obtained. To start with, we
assume that the excerpt consists of harmonically complete
phrase(s) and the chord labels are free from errors.
There are some existing algorithms to estimate the key
based on chord progression. These algorithms can be classified into two categories: statistical-based and rule-based
approach. Hidden Markov model is very often used in the
statistical approach. Lee & Stanley [7] extracted key information by performing harmonic analysis on symbolic
training data and estimated the model parameters from
them. They built 24 key-specific HMMs (all major and minor keys) for recognizing a single global key which has the
highest likelihood. Raphael & Stoddard [11] performed
harmonic analysis on pitch and rhythm. They divided the
music into a fixed musical period, usually a measure, and
associate a key and chord to each of period. They performed functional analysis of chord progression to determine the key. Unlabeled MIDI files were used to train the
transition and output distributions of HMM. Instead of recognizing the global key, it can track the local key. Catteau
et al. [2] described a probabilistic framework for simultaneous chord and key recognition. Instead of using training
data, Lerdahl’s representation of tonal space [8] were used
as a distance metric to model the key and chord transition
probabilities. Shenoy et al. [15] proposed a rule-based approach for determining the key from chord sequence. They
created a reference vector for each of the 12 major and
minor keys, including the possible chords within the key.
Higher weights were assigned to primary chords (tonic,
subdominant and dominant chords). The chord vector obtained from audio data were compared against the reference vector using weighted cosine similarity. The pattern
with the highest rank is chosen as the selected global key.
This paper uses a rule-based approach to model tonal
harmony. A context-free dependency structure is used to
exhaust all the possible combinations of key paths, and
the best one is selected according to music knowledge.
The main objective of this research is to exploit this tonal
context-free dependency structure in order to partition an
excerpt of classical music into several key sections.
Key information is a useful information for tonal music
analysis. It is related to chord progressions, which follows
some specific structures and rules. In this paper, we describe a generative account of chord progression consisting of phrase-structure grammar rules proposed by Martin
Rohrmeier. With some modifications, these rules can be
used to partition a chord symbol sequence into different
key areas, if modulation occurs. Exploiting tonal grammar
rules, the most musically sensible key partition of chord
sequence is derived. Some examples of classical music
excerpts are evaluated. This rule-based system is compared against another system which is based on dynamic
programming of harmonic-hierarchy information. Using
Kostka-Payne corpus as testing data, the experimental result shows that our system is better in terms of key detection accuracy.
1. INTRODUCTION
Chord progression is the foundation of harmony in tonal
music and it can determine the key. The key involves certain melodic tendencies and harmonic relations that maintain the tonic as the centre of attention [4]. Key is an indicator of the musical style or character. For example, the
key C major is related to innocence and pureness, whereas
F minor is related to depression or funereal lament [16].
Key detection is useful for music analysis. A classical music piece may have several modulations (key changes). A
change of key means a change of tonal center, the adoption of a different tone to which all the other tones are to
be related [10]. Key change allows tonal music to convey
a sense of long-range motion and drama [17].
Keys and chord labels are interdependent. Even if
the chord labels are free from errors, obtaining the key
path is often a non-trivial task. For example, if a music excerpt has been analyzed with the chord sequence
[B , F, Gmin , Amin , G, C], how would you analyze its
key? Is it a phrase entirely in B major or C major, as
c Lam Wang Kong, Tan Lee.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Lam Wang Kong, Tan Lee. “Automatic key partition based on Tonal Organization Information of Classical
Music”, 15th International Society for Music Information Retrieval Conference, 2014.
501
Functional level
Scale degree level
T R → DR T
DR → SR D
T R → T R DR
XR → XR XR
phrase → T R
Added rules for scale degree level:
S → ii (minor)
T → I IV V I | V I IV I | I bII I
D → I V , after S or D(V )
TR
DR
SR
XR
T
D
tonic region
dominant region
predominant region
any specific region
tonic function
dominant function
S
X
D()
X/Y
I, III...
ii, vi...
T →I
T → I IV I
S → IV
D → V | vii
T → vi | III
D → V II (minor)
S → ii (major)
S → V I | bII (minor)
X → D(X) X
D(X) → V /X | vii/X
Figure 1. Example of a tonal grammar tree (single key)
predominant function
any specific function
secondary dominant
X of Y chord
major chords
minor chords
Table 1. Rules (top) and labels (bottom) used in our system
2. TONAL THEORY OF CLASSICAL MUSIC
2.1 Schenkerian analysis and formalization
Figure 3. Flow diagram of our key partitioning system
To interpret the structure of the tonal music, Schenkerian
analysis [14] is used. The input is assumed to be classical
music with one or more tonal centre (tonal region). Each
tonal centre can be elaborated into tonic – dominant – tonic
regions [1]. The dominant region can be further elaborated
into predominant-dominant regions. Each region can be
recursively elaborated to form a tonal grammar tree. We
can derive the key information by referring to the top of the
tree, which groups the chord sequence into a tonal region.
Context-free grammar can be used to formalize this
tree structure. A list of generative syntax is proposed by
Rohrmeier [13] in the form of V → w. V is a single nonterminal symbol, while w is a string of terminals and/or
non-terminals. Chord symbols (eg. IV ) are represented
by terminals. They are the leaves of the grammar tree.
Tonal functions (eg. T for tonic) or regions (eg. T R for
tonic region) are represented by non-terminals. They can
be the internal nodes or the root of the grammar tree. For
instance, the rule D → V | vii indicates that the V or vii
chord can be represented by the dominant function. The
rule S → ii (major) indicates that ii chord can be represented by the predominant function only when the current
key is major. Originally Rohrmeier has proposed 28 rules.
Some of them were modified to suit classical music and
were listed in Table 1.
Based on this set of rules, Cocke–Younger–Kasami
parsing algorithm [18] is used to construct a tonal grammar
tree. If a music input is harmonically valid, a single tonal
grammar tree can be built like in Figure 1. Else some scattered tree branches are resulted and cannot be connected to
one single root.
2.2 Modulation
In Rohrmeier’s generative syntax of tonal harmony, modulation is formalized as a new local tonic [13]. Each functional region (new key section) is grouped as a single nontonic chord in the original passage, and they may relate this
(elaborated) chord to the neighbouring chords.
In this research we have a more general view of modulation. As a music theorist, Reger had published a book
Modulation, showing how to modulate from C major / minor to every other key [12]. Modulation to every other key
is possible, but modulation to harmonically closer keys is
more common [10]. For instance, if the music is originally in C major, it is more probable to modulate to G major instead of B major. Lerdahl’s chordal distance [8] is
used to measure the distance between different keys. Here
Rohrmeier’s modulation rules in [13] are not used. Instead,
a tonal grammar tree is built for each new key section,
and the key path with the best score is chosen. Any key
changes explainable by tonicization (temporary borrowing
of chords from other keys), such as the chords [I V/V V
I], is not considered as a modulation. Figure 2 shows an
example of tonal grammar tree with modulation, from E
minor to D minor. It is presented by two disjunct trees.
3. SYSTEM BUILDING BLOCKS
3.1 Overview
The proposed key partitioning system is shown as in Figure
3. This system takes a sequence of chord labels (e.g. A
minor, E major) and outputs the best key path. The path
may consist of only one key, or several keys. For example,
[F F F F F F] or [Am Am Am C C C] (m indicates minor
502
Figure 2. Example of a tonal grammar tree with modulation
Path no.
1
2
chords, other chords are major) are both valid key paths.
The Tonal Grammar Tree mentioned in §2.1 is the main
tool used in this system.
3
4
3.2 Algorithm for key partitioning
Each key section is assumed to have at least one tonic
chord. The top of each grammar tree must be TR (tonic region), so the key section is a complete tonal grammar tree
by itself. Furthermore, the minimum length of each key
section is assumed to be 3 chords. However, if no valid
paths can be found, key sections with only 2 chords are
also considered.
The algorithm is as follows:
Gm
Gm
B
B
Gm
Gm
B
B
Key paths
Gm Am
Gm
C
B Am
B
C
Am
C
Am
C
Am
C
Am
C
Table 2. All valid key paths in the example
6. If no valid paths can be found, go back to step 4 and
change the requirement to “at least 2 chords”. Else
proceed to step 7.
7. Evaluate the path score of all valid paths and select
the one with the highest score to be the best key path.
A simple example is used to illustrate this process. The
input chord sequence is [B F Gm Am G C]. Incomplete
trees with the keys (B, F, Gm, Am, G, C) are built. As all
the trees are incomplete, proceed to step 3 and the accumulated length is calculated. The B major tree is shown
in Figure 4 as an example. Other five trees (F, Gm, Am,
G, C) were built in the same fashion. Either key sections
1-3 or 1-4 of Bmajor are valid key sections as they can
all be grouped into a single TR and they have at least 3
chords. Then all the valid key paths were found and they
are listed in Table 2. All the path scores were evaluated by
the equation (1) of the next section.
1. In a chord sequence, hypothesize any of the chord
label as the tonic of a key. Derive the tonal grammar
tree of each key.
2. Find if there is any key that can build a single complete tree for the entire sequence. If yes, limit the
valid paths to these single-key paths and go to step
7. This phrase is assumed to have a single key only.
Else go to next step.
3. For each chord label in the sequence, find the maximum possible accumulated chord sequence length
of each key section (up to that label). Determine
if this sequence is breakable at that label (The secondary dominant chord is dependent on the subsequent chord. For example, the tonicization segment
V/V V cannot be broken in the middle, as V/V is
dependent on V chord).
3.3 Formulation
We have several criteria for choosing the best key path. A
good choice of a key section should be rich in tonic and
dominant chords, as they are the most important chords to
define and establish a key [10]. It is more preferable if the
key section starts and ends with the tonic chord, and with
less tonicizations as a simpler explanation is better than a
complicated one. In a music excerpt, less modulations and
modulations to closer keys are preferred. We formulate
4. Find out all possible key sections with at least 3
chords including at least one tonic chord.
5. Find out all valid paths traversing all the possible
key sections, from beginning to end, in a brute-force
manner.
503
added by the experienced musician 2 . All the chord types
have been mapped to their roots: major or minor. There
are 25 excerpts with a single key and 21 excerpts with key
changes (one to four key changes). The longest excerpt
has 47 chords whereas the shortest excerpt has 8 chords.
The instrumentation ranges from solo piano to orchestral.
As we assume the input chord sequence to be harmonically complete, the last chord of excerpts 9, 14 and 15 were
truncated as they are the starting chord of another phrase.
There are 866 chords in total. For every excerpt, the partitioning algorithm in §3.2 is used to obtain the best path.
Figure 4. The incomplete B major Tree
4.2 Baseline system
these criteria with equation (1):
To the best of author’s knowledge, there is currently no
key partitioning algorithm directly use chord labels as input. To compare the performance of our key partitioning
system, another system based on Krumhansl’s harmonichierarchy information and dynamic programming were
set up. Krumhansl’s key profile has been used in many
note-based key tracking systems such as [3, 9]. Here
Krumhansl’s harmonic-hierarchy ratings (listed in Chapter 7 of [6]) are used to obtain the perceptual closeness of
a chord in a particular key. A higher rating corresponds
to a higher tendency to be part of the key. As a fair comparison, the number of chords in a key section is restricted
to be at least three, which is the same in our system. To
prevent fluctuations of the key, a penalty term D(x, y) is
imposed on key changes. The multiplicative constant of
penalty term α is determined experimentally to give the
best result. The best key path is found iteratively by the
dynamic programming technique presented by equations
(2) and (3):
Stotal = aStd − bSton − cScost + dSstend − eSsect (1)
where S td is the no. of tonic and dominant chords, Ston
is the total number of tonicization steps. For example, in
chord progression V/V/ii V/ii ii, the first chord has two
steps, while the second chord has one step. Ston = 2 + 1 +
0 = 3. Scost is the total modulation cost: the total tonal
distance of each modulation measured by Lerdahl’s distance defined in [8]. Sstend indicates whether the excerpt
starts and ends with tonic or not. Ssect is the total number
of key sections. If a key section has only 2 chords, it is
counted as 3 in Ssect as a penalty. These parameters control how well chords fit in a key section against how often
the modulation occurs. S td , Ston and Sstend maximizes
fitness of the chord sequence to a key section. Scost and
Ssect induce penalty whenever modulation occurs. The parameters S td , Ston , Scost , Sstend and Ssect are normalized
so that their mean and standard deviation are 0 and 1 respectively. All the coefficients, namely a, b, c, d, e, are determined experimentally, although a slightly different set
of values does not have a large effect on the key partitioning results. They are set at [a, b, c, d, e] = [1, 0.4, 2, 2, 0.4].
Key structure is generally thought to be hierarchical. An
excerpt may have one level of large-scale key changes and
another level of tonicizations [17], and the boundary is not
well-defined. So it seemed fair to adjust these parameters
in order to match the level of key changes labeled by the
ground truth. The key path with the highest Stotal is chosen as the best path.
Ax [1] = Hx [1]
Ax [n − 1] + Hx [n],
Ax [n] = max
Ay [n − 1] + Hx [n] − αD(x, y)
∀x, y ∈ K, where y = x
(2)
3
(3)
Hx [n] is the harmonic-hierarchy rating of the nth chord
with the key x. Ax [n] is the accumulated key strength of
the nth chord when the current key is x. K is the set of all
possible keys. D(x, y) is the distance between keys x, y
based on the key distance in [6] derived from multidimensional scaling. The best path can be found by obtaining
the largest Ax of the last chord and tracking all the way
back to Ax [1]. The same Kostka-Payne corpus chord labels were used to test this baseline system. The best result
was obtained by setting α = 4.5.
4. EXPERIMENTS
4.1 Settings
To test the system, we have chosen the Kostka-Payne corpus, which contains classical music excerpts in a theory
book [5]. This selection has 46 excerpts, covering compositions of many famous composers. They serve as representative examples of classical music in common practice
period (around 1650-1900). All of the excerpts were examined. This corpus has ground truth key information labeled
by David Temperley 1 . The mode (major or minor) of the
key was labeled by an experienced musician. The chord
labels are also available from the website, with the mode
1
∀x ∈ K
4.3 Results
The key partitioning result of our proposed system and the
baseline system were compared against the ground truth
provided by Temperley. Four kinds of result metrics were
used. The average matching score is shown in Figure 5.
2 All the chord and key labels can be found here:
https://drive.google.com/file/d/0B0Td6LwTULvMVJ6MFcyYWsxVzQ/edit?usp=sharing
http://www.theory.esm.rochester.edu/temperley/kp-stats/
504
Key relation
Dominant
Supertonic
Relative
Parallel
Minor 3rd
Major 3rd
Leading tone
Tritone
total no.
35
32
11
11
9
8
3
2
%
32.7
29.9
10.3
10.3
8.4
7.5
2.8
1.9
Table 3. Eight categories of the 107 error labels
Figure 5. Key partitioning result, with 95% confidence
interval
chord symbols
F major
B major
Exact indicates the exact matches between the obtained
key path and the ground truth. As modulation is a gradual process, the exact location of key changes may not be
definitive. It is more meaningful to consider Inexact. For
inexact, the obtained key is also considered as correct if
it matches the key of the previous or next chord. MIREX
refers to the MIREX 2014 Audio key detection evaluation
standard 3 . Harmonically close keys will be given a partial point. Perfect fifth is awarded with 0.5 points, relative minor/ major 0.3 points, whereas parallel major/ minor 0.2 points. This is useful as sometimes a chord progression may be explainable by two different related keys.
MIREX in refers to the MIREX standard, but with the addition that the points of previous or next chord will also be
considered and the maximum point will be chosen as the
matching score of that chord.
The proposed system outperforms the baseline system
by about 18% for exact or inexact matching and 0.1 points
for MIREX-related scores. It shows that our knowledgebased tonal grammar tree system is better than the baseline system which is based on perceptual closeness. Tonal
structural information is exploited, so we have a better understanding of the chord progression and modulations.
Gm
ii
vi
C
V
V/V
F
I
V
B
IV
I
Gm
ii
vi
C
V
V/V
F
I
V
Table 4. Analysis with two different keys
Modulations between keys that are supertonicallyrelated (differs by 2 semitones) or relative major / minor
have a similar problem as the dominant key modulation.
Many common chords are shared among both keys, so it
is easy to confuse these two keys. It is worth to mention
that nine of the supertonically-related errors came from excerpt 45. In Temperley’s key labels, the whole excerpt is
labeled as C major with measures 10-12 considered as a
passage of tonicization. However, in [5], it was written that
“Measures 10-12 can be analyzed in terms of secondary
functions or as a modulation”. If the measures 10-12 are
considered as a modulation to D minor, then the analysis
of these nine chords is correct.
The parallel key modulation, for example from C major
to C minor, has a different problem. Sometimes composers
tend to start the phrase with a new mode (major or minor)
without much preparation, as the tonic is the same. Fluctuation between major and minor of the same key has always
been common [10]. When the phrase information is absent, the exact position of modulation cannot be found by
the proposed system.
In another way, there may exist some ornament notes
that obscure the real identity of a chord, so that the chord
symbol analyzed acoustically is different from the chord
symbol analyzed structurally or grammatically. For example, in Figure 6, the first two bars should be analyzed as
IV 6 -viiφ7 -I progression in A major. However, the C of
the I chord is delayed to the next chord. The appoggiatura
B made the I chord sound as a i chord, the tonic minor
chord instead. Similarly, the last two bars should be analyzed as IV 6/5 -viio7 -i in F minor. However, the passing
note A made the i chord sound as a I chord, the original
A is delayed to the next chord. In these two cases, the key
derived by the last chord in the progression is in conflict
with the other chords. Hence the key will be recognized
wrongly if the acoustic chord symbol is provided instead
of the structural chord symbol.
4.4 Error analysis
The ground truth key information are compared against the
key labels generated by the proposed algorithm. 17 boundary errors were detected, ie. the key label of the previous
or next chord was recognized instead. In classical music,
modulation is usually not a sudden event. It occurs gradually through several pivot chords (chords common to both
keys) [10]. Therefore it is sometimes subjective to determine the boundary between two key sections. It may not
be a wrong labeling if the boundary is different from the
ground truth. Other types of error are listed in Table 3.
The most common error is the misclassification as dominant key, which is the closest related key [10]. It shares
many common chords with the tonic key. From Table 4,
the same chord sequence can be analyzed by two keys that
are dominantly-related. Although the B major analysis
contains more tonicizations, the resultant score disadvantage may be outweighed by the cost of key changes, if it is
followed by a B major section.
3
Semitone difference
7
2
3
0
3
4
1
6
5. DIFFICULTIES
The biggest problem of this research is lack of labeled data.
To the best of our knowledge, large chord label database
http://www.music-ir.org/mirex/wiki/2014:Audio Key Detection
505
[3] E. Gómez and P. Herrera. Estimating The Tonality Of
Polyphonic Audio Files: Cognitive Versus Machine
Learning Modelling Strategies. In ISMIR, pages 1–4,
2004.
[4] B. Hyer. Key (i). In S. Sadie, editor, The New Grove
Dictionary of Music and Musicians. Macmillan Publishers, London, 1980.
Figure 6. Excerpt from Mozart’s Piano Concerto no. 23,
2nd movement
[5] S. M. Kostka and D. Payne. Workbook for tonal harmony, with an introduction to twentieth-century music.
McGraw-Hill, New York, 3rd ed. edition, 1995.
for classical music is absent. The largest database we could
find is the Kostka-Payne corpus used in this paper. In the
future, we may consider manually label more music pieces
to check if the system works generally well in classical
music.
Moreover, key partitioning is sometimes subjective to
listener’s perception. In some cases, there are several pivot
chords to establish the new key center. “Ground truth”
boundaries of key sections are sometimes set arbitrarily.
Or there are several sets of acceptable and sensible partitions of key sections. This problem is yet to be studied. Inconsistency between acoustic and structural chord symbols
mentioned in §4.4 is also yet to be solved. For any rulebased systems, exceptions may occur. Composers may deliberately break some traditions in the creative process. It
is not possible to handle all these exceptional cases.
[6] C. L. Krumhansl. Cognitive Foundations of Musical
Pitch. Oxford University Press, New York, 1990.
[7] K. Lee and M. Slaney. Acoustic Chord Transcription
and Key Extraction From Audio Using Key-Dependent
HMMs Trained on Synthesized Audio. In Array, editor, Ieee Transactions On Audio Speech And Language
Processing, volume 16, pages 291–301. Ieee, 2008.
[8] F. Lerdahl. Tonal pitch space. Oxford University Press,
Oxford, 2001.
[9] H. Papadopoulos and G. Peeters. Local Key Estimation From an Audio Signal Relying on Harmonic
and Metrical Structures. IEEE Transactions on Audio,
Speech, and Language Processing, 20(4):1297–1312,
May 2012.
6. FUTURE WORK AND CONCLUSION
We have only considered major and minor chords in this
paper. As dominant 7th and diminished chords are common in classical music, we may consider expanding the
chord type selection to make chord labels more accurate.
The current system assumes chord labels to be free of errors. We plan to study the method of key tracking in the
presence of chord label errors. Then we may incorporate
this system to the chord classification system for audio key
detection, as the key and chord progression is interdependent. Currently the input phrases must be complete in order to make this tree building process work. We plan to find
the key partition method for incomplete input phrases. A
more efficient algorithm for tree building process, instead
of brute-force, is yet to be discovered. Then less trees are
required to be built.
In this paper, we have discussed the uses of tonal grammar to partition key sections of classical music. The
proposed system outperforms the baseline system which
uses dynamic programming on Krumhansl’s harmonichierarchy ratings. This tonal grammar is useful for tonal
classical music information retrieval and hopefully more
uses can be found.
7. REFERENCES
[1] A. Cadwallader and D. Gagné. Analysis of Tonal Music: A Schenkerian Approach. Oxford University Press,
Oxford, 1998.
[2] B. Catteau, J. Martens, and M. Leman. A probabilistic
framework for audio-based tonal key and chord recognition. Advances in Data Analysis, (2005):1–8, 2007.
[10] W. Piston. Harmony. W. W. Norton, New York, rev. ed.
edition, 1948.
[11] C. Raphael and J. Stoddard. Functional harmonic analysis using probabilistic models. Computer Music Journal, pages 45–52, 2004.
[12] M. Reger. Modulation. Dover Publications, Mineola,
N.Y., dover ed. edition, 2007.
[13] M. Rohrmeier. Towards a generative syntax of tonal
harmony. Journal of Mathematics and Music, 5(1):35–
53, Mar. 2011.
[14] H. Schenker. Free Composition. Longman, New York,
London, 1979.
[15] A. Shenoy and R. Mohapatra. Key determination of
acoustic musical signals. 2004 IEEE International
Conference on Multimedia and Expo (ICME) (IEEE
Cat. No.04TH8763), pages 1771–1774, 2004.
[16] R. Steblin. A history of key characteristics in the eighteenth and early nineteenth centuries. University of
Rochester Press, Rochester, NY, 2nd edition, 2002.
[17] D. Temperley. The cognition of basic musical structures. MIT Press, Cambridge, Mass., 2001.
[18] D. H. Younger. Recognition and parsing of contextfree languages in time n3 . Information and Control,
10(2):189–208, 1967.
506
BAYESIAN SINGING-VOICE SEPARATION
Po-Kai Yang, Chung-Chien Hsu and Jen-Tzung Chien
Department of Electrical and Computer Engineering, National Chiao Tung University, Taiwan
{niceallen.cm01g, chien.cm97g, jtchien}@nctu.edu.tw
ABSTRACT
and background music should be collected. But, it is
more practical to conduct the unsupervised learning for
blind source separation by using only the mixed test data.
In [13], the repeating structure of the spectrogram of the
mixed music signal was extracted and applied for separation of music and voice. The repeating components
from accompaniment signal were separated from the nonrepeating components from vocal signal. A binary timefrequency masking was applied to identify the repeating
background accompaniment. In [9], a robust principal
component analysis was proposed to decompose the spectrogram of mixed signal into a low-rank matrix for accompaniment signal and a sparse matrix for vocal signal. System performance was improved by imposing the harmonicity constraints [22]. A pitch extraction algorithm was inspired by the computational auditory scene analysis [3] and
was applied to extract the harmonic components of singing
voice.
This paper presents a Bayesian nonnegative matrix factorization (NMF) approach to extract singing voice from
background music accompaniment. Using this approach,
the likelihood function based on NMF is represented by
a Poisson distribution and the NMF parameters, consisting of basis and weight matrices, are characterized by the
exponential priors. A variational Bayesian expectationmaximization algorithm is developed to learn variational
parameters and model parameters for monaural source separation. A clustering algorithm is performed to establish
two groups of bases: one is for singing voice and the other
is for background music. Model complexity is controlled
by adaptively selecting the number of bases for different
mixed signals according to the variational lower bound.
Model regularization is tackled through the uncertainty
modeling via variational inference based on marginal likelihood. The experimental results on MIR-1K database
show that the proposed method performs better than various unsupervised separation algorithms in terms of the
global normalized source to distortion ratio.
In general, the issue of singing-voice separation is seen
as a single-channel source separation problem which could
be solved by using the learning approach based on the
nonnegative matrix factorization (NMF) [10, 19]. Using
NMF, a nonnegative matrix is factorized into a product
of a basis matrix and a weight matrix which are nonnegative [10]. NMF can be directly applied in Fourier spectrogram domain for audio signal processing. In [7], the
nonnegative sparse coding was proposed to conduct sparse
learning for overcomplete representation based on NMF.
Such sparse coding provides efficient and robust solution
to NMF. However, how to determine the regularization parameter for sparse representation is a key issue for NMF. In
addition, the time-varying envelopes of spectrogram convey important information. In [16], one dimensional convolutive NMF was proposed to extract the bases, which
considered the dependencies across successive columns of
input spectrogram, and was applied for supervised singlechannel speech separation. In [14], two dimensional NMF
was proposed to discover fundamental bases for blind musical instrument separation in presence of harmonic variations from piano and trumpet. Number of bases was empirically determined. Nevertheless, the selection of the number of bases is known as a model selection problem in signal processing and machine learning. How to tackle this
regularization issue plays an important role to assure generalization for future data in ill-posed condition [1].
1. INTRODUCTION
Singing voice conveys important information of a song.
This information is practical for many music-related applications including singer identification [11], music emotion annotation [21], melody extraction, lyric recognition
and lyric synchronization [6]. However, singing voice is
usually mixed with background accompaniment in a music signal. How to extract the singing voice from a singlechannel mixed signal is known as a crucial issue for music information retrieval. Some approaches have been proposed to deal with single-channel singing-voice separation.
There are two categories of approaches to source separation: supervised learning [2] and unsupervised learning [8, 9, 13, 22]. Supervised approach conducts the singlechannel source separation given by the labeled training
data from different sources. In the application of singingvoice separation, the separate training data of singing voice
c Po-Kai Yang, Chung-Chien Hsu and Jen-Tzung Chien.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Po-Kai Yang, Chung-Chien Hsu
and Jen-Tzung Chien. “BAYESIAN SINGING-VOICE SEPARATION”,
15th International Society for Music Information Retrieval Conference,
2014.
Basically, uncertainty modeling via probabilistic framework is helpful to improve model regularization for NMF.
507
the approximated data BW
The uncertainties in singing-voice separation may come
from improper model assumption, incorrect model order
and possible noise interference, nonstationary environment, reverberant distortion. Under probabilistic framework, nonnegative spectral signals are drawn from probability distributions. The nonnegative parameters are also
represented by prior distributions. Bayesian learning is introduced to deal with uncertainty decoding and build a robust source separation by maximizing the marginal likelihood over the randomness of model parameters. In [15],
Bayesian NMF (BNMF) was proposed for image feature
extraction based on the assumption of Gaussian likelihood
and exponential prior. In the BNMF [4], an approximate
Bayesian inference based on variational Bayesian (VB) algorithm using Poisson likelihood for observation data and
Gamma prior for model parameters was proposed for image reconstruction. Implementation cost was demanding
due to the numerical calculation of shape parameter. Although NMF was presented for singing-voice separation
in [19, 23], the regularization issue was ignored and the
sensitivity of system performance due to uncertain model
and ill-posed condition was serious.
This paper presents a new model-based singing-voice
separation. The novelties of this paper are twofold. The
first one is to develop Bayesian approach to unsupervised
singing-voice separation. Model uncertainty is compensated to improve the performance of source separation of
vocal signal and background accompaniment signal. Number of bases is adaptively determined from the mixed signal
according to the variational lower bound of the logarithm
of a marginal likelihood over NMF basis and weight matrices. The second one is the theoretical contribution in
Bayesian NMF. We construct a new Bayesian NMF where
the likelihood function in NMF is drawn from Poisson distribution and the model parameters are characterized by exponential distributions. A closed-form solution to hyperparameters using the VB expectation-maximization (EM) [5]
algorithm is derived for ease of implementation and computation. This BNMF is connected to standard NMF with
sparseness constraint. But, using the BNMF, the regularization parameters or hyperparameters are optimally estimated from training data without empirical selection from
validation data. Beyond the approaches in [4, 15], the proposed BNMF completely considers the dependencies of
the variational objective on hyperparameters and derives
the analytical solution to singing-voice separation.
(Xmn log
m,n
Xmn
+ [BW]mn − Xmn )
[BW]mn
(1)
2.1 Maximum Likelihood Factorization
NMF approximation is revisited by introducing the probabilistic framework based on maximum likelihood (ML)
theory. The nonnegative latent variable
Zmkn is embedded
in data entry Xmn by Xmn =
k Zmkn and is represented by a Poisson distribution with mean Bmk Wkn , i.e.
Zmkn ∼ Pois(Zmkn ; Bmk Wkn ) [4]. Log likelihood function of data matrix X given parameters Θ is expressed by
log p(X|B, W) = log
=
Pois(Xmn ;
m,n
Bmk Wkn )
k
(Xmn log[BW]mn − [BW]mn − log Γ(Xmn + 1))
(2)
m,n
where Γ(·) is the gamma function. Maximizing the log
likelihood function in Eq. (2) based on Poisson distribution
is equivalent to minimizing the KL divergence between X
and BW in Eq. (1). This ML problem with missing variables Z = {Zmkn } can be solved according to EM algorithm. In E step, the expectation function of the log likelihood of data X and latent variable Z given new parameters
B(τ +1) and W(τ +1) is calculated with respect to Z under
current parameters B(τ ) and W(τ ) . In M step, we maximize the resulting auxiliary function to obtain the updating
of NMF parameters which is equivalent to that of standard
NMF in [10].
2.2 Bayesian Factorization
ML estimation is prone to find an over-trained model [1].
To improve model regularization, Bayesian approach is introduced to establish NMF for single-source separation.
ML NMF was improved by considering the priors of basis matrix B and weight matrix W for Bayesian NMF
(BNMF). Different specifications of likelihood function
and prior distribution result in different solutions with different inference procedures.
In [15], the approximation
error of Xmn using k Bmk Wkn is modeled by a zeromean Gaussian distribution
Xmn ∼ N (Xmn ;
Bmk Wkn , σ 2 )
(3)
k
with the variance parameter σ 2 which is distributed by an
inverse gamma prior. The priors of nonnegative Bmk and
Wkn are modeled by the exponential distributions
2. NONNEGATIVE MATRIX FACTORIZATION
Lee and Seung [10] proposed the standard NMF where no
probabilistic distribution was assumed. Given a nonnega×N
tive data matrix X ∈ RM
, NMF aims to decompose
+
data matrix X into a product of two nonnegative matrices
×K
B ∈ RM
and W ∈ RK×N
. The (m, n)-th
+
+
entry of X
is approximated by Xmn ≈ [BW]mn =
k Bmk Wkn .
NMF parameters Θ = {B, W} consist of basis matrix B
and weight matrix W. The approximation based on NMF
is optimized by minimizing the Kullback-Leibler (KL) divergence DKL (X BW) between the observed data X and
Bmk ∼ Exp(Bmk ; λbmk ), Wkn ∼ Exp(Wkn ; λw
kn )
(4)
where Exp(x; θ) = θ exp(−θx), with means (λbmk )−1 and
−1
(λw
, respectively. Typically, the larger the exponenkn )
tial hyperparameter θ is involved, the sparser the exponential distribution is shaped. The sparsity of basis parameter Bmk and weight parameter Wkn is controlled by
hyperparameters λbmk and λw
kn , respectively. In [15], the
hyperparameters {λbmk , λw
}
kn were fixed and empirically
determined. The Gaussian likelihood does not adhere to
508
the assumption of nonnegative data matrix X. The other
weakness in the BNMF [15] is that the exponential distribution is not conjugate prior to the Gaussian likelihood
function for NMF. There was no closed-form solution. The
parameters Θ = {B, W, σ 2 } were accordingly estimated
by Gibbs sampling procedure where a sequence of posterior samples of Θ was drawn by the corresponding conditional posterior probabilities.
Cemgil [4] proposed the BNMF for image reconstruction based on the Poisson likelihood function as given in
Eq. (2) and the gamma priors for basis and weight matrices. The gamma distribution, represented by a shape parameter and a scale parameter, is known as the conjugate
prior to Poisson likelihood function. Variational Bayesian
(VB) inference procedure was developed for NMF implementation. However, the shape parameter was implemented by the numerical solution. The computation cost
was relatively high. Some dependencies of variational
lower bound on model parameters were ignored in [4]. The
resulting parameters did not reach true optimum of variational objective.
evidence function is meaningful to act as an objective for
model selection which balances the tradeoff between data
fitness and model complexity [1]. In the singing-voice separation based on NMF, this objective is used to judge which
number of bases K should be selected. The selected number is adaptive to fit different experimental conditions with
varying lengths and the variations from different singers,
genders, songs, genres, instruments and music accompaniments. Model regularization is tackled accordingly. But,
using NMF without Bayesian treatment, the number of
bases was fixed and empirically determined.
3.2 Variational Bayesian Inference
The exact Bayesian solution to optimization problem in
Eq. (6) does not exist because the posterior probability of
three latent variables {Z, B, W} given the observed mixtures X could not be factorized. To deal with this issue, the
variational Bayesian expectation-maximization (VB-EM)
algorithm is developed to implement Poisson-Exponential
BNMF. VB-EM algorithm applies the Jensen’s inequality and maximizes the lower bound of the logarithm of
marginal likelihood
3. NEW BAYESIAN FACTORIZATION
This study aims to find an analytical solution to full
Bayesian NMF by considering all dependencies of variational lower bound on regularization parameters. Regularization parameters are optimally estimated.
log p(X|Θ) ≥
Z
DKL (X||BW) +
m,k
+
λw
kn Wkn
In VB-E step, a general solution to variational distribution
qj of an individual latent variable j ∈ {Z, B, W} is obtained by [1]
log q̂j ∝ Eq(i=j) [log p(X, Z, B, W|Θ)].
(8)
(5)
Given the variational distributions defined by
k,n
p(X|Z, B, W)p(Z|B, W)p(B, W|Θ)dBdW
(7)
3.2.1 VB-E Step
where the terms independent of Bmk and Wkn are treated
as constants. Notably, the regularization terms (2nd and
3rd terms) in this objective are nonnegative and seen as the
1 regularizers [18] which are controlled by hyperparameters {λbmk , λw
kn }. These regularizers impose sparseness in
the estimated MAP parameters.
However, MAP estimates are seen as point estimates.
The randomness of parameters is not considered in model
construction. To conduct full Bayesian treatment, BNMF
is developed by maximizing the marginal likelihood
p(X|Θ) over latent variables Z as well as NMF parameters {B, W}
p(X, Z, B, W|Θ)
q(Z, B, W)
where H[·] is an entropy function. The factorized variational distribution q(Z, B, W) = q(Z)q(B)q(W) is
assumed to approximate the true posterior distribution
p(Z, B, W|X, Θ).
In accordance with the Bayesian perspective and the spirit
of standard NMF, we adopt the Poisson distribution as likelihood function and the exponential distribution as conjugate prior for NMF parameters Bmk and Wkn with hyperparameters λbmk and λw
kn , respectively. Maximum a posteriori (MAP) estimates of parameters Θ = {B, W} are
obtained by maximizing the posterior distribution or minimizing − log p(B, W|X) which is arranged as a regularized KL divergence between X and BW
λbmk Bmk
q(Z, B, W) log
× dBdW = Eq [log p(X, Z, B, W|Θ)] + H[q(Z, B, W)]
3.1 Bayesian Objectives
b
q(Bmk ) = Gam(Bmk ; αbmk , βmk
)
w
w
q(Wkn ) = Gam(Wkn ; αkn , βkn )
(9)
q(Zmkn ) = Mult(Zmkn ; Pmkn )
b
b
w
w
the variational parameters {αmk
, βmk
, αkn
, βkn
, Pmkn } in
three distributions are estimated by
α̂bmk
=1+
Zmkn ,
b
β̂mk
=
n
α̂w
kn = 1 +
m
P̂mkn
(6)
Z
w
Zmkn , β̂kn
=
−1
Wkn +
λbmk
n
Bmk + λw
kn
−1
(10)
k
exp(log Bmk + log Wkn )
= j exp(log Bmj + log Wjn )
where the expectation function Eq [·] is replaced by · for
simplicity. By substituting the variational distribution into
and estimating the sparsity-controlled hyperparameters or
regularization parameters Θ = {λbmk , λw
mk }. The resulting
509
Eq. (7), the variational lower bound is obtained by
BL = −
+
Bmk Wkn m,n,k
(− log Γ(Xmn + 1) −
m,n
+
log Bmk k
Zmkn +
n
m,k
+
Zmkn log P̂mkn )
log Wkn (log λbmk − λbmk Bmk ) +
Zmkn m
k,n
m,k
+
the Gaussian-Exponential BNMF in [15] and the PoissonGamma BNMF in [4]. The superiorities of the proposed
method to the BNMFs in [15, 4] are twofold. First, assuming the exponential priors provides a BNMF approach
with tractable solution as given in Eq. (13). Gibbs sampling in [15] and Newton’s solution in [4] are computationally expensive. Second, the dependencies of three terms of
the variational lower bound in Eq. (11) on hyperparameters λbmk or λw
kn are all considered in finding the true optimum while some dependencies were ignored in the solution to Poisson-Gamma BNMF [4]. Also, the observations
in Gaussian-Exponential BNMF [15] were not constrained
to be nonnegative.
w
(log λw
kn − λkn Wkn )
k,n
b
(−(α̂bmk − 1)Ψ(α̂bmk ) + log β̂mk
+ α̂bmk + log Γ(α̂bmk ))
m,k
+
w
w
w
w
(−(α̂w
kn − 1)Ψ(α̂kn ) + log β̂kn + α̂kn + log Γ(α̂kn ))
k,n
(11)
where Ψ(·) is the derivative of the log gamma function,
and is known as a digamma function.
4. EXPERIMENTS
4.1 Experimental Setup
3.2.2 VB-M Step
We used the MIR-1Kdataset [8] to evaluate the proposed
method for unsupervised singing-voice separation from
background music accompaniment. The dataset consisted
of 1000 song clips extracted from 110 Chinese karaoke pop
songs performed by 8 female and 11 male amateurs. Each
clip recorded at 16 KHz sampling frequency with the duration ranging from 4 to 13 seconds. Since the music accompaniment and the singing voice were recorded at left and
right channels, we followed [8, 9, 13] and simulated three
different sets of monaural mixtures at signal-to-musicratios (SMRs) of 5, 0, and -5 dB where the singing-voice
was treated as signal and the accompaniment was treated
as music. The separation problem was tackled in the shorttime Fourier transform (STFT) domain. The 1024-point
STFT was calculated to obtain the Fourier magnitude spectrograms with frame duration of 40 ms and frame shift of
10 ms. In the implementation of BNMF, ML-NMF was
adopted as the initialization and 50 iterations were run to
find the posterior means of basis and weight parameters.
To evaluate the performance of singing-voice separation,
we measure the signal-to-distortion ratio (SDR) [20] and
then calculate the normalized SDR (NSDR) and the global
NSDR (GNSDR) as
In VB-M step, the optimal regularization parameters Θ =
{λbmk , λw
kn } are derived by maximizing Eq. (11) with respect to Θ and yielding
b
∂ log βmk
1
∂BL
= b − Bmk +
=0
b
b
∂λmk
λmk
∂λmk
w
∂ log βkn
1
∂BL
= w − Wkn +
= 0.
w
w
∂λkn
λkn
∂λkn
(12)
Accordingly, the solution to BNMF hyperparameters is derived by solving a quadratic equation where nonnegative
constraint is considered to find positive values of hyperparameters by
λ̂bmk
1
=
2
λ̂w
kn =
1
2
−
Wkn +
n
−
m
(
Bmk +
Wkn
)2
+4
n Wkn Bmk m Bmk 2
(
Bmk ) + 4
Wkn m
(13)
n
b
b
w w
where Bmk = αmk
βmk
and Wkn = αkn
βkn are obtained as the means of gamma distributions. VB-E step
and VB-M step are alternatively and iteratively performed
to estimate BNMF parameters Θ with convergence. It is
meaningful to select the best number of bases (K) with the
largest lower bound of the log marginal likelihood which
integrates out the parameters of weight and basis matrices.
NSDR(V̂, V, X) = SDR(V̂, V) − SDR(X, V)
Ñ
n=1 ln NSDR(V̂n , Vn , Xn )
GNSDR(V̂, V, X) =
Ñ
n=1 ln
3.3 Poisson-Exponential Bayesian NMF
(14)
where V̂, V, X denote the estimated singing voice, the
original clean singing voice, and the mixture signal, respectively, Ñ is the total number of the clips and ln is the
length of the nth clip. NSDR is used to measure the improvement of SDR between the estimated singing voice V̂
and the mixture signal X. GNSDR is used to calculate
the overall separation performance by taking the weighted
mean of the NSDRs.
To the best of our knowledge, this is the first study where a
Bayesian approach is developed for singing-voice separation. The uncertainties in singing-voice separation due to
a variety of singers, songs and instruments could be compensated. Model selection problem is tackled as well. In
this study, total number of basis vectors K is adaptively
selected for individual mixed signal according to the variational lower bound in Eq. (11) with the converged variab
b
w
w
tional parameters {α̂mk
, β̂mk
, α̂kn
, β̂kn
, P̂mkn } and model
b
w
parameters {λ̂mk , λ̂kn }.
Considering the pairs of likelihood function and prior
distribution in NMF, the proposed method is also called
the Poisson-Exponential BNMF which is different from
4.2 Unsupervised Singing-Voice Separation
We implemented the unsupervised singing-voice separation where total number of bases (K) and the grouping of
these bases into vocal source and music source were both
510
SMR:−5dB
GNSDR(dB)
4
3.38
3
2.4
2.1
2
1.51
2.81
K-means clustering
NMF clustering
1.3
1
0
−1
Hsu
Huang
Yang
Rafii[12] Rafii[13] BNMF1 BNMF2
GNSDR(dB)
NMF
(50)
2.47
2.97
BNMF
(adaptive)
2.92
3.25
Table 1. Comparison of GNSDR at SMR = 0 dB using
NMF with fixed number of bases {30, 40, 50} and BNMF
with adaptive number of bases.
SMR:0dB
3
2.37
2.76
2
0
NMF
(40)
2.58
3.13
−0.51
4
1
NMF
(30)
2.69
3.15
2.7
2.92
3.25
1.7
0.91
Hsu
Huang
Yang
SMR:5dB
GNSDR(dB)
4
3
2.57
2.58
2.57
2.1
2
2.12
1.3
1
0.17
0
Hsu
Huang
Yang
Figure 1. Performance comparison using BNMF1 (Kmeans clustering) and BNMF2 (NMF-clustering) and five
competitive methods (Hsu [8], Huang [9], Yang [22], Rafii
[12], Rafii [13]) in terms of GNSDR under various SMRs.
Figure 2. Histogram of the selected number of bases using
BNMF under various SMRs.
learned from test data in an unsupervised way. No training
data were required. Model complexity based on K was determined in accordance with the variational lower bound of
log marginal likelihood in Eq. (11) while the grouping of
bases for two sources was simply performed via the clustering algorithms using the estimated basis vectors in B
or equivalently from the estimated variational parameters
b
b
{αmk
, βmk
}. Following [17], we conducted the K-means
clustering algorithm based on the basis vectors B in Melfrequency cepstral coefficient (MFCC) domain. Each basis
vector was first transformed to the Mel-scaled spectrum by
applying 20 overlapping triangle filters spaced on the Mel
scale. Then, we took the logarithm and applied the discrete
cosine transform to obtain nine MFCCs. Finally, we normalized each coefficient to zero mean and unit variance.
The K-means clustering algorithm was applied to partition
the feature set into two clusters through an iterative procedure until convergence. However, it is more meaningful
to conduct NMF-based clustering for the proposed BNMF
method. To do so, we transformed the basis vectors B into
Mel-scaled spectrum to form the Mel-scaled basis matrix.
ML-NMF was applied to factorize this Mel-scaled basis
matrix into two matrices B̃ of size N -by-2 and W̃ of size
2-by-K. The soft mask scheme based on Wiener gain was
applied to smooth the separation of B into basis vectors
for vocal signal and music signal. This same soft mask
was performed for the separation of mixed signal X into
vocal signal and music signal based on the K-means clustering and NMF clustering. Finally, the separated singing
voice and music accompaniment signals were obtained by
the overlap-and-add method using the original phase.
4.3 Experimental Results
The unsupervised single-channel separation using BNMFs
(BNMF1 using K-means clustering and BNMF2 using
NMF clustering) and the other five competitive systems
(Hsu [8], Huang [9], Yang [22], Rafii [12], Rafii [13])
is compared in terms of GNSDR as depicted in Figure
1. Using K-means clustering in MFCC domain, the resulting BNMF1 outperforms the other five methods under
SMRs of 0 dB and -5 dB while the results using Huang [9]
and Yang [22] perform better than BNMF1 under 5 dB
condition. This is because the methods in [9, 22] used
additional pre- and/or post-processing techniques as provided in [13, 22] which were not applied in BNMF1 and
BNMF2. Nevertheless, using BNMF factorization with
NMF clustering (BNMF2), the overall evaluation consistently achieves around 0.33∼0.57 dB relative improvement
in GNSDR compared with BNMF1 including the SMR
condition at 5dB. In addition, we evaluate the effect on the
adaptive basis selection using BNMF. Table 1 reports the
comparison of BNMF1 and BNMF2 with adaptive basis
selection and ML-NMF with fixed number of bases under
SMR of 0 dB. Two clustering methods were also carried
out for NMF with different K. BNMF factorization combined with NMF clustering achieves the best performance
in this comparison. Figure 2 shows the histogram of the
selected number of bases K using BNMF. It is obvious
that this adaptive basis selection plays an important role to
find suitable amount of bases to fit different experimental
conditions.
511
5. CONCLUSIONS
We proposed a new unsupervised Bayesian nonnegative
matrix factorization approach to extract the singing voice
from background music accompaniment and illustrated the
novelty on an analytical and true optimum solution to the
Poisson-Exponential BNMF. Through the VB-EM inference procedure, the proposed method automatically selected different number of bases to fit various experimental conditions. We conducted two clustering algorithms to
find the grouping of bases into vocal and music sources.
Experimental results showed the consistent improvement
of using BNMF factorization with NMF clustering over
the other singing-voice separation methods in terms of
GNSDR. In future works, the proposed BNMF shall be
extended to multi-layer source separation and applied to
detect unknown number of sources.
6. REFERENCES
[1] C. M. Bishop. Pattern Recognition and Machine
Learning. Springer Science, 2006.
[2] N. Boulanger-Lewandowski, G. J. Mysore, and
M. Hoffman. Exploiting long-term temporal dependencies in NMF using recurrent neural networks with
application to source separation. In Proc. of ICASSP,
pages 337–344, 2014.
[10] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing Systems, pages 556–562, 2000.
[11] A. Mesaros, T. Virtanen, and A. Klapuri. Singer identification in polyphonic music using vocal separation
and pattern recognition methods. In Proc. of Annual
Conference of International Society for Music Information Retrieval, pages 375–378, 2007.
[12] Z. Rafii and B. Pardo. A simple music/voice separation method based on the extraction of the repeating
musical structure. In Proc. of ICASSP, pages 221–224,
2011.
[13] Z. Rafii and B. Pardo. Repeating pattern extraction
technique (REPET): A simple method for music/voice
separation. IEEE Transactions on Audio, Speech, Language Processing, 21(1):73–84, Jan. 2013.
[14] M. N. Schmidt and M. Morup. Non-negative matrix factor 2-D deconvolution for blind single channel source separation. In Proc. of ICA, pages 700–707,
2006.
[15] M. N. Schmidt, O. Winther, and L. K. Hansen.
Bayesian non-negative matrix factorization. In Proc. of
ICA, pages 540–547, 2009.
[3] A. S. Bregman. Auditory Scene Analysis: the Perceptual Organization of Sound. MIT Press, 1990.
[16] P. Smaragdis. Convolutive speech bases and their
application to speech separation. IEEE Transactions
on Audio, Speech, Language Processing, 15(1):1–12,
2007.
[4] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational Intelligence
and Neuroscience, (Article ID 785152), 2009.
[17] M. Spiertz and V. Gnann. Source-Filter based clustering for monaural blind source separation. In Proc.
of International Conference on Digital Audio Effects,
pages 1–4, 2009.
[5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society (B),
39(1):1–38, 1977.
[6] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno.
Lyricsynchronizer: automatic synchronization system
between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing, 5(6):1252–
1261, 2011.
[7] P. O. Hoyer. Non-negative matrix factorization with
sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004.
[18] R. Tibshirani. Regression shrinkage and selection via
the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996.
[19] S. Vembu and S. Baumann. Separation of vocals from
polyphonic audio recordings. In Proc. of ISMIR, pages
375–378, 2005.
[20] E. Vincent, R. Gribonval, and C. Fevotte. Performance
measurement in blind audio source separation. IEEE
Transaction on Audio, Speech and Language Processing, 14(4):1462–1469, 2006.
[21] D. Yang and W. Lee. Disambiguating music emotion
using software agents. In Proc. of ISMIR, pages 52–57,
2004.
[8] C.-L. Hsu and J.-S. R. Jang. On the improvement of
singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Transactions on Audio,
Speech, Language Processing, 18(2):310–319, 2010.
[22] Y.-H. Yang. On sparse and low-rank matrix decomposition for singing voice separation. In Proc. of ACM
International Conference on Multimedia, pages 757–
760, 2012.
[9] P.-S. Huang, S. D. Chen, P. Smaragdis, and
M. Hasegawa-Johnson. Singing-voice separation from
monaural recordings using robust principal component
analysis. In Proc. of ICASSP, pages 57–60, 2012.
[23] B. Zhu, W. Li, R. Li, and X. Xue. Multi-stage nonnegative matrix factorization for monaural singing
voice separation. IEEE Transactions on Audio, Speech,
Language Processing, 21(10):2096–2107, 2013.
512
PROBABILISTIC EXTRACTION OF BEAT POSITIONS FROM A BEAT
ACTIVATION FUNCTION
Filip Korzeniowski, Sebastian Böck, and Gerhard Widmer
Department of Computational Perception
Johannes Kepler University, Linz, Austria
[email protected]
ABSTRACT
beat at each audio position. A post-processing step selects
from these activations positions to be reported as beats.
However, this method struggles to find the correct beats
when confronted with ambiguous activations.
We contribute a new, probabilistic method for this purpose. Although we designed the method for audio with a
steady pulse, we show that using the proposed method the
beat tracker achieves better results even for datasets containing music with varying tempo.
The remainder of the paper is organised as follows: Section 2 reviews the beat tracker our method is based on. In
Section 3 we present our approach, describe the structure
of our model and show how we infer beat positions. Section 4 describes the setup of our experiments, while we
show their results in Section 5. Finally, we conclude our
work in Section 6.
We present a probabilistic way to extract beat positions
from the output (activations) of the neural network that is at
the heart of an existing beat tracker. The method can serve
as a replacement for the greedy search the beat tracker currently uses for this purpose. Our experiments show improvement upon the current method for a variety of data
sets and quality measures, as well as better results compared to other state-of-the-art algorithms.
1. INTRODUCTION
Rhythm and pulse lay the foundation of the vast majority of musical works. Percussive instruments like rattles,
stampers and slit drums have been used for thousands of
years to accompany and enhance rhythmic movements or
dances. Maybe this deep connection between movement
and sound enables humans to easily tap to the pulse of a
musical piece, accenting its beats. The computer, however,
has difficulties determining the position of the beats in an
audio stream, lacking the intuition humans developed over
thousands of years.
Beat tracking is the task of locating beats within an audio stream of music. Literature on beat tracking suggests
many possible applications: practical ones such as automatic time-stretching or correction of recorded audio, but
also as a support for further music analysis like segmentation or pattern discovery [4]. Several musical aspects hinder tracking beats reliably: syncopation, triplets and offbeat rhythms create rhythmical ambiguousness that is difficult to resolve; varying tempo increases musical expressivity, but impedes finding the correct beat times. The multitude of existing beat tracking algorithms work reasonably
well for a subset of musical works, but often fail for pieces
that are difficult to handle, as [11] showed.
In this paper, we further improve upon the beat tracker
presented in [2]. The existing algorithm uses a neural network to detect beats in the audio. The output of this neural
network, called activations, indicates the likelihood of a
2. BASE METHOD
In this section, we will briefly review the approach presented in [2]. For a detailed discourse we refer the reader
to the respective publication. First, we will outline how
the algorithm processes the signal to emphasise onsets. We
will then focus on the neural network used in the beat
tracker and its output in Section 2.2. After this, Section 3
will introduce the probabilistic method we propose to find
beats in the output activations of the neural network.
2.1 Signal Processing
The algorithm derives from the signal three logarithmically filtered power spectrograms with window sizes W
of 1024, 2048 and 4096 samples each. The windows are
placed 441 samples apart, which results in a frame rate of
fr = 100 frames per second for audio sampled at 44.1kHz.
We transform the spectra using a logarithmic function to
better match the human perception of loudness, and filter
them using 3 overlapping triangular filters per octave.
Additionally, we compute the first order difference for
each of the spectra in order to emphasise onsets. Since
longer frame windows tend to smear spectral magnitude
values in time, we compute the difference to the last, second to last, and third to last frame, depending on the window size W . Finally, we discard all negative values.
c Filip Korzeniowski, Sebastian Böck, Gerhard Widmer.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Filip Korzeniowski, Sebastian Böck,
Gerhard Widmer. “Probabilistic Extraction of Beat Positions from a Beat
Activation Function”, 15th International Society for Music Information
513
0.6
0.06
0.5
0.05
0.4
Activation
Activation
0.3
0.2
0.1
0.0
0.04
0.03
0.02
0.01
0
2
4
6
8
0.00
10
Time [s]
(a) Activations of a piece from the Ballroom dataset
0
2
4
6
8
10
Time [s]
(b) Activations of a piece from the SMC dataset
Figure 1. Activations of pieces from two different datasets. The activations are shown in blue, with green, dotted lines
showing the ground truth beat annotations. On the left, distinct peaks indicate the presence of beats. The prominent
rhythmical structure of ballroom music enables the neural network to easily discern frames that contain beats from those
that do not. On the right, many peaks in the activations do not correspond to beats, while some beats lack distinguished
peaks in the activations. In this piece, a single woodwind instrument is playing a solo melody. Its soft onsets and lack of
percussive instruments make detecting beats difficult.
2.2 Neural Network
Our classifier consists of a bidirectional recurrent neural
network of Long Short-Term Memory (LSTM) units, called
bidirectional Long Short-Term Memory (BLSTM) recurrent neural network [10]. The input units are fed with the
log-filtered power spectra and their corresponding positive
first order differences. We use three fully connected hidden
layers of 25 LSTM units each. The output layer consists of
a single sigmoid neuron. Its value remains within [0, 1],
with higher values indicating the presence of a beat at the
given frame.
After we initialise the network weights randomly, the
training process adapts them using standard gradient descent with back propagation and early stopping. We obtain
training data using 8-fold cross validation, and randomly
choose 15% of the training data to create a validation set.
If the learning process does not improve classification on
this validation set for 20 training epochs, we stop it and
choose the best performing neural network as final model.
For more details on the network and the learning process,
we refer the reader to [2].
The neural network’s output layer yields activations for
every feature frame of an audio signal. We will formally
represent this computation as mathematical function. Let
N be the number of feature frames for a piece, and N≤N =
{1, 2, . . . , N } the set of all frame indices. Furthermore,
let υn be the feature vector (the log-filtered power spectra
and corresponding differences) of the nth audio frame, and
Υ = (υ1 , υ2 , . . . , υN ) denote all feature vectors computed
for a piece. We represent the neural network as a function
Ψ : N≤N → [0, 1] ,
(1)
such that Ψ(n; Υ) is the activation value for the nth frame
when the network processes the feature vectors Υ. We will
call this function “activations” in the following.
Depending on the type of music the audio contains, the
activations show clear (or, less clear) peaks at beat positions. Figure 1 depicts the first 10 seconds of activations
514
for two different songs, together with ground truth beat annotations. In Fig. 1a, the peaks in the activations clearly
correspond to beats. For such simple cases, thresholding
should suffice to extract beat positions. However, we often
have to deal with activations as those in Fig. 1b, with many
spurious and/or missing peaks. In the following section,
we will propose a new method for extracting beat positions
from such activations.
3. PROBABILISTIC EXTRACTION OF BEAT
POSITIONS
Figure 1b shows the difficulty in deriving the position of
beats from the output of the neural network. A greedy local
search, as used in the original system, runs into problems
when facing ambiguous activations. It struggles to correct
previous beat position estimates even if the ambiguity resolves later in the piece. We therefore tackle this problem
using a probabilistic model that allows us to globally optimise the beat sequence.
Probabilistic models are a frequently used to process
time-series data, and are therefore popular in beat tracking (e.g. [3, 9, 12, 13, 14]). Most systems favour generative
time-series models like hidden Markov models (HMMs),
Kalman filters, or particle filters as natural choices for this
problem. For a more complete overview of available beat
trackers using various methodologies and their results on a
challenging dataset we refer the reader to [11].
In this paper, we use a different approach: our model
represents each beat with its own random variable. We
model time as dimension in the sample space of our random variables as opposed to a concept of time driving a
random process in discrete steps. Therefore, all activations
are available at any time, instead of one at a time when
thinking of time-series data.
For each musical piece we create a model that differs
from those of other pieces. Different pieces have different
lengths, so the random variables are defined over different sample spaces. Each piece contains a different number
of beats, which is why each model consists of a different
where each yn is in the domain defined by the input features. Although Y is formally a random variable with a
distribution P (Y ), its value is always given by the concrete features extracted from the audio.
The model’s structure requires us to define dependencies between the variables as conditional probabilities. Assuming these dependencies are the same for each beat but
the first, we need to define
Y
X1
X2
···
XK
P (X1 | Y )
P (Xk | Xk−1 , Y ) .
Figure 2. The model depicted as Bayesian network. Each
Xk corresponds to a beat and models its position. Y represents the feature vectors of a signal.
If we wanted to compute the joint probability of the model,
we would also need to define P (Y ) – an impossible task.
Since, as we will elaborate later, we are only interested in
P (X1:K | Y ) 1 , and Y is always given, we can leave this
aside.
number of random variables.
The idea to model beat positions directly as random
variables is similar to the HMM-based method presented in
[14]. However, we formulate our model as a Bayesian network with the observations as topmost node. This allows
us to directly utilise the whole observation sequence for
each beat variable, without potentially violating assumptions that need to hold for HMMs (especially those regarding the observation sequence). Also, our model uses
only a single factor to determine potential beat positions in
the audio – the output of a neural network – whereas [14]
utilises multiple features on different levels to detect beats
and downbeats.
3.2 Probability Functions
Except for X1 , two random variables influence each Xk :
the previous beat Xk−1 and the features Y . Intuitively, the
former specifies the spacing between beats and thus the
rough position of the beat compared to the previous one.
The latter indicates to what extent the features confirm the
presence of a beat at this position. We will define both as
individual factors that together determine the conditional
probabilities.
3.2.1 Beat Spacing
3.1 Model Structure
The pulse of a musical piece spaces its beats evenly in
time. Here, we assume a steady pulse throughout the piece
and model the relationship between beats as factor favouring their regular placement according to this pulse. Future work will relax this assumption and allow for varying
pulses.
Even when governed by a steady pulse, the position of
beats is far from rigid: slight modulations in tempo add
musical expressivity and are mostly artistic elements intended by performers. We therefore allow a certain deviation from the pulse. As [3] suggests, tempo changes are
perceived relatively rather than absolutely, i.e. halving the
tempo should be equally probable as doubling it. Hence,
we use the logarithm to base 2 to define the intermediate
factor Φ̃ and factor Φ, our beat spacing model. Let x and
x be consecutive beat positions and x > x , we define
Φ̃ (x, x ) = φ log2 (x − x ) ; log2 (τ ) , στ2 ,
(4)
Φ̃ (x, x ) if 0 < x − x < 2τ
Φ (x, x ) =
, (5)
0
else
As mentioned earlier, we create individual models for each
piece, following the common structure described in this
section. Figure 2 gives an overview of our system, depicted
as Bayesian network.
Each Xk is a random variable modelling the position of
the k th beat. Its domain are all positions within the length
of a piece. By position we mean the frame index of the activation function – since we extract features with a frame
rate of fr = 100 frames per second, we discretise the continuous time space to 100 positions per second.
Formally, the number of possible positions per piece is
determined by N , the number of frames. Each Xk is then
defined as random variable with domain N≤N , the natural
numbers smaller or equal to N :
Xk ∈ N≤N
with 1 ≤ k ≤ K,
(2)
where K is the number of beats in the piece. We estimate
this quantity by detecting the dominant interval τ of a piece
using an autocorrelation-based method on the smoothed
activation function of the neural network (see [2] for details). Here, we restrict the possible intervals to a range
[τl ..τu ], with both bounds learned from data. Assuming a
steady tempo and a continuous beat throughout the piece,
we simply compute K = N/τ .
Y models the features extracted from the input audio.
If we divide the signal into N frames, Y is a sequence of
vectors:
Y ∈ {(y1 , . . . , yN )} ,
and
where φ x; μ, σ 2 is the probability density function of a
Gaussian distribution with mean μ and variance σ 2 , τ is
the dominant inter-beat interval of the piece, and στ2 represents the allowed tempo variance. Note how we restrict the
non-zero range of Φ: on one hand, to prevent computing
the logarithm of negative values, and on the other hand, to
reduce the number of computations.
(3)
1
515
We use Xm:n to denote all Xk with indices m to n
The factor yields high values when x and x are spaced
approximately τ apart. It thus favours beat positions that
correspond to the detected dominant interval, allowing for
minor variations.
Having defined the beat spacing factor, we will now
elaborate on the activation vector that connects the model
to the audio signal.
a dynamic programming method similar to the well known
Viterbi algorithm [15] to obtain the values of interest.
We adapt the standard Viterbi algorithm to fit the structure of model by changing the definition of the “Viterbi
variables” δ to
δ1 (x) = P (X1 = x | Υ)
and
P (Xk = x | Xk−1 = x , Υ) · δk−1 (x ),
δk (x) = max
3.2.2 Beat Activations
x
2
where x, x ∈ N≤N . The backtracking pointers are set
accordingly.
P (x∗1:K | Υ) gives us the probability of the beat sequence given the data. We use this to determine how
well the deducted beat structure fits the features and in
consequence the activations. However, we cannot directly
compare the probabilities of beat sequences with different
numbers of beats: the more random variables a model has,
the smaller the probability of a particular value configuration, since there are more possible configurations. We thus
normalise the probability by dividing by K, the number of
beats.
With this in mind, we try different values for the dominant interval τ to obtain multiple beat sequences, and
choose the one with the highest normalised probability.
Specifically, we run our method with multiples of τ (1/2,
2/3, 1, 3/2, 2) to compensate for errors when detecting the
dominant interval.
The neural network’s activations Ψ indicate how likely
each frame n ∈ N≤N is a beat position. We directly use
this factor in the definition of the conditional probability
distributions.
With both factors in place we can continue to define
the conditional probability distributions that complete our
probabilistic model.
3.2.3 Conditional Probabilities
The conditional probability distribution P (Xk | Xk−1 , Y )
combines both factors presented in the previous sections. It
follows the intuition we outlined at the beginning of Section 3.2 and molds it into the formal framework as
P (Xk | Xk−1 , Y ) = Ψ (Xk ; Y ) · Φ (Xk , Xk−1 )
.
Xk Ψ (Xk ; Y ) · Φ (Xk , Xk−1 )
(6)
The case of X1 , the first beat, is slightly different. There
is no previous beat to determine its rough position using
the beat spacing factor. But, since we assume that there
is a steady and continuous pulse throughout the audio, we
can conclude that its position lies within the first interval
from the beginning of the audio. This corresponds to a
uniform distribution in the range [0, τ ], which we define as
beat position factor for the first beat as
1/τ
if 0 ≤ x < τ,
.
(7)
Φ1 (x) =
0
else
4. EXPERIMENTS
In this section we will describe the setup of our experiments: which data we trained and tested the system on,
and which evaluation metrics we chose to quantify how
well our beat tracker performs.
4.1 Data
We ensure the comparability of our method by using three
freely available data sets for beat tracking: the Ballroom
dataset [8,13]; the Hainsworth dataset [9]; the SMC dataset
[11]. The order of this listing indicates the difficulty associated with each of the datasets. The Ballroom dataset consists of dance music with strong and steady rhythmic patterns. The Hainsworth dataset includes of a variety of musical genres, some considered easier to track (like pop/rock,
dance), others more difficult (classical, jazz). The pieces
in the SMC dataset were specifically selected to challenge
existing beat tracking algorithms.
We evaluate our beat tracker using 8-fold cross validation, and balance the splits according to dataset. This
means that each split consists of roughly the same relative
number of pieces from each dataset. This way we ensure
that all training and test splits represent the same distribution of data.
All training and testing phases use the same splits. The
same training sets are used to learn the neural network and
to set parameters of the probabilistic model (lower and upper bounds τl and τu for dominant interval estimation and
στ ). The test phase feeds the resulting tracker with data
from the corresponding test split. After detecting the beats
The conditional probability for X1 is then
P (X1 | Y ) = Ψ (X1 ; Y ) · Φ1 (X1 )
.
X1 Ψ (X1 ; Y ) · Φ1 (X1 )
(8)
The conditional probability functions fully define our
probabilistic model. In the following section, we show
how we can use this model to infer the position of beats
present in a piece of music.
3.3 Inference
We want to infer values x∗1:K for X1:K that maximise the
probability of the beat sequence given Y = Υ, that is
x∗1:K = argmax P (X1:K | Υ) .
(9)
x1:K
Each x∗k corresponds to the position of the k th beat. Υ are
the feature vectors computed for a specific piece. We use
2 technically, it is not a likelihood in the probabilistic sense – it just
yields higher values if the network thinks that the frame contains a beat
than if not
516
for all pieces, we group the results according to the original
datasets in order to present comparable results.
SMC
F
Cg
CMLt
AMLt
0.545
0.497
0.436
0.402
0.442
0.360
0.580
0.431
F
Cg
CMLt
AMLt
0.840
0.837
0.718
0.717
0.784
0.763
0.875
0.811
Degara* [7]
Klapuri* [12]
Davies* [6]
-
-
0.629
0.620
0.609
0.815
0.793
0.763
Ballroom
F
Cg
CMLt
AMLt
Proposed
Böck [1, 2]
0.903
0.889
0.864
0.857
0.833
0.796
0.910
0.831
Krebs [13]
Klapuri [12]
Davies [6]
0.855
0.728
0.764
0.772
0.651
0.696
0.786
0.539
0.574
0.865
0.817
0.864
Proposed
Böck [1, 2]
4.2 Evaluation Metrics
A multitude of evaluation metrics exist for beat tracking algorithms. Some accent different aspects of a beat tracker’s
performance, some capture similar properties. For a comprehensive review and a detailed elaboration on each of
the metrics, we refer the reader to [5]. Here, we restrict
ourselves to the following four quantities, but will publish
further results on our website 3 .
Hainsworth
Proposed
Böck [1, 2]
F-measure The standard measure often used in information retrieval tasks. Beats count as correct if detected
within ±70ms of the annotation.
Cemgil Measure that uses a Gaussian error window with
σ = 40ms instead of a binary decision based on a
tolerance window. It also incorporates false positives and false negatives.
CMLt The percentage of correctly detected beats at the
correct metrical level. The tolerance window is set
to 17.5% of the current inter-beat interval.
Table 1. Beat tracking results for the three datasets. F
stands for F-measure and Cg for the Cemgil metric. Results marked with a star skip the first five seconds of each
piece and are thus better by about 0.01 for each metric, in
our experience.
AMLt Similar to CMLt, but allows for different metrical
levels like double tempo, half tempo, and off-beat.
In contrast to common practice 4 , we do not skip the
first 5 seconds of each audio signal for evaluation. Although skipping might make sense for on-line algorithms,
it does not for off-line beat trackers.
considerably. Our beat tracker also performs better than
the other algorithms, where metrics were available.
The proposed model assumes a stable tempo throughout
a piece. This assumption holds for certain kinds of music
(like most of pop, rock and dance), but does not for others
(like jazz or classical). We estimated the variability of the
tempo of a piece using the standard deviation of the local
beat tempo. We computed the local beat tempo based on
the inter-beat interval derived from the ground truth annotations. The results indicate that most pieces have a
steady pulse: 90% show a standard deviation lower than
8.61 bpm. This, of course, depends on the dataset, with
97% of the ballroom pieces having a deviation below 8.61
bpm, 89% of the Hainsworth dataset but only 67.7% of the
SMC data.
We expect our approach to yield inferior results for
pieces with higher tempo variability than for those with
a more constant pulse. To test this, we computed Pearson’s correlation coefficient between tempo variability and
AMLt value. The obtained value of ρ = -0.46 indicates that
our expectation holds, although the relationship is not linear, as a detailed examination showed. Obviously, multiple
other factors also influence the results. Note, however, that
although the tempo of pieces from the SMC dataset varies
most, it is this dataset where we observed the strongest improvement compared to the original approach.
Figure 3 compares the beat detections obtained with the
proposed method to those computed by the original approach. It exemplifies the advantage of a globally optimised beat sequence compared to a greedy local search.
5. RESULTS
Table 1 shows the results of our experiments. We obtained
the raw beat detections on the Ballroom dataset for [6, 12,
13] from the authors of [13] and evaluated them using our
framework. The results are thus directly comparable to
those of our method. For the Hainsworth dataset, we collected results for [6, 7, 12] from [7], who does skip the first
5 seconds of each piece in the evaluation. In our experience, this increases the numbers obtained for each metric
by about 0.01.
The approaches of [6, 7] do not require any training.
In [12], some parameters are set up based on a separate
dataset consisting of pieces from a variety of genres. [13]
is a system that is specialised for and thus only trained on
the Ballroom dataset.
We did not include results of other algorithms for the
SMC dataset, although available in [11]. This dataset did
not exist at the time most beat trackers were crafted, so the
authors could not train or adapt their algorithms in order to
cope with such difficult data.
Our method improves upon the original algorithm [1,
2] for each of the datasets and for all evaluation metrics.
While F-Measure and Cemgil metric rises only marginally
(except for the SMC dataset), CMLt and AMLt improves
3 http://www.cp.jku.at/people/korzeniowski/ismir2014
4 As implemented in the MatLab toolbox for the evaluation of beat
trackers presented in [5]
517
0.06
Activation
0.05
0.04
0.03
0.02
0.01
0.00
0
5
10
15
20
Time [s]
Figure 3. Beat detections for the same piece as shown in Fig. 1b obtained using the proposed method (red, up arrows)
compared to those computed by the original approach (purple, down arrows). The activation function is plotted solid
blue, ground truth annotations are represented by vertical dashed green lines. Note how the original method is not able to
correctly align the first 10 seconds, although it does so for the remaining piece. Globally optimising the beat sequence via
back-tracking allows us to infer the correct beat times, even if the peaks in the activation function are ambiguous at the
beginning.
Mary University of London, Centre for Digital Music, Tech.
Rep. C4DM-TR-09-06, 2009.
We proposed a probabilistic method to extract beat positions from the activations of a neural network trained for
beat tracking. Our method improves upon the simple approach used in the original algorithm for this purpose, as
our experiments showed.
In this work we assumed close to constant tempo
throughout a piece of music. This assumption holds for
most of the available data. Our method also performs reasonably well on difficult datasets containing tempo changes, such as the SMC dataset. Nevertheless we believe that
extending the presented method in a way that enables tracking pieces with varying tempo will further improve the system’s performance.
[6] M. E. P. Davies and M. D. Plumbley. Context-Dependent
Beat Tracking of Musical Audio. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1009–1020,
Mar. 2007.
[7] N. Degara, E. A. Rua, A. Pena, S. Torres-Guijarro, M. E. P.
Davies, and M. D. Plumbley. Reliability-Informed Beat
Tracking of Musical Signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):290–301, Jan.
2012.
[8] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis,
C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio,
[9] S. W. Hainsworth and M. D. Macleod. Particle Filtering
Applied to Musical Tempo Tracking. EURASIP Journal on
Advances in Signal Processing, 2004(15):2385–2395, Nov.
2004.
ACKNOWLEDGEMENTS
This work is supported by the European Union Seventh
Framework Programme FP7 / 2007-2013 through the GiantSteps project (grant agreement no. 610591).
7. REFERENCES
[1] MIREX 2013 beat tracking results. http://nema.lis.
illinois.edu/nema_out/mirex2013/results/
abt/, 2013.
[2] S. Böck and M. Schedl. Enhanced Beat Tracking With
Context-Aware Neural Networks. In Proceedings of the 14th
International Conference on Digital Audio Effects (DAFx11), Paris, France, Sept. 2011.
[3] A. T. Cemgil, B. Kappen, P. Desain, and H. Honing. On
tempo tracking: Tempogram Representation and Kalman filtering. Journal of New Music Research, 28:4:259–273, 2001.
[4] T. Collins, S. Böck, F. Krebs, and G. Widmer. Bridging the
Audio-Symbolic Gap: The Discovery of Repeated Note Content Directly from Polyphonic Music Audio. In Proceedings
of the Audio Engineering Society’s 53rd Conference on Semantic Audio, London, 2014.
[5] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation
methods for musical audio beat tracking algorithms. Queen
[10] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computing, 9(8):1735–1780, Nov. 1997.
[11] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. a. L. Oliveira,
and F. Gouyon. Selective Sampling for Beat Tracking Evaluation. IEEE Transactions on Audio, Speech, and Language
Processing, 20(9):2539–2548, Nov. 2012.
[12] A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the
meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):342–355, 2006.
[13] F. Krebs, S. Böck, and G. Widmer. Rhythmic Pattern Modeling for Beat and Downbeat Tracking in Musical Audio. In
Proc. of the 14th International Conference on Music Information Retrieval (ISMIR), 2013.
[14] G. Peeters and H. Papadopoulos. Simultaneous beat and
downbeat-tracking using a probabilistic framework: theory
and large-scale evaluation. IEEE Transactions on Audio,
[15] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the
IEEE, 77(2):257–286, 1989.
518
GEOGRAPHICAL REGION MAPPING SCHEME BASED ON
MUSICAL PREFERENCES
Sanghoon Jun
Korea University
Seungmin Rho
Sungkyul University
Eenjun Hwang
Korea University
[email protected]
[email protected]
[email protected]
ences. In the case of two regions near the border of the
two countries, the people might show very different music preferences from those living in a region far from the
border but in the same country. The degree of preference
differences can be varied because of the difference in the
sizes of the countries. Furthermore, the water bodies that
cover 71% of the Earth’s surface can lead to a disjunction
of the differences.
Music from countries that have a high cultural influence might gain global popularity. For instance, pop music from the United States is very popular all over the
world. Countries that have a common cultural background might have similar musical preferences irrespective of the geographical distance between them. Language is another important factor that can lead to different countries, such as the US and the UK, having similar
popular music charts.
For these reasons, predicting musical preferences on
the basis of geographical proximity can lead to incorrect
results. In this paper, we present a scheme for constructing a music map where regions are positioned close to
one another depending on the musical preferences of their
populations. That is, regions such as cities in a traditional
map are rearranged in the music map such that regions
with similar musical preferences are close to one another.
As a result, regions with similar musical preferences are
concentrated in the music map and regions with distinct
musical preferences are far away from the group.
The rest of this paper is organized as follows: In Section 2, we present a brief overview of the related works.
Section 3 presents the scheme for mapping a geographical region to a new music space. Section 4 describes the
experiments that we performed and some of the results.
In the last section, we conclude the paper with directions
for future work.
ABSTRACT
Many countries and cities in the world tend to have different types of preferred or popular music, such as pop,
K-pop, and reggae. Music-related applications utilize geographical proximity for evaluating the similarity of music preferences between two regions. Sometimes, this can
lead to incorrect results due to other factors such as culture and religion. To solve this problem, in this paper, we
propose a scheme for constructing a music map in which
regions are positioned close to one another depending on
the similarity of the musical preferences of their populations. That is, countries or cities in a traditional map are
rearranged in the music map such that regions with similar musical preferences are close to one another. To do
this, we collect users’ music play history and extract popular artists and tag information from the collected data.
Similarities among regions are calculated using the tags
and their frequencies. And then, an iterative algorithm for
rearranging the regions into a music map is applied. We
present a method for constructing the music map along
with some experimental results.
1. INTRODUCTION
To recommend suitable music pieces to users, various
methods have been proposed and one of them is the joint
consideration of music and location information. In general, users in the same place tend to listen to similar kinds
of music and this is shown by the statistics of music listening history. Context-aware computing utilizes this
human tendency to recommend songs to a user.
However, the current approach of exploring geographical proximity for obtaining a user’s music preferences
might have several limitations due to various factors such
as region scale, culture, religion, and language. That is,
neighboring regions can show significant differences in
music listening statistics and vice versa.
In fact, the geographical distance between two regions
is not always proportional to the degree of difference in
music preferences. For instance, assume that there are
two neighboring countries having different music prefer-
2. RELATED WORK
Many studies have tried to utilize location information
for various music-related applications such as music
search and recommendation. Kaminskas et al. presented
a context-aware music recommender system that suggests music items on the basis of the users’ contextual
conditions, such as the users’ mood or location [1]. They
defined the term “place of interest (POI)” and considered
the selection of suitable music tracks on the basis of the
POI. In [2], Schedl et al. presented a music recommenda-
© Sanghoon Jun, Seungmin Rho, Eenjun Hwang.
License (CC BY 4.0). Attribution: Sanghoon Jun, Seungmin Rho,
Eenjun Hwang. “Geographical Region Mapping Scheme Based On
Musical Preferences”, 15th International Society for Music Information
519
Data Collection
Extract music listening
data and location
- Messages
- GPS information
- Profile location
Define region and
group data
Mapping regions to
2-dimensional space
Space Representation
Generate Gaussian
mixture model
Extract popular artists
from groups
- Geo.TopArtist
- Artist.TopTags
Generate map
Extract tag statistics from
popular artists
PM
Similarity Measurement
SL
HKTC
GP
TLTD
NA
MW
Calculate similarities of
region pairs
MS
VN
BJ
UG
NG
AS
SZ
AX
MM
FK
FI
MZ
MO
GN
LY TG TO
CN
IQ
SDBB
JM
LSMF
BO
BQ
LC MKCM TH
YE
BT
TV OM
LA
FO MD
KY
PW
GUZM AO
BWPS
NU IO NO GI
MAFR PH
SC
PA
MC
TJMG AZ
BN
ML
CW MT
TF
PT
NI
RS VC
SX
SB
AE SG
DO GY
PEGS AMKP ISSVMY
EH AL
AQ
HN
CLLUNL HU
IN EG HR
BH MQ
MX
CO
AR ZA
MR
ES
CA
HM
GRCC
SKEE
PY
IM
IE
DKCX
GB
TR
PR
RE
DZ
VGAU PKTTUS
DJ
CH QA
CI ECIT
LV
IL CR
AT
CZ
BEVINZ
LT UY
UM
PN
VE
BA
GD
VU
ME
BF
FJ
SS
RO
KZ SI
SA
SJ
PL
ZW
SO
GT
TZ
FMSE GW
BG NE MNIDCY
VA
UA
GG MV
BV
BYBD UZ GE
BR CKTN
ST CU TKDEPF GL
HT
LB
AW
JO
AG
WF
RU TM PG
LK
BS
MU
KG AD
CG KE YT
BI
NR
SR
AF
CV LI
SM
NP
CD
KR
MHNC
KI
IR
JE
SY
KN
WS
KH
MP
ET
AI
TW
GA
GM
KW
GF
BL
GH
KM
- Google maps
- Geocoder
dimensional (3D) visualization model [5]. Using the music similarity model, they provided new tools for exploring and interacting with a music collection. In [6], Knees
et al. presented a user interface that creates a virtual
landscape for music collection. By extracting features
from audio signals and clustering the music pieces, they
created a 3D island landscape. In [7], Pampalk et al. presented a system that facilitates the exploration of music
libraries. By estimating the perceived sound similarities,
music pieces are organized on a two-dimensional (2D)
map so that similar pieces are located close to one another. In [8], Rauber et al. proposed an approach to automatically create an organization of music collection based on
sound similarities. A 3D visualization of music collection offers an interface for an interactive exploration of
large music repositories.
Space Mapping
ER
RW
SH
Generate similarity matrix
NF
DM
BZ
GQ
SN
LR
JP
CF
BM
Figure 1. Overall scheme
3. GEOGRAPHICAL REGION MAPPING
In this paper, we propose a scheme for geographical region mapping on the basis of the musical preferences of
the people residing in these regions. The proposed
scheme consists of three parts as shown in Figure 1. Firstly, the music listening history and the related location data are collected from Twitter. After defining regions, the
collected data are refined to tag the statistics per region
by querying popular artists and their popularities from
last.fm. Similarities between the defined regions are calculated and stored in the similarity matrix. The similarity
matrix is represented into a 2D space by using an iterative
algorithm. Then, a Gaussian mixture model (GMM) is
generated for constructing the music map on the basis of
the relative location of the regions.
Figure 2. Collected data from twitter
tion algorithm that combines information on the music
content, music context, and user context by using a data
set of geo-located music listing activities. In [3], Schedl
et al derived and analyzed culture-specific music listening patterns by collecting music listening patterns of different countries (cities). They utilized social microblog
such as Twitter and its tags in order to collect musicrelated information and measure the similarities between
artists. Jun et al. presented a music recommender that
considers personal and general musical predilections on
the basis of time and location [4]. They analyzed massive
social network streams from twitter and extracted the
music listening histories. On the basis of a statistical
analysis of the time and location, a collection of songs is
selected and blended using automatic mixing techniques.
These location-aware methods show a reasonable music
search and recommendation performance when the range
of the placeGof interest is small. However, the aforementioned problems might occur when the location range
increases. Furthermore, these methods do not consider
the case where remote regions have similar music preferences, which is often the case.
On the basis of these observations, in this paper, we
propose a new data structure called a “music map”,
where regions with similar musical preferences are located close to one another. Some pioneering studies to represent music by using visualization techniques have been
reported. Lamere et al. presented an application for exploring and discovering new music by using a three-
3.1 Music Listen History and Location Collection
By analyzing the music listening history and location data,
we can find out the music type that is popular in a certain
city or country. In order to construct a music map, we
need to collect the music listening history and location
information on a global scale. To do this, we utilize
last.fm, which is a popular music database. However,
last.fm has several limitations related to the coverage of
the global music listening history. The most critical one is
that the database provides the listening data of a particular country only. In other words, we cannot obtain the data for a detailed region. Users in some countries (not all
countries) use last.fm, and it does not contain sufficient
data to cover the preferences of all the regions of these
countries. Because of this, we observed that popular music in the real world does not always match with the
last.fm data.
On the other hand, an explosive number of messages
are generated all over the world through Twitter. Twitter
is one of the most popular social network services. In this
study, we use Twitter for collecting a massive amount of
music listening history data. By filtering music-related
messages from Twitter, we can collect various types of
520
#nowplaying
#np
#music
#soundcloud
#musicfans
#listenlive
#hiphop
#musicmondays
#pandora
#mp3
#itunes
#newmusic
1
AU
AT
BE
CA
CL
CZ
DK
EE
FI
FR
DE
GR
HU
IS
IE
IL
IT
JP
KR
LU
MX
NL
NZ
NO
PL
PT
SK
SI
ES
SE
CH
TR
UK
US
Table 1. Music-related hashtags.
<Phrase A> by < Phrase B>
< Phrase A> - < Phrase B >
< Phrase A > / < Phrase B >
“< Phrase A >” - < Phrase B >
Table 2. Typical syntax for parsing song title and artist
music-related information, such as artist name, song title,
and the published location. Figure 2 shows the distribution of the collected music-related tweets from around the
world.
We used the Tweet Stream provided through a Twitter
application processing interface (API) for collecting
tweets. In order to select only the music-related tweets,
we used music-related hashtags. Hashtags are very useful
for searching the relevant tweets or for grouping tweets
on the basis of topics. As shown in Table 1, we used the
music-related hashtag lists that have been defined in [4].
Music-related tweet messages contain musical information such as song title and artist name. These textual
data are represented in various forms. In particular, we
considered the patterns shown in Table 2 for finding the
artist names and the song titles. We employed a local
MusicBrainz [9] server to validate the artist names.
For collecting location information, we gathered global
positioning system (GPS) data that are included in tweet
messages. However, we observed that the number of
tweets that contain GPS data is quite small considering
the total number of tweets. To solve this, we collected the
profile location of the user who published a tweet message. Profile location contains the text address of the
country or the city of the user. We employed the Google
Geocoding API [10] for validating the location name and
converting the address to GPS coordinates.
0.9
0.8
0.7
0.6
0.5
AU ATBE CA CL CZDKEE FI FRDE GRHU IS IE IL IT JP KR LU MXNL NZNOPL PTSK SI ES SE CH TRUK US
Figure 3. Tag similarity matrix of 34 countries
= { , … , }
(2)
where n is the number of referred artists. Also, using an
artist name, we can collect his/her tag list. For a region r,
we construct a set Tr of top tags by querying top tags to
last.fm using the artist names of the region r as follows:
! = {"#$!%&!"( ) ' … ' "#$!%&!"( )| * }
= {$ , … $+ }
(3)
where getTopTags(a) returns a list of top tags of artist a
and m is the number of collected tags for the region r. We
define a function RTC(r, t) that calculates the total count
of tag t in region r using the following equation:
-!(., $) = /1 *23 × "#$!"%0$( , $)
(4)
Here, getTagCount(a, t) returns the count of tag t for the
artist a in last.fm. In the same vein, RTC can return a set
of tag counts when the second argument is a tag set T.
-!(., 4) = {-!(., $ ), … , -!(., $+ )|$ * 4}
(5)
3.2 Region Definition and Tag Representation
3.3 Similarity Measurement
Using the collected GPS information, we created a set of
regions on the basis of the city or country. For grouping
data by city name or country name, the collected GPS information is converted into its corresponding city or
country name. In this study, we got 1327 cities or 198
countries from the music listening history collected
through TwitterU
For each region, we collect two sets Ar and ACr of referred artist names and their play counts, respectively:
To construct a music map of regions, we need a measurement for estimating musical similarity. In this paper,
we assume that music proximity between regions is
closely related to the artists and their tags because the
musical characteristics of a region can be explained by
the artists’ tags of the region. In particular, in order to
measure the similarity among the regions represented by
the tag groups, we employed a cosine similarity measurement as shown in the following equation:
= { , … , }
(1)
521
!56(. , .7 ) =
89:(; ,4< )×89:(> ,4< )
?89:@; ,43; A?×?89:@> ,43> A?
(6)
1
Japan
0.8
0.6
(a) iteration = 1
Korea, republic of
(b) iteration = 100
0.4
0.2
0
0
(c) iteration = 400
Israel
Slovenia Poland
Greece
Luxembourg
Germany
Italy
Turkey Chile
Czech republic
Spain Austria
NewNetherlands
zealand
United kingdom
Slovakia
Switzerland
Hungary
Mexico
Belgium
Estonia
Ireland
United states
Australia
Denmark
Portugal
Canada
Sweden Norway
Iceland
France
0.2
0.4
0.6
0.8
Finland
1
Figure 5. Gaussian mixture model of 34 countries
(d) iteration = 1000
Figure 4. Example of mapped space in iterations
4B = 4; C 4>
1
JP
(7)
0.8
The cosine similarities of all possible pairs of regions
were calculated and stored in the tag similarity matrix
TSM. Hence, if there were m regions in the collection, we
obtained a TSM of m × m. A sample TSM for 34 countries is shown in Figure 3.
0.6
SI
KR
0.4
0.2
SE
3.4 2D Space Mapping
0
On the basis of the TSM, we generated a 2D space for a
music map by converting tag similarities between regions
into proper metric for 2D space mapping. In this paper,
this conversion is done approximately using an iterative
algorithm. The proposed algorithm is based on the computational model such as a self-organizing map and an
artificial neural network algorithm. By using an iterative
phase, the algorithm gradually separates the regions in
inverse proportion to the tag similarity.
-0.2
-0.2
0.2
FI
PL
GR
LU
DE
IT
TR CL
CZ
AT
NZNLESSK
UK
CH
BEHUMX
EE IE
US
DK AU
PT
CA
IS
NO
FR
0.4
0.6
0.8
1
1.2
Figure 6. Music map of 34 countries
7
HD(.E , . ) = I@J(.E) G J(. )A + (L(.E ) G L(. ))7 (9)
where x(ri) and y(ri) returns x and y positions of the region ri in 2D space, respectively. In order for TD and ED
to have same value as much as possible, the following
equation is applied
3.4.1 Initialization
In the initialization phase, 2D space is generated where
X-axis and Y-axis of the space have ranges from 0 to 1.
Each region is randomly placed on the 2D space. We observed that our random initialization does not provide deterministic result of the 2D space mapping.
J(. ) = J(. ) + M($)(HD(.E , . ) G !D(.E , . ))
(N(O )PN(1 ))
QR(O ,1 )
(10)
3.4.2 Iterations
L(. ) = L(. ) + M($)(HD(.E , . ) G !D(.E , . ))
In each iteration, a region in the 2D space is randomly
selected and the tag distance TD between the selected region rs and any other region ri is computed using the
similarity matrix.
!D(.E , . ) = 1 G !56(.E , . )
0
IL
(S(O )PS(1 ))
QR(O ,1 )
(11)
(8)
Here, ©(t) is a learning rate in t-th iteration. The learning
rate is monotonically decreased during iteration according to the following equation
M($) = MT exp(G$/!)
Subsequently, Euclidean distances ED between the selected region rs and other region ri is computed using the
following equation
522
(12)
GN
Average |ED-TD|
0.6
SL
BJ
DM
0.5
KW
0.4
BZ
HK
0.1
CF
0
100
200
300
400
500
Iterations
Figure 7. Average difference of distances in iterations
©0 denotes the initial learning rate, and T represents the
total number of iterations. After each iteration, regions
having higher TD are located far away from the selected
region and regions having lower TD are located closer.
Figure 4 shows examples of the mapped space after iterations.
AI
LY
BB
FK SB
CN
KN
BI HT
FIKH
MZ
VA
ET
KR
EH MV
BY
MFGD
TC
PH
JOAD
PG UZ
KZ
MAPN SE LCMM MH
CG
GE
IN
FR FJBQ
NENO
BT
BVFO BO
MNDE
RW
IL SI SJAQ
BR MG
ML
ME
MS BL
TW
VE
KY
OM EG PL
VU
EC
LU
BE
LT
UY
SM
HR
CZ
SS
GS
IT ATNL TF
RO
DK
MTCR
MR
GR
CI
BG
IS
TR
AR
GL
LV
BD
TK
ZA
CH
MX
GT
CC
CO
CXUMTN SO
SK
VG
UA ALBAPE
EE
CL
IM
IE
HU
GB
GQ
VC
KG
KP
ES
PT
IDCYRSPY
MQ
NR DJ ST
CA
HM
PK
TO
AU
PR
RE
GI
NZ
US
CUPF AZ AMQA
CW
CD
VI
BS
WF MP BF
ZW SCGU
MD
HN
SVMY
RU TJ
NU
TM
TT
SG
SX
LB
AE BN
CK AO MO
DO
CV MC
LA
IO
SY
NI
LK
GA
TV BH ZMSA
FM
IR
GW
JM
GY
DZ
LS
GP
GH
NP CM PA
AG GG
JP
SZ
SN
TZ PWTH
YE ER
LI SR
AF
AX
AW
LR
GM
NG
AS
TG
GF
VN JE
Figure 8. Music map of 239 countries
Hh
Fv
V(i) = {J(. ), L(. )}
0
1Z
8
(1 )
\
Ro
In
In
After 2D space mapping, the regions are mapped such
that regions having similar music preferences are placed
close to one another. As a result, they form distinct
crowds in the 2D space. In contrast, regions having
unique preferences are placed apart from the crowds. To
represent them as a map, a 2D distribution on the space is
not sufficient. In this paper, in order to represent the
information like a real world map, we employed the
GMM. The Gaussian with diagonal matrix is constructed
using the following equations:
&(i) =
IQ
PM
KM
3.5 Space Representation
1Z
W(i) = X 8
0
MW
BW
MU
SH
0.2
0
YT NC NFKI
SDPS KE TD TL
NA
MK
0.3
BM
WS
UG
Nu
Db
Ro
Hh El ElHh
RoIn Ro El El
Po
In
El El
Ro El El
In ElIn
Jz
Ro El
Ro
RoRo Ro
Ro
El
In
Po
Ro
Po
El
Po
Po
Ro
Po
RoRo Po
Po
El
ElElPoElPoPo Po
Ro Ro
Ro
Ro
Po El
Ro
Po
Jz
Po
RoRo
Po
PoRoPo
RoRo
Po
Po Jz
Ro
Po
Ro
Ro
Ro
Ro
Ro
Po
Ro
Ro
Ro
Ro
Ro
Ro
Ro
Po
Ro
Po
Ro
Ro
Po
Ro
Po
Ro
Ro
Po
Ro
Po Po El
Ro
Ro
Ro
Po
Po
Ro
Ro
Po
Ro
Po
Ro Ro
Po
Po
Ro
Po
Po
Po
Po
Po
El
Jp
Ro
Ro
Ro
Ro
Po
Po
Po
Ro
Ro Ro Ro
Po
Po
El Po Hh
Po
Po
Po
Po
Ro
Po
Ro
Ro RoRo
Po
Ro RoRo
Ro PoPo
Po Hh
Ro Ro PoPoPo
PoPo
Ro Po Po
PoPo El
PoRo
Ro
Po
Po
Po
Ro
Ro Po Po
Ro
Po
PoPo
Ro
Po
Ro
Po
In
Ro
Fv
Hh
Ro Po In Ro Ro
Jp
El
Po
Hh In In
Po Ro
In In
Hh
El
Hh
Hh
Ro
Hh
In
Hh
In Jp
Pu
El
Hh
Hh
In
Po
Hh
In
In
(13)
Po(Pop), El(electronic), Hh(Hip-Hop), Ro(rock), Jz(jazz), In(Indie), Jp(japanese),
Fv(female vocalists), Db(Drum and bass), Pu(punk), Nu(Nu Metal)
(14)
Figure 9. Top tags of music map.
or a small island on the basis of their distribution. As a
result, the mapped result is visualized as a music map
having an appearance similar to that of a real world map.
An example of a music map for 34 countries is shown in
Figure 6. Although the generated music map contains less
information than the contour graph of GMM, it could be
more intuitive to the casual users to understand the relations between regions in terms of music preferences.
(15)
Here, n is total number of regions and nn(ri) returns the
number of neighboring regions of region ri in the 2D
space. To model the GMM in the crowded area of 2D
space, mixing proportion p(i) is adjusted based on the
number of neighbors nn(ri). In other words, nn(ri) has a
higher value when p(i) is crowded and it reduces the proportion of i-th Gaussian. It helps to prevent Gaussian
from over-height. An example of generated GMM is
shown in Figure 5.
To generate a music map using the GMM, the probabilistic density function (pdf) of the GMM is simplified
by applying a threshold. By projecting the GMM on the
2D plane after applying the threshold to the pdf, the
boundaries of the GMM are created. We empirically
found that the threshold value 0 gives an appropriate
boundary. A boundary represents regions as a continent
4. EXPERIMENT
4.1 Experiment Setup
To collect the music-related tweets, we gathered the tweet
streams from the Twitter server in real time in order to
collect the music information of Twitter users. During
one week, we collected 4.57 million tweets that had the
hashtags listed in Table 1. After filtering the tweets
through regular expressions, 1.56 million music listening
history records were collected. We got 1327 cities or 198
523
a music map according to the tag similarities. The possible application domains of the proposed scheme span a
broad range—from music collection, browsing services,
and music marketing tools, to a worldwide music trend
analysis.
countries from the music listening history collected
through Twitter. We collected the lists of the top artists
for 249 countries from last.fm. For these countries, 2735
artists and their top tags were collected from last.fm.
4.2 Differences of ED and TD
In the proposed scheme, the iterative algorithm gradually
reduces the difference between ED and TD, as mentioned
above. In order to show that the algorithm reduces the
difference and moves the regions appropriately, the average difference between ED and TD is measured in each
iteration. Figure 7 shows the average distances during
500 iterations. The early phases in the computation show
high average distance differences due to the random initialization. As the iteration proceeds, the average distance
differences are gradually reduced and converged.
6. ACKNOWLEDGEMENT
This research was supported by Basic Science Research
Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (NRF2013R1A1A2012627) and the MSIP (Ministry of Science,
ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program
(NIPA-2014-H0301-14-1001) supervised by the NIPA
(National IT Industry Promotion Agency)
7. REFERENCES
4.3 Map Generation for 249 Countries
In order to evaluate the effectiveness of the proposed
scheme, we defined a region group that contained 249
countries. After collecting the music listening history
from Twitter and last.fm, we generated a music map by
using the proposed scheme. Figure 8 shows the resulting
music map. We observed that the map consisted of a big
island (continent) and a few small islands. In the center of
the big island, countries that had a high musical influence,
such as the US and the UK, were located. On the other
hand, countries having unique music preferences such as
Japan and Hong Kong were formed as small islands and
located far away from the big island.
4.4 Top Tag Representation
A music map is based on the musical preferences between regions, and these preferences were calculated on
the basis of the similarities of the musical tags. In the last
experiment, we first find out the top tag of each country
and show the distribution of the top tags in the music map.
Figure 9 shows the top tags of the map in Figure 8. In the
map, “Rock” and “Pop”, which are the most popular tags
in the collected data, are located in the center and occupies a significant portion of the big island. On the north
side of the big island, “Electronic” tag is located and in
the south, “Indie” tag is placed. The “Pop” tag, which is
popular in almost every country, is located throughout the
map.
5. CONCLUSION
In this paper, we proposed a scheme for constructing a
music map in which regions such as cities and countries
are located close to one another depending on the musical
preferences of the people residing in them. To do this, we
collected the music play history and extracted the popular
artists and tag information from Twitter and last.fm. A
similarity matrix for each region pair was calculated by
using the tags and their frequencies. By applying an iterative algorithm and GMM, we reorganized the regions into
524
[1] M. Kaminskas and F. Ricci, “Location-Adapted
Music Recommendation Using Tags,” in User
Modeling, Adaption and Personalization, Springer
Berlin Heidelberg, 2011, pp. 183–194.
[2] M. Schedl and D. Schnitzer, “Location-Aware
Music Artist Recommendation,” in MultiMedia
Modeling, Springer International Publishing, 2014,
pp. 205–213.
[3] M. Schedl and D. Hauger, “Mining Microblogs to
Infer Music Artist Similarity and Cultural Listening
Patterns,” in Proceedings of the 21st International
Conference Companion on World Wide Web, New
York, USA, 2012, pp. 877–886.
[4] S. Jun, D. Kim, M. Jeon, S. Rho, and E. Hwang,
“Social mix: automatic music recommendation and
mixing scheme based on social network analysis,”
Journal of Supercomputing, pp. 1–22, Apr. 2014.
[5] P. Lamere and D. Eck, “Using 3d visualizations to
explore and discover music,” in in Int. Conference
on Music Information Retrieval, 2007.
[6] P. Knees, M. Schedl, T. Pohle, and G. Widmer,
“Exploring
Music
Collections
in
Virtual
Landscapes,” IEEE MultiMedia, vol. 14, no. 3, pp.
46–54, Jul. 2007.
[7] E. Pampalk, A. Rauber, and D. Merkl, “Contentbased Organization and Visualization of Music
Archives,” in Proceedings of the Tenth ACM
International Conference on Multimedia, New York,
NY, USA, 2002, pp. 570–579.
[8] A. Rauber, E. Pampalk, and D. Merkl, “The SOMenhanced JukeBox: Organization and Visualization
of Music Collections Based on Perceptual Models,”
Journal of New Music Research, vol. 32, no. 2, pp.
193–210, 2003.
[9] “MusicBrainz - The Open Music Encyclopedia.”
[Online]. Available: http://musicbrainz.org/.
[Accessed: 03-May-2014].
[10] ˈThe Google Geocoding API” [Online]. Available:
https://developers.google.com/maps/documentation/
geocoding/. [Accessed: 03-May-2014]
ON COMPARATIVE STATISTICS FOR LABELLING TASKS:
WHAT CAN WE LEARN FROM MIREX ACE 2013?
John Ashley Burgoyne
Universiteit van Amsterdam
[email protected]
W. Bas de Haas
Universiteit Utrecht
[email protected]
ABSTRACT
For mirex 2013, the evaluation of audio chord estimation
(ace) followed a new scheme. Using chord vocabularies
of differing complexity as well as segmentation measures,
the new scheme provides more information than the ace
evaluations from previous years. With this new information, however, comes new interpretive challenges. What
are the correlations among different songs and, more importantly, different submissions across the new measures?
Performance falls off for all submissions as the vocabularies
increase in complexity, but does it do so directly in proportion to the number of more complex chords, or are certain
algorithms indeed more robust? What are the outliers, songalgorithm pairs where the performance was substantially
higher or lower than would be predicted, and how can they
be explained? Answering these questions requires moving beyond the Friedman tests that have most often been
used to compare algorithms to a richer underlying model.
We propose a logistic-regression approach for generating
comparative statistics for mirex ace, supported with generalised estimating equations (gees) to correct for repeated
measures. We use the mirex 2013 ace results as a case
study to illustrate our proposed method, including some of
interesting aspects of the evaluation that might not apparent
from the headline results alone.
1. INTRODUCTION
Automatic chord estimation (ace) has a long tradition
within the music information retrieval (mir) community,
and chord transcriptions are generally recognised as a useful
mid-level representation in academia as well as in industry.
For instance, in an academic context it has been shown that
chords are interesting for addressing musicological hypotheses [3,13], and that they can be used as a mid-level feature
to aid in retrieval tasks like cover-song detection [7,10 ]. In
Johan Pauwels is no longer affiliated with stms. Data and source code
to reproduce this paper, including all statistics andfi gures, are available
from http://bitbucket.org/jaburgoyne/ismir-2014.
© John Ashley Burgoyne, W. Bas de Haas, Johan Pauwels.
Licensed under a Creative Commons Attribution4. 0 International License
(cc by 4.0). Attribution: John Ashley Burgoyne, W. Bas de Haas, Johan
Pauwels. “On comparative statistics for labelling tasks: What can we learn
from mirex ace 2013?”,15 th International Society for Music Information
Retrieval Conference,2014.
Johan Pauwels
stms ircam–cnrs–upmc
[email protected]
an industrial setting, music start-ups like Riffstation 1 and
Chordify 2 use ace in their music teaching tools, and at
the time of writing, Chordify attracts more than 2million
unique visitors every month [6].
In order to compare different algorithmic approaches in
an impartial setting, the Music Information Retrieval Evaluation eXchange (mirex) introducted an annual ace task in
2008. Since then, between 11 and 18 algorithms have been
submitted each year by between 6 and 13 teams. Despite
the fact that ace algorithms are used outside of academic
environments, and even though the number of mirex participants has decreased slightly over the last three years,
the problem of automatic chord estimation is nowhere near
solved. Automatically extracted chord sequences have classically been evaluated by calculating the chord symbol recall
(csr), which reflects the proportion of correctly labelled
chords in a single song, and a weighted chord symbol recall
(wcsr), which weights the average csr of a set of songs by
their length. On fresh validation data, the best-performing
algorithms in 2013 achieved wcsr of only 75 percent, and
that only when the range of possible chords was restricted
exclusively to the 25 major, minor and “no-chord” labels;
thefi gure drops to 60 percent when the evaluation is extended to include seventh chords (see Table1).
mirex is a terrific platform for evaluating the performance of ace algorithms, but by 2010 it was already being
recognised that the metrics could be improved. At that time,
they included only csr and wcsr using a vocabulary of12
major chords, 12 minor chords and a “no-chord” label. At
ismir 2010, a group of ten researchers met to discuss their
dissatisfaction. In the resulting ‘Utrecht Agreement’, 3 it
was proposed that future evaluations should include more
diverse chord vocabularies, such as seventh chords and inversions, as the25 -chord vocabulary was considered a rather
coarse representation of tonal harmony. Furthermore, the
group agreed that it was important to include a measure of
segmentation quality in addition to csr and wcsr.
At approximately the same time, Christopher Harte proposed a formalisation of measures that implemented the
aspirations indicated in the Utrecht agreement [8]. Recently,
Pauwels and Peeters reformulated and extended Harte’s
work with the precise aim of handling differences in chord
vocabulary between annotated ground truth and algorithmic
1
http://www.riffstation.com/
http://chordify.net
3 http://www.music-ir.org/mirex/wiki/The_
Utrecht_Agreement_on_Chord_Evaluation
2
525
Algorithm
# Types
ko2
nmsd2
cb4
nmsd1
cb3
ko1
pp4
pp3
cf2
ng1
ng2
sb8
7
10
13
10
13
7
5
2
10
2
5
2
Inversions?
Training?
I
II
III
IV
V
VI
VII
VIII
•
76
75
76
74
76
75
69
70
71
71
67
9
74
71
72
71
72
71
66
68
67
67
63
7
72
69
70
69
70
69
64
65
65
65
61
6
60
59
59
58
58
54
51
50
49
49
44
5
58
57
57
56
56
52
49
48
47
46
43
5
84
82
85
83
85
83
83
83
83
82
82
51
79
79
80
79
81
80
78
82
83
79
81
92
89
86
90
86
89
88
87
84
83
86
83
35
•
•
Table1 . Number of supported chord types, inversion support, training support, and mirex results on the Billboard 2013test
set for all2013 ace submissions. I: root only; II: major-minor vocabulary; III: major-minor vocabulary with inversions; IV:
major-minor vocabulary with sevenths; V: major-minor vocabulary with sevenths and inversions; VI: mean segmentation
score; VII: under-segmentation; VIII: over-segmentation. Adapted from the mirex Wiki.
output on one hand, and among the output of different algorithms on the other hand [15]. They also performed a
rigorous re-evaluation of all mirex ace submissions from
2010 to2012 . As of mirex 2013, these revised evaluation procedures, including the chord-sequence segmentation evaluation suggested by Harte [8] and Mauch [12],
have been adopted in the context of the mirex ace task.
mirex ace evaluation has also typically included comparative statistics to help determine whether the differences
in performance between pairs of algorithms are statistically
significant. Traditionally, Friedman’s anova has been used
for this purpose, accompanied by Tukey’s Honest Significant Difference tests for each pair of algorithms. Friedman’s
anova is equivalent to a standard two-way anova with
the actual measurements (in our case wcsr or directional
Hamming distance [dhd], the new segmentation measure)
replaced by the rank of each treatment (in our case, each algorithm) on that measure within each block (in our case, for
each song) [11]. The rank transformation makes Friedman’s
anova an excellent ‘one sizefi ts all’ approach that can be
applied with minimal regard to the underlying distribution
of the data, but these benefits come with costs. Like any nonparametric test, Friedman’s anova can be less powerful
than parametric alternatives where the distribution is known,
and the rank transformation can obscure information inherent to the underlying measurement, magnifying trivial
differences and neutralising significant inter-correlations.
But there is no need to pay the costs of Friedman’s anova for evaluating chord estimation. Fundamentally, wcsr
is a proportion, specifically the expected proportion of audio frames that an estimation algorithm will label correctly,
and as such, itfi ts naturally into logistic regression (i.e., a
logit model). Likewise, dhd is constrained to fall between
0 and 100 percent, and thus it is also suitable for the same
type of analysis. The remainder of this paper describes how
logistic regression can be used to compare chord estimation
algorithms, using mirex results from 2013 to illustrate four
key benefits: easier interpretation, greater statistical power,
built-in correlation estimates for identifying relationships
among algorithms, and better detection of outliers.
526
2. LOGISTIC REGRESSION WITH GEES
Proportions cannot be distributed normally because they are
supported exclusively on [0,1 ], and thus they present challenges for traditional techniques of statistical analysis. Logit
models are designed to handle these challenges without sacrificing the simplicity of the usual linear function relating
parameters and covariates [1, ch.4]:
π(x; β) =
ex β
,
1 + ex β
(1)
or equivalently
log
π(x; β)
= x β ,
1 − π(x; β)
(2)
where π represents the relative frequency of ‘success’ given
the values of covariates in x and parameters β. In the case
of a basic model for mirex ace, x would identify the algorithm and π would be the relative frequency of correct
chord labels for that algorithm (i.e., wcsr). In the case
of data like ace results, where there are proportions pi of
correct labels over ni analysis frames rather than binary successes or failures, i indexing all combinations of individual
songs and algorithms, logistic regression assumes that each
pi represents the observed proportion of successes among
ni conditionally-independent binary observations, or more
formally, that the pi are distributed binomially:
n
f P | N,X (p | n, x; β) =
π pn (1 − π) (1− p)n . (3)
pn
The expected value for each pi is naturally πi = π(xi ; β),
the overall relative frequency of success given xi :
E [P | N, X] = π(x; β) .
(4)
Logistic regression models are most oftenfi t by the
maximum-likelihood technique, i.e., one is seeking a vector
β̂ to maximise the log-likelihood given the data:
ni
P | N,X (β; p, n, X) =
log
+
pi ni
i
pi ni log πi + (1 − pi )ni log (1 − πi ) . (5)
One thus solves the system of likelihood equations for β,
whereby the gradient of Equation 5 is set to zero:
(pi − πi )ni xi = 0
(6)
∇β P | N,X (β; p, n, X) =
i
and so
pi ni xi =
i
πi ni xi .
(7)
i
In the case of mirex ace evaluation, each xi is simply
an indicator vector to partition the data by algorithm, and
thus β̂ is the parameter vector for which πi equals the songlength–weighted mean over all pi for that algorithm.
2.1 Quasi-Binomial Models
Under a strict logit model, the variance of each pi is inversely
proportional to ni :
1
var [P | N, X] =
π(1 − π) .
(8)
n
Equation 8 only holds, however, if the estimates of chord
labels for each audio frame are independent. For ace, this is
unrealistic: only the most naïve algorithms treat every frame
independently. Some kind of time-dependence structure is
standard, most frequently a hidden Markov model or some
close derivative thereof. Hence one would expect that the
variance of wcsr estimates should be rather larger than the
basic logit model would suggest.
This type of problem is extremely common across disciplines, so much so that is has been given a name, overdispersion, and some authors go so far as to state that ‘unless
there are good external reasons for relying on the binomial
assumption [of independence], it seems wise to be cautious
and to assume that over-dispersion is present to some extent unless and until it is shown to be absent’ [14, p.125].
One standard approach to handling over-dispersion is to
use a so-called quasi-likelihood [1, §4.7 ]. In case of logistic regression, this typically entails a modification to the
assumption on the distribution of the pi that includes an
additional dispersion parameter φ. The expected values are
the same as a standard binomial model, but
φ
var [P | N, X] =
π(1 − π) .
(9)
n
These models are known as quasi-likelihood models
because one loses a closed-form solution for the actual
probability distribution f P | N,X ; one knows only that the
pi behave something like binomially-distributed variables,
with identical means but proportionally more variance. The
parameter estimates β̂ and predictions π(·; β̂) for a quasibinomial model are the same as ordinary logistic regression,
but the estimated variance-covariance matrices are scaled
by the estimated dispersion parameter φ̂ (and likewise the
standard errors are scaled by its square root). The dispersion parameter is estimated so that the theoretical variance
matches the empirical variance in the data, and because of
the form of Equation9 , it renders any scaling considerations
for the ni moot.
Other approaches to handling over-dispersion include
beta-binomial models [1, §13.3 ] and beta regression [5],
but we prefer the simplicity of the quasi-likelihood model.
527
2.2 Generalised Estimating Equations (gees)
The quasi-binomial model achieves most of what one would
be looking for when evaluating ace for mirex: it handles
proportions naturally, is consistent with the weighted averaging used to compute wcsr, and adjusts for over-dispersion
in a way that also eliminates any worries about scaling. Nonetheless, it is slightly over-conservative for evaluating ace.
As discussed earlier, quasi-binomial models are necessary
to account for over-dispersion, and one important source
of over-dispersion in these data is the lack of independence
of chord estimates from most algorithms within the same
song. mirex exhibits another important violation of the independence assumption, however: all algorithms are tested
on the same sets of songs, and some songs are clearly more
difficult than others. Put differently, one does not expect
the algorithms to perform completely independently of one
another on the same song but rather expects a certain correlation in performance across the set of songs. By taking
that correlation into account, one can improve the precision of estimates, particularly the precision of pair-wise
comparisons [1, §10.1].
A relatively straightforward variant of quasi-likelihood
known as generalised estimating equations (gees) incorporates this type of correlation [1, ch.11 ]. With the gee
approach, rather than predicting each pi individually, one
predicts complete vectors of proportions pi for each relevant group, much as Friedman’s test seeks to estimate ranks
within each group. For ace, the groups are songs, and thus
one considers the observations to be vectors pi , one for each
song, where pi j represents the csr or segmentation score
for algorithm j on song i. Analogous to the case of ordinary
quasi-binomial or logistic regression,
(10)
E P j | N, X j = π(x j ; β) .
Likewise, analogous to the quasi-binomial variance,
φ
π j (1 − π j ) .
(11)
var P j | N, X j =
n
Because the gee approach is concerned with vectorvalued estimates rather than point estimates, it also involves
estimating a full variance-covariance matrix. In addition to
β and φ, the approach requires a further vector of parameters
α and an a priori assumption on the correlation structure of
the P j in the form of a function R(α) that yields a correlation
matrix. (One might, for example, assume that that the P j
are exchangeable, i.e., that every pair shares a common
correlation coefficient.) Then if B is a diagonal matrix such
that B j j = var [P j | N, X j ],
cov [P | N, X] = B /2 R(α)B /2 .
1
1
(12)
If all of the P j are uncorrelated with each other, then this
formula reduces to the basic quasi-binomial model, which
assumes a diagonal covariance matrix. Thefi nal step of
gee estimation adjusts Equation 12 according to the actual
correlations observed in the data, and as such, gees are
quite robust in practice even when the a priori assumptions
about the correlation structure are incorrect [1, §11.4.2].
a
b
G
G
G
G
G
6
G
G
G
G
G
4
G
G
SB8
NG2
NG1
CF2
PP3
PP4
CB3
KO1
NMSD1
CB4
KO2
G
G
G
G
1.0
0.8
0.6
Algorithm
(a) Friedman’s anova
G
G
G
G
G
0.4
G
0.2
0.0
G
G
G
KO2
G
G
2
G
b
G
G
G
G
G
G
G
G
G
G
SB8
G
d
e
b
c
NG2
G
G
e
f
CF2
G
G
G
e
f
c
d
NG1
G
G
G
e
f
b
c
PP3
8
G
f
c
d
PP4
10
f
CB3
e
f
KO1
f
g
CB4
g
a
NMSD1
g
12
NMSD2
Rank per Song (1 low; 12 high)
f
g
d
e
c
d
NMSD2
e
f
g
c
Chord−Symbol Recall (CSR)
c
d
Algorithm
(b) Logistic Regression
Figure1 . Boxplots and compact letter displays for the mirex ace 2013 results on the Billboard 2013 test set with vocabulary
V (seventh chords and inversions), weighted by song length. Bold lines represent medians andfi lled dots means. N = 161
songs per algorithm. Given the respective models, there are insufficient data to distinguish among algorithms sharing a letter,
correcting to hold the fdr at α = .005. Although Friedman’s anova detects 2 more significant pairwise differences than
logistic regression (45 vs.43 ), it operates on a different scale than csr and misorders algorithms relative to wcsr.
3. ILLUSTRATIVE RESULTS
mirex ace 2013 evaluated 12 algorithms according to a
battery of eight rubrics (wcsr onfi ve harmonic vocabularies and three segmentation measures) on each of three
different data sets (the Isophonics set, including music from
the Beatles, Queen, and Zweieck [12] and two versions of
the McGill Billboard set, including music from the American pop charts [4]). There is insufficient space to present
the results of logistic regression on all combinations, and so
we will focus on a single one of the data sets, the Billboard
2013 test set. In some cases, logistic regression allows us to
speak to all measures (11 592 observations), but in general,
we will also restrict ourselves to discussing the newest and
most challenging of the harmonic vocabularies for wcsr:
Vocabulary V (1932 observations), which includes major
chords, minor chords, major sevenths, minor sevenths, dominant sevenths, and the complete set of inversions of all of
the above. We are interested in four key questions.
1. How do pairwise comparisons under logistic regression compare to pairwise comparisons with Friedman’s anova? Is logistic regression more powerful?
2. Are there differences among algorithms as the harmonic vocabularies get more difficult, or is the drop
performance uniform? In other words, is there a benefit to continuing with so many vocabularies?
3. Are all ace algorithms making similar mistakes, or
do they vary in their strengths and weaknesses?
4. Which algorithm-song pairs exhibited unexpectedly
good or bad performance, and is there anything to be
learned from these observations?
is restricted to Vocabulary V, with the algorithms in descending order by wcsr. Figure1 a comes from Friedman’s
anova weighted by song length, and thus its y-axis reflects
not csr directly but the per-song ranks with respect to csr.
Figure1 b comes from quasi-binomial regression estimated
with gees, as described in Section2 . Its y-axis does reflect
per-song csr. Above the boxplots, all significant pairwise
differences are recorded as a compact letter display. In the
interest of reproducible research, we used a stricter α =
.005 threshold for reporting pairwise comparisons with the
more contemporary false-discovery-rate (fdr) approach of
Benjamini and Hochberg, as opposed to more traditional
Tukey tests at α = .05[2 ,9 ]. Within either of the subfigures,
the difference in performance between two algorithms that
share any letter in the compact letter display is not statistically significant. Overall, Friedman’s anova found 2more
significant pairwise differences than logistic regression.
3.2 Effect of Vocabulary
To test the utility of the new evaluation vocabularies, we
ran both Friedman anovas (ranked separately for each
vocabulary) and logistic regressions and looked for significant interactions among the algorithm, inversions (present
or absent from the vocabulary) and the complexity of the
vocabulary (root only, major-minor, or major-minor with
7ths). Under Friedman’s anova, there was a significant
Algorithm × Complexity interaction, F (22,9440 ) = 3.21,
p < .001. The logistic regression model identified a significant three-way Algorithm × Complexity × Inversions
interaction, χ2 (12) = 37.35, p < .001, but the additional
interaction with inversions should be interpreted with care:
only one algorithm (cf2) attempts to recognise inversions.
3.1 Pairwise Comparisons
3.3 Correlation Matrices
The boxplots in Figure 1 give a more detailed view of the
performance of each algorithm than Table1 . Thefigure
Table 2 presents the inter-correlations of wcsr between
algorithms, rank-transformed (Spearman’s correlations, ana-
528
Algorithm
ko2
nmsd2
cb4
nmsd1
cb3
ko1
pp4
pp3
cf2
ng1
ng2
sb8
ko2
nmsd2
cb4
nmsd1
cb3
ko1
pp4
pp3
cf2
ng1
ng2
sb8
–
.25∗
.41∗
.30∗
.34∗
−.04
−.22
−.49∗
.09
−.54∗
.09
−.32∗
.07
–
.39∗
.60∗
.10
−.42∗
.08
−.46∗
.19
−.42∗
.17
−.44∗
.11
−.01
–
.53∗
.76∗
−.51∗
−.16
−.61∗
.24∗
−.60∗
.17
−.44∗
−.05
.49∗
.12
–
.42∗
−.51∗
.06
−.53∗
.42∗
−.56∗
.16
−.52∗
.10
−.25∗
.47∗
−.17
–
−.29∗
−.07
−.37∗
.17
−.41∗
−.03
−.46∗
.03
−.20
−.46∗
−.45∗
−.19
–
−.05
.68∗
−.49∗
.68∗
−.50∗
.00
−.41∗
−.19
−.30∗
−.08
−.26∗
−.10
–
.22
.06
.04
−.09
−.32∗
−.44∗
−.36∗
−.48∗
−.45∗
−.14
.42∗
.37∗
–
−.51∗
.85∗
−.54∗
.08
−.03
.00
.09
.27∗
−.08
−.41∗
−.03
−.48∗
–
−.47∗
.50∗
−.33∗
−.35∗
−.33∗
−.38∗
−.44∗
−.17
.50∗
.00
.66∗
−.48∗
–
−.40∗
.08
.05
.02
.08
.17
−.16
−.52∗
.05
−.48∗
.48∗
−.40∗
–
−.16
−.01
−.06
−.09
−.10
−.08
.05
−.03
.04
−.14
−.10
−.11
–
Table2 . Pearson’s correlations on the coefficients from logistic regression (wcsr) for the Billboard 2013 test set with
vocabulary V (lower triangle); Spearman’s correlations for the same data (upper triangle). N = 161 songs per cell. Starred
correlations are significant at α = .005, controlling for the fdr. A set of algorithms (viz., ko1, pp3, ng1, and sb8) stands
out for negative correlations with the top performers; in general, these algorithms did not attempt to recognise seventh chords.
logous to Friedman’s anova) in the upper triangle, and in
the lower triangle, as estimated from logistic regression with
gees. Significant correlations are marked, again controlling
the fdr at α = .005. Positive correlations do not necessarily
imply that the algorithms perform similarly; rather it implies that theyfi nd the same songs relatively easy or difficult.
Negative correlations imply that songs that one algorithm
finds difficult are relatively easy for the other algorithm.
3.4 Outliers
To identify outliers, we considered all evaluations on the
Billboard 2013 test set and examined the distribution of
residuals. Chauvenet’s criterion for outliers in a sample of
this size is to lie more than4. 09 standard deviations from the
mean [16, §6.2 ]. Under Friedman’s anova, Chauvenet’s
criterion identified 7 extreme data points. These are all for
algorithm sb8, a submission with a programming bug that
erroneously returned alternating C- and B-major chords regardless of the song, on songs that were so difficult for most
other algorithms that the essentially random approach of
the bug did better. Under the logistic regression model, the
criterion identified 26 extreme points. Here, the unexpected
behaviour was primarily for songs that are tuned a quartertone off from standard tuning (A4 = 440 Hz). The ground
truth necessarily is ‘rounded off’ to standard tuning in one
direction or the other, but in cases where an otherwise highperforming algorithm happened to round off in the opposite
direction, the performance is markedly low.
4. DISCUSSION
We were surprised tofi nd that in terms of distinguishing
between algorithms, Friedman’s anova was in fact more
powerful than logistic regression, detecting a few extra significant pairs. Nonetheless, the two approaches yield substantially equivalent broad conclusions: that a group of top
performers – cb3, cb4, ko2, nmsd1, and nmsd2 – are
statistically indistinguishable from each other, with ko1
also indistinguishable from the lower end of this group.
Moreover, having now benefited from years of study, wcsr
529
is a reasonably intuitive and well-motivated measure of ace
performance, and it is awkward to have to work on the Friedman’s rank scale instead, especially since it ultimately ranks
the algorithms’ overall performance in a slightly different
order than the headline wcsr-based results.
Friedman’s anova did exhibit less power for our question about interactions between algorithms and differing
chord vocabularies. Again, wcsr as a unit and as a concept
is highly meaningful for chord estimation, and there is a
conceptual loss from rank transformation. Given the rank
transformation, Friedman’s anova can only be sensitive to
reconfigurations of relative performance as the vocabularies
become more difficult; logistic regression can also be sensitive to different effect sizes across algorithms even when
their relative ordering remains the same.
It was encouraging to see that under either statistical
model, there was a benefit to evaluating with multiple vocabularies. That encouraged us to examine the inter-correlations
for the performance of the algorithms. Figure 2summarises
the original correlation matrix in Table 2 more visually by
using the correlations from logistic regression as the basis
of a hierarchical clustering. Two clear groups emerge, both
from the clustering and from minding negative correlations
in the original matrix: one relatively low-performing group
including ko1, pp3, ng1, and sb8, and one relatively highperforming group including all others but for perhaps pp4,
which does not seem to correlate strongly with any other
algorithm. The shape of the equivalent tree based on Spearman’s correlations is similar but for joining pp4 with sb8
instead of the high-performing group. Table 1 uncovers
the secret behind the low performers: ko1 excepted, none
of the low-performing algorithms attempt to recognise seventh chords, which comprise 29 percent of all chords under
Vocabulary V. Furthermore, we performed an additional
evaluation of seventh chords only, in the style of [15] and
using their software available online. 4 From the resulting
low score of ko1, we can deduce that this algorithm is
able to recognise seventh chords in theory, but that it was
most likely trained on the relatively seventh-poor Isophon4
https://github.com/jpauwels/MusOOEvaluator
[3] J. A. Burgoyne. Stochastic Processes and DatabaseDriven Musicology. PhD thesis, McGill U., Montréal,
QC,2012.
NG2
NMSD1
NMSD2
CB3
KO2
CB4
NG1
PP3
KO1
CF2
SB8
PP4
0.6
0.4
0.2
0.0
Pearson's Distance
0.8
[2] Y. Benjamini and Y. Hochberg. Controlling the false
discovery rate: A practical and powerful approach to
multiple testing. J. Roy. Stat. Soc. B,1(57):289–300,
1995.
[4] J. A. Burgoyne, J. Wild, and I. Fujinaga. An expert
ground-truth set for audio chord recognition and music
analysis. In Proc. Int. Soc. Music Inf. Retr., pages633–
38, Miami, FL,2011.
Figure2 . Hierarchical clustering of algorithms based on
wcsr for for the Billboard 2013 test set with vocabulary V,
Pearson’s distance as derived from the estimated correlation
matrix under logistic regression, and complete linkage. The
group of algorithms that is negatively correlated with the
top performers appears at the left. pp4 stands out as the
most idiosyncratic performer.
ics corpus (only 15 percent of all chords). ko2 is the same
algorithm trained directly on the mirex Billboard training
corpus, and with that training, it becomes a top performer.
Our analysis of outliers again showed Friedman’s anova
to be less powerful than logistic regression, as one would
expect given the range restrictions on rank transformation.
But here also the more important advantage of logistic regression is the ability to work on the wcsr scale. Outliers
under the logistic regression model are also points that have
an unusually strong effect on the reported results. In our
analysis, they highlight the practical consequences of the
well-known problem of atypically-tuned commercial recordings. Although we would not propose deleting outliers, it is
sobering to know that tuning problems may be having an
outsized effect on our headline evaluationfi gures. It might
be worth considering allowing algorithms their best score
in keys up to a semitone above or below the ground truth.
Overall, we have shown that as ace becomes more established and its evaluation more thorough, it is useful to
use a subtler statistical model for comparative analysis. We
recommend that future mirex ace evaluations use logistic
regression in preference to Friedman’s anova. It preserves
the natural units and scales of wcsr and segementation
analysis, is more powerful for many (although not all) statistical tests, and when augmented with gees, it allows for
a detailed correlational analysis of which algorithms tend
to have problems with the same songs as others and which
have perhaps genuinely broken innovative ground. This is
by no means to suggest that Friedman’s test is a bad test in
general – its near-universal applicability makes it an excellent choice in many circumstances, including many other
mirex evaluations – but for ace, we believe that the extra understanding logistic regression can offer may help
researchers predict which techniques are most promising
for breaking the current performance plateau.
5. REFERENCES
[1] A. Agresti. Categorical Data Analysis. Wiley, New York,
2nd edition,2007.
[5] S. Ferrari and F. Cribari-Neto. Beta regression for modelling rates and proportions. J. Appl. Stat.,31(7):799–815,
2004.
[6] W. B. de Haas, J. P. Magalhães, D. ten Heggeler, G. Bekenkamp, and T. Ruizendaal. Chordify: Chord transcription for the masses. Demo at the Int. Soc. Music Inf.
Retr. Conf., Curitiba, Brazil,2012.
[7] W. B. de Haas, J. P. Magalhães, R. C. Veltkamp, and
F. Wiering. Harmtrace: Improving harmonic similarity
estimation using functional harmony analysis. In Proc.
Int. Soc. Music Inf. Retr., pages67–72 , Miami, FL,2011.
[8] C. Harte. Towards Automatic Extraction of Harmony
Information from Music Signals. PhD thesis, Queen
Mary, U. London,2010.
[9] V. E. Johnson. Revised standards for statistical evidence.
P. Nat’l Acad. Sci. USA,110(48):19313–17 ,2013.
[10] M. Khadkevich and M. Omologo. Large-scale cover
song identification using chord profiles. In Proc. Int. Soc.
Music Inf. Retr. Conf., pages233–38 , Curitiba, Brazil,
2013.
[11] M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models. McGraw-Hill, Boston,
MA,5 th edition,2005.
[12] M. Mauch. Automatic Chord Transcription from Audio
Using Computational Models of Musical Context. PhD
thesis, Queen Mary, U. London,2010.
[13] M. Mauch, S. Dixon, C. Harte, M. Casey, and B. Fields.
Discovering chord idioms through Beatles and Real
Book songs. In Proc. Int. Soc. Music Inf. Retr. Conf.,
pages255–58 , Vienna, Austria,2007.
[14] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, Boca Raton, FL,2 nd edition,
1989.
[15] J. Pauwels and G. Peeters. Evaluating automatically
estimated chord sequences. In Proc. IEEE Int. Conf.
Acoust. Speech Signal Process., pages749–53 , Vancouver, British Columbia,2013.
[16] J. R. Taylor. An Introduction to Error Analysis: The
Study of Uncertainties in Physical Measurements. University Science Books, Sausalito, CA,2 nd edition,1997.
530
MERGED-OUTPUT HMM FOR PIANO FINGERING OF BOTH HANDS
Shigeki Sagayama
Nobutaka Ono
Eita Nakamura
Meiji University
National Institute of Informatics National Institute of Informatics
Tokyo 164-8525, Japan
[email protected]
[email protected]
[email protected]
ABSTRACT
This paper discusses a piano fingering model for both hands
and its applications. One of our motivations behind the
study is automating piano reduction from ensemble scores.
For this, quantifying the difficulty of piano performance is
important where a fingering model of both hands should
be relevant. Such a fingering model is proposed that is
based on merged-output hidden Markov model and can be
applied to scores in which the voice part for each hand is
not indicated. The model is applied for decision of fingering for both hands and voice-part separation, automation of
which is itself of great use and were previously difficult. A
measure of difficulty of performance based on the fingering model is also proposed and yields reasonable results.
1. INTRODUCTION
Music arrangement is one of the most important musical
activities, and its automation certainly has attractive applications. One common form is piano arrangement of ensemble scores, whose purposes are, among others, to enable pianists to enjoy a wider variety of pieces and to accompany other instruments by substituting the role of orchestra. While certain piano reductions have high technicality and musicality as in the examples by Liszt [8], those
for vocal scores of operas and reduction scores of orchestra
accompaniments are often faithful to the original scores in
most parts. The most faithful reduction score is obtained
by gathering every note in the original score, but the result
can be too difficult to perform, and arrangement such as
deleting notes is often in order.
In general, the difficulty of a reduction score can be reduced by arrangement, but then the fidelity also decreases.
If one can quantify the performance difficulty and the fidelity to the original score, the problem of “minimal” piano reduction can be considered as an optimization problem of the fidelity given constraints on the performance
difficulty. A method for guitar arrangement based on probabilistic model with a similar formalization is proposed in
Ref. [5]. This paper is a step toward a realization of piano
reduction algorithm based on the formalization.
The playability of piano passages is discussed in Refs. [3,
2] in connection with automatic piano arrangement. There,
constraints such as the maximal number of notes in each
hand, the maximal interval being played, say, 10th, and
the minimal time interval of a repeated note are considered. Although these constraints are simple and effective
to some extent, the actual situation is more complicated as
manifested in the fact that, for example, the playability can
change with tempos and players can arpeggiate chords that
cannot be played simultaneously. In addition, the playability can depend on the technical level of players [3]. Given
these problems, it seems appropriate to consider performance difficulty that takes values in a range.
There are various measures and causes of performance
difficulty including player’s movements and notational complexity of the score [12, 1, 15]. Here we focus on the difficulty of player’s movements, particularly piano fingering,
which is presumably one of the most important factors.
The difficulty of fingering is closely related to the decision
of fingering [4, 7, 13, 16]. Given the current situation that a
method of determining the fingering costs from first principles is not established, however, it is also effective to take a
statistical approach, and consider the naturalness of fingering in terms of probability obtained from actual fingering
data. With a statistical model of fingering, the most natural
fingering can be determined, and one can quantify the difficulty of fingering in terms of naturalness. This will be explained in Secs. 2 and 3. The practical importance of piano
fingering and its applications are discussed in Ref. [17].
Since voice parts played by both hands are not a priori
separated or indicated in the original ensemble score, a fingering model must be applicable in such a situation. Thus,
a fingering model for both hands and an algorithm to separate voice parts are necessary. We propose such a model
and an algorithm based on merged-output hidden Markov
model (HMM), which is suited for modeling multi-voicepart structured phenomena [10, 11]. Since multi-voice-part
structure of music is common and voice-part separation
can be applied for a wide range of information processing,
the results are itself of great importance.
2. MODEL FOR PIANO FINGERING FOR BOTH
HANDS
c Eita Nakamura, Nobutaka Ono, Shigeki Sagayama.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Eita Nakamura, Nobutaka Ono,
Shigeki Sagayama. “ Merged-Output HMM for Piano Fingering of Both
Hands ”, 15th International Society for Music Information Retrieval Conference, 2014.
2.1 Model for one hand
Before discussing the piano fingering model for both hands,
let us discuss the fingering model for one hand. Piano
531
fingering models and algorithms for decision of fingering
have been studied in Refs. [13, 16, 4, 18, 19, 20, 7]. Here
we extend the model in Ref. [19] to including chords.
Piano fingering for one hand, say, the right hand, is indicated by associating a finger number fn = 1, · · · , 5 (1 =
thumb, 2 = the index finger, · · · , 5 = the little finger)
to each note pn in a score 1 , where n = 1, · · · , N indexes notes in the score and N is the number of notes. We
consider the probability of a fingering sequence f1:N =
N
(fn )N
n=1 given a score, or a pitch sequence, p1:N = (pn )n=1 ,
which is written as P (f1:N |p1:N ). As explained in detail
in Sec. 3.1, an algorithm for fingering decision can be obtained by estimating the most probable candidate fˆ1:N =
argmax P (f1:N |p1:N ). The fingering of a particular note
f1:N
is more influenced by neighboring notes than notes that are
far away in score position. Dependence on neighboring
notes is most simply described by that on adjacent notes,
and it can be incorporated with a Markov model. It also
has advantages in efficiency in maximizing probability and
setting model parameters. Although the probability of fingering may depend on inter-onset intervals between notes,
the dependence is not considered here for simplicity.
As proposed in Ref. [18, 19], the fingering model can be
constructed with an HMM. Supposing that notes in score
are generated by finger movements and the resulting performed pitches, their probability is represented with the
probability that a finger would be used after another finger
P (fn |fn−1 ), and the probability that a pitch would result
from succeeding two used fingers. The former is called the
transition probability, and the latter output probability. The
output probability of pitch depends on the previous pitch
in addition to the corresponding used fingers, and it is described with a conditional probability P (pn |pn−1 , fn−1 , fn ).
In terms of these probabilities, the probability of notes and
fingerings is given as
P (p1:N , f1:N ) =
N
P (pn |pn−1 , fn−1 , fn )P (fn |fn−1 ),
n=1
(1)
where the initial probabilities are written as P (f1 |f0 ) ≡
P (f1 ) and P (p1 |p0 , f0 , f1 ) ≡ P (p1 |f1 ). The probability
P (f1:N |p1:N ) can also be given accordingly.
To train the model efficiently, we assume some reasonable constraints on the parameters. First we assume that
the probability depends on pitches only through their geometrical positions on the keyboard which is represented
as a two-dimensional lattice (Fig. 1). We also assume the
translational symmetry in the x-direction and the time inversion symmetry for the output probability. If the coordinate on the keyboard is written as (p) = (
x (p), y (p)),
the assumptions mean that the output probability has a form
P (p |p, f, f ) = F (
x (p ) − x (p), y (p ) − y (p); f, f ),
and it satisfies F (
x (p ) − x (p), y (p ) − y (p); f, f ) =
F (
x (p) − x (p ), y (p) − y (p ); f , f ). A model for each
hand can be obtained in this way, and it is written as
Fη (
x (p ) − x (p), y (p ) − y (p); f, f ) with η = L, R.
1
We do not consider the so-called finger substitution in this paper.
532
Figure 1. Keyboard lattice. Each key on a keyboard is
represented by a point of a two-dimensional lattice.
It is further assumed that these probabilities are related
by reflection in the x-direction, which yields FL (
x (p ) −
x (p), y (p )−
y (p); f, f ) = FR (
x (p )−
x (p), y (p )−
y (p); f, f ).
The above model can be extended to be applied for passages with chords, by converting a polyphonic passage to
a monophonic passage by virtually arpeggiating the chords
[7]. Here, notes in a chord are ordered from low pitch to
high pitch. The parameter values can be obtained from fingering data.
2.2 Model for both hands
Now let us consider the fingering of both hands in the situation that it is unknown a priori which of the notes are
to be played by the left or right hand. The problem can be
stated as associating the fingering information (ηn , fn )N
n=1
for the pitch sequence p1:N , where ηn = L, R indicates the
hand with which the n-th note is played.
One might think to build a model of both hands by simply extending the one-hand model and using (ηn , fn ) as
a latent variable. However, this is not an effective model
as far as it is a first-order Markov model since, for example, probabilistic constraints between two successive notes
by the right hand cannot be directly incorporated when
they are interrupted by other notes of the left hand. Using higher-order Markov models leads to the problem of
increasing number of parameters that is hard to train as
well as the increasing computational cost. The underlying problem is that the model cannot capture the structure
of dependencies that is stronger among notes in each hand
than those across hands.
Recently an HMM, called merged-output HMM, is proposed that is suited for describing such voice-part-structured
phenomena [10, 11]. The basic idea is to construct a model
for both hands by starting with two parallel HMMs, called
part HMMs, each of which corresponds to the HMM for
fingering of each hand, and then merging the outputs of
the part HMMs. Assuming that only one of the part HMMs
transits and outputs an observed symbol at each time, the
state space of the merged-output HMM is given as a triplet
k = (η, fL , fR ) of the hand information η = L, R and
fingerings of both hands: η indicate which of the HMMs
transits, and fL and fR indicate the current states of the
part HMMs. Let the transition and output probabilities
of the part HMMs be aηf f = Pη (f |f ) and bηf f (
) =
Fη (
; f, f ) (η = L, R). Then the transition and output
probabilities of the merged-output HMM are given as
αL aL
fL f δ fR fR ,
η = L;
αR aR
fR f δ fL fL ,
η = R,
L
akk =
bkk (
) =
R
bL
fL f (
),
η = L;
bR
fR f (
),
η = R,
L
R
0.12
0.1
0.08
0.06
(2)
0.04
(3)
0.02
0
-40
where δ denotes Kronecker’s delta. Here, αL,R represent
the probability of choosing which of the hands to play the
note, and practically, they satisfy αL ∼ αR ∼ 1/2. As
shown in Ref. [11], certain interaction factors can be introduced to Eqs. (2) and (3). Although such interactions may
be important in the future [14], we confine ourselves to the
case of no interactions in this paper for simplicity.
By estimating the most probable sequence k̂1:N , both
the optimal configuration of hands η̂1:N , which yields a
voice-part separation, and that of fingers (fˆL , fˆR )1:N are
obtained. For details of inference algorithms and other aspects of merged-output HMM, see Ref. [11].
αL aL
p L p δp R p R ,
η = L;
αR aR
p R p δp L p L ,
η = R,
L
R
bx (y) = δy,pη .
20
40
Figure 2. Histograms of pitch transitions in piano scores
for each hand.
3. APPLICATIONS OF THE FINGERING MODEL
3.1 Algorithm for decision of fingering
A direct application of the model explained in Secs. 2.1
and 2.2 is the decision of fingering. The algorithm can be
derived by applying the Viterbi algorithm. For one hand,
the derived algorithm is similar as the one in Ref. [19], but
we reevaluated the accuracy since the present model can
be applied for polyphonic passages and the details of the
models are different.
For evaluation, we prepared manually labeled fingerings of classical piano pieces and compared them to the
one estimated with the algorithm. The test pieces were
Nos. 1, 2, 3, and 8 of Bach’s two-voice inventions, and
the introduction and exposition parts from Beethoven’s 8th
piano sonata in C minor. The training and test of the algorithm was done with the leave-one-out cross validation
method for each piece. To avoid zero frequencies in the
training, we added a uniform count of 0.1 for every bin.
The averaged accuracy was 56.0% (resp. 55.4%) for the
right (resp. left) hand where the number of notes was 5202
(resp. 5539). Since the training data was not big, and we
had much higher rate of more than 70% for closed test,
the accuracy may improve if a larger set of training data is
given. The results were better than the reported values in
Ref. [19]. The reason would be that the constraints of the
model in the reference was too strong, which is relaxed in
the present model. For detailed analysis of the estimation
errors, see Ref. [19].
The model explained in the previous section involves both
hands and the used hand and fingers are modeled simultaneously. We can alternatively consider the problem of associating fingerings of both hands as first separating voice
parts for both hands, and then associating fingerings for
notes in each voice part. In this subsection, a simple model
that can be used for voice-part separation is given. The
model is also based on a simpler merged-output HMM, and
it yields more efficient algorithm for voice-part separation.
We consider a merged-output HMM with a hidden state
x = (η, pL , pR ), where η = L, R indicates the voice part,
and pL,R describes the pitch played in each voice part. If
the pitch sequence in the score is denoted by (yn )n , the
transition and output probabilities are written as
axx =
0
Interval [semitone]
2.3 Model for voice-part separation
-20
(4)
(5)
Here the transition probability aL,R
pp describes the pitch sequence in each voice part directly, without any information
on fingerings. The corresponding distributions can be obtained from actual data of piano pieces, as shown in Fig. 2.
3.2 Voice-part separation
So far we have considered a model of pitches and horizontal intervals for voice-part separation. The voice-partseparation algorithm can be derived by applying the Viterbi
algorithm to the above model. In fact, a voice part in the
score played by one hand is also constrained by vertical
intervals since it is physically difficult to play a chord containing an interval far wider than a octave by one hand. The
constraint on the vertical intervals can also be introduced
in terms of probability.
Voice-part separation between two hands can be done with
the model described in Sec. 2.3, and the algorithm can be
obtained by the Viterbi algorithm. In fact, we can derive
a more efficient estimation algorithm which is effectively
equivalent since the model has noiseless observations as in
Eq. (5).
It is obtained by minimizing the following potential with
respect to the variables {(ηn , hn )}, hn = 0, 1, · · · , Nh for
533
Table 1. Error rates of the voice-part-separation algorithms.The 0-HMM (resp. 1-HMM, 2-HMM) indicates the algorithm
with the zeroth-order (resp. first-order, second-order) HMM.
Pieces
# Notes 0-HMM [%] 1-HMM [%] 2-HMM [%] Merged-output HMM [%]
Bach (15 pcs)
9638
5.1
5.3
6.1
1.9
Beethoven (2 pcs) 18144
13.0
11.1
11.5
9.28
Chopin (5 pcs)
8508
5.7
4.0
4.29
3.8
Debussy (3 pcs)
3360
17.8
14.8
14.8
18.7
Total
39650
9.9
8.5
8.9
7.1
each note:
V (η, h) = −
ln Q(ηn−1 , hn−1 ; ηn , hn ),
(6)
n
Q(ηn−1 , hn−1 ; ηn , hn )
(ηn )
αηn ayn−1
,yn δhn ,hn−1 +1 ,
=
(ηn )
,
αηn ayn−2−h
,y δ
n−1 n hn ,0
(a) Passage in Bach’s two-voice invention No. 1.
ηn = ηn−1 ;
ηn = ηn−1 .
(7)
Here hn is necessary to memorize the current state of the
voice part opposite of ηn . The minimization of the potential can be done with dynamic programming incrementally
for each n. The estimation result is the same as the one
with the Viterbi algorithm applied to the model when Nh
is sufficiently large, and we confirmed that Nh = 50 is
sufficient to provide a good approximation.
The algorithm was evaluated by applying it to several
classical piano pieces. The used pieces were all pieces of
Bach’s two-voice inventions, the first two piano sonatas by
Beethoven, Chopin’s Etude Op. 10 Nos. 1–5, and the first
three pieces in the first book of Debussy’s Préludes. For
comparison, we also evaluated algorithms based on lowerorder HMMs. The zeroth-order model with transition and
output probabilities P (η) and P (p|η) is almost equivalent
to the keyboard splitting method, the first-order model with
P (η |η) and P (δp|η, η ) and the second-order model are
simple applications of HMMs whose latent variables are
hand informations η = L, R.
The results are shown in Table 1. In total, the mergedoutput HMM yielded the lowest error rate, with which relatively accurate voice part separation can be done. On
the other hand, there were less changes in results for the
lower-order HMMs, showing that the effectiveness of the
merged-output HMM. In Debussy’s pieces, the error rates
were relatively high since the pieces necessitate complex
fingerings with wide movements of the hands. An example of the voice-part separation result is shown in Fig. 3.
(b) Piano role representation of the voice-part separation result. Two voice
parts are colored red and blue.
Figure 3. Example of a voice-part separation result.
notes in the time range of [t − Δt/2, t + Δt/2], and f (t)
be the corresponding fingerings, where Δt is a width of the
time range to define the time rate. Then it is given as
D(t) = − ln P (p(t), f (t))/Δt.
(8)
Since the minimal time interval of successive notes are
about a few 10 milli seconds and it is hard to imagine that
difficulty is strongly influenced by notes that are separated
more than 10 seconds, it is natural to set Δt within these
extremes. The right-hand side is given by Eq. (1). It is possible to calculate D(t) for a score without indicated fingerings by replacing f (t) with the estimated fingerings fˆ(t)
with the model in Sec. 2. In addition to the difficulty for
both hands, that for each hand DL,R (t) can also be defined
similarly.
Fig. 4 shows some examples of DL,R (t) calculated for
several piano pieces. Here Δt was set to 1 sec. Although
it is not easy to evaluate the quantity in a strict way, the
results seems reasonable and reflects generic intuition of
difficulty. The invention by Bach that can be played by
beginners yields DL,R that are less than about 10, the example of Beethoven’s sonata which requires middle-level
technicality has DL,R around 20 to 30, and Chopin’s Fantasie Impromptu which involves fast passages and difficult
fingerings has DL,R up to about 40. It is also worthy of noting that relatively difficult passages such as the fast chromatique passage of the right hand in the introduction of
Beethoven’s sonata and ornaments in the right hand of the
3.3 Quantitative measure of difficulty of performance
A measure of performance difficulty based on the naturalness of the fingerings can be obtained by the probabilistic
fingering model. Although global structures in scores may
influence the difficulty, we concentrate on the effect of local structures. It is supposed that the difficulty is additive
with regard to performed notes and an increasing function
of tempo. A quantity satisfying these conditions is the time
rate of probabilistic cost. Let p(t) denote the sequence of
534
(a) Difficulty for right hand DR
(b) Difficulty for left hand DL
Figure 4. Examples of DR and DL . The red (resp. green, blue) line is for Bach’s two-voice invention No.=1, (resp. Introduction and exposition parts of the first movement of Beethoven’s eighth piano sonata, Chopin’s Fantasie Impromptu).
pp. 10–12, 2012.
slow part of the Fantasie Impromptu are also captured in
terms of DR .
[2] S.-C. Chiu et al., “Automatic system for the arrangement of piano reductions,” Proc. AdMIRe, 2009.
4. CONCLUSIONS
In this paper, we considered a piano fingering model of
both hands and its applications especially toward a piano
reduction algorithm. First we reviewed a piano fingering
model for one hand based on HMM, and then constructed
a model for both hands based on merged-output HMM.
Next we applied the model for constructing an algorithm
for fingering decision and voice-part-separation algorithm
and obtained a measure of performance difficulty. The algorithm for fingering decision yielded better results than
the previously proposed one by a modification in details
of the model. The results of voice-part separation is quite
good and encouraging. The proposed measure of performance difficulty successfully captures the dependence on
tempos and complexity of pitches and finger movements.
The next step to construct a piano reduction algorithm
according to the formalization mentioned in the Introduction is to quantify the fidelity of the arranged score to the
original score and to integrate it with the constraints of
performance difficulty. The fidelity can be described with
edit probability, similarly as in Ref. [5], and an arrangement model can be obtained by integrating the fingering
model with the edit probability. We are currently working
on these issues and the results will be reported elsewhere.
5. ACKNOWLEDGMENTS
This work is supported in part by Grant-in-Aid for Scientific Research from Japan Society for the Promotion of
Science, No. 23240021, No. 26240025 (S.S. and N.O.),
and No. 25880029 (E.N.).
[3] K. Fujita et al., “A proposal for piano score generation that considers proficiency from multiple part (in
Japanese),” Tech. Rep. SIGMUS, MUS-77, pp. 47–52,
2008.
[4] M. Hart and E. Tsai, “Finding optimal piano fingerings,” The UMAP Journal, 21(1), pp. 167–177, 2000.
[5] G. Hori et al., “Input-output HMM applied to automatic arrangement for guitars,” J. Information Processing, 21(2), pp. 264–271, 2013.
[6] Z. Ghahramani and M. Jordan, “Factorial Hidden
Markov Models,” Machine Learning, 29, pp. 245–273,
1997.
[7] A. Al Kasimi et al., “A simple algorithm for automatic generation of polyphonic piano fingerings,” ISMIR, pp. 355–356, 2007.
[8] F. Liszt, Musikalische Werke, Serie IV, Breitkopf &
Härtel, 1922.
[9] J. Musafia, The Art of Fingering in Piano Playing,
MCA Music, 1971.
[10] E. Nakamura et al., “Merged-output hidden Markov
model and its applications to score following and
hand separation of polyphonic keyboard music (in
Japanese),” Tech. Rep. SIGMUS, 2013-EC-27, 15,
2013.
6. REFERENCES
[11] E. Nakamura et al., “Merged-output hidden Markov
model for score following of MIDI performance with
ornaments, desynchronized voices, repeats and skips,”
to appear in Proc. ICMC, 2014.
[1] S.-C. Chiu and M.-S. Chen, “A study on difficulty
level recognition of piano sheet music,” Proc. AdMIRe,
[12] C. Palmer, “Music performance,” Ann. Rev. Psychol.,
48, pp. 115–138, 1997.
535
[13] R. Parncutt et al., “An ergonomic model of keyboard
fingering for melodic fragments,” Music Perception,
14(4), pp. 341–382, 1997.
[14] R. Parncutt et al., “Interdependence of right and left
hands in sight-read, written, and rehearsed fingerings
of parallel melodic piano music,” Australian J. of Psychology, 51(3), pp. 204–210, 1999.
[15] V. Sébastien et al., “Score analyzer: Automatically
determining scores difficulty level for instrumental elearning,” Proc. ISMIR, 2012.
[16] H. Sekiguchi and S. Eiho, “Generating and displaying
the human piano performance,” 40(6), pp. 167–177,
1999.
[17] Y. Takegawa et al., “Design and implementation of a
real-time fingering detection system for piano performance,” Proc. ICMC, pp. 67–74, 2006.
[18] Y. Yonebayashi et al., “Automatic determination of
piano fingering based on hidden Markov model (in
Japanese),” Tech. Rep. SIGMUS, 2006-05-13, pp. 7–
12, 2006.
[19] Y. Yonebayashi et al., “Automatic decision of piano
fingering based on hidden Markov models,” IJCAI,
pp. 2915–2921, 2007.
[20] Y. Yonebayashi et al., “Automatic piano fingering decision based on hidden Markov models with latent
variables in consideration of natural hand motions (in
Japanese),” Tech. Rep. SIGMUS, MUS-71-29, pp. 179–
184, 2007.
536
MODELING RHYTHM SIMILARITY FOR ELECTRONIC DANCE MUSIC
Maria Panteli
Niels Bogaards
Aline Honingh
University of Amsterdam,
Elephantcandy,
University of Amsterdam,
Amsterdam, Netherlands
[email protected] [email protected] [email protected]
ABSTRACT
A model for rhythm similarity in electronic dance music
(EDM) is presented in this paper. Rhythm in EDM is built
on the concept of a ‘loop’, a repeating sequence typically
associated with a four-measure percussive pattern. The
presented model calculates rhythm similarity between segments of EDM in the following steps. 1) Each segment
is split in different perceptual rhythmic streams. 2) Each
stream is characterized by a number of attributes, most notably: attack phase of onsets, periodicity of rhythmic elements, and metrical distribution. 3) These attributes are
combined into one feature vector for every segment, after which the similarity between segments can be calculated. The stages of stream splitting, onset detection and
downbeat detection have been evaluated individually, and
a listening experiment was conducted to evaluate the overall performance of the model with perceptual ratings of
rhythm similarity.
&#(#*#&'
(#&'
'#)#&&#&(
$%
!"
$%
Figure 1: Example of a common (even) EDM rhythm [2].
1. INTRODUCTION
Music similarity has attracted research from multidisciplinary domains including tasks of music information retrieval and music perception and cognition. Especially for
rhythm, studies exist on identifying and quantifying rhythm
properties [16, 18], as well as establishing rhythm similarity metrics [12]. In this paper, rhythm similarity is studied
with a focus on Electronic Dance Music (EDM), a genre
with various and distinct rhythms [2].
EDM is an umbrella term consisting of the ‘four on
the floor’ genres such as techno, house, trance, and the
‘breakbeat-driven’ genres such as jungle, drum ‘n’ bass,
breaks etc. In general, four on the floor genres are characterized by a four-beat steady bass-drum pattern whereas
breakbeat-driven exploit irregularity by emphasizing the
metrically weak locations [2]. However, rhythm in EDM
exhibits multiple types of subtle variations and embellishments. The goal of the present study is to develop a rhythm
similarity model that captures these embellishments and allows for a fine inter-song rhythm similarity.
c Maria Panteli, Niels Bogaards, Aline Honingh.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Maria Panteli, Niels Bogaards, Aline
Honingh. “Modeling rhythm similarity for electronic dance music”, 15th
International Society for Music Information Retrieval Conference, 2014.
537
The model focuses on content-based analysis of audio
recordings. A large and diverse literature deals with the
challenges of audio rhythm similarity. These include, amongst other, approaches to onset detection [1], tempo estimation [9,25], rhythmic representations [15,24], and feature extraction for automatic rhythmic pattern description
and genre classification [5, 12, 20]. Specific to EDM, [4]
study rhythmic and timbre features for automatic genre
classification, and [6] investigate temporal and structural
features for music generation.
In this paper, an algorithm for rhythm similarity based
on EDM characteristics and perceptual rhythm attributes is
presented. The methodology for extracting rhythmic elements from an audio segment and a summary of the features extracted is provided. The steps of the algorithm are
evaluated individually. Similarity predictions of the model
are compared to perceptual ratings and further considerations are discussed.
2. METHODOLOGY
Structural changes in an EDM track typically consist of
an evolution of timbre and rhythm as opposed to a versechorus division. Segmentation is firstly performed to split
the signal into meaningful excerpts. The algorithm developed in [21] is used, which segments the audio signal based
on timbre features (since timbre is important in EDM structure [2]) and musical heuristics.
EDM rhythm is expressed via the ‘loop’, a repeating
pattern associated with a particular (often percussive) instrument or instruments [2]. Rhythm information can be
extracted by evaluating characteristics of the loop: First,
the rhythmic pattern is often presented as a combination of
instrument sounds (eg. Figure 1), thus exhibiting a certain
‘rhythm polyphony’ [3]. To analyze this, the signal is split
into the so-called rhythmic streams. Then, to describe the
underlying rhythm, features are extracted for each stream
based on three attributes: a) The attack phase of the onsets is considered to describe if the pattern is performed on
segmentation
feature extraction
rhythmic
streams
detection
onset
detection
attack
characterization
feature
vector
metrical
periodicity
metricaldistribution
distribution
feature extraction
similarity
stream # 1
stream # 2
stream # 3
Figure 2: Overview of methodology.
percussive or non-percussive instruments. Although this
is typically viewed as a timbre attribute, the percussiveness of a sound is expected to influence the perception
of rhythm [16]. b) The repetition of rhythmic sequences
of the pattern are described by evaluating characteristics
of different levels of onsets’ periodicity. c) The metrical
structure of the pattern is characterized via features extracted from the metrical profile [24] of onsets. Based on
the above, a feature vector is extracted for each segment
and is used to measure rhythm similarity. Inter-segment
similarity is evaluated with perceptual ratings collected via
a specifically designed experiment. An overview of the
methodology is shown in Figure 2 and details for each step
are provided in the sections below. Part of the algorithm is
implemented using the MIRToolbox [17].
2.1 Rhythmic Streams
Several instruments contribute to the rhythmic pattern of
an EDM track. Most typical examples include combinations of bass drum, snare and hi-hat (eg. Figure 1). This
is mainly a functional rather than a strictly instrumental division, and in EDM one finds various instrument sounds
to take the role of bass, snare and hi-hat. In describing
rhythm, it is essential to distinguish between these sources
since each contributes differently to rhythm perception [11].
Following this, [15, 24] describe rhythmic patterns of
latin dance music in two prefixed frequency bands (low and
high frequencies), and [9] represents drum patterns as two
components, the bass and snare drum pattern, calculated
via non-negative matrix factorization of the spectrogram.
In [20], rhythmic events are split based on their perceived
loudness and brightness, where the latter is defined as a
function of the spectral centroid.
In the current study, rhythmic streams are extracted with
respect to the frequency domain and loudness pattern. In
particular, the Short Time Fourier Transform of the signal is computed and logarithmic magnitude spectra are assigned to bark bands, resulting into a total of 24 bands for
a 44.1 kHz sampling rate. Synchronous masking is modeled using the spreading function of [23], and temporal
masking is modeled with a smoothing window of 50 ms.
This representation is hereafter referred to as loudness envelope and denoted by Lb for bark bands b = 1, . . . , 24. A
self-similarity matrix is computed from this 24-band representation indicating the bands that exhibit similar loudness pattern. The novelty approach of [8] is applied to
the 24 × 24 similarity matrix to detect adjacent bands that
should be grouped to the same rhythmic stream. The peak
538
locations P of the novelty curve define the number of the
bark band that marks the beginning of a new stream, i.e., if
P = {pi ∈ {1, . . . , 24}|i = 1, . . . , I} for total number of
peaks I, then stream Si consists of bark bands b given by,
{b|b ∈ [pi , pi+1 − 1]} for i = 1, . . . , I − 1
Si =
for i = I.
{b|b ∈ [pI , 24]}
(1)
An upper limit of 6 streams is considered based on the approach of [22] that uses a total of 6 bands for onset detection and [14] that suggests a total of three or four bands for
meter analysis.
The notion of rhythmic stream here is similar to the notion of ‘accent band’ in [14] with the difference that each
rhythmic stream is formed on a variable number of adjacent bark bands. Detecting a rhythmic stream does not
necessarily imply separating the instruments, since if two
instruments play the same rhythm they should be grouped
to the same rhythmic stream. The proposed approach does
not distinguish instruments that lie in the same bark band.
The advantage is that the number of streams and the frequency range for each stream do not need to be predetermined but are rather estimated from the spectral representation of each song. This benefits the analysis of electronic
dance music by not imposing any constraints on the possible instrument sounds that contribute to the characteristic
rhythmic pattern.
2.1.1 Onset Detection
To extract onset candidates, the loudness envelope per bark
band and its derivative are normalized and summed with
more weight on loudness than its derivative, i.e.,
Ob (n) = (1 − λ)Nb (n) + λNb (n)
(2)
where Nb is the normalized loudness envelope Lb , Nb the
normalized derivative of Lb , n = 1, . . . , N the frame number for a total of N frames, and λ < 0.5 the weighting factor. This is similar to the approach described by Equation
3 in [14] with reduced λ, and is computed prior summation
to the different streams as suggested in [14,22]. Onsets are
detected via peak extraction within each stream, where the
(rhythmic) content of stream i is defined as
Ri = Σb∈Si Ob
(3)
with Si as in Equation 1 and Ob as in Equation 2. This
onset detection approach incorporates similar methodological concepts with the positively evaluated algorithms for
the task of audio onset detection [1] in MIREX 2012, and
tempo estimation [14] in the review of [25].
2
2
2
8
6
4
2
(a) Bark-band spectrogram.
(b) Self-similarity matrix.
(c) Novelty curve.
Figure 3: Detection of rhyhmic streams using the novelty approach; first a bark-band spectrogram is computed, then its
self-similarity matrix, and then the novelty [7] is applied where the novelty peaks define the stream boundaries.
2.2 Feature Extraction
The onsets in each stream represent the rhythmic elements
of the signal. To model the underlying rhythm, features
are extracted from each stream, based on three attributes,
namely, characterization of attack, periodicity, and metrical distribution of onsets. These are combined to a feature
vector that serves for measuring inter-segment similarity.
The sections below describe the feature extraction process
in detail.
2.2.1 Attack Characterization
To distinguish between percussive and non-percussive patterns, features are extracted that characterize the attack phase of the onsets. In particular, the attack time and attack
slope are considered, among other, essential in modeling
the perceived attack time [10]. The attack slope was also
used in modeling pulse clarity [16]. In general, onsets from
percussive sounds have a short attack time and steep attack
slope, whereas non-percussive sounds have longer attack
time and gradually increasing attack slope.
For all onsets in all streams, the attack time and attack slope is extracted and split in two clusters; the ‘slow’
(non-percussive) and ‘fast’ (percussive) attack phase onsets. Here, it is assumed that both percussive and nonpercussive onsets can be present in a given segment, hence
splitting in two clusters is superior to, e.g., computing the
average. The mean and standard deviation of the two clusters of the attack time and attack slope (a total of 8 features)
is output to the feature vector.
Lag duration of maximum autocorrelation: The location (in time) of the second highest peak (the first being
at lag 0) of the autocorrelation curve normalized by the bar
duration. It measures whether the strongest periodicity occurs in every bar (i.e. feature value = 1), or every half bar
(i.e. feature value = 0.5) etc.
Amplitude of maximum autocorrelation: The amplitude of the second highest peak of the autocorrelation
curve normalized by the amplitude of the peak at lag 0.
It measures whether the pattern is repeated in exactly the
same way (i.e. feature value = 1) or somewhat in a similar
way (i.e. feature value < 1) etc.
Harmonicity of peaks: This is the harmonicity as defined in [16] with adaptation to the reference lag l0 corresponding to the beat duration and additional weighting
of the harmonicity value by the total number of peaks of
the autocorrelation curve. This feature measures whether
rhythmic periodicities occur in harmonic relation to the
beat (i.e. feature value = 1) or inharmonic (i.e. feature
value = 0).
Flatness: Measures whether the autocorrelation curve
is smooth or spiky and is suitable for distinguishing between periodic patterns (i.e. feature value = 0), and nonperiodic (i.e. feature value = 1).
Entropy: Another measure of the ‘peakiness’ of autocorrelation [16], suitable for distinguishing between ‘clear’
repetitions (i.e. distribution with narrow peaks and hence
feature value close to 0) and unclear repetitions (i.e. wide
peaks and hence feature value increased).
2.2.3 Metrical Distribution
2.2.2 Periodicity
To model the metrical aspects of the rhythmic pattern, the
metrical profile [24] is extracted. For this, the downbeat
is detected as described in Section 2.2.4, onsets per stream
are quantized assuming a 44 meter and 16-th note resolution [2], and the pattern is collapsed to a total of 4 bars. The
latter is in agreement with the length of a musical phrase
in EDM being usually in multiples of 4, i.e., 4-bar, 8-bar,
or 16-bar phrase [2]. The metrical profile of a given stream
is thus presented as a vector of 64 bins (4 bars × 4 beats
× 4 sixteenth notes per beat) with real values ranging between 0 (no onset) to 1 (maximum onset strength) as shown
in Figure 5. For each rhythmic stream, a metrical pro-
One of the most characteristic style elements in the musical
structure of EDM is repetition; the loop, and consequently
the rhythmic sequence(s), are repeating patterns. To analyze this, the periodicity of the onset detection function per
stream is computed via autocorrelation and summed across
all streams. The maximum delay taken into account is proportional to the bar duration. This is calculated assuming a
steady tempo and 44 meter throughout the EDM track [2].
The tempo estimation algorithm of [21] is used.
From the autocorrelation curve (cf. Figure 4), a total of
5 features are extracted:
539
tions are made:
Assumption 1: Strong beats of the meter are more likely
to be emphasized across all rhythmic streams.
Assumption 2: The downbeat is often introduced by
an instrument in the low frequencies, i.e. a bass or a kick
drum [2, 13].
Considering the above, the onsets per stream are quantized assuming a 44 meter, 16-th note resolution, and a set of
downbeat candidates (in this case the onsets that lie within
one bar length counting from the beginning of the segment). For each downbeat candidate, hierarchical weights
[18] that emphasize the strong beats of the meter as indicated by Assumption 1, are applied to the quantized patterns. Note, there is one pattern for each rhythmic stream.
The patterns are then summed by applying more weight to
the pattern of the low-frequency stream as indicated by Assumption 2. Finally, the candidate whose quantized pattern
was weighted most, is chosen as the downbeat.
1.2
1 Bar
1 Beat
1
0.8
Normalized amplitude
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
0
0.5
1
1.5
2
2.5
Lag (s)
Figure 4: Autocorrelation of onsets indicating high periodicities of 1 bar and 1 beat duration.
3. EVALUATION
Figure 5: Metrical profile of the rhythm in Figure 1 assuming for simplicity a 2-bar length and constant amplitude.
One of the greatest challenges of music similarity evaluation is the definition of a ground truth. In some cases,
objective evaluation is possible, where a ground truth is defined on a quantifiable criterion, i.e., rhythms from a particular genre are similar [5]. In other cases, music similarity
is considered to be influenced by the perception of the listener and hence subjective evaluation is more suitable [19].
Objective evaluation in the current study is not preferable
since different rhythms do not necessarily conform to different genres or subgenres 1 . Therefore a subjective evaluation is used where predictions of rhythm similarity are
compared to perceptual ratings collected via a listening experiment (cf. Section 3.4). Details of the evaluation of
rhythmic stream, onset, and downbeat detection are provided in Sections 3.1 - 3.3. A subset of the annotations
used in the evaluation of the latter is available online 2 .
file is computed and the following features are extracted.
Features are computed per stream and averaged across all
streams.
Syncopation: Measures the strength of the events lying
on the weak locations of the meter. The syncopation model
of [18] is used with adaptation to account for the amplitude
(onset strength) of the syncopated note. Three measures of
syncopation are considered that apply hierarchical weights
with, respectively, sixteenth note, eighth note, and quarter
note resolution.
Symmetry: Denotes the ratio of the number of onsets
in the second half of the pattern that appear in exactly the
same position in the first half of the pattern [6].
Density: Is the ratio of the number of onsets over the
possible total number of onsets of the pattern (in this case
64).
Fullness: Measures the onsets’ strength of the pattern.
It describes the ratio of the sum of onsets’ strength over the
maximum strength multiplied by the possible total number
of onsets (in this case 64).
Centre of Gravity: Denotes the position in the pattern
where the most and strongest onsets occur (i.e., indicates
whether most onsets appear at the beginning or at the end
of the pattern etc.).
Aside from these features, the metrical profile (cf. Figure 5) is also added to the final feature vector. This was
found to improve results in [24]. In the current approach,
the metrical profile is provided per stream, restricted to a
total of 4 streams, and output in the final feature vector in
order of low to high frequency content streams.
3.1 Rhythmic Streams Evaluation
The number of streams is evaluated with perceptual annotations. For this, a subset of 120 songs from a total of 60
artists (2 songs per artist) from a variety of EDM genres
and subgenres was selected. For each song, segmentation
was applied using the algorithm of [21] and a characteristic
segment was selected. Four subjects were asked to evaluate the number of rhythmic streams they perceive in each
segment, choosing between 1 to 6, where rhythmic stream
was defined as a stream of unique rhythm.
For 106 of the 120 segments, the subjects’ responses’
standard deviation was significantly small. The estimated
number of rhythmic streams matched the mean of the subject’s response distribution with an accuracy of 93%.
2.2.4 Downbeat Detection
1 Although some rhythmic patterns are characteristic to an EDM genre
or subgenre, it is not generally true that these are unique and invariant.
2 https://staff.fnwi.uva.nl/a.k.honingh/rhythm_
similarity.html
The downbeat detection algorithm uses information from
the metrical structure and musical heuristics. Two assump-
540
r
-0.17
0.48
0.33
0.69
0.70
3.2 Onset Detection Evaluation
Onset detection is evaluated with a set of 25 MIDI and
corresponding audio excerpts, specifically created for this
purpose. In this approach, onsets are detected per stream,
therefore onset annotations should also be provided per
stream. For a number of different EDM rhythms, MIDI
files were created with the constraint that each MIDI instrument performs a unique rhythmic pattern therefore represents a unique stream, and were converted to audio.
The onsets estimated from the audio were compared to
the annotations of the MIDI file using the evaluation measures of the MIREX Onset Detection task 3 . For this, no
stream alignment is performed but rather onsets from all
streams are grouped to a single set. For 25 excerpts, an
F -measure of 85%, presicion of 85%, and recall of 86%
are obtained with a tolerance window of 50 ms. Inaccuracies in onset detection are due (on average) to doubled than
merged onsets, because usually more streams (and hence
more onsets) are detected.
features
attack characterization
periodicity
metrical distribution excl. metrical profile
metrical distribution incl. metrical profile
all
Table 1: Pearson’s correlation r and p-values between the
model’s predictions and perceptual ratings of rhythm similarity for different sets of features.
rating, and all ratings being consistent, i.e., rated similarity
was not deviating more than 1 point scale. The mean of the
ratings was utilized as the ground truth rating per pair.
For each pair, similarity can be calculated via applying
a distance metric to the feature vectors of the underlying
segments. In this preliminary analysis, the cosine distance
was considered. Pearson’s correlation was used to compare
the annotated and predicted ratings of similarity. This was
applied for different sets of features as indicated in Table 1.
A maximum correlation of 0.7 was achieved when all
features were presented. The non-zero correlation hypothesis was not rejected (p > 0.05) for the attack characterization features indicating non-significant correlation with
the (current set of) perceptual ratings. The periodicity features are correlated with r = 0.48, showing a strong link
with perceptual rhythm similarity. The metrical distribution features indicate a correlation increase of 0.36 when
the metrical profile is included in the feature vector. This
is in agreement with the finding of [24].
As an alternative evaluation measure, the model’s predictions and perceptual ratings were transformed to a binary scale (i.e., 0 being dissimilar and 1 being similar)
and their output was compared. The model’s predictions
matched the perceptual ratings with an accuracy of 64%.
Hence the model matches the perceptual similarity ratings
at not only relative (i.e., Pearson’s correlation) but also absolute way, when a binary scale similarity is considered.
3.3 Downbeat Detection Evaluation
To evaluate the downbeat the subset of 120 segments described in Section 3.1 was used. For each segment the
annotated downbeat was compared to the estimated one
with a tolerance window of 50 ms. An accuracy of 51%
was achieved. Downbeat detection was also evaluated at
the beat-level, i.e., estimating whether the downbeat corresponds to one of the four beats of the meter (instead of
off-beat positions). This gave an accuracy of 59%, meaning that in the other cases the downbeat was detected on the
off-beat positions. For some EDM tracks it was observed
that high degree of periodicity compensates for a wrongly
estimated downbeat. The overall results of the similarity
predictions of the model (Section 3.4) indicate only a minor increase when the correct (annotated) downbeats are
taken into account. It is hence concluded that the downbeat detection algorithm does not have great influence on
the current results of the model.
3.4 Mapping Model Predictions to Perceptual Ratings
of Similarity
4. DISCUSSION AND FUTURE WORK
In the evaluation of the model, the following considerations are made. High correlation of 0.69 was achieved
when the metrical profile, output per stream, was added to
the feature vector. An alternative experiment tested the correlation when considering the metrical profile as a whole,
i.e., as a sum across all streams. This gave a correlation of
only 0.59 indicating the importance of stream separation
and hence the advantage of the model to account for this.
A maximum correlation of 0.7 was reported, taking into
account the downbeat detection being 51% of the cases
correct. Although regularity in EDM sometimes compensates for this, model’s predictions can be improved with a
more robust downbeat detection.
Features of periodicity (Section 2.2.2) and metrical distribution (Section 2.2.3) were extracted assuming a 44 meter, and 16-th note resolution throughout the segment. This
is generally true for EDM, but exceptions do exist [2]. The
The model’s predictions were evaluated with perceptual
ratings of rhythm similarity collected via a listening experiment. Pairwise comparisons of a small set of segments
representing various rhythmic patterns of EDM were presented. Subjects were asked to rate the perceived rhythm
similarity, choosing from a four point scale, and report also
the confidence of their rating. From a preliminary collection of experiment data, 28 pairs (representing a total of 18
unique music segments) were selected for further analysis.
These were rated from a total of 28 participants, with mean
age 27 years old and standard deviation 7.3. The 50% of
the participants received formal musical training, 64% was
familiar with EDM and 46% had experience as EDM musician/producer. The selected pairs were rated between 3 to
5 times, with all participants reporting confidence in their
3
p
0.22
0.00
0.01
0.00
0.00
www.MIREX.org
541
assumptions could be relaxed to analyze EDM with ternary
divisions or no 44 meter, or expanded to other music styles
with similar structure.
The correlation reported in Section 3.4 is computed from
a preliminary set of experiment data. More ratings are currently collected and a regression analysis and tuning of the
model is considered in future work.
[11] T. D. Griffiths and J. D. Warren. What is an auditory
object? Nature Reviews Neuroscience, 5(11):887–892,
2004.
5. CONCLUSION
[13] J. A. Hockman, M. E. P. Davies, and I. Fujinaga. One in
the Jungle: Downbeat Detection in Hardcore, Jungle,
and Drum and Bass. In ISMIR, 2012.
A model of rhythm similarity for Electronic Dance Music
has been presented. The model extracts rhythmic features
from audio segments and computes similarity by comparing their feature vectors. A method for rhythmic stream
detection is proposed that estimates the number and range
of frequency bands from the spectral representation of each
segment rather than a fixed division. Features are extracted
from each stream, an approach shown to benefit the analysis. Similarity predictions of the model match perceptual
ratings with a correlation of 0.7. Future work will fine-tune
predictions based on a perceptual rhythm similarity model.
6. REFERENCES
[1] S. Böck, A. Arzt, K. Florian, and S. Markus. Online real-time onset detection with recurrent neural networks. In International Conference on Digital Audio
Effects, 2012.
[2] M. J. Butler. Unlocking the Groove. Indiana University
Press, Bloomington and Indianapolis, 2006.
[3] E. Cambouropoulos. Voice and Stream: Perceptual and
Computational Modeling of Voice Separation. Music
Perception, 26(1):75–94, 2008.
[4] D. Diakopoulos, O. Vallis, J. Hochenbaum, J. Murphy,
and A. Kapur. 21st Century Electronica: MIR Techniques for Classification and Performance. In ISMIR,
2009.
[12] C. Guastavino, F. Gómez, G. Toussaint, F. Marandola, and E. Gómez. Measuring Similarity between
Flamenco Rhythmic Patterns. Journal of New Music
Research, 38(2):129–138, June 2009.
[14] A. Klapuri, A. J. Eronen, and J. T. Astola. Analysis
of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech and Language Processing,
14(1):342–355, January 2006.
[15] F. Krebs, S. Böck, and G. Widmer. Rhythmic pattern
modeling for beat and downbeat tracking in musical
audio. In ISMIR, 2013.
[16] O. Lartillot, T. Eerola, P. Toiviainen, and J. Fornari.
Multi-feature Modeling of Pulse Clarity: Design, Validation and Optimization. In ISMIR, 2008.
[17] O. Lartillot and P. Toiviainen. A Matlab Toolbox for
Musical Feature Extraction From Audio. In International Conference on Digital Audio Effects, 2007.
[18] H. C. Longuet-Higgins and C. S. Lee. The Rhythmic Interpretation of Monophonic Music. Music Perception: An Interdisciplinary Journal, 1(4):424–441,
1984.
[19] A. Novello, M. M. F. McKinney, and A. Kohlrausch.
Perceptual Evaluation of Inter-song Similarity in Western Popular Music. Journal of New Music Research,
40(1):1–26, March 2011.
[20] J. Paulus and A. Klapuri. Measuring the Similarity of
Rhythmic Patterns. In ISMIR, 2002.
[5] S. Dixon, F. Gouyon, and G. Widmer. Towards Characterisation of Music via Rhythmic Patterns. In ISMIR,
2004.
[21] B. Rocha, N. Bogaards, and A. Honingh. Segmentation
and Timbre Similarity in Electronic Dance Music. In
Sound and Music Computing Conference, 2013.
[6] A. Eigenfeldt and P. Pasquier. Evolving Structures for
Electronic Dance Music. In Genetic and Evolutionary
Computation Conference, 2013.
[22] E. D. Scheirer. Tempo and beat analysis of acoustic
musical signals. The Journal of the Acoustical Society
of America, 103(1):588–601, January 1998.
[7] J. Foote and S. Uchihashi. The beat spectrum: a new
approach to rhythm analysis. In ICME, 2001.
[23] M. R. Schroeder, B. S. Atal, and J. L. Hall. Optimizing
digital speech coders by exploiting masking properties
of the human ear. The Journal of the Acoustical Society
of America, pages 1647–1652, 1979.
[8] J. T. Foote. Media segmentation using self-similarity
decomposition. In Electronic Imaging. International
Society for Optics and Photonics, 2003.
[9] D. Gärtner. Tempo estimation of urban music using
tatum grid non-negative matrix factorization. In ISMIR,
2013.
[10] J. W. Gordon. The perceptual attack time of musical
tones. The Journal of the Acoustical Society of America, 82(1):88–105, 1987.
[24] L. M. Smith. Rhythmic similarity using metrical profile matching. In International Computer Music Conference, 2010.
[25] J. R. Zapata and E. Gómez. Comparative Evaluation
and Combination of Audio Tempo Estimation Approaches. In Audio Engineering Society Conference,
2011.
542
MUSE: A MUSIC RECOMMENDATION MANAGEMENT SYSTEM
Martin Przyjaciel-Zablocki, Thomas Hornung, Alexander Schätzle,
Sven Gauß, Io Taxidou, Georg Lausen
Department of Computer Science, University of Freiburg
zablocki,hornungt,schaetzle,gausss,taxidou,[email protected]
ABSTRACT
Evaluating music recommender systems is a highly repetitive, yet non-trivial, task. But it has the advantage over
other domains that recommended songs can be evaluated
immediately by just listening to them.
In this paper, we present M U S E – a music recommendation management system – for solving the typical tasks
of an in vivo evaluation. M U S E provides the typical offthe-shelf evaluation algorithms, offers an online evaluation
system with automatic reporting, and by integrating online streaming services also a legal possibility to evaluate
the quality of recommended songs in real time. Finally, it
has a built-in user management system that conforms with
state-of-the-art privacy standards. New recommender algorithms can be plugged in comfortably and evaluations
can be configured and managed online.
1. INTRODUCTION
One of the hallmarks of a good recommender system is a
thorough and significant evaluation of the proposed algorithm(s) [6]. One way to do this is to use an offline dataset
like The Million Song Dataset [1] and split some part of
the data set as training data and run the evaluation on top
of the remainder of the data. This approach is meaningful for features that are already available for the dataset,
such as e.g. tag prediction for new songs. However, some
aspects of recommending songs are inherently subjective,
such as serendipity [12], and thus the evaluation of such
algorithms can only be done in vivo, i.e. with real users
not in an artificial environment.
When conducting an in vivo evaluation, there are some
typical issues that need to be considered:
User management. While registering for evaluations, users
should be able to provide some context information about
them to guide the assignment in groups for A/B testing.
Privacy & Security. User data is highly sensitive, and
high standards have to be met wrt. who is allowed to access
c Martin Przyjaciel-Zablocki, Thomas Hornung, Alexan
der Schätzle, Sven Gauß, Io Taxidou, Georg Lausen.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Martin Przyjaciel-Zablocki, Thomas
Hornung, Alexander Schätzle, Sven Gauß, Io Taxidou, Georg Lausen.
“MuSe: A Music Recommendation Management System”, 15th International Society for Music Information Retrieval Conference, 2014.
the data. Also, an evaluation framework needs to ensure
that user data cannot be compromised.
Group selection. Users are divided into groups for A/B
testing, e.g. based on demographic criteria like age or gender. Then, recommendations for group A are provided by a
baseline algorithm, and for group B by the new algorithm.
Playing songs. Unlike other domains, e.g. books, users
can give informed decisions by just listening to a song.
Thus, to assess a recommended song, it should be possible to play the song directly during the evaluation.
Evaluation monitoring. During an evaluation, it is important to have an overview of how each algorithm performs
so far, and how many and how often users participate.
Evaluation metrics. Evaluation results are put into graphs
that contain information about the participants and the performance of the evaluated new recommendation algorithm.
Baseline algorithms. Results of an evaluation are often
judged by improvements over a baseline algorithm, e.g. a
collaborative filtering algorithm [10].
In this paper, we present M U S E – a music recommendation management system – that takes care of all the regular tasks that are involved in conducting an in vivo evaluation. Please note that M U S E can be used to perform in
vivo evaluations of arbitrary music recommendation algorithms. An instance of M U S E that conforms with state-ofthe-art privacy standards is accessible by using the link below, a documentation is available on the M U S E website 2 .
muse.informatik.uni-freiburg.de
The remainder of the paper is structured as follows: After a discussion of related work in Section 2, we give an
overview of our proposed music recommendation management system in Section 3 with some insights in our evaluation framework in Section 4. Included recommenders are
presented in Section 5, and we conclude with an outlook
on future work in Section 6.
2. RELATED WORK
The related work is divided in three parts: (1) music based
frameworks for recommendations, (2) recommenders’ evaluation, (3) libraries and platforms for developing and plugin recommenders.
Music recommendation has attracted a lot of interest
from the scientific community since it has many real life
applications and bears multiple challenges. An overview
2 M U S E - Music Sensing in a Social Context:
dbis.informatik.uni-freiburg.de/MuSe
543
incorporate algorithms in the framework, integrate plugins,
make configurations and visualize the results. However,
our system offers additionally real-time online evaluations
of different recommenders, while incorporating end users
in the evaluation process. A case study of using Apache
Mahout, a library for distributed recommenders based on
MapReduce can be found in [15]. Their study provides insights into the development and evaluation of distributed
algorithms based on Mahout.
To the best of our knowledge, this is the first system
that incorporates such a variety of characteristics and offers
a full solution for music recommenders development and
evaluation, while highly involving the end users.
of factors affecting music recommender systems and challenges that emerge both for the users’ and the recommenders side are highlighted in [17]. Improving music recommendations has attracted equal attention. In [7, 12],
we built and evaluated a weighted hybrid recommender
prototype that incorporates different techniques for music recommendations. We used Youtube for playing songs
but due to a complex process of identifying and matching
songs, together with some legal issues, such an approach
is no longer feasible. Music platforms are often combined
with social media where users can interact with objects
maintaining relationships. Authors in [2] leverage this rich
information to improve music recommendations by viewing recommendations as a ranking problem.
The next class of related work concerns evaluation of
recommenders. An overview of existing systems and methods can be found in [16]. In this study, recommenders are
evaluated based on a set of properties relevant for different applications and evaluation metrics are introduced to
compare algorithms. Both offline and online evaluation
with real users are conducted, discussing how to draw valuable conclusion. A second review on collaborative recommender systems specifically can be found in [10]. It consists the first attempt to compare and evaluate user tasks,
types of analysis, datasets, recommendation quality and
attributes. Empirical studies along with classification of
existing evaluation metrics and introduction of new ones
provide insights into the suitability and biases of such metrics in different settings. In the same context, researchers
value the importance of user experience in the evaluation
of recommender systems. In [14] a model is developed
for assessing the perceived recommenders quality of users
leading to more effective and satisfying systems. Similar
approaches are followed in [3, 4] where authors highlight
the need for user-centric systems and high involvement of
users in the evaluation process. Relevant to our study is
the work in [9] which recognizes the importance for online user evaluation, while implementing such evaluations
simultaneously by the same user in different systems.
3. MUSE OVERVIEW
We propose M U S E: a web-based music recommendation
management system, built around the idea of recommenders that can be plugged in. With this in mind, M U S E is
based on three main system design pillars:
Extensibility. The whole infrastructure is highly extensible, thus new recommendation techniques but also other
functionalities can be added as modular components.
Reusability. Typical tasks required for evaluating music
recommendations (e.g. managing user accounts, playing
and rating songs) are already provided by M U S E in accordance with current privacy standards.
Comparability. By offering one common evaluation framework we aim to reduce side-effects of different systems
that might influence user ratings, improving both comparability and validity of in-vivo experiments.
A schematic overview of the whole system is depicted
in Fig. 1. The M U S E Server is the core of our music recommendation management system enabling the communication between all components. It coordinates the interaction with pluggable recommenders, maintains the data
in three different repositories and serves the requests from
multiple M U S E clients. Next, we will give some insights
in the architecture of M U S E by explaining the most relevant components and their functionalities.
The last class of related work refers to platforms and
libraries for developing and selecting recommenders. The
authors of [6] proposed LensKit, an open-source library
that offers a set of baseline recommendation algorithms
including an evaluation framework. MyMediaLite [8] is
a library that offers state of the art algorithms for collaborative filtering in particular. The API offers the possibility
for new recommender algorithm’s development and methods for importing already trained models. Both provide a
good foundation for comparing different research results,
but without a focus on in vivo evaluations of music recommenders, thus they don’t offer e.g. capabilities to play
and rate songs or manage users. A patent in [13] describes
a portal extension with recommendation engines via interfaces, where results are retrieved by a common recommendation manager. A more general purpose recommenders
framework [5] which is close to our system, allows using
and comparing different recommendation methods on provided datasets. An API offers the possibility to develop and
3.1 Web-based User Interface
Unlike traditional recommender domains like e-commerce,
where the process of consuming and rating items takes up
to several weeks, recommending music exhibits a highly
dynamic nature raising new challenges and opportunities
for recommender systems. Ratings can be given on the fly
and incorporated immediately into the recommending process, just by listening to a song. However, this requires
a reliable and legal solution for playing a large variety of
songs. M U S E benefits from a tight integration of Spotify 3 ,
a music streaming provider that allows listening to millions
of songs for free. Thus, recommended songs can be embedded directly into the user interface, allowing to listen
and rate them in a user-friendly way as shown in Fig. 2.
3
544
A Spotify account is needed to play songs
MuSeServer
PluggableRecommender1
Recommendation
ListBuilder
MuSeClient
REST
Webservice
WebbasedUserInterface
Recommendation
List
Administration
Track
Manager
Music
Repository
MusicRetrievalEngine
User
Profile
Evaluation
Framework
User
Context
Wrapper/
Connector1
XML
Social
Connector1
Charts
Last.fmAPI
SocialNetworks
XML
Last.fmAPI
HTML
AJAX
Spotify
Connector
WebServices
HTML
UserProfileEngine
User
Manager
Recommender1
DataRepositories
Coordinator
Recommendation
Model
Recommender
Interface
Recommender
RecommenderManager
others
JSON
SpotifyAPI
Figure 1. Muse – Music Recommendation Management System Overview
3.2 Data Repositories
Although recommenders in M U S E work independently of
each other and may even have their own recommendation
model with additional data, all music recommenders have
access to three global data structures.
The first one is the Music Repository that stores songs
with their meta data. Only songs in this database can be
recommended, played and rated. The Music Retrieval Engine periodically collects new songs and meta data from
Web Services, e.g. chart lists or Last.fm. It can be easily
extended by new sources of information like audio analysis features from the Million Song Dataset [1], that can be
requested periodically or dynamically. Each recommender
can access all data stored in the Music Repository.
The second repository stores the User Profile, hence
it also contains personal data. In order to comply with
German data privacy requirements only restricted access is
granted for both, recommenders and evaluation analyses.
The last repository collects the User Context, e.g. which
songs a user has listened to with the corresponding rating
for the respective recommender.
Access with anonymized user IDs is granted for all recommenders and evaluation analyses. Finally, both userrelated repositories can be enriched by the User Profile
Engine that fetches data from other sources like social networks. Currently, the retrieval of listening profiles of publicly available data from Last.fm and Facebook is supported.
Figure 2. Songs can be played & rated
In order to make sure that users can obtain recommendations without having to be long-time M U S E users, we
ask for some contextual information during the registration process. Each user has to provide coarse-grained demographic and preference information, namely the user’s
spoken languages, year of birth, and optionally a Last.fm
user name. In Section 5, we will present five different
approaches that utilize those information to overcome the
cold start problem. Beyond that, these information is also
exploited for dividing users into groups for A/B testing.
3.3 Recommender Manager
Fig. 3 shows the settings pane of a user. Note, that this
window is available only for those users, who are not participating in an evaluation. It allows to browse all available
recommenders and compare them based on meta data provided with each recommender. Moreover, it is also possible to control how recommendations from different recommenders are amalgamated to one list. To this end, a
summary is shown that illustrates the interplay of novelty,
accuracy, serendipity and diversity. Changes are applied
and reflected in the list of recommendations directly.
The Recommender Manager has to coordinate the interaction of recommenders with users and the access to the data.
This process can be summarized as follows:
• It coordinates access to the repositories, forwards
user request for new recommendations, and receives
generated recommendations.
• It composes a list of recommendations by amalgamating recommendations from different recommenders into one list based on individual user settings.
545
4.1 Evaluation Setup
The configuration of an evaluation consists of three steps
(cf. Fig. 4): (1) A new evaluation has to be scheduled,
i.e. a start and end date for the evaluation period has to
be specified. (2) The number and setup of groups for A/B
testing has to be defined, where up to six different groups
are supported. For each group an available recommender
can be associated with the possibility of hybrid combinations of recommenders if desired. (3) The group placement
strategy based on e.g. age, gender and spoken languages is
required. As new participants might join the evaluation
over time, an online algorithm maintains a uniform distribution with respect to the specified criteria. After the setup
is completed, a preview illustrates how group distributions
would resemble based on a sample of registered users.
Figure 3. Users can choose from available recommenders
• A panel for administrative users allows enabling, disabling and adding of recommenders that implement
the interface described in Section 3.4. Moreover,
even composing hybrid recommenders is supported.
3.4 Pluggable Recommender
A cornerstone of M U S E is its support for plugging in recommenders easily. The goal was to design a rather simple
and compact interface enabling other developers to implement new recommenders with enough flexibility to incorporate existing approaches as well. This is achieved by
a predefined Java interface that has to be implemented for
any new recommender. It defines the interplay between the
M U S E Recommender Manager and its pluggable recommenders by (1) providing methods to access all three data
repositories, (2) forwarding requests for recommendations
and (3) receiving recommended items. Hence, new recommenders do not have to be implemented within M U S E
in order to be evaluated, it suffices to use the interface to
provide a mapping of inputs and outputs 4 .
Figure 4. Evaluation setup via Web interface
While an evaluation is running, both registered users
and new ones are asked to participate after they login to
M U S E. If a user joins an evaluation, he will be assigned to
a group based on the placement strategy defined during the
setup and all ratings are considered for the evaluation. So
far, the following types of ratings can be discerned:
Song rating. The user can provide three ratings for the
quality of the recommended song (“love”, “like”, and “dislike”). Each of these three rating options is mapped to a
numerical score internally, which is then used as basis for
the analysis of each recommender.
List rating. The user can also provide ratings for the entire
list of recommendations that is shown to him on a fivepoint Likert scale, visualized by stars.
Question. To measure other important aspects of a recommendation like its novelty or serendipity, an additional
field with a question can be configured that contains either
a yes/no button or a five-point Likert scale.
The user may also decide not to rate some of the recommendations. In order to reduce the number of non-rated
recommendations in evaluations, the rating results can only
be submitted when at least 50% of the recommendations
are rated. Upon submitting the rating results, the user gets
a new list with recommended songs.
4. EVALUATION FRAMEWORK
There are two types of experiments to measure the performance of recommenders: (1) offline evaluations based
on historical data and (2) in vivo evaluations where users
can evaluate recommendations online. Since music is of
highly subjective nature with many yet unknown correlations, we believe that in vivo evaluations have the advantage of also capturing subtle effects on the user during the
evaluation. Since new songs can be rated within seconds
by a user, such evaluations are a good fit for the music domain. M U S E addresses the typical issues that are involved
in conducting an in-vivo evaluation and thus allows researches to focus on the actual recommendation algorithm.
This section gives a brief overview of how evaluations
are created, monitored and analyzed.
4
4.2 Monitoring Evaluations
Running in vivo evaluations as a black box is undesirable,
since potential issues might be discovered only after the
More details can be found on our project website.
546
age of 14 and 20 [11]. The Annual Charts Recommender
exploits this insight and recommends those songs, which
were popular during this time. This means, when a user
indicates 1975 as his year of birth, he will be assigned to
the music context of years 1989 to 1995, and obtain recommendations from that context. The recommendation ranking is defined by the charts position in the corresponding
annual charts, where the following function is used to map
the charts position to a score, with cs as the position of
song s in charts c and n is the maximum rank of charts c:
evaluation is finished. Also, it is favorable to have an overview of the current state, e.g. if there are enough participants, and how the recommenders perform so far. M U S E
provides comprehensive insights via an administrative account into running evaluations as it offers an easy accessible visualization of the current state with plots. Thus,
adjustments like adding a group or changing the runtime
of the evaluation can be made while the evaluation is still
running.
1
score(s) = −log( cs )
n
(1)
Country Charts Recommender. Although music taste
is subject to diversification across countries, songs that a
user has started to listen to and appreciate oftentimes have
peaked in others countries months before. This latency aspect as well as an inter-country view on songs provide a
good foundation for serendipity and diversity. The source
of information for this recommender is the spoken languages, provided during registration, which are mapped to
a set of countries for which we collect the current charts.
Suppose there is a user a with only one country A assigned
to his spoken languages, and CA the set of charts songs for
A. Then, the set CR of possible recommendations for a is
defined as follows, where L is the set of all countries:
"
CR = (
C X ) \ CA
X∈L
Figure 5. Evaluation results are visualized dynamically
The score for a song s ∈ CR is defined by the average
charts position across all countries, where Function (1) is
used for mapping the charts position into a score.
City Charts Recommender. While music tastes differ
across countries, they may likewise differ across cities in
the same country. We exploit this idea by the City Charts
Recommender, hence it can be seen as a more granular
variant of the Country Charts Recommender. The set of
recommendations CR is now composed based on the city
charts from those countries a user was assigned to. Hereby,
the ranking of songs in that set is not only defined by the
average charts position, but also by the number of cities
where the song occurs in the charts: The fewer cities a
song appears in, the more “exceptional” and thus relevant
it is.
Social Neighborhood Recommender. Social Networks
are, due to their growing rates, an excellent source for contextual knowledge about users, which in turn can be utilized for better recommendations. In this approach, we use
the underlying social graph of Last.fm to generate recommendations based on user’s Last.fm neighborhood which
can be retrieved by our User Profile Engine. To compute
recommendations for a user a, we select his five closest
neighbors, an information that is estimated by Last.fm internally. Next, for each of them, we retrieve its recent top
20 songs and thus get five sets of songs, namely N1 ...N5 .
Since that alone would provide already known songs in
general, we define the set NR of possible recommendations as follows, where Na is the set of at most 25 songs a
4.3 Analyzing Evaluations
For all evaluations, including running and finished ones,
a result overview can be accessed that shows results in a
graphical way to make them easier and quicker to grasp
(c.f. Fig. 5). The plots are implemented in a dynamic fashion allowing to adjust, e.g., the zoom-level or the displayed
information as desired. They include a wide range of metrics like group distribution, number of participants over
time, averaged ratings, mean absolute error, accuracy per
recommender, etc. Additionally, the complete dataset or
particular plotting data can be downloaded in CSV format.
5. RECOMMENDATION TECHNIQUES
M U S E comes with two types of recommenders out-of-thebox. The first type includes traditional algorithms, i.e. Contend Based and Collaborative Filtering [10] that can be
used as baseline for comparison. The next type of recommenders is geared towards overcoming the cold start problem by (a) exploiting information provided during registration (Annual, Country, and City Charts recommender),
or (b) leveraging knowledge from social networks (Social
Neighborhood and Social Tags recommender).
Annual Charts Recommender. Studies have shown, that
the apex of evolving music taste is reached between the
547
user a recently listened to and appreciated:
NR = (
"
[4] Paolo Cremonesi, Franca Garzotto, Sara Negro,
Alessandro Vittorio Papadopoulos, and Roberto Turrin. Looking for ”good” recommendations: A comparative evaluation of recommender systems. In INTERACT (3), pages 152–168, 2011.
N i ) \ Na
1≤i≤5
Social Tags Recommender. Social Networks collect an
enormous variety of data describing not only users but also
items. One common way of characterising songs is based
on tags that are assigned to them in a collaborative manner. Our Social Tag Recommender utilizes such tags to
discover new genres which are related to songs a user liked
in the past. At first, we determine his recent top ten songs
including their tags from Last.fm. We merge all those tags
and filter out the most popular ones like “rock” or “pop” to
avoid getting only obvious recommendations. By counting the frequency of the remaining tags, we determine the
three most common thus relevant ones. For the three selected tags, we use again Last.fm to retrieve songs where
the selected tags were assigned to most frequently.
To test our evaluation framework as well as to assess the
performance of our five recommenders we conducted an in
vivo evaluation with M U S E. As a result 48 registered users
rated a total of 1567 song recommendations confirming the
applicability of our system for in vivo evaluations. Due
to space limitations, we decided to omit a more detailed
discussion of the results.
6. CONCLUSION
M U S E puts the fun back in developing new algorithms for
music recommendations by taking the burden from the researcher to spent cumbersome time on programming yet
another evaluation tool. The module-based architecture offers the flexibility to immediately test novel approaches,
whereas the web-based user-interface gives control and insight into running in vivo evaluations. We tested M U S E
with a case study confirming the applicability and stability
of our proposed music recommendation management system. As future work, we envision to increase the flexibility
of setting up evaluations, add more metrics to the result
overview, and to develop further connectors for social networks and other web services to enrich the user’s context
while preserving data privacy.
7. REFERENCES
[1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian
Whitman, and Paul Lamere. The million song dataset.
In ISMIR, 2011.
[2] Jiajun Bu, Shulong Tan, Chun Chen, Can Wang, Hao
Wu, Lijun Zhang 0005, and Xiaofei He. Music recommendation by unified hypergraph: combining social
media information and music content. In ACM Multimedia, pages 391–400, 2010.
[3] Li Chen and Pearl Pu. User evaluation framework of
recommender systems. In Workshop on Social Recommender Systems (SRS’10) at IUI, volume 10, 2010.
[5] Aviram Dayan, Guy Katz, Naseem Biasdi, Lior
Rokach, Bracha Shapira, Aykan Aydin, Roland
Schwaiger, and Radmila Fishel. Recommenders
benchmark framework. In RecSys, pages 353–354,
2011.
[6] Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John Riedl. Rethinking the recommender
research ecosystem: reproducibility, openness, and
LensKit. In RecSys, pages 133–140, 2011.
[7] Simon Franz, Thomas Hornung, Cai-Nicolas Ziegler,
Martin Przyjaciel-Zablocki, Alexander Schätzle, and
Georg Lausen. On weighted hybrid track recommendations. In ICWE, pages 486–489, 2013.
[8] Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Mymedialite: a free
recommender system library. In RecSys, pages 305–
308, 2011.
[9] Conor Hayes and Pádraig Cunningham. An on-line
evaluation framework for recommender systems. Trinity College Dublin, Dep. of Computer Science, 2002.
[10] Jonathan L. Herlocker, Joseph A. Konstan, Loren G.
Terveen, and John Riedl. Evaluating collaborative filtering recommender systems. In ACM Trans. Inf. Syst.,
pages 5–53, 2004.
[11] Morris B Holbrook and Robert M Schindler. Some
exploratory findings on the development of musical
tastes. Journal of Consumer Research, pages 119–124,
1989.
[12] Thomas Hornung, Cai-Nicolas Ziegler, Simon Franz,
Martin Przyjaciel-Zablocki, Alexander Schätzle, and
Georg Lausen. Evaluating Hybrid Music Recommender Systems. In WI, pages 57–64, 2013.
[13] Stefan Liesche, Andreas Nauerz, and Martin Welsch.
Extendable recommender framework for web-based
systems, 2008. US Patent App. 12/209,808.
[14] Pearl Pu, Li Chen, and Rong Hu. A user-centric evaluation framework for recommender systems. In RecSys
’11, pages 157–164, New York, NY, USA, 2011. ACM.
[15] Carlos E Seminario and David C Wilson. Case study
evaluation of mahout as a recommender platform. In
RecSys, 2012.
[16] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender Systems Handbook, pages 257–297, 2011.
[17] Alexandra L. Uitdenbogerd and Ron G. van Schyndel.
A review of factors affecting music recommender success. In ISMIR, 2002.
548
TEMPO- AND TRANSPOSITION-INVARIANT IDENTIFICATION OF
PIECE AND SCORE POSITION
Andreas Arzt1 , Gerhard Widmer1,2 , Reinhard Sonnleitner1
1
Department of Computational Perception, Johannes Kepler University, Linz, Austria
2
Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
[email protected]
ABSTRACT
goal is to identify different versions of one and the same
song, mostly in order to detect cover versions in popular
music.
A common way to solve this task, especially for classical music, is to use an audio matching algorithm (see
e.g. [10]). Here, all the scores are first transformed into
audio files (or a suitable in-between representation), and
then aligned to the query in question, most commonly with
algorithms based on dynamic programming techniques. A
limitation of this approach is that relatively large queries
are needed (e.g. 20 seconds), to achieve good retrieval results. Another problem is computational cost. To cope
with this, in [8] clever indexing strategies were presented
that greatly reduce the computation time.
In [2] an approach is presented that tries to solve the
task in the symbolic domain instead. First, the query is
transformed into a symbolic list of note events via an audio
transcription algorithm. Then, a globally tempo-invariant
fingerprinting method is used to query the database and
identify matching positions. In this way even for queries
with lengths of only a few seconds very robust retrieval
results can be achieved. A downside is that this method
depends on automatic music transcription, which in general is an unsolved problem. In [2] a state of the art transcription system for piano music is used, thus limiting the
approach to piano music only, at least for the time being.
In addition, we identified two other limitations of this
algorithm, which we tackle in this paper. First, the approach depends on the performer playing the piece in the
correct key and the correct octave (i.e. in the same key
and octave as it is stored in the database). In music it
is quite common to transpose a piece of music according to specific circumstances, e.g. a singer preferring to
sing in a specific range. Secondly, while this algorithm
works very well for small queries, larger queries with local
tempo changes within the query tend to be problematic. Of
course these limitations were already discussed in the literature for other approaches, see e.g. [10] for tempo- and
transposition-invariant audio matching.
In this paper we present solutions to both problems by
proposing (1) a transposition-invariant fingerprinting method for symbolic music representations which uses an additional verification step that largely compensates for the
general loss in discriminative power, and (2) a simple but
effective tracking method that essentially achieves not only
global, but also local invariance to tempo changes.
We present an algorithm that, given a very small snippet
of an audio performance and a database of musical scores,
quickly identifies the piece and the position in the score.
The algorithm is both tempo- and transposition-invariant.
We approach the problem by extending an existing tempoinvariant symbolic fingerprinting method, replacing the absolute pitch information in the fingerprints with a relative
representation. Not surprisingly, this leads to a big decrease in the discriminative power of the fingerprints. To
overcome this problem, we propose an additional verification step to filter out the introduced noise. Finally, we
present a simple tracking algorithm that increases the retrieval precision for longer queries. Experiments show that
both modifications improve the results, and make the new
algorithm usable for a wide range of applications.
1. INTRODUCTION
Efficient algorithms for content-based retrieval play an important role in many areas of music retrieval. A well known
example are audio fingerprinting algorithms, which permit
the retrieval of all audio files from the database that are
(almost) exact replicas of a given example query (a short
audio excerpt). For this task there exist efficient algorithms
that are in everyday commercial use (see e.g. [4], [13]).
A related task, relevant especially in the world of classical music, is the following: given a short audio excerpt of
a performance of a piece, identify both the piece (i.e. the
musical score the performance is based on), and the position within the piece. For example, when presented with an
audio excerpt of Vladimir Horowitz playing Chopin’s Nocturne Op. 55 No. 1, the goal is to return the name and data
of the piece (Nocturne Op. 55 No. 1 by Chopin) rather than
identifying the exact audio recording. Hence, the database
for this task does not contain audio recordings, but symbolic representations of musical scores. This is related to
version identification (see [11] for an overview), where the
c Andreas Arzt1 , Gerhard Widmer1,2 , Reinhard
Sonnleitner1 .
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Andreas Arzt1 , Gerhard Widmer1,2 ,
Reinhard Sonnleitner1 . “Tempo- and Transposition-invariant Identification of Piece and Score Position”, 15th International Society for Music
Information Retrieval Conference, 2014.
549
2. TEMPO-INVARIANT FINGERPRINTING
The basis of our algorithm is a fingerprinting method presented in [2] (which in turn is based on [13]) that is invariant to the global tempo of both the query and the entries
in the database. In this section we will give a brief summary of this algorithm. Then we will show how to make
it transposition-invariant (Section 3) and how to make it
invariant to local tempo changes (Section 4).
2.1 Building the Score Database
In [2] a fingerprinting algorithm was introduced that is invariant to global tempo differences between the query and
the scores in the database. Each score is represented as an
ordered list of [ontime, pitch] pairs, which in turn are extracted from MIDI files with a suitable but constant tempo
for the whole piece.
For each score, fingerprint tokens are generated and stored in a database. Tokens are created from triplets of noteon events according to some constraints to make them tempo invariant. A fixed event e is paired with the first n1
events with a distance of at least d seconds “in the future” of e. This results in n1 event pairs. For each of
these pairs this step is repeated with the n2 future events
with a distance of at least d seconds. This finally results
in n1 ∗ n2 event triplets. In our experiments we used the
values d = 0.05 seconds and n1 = n2 = 5 (i.e. for each
event 25 tokens are created). The pair creation steps are
constrained to notes which are at most 2 octaves apart.
Given such a triplet consisting of the events e1 , e2 and
e3 , the time difference td 1,2 between e1 and e2 and the
time difference td 2,3 between e2 and e3 are computed. To
get a tempo independent fingerprint token, the ratio of the
td
time differences is computed: tdr = td 2,3
. This finally
1,2
leads to a fingerprint token dbtoken = [pitch1 : pitch2 :
pitch3 : tdr ] : pieceID : time : td 1,2 , with the hash
key being [pitch1 : pitch2 : pitch3 : tdr ], pieceID the
identifier of the piece, and time the onset time of e1 . The
tokens in our database are unique, i.e. we only insert the
generated token if an equivalent one does not exist yet.
2.2 Querying the Database
Before querying the database, the query (an audio snippet
of a performance) has to be transformed into a symbolic
representation. The algorithm we use to transcribe musical
note onsets from an audio signal is based on the system
described in [3]. The result of this step is a possibly very
noisy list of [ontime, pitch] pairs.
This list is processed in exactly the same fashion as
above, resulting in a list of tokens of the form qtoken =
[qpitch1 : qpitch2 : qpitch3 : qtdr ] : qtime : qtd 1,2 .
Then, all the tokens which match hash keys of the query
tokens are extracted from the database (we allow a maximal deviation of the ratio of the time differences of 15%).
For querying, the general idea is to find regions in the
database of scores which share a continuous sequence of
tokens with the query. To quickly identify these regions
we use the histogram approach presented in [2] and [13].
This is a computationally inexpensive way of finding these
sequences by sorting the matched tokens into a histogram
with a bin width of 1 second such that peaks appear at the
start points of these regions (i.e. the start point where the
query matches a database position). We also included the
restriction that each query token can only be sorted at most
once into each bin of the histogram, effectively preventing
excessively high scores for sequences of repeated patterns
in a brief period of time.
The matching score for each score position is computed
as the number of tokens in the respective histogram bin. In
addition, we can also compute a tempo estimate, i.e. the
tempo of the performance compared to the tempo in the
score, by taking the mean of the ratios of td 1,2 and qtd 1,2
of the respective matching query and database tokens that
were sorted in the bin in question. We will use this information for the tracking approach presented in Section 4.
3. TRANSPOSITION-INVARIANT
FINGERPRINTS
3.1 General Approach
In the algorithm described above, the pitches in the hash
keys are represented as absolute values. Thus, if a performer decides to transpose a piece by an arbitrary number
of semi-tones, any identification attempt by the algorithm
must fail.
To overcome this problem, we suggest a simple, relative
representation of the pitch values, which makes the algorithm invariant to linear transpositions. Instead of using 3
absolute pitch values, we replace them by 2 differences,
pd1 = pitch2 − pitch1 and pd2 = pitch3 − pitch2 , resulting in a hash key [pd1 : pd2 : tdr ]. For use in Section
3.2 below we additionally store pitch1 , the absolute pitch
of the first note, in the token value.
In every other aspect the algorithm works in the same
way as the purely tempo-invariant version described above.
Of course this kind of transposition invariance cannot come
for free as the resulting fingerprints will not be as discriminative as before. This has two important direct consequences: (1) the retrieval accuracy will suffer, and (2) for
every query a lot more matching tokens are found in the
database, thus the runtime for each query increases (see
Section 5).
3.2 De-noising the Results: Token Verification
To compensate for the loss in discriminative power we propose an additional step before accepting a database token
as a match to the query. The general idea is taken from [9]
and was first used in a music context by [12]. It is based
on a verification step for each returned token that looks at
the context within the query and the context at the returned
position the database.
Each token dbtoken that was returned in response to
a qtoken can be used to project the query (i.e. the notes
identified from the query audio snippet by the transcription algorithm) to the possibly matching position in the
score indicated by the dbtoken. The intuition then is that at
550
The basic idea is to create virtual ‘agents’ for positions
in the result sets. Each agent has a current hypothesis of
the piece, the position within the piece and the tempo, and
a score based on the results of the sub-queries. The agents
are updated, if possible, with newly arriving data. In doing so, agents that represent positions that successively occur in result sets will accumulate higher scores than agents
that represent positions that only occurred once or twice by
chance, and are most probably false positives.
More precisely, we iterate over all sub-queries and perform the following steps in each iteration:
true matching positions we will find a majority of the notes
from the query at their expected positions in the score. This
will permit us to more reliably decide if the match of hash
keys is a false positive or an actual match.
To do this, we need to compute the pitch shift and the
tempo difference between the query and the potential position in the database. The pitch shift is computed as the
difference of the pitch1 of qtoken and dbtoken. The difference in tempo is computed as the ratio of td1,2 of the
two tokens. This information can now in turn be used to
compute the expected time and pitch for each query note
at the current score position hypothesis. We actually do
not do this for the whole query, but only for a window of
w = 10 notes, centred at the event e1 of the query, and we
exclude the notes e1 , e2 and e3 from this list (as they were
already used to come up with the match in the first place).
We now take these w notes and check if they appear in
the database as would be expected. In this search we are
strict on the pitch value, but allow for a window of ±100
ms with regards to the actual time in the database. If we can
confirm that a certain percentage of notes from the query
appears in the database as expected (in the experiments we
used 0.8), we finally accept the query token as an actual
match.
As this approach is computationally expensive, we actually compute the results in two steps: we first do ‘normal’
fingerprinting without the verification step and only keep
the top 5% of the results. We then perform the verification
step on these results only and recompute the scores. On
our dataset this effectively more than halves the computation time.
• Normalise Scores: First the scores of the positions
in the result set of the sub-query are normalised by
dividing them by their median. This makes sure that
each iteration has approximately the same influence
on the tracking process.
• Update Agents: For every agent, we look for a matching position in the result set of the sub-query (i.e. a
position that approximately fits the extrapolated position of the agent, given the old position, the tempo,
and the elapsed time). The position, the tempo and
the score of the agent are updated with the new data
from the matching result of the sub-query. If we do
not find a matching position in the result set, we update the agent with a score of 0, and the extrapolated position is taken as the new hypothesis. If a
matching position is found, the accumulated score
is updated in a fashion such that scores from further
in the past have a smaller impact than more recent
ones. Each agent has a ring buffer s of size 50, in
which the scores of the individual sub-queries are
being stored. The accumulated score of the agent is
50
si
then calculated as scoreacc =
1+log i , where s1
4. PROCESSING LONGER QUERIES:
MULTI-AGENT TRACKING
The fingerprinting method in [2] was mainly concerned
with invariance regarding the global tempo. When applying this algorithm to our database with longer queries, local tempo changes (i.e. tempo changes within the query)
prove to be problematic, because they break the ‘cheap’
histogram approach that is used to determine continuous
regions of matching tokens.
Instead of using computationally much more expensive
methods for determining these regions, we propose to split
longer queries into shorter ones and track the results of
these sub-queries over time. This is based on the assumption that in short queries the tempo is (quasi) stationary,
and that a few exceptions will not break the tracking algorithm we use. In our implementation, we split each query
into sub-queries with a window size of w = 15 notes and
a hop size of h = 5 notes and then feed each sub-query to
the fingerprinter individually.
Each result of a sub-query (but at most the top 100 positions that are returned) is in turn fed to an on-line position hypothesis tracking algorithm. In our current proofof-concept implementation we use a simple on-line rulebased multi-agent approach, inspired by the beat-tracking
algorithm described in [6]. For a purely off-line retrieval
task a non-causal algorithm will lead to even better results.
i=1
is the most recent score.
• Create Agents: Each sub-query result that was not
used to update an existing agent is used to initialise
a new agent at the respective score position (i.e. in
the first iteration up to 100 agents are created).
• Remove obsolete Agents: Finally, agents with low
scores are removed. In our implementation we simply remove agents that are older then 10 iterations
and are not part of the current top 25 agents.
At each point in time the agents are ordered by scoreacc
and can be seen as hypotheses about the current position
in the database of pieces. Thus, in the case of a single
long query, the agents with the highest accumulated scores
are returned in the end. In an on-line scenario, where an
audio stream is constantly being monitored by the fingerprinting system, the current top hypotheses can be returned
after each performed update (i.e. after each processed subquery).
551
5. EVALUATION
We tested the algorithms with different query lengths:
10, 15, 20 and 25 notes (automatically transcribed from
the audio query). For each of the query lengths, we generated 2500 queries by picking random points in the performances of our test database, and used them as input for the
proposed algorithms. Duplicate retrieval results (i.e. positions that have the exact same note content; also, duplicate
piece IDs for the experiments on piece-level) are removed
from the result set.
Table 2 shows the results of the original tempo-invariant
(but not pitch-invariant) algorithm on our dataset. Here,
we present results for two categories: correctly identified
pieces, and correctly identified piece and position in the
score. For both categories we give the percentage of correct results at rank 1, and the mean reciprocal rank. This
experiment basically confirms the results that were reported
in [2] on a larger database (more than twice as large), for
which a slight drop in performance is expected.
In addition, for the experiments with the transpositioninvariant fingerprinting method, we transposed each score
randomly by between -11 and +11 semitones – although
strictly speaking this was not necessary, as the transpositioninvariant algorithm returns exactly the same (large) set of
tokens for un-transposed and transposed queries or scores.
Table 3 gives the results of the transposition-invariant
method on these queries, both without (left) and with the
verification step (right). As expected, the use of pitchinvariant fingerprints without additional verification causes
a big decrease in retrieval precision (compare left half of
Table 3 with Table 2). Furthermore, the loss in discriminative power of the fingerprint tokens also results in an increased number of tokens returned for every query, which
has a direct influence on the runtime of the algorithm (last
row in Table 3). The proposed verification step solves the
precision problem, at least to some extent, and in our opinion makes the approach usable. Of course this does not
come for free, as the runtime increases slightly.
We also tried to use the verification step with the original tempo-invariant algorithm but were not able to improve
on the retrieval results. At least on our test data the tempoinvariant fingerprints are discriminative enough to mostly
avoid false positives.
Finally, Table 4 gives the results on slightly longer queries for both the original tempo-invariant and the new tempoand transposition-invariant algorithm. As can be seen, for
the detection of the exact position in the score, using no
tracking, the results based on queries with length 100 notes
are worse than those for queries with only 50 notes, i.e.
more information leads to worse results. This is caused
by local tempo changes within the query, which break the
histogram approach for finding sequences of matching tokens.
As shown on the right hand side for both fingerprinting
types in Table 4, the approach of splitting longer queries
into shorter ones and tracking the results takes care of this
problem. Please note that for the tracking approach we
check if the position hypotheses after the last tracking step
match the correct position in the score. Thus, as this is an
5.1 Dataset Description
For the evaluation of the proposed algorithms a ground
truth is needed. We need exact alignments of performances
(recordings) of classical music to their respective scores
such that we know exactly when each note given in the
score is actually played in the performance. This data can
either be generated by a computer program or by extensive
manual annotation but both ways are prone to errors.
Luckily, we have access to two unique datasets where
professional pianists played performances on a computercontrolled piano 1 and thus every action (e.g. key presses,
pedal movements) was recorded. The first dataset (see
[14]) consists of performances of the first movements of
13 Mozart piano sonatas by Roland Batik. The second,
much larger, dataset consists of nearly the complete solo
piano works by Chopin performed by Nikita Magaloff [7].
For the latter set we do not have the original audio files and
thus replayed the symbolic performance data on a Yamaha
N2 hybrid piano and recorded the resulting performances.
As we have both symbolic and audio information about
the performances, we know the exact timing of each played
note in the audio files. To build the score database we converted the sheet music to MIDI files with a constant tempo
such that the overall duration of the file is similar to a ‘normal’ performance of the piece.
In addition to these two datasets the score database includes the complete Beethoven piano sonatas, two symphonies by Beethoven, and various other piano pieces. To
this data we have no ground truth, but this is irrelevant
since we do not actively query for them with performance
data in our evaluation runs. See Table 1 for an overview of
the complete dataset.
5.2 Results
For the evaluation we follow the procedure from [2]. A
score position X is considered correct if it marks the beginning (+/- 1.5 seconds) of a score section that is identical in note content, over a time span the length of the query
(but at least 20 notes), to the note content of the ‘real’ score
situation corresponding to the audio segment that the system was just listening to. We can establish this as we have
the correct alignment between performance time and score
positions — our ground truth). This complex definition
is necessary because musical pieces may contain repeated
sections or phrases, and it is impossible for the system (or
anyone else, for that matter) to guess the ‘true’ one out of a
set of identical passages matching the current performance
snippet, given just that performance snippet as input. We
acknowledge that a measurement of musical time in a score
in terms of seconds is rather unusual. But as the MIDI
tempos in our database generally are set in a meaningful
way, this seemed the best decision to make errors comparable over different pieces, with different time signatures –
it would not be very meaningful to, e.g. compare errors in
bars or beats over different pieces.
1
Bösendorfer SE 290
552
Data Description
Chopin Corpus
Mozart Corpus
Additional Pieces
Total
Score Database
Number of Pieces Notes in Score
154
325,263
13
42,049
159
574,926
326
942,238
Testset
Notes in Performance Performance Duration
326,501
9:38:36
42,095
1:23:56
–
–
Table 1. Database and Testset Overview. In the database, all the pieces are included. As we only have performances
aligned to the scores for the Chopin and the Mozart corpus, only these are included in the test set to query the database.
Query Length in Notes
Correct Piece as Top Match
Correct Piece Mean Reciprocal Rank (MRR)
Correct Position as Top Match
Correct Position Mean Reciprocal Rank (MRR)
Mean Query Length in Seconds
Mean Query Execution Time in Seconds
10
0.6
0.68
0.53
0.60
1.47
0.02
15
0.82
0.86
0.72
0.79
2.26
0.06
20
0.88
0.91
0.77
0.83
3.16
0.11
25
0.91
0.93
0.79
0.85
3.82
0.16
Table 2. Results for different query sizes of the original tempo-invariant piece and score position identification algorithm
on the test database at the piece level (upper half) and on the score position level (lower half). Each estimate is based on
2500 random audio queries. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank
(MRR) are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown.
Correct Piece MRR
Correct Position MRR
Without Verification
10
15
20
25
0.30 0.40 0.41 0.40
0.36 0.47 0.50 0.49
0.23 0.33 0.32 0.32
0.29 0.40 0.41 0.40
1.47 2.26 3.16 3.82
0.10 0.32 0.62 0.91
10
0.43
0.49
0.33
0.41
1.47
0.12
With Verification
15
20
25
0.63 0.71 0.75
0.69 0.76 0.79
0.51 0.57 0.60
0.59 0.66 0.69
2.26 3.16 3.82
0.38 0.72 1.09
Table 3. Results for different query sizes of the proposed tempo- and transposition-invariant piece and score position
identification algorithm on the test database with (right) and without (left) the proposed verification step. Each estimate is
based on 2500 random audio queries. The upper half shows recognition results on the piece level, the lower half on the
score position level. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank (MRR)
are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown.
Correct Piece MRR
Correct Position MRR
Tempo-invariant
No Tracking
Tracking
50
100
50
100
0.95 0.96 0.98
1
0.97 0.98 0.99
1
0.78 0.73 0.87 0.88
0.85 0.81 0.89 0.90
7.62 15.03 7.62 15.03
0.42 0.92 0.49 1.08
Tempo- and Pitch-invariant
No Tracking
Tracking
50
100
50
100
0.81 0.79 0.92 0.98
0.85 0.82 0.94 0.99
0.64 0.59 0.77 0.83
0.72 0.66 0.82 0.86
7.62 15.03 7.62 15.03
2.71 6.11 3.21 7.09
Table 4. Results of the proposed tracking algorithm on the test database for both the original tempo-invariant algorithm
(left) and the new tempo- and transposition-invariant approach (right), including the verification step. For the category ‘No
Tracking’, the query was fed directly to the fingerprinting algorithm. For ‘Tracking’, the queries were split into sub-queries
with a window size of 15 notes and a hop size of 5 notes, and the individual results were tracked by our proof-of-concept
multi-agent approach. Evaluation of the tracking approach is based on the finding the endpoint of a query (see text). Each
estimate is based on 2500 random audio queries. The upper half shows recognition results on the piece level, the lower half
on the score position level. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank
(MRR) are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown.
553
on-line algorithm, we are not interested in the start position of the query in the score, but in the endpoint, i.e. if the
query was tracked successfully, and the correct current position is returned. Even the causal approach leads to a high
percentage of correct results with both the original and the
tempo- and pitch-invariant fingerprinting algorithm. Most
of the remaining mistakes happen because (very) similar
parts within one and the same piece are confused.
[2] A. Arzt, S. Böck, and G. Widmer. Fast identification of
piece and score position via symbolic fingerprinting. In
Proceedings of the International Conference on Music
Information Retrieval (ISMIR), 2012.
6. CONCLUSIONS
[4] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP), 2002.
[3] S. Böck and M. Schedl. Polyphonic piano note transcription with recurrent neural networks. In Proceedings of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2012.
6.1 Applications
The proposed algorithm is useful in a wide range of applications. As a retrieval algorithm it enables fast and robust (inter- and intra-document) searching and browsing in
large collections of musical scores and corresponding performances. Furthermore, we believe that the algorithm is
not limited to retrieval tasks in classical music, but may be
of use for cover version identification in general, and possibly many other tasks. For example, it was already successfully applied in the field of symbolic music processing
to find repeating motifs and sections in complex musical
scores [5].
Currently, the algorithm is mainly used in an on-line
scenario (see [1]). In connection with a score following
algorithm it can act as a ‘piano music companion’. The
system is able to recognise arbitrary pieces of classical piano music, identify the position in the score and track the
progress of the performer. This enables a wide range of
applications for musicians and for consumers of classical
music.
[5] T. Collins, A. Arzt, S. Flossmann, and G. Widmer.
Siarct-cfp: Improving precision and the discovery of
inexact musical patterns in point-set representations. In
Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2013.
[6] S. Dixon. Automatic extraction of tempo and beat from
expressive performances. Journal of New Music Research, 30(1):39–58, 2001.
[7] S. Flossmann, W. Goebl, M. Grachten, B. Niedermayer, and G. Widmer. The Magaloff project:
An interim report. Journal of New Music Research,
39(4):363–377, 2010.
[8] F. Kurth and M. Müller. Efficient index-based audio
matching. IEEE Transactions on Audio, Speech, and
Language Processing, 16(2):382–395, 2008.
[9] D. Lang, D. W. Hogg, K. Mierle, M. Blanton, and
S. Roweis. Astrometry. net: Blind astrometric calibration of arbitrary astronomical images. The Astronomical Journal, 139(5):1782, 2010.
6.2 Future Work
In its current state the algorithm is able to recognise the
correct piece and the score position even for very short
queries of piano music. It is invariant to both tempo differences and transpositions and can be used in on-line contexts (i.e. to monitor audio streams and at any time report
what it is listening to) and as an off-line retrieval algorithm.
The main direction for future work is to lift the restriction
to piano music and make it applicable to all kinds of classical music, even orchestral music. The limiting component
at the moment is the transcription algorithm, which is only
trained on piano sounds.
7. ACKNOWLEDGMENTS
This research is supported by the Austrian Science Fund
(FWF) under project number Z159 and the EU FP7 Project
PHENICX (grant no. 601166).
[10] M. Müller, F. Kurth, and M. Clausen. Audio matching
via chroma-based statistical features. In Proceedings
of the International Conference on Music Information
Retrieval (ISMIR), 2005.
[11] J. Serra, E. Gómez, and P. Herrera. Audio cover song
identification and similarity: background, approaches,
evaluation, and beyond. In Z. W. Ras and A. A. Wieczorkowska, editors, Advances in Music Information
Retrieval, pages 307–332. Springer, 2010.
[12] R. Sonnleitner and G. Widmer. Quad-based audio fingerprinting robust to time and frequency scaling. In
Proceedings of the International Conference on Digital Audio Effects, 2014.
[13] A. Wang. An industrial strength audio search algorithm. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2003.
8. REFERENCES
[1] A. Arzt, S. Böck, S. Flossmann, H. Frostel, M. Gasser,
and G. Widmer. The complete classical music companion v0. 9. In Proceedings of the 53rd AES Conference
on Semantic Audio, 2014.
[14] G. Widmer. Discovering simple rules in complex data:
A meta-learning algorithm and some surprising musical discoveries. Artificial Intelligence, 146(2):129–
148, 2003.
554
GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED
ON MUSIC METADATA
Chun-Hung Lu
Jyh-Shing Roger Jang
Ming-Ju Wu
Innovative Digitech-Enabled Applications
Computer Science Department Computer Science Department
& Services Institute (IDEAS),
National Tsing Hua University National Taiwan University
Institute for Information Industry,
Taipei, Taiwan
Hsinchu, Taiwan
Taipei, Taiwan
[email protected] [email protected]
[email protected]
ABSTRACT
Identity unknown
Music recommendation is a crucial task in the field of
music information retrieval. However, users frequently
withhold their real-world identity, which creates a negative
impact on music recommendation. Thus, the proposed
method recognizes users’ real-world identities based on
music metadata. The approach is based on using the tracks
most frequently listened to by a user to predict their gender
and age. Experimental results showed that the approach
achieved an accuracy of 78.87% for gender identification
and a mean absolute error of 3.69 years for the age
estimation of 48403 users, demonstrating its effectiveness
and feasibility, and paving the way for improving music
recommendation based on such personal information.
Music metadata
of the user
Top-1 track
Top-2 track
Top-3 track
Artist name
Paul Anka
The Platters
Johnny Cash
…
…
Song title
You Are My
Destiny
Only You
I Love You
Because
…
Input
Our system
Output
Gender: male
Age: 65
1. INTRODUCTION
Figure 1. Illustration of the proposed system using a real
example.
Amid the rapid growth of digital music and mobile
devices, numerous online music services (e.g., Last.fm,
7digital, Grooveshark, and Spotify) provide music
recommendations to assist users in selecting songs. Most
music-recommendation systems are based on content- and
collaborative-based approaches [15]. For content-based
approaches [2, 8, 9], recommendations are made according
to the audio similarity of songs.
By contrast,
collaborative-based approaches involve recommending
music for a target user according to matched listening
patterns that are analyzed from massive users [1, 13].
Because music preferences of users relate to their
real-world identities [12], several collaborative-based
approaches consider identification factors such as age
and gender for music recommendation [14]. However,
online music services may experience difficulty obtaining
such information. Conversely, music metadata (listening
history) is generally available. This motivated us to
recognize users’ real-world identities based on music
metadata. Figure 1 illustrates the proposed system. In this
preliminary study, we focused on predicting gender and
age according to the most listened songs. In particular,
gender identification was treated as a binary-classification
problem, whereas age estimation was considered a
regression problem. Two features were applied for both
gender identification and age estimation tasks. The first
feature, TF*IDF, is a widely used feature representation
in natural language processing [16]. Because the music
metadata of each user can be considered directly as
a document, gender identification can be viewed as a
document categorization problem. In addition, TF*IDF is
generally applied with latent semantic indexing (LSI) to
reduce feature dimension. Consequently, this serves as the
baseline feature in this study.
The second feature, the Gaussian super vector (GSV)
[3], is a robust feature representation for speaker
verification. In general, the GSV is used to model acoustic
features such as MFCCs. In this study, music metadata was
translated into proposed hotness features (a bag-of-features
representation) and could be modeled using the GSV. The
concept of the GSV can be described as follows. First,
c Ming-Ju Wu, Jyh-Shing Roger Jang, Chun-Hung Lu.
License (CC BY 4.0). Attribution: Ming-Ju Wu, Jyh-Shing Roger
Jang, Chun-Hung Lu. “Gender Identification and Age Estimation of
Users Based on Music Metadata”, 15th International Society for Music
555
a universal background model (UBM) is trained using a
Gaussian mixture model (GMM) to represent the global
music preference of users. A user-specific GMM can
then be obtained using the maximum a posteriori (MAP)
adaptation from the UBM. Finally, the mean vectors of the
user-specific GMM are applied as GSV features.
The remainder of this paper is organized as follows:
Section 2 describes the related literature, and Section 3
introduces the TF*IDF; the GSV is explained in Section
4, and the experimental results are presented in Section 5;
finally, Section 6 provides the conclusion of this study.
User identity
Gender
Top level
Age
ǥ
Semantic gap
Music
metadata
Middle level
Mood
Music
metadata
Artist
Genre
ǥ
Semantic gap
Features
2. RELATED LITERATURE
Low level
Timbre
Rhythm
ǥ
Machine learning has been widely applied to music
information retrieval (MIR), a vital task of which is
content-based music classification [5, 11]. For example,
the annual Music Information Retrieval Evaluation
eXchange (MIREX) competition has been held since 2004,
at which some of the most popular competition tasks
have included music genre classification, music mood
classification, artist identification, and tag annotation.
The purpose of content-based music classification is to
recognize semantic music attributes from audio signals.
Generally, songs are represented by features with different
aspects such as timbre and rhythm. Classifiers are used
to identify the relationship between low-level features and
mid-level music metadata.
However, little work has been done on predicting
personal traits based on music metadata [7]. Figure 2
shows a comparison of our approach and content-based
music classification. At the top level, user identity provides
a basic description of users. At the middle level, music
metadata provides a description of music. A semantic gap
exists between music metadata and user identity. Beyond
content-based music classification, our approach serves
as a bridge between them. This enables online music
services to recognize unknown users more effectively and,
consequently, improve their music recommendations.
Content-based music classification
Our approach
Figure 2. Comparison of our approach and content-based
music classification.
of an artist among documents.
expressed as
The TF*IDF can be
tf idfi,n = tfi,n × log
|D|
dfn
(2)
where tfi,n is the frequency of tn in di , and dfn represents
the number of documents in which tn appears.
dfn = |{d : d ∈ D and tn ∈ d }|
(3)
3.2 Latent Semantic Indexing
The TF*IDF representation scheme leads to high feature
dimensionality because the feature dimension is equal to
the number of artists. Therefore, LSI is generally applied
to transform data into a lower-dimensional semantic space.
Let W be the TF*IDF reorientation of D, where each
column represents document di . The LSI performs
singular value decomposition (SVD) as follows:
3. TF*IDF FEATURE REPRESENTATION
The music metadata of each user can be considered a
document. The TF*IDF describes the relative importance
of an artist for a specific document. LSI is then applied for
dimensionality reduction.
W ≈ U ΣV T
(4)
where U and V represent terms and documents in the
semantic space, respectively. Σ is a diagonal matrix
with corresponding singular values. Σ−1 U T can be used
to transform new documents into the lower-dimensional
semantic space.
3.1 TF*IDF
Let the document (music metadata) of each user in the
training set be denoted as
di = {t1 , t2 , · · · , tn }, di ∈ D
Artist
4. GSV FEATURE REPRESENTATION
(1)
This section introduces the proposed hotness features and
explains how to generate the GSV features based on
hotness features.
where tn is the artist name of the top-n listened to song of
user i. D is the collection of all documents in the training
set. The TF*IDF representation is composed of the term
frequency (TF) and inverse document frequency (IDF).
TF indicates the importance of an artist for a particular
document, whereas IDF indicates the discriminative power
4.1 Hotness Feature Extraction
We assumed each artist tn may exude various degrees of
hotness to different genders and ages. For example, the
556
count (the number of times) of Justin Bieber that occurs
in users’ top listened to songs of the training set was 845,
where 649 was from the female class and 196 was from
the male class. We could define the hotness of Justin
Bieber for females as 76.80% (649/845) and that for males
as 23.20% (196/845). Consequently, a user tends to be a
female if her top listened to songs related mostly to Justin
Bieber. Consequently, the age and gender characteristics of
a user can be obtained by computing the hotness features
of relevant artists.
Let D be divided into classes C according to users’
genders or ages:
C 1 ∪ C2 ∪ · · · ∪ Cp = D
(5)
C1 ∩ C2 ∩ · · · ∩ Cp = ∅
α
cn,2
α
..
.
⎥
⎥
⎥
⎦
Hotness feature
extraction
MAP adaptation
UBM
m2 (6)
θM AP = arg max f (x|θ) g (θ|ω)
cn,l
x = [h1 , h2 , · · · , hn ]
(10)
θ
(7)
where f (x|θ) is the probability density function (PDF) for
the observed data x given the parameter θ, and g (θ|ω) is
the prior PDF given the hyperparameter ω.
Finally, for each user, the mean vectors of the adapted
GMM are stacked to form a new feature vector called
GSV. Because the adapted GMM is obtained using MAP
adaptation over the UBM, it is generally more robust
than directly modeling the feature vectors by using GMM
without any prior knowledge.
(8)
Because the form of x can be considered a bag-of-features,
the GSV can be applied directly.
4.2 GSV Feature Extraction
5. EXPERIMENTAL RESULTS
Figure 3 is a flowchart of the GSV feature extraction,
which can be divided into offline and online stages. At
the offline stage, the goal is to construct a UBM [10] to
represent the global hotness features, which are then used
as prior knowledge for each user at the online stage. First,
hotness features are extracted for all music metadata in
the training set. The UBM is then constructed through a
GMM estimated using the EM (expectation-maximization)
algorithm. Specifically, the UBM evaluates the likelihood
of a given feature vector x as follows:
wk N (x|mk , rk )
mk º¼
matrix rk . This bag-of-features model is based on the
assumption that similar users have similar global artist
characteristics.
At the online stage, the MAP adaptation [6] is used to
produce an adapted GMM for a specific user. Specifically,
MAP attempts to determine the parameter θ in the
parameter space Θ that maximizes the posterior probability
given the training data x and hyperparameter ω, as follows:
Next, each document in (1) can be transformed to a
p × n matrix x, which describes the gender and age
characteristics of a user:
K
Hotness feature
extraction
Figure 3. Flowchart of the GSV feature extraction.
l=1
f (x|θ) =
A user in the
training or test sets
(music metadata)
GSV
where cn,p is the count of artist tn in Cp , and α is the count
of artist tn in all classes.
p
Training set
(music metadata)
ª¬m1
cn,p
α
α=
Online
ML estimation
where p is the number of classes. Here, p is 2 for gender
identification and 51 (the range of age) for age estimation.
The hotness feature of each artist tn is defined as hn :
⎡ cn,1 ⎤
⎢
⎢
hn = ⎢
⎣
Offline
This section describes data collection, experimental
settings, and experimental results.
5.1 Data Collection
The Last.fm API was applied for data set collection,
because it allows anyone to access data including albums,
tracks, users, events, and tags. First, we collected
user IDs through the User.getFriends function. Second,
the User.getInfo function was applied to each user for
obtaining their age and gender information. Finally, the
User.getTopTracks function was applied to acquire at most
top-50 tracks listened to by a user. The track information
included song titles and artist names, but only artist names
were used for feature extraction in this preliminary study.
The final collected data set included 96807 users, in
which each user had at least 40 top tracks as well as
complete gender and age information. According to the
users’ country codes, they were from 211 countries (or
(9)
k=1
where θ = (w1 , ..., wK , m1 , ..., mK , r1 , ..., rK ) is a set
of parameters, with wk denoting the mixture gain for
the
K kth mixture component, subject to the constraint
k=1 wk = 1, and N (x|mk , rk ) denoting the Gaussian
density function with a mean vector mk and a covariance
557
6000
18%
5000
30%
1%
2%
2%
4000
2%
Count
United Kingdom
United States
Brazil
Germany
Russia
Italy
Poland
Netherlands
Belgium
Canada
Australia
Spain
France
Sweden
Mexico
Others
2%
2%
3000
2%
2%
2000
3%
3%
18%
1000
3%
5%
6%
0
15
20
25
30
35
40
45
50
55
60
65
Age
Figure 4. Ratio of countries of the collected data set.
Figure 6. Age distribution of the collected data set.
Male
Female
5
10
33.79%
4
10
3
Count
10
66.21%
2
10
1
10
Figure 5. Gender ratio of the collected data set.
0
10 0
10
2
4
10
10
6
10
Artist
regions such as Hong Kong). The ratio of countries
is shown in Figure 4. The majority were Western
countries. The gender ratio is shown in Figure 5, in which
approximately one-third of users (33.79%) were female
and two-thirds (66.21%) were male. The age distribution
of users is shown in Figure 6. The distribution was a
skewed normal distribution and most users were young
people.
Figure 7 shows the count of each artist that occurred
in the users’ top listened songs. Among 133938 unique
artists in the data set, the ranking of popularity presents a
pow-law distribution. This demonstrates that a few artists
dominate the top listened songs. Although the majority of
artists are not popular for all users, this does not indicate
that they are unimportant, because their hotness could be
discriminative over ages and gender.
Figure 7. Count of artists of users’ top listened songs.
Ranking of popularity presents a pow-law distribution.
to existing regression approaches. The RBF kernel with
γ = 8 was applied to the SVM and SVR. For the
UBM parameters, two Gaussian mixture components were
experimentally applied (similar results can be obtained
when using a different number of mixture components).
Consequently, the numbers of dimensions of GSV features
for gender identification and age estimation were 4 (2×2)
and 102 (2×51), respectively.
5.3 Gender Identification
The accuracy was 78.87% and 78.21% for GSV and
TF*IDF + LSI features, respectively. This indicates that
both features are adequate for such a task. Despite
the low dimensionality of GSV (4), it was superior to
the high dimensionality of TF*IDF + LSI (200). This
indicates the effectiveness of GSV use and the proposed
hotness features. Figures 8 and 9 respectively show
the confusion matrix of using GSV and TF*IDF + LSI
features. Both features yielded higher accuracies for
the male class than for the female class. A possible
explanation is that a portion of the females’ were similar to
the males’. The classifier tended to favor the majority class
5.2 Experimental Settings
The data set was equally divided into two subsets, the
training (48404) and test (48403) sets. An open source tool
of Python, Gensim, was applied for the TF*IDF and LSI
implementation. followed the default setting of Gensim
that maintained 200 latent dimensions for the TF*IDF. A
support vector machine (SVM) tool, LIBSVM [4], was
applied as the classifier. The SVM extension, support
vector regression (SVR) was applied as the regressor,
which has been observed in many cases to be superior
558
Method
GSV
TF*IDF+LSI
(male), resulting in many female instances with incorrect
predictions. The age difference can also be regarded for
further analysis. Figure 10 shows the gender identification
results of two features over various ages. Both features
tended to have lower accuracies between the ages of 25 and
40 years, implying that a user whose age is between 25 and
40 years seems to have more blurred gender boundaries
than do users below 25 years and above 40 years.
This study confirmed the possibility of predicting users’
age and gender based on music metadata. Three of the
findings are summarized as follows.
Female
Male
86.40%
(27690)
MAE (female)
2.48
3.05
those of the TF*IDF + LSI.
For further analysis, gender difference was also
considered. Notably, the MAE of females is less than
that of males for both GSV and TF*IDF + LSI features.
In particular, the MAE differences between males and
females are approximately 1.8 for both features, implying
that females have more distinct age divisions than males
do.
Table 1 shows the performance comparison for age
estimation. The mean absolute error (MAE) was applied
as the performance index. The range of the predicted ages
of the SVR is between 15 and 65 years. The experimental
results show that the MAE is 3.69 and 4.25 years for GSV
and TF*IDF + LSI, respectively. The GSV describes the
age characteristics of a user and utilizes prior knowledge
from the UBM; therefore, the GSV features are superior to
Male
MAE (male)
4.31
4.86
Table 1. Performance comparison for age estimation.
5.4 Age Estimation
Accuracy 78.87%
MAE
3.69
4.25
• GSV features are superior to those of TF*IDF +
LSI for both gender identification and age estimation
tasks.
13.60%
(4358)
• Males tend to exhibit higher accuracy than females
do in gender identification, whereas females are
more predictable than males in age estimation.
Female
35.89%
(5870)
64.11%
(10485)
• The experimental results indicate that gender
identification is influenced by age, and vice versa.
This suggests that an implicit relationship may exist
between them.
Future work could include utilizing the proposed
approach to improve music recommendation systems. We
will also explore the possibility of recognizing deeper
social aspects of user identities, such as occupation and
education level.
Figure 8. Confusion matrix of gender identification by
using GSV features.
Female
Male
Accuracy 78.21%
1
GSV
TF*IDF+LSI
Male
92.14%
(29529)
0.95
7.86%
(2519)
0.9
Accuracy
0.85
Female
49.10%
(8030)
0.8
0.75
50.90%
(8325)
0.7
0.65
0.6
15
20
25
30
35
40
45
50
55
60
65
Age
Figure 9. Confusion matrix of gender identification by
using TF*IDF + LSI features.
Figure 10. Gender identification results for various ages.
559
7. ACKNOWLEDGEMENT
This study is conducted under the ”NSC
102-3114-Y-307-026 A Research on Social Influence
and Decision Support Analytics” of the Institute for
Information Industry which is subsidized by the National
Science Council.
8. REFERENCES
[1] L. Barrington, R. Oda, and G. Lanckriet. Smarter than
genius? human evaluation of music recommender
systems. In Proceedings of the International
Symposium on Music Information Retrieval, pages
357–362, 2009.
[2] D. Bogdanov, M. Haro, F. Fuhrmann, E. Gomez,
and P. Herrera. Content-based music recommendation
based on user preference examples. In Proceedings of
the ACM Conf. on Recommender Systems. Workshop
on Music Recommendation and Discovery, 2010.
[3] W. M. Campbell, D. E. Sturim, and D. A. Reynolds.
Support vector machines using GMM supervectors for
speaker verification. IEEE Signal Processing Letters,
13(5):308–311, May 2006.
[4] C. C. Chang and C. J. Lin. Libsvm: A library for
support vector machine, 2010.
[12] A. Uitdenbogerd and R. V. Schnydel. A review
of factors affecting music recommender success. In
Proceedings of the International Symposium on Music
Information Retrieval, pages 204–208, 2002.
[13] B. Xu, J. Bu, C. Chen, and D. Cai. An exploration
of improving collaborative recommender systems via
user-item subgroups. In Proceedings of the 21st
international conference on World Wide Web, pages
21–30, 2012.
[14] Billy Yapriady and AlexandraL. Uitdenbogerd.
Combining demographic data with collaborative
filtering for automatic music recommendation.
In Knowledge-Based Intelligent Information and
Engineering Systems, volume 3684 of Lecture Notes
in Computer Science, pages 201–207. 2005.
[15] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G.
Okuno. Hybrid collaborative and content-based music
recommendation using probabilistic model with latent
user preferences. In Proceedings of the International
Symposium on Music Information Retrieval, pages
296–301, 2006.
[16] W. Zhang, T. Yoshida, and X. Tang. A comparative
study of tf*idf, lsi and multi-words for text
classification. Expert Systems with Applications,
38(3):2758–2765, 2011.
[5] Z. Fu, G. Lu, K. M. Ting, and D. Zhang. A survey of
audio-based music classification and annotation. IEEE
Trans. Multimedia., 13(2):303–319, Apr. 2011.
[6] J. L. Gauvain and C. H. Lee. Maximum a posteriori
estimation for multivariate gaussian mixture
observations of markov chains. IEEE Trans. Audio,
Speech, Lang. Process., 2(2):291–298, Apr. 1994.
[7] Jen-Yu Liu and Yi-Hsuan Yang. Inferring personal
traits from music listening history. In Proceedings
of the Second International ACM Workshop on
Music Information Retrieval with User-centered and
Multimodal Strategies, MIRUM ’12, pages 31–36,
New York, NY, USA, 2012. ACM.
[8] B. McFee, L. Barrington, and G. Lanckriet. Learning
content similarity for music recommendation.
IEEE Trans. Audio, Speech, Lang. Process.,
20(8):2207–2218, Oct. 2012.
[9] A. V. D. Orrd, S. Dieleman, and B. Benjamin. Deep
content-based music recommendation. In Advances in
Neural Information Processing Systems, 2013.
[10] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn.
Speaker verification using adapted gaussian mixture
models. Digital Signal Process, 10(13):19–41, Jan.
2000.
[11] B. L. Sturm. A survey of evaluation in music
genre recognition. In Proceedings of the Adaptive
Multimedia Retrieval, 2012.
560
INFORMATION-THEORETIC MEASURES
OF MUSIC LISTENING BEHAVIOUR
Daniel Boland, Roderick Murray-Smith
School of Computing Science, University of Glasgow, United Kingdom
[email protected]; [email protected]
ABSTRACT
We identify the entropy of music features as a metric
for characterising music listening behaviour. This measure can be used to produce time-series analyses of user
behaviour, allowing for the identification of events where
this behaviour changed. In a case study, the date when a
user adopted a different music retrieval system is detected.
These detailed analyses of listening behaviour can support
user studies or provide implicit relevance feedback to music retrieval. More broad analyses are performed across
the 10, 000 playlists. A Mutual Information based feature
selection algorithm is employed to identify music features
relevant to how users create playlists. This user-centred
feature selection can sanity-check the choice of features in
MIR. The information-theoretic approach introduced here
is applicable to any discretisable feature set and distinct in
being based solely upon actual user behaviour rather than
assumed ground-truth. With the techniques described here,
MIR researchers can perform quantitative yet user-centred
evaluations of their music features and retrieval systems.
We present an information-theoretic approach to the measurement of users’ music listening behaviour and selection
of music features. Existing ethnographic studies of music use have guided the design of music retrieval systems
however are typically qualitative and exploratory in nature.
We introduce the SPUD dataset, comprising 10, 000 handmade playlists, with user and audio stream metadata. With
this, we illustrate the use of entropy for analysing music
listening behaviour, e.g. identifying when a user changed
music retrieval system. We then develop an approach to
identifying music features that reflect users’ criteria for
playlist curation, rejecting features that are independent of
user behaviour. The dataset and the code used to produce
it are made available. The techniques described support
a quantitative yet user-centred approach to the evaluation
of music features and retrieval systems, without assuming
objective ground truth labels.
1.1 Understanding Users
1. INTRODUCTION
User studies have provided insights about user behaviour
in retrieving and listening to music and highlighted the
lack of consideration in MIR about actual user needs. In
2003, Cunningham et al. bemoaned that development of
music retrieval systems relied on “anecdotal evidence of
user needs, intuitive feelings for user information seeking
behavior, and a priori assumptions of typical usage scenarios” [5]. While the number of user studies has grown, the
situation has been slow to improve. A review conducted
a decade later noted that approaches to system evaluation
still ignore the findings of user studies [12]. This issue
is stated more strongly by Schedl and Flexer, describing
systems-centric evaluations that “completely ignore user
context and user properties, even though they clearly influence the result” [15]. Even systems-centric work, such
as the development of music classifiers, must consider the
user-specific nature of MIR. Downie termed this the multiexperiential challenge, and noted that “Music ultimately
exists in the mind of its perceiver” [6]. Despite all of
this, the assumption of an objective ground truth for music
genre, mood etc. is common [4], with evaluations focusing
on these rather than considering users. It is clear that much
work remains in placing the user at the centre of MIR.
Understanding how users interact with music retrieval systems is of fundamental importance to the field of Music
Information Retrieval (MIR). The design and evaluation of
such systems is conditioned upon assumptions about users,
their listening behaviours and their interpretation of music. While user studies have offered guidance to the field
thus far, they are mostly exploratory and qualitative [20].
The availability of quantitative metrics would support the
rapid evaluation and optimisation of music retrieval. In
this work, we develop an information-theoretic approach
to measuring users’ music listening behaviour, with a view
to informing the development of music retrieval systems.
To demonstrate the use of these measures, we compiled
‘Streamable Playlists with User Data’ (SPUD) – a dataset
comprising 10, 000 playlists from Last.fm 1 produced by
3351 users, with track metadata including audio streams
from Spotify. 2 We combine the dataset with the mood and
genre classification of Syntonetic’s Moodagent, 3 yielding
a range of intuitive music features to serve as examples.
c Daniel Boland, Roderick Murray-Smith.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Boland, Roderick MurraySmith. “Information-Theoretic Measures of Music Listening Behaviour”,
2014.
1 . http://www.last.fm
2 . http://www.spotify.com
3 . http://www.moodagent.com Last accessed: 30/04/14
561
1.2 Evaluation in MIR
The lack of robust evaluations in the field of MIR was identified by Futrelle and Downie as early as 2003 [8]. They
noted the lack of any standardised evaluations and in particular that MIR research commonly had an “emphasis on
basic research over application to, and involvement with,
users.” In an effort to address these failings, the Music
Information Retrieval Evaluation Exchange (MIREX) was
established [7]. MIREX provides a standardised framework of evaluation for a range of MIR problems using
common metrics and datasets, and acts as the benchmark
for the field. While the focus on this benchmark has done
a great deal towards the standardisation of evaluations, it
has distracted research from evaluations with real users.
A large amount of evaluative work in MIR focuses on
the performance of classifiers, typically of mood or genre
classes. A thorough treatment of the typical approaches to
evaluation and their shortcomings is given by Sturm [17].
We note that virtually all such evaluations seek to circumvent involving users, instead relying on a ‘ground truth’
which is assumed to be objective. An example of a widely
used ground truth dataset is GTZAN, a small collection
of music with the author’s genre annotations. Even were
the objectivity of such annotations to be assumed, such
datasets can be subject to confounding factors and mislabellings as shown by Sturm [16]. Schedl et al. also observe
that MIREX evaluations involve assessors’ own subjective
annotations as ground truth [15].
Figure 1. Distribution of playlist lengths within the SPUD
dataset. The distribution peaks around a playlist length of
12 songs. There is a long tail of lengthy playlists.
2. THE SPUD DATASET
The SPUD dataset of 10, 000 playlists was produced by
scraping from Last.fm users who were active throughout
March and April, 2014. The tracks for each playlist are
also associated with a Spotify stream, with scraped metadata, such as artist, popularity, duration etc. The number
of unique tracks in the dataset is 271, 389 from 3351 users.
The distribution of playlist lengths is shown in Figure 1.
We augment the dataset with proprietary mood and genre
features produced by Syntonetic’s Moodagent. We do this
to provide high-level and intuitive features which can be
used as examples to illustrate the techniques being discussed. It is clear that many issues remain with genre and
mood classification [18] and the results in this work should
be interpreted with this in mind. Our aim in this work is
not to identify which features are best for music classification but to contribute an approach for gaining an additional
perspective on music features. Another dataset of playlists
AOTM-2011 is published [13] however the authors only
give fragments of playlists where songs are also present
in the Million Song Dataset (MSD) [1]. The MSD provides
music features for a million songs but only a small fraction of songs in AOTM-2011 were matched in MSD. Our
SPUD dataset is distinct in maintaining complete playlists
and having time-series data of songs listened to.
1.3 User-Centred Approaches
There remains a need for robust, standardised evaluations
featuring actual users of MIR systems, with growing calls
for a more user-centric approach. Schedl and Flexer made
the broad case for “putting the user in the center of music
information retrieval”, concerning not only user-centred
development but also the need for evaluative experiments
which control independent variables that may affect dependent variables [14]. We note that there is, in particular, a
need for quantitative dependent variables for user-centred
evaluations. For limited tasks such as audio similarity or
genre classification, existing dependent variables may be
sufficient. If the field of MIR is to concern itself with the
development of complete music retrieval systems, their interfaces, interaction techniques, and the needs of a variety
of users, then additional metrics are required. Within the
field of HCI it is typical to use qualitative methods such as
the think-aloud protocol [9] or Likert-scale questionnaires
such as the NASA Task Load Index (TLX) [10].
Given that the purpose of a Music Retrieval system is to
support the user’s retrieval of music, a dependent variable
to measure this ability is desirable. Such a measure cannot
be acquired independently of users – the definition of musical relevance is itself subjective. Users now have access
to ‘Big Music’ – online collections with millions of songs,
yet it is unclear how to evaluate their ability to retrieve this
music. The information-theoretic methodology introduced
in this work aims to quantify the exploration, diversity and
underlying mental models of users’ music retrieval.
3. MEASURING MUSIC LISTENING BEHAVIOUR
When evaluating a music retrieval system, or performing
a user study, it would be useful to quantify the musiclistening behaviour of users. Studying this behaviour over
time would enable the identification of how different music retrieval systems influence user behaviour. Quantifying
listening behaviour would also provide a dependent variable for use in MIR evaluations. We introduce entropy
as one such quantitative measure, capturing how a user’s
music-listening relates to the music features of their songs.
562
3.1 Entropy
For each song being played by a user, the value of a given
music feature can be taken as a random variable X. The
entropy H(X) of this variable indicates the uncertainty
about the value of that feature over multiple songs in a listening session. This entropy measure gives a scale from
a feature’s value never changing, through to every level of
the feature being equally likely. The more a user constrains
their music selection by a particular feature, e.g. mood or
album, then the lower the entropy is over those features.
The entropy for a feature is defined as:
p (x) log2 [p(x)] ,
(1)
H(X) = −
x∈X
where x is every possible level of the feature X and the distribution p (x) is estimated from the songs in the listening
session. The resulting entropy value is measured in bits,
though can be normalised by dividing by the maximum
entropy log2 [|X|]. Estimating entropy in this way can be
done for any set of features, though requires that they are
discretised to an appropriate number of levels.
For example, if a music listening session is dominated
by songs of a particular tempo, the distribution over values
of a TEMPO feature would be very biased. The entropy
H(TEMPO) would thus be very low. Conversely, if users
used shuffle or listened to music irrespective of tempo, then
the entropy H(TEMPO) would tend towards the average
entropy of the whole collection.
Figure 2. Windowed entropy over albums shows a user’s
album-based music listening over time. Each point represents 20 track plays. The black line depicts mean entropy,
calculated using locally weighted regression [3] with 95%
CI of the mean shaded. A changepoint is detected around
Feb. 2010, as the user began using online radio (light blue)
3.3 Changepoints in Music Retrieval
Having produced a time-series analysis of music-listening
behaviour, we are now able to identify events which caused
changes in this behaviour. In order to identify changepoints in the listening history, we apply the ‘Pruned Exact
Linear Time’ (PELT) algorithm [11]. The time-series is
partitioned in a way that reduces a cost function of changes
in the mean and variance of the entropy. Changepoints can
be of use in user studies, for example in Figure 2, the user
explained in an interview that the detected changepoint occurred when they switched to using online radio. There
is a brief return to album-based listening after the changepoint – users’ music retrieval behaviour can be a mixture of
different retrieval models. Changepoint detection can also
be a user-centred dependent variable in evaluating music
retrieval interfaces i.e. does listening behaviour change as
the interface changes? Further examples of user studies are
available with the SPUD dataset.
3.2 Applying a Window Function
Many research questions regarding a user’s music listening
behaviour concern the change in that behaviour over time.
An evaluation of a music retrieval interface might hypothesise that users will be empowered to explore a more diverse
range of music. Musicologists may be interested to study
how listening behaviour has changed over time and which
events precede such changes. It is thus of interest to extend Eqn (1) to define a measure of entropy which is also a
function of time:
H(X, t) = H(w(X, t)) ,
(2)
where w(X, t) is a window function taking n samples of X
around time t. In this paper we use a rectangular window
function with n = 20, assuming that most albums will
have fewer tracks than this. The entropy at any given point
is limited to the maximum possible H(X, t) = log2 [n] i.e.
where each of the n points has a unique value.
An example of the change in entropy for a music feature
over time is shown in Figure 2. In this case H(ALBUM) is
shown as this will be 0 for album-based listening and at
maximum for exploratory or radio-like listening. It is important to note that while trends in mean entropy can be
identified, the entropy of music listening is itself quite a
noisy signal – it is unlikely that a user will maintain a single music-listening behaviour over a large period of time.
Periods of album listening (low or zero entropy) can be
seen through the time-series, even after the overall trend is
towards shuffle or radio-like music listening.
3.4 Identifying Listening Style
The style of music retrieval that the user is engaging in
can be inferred using the entropy measures. Where the
entropy for a given music feature is low, the user’s listening
behaviour can be characterised by that feature i.e. we can
be certain about that feature’s level. Alternately, where a
feature has high entropy, then the user is not ‘using’ that
feature in their retrieval. When a user opts to use shufflebased playback i.e. the random selection of tracks, there
is the unique case that entropy across all features will tend
towards the maximum. In many cases, feature entropies
have high covariance, e.g. songs on an album will have the
same artist and similar features. We did not include other
features in Figure 2 as the same pattern was apparent.
563
4. SELECTING FEATURES FROM PLAYLISTS
Identifying which music features best describe a range of
playlists is not only useful for playlist recommendation,
but also provides an insight into how users organise and
think about music. Music recommendation and playlist
generation typically work on the basis of genre, mood and
popularity, and we investigate which of these features is
supported by actual user behaviour. As existing retrieval
systems are based upon these features, there is a potential ‘chicken-and-egg’ effect where the features which best
describe user playlists are those which users are currently
exposed to in existing retrieval interfaces.
4.1 Mutual Information
Information-theoretic measures can be used to identify to
what degree a feature shares information with class labels.
For a feature X and a class label Y , the mutual information
I(X; Y ) between these two can be given as:
I(X; Y ) = H(X) − H(X | Y ) ,
Figure 3. Features are ranked by their Adjusted Mutual
Information with playlist membership. Playlists are distinguished more by whether they contain ROCK or ANGRY
music than by whether they contain POPULAR or WORLD.
(3)
that is, the entropy of the feature H(X) minus the entropy
of that feature if the class is known H(X | Y ). By taking membership of playlists as a class label, we can determine how much we can know about a song’s features if we
know what playlist it is in. When using mutual information
to compare clusterings in this way, care must be taken to
account for random chance mutual information [19]. We
adapt this approach to focus on how much the feature entropy is reduced, and normalise accordingly:
AM I(X; Y ) =
I(X; Y ) − E[I(X; Y )]
,
H(X) − E[I(X; Y )]
It is of interest that TEMPO was not one of the highest
ranked features, illustrating the style of insights available
when using this approach. Further investigation is required
to determine whether playlists are not based on tempo as
much as is often asumed or if this result is due to the peculiarities of the proprietary perceptual tempo detection.
4.3 Feature Selection
Features can be selected using information-theoretic measures, with a rigorous treatment of the field given by Brown
et al. [2]. They define a unifying framework within which
to discuss methods for selecting a subset of features using
mutual information. This is done by defining a J criterion
for a feature:
(4)
where AM I(X; Y ) is the adjusted mutual information and
E[I(X; Y )] is the expectation of the mutual information
i.e. due to random chance. The AMI gives a normalised
measure of how much of the feature’s entropy is explained
by the playlist. When AM I = 1, the feature level is known
exactly if the playlist is known, when AM I = 0, nothing
about the feature is known if the playlist is known.
J (fn ) = I(fn ; C | S) .
(5)
This gives a measure of how much information the feature shares with playlists given some previously selected
features, and can be used as a greedy feature selection algorithm. Intuitively, features should be selected that are
relevant to the classes but that are also not redundant with
regard to previously selected features. A range of estimators for I(fn ; C | S) are discussed in [2].
As a demonstration of the feature selection approach
we have described, we apply it to the features depicted in
Figure 3, selecting features to minimise redundancy. The
selected subset of features in rank order is: ROCK, DURA TION , POPULARITY , TENDER and JOY . It is notable that
ANGRY had an AMI that was almost the same as ROCK ,
but it is redundant if ROCK is included. Unsurprisingly, the
second feature selected is from a different source than the
first – the duration information from Spotify adds to that
used to produce the Syntonetic mood and genre features.
Reducing redundancy in the selected features in this way
yields a very different ordering, though one that may give a
clearer insight into the factors behind playlist construction.
4.2 Linking Features to Playlists
We analysed the AMI between the 10, 000 playlists in the
SPUD dataset and a variety of high level music features.
The ranking of some of these features is given in Figure 3.
Our aim is only to illustrate this approach, as any results
are only as reliable as the underlying features. With this in
mind, the features ROCK and ANGRY had the most uncertainty explained by playlist membership. While the values
may seem small, they are calculated over many playlists,
which may combine moods, genres and other criteria. As
these features change most between playlists (rather than
within them), they are the most useful for characterising
the differences between playlists. The DURATION feature
ranked higher than expected, further investigation revealed
playlists that combined lengthy DJ mixes. It is perhaps
unsurprising that playlists were not well characterised by
whether they included WORLD music.
564
5.3 Engagement
5. DISCUSSION
While we reiterate that this work only uses a specific set of
music features and user base, we consider our results to be
encouraging. It is clear that the use of entropy can provide
a detailed time-series analysis of user behaviour and could
prove a valuable tool for MIR evaluation. Similarly, the use
of adjusted mutual information allows MIR researchers to
directly link work on acquiring music features to the ways
in which users interact with music. In this section we consider how the information-theoretic techniques described
in this work can inform the field of MIR.
5.1 User-Centred Feature Selection
The feature selection shown in this paper is done directly
from the user data. In contrast, feature selection is usually performed using classifier wrappers with ground truth
class labels such as genre. The use of genre is based on the
assumption that it would support the way users currently
organise music and features are selected based on these
labels. This has lead to issues including classifiers being
trained on factors that are confounded with these labels
and that are not of relevance to genre or users [18]. Our
approach selects features independently of the choice of
classifier, in what is termed a ‘filter’ approach. The benefit
of doing this is that a wide range of features can be quickly
filtered at relatively little computational expense. While
the classifier ‘wrapper’ approach may achieve greater performance, it is more computationally expensive and more
likely to suffer from overfitting.
The key benefit of filtering features based on user behaviour is that it provides a perspective on music features
that is free from assumptions about users and music ground
truth. This user-centred perspective provides a sanity-check
for music features and classification – if a feature does not
reflect the ways in which users organise their music, then
how useful is it for music retrieval?
5.2 When To Learn
The information-theoretic measures presented offer an implicit relevance feedback for music retrieval. While we
have considered the entropy of features as reflecting user
behaviour, this behaviour is conditioned upon the existing
music retrieval interfaces being used. For example, after
issuing a query and receiving results, the user selects relevant songs from those results. If the entropy of a feature
for those selected songs is small relative to the result set,
then this feature is implicitly relevant to the retrieval.
The identification of shuffle and explorative behaviour
provides some context for this implicit relevance feedback.
Music which is listened to in a seemingly random fashion
may represent an absent or disengaged user, adding noise
to attempts to weight recommender systems or build a user
profile. At the very least, where entropy is high across all
features, then those features do not reflect the user’s mental
model for their music retrieval. The detection of shuffle
or high-entropy listening states thus provides a useful data
hygiene measure when interpreting listening data.
The entropy measures capture how much each feature is
being ‘controlled’ by the user when selecting their music.
We have shown that it spans a scale from a user choosing to
listen to something specific to the user yielding control to
radio or shuffle. Considering entropy over many features in
this way gives a high-dimensional vector representing the
user’s engagement with music. Different styles of music
retrieval occupy different points in this space, commonly
the two extremes of listening to a specific album or just
shuffling. There is an opportunity for music retrieval that
has the flexibility to support users engaging and applying
control over music features only insofar as they desire to.
An example of this would be a shuffle mode that allowed
users to bias it to varying degrees, or to some extent, the
feedback mechanism in recommender systems.
5.4 Open Source
The SPUD dataset is made available for download at:
http://www.dcs.gla.ac.uk/˜daniel/spud/
Example R scripts for importing data from SPUD and producing the analyses and plots in this paper are included.
The code used to scrape this dataset is available under the
MIT open source license, and can be accessed at:
http://www.github.com/dcboland/
The MoodAgent features are commercially sensitive,
thus not included in the SPUD dataset. At present, industry is far better placed to provide such large scale analyses
of music data than academia. Even with user data and the
required computational power, large-scale music analyses
require licensing arrangements with content providers, presenting a serious challenge to academic MIR research. Our
adoption of commercially provided features has allowed us
to demonstrate our information-theoretic approach, and we
distribute the audio stream links, however it is unlikely that
many MIR researchers will have the resources to replicate
all of these large scale analyses. The CoSound 4 project
is an example of industry collaborating with academic research and state bodies to navigate the complex issues of
music licensing and large-scale analysis.
6. CONCLUSION
This work introduces an information-theoretic approach to
the study of users’ music listening behaviour. The case is
made for a more user-focused yet quantitative approach to
evaluation in MIR. We described the use of entropy to produce time-series analyses of user behaviour, and showed
how changes in music-listening style can be detected. An
example is given where a user started using online radio,
having higher entropy in their listening. We introduced
the use of adjusted mutual information to establish which
music features are linked to playlist organisation. These
techniques provide a quantitative approach to user studies
and ground feature selection in user behaviour, contributing tools to support the user-centred future of MIR.
4 . http://www.cosound.dk/ Last accessed: 30/04/14
565
ACKNOWLEDGEMENTS
This work was supported in part by Bang & Olufsen and
the Danish Council for Strategic Research of the Danish
Agency for Science Technology and Innovation under the
CoSound project, case number 11-115328. This publication only reflects the authors’ views.
7. REFERENCES
[1] T Bertin-Mahieux, D. P Ellis, B Whitman, and
P Lamere. The Million Song Dataset. In Proceedings
of the 12th International Conference on Music Information Retrieval, Miami, Florida, 2011.
[2] G Brown, A Pocock, M.-J Zhao, and M Luján. Conditional likelihood maximisation: a unifying framework
for information theoretic feature selection. The Journal
of Machine Learning Research, 13:27–66, 2012.
[3] W. S Cleveland and S. J Devlin. Locally weighted regression: an approach to regression analysis by local
fitting. Journal of the American Statistical Association,
83(403):596–610, 1988.
[4] A Craft and G Wiggins. How many beans make five?
the consensus problem in music-genre classification
and a new evaluation method for single-genre categorisation systems. In Proceedings of the 8th International
Conference on Music Information Retrieval, Vienna,
Austria, 2007.
[5] S. J Cunningham, N Reeves, and M Britland. An ethnographic study of music information seeking: implications for the design of a music digital library. In Proceedings of the 3rd ACM/IEEE-CS joint conference on
Digital libraries, Houston, Texas, 2003.
[6] J. S Downie. Music Information Retrieval. Annual Review of Information Science and Technology,
37(1):295–340, January 2003.
[7] J. S Downie. The Music Information Retrieval
Evaluation eXchange (MIREX). D-Lib Magazine,
12(12):795–825, 2006.
[8] J Futrelle and J. S Downie. Interdisciplinary Research
Issues in Music Information Retrieval: ISMIR 2000 2002. Journal of New Music Research, 32(2):121–131,
2003.
[9] J. D Gould and C Lewis. Designing for usability: key
principles and what designers think. Communications
of the ACM, 28(3):300–311, 1985.
[10] S. G Hart. NASA-task load index (NASA-TLX); 20
years later. In Proceedings of the Human Factors and
Ergonomics Society 50th Annual Meeting, San Francisco, California, 2006.
[11] R Killick, P Fearnhead, and I. A Eckley. Optimal Detection of Changepoints With a Linear Computational
Cost. Journal of the American Statistical Association,
107(500):1590–1598, 2012.
[12] J. H Lee and S. J Cunningham. The Impact (or Nonimpact) of User Studies in Music Information Retrieval. In Proceedings of the 13th International Conference for Music Information Retrieval, Porto, Portugal, 2012.
[13] B McFee and G Lanckriet. Hypergraph models of
playlist dialects. In Proceedings of the 13th International Conference for Music Information Retrieval,
Porto, Portugal, 2012.
[14] M Schedl and A Flexer. Putting the User in the Center
of Music Information Retrieval. In Proceedings of the
13th International Conference on Music Information
Retrieval, Porto, Portugal, 2012.
[15] M Schedl, A Flexer, and J Urbano. The neglected user
in music information retrieval research. Journal of Intelligent Information Systems, 41(3):523–539, 2013.
[16] B. L Sturm. An Analysis of the GTZAN Music Genre
Dataset. In Proceedings of the 2nd International ACM
Workshop on Music Information Retrieval with Usercentered and Multimodal Strategies, MIRUM ’12,
New York, USA, 2012.
[17] B. L Sturm. Classification accuracy is not enough.
Journal of Intelligent Information Systems, 41(3):371–
406, 2013.
[18] B. L Sturm. A simple method to determine if a music
information retrieval system is a horse. IEEE Transactions on Multimedia, 2014.
[19] N. X Vinh, J Epps, and J Bailey. Information theoretic
measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal
of Machine Learning Research, 11:2837–2854, 2010.
[20] D. M Weigl and C Guastavino. User studies in the music information retrieval literature. In Proceedings of
the 12th International Conference for Music Information Retrieval, Miami, Florida, 2011.
566
EVALUATION FRAMEWORK FOR AUTOMATIC SINGING
TRANSCRIPTION
Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón, Isabel Barbancho
Universidad de Málaga, ATIC Research Group, Andalucı́a Tech,
ETSI Telecomunicación, Campus de Teatinos s/n, 29071 Málaga, SPAIN
[email protected], [email protected], [email protected], [email protected]
ABSTRACT
In this paper, we analyse the evaluation strategies used
in previous works on automatic singing transcription, and
we present a novel, comprehensive and freely available
evaluation framework for automatic singing transcription.
This framework consists of a cross-annotated dataset and a
set of extended evaluation measures, which are integrated
in a Matlab toolbox. The presented evaluation measures
are based on standard MIREX note-tracking measures, but
they provide extra information about the type of errors made by the singing transcriber. Finally, a practical case of
use is presented, in which the evaluation framework has
been used to perform a comparison in detail of several
state-of-the-art singing transcribers.
1. INTRODUCTION
Singing transcription refers to the automatic conversion of
a recorded singing signal into a symbolic representation
(e.g. a MIDI file) by applying signal-processing methods [1]. One of its renowned applications is query-byhumming [5], but other types of applications also are related to this task, like singing tutors [2], computer games
(e.g. Singstar 1 ), etc. In general, singing transcription is
considered a specific case of melody transcription (also
called note tracking), which is more general problem. However, singing transcription not only relates to melody transcription but also to speech recognition, and still nowadays
it is a challenging problem even in the case of monophonic
signals without accompaniment [3].
In the literature, various approaches for singing transcription can be found. A simple but commonly referenced
approach was proposed by McNab in 1996 [4], and it relied on several handcrafted pitch-based and energy-based
segmentation methods. Later, in 2001 Haus et al. used
a similar approach with some rules to deal with intonation issues [5], and in 2002, Clarisse et al. [6] contributed
with an auditory model, leading to later improved systems
1
http://www.singstar.com
c Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón,
Isabel Barbancho.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Emilio Molina, Ana M. Barbancho,
Lorenzo J. Tardón, Isabel Barbancho. “Evaluation framework for automatic singing transcription”, 15th International Society for Music Information Retrieval Conference, 2014.
such as [7] (later included in MAMI project 2 and today in
SampleSumo products 3 ). Additionally, other more recent
approaches use hidden Markov models (HMM) to detect
note-events in singing voice [8, 9, 11]. One of the most
representative HMM-based singing transcribers was published by Ryynänen in 2004 [9]. More recently, in 2013,
another probabilistic approach for singing transcription has
been proposed in [3], also leading to relevant results. Regarding the evaluation methodologies used in these works
(see Sections 2.1 and 3.1 for a review), there is not a standard methodology.
In this paper, we present a comprehensive evaluation
framework for singing transcription. This framework consists of a cross-annotated dataset (Section 2) and a novel,
compact set of evaluation measures (Section 3), which report information about the type of errors made by the singing transcriber. These measures have been integrated in
a freely available Matlab toolbox (see Section 3.3). Then,
we present a practical case in which the evaluation framework has been used to perform a comparison in detail of
several state-of-the-art singing transcribers (Section 4). Finally, some relevant conclusions are presented in Section 5
2. DATASETS
In this section, we review the evaluation datasets used in
prior works on singing transcription , and we describe the
proposed evaluation dataset and our strategy for groundtruth annotation.
2.1 Datasets used in prior works
In Table 1, we present the datasets used in some relevant
works on singing transcription. Note that none of the datasets fully represents the possible contexts in which singing transcription might be applied, since they are either
too small (e.g. [5,6]), either very specific in style (e.g. [11]
for opera and [3] for flamenco), or either they use an annotation strategy that may be subjective (e.g. [5, 6]), or only
valid for very good performances in rhythm and intonation
(e.g. [8, 9]). In addition, only the flamenco dataset used
in [3] is freely available.
2.2 Proposed dataset
In this section we describe the music collection, as well as
the annotation strategy used to build the ground-truth.
2
3
567
http://www.ipem.ugent.be/MAMI
http://www.samplesumo.com
Author
Year
Dataset
size
Audio
quality
Music
style
McNab [4]
Haus &
Pollastri [5]
Clarisse
et al. [6]
Viitaniemi
et al. [8]
Ryynänen
et al. [9]
Mulder
et al. [7]
1996
2001
20 short
melodies
22 short
melodies
66 melodies
(120
minutes)
Low & moderated noise
Low & moderated noise
High quality
(studio
conditions)
Popular and
scales
Popular
2004
52 melo.
(1354 notes)
Good & moderated noise
Popular
songs
Kumar
et al. [10]
Krige
et al. [11]
2007
47 songs
(2513 notes)
13842
notes
Good
Gómez &
Bonada [3]
2013
Indian
music
Opera
lessons
& scales
Flamenco
songs
2002
2003
2004
2008
72 excerpts
(2803 notes)
High quality
but strong
reverberation
Good &
slightly noisy
Folk songs
& scales
Singing
style
NONE
Syllables:
’na-na’...
Singing with &
without lyrics
Singing,
humming
& whistling
Syllables,
singing &
whistling
Syllables:
/la/ /da/ /na/
Syllables
Lyrics &
ornaments
Ground-truth (GT)
annotation
strategy
Tunning
devs. annotated in GT
Freely
available
Annotated by
one musician
Annotation by
one musician
Original score
used as
ground-truth
No
No
No
No
No
No
Team
of
musicologists
Manual annot. of
vowel onsets [REf]
Time alignment using
Viterbi
Musicians team
(cross-annotation)
No
No
No
No
No
No
Yes
Yes
Table 1. Review of the evaluation datasets used in prior works on singing transcription. Some details about the dataset are
not provided in some cases, so certain fields can not be expressed in the same units (e.g. dataset size).
2.2.1 Music collection
musicians were given a set of instructions about the specific criteria to annotate the singing melody:
The proposed dataset consists of 38 melodies sung by adult
and child untrained singers, recorded in mono with a sample rate of 44100Hz and a resolution of 16 bits. Generally,
the recordings are not clean and some background noise is
present. The duration of the excerpts ranges from 15 to 86
seconds and the total duration of the whole dataset is 1154
seconds. This music collection can be broken down into
three categories, according to the type of singer:
• Ornaments such as pitch bending at the beginning
of the notes or vibratos are not considered independent notes. This criterion is based on Vocaloid’s 6
approach, where ornaments are not modelled with
extra notes.
• Portamento between two notes does not produce an
extra third note (again, this is the criteria used in
Vocaloid).
• The onsets are placed at the beginning of voiced segments and in each clear change of pitch or phoneme.
In the case of ’l’, ’m’, ’n’ voiced consonants + vowel
(e.g. ’la’), the onset is not placed at the beginning of
the consonant but at the beginning of the vowel.
• Children (our own recordings 4 ): 14 melodies of traditional children songs (557 seconds) sung by 8 different children (5-11 years old).
• Adult male: 13 pop melodies (315 seconds) sung
by 8 different adult male untrained singers. These
recordings were randomly chosen from the public
dataset MTG-QBH 5 [12].
• The pitch of each note is annotated with cents resolution as perceived by the team of experts. Note that
we annotate the tuning deviation for each independent note.
3. EVALUATION MEASURES
• Adult female: 11 pop melodies (281 seconds) sung
by 5 different adult female untrained singers, also
taken from MTG-QBH dataset.
Note that in this collection the pitch and the loudness can
be unstable, and well performed vibratos are not frequent.
In this section, we describe the evaluation measures used
in prior works on automatic singing transcription, and we
present the proposed ones.
3.1 Evaluation measures used in prior works
2.2.2 Ground-truth: annotation strategy
The described music collection has been manually annotated to build the ground truth 4 . First, we have transcribed
the audio recordings with a baseline algorithm (Section
4.2), and then all the transcription errors have been corrected by an expert musician with more than 10 years of
music training. Then, a second expert musician (with 7
years of music training) checked all the annotations until
both musicians agreed in their correctness. The transcription errors were corrected by listening, at the same time, to
the synthesized transcription and the original audio. The
4
5
In Table 2, we review the evaluation measures used in some
relevant works on singing transcription. In some cases,
only the note and/or frame error is provided as a compact,
representative measure [5, 9], whereas other approaches
provide extra information about the type of errors made
by the system using dynamic time warping (DTW) [6] or
Viterbi-based alignment [11]. In our case, we have taken
the most relevant aspects of these approaches and we added
some novel ideas in order to define a novel, compact and
comprehensive set of evaluations.
Available at http://www.atic.uma.es/ismir2014singing
http://mtg.upf.edu/download/datasets/mtg-qbh
6
568
http://www.vocaloid.com
Author
McNab
Haus &
Pollastri [5]
Year
1996
Evaluation measures
NONE
Rate of note pitch errors (segmentation errors are not considered)
DTW-based measurement of various
note errors, e.g. insertions deletions
and substitutions.
Frame-based errors. Do not report
information about type of errors
made.
Note-based and frame-based errors.
Do not report information about
type of errors made.
DTW-based measurement of various
note errors, e.g. insertions deletions
and substitutions.
Onset detection errors (pitch and
durations are ignored).
Viterbi-based measurement
of deletions, insertions and
substitutions (typical evaluation in
speech recognition).
MIREX measures for audio
melody extraction
and note-tracking. Do
not report information
about type of errors made.
are N GT and N TR , respectively. Regarding the expressions
used in the for correct notes, we have used Precision, Recall and F-measure, which are defined as follow:
Table 2. Evaluation measures used in prior works on singing transcription.
Finally, in the case of segmentation errors (Section 3.2.5),
we also compute the mean number of notes tagged as X in
the transcription for each note tagged as X in the groundtruth. This magnitude has been expressed as a ratio:
2001
Clarisse
et al. [6]
2002
Viitaniemi
et al. [8]
2003
Ryynänen
et al. [9]
2004
Mulder
et al. [7]
2004
Kumar
et al. [10]
2007
Krige
et al. [11]
2008
Gómez
& Bonada [3]
2013
In this section, we firstly present the notation and some
needed definitions that are used in the rest of sections, and
then we describe the evaluation measures used to quantify the proportion of correctly transcribed notes. Finally,
we present a set of novel evaluation measures that independently report the importance of each type of error. In
Figure 1 we show an example of the types of errors considered.
MIDI note
62
61
60
59
MIDI note
MIDI note
(a)
CXRecall
=
CXF-measure
=
GT
NCX
N GT
TR
NCX
N TR
CXPrecision · CXRecall
2·
CXPrecision + CXRecall
XRatio =
(1)
(2)
(3)
NXTR
NXGT
(6)
3.2.2 Definition of correct onset/pitch/offset
The definitions of correctly transcribed notes (given in Section 3.2.3) consists of combinations of three independent
conditions: correct onset, correct pitch and correct offset. We have defined these conditions according to MIREX
(Multiple F0 estimation and tracking and Audio Onset Detection tasks), and so they are defined as follow:
• Correct Onset: If the note’s onset of a transcribed note
nTj R is within a ±50ms range of the onset of a ground-truth
note nGT
i , i.e.:
±50 cents
COnP, COn
=
where CX makes reference to the specific category of correct note: Correct Onset & Pitch & Offset (X = COnPOff),
Correct Onset & Pitch (X = COnP) or Correct Onset (X
GT
TR
= COn). Finally, NCX
and NCX
are the total number of
matching CX conditions in the ground-truth and the transcription, respectively.
Regarding the measures used for errors, we have computed the Error Rate with respect to N GT , or with respect
to N TR , as follow:
N GT
XRateGT = XGT
(4)
N
N TR
XRateTR = XTR
(5)
N
3.2 Proposed measures
62
± 20%
61 ± 50ms duration
60
59 COnPOff, COnP, COn
CXPrecision
COn
(b)
GT
onset(nTj R ) ∈ [onset(nGT
i ) − 50ms, onset(ni ) + 50ms] (7)
M
S
M
then we consider that nGT
has a correct onset with respect
i
to nTj R .
• Correct Pitch: If the note’s pitch of a transcribed note
nTj R is within a ±0.5 semitones range of the pitch of a
ground-truth note nGT
i , i.e.:
ND
PU
(c)
62
61
60
59 OBOn
GT
pitch(nTj R ) ∈ [pitch(nGT
i ) − 0.5 st, pitch(ni ) + 0.5 st] (8)
3.2.1 Notation
then we consider that nGT
has a correct pitch with respect
i
to nTj R .
• Correct Offset: If the offsets of the ground-truth note
nGT
and the transcribed note nTj R are within a range of
i
or ±50 ms, whichever is
±20% of the duration of nGT
i
larger, i.e.:
The i:th note of the ground-truth is noted as nGT
i , and the
j:th note of the transcription is noted as nTR
.
The total
j
number of notes in the ground-truth and the transcription
where OffRan = max(50ms, duration(nGT
i )), then we consider that nGT
has
a
correct
offset
with
respect
to nTj R .
i
OBP
Transcription
GROUND-TRUTH
Figure 1. Examples of the different proposed measures.
R
GT
GT
offset(nT
j ) ∈ [offset(ni ) − OffRan, offset(ni ) + OffRan] (9)
569
3.2.3 Correctly transcribed notes
with correct onset and offset, but wrong pitch. In order to
detect them, firstly we find all ground-truth notes with correct onset and offset, taking into account that one groundtruth note can only be associated with one transcribed note.
Then, we remove all notes previously tagged as COnPOff
(Section 3.2.3). The reported measure is the rate of OBP
notes in the ground-truth:
The definition of “correct note” should be useful to measure the suitability of a given singing transcriber for a specific application. However, different applications may require a different definition of correct note. Therefore, we
have chosen three different definitions of correct note as
defined in MIREX:
• Correct onset, pitch and offset (COnPOff): This is
a standard correctness criteria, since it is used in MIREX
(Multiple F0 estimation and tracking task), and it is the
most restrictive one. The note nGT
is assumed to be cori
rectly transcribed into the note nTj R if it has correct onset, correct pitch and correct offset (as defined in Section
3.2.2). In addition, one ground truth note nGT
can only be
i
associated with one transcribed note nTj R . In our evaluation framework, we report Precision, Recall and F-measure
as defined in Section 3.2.1:
OBPRateGT
• Only-Bad-Offset (OBOff): A ground-truth note nGT
is
i
labelled as OBOn if it has been transcribed into a note nTj R
with correct pitch and onset, but wrong offset. In order to
detect them, firstly we find all ground-truth notes with correct pitch and onset, taking into account that one groundtruth note can only be associated with one transcribed note.
(Section 3.2.3). The reported measure is the rate of OBOff
COnPOffPrecision , COnPOffRecall and COnPOffF-measure .
OBOffRateGT
• Correct Onset, Pitch (COnP): This criteria is also used
in MIREX, but it is less restrictive since it just considers
onset and pitch, and ignores the offset value. Therefore,
in COnP criteria, a note nGT
is assumed to be correctly
i
transcribed into the note nTj R if it has correct onset and
correct pitch. In addition, one ground truth note nGT
can
i
only be associated with one transcribed note nTj R . In our
evaluation framework, we report Precision, Recall and Fmeasure:
COnPPrecision , COnPRecall and COnPF-measure .
3.2.5 Incorrect notes with segmentation errors
Segmentation errors refer to the case in which sung notes
are incorrectly split or merged during the transcription. Depending on the final application, certain types of segmentation errors may not be important (e.g. frame-based systems
for query-by-humming are not affected by splits), but they
can lead to problems in many other situations. Therefore,
we have defined two evaluation measures which are informative about the segmentation errors made by the singing
transcriber.
• Split (S): A split note is a ground truth note nGT
that
i
is incorrectly segmented into different consecutive notes
nTj1R , nTj2R · · · nTjnR . Two requirements are needed in a
split: (1) the set of transcribed notes nTj1R , nTj2R , . . . nTjnR
must overlap at least the 40% of nGT
in time (pitch is igi
nored), and (2) nGT
must overlap at least the 40% of every
i
note nTj1R , nTj2R , . . . nTjnR in time (again, pitch is ignored).
These requirements are needed to ensure a consistent relationship between ground truth and transcribed notes. The
specific reported measures are:
• Correct Onset (COn): Additionally, we have included the
evaluation criteria used in MIREX Audio Onset Detection
task. In this case, a note nGT
is assumed to be correctly
i
transcribed into the note nTj R if it has correct onset. In addition, one ground truth note nGT
can only be associated
i
with one transcribed note nTj R . In our evaluation framework, we report Precision, Recall and F-measure:
COnPOffPrecision , COnPOffRecall and COnPOffF-measure .
3.2.4 Incorrect notes with one single error
In addition, we have included some novel evaluation measures to identify the notes that are close to be correctly transcribed, but they fail in one single aspect. These measures
are useful to identify specific weaknesses of a given singing transcriber. The proposed categories are:
• Only-Bad-Onset (OBOn): A ground-truth note nGT
is
i
labelled as OBOn if it has been transcribed into a note nTj R
with correct pitch and offset, but wrong onset. In order to
detect them, firstly we find all ground-truth notes with correct pitch and offset, taking into account that one groundtruth note can only be associated with one transcribed note.
(Section 3.2.3). The reported measure is the rate of OBOn
OBOnRateGT
SRateGT and SRatio
Note that in this case SRatio > 1.
• Merged (M): A set of consecutive ground-truth notes
GT
GT
nGT
i1 , ni2 , · · · nin are considered to be merged if they
all are transcribed into the same note nTj R . This is the complementary case of split. Again, two requirements must be
true to consider a group of merged notes: (1) the set of
GT
GT
ground truth notes nGT
i1 ,ni2 , . . . nin must overlap the
TR
40% of nj in time (pitch is ignored), and (2) nTj R must
GT
GT
overlap the 40% of every note nGT
i1 ,ni2 , . . . nin in time
(again, pitch is ignored). The specific reported measures
are:
MRateGT and MRatio
• Only-Bad-Pitch (OBP): A ground-truth note nGT
is lai
belled as OBP if it has been transcribed into a note nTj R
Note that in this case MRatio < 1.
570
3.2.6 Incorrect notes with voicing errors
the experiment, we have used a binary provided by the authors of the algorithm.
Method (b): Ryynänen (2008) [13]. We have used the
method for automatic transcription of melody, bass line
and chords in polyphonic music published by Ryynänen
in 2008 [13], although we only focus on melody transcription. It is the last evolution of the original HMM-based
monophonic singing transcriber [9]. For the experiment,
we have used a binary provided by the authors of the algorithm.
Method (c): Melotranscript 4 (based on Mulder 2004
[7]). It is the commercial version derived from the research
carried out by Mulder et al. [7]. It is based on an auditory
model. For the experiment, we have used the demo version
available in SampleSumo website 3 .
Voicing errors happen when an unvoiced sound produces a
false transcribed note (spurious note), or when a sung note
is not transcribed at all (non-detected note). This situation
is commonly associated to a bad performance of the voicing stage within the singing transcriber. We have defined
two categories:
• Spurious notes (PU): A spurious note is a transcribed
note nTj R that does not overlap at all (neither in time nor in
pitch) any note in the ground truth. The associated reported
measure is:
PURateTR
• Non-detected notes (ND): A ground-truth note nGT
is
i
non-detected if it does not overlap at all (neither in time
nor in pitch) any transcribed note. The associated reported
measure is:
NDRateGT
4.2 Baseline algorithm
3.3 Proposed Matlab toolbox
The presented evaluation measures have been implemented
in a freely available Matlab toolbox 4 , which consists of a
set of functions and structures, as well as a graphical user
interface to visually analyse the performance of the evaluated singing transcriber.
The main function of our toolbox is evaluation.m,
which receives the ground-truth and the transcription of an
audio clip as inputs, and it outputs the results of all the
evaluation measures. In addition, we have included a function called listnotes.m, which receives as inputs the
ground-truth, the transcription and the category X to be
listed, and it outputs a list (in a two-columns format: onset time-offset time) of all the notes in the ground-truth
tagged as X category. This information is useful to isolate
the problematic audio excerpts for further analysis.
Finally, we have implemented a graphical user interface, where the ground-truth and the transcription of a given
audio clip can be compared using a piano-roll representation. This interface also allows the user to highlight notes
tagged as X (e.g. COnPOff, S, etc.).
4. PRACTICAL USE OF THE PROPOSED
TOOLBOX
In this section, we describe a practical case of use in which
the presented evaluation framework has been used to perform an improved comparative study of several state-ofthe-art singing transcribers (presented in Section 4.1). In
addition, a simple, easily reproducible baseline approach
has been included in this comparative study. Finally, we
show and discuss the obtained results.
4.1 Compared algorithms
We have compared three state-of-the-art algorithms for singing transcription:
Method (a): Gómez & Bonada (2013) [3]. It consists of
three main steps: tuning-frequency estimation, transcription into short notes, and an iterative process involving note
consolidation and refinement of the tuning frequency. For
571
According to [8], the simplest possible segmentation consists of simply rounding a rough pitch estimate to the closest MIDI note ni and taking all pitch changes as note boundaries. The proposed baseline method is based on such idea,
and it uses Yin [14] to extract the F0 and aperiodicity at
frame-level. A frame is classified as unvoiced if its aperiodicity is under < 0.4. Finally, all notes shorter than
100ms are discarded.
4.3 Results & discussion
In Figure 2 we show the results of our comparative analysis. Regarding the F-measure of correct notes (COnPOff,
COnP and COn), methods (a) and (c) attains similar values,
whereas method (b) performs slightly worse. In addition,
it seems that method (a) is slightly superior to method (c)
for onset detection, but method (c) is superior when pitch
and offset values must be also estimated. In all cases, the
baseline is clearly worse than the rest of methods.
In addition, we observed that the rate of notes with incorrect onset (OBOn) is equally high (20%) in all methods.
After analysing the specific recordings, we concluded that
onset detection within a range of ±50ms is very restrictive
in the case of singing voice with lyrics, since many onsets
are not clear even for an expert musician (as proved during
the ground-truth building). Moreover, we also observed
that all methods, and especially method (a), have problems
with pitch bendings at the beginning of the notes, since
they tend to split them.
Regarding the segmentation and voicing errors, we realised that method (a) tends to split notes, whereas method
(b) tends to merge notes. This information, easily provided
by our evaluation framework, may be useful to improve
specific weaknesses of the algorithms during the development stage. Finally, we also realised that method (b) is
worse than method (a) and (c) in terms of voicing.
To sum up, method (c) seems to be the best one in most
measures, mainly due to a better performance in segmentation and voicing. However, method (a) is very appropriate
for onset detection. Finally, although method (b) works
clearly better than the baseline, has a poor performance
due to errors in segmentation (mainly merged notes) and
voicing (mainly spurious).
Conference on Acoustics, Speech and Signal Processing
ICASSP, pp. 744–748, 2013.
COnPOff (Precision)
COnPOff (Recall)
[3] E. Gómez and J. Bonada, “Towards computer-assisted
flamenco transcription: An experimental comparison of
automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, vol. 37, no. 2,
pp. 73–90, 2013.
COnPOff (F-measure)
COnP (Precision)
COnP (Recall)
COnP (F-measure)
COn (Precision)
[4] R. J. McNab, L. A. Smith, and I. H. Witten, “Signal Processing for Melody Transcription,” Proceedings
of the 19th Australasian Computer Science Conference,
vol. 18, no. 4, pp. 301–307, 1996.
COn (Recall)
COn (F-measure)
OBOn (RateGT)
[5] G. Haus and E. Pollastri, “An audio front end for queryby-humming systems,” in Proceedings of the 2nd International Symposium on Music Information Retrieval ISMIR, pp. 65–72, sn, 2001.
OBP (RateGT)
OBOff (RateGT)
Split (RateGT)
Baseline method
(a) Gómez & Bonada
(b) Ryynänen
(c) Melotranscript
Merged (RateGT)
Spurious (RateTR)
Non-detected (RateTR)
0
0.2
0.4
0.6
[6] L. P. Clarisse, J. P. Martens, M. Lesaffre, B. D. Baets,
H. D. Meyer, and M. Leman, “An Auditory Model Based
Transcriber of Singing Sequences,” in Proceedings of the
3rd International Conference on Music Information Retrieval ISMIR, pp. 116–123, 2002.
0.8
Measure value
[7] T. De Mulder, J.P. Martens, M. Lesaffre, M. Leman, B.
De Baets, H. De Meyer, “Recent improvements of an
auditory model based front-end for the transcription of
vocal queries”, , Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing,
(ICASSP 2004), Montreal, Quebec, Canada, May 17–21,
Vol. IV, pp. 257–260, 2004.
Figure 2. Comparison in detail of several state-of-the-art
singing transcription systems using the presented evaluation framework.
5. CONCLUSIONS
In this paper, we have presented an evaluation framework
for singing transcription. It consists of a cross-annotated
dataset of 1154 seconds and a novel set of evaluation measures, able to report the type of errors made by the system. Both the dataset, and a Matlab toolbox including the
presented evaluation measures, are freely available 4 . In
order to show the utility of the work presented in this paper, we have performed an detailed comparative study of
three state-of-the-art singing transcribers plus a baseline
method, leading to relevant information about the performance of each method. In the future, we plan to expand our
evaluation dataset in order to make it comparable to other
datasets 7 used in MIREX (e.g. MIR-1K or MIR-QBSH).
6. ACKNOWLEDGEMENTS
This work has been funded by the Ministerio de Economı́a
y Competitividad of the Spanish Government under Project
No. TIN2013-47276-C6-2-R and by the Junta de Andalucı́a
under Project No. P11-TIC-7154. The work has been done
at Universidad de Málaga. Campus de Excelencia Internacional Andalucı́a Tech.
7. REFERENCES
[1] M. Ryynänen, “Singing transcription,” in Signal Processing Methods for Music Transcription (A. Klapuri
and M. Davy, eds.), pp. 361–390, Springer Science +
Business Media LLC, 2006.
[2] E. Molina, I. Barbancho, E. Gómez, A. M. Barbancho,
and L. J. Tardón, “Fundamental frequency alignment vs.
note-based melodic similarity for singing voice assessment,” in Proceedings of the 2013 IEEE International
7
[8] T. Viitaniemi, A. Klapuri, and A. Eronen, “A probabilistic model for the transcription of single-voice melodies,”
in Proceedings of the 2003 Finnish Signal Processing
Symposium FINSIG03, pp. 59–63, 2003.
[9] M. Ryynänen and A. Klapuri, “Modelling of note events
for singing transcription,” in Proceedings of ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing SAPA, (Jeju, Korea), Oct. 2004.
[10] P. Kumar, M. Joshi, S. Hariharan, and P. Rao, “Sung
Note Segmentation for a Query-by-Humming System”.
In Intl Joint Conferences on Artificial Intelligence (IJCAI), 2007.
[11] W. Krige, T. Herbst, and T. Niesler, “Explicit transition
modelling for automatic singing transcription,” Journal
of New Music Research, vol. 37, no. 4, pp. 311–324,
2008.
[12] J. Salamon, J. Serrá and E. Gómez, “Tonal Representations for Music Retrieval: From Version Identification to
Query-by-Humming”, International Journal of Multimedia Information Retrieval, special issue on Hybrid Music
Information Retrieval, vol. 2, no. 1, pp. 45–58, 2013.
[13] M. P. Ryynänen and A. P. Klapuri, “Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music,” in Computer Music Journal, vol.32, no.
3, 2008.
[14] A. De Cheveigné and H. Kawahara: “YIN, a fundamental frequency estimator for speech and music,” Journal
of the Acoustic Society of America, Vol. 111, No. 4, pp.
1917-1930, 2002.
http://mirlab.org/dataSet/public/
572
WHAT IS THE EFFECT OF AUDIO QUALITY ON THE
ROBUSTNESS OF MFCCs AND CHROMA FEATURES?
Julián Urbano, Dmitry Bogdanov, Perfecto Herrera, Emilia Gómez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra Barcelona, Spain
{julian.urbano,dmitry.bogdanov,perfecto.herrera,emilia.gomez,xavier.serra}@upf.edu
ABSTRACT
Many MIR tasks such as classification, similarity, autotagging, recommendation, cover identification and audio fingerprinting, audio-to-score alignment, audio segmentation,
key and chord estimation, and instrument detection are at
least partially based on them. As they pervade the literature
on MIR, we analyzed the effect of audio encoding and signal analysis parameters on the robustness of MFCCs and
chroma. To this end, we run two different audio analysis
tools over a diverse collection of 400 music tracks. We then
compute several indicators that quantify the robustness and
stability of the resulting features and estimate the practical
implications for a general task like genre classification.
Music Information Retrieval is largely based on descriptors
computed from audio signals, and in many practical applications they are to be computed on music corpora containing audio files encoded in a variety of lossy formats. Such
encodings distort the original signal and therefore may affect the computation of descriptors. This raises the question of the robustness of these descriptors across various
audio encodings. We examine this assumption for the case
of MFCCs and chroma features. In particular, we analyze
their robustness to sampling rate, codec, bitrate, frame size
and music genre. Using two different audio analysis tools
over a diverse collection of music tracks, we compute several statistics to quantify the robustness of the resulting descriptors, and then estimate the practical effects for a sample task like genre classification.
2. DESCRIPTORS
2.1 Mel-Frequency Cepstrum Coefficients
MFCCs are inherited from the speech domain [18], and
they have been extensively used to summarize the spectral
content of music signals within an analysis frame. MFCCs
are widely used in tasks like music similarity [1,12], music
classification [6] (in particular, genre), autotagging [13],
preference learning for music recommendation [19, 24],
cover identification and audio segmentation [17].
There is no standard algorithm to compute MFCCs, and
a number of variants have been proposed [8] and adapted
for MIR applications. MFCCs are commonly computed as
follows. The first step consists in windowing the input signal and computing its magnitude spectrum with the Fourier
transform. We then apply a filterbank with critical (mel)
band spacing of the filters and bandwidths. Energy values are obtained for the output of each filter, followed by
a logarithm transformation. We finally compute a discrete
cosine transform to the set of log-energy values to obtain
the final set of coefficients. The number of mel bands and
the frequency interval on which they are computed may
vary among implementations. The low order coefficients
account for the slowly changing spectral envelope, while
the higher order coefficients describe the fast variations of
the spectrum shape, including pitch information. The first
coefficient is typically discarded in MIR applications because it does not provide information about the spectral
shape; it reflects the overall energy in mel bands.
1. INTRODUCTION
A significant amount of research in Music Information Retrieval (MIR) is based on descriptors computed from audio signals. In many cases, research corpora contain music files encoded in a lossless format. In some situations,
datasets are distributed without their original music corpus,
so researchers have to gather audio files themselves. In
many other cases, audio descriptors are distributed instead
of the audio files. In the end, MIR research is thus based on
corpora that very well may use different audio encodings,
all under the assumption that audio descriptors are robust
to these variations and the final MIR algorithms are not affected. This possible lack of robustness poses serious questions regarding the reproducibility of MIR research and
its applicability. For instance, whether algorithms trained
with lossless audio files can generalize to lossy encodings;
or whether a minimum audio bitrate should be required in
datasets that distribute descriptors instead of audio files.
In this paper we examine the assumption of robustness of music descriptors across different audio encodings on the example of Mel-frequency cepstral coefficients (MFCCs) and chroma features. They are among
the most popular music descriptors used in MIR research,
as they respectively capture timbre and tonal information.
2.2 Chroma
c J.Urbano, D.Bogdanov, P.Herrera, E.Gómez and X.Serra.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: J. Urbano, D. Bogdanov, P. Herrera,
E. Gómez and X. Serra. “What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features?”, 15th International Society
for Music Information Retrieval Conference, 2014.
Chroma features represent the spectral energy distribution
within an analysis frame, summarized into 12 semitones
across octaves in equal-tempered scale. Chroma captures
the pitch class distribution of an input signal, typically used
573
Since our goal here is not to compare tools, we refer to
them simply as Lib1 and Lib2 throughout the paper.
Lib1 and Lib2 provide by default two different implementations of MFCCs, both of which compute cepstral coefficients on 40 mel bands, resembling the MFCC FB-40 implementation [8, 22] but on different frequency intervals.
Lib1 covers a wider frequency range of 0-11000 Hz with
mel bin centers being equally spaced on the mel scale in
this range, while Lib2 covers a frequency range of 666364 Hz. We compute the first 13 MFCCs in both systems
and discard the first coefficient. In the case of chroma,
Lib1 analyzes a frequency range of 40-5000 Hz based on
Fourier transform and estimates tuning frequency. Lib2
uses a Constant Q Transform and analyzes the frequency
range 65-2093 Hz assuming tuning frequency of 440 Hz,
but it does not account for harmonics of the detected peaks.
We compute 12-dimensional chroma features.
Genre. Robustness may depend as well on the music
genre of songs. For instance, as the most dramatic change
that perceptual coders introduce is that of filtering out highfrequency spectral content, genres that make use of very
high-frequency sounds (e.g. cymbals and electronic tones)
should show a more detrimental effect than genres not including them (e.g. country, blues and classical).
for key and chord estimation [7, 9], music similarity and
cover identification [20], classification [6], segmentation
and summarization [5, 17], and synchronization [16].
Several approaches exist for chroma feature extraction,
including the following steps. The signal is first analyzed
with a high frequency resolution in order to obtain its frequency domain representation. The main frequency components (e.g. spectral peaks) are mapped onto pitch classes
according to an estimated tuning frequency. For most approaches, a frequency value partially contributes to a set
of “sub-harmonic” fundamental frequency (pitch) candidates. The chroma vector is computed with a given interval resolution (number of bins per octave) and is finally
post-processed to obtain the final chroma representation.
Timbre invariance is achieved by different transformations
such as spectral whitening [9] or cepstrum liftering [15].
3. EXPERIMENTAL DESIGN
3.1 Factors Affecting Robustness
We identified several factors that could have an effect on
the robustness of audio descriptors, from the perspective
of their audio encoding (codec, bitrate and sampling rate),
analysis parameters (frame/hop size and audio analysis
tool) and the musical characteristics of the songs (genre).
SRate. The sampling rate at which an audio signal is
encoded may affect robustness when using very high frequency rates. We study standard 44100 and 22050 Hz.
Codec. Perceptual audio coders may also affect descriptors because they introduce perturbations to the original
audio signal, in particular by reducing high-frequency content, blurring the attacks, and smoothing the spectral envelope. In our experiments, we chose one lossless and two
lossy audio codecs: WAV, MP3 CBR and MP3 VBR.
BRate. Different audio codecs allow different bitrates
depending on the sampling rate, so we can not combine all
codecs with all bitrates. The following combinations are
permitted and used in our study:
• WAV: 1411 Kbps.
• MP3 CBR at 22050 Hz: 64, 96, 128 and 160 Kbps.
• MP3 CBR at 44100 Hz: 64, 96, 128, 160, 192, 256
and 320 Kbps.
• MP3 VBR: 6 (100-130 Kbps), 4 (140-185 Kbps), 2
(170-210 Kbps) and 0 (220-260 Kbps).
FSize. We considered a variety of frame sizes for spectral analysis: 23.2, 46.4, 92.9, 185.8, 371.5 and 743.0 ms.
That is, we used frame sizes of 1024, 2048, 4096, 8192,
16384 and 32768 samples for signals with sampling rate of
44100 Hz, and the halved values (512, 1024, 2048, 4096,
8192 and 16384 samples) in the case of 22050 Hz.
Audio analysis tool. The specific software used to compute descriptors may have an effect on their robustness due
to parameterizations (e.g. frequency ranges) and other implementation details. We use two state-of-the-art and open
source tools publicly available online: Essentia 2.0.1 1 [2]
and QM Vamp Plugins 1.7 for Sonic Annotator 0.7 2 [3].
3.2 Data
We created an ad-hoc corpus of music for this study, containing 400 different music tracks (30 seconds excerpts) by
395 different artists, uniformly covering 10 music genres
(blues, classical, country, disco/funk/soul, electronic, jazz,
rap/hip-hop, reggae, rock and rock’n’roll). All 400 tracks
are encoded from their original CD at a 44100 Hz sampling
rate using the lossless FLAC audio codec.
We converted all lossless tracks in our corpus into various audio formats in accordance with the factors identified above, taking into account all possible combinations of sampling rate, codec and bitrate. Audio conversion was done using the FFmpeg 0.8.3 3 converter, which
includes the LAME codec for MP3 joint stereo mode
(Lavf53.21.1 ). Afterwards, we analyzed the original lossless files and their lossy versions using both Lib1 and Lib2.
In the case of Lib1, both MFCCs and chroma features were
computed for all different frame sizes with the hop size
equal to half the frame size. MFCCs were computed similarly in the case of Lib2, but chroma features only allow a
fixed frame size of 16384 samples (we selected a hop size
of 2048 samples). In all cases, we summarize the framewise feature vectors with the mean of each coefficient.
3.3 Indicators of Robustness
We computed several indicators of the robustness of
MFCCs and chroma, each measuring the difference between the descriptors computed with the original lossless
audio clips and the descriptors computed with their lossy
versions. We blocked by tool, sampling rate and frame
size under the assumption that these factors are not mixed
in practice within the same application. For two arbitrary
1
http://essentia.upf.edu
http://vamp-plugins.org/plugin-doc/
qm-vamp-plugins.html
2
3
574
http://www.ffmpeg.org
vectors x and y (each containing n = 12 MFCC or chroma
values) from a lossless and a lossy version, we compute
five indicators to measure how different they are.
Relative error δ. It is computed as the average relative
difference across coefficients. This indicator can be easily interpreted as the percentage error between coefficients,
and it is of especial interest for tasks in which coefficients
are used as features to train some model.
related to genre (main effects confounded with two-factor
interactions) [14]. We ran an ANOVA analysis on these
models to estimate variance components, which indicate
the contribution of each factor to the total variance, that is,
their impact on the robustness of the audio descriptors.
Table 1 shows the results for MFCCs. As shown by the
mean scores, the descriptors computed by Lib1 and Lib2
are similarly robust (note that ε scores are not directly comparable across tools because they are not normalized; actual MFCCs in Lib1 are orders of magnitude larger than
in Lib2). Both correlation coefficients r and ρ, as well as
cosine similarity θ, are extremely high, indicating that the
shape of the feature vectors is largely preserved. However,
the average error across coefficients is as high δ ≈ 6.1% at
22050 Hz and δ ≈ 6.7% at 44100 Hz.
When focusing on the stability of the descriptors, we
see that the implementation in Lib2 is generally more stable because the distributions have less variance, except for
δ and ρ at 22050 Hz. The decomposition in variance components indicates that the choice of frame size is irrelevant
in general (low σ̂F2 Size scores), and that the largest part of
the variability depends on the particular characteristics of
2
scores). For
the music pieces (very high σ̂T2 rack + σ̂residual
Lib2 in particular, this means that controlling encodings
or analysis parameters does not increase robustness significantly when the sampling rate is 22050 Hz; it depends
almost exclusively on the specific music pieces. On the
other hand, the combination of codec and bitrate has a quite
large effect in Lib1. For instance, about 42% of the variability in Euclidean distances is due to the BRate:Codec
interaction effect. This means that an appropriate selection
of the codec and bitrate of the audio files leads to significantly more robust descriptors. At 44100 Hz both tools are
clearly affected by the BRate:Codec effect as well, especially Lib1. Figure 1 compares the distributions of δ scores
for each tool. We can see that Lib1 has indeed large variance across groups, but small variance within groups, as
opposed to Lib2. The robustness of Lib1 seems to converge to δ ≈ 3% at 256 Kbps, and the descriptors are
clearly more stable with larger bitrates (smaller withingroup variance). On the other hand, the average robustness
of Lib2 converges to δ ≈ 5% at 160-192 Kbps, and stabil-
0.20
δ
0.10
VBR.0
VBR.2
VBR.4
VBR.6
CBR.320
CBR.96
CBR.64
0.00
VBR.0
VBR.2
VBR.4
VBR.6
CBR.320
CBR.256
CBR.192
CBR.160
CBR.64
0.00
0.10
δ
0.20
0.30
MFCCs Lib2 44100 Hz
0.30
MFCCs Lib1 44100 Hz
CBR.256
For simplicity, we followed a hierarchical analysis for each
combination of sampling rate, tool, feature and robustness indicator. We are first interested in the mean of the
score distributions, which tells us the expected robustness
in each case (e.g. a low ε mean score suggests that the descriptor is robust because it does not differ much between
the lossless and the lossy versions). But we are also interested in the stability of the descriptor, that is, the variance
of the distribution. For instance, a descriptor might be robust on average but not below 192 Kbps, or robust only
with a frame size of 2048.
To gain a deeper understanding of the variations in the
indicators, we fitted a random effects model to study the
effects of codec, bitrate and frame size [14]. The specific models included the FSize and Codec main effects,
and the bitrate was modeled as nested within the Codec
effect (BRate:Codec); all interactions among them were
also fitted. Finally, we included the Genre and Track main
effects to estimate the specific variability due to inherent
differences among the music pieces themselves. We did
not consider any Genre or Track interactions because they
can not be controlled in a real-world application, so their
effects are all confounded with the residual effect. Note
though that this residual does not account for any random
error (in fact, there is no random error in this model); it
accounts for high-order interactions associated with Genre
and Track that are irrelevant for our purposes. This results in a Resolution V design for the factors of interest
(main effects unconfounded with two- or three-factor interactions) and a Resolution III design for musical factors
CBR.192
3.4 Analysis
CBR.160
Euclidean distance ε. The Euclidean distance between
the two vectors, which is especially relevant for tasks that
compute distances between pairs of songs, such as in music
similarity or other tasks that use techniques like clustering.
Pearson’s r. The common parametric correlation coefficient between the two vectors, ranging from -1 to 1.
Spearman’s ρ. A non-parametric correlation coefficient, equal to the Pearson’s r correlation after transforming all coefficients to their corresponding ranks in x ∪ y.
Cosine similarity θ. The angle between both vectors. It
is is similar to ε, but it is normalized between 0 and 1.
We have 400 tracks×19 BRate:Codec×6 FSize=45600
datapoints for MFCCs with Lib1, MFCCs with Lib2, and
chroma with Lib1. For chroma with Lib2 there is just
one FSize, which yields 7600 datapoints. This adds up to
144400 datapoints for each indicators, 722000 overall.
CBR.128
|xi −yi |
max(|xi |,|yi |)
CBR.96
1
n
CBR.128
δ(x, y) =
4. RESULTS
Figure 1. Distributions of δ scores for different combinations of MP3 codec and bitrate at 44100 Hz, and for both
audio analysis tools. Blue crosses mark the sample means.
Outliers are rather uniformly distributed across genres.
575
2
σ̂F
Size
2
σ̂Codec
Lib2
Lib1
2
σ̂BRate:Codec
2
σ̂F
Size×Codec
2
σ̂F
Size×(BRate:Codec)
2
σ̂Genre
2
σ̂T
rack
2
σ̂residual
Grand mean
Total variance
Standard deviation
2
σ̂F
Size
2
σ̂Codec
2
σ̂BRate:Codec
2
σ̂F
Size×Codec
2
σ̂F
Size×(BRate:Codec)
2
σ̂Genre
2
σ̂T
rack
2
σ̂residual
Grand mean
Total variance
Standard deviation
δ
1.08
0
31.25
0
4.87
0.99
19.76
42.05
0.0591
0.0032
0.0567
1.17
0
4.91
0
0.96
4.21
52.34
36.41
0.0622
0.0040
0.0631
ε
3.03
0
42.13
0
11.71
4.53
5.84
32.75
1.6958
3.4641
1.8612
0.32
0
6.01
0
0.43
14.68
61.05
17.51
0.0278
0.0015
0.0391
22050 Hz
r
1.73
0
21.61
0
12.36
3.92
6.46
53.92
0.9999
1.8e-7
0.0004
0.16
0
2.32
0
0.03
2.84
32.07
62.57
0.9999
8.9e-8
0.0003
ρ
0
0
8.38
0
1.23
0.08
11.59
78.72
0.9977
3.2e-5
0.0056
0.24
0
0.74
0
0.04
0.61
66.10
32.27
0.9955
0.0002
0.0131
θ
1.74
0
21.49
0
13.21
3.80
5.73
54.03
0.9999
1.5e-7
0.0004
0.18
0
3.14
0
0.09
4.41
41.26
50.92
0.9999
3.5e-8
0.0002
δ
0.21
0
46.98
0
7.37
1.12
10.12
34.19
0.0682
0.0081
0.0897
0.25
0
23.46
0
7.17
0.37
27.33
41.42
0.0656
0.0055
0.0740
ε
0.09
0
41.77
0.20
18.25
0.52
3.91
35.26
1.8820
11.44
3.3835
0
0
24.23
0
8.09
5.37
14.10
48.21
0.0342
0.0034
0.0587
44100 Hz
r
0.01
0
22.52
0.07
17.98
0.90
2.65
55.87
0.9998
1.6e-6
0.0013
0
0
14.27
0
10.35
0.50
6.55
68.32
0.9998
6.4e-7
0.0008
ρ
0
0
24.03
0.05
10.85
0.32
5.23
59.52
0.9939
0.0005
0.0214
0
0
13.31
0
6.34
0
13.32
67.03
0.9947
0.0002
0.0150
θ
0
0
21.51
0.06
18.02
0.89
2.59
56.92
0.9998
1.4e-6
0.0012
0
0
15.02
0
10.86
0.48
5.53
68.11
0.9999
4.8e-7
0.0007
Table 1. Variance components in the distributions of robustness of MFCCs for Lib1 (top) and Lib2 (bottom). Each
component represents the percentage of total variance due to each effect (eg. σ̂F2 Size = 3.03 indicates that 3.03% of the
variability in the robustness indicator is due to differences across frame sizes; σ̂x2 = 0 when the effect is so extremely small
that the estimate is slightly below zero). All interactions with the Genre and Track main effects are confounded with the
residual effect. The last rows show the grand mean, total variance and standard deviation of the distributions.
their shape, the individual components vary significantly
across encodings; we observed that increasing the bitrate
leads to larger coefficients overall. This suggests that normalizing the chroma coefficients could dramatically improve the distributions of δ and ε. We tried the parameter
normalization=2 to have Lib2 normalize chroma vectors to unit maximum. As expected, the effects of codec
and bitrate are removed after normalization, and most of
the variability is due to the Track effect. The correlation
indicators are practically unaltered after normalization.
ity remains virtually the same beyond 96 Kbps. These plots
confirm that the MFCC implementation in Lib1 is nearly
twice as robust and stable when the encoding is homogeneous in the corpus, while the implementation in Lib2 is
less robust but more stable with heterogeneous encodings.
The FSize effect is negligible, indicating that the choice
of frame size does not affect the robustness of MFCCs
in general. However, in several cases we can observe
large σ̂F2 Size×(BRate:Codec) scores, meaning that for some
codec-bitrate combinations it does matter. An in-depth
analysis shows that these differences only occur at 64 Kbps
though (small frame sizes are more robust); differences are
2
scores invery small otherwise. Finally, the small σ̂Genre
dicate that robustness is similar across music genres.
A similar analysis was conducted to assess the robustness and stability of chroma features. Even though the
correlation indicators are generally high as well, Table 2
shows that chroma vectors do not preserve the shape as
well as MFCCs do. When looking at individual coefficients, the relative errors are similarly δ ≈ 6% in Lib1, but
they are greatly reduced in Lib2, especially at 44100 Hz.
In fact, the chroma implementation in Lib2 is more robust
and stable according to all indicators 4 . For Lib1, virtually
all the variability in the distributions is due to the Track
and residual effects, meaning that chroma is similarly robust across encodings, analysis parameters and genre. For
Lib2, we can similarly observe that errors in the correlation indicators depend almost entirely on the Track effect,
but δ and ε depend mostly on the codec-bitrate combination. This indicates that, despite chroma vectors preserve
5. ROBUSTNESS IN GENRE CLASSIFICATION
The previous section provided indicators of robustness that
can be easily understood. However, they can be hard to
interpret because in the end we are interested in the robustness of the various algorithms that make use of these
features; whether δ = 5% is large or not depends on how
MFCCs and chroma are used in practice. To investigate
this question we consider a music genre classification task.
For each sampling rate, codec, bitrate and tool we trained
one SVM model with radial basis kernel using MFCCs and
another using chroma. For MFCCs we used a standard
frame size of 2048, and for chroma we set 4096 in Lib1
and the fixed 16384 in Lib2. We did random sub-sampling
validation with 100 random trials for each model, using
320 tracks for training and the remaining 80 for testing.
We first investigate whether a particular choice of encoding is likely to classify better when fixed across training and test sets. Table 3 shows the results for a selection of encodings at 44100 Hz. Within the same tool and
descriptor, differences across encodings are quite small,
approximately 0.02. In particular, for MFCCs and Lib1
an ANOVA analysis suggests that differences are signifi-
4 Even though these distributions include all frame sizes in Lib1 but
only 16384 in Lib2, the FSize effect is negligible in Lib1, meaning that
these indicators are still comparable across implementations
576
Lib2
Lib1
2
σ̂F
Size
2
σ̂Genre
2
σ̂T rack
2
σ̂residual
Grand Mean
Total variance
Standard deviation
2
σ̂Codec
2
σ̂BRate:Codec
2
σ̂Genre
2
σ̂T
rack
2
σ̂residual
Grand mean
Total variance
Standard deviation
δ
1.68
2.81
20.69
74.82
0.0610
0.0046
0.0682
63.62
0.71
0.25
19.29
16.14
0.0346
0.0004
0.0195
ε
2.77
2.75
19.27
75.21
0.0545
0.0085
0.0924
34.55
0.23
15.87
32.77
16.58
0.0031
5e-6
0.0022
22050 Hz
r
0.20
1.29
17.75
80.75
0.9554
0.0276
0.1663
0
0
2.90
96.71
0.38
0.9915
0.0002
0.0135
ρ
0.15
1.47
18.52
79.86
0.9366
0.0293
0.1713
0
0
4.05
92.75
3.20
0.9766
0.0007
0.0270
θ
0.38
0.81
16.63
82.17
0.9920
0.0014
0.0373
0
0
7.95
91.80
0.25
0.9998
6.1e-8
0.0002
δ
2.37
3.12
22.28
72.22
0.0588
0.0048
0.0695
32.32
61.80
0.62
3.27
1.98
2.6e-2
4.6e-4
0.0213
ε
2.42
2.61
20.78
74.19
0.0521
0.0082
0.0904
21.59
39.51
9.98
13.79
15.13
2.2e-3
4.8e-6
0.0022
44100 Hz
r
0.24
1.17
18.81
79.79
0.9549
0.0286
0.1691
0
0.01
3.43
94.24
2.32
0.9989
3.7e-6
0.0019
ρ
0.34
1.25
19.92
78.49
0.9375
0.0298
0.1725
0
0.03
1.33
93.04
5.60
0.9928
0.0001
0.0122
θ
0.50
0.85
18.64
80.01
0.9922
0.0013
0.0355
0
0.04
3.66
77.00
19.30
1
1.8e-9
4.2e-5
Lib2 Lib1
Table 2. Variance components in the distributions of robustness of Chroma for Lib1 (top) and Lib2 (bottom), similar
to Table 1. The Codec main effect and all its interactions are not shown for Lib1 because all variance components are
estimated as 0. Note that the FSize main effect and all its interactions are omitted for Lib2 because it is fixed to 16384.
MFCCs
Chroma
MFCCs
Chroma
64
.383
.275
.335
.320
96
.384
.281
.329
.325
128
.401
.288
.332
.320
160
.403
.261
.341
.323
192
.395
.278
.336
.325
256
.402
.278
.336
.319
320 WAV
.394 .393
.284 .291
.344 .335
.320 .313
trate compression, mostly due to distortions at high frequencies. They estimated squared Pearson’s correlation
between MFCCs computed on original lossless audio and
its MP3 derivatives, using 4 different MFCC implementations. All implementations were found to be robust at
bitrates of at least 128 Kbps, with r2 > 0.95, but a significant loss in robustness was observed at 64 Kbps in
some of the implementations. The most robust MFCC implementation had a highest frequency of 4600 Hz, while
the least robust implementation included frequencies up to
11025 Hz. Their music corpus contained only 46 songs
though, clearly limiting their results. In our experiments,
all encodings show r2 > 0.99. However, we note that Pearson’s r is very sensible to outliers with such small samples.
This is the case of the first MFCC coefficients, which are
orders of magnitude larger than the last coefficients. This
makes r extremely large simply because the first coefficients are remotely similar; most of the variability between
feature vectors is explained because of the first coefficient.
This is clear in our Table 1, where r ≈ 1 and variance is
nearly 0. To minimize this sensibility to outliers, we also
included the non-parametric Spearman’s ρ correlation coefficient as well as the cosine similarity. In our case, the
tool with the larger frequency range was shown to be more
robust under homogeneous encodings, while the shorter
range was more stable under heterogeneous conditions.
Hamawaki et al. [10] analyzed differences in the distribution of MFCCs for different bitrates using a corpus of
2513 MP3 files of Japanese and Korean pop songs with bitrates between 96 and 192 Kbps. Following a music similarity task, they compared differences in the top-10 ranked
results when using MFCCs derived from WAV audio, its
MP3 encoded versions, and the mixture of MFCCs from
different sources. They found that the correlation of the results deteriorates smoothly as the bitrate decreases, while
ranking on a set of MFCCs derived from different formats
revealed uncorrelated results. We similarly observed that
the differences between MFCCs of the original WAV files
and its MP3 versions decrease smoothly with bitrate.
Jensen et al. [12] measured the effect of audio encoding
on performance of an instrument classifier using MFCCs.
Table 3. Mean classification accuracy over 100 trials when
training and testing with the same encoding (MP3 CBR
and WAV only) at 44100 Hz.
cant, F (7, 693) = 2.34, p = 0.023; a multiple comparisons
analysis reveals that 64 Kbps is significantly worse than
the best (160 Kbps). In terms of chroma, differences are
again statistically significant, F (7, 693) = 3.71, p < 0.001;
160 Kbps is this time significantly worse that most of
the others. With Lib2 differences are not significant for
MFCCs, F (7, 693) = 1.07, p = 0.378. No difference is
found for chroma either, F (7, 693) = 0.67, p = 0.702.
Overall, despite some pairwise comparisons are significantly different, there is no particular encoding that clearly
outperforms the others; the observed differences are probably just Type I errors. There is no clear correlation either
between bitrate and accuracy.
We then investigate whether a particular choice of encoding for training is likely to produce better results when
the target test set has a fixed encoding. For MFCCs
and Lib1 there is no significant difference in any but
one case (testing with 160 Kbps is worst when training
with 64 Kbps). For chroma there are a few cases where
160 Kbps is again significantly worse than others, but we
attribute these to Type I errors as well. Although not significantly so, the best result is always obtained when the
training set has the same encoding as the target test set.
With Lib2 there is no significant difference for MFCCs or
chroma. Overall, we do not observe a correlation either between training and test encodings. Due to space constrains,
we do not discuss results for VBR or 22050 Hz, but the
same general conclusions can be drawn nonetheless.
6. DISCUSSION
Sigurdsson et al. [21] suggested that MFCCs are sensitive to the spectral perturbations that result from low bi-
577
They compared MFCCs computed from MP3 files at only
32-64 Kbps, observing a decrease in performance when
using a different encoder for training and test sets. In contrast, performance did not change significantly when using
the same encoder. For genre classification with MFCCs,
our results showed no differences in either case. We note
though that the bitrates we considered are much larger. Uemura et al. [23] examined the effect of bitrate on chord
recognition using chroma features with an SVM classifier. They observed no obvious correlation between encoding and estimation results; the best results were even
obtained with very low bitrates for some codecs. Our results on genre classification with chroma largely agree in
this case as well; the best results with Lib2 were also obtained by low bitrates. Casey et al. [4] evaluated the effect
of lossy encodings on genre classification tasks using audio spectrum projection features. They found a small but
statistically significant decrease in accuracy for bitrates of
32 and 96 Kbps. In our experiments, we do not observe
these differences, although the lowest bitrate we consider
is 64 Kbps. Jacobson et al. [11] also investigated the robustness of onset detection methods to lossy MP3 encoding. They found statistically significant changes in accuracy only at bitrates lower than 32 Kbps.
Our results showed that MFCCs and chroma features, as
computed by Lib1 and Lib2, are generally robust and stable within reasonable limits. Some differences have been
noted between tools though, largely attributable to the different frequency ranges they employ. Nonetheless, it is
evident that certain combinations of codec and bitrate may
require a re-parameterization of some descriptors to improve or even maintain robustness. In practice, these parameterizations affect the performance and applicability of
algorithms, so a balance between performance, robustness
and generalizability should be sought. These considerations are of major importance when collecting audio files
for some dataset, as a minimum audio quality might be
needed for some descriptors.
7. CONCLUSIONS
In this paper we have studied the robustness of two common audio descriptors used in Music Information Retrieval, namely MFCCs and chroma, to different audio encodings and analysis parameters. Using a varied corpora
of music pieces and two different audio analysis tools we
have confirmed that MFFCs are robust to frame/hop sizes
and lossy encoding provided that a minimum bitrate of
approximately 160 Kbps is used. Chroma features were
shown to be even more robust, as the codec and bitrates
had virtually no effect on the computed descriptors. This
is somewhat expected given that chroma does not capture
information as fine-grained as MFCCs do, and that lossy
compression does not alter the perceived tonality. We did
find subtle differences between implementations of these
audio features, which call for further research on standardizing algorithms and parameterizations to maximize their
robustness while maintaining their effectiveness in the various tasks they are used in. The immediate line for future
work includes the analysis of other features and tools.
8. ACKNOWLEDGMENTS
This work is partially supported by an A4U postdoctoral
grant and projects SIGMUS (TIN2012-36650), CompMusic (ERC 267583), PHENICX (ICT-2011.8.2) and GiantSteps (ICT-2013-10).
9. REFERENCES
[1] J.J. Aucouturier, F. Pachet, and M. Sandler. “The way it
sounds”: timbre models for analysis and retrieval of music
signals. IEEE Trans. Multimedia, 2005.
[2] D. Bogdanov, N. Wack, et al. ESSENTIA: an audio analysis
library for music information retrieval. In ISMIR, 2013.
[3] C. Cannam, M.O. Jewell, C. Rhodes, M. Sandler, and
M. d’Inverno. Linked data and you: bringing music research
software into the semantic web. J. New Music Res., 2010.
[4] M. Casey, B. Fields, et al. The effects of lossy audio encoding
on genre classification tasks. In AES, 2008.
[5] W. Chai. Semantic segmentation and summarization of music: methods based on tonality and recurrent structure. IEEE
Signal Processing Magazine, 2006.
[6] D. Ellis. Classifying music audio with timbral and chroma
features. In ISMIR, 2007.
[7] T. Fujishima. Realtime chord recognition of musical sound:
a system using common lisp music. In ICMC, 1999.
[8] T. Ganchev, N. Fakotakis, and G. Kokkinakis. Comparative
evaluation of various MFCC implementations on the speaker
verification task. In SPECOM, 2005.
[9] E. Gómez. Tonal description of music audio signals. PhD thesis, Universitat Pompeu Fabra, 2006.
[10] S. Hamawaki, S. Funasawa, et al. Feature analysis and normalization approach for robust content-based music retrieval
to encoded audio with different bit rates. In MMM, 2008.
[11] K. Jacobson, M. Davies, and M. Sandler. The effects of lossy
audio encoding on onset detection tasks. In AES, 2008.
[12] J.H. Jensen, M.G. Christensen, D. Ellis, and S.H. Jensen.
Quantitative analysis of a common audio similarity measure.
IEEE TASLP, 2009.
[13] B. McFee, L. Barrington, and G. Lanckriet. Learning content
similarity for music recommendation. IEEE TASLP, 2012.
[14] D.C. Montgomery. Design and Analysis of Experiments. Wiley & Sons, 2009.
[15] M. Müller and S. Ewert. Towards timbre-invariant audio features for harmony-based music. IEEE TASLP, 2010.
[16] M. Müller, H. Mattes, and F. Kurth. An efficient multiscale
approach to audio synchronization. In ISMIR, 2006.
[17] J. Paulus, M. Müller, and A Klapuri. Audio-based music
structure analysis. In ISMIR, 2010.
[18] L.R. Rabiner and R.W. Schafer. Introduction to Digital
Speech Processing. Foundations and Trends in Signal Processing. 2007.
[19] J. Reed and C. Lee. Preference music ratings prediction using
tokenization and minimum classification error training. IEEE
TASLP, 2011.
[20] J. Serrà, E. Gómez, and P. Herrera. Audio cover song identification and similarity: background, approaches, evaluation,
and beyond. In Z. Raś and A.A. Wieczorkowska, editors, Advances in Music Information Retrieval. Springer, 2010.
[21] S. Sigurdsson, K.B. Petersen, and T. Lehn-Schiler. Mel Frequency Cepstral Coefficients: an evaluation of robustness of
MP3 encoded music. In ISMIR, 2006.
[22] M. Slaney. Auditory toolbox. Interval Research Corporation, Technical Report, 1998. http://engineering.
purdue.edu/˜malcolm/interval/1998-010/.
[23] A. Uemura, K. Ishikura, and J. Katto. Effects of audio compression on chord recognition. In MMM, 2014.
[24] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and HG. Okuno.
An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model. IEEE
TASLP, 2008.
578
MUSIC INFORMATION BEHAVIORS AND SYSTEM PREFERENCES
OF UNIVERSITY STUDENTS IN HONG KONG
Xiao Hu
University of Hong Kong
Jin Ha Lee
University of Washington
Leanne Ka Yan Wong
University of Hong Kong
[email protected]
[email protected]
[email protected]
ern and Eastern cultures. Before the handover to the Chinese government in 1997, Hong Kong had been ruled by
the British government for 100 years. This had resulted in
a heavy influence of Western culture, although much of
the Chinese cultural heritage has also been preserved well
in Hong Kong. The cultural influences of Hong Kong to
the neighboring regions in Asia were significant, especially in the pre-handover era. In fact, in the 80s and
throughout the 90s, Cantopop (Cantonese popular music,
sometimes referred to as HK-pop) was widely popular
across many Asian countries, and produced many influential artists such as Leslie Cheung, Anita Mui, Andy
Lau, and so on [2]. In the post-handover era, there has
been an influx of cultural products from mainland China
which is significantly affecting the popular culture of
Hong Kong [8]. The cultural history and influences of
Hong Kong, especially paired with the significance of
Cantopop, makes it an interesting candidate to explore
among many non-Western cultures.
Of the populations in Hong Kong, we specifically
wanted to investigate young adults on their music information needs and behaviors. They represent a vibrant
population who are not only heavily exposed to and fast
adopters of new ideas, but also represent the future workforce and consumers. University students in Hong Kong
are mostly digital natives (i.e., grew up with access to
computers and the Internet from an early age) with rich
experience of seeking and listening to digital music. Additionally the fact that they are influenced by both Western and Eastern cultures, and exposed to both global and
local music make them worthy of exploring as a particular group of music users.1 2
There have been a few related studies which investigated music information users in Hong Kong. Lai and
Chan [5] surveyed information needs of users in an academic music library setting. They found that the frequencies of using score and multimedia were higher than using electronic journal databases, books, and online journals. Nettamo et al. [9] compared users in New York City
and those in Hong Kong in using their mobile devices for
music-related tasks. Their results showed that users’ envi-
ABSTRACT
This paper presents a user study on music information
needs and behaviors of university students in Hong Kong.
A mix of quantitative and qualitative methods was used.
A survey was completed by 101 participants and supplemental interviews were conducted in order to investigate
users’ music information related activities. We found that
university students in Hong Kong listened to music frequently and mainly for the purposes of entertainment,
singing and playing instruments, and stress reduction.
This user group often searches for music with multiple
methods, but common access points like genre and time
period were rarely used. Sharing music with people in
their online social networks such as Facebook and Weibo
was a common activity. Furthermore, the popularity of
smartphones prompted the need for streaming music and
mobile music applications. We also examined users’
preferences on music services available in Hong Kong
such as YouTube and KKBox, as well as the characteristics liked and disliked by the users. The results not only
offer insights into non-Western users’ music behaviors
but also for designing online music services for young
music listeners in Hong Kong.
1. INTRODUCTION AND RELATED WORK
Seeking music and music information is prevalent in our
everyday life as music is an indispensable element for
many people [1]. People in Hong Kong are not an exception. Hong Kong has the second highest penetration rate
of broadband Internet access in Asia, following South
Korea1. Consequently, Hong Kongers are increasingly
using various online music information services to seek
and listen to music, including iTunes, YouTube, Kugou,
Sogou and Baidu2. However, our current understanding
of their music information needs and behaviors are still
lacking, as few studies explored user populations in Hong
Kong, or in any non-Western cultures.
Hong Kong is a unique location that merges the West© Xiao Hu, Jin Ha Lee, Leanne Ka Yan Wong.
License (CC BY 4.0). Attribution: Xiao Hu, Jin Ha Lee, Leanne Ka
Yan Wong. “Music Information Behaviors and System Preferences of
University Students in Hong Kong”, 15th International Society for
1
http://www.itu.int/ITU-D/ICTEYE/Reporting/Dynamic ReportWizard.aspx
2
579
http://hk.epochtimes.com/b5/11/10/20/145162.htm
ronment and context greatly influenced their behaviors,
and there were cultural differences in consuming and
managing mobile music between the two user groups.
Our study investigates everyday music information behaviors of university students in Hong Kong, and thus the
scope is broader than these studies.
In addition to music information needs and behaviors,
this study also examines the characteristics of popular
music services adopted by university students in Hong
Kong, in order to investigate their strengths and weaknesses. Recommendations for designing music services
are proposed based on the results. This study will improve our understanding on music information behaviors
of the target population and contribute to the design of
music services that can better serve target users.
depth explanations to support the survey findings. Faceto-face interviews were carried out individually with five
participants from three different universities. The interviews were conducted in Cantonese, the mother tongue of
the interviewees, and were later transcribed and translated
to English. Each interview lasted up to approximately 20
minutes.
3. SURVEY DATA ANALYSIS
Of the 167 survey responses collected, 101 complete responses were analyzed in this study. All the survey participants were university students in Hong Kong. Among
them, 58.4% of were female and 41.6% of them were
male. They were all born between 1988 and 1994, and
most of them (88.1%) were born between 1989 and 1992.
Therefore, they were in their early 20s when the survey
was taken in 2013. Nearly all of them (98.0%) were undergraduates majoring Science/Engineering (43.6%), Social Sciences/Humanities (54.0%) and Other (2.0%).
2. METHODS
A mix of quantitative and qualitative methods was used
in order to triangulate our results. We conducted a survey
in order to collect general information about target users’
music information needs, seeking behaviors, and opinions
on commonly used music services. Afterwards, follow-up
face-to-face interviews of a smaller user group were conducted to collect in-depth explanations on the themes and
patterns discovered in the survey results. Prior to the formal survey and interviews, pilot tests were carried out
with a smaller group of university students to ensure that
the questions were well-constructed and students were
able to understand and answer them without major issues.
3.1 Music Preferences
In order to find out participants’ preferred music genres,
they were asked to select and rank up to five of their favorite music genres from a list of 25 genres covering
most Western music genres. To ensure that the participants understand the different genres, titles and artist
names of example songs representative of each genre
were provided. The results are shown in Table 1 where
each cell represents the number of times each genre was
mentioned with the rank corresponding in the column.
Pop was the most preferred genre among the participants,
followed by R&B/Soul and Rock. We also aggregated the
results by assigning reversely proportional weights to the
ranks (1st: 5 points, and 5th: 1 point). The most popular
music genres among the participants were Pop (311 pts),
R&B/Soul (204 pts), Rock (109 pts), Gospel (88 pts) and
Jazz (86 pts).
2.1 Survey
The survey was conducted as an online questionnaire.
The questionnaire instrument was adapted from the one
used in [6] and [7], with modifications to fit the multilingual and multicultural environment. Seventeen questions
about the use of popular music services were added to the
questionnaire. The survey was implemented with LimeSurvey, an open-source survey application, and consisted
of five parts: demographic information, music preference,
music seeking behaviors, music collection management,
and opinions on preferred music services. Completing the
survey took approximately 30 minutes, and each participant was offered a chance to enter his/her name for a raffle to win one of the three supermarket gift coupons of
HKD50, if they wished.
The target population was students (both undergraduate and graduate) from the eight universities sponsored by
the government of Hong Kong Special Administrative
Region. The sample was recruited using Facebook due to
its popularity among university students in Hong Kong.
Survey invitations were posted on Facebook, initially
through the list of friends of the authors, and then further
disseminated by chain-referrals.
1st
2nd
3rd
4th
5th
Pop
43
14
9
4
5
75
74.2%
R&B
7
29
11
7
6
60
59.4%
Total Total (%)
Rock
6
9
10
3
7
35
34.7%
Gospel
9
6
2
4
5
26
25.7%
Jazz
6
8
3
5
5
27
26.7%
Table 1. Preferences on music genres
Moreover, as both Chinese and English are official
languages of Hong Kong, participants were also asked to
rank their preferences on languages of lyrics. The five
options were English, Cantonese, Mandarin, Japanese and
Korean. The last three were included due to popularity of
songs from nearby countries/regions in Hong Kong, including mainland China and Taiwan (Mandarin), Japan
(Japanese), and Korea (Korean). As shown in Table 2,
English was in fact highly preferred, followed by Cantonese. Mandarin was mostly ranked at the second or third
2.2 Interviews
Semi-structured interviews were conducted after the survey data were collected and analyzed, in order to seek in-
580
place, while Korean and Japanese were ranked lower. We
also aggregated the answers and found that the most popular languages in songs are English (394 points), Cantonese (296 points), and Mandarin (223 points).
1st
2nd
3rd
4th
5th
Total
Known-item search was the most common type of music
information seeking; nearly all respondents (95.1%)
sought music information for the identification/verification of musical works, artist and lyrics, and
about half of them do so at least a few times a week. Obtaining background information was also a strong reason;
over 90% of the participants sought music to learn more
about music artists (97.0%) as well as music (94.1%), and
approximately half of them (53.5% and 40.6%, respectively) sought this kind of music information at least two
or three times a month.
When asked which sources stimulated or influenced
their music information needs, all 101 participants
acknowledged online video clips (e.g. YouTube) and TV
shows/movies. This suggests that the influence of other
media using music is quite significant which echoes the
finding that associative metadata in music seeking was
important for the university population in the United
States [6]. Also over 70% of the participants’ music
needs were influenced by music heard in public places,
advertisement/commercial, radio show, or family members’/friends’ home.
As for the metadata used in searching for music, performer was the most popular access point with 80.2% of
positive responses, followed by the title of work(s)
(65.3%) and some words of lyrics (62.4%). Other common types of metadata such as genre and time period
were only used by a few respondents (33.7% and 29.7%,
respectively). Particularly for genre, the proportion is significantly lower than 62.7% as found in the prior survey
of university population in the United States [6]. This is
perhaps related to the exposure to different music genres
in Hong Kong, and the phenomenon that Hong Kongers
music listeners tend to emphasize an affinity with friends
while Americans (New Yorkers) are more likely to use
music to highlight their individual personalities [9].
Moreover, participants responded that they would also
seek music based on other users’ opinions: 57.4% by recommendations from other people and 52.5% by popularity. The proportion for popularity is also fairly larger than
the 31% in [6]. This shows that the social aspect is a crucial factor affecting participants’ music seeking behaviors.
Of the different types of people, friends and family
members (91.1%) and people on their social network
websites (e.g. Facebook, Weibo) (89.1%) were the ones
whom they most likely ask for help when searching for
music. In addition, they turned to the Internet more frequently than friends and family members. Thirty-nine
percent of them sought help on social network websites at
least a few times a week while only 23.8% turned to
friends/family members at least a few times a week.
On the other hand, when asked which physical places
they go to in order to search for music or music information, 82.18% said that they would find music in family
members’ or friends’ home, which was higher than going
to record stores (75.3%), libraries (70.3%), and academic
Total (%)
English
46
27
16
3
2
94
93.1%
Cantonese
31
20
15
7
2
75
74.3%
Mandarin
13
23
20
2
2
60
59.4%
Korean
6
15
6
16
14
57
56.4%
Japanese
5
5
10
16
16
52
51.5%
Table 2. Preferences on languages of song lyrics
3.2 Music Seeking Behaviors
When asked about the type of music information they
have ever searched, most participants indicated preferences on audio: MP3s and music videos (98.0%), music
recordings (e.g., CDs, vinyl records, tapes) (94.1%), and
music multimedia in other formats (e.g., Blue-ray, DVD,
VHS) (88.1%). Written forms of music information were
sought by fewer respondents: books on music (73.2%),
music magazines (69.3%), and academic music journals
(63.4%). Approximately one out of three participants
even responded that they have never sought music magazines (30.7%) or academic music journals (36.6%).
As for the frequency of search, 41.6% of respondents
indicated that they sought MP3s and music videos at least
a few times a week, compared to only 18.8% for music
recordings (e.g., CDs, vinyl records, tapes) and 24.8% for
music multimedia in other formats (e.g., Blue-ray, DVD).
Moreover, 98.0% of participants responded that they
had searched for music information on the Internet.
Among them, almost all (99.0%) answered that they had
downloaded free music online, and 95.0% responded that
they had listened to streaming music or online radio. This
clearly indicates that participants sought digital music
more often through online channels than offline or physical materials. However, even though 77.8% of respondents had visited online music store, only 69.7% of them
had purchased any electronic music files or albums. Not
surprisingly, participants preferred free music resources.
Music was certainly a popular element of entertainment in the lives of the participants. When asked why
they sought music, all participants included entertainment
in their answers. Also, a large proportion (83.0%) indicated that they sought music for entertainment at least a
few times a week. Furthermore, 97.0% of respondents
search for music information for singing or playing a musical instrument for fun. This proportion is significantly
higher than the results from the previous survey of university population in the United States (32.8% for singing
and 31.9% for playing a musical instrument) [6]. In addition, 78.2% of our respondents do this at least two or
three times a month. We conjecture that this is most likely due to the popularity of karaoke in Hong Kong.
581
institutions (64.4%). Overall, these data show that users’
social networks, and especially online networks are important for their music searching process.
(56.9%). Only a few respondents (9.8%) were unsatisfied
with certain features of YouTube such as advanced
search, relevance of search results, and navigation. It is
surprising to see that five respondents rated YouTube
negatively on the aspect of price. We suspect they might
have associated this aspect with the price of purchasing
digital music from certain music channels on YouTube,
or the indirect cost of having to watch ads. However, we
did not have the means to identify these respondents to
verify the reasons behind their ratings.
3.3 Music Collection Management
More participants were managing a digital collection
(40.6%) than a physical one (25.7%). On average, each
respondent estimated that he/she managed 900 pieces of
digital music and 94 pieces of music in physical formats.
This shows that managing digital music is more popular
among participants, although the units that they typically
associate with digital versus physical items might differ
(e.g., digital file vs. physical album).
We also found that students tended to manage their
music collections with simple methods. Over half of the
respondents (50.0% for music in physical formats and
56.1% for digital music) manage their music collection
by artist name. Participants sometimes also organized
their digital collections by album title (17.7%), but rarely
by format type (3.9%) and never by record label. More
participants indicated they did not organize their music at
all for their physical music collection (19.2%) than their
digital music collection (2.4%). When they did organize
their physical music collection, they would use album title (11.5%) and genre (11.5%). Overall, organizing the
collection did not seem to be one of the users’ primary
activities related to music information.
YouTube KKBox
3.4 Preferred Music Services
Respondents gave a variety of responses regarding their
most frequently visited music services: YouTube (51.5%),
KKBox (26.7%), and iTunes (14.9%) were the most popular ones. KKBox is a large cloud-based music service
provider founded in Taiwan, very popular in the region
and sometimes referred to as “Asian Spotify.” YouTube,
which provides free online streaming music video, was
almost twice as popular as the second most favored music
service, KKBox. The popularity of YouTube was also
observed in Lee and Waterman’s survey of 520 music
users in 2012 [7]. Their respondents ranked Pandora as
the most preferred service, followed by YouTube as the
second.
The participants were also asked to evaluate their favorite music services. Specifically, they were asked to
indicate their level of satisfaction using a 5-point Likert
scale on 15 different aspects on search function, search
results and system utility. Table 3 shows the percentage
of positive (aggregation of “somewhat satisfied” and
“very satisfied) and negative (aggregation of “somewhat
unsatisfied” and “very unsatisfied”) ratings among users
who chose each of the three services as their most favored one.
For those who selected YouTube as their most frequently used service, they indicated that they were especially satisfied with its keyword search function (74.5%),
recommendation of keywords (70.6%), variety of available music information (60.8%) and attractive interface
582
utility
search
results
search
function
P
N
P
N
iTunes
P
N
keyword search
74.5 7.8 29.6 7.4 13.3 0.0
advanced search
54.9 9.8 44.4 18.5 46.7 6.7
content-based search
51.0 7.8 44.4 29.6 66.7 13.3
auto-correction
49.0 7.8 29.6 29.6 20.0 33.3
keywords suggestion
70.6 3.9 40.7 25.9 20.0 53.3
number of results
52.9 7.8 40.7 22.2 6.7 33.3
relevance
47.1 9.8 48.1 18.5 13.3 33.3
accuracy
49.0 7.8 44.4 18.5 33.3 26.7
price of the service
39.2 9.8 25.9 25.9 33.3 20.0
accessibility
52.9 7.8 22.2 37.0 26.7 20.0
navigation
52.9 9.8 18.5 29.6 6.7 20.0
variety of available
music information
60.8 7.8 22.2 22.2 26.7 13.3
music recommendation 52.9 7.8 33.3 22.2 53.3 20.0
interface attractiveness 56.9 3.9 33.3 7.4 40.0 20.0
music sharing
47.1 3.9 40.7 7.4 40.0 20.0
Table 3 User ratings of three most preferred music
services (“P”: positive; “N”: negative, in percentage)
The level of satisfaction for KKBox was lower than
that of YouTube. Nearly half of the participants who use
KKBox were satisfied with its relevance of results
(48.1%), advanced search function (44.4%) and contentbased search function (44.4%). The aspects of KKBox
that participants did not like included the lack of accessibility (37.0%), content-based search function (29.6%),
and auto-correction (29.6%). Interestingly, the contentbased search function in KKBox was controversial
among the participants. Some participants liked it probably because it was a novel feature that few music services
had; while others were not satisfied with it, perhaps due
to fact that current performance of audio content-based
technologies have yet to meet users’ expectation.
Only 15 participants rated iTunes as their most frequently used music service. Their opinions on iTunes
were mixed. Its content-based search function and music
recommendations were valued by 66.7% and 53.3% of
the 15 participants, respectively. The data seem to suggest that audio content-based technologies in iTunes performed better than KKBox, but this must be verified with
a larger sample in future work. On the other hand, over
half of the respondents gave negative response to the
keyword suggestion function in iTunes. Moreover, the
auto-correction, number of search results, and relevance
of search results also received negative responses by one
third of the respondents. These functions are related to
the content of music collection in iTunes, and thus we
suspect that the coverage of iTunes perhaps did not meet
the expectations of young listeners in Hong Kong, as
much as the other two services did.
4.3 24/7 Online Music Listening
Participants in this study preferred listening to or watching streaming music services rather than downloading
music. Downloading an mp3 file of a song usually takes
about a half minute with a broadband connection and
slightly longer with a wireless connection. Interviewees
commented that downloading just added an extra step
which was inconvenient to them.
Apart from the web, smart mobile devices are becoming ubiquitous which is also affecting people’s mode of
music listening. According to Mobilezine4, 87% of Hong
Kongers aged between 15 and 64 own a smart device.
According to Phneah [10], 55% of Hong Kong youths
think that the use of smartphones dominates their lives as
they are unable to stop using smartphones even in restrooms, and many sleep next to it. As expected, university students in Hong Kong are accustomed to having 24/7
access to streaming music on their smartphones.
4. THEMES/TRENDS FROM INTERVIEWS
4.1 Multiple Music Information Searching Strategies
Interviewees searched for music using not only music
services like YouTube or KKBox, but also generalpurpose search engines, such as Google and Yahoo!.
Most often, a simple keyword search with the song title
or artist name was conducted when locating music in
these music services. However, more complicated
searches such as those using lyrics and the name of composer are not supported by most existing music services.
In this case, search engines had to be used. For example,
if the desired song title and artist name are unknown or
inaccurate, interviewees would search for them on
Google or Yahoo! with any information they know about
the song. The search often directed them to the right piece
of metadata which then allowed them to conduct a search
in YouTube or other music services. As expected, this
does not always lead to successful results; one participant
said “when I did not know the song title or artist name, I
tried singing the song to Google voice search, but the result was not satisfactory.”
5. IMPLICATIONS FOR MUSIC SERVICES
5.1 Advanced Search
A simple keyword search may not be sufficient to accommodate users who want to search for music with various metadata, not only with song titles, but also performer’s names, lyrics, and so on. For example, if a user
wants to locate songs with the word “lotus” in the lyrics,
they would simply use “lotus” as the search keyword.
However, the search functions in various music services
generally are not intelligent enough to understand the semantic differences among the band named Lotus and the
word “lotus” in lyrics, not to mention which role the band
Lotus might have played (e.g., performer, composer, or
both). As a result, users have to conduct preliminary
searches in web search engines as an extra step when attempting to locate the desired song. Many users will appreciate having an advanced search function with specific
fields in music services that allow them to conduct lyric
search with “lotus” rather than a general keyword search.
4.2 Use of Online Social Networks
Online social network services are increasingly popular
among people in Hong Kong. According to an online
survey conducted with 387 Hong Kong residents in
March 20113, the majority of the respondents visited Facebook (92%), read blogs (77%) and even wrote blog
posts (52%). Social media provides a convenient way for
people to connect with in Hong Kong where maintaining
a work-life balance can be quite challenging.
University students in Hong Kong are also avid social
media users. They prefer communicating and sharing information with others using online social networks for the
efficiency and flexibility. Naturally, it also serves as a
convenient channel for sharing music recommendations
and discussing music-related topics. Relying on others
was considered an important way to search for music:
“Normally, I will consider others’ opinions first. There
are just way too many songs, so it helps find good music
much more easily.”, “I love other people’s comments, especially when they have the same view as me!”
5.2 Mood Search
Participants showed great interests in the feeling or emotion in music, as they perceived the meaning of songs
were mostly about particular emotions. Terms such as
“positive”, “optimistic”, and “touching” were used to describe the meaning of music during the interviews. Therefore, music services that can support searching by mood
terms may be useful.
Music emotion or mood has been recognized as an important access point for music [3]. A cross-cultural study
by Hu and Lee [4] points out that listeners from different
cultural backgrounds have different music mood judgments and they tend to agree more with users from the
3
4
Hong Kong has the second highest smartphone penetration in the
world:
http://mobilezine.asia/2013/01/hong-kong-has-the-secondhighest-smartphone-penetration-in-the-world/.
Hong Kong social media use higher than United States:
http://travel.cnn.com/hong-kong/life/hong-kong-social-media-usehigher-united-states-520745.
583
same cultural background than users from other cultures.
This cultural difference must be taken into account when
establishing mood metadata for music services.
7. ACKNOWLEDGEMENT
The study was partially supported by a seed basic research project in University of Hong Kong. The authors
extend special thanks to Patrick Ho Ming Chan for assisting in data collection.
5.3 Connection with Social Media
Social media play a significant role in sharing and discussing music among university students in Hong Kong.
YouTube makes it easy for people to share videos in various online social communities such as Facebook, Twitter
and Google Plus. Furthermore, users can view the shared
YouTube videos directly on Facebook which makes it
even more convenient. This is one of the key reasons our
participants preferred YouTube. However, music services
like iTunes have yet to adopt this strategy. For our study
population, linking social network to music services
would certainly enhance user experience and help promote music as well.
8. REFERENCES
[1] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C.
Rhodes, and M. Slaney: “Content-Based Music
Information Retrieval: Current Directions and Future
Challenges,” Proceedings of the IEEE, 96 (4), pp.
668-696, 2008.
[2] S. Y. Chow: “Before and after the Fall: Mapping
Hong Kong Cantopop in the Global Era,” LEWI
Working Paper Series, 63, 2007.
[3] X. Hu: “Music and mood: Where theory and reality
meet,” Proceedings of iConference. 2010.
5.4 Smartphone Application
Many participants are listening to streaming music with
their smartphones, and thus naturally, offering music apps
for smart devices will be critical for music services. Both
YouTube and iTunes offer smartphone apps. Moreover,
instant messaging applications, such as WhatsApp, is
found as the most common reason for using smartphones
among Hong Kongers [10]. To further improve the user
experience, music-related smartphone apps may consider
incorporating online instant messaging capabilities.
[4] X. Hu and J. H. Lee: “A Cross-cultural Study of
Music Mood Perception between American and
Chinese Listeners,” Proceedings of the ISMIR,
pp.535-540, 2012.
[5] K. Lai and K. Chan: “Do you know your music
users' needs? A library user survey that helps
enhance a user-centered music collection.” The
Journal of Academic Librarianship, 36(1), pp.63-69,
2010.
6. CONCLUSION
[6] J. H. Lee and S. J. Downie: “Survey of music
information needs, uses, and seeking behaviours:
Preliminary findings,” Proceedings of the ISMIR, pp.
441-446, 2004.
Music is essential for many university students in Hong
Kong. They listen to music frequently for the purpose of
entertainment and relaxation, to help reduce stress in their
extremely tense daily lives. Currently, there does not exist a single music service that can fulfill all or most of
their music information needs, and thus they often use
multiple tools for specific searches. Furthermore, sharing
and acquiring music from friends and acquaintances was
a key activity, mainly done on online social networks.
Comparing our findings to those of previous studies revealed some cultural differences between Hong Kongers
and Americans, such as Hong Kongers relying more on
popularity and significantly less on genres in music
search.
With the prevalence of smartphones, students are increasingly becoming “demanding” as they get accustomed to accessing music anytime and anywhere. Streaming music and music apps for smartphones are becoming
increasingly common. The most popular music service
among university students in Hong Kong was YouTube
due to its convenience, user-friendly interface, and requiring no payment to use their service. In order to further
improve the design of music services, we recommended
providing an advanced search function, emotion/moodbased search, social network connection, smartphone
apps as well as access to high quality digital music which
will help fulfill users’ needs.
[7] J. H. Lee and M. N. Waterman: “Understanding user
requirements for music information services,”
Proceedings of the ISMIR, pp. 253-258, 2012.
[8] B. T. McIntyre, C. C. W. Sum, and Z. Weiyu:
“Cantopop: The voice of Hong Kong,” Journal of
Asian Pacific Communication, 12 (2), pp. 217-243,
2002.
[9] E. Nettamo, M. Norhamo, and J. Häkkilä: “A crosscultural study of mobile music: Retrieval,
management and consumption,” Proceedings of
OzCHI 2006, pp. 87-94, 2006.
[10] J. Phneah: “Worrying signals as smartphone
addiction soars,” The Standard. Retrieved from
http://www.
thestandard.com.hk/news_detail.asp?pp_cat=30&art
_id=132763&sid=39444767&con_type=1, 2013.
[11] V. M. Steelman: “Intraoperative music therapy:
Effects on anxiety, blood pressure,” Association of
Operating Room Nurses Journal, 52(5), pp. 10261034, 1990.
584
LYRICSRADAR: A LYRICS RETRIEVAL SYSTEM
BASED ON LATENT TOPICS OF LYRICS
Shoto Sasaki∗1 Kazuyoshi Yoshii∗∗2 Tomoyasu Nakano∗∗∗3 Masataka Goto∗∗∗4 Shigeo Morishima∗5
∗
Waseda University ɹ ∗∗ Kyoto University
∗∗∗
National Institute of Advanced Industrial Science and Technology (AIST)
1
joudanjanai-ss[at]akane.waseda.jp 2 yoshii[at]kuis.kyoto-u.ac.jp
3,4
(t.nakano, m.goto)[at]aist.go.jp 5 shigeo[at]waseda.jp
ABSTRACT
This paper presents a lyrics retrieval system called LyricsRadar that enables users to interactively browse song
lyrics by visualizing their topics. Since conventional lyrics
retrieval systems are based on simple word search, those
systems often fail to reflect user’s intention behind a query
when a word given as a query can be used in different contexts. For example, the wordʠtearsʡcan appear not only in
sad songs (e.g., feel heartrending), but also in happy songs
(e.g., weep for joy). To overcome this limitation, we propose to automatically analyze and visualize topics of lyrics
by using a well-known text analysis method called latent
Dirichlet allocation (LDA). This enables LyricsRadar to
offer two types of topic visualization. One is the topic radar
chart that visualizes the relative weights of five latent topics of each song on a pentagon-shaped chart. The other is
radar-like arrangement of all songs in a two-dimensional
space in which song lyrics having similar topics are arranged close to each other. The subjective experiments using 6,902 Japanese popular songs showed that our system
can appropriately navigate users to lyrics of interests.
Figure 1. Overview of topic modeling of LyricsRadar.
approaches analyzed the text of lyrics by using natural
language processing to classify lyrics according to emotions, moods, and genres [2, 3, 11, 19]. Automatic topic
detection [6] and semantic analysis [1] of song lyrics have
also been proposed. Lyrics can be used to retrieve songs
[5] [10], visualize music archives [15], recommend songs
[14], and generate slideshows whose images are matched
with lyrics [16]. Some existing web services for lyrics retrieval are based on social tags, such as “love” and “graduation”. Those services are useful, but it is laborious to put
appropriate tags by hands and it is not easy to find a song
whose tags are also put to many other songs. Macrae et
al. showed that online lyrics are inaccurate and proposed a
ranking method that considers their accuracy [13]. Lyrics
are also helpful for music interfaces: LyricSynchronizer
[8] and VocaRefiner [18], for example, show the lyrics of
a song so that a user can click a word to change the current playback position and the position for recording, respectively. Latent topics behind lyrics, however, were not
exploited to find favorite lyrics.
1. INTRODUCTION
Some listeners regard lyrics as essential when listening to
popular music. It was, however, not easy for listeners to
find songs with their favorite lyrics on existing music information retrieval systems. They usually happen to find
songs with their favorite lyrics while listening to music.
The goal of this research is to assist listeners who think the
lyrics are important to encounter songs with unfamiliar but
interesting lyrics.
Although there were previous lyrics-based approaches
for music information retrieval, they have not provided an
interface that enables users to interactively browse lyrics
of many songs while seeing latent topics behind those
lyrics. We call these latent topics lyrics topics. Several
c Shoto Sasaki, Kazuyoshi Yoshii, Tomoyasu Nakano,
Masataka Goto, Shigeo Morishima.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Shoto Sasaki, Kazuyoshi Yoshii,
Tomoyasu Nakano, Masataka Goto, Shigeo Morishima. LyricsRadar: A
Lyrics Retrieval System Based on Latent Topics of Lyrics, 15th International Society for Music Information Retrieval Conference, 2014.
We therefore propose a lyrics retrieval system, LyricsRadar, that analyzes the lyrics topics by using a machine
learning technique called latent Dirichlet allocation (LDA)
and visualizes those topics to help users find their favorite
lyrics interactively (Fig.1). A single word could have different topics. For example, “diet” may at least have two
585
Figure 2. Example display of LyricsRadar.
by dots. So this approach cannot be achieved at all by the
conventional method which directly searches for a song by
the keywords or phrases appearing in lyrics. Since linguistic expressions of the topic are not necessary, user can find
a target song intuitively even when user does not have any
knowledge about lyrics.
lyrics topics. When it is used with words related to meal,
vegetables, and fat, its lyrics topic “food and health” could
be estimated by the LDA. On the other hand, when it is
used with words like government, law, and elections, “politics” could be estimated. Although the LDA can estimate various lyrics topics, five typical topics common to all
lyrics in a given database were chosen. The lyrics of each
song are represented by the unique ratios of these five topics, which are displayed as pentagon-shaped chart called
as a topic radar chart. This chart makes it easy to guess
the meaning of lyrics before listening to its song. Furthermore, users can directly change the shape of this chart as a
query to retrieve lyrics having a similar shape.
In LyricsRadar, all the lyrics are embedded in a twodimensional space, mapped automatically based on the ratios of the five lyrics topics. The position of lyrics is such
that lyrics in close proximity have similar ratios. Users
can navigate in this plane by mouse operation and discover
some lyrics which are located very close to their favorite
lyrics.
2.1 Visualization based on the topic of lyrics
LyricsRadar has the following two visualization functions:
(1) the topic radar chart; and (2) a mapping to the twodimensional plane. Figure 2 shows an example display
of our interface. The topic radar chart shown in upperleft corner of Figure 2 is a pentagon-shape chart which
expresses the ratio of five topics of lyrics. Each colored
dot displayed in two dimensional plane shown in Figure
2 means the relative location of lyrics in a database. We
call these colored dot representations of lyrics lyrics dot.
User can see lyrics, its title and artist name, and the topic
ratio by clicking the lyrics dot placed on the 2D space, this
supports to discover lyrics interactively. While the lyrics
mapping assists user to understand the lyrics topic by the
relative location in the map, the topic radar chart helps to
get the lyrics image intuitively by the shape of chart. We
explain each of these in the following subsections.
2. FUNCTIONALITY OF LYRICSRADAR
LyricsRadar enables to bring a graphical user interface assisting users to navigate in a two dimensional space intuitively and interactively to come across the target song.
This space is generated automatically by analysis of the
topics which appear in common with the lyrics of many
musical pieces in database using LDA. Also a latent meaning of lyrics is visualized by the topic radar chart based
on the combination of topics ratios. Lyrics that are similar
to a user’s preference (target) can be intuitively discovered
by clicking of the topic radar chart or lyrics representing
2.1.1 Topic radar chart
The values of the lyrics topic are computed and visualized
as the topic radar chart which is pentagon style. Each vertex of the pentagon corresponds to a distinct topic, and predominant words of each topic (e.g., “heart”, “world”, and
“life” for the topic 3) are also displayed at the five corner
of pentagon shown in Figure 2. The predominant words
help user to guess the meaning of each topic. The center
586
Figure 4. Mapping of 487 English artists.
Figure 3. An example display of lyrics by a selected artist.
associated with lyrics as metadata. When an artist name
is chosen, as shown in the right side of Figure 3, the point
of the artist’s lyrics will be getting yellow; similarly, when
a songwriter is chosen, the point of the songwriter’s lyrics
will be changed to orange. While this is somewhat equivalent to lyrics retrieval using the artist or songwriter as a
query, it is our innovative point in the sense that a user can
intuitively grasp how artists and songwriters are distributed
based on the ratio of the given topic. Although music retrieval by artist is very popular in a conventional system, a
retrieval by songwriter is not focused well yet. However,
in the meaning of lyrics retrieval, it is easier for search by
songwriter to discover songs with one’s favorite lyrics because a songwriter has his own lyrics vocabulary.
Moreover, we can make a topic analysis depending on
a specific artist in our system. Intuitively similar artists are
also located and colored closer in a topic chart depending
on topic ratios. The artist is colored based on a topic ratio
in the same way as that of the lyrics. In Figure 4, the size
of a circle is proportional to the number of musical pieces
each artist has. In this way, other artists similar to one’s
favorite artist can be easily discovered.
of the topic radar chart indicates 0 value of a ratio of the
lyrics topic in the same manner as the common radar chart.
Since the sum of the five components is a constant value, if
the ratio of a topic stands out, it will clearly be seen by the
user. It is easy to grasp the topic of selected lyrics visually
and to make an intuitive comparison between lyrics.
Furthermore, the number of topics in this interface is
set to five to strike a balance between the operability of
interface and the variety of topics1 .
2.1.2 Plane-mapped lyrics
The lyrics of musical pieces are mapped onto a twodimensional plane, in which musical pieces with almost
the same topic ratio can get closer to each other. Each musical piece is expressed by colored dot whose RGB components are corresponding to 3D compressed axis for five
topics’ values. This space can be scalable so that the local
or global structure of each musical piece can be observed.
The distribution of lyrics about a specific topic can be recognized by the color of the lyrics. The dimension compression in mapping and coloring used t-SNE [9]. When
a user mouseovers a point in the space, it is colored pink
and meta-information about the title, artist, the topic radar
chart appears simultaneously.
By repeating mouseover, lyrics and names of its artist
and songwriter are updated continuously. Using this approach, other lyrics with the similar topics to the input
lyrics can be discovered. The lyrics map can be moved
and zoomed by dragging the mouse or using a specific keyboard operation. Furthermore, it is possible to visualize the
lyrics map specialized to artist and songwriter, which are
2.2 Lyrics retrieval using topic of lyrics
In LyricsRadar, in addition to the ability to traverse and explore a map to find lyrics, we also propose a system to directly enter a topic ratio as an intuitive expression of one’s
latent feeling. More specifically, we consider the topic
radar chart as an input interface and provide a means by
which a user can give topic ratios for five elements directly
to search for lyrics very close to one’s latent image. This
interface can satisfy the search query in which a user would
like to search for lyrics that contain more of the same topics using the representative words of each topic. Figure
5 shows an example in which one of the five topics is increased by mouse drag, then the balance of five topics ratio
1 If the number of topics was increased, a more subdivided and exacting semantic content could have been represented; however, the operation
for a user will be getting more complicated.
587
Figure 5. An example of the direct manipulation of the topic ratio on the topic radar chart. Each topic ratio can be increased
by dragging the mouse.
has changed because the sum of five components is equal
to 1.0. A user can repeat these processes by updating topics
ratios or navigating the point in a space interactively until
finding interesting lyrics. As with the above subsections,
we have substantiated our claims for a more intuitive and
exploratory lyrics retrieval system.
3. IMPLEMENTATION OF LYRICSRADAR
Figure 6. Graphical representation of the latent Dirichlet
allocation (LDA).
LyricsRadar used LDA [4] for the topic analysis of lyrics.
LDA is a typical topic modeling method by machine learning. Since LDA assigns each word which constitutes lyrics
to a different topic independently, the lyrics include a variety of topics according to the variation of words in the
lyrics. In our system, K typical topics which constitute
many lyrics in database are estimated and a ratio to each
topic is calculated for lyrics with unsupervised learning.
As a result, appearance probability of each word in every
topic can be calculated. The typical representative word to
each topic can be decided at the same time.
functions, whereas the other two terms are prior distributions. The likelihood functions themselves are defined as
K
xd,n,v
Nd D V
z
d,n,k
p(X|Z, φ) =
φk,v
(2)
d=1 n=1 v=1
p(Z|π) =
Nd D K
z
d,n,k
πd,k
(3)
d=1 n=1 k=1
3.1 LDA for lyrics
We then introduce conjugate priors as
The observed data that we consider for LDA are D independent lyrics X = {X1 , ..., XD }. The lyrics Xd consist of Nd word series Xd = {xd,1 , ..., xd,Nd }. The size
of all vocabulary that appear in the lyrics is V , xd,n is
a V -dimensional “1-of-K“ vector (a vector with one element containing 1 and all other elements containing 0).
The latent variable (i.e., the topics series) of the observed
lyrics Xd is Zd = {zd,1 , ..., zd,Nd }. The number of topics is K, so zd,n indicates a K-dimensional 1-of-K vector. Hereafter, all latent variables of lyrics D are indicated
Z = {Z1 , ..., ZD }. Figure 6 shows a graphical representation of the LDA model used in this paper. The full joint
distribution is given by
p(X, Z, π, φ) = p(X|Z, φ)p(Z|π)p(π)p(φ)
k=1
p(π) =
p(φ) =
D
Dir(πd |α(0) ) =
D
C(α(0) )
K
d=1
d=1
k=1
K
K
V
k=1
Dir(φk |β (0) ) =
C(β (0) )
(0)
α
πd,k
−1
(4)
βv(0) −1
φk,v
v=1
k=1
(5)
where p(π) and p(φ) are products of Dirichlet distributions, α(0) and β (0) are hyperparameters, and C(α(0) ) and
C(β (0) ) are normalization factors calculated as follows:
Γ(x̂)
C(x) =
,
Γ(x1 ) · · · Γ(xI )
(1)
x̂ =
I
xi
(6)
i=1
Also note that π is the topic mixture ratio of lyrics used
as the topic radar chart by normalization. The appearance
probability φ of the vocabulary in each topic was used to
evaluate the high-representative word that is strongly correlated with each topic of the topic radar chart.
where π indicates the mixing weights of the multiple topics of lyrics (D of the K-dimensional vector) and φ indicates the unigram probability of each topic (K of the
V -dimensional vector). The first two terms are likelihood
588
3.2 Training of LDA
The lyrics database contains 6902 Japanese popular songs
(J-POP) and 5351 English popular songs. Each of these
songs includes more than 100 words. J-POP songs are selected from our own database and English songs are from
Music Lyrics Database v.1.2.72 . J-POP database has 1847
artists and 2285 songwriters and English database has 398
artists. For the topic analysis per artist, 2484 J-POP artists
and 487 English artists whose all songs include at least
100 words are selected. 26229 words in J-POP and 35634
words in English which appear more than ten times in all
lyrics is used for the value V which is the size of vocabulary in lyrics. In J-POP lyrics, MeCab [17] was used for
the morphological analysis of J-POP lyrics. The noun,
verb, and adjective components were extracted and then
the original and the inflected form were counted as one
word. In English lyrics, we use stopwords using Full-Text
Stopwords in MySQL3 to remove commonly-used words.
However, words which appeared often in many lyrics were
inconvenient to analyze topics. To lower the importance
of such words in the topic analysis, they were weighted by
inverse document frequency (idf).
In the training the LDA, the number of topics (K) is set
to 5. All initial values of hyperparameters α(0) and β (0)
were set to 1.
Figure 7. Results of our evaluation experiment to evaluate topic analysis; the score of (1) was the closest to 1.0,
showing our approach to be effective.
(4) The lyrics selected at random
Each subject evaluated the similarity of the impression
received from the two lyrics using a five-step scale (1: closest, 2: somehow close, 3: neutral, 4: somehow far, and 5:
most far), comparing the basis lyrics and one of the target
lyrics after seeing the basis lyrics. Presentation order to
subjects was random. Furthermore, each subject described
the reason of evaluation score.
4.1.2 Experimental results
The average score of the five-step evaluation results for the
four target lyrics by all subjects is shown in the Figure 7.
As expected, lyrics closest to the basis lyrics on the lyrics
map were evaluated as the closest in terms of the impression of the basis lyrics, because the score of (1) was closest
to 1.0. Results of target lyrics (2) and (3) were both close to
3.0. The lyrics closest to the basis lyrics of the same songwriter or artist as the selected lyrics were mostly judged as
“3: neutral.” Finally, the lyrics selected at random (4) were
appropriately judged to be far.
As the subjects’ comments about the reason of decision, we obtained such responses as a sense of the
season, positive-negative, love, relationship, color, lightdark, subjective-objective, and tension. Responses differed
greatly from one subject to the next. For example, some
felt the impression only by the similarity of a sense of the
season of lyrics. Trial usage of LyricsRadar has shown that
it is a useful tool for users.
4. EVALUATION EXPERIMENTS
To verify the validity of the topic analysis results (as related to the topic radar chart and mapping of lyrics) in
LyricsRadar, we conducted a subjective evaluation experiment. There were 17 subjects (all Japanese speakers) with
ages from 21 to 32. We used the results of LDA for the
lyrics of the 6902 J-POP songs described in Section 3.2.
4.1 Evaluation of topic analysis
Our evaluation here attempted to verify that the topic ratio determined by the topic analysis of LDA could appropriately represent latent meaning of lyrics. Furthermore,
when the lyrics of a song are selected, relative location to
other lyrics of the same artist or songwriter in the space is
investigated.
4.1.1 Experimental method
In our experiment, the lyrics of a song are selected at random in the space as basis lyrics and also target lyrics of
four songs are selected to be compared according to the
following conditions.
4.2 Evaluation of the number of topics
The perplexity used for the quality assessment of a language model was computed for each number of topics. The
more the model is complicated, the higher the perplexity
becomes. Therefore, we can estimate that the performance
of language model is good when the value of perplexity is
low. We calculated perplexity as
(1) The lyrics closest to the basis lyrics on lyrics map
(2) The lyrics closest to the basis lyrics with same songwriter
D
d=1 log p(Xd )
perplexity(X) = exp −
D
d=1 Nd
(3) The lyrics closest to the basis lyrics with same artist
2
“Music Lyrics Database v.1.2.7,” http://www.odditysoftware.
com/page-datasales1.htm.
(7)
In case the number of topics (K) is five, the perplexity is
1150 which is even high.
3 “Full-Text Stopwords in MySQL,” http://dev.mysql.com/doc/
refman/5.5/en/fulltext-stopwords.html.
589
[2] C. Laurier et al.: “Multimodal Music Mood Classification
Using Audio and Lyrics,” Proceedings of ICMLA 2008, pp.
688–693, 2008.
[3] C. McKay et al.: “Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features,” Proceedings of ISMIR 2008, pp. 213–218,
2008.
[4] D. M. Blei et al.: “Latent Dirichlet Allocation,” Journal of
Machine Learning Research Vol.3, pp. 993–1022, 2003.
[5] E. Brochu and N. de Freitas: ““Name That Song!”: A Probabilistic Approach to Querying on Music and Text,” Proceedings of NIPS 2003, pp. 1505–1512, 2003.
[6] F. Kleedorfer et al.: “Oh Oh Oh Whoah! Towards Automatic
Topic Detection In Song Lyrics,” Proceedings of ISMIR 2008,
pp. 287–292, 2008.
Figure 8. Perplexity for the number of topics.
On the other hand, because Miller showed that the number of objects human can hold in his working memory is
7 ± 2 [7], the number of topics should be 1 to 5 in order to
obtain information naturally. So we decided to show five
topics in the topic radar chart.
Figure 8 shows calculation results of perplexity for each
topic number. Blue points represent perplexity for LDA
applied to lyrics and red points represent perplexity for
LDA applied to each artist. Orange bar indicates the range
of human capacity for processing information. Since there
exists a tradeoff between the number of topics and operability, we found that five is appropriate number of topics.
5. CONCLUSIONS
In this paper, we propose LyricsRadar, an interface to assist a user to come across favorite lyrics interactively. Conventionally lyrics were retrieved by titles, artist names, or
keywords. Our main contribution is to visualize lyrics in
the latent meaning level based on a topic model by LDA.
By seeing the pentagon-style shape of Topic Radar Chart, a
user can intuitively recognize the meaning of given lyrics.
The user can also directly manipulate this shape to discover
target lyrics even when the user does not know any keyword or any query. Also the topic ratio of focused lyrics
can be mapped to a point in the two dimensional space
which visualizes the relative location to all the lyrics in
our lyrics database and enables the user to navigate similar
lyrics by controlling the point directly.
For future work, user adaptation is inevitable task because every user has an individual preference, as well as
improvements to topic analysis by using hierarchical topic
analysis [12]. Furthermore, to realize the retrieval interface
corresponding to a minor topic of lyrics, a future challenge
is to consider the visualization method that can reflect more
numbers of topics by keeping an easy-to-use interactivity.
Acknowledgment: This research was supported in part
by OngaCREST, CREST, JST.
6. REFERENCES
[1] B. Logan et al.: “Semantic Analysis of Song Lyrics,” Proceedings of IEEE ICME 2004 Vol.2, pp. 827–830, 2004.
[7] G. A. Miller: “The magical number seven, plus or minus
two: Some limits on our capacity for processing information,” Journal of the Psychological Review Vol.63(2), pp. 81–
97, 1956.
[8] H. Fujihara et al.: “LyricSynchronizer: Automatic Synchronization System between Musical Audio Signals and Lyrics,”
Journal of IEEE Selected Topics in Signal Processing, Vol.5,
No.6, pp. 1252–1261, 2011.
[9] L. Maaten and G. E. Hinton: “Visualizing High-Dimensional
Data Using t-SNE,” Journal of Machine Learning Research,
Vol.9, pp. 2579–2605, 2008.
[10] M. Müller et al.: “Lyrics-based Audio Retrieval and Multimodal Navigation in Music Collections,” Proceedings of
ECDL 2007, pp. 112–123, 2007.
[11] M. V. Zaanen and P. Kanters: “Automatic Mood Classification Using TF*IDF Based on Lyrics,” Proceedings of ISMIR
2010, pp. 75–80, 2010.
[12] R. Adams et al.: “Tree-Structured Stick Breaking Processes
for Hierarchical Data,” Proceedings of NIPS 2010, pp. 19–
27, 2010.
[13] R. Macrae and S. Dixon: “Ranking Lyrics for Online Search,”
Proceedings of ISMIR 2012, pp. 361–366, 2012.
[14] R. Takahashi et al.: “Building and combining document and
music spaces for music query-by-webpage system,” Proceedings of Interspeech 2008, pp. 2020–2023, 2008.
[15] R. Neumayer and A. Rauber: “Multi-modal Music Information Retrieval: Visualisation and Evaluation of Clusterings by
Both Audio and Lyrics,” Proceedings of RAO 2007, pp. 70–
89, 2007.
[16] S. Funasawa et al.: “Automated Music Slideshow Generation
Using Web Images Based on Lyrics,” Proceedings of ISMIR
2010, pp. 63–68, 2010.
[17] T. Kudo: “MeCab: Yet Another Part-of-Speech and Morphological Analyzer,” http://mecab.googlecode.com/
svn/trunk/mecab/doc/index.html.
[18] T. Nakano and M. Goto: “VocaRefiner: An Interactive
Singing Recording System with Integration of Multiple
Singing Recordings,” Proceedings of SMC 2013, pp. 115–
122, 2013.
[19] Y. Hu et al.: “Lyric-based Song Emotion Detection with Affective Lexicon and Fuzzy Clustering Method,” Proceedings
of ISMIR 2009, pp. 122–128, 2009.
590
JAMS: A JSON ANNOTATED MUSIC SPECIFICATION FOR
REPRODUCIBLE MIR RESEARCH
Eric J. Humphrey1,* , Justin Salamon1,2 , Oriol Nieto1 , Jon Forsyth1 ,
Rachel M. Bittner1 , and Juan P. Bello1
1
2
Music and Audio Research Lab, New York University, New York
Center for Urban Science and Progress, New York University, New York
ABSTRACT
Meanwhile, the interests and requirements of the community are continually evolving, thus testing the practical
limitations of lab-files. By our count, there are three unfolding research trends that are demanding more of a given
annotation format:
The continued growth of MIR is motivating more complex annotation data, consisting of richer information, multiple annotations for a given task, and multiple tasks for
a given music signal. In this work, we propose JAMS, a
JSON-based music annotation format capable of addressing the evolving research requirements of the community,
based on the three core principles of simplicity, structure
and sustainability. It is designed to support existing data
while encouraging the transition to more consistent, comprehensive, well-documented annotations that are poised
to be at the crux of future MIR research. Finally, we provide a formal schema, software tools, and popular datasets
in the proposed format to lower barriers to entry, and discuss how now is a crucial time to make a concerted effort
toward sustainable annotation standards.
• Comprehensive annotation data: Rich annotations,
like the Billboard dataset [2], require new, contentspecific conventions, increasing the complexity of
the software necessary to decode it and the burden
on the researcher to use it; such annotations can be
so complex, in fact, it becomes necessary to document how to understand and parse the format [5].
• Multiple annotations for a given task: The experience of music can be highly subjective, at which
point the notion of “ground truth” becomes tenuous. Recent work in automatic chord estimation [8]
shows that multiple reference annotations should be
embraced, as they can provide important insight into
system evaluation, as well as into the task itself.
1. INTRODUCTION
Music annotations —the collection of observations made
by one or more agents about an acoustic music signal—
are an integral component of content-based Music Information Retrieval (MIR) methodology, and are necessary
for designing, evaluating, and comparing computational
systems. For clarity, we define the scope of an annotation as corresponding to time scales at or below the level
of a complete song, such as semantic descriptors (tags) or
time-aligned chords labels. Traditionally, the community
has relied on plain text and custom conventions to serialize
this data to a file for the purposes of storage and dissemination, collectively referred to as “lab-files”. Despite a
lack of formal standards, lab-files have been, and continue
to be, the preferred file format for a variety of MIR tasks,
such as beat or onset estimation, chord estimation, or segmentation.
∗ Please
• Multiple concepts for a given signal: Although systems are classically developed to accomplish a single task, there is ongoing discussion toward integrating information across various musical concepts
[12]. This has already yielded measurable benefits
for the joint estimation of chords and downbeats [9]
or chords and segments [6], where leveraging multiple information sources for the same input signal
can lead to improved performance.
It has long been acknowledged that lab-files cannot be used
to these ends, and various formats and technologies have
been previously proposed to alleviate these issues, such
as RDF [3], HDF5 [1], or XML [7]. However, none of
these formats have been widely embraced by the community. We contend that the weak adoption of any alternative
format is due to the combination of several factors. For example, new tools can be difficult, if not impossible, to integrate into a research workflow because of compatibility
issues with a preferred development platform or programming environment. Additionally, it is a common criticism
that the syntax or data model of these alternative formats
is non-obvious, verbose, or otherwise confusing. This is
especially problematic when researchers must handle for-
direct correspondence to [email protected]
c Eric J. Humphrey, Justin Salamon, Oriol Nieto, Jon
Forsyth, Rachel M. Bittner, Juan P. Bello.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Eric J. Humphrey, Justin Salamon,
Oriol Nieto, Jon Forsyth, Rachel M. Bittner, Juan P. Bello. “JAMS: A
JSON Annotated Music Specification for Reproducible MIR Research”,
2014.
591
mat conversions. Taken together, the apparent benefits to
conversion are outweighed by the tangible costs.
In this paper, we propose a JSON Annotated Music Specification (JAMS) to meet the changing needs of the MIR
community, based on three core design tenets: simplicity,
structure, and sustainability. This is achieved by combining the advantages of lab-files with lessons learned from
previously proposed formats. The resulting JAMS files
are human-readable, easy to drop into existing workflows,
and provide solutions to the research trends outlined previously. We further address classical barriers to adoption by
providing tools for easy use with Python and MATLAB,
and by offering an array of popular datasets as JAMS files
online. The remainder of this paper is organized as follows: Section 2 identifies three valuable components of an
annotation format by considering prior technologies; Section 3 formally introduces JAMS, detailing how it meets
these design criteria and describing the proposed specification by example; Section 4 addresses practical issues
and concerns in an informal FAQ-style, touching on usage
tools, provided datasets, and some practical shortcomings;
and lastly, we close with a discussion of next steps and
perspectives for the future in Section 5.
where X, Y, and Z correspond to “artist”, “album”, and
“title”, respectively 1 ; parsing rules, such as “lines beginning with ‘#’ are to be ignored as comments”; auxiliary
websites or articles, decoupled from the annotations themselves, to provide critical information such as syntax, conventions, or methodology. Alternative representations are
able to manage more complex data via standardized markup
and named entities, such as fields in the case of RDF or
JSON, or IDs, attributes and tags for XML.
2. CORE DESIGN PRINCIPLES
3. INTRODUCING JAMS
In order to craft an annotation format that might serve the
community into the foreseeable future, it is worthwhile to
consolidate the lessons learned from both the relative success of lab-files and the challenges faced by alternative formats into a set of principles that might guide our design.
With this in mind, we offer that usability, and thus the likelihood of adoption, is a function of three criteria:
So far, we have identified several goals for a music annotation format: a data structure that matches the document
model; a lightweight markup syntax; support for multiple
annotations, multiple tasks, and rich metadata; easy workflow integration; cross-language compliance; and the use
of pre-existing technologies for stability. To find our answer, we need only to look to the web development community, who have already identified a technology that meets
these requirements. JavaScript Object Notation (JSON) 2
has emerged as the serialization format of the Internet, now
finding native support in almost every modern programming language. Notably, it was designed to be maximally
efficient and human readable, and is capable of representing complex data structures with little overhead.
JSON is, however, only a syntax, and it is necessary
to define formal standards outlining how it should be used
for a given purpose. To this end, we define a specification on top of JSON (JAMS), tailored to the needs of MIR
researchers.
2.1 Simplicity
The value of simplicity is demonstrated by lab-files in two
specific ways. First, the contents are represented in a format that is intuitive, such that the document model clearly
matches the data structure and is human-readable, i.e. uses
a lightweight syntax. This is a particular criticism of RDF
and XML, which can be verbose compared to plain text.
Second, lab-files are conceptually easy to incorporate into
research workflows. The choice of an alternative file format can be a significant hurdle if it is not widely supported,
as is the case with RDF, or the data model of the document
does not match the data model of the programming language, as with XML.
2.3 Sustainability
Recently in MIR, a more concerted effort has been made
toward sustainable research methods, which we see positively impacting annotations in two ways. First, there is
considerable value to encoding methodology and metadata
directly in an annotation, as doing so makes it easier to
both support and maintain the annotation while also enabling direct analyses of this additional information. Additionally, it is unnecessary for the MIR community to develop every tool and utility ourselves; we should instead
leverage well-supported technologies from larger communities when possible.
3.1 A Walk-through Example
Perhaps the clearest way to introduce the JAMS specification is by example. Figure 1 provides the contents of a
hypothetical JAMS file, consisting of nearly valid 3 JSON
syntax and color-coded by concept. JSON syntax will be
familiar to those with a background in C-style languages,
as it uses square brackets (“[ ]”) to denote arrays (alternatively, lists or vectors), and curly brackets (“{ }”) to denote
2.2 Structure
It is important to recognize that lab-files developed as a
way to serialize tabular data (i.e. arrays) in a languageindependent manner. Though lab-files excel at this particular use case, they lack the structure required to encode complex data such as hierarchies or mix different data
types, such as scalars, strings, multidimensional arrays,
etc. This is a known limitation, and the community has
devised a variety of ad hoc strategies to cope with it: folder
trees and naming conventions, such as “{X}/{Y}/{Z}.lab”,
1
http://www.isophonics.net/content/
reference-annotations
2 http://www.json.org/
3 The sole exception is the use of ellipses (“...”) as continuation characters, indicating that more information could be included.
592
A
{'tag':
objects (alternatively, dictionaries, structs, or hash maps).
Defining some further conventions for the purpose of illustration, we use single quotes to indicate field names, italics
when referring to concepts, and consistent colors for the
same data structures. Using this diagram, we will now step
through the hierarchy, referring back to relevant components as concepts are introduced.
B
C
[ {'data':
[ {'value': "good for running",
'confidence': 0.871,
'secondary_value': "use-case"} ,
D
{'value': "rock", ... } , ...]
,
E
'annotation_metadata':
{'corpus': "User-Generated Tags",
'version': "0.0.1",
3.1.1 The JAMS Object
'annotation_rules': "Annotators were provided ...",
'annotation_tools': "Sonic Visualizer, ...",
A JAMS file consists of one top-level object, indicated
by the outermost bounding box. This is the primary container for all information corresponding to a music signal, consisting of several task-array pairs, an object for
file metadata, and an object for sandbox. A taskarray is a list of annotations corresponding to a given task
name, and may contain zero, one, or many annotations for
that task. The format of each array is specific to the kind
of annotations it will contain; we will address this in more
detail in Section 3.1.2.
The file metadata object (K) is a dictionary containing basic information about the music signal, or file,
that was annotated. In addition to the fields given in the diagram, we also include an unconstrained identifiers
object (L), for storing unique identifiers in various namespaces, such as the EchoNest or YouTube. Note that we
purposely do not store information about the recording’s
audio encoding, as a JAMS file is format-agnostic. In other
words, we assume that any sample rate or perceptual codec
conversions will have no effect on the annotation, within a
practical tolerance.
Lastly, the JAMS object also contains a sandbox, an
unconstrained object to be used as needed. In this way, the
specification carves out such space for any unforeseen or
otherwise relevant data; however, as the name implies, no
guarantee is made as to the existence or consistency of this
information. We do this in the hope that the specification
will not be unnecessarily restrictive, and that commonly
“sandboxed” information might become part of the specification in the future.
'validation': "Data were checked by ...",
'data_source': "Manual Annotation",
F
'curator':
{"name": "Jane Doe", "email": "[email protected]"} ,
'annotator':
G
{'unique_id': "61a4418c841",
'skill_level': "novice",
'principal_instrument': "voice",
, ...]
'primary_role': "composer", ... }
'sandbox': { ... }
,
}
,
{'data': ... }
,
'beat':
H
[ {'data': [ {'time': {'value': 0.237, ... } ,
'label': {'value': "1", ... }
} ,
{'time': ... } , ...]
,
'annotation_metadata': { ... } ,
'sandbox': { ... } }
,
{'data': ... } , ...]
,
'chord':
I
[ {'data': [ {'start': {'value': 0.237, ... } ,
'end': {'value': "1", ... } ,
'label': {'value': "Eb", ... } } ,
{'time': ... } , ...]
,
'sandbox': { ... } }
,
{'data': ... } , ...]
,
'melody':
[ {'data':
J
[ {'value': [ 205.340, 204.836, 205.561, ... ],
'time': [ 10.160, 10.538, 10.712, ... ],
'confidence': [ 0.966, 0.884, 0.896,
,
... ],
,
'label': {'value': "vocals", ... } }
,
{'value': ... } , ...]
3.1.2 Annotations
An annotation (B) consists of all the information that is
provided by a single annotator about a single task for a
single music signal. Independent of the task, an annotation
comprises three sub-components: an array of objects for
data (C), an annotation metadata object (E), and
an annotation-level sandbox. For clarity, a task-array (A)
may contain multiple annotations (B).
Importantly, a data array contains the primary annotation information, such as its chord sequence, beat locations, etc., and is the information that would normally be
stored in a lab-file. Though all data containers are functionally equivalent, each may consist of only one object
type, specific to the given task. Considering the different
types of musical attributes annotated for MIR research, we
divide them into four fundamental categories:
,,
,
'sandbox': { ... }
,
{'data': ... } , ...]
K
'file_metadata':
{'version': "0.0.1",
'identifiers':
{'echonest_song_id': "SOVBDYA13D4615308E",
'youtube_id': "jBDF04fQKtQ”, ... }
,
L
'artist': "The Beatles",
'title': "With a Little Help from My Friends",
'release': "Sgt. Pepper's Lonely Hearts Club Band",
'duration': 159.11 }
'sandbox': {'foo': "bar", ... }
M
,
}
Figure 1. Diagram illustrating the structure of the JAMS
specification.
1. Attributes that exist as a single observation for the
entire music signal, e.g. tags.
593
observation
tag
genre
mood
2. Attributes that consist of sparse events occurring at
specific times, e.g. beats or onsets.
3. Attributes that span a certain time range, such as
chords or sections.
4. Attributes that comprise a dense time series, such
as discrete-time fundamental frequency values for
melody extraction.
event
beat
onset
range
chord
segment
key
note
source
time series
melody
pitch
pattern
Table 1. Currently supported tasks and types in JAMS.
These four types form the most atomic data structures, and
will be revisited in greater detail in Section 3.1.3. The important takeaway here, however, is that data arrays are not
allowed to mix fundamental types.
Following [10], an annotation metadata object
is defined to encode information about what has been annotated, who created the annotations, with what tools, etc.
Specifically, corpus provides the name of the dataset to
which the annotation belongs; version tracks the version of
this particular annotation; annotation rules describes the
protocol followed during the annotation process; annotation tools describes the tools used to create the annotation; validation specifies to what extent the annotation was
verified and is reliable; data source details how the annotation was obtained, such as manual annotations, online
aggregation, game with a purpose, etc.; curator (F) is itself an object with two subfields, name and email, for the
contact person responsible for the annotation; and annotator (G) is another unconstrained object, which is intended
to capture information about the source of the annotation.
While complete metadata are strongly encouraged in practice, currently only version and curator are mandatory in
the specification.
diagram, the value of time is a scalar quantity (0.237),
whereas the value of label is a string (‘1’), indicating
metrical position.
A range (I) is useful for representing musical attributes
that span an interval of time, such as chords or song segments (e.g. intro, verse, chorus). It is an object that consists
of three observations: start, end, and label.
The time series (J) atomic type is useful for representing musical attributes that are continuous in nature, such
as fundamental frequency over time. It is an object composed of four elements: value, time, confidence
and label. The first three fields are arrays of numerical
values, while label is an observation.
3.2 The JAMS Schema
3.1.3 Datatypes
Having progressed through the JAMS hierarchy, we now
introduce the four atomic data structures, out of which an
annotation can be constructed: observation, event, range
and time series. For clarity, the data array (A) of a tag
annotation is a list of observation objects; the data array of
a beat annotation is a list of event objects; the data array
of a chord annotation is a list of range objects; and the
data array of a melody annotation is a list of time series
objects. The current space of supported tasks is provided
in Table 1.
Of the four types, an observation (D) is the most atomic,
and used to construct the other three. It is an object that
has one primary field, value, and two optional fields,
confidence and secondary value. The value and
secondary value fields may take any simple primitive, such as a string, numerical value, or boolean, whereas
the confidence field stores a numerical confidence estimate for the observation. A secondary value field is provided for flexibility in the event that an observation requires an additional level of specificity, as is the case in
hierarchical segmentation [11].
An event (H) is useful for representing musical attributes
that occur at sparse moments in time, such as beats or onsets. It is a container that holds two observations, time
and label. Referring to the first beat annotation in the
594
The description in the previous sections provides a highlevel understanding of the proposed specification, but the
only way to describe it without ambiguity is through formal representation. To accomplish this, we provide a JSON
schema 4 , a specification itself written in JSON that uses
a set of reserved keywords to define valid data structures.
In addition to the expected contents of the JSON file, the
schema can specify which fields are required, which are
optional, and the type of each field (e.g. numeric, string,
boolean, array or object). A JSON schema is concise, precise, and human readable.
Having defined a proper JSON schema, an added benefit of JAMS is that a validator can verify whether or not a
piece of JSON complies with a given schema. In this way,
researchers working with JAMS files can easily and confidently test the integrity of a dataset. There are a number of
JSON schema validator implementations freely available
online in a variety of languages, including Python, Java, C,
JavaScript, Perl, and more. The JAMS schema is included
in the public software repository (cf. Section 4), which also
provides a static URL to facilitate directly accessing the
schema from the web within a workflow.
4. JAMS IN PRACTICE
While we contend that the use and continued development
of JAMS holds great potential for the many reasons outlined previously, we acknowledge that specifications and
standards are myriad, and it can be difficult to ascertain
the benefits or shortcomings of one’s options. In the interest of encouraging adoption and the larger discussion of
4
http://json-schema.org/
4.4 What datasets are already JAMS-compliant?
standards in the field, we would like to address practical
concerns directly.
To further lower the barrier to entry and simplify the process of integrating JAMS into a pre-existing workflow, we
have collected some of the more popular datasets in the
community and converted them to the JAMS format, linked
via the public repository. The following is a partial list of
converted datasets: Isophonics (beat, chord, key, segment);
Billboard (chord); SALAMI (segment, pattern); RockCorpus (chord, key); tmc323 (chords); Cal500 (tag); Cal10k
(tag); ADC04 (melody); and MIREX05 (melody).
4.1 How is this any different than X?
The biggest advantage of JAMS is found in its capacity
to consistently represent rich information with no additional effort from the parser and minimal markup overhead. Compared to XML or RDF, JSON parsers are extremely fast, which has contributed in no small part to its
widespread adoption. These efficiency gains are coupled
with the fact that JAMS makes it easier to manage large
data collections by keeping all annotations for a given song
in the same place.
4.5 Okay, but my data is in a different format – now
what?
We realize that it is impractical to convert every dataset
to JAMS, and provide a collection of Python scripts that
can be used to convert lab-files to JAMS. In lieu of direct
interfaces, alternative formats can first be converted to labfiles and translated to JAMS thusly.
4.2 What kinds of things can I do with JAMS that I
can’t already do with Y?
JAMS can enable much richer evaluation by including multiple, possibly conflicting, reference annotations and directly embedding information about an annotation’s origin. A perfect example of this is found in the Rock Corpus
Dataset [4], consisting of annotations by two expert musicians: one, a guitarist, and the other, a pianist. Sources of
disagreement in the transcriptions often stem from differences of opinion resulting from familiarity with their principal instrument, where the voicing of a chord that makes
sense on piano is impossible for a guitarist, and vice versa.
Similarly, it is also easier to develop versatile MIR systems
that combine information across tasks, as that information
is naturally kept together.
Another notable benefit of JAMS is that it can serve as
a data representation for algorithm outputs for a variety of
tasks. For example, JAMS could simplify MIREX submissions by keeping all machine predictions for a given team
together as a single submission, streamlining evaluations,
where the annotation sandbox and annotator metadata can
be used to keep track of algorithm parameterizations. This
enables the comparison of many references against many
algorithmic outputs, potentially leading to a deeper insight
into a system’s performance.
4.6 My MIR task doesn’t really fit with JAMS.
4.3 So how would this interface with my workflow?
Thanks to the widespread adoption of JSON, the vast majority of languages already offer native JSON support. In
most cases, this means it is possible to go from a JSON
file to a programmatic data structure in your language of
choice in a single line of code using tools you didn’t have
to write. To make this experience even simpler, we additionally provide two software libraries, for Python and
MATLAB. In both instances, a lightweight software wrapper is provided to enable a seamless experience with JAMS,
allowing IDEs and interpreters to make use of autocomplete and syntax checking. Notably, this allows us to provide convenience functionality for creating, populating, and
saving JAMS objects, for which examples and sample code
are provided with the software library 5 .
5
https://github.com/urinieto/jams
595
That’s not a question, but it is a valid point and one worth
discussing. While this first iteration of JAMS was designed
to be maximally useful across a variety of tasks, there are
two broad reasons why JAMS might not work for a given
dataset or task. One, a JAMS annotation only considers
information at the temporal granularity of a single audio
file and smaller, independently of all other audio files in
the world. Therefore, extrinsic relationships, such as cover
songs or music similarity, won’t directly map to the specification because the concept is out of scope.
The other, more interesting, scenario is that a given use
case requires functionality we didn’t plan for and, as a
result, JAMS doesn’t yet support. To be perfectly clear,
the proposed specification is exactly that –a proposal– and
one under active development. Born out of an internal
need, this initial release focuses on tasks with which the
authors are familiar, and we realize the difficulty in solving
a global problem in a single iteration. As will be discussed
in greater detail in the final section, the next phase on our
roadmap is to solicit feedback and input from the community at large to assess and improve upon the specification.
If you run into an issue, we would love to hear about your
experience.
4.7 This sounds promising, but nothing’s perfect.
There must be shortcomings.
Indeed, there are two practical limits that should be mentioned. Firstly, JAMS is not designed for features or signal
level statistics. That said, JSON is still a fantastic, crosslanguage syntax for serializing data, and may further serve
a given workflow. As for practical concerns, it is a known
issue that parsing large JSON objects can be slow in MATLAB. We’ve worked to make this no worse than reading
current lab-files, but speed and efficiency are not touted
benefits of MATLAB. This may become a bigger issue as
JAMS files become more complete over time, but we are
actively exploring various engineering solutions to address
this concern.
[3] Chris Cannam, Christian Landone, Mark B Sandler,
and Juan Pablo Bello. The sonic visualiser: A visualisation platform for semantic descriptors from musical
signals. In Proc. of the 7th International Society for
Music Information Retrieval Conference, pages 324–
327, 2006.
5. DISCUSSION AND FUTURE PERSPECTIVES
In this paper, we have proposed a JSON format for music annotations to address the evolving needs of the MIR
community by keeping multiple annotations for multiple
tasks alongside rich metadata in the same file. We do so in
the hopes that the community can begin to easily leverage
this depth of information, and take advantage of ubiquitous serialization technology (JSON) in a consistent manner across MIR. The format is designed to be intuitive and
easy to integrate into existing workflows, and we provide
software libraries and pre-converted datasets to lower barriers to entry.
Beyond practical considerations, JAMS has potential to
transform the way researchers approach and use music annotations. One of the more pressing issues facing the community at present is that of dataset curation and access. It is
our hope that by associating multiple annotations for multiple tasks to an audio signal with retraceable metadata, such
as identifiers or URLs, it might be easier to create freely
available datasets with better coverage across tasks. Annotation tools could serve music content found freely on the
Internet and upload this information to a common repository, ideally becoming something like a Freebase 6 for
MIR. Furthermore, JAMS provides a mechanism to handle multiple concurrent perspectives, rather than forcing
the notion of an objective truth.
Finally, we recognize that any specification proposal
is incomplete without an honest discussion of feasibility
and adoption. The fact remains that JAMS arose from the
combination of needs within our group and an observation
of wider applicability. We have endeavored to make the
specification maximally useful with minimal overhead, but
appreciate that community standards require iteration and
feedback. This current version is not intended to be the
definitive answer, but rather a starting point from which
the community can work toward a solution as a collective.
Other professional communities, such as the IEEE, convene to discuss standards, and perhaps a similar process
could become part of the ISMIR tradition as we continue
to embrace the pursuit of reproducible research practices.
6. REFERENCES
[1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian
Whitman, and Paul Lamere. The million song dataset.
In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011.
[4] Trevor De Clercq and David Temperley. A corpus analysis of rock harmony. Popular Music, 30(1):47–70,
2011.
[5] W Bas de Haas and John Ashley Burgoyne. Parsing the
billboard chord transcriptions. University of Utrecht,
Tech. Rep, 2012.
[6] Matthias Mauch, Katy Noland, and Simon Dixon. Using musical structure to enhance automatic chord transcription. In Proc. of the 10th International Society for
236, 2009.
[7] Cory McKay, Rebecca Fiebrink, Daniel McEnnis,
Beinan Li, and Ichiro Fujinaga. Ace: A framework
for optimizing music classification. In Proc. of the 6th
International Society for Music Information Retrieval
Conference, pages 42–49, 2005.
[8] Yizhao Ni, Matthew McVicar, Raul Santos-Rodriguez,
and Tijl De Bie. Understanding effects of subjectivity in measuring chord estimation accuracy. Audio,
Speech, and Language Processing, IEEE Transactions
on, 21(12):2607–2615, 2013.
[9] Hélène Papadopoulos and Geoffroy Peeters. Joint estimation of chords and downbeats from an audio signal.
Audio, Speech, and Language Processing, IEEE Transactions on, 19(1):138–152, 2011.
[10] G. Peeters and K. Fort. Towards a (better) definition of
annotated MIR corpora. In Proc. of the 13th International Society for Music Information Retrieval Conference, pages 25–30, Porto, Portugal, Oct. 2012.
[11] Jordan Bennett Louis Smith, John Ashley Burgoyne,
Ichiro Fujinaga, David De Roure, and J Stephen
Downie. Design and creation of a large-scale database
of structural annotations. In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 555–560, 2011.
[12] Emmanuel Vincent, Stanislaw A Raczynski, Nobutaka
Ono, Shigeki Sagayama, et al. A roadmap towards versatile mir. In Proc. of the 11th International Society for
664, 2010.
[2] John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga. An expert ground truth set for audio chord recognition and music analysis. In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 633–638, 2011.
6
http://www.freebase.com
596
ON THE CHANGING REGULATIONS OF PRIVACY
AND PERSONAL INFORMATION IN MIR
Pierre Saurel
Université Paris-Sorbonne
Francis Rousseaux
IRCAM
Marc Danger
ADAMI
pierre.saurel
@paris-sorbonne.fr
francis.rousseaux
@ircam.fr
mdanger
@adami.fr
labs (for MIR teaching, for multicultural emotion comparisons, or for MIR user requirement purposes) the identification of legal issues becomes essential or strategic.
Legal issues related to copyright and Intellectual Property have already been identified and expressed into Digital Rights Management by the MIR community [2], [7],
when those related to security, business models and right
to access have been expressed by Information Access [4],
[11]. Privacy is another important legal issue. To address
it properly one needs first to classify the personal data
and processes. A naive classification appears when you
quickly look at the kind of personal data MIR deals with:
User’s comments, evaluation, annotation and music
recommendations are obvious personal data as long as
they are published under their name or pseudo;
Addresses allowing identification of a device or an instrument and Media Access Control addresses are
linked to personal data;
Any information allowing identification of a natural
person, as some MIR processes do, shall be qualified
as personal data and processing of personal data.
But the legal professionals do not unanimously approve this classification. For instance the Court of Appeal
in Paris judged in two decisions (2007/04/27 and
2007/05/15) that the Internet Protocol address is not a
personal data.
ABSTRACT
In recent years, MIR research has continued to focus
more and more on user feedback, human subjects data,
and other forms of personal information. Concurrently,
the European Union has adopted new, stringent regulations to take effect in the coming years regarding how
such information can be collected, stored and manipulated, with equally strict penalties for being found in violation of the law.
Here, we provide a summary of these changes, consider how they relate to our data sources and research practices, and identify promising methodologies that may
serve researchers well, both in order to be in compliance
with the law and conduct more subject-friendly research.
We additionally provide a case study of how such changes might affect a recent human subjects project on the
topic of style, and conclude with a few recommendations
for the near future.
This paper is not intended to be legal advice: our personal legal interpretations are strictly mentioned for illustration purpose, and reader should seek proper legal
counsel.
1. INTRODUCTION
The International Society for Music Information
Retrieval addresses a wide range of scientific, technical
and social challenges, dealing with processing, searching,
organizing and accessing music-related data and digital
sounds through many aspects, considering real scale usecases and designing innovative applications, exceeding its
academic-only initiatory aims.
Some recent Music Information Retrieval tools and
algorithms aim to attribute authorship and to characterize
the structure of style, to reproduce the user’s style and to
manipulate one’s style as a content [8], [1]. They deal for
instance with active listening, authoring or personalised
reflexive feedback. These tools will allow identification
of users in the big data: authors, listeners, performers.
As the emerging MIR scientific community leads to
industrial applications of interest to the international
business (start-up, Majors, content providers, platforms)
and to experimentations involving many users in living
2. WHAT ARE PROCESSES OF PERSONAL
DATA AND HOW THEY ARE REGULATED
A careful consideration of the applicable law of personal
data is necessary to elaborate a proper classification of
MIR personal data processes taking the different international regulations into account.
2.1 Europe vs. United States: two legal approaches
Europe regulates data protection through one of the highest State Regulations in the world [3], [9] when the United States lets contractors organize data protection through
agreements supported by consideration and entered into
voluntarily by the parties. These two approaches are
deeply divergent. United States lets companies specify
their own rules with their consumers while Europe enforces a unique regulated framework on all companies
providing services to European citizens. For instance any
company in the United States can define how long they
keep the personal data, when the regulations in Europe
would specify a maximum length of time the personal
© Pierre Saurel, Francis Rousseaux, Marc Danger.
License (CC BY 4.0). Attribution: Pierre Saurel, Francis Rousseaux,
Marc Danger. “On the Changing Regulations of Privacy and Personal
Information in MIR”, 15th International Society for Music Information
597
data is to be stored. And this applies to any company offering the same service.
A prohibition is at the heart of the European Commission’s Directive on Data Protection (95/46/CE – The Directive) [3]. The transfer of personal data to nonEuropean Union countries that do not meet the European
Union adequacy standard for privacy protection is strictly
forbidden [3, article 25]1. The divergent legal approaches
and this prohibition alone would outlaw the proposal by
American companies of many of their IT services to European citizens. In response the U.S. Department of
Commerce and the European Commission developed the
Safe Harbor Framework (SHF) [6], [14]. Any nonEuropean organization is free to self-certify with the SHF
and join.
A new Proposal for a Regulation on the protection of
individuals with regard to the processing of personal data
was adopted the 12 March 2014 by the European Parliament [9]. The Directive allows adjustments from one European country to another and therefore diversity of implementation in Europe when the regulation is directly
enforceable and should therefore be implemented directly
and in the same way in all countries of the European Union. This regulation should apply in 2016. This regulation
enhances data protection and sanctions to anyone who
does not comply with the obligations laid down in the
Regulation. For instance [9, article 79] the supervisory
authority will impose, as a possible sanction, a fine of up
to one hundred million Euros or up to 5% of the annual
worldwide turnover in case of an enterprise.
Complying with Safe Harbor is the easiest way for an organization using MIR processing to fulfill the high level
European standard about personal data, to operate
worldwide and to avoid prosecution regarding personal
data. As explained below any non-European organization
may enter the US – EU SHF’s requirement and publicly
declare that they do so. In that case the organization must
develop a data privacy policy that conforms to the seven
Safe Harbor Principles (SHP) [14].
First of all organizations must identify personal data
and personal data processes. Then they apply the SHP to
these data and processes. By joining the SHF, organizations must implement procedures and modify their own
information system whether paper or electronic.
Organizations must notify (P1) individuals about the
purposes for which they collect and use information
about them, to whom the information can be disclosed
and the choices and means offered for limiting its disclosure. Organizations must explain how they can be contacted with any complaints. Individuals should have the
choice (P2) (opt out) whether their personal information
is disclosed or not to a third party. In case of sensitive information explicit choice (opt in) must be given. A transfer to a third party (P3) is only possible if the individual
made a choice and if the third party subscribed to the
SHP or was subject to any adequacy finding regarding to
the ED. Individuals must have access (P4) to personal
information about them and be able to correct, amend or
delete this information. Organizations must take reasonable precautions (P5) to prevent loss, misuse, disclosure,
alteration or destruction of the personal information. Personal information collected must be relevant (P6: data
integrity) for the purpose for which it is to be used. Sanctions (P7 enforcement) ensure compliance by the organization. There must be a procedure for verifying the implementation of the SHP and the obligation to remedy
problems arising out of a failure to comply with the SHP.
2.2 Data protection applies to any information concerning an identifiable natural person
Until French law applied the 95/46/CE European Directive, personal data was only defined considering sets
of data containing the name of a natural person. This definition has been extended; the 95/46/CE European Directive (ED) defines ‘personal data’ [3, article 2] as: “any
information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one
who can be identified, directly or indirectly, in particular
by reference to an identification number or to one or
more factors specific to his physical, physiological, mental, economic, cultural or social identity”.
For instance the identification of an author through the
structure of his style as depending on his mental, cultural
or social identity is a process that must comply with the
European data privacy principles.
3. CLASSIFICATION FOR MIR
PERSONAL DATA PROCESSING
Considering the legal definition of personal data we can
now propose a less naive classification of MIR processes
and data into three sets: (i) nominative data, (ii) data leading to an easy identification of a natural person and (iii)
data leading indirectly to the identification of a natural
person through a complex process.
3.1 Nominative data and data leading easily to the
identification of a natural person
2.3 Safe Harbor is the Framework ISMIR affiliates
need not to pay a fine up to hundreds million Euros
The first set of processes deals with all the situations giving the name of a natural person directly. The second set
deals with the cases of a direct or an indirect identification easily done for instance through devices.
In these two sets we find that the most obvious set of
data concerns the “Personal Music Libraries” and “recommendations”. Looking at the topics that characterize
1
Argentina, Australia, Canada, State of Israel, New Zealand, United
States – Transfer of Air Passenger Name Record (PNR) Data, United
States – Safe Harbor, Eastern Republic of Uruguay are, to date, the only
non-European third countries ensuring an adequate level of protection:
http://ec.europa.eu/justice/data-protection/document/internationaltransfers/adequacy/index_en.htm
598
ISMIR papers from year 2000 to 2013, we find more than
30 papers and posters dealing with those topics as their
main topic. Can one recommend music to a user or analyze their personal library without tackling privacy?
sonal data in case of a simple direct or indirect identification process.
4.1 Trends in terms of use and innovative technology
Databases of personal data are no more clearly identified.
We can view the situation as combining five aspects,
which lead to new scientific problems concerning MIR
personal data processing.
Data Sources Explosion. The number of databases for
retrieving information is growing dramatically. Applications are also data sources. Spotify for instance provides a
live flow of music consumption information from millions of users. Data from billions of sensors will soon be
added. This profusion of data does not mean quality. Accessible does not mean legal or acceptable for a user.
Those considerations are essential to build reliable and
sustainable systems.
Crossing & Reconciling Data. Data sources are no
longer isolated islands. Once the user can be identified
(cookie, email, customer id), it is possible to match, aggregate and remix data that was previously isolated.
Time Dimension. The web has a good memory that
humans are generally not familiar with. Data can be public one day and be considered as very private 3 years later. Many users forget they posted a picture after a student
party. And the picture has the misfortune to crop up again
when you apply for a job. And it is not only a question of
human memory: Minute traces collected one day can be
exploited later and provide real information.
Permanent Changes. The general instability of the
data sources, technical formats and flows, applications
and use is another strong characteristic of the situation.
The impact on personal data is very likely. If the architecture of the systems changes a lot and frequently, the social norms also change. Users today publicly share information that they would have considered totally private a
few years earlier. And the opposite could be the case.
User Understandability and Control. Because of the
complexity of changing systems and complex interactions
users will less and less control over their information.
This lack of control is caused by the characteristics of the
systems and by the mistakes and the misunderstandings
of human users. The affair of the private Facebook messages appearing suddenly on timeline (Sept. 2012) is significant. Facebook indicates that there was no bug. Those
messages were old wall posts that are now more visible
with the new interface. This is a combination of bad user
understanding and fast moving systems.
3.2 Data leading to the identification of a natural person through a complex process
The third set of personal data deals with cases when a
natural person is indirectly identifiable using a complex
process, like some of the MIR processes.
Can one work on “Classification” or “Learning”, producing 130 publications (accepted contributions at ISMIR
from year 2000 to year 2013) without considering users
throughout their tastes or style? The processes used under
these headings belong for the most part to this third set.
Looking directly at the data without any sophisticated
tool does not allow any identification of the natural person. On the contrary, using some MIR algorithms or machine learning can lead to indirect identifications [12].
Most of the time these non-linear methods use inputs
to build new data which are outputs or data stored inside
the algorithm, like weights for instance in a neural net.
3.3 The legal criteria of the costs and the amount of
time required for identification
This third set of personal data is not as homogeneous as it
seems to be at first glance. Can we compare sets of data
that lead to an identification of a natural person through a
complex process?
The European Proposal for a Regulation specifies the
concept of “identifiability”. It tries to define legal criteria
to decide if an identifiable set of data is or is not personal
data. It considers the identification process [9, recital 23]
as a relative one depending on the means used for that
identification: “To determine whether a person is identifiable, account should be taken of all the means reasonably likely to be used either by the controller or by any
other person to identify or single out the individual directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the individual, account should be taken of all objective factors, such as the
costs of and the amount of time required for identification, taking into consideration both available technology
at the time of the processing and technological development.”
But under what criteria should we, as MIR practitioners, specify when a set of data allows an easy identification and belongs to the second set or, on the contrary, is
too complex or reaches a too uncertain identification so
that we would not legally say that these are personal data? To answer these questions, we must be able to compare MIR processes with new criteria.
4.2 The case of an Apache Hadoop File System
(AHFS) on which some machine learning is applied
Everyone produces data and personal data without being
always aware that they provide data revealing their identification. When a user tags / rates musical items [13], he
gives personal information. If a music recommender ex-
4. MANAGING THE TWO FIRST SETS
On an example chosen to be problematic (but increasingly common in the industry), we show how to manage per-
599
ploits this user data without integrating privacy concepts,
he faces legal issues and strong discontent from the users.
The data volume has increased faster than “Moore’s
law”: This is what is meant by “Big Data”. New data is
generally unstructured and traditional database systems
such as Relational Database Management Systems cannot
handle the volume of data produced by users & machines
& sensors. This challenge was the main drive for Google
to define a new technology: the Apache Hadoop File System (AHFS). Within this framework, data and computational activities are distributed on a very large number of
servers. Data is not loaded for computation, nor the results stored. Here, the algorithm is close to the data. This
situation leads to the epistemological problem of separability into the field of MIR personal data processing: are
all MIR algorithms (and for instance the authorship attribution algorithms) separable into data and processes?
An answer to this question is required for any algorithm
to be able to identify the set of personal data it deals with.
Now, let us consider a machine learning classifier/recommender trained on user data. In this sense, the
algorithm is inseparable from the data it uses to function.
And, if the machine is internalizing identifiable information from a set of users in a certain state (let say EU),
it is then in violation to share the resulting function in a
non-adequate country (let say Brazil) the EU if it was
trained in, say, the US.
solution has gained widespread international recognition,
and was recently recognized as a global privacy standard.
According to its Canadian inventor 1, is PbD based on
seven Foundation Principles (FP): PbD “is an approach
to protect privacy by embedding it into the design specifications of technologies, business practices, and physical
infrastructures. That means building in privacy up front –
right into the design specifications and architecture of
new systems and processes. PbD is predicated on the
idea that, at the outset, technology is inherently neutral.
As much as it can be used to chip away at privacy, it can
also be enlisted to protect privacy. The same is true of
processes and physical infrastructure”:
Proactive not Reactive (FP1): the PbD approach is
based on proactive measures anticipating and
preventing privacy invasive events before they occur;
Privacy as the Default Setting (FP2): the default rules
seek to deliver the maximum degree of privacy;
Privacy embedded into Design (FP3): Privacy is
embedded into the architecture of IT systems and
business practices;
Full Functionality – Positive Sum, not Zero-Sum
(FP4): PbD seeks to accommodate all legitimate
interests and objectives (security, etc.) in a “win-win”
manner;
End-to-End Security – Full Lifecycle Protection (FP5):
security measures are essential to privacy, from start to
finish;
Visibility and Transparency — Keep it Open (FP6):
PbD is subject to independent verification. Its
component parts and operations remain visible and
transparent, to users and providers alike;
Respect for User Privacy — Keep it User-Centric
(FP7): PbD requires architects and operators to keep
the interests of the individual uppermost.
At the time of digital data exchange through networks,
PbD is a key-concept in legacy [10]. In Europe, where
this domain has been directly inspired by the Canadian
experience, the EU 2 affirms: “PbD means that privacy
and data protection are embedded throughout the entire
life cycle of technologies, from the early design stage to
their deployment, use and ultimate disposal”.
4.3 Analyzing the multinational AHFS case
Regarding to the European regulation rules [3, art. 25],
you may not transfer personal data collected in Europe to
a non-adequate State (see list of adequate countries
above). If you build a multinational AHFS system, you
may collect data in Europe and in US depending on the
way you localized the AHFS servers. The European data
may not be transferred to Brazil. Even the classifier
would not legally be used in Brazil as long as it internalizes some identifiable European personal information.
In practice one should then localize the AHFS files
and machine-learning processes to make sure no identifiable data will be transferred from one country with a specific regulation to another with another regulation about
personal data. We call these systems “heterarchical” due
to the blended situation of a hierarchical system (the
global AHFS management) and the need of a heterogeneous local regulation.
To manage properly the global AHFS system we need
a first analysis of the system dispatching the different
files on the right legal places. Privacy by Design (PbD) is
a useful methodology to do so.
4.5 Prospects for a MIR Privacy by Design
PbD is a reference for designing systems and processing
involving personal data, enforced by the new European
proposal for a Regulation [9, art. 23]. It becomes a method for these designs whereby it includes signal analysis
methods and may interest MIR developers.
This proposal leads to new questions, such as the following: Is PbD a universal methodological solution about
personal data for all MIR projects? Most of ISMIR contributions are still research oriented which doesn’t mean
4.4 Foundations Principals of Privacy by Design
PbD was first developed by Ontario’s Information and
Privacy Commissioner, Dr. Ann Cavoukian, in the 1990s,
at the very birth of the future big data phenomenon. This
1
http://www.ipc.on.ca/images/Resources/7foundationalprinciples.pdf
“Safeguarding Privacy in a Connected World – A European Data Protection Framework for the 21st Century” COM (2012) 9 final.
2
600
that they fulfill the two specific exceptions [9, art. 83]1.
To say more about that intersection, we need to survey
the ISMIR scientific production, throughout the main
FPs. FP6 (transparency) and FP7 (user-centric) are usually respected among the MIR community as source code
and processing are often (i) delivered under GNU like
licensing allowing audit and traceability (ii) user-friendly.
However, as long as PbD is not embedded, FP3 cannot be
fulfilled and accordingly FP2 (default setting), FP5 (endto-end), FP4 (full functionality) and FP1 (proactive) cannot be fulfilled even. Without any PbD embedded into
Design, there are no default settings (FP2), you cannot
follow an end-to-end approach (FP5), you cannot define
full functionality regarding to personal data (FP4) nor be
proactive. Principle of pro-activity (FP1) is the key. Fulfilling FP1 you define the default settings (FP2), be fully
functional (FP4) and define an end-to-end process (FP5).
In brief is PbD useful to MIR developers even if it is
not the definitive martingale!
This situation leads to a new scientific problem: Is
there an absolute criterion about the identifiability of personal data extracted from a set of data with a MIR process? What characterizes a maximal subset from the big
data that could not ever be computed by any Turing machine to identify a natural person with any algorithm?
5.2 What about the foundational separation in computer science between data and process?
Computer science is based on a strict separation between
data and process (dual as these two categories are interchangeable at any time; data can be activated as a process
and a process can be treated as a data).
We may wonder about the possibility of maintaining
the data/process separation paradigm if i) the data stick to
the process and ii) the legal regulation leads to a location
of the data in the legal system in which those data were
produced.
6. CONCLUSION
5. EXPLORING THE THIRD SET
6.1 When some process lead to direct or indirect personal data identification
“Identifiability” is the potentiality of a set of data to lead
to the identification of its source. A set of data should be
qualified as being personal data if the cost and the
amount of time required for identification are reasonable.
These new criteria are a step forward since the qualification is not an absolute one anymore and depends specifically on the state of the art.
Methodological Recommendations. MIR researchers
could first audit their algorithm and data, and check if
they are able to identify a natural person (two first sets of
our classification). If so they could use the SHF which
could already be an industrial challenge for instance regarding Cyber Security (P5). Using the PbD methodology
certainly leads to operational solutions in these situations.
5.1 Available technology and technological development to take into account at this present moment
6.2 When some process may lead to indirect personal
data identification through some complex process
Changes in Information Technology lead to a shift in the
approach of data management: from computational to data exploration. The main question is “What to look for?”
Many companies build new tools to “make the data
speak”. This is the case considering the trend of personalized marketing. Engineers using big data build systems
that produce new personal dataflow.
Is it possible to stabilize these changes through standardization of metadata? Is it possible to develop a standardization of metadata which could ease the classification
of MIR processing of personal data into identifying and
non-identifying processes.
Many of the MIR methods are stochastic, probabilistic
or designed to cost and more generally non-deterministic.
On the contrary the European legal criteria [9, recital 23]
(see above § 3.3) to decide whether a data is personal or
not (the third set) seem to be much to deterministic to fit
the effective new practices about machine learning on
personal data.
In many circumstances, the MIR community develops
new personal data on the fly, using the whole available
range of data analysis and data building algorithm. Then
researchers could apply the PbD methodology, to insure
that no personal data is lost during the system design.
Here PbD is not a universal solution because the time
when data (on the one hand) and processing (on the other
hand) were functionally independent, formally and semantically separated, has ended. Nowadays, MIR researchers currently use algorithms that support effective
decision, supervised or not, without introducing ‘pure’
data or ‘pure’ processing, but building up acceptable solutions together with machine learning [5] or heuristic
knowledge that cannot be reduced to data or processing:
The third set of personal data may appear, and raise theoretical scientific problems.
Political Opportunities. The MIR community has a
political role to play in the data privacy domain, by explaining to lawyers —joining expert groups in the US,
UE or elsewhere— what we are doing and how we overlap with the tradition in style description, turning it into a
computed style genetic, which radically questions the
analysis of data privacy traditions, cultures and tools.
1
(i) these processing cannot be fulfilled otherwise and (ii) data permitting the identification are kept separately from the other information, or
when the bodies conducting these data respect three conditions: (i) consent of the data subject, (ii) publication of personal data is necessary and
(iii) data are made public
601
Future Scientific Works. In addition to methodological and political ones, we face purely scientific challenges, which constitute our research program for future
works. Under what criteria should we, as MIR practitioners, specify when a set of data allows an easy identification and belongs to the second set or on the contrary is
too complex or allows a too uncertain identification so
that we would say that these are not personal data? What
characterizes a maximal subset from the big data that
could not ever be computed by any Turing machine to
identify a natural person with any algorithm?
[10] V. Reding: “The European Data Protection
Framework for the Twenty-first century”,
International Data Privacy Law, volume 2, issue 3,
pp.119-129, 2012.
[11] A. Seeger: “I Found It, How Can I Use It? - Dealing
With the Ethical and Legal Constraints of
Information Access”,
Proceedings of the
International Symposium on Music Information
Retrieval, 2003.
[12] A.B. Slavkovic, A. Smith: “Special Issue on
Statistical and Learning-Theoretic Challenges in
Data Privacy”, Journal of Privacy and
Confidentiality, Vol. 4, Issue 1, pp. 1-243, 2012.
7. REFERENCES
[1] S. Argamon, K. Burns, S. Dubnov (Eds): The
Structure of Style, Springer-Verlag, 2010.
[13] P. Symeonidis, M. Ruxanda, A. Nanopoulos, Y.
Manolopoulos: “Ternary Semantic Analysis of
Social
Tags
for
Personalized
Music
Recommendation”, Proceedings of the International
Symposium on Music Information Retrieval, 2008.
[2] C. Barlas: “Beating Babel - Identification, Metadata
and Rights”, Invited Talk, Proceedings of the
International Symposium on Music Information
Retrieval, 2002.
[14] U.S. – EU
[3] Directive (95/46/EC) of 24 October 1995 Official
Journal L 281, 23/11/1995 P. 0031 - 0050 : http://eur-
Safe
Harbor:
http://www.export.gov/safeharbor/eu/eg_main_018365.asp
lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L00
46:en:HTML
[4] J.S. Downie, J. Futrelle, D. Tcheng: “The
International Music Information Retrieval Systems
Evaluation Laboratory: Governance, Access and
Security”, Proceedings of the International
[5] A. Gkoulalas-Divanis, Y. Saygin, Vassilios S.
Verykios: “Special Issue on Privacy and Security
Issues in Data Mining and Machine Learning”,
Transactions on Data Privacy, Vol. 4, Issue 3,
pp. 127-187, December 2011.
[6] D. Greer: “Safe Harbor - A Framework that Works”,
International Data Privacy Law, Vol.1, Issue 3,
pp. 143-148, 2011.
[7] M. Levering: “Intellectual Property Rights in
Musical Works: Overview, Digital Library Issues
and Related Initiatives”, Invited Talk, Proceedings
of the International Symposium on Music
Information Retrieval, 2000.
[8] F. Pachet, P. Roy: “Hit Song Science is Not Yet a
Science”, Proceedings of the International
[9] Proposal for a Regulation on the protection of
individuals with regard to the processing of personal
data was adopted the 12 March 2014 by the
European
Parliament:
http://www.europarl.europa.eu/sides/getDoc.do?type
=TA&reference=P7-TA-2014-0212&language=EN
602
A MULTI-MODEL APPROACH TO BEAT TRACKING CONSIDERING
HETEROGENEOUS MUSIC STYLES
Sebastian Böck, Florian Krebs and Gerhard Widmer
[email protected]
ABSTRACT
this task. Most systems then determine the most predominant tempo from these periodicities and subsequently determine the beat times using multiple agents approaches
[8,12], dynamic programming [6,10], hidden Markov models (HMM) [7, 16, 18], or recurrent neural networks (RNN)
[2]. Other systems operate directly on the input features
and jointly determine the tempo and phase of the beats using dynamic Bayesian networks (DBN) [3, 14, 17, 21].
One of the most common problems of beat tracking
systems are “octave errors”, meaning that a system detects beats at double or half the rate of the ground truth
tempo. For human tappers this generally does not constitute a problem, as can be seen when comparing beat tracking results at different metrical levels [6]. Hainsworth and
Macleod stated that beat tracking systems will have to be
style specific in the future in order to improve the state-ofthe-art [14]. This is consistent with the finding of Krebs et
al. [17] who showed on a dataset of Ballroom music that
the beat tracking performance can be improved by incorporating style-specific knowledge, especially by resolving
the octave error. While approaches have been proposed
which combined multiple existing features for beat tracking [22], no one has so far combined several models specialised on different musical styles to improve the overall
performance.
In this paper, we propose a multi-model approach to
fuse information of different models that have been specialised on heterogeneous music styles. The model is based
on the recurrent neural network (RNN) beat tracking system proposed in [2] and can be easily adapted to any music style without further parameter tweaking, only by providing a corresponding beat-annotated dataset. Further,
we propose an additional dynamic Bayesian network stage
based on the work of Whiteley et al. [21] which jointly infers the tempo and the beat phase from the beat activations
of the RNN stage.
In this paper we present a new beat tracking algorithm
which extends an existing state-of-the-art system with a
multi-model approach to represent different music styles.
The system uses multiple recurrent neural networks, which
are specialised on certain musical styles, to estimate possible beat positions. It chooses the model with the most appropriate beat activation function for the input signal and
jointly models the tempo and phase of the beats from this
activation function with a dynamic Bayesian network. We
test our system on three big datasets of various styles and
report performance gains of up to 27% over existing stateof-the-art methods. Under certain conditions the system is
able to match even human tapping performance.
1. INTRODUCTION AND RELATED WORK
The automatic inference of the metrical structure in music is a fundamental problem in the music information retrieval field. In this line, beat tracking deals with finding
the most salient level of this metrical grid, the beat. The
beat consists of a sequence of regular time instants which
usually invokes human reactions like foot tapping. During
the last years, beat tracking algorithms have considerably
improved in performance. But still they are far from being
considered on par with human beat tracking abilities – especially for music styles which do not have simple metrical
and rhythmic structures.
Most methods for beat tracking extract some features
from the audio signal as a first step. As features, commonly low-level features such as amplitude envelopes [20]
or spectral features [2], mid-level features like onsets either in discretised [8,12] or continuous form [6,10,16,18],
chord changes [12,18] or combinations thereof with higher
level features such as rhythmic patterns [17] or metrical
relations [11] are used. The feature extraction is usually
followed by a stage that determines periodicities within
the extracted features sequences. Autocorrelation [2, 9, 12]
and comb filters [6, 20] are commonly used techniques for
2. PROPOSED METHOD
The new beat tracking algorithm is based on the state-ofthe-art approach presented by Böck and Schedl in [2]. We
extend their system to be able to better deal with heterogeneous music styles and combine it with a dynamic Bayesian
network similar to the ones presented in [21] and [17].
The basic structure is depicted in Figure 1 and consists
of the following elements: first the audio signal is preprocessed and fed into multiple neural network beat track-
c Sebastian Böck, Florian Krebs and Gerhard Widmer.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sebastian Böck, Florian Krebs and
Gerhard Widmer. “A Multi-Model Approach to Beat Tracking Considering Heterogeneous Music Styles”, 15th International Society for Music
603
2.2 Multiple parallel neural networks
ing modules. Each of the modules is trained on different
audio material and outputs a different beat activation function when activated with a musical signal. These functions
are then fed into a module which chooses the most appropriate model and passes its activation function to a dynamic
Bayesian network to infer the actual beat positions.
At the core of the new approach, multiple neural networks
are used to determine possible beat locations in the audio
signal. As outlined previously, these networks are trained
on material with different music styles to be able to better
detect the beats in heterogeneous music styles.
As networks we chose the same recurrent neural network (RNN) topology as in [2] with three bidirectional hidden layers with 25 long short-term memory (LSTM) units
per layer. For training of the networks, standard gradient
descent with error backpropagation and a learning rate of
1e−4 is used. We initialise the network weights with a
Gaussian distribution with mean 0 and standard deviation
of 0.1. We use early stopping with a disjoint validation set
to stop training if no improvement over 20 epochs can be
observed.
Reference
Network
Model 1
Signal
Preprocessing
Model 2
Model
Switcher
Dynamic
Bayesian
Network
Beats
•
•
•
Model N
Figure 1. Overview of the new multi-model beat tracking
system.
One reference network is trained on the complete dataset
until the stopping criterion is reached for the first time. We
use this point during the training phase to diverge the specialised models from the reference network.
Theoretically, a single network large enough should be
able to model all the different music styles simultaneously,
but unfortunately this optimal solution is hardly achievable. The main reason for this is the difficulty to choose an
absolutely balanced training set with an evenly distributed
set of beats over all the different dimensions relevant for
detecting beats. These include rhythmic patterns [17, 20],
harmonic aspects and many other features. To overcome
this limitation, we split the available training data into multiple parts. Each part should represent a more homogeneous subset than the whole set so that the networks are
able to specialise on the dominant aspects of this subset.
It seems reasonable to assume that humans do something similar when tracking beats [4]. Depending on the
style of the music, the rhythmic patterns present, the instrumentation, the timbre, they apply their musical knowledge to chose one of their “learned” models and then decide which musical events are beats or not. Our approach
mimics this behaviour by learning multiple distinct models.
Afterwards, all networks are fine-tuned with a reduced
learning rate of 1e−5 on either the complete set or the individual subsets (cf. Section 3.1) with the above mentioned
stopping criterion. Given N subsets, N + 1 models are
generated.
The output functions of the network models represent
the beat probability at each time frame. Instead of tracking
the beats with an autocorrelation function as described in
the original work, the beat activation functions of the different models are fed into the next model-selection stage.
2.3 Model selection
The purpose of this stage is to select a model which outputs
a better beat activation function than the reference model
when activated with a signal. Compared to the reference
model, the specialised models produce better predictions
on input data which is similar to that used for fine-tuning,
but worse predictions on signals dissimilar to the training
data. This behaviour can be seen in Figure 2, where the
specialised model produces higher beat activation values
at the beat locations and lower values elsewhere.
2.1 Signal pre-processing
Table 1 illustrates the impact on the Ballroom subset,
where the relative gain of the best specialised model compared to the reference model (+1.7%) is lower than the
penalties of the other models (−2.3% to −6.3%). The
fact that the performance degradation of the unsuitable specialised models is greater than the gain of the most suitable
model allows us to use a very simple but effective method
to choose the best model.
All neural networks share the same signal pre-processing
step, which is very similar to the work in [2]. As inputs
to the different neural networks, the logarithmically filtered and scaled spectrograms of three parallel Short Time
Fourier Transforms (STFT) obtained for different window
lengths and their positive first order differences are used.
The system works with a constant frame rate fr of 100
frames per second. Window lengths of 23.2 ms, 46.4 ms
and 92.9 ms are used and the resulting spectrogram bins
of the discrete Fourier transforms are filtered with overlapping triangular filters to have a frequency resolution of
three bands per octave. To put all resulting magnitude values into a positive range we add 1 before taking the logarithm.
To select the best performing model, all network outputs of the fine-tuned networks are compared with the output of the reference network (which was trained on the
whole training set) and the one yielding the lowest mean
squared difference is selected as the final one and its output is fed into the final beat tracking stage.
604
The DBN we use is closely related to the one proposed
in [21], adapted to our specific needs. Instead of modelling whole bars, we only model one beat period which reduces the size of the search space. Additionally we do not
model rhythmic patterns explicitly and leave this higher
level analysis to the neural networks. This finally leads to
a DBN which consists of two hidden variables, the tempo
ω and the position φ inside a beat period. In order to infer the hidden variables from an audio signal, we have to
specify three entities: A transition model which describes
the transitions between the hidden variables, an observation model which takes the beat activations from the neural
network and transforms them into probabilities suitable for
the DBN, and the initial distribution which encodes prior
knowledge about the hidden variables. For computational
ease we discretise the tempo-beat space to be able to use
standard hidden Markov model (HMM) [19] algorithms
for inference.
Figure 2. Example beat activations for a 4 seconds ballroom snippet. Red is the reference network’s activations,
black the selected model and blue a discarded one. Green
dashed vertical lines denote the annotated beat positions.
SMC *
Hainsworth *
Ballroom *
Reference
Multi-model
F-measure
0.834
0.867
0.904
0.887
0.897
Cemgil
0.807
0.839
0.872
0.855
0.866
AMLc
0.664
0.694
0.777
0.748
0.759
2.4.1 Transition model
AMLt
0.767
0.793
0.853
0.831
0.841
The beat period is discretised into Φ = 640 equidistant
cells and φ ∈ {1, ..., Φ}. We refer to the unit of the variable
φ (position inside a beat period) as pib. φk at audio frame
k is then computed by
φk = (φk−1 + ωk−1 − 1) mod Φ + 1.
Table 1. Performance of differently specialised models (marked with asterisks, fine-tuned on the SMC,
Hainsworth and Ballroom subsets) on the Ballroom subset
compared to the reference model and the network selected
by the multi-model selection stage.
(1)
The tempo space is discretised into Ω = 23 equidistant
cells, which cover the tempo range up to 215 beats per
minute (BPM). The unit of the tempo variable ω is pib per
audio frame. As we want to restrict ω to integer values (to
stay within the φ grid at transitions), we need a high resolution of φ in order to get a high resolution of ω. Based on
experiments with the training set, we set the tempo space
to ω ∈ {6, ..., Ω}, where ω = 6 is equivalent to a minimum
tempo of 6 × 60 × fr /Φ ≈ 56 BPM. As in [21] we only
allow for three tempo transitions at time frame k: It stays
constant, it accelerates, or it decelerates.
⎧
P (ωk |ωk−1 ) = 1 − pω
⎨ ωk−1 ,
ωk−1 + 1, P (ωk |ωk−1 ) = p2ω
ωk =
(2)
⎩
ωk−1 − 1, P (ωk |ωk−1 ) = p2ω
2.4 Dynamic Bayesian network
Independent of whether only one or multiple neural networks are used, the approach of Böck and Schedl [2] has
a fundamental shortcoming: the final peak-picking stage
does not try to find a global optimum when selecting the
final locations of the beats. It rather determines the dominant tempo of the piece (or a segment of certain length)
and then aligns the beat positions according to this tempo
by simply choosing the best start position and then progressively locating the beats at positions with the highest
activation function values in a certain region around the
pre-determined position. To allow a greater responsiveness
to tempo changes, this chosen region must not be too small.
However, this also introduces a weakness to the algorithm,
because the tracking stage can easily get distracted by a
few misaligned beats and needs some time to recover from
this fault. The activation function depicted in Figure 2 has
two of these spurious detections around frames 100 and
200.
To circumvent this problem, we feed the output of the
chosen neural network model into a dynamic Bayesian network (DBN) which jointly infers tempo and phase of a beat
sequence. Another advantage of this new method is that
we are able to model both beat and non-beat states, which
was shown to perform superior to the case where only beat
states are modelled [7].
Transitions to tempi outside of the allowed range are not
allowed by setting the corresponding transition probabilities to zero. The probability of a tempo change pω was set
to 0.002.
2.4.2 Observation model
Since the beat activation function a produced by the neural
network is limited to the range [0, 1] and shows high values at beat positions and low values at non-beat positions,
we use the activation function directly as state-conditional
observation distributions (similar to [7]). We define the observation likelihood as
ak ,
1 ≤ φk ≤ Φ
λ
P (ak |φk ) =
(3)
1−ak
λ−1 , otherwise.
Φ
λ ∈ [ Φ−1
, Φ] is a parameter that controls the proportion of
the beat interval which is considered as beat and non-beat
605
3.2 Performance measures
location. Smaller values of λ (a higher proportion of beat
locations and a smaller proportion of non-beat locations)
are especially important for higher tempi, as the DBN visits only a few position states of a beat interval and could
possibly miss the beginning of a beat. On the other hand,
higher values of λ (a smaller proportion of beat locations)
lead to less accurate beat tracking, as the activations are
blurred in the state domain of the DBN. On our training
set we achieved the best results with the value λ = 16.
In line with almost all other publications on the topic of
beat tracking, we report the following scores:
F-measure : counts the number of true positive (correctly
located beats within a tolerance window of ±70 ms), false
positive and negative detections;
P-score : measures the tracking accuracy by the correlation of the detections and the annotations, considering
deviations within 20% of the annotated beat interval as
correct;
2.4.3 Initial state distribution
The initial state distribution is normally used to incorporate
any prior knowledge about the hidden states, such as tempo
distributions. In this paper, we use a uniform distribution
over all states, for simplicity and ease of generalisation.
Cemgil : places a Gaussian function with a standard deviation of 40 ms around the annotations and then measures the tracking accuracy by summing up the scores of
the detected beats on this function normalising it by the
overall length of the annotations or detections, whichever
is greater;
2.4.4 Inference
We are interested in the sequence of hidden variables φ1:K
and ω1:K , that maximise the posterior probability of the
hidden variables given the observations (activations a1:K ).
Combining the discrete states of φ and ω into one state
vector xk = [φk , ωk ], we can compute the maximum aposteriori state sequence x∗1:K by
x∗1:K = arg max p(x1:K |a1:K ).
x1:K
CMLc & CMLt : measure the longest continuously segment (CMLc) or all correctly tracked beats (CMLt) at the
correct metrical level. A beat is considered correct if it is
reported within a 17.5% tempo and phase tolerance, and
the same applies for the previously detected beat;
(4)
AMLc & AMLt : like CMLc & CMLt, but additionally
allow offbeat and double/half as well as triple/third tempo
variations of the annotated beats;
Equation 4 can be computed efficiently using the wellknown Viterbi algorithm [19]. Finally the set of beat times
B are determined by the set of time frames k which were
assigned to a beat position (B = {k : φk < φk−1 }). In our
experiments we found that the beat detection becomes less
accurate if the part of the beat interval which is considered
as beat-state is too large (i.e. smaller values of λ). Therefore we determine the final beat times by looking for the
highest beat activation value inside the beat-state window
W = {k : φk ≤ Φ
λ }.
D & Dg : the information gain (D) and global information
gain (Dg ) are phase agnostic measures comparing the annotations with the detections (and vice-versa) building a
error histogram and then calculating the Kullback-Leibler
divergence w.r.t. a uniform histogram.
A more detailed description of the evaluation methods can
be found in [5]. However, since we only investigate offline algorithms, we do not skip the first five seconds for
evaluation.
3. EVALUATION
3.3 Results & Discussion
For the development and evaluation of the algorithm we
used some well-known datasets. This allows for highest
comparability with previously published results of stateof-the-art algorithms.
Table 2 lists the performance results of the reference implementation, Böck’s BeatTracker.2013, and the various extensions proposed in this paper for all datasets. All results
are obtained with 8-fold cross validation with previously
defined splittings, ensuring that no pieces are used both for
training or parameter tuning and testing purposes. Additionally, we compare our new approach to published statof-the-art results on the Hainsworth and Ballroom datasets.
3.1 Datasets
As training material for our system, the datasets introduced
in [13–15] are used. They are called Ballroom, Hainsworth
and SMC respectively. To show the ability of our new algorithm to adapt to various music styles, a very simple approach of splitting the complete dataset into multiple subsets according to the original source was chosen. Although
far from optimal – both the SMC and Hainsworth datasets
contain heterogeneous music styles – we still consider this
a valid choice, since any “better” splitting would allow
the system to adapt even further to heterogeneous styles
and in turn lead to better results. At least the three sets
have a somehow different focus regarding the music styles
present.
3.3.1 Multi-model extension
As can be seen, the use of the multi-model extension almost always improves the results over the implementation
it is based on, especially on the SMC set. The gain in performance on the Ballroom set was expected, since Krebs et
al. already showed that modelling rhythmic patterns helps
to increase the overall detection accuracy [17]. Although
we did not split the set according to the individual rhythmic
patterns, the overall style of ballroom music can be considered unique enough to be distinct from the other music
606
Ballroom
BeatTracker.2013 [1, 2]
— Multi-Model
— DBN
— Multi-Model + DBN
Krebs et al. [17]
Zapata et al. [22] †
Hainsworth
— Multi-Model
— DBN
Davies et al. [6]
Peeters & Papadopoulos [18]
Degara et al. [7]
Human tapper [6] ‡
SMC
— Multi-Model
— DBN
F-measure
P-score
Cemgil
CMLc
CMLt
AMLc
AMLt
D
Dg
0.887
0.897
0.903
0.910
0.855
0.767
0.863
0.875
0.876
0.881
0.839
0.735
0.855
0.866
0.838
0.845
0.772
0.672
0.719
0.740
0.792
0.800
0.745
0.586
0.795
0.814
0.825
0.830
0.786
0.607
0.748
0.759
0.873
0.885
0.818
0.824
0.831
0.841
0.915
0.924
0.865
0.860
3.404
3.480
3.427
3.469
2.499
2.750
2.596
2.674
2.275
2.352
1.681
1.187
0.832
0.832
0.843
0.840
0.710
-
0.843
0.847
0.867
0.865
0.732
-
0.712
0.716
0.711
0.707
0.589
-
0.618
0.617
0.696
0.696
0.569
0.548
0.547
0.561
0.528
0.756
0.761
0.808
0.803
0.642
0.612
0.628
0.629
0.812
0.655
0.652
0.759
0.760
0.709
0.681
0.703
0.719
0.575
0.807
0.809
0.883
0.881
0.824
0.789
0.831
0.815
0.874
2.167
2.171
2.251
2.268
2.057
-
1.468
1.490
1.481
1.466
0.880
-
0.497
0.514
0.516
0.529
0.369
0.598
0.617
0.622
0.630
0.460
0.402
0.415
0.404
0.415
0.285
0.238
0.257
0.294
0.296
0.115
0.360
0.389
0.415
0.428
0.158
0.279
0.296
0.378
0.383
0.239
0.436
0.467
0.550
0.567
0.397
1.263
1.324
1.426
1.460
0.879
0.416
0.467
0.504
0.531
0.126
Table 2. Performance of the proposed algorithm on the Ballroom [13], Hainsworth [14] and SMC [15] datasets. BeatTracker is the reference implementation our Multi-Model and dynamic Bayesian network (DBN) extensions are built on.
The results marked with † are obtained with Essentia’s implementation of the multi-feature beat tracker. 1 ‡ denotes causal
(i.e. online) processing, all listed algorithms use non-causal analysis (i.e. offline processing) with the best results in bold.
styles present in the other sets and the salient features can
be exploited successfully by the multi-model approach.
3.3.2 Dynamic Bayesian network extension
As already indicated in the original paper [2] (and described
earlier in Section 2.4), the original BeatTracker can be easily distracted by some misaligned beats and then needs
some time to recover from any failure. The newly adapted
dynamic Bayesian network beat tracking stage does not
suffer from this shortcoming by searching for the globally best beat locations. The use of the DBN boosts the
performance on all datasets for almost all evaluation measures. Interestingly, the Cemgil accuracy is degraded by
using the DBN stage. This might be explained by the fact
that the discretisation grid of the beat period beat positions becomes too coarse for low tempi (cf. Section 2.4.4)
and therefore yields inaccurate beat detections, which especially affect the Cemgil accuracy. This is one of the issues that needs to be resolved in the future, especially for
lower tempi where the penalty is the highest.
Davies et al. [6] also list performance results of a human tapper on the same dataset. However it must be noted
that these were obtained by online real-time tapping, hence
they cannot be compared directly to the system presented.
However, the system of Davies et al. can also be switched
to causal mode (and thus being comparable to a human
tapper). In this mode it achieved performance reduced by
approximately 10% [6]. Adding the same amount to the
reported tapping results of 0.528 CMLc and 0.575 AMlc
suggests that our system is capable of performing as good
as humans when continuous tapping is required.
On the Ballroom set we achieve higher results than the
particularly specialised system of Krebs et al. [17]. Since
our DBN approach is a simplified variant of their model, it
can be assumed that the relatively low scores of the Cemgil
accuracy and the information gain are due to the same reason – the coarse discretisation of the beat or bar states.
Nonetheless, comparing the continuity scores (which have
higher tolerance thresholds) we can still report an average
increase in performance of more than 5%.
3.3.3 Comparison with other methods
4. CONCLUSIONS & OUTLOOK
Our new system set side by side with other state-of-the-art
algorithms draws a clear picture. It outperforms all of them
considerably – independently of the dataset and evaluation
measure chosen. Especially the high performance boosts
of the CMLc and CMLt scores on the Hainworth dataset
highlight the ability to track the beats at the correct metrical level significantly more often than any other method.
In this paper we have presented a new beat tracking system
which is able to improve over existing algorithms by incorporating multiple models which were trained on different
music styles and combining it with a dynamic Bayesian
1
607
http://essentia.upf.edu, v2.0.1
network for the final inference of the beats. The combination of these two extensions yields a performance boost –
depending on the dataset and evaluation measures chosen
– of up to 27% relative, matching human tapping results
under certain conditions. It outperforms other state-of-theart algorithms in tracking the beats at the correct metrical
level by 20%.
We showed that the specialisation on a certain musical
style helps to improve the overall performance, although
the method for splitting the available data into sets of different styles and then selecting the most appropriate model
is rather simple. For the future we will investigate more advanced techniques for the selection of suitable data for the
creation of the specialised models, e.g. splitting the datasets
according to dance styles as performed by Krebs et al. [17]
or applying unsupervised clustering techniques. We also
expect better results from more advanced model selection
methods. One possible approach could be to feed the individual model activations to the dynamic Bayesian network
and let it choose among them.
Finally, the Bayesian network could be tuned towards
using a finer beat positions grid and thus reporting the beats
at more appropriate times than just selecting the position
of the highest activation reported by the neural network
model.
5. ACKNOWLEDGMENTS
This work is supported by the European Union Seventh
Framework Programme FP7 / 2007-2013 through the
GiantSteps project (grant agreement no. 610591) and the
Austrian Science Fund (FWF) project Z159.
6. REFERENCES
[1] MIREX 2013 beat tracking results. http://nema.lis.
illinois.edu/nema_out/mirex2013/results/
abt/, 2013.
[2] S. Böck and M. Schedl. Enhanced Beat Tracking with
Context-Aware Neural Networks. In Proceedings of the 14th
International Conference on Digital Audio Effects (DAFx11), pages 135–139, Paris, France, September 2011.
[3] A. T. Cemgil, H. Kappen, P. Desain, and H. Honing. On
tempo tracking: Tempogram Representation and Kalman filtering. Journal of New Music Research, 28:4:259–273, 2001.
[4] N. Collins. Towards a style-specific basis for computational
beat tracking. In Proceedings of the 9th International Conference on Music Perception and Cognition (ICMPC9), pages
461–467, Bologna, Italy, 2006.
[5] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation methods for musical audio beat tracking algorithms.
Technical Report C4DM-TR-09-06, Centre for Digital Music, Queen Mary University of London, 2009.
[6] M. E. P. Davies and M. D. Plumbley. Context-dependent
beat tracking of musical audio. IEEE Transactions on Audio,
Speech, and Language Processing, 15(3):1009–1020, March
2007.
[7] N. Degara, E. Argones-Rúa, A. Pena, S. Torres-Guijarro,
M. E. P. Davies, and M. D. Plumbley. Reliability-informed
beat tracking of musical signals. IEEE Transactions on Audio, Speech and Language Processing, 20(1):290–301, January 2012.
[8] S. Dixon. Automatic extraction of tempo and beat from
expressive performances. Journal of New Music Research,
30:39–58, 2001.
[9] D. Eck. Beat tracking using an autocorrelation phase matrix. In Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP 2007),
volume 4, pages 1313–1316, Honolulu, Hawaii, USA, April
2007.
[10] D. P. W. Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 2007:51–60, 2007.
[11] A. Gkiokas, V. Katsouros, G. Carayannis, and T. Stafylakis.
Music tempo estimation and beat tracking by applying source
separation and metrical relations. In Proceedings of the 37th
International Conference on Acoustics, Speech and Signal
Processing (ICASSP 2012), pages 421–424, Kyoto, Japan,
March 2012.
[12] M. Goto and Y. Muraoka. Beat tracking based on multipleagent architecture a real-time beat tracking system for audio
signals. In Proceedings of the International Conference on
Multiagent Systems, pages 103–110, 1996.
[13] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis,
C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844,
September 2006.
[14] S. Hainsworth and M. Macleod. Particle filtering applied to
musical tempo tracking. EURASIP J. Appl. Signal Process.,
15:2385–2395, January 2004.
[15] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. Oliveira, and
F. Gouyon. Selective sampling for beat tracking evaluation.
IEEE Transactions on Audio, Speech, and Language Processing, 20(9):2539–2548, November 2012.
[16] A. Klapuri, A. Eronen, and J. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio,
Speech, and Language Processing, 14(1):342–355, January
2006.
[17] F. Krebs, S. Böck, and G. Widmer. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), pages 227–232,
Curitiba, Brazil, November 2013.
[18] G. Peeters and H. Papadopoulos. Simultaneous beat and
downbeat-tracking using a probabilistic framework: Theory and large-scale evaluation. IEEE Transactions on Audio,
[19] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of
the IEEE, pages 257–286, 1989.
[20] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. The Journal of the Acoustical Society of America,
103(1):588–601, 1998.
[21] N. Whiteley, A. Cemgil, and S. Godsill. Bayesian modelling
of temporal structure in musical audio. In Proceedings of the
7th International Conference on Music Information Retrieval
(ISMIR 2006), pages 29–34, Victoria, BC, Canada, October
2006.
[22] J. R. Zapata, M. E. P. Davies, and E. Gómez. Multi-feature
beat tracking. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 22(4):816–825, April 2014.
608
Oral Session 8
Source Separation
609
610
EXTENDING HARMONIC-PERCUSSIVE SEPARATION OF AUDIO
SIGNALS
Jonathan Driedger1 , Meinard Müller1 , Sascha Disch2
1
International Audio Laboratories Erlangen
2
Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany
{jonathan.driedger,meinard.mueller}@audiolabs-erlangen.de, [email protected]
ABSTRACT
In recent years, methods to decompose an audio signal into
a harmonic and a percussive component have received a lot
of interest and are frequently applied as a processing step
in a variety of scenarios. One problem is that the computed components are often not of purely harmonic or percussive nature but also contain noise-like sounds that are
neither clearly harmonic nor percussive. Furthermore, depending on the parameter settings, one often can observe
a leakage of harmonic sounds into the percussive component and vice versa. In this paper we present two extensions to a state-of-the-art harmonic-percussive separation
procedure to target these problems. First, we introduce a
separation factor parameter into the decomposition process that allows for tightening separation results and for
enforcing the components to be clearly harmonic or percussive. As second contribution, inspired by the classical
sines+transients+noise (STN) audio model, this novel concept is exploited to add a third residual component to the
decomposition which captures the sounds that lie in between the clearly harmonic and percussive sounds of the
audio signal.
Figure 1. (a): Input audio signal x. (b): Spectrogram X.
(c): Spectrogram of the harmonic component Xh (left), the
residual component Xr (middle) and the percussive component Xp (right). (d): Waveforms of the harmonic component xh (left), the residual component xr (middle) and
the percussive component xp (right).
1. INTRODUCTION
harmonic sounds have a horizontal structure in a spectrogram representation of the input signal, while percussive
sounds form vertical structures. By iteratively diffusing the
spectrogram once in horizontal and once in vertical direction, the harmonic and percussive elements are enhanced,
respectively. The two enhanced representations are then
compared, and entries in the original spectral representation are assigned to either the harmonic or the percussive
component according to the dominating enhanced spectrogram. Finally, the two components are transformed back to
the time-domain. Following the same idea, Fitzgerald [5]
replaces the diffusion step by a much simpler median filtering strategy, which turns out to yield similar results while
having a much lower computational complexity.
A drawback of the aforementioned approaches is that
the computed decompositions are often not very tight in
the sense that the harmonic and percussive components
may still contain some non-harmonic and non-percussive
residues, respectively. This is mainly because of two reasons. First, sounds that are neither of clearly harmonic nor
of clearly percussive nature such as applause, rain, or the
sound of a heavily distorted guitar are often more or less
The task of decomposing an audio signal into its harmonic
and its percussive component has received large interest in
recent years. This is mainly because for many applications
it is useful to consider just the harmonic or the percussive
portion of an input signal. Harmonic-percussive separation has been applied, for example, for audio remixing [9],
improving the quality of chroma features [14], tempo estimation [6], or time-scale modification [2, 4]. Several decomposition algorithms have been proposed. In [3], the
percussive component is modeled by detecting portions in
the input signal which have a rather noisy phase behavior. The harmonic component is then computed by the
difference of the original signal and the computed percussive component. In [10], the crucial observation is that
c Jonathan Driedger, Meinard Müller, and Sascha Disch.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jonathan Driedger, Meinard Müller,
and Sascha Disch. “Extending Harmonic-Percussive Separation of Audio Signals”, 15th International Society for Music Information Retrieval
Conference, 2014.
611
randomly distributed among the two components. Second,
depending on the parameter setting, harmonic sounds often leak into the percussive component and the other way
around. Finding suitable parameters which yield satisfactory results often involves a delicate trade-off between a
leakage in one or the other direction.
In this paper, we propose two extensions to [5] that
lead towards more flexible and refined decompositions.
First, we introduce the concept of a separation factor (Section 2). This novel parameter allows for tightening decomposition results by enforcing the harmonic and percussive
component to contain just the clearly harmonic and percussive sounds of the input signal, respectively, and therefore to attenuate the aforementioned problems. Second,
we exploit this concept to add a third residual component
that captures all sounds in the input audio signal which
are neither clearly harmonic nor percussive (see Figure 1).
This kind of decomposition is inspired by the classical
sines+transients+noise (STN) audio model [8, 11] which
aims at resynthesizing a given audio signal in terms of
a parameterized set of sine waves, transient sounds, and
shaped white noise. While a first methodology to compute such a decomposition follows rather straightforward
from the concept of a separation factor, we also propose a
more involved iterative decomposition procedure. Building on concepts proposed in [13], this procedure allows
for a more refined adjustment of the decomposition results
(Section 3.3). Finally, we evaluate our proposed procedures based on objective evaluation measures as well as
subjective listening tests (Section 4). Note that this paper
has an accompanying website [1] where you can find all
audio examples discussed in this paper.
distributed, while the harmonic components stand out. By
applying a median filter to Y once in horizontal and once
in vertical direction, we get a harmonically enhanced magnitude spectrogram Ỹh and a magnitude spectrogram Ỹp
with enhanced percussive content
median(Y (t − h , k), . . . , Y (t + h , k))
Ỹp (t, k)
:=
median(Y (t, k − p ), . . . , Y (t, k + p ))
Mh (t, k)
:=
Mp (t, k)
:=
Ỹh (t, k)/(Ỹp (t, k) + ) > β
Ỹp (t, k)/(Ỹh (t, k) + ) ≥ β
where is a small constant to avoid division by zero, and
the operators ≥ and > yield a binary result from {0, 1}.
Applying these masks to the original spectrogram X yields
the spectrograms for the harmonic and the percussive component
The first steps of our proposed decomposition procedure
for tightening the harmonic and the percussive component
are the same as in [5], which we now summarize. Given
an input audio signal x, our goal is to compute a harmonic
component xh and a percussive component xp such that xh
and xp contain the clearly harmonic and percussive sounds
of x, respectively. To achieve this goal, first a spectrogram
X of the signal x is computed by applying a short-time
Fourier transform (STFT)
N
−1
:=
for h , p ∈ N where 2
h + 1 and 2
p + 1 are the lengths
of the median filters, respectively.
Now, extending [5], we introduce an additional parameter β ∈ R, β ≥ 1, called the separation factor. We assume an entry of the original spectrogram X(t, k) to be
part of the clearly harmonic or percussive component if
Ỹh (t, k)/Ỹp (t, k) > β or Ỹp (t, k)/Ỹh (t, k) ≥ β, respectively. Intuitively, for a sound to be included in the harmonic component it is required to stand out from the percussive portion of the signal by at least a factor of β, and
vice versa for the percussive component. Using this principle, we can define binary masks Mh and Mp
2. TIGHTENED HARMONIC-PERCUSSIVE
SEPARATION
X(t, k) =
Ỹh (t, k)
Xh (t, k)
:=
X(t, k) · Mh (t, k)
Xp (t, k)
:=
X(t, k) · Mp (t, k) .
These spectrograms can then be brought back to the timedomain by applying an “inverse” short-time Fourier transform, see [7]. This yields the desired signals xh and xp .
Choosing a separation factor β > 1 tightens the separation
result of the procedure by preventing sounds which are neither clearly harmonic nor percussive to be included in the
components. In Figure 2a, for example, you see the spectrogram of a sound mixture of a violin (clearly harmonic),
castanets (clearly percussive), and applause (noise-like,
and neither harmonic nor percussive). The sound of the
violin manifests itself as clear horizontal structures, while
one clap of the castanets is visible as a clear vertical structure in the middle of the spectrogram. The sound of the
applause however does not form any kind of directed structure and is spread all over the spectrum. When decomposing this audio signal with a separation factor of β=1, which
basically yields the procedure proposed in [5], the applause
is more or less equally distributed among the harmonic
and the percussive component, see Figure 2b. However,
when choosing β=3, only the clearly horizontal and vertical structures are preserved in Xh and Xp , respectively,
and the applause is no longer contained in the two components, see Figure 2c.
w(n) x(n + tH) exp(−2πikn/N )
n=0
with t ∈ [0 : T −1] and k ∈ [0 : K], where T is the number
of frames, K = N/2 is the frequency index corresponding
to the Nyquist frequency, N is the frame size and length
of the discrete Fourier transform, w is a sine-window function and H is the hopsize (we usually set H = N/4). A
crucial observation is that looking at one frequency band in
the magnitude spectrogram Y = |X| (one row of Y ), harmonic components stay rather constant, while percussive
structures show up as peaks. Contrary, in one frame (one
column of Y ), percussive components tend to be equally
612
Figure 3. Energy distribution between the harmonic, residual, and percussive components for different frame sizes N
and separation factors β. (a): Harmonic components. (b):
Residual components. (c): Percussive components.
the original signal are not necessarily equal. Our proposed
approach yields a decomposition of the signal. The three
components always add up to the original signal again. The
separation factor β hereby constitutes a flexible handle to
adjust the sound characteristics of the components.
Figure 2. (a): Original spectrogram X. (b): Spectrograms
Xh (left) and Xp (right) for β = 1. (c): Spectrograms Xh
(left) and Xp (right) for β = 3.
3.2 Influence of the Parameters
3. HARMONIC-PERCUSSIVE-RESIDUAL
SEPARATION
The main parameters of our decomposition procedure are
the length of the median filters, the frame size N used
to compute the STFT, and the separation factor β. Intuitively, the length of the filters specify the minimal sizes
of horizontal and vertical structures which should be considered as harmonic and percussive sounds in the STFT
of x, respectively. Our experiments have shown that the
filter lengths actually do not influence the decomposition
too much as long as no extreme values are chosen, see
also [1]. The frame size N on the other hand pushes the
overall energy of the input signal towards one of the components. For large frame sizes, the short percussive sounds
lose influence in the spectral representation and more energy is assigned to the harmonic component. This results in
a leakage of some percussive sounds to the harmonic component. Vice versa, for small frame sizes the low frequency
resolution often leads to a blurring of horizontal structures,
and harmonic sounds tend to leak into the percussive component. The separation factor β shows a different behavior
to the previous parameters. The larger its value, the clearer
becomes the harmonic and percussive nature of the components xh and xp . Meanwhile, also the portion of the signal that is assigned to the residual component xr increases.
To illustrate this behavior, let us consider a first synthetic
example where we apply our proposed procedure to the
mixture of a violin (clearly harmonic), castanets (clearly
percussive), and applause (neither harmonic nor percussive), all sampled at 22050 Hertz and having the same energy. In Figure 3, we visualized the relative energy distribution of the three components for varying frame sizes
N and separation factors β, while fixing the length of the
median filters to be always equivalent to 200 milliseconds
in horizontal direction and 500 Hertz in vertical direction,
see also [1]. Since the energy of all three signals is normalized, potential leakage between the components is indicated by components that have either more or less than a
third of the overall energy assigned. Considering Fitzgerald’s procedure [5] as a baseline (β=1), we can investigate
In Section 3.1 we show how harmonic-percussive separation can be extended with a third residual component.
Afterwards, in Section 3.2, we show how the parameters
of the proposed procedure influence the decomposition results. Finally, in Section 3.3, we present an iterative decomposition procedure which allows for a more flexible
adjustment of the decomposition results.
3.1 Basic Procedure and Related Work
The concept presented in Section 2 allows us to extend
the decomposition procedure with a third component xr ,
called the residual component. It contains the portion of
the input signal x that is neither part of the harmonic component xh nor the percussive components xp . To compute
xr , we define the binary mask
Mr (t, k) := 1 − Mh (t, k) + Mp (t, k) ,
apply it to X, and transform the resulting spectrogram Xr
back to the time-domain (note that the masks Mh and Mp
are disjoint). This decomposition into three components
is inspired by the STN audio model. Here, an audio signal is analyzed to yield parameters for sinusoidal, transient, and noise components which can then be used to approximately resynthesize the original signal [8, 11]. While
the main application of the STN model lies in the field of
low bitrate audio coding, the estimated parameters can also
be used to synthesize just the sinusoidal, the transient, or
the noise component of the approximated signal. The harmonic, the percussive, and the residual component resulting from our proposed decomposition procedure are often
perceptually similar to the STN components. However, our
proposed procedure is conceptually different. STN modeling aims for a parametrization of the given audio signal. While the estimated parameters constitute a compact
approximation of the input signal, this approximation and
613
its behavior by looking at the first columns of the matrices in Figure 3. While the residual component has zero
energy in this setting, one can observe by listening that
the applause is more or less equally distributed between
the harmonic and the percussive component for medium
frame sizes. This is also reflected in Figure 3a/c by the
energy being split up roughly into equal portions. For
very large N , most of the signal’s energy moves towards
the harmonic component (value close to one in Figure 3a
for β=1, N =4096), while for very small N , the energy is
shifted towards the percussive component (value close to
one in Figure 3c for β=1, N =128). With increasing β,
one can observe how the energy gathered in the harmonic
and the percussive component flows towards the residual
component (decreasing values in Figure 3a/c and increasing values in Figure 3b for increasing β). Listening to
the decomposition results shows that the harmonic and the
percussive component thereby become more and more extreme in their respective characteristics. For medium frame
sizes, this allows us to find settings that lead to decompositions in which the harmonic component contains the violin, the percussive component contains the castanets, and
the residual contains the applause. This is reflected by Figure 3, where for N =1024 and β=2 the three sound components all hold roughly one third of the overall energy. For
very large or very small frame sizes it is not possible to
get such a good decomposition. For example, considering
β=1 and N =4096, we already observed that the harmonic
component holds most of the signal’s energy and also contains some of the percussive sounds. However, already for
small β > 1 these percussive sounds are shifted towards
the residual component (see the large amount of energy assigned to the residual in Figure 3b for β=1.5, N =4096).
Furthermore, also the energy from the percussive component moves towards the residual. The large frame size
therefore results in a very clear harmonic component while
the residual holds both the percussive as well as all other
non-harmonic sounds, leaving the percussive component
virtually empty. For very small N the situation is exactly
the other way around. This observation can be exploited
to define a refined decomposition procedure which we discuss in the next section.
Figure 4. Overview of the refined procedure. (a): Input
signal x. (b): First run of the decomposition procedure
using a large frame size Nh and a separation factor βh .
(c): Second run of the decomposition procedure using a
small frame size Np and a separation factor βp .
Figure 5. Energy distribution between the harmonic, residual, and percussive components for different separation
factors βh and βp . (a): Harmonic components. (b): Residual components. (c): Percussive components.
presented in Section 3.1. So far, although it is possible to
find a good combination of N and β such that both the
harmonic as well as the percussive component represent
the respective characteristics of the input signal well (see
Section 3.2), the computation of the two components is
still coupled. It is therefore not clear how to adjust the
content of the harmonic and the percussive component independently. Having made the observation that large N
lead to good harmonic but poor percussive/residual components for β>1, while small N lead to good percussive components but poor harmonic/residual components for β>1,
we build on the idea from Tachibana et al. [13] and compute the decomposition in two iterations. Here, the goal is
to decouple the computation of the harmonic component
from the computation of the percussive component. First,
the harmonic component is extracted by applying our basic
procedure with a large frame size Nh and a separation facfirst
tor βh >1, yielding xfirst
and xfirst
p . In a second run,
h , xr
3.3 Iterative Procedure
In [13], Tachibana et al. described a method for the extraction of human singing voice from music recordings. In this
algorithm, the singing voice is estimated by iteratively applying the harmonic-percussive decomposition procedure
described in [9] first to the input signal and afterwards
again to one of the resulting components. This yields a decomposition of the input signal into three components, one
of which containing the estimate of the singing voice. The
core idea of this algorithm is to perform the two harmonicpercussive separations on spectrograms with two different
time-frequency resolutions. In particular, one of the spectrograms is based on a large frame size and the other on a
small frame size. Using this idea, we now extend our proposed harmonic-percussive-residual separation procedure
614
Violin
-3.10 -5.85
Castanets -2.93
Applause
-3.04
3.58
−
HPR-IO
HPR-I
HPR
HP-I
HP
BL
HPR-IO
HPR-I
SAR
HPR
HP-I
HP
BL
HPR-IO
HPR-I
SIR
HPR
HP-I
HP
BL
SDR
0.08
8.23 7.65 8.85 -3.10 -5.09
1.08
17.69 14.58 21.65 274.25 8.33
9.44
8.82 8.78 9.11
2.86
8.29 9.14 9.28 -2.93
10.45 22.34 20.66 24.41 274.25 8.14
4.07
8.49 9.50 9.44
-7.03 4.25 4.93 5.00 -3.04
6.06
−
14.69
8.41
12.80
9.04
274.25
−
-6.85 6.95 5.93 7.69
Table 1. Objective evaluation measures. All values are given in dB.
the procedure is applied again to the sum xfirst
+ xfirst
r
p , this
time using a small frame size Np and a second separation
factor βp >1. This yields the components xsecond
, xsecond
r
h
second
and xp
. Finally, we define the output components of
the procedure to be
the castanets, and the applause signal represent the characteristics that we would like to capture in the harmonic, the
percussive, and the residual components, respectively, we
treated the decomposition task of this mixture as a source
separation problem. In an optimal decomposition the harmonic component would contain the original violin signal, the percussive component the castanets signal, and the
residual component the applause. To evaluate the decomposition quality, we computed the source to distortion ratios (SDR), the source to interference ratios (SIR), and the
source to artifacts ratios (SAR) [15] for the decomposition
results of the following procedures.
As a baseline (BL), we simply considered the original mixture as an estimate for all three sources. Furthermore, we applied the standard harmonic-percussive separation procedure by Fitzgerald [5] (HP) with the frame
size set to N =1024, the HP method applied iteratively
(HP-I) with Nh =4096 and Np =256, the proposed basic
harmonic-percussive-residual separation procedure (HPR)
as described in Section 3.1 with N =1024 and β=2, and
the proposed iterative harmonic-percussive-residual separation procedure (HPR-I) as described in Section 3.3 with
Nh =4096, Np =256, and βh =βp =2. As a final method,
we also considered HPR-I with separation factor βh =3
and βp =2.5, which were optimized manually for the task at
hand (HPR-IO). The filter lengths in all procedures were
always fixed to be equivalent to 200 milliseconds in time
direction and 500 Hertz in frequency direction. Decomposition results for all procedures can be found at [1].
The results are listed in Table 1. All values are given in
dB and higher values indicate better results. As expected,
BL yields rather low SDR and SIR values for all components, while the SAR values are excellent since there are
no artifacts present in the original mixture. The method
HP yields low evaluation measures as well. However,
these values are to be taken with care since HP decomposes the input mixture in just a harmonic and a percussive component. The applause is therefore not estimated
explicitly and, as also discussed in Section 2, randomly
distributed among the harmonic and percussive component. It is therefore clear that especially the SIR values
are low in comparison to the other procedures since the
applause heavily interferes with the remaining two sources
in the computed components. When looking at HP-I, the
benefit of having a third component becomes clear. Although here the residual component does not capture the
applause very well (SDR of −7.03 dB) this already suf-
second
+ xsecond
, xp := xsecond
.
xh := xfirst
h , xr := xh
r
p
For an overview of the procedure see Figure 4. While fixing the values of Nh and Np to a small and a large frame
size, respectively (in our experiments we chose Nh =4096
and Np =256), the separation factors βh and βp yield handles that give simple and independent control over the harmonic and percussive component. Figure 5, which is based
on the same audio example as Figure 3, shows the energy distribution among the three components for different combinations of βh and βp , see also [1]. For the harmonic components (Figure 5a) we see that the portion of
the signals energy contained in this component is independent of βp and can be controlled purely by βh . This is
a natural consequence from the fact that in our proposed
procedure the harmonic component is always computed directly from the input signal x and βp does not influence its
computation at all. However, we can also observe that the
energy contained in the percussive component (Figure 5c)
is fairly independent of βh and can be controlled almost
solely by βp . Listening to the decomposition results confirms these observations. Our proposed iterative procedure
therefore allows to adjust the harmonic and the percussive
component almost independently what significantly simplifies the process of finding an appropriate parameter setting for a given input signal. Note that in principle it would
also be possible to choose βh =βp =1, resulting in an iterative application of Fitzgerald’s method [5]. However, as
discussed in Section 3.2, Fitzgerald’s method suffers from
component leakage when using very large or small frame
sizes. Therefore, most of the input signal’s energy will be
assigned to the harmonic component in the first iteration
of the algorithm, while most of the remaining portion of
the signal is assigned to the percussive component in the
second iteration. This leads to a very weak, although not
empty, residual component.
4. EVALUATION
In a first experiment, we applied objective evaluation measures to our running example. Assuming that the violin,
615
Item name
Description
CastanetsViolinApplause
Heavy
Synthetic mixture of a violin, castanets and applause.
Recording of heavily distorted guitars, a bass and
drums.
Excerpt from My Leather, My Fur, My Nails by the
band Stepdad.
Regular beat played on bongos.
Monophonic melody played on a glockenspiel.
Excerpt from “Gute Nacht” by Franz Schubert which is
part of the Winterreise song cycle. It is a duet of a male
singer and piano.
Stepdad
Bongo
Glockenspiel
Winterreise
tics of the fine structure must remain constant on the large
scale” [12]. In our opinion this is not a bad description of
what one can hear in residual components.
Acknowledgments:
This work has been supported by the German Research Foundation (DFG MU 2686/6-1). The International Audio Laboratories Erlangen are a joint institution
of the Friedrich-Alexander-Universität Erlangen-Nürnberg
(FAU) and Fraunhofer IIS.
Table 2. List of audio excerpts.
5. REFERENCES
[1] J. Driedger, M. Müller, and S. Disch. Accompanying website:
Extending harmonic-percussive separation of audio signals.
http://www.audiolabs-erlangen.de/resources/
2014-ISMIR-ExtHPSep/.
fices to yield SDR and SIR values clearly above the baseline for the estimates of the violin and the castanets. The
separation quality further improves when considering the
results of our proposed method HPR. Here the evaluation
yields high values for all measures and components. The
very high SIR values are particularly noticeable since they
indicate that the three sources are separated very clearly
with very little leakage between the components. This
confirms our claim that our proposed concept of a separation factor allows for tightening decomposition results
as described in Section 2. The results of HPR-I are very
similar to the results for the basic procedure HPR. However, listening to the decomposition reveals that the harmonic and the percussive component still contain some
slight residue sounds of the applause. Slightly increasing the separation factors to βh =3 and βp =2.5 (HPR-IO)
eliminates these residues and further increases the evaluation measures. This straight-forward adjustment is possible since the two separation factors βh and βp constitute
independent handles to adjust the content of the harmonic
and percussive component, what demonstrates the flexibility of our proposed procedure.
The above described experiment constitutes a first case
study for the objective evaluation of our proposed decomposition procedures, based on an artificially mixed example. To also evaluate these procedures on real-world audio
data, we additionally performed an informal subjective listening tests with several test participants. To this end, we
applied our procedures to the set of audio excerpts listed
in Table 2. Among the excerpts are complex sound mixtures as well as purely percussive and harmonic signals,
see also [1]. Raising the question whether the computed
harmonic and percussive components meet the expectation
of representing the clearly harmonic or percussive portions
of the audio excerpts, respectively, the performed listening test confirmed our hypothesis. It furthermore turned
out that βh =βp =2, Nh =4096 and Np =256 seems to be
a setting for our iterative procedure which robustly yields
good decomposition results, rather independent of the input signal. Regarding the residual component, it was often
described to sound like a sound texture by the test participants, which is a very interesting observation. Although
there is no clear definition of what a sound texture exactly
is, literature states “sound texture is like wallpaper: it can
have local structure and randomness, but the characteris-
[2] J. Driedger, M. Müller, and S. Ewert. Improving time-scale modification of music signals using harmonic-percussive separation. Signal
Processing Letters, IEEE, 21(1):105–109, 2014.
[3] C. Duxbury, M. Davies, and M. Sandler. Separation of transient
information in audio using multiresolution analysis techniques. In
Proceedings of the COST G-6 Conference on Digital Audio Effects
(DAFX-01), Limerick, Ireland, 12 2001.
[4] C. Duxbury, M. Davies, and M. Sandler. Improved time-scaling of
musical audio using phase locking at transients. In Audio Engineering
Society Convention 112, 4 2002.
[5] D. Fitzgerald. Harmonic/percussive separation using medianfiltering.
In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages 246–253, Graz, Austria, 2010.
[6] A. Gkiokas, V. Katsouros, G. Carayannis, and T. Stafylakis. Music
tempo estimation and beat tracking by applying source separation and
metrical relations. In ICASSP, pages 421–424, 2012.
[7] D. W. Griffin and J. S. Lim. Signal estimation from modified shorttime Fourier transform. IEEE Transactions on Acoustics, Speech and
Signal Processing, 32(2):236–243, 1984.
[8] S. N. Levine and J. O. Smith III. A sines+transients+noise audio representation for data compression and time/pitch scale modications.
In Proceedings of the 105th Audio Engineering Society Convention,
1998.
[9] N. Ono, K. Miyamoto, H. Kameoka, and S. Sagayama. A real-time
equalizer of harmonic and percussive components in music signals.
In Proceedings of the International Conference on Music Information
Retrieval (ISMIR), pages 139–144, Philadelphia, Pennsylvania, USA,
2008.
[10] N. Ono, K. Miyamoto, J. LeRoux, H. Kameoka, and S. Sagayama.
Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In European
Signal Processing Conference, pages 240–244, Lausanne, Switzerland, 2008.
[11] A. Petrovsky, E. Azarov, and A. Petrovsky. Hybrid signal decomposition based on instantaneous harmonic parameters and perceptually
motivated wavelet packets for scalable audio coding. Signal Processing, 91(6):1489–1504, 2011.
[12] N. Saint-Arnaud and K. Popat. Computational auditory scene analysis. chapter Analysis and synthesis of sound textures, pages 293–308.
L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1998.
[13] H. Tachibana, N. Ono, and S. Sagayama. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 22(1):228–237, January 2013.
[14] Y. Ueda, Y. Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama.
HMM-based approach for automatic chord detection using refined
acoustic features. In ICASSP, pages 5518–5521, 2010.
[15] E. Vincent, R. Gribonval, and C. Févotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio,
616
SINGING VOICE SEPARATION USING
SPECTRO-TEMPORAL MODULATION FEATURES
Frederick Yen
Yin-Jyun Luo
Master Program of SMIT
National Chiao-Tung University, Taiwan
Tai-Shih Chi
Dept. of Elec. & Comp. Engineering
National Chiao-Tung University, Taiwan
{fredyen.smt01g,fredom.smt02g}
@nctu.edu.tw
[email protected]
ABSTRACT
An auditory-perception inspired singing voice separation
algorithm for monaural music recordings is proposed in
this paper. Under the framework of computational auditory scene analysis (CASA), the music recordings are first
transformed into auditory spectrograms. After extracting
the spectral-temporal modulation contents of the timefrequency (T-F) units through a two-stage auditory model,
we define modulation features pertaining to three categories in music audio signals: vocal, harmonic, and percussive. The T-F units are then clustered into three categories
and the singing voice is synthesized from T-F units in the
vocal category via time-frequency masking. The algorithm was tested using the MIR-1K dataset and demonstrated comparable results to other unsupervised masking
approaches. Meanwhile, the set of novel features gives a
possible explanation on how the auditory cortex analyzes
and identifies singing voice in music audio mixtures.
1. INTRODUCTION
Over the past decade, the task of singing voice separation
has gained much attention due to improvements in digital
audio technologies. In the research field of music information retrieval (MIR), separated vocal signals or accompanying music signals can be of great use in many
applications, such as singer identification, pitch extraction, and music genre classification. During the past few
years, many algorithms have been proposed for this challenging task. These algorithms can be categorized into
unsupervised and supervised approaches.
The unsupervised approaches do not contain any
training mechanism in the algorithms. For instance,
Durrieu et al. used a source/filter signal model with nonnegative matrix factorization (NMF) to perform source
separation [5] and Fitzgerald et al. used median filtering
and factorization techniques to separate harmonic and
percussive components in audio signals [7]. Some other
unsupervised methods considered structural characteristics of vocals and music accompaniments in several domains for separation. For example, Pardo and Rafii proposed REPET which views the accompaniments as repeating background signals and vocals as the varying information lying on top of them [16]. Tachibana et al. pro© Frederick Yen, Yin-Jyun Luo, Tai-Shih Chi.
License (CC BY 4.0). Attribution: Frederick Yen, Yin-Jyun Luo, TaiShih Chi. “Singing Voice Separation using Spectro-Temporal
Modulation Features”, 15th International Society for Music Information
617
posed the separation technique, HPSS, to remove the
harmonic and percussive instruments sequentially in a
two-stage framework by considering the nature of fluctuations of audio signals [19]. Huang et al. used RPCA to
present accompaniments in low-rank subspace and vocal
in sparse representation [8]. In addition, some unsupervised CASA-based systems were proposed for singing
voice separation by finding singing dominant regions on
the spectrograms using pitch and harmonic information.
For instance, Li and Wang proposed a CASA system obtaining binary masks using pitch-based inference [13].
Hsu and Jang extended the work and proposed a system
for separating both voiced and unvoiced singing segments
from the music mixtures [9]. Although training mechanisms were seen in these two systems, they were only for
detecting voiced and unvoiced segments, but not for separation.
In contrast, there were approaches based on supervised learning techniques. For example, Vembu et al.
used vocal/non-vocal SVM and neural-network (NN)
classifiers for vocal-nonvocal segmentation [20]. Ozerov
et al. used a vocal/non-vocal classifier based on Bayesian
modeling [15]. Another group of methods combined
RPCA with training mechanisms. For instance, Yang’s
low-rank representation method decomposed vocals and
accompaniments using pre-trained low-rank matrices [22]
and Sprechmann et al. proposed a real-time method using
low-rank modeling with neural networks [17]. Although
these supervised learning methods demonstrated very
high performance, they usually offer a weaker conception
of generality.
Music instruments produce signals with various kinds
of fluctuations such that they can be briefly categorized
into two groups, percussive and harmonic. Signals produced by percussive instruments are more consistent
along the spectral axis and by harmonic instruments are
more consistent along the temporal axis with little or no
fluctuations. These two categories occupy a large proportion of a spectrogram with mainly vertical and horizontal
lines. To extend this sense into a more general form, the
fluctuations can be viewed as a sum of sinusoid modulations along the spectral axis and the temporal axis. If a
signal has nearly zero modulation along one of the two
axes, its energy is smoothly distributed along that axis.
Conversely, if a signal has a high frequency of modulation along one axis, then its energy becomes scattered
along that axis. Therefore, if one can decipher the modulation status of a signal, one may be able to identify the
instrument type of the signal. An algorithm utilizing mo-
ributed over 5.3 octaves with the 24 filters/octave frequency resolution. These constant-Q filters mimic the
frequency selectivity of the cochlea. Outputs of these filters are then transformed through a non-linear compression stage, a lateral inhibitory network (LIN), and a halfwave rectifier cascaded with a low-pass filter. The nonlinear compression stage models the saturation caused by
inner hair cells, the LIN models the spectral masking effect, and the following stage serves as an envelope extractor to model the temporal dynamic reduction along
the auditory pathway to the midbrain. Outputs of the
module from different stages are formulated below:
Figure 1. Stages of the cochlear module, adopted from
[2].
dulation information can be seen in [1], where Barker et
al. combined the modulation spectrogram (MS) with nonnegative tensor factorization (NTF) to perform speech
separation from mixtures of speech and music.
Although the above mentioned engineering approaches produce promising results, human’s tremendous ability
in sound streams separation makes a biomimetic approach interesting to investigate. Based on neurophysiological evidences, it is suggested that neurons of
the auditory cortex (A1) respond to both spectral modulations and temporal modulations of the input sounds. Accordingly, a computational auditory model was proposed
to model A1 neurons as spectro-temporal modulation filters [2]. This concept of spectro-temporal modulation decomposition has inspired many approaches in various engineering topics, such as using spectro-temporal modulation features for speaker recognition [12], robust speech
recognition [18], voice activity detection [10], and sound
segregation [6].
Since modulations are important for music signal categorization, this modulation-decomposition auditory
model is used as a pre-processing stage for singing voice
separation in this paper. Our proposed unsupervised algorithm adapts this two-stage auditory model, which decodes the spectro-temporal modulations of a T-F unit, to
extract modulation based features and performs singing
voice separation under the CASA framework. This paper
is organized as follows. A brief review of the auditory
model is presented in Section 2. Section 3 describes the
proposed method. Section 4 shows evaluation and results.
Lastly, Section 5 draws the conclusion.
c d f g $ hj k$l m
(1)
c7 d f g @no c d fA ho q
(2)
c^ d f g rsnt c7 d fd u
(3)
cv d f g c^ d f ho wl y
(4)
where z is the input signal; k$l m is the impulse response of the cochlear filter with center frequency m; hj
denotes convolution in time; 炽 is the nonlinear compression function; no is the partial derivative of ; q
is the membrane leakage low-pass filter; wl y g Po~
is the integration window with the time constant y
to model current leakage of the midbrain; is the step
function. Detailed descriptions of the cochlear module
can be found in [2].
The output cv d f of the module is the auditory
spectrogram, which represents the neuron activities along
time and log-frequency axis. In this work, we bypass the
non-linear compression stage by assuming input sounds
are properly normalized without triggering the highvolume saturation effect of the inner hair cells.
2. SPECTRO-TEMPORAL AUDITORY MODEL
A neuro-physiological auditory model is used to extract
the modulation features. The model consists of an early
cochlear (ear) module and a central auditory cortex (A1)
module.
2.2 Cortical Module
The second module simulates the neural responses of the
auditory cortex (A1). The auditory spectrogram cv d f
is analyzed by cortical neurons which are modeled by
two-dimensional filters tuned to different spectrotemporal modulations. The rate parameter (in Hz) characterizes the velocity of local spectro-temporal envelope
2.1 Cochlear Module
As shown in Figure 1, the input sound goes through 128
overlapping asymmetric constant-Q band-pass filters
(] ^_` a b ) whose center frequencies are uniformly dist-
618
Figure 3. Block diagram of the proposed algorithm.
harmonic, percussive and vocal. Harmonic components
have steady energy distributions over time and have clear
formant structures over frequency. Each percussive component has impulsive energy concentrated in a short period of time and has no obvious harmonic structure. Vocal components possess harmonic structure and their energy is distributed along various time periods. Interpreting the above statements from the rate-scale perspective,
several general properties can be drawn. Harmonic components can be usually regarded as having low rate and
high scale modulations. It means that they have relatively slow energy change along time and rapid energy
change along the log-frequency axis due to the harmonic
structures. In contrast, percussive components typically
show quick energy change along time and energy spreading along the whole log-frequency axis, such that they
possess high rate and low scale modulations. Vocal
components are often recognized as a mix version of the
harmonic and percussive components with characteristics sometimes considered more similar to harmonics.
Different types of singing or vocal expression can result
in various values of rate and scale. Figure 4 shows some
examples of rate-scale plots of components from the
three categories.
Given an auditory spectrogram cb * transformed from an input music signalz, the rate-scale
plots of the T-F units are generated. As a pre-process, in
order to prevent extracting trivial data from nearly inaudible T-F units of the auditory spectrogram, we leave out
the T-F units that have energy less than 1% of the maximum energy of the whole auditory spectrogram. With the
rest of the T-F units, we obtain the rate-scale plot of each
unit and proceed to the feature extraction stage.
For each rate-scale plot, the total energies of the negative and positive rate side are compared. The side with
greater energy is determined as the dominant plot. From
the dominant plot, we extract 11 features as shown in Table 1. The features are selected by observing the ratescale plots with some intuitive assumptions of the physical properties which distinguish between harmonic, percussive and vocal. The first 10 features are obtained by
computing the energy ratio of two different areas on the
rate-scale plot. For example, as shown in Table 1, the first
feature is the ratio of the total modulation energy of scale
= 1 to the total modulation energy of scale = 0.25. The
low scales, such as 0.25 and 0.5, capture the degree of the
Figure 2. Rate-scale outputs of the cortical module to
two T-F units of the auditory spectrogram of the
'Ani_2_03.wav' vocal track in MIR-1K [9].
variation along the temporal axis. The scale parameter
(in cycle/octave) characterizes the density of the local
spectro-temporal envelope variation along the logfrequency axis. Furthermore, the cortical neurons are
found sensitive to the direction of the spectro-temporal
envelope. It is characterized by the sign of the rate parameter in this model, with negative for the upward direction and positive for the downward direction.
From functional point of view, this module performs
a spectro-temporal multi-resolution analysis on the input
auditory spectrogram in various rate-scale combinations.
Outputs of various cortical neurons to a single T-F unit of
the spectrogram demonstrate the local spectro-temporal
modulation contents of the unit in terms of the rate, scale
and directionality parameters.
Figure 2 shows rate-scale outputs of two T-F units in
an auditory spectrogram of a vocal clip. The rate-scale
output is referred to as the rate-scale plot in this paper.
The rate and scale indices are P7 andG , respectively. The strong responses of the plots correspond to
the variations of singing pitch envelopes resolved by the
rate and scale parameters and the moving direction of the
pitch. Detailed description of the cortical module is available in [3].
3. PROPOSED METHOD
A schematic diagram of the proposed algorithm is shown
in Figure 3. The following sections will discuss each part
in details.
3.1 Feature Extraction
According to the spectral and temporal behaviors observed on the auditory spectrogram, components of a
musical piece are characterized into three categories,
619
Scale
1 : 0.25
2 : 0.25
4 : 0.25
8 : 0.25
(0.25, 2, 4)
(0.25, 2, 4)
(0.25, 2, 4)
(0.25, 0.5) : all
(1, 2) : all
(4, 8) : all
(0.25)
Rate
all
all
all
all
(1, 2) : (0.25, 0.5, 1, 2, 16, 32)
(0.25, 0.5) : (0.25, 0.5, 1, 2, 16, 32)
(16, 32) : (0.25, 0.5, 1, 2, 16, 32)
all
all
all
all
Table 1. Eleven extracted modulation energy features
Figure 4. (a) Rate-scale plot from the vocal track of
‘Ani_4_07’ in MIR-1K. The modulation energy is mostly
concentrated in the middle and high scales for a unit with
a clear harmonic structure. (b) Rate-scale plots from the
accompanying music track of ‘Ani_4_07’. The upper plot
shows energy concentrating at low rates for a sustained
unit. The lower plot shows energy concentrating at high
rates for a transient unit.
flatness of the formant structure while the high scales,
such as 1, 2, 4 and 8, capture the harmonicity with different frequency spacing between harmonics. Therefore, the
first four features can be thought as descriptors which distinguish harmonic from percussive using spectral information. The fifth to the seventh features capture temporal
information which can distinguish sustained units from
transient units.
The feature values are saved as feature vectors and
then grouped as a feature matrix * for clustering,
where is the number of features and is the number of
total valid units in the auditory spectrogram.
3.2 Unsupervised Clustering
In the unsupervised clustering stage, a spectrogram is divided into three parts and clustering is performed for each
part. Based on hearing perception, the frequency resolution is higher at lower frequencies while the temporal
resolution is higher at higher frequencies [14]. Due to the
frequency resolution of the constant-Q cochlear filters/channels in the auditory model, the auditory spectrogram can only resolve about ten harmonics [11]. To handle different resolutions, the spectrogram is separated into
three sub-spectrograms with overlapped frequency ranges.
The three sub-spectrograms consist of channel 1 to channel 60, channel 46 to channel 75, and channel 61 to channel 128, respectively, with overlaps of 15 channels.
620
The clustering step is performed using the EM algorithm to group data into three unlabelled clusters. The
EM algorithm assigns a probability set to each T-F unit
showing its likelihood of belonging to each cluster. Note
that in spectrogram representations, the sound sources are
superimposed on top of each other. It implies that one TF unit may contain energy from more than one source.
Therefore, in this work, if one T-F unit has a probability
set in which the second highest probability is higher than
5%, that particular T-F unit will also be labelled to the
second high probability cluster. It means one unit may
eventually appear in more than one cluster. The parameter 5% was empirically determined. Each of the three
sub-spectrograms is clustered into three groups. Total of
nine groups are generated and merged back into three
whole spectrograms by comparing the correlations of the
overlapped channels between different groups. Each of
the three whole spectrograms represents the extracted
harmonic, percussive, and vocal part of the music mixture.
With no prior information about the labels of the three
whole spectrograms, the effective mean rate-scale plot of
each spectrogram is examined. The effective mean ratescale plot is the mean of rate-scale plots of the T-F units
with energy higher than 20% of the maximum energy in
that spectrogram. The total modulation energy of rate = 1,
2 Hz and scale = 0.25, 2, 4 cycle/octave is calculated
from the effective mean rate-scale plot and referred to as
Ev, which is used as the criterion to select the vocal spectrogram. The one with the maximum Ev value is picked
as the vocal spectrogram since Ev catches modulations
related to the formant structure (scale = 0.25), the harmonic structure (scale = 2 and 4) and the singing rate
(rate = 1 and 2) of singing voices.
The vocal spectrogram is then synthesized to an estimated signal using the auditory model toolbox [24]. The
nonlinear operation of the envelope extractor in the cochlear module makes perfect synthesis impossible, thus
causing a general result of loss of higher frequencies of
the signal. Detailed computations are shown in [2].
4. EVALUATION RESULTS
The MIR-1K [9] is used as the evaluation dataset. It cont-
mance to the masking-based REPET in all SNR conditions. When compared with the subspace RPCA method,
our proposed method has comparable performance only
in the -5 dB SNR condition. These results demonstrate
the effectiveness of the spectral-temporal modulation features for analyzing music mixtures. As this proposed
method only applies a simple EM algorithm for clustering,
harmonic mismatches and artificial noises are yet to be
discussed.
The future work will be focused on applying more
advanced classifiers for more accurate separations and
adopting a two-stage mechanism like HPSS to discard
percussive and harmonic components sequentially. The
other potential work is to implement the proposed
spectro-temporal modulation based method in the Fourier
spectrogram domain [4] to mitigate synthesis errors injected by the projection-based reconstruction process of
the auditory model.
Figure 5. GNSDR comparison at voice-to-music ratio of
-5, 0, and 5 dB with existing methods.
ains 1000 WAV files of karaoke clips sung by amateur
singers. The length of each clip is around 4~13 seconds.
The vocal and music accompaniment parts were recorded
in the right and the left channels separately. In this experiment, we mixed two channels in -5, 0, 5 dB SNR (signal
to noise ratio, i.e., vocal to music accompaniment ratio)
for test. To assess the quality of separation, the source-todistortion ratio (SDR) [21] is used as the objective measure. The ratios are computed by the BSS Eval toolbox
v3.0 [23]. Following [9], we compute the normalized
SDR (NSDR) and the weighted average of NSDR, the
global NSDR (GNSDR), with the weighting proportional
to the length of each file. To have a fair comparison, we
compare our method with other unsupervised methods,
which extract vocal clips only through one major stage.
The compared algorithms are listed below:
I.
II.
III.
6. ACKNOWLEDGEMENTS
This research is supported by the National Science Council, Taiwan under Grant No NSC 102-2220-E-009-049
and the Biomedical Electronics Translational Research
Center, NCTU.
7. REFERENCES
[1] T. Barker and T. Virtanen, "Non-negative tensor factorization of modulation spectrograms for monaural
sound source separation," Proc. of Interspeech, pp.
827-831, 2013.
[2] T. Chi, P. Ru, and S. A. Shamma, "Multiresolution
spectrotemporal analysis of complex sounds," J.
Acoust. Soc. Am., Vol. 118, No. 2, pp. 887-906,
2005.
[3] T. Chi, Y. Gao, M. C. Guyton, P. Ru, and S.
Shamma, "Spectro-temporal modulation transfer
functions and speech intelligibility," J. Acoust. Soc.
Am., Vol. 106, No. 5, pp. 2719-2732, 1999.
Hsu: the approach proposed in [9] that performs
unvoiced sound separation combined with the
pitch-based inference method in [13].
R (REPET with soft masking): the approach proposed in [16] that computes a repeating background
structure and extract vocal with soft time-frequency
masking.
RPCA: a matrix decomposition method applying
robust principal component analysis proposed by
Huang et al. [8].
[4] T.-S. Chi and C.-C. Hsu, "Multiband analysis and
synthesis of spectro-temporal modulations of Fourier
spectrogram," J. Acoust. Soc. Am., Vol. 129, No. 5,
pp. EL190-EL196, 2011.
From Figure 5, we can observe that the proposed
method has the highest performance tied with RPCA in
the -5 dB SNR condition. In 0 and 5 dB SNR conditions,
the performance of the proposed method is comparable to
the performance of REPET.
[5] J.-L. Durrieu, B. David, and G. Richard, "A musically motivated mid-level representation for pitch estimation and musical audio source separation, "IEEE
J. of Selected Topics on Signal Process.," Vol. 5, No.
6, pp. 1180-1191, 2011.
[6] M. Elhilali and S. A. Shamma, "A cocktail party
with a cortical twist: how cortical mechanisms contribute to sound segregation, " J. Acoust. Soc. Am.,
Vol. 124, No. 6, pp. 3751-3771, 2008.
5. CONCLUSION
In this paper, we propose a singing voice separation
method utilizing the spectral-temporal modulations as
clustering features. Based on the energy distributions on
the rate-scale plots of T-F units, the vocal signal is extracted from the auditory spectrogram and the separation
performance is evaluated using the MIR-1K dataset. Our
proposed CASA-based masking method outperforms the
CASA-based system in [9] and has comparable perfor-
[7] D. FitzGerald and M. Gainza, "Single channel vocal
separation using median filtering and factorization
techniques," ISAST Trans. on Electron. and Signal
Process., Vol. 4, No. 1, pp. 62-73 (ISSN 1797-2329),
2010.
621
Based on Two-stage Harmonic/Percussive Sound
Separation on Multiple Resolution Spectrograms,"
IEEE/ACM Trans. on Audio, Speech, and Language
Process., Vol. 22, No. 1, pp. 228-237, 2014.
[8] P.-S. Huang, S. D. Chen, P. Smaragdis, and M.
Hasegawa-Johnson, "Singing-voice separation from
monaural recordings using robust principal
component analysis," Porc. IEEE Int. Conf. on
Acoust., Speech and Signal Process., pp. 57-60,
2012.
[20] S. Vembu and S. Baumann, "Separation of vocals
from polyphonic audio recordings," Proc. of the Int.
Soc. for Music Inform. Retrieval Conf., pp. 337–344,
2005.
[9] C.-L. Hsu and J.-S. R. Jang, "On the improvement of
singing voice separation for monaural recordings
using the MIR-1K dataset," IEEE Trans. on Audio,
Speech, and Language Process., Vol. 18, No. 2, pp.
310-319, 2010.
[21] E. Vincent, R. Gribonval, and C. Févotte,
"Performance measurement in blind audio source
separation," IEEE Trans. on Audio, Speech, and
Language Process., Vol. 14, No. 4, pp. 1462-1469,
2006.
[10] C.-C. Hsu, T.-E. Lin, J.-H. Chen, and T.-S. Chi,
"Voice activity detection based on frequency modulation of harmonics," IEEE Int. Conf. on Acoust. ,
Speech and Signal Process., pp. 6679-6683, 2013.
[22] Y. Yang, "Low-rank representation of both singing
voice and music accompaniment via learned
dictionaries," Proc. of the Int. Soc. for Music Inform.
Retrieval Conf., pp. 427-432, 2013.
[11] D. Klein, and S. A. Shamma, "The case of the
missing pitch templates: how harmonic templates
emerge in the early auditory system," J. Acoust. Soc.
Am., Vol. 107, No. 5, pp. 2631-2644, 2000.
[23] http://bass-db.gforge.inria.fr/bss_eval/
[24] http://www.isr.umd.edu/Labs/NSL/nsl.html
[12] H. Lei, B. T. Meyer, and N. Mirghafori, "Spectrotemporal Gabor features for speaker recognition,"
IEEE Int. Conf. on Acoust., Speech and Signal
Process., pp. 4241-4244, 2012.
[13] Y. Li and D. Wang, "Separation of singing voice
from music accompaniment for monaural
recordings," IEEE Trans. on Audio, Speech, and
Language Process., Vol. 15, No. 4, pp. 1475-1487,
2007.
[14] B. C. J. Moore: An Introduction to the Psychology of
Hearing 5th Ed., Academic Press, 2003.
[15] A. Ozerov, P. Philippe, F. Bimbot, and R.
Gribonval, “Adaptation of Bayesian models for single channel source separation and its application to
voice / music separation in popular songs, "IEEE
Trans. on Audio, Speech, and Language Process.,"
special issue on Blind Signal Proc. for Speech and
Audio Applications, Vol. 15, No. 5, pp. 1564-1578,
2007.
[16] Z. Rafii and B. Pardo, "REpeating Pattern Extraction
Technique (REPET): A Simple Method for
Music/Voice Separation," IEEE Trans. on Audio,
Speech, and Language Process., Vol. 21, No. 1, pp.
73-84, 2013.
[17] P. Sprechmann, A. Bronstein, and G. Sapiro, "Realtime online singing voice separation from monaural
recordings using robust low-rank modeling," Proc.
of the Int. Soc. for Music Inform. Retrieval Conf., pp.
67–72, 2012.
[18] R. M. Stern and N. Norgan, "Hearing is believing:
biologically inspired methods for robust automatic
speech recognition," IEEE Signal Process. Mag.,
Vol. 29, No. 6, pp. 34–43, 2012.
[19] H. Tachibana, N. Ono, and S. Sagayama, "Singing
Voice Enhancement in Monaural Music Signals
622
HARMONIC-TEMPORAL FACTOR DECOMPOSITION
INCORPORATING MUSIC PRIOR INFORMATION FOR
INFORMED MONAURAL SOURCE SEPARATION
Tomohiko Nakamura† , Kotaro Shikata† , Norihiro Takamune† , Hirokazu Kameoka†‡
†
Graduate School of Information Science and Technology, The University of Tokyo.
‡
NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation.
{nakamura,k-shikata,takamune,kameoka}@hil.t.u-tokyo.ac.jp
ABSTRACT
For monaural source separation two main approaches have
thus far been adopted. One approach involves applying
non-negative matrix factorization (NMF) to an observed
magnitude spectrogram, interpreted as a non-negative matrix. The other approach is based on the concept of
computational auditory scene analysis (CASA). A CASAbased approach called the “harmonic-temporal clustering
(HTC)” aims to cluster the time-frequency components
of an observed signal based on a constraint designed according to the local time-frequency structure common in
many sound sources (such as harmonicity and the continuity of frequency and amplitude modulations). This paper proposes a new approach for monaural source separation called the “Harmonic-Temporal Factor Decomposition (HTFD)” by introducing a spectrogram model that
combines the features of the models employed in the NMF
and HTC approaches. We further describe some ideas how
to design the prior distributions for the present model to
incorporate musically relevant information into the separation scheme.
1. INTRODUCTION
Monaural source separation is a process in which the signals of concurrent sources are estimated from a monaural
polyphonic signal and is one of fundamental objectives offering a wide range of applications such as music information retrieval, music transcription and audio editing.
While we can use spatial cues for blind source separation with multichannel inputs, for monaural source separation we need other cues instead of the spatial cues.
For monaural source separation two main approaches have
thus far been adopted. One approach is based on the concept of computational auditory scene analysis (e.g., [7]).
The auditory scene analysis process described by Bregman [1] involves grouping elements that are likely to have
originated from the same source into a perceptual structure called an auditory stream. In [8, 10], an attempt
has been made to imitate this process by clustering timefrequency components based on a constraint designed according to the auditory grouping cues (such as the har-
monicity and the coherences and continuities of amplitude and frequency modulations). This method is called
“harmonic-temporal clustering (HTC).”
The other approach involves applying non-negative matrix factorization (NMF) to an observed magnitude spectrogram (time-frequency representation) interpreted as a
non-negative matrix [19]. The idea behind this approach
is that the spectrum at each frame is assumed to be represented as a weighted sum of a limited number of common
spectral templates. Since the spectral templates and the
mixing weights should both be non-negative, this implies
that an observed spectrogram is modeled as the product of
two non-negative matrices. Thus, factorizing an observed
spectrogram into the product of two non-negative matrices allows us to estimate the unknown spectral templates
constituting the observed spectra and decompose the observed spectra into components associated with the estimated spectral templates.
The two approaches described above rely on different
clues for making separation possible. Roughly speaking,
the former approach focuses on the local time-frequency
structure of each source, while the latter approach focuses on a relatively global structure of music spectrograms (such a property that a music signal typically consists of a limited number of recurring note events). Rather
than discussing which clues are more useful, we believe
that both of these clues can be useful for achieving a reliable monaural source separation algorithm. This belief has
led us to develop a new model and method for monaural
source separation that combine the features of both HTC
and NMF. We call the present method “harmonic-temporal
factor decomposition (HTFD).”
The present model is formulated as a probabilistic generative model in such a way that musically relevant information can be flexibly incorporated into the prior distributions of the model parameters. Given the recent progress
of state-of-the-art methods for a variety of music information retrieval (MIR)-related tasks such as audio key detection, audio chord detection, and audio beat tracking, information such as key, chord and beat extracted from the
given signal can potentially be utilized as reliable and useful prior information for source separation. The inclusion
of auxiliary information in the separation scheme is referred to as informed source separation and is gaining increasing momentum in recent years (see e.g., among others, [5,15,18,20]). This paper further describes some ideas
how to design the prior distributions for the present model
to incorporate musically relevant information.
We henceforth denote the normal, Dirichlet and Poisson
c Tomohiko Nakamura† , Kotaro Shikata† , Norihiro
Takamune† , Hirokazu Kameoka†‡ .
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Tomohiko Nakamura† , Kotaro
Shikata† , Norihiro Takamune† , Hirokazu Kameoka†‡ . “Harmonictemporal factor decomposition incorporating music prior information for
informed monaural source separation”, 15th International Society for
623
distributions by N, Dir and Pois, respectively.
at time t is expressed as a harmonically-spaced Gaussian
mixture function. If we assume the additivity of power
spectra, the power spectrogram of a superposition of K
pitched sounds is given by the sum of Eq. (8) over k. It
should be noted that this model is identical to the one employed in the HTC approach [8].
Although we have defined the spectrogram model above
in continuous time and continuous log-frequency, we actually obtain observed spectrograms as a discrete timefrequency representation through computer implementations. Thus, we henceforth use Yl,m := Y(xl , tm ) to denote an observed spectrogram where xl (l = 1, . . . , L) and
tm (m = 1, . . . , M) stand for the uniformly-quantized logfrequency points and time points, respectively. We will
also use the notation Ωk,m and ak,n,m to indicate Ωk (tm ) and
ak,n (tm ).
2. SPECTROGRAM MODEL OF MUSIC SIGNAL
2.1 Wavelet transform of source signal model
As in [8], this section derives the continuous wavelet transform of a source signal. Let us first consider as a signal
model for the sound of the kth pitch the analytic signal
representation of a pseudo-periodic signal given by
N
ak,n (u)ej(nθk (u)+ϕk,n ) ,
(1)
fk (u) =
n=1
where u denotes the time, nθk (u) + ϕk,n the instantaneous
phase of the n-th harmonic and ak,n (u) the instantaneous
amplitude. This signal model implicitly ensures not to violate the ‘harmonicity’ and ‘coherent frequency modulation’ constraints of the auditory grouping cues. Now, let
the wavelet basis function be defined by
u − t
1
,
(2)
ψ
ψα,t (u) = √
α
2πα
where α is the scale parameter such that α > 0, t the shift
parameter and ψ(u) the mother wavelet with the center frequency of 1 satisfying the admissibility condition. ψα,t (u)
can thus be used to measure the component of period α at
time t. The continuous wavelet transform of fk (u) is then
defined by
∞
N
ak,n (u)ej(nθk (u)+ϕk,n ) ψ∗α,t (u)du. (3)
Wk (log α1 , t) =
2.2 Incorporating source-filter model
The generating processes of many sound sources in real
world can be explained fairly well by the source-filter theory. In this section, we follow the idea described in [12] to
incorporate the source-filter model into the above model.
Let us assume that each signal fk (u) within a short-time
segment is an output of an all-pole system. That is, if
we use fk,m [i] to denote the discrete-time representation
of fk (u) within a short-time segment centered at time tm ,
fk,m [i] can be described as
P
βk,m [p] fk,m [i − p] + k,m [i], (9)
βk,m [0] fk,m [i] =
−∞ n=1
Since the dominant part of ψ∗α,t (u) is typically localized
around time t, the result of the integral in Eq. (3) shall
depend only on the values of θk (u) and ak,n (u) near t. By
taking this into account, we replace θk (t) and ak,n (t) with
zero- and first-order approximations around time t:
ak,n (u) ak,n (t), θk (u) θk (t) + θ̇k (t)(u − t). (4)
p=1
where i, k,m [i], and βk,m [p] (p = 0, . . . , P) denote the
discrete-time index, an excitation signal, and the autoregressive (AR) coefficients, respectively. As we have already assumed in 2.1 that the F0 of fk,m [i] is eΩk,m , to make
the assumption consistent, the F0 of the excitation signal
k,m [i] must also be eΩk,m . We thus define k,m [i] as
N
Ωk,m
vk,n,m e jne iu0 ,
(10)
k,m [i] =
Note that the variable θ̇k (u) corresponds to the instantaneous fundamental frequency (F0 ). By undertaking the
above approximations, applying the Parseval’s theorem,
and putting x = log(1/α) and Ωk (t) = log θ̇k (t), we can
further write Eq. (3) as
N
ak,n (t)Ψ∗ (ne−x+Ωk (t) )e j(nθk (t)+ϕk,n ) , (5)
Wk (x, t) =
n=1
where u0 denotes the sampling period of the discrete-time
representation and vk,n,m denotes the complex amplitude of
the nth partial. By applying the discrete-time Fourier transform (DTFT) to Eq. (9) and putting Bk,m (z) := βk,m [0] −
βk,m [1]z−1 · · · − βk,m [P]z−P , we obtain
√
N
2π vk,n,m δ(ω − neΩk,m u0 ),
(11)
Fk,m (ω) =
Bk,m (e jω ) n=1
n=1
where x denotes log-frequency and Ψ the Fourier transform
of ψ. Since the function Ψ can be chosen arbitrarily, as
with [8], we employ the following unimodal real function
whose maximum is taken at ω = 1:
⎧ (log ω)2
⎪
⎪e− 4σ2
⎨
(ω > 0) .
Ψ(ω) = ⎪
(6)
⎪
⎩0
(ω ≤ 0)
Eq. (5) can then be written as
N
(x−Ωk (t)−log n)2
ak,n (t)e− 4σ2
e j(nθk (t)+ϕk,n ) .
Wk (x, t) =
where Fk,m denotes the DTFT of fk,m , ω the normalized angular frequency, and δ the Dirac delta function. The inverse
DTFT of Eq. (11) gives us another expression of fk,m [i]:
N
Ωk,m
vk,n,m
fk,m [i] =
(12)
e jne iu0 .
Ωk,m
jne
u0 )
n=1 Bk,m (e
By comparing Eq. (12) and the discrete-time representation of Eq. (1), we can associate the parameters of the
source filter model defined above with the parameters introduced in 2.1 through the explicit relationship:
vk,n,m
.
(13)
|ak,n,m | = Bk,m (e jneΩk,m u0 ) (7)
n=1
If we now assume that the time-frequency components are
sparsely distributed so that the partials rarely overlap each
other, |Wk (x, t)|2 is given approximately as
N
(x−Ωk (t)−log n)2
|ak,n (t)|2 e− 2σ2
.
(8)
|Wk (x, t)|2 n=1
2.3 Constraining model parameters
The key assumption behind the NMF model is that the
spectra of the sound of a particular pitch is expressed as
This assumption means that the power spectra of the partials can approximately be considered additive. Note that
a cutting plane of the spectrogram model given by Eq. (8)
624
a multiplication of time-independent and time-dependent
factors. In order to extend the NMF model to a more reasonable one, we consider it important to clarify which factors involved in the spectra should be assumed to be timedependent and which factors should not. For example, the
F0 must be assumed to vary in time during vibrato or portamento. Of course, the scale of the spectrum should also be
assumed to be time-varying (as with the NMF model). On
the other hand, the timbre of an instrument can be considered relatively static throughout an entire piece of music.
We can reflect these assumptions in the present model in
the following way. For convenience of the following analysis, we factorize |ak,n,m | into the product of two variables,
wk,n,m and Uk,m
|ak,n,m | = wk,n,m Uk,m .
(14)
wk,n,m can be interpreted as the relative power of the nth
harmonic and Uk,m as the time-varying normalized
ampli
tude of the sound of the kth pitch such that k,m Uk,m = 1.
In the same way, let us put vk,n,m as
vk,n,m = w̃k,n,m Uk,m .
(15)
7000
Frequency [Hz]
3929
1238
695
390
0
0.37
0.73
1.1
1.46
1.83
Time [s]
Figure 1. Power spectrogram of a violin vibrato sound.
p(w|w̃, β, Ω) expressed by the Dirac delta function
w̃k,n,m .
δ wk,n,m − p(w|w̃, β, Ω) =
Bk (e jneΩk,m u0 ) (20)
k,n,m
The conditional distribution p(w|β, Ω) can thus be obtained
by defining the distribution p(w̃) and marginalizing over
w̃. If we now assume that the complex amplitude w̃k,n,m
follows a circular complex normal distribution
(21)
w̃k,n,m ∼ NC (w̃k,n,m ; 0, ν2 ), n = 1, . . . , N,
Since the all-pole spectrum 1/|Bk,m (e jω )|2 is related to the
timbre of the sound of the kth pitch, we want to constrain
it to be time-invariant. This can be done simply by eliminating the subscript m. Eq. (13) can thus be rewritten as
w̃k,n,m
.
(16)
wk,n,m = Bk (e jneΩk,m u0 ) where NC (z; 0, ξ 2 ) = e−|z| /ξ /(πξ2 ), we can show, as in [12],
that wk,n,m follows a Rayleigh distribution:
2
2
wk,n,m ∼ Rayleigh(wk,n,m ; ν/|Bk (e jne
Ωk,m
u
0
)|),
(22)
−z2 /(2ξ2 )
where Rayleigh(z; ξ) = (z/ξ )e
. This defines the
conditional distribution p(w|β, Ω).
The F0 of stringed and wind instruments often varies
continuously over time with musical expressions such as
vibrato. For example, the F0 of a violin sound varies periodically around the note frequency during vibrato, as depicted in Fig. 1. Let us denote the standard log-F0 corresponding to the kth note by μk . To appropriately describe the variability of an F0 contour in both the global
and local time scales, we design a prior distribution for
Ωk := (Ωk,1 , Ωk,2 , . . . , Ωk,M )T by employing the productof-experts (PoE) [6] concept using two probability distributions. First, we design a distribution qg (Ωk ) describing how likely Ωk,1 , . . . , Ωk,L stay near μk . Second, we
design another distribution ql (Ωk ) describing how likely
Ωk,1 , . . . , Ωk,L are locally continuous along time. Here we
define qg (Ωk ) and ql (Ωk ) as
2
We can use Ωk,m as is, since it is already dependent on m.
To sum up, we obtain a spectrogram model Xl,m as
⎞
⎛ N
K
(x −Ω
−log n)2 ⎟
⎜⎜⎜ 2
⎟⎟⎟
− l k,m 2
⎜
2σ
Ck,l,m , Ck,l,m = ⎜⎝ wk,n,m e
Xl,m =
⎟⎠ Uk,m ,
n=1
k=1
2205
Hk,l,m
(17)
where Ck,l,m stands for the spectrogram of the kth pitch. If
we denote the term insidethe parenthesis by Hk,l,m , Xl,m
can be rewritten as Xl,m = k Hk,l,m Uk,m and so the relation
to the NMF model may become much clearer.
2.4 Formulating probabilistic model
Since the assumptions and approximations we made so far
do not always hold exactly in reality, an observed spectrogram Yl,m may diverge from Xl,m even though the parameters are optimally determined. One way to simplify the
process by which this kind of deviation occurs would be to
assume a probability distribution of Yl,m with the expected
value of Xl,m . Here, we assume that Yl,m follows a Poisson
distribution with mean Xl,m
Yl,m ∼ Pois(Yl,m ; Xl,m ),
(18)
qg (Ωk ) = N(Ωk ; μk 1 M , υ2k I M ),
ql (Ωk ) =
N(Ω k ; 0 M , τ2k D−1 ),
(23)
(24)
⎤
⎡
⎢⎢⎢ 1 −1 0 0 · · · 0 ⎥⎥⎥
⎢⎢⎢−1 2 −1 0 · · · 0 ⎥⎥⎥
⎥
⎢⎢⎢
⎢⎢⎢ 0 −1 2 −1 · · · 0 ⎥⎥⎥⎥⎥
⎢
(25)
D = ⎢⎢⎢ ..
. . . . . . .. ⎥⎥⎥⎥⎥ ,
⎢⎢⎢ .
. . . . ⎥⎥
⎥⎥⎥
⎢⎢⎢
⎢⎢⎣ 0 · · · 0 −1 2 −1⎥⎥⎦
0 · · · 0 0 −1 1
where I M denotes an M × M identity matrix, D an M × M
band matrix, 1 M an M-dimensional all-one vector, and 0 M
an M-dimensional all-zero vector, respectively. υk denotes
the standard deviation from mean μk , and τk the standard
deviation of the F0 jumps between adjacent frames. The
prior distribution of Ωk is then derived as
(26)
p(Ωk ) ∝ qg (Ωk )αg ql (Ωk )αl
where αg and αl are the hyperparameters that weigh the
where Pois(z; ξ) = ξ z e−ξ /Γ(z). This defines our likelihood
function
Pois(Yl,m ; Xl,m ),
(19)
p(Y|θ) =
l,m
where Y denotes the set consisting of Yl,m and Θ the entire
set consisting of the unknown model parameters. It should
be noted that the maximization of the Poisson likelihood
with respect to Xl,m amounts to optimally fitting Xl,m to Yl,m
by using the I-divergence as the fitting criterion.
Eq. (16) implicitly defines the conditional distribution
625
Ak = (Ak,1 , . . . , Ak,M )T . Here we introduce Dirichlet distributions:
Ak ∼ Dir( Ak ; γ(A)
R ∼ Dir(R; γ(R) ),
(28)
k ),
" ξi (A)
(A)
(A) T
where Dir(z; ξ) ∝ i zi , γk := (γk,1 , . . . , γk,M ) , and
(R)
T
γ(R) := (γ1(R) , . . . , γ(R)
K ) . For p(R), we set γk at a reasonably high value if the kth pitch is contained in the scale and
(A)
< 1 so that the Dirichlet
vice versa. For p( Ak ), we set γk,m
distribution becomes a sparsity inducing distribution.
contributions of qg (Ωk ) and ql (Ωk ) to the prior distribution.
2.5 Relation to other models
It should be noted that the present model is related to other
models proposed previously.
If we do not assume a parametric model for Hk,l,m and
treat each Hk,l,m itself as the parameter, the spectrogram
model Xl,m can be seen as an NMF model with timevarying basis spectra, as in [14]. In addition to this assumption, if we assume that Hk,l,m is time-invariant (i.e.,
Hk,l,m = Hk,l ), Xl,m reduces to the regular NMF model [19].
Furthermore, if we assume each basis spectrum to have
a harmonic structure, Xl,m becomes equivalent to the harmonic NMF model [16, 21].
If we assume that Ωk,m is equal over time m, Xl,m reduces
to a model similar to the ones described in [17, 22]. Furthermore, if we describe Uk,m using a parametric function
of m, Xl,m becomes equivalent to the HTC model [8, 10].
With a similar motivation, Hennequin et al. developed
an extension to the NMF model defined in the short-time
Fourier transform domain to allow the F0 of each basis
spectrum to be time-varying [4].
4. PARAMETER ESTIMATION ALGORITHM
Given an observed power spectrogram Y := {Yl,m }l,m ,
we would like to find the estimates of Θ :=
{Ω, w, β, V, R, A} that maximizes the posterior density
p(Θ|Y) ∝ p(Y|Θ)p(Θ). We therefore consider the problem of maximizing
L(Θ) := ln p(Y|Θ) + ln p(Θ),
(29)
with respect to Θ where
#
$
Yl,m ln Xl,m − Xl,m
(30)
ln p(Y|Θ) =
c
l,m
ln p(Θ) = ln p(w|β, Ω) +
3. INCORPORATION OF AUXILIARY
INFORMATION
3.1 Use of musically relevant information
We consider using side-information obtained with the
state-of-the-art methods for MIR-related tasks including
key detection, chord detection and beat tracking to assist
source separation.
When multiple types of side-information are obtained
for a specific parameter, we can combine the use of the
mixture-of-experts and PoE [6] concepts according to the
“AND” and “OR” conditions we design. For example,
pitch occurrences typically depend on both the chord and
key of a piece of music. Thus, when the chord and key information are obtained, we may use the product-of-experts
concept to define a prior distribution for the parameters
governing the likeliness of the occurrences of the pitches.
In the next subsection, we describe specifically how to design the prior distributions.
+ ln p(R) +
ln p(Ωk )
k
ln p( Ak ).
(31)
k
=c denotes equality up to constant terms. Since the first
term of Eq. (30) involves summation over k and n, analytically solving the current maximization problem is intractable. However, we can develop a computationally efficient algorithm for finding a locally optimal solution based
on the auxiliary function concept, by using a similar idea
described in [8, 12].
When applying an auxiliary function approach to a certain maximization problem, the first step is to define a
lower bound function for the objective function. As mentioned earlier, the difficulty with the current maximization
problem lies in the first term in Eq. (30) . By using the fact
that the logarithm function is a concave function, we can
invoke the Jensen’s inequality
3.2 Designing prior distributions
The likeliness of the pitch occurrences in popular and classical western music usually depend on the key or the chord
used in that piece. The likeliness of the pitch occurrences
can be described as a probability distribution over the relative energies of the sounds of the individual pitches.
Since the number of times each note is activated is usually limited, inducing sparsity to the temporal activation of
each note event would facilitate the source separation. The
likeliness of the number of times each note is activated can
be described as well as a probability distribution over the
temporal activations of the sound of each pitch.
To allow for designing such prior distributions, we dethe pitchcompose Uk,m as the product
of two variables:
wise relative energy Rk = m Uk,m (i.e. k Rk = 1), and
the pitch-wise normalized amplitude Ak,m = Uk,m /Rk (i.e.
m Ak,m = 1). Hence, we can write
(27)
Uk,m = Rk Ak,m .
This decomposition allows us to incorporate different
kinds of prior information into our model by separately
defining prior distributions over R = (R1 , . . . , RK )T and
Yl,m ln Xl,m ≥ Yl,m
λk,n,l,m ln
w2k,n,m e−
k,n
(xl −Ωk,m −log n)2
2σ2
λk,n,l,m
Uk,m
,
(32)
to obtain a lower bound function,
where λk,n,l,m is a positive
variable that sums to unity: k,n λk,n,l,m = 1. Equality of
(32) holds if and only if
λk,n,l,m =
w2k,n,m e−
(xl −Ωk,m −log n)2
2σ2
Xl,m
Uk,m
.
(33)
Although one may notice that the second term in
Eq. (30) is nonlinear in Ωk,m , the summation of Xl,m
over
fairly well using the integral
% ∞ l can be approximated
X(x,
t
)dx,
since
X
is
the sum of the values at the
m
l,m
l
−∞
sampled points X(x1 , tm ), . . . , X(xL , tm ) with an equal interval, say Δ x . Hence,
∞
1
Xl,m X(x, tm )dx
Δ x −∞
l
∞ (x−Ω −log n)2
k,m
1 2
=
wk,n,m Uk,m
e− 2σ2
dx
Δ x k,n
−∞
626
√
2πσ Uk,m
w2k,n,m .
Δx
n
k
(34)
5555
Frequency [Hz]
=
This approximation implies that the second term in Eq.
(30) depends little on Ωk,m .
An auxiliary function can thus be written as
+
L (Θ, λ) =
Yl,m
λk,n,l,m ln
w2k,n,m e−
−ln n)2
(xl −Ωk,m
2σ2
λk,n,l,m
l,m
k,n
√
2πσ −
Uk,m
w2k,n,m + ln p(Θ).
Δx
m
n
k
c
Uk,m
1750
551
174
2
4
6
8
Time [s]
(35)
Figure 2. Power spectrogram of a mixed audio signal of
three violin vibrato sounds (D4, F4 and A4).
We can derive update equations for the model parameters,
using the above auxiliary function. By setting at zero the
partial derivative of L+ (Θ, λ) with respect to each of the
model parameters, we obtain
l Yl,m λk,n,l,m + 1/2
, (36)
w2k,n,m ← √
Ω
2πRk Ak,m σ/Δ x + ν2 /(2|Bk (e jne k,m u0 )|2 )
⎛
⎞−1
⎜⎜⎜ αl
⎟⎟⎟
α
g
Ωk ← ⎜⎜⎜⎝ 2 D + 2 I M +
diag( pk,n,l )⎟⎟⎟⎠
τ
υk
n,l
⎛
⎞
⎜⎜⎜ αg
⎟⎟⎟
(37)
× ⎜⎜⎝⎜μk 2 1 M +
(xl − ln n)pk,n,l ⎟⎟⎠⎟ ,
υk
n,l
(R)
l,m Yl,m n λk,n,l,m + γk − 1
Rk ∝
,
(38)
2
m,n Ak,m wk,m,n
(A)
l Yl,m n λk,n,l,m + γk,m − 1
Ak,m ∝
,
(39)
Rk n w2k,m,n
'
1 &
pk,n,l := 2 Yl,1 λk,n,l,1 , Yl,2 λk,n,l,2 , · · · , Yl,M λk,n,l,M , (40)
σ
were artificially made by mixing D4, F4 and A4 violin
vibrato sounds from the RWC instrument database [3]. In
this paper, the F0 of the pitch name A4 was set at 440
Hz. The power spectrogram of the mixed signal is shown
in Fig. 2. To convert the signal into a spectrogram, we
employed the fast approximate continuous wavelet transform [9] with a 16 ms time-shift interval. {xl }l ranged 55
to 7040 Hz per 10 cent. The parameters of HTFD were
= (1 − 3.96 × 10−6 )1I , (τk , vk ) = (0.83, 1.25)
set at γ(A)
k
for all k, (N, K, σ, αg , α s ) = (8, 73, 0.02, 1, 1), and γ(R) =
(1−2.4×10−3 )1K . {μk }k ranged A1 to A7 with a chromatic
interval, i.e. μk = ln(55) + ln(2) × (k − 1)/12. The number
of NMF bases were set at three. The parameter updates of
both HTFD and NMF were stopped at 100 iterations.
While the estimates of spectrograms obtained with
NMF were flat and the vibrato spectra seemed to be averaged (Fig. 3 (a)), those obtained with HTFD tracked the F0
contours of the vibrato sounds appropriately (Fig. 3 (b)),
and clear vibrato sounds were contained in the separated
audio signals by HTFD.
where diag( p) converts a vector p into a diagonal matrix
with the elements of p on the main diagonal.
As for the update equations for the AR coefficients β,
we can invoke the method described in [23] with a slight
modification, since the terms in the auxiliary function that
depend on β has the similar form as the objective function
defined in [23]. It can be shown that L+ can be increased
by the following updates (the details are omitted owing to
space limitations):
(41)
hk ← Ĉk (βk )βk , βk ← Ck−1 hk ,
5.2 Separation using key information
We next examined whether the prior information of a
sound improve source separation accuracy. The key of the
sound used in 5.1, was assumed as D major. The key information was incorporated in the estimation scheme by
setting γk(R) = 1 − 2.4 × 10−3 for the pitch indices that are
not contained in the D major scale and γk(R) = 1−3.0×10−3
for the pitch indices contained in that scale. The other conditions were the same as 5.1.
With HTFD without using the key information, the estimated activations of the pitch indices that were not contained in the scale, in particular D4, were high as illustrated in Fig. 4 (a). In contrast, those estimated activations
with HTFD using the key information were suppressed as
shown in Fig. 4 (b). These results thus support strongly that
incorporating prior information improve the source separation accuracy.
where Ck and Ĉk (βk ) are (P + 1) × (P + 1) Toeplitz matrices,
whose (p, q)-th elements are
2
1 wk,m,n
Ck,p,q =
cos[(p − q)neΩk,m u0 ],
MN m,n 2ν
1
1 Ĉk,p,q (βk ) =
cos[(p − q)neΩk,m u0 ].
jne
MN m,n |Bk (e Ωk,m u0 )|2
(42)
5. EXPERIMENTS
5.3 Transposing from one key to another
Here we show some results of an experiment on automatic
key transposition [11] using HTFD. The aim of key transposition is to change the key of a musical piece to another
key. We separated the spectrogram of a polyphonic sound
into spectrograms of individual pitches using HFTD, transposed the pitches of the subset of the separated components, added all the spectrograms together to construct a
pitch-modified polyphonic spectrogram, and constructed a
In the following preliminary experiments, we simplified
HTFD by omitting the source filter model and assuming
the time-invariance of wk,m,n .
5.1 F0 tracking of violin sound
To confirm whether HTFD can track the F0 contour of
a sound, we compared HTFD with NMF with the Idivergence, by using a 16 kHz-sampled audio signal which
627
A♭4
311
554
311
F4
Pitch
554
988
Frequency [Hz]
988
Frequency [Hz]
Frequency [Hz]
988
554
D4
D♭4
311
Time
0
2
4
Time [s]
6
0
2
4
Time [s]
6
0
2
4
Time [s]
6
(a) Without key information
(a) Estimates of spectrograms and F0 contours (orange lines) obtained with HTFD
311
554
311
F4
Pitch
554
A♭4
988
Frequency [Hz]
988
Frequency [Hz]
Frequency [Hz]
988
554
D4
D♭4
311
Time
0
2
4
Time [s]
6
0
2
4
Time [s]
6
0
2
4
Time [s]
6
(b) With key information
(b) Estimates of spectrograms obtained with NMF
Figure 4. Temporal activations of
Figure 3. Estimated spectrogram models by harmonic-temporal factor decomposi- A3–A4 estimated with HTFD using
tion (HTFD) and non-negative matrix factorization (NMF). In left-to-right fashion, and without using prior information
of the key. The red curves represent
the spectrogram models are for D4, F4 and A4.
the temporal activations of D4.
time-domain signal from the modified spectrogram using
the method described in [13]. For the key transposition,
we adopted a simple way: To transpose, for example, from
A major scale to A natural minor scale, we changed the
pitches of the separated spectrograms corresponding to C,
F and G to C, F and G, respectively.
Some results are demonstrated in http://hil.t.
u-tokyo.ac.jp/ñakamura/demo/HTFD.html.
6. CONCLUSION
This paper proposed a new approach for monaural source
separation called the “Harmonic-Temporal Factor Decomposition (HTFD)” by introducing a spectrogram model that
combines the features of the models employed in the NMF
and HTC approaches. We further described some ideas
how to design the prior distributions for the present model
to incorporate musically relevant information into the separation scheme.
7. ACKNOWLEDGEMENTS
This work was supported by JSPS Grant-in-Aid for Young
Scientists B Grant Number 26730100.
8. REFERENCES
[1] A. S. Bregman: Auditory Scene Analysis, MIT Press, Cambridge, 1990.
[2] J. S. Downie, D. Byrd, and T. Crawford: “Ten years of ISMIR: Reflections on challenges and opportunities,” Proc. ISMIR, pp. 13–18, 2009.
[3] M. Goto: “Development of the RWC Music Database,” Proc.
ICA, pp. l–553–556, 2004.
[4] R. Hennequin, R. Badeau, and B. David: “Time-dependent
parametric and harmonic templates in non-negative matrix
factorization,” Proc. DAFx, pp. 246–253, 2010.
[5] R. Hennequin, B. David, and R. Badeau: “Score informed
audio source separation using a parametric model of nonnegative spectrogram,” Proc. ICASSP, pp. 45–48, 2011.
[6] G. E. Hinton: “Training products of experts by minimizing contrastive divergence,” Neural Comput., vol. 14, no. 8,
pp. 1771–1800, 2002.
[7] G. Hu, and D. L. Wang: “An auditory scene analysis approach
to monaural speech segregation,” Topics in Acoust. Echo and
Noise Contr., pp. 485–515, 2006.
[8] H. Kameoka: Statistical Approach to Multipitch Analysis,
PhD thesis, The University of Tokyo, Mar. 2007.
[9] H. Kameoka, T. Tabaru, T. Nishimoto, and S. Sagayama:
(Patent) Signal processing method and unit, in Japanese, Nov.
2008.
[10] H. Kameoka, T. Nishimoto, and S. Sagayama: “A multipitch
analyzer based on harmonic temporal structured clustering,”
IEEE Trans. ASLP, vol. 15, no. 3, pp. 982–994, 2007.
[11] H. Kameoka, J. Le Roux, Y. Ohishi, and K. Kashino: “Music
Factorizer: A note-by-note editing interface for music waveforms,” IPSJ SIG Tech. Rep., 2009-MUS-81-9, in Japanese,
Jul. 2009.
[12] H. Kameoka: “Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model,” IEICE Tech. Rep., vol. 110, no. 297, SP2010-74,
pp. 29–34, in Japanese, Nov. 2010.
[13] T. Nakamura and H. Kameoka: “Fast signal reconstruction
from magnitude spectrogram of continuous wavelet transform based on spectrogram consistency,” Proc. DAFx, 40, to
appear, 2014.
[14] M. Nakano, J. Le Roux, H. Kameoka, Y. Kitano, N. Ono,
and S. Sagayama: “Nonnegative matrix factorization with
Markov-chained bases for modeling time-varying patterns in
music spectrograms,” Proc. LVA/ICA, pp. 149–156, 2010.
[15] A. Ozerov, C. Févotte, R. Blouet, and J. L. Durrieu: “Multichannel nonnegative tensor factorization with structured
constraints for user-guided audio source separation,” Proc.
ICASSP., pp. 257–260, 2011.
[16] S. A. Raczyński, N. Ono, and S. Sagayama: “Multipitch analysis with harmonic nonnegative matrix approximation,” Proc.
ISMIR, pp. 381–386, 2007.
[17] D. Sakaue, T. Otsuka, K. Itoyama, and H. G. Okuno:
“Bayesian nonnegative harmonic-temporal factorization and
its application to multipitch analysis,” Proc. ISMIR, pp. 91–
96, 2012.
[18] U. Simsekli and A. T. Cemgil: “Score guided musical source
separation using generalized coupled tensor factorization,”
Proc. EUSIPCO, pp. 2639–2643, 2012.
[19] P. Smaragdis and J. C. Brown: “Non-negative matrix factorization for polyphonic music transcription,” Proc. WASPAA,
pp. 177–180, 2003.
[20] P. Smaragdis and G. J. Mysore: “Separation by ”humming”:
User-guided sound extraction from monophonic mixtures,”
Proc. WASPAA, pp. 69–72, 2009.
[21] E. Vincent, N. Bertin, and R. Badeau: “Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch
transcription,” Proc. ICASSP, pp. 109–112, 2008.
[22] K. Yoshii and M. Goto: “Infinite latent harmonic allocation:
A nonparametric Bayesian approach to multipitch analysis,”
Proc. ISMIR, pp. 309–314, 2010.
[23] A. El-Jaroudi, J. Makhoul: “Discrete all-pole modeling,”
IEEE Trans. SP, vol. 39, no. 2, pp. 411–423, 1991.
628
Oral Session 9
Rhythm & Beat
629
630
DESIGN AND EVALUATION OF ONSET DETECTORS USING
DIFFERENT FUSION POLICIES
Mi Tian, György Fazekas, Dawn A. A. Black, Mark Sandler
Centre for Digital Music, Queen Mary University of London
{m.tian, g.fazekas, dawn.black, mark.sandler}@qmul.ac.uk
ABSTRACT
Note onset detection is one of the most investigated tasks
in Music Information Retrieval (MIR) and various detection methods have been proposed in previous research. The
primary aim of this paper is to investigate different fusion
policies to combine existing onset detectors, thus achieving better results. Existing algorithms are fused using three
strategies, first by combining different algorithms, second,
by using the linear combination of detection functions, and
third, by using a late decision fusion approach. Large scale
evaluation was carried out on two published datasets and a
new percussion database composed of Chinese traditional
instrument samples. An exhaustive search through the parameter space was used enabling a systematic analysis of
the impact of each parameter, as well as reporting the most
generally applicable parameter settings for the onset detectors and the fusion. We demonstrate improved results
attributed to both fusion and the optimised parameter settings.
1. INTRODUCTION
The automatic detection of onset events is an essential part
in many music signal analysis schemes and has various applications in content-based music processing. Different approaches have been investigated for onset detection in recent years [1,2]. As the main contribution of this paper, we
present new onset detectors using different fusion policies,
with improved detection rates relying on recent research in
the MIR community. We also investigate different configurations of onset detection and fusion parameters, aiming
to provide a reference for configuring onset detection systems.
The focus of ongoing onset detection work is typically
targeting Western musical instruments. Apart from using
two published datasets, a new database is incorporated into
our evaluation, collecting percussion ensembles of Jingju,
also denoted as Peking Opera or Beijing Opera, a major
genre of Chinese traditional music 1 . By including this
dataset, we aim at increasing the diversity of instrument
categories in the evaluation of onset detectors, as well as
extending the research to include non-Western music types.
The goal of this paper can be summarised as follows: i)
to evaluate fusion methods in comparison with the baseline
algorithms, as well as a state-of-the-art method 2 ; ii) to investigate which fusion policies and which pair-wise combinations of onset detectors yield the most improvement
over standard techniques; iii) to find the best performing
configurations by searching through the multi-dimensional
parameter space, hence identifying emerging patterns in
the performances of different parameter settings, showing
good results across different datasets; iv) to investigate the
performance difference in Western and non-Western percussive instrument datasets.
In the next section, we present a review of related work.
Descriptions of the datasets used in this experiment are
given in Section 3. In Section 4, we introduce different fusion strategies. Relevant post-processing and peak-picking
procedures, as well as the parameter search process will
be discussed in Section 5. Section 6 presents the results,
with a detailed analysis and discussion of the performance
of the fusion methods. Finally, the last section summarises
our findings and provides directions for future work.
2. RELATED WORK
Many onset detection algorithms and systems have been
proposed in recent years. Common approaches using energy or phase information derived from the input signal include the high frequency content (HFC) and complex domain (CD) methods. See [1, 6] for detailed reviews and [9]
for further improvements. Pitch contours and harmonicity information can also be indicators for onset events [7].
These methods shows some superiority over energy based
ones in case of soft onsets.
Onset detection systems using machine learning techniques have also been gaining popularity in recent years 3 .
The winner of MIREX 2013 audio onset detection task
utilises convolutional neural networks to classify and distinguish onsets from non-onset events in the spectrogram
[13]. The data-driven nature of these methods makes the
c Mi Tian, György Fazekas, Dawn A. A. Black, Mark San
dler.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mi Tian, György Fazekas, Dawn
A. A. Black, Mark Sandler. “Design and Evaluation of Onset Detectors
Using Different Fusion Policies”, 15th International Society for Music
1
http://en.wikipedia.org/wiki/Peking_opera
Machine learning-based methods are excluded from this study to
limit the scope of our work.
3 http://www.music-ir.org/mirex/wiki/2013:
Audio_Onset_Detection
2
631
with 732 onsets. We also use NPP onsets from the first two
datasets to form the fourth one, providing a direct comparison with the Chinese NPP instruments. All stimuli are
mono signals sampled at 44.1kHz 6 and 16 bits per sample,
having 3349 onsets in total.
detection less dependent on onset types, though a computationally expensive training process is required. A promising approach for onset detection lies in the fusion of multiple detection methods. Zhou et al. proposed a system integrating two detection methods selected according to properties of the target onsets [17]. In [10], pitch, energy and
phase information are considered in parallel for the detection of pitched onsets. Another fusion strategy is to combine peak score information to form new estimations of
the onset events [8]. Albeit fusion has been used in previous work, there is a lack of systematic evaluation of fusion
strategies and applications in the current literature. This
paper focusses on the assessment of different fusion policies, from feature-level and detection function-level fusion
to higher level decision fusion.
The success of an onset detection algorithm largely depends on the signal processing methods used to extract
salient features from the audio that emphasise the features
characterising onset events as well as smoothing the noise
in the detection function. Various signal processing techniques have been introduced in recent studies, such as vibrato suppression [3] and adaptive thresholding [1]. In
[14], adaptive whitening is presented where each STFT
bins magnitude is divided by the an average peak for that
bin accumulated over time. This paper also investigates
the performances of some commonly used signal processing modules within onset detection systems.
4. FUSION EXPERIMENT
3. DATASETS
In this study, we use two previously released evaluation
datasets and a newly created one. The first published dataset
comes from [1], containing 23 audio tracks with a total duration of 190 seconds and having 1058 onsets. These are
classified into four groups: pitched non-percussive (PNP),
e.g. bowed strings, 93 onsets, pitched percussive (PP), e.g.
piano, 482 onsets 4 , non-pitched percussive (NPP), e.g.
drums, 212 onsets, and complex mixtures (CM), e.g. pop
singing music, 271 onsets. The second set comes from [2]
which is composed of 30 samples 5 of 10 second audio
tracks, containing 1559 onsets in total, covering also four
categories: PNP (233 onsets in total), PP (152 onsets), NPP
(115 onsets), CM (1059 onsets). The use of these datasets
enables us to test the algorithms on a range of different instruments and onset types, and provides for direct comparison with published work. The combined dataset used in
the evaluation of our work is composed of these two sets.
The third dataset consists of recordings of the four major percussion instruments in Jingju: bangu (clapper- drum),
daluo (gong-1), naobo (cymbals), and xiaoluo (gong-2).
The samples are manually mixed using individual recordings of these instruments with possibly simultaneous onsets to closely reproduce real world conditions. See [15]
for more details on the instrument types and the dataset.
This dataset includes 10 samples of 30-second excerpts
4 A 7-onset discrepancy(482 instead of 489) from the reference paper
is reported by the original author due to revisions of annotations.
5 Only a subset of this dataset presented in the original paper is received from the author for the evaluation in this paper.
The aim of information fusion is to merge information from
heterogeneous sources to reduce uncertainty of inferences
[11]. In our study, six spectral-based onset detection algorithms are considered as baselines for fusion: high frequency content (HFC), spectral difference (SD) complex
domain (CD), broadband energy rise (BER), phase deviation (PD), outlined in [1], and SuperFlux (SF) from recent
work [4]. We also developed and included in the fusion a
method based on Linear Predictive Coding [12], where the
LPC coefficients are computed using the Levinson-Durbin
recursion, and the onset detection function is derived from
the LPC error signal.
Three fusion policies are used in our experiments: i)
feature-level fusion, ii) fusion using the linear combination
of detection functions and iii) decision fusion by selecting
and merging onset candidates. All pairwise combination
of the baseline algorithms are amenable for the latter two
fusion policies. However, not all algorithms can be meaningfully combined using feature-level fusion. For example
CD can be considered as an existing combination of SD
and PD, therefore combining CD with either of these two
at a feature level is not sensible. In this study, 10 featurelevel fusion, 13 linear combination based fusion and 15
decision fusion based methods are tested. These are compared to the 7 original methods, giving us 45 detectors in
total. In the following, we describe specific fusion policies.
We assume familiarity with onset detection principles and
restrain from describing these details, please see [1] for a
tutorial.
4.1 Feature-level Fusion
In feature-level fusion, multiple algorithms are combined
to compute fused features. For conciseness, we provide
only one example combining BER and SF, denoted BERSF,
utilising the vibrato suppression capability of SF [4] for detecting soft onsets, as well as the good performance of BER
for detecting percussive onsets with sharp energy bursts
[1]. Here, we use the BER to mask the SF detection function as described by Equation (1). In essence, SF is used
directly when there is evidence for a sharp energy rise, otherwise it is further smoothed using a median filter.
ODF (n) =
if BER(n) > γ
otherwise,
(1)
where γ is an experimentally defined threshold, λ is a weighting constant set to 0.9 and SF (n) is the median filtered
detection function with a window size of 3 frames.
6
632
SF (n)
λ(SF (n))
Some audio files were upsampled to obtain a uniform dataset.
4.2 Linear Combination of Detection Functions
In this method, two time aligned detection functions are
used and their weighted linear combination is computed to
form a new detection function as shown in Equation 2:
ODF (n) = wODF1 (n) + (1 − w)ODF2 (n),
(2)
where ODF1 and ODF2 are two normalised detection functions and w is a weighting coefficient (0 ≤ w ≤ 1).
4.3 Decision Fusion
5.2.2 Backtracking
This fusion method operates at a later stage and combines
prior decisions of two detectors. Post-processing and peak
picking are applied separately yielding two lists of onset
candidates. Onsets from the two lists occurring within a
fixed temporal tolerance window will be merged and accepted. Let T S1 and T S2 be the lists of onset locations
given by two different detectors, i and j be indexes of onsets in the candidate lists and δ the tolerance time window.
The final onset locations are generated using the fusion
strategy described by Algorithm 1.
In case of many musical instruments, onsets have longer
transients without a sharp burst of energy rise. This may
cause energy based detection functions to exhibit peaks
after the perceived onset locations. Vos and Rasch conclude that onsets are perceived when the envelope reaches
a level of roughly 6-15 dB below the maximum level of
the tones [16]. Using this rationale, we trace the onset locations from the detected peak position back to a hypothesised earlier “perceived” location. The backtracking procedure is based on measuring relative differences in the
detection function, as illustrated by Algorithm 2, where θ
is the threshold used as a stopping condition. We use the
implementation available in the QM Vamp Plugins.
Algorithm 1 Onset decision fusion
1: procedure D ECISION F USION(T S1 , T S2 )
2:
I, J ← 0 : len(T S1 ) − 1, 0 : len(T S2 ) − 1
3:
T S ← empty list
4:
for all i, j in product(I, J) do
5:
if abs(T S1 [i] − T S2 [j]) < δ then
6:
insert sorted: T S ← mean(T S1 [i], T S2 [j])
7:
degree polynomial on the detection function around local
maxima using a least squares method, following the QM
Vamp Plugins 7 . The coefficients a and c of the quadratic
equation y = ax2 + bx + c are used to detect both sharper
peaks, under the condition a > tha , and peaks with a
higher magnitude, when c > thc . The corresponding thresholds are computed from a single sensitivity parameter called
threshold using tha = (100 − threshold)/1000 for the
quadratic term and thc = (100 − threshold)/1500 for the
constant term. The linear term b can be ignored.
Algorithm 2 Backtracking
Require: idx: index of a peak location in the ODF
1: procedure BACKTRACKING(idx, ODF, θ)
2:
δ, γ ← 0
3:
while idx > 1 do
4:
δ ← ODF [idx] − ODF [idx − 1]
5:
if δ < γ ∗ θ then
6:
break
7:
idx ← idx − 1
8:
γ←δ
9:
return idx
return T S
5. PEAK PICKING AND PARAMETER SEARCH
5.1 Smoothing and Thresholding
Post-processing is an optional stage to reduce noise that interferes with the selection of maxima in the detection function. In this study, three post-processing blocks are used: i)
DC removal and normalisation, ii) zero-phase low-pass filtering and iii) adaptive thresholding. In conventional normalisation, data is scaled using a fixed constant. Here we
use a normalisation coefficient computed by weighting the
input exponentially. After removing constant offsets, the
detection function is normalised using the coefficient AlphaNorm calculated by Equation (3):
5.3 Parameter Search
An exhaustive search is carried out to find the configurations in the parameter space yielding the best detection
rates. The following parameters and settings, related to the
onset detection and fusion stages, are evaluated: i) adaptive
whitening (wht) on/off; ii) detection sensitivity (threshold), ranging from 0.1 to 1.0 with an increment of 0.1; iii)
backtracking threshold (θ), ranging from 0.4 to 2.4 with 8
equal subdivisions (the upper bound is set to an empirical
value 2.4 in the experiment since the tracking will not go
beyond the previous valley); iv) linear combination coefficient (w), ranging from 0.0 to 1.0 with an increment of 0.1;
v) tolerance window length (δ) for decision fusion, ranging from 0.01 to 0.05 (in second) having 8 subdivisions.
This gives a 5-dimensional space and all combinations of
all possible values described above are evaluated. This results in 180 configurations in case of standard detectors
and feature-level fusion, 1980 in case of linear fusion and
1620 for decision fusion. The configurations are described
1
|ODF (n)|α α
AlphaN orm =
(3)
len(ODF )
A low-pass filter is applied to the detection function to
reduce noise. To avoid introducing delays, a zero phase filter is employed at this stage. Finally, adaptive thresholding
using a moving median filter is applied following Bello [1],
to avoid the common pitfalls of using a fixed threshold for
peak picking.
n
5.2 Peak Picking
5.2.1 Polynomial Fitting
The use of polynomial fitting allows for assessing the shape
and magnitude of peaks separately. Here we fit a second-
7
633
http://www.vamp-plugins.org
using the Vamp Plugin Ontology 8 and the resulting RDF
files are used by Sonic Annotator [5] to configure the detectors. The test result will thus give us not only the overall
performance of each onset detector, but also uncover their
strengths and limitations across different datasets and parameter settings.
6. EVALUATION AND RESULTS
6.1 Analysis of Overall Performance
Figure 1 provides an overview of the results, showing the
F-measure for the top 12 detectors in our study 9 . Detectors are ranked by the median showing the overall performance increase due to fusion across the entire range of parameter settings. Due to space limitations, only a subset of
the results are reported in this paper. The complete result
set for all tested detectors under all configurations on different datasets is available online 10 , together with Vamp
plugins of all tested onset detectors. The names of the fusion algorithms come from the abbreviations of the constituent methods, while the numbers represent the fusion
policy: 0: feature-level fusion, 1: linear combination of
detection functions and 2: decision fusion.
CDSF-1 yields improved F-measure for the combined
dataset by 3.06% and 6.14% compared to the two original methods SF and CD respectively. Smaller interquartile
ranges (IQRs) observed in case of CD, SD and HFC based
methods show they have less dependency on the configuration. BERSF-2 and BERSF-1 vary the most in performance, also reflected from their IQRs. In case of BERSF2, the best performance is obtained using the widest considered tolerance window (0.05s), with modest sensitivity
(40%). However, decreasing the tolerance window size has
an adverse effect on the performance, yielding one of the
lowest detection rates caused by the significant drop of recall. In case of BERSF-1, a big discrepancy between the
best and worst performing configurations can be observed.
This is partly because the highest sensitivity setting has a
negative effect on SF causing very low precision.
Table 1 shows the results ranked by F-measure, precision and recall with corresponding standard deviations for
the ten best detectors as well as all baseline methods. Standard deviations are computed over the results for all configurations in each dataset. SF is ranked in the best performing ten, thus it is excluded from the baseline. Nine out of
the top ten detectors are fusion methods. CDSF-1 performs
the best for all datasets (including CHN-NPP and WESNPP that are not listed in the table) while BERSF yields
the second best performance in the combined, WES-NPP
and JPB datasets. Corresponding parameter settings for the
combined dataset are given in Table 2.
Fusion policies may perform differently in the evaluation. In case of feature-level fusion, we compared how
combined methods score relative to their constituents. The
Figure 1. F-meaure of all configurations for the top 12
detectors. (Min, first and third quartile and max value of
the data are represented by the bottom bar of the whiskers,
bottom and upper borders of the boxes and upper bar of the
whiskers respectively. Median is shown by the red line)
method
CDSF-1
BERSF-1
BERSF-2
BERSF-0
CDSF-2
SF
CDBER-1
BERSD-1
HFCCD-1
CDBER-2
mean
std
median
mode
threshold
10.0
10.0
40.0
30.0
50.0
20.0
10.0
10.0
20.0
50.0
25.90
15.01
20.0
10.0
θ
2.15
2.40
2.15
2.40
2.40
2.40
2.40
2.40
1.15
1.15
2.100
0.4848
2.15
2.40
wht
off
off
off
off
off
off
off
off
off
off
off
w
0.20
0.30
n/a
n/a
n/a
n/a
0.50
0.60
0.50
n/a
0.4200
0.1470
0.50
0.50
δ (s)
n/a
n/a
0.05
n/a
0.05
n/a
n/a
n/a
n/a
0.05
0.05
0.00
0.05
0.05
Table 2. Parameter settings for the ten best performing detectors, threshold: overall detection sensitivity; θ: backtracking threshold; wht: adaptive whitening; w: linear
combination coefficient; δ: tolerance window size.
performances vary between datasets, with only HFCBER0 outperforming both HFC and BER on the combined and
SB datasets in terms of mean F-measure. However, five
perform better than their two constitutes on JPB, two on
CHN-NPP and five on WES-NPP dataset (these results are
published online). A more detailed analysis of these performance differences constitutes future work.
When comparing linear fusion of detection functions
with decision fusion, the former performs better across all
datasets in all but one cases, the fusion of HFC and BER.
Even in this case, linear fusion yields close performance
in terms of mean F-measure. Interesting observations also
emerge for particular methods on certain datasets. The linear fusion based detectors involving LPC and PD (SDPD1 and LPCPD-1) show better performances in the case of
the CHN-NPP dataset compared to their performances on
other datasets as well those given by their constituent methods (please see table online). Further analysis, for instance,
by looking at statistical significance of these observations
is required to identify relevant instrument properties.
When comparing BERSF-2, CDSF-2 and CDBER-2 to
the other detectors in Table 1, notably higher standard deviations in recall and F-measure are shown, indicating this
8
http://www.omras2.org/VampOntology
Due to different post-processing stages, the results reported here may
diverge from previously published results.
10 http://isophonics.net/onset-fusion
9
634
method
F (combined)
P (combined)
R (combined)
F (sb)
P (sb)
R (sb)
F (jpb)
P (jpb)
R (jpb)
CDSF-1
BERSF-1
BERSF-2
BERSF-0
CDSF-2
SF
CDBER-1
BERSD-1
HFCCD-1
CDBER-2
0.8580 0.0613
0.8559 0.0941
0.8528 0.1684
0.8451 0.0722
0.8392 0.1537
0.8274 0.0719
0.8145 0.0809
0.8073 0.0792
0.8032 0.0472
0.7967 0.2231
0.7966 0.0492
0.7883 0.0942
0.7795 0.0466
0.7712 0.0412
0.7496 0.0658
0.6537 0.1084
0.9054 0.1195
0.8857 0.1363
0.8901 0.1411
0.8638 0.1200
0.8970 0.1129
0.8313 0.1209
0.8210 0.1276
0.8163 0.1311
0.8512 0.1179
0.8423 0.1404
0.8509 0.1164
0.7776 0.1184
0.8354 0.1269
0.8011 0.1225
0.7671 0.1103
0.5775 0.1008
0.8153 0.0609
0.8280 0.0866
0.8186 0.2028
0.8272 0.0701
0.7884 0.1855
0.8234 0.0657
0.8080 0.0792
0.7986 0.0812
0.7603 0.0734
0.7558 0.2398
0.7489 0.0672
0.7994 0.1001
0.7305 0.0733
0.7436 0.0898
0.7330 0.1061
0.7530 0.2235
0.8194 0.0598
0.8126 0.0961
0.8088 0.1677
0.8025 0.0723
0.7892 0.1758
0.8126 0.0744
0.7877 0.0829
0.7843 0.0828
0.7802 0.0448
0.7605 0.2279
0.7692 0.0467
0.7626 0.0974
0.7604 0.0450
0.7411 0.0375
0.7243 0.0657
0.6143 0.1093
0.8455 0.1165
0.8191 0.1306
0.8729 0.1470
0.8185 0.1134
0.8336 0.1251
0.8191 0.1241
0.7972 0.1295
0.7985 0.1358
0.8387 0.1239
0.8140 0.1607
0.8361 0.1191
0.7521 0.1166
0.8311 0.1326
0.7818 0.1291
0.7494 0.1069
0.5230 0.0688
0.7949 0.0681
0.8062 0.0988
0.7536 0.2055
0.7870 0.0744
0.7493 0.2014
0.8063 0.0737
0.7785 0.0893
0.7707 0.0915
0.7293 0.0765
0.7138 0.2384
0.7123 0.0709
0.7138 0.1119
0.7009 0.0785
0.7044 0.0844
0.7009 0.1019
0.7308 0.2302
0.9286 0.0649
0.9283 0.0925
0.9230 0.1724
0.9175 0.0747
0.9165 0.1344
0.8488 0.0704
0.8560 0.0793
0.8420 0.0756
0.8416 0.0511
0.8498 0.2291
0.8320 0.0535
0.8254 0.0920
0.8210 0.0491
0.8159 0.0496
0.7913 0.0662
0.7114 0.1115
0.9748 0.1241
0.9718 0.1463
0.9637 0.1310
0.9712 0.1322
0.9642 0.1001
0.8290 0.1177
0.8678 0.1253
0.8310 0.1252
0.8376 0.1101
0.8853 0.1273
0.8692 0.1128
0.7968 0.1226
0.8202 0.1190
0.8082 0.1138
0.8041 0.1164
0.6513 0.1536
0.8865 0.0525
0.8885 0.0710
0.8856 0.2011
0.8694 0.0658
0.8732 0.1690
0.8694 0.0558
0.8446 0.0667
0.8532 0.0685
0.8456 0.0705
0.8170 0.2494
0.7979 0.0636
0.8561 0.0851
0.8217 0.0676
0.8236 0.1002
0.7788 0.1118
0.7836 0.2158
CD
BER
SD
HFC
LPC
PD
Table 1. F-measure (F), Precision (P) and Recall (R) for dataset combined, SB, JPB for detectors under best performing
configurations from the parameter search, with corresponding standard deviations over different configurations.
statistic
mean
std
median
Combined
0.7731
0.0587
0.7818
SB
0.7438
0.0579
0.7595
JPB
0.8183
0.0628
0.8226
CHN-NPP
0.8527
0.1206
0.8956
WES-NPP
0.8358
0.0641
0.8580
Table 3. Statistics for F-measure of the ten detectors with
their best performances from Table 1 for different datasets
fusion policy is more sensitive to the choice of parameters.
A possible improvement in this fusion policy would be to
make the size of the tolerance window dependent on the
magnitude of relevant peaks of the detection functions.
The results also vary across different datasets. Table 3
summarises F-measure statistics computed over the detectors listed in Table 1 at their best setting for each datasets
used in this paper. In comparison with SB, the JPB dataset
exhibits higher F-measure. This dataset has larger diversity
in terms of the length of tracks and the level of complexity, while the SB dataset mainly consists of complex mixture (CM) onsets type. Both the Chinese and Western NPP
onset class provides noticeably higher detection rate compared to the mix-typed datasets. Though the CHN-NPP set
shows the largest standard deviation, suggesting a greater
variation in performance between the different detectors
for these instruments. Apart from aiming at optimal overall detection results, it is also useful to consider when and
how a certain onset detector exhibits the best performance,
which constitutes future work.
Figure 2. Performances of CDSF-1 onset detector under
different w (labelled in each curve) and threshold (annotated in the side box) settings
even at a higher threshold, the onset location would not be
traced back further than the valley preceding the peak detected in our algorithm. An interesting direction for future
work would thus be, given this observation, to take into
account the properties of human perception.
Adaptive whitening had to be turned off for the majority
of detectors to provide good performance for all datasets.
This indicates that the method does not improve onset detection performance in general, although it is available in
most onset detectors in the Vamp plugin library. The value
of the tolerance window was always 0.05s for best performance in our study, suggesting that the temporal precision of the different detectors varies significantly, which
requires a fairly wide decision horizon for successful combination.
Figure 2 shows how two parameters influence the performance of the onset detector CDSF-1. The figure illustrates the true positive rate (i.e., correct detections relative to the number of target onsets) and false positive rate
(i.e., false detections relative to the number of detected onsets) and better performance is indicated by the curve shifting upwards and leftwards. All parameters except the linear combination coefficient (w) and detection sensitivity
6.2 Parameter Specifications
For general datasets a low detection sensitivity value is
favourable, which is supported by the fact that 30 out of
the 45 tested methods yield the best performances with a
sensitivity lower than 50% (see online). In 23 out of all
cases, the value of the backtracking threshold was the highest considered in our study (2.4) when the detectors yield
the best performances for the combined dataset, and it was
unanimously at a high value for all other datasets including
the percussive ones. This suggests that in many cases, the
perceived onset will be better characterised by the valley of
the detection function prior to the detected peak. Note that
635
Proc. of the 14th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2013.
(threshold) are fixed at their optimal values. We can observe that the value of the linear combination coefficient
is around 0.2 for best performance. This suggests that the
detector works the best when taking the majority of the
contribution from SF. With the threshold increasing from
10.0% to 60.0%, the true positive rate is increasing at the
cost of picking more false onsets, thus a lower sensitivity is preferred in this case. Poorest performance in case
of the linear fusion policy occurs in general when the linear combination coefficient overly favours one constituent
detector, or the sensitivity (threshold) is too high and the
backtracking threshold (θ) is at its lowest value.
[4] S. Böck and G. Widmer. Maximum filter vibrato suppression for onset detection. In Proc. of the 16th Int.
Conf. on Digital Audio Effects (DAFx), 2013.
[5] C. Cannam, M.O. Jewell, C. Rhodes, M. Sandler, and
M. d’Inverno. Linked data and you: Bringing music research software into the semantic web. Journal of New
Music Research, 2010.
[6] N. Collins. A comparison of sound onset detection algorithms with emphasis on psychoacoustically motivated detection functions. In Proc. of the 118th Convention of the Audio Engineering Society, 2005.
In this work, we applied several fusion techniques to aid
the music onset detection task. Different fusion policies
were tested and compared to their constituent methods,
including the state-of-the-art SuperFlux method. A large
scale evaluation was performed on two published datasets
showing improvements as a result of fusion, without extra
computational cost, or the need for a large amount of training data as in the case of machine learning based methods.
A parameter search was used to find the optimal settings
for each detector to yield the best performance.
We found that some of the best performing configurations do not match the default settings of some previously
published algorithms. This suggests that in some cases,
better performance can be achieved just by finding better
settings which work best overall for a given type of audio
even without changing the algorithms.
In future work, a possible improvement in case of late
decision fusion is to take the magnitude of the peaks into
account when combining detected onsets, essentially treating the value as an estimation confidence. We will investigate the dependency of the selection of onset detectors on
the type and the quality of the input music signal. We also
intend to carry out more rigorous statistical analyses with
significance tests for the reported results. More parameters
could be included in the search to study their strengths as
well as how they influence each other under different configurations. Another interesting direction is to incorporate
more Non-Western music types as detection target and design algorithms using instrument specific priors.
8. REFERENCES
[1] J.P. Bello, L. Daudet, S. Abdallan, C. Duxbury, and
M. Davies. A tutorial on onset detection in music signals. In IEEE Transactions on Audio, Speech, and Language Processing, volume 13, 2005.
[2] S. Böck, F. Krebs, and M. Schedl. Evaluating the online capabilities of onset detection methods. In Proc.
of the 13th Int. Soc. for Music Information Retrieval
Conf. (ISMIR), 2012.
[3] S. Böck and G. Widmer. Local group delay based vibrato and tremolo suppression for onset detection. In
[7] N. Collins. Using a pitch detector for onset detection.
In Proc. of the 6th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2005.
[8] N. Degara-Quintela, A. Pena, and S. Torres-Guijarro.
A comparison of score-level fusion rules for onset detection in music signals. In Proc. of the 10th Int. Soc.
for Music Information Retrieval Conf. (ISMIR), 2009.
[9] S. Dixon. Onset detection revisited. In Proc. of the 9
th Int. Conference on Digital Audio Effects (DAFx’06),
2006.
[10] A. Holzapfel, Y. Stylianou, A.C. Gedik, and
B. Bozkurt. Three dimensions of pitched instrument
onset detection. IEEE Transactions on Audio, Speech,
and Language Processing, 2010.
[11] L.A. Klein. Sensor and data fusion: a tool for information assessment and decision making. SPIE, 2004.
[12] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 1975.
[13] J. Schlüter and S. Böck. Musical onset detection with
convolutional neural networks. In 6th Int. Workshop on
Machine Learning and Music (MML), 2013.
[14] D. Stowell and M. Plumbley. Adaptive whitening for
improved real-time audio onset detection. In Proceedings of the International Computer Music Conference
(ICMC), 2007.
[15] M. Tian, A. Srinivasamurthy, M. Sandler, and X. Serra.
A study of instrument-wise onset detection in beijing
opera percussion ensembles. In IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014.
[16] J. Vos and R. Rasch. The perceptual onset of musical
tones. Perception & Psychophysics, 29(4), 1981.
[17] R. Zhou, M. Mattavellii, and G. Zoia. Music onset detection based on resonator time frequency image. IEEE
Transactions on Audio, Speech, and Language Processing, 2008.
636
EVALUATING THE EVALUATION MEASURES FOR BEAT TRACKING
Sebastian Böck
[email protected]
Mathew E. P. Davies
Sound and Music Computing Group
INESC TEC, Porto, Portugal
[email protected]
ABSTRACT
This measurement of performance can happen via subjective listening test, where human judgements are used
to determine beat tracking performance [3], to discover:
how perceptually accurate the beat estimates are when
mixed with the input audio. Alternatively, objective evaluation measures can be used to compare beat times with
ground truth annotations [4], to determine: how consistent the beat estimates are with the ground truth according to some mathematical relationship. While undertaking listening tests and annotating beat locations are both
extremely time-consuming tasks, the apparent advantage
of the objective approach is that once ground truth annotations have been determined, they can easily be re-used
without the need for repeated listening experiments. However, the usefulness of any given objective accuracy score
(of which there are many [4]) is contingent on its ability
to reflect human judgement of beat tracking performance.
Furthermore, for the entire objective evaluation process to
be meaningful, we must rely on the inherent accuracy of
the ground truth annotations.
In this paper we work under the assumption that musically trained experts can provide meaningful ground truth
annotations and rather focus on the properties of the objective evaluation measures. The main question we seek to
address is: to what extent do existing objective accuracy
scores reflect subjective human judgement of beat tracking performance? In order to answer this question, even
in principle, we must first verify that human listeners can
make reliable judgements of beat tracking performance.
While very few studies exist, we can find supporting evidence suggesting human judgements of beat tracking accuracy are highly repeatable [3] and that human listeners can
reliably disambiguate accurate from inaccurate beat click
sequences mixed with music signals [11].
The analysis we present involves the use of a test
database for which we have a set of estimated beat locations, annotated ground truth and human subjective judgements of beat tracking performance. Access to all of these
components (via the results of existing research [12, 17])
allows us to examine the correlation between objective accuracy scores, obtained by comparing the beat estimates to
the ground truth, with human listener judgements. To the
best of our knowledge this is the first study of this type for
musical beat tracking.
The remainder of this paper is structured as follows. In
Section 2 we summarise the objective beat tracking evaluation measures used in this paper. In Section 3 we describe
The evaluation of audio beat tracking systems is normally
addressed in one of two ways. One approach is for human
listeners to judge performance by listening to beat times
mixed as clicks with music signals. The more common
alternative is to compare beat times against ground truth
annotations via one or more of the many objective evaluation measures. However, despite a large body of work in
audio beat tracking, there is currently no consensus over
which evaluation measure(s) to use, meaning multiple accuracy scores are typically reported. In this paper, we seek
to evaluate the evaluation measures by examining the relationship between objective accuracy scores and human
judgements of beat tracking performance. First, we present
the raw correlation between objective scores and subjective
ratings, and show that evaluation measures which allow alternative metrical levels appear more correlated than those
which do not. Second, we explore the effect of parameterisation of objective evaluation measures, and demonstrate that correlation is maximised for smaller tolerance
windows than those currently used. Our analysis suggests
that true beat tracking performance is currently being overestimated via objective evaluation.
1. INTRODUCTION
Evaluation is a critical element of music information retrieval (MIR) [16]. Its primary use is a mechanism to determine the individual and comparative performance of algorithms for given MIR tasks towards improving them in
light of identified strengths and weaknesses. Each year
many different MIR systems are formally evaluated within
the MIREX initiative [6].
In the context of beat tracking, the concept and purpose
of evaluation can be addressed in several ways. For example, to measure reaction time across changing tempi [2],
to identify challenging musical properties for beat trackers [9] or to drive the composition of new test datasets [10].
However, as with other MIR tasks, evaluation in beat tracking is most commonly used to estimate the performance of
one or more algorithms on a test dataset.
c Mathew E. P. Davies, Sebastian Böck.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mathew E. P. Davies, Sebastian Böck.
“Evaluating the evaluation measures for beat tracking”, 15th International
Society for Music Information Retrieval Conference, 2014.
637
ceding tolerance window. In addition, a separate threshold requires that the estimated inter-beat-interval should
be close to the IAI. In practice both thresholds are set
at ±17.5% of the IAI. In [4], two basic conditions consider the ratio of the longest continuously correct region
to the length of the excerpt (CMLc), and the total proportion of correct regions (CMLt). In addition, the AMLc
and AMLt versions allow for additional interpretations of
the annotations to be considered accurate. As specified
above, we reduce these four to two principal accuracy
scores. To prevent any ambiguity, we rename these accuracy scores Continuity-C (CMLc) and Continuity-T
(CMLt).
the comparison between subjective ratings and objective
scores of beat tracking accuracy. Finally, in Section 4 we
present discussion and areas for future work.
2. BEAT TRACKING EVALUATION MEASURES
In this section we present a brief summary each of the evaluation measures from [4]. While nine different approaches
were presented in [4], we reduce them to seven by only presenting the underlying approaches for comparing a set of
beats with a set of annotations (i.e. ignoring alternate metrical interpretations). We consider the inclusion of different metrical interpretations of the annotations to be a separate process which can be applied to any of these evaluation
measures (as in [5, 8, 15]), rather than a specific property
of one particular approach. To this end, we choose three
evaluation conditions: Annotated – comparing beats to annotations, Annotated+Offbeat – including the “off-beat”
of the annotations for comparison against beats and Annotated+Offbeat+D/H – including the off-beat and both
double and half the tempo of the annotations. This doubling and halving has been commonly used in beat tracking evaluation to attempt to reflect the inherent ambiguity
in music over which metrical level to tap the beat [13]. The
set of seven basic evaluation measures are summarised below:
Information Gain : this method performs a two-way
comparison of estimated beat times to annotations and
vice-versa. In each case, a histogram of timing errors is
created and from this the Information Gain is calculated
as the Kullback-Leibler divergence from a uniform histogram. The default number of bins used in the histogram
is 40.
3. SUBJECTIVE VS. OBJECTIVE COMPARISON
3.1 Test Dataset
To facilitate the comparison of objective evaluation scores
and subjective ratings we require a test dataset of audio examples for which we have both annotated ground truth beat
locations and a set of human judgements of beat tracking
performance for a beat tracking algorithm. For this purpose we use the test dataset from [17] which contains 48
audio excerpts (each 15s in duration). The excerpts were
selected from the MillionSongSubset [1] according to a
measurement of mutual agreement between a committee
of five state of the art beat tracking algorithms. They cover
a range from very low mutual agreement – shown to be
indicative of beat tracking difficulty, up to very high mutual agreement – shown to be easier for beat tracking algorithms [10].
In [17] a listening experiment was conducted where
a set of 22 participants listened to these audio examples
mixed with clicks corresponding to automatic beat estimates and rated on a 1 to 5 scale how well they considered
the clicks represented the beats present in the music. For
each excerpt these beat times were the output of the beat
tracker which most agreed with the remainder of the five
committee members from [10]. Analysis of the subjective
ratings and measurements of mutual agreement revealed
low agreement to be indicative of poor subjective performance.
In a later study, these audio excerpts were used as one
test set in a beat tapping experiment, where participants
tapped the beat using a custom piece of software [12]. In
order to compare the mutual agreement between tappers
with their global performance against the ground truth, a
musical expert annotated ground truth beat locations. The
tempi range from 62 BPM (beats per minute) up to 181
BPM and, with the exception of two excerpts, all are in 4/4
time. Of the remaining two excerpts, one is in 3/4 time and
F-measure : accuracy is determined through the proportion of hits, false positives and false negatives for a given
annotated musical excerpt, where hits count as beat estimates which fall within a pre-defined tolerance window
around individual ground truth annotations, false positives are extra beat estimates, and false negatives are
missed annotations. The default value for the tolerance
window is ±0.07s.
PScore : accuracy is measured as the normalised sum
of the cross-correlation between two impulse trains, one
corresponding to estimated beat locations, and the other
to ground truth annotations. The cross-correlation is
limited to the range covering 20% of the median interannotation-interval (IAI).
Cemgil : a Gaussian error function is placed around each
ground truth annotation and accuracy is measured as the
sum of the “errors” of the closest beat to each annotation,
normalised by whichever is greater, the number of beats
or annotations. The standard deviation of this Gaussian
is set at 0.04s.
Goto : the annotation interval-normalised timing error is
measured between annotations and beat estimates, and
a binary measure of accuracy is determined based on
whether a region covering 25% of the annotations continuously meets three conditions – the maximum error is
less than ±17.5% of the IAI, and the mean and standard
deviation of the error are within ±10% of the IAI.
Continuity-based : a given beat is considered accurate if
it falls within a tolerance window placed around an annotation and that the previous beat also falls within the pre-
638
F−measure
(0.77)
ratings
A
4
ratings
A+O
Cemgil
(0.79)
4
2
0
4
2
0.5
(0.85)
1
0
Goto
(0.52)
4
2
0.5
(0.74)
1
0
Continuity−C
(0.68)
4
2
0.5
(0.84)
1
0
Continuity−T
(0.68)
4
2
0.5
(0.51)
1
0
Inf. Gain
(0.85)
4
2
0.5
(0.65)
1
0
2
0.5
(0.61)
1
0
4
4
4
4
4
4
2
2
2
2
2
2
2
0
0.5
(0.85)
1
4
0.5
(0.82)
1
4
2
0
0
1
0
0.5
(0.85)
1
4
2
0.5
accuracy
0
1
0
0.5
(0.41)
1
4
2
0.5
accuracy
0
1
0
0.5
(0.86)
1
4
2
0.5
accuracy
0
1
0
0.5
(0.84)
1
1
0
5
4
2
0.5
accuracy
0
(0.85)
4
2
0.5
accuracy
0
5
(0.86)
4
A+O+D/H
ratings
PScore
(0.72)
2
0.5
accuracy
1
0
5
bits
Figure 1. Subjective ratings vs. objective accuracy scores for different evaluation measures. The rows indicate different
evaluation conditions. (top row) Annotated, (middle row) Annotated+Offbeat, and (bottom row) Annotated+Offbeat+D/H.
For each scatter plot, the linear correlation coefficient is provided.
Comparing each individual measure across these evaluation conditions, reveals that Information Gain is least
affected by the inclusion of additional interpretations of
the annotations, and hence most robust to ambiguity over
metrical level. Referring to the F-measure and PScore
columns of Figure 1 we see that the “vertical” structure
close to accuracies of 0.66 and 0.5 respectively is mapped
across to 1 for the Annotated+Offbeat+D/H condition.
This pattern is also reflected for Goto, Continuity-C and
Continuity-T which also determine beat tracking accuracy
according to fixed tolerance windows, i.e. a beat falling
anywhere inside a tolerance window is perfectly accurate.
However, the fact that a fairly uniform range of subjective
ratings between 3 and 5 (i.e. “fair” to “excellent” [17]) exists for apparently perfect objective scores indicates a potential mismatch and over-estimation of beat tracking accuracy. While a better visual correlation appears to exist in
the scatter plots of Cemgil and Information Gain, this is
not reflected in the correlation values (at least not for the
Annotated+Offbeat+D/H condition). The use a Gaussian
instead of a “top-hat” style tolerance window for Cemgil
provides more information regarding the precise localisation of beats to annotations and hence does not have this
clustering at the maximum performance. The Information Gain measure does not use tolerance windows at all,
instead it measures beat tracking accuracy in terms of the
temporal dependence between beats and annotations, and
thus shows a similar behaviour.
the other was deemed to have no beat at all, and therefore
no beats were annotated.
In the context of this paper, this set of ground truth beat
annotations provides the final element required to evaluate
the evaluation measures, since we now have: i) automatically estimated beat locations, ii) subjective ratings corresponding to these beats and iii) ground truth annotations
to which the estimated beat locations can be compared.
We use each of the seven evaluation measures described in
Section 2 to obtain the objective accuracy scores according
to the three versions of the annotations: Annotated, Annotated+Offbeat and Annotated+Offbeat+D/H. Since all excerpts are short, and we are evaluating the output of an
offline beat tracking algorithm, we remove the startup condition from [4] where beat times in the first five seconds
are ignored.
3.2 Results
3.2.1 Correlation Analysis
To investigate the relationship between the objective accuracy scores and subjective ratings, we present scatter plots
in Figure 1. The title of each individual scatter plot includes the linear correlation coefficient which we interpret
as an indicator of the validity of a given evaluation measure
in the context of this dataset.
The highest overall correlation (0.86) occurs for
Continuity-C when the offbeat and double/half conditions
are included. However, for all but Goto, the correlation is
greater than 0.80 once these additional evaluation criteria
are included. It is important to note only Continuity-C and
Continuity-T explicitly include these conditions in [4].
Since Goto provides a binary assessment of beat tracking performance, it is unlikely to be highly correlated with
the subjective ratings from [17] where participants were
explicitly required to use a five point scale rather than a
good/bad response concerning beat tracking performance.
Nevertheless, we retain it to maintain consistency with [4].
3.2.2 The Effect of Parameterisation
For the initial correlation analysis, we only considered
the default parameterisation of each evaluation measure as
specified in [4]. However, to only interpret the validity of
the evaluation measures in this way presupposes that they
have already been optimally parameterised. We now explore whether this is indeed the case, by calculating the objective accuracy scores (under each evaluation condition)
as a function of a threshold parameter for each measure.
639
F−measure
PScore
1
Cemgil
1
Goto
1
Continuity−C
1
1
Continuity−T
Inf. Gain
1
5
0.5
0.5
0.5
0.5
0.5
bits
accuracy
4
0.5
3
2
1
correlation
0
0
0.05
0.1
0
0
0.5
0
0
0.05
0.1
0
0
0.5
0
0
0.5
0
0
0.5
0
0
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0.05
0.1
threshold (s)
0
0
0.5
threshold
0
0
0.05
0.1
threshold (s)
0
0
0.5
threshold
0
0
0.5
threshold
0
0
0.5
threshold
0
0
50
100
50
100
num bins
Figure 2. (top row) Beat tracking accuracy as a function of threshold (or number of bins for Information Gain) per evaluation measure. (bottom row) Correlation between subjective ratings and accuracy scores as a function of threshold (or number of bins). In each plot the solid line indicates the Annotated condition, the dashed–dotted line shows Annotated+Offbeat
and the dashed line shows Annotated+Offbeat+D/H. For each evaluation measure, the default parameteristation from [4] is
shown by a dotted vertical line.
We then re-compute the subjective vs. objective correlation. We adopt the following parameter ranges as follows:
F-measure
PScore
Cemgil
Goto
Continuity-C
Continuity-T
Information Gain
F-measure : the size of the tolerance window increases
from ±0.001s to ±0.1s.
PScore : the width of the cross-correlation increases from
0.01 to 0.5 times the median IAI.
Cemgil : the standard deviation of the Gaussian error
function grows from 0.001s to 0.1s.
Default
Parameters
Max. Correlation
Parameters
0.070s
0.200
0.040s
0.175
0.175
0.175
40
0.049s
0.110
0.051s
0.100
0.095
0.090
38
Table 1. Comparison of default parameters per evaluation measure with those which provide the maximum correlation with subjective ratings in the Annotated+Offbeat+D/H condition.
Goto : to allow a similar one-dimensional representation,
we make all three parameters identical and vary them
from ±0.005 to ±0.5 times the IAI.
Continuity-based : the size of the tolerance window increases from ±0.005 to ±0.5 times the IAI.
after which the correlation soon reaches its maximum and
then reduces. Comparing these change points with the dotted vertical lines (which show the default parameters) we
see that correlation is maximised for smaller (i.e. more restrictive) parameters than those currently used. By finding
the point of maximum correlation in each of the plots in
the bottom row of Figure 2 we can identify the parameters which yield the highest correlation between objective
accuracy and subjective ratings. These are shown for the
Annotated+Offbeat+D/H evaluation condition in Table 1
for which the correlation is typically highest. Returning to
the plots in the top row of Figure 2 we can then read off the
corresponding objective accuracy with the default and then
maximum correlation parameters. These accuracy scores
are shown in Table 2.
From these Tables we see that it is only Cemgil whose
default parameterisation is lower than that which maximises the correlation. However this does not apply for
the Annotated only condition which is implemented in [4].
While there is a small difference for Information Gain, in-
Information Gain : we vary the number of bins in multiples of 2 from 2 up to 100.
In the top row of Figure 2 the objective accuracy scores
as a function of different parameterisations are shown. The
plots in the bottom row show the corresponding correlations with subjective ratings. In each plot the dotted vertical line indicates the default parameters. From the top row
plots we can observe the expected trend that, as the size of
the tolerance window increases so the objective accuracy
scores increase. For the case of Information Gain the beat
error histograms become increasingly sparse due to having
more histogram bins than observations, hence the entropy
reduces and the information gain increases. In addition,
Information Gain does not have a maximum value of 1,
but instead, log2 of the number of histogram bins [4].
Looking at the effect of correlation with subjective ratings in the bottom row of Figure 2, we see that for most
evaluation measures there is rapid increase in the correlation as the tolerance windows grow from very small sizes
640
Annotated
Default Max Corr.
Params
Params
F-measure
PScore
Cemgil
Goto
Continuity-C
Continuity-T
Information Gain
0.673
0.653
0.596
0.583
0.518
0.526
3.078
0.607
0.580
0.559
0.563
0.488
0.505
2.961
Annotated+Offbeat
Default Max Corr.
Params
Params
0.764
0.753
0.681
0.667
0.605
0.624
3.187
0.738
0.694
0.702
0.646
0.570
0.587
3.187
Annotated+Offbeat+D/H
Default
Max Corr.
Params
Params
0.834
0.860
0.739
0.938
0.802
0.837
3.259
0.797
0.792
0.779
0.813
0.732
0.754
3.216
Table 2. Summary of objective beat tracking accuracy under the three evaluation conditions: Annotated, Annotated+Offbeat and Annotated+Offbeat+D/H per evaluation measure. Accuracy is reported using the default parameterisation from [4] and also using the parameterisation which provides maximal correlation to the subjective ratings. For
Information Gain only performance is measured in bits.
might argue that the apparent glass ceiling of around 80%
for beat tracking [10] (using Continuity-T for the Annotated+Offbeat+D/H condition) may in fact be closer to
75%, or perhaps lower still. In terms of external evidence
to support our findings, a perceptual study evaluating human tapping ability [7] used a tolerance window of ±10%
of the IAI, which is much closer to our “maximum correlation” Continuity-T parameter of ±9% than the default
value of ±17.5% of the IAI.
spection of Figure 2 shows that it is unaffected by varying
the number of histogram bins in terms of the correlation.
In addition, the inclusion of the extra evaluation criteria
also leads to a negligible difference in reported accuracy.
Therefore Information Gain is most robust to parameter
sensitivity and metrical ambiguity. For the other evaluation measures the inclusion of the Annotated+Offbeat and
the Annotated+Offbeat+D/H (in particular) leads to more
pronounced differences. The highest overall correlation
between objective accuracy scores and subjective ratings
(0.89) occurs for Continuity-T for a tolerance window of
±9% of the IAI rather than the default value of ±17.5%.
Referring again to Table 2 we see that this smaller tolerance window causes a drop in reported accuracy from
0.837 to 0.754. Indeed a similar drop in performance can
be observed for most evaluation measures.
Before making recommendations to the MIR community with regard to how beat tracking evaluation should be
conducted in the future, we should first revisit the makeup
of the dataset to assess the scope from which we can draw
conclusions. All excerpts are just 15s in duration, and
therefore not only much shorter than complete songs, but
also significantly shorter than most annotated excerpts in
existing datasets (e.g. 40s in [10]). Therefore, based on
our results, we cannot yet claim that our subjective vs.
objective correlations will hold for evaluating longer excerpts. We can reasonably speculate that an evaluation
across overlapping 15s windows could provide some local information about beat tracking performance for longer
pieces, however this is currently not how beat tracking
evaluation is addressed. Instead, a single score of accuracy
is normally reported regardless of excerpt length. With
the exception of [3] we are unaware of any other research
where subjective beat tracking performance has been measured across full songs.
4. DISCUSSION
Based on the analysis of objective accuracy scores and subjective ratings on this dataset of 48 excerpts, we can infer
that: i) a higher correlation typically exists when the Annotated+Offbeat and/or Annotated+Offbeat+D/H conditions
are included, and ii) for the majority of existing evaluation
measures, this correlation is maximised for a more restrictive parameterisation than the default parameters which are
currently used [4]. A strict following of the results presented here would promote either the use of Continuty-T
for the Annotated+Offbeat+D/H condition with a smaller
tolerance window, or Information Gain since it is most resilient to these variable evaluation conditions while maintaining a high subjective vs. objective correlation.
If we are to extrapolate these results to all existing work
in the beat tracking literature this would imply that any papers reporting only performance for the Annotated condition using F-measure and PScore may not be as representative of subjective ratings (and hence true performance) as
they could be by incorporating additional evaluation conditions. In addition, we could infer that most presented accuracy scores (irrespective of evaluation measure or evaluation condition) are somewhat inflated due to the use of
artificially generous parameterisations. On this basis, we
Regarding the composition of our dataset, we should
also be aware that the excerpts were chosen in an unsupervised data-driven manner. Since they were sampled from
a much larger collection of excerpts [1] we do not believe
there is any intrinsic bias in their distribution other than any
which might exist across the composition of the MillionSongSubset itself. The downside of this unsupervised sampling is that we do not have full control over exploring specific interesting beat tracking conditions such as off-beat
tapping, expressive timing, the effect of related metrical
levels and non-4/4 time-signatures. We can say that for the
few test examples where the evaluated beat tracker tapped
the off-beat (shown as zero accuracy points in the Anno-
641
tated condition but non-zero for the Annotated+Offbeat
condition in Figure 1), were not rated as “bad”. Likewise,
there did not appear to be a strong preference over a single
metrical level. Interestingly, the ratings for the unannotatable excerpt were among the lowest across the dataset.
Overall, we consider this to be a useful pilot study
which we intend to follow up in future work with a more
targeted experiment across a much larger musical collection. In addition, we will also explore the potential for using bootstrapping measures from Text-IR [14] which have
also been used for the evaluation of evaluation measures.
Based on these outcomes, we hope to be in a position to
make stronger recommendations concerning how best to
conduct beat tracking evaluation, ideally towards a single unambiguous measurement of beat tracking accuracy.
However, we should remain open to the possibility that different evaluation measures may be more appropriate than
others and that this could depend on several factors, including: the goal of the evaluation; the types of beat tracking
systems evaluated; how the ground truth was annotated;
and the make up of the test dataset.
To summarise, we believe the main contribution of this
paper is to further raise the profile and importance of
evaluation in MIR, and to encourage researchers to more
strongly consider the properties of evaluation measures,
rather than merely reporting accuracy scores and assuming them to be valid and correct. If we are to improve underlying analysis methods through iterative evaluation and
refinement of algorithms, it is critical to optimise performance according to meaningful evaluation methodologies
targeted towards specific scientific questions.
While the analysis presented here has only been applied
in the context of beat tracking, we believe there is scope
for similar subjective vs. objective comparisons in other
MIR topics such as chord recognition or structural segmentation, where subjective assessments should be obtainable
via similar listening experiments to those used here.
5. ACKNOWLEDGMENTS
This research was partially funded by the Media Arts and
Technologies project (MAT), NORTE-07-0124-FEDER000061, financed by the North Portugal Regional Operational Programme (ON.2–O Novo Norte), under the National Strategic Reference Framework (NSRF), through
the European Regional Development Fund (ERDF), and
by national funds, through the Portuguese funding agency,
Fundação para a Ciência e a Tecnologia (FCT) as well as
FCT post-doctoral grant SFRH/BPD/88722/2012. It was
also supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the GiantSteps
project (grant agreement no. 610591).
6. REFERENCES
[1] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and
P. Lamere. The million song dataset. In Proceedings of 12th
International Society for Music Information Retrieval Conference, pages 591–596, 2011.
[2] N. Collins. Towards Autonomous Agents for Live Computer
Music: Realtime Machine Listening and Interactive Music
Systems. PhD thesis, Centre for Music and Science, Faculty
of Music, Cambridge University, 2006.
[3] R. B. Dannenberg. Toward automated holistic beat tracking,
music analysis, and understanding. In Proceedings of 6th
International Conference on Music Information Retrieval,
pages 366–373, 2005.
[4] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation
methods for musical audio beat tracking algorithms. Technical Report C4DM-TR-09-06, Queen Mary University of
London, Centre for Digital Music, 2009.
[5] S. Dixon. Evaluation of audio beat tracking system beatroot.
Journal of New Music Research, 36(1):39–51, 2007.
[6] J. S. Downie. The music information retrieval evaluation
exchange (2005–2007): A window into music information retrieval research. Acoustical Science and Technology,
29(4):247–255, 2008.
[7] C. Drake, A. Penel, and E. Bigand. Tapping in time with mechanically and expressively performed music. Music Perception, 18(1):1–23, 2000.
[8] M. Goto and Y. Muraoka. Issues in evaluating beat tracking
systems. In Working Notes of the IJCAI-97 Workshop on Issues in AI and Music - Evaluation and Assessment, pages
9–16, 1997.
[9] P. Grosche, M. Müller, and C. S. Sapp. What Makes Beat
Tracking Difficult? A Case Study on Chopin Mazurkas. In
Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 649–654, 2010.
[10] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. Oliveira, and
F. Gouyon. Selective sampling for beat tracking evaluation.
IEEE Transactions on Audio, Speech and Language Processing, 20(9):2539–2460, 2012.
[11] J. R. Iversen and A. D. Patel. The beat alignment test (BAT):
Surveying beat processing abilities in the general population.
In Proceedings of the 10th International Conference on Music Perception and Cognition, pages 465–468, 2008.
[12] M. Miron, F. Gouyon, M. E. P. Davies, and A. Holzapfel.
Beat-Station: A real-time rhythm annotation software. In
Proceedings of the Sound and Music Computing Conference,
pages 729–734, 2013.
[13] D. Moelants and M. McKinney. Tempo perception and musical content: what makes a piece fast, slow or temporally ambiguous? In Proceedings of the 8th International Conference
on Music Perception and Cognition, pages 558–562, 2004.
[14] T. Sakai. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the International ACM SIGIR conference on research and development in information retrieval,
pages 525–532, 2006.
[15] A. M. Stark. Musicians and Machines: Bridging the Semantic Gap in Live Performance. PhD thesis, Centre for Digital
Music, Queen Mary University of London, 2011.
[16] J. Urbano, M. Schedl, and X. Serra. Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems, 41(3):345–369, 2013.
[17] J. R. Zapata, A. Holzapfel, M. E. P. Davies, J. L. Oliveira, and
F. Gouyon. Assigning a confidence threshold on automatic
beat annotation in large datasets. In Proceedings of 13th International Society for Music Information Retrieval Conference, pages 157–162, 2012.
642
IMPROVING RHYTHMIC TRANSCRIPTIONS
VIA PROBABILITY MODELS APPLIED POST-OMR
Maura Church
Applied Math, Harvard University
and Google Inc.
Michael Scott Cuthbert
Music and Theater Arts
M.I.T.
[email protected]
[email protected]
ticularly in searches such as chord progressions that rely
on accurate recognition of multiple musical staves.
ABSTRACT
Despite many improvements in the recognition of graphical elements, even the best implementations of Optical
Music Recognition (OMR) introduce inaccuracies in the
resultant score. These errors, particularly rhythmic errors,
are time consuming to fix. Most musical compositions
repeat rhythms between parts and at various places
throughout the score. Information about rhythmic selfsimilarity, however, has not previously been used in
OMR systems.
Understandably, the bulk of OMR research has focused
on improving the algorithms for recognizing graphical
primitives and converting them to musical objects based
on their relationships on the staves. Improving score accuracy using musical knowledge (models of tonality, meter, form) has largely been relegated to “future work” sections and when discussed has focused on localized structures such as beams and measures and requires access to
the “guts” of a recognition engine (see Section 6.2.2 in
[9]). Improvements to score accuracy based on the output
of OMR systems using multiple OMR engines have been
suggested [2] and when implemented yielded results that
were more accurate than individual OMR engines, though
the results were not statistically significant compared to
the best commercial systems [1]. Improving the accuracy
of an OMR score using musical knowledge and a single
engine’s output alone remains an open field.
This paper describes and implements methods for using
the prior probabilities for rhythmic similarities in scores
produced by a commercial OMR system to correct
rhythmic errors which cause a contradiction between the
notes of a measure and the underlying time signature.
Comparing the OMR output and post-correction results to
hand-encoded scores of 37 polyphonic pieces and movements (mostly drawn from the classical repertory), the
system reduces incorrect rhythms by an average of 19%
(min: 2%, max: 36%).
This paper proposes using rhythmic repetition and similarity within a score to create a model where measurelevel metrical errors can be fixed using correctly recognized (or at least metrically consistent) measures found in
other places in the same score, creating a self-healing
method for post-OMR processing conditioned on probabilities based on rhythmic similarity and statistics of
symbolic misidentification.
The paper includes a public release of an implementation
of the model in music21 and also suggests future refinements and applications to pitch correction that could
further improve the accuracy of OMR systems.
1. INTRODUCTION
2. PRIOR PROBABILITIES OF DISTANCE
Millions of paper copies of musical scores are found in
libraries and archival collections and hundreds of thousands of scores have already been scanned as PDFs in
repositories such as IMSLP [5]. A scan of a score cannot,
however, be searched or manipulated musically, so Optical Music Recognition (OMR) software is necessary to
transform an image of a score into symbolic formats (see
[7] for a recent synthesis of relevant work and extensive
bibliography; only the most relevant citations from this
work are included here). Projects such as Peachnote [10]
show both the feasibility of recognizing large bodies of
scores and also the limitations that errors introduce, par© Maura Church Michael Scott Cuthbert.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Maura Church and Michael Scott
Cuthbert. “Improving Rhythmic Transcriptions via Probability Models
Applied Post-OMR”, 15th International Society for Music Information
643
Most Western musical scores, excepting those in certain
post-common practice styles (e.g., Boulez, Cage), use
and gain cohesion through a limited rhythmic vocabulary
across measures. Rhythms are often repeated immediately or after a fixed distance (e.g., after a 2, 4, or 8 measure
distance). In a multipart score, different instruments often employ the same rhythms in a measure or throughout
a passage. From a parsed musical score, it is not difficult
to construct a hash of the sequence of durations in each
measure of each part (hereafter simply called “measure”;
“measure stack” will refer to measures sounding together
across all parts); if grace notes are handled separately,
and interior voices are flattened (e.g., using the music21
chordify method) then hash-key collisions will only
occur in the rare cases where two graphically distinct
symbols equate to the same length in quarter notes (such
as a dotted-triplet eighth note and a normal eighth).
Within each part, the prior probability that a measure m0
will have the same rhythm as the measure n bars later (or
earlier) can be computed (the prior-based-on-distance, or
PrD). Similarly, the prior probability that, within a
measure stack, part p will have the same rhythm as part q
can also be computed (the prior-based-on-part, or PrP).
computing these values independently for each OMR system and quality of scan, such work is beyond the scope of
the current paper. Therefore, we use Rossant and Bloch’s
recognition rates, adjusting them for the differences between working with individual symbols (such as dots and
note stems) and symbolic objects (such as dotted-eighth
and quarter notes). The values used in this model are
thus: c = .003, o = .009, a = .004, v = .016.1 As will become clear, more accurate measures would only improve
the results given below. Subtracting these probabilities
from 1.0, the rate of equality, e, is .968.
Figure 1 shows these two priors for the violin I and viola
parts of the first movement of Mozart K525 (Eine kleine
Nachtmusik). Individual parts have their own characteristic shapes; for instance, the melodic violin I (top left),
shows less rhythmic similarity overall than the viola
(bot. left). This difference results from the greater rhythmic variety of the violin I part compared to the viola
part. Moments of large-scale repetition such as between
the exposition and recapitulation, however, are easily
visible as spikes in the PrD graph for violin I. (Possible
refinements to the model taking into account localized
similarities are given at the end of this paper.) The PrP
graphs (right) show that both parts are more similar to
the violoncello part than to any other part. However, the
viola is more similar to the cello (and to violin II) that
violin I is to any other part.
3.2 Aggregate Change Distances
The similarity of two measures can be calculated in a
number of different ways, including the earth mover distance, the Hamming distance, and the minimum Levenshtein or edit distance. The nature of the change probabilities obtained from Rossant and Bloch along with the
inherent difficulties of finding the one-to-one correspondence of input and output objects required for other
methods, made Levenshtein distance the most feasible
method. The probability that certain changes would occur
in a given originally scanned measure (source, S) to transform it into the OMR output measure (destination, D) is
determined by finding, through an implementation of edit
distance, values for i, j, k, l, and m (for number of class
changes, omissions, additions, value changes, and unchanged elements) that maximize:
pS, D = c i o j a k v l e m
(1)
Equation (1), the prior-based-on-changes or PrC, can be
used to derive a probability of rhythmic change due to
OMR errors between any two arbitrary measures, but the
model employed here concerns itself with measures with
incorrect rhythms, or flagged measures.
3.3 Flagged Measures
Let FPi be the set of flagged measures for part Pi, that is,
measures whose total durations do not correspond to the
total duration implied by the currently active time signature, and F = {FP1, …, FPj} for a score with j parts. (Measure stacks where each measure number is in F can be removed as probable pickup or otherwise intended incomplete measures, and long stretches of measures in F in all
parts can be attributed to incorrectly identified time signatures and reevaluated, though neither of these refinements is used in this model). It is possible for rhythms
within a measure to be incorrectly recognized without the
entire measure being in F; though this problem only arises in the rare case where two rhythmic errors cancel out
each other (as in a dotted quarter read as a quarter with an
eighth read as a quarter in the same measure).
Figure 1. Priors based on distance (l. in measure separation) and part (r.) for the violin I (top) and viola (bot.)
parts in Mozart, K525.
3. PRIOR PROBABILITIES OF CHANGE
3.1 Individual Change Probabilities
The probability that any given musical glyph will be read
correctly or incorrectly is dependent on the quality of
scan, the quality of original print, the OMR engine used,
and the type of repertory. One possible generalization
used in the literature [8] is to classify errors as class confusion (e.g., rest for note, with probability of occurring c),
omissions (e.g., of whole symbols or of dots, tuplet
marks: probability o), additions (a), and general value
confusion (e.g., quarter for eighth: v). Other errors, such
as sharp for natural or tie for slur, do not affect rhythmic
accuracy. Although accuracy would be improved by
1
Rossant and Bloch give probabilities of change given that an error has
occurred. The numbers given here are renormalizations of those error
rates after removing the prior probability that an error has taken place.
644
4. INTEGRATING THE PRIORS
For each m FPi, the measure n in part Pi with the highest likelihood of representing the prototype source rhythm
before OMR errors were introduced is the source measure
SD that maximizes the product of the prior-based-ondistance, that is, the horizontal model, and the priorbased-on-changes:
SD = argmax(PrDn PrCn) n F.
Figure 2. Mozart, K525 I, in OMR (l.) and scanned (r.)
versions.
(2)
prior based on changes is much smaller (4 10-9). Violin I
is not considered as a source since its measure has also
been flagged as incorrect. Therefore the viola’s measure
is used for SP.
(In the highly unlikely case of equal probabilities, a single measure is chosen arbitrarily) Similarly, for each m in
FP the measure t in the measure stack corresponding to m,
with the highest likelihood of being the source rhythm for
m, is the source measure SP that maximizes the product of
the prior-based-on-part, that is, the vertical model, and
the prior-based-on-changes:
SP = argmax(PrPt PrCt) t F.
A similar search is done for the other (unflagged)
measures in the rest of the violin II part in order to find
SD. In this case, the probability of SP exceeds that of SD,
so the viola measure’s rhythm is, correctly, used for violin II.
(3)
Since the two priors PrD and PrP have not been normalized in any way, the best match from SD and SP can be
obtained by simply taking the maximum of the two:
S = argmax(P(m)) m in [SD, SP]
6. IMPLEMENTATION
The model developed above was implemented using conversion and score manipulation routines from the opensource Python-based toolkit, music21 [4] and has been
contributed back to the toolkit as the omr.correctors
module in v.1.9 and above. Example 1 demonstrates a
round-trip in MusicXML of a raw OMR score to a postprocessed score.
(4)
Given the assumption that the time signature and barlines
have accurately been obtained and that each measure
originally contained notes and rests whose total durations
matched the underlying meter, we do not need to be concerned with whether S is a “better” solution for correcting
m than the rhythms currently in m, since the probability
of a flagged measure being correct is zero. Thus any solution has a higher likelihood of being correct than what
was already there. (Real-world implementations, however, may wish to place a lower bound on P(S) to avoid
substitutions that are below a minimum threshold to prevent errors being added that would be harder to fix than
the original.)
from music21 import *
s = converter.parse('/tmp/k525omrIn.xml')
sc = omr.correctors.ScoreCorrector(s)
s2 = sc.run()
s2.write('xml', fp='/tmp/k525post.xml')
Example 1. Python/music21 code for correcting OMR
errors in Mozart K525, I.
Figure 3, below, shows the types of errors that the model
is able, and in some cases unable, to correct.
5. EXAMPLE
7. RESULTS
In this example from Mozart K525, mvmt. 1, measure
stack 17, measures in both Violin I and Violin II have
been flagged as containing rhythmic errors (marked in
purple in Figure 2).
Nine scores of four-movement quartets by Mozart (5),1
Haydn (1), and Beethoven (4) were used for the primary
evaluation. (Mozart K525, mvmt. 1 was used as a test
score for development and testing but not for evaluation.)
Scanned scores came from out-of-copyright editions
(mainly Breitkopf & Härtel) via IMSLP and were converted to MusicXML using SmartScore X2 Pro
(v.10.5.5). Ground truth encodings in MuseData and MusicXML formats came via the music21 corpus originally
from the Stanford’s CCARH repertories [6] and Project
Gutenberg.
Both the OMR software and our implementation of the
method, described below, can identify the violin lines as
containing rhythmic errors, but neither can know that an
added dot in each part has caused the error. The vertical
model (PrP * PrC) will look to the viola and cello parts
for corrections to the violin parts. Violin II and viola
share five rhythms (e5) and only one omission of a dot is
required to transform the viola rhythm into violin II (o1),
for a PrC of 0.0076. The prior on similarities between violin II and viola (PrP) is 0.57, so the complete probability
of this transformation is 0.0043. The prior on similarities
between violin II and cello is slightly higher, 0.64, but the
1
Mozart K156 is a three-movement quartet, however, both the ground
truth and the OMR versions include the abandoned first version of the
Adagio as a fourth movement.
645
Figure 3: Comparison of Mozart K525 I, mm.
35–39 in the original scan (top), SmartScore
OMR output (middle), and after post-OMR
processing (bot.). Flags 1–3 were corrected
successfully; Flags 4 and 5 result in metrically
plausible but incorrect emendations. The model was able to preserve the correct pitches for
Flags 2 (added quarter rest) and Flag 3 (added
augmentation dot). Flag 1 (omitted eighth
note) is considered correct in this evaluation,
based solely on rhythm, even though the pitch
of the reconstructed eighth note is not correct.
The proportion of suggestions taken from the horizontal
(PrD) and vertical models (PrP) depended significantly
on the number of parts in the piece. In Mozart K525 quartet, 72% of the suggestions came from the horizontal
model while for the Schubert symphony (fourteen parts),
only 39% came from the horizontal model.
The pre-processed OMR movement was aligned with the
ground truth by finding the minimum edit distance between measure hashes. This step was necessary for the
many cases where the OMR version contained a different
number of measures than the ground truth. The number of
differences between the two versions of the same movement was recorded. A total of 29,728 measures with
7,196 flagged measures were examined. Flag rates ranged
from 0.6% to 79.2% with a weighed mean of 24.2% and
median of 21.7%.
8. APPLICATIONS
The model has broad applications for improving the accuracy of scores already converted via OMR, but it would
have greater impact as an element of an improved user
experience within existing software. Used to its full potential, the model could help systems provide suggestions
as users examine flagged measures. Even a small scale
implementation could greatly improve the lengthy errorcorrecting process that currently must take place before a
score is useable. See Figure 4 for an example interface.
The model was then run on each OMR movement and the
number of differences with the ground truth was recorded
again. (In order to make the outputted score useful for
performers and researchers, we added a simple algorithm
to preserve as much pitch information as possible from
the original measure.) From 2.1% to 36.1% of flagged
measures were successfully corrected, with a weighed
mean of 18.8% and median of 18.0%: a substantial improvement over the original OMR output.
Manually checking the pre- and post-processed OMR
scores against the ground truth showed that the highest
rates of differences came from scores where single-pitch
repetitions (tremolos) were spelled out in one source and
written in abbreviated form in another; such differences
could be corrected for in future versions. There was no
significant correlation between the percentage of
measures originally flagged and the correction rate (r =
.17, p > .31).
Figure 4. A sample interface improvement using the
model described.
The model was also run on two scores outside the classical string quartet repertory to test its further relevance.
On a fourteenth-century vocal work (transcribed into
modern notation), Gloria: Clemens Deus artifex and the
first movement of Schubert’s “Unfinished” symphony,
the results were similar to the previous findings (16.8%
and 18.7% error reduction, respectively).
A similar model to the one proposed here could also be
integrated into OMR software to offer suggestions for
pitch corrections if the user selects a measure that was not
flagged for rhythmic errors. Integration within OMR
software would also potentially give the model access to
646
rejected interpretations for measures that may become
more plausible when rhythmic similarity within a piece is
taken into account.
[2] D. Byrd, M. Schindele: “Prospects for improving
OMR with multiple recognizers,” Proc. ISMIR, Vol.
7, pp. 41–47, 2006.
The model could be expanded to take into account spatial
separation between glyphs as part of the probabilities.
Simple extensions such as ignoring measures that are
likely pickups or correcting wrong time signatures and
missed barlines (resulting in double-length measures)
have already been mentioned. Autocorrelation matrices,
which would identify repeating sections such as recapitulations and rondo returns, would improve the prior-basedon-distance metric. Although the model runs quickly on
small scores (in far less than the time to run OMR despite
the implementation being written in an interpreted language), on larger scores the O(len(F) len(Part)) complexity of the horizontal model could become a problem
(though correction of the lengthy Schubert score took less
than ten minutes on an i7 MacBook Air). Because the
prior-based-on-distance tends to fall off quickly, examining only a fixed-sized window worth of measures around
each flagged measure would offer substantial speed-ups.
[3] D. Byrd, J. G. Simonsen, “Towards a Standard
Testbed for Optical Music Recognition: Definitions,
Metrics,
and
Page
Images,”
http://www.informatics.indiana.edu/donbyrd/Papers/
OMRStandardTestbed_Final.pdf, in progress.
[4] M. Cuthbert and C. Ariza: “music21: A Toolkit for
Computer-Aided Musicology and Symbolic Music
Data,” Proc. ISMIR, Vol. 11, pp. 637–42, 2010.
[5] E. Guo et al.: Petrucci Music Library, imslp.org,
2006–.
[6] W. Hewlett, et al.: MuseData: an Electronic Library
of Classical Music Scores, musedata.org, 1994,
2000.
[7] A. Rebelo, et al.: “Optical music recognition: Stateof-the-art and open issues,” International Journal of
Multimedia Information Retrieval, Vol. 1, No. 3, pp.
173–190, 2012.
Longer scores and scores with more parts offered more
possibilities for high-probability correcting measures.
Thus we encourage the creators of OMR competitions
and standard OMR test examples [3] to include entire
scores taken from standard repertories in their evaluation
sets.
[8] F. Rossant and I. Bloch, “A fuzzy model for optical
recognition of musical scores,” Fuzzy sets and
systems, Vol. 141, No. 2, pp. 165–201, 2004.
[9] F. Rossant, I. Bloch: “Robust and adaptive OMR
system including fuzzy modeling, fusion of musical
rules, and possible error detection,” EURASIP
Journal on Advances in Signal Processing, 2007.
The potential of post-OMR processing based on musical
knowledge is still largely untapped. Models of tonal behavior could identify transposing instruments and thus
create better linkages between staves across systems that
vary in the number of parts displayed. Misidentifications
of time signatures, clefs, ties, and dynamics could also be
reduced through comparison across parts and with similar
sections in scores. While more powerful algorithms for
graphical recognition will always be necessary, substantial improvements can be made quickly with the selective
deployment of musical knowledge.
[10] V. Viro: “Peachnote: Music score search and
analysis platform,” Proc. ISMIR, Vol. 12, pp. 359–
362, 2011.
9. ACKNOWLEDGEMENTS
The authors thank the Radcliffe Institute of Harvard University, the National Endowment for the Humanities/Digging into Data Challenge, the Thomas Temple
Hoopes Prize at Harvard, and the School of Humanities,
Arts, and Social Sciences, MIT, for research support, four
anonymous readers for suggestions, and Margo Levine,
Beth Chen, and Suzie Clark of Harvard’s Applied Math
and Music departments for advice and encouragement.
10. REFERENCES
[1] E. P. Bugge, et al.: “Using sequence alignment and
voting to improve optical music recognition from
multiple recognizers,” Proc. ISMIR, Vol. 12, pp.
405–410, 2011.
647
648
CLASSIFYING EEG RECORDINGS OF RHYTHM PERCEPTION
Sebastian Stober, Daniel J. Cameron and Jessica A. Grahn
Brain and Mind Institute, Department of Psychology, Western University, London, ON, Canada
{sstober,dcamer25,jgrahn}@uwo.ca
ABSTRACT
Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different
rhythm types/genres or even identify the rhythms themselves.
In this paper, we present first classification results using deep
learning techniques on EEG data recorded within a rhythm
perception study in Kigali, Rwanda. We tested 13 adults,
mean age 21, who performed three behavioral tasks using
rhythmic tone sequences derived from either East African
or Western music. For the EEG testing, 24 rhythms – half
East African and half Western with identical tempo and based
on a 2-bar 12/8 scheme – were each repeated for 32 seconds. During presentation, the participants’ brain waves were
recorded via 14 EEG channels. We applied stacked denoising autoencoders and convolutional neural networks on the
collected data to distinguish African and Western rhythms on
a group and individual participant level. Furthermore, we investigated how far these techniques can be used to recognize
the individual rhythms.
1. INTRODUCTION
Musical rhythm occurs in all human societies and is related to
many phenomena, such as the perception of a regular emphasis (i.e., beat), and the impulse to move one’s body. However,
the brain mechanisms underlying musical rhythm are not
fully understood. Moreover, musical rhythm is a universal
human phenomenon, but differs between human cultures, and
the influence of culture on the processing of rhythm in the
brain is uncharacterized.
In order to study the influence of culture on rhythm processing, we recruited participants in East Africa and Canada
to test their ability to perceive and produce rhythms derived
from East African and Western music. Besides behavioral
tasks, which have already been discussed in [4], the East
African participants also underwent electroencephalography
(EEG) recording while listening to East African and Western
musical rhythms thus enabling us to study the neural mechanisms underlying rhythm perception. We were interested
in differences between neuronal entrainment to the periodicities in East African versus Western rhythms for participants
from those respective cultures. Entrainment was defined as
c Sebastian Stober, Daniel J. Cameron and Jessica A. Grahn.
Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Sebastian Stober, Daniel J. Cameron and Jessica
A. Grahn. “Classifying EEG Recordings of Rhythm Perception”, 15th
International Society for Music Information Retrieval Conference, 2014.
649
the magnitudes of steady state evoked potentials (SSEPs) at
frequencies related to the metrical structure of rhythms. A
similar approach has been used previously to study entrainment to rhythms [17, 18].
But it is also possible to look at the collected EEG data
from an information retrieval perspective by asking questions
like How well can we tell from the EEG whether a participant
listened to an East African or Western rhythm? or Can we
even say from a few seconds of EEG data which rhythm somebody listened to? Note that answering such question does
not necessarily require an understanding of the underlying
processes. Hence, we have attempted to let a machine figure
out how best to represent and classify the EEG recordings
employing recently developed deep learning techniques. In
the following, we will review related work in Section 2, describe the data acquisition and pre-processing in Section 3
present our experimental findings in Section 4, and discuss
further steps in Section 5.
2. RELATED WORK
Previous research demonstrates that culture influences perception of the metrical structure (the temporal structure of
strong and weak positions in rhythms) of musical rhythms
in infants [20] and in adults [16]. However, few studies have
investigated differences in brain responses underlying the cultural influence on rhythm perception. One study found that
participants performed better on a recall task for culturally familiar compared to unfamiliar music, yet found no influence
of cultural familiarity on neural activations while listening to
the music while undergoing functional magnetic resonance
imaging (fMRI) [15].
Many studies have used EEG and magnoencephalography (MEG) to investigate brain responses to auditory rhythms.
Oscillatory neural activity in the gamma (20-60 Hz) frequency
band is sensitive to accented tones in a rhythmic sequence and
anticipates isochronous tones [19]. Oscillations in the beta
(20-30 Hz) band increase in anticipation of strong tones in a
non-isochronous sequence [5, 6, 10]. Another approach has
measured the magnitude of SSEPs (reflecting neural oscillations entrained to the stimulus) while listening to rhythmic
sequences [17, 18]. Here, enhancement of SSEPs was found
for frequencies related to the metrical structure of the rhythm
(e.g., the frequency of the beat).
In contrast to these studies investigating the oscillatory activity in the brain, other studies have used EEG to investigate
event-related potentials (ERPs) in responses to tones occurring in rhythmic sequences. This approach has been used to
show distinct sensitivity to perturbations of the rhythmic pat-
tern vs. the metrical structure in rhythmic sequences [7], and
to suggest that similar responses persist even when attention
is diverted away from the rhythmic stimulus [12].
In the field of music information retrieval (MIR), retrieval
based on brain wave recordings is still a very young and unexplored domain. So far, research has mainly focused on
emotion recognition from EEG recordings (e.g., [3, 14]). For
rhythms, however, Vlek et al. [23] already showed that imagined auditory accents can be recognized from EEG. They
asked ten subjects to listen to and later imagine three simple metric patterns of two, three and four beats on top of a
steady metronome click. Using logistic regression to classify accented versus unaccented beats, they obtained an average single-trial accuracy of 70% for perception and 61%
for imagery. These results are very encouraging to further
investigate the possibilities for retrieving information about
the perceived rhythm from EEG recordings.
In the field of deep learning, there has been a recent increase of works involving music data. However, MIR is
still largely under-represented here. To our knowledge, no
prior work has been published yet on using deep learning
to analyze EEG recordings related to music perception and
cognition. However, there are some first attempts to process
EEG recordings with deep learning techniques.
Wulsin et al. [24] used deep belief nets (DBNs) to detect anomalies related to epilepsy in EEG recordings of 11
subjects by classifying individual “channel-seconds”, i.e., onesecond chunks from a single EEG channel without further
information from other channels or about prior values. Their
classifier was first pre-trained layer by layer as an autoencoder
on unlabelled data, followed by a supervised fine-tuning with
backpropagation on a much smaller labeled data set. They
found that working on raw, unprocessed data (sampled at
256Hz) led to a classification accuracy comparable to handcrafted features.
Langkvist et al. [13] similarly employed DBNs combined
with a hidden Markov model (HMM) to classify different
sleep stages. Their data for 25 subjects comprises EEG as
well as recordings of eye movements and skeletal muscle activity. Again, the data was segmented into one-second chunks.
Here, a DBN on raw data showed a classification accuracy
close to one using 28 hand-selected features.
3. DATA ACQUISITION & PRE-PROCESSING
3.1 Stimuli
African rhythm stimuli were derived from recordings of traditional East African music [1]. The author (DC) composed
the Western rhythmic stimuli. Rhythms were presented as
sequences of sine tones that were 100ms in duration with intensity ramped up/down over the first/final 50ms and a pitch
of either 375 or 500 Hz. All rhythms had a temporal structure
of 12 equal units, in which each unit could contain a sound
or not. For each rhythmic stimulus, two individual rhythmic
sequences were overlaid – each at a different pitch. For each
cultural type of rhythm, there were 2 groups of 3 individual
rhythms for which rhythms could be overlaid with the others
in their group. Because an individual rhythm could be one
650
Table 1. Rhythmic sequences in groups of three that pairings
were based on. All ‘x’s denote onsets. Larger, bold ‘X’s
denote the beginning of a 12 unit cycle (downbeat).
Western Rhythms
1 X x x x
2 X
x
3 X
x x
x x
x x
x x
x x
X x x x
x
x X
x
x x x x X
x x
4 X
x x
5 X x x x
6 X
x x
x x
x
x X
x x
x x
x
X x x x
x x
x x x x X
x x
x x
x x
x x
x x
x
x
x x x x
x x
x
x
x x
x
x x
x x x x
East African Rhythms
1 X
2 X
3 X
x x x x x
x
x
x
x
x
x x x x X
x
x X
x
X
4 X
5 X
6 X
x x x
x x x
x x X
x x
x x
x x
x X
x x
x x
x
x
X
x x x x x
x
x
x
x
x
x x x x
x
x x x
x x x
x x
x x
x x
x x
x
x x
x x
x
x
of two pitches/sounds, this made for a total of 12 rhythmic
stimuli from each culture, each used for all tasks. Furthermore, rhythmic stimuli could be one of two tempi: having a
minimum inter-onset interval of 180 or 240ms.
3.2 Study Description
Sixteen East African participants were recruited in Kigali,
Rwanda (3 female, mean age: 23 years, mean musical training: 3.4 years, mean dance training: 2.5 years). Thirteen of
these participated in the EEG portion of the study as well as
the behavioral portion. All participants were over the age of
18, had normal hearing, and had spent the majority of their
lives in East Africa. They all gave informed consent prior to
participating and were compensated for their participation, as
per approval by the ethics boards at the Centre Hospitalier
Universitaire de Kigali and the University of Western Ontario.
After completion of the behavioral tasks, electrodes were
placed on the participant’s scalp. They were instructed to
sit with eyes closed and without moving for the duration of
the recording, and to maintain their attention on the auditory
stimuli. All rhythms were repeated for 32 seconds, presented
in counterbalanced blocks (all East African rhythms then all
Western rhythms, or vice versa), and with randomized order
within blocks. All 12 rhythms of each type were presented
– all at the same tempo (fast tempo for subjects 1–3 and 7–9,
and slow tempo for the others). Each rhythm was preceded
by 4 seconds of silence. EEG was recorded via a portable
Grass EEG system using 14 channels at a sampling rate of
400Hz and impedances were kept below 10kΩ.
3.3 Data Pre-Processing
EEG recordings are usually very noisy. They contain artifacts
caused by muscle activity such as eye blinking as well as possible drifts in the impedance of the individual electrodes over
the course of a recording. Furthermore, the recording equipment is very sensitive and easily picks up interferences from
the surroundings. For instance, in this experiment, the power
supply dominated the frequency band around 50Hz. All these
issues have led to the common practice to invest a lot of effort
into pre-processing EEG data, often even manually rejecting
single frames or channels. In contrast to this, we decided to
put only little manual work into cleaning the data and just removed obviously bad channels, thus leaving the main work to
the deep learning techniques. After bad channel removal, 12
channels remained for subjects 1–5 and 13 for subjects 6–13.
We followed the common practice in machine learning to
partition the data into training, validation (or model selection) and test sets. To this end, we split each 32s-long trial
recording into three non-overlapping pieces. The first four
seconds were used for the validation dataset. The rationale
behind this was that we expected that the participants would
need a few seconds in the beginning of each trial to get used
to the new rhythm. Thus, the data would be less suited for
training but might still be good enough to estimate the model
accuracy on unseen data. The next 24 seconds were used for
training and the remaining four seconds for testing.
The data was finally converted into the input format required by the neural networks to be learned. 1 If the network
just took the raw EEG data, each waveform was normalized
to a maximum amplitude of 1 and then split into equally sized
frames matching the size of the network’s input layer. No windowing function was applied and the frames overlapped by
75% of their length. If the network was designed to process
the frequency spectrum, the processing involved:
1. computing the short-time Fourier transform (STFT) with
given window length of 64 samples and 75% overlap,
2. computing the log amplitude,
3. scaling linearly to a maximum of 1 (per sequence),
4. (optionally) cutting of all frequency bins above the number
requested by the network,
5. splitting the data into frames matching the network’s input
dimensionality with a given hop size of 5 to control the
overlap.
As the classes were perfectly balanced for both tasks, we
chose the accuracy, i.e., the percentage of correctly classified
instances, as evaluation measure. Accuracy can be measured
on several levels. The network predicts a class label for
each input frame. Each frame is a segment from the time
sequence of a single EEG channel. Finally, for each trial,
several channels were recorded. Hence, it is natural to also
measure accuracy also at the sequence (i.e, channel) and trial
level. There are many ways to aggregate frame label predictions into a prediction for a channel or a trial. We tested the
following three ways to compute a score for each class:
• plain: sum of all 0-or-1 outputs per class
• fuzzy: sum of all raw output activations per class
• probabilistic: sum of log output activations per class
While the latter approach which gathers the log likelihoods
from all frames worked best for a softmax output layer, it
usually performed worse than the fuzzy approach for the
DLSVM output layer with its hinge loss (see below). The
plain approach worked best when the frame accuracy was
close to the chance level for the binary classification task.
Hence, we chose the plain aggregation scheme whenever the
frame accuracy was below 52% on the validation set and
otherwise the fuzzy approach.
We expected significant inter-individual differences and
therefore made learning good individual models for the participants our priority. We then tested configuration that worked
well for individuals on three groups – all participants as well
as one group for each tempo, containing 6 and 7 subjects
respectively.
4.1 Classification into African and Western Rhythms
4.1.1 Multi-Layer Perceptron with Pre-Trained Layers
Here, the number of retained frequency bins and the input
length were considered as hyper-parameters.
4. EXPERIMENTS & FINDINGS
All experiments were implemented using Theano [2] and
pylearn2 [8]. 2 The computations were run on a dedicated
12-core workstation with two Nvidia graphics cards – a Tesla
C2075 and a Quadro 2000.
As the first retrieval task, we focused on recognizing whether a participant had listened to an East African or Western
rhythm (Section 4.1). This binary classification task is most
likely much easier than the second task – trying to predict
one out of 24 rhythms (Section 4.2). Unfortunately, due to
the block design of the study, it was not possible to train a
classifier for the tempo. Trying to do so would yield a classifier that “cheated” by just recognizing the inter-individual
differences because every participant only listened to stimuli
of the same tempo.
1 Most of the processing was implemented through the librosa library
available at https://github.com/bmcfee/librosa/.
2 The code to run the experiments is publicly available as supplementary material of this paper at http://dx.doi.org/10.6084/m9.
figshare.1108287
651
Motivated by the existing deep learning approaches for EEG
data (cf. Section 2), we choose to pre-train a MLP as an
autoencoder for individual channel-seconds – or similar fixedlength chunks – drawn from all subjects. In particular, we
trained a stacked denoising autoencoder (SDA) as proposed
in [22] where each individual input was set to 0 with a corruption probability of 0.2.
We tested several structural configurations, varying the
input sample rate (400Hz or down-sampled to 100Hz), the
number of layers, and the number of neurons in each layer.
The quality of the different models was measured as the
mean squared reconstruction error (MSRE). Table 2 gives
an overview of the reconstruction quality for selected configurations. All SDAs were trained with tied weights, i.e.,
the weight matrix of each decoder layer equals the transpose
of the respective encoder layer’s weight matrix. Each layer
was trained with stochastic gradient descent (SGD) on minibatches of 100 examples for a maximum of 100 epochs with
an initial learning rate of 0.05 and exponential decay.
In order to turn a pre-trained SDA into a multilayer perceptron (MLP) for classification, we replaced the decoder part
of the SDA with a DLSVM layer as proposed in [21]. 3 This
special kind of output layer for classification uses the hinge
3 We used the experimental implementation for pylearn2 provided by Kyle
Kastner at https://github.com/kastnerkyle/pylearn2/
blob/svm_layer/pylearn2/models/mlp.py
Table 2. MSRE and classification accuracy for selected SDA (top, A-F) and CNN (bottom, G-I) configurations.
neural network configuration
id (sample rate, input format, hidden layer sizes)
MSRE
train
MLP Classification Accuracy (for frames, channels and trials) in %
test
indiv. subjects
fast (1–3, 7–9)
slow (4–6, 10–13)
all (1–13)
A 100Hz, 100 samples, 50-25-10 (SDA for subject 2) 4.35 4.17
61.1 65.5 72.4
58.7 60.6 61.1
53.7 56.0 59.5
53.5 56.6 60.3
B 100Hz, 100 samples, 50-25-10
3.19 3.07
58.1 62.0 66.7
58.1 60.7 61.1
53.5 57.7 57.1
52.1 53.5 54.5
C 100Hz, 100 samples, 50-25
1.00 0.96
61.7 65.9 71.2
58.6 62.3 63.2
54.4 56.4 57.1
53.4 54.8 56.4
D 400Hz, 100 samples, 50-25-10
0.54 0.53
51.7 58.9 62.2
50.3 50.6 50.0
50.0 51.8 51.2
50.1 50.2 50.0
E 400Hz, 100 samples, 50-25
0.36 0.34
60.8 65.9 71.8
56.3 58.6 66.0
52.0 55.0 56.0
49.9 50.1 56.1
F 400Hz, 80 samples, 50-25-10
0.33 0.32
52.0 59.9 62.5
52.3 53.9 54.9
50.5 53.5 55.4
50.2 51.0 50.3
G 100Hz, 100 samples, 2 conv. layers
62.0 63.9 67.6
57.1 57.9 59.7
49.9 50.2 50.0
51.7 52.8 52.9
H 100Hz, 200 samples, 2 conv. layers
64.0 64.8 67.9
58.2 58.5 61.1
49.5 49.6 50.6
50.9 50.2 50.6
I
400Hz, 1s freq. spectrum (33 bins), 2 conv. layers
69.5 70.8 74.7
58.1 58.0 59.0
53.8 54.5 53.0
53.7 53.9 52.6
J
400Hz, 2s freq. spectrum (33 bins), 2 conv. layers
72.2 72.6 77.6
57.6 57.5 60.4
52.9 52.9 54.8
53.1 53.5 52.3
Figure 1. Boxplot of the frame-level accuracy for each individual subject aggregated over all configurations. 5
loss as cost function and replaces the commonly applied softmax. We observed much smoother learning curves and a
slightly increased accuracy when using this cost function for
optimization together with rectification as non-linearity in
the hidden layers. For training, we used SGD with dropout
regularization [9] and momentum, a high initial learning rate
of 0.1 and exponential decay over each epoch. After training for 100 epochs on minibatches of size 100, we selected
the network that maximized the accuracy on the validation
dataset. We found that the dropout regularization worked
really well and largely avoided over-fitting to the training
data. In some cases, even a better performance on the test
data could be observed. The obtained mean accuracies for
the selected SDA configurations are also shown in Table 2
for MLPs trained for individual subjects as well as for the
three groups. As Figure 1 illustrates, there were significant
individual differences between the subjects. Whilst learning
good classifiers appeared to be easy for subject 9, it was much
harder for subjects 5 and 13. As expected, the performance
for the groups was inferior. Best results were obtained for
the “fast” group, which comprised only 6 subjects including
2 and 9 who were amongst the easiest to classify.
We found that two factors had a strong impact on the
MSRE: the amount of (lossy) compression through the autoencoder’s bottleneck and the amount of information the
5 Boxes show the 25th to 75th percentiles with a mark for the median
within, whiskers span to furthest values within the 1.5 interquartile range,
remaining outliers are shown as crossbars.
network processes at a time. Configurations A, B and D had
the highest compression ratio (10:1). C and E lacked the third
autoencoder layer and thus only compressed at 4:1 and with a
lower resulting MSRE. F had exactly twice the compression
ratio as C and E. While the difference in the MSRE was
remarkable between F and C, it was much less so between
F and E – and even compared to D. This could be explained
by the four times higher sample rate of D–F. Whilst A–E
processed the same amount of samples at a time, the input for
A–C contained much more information as they were looking
at 1s of the signal in contrast to only 250ms. Judging from the
MSRE, the longer time span appears to be harder to compress.
This makes sense as EEG usually contains most information
in the lower frequencies and higher sampling rates do not necessarily mean more content. Furthermore, with growing size
of the input frames, the variety of observable signal patterns
increases and they become harder to approximate. Figure 2
illustrates the difference between two reconstructions of the
same 4s raw EEG input segment using configurations B and
D. In this specific example, the MSRE for B is ten times as
high compared to D and the loss of detail in the reconstruction is clearly visible. However, D can only see 250ms of the
signal at a time whereas B processes one channel-second.
Configuration A had the highest MSRE as it was only
trained on data from subject 2 but needed to process all other
subjects as well. Very surprisingly, the respective MLP produced much better predictions than B, which had identical
structure. It is not clear what caused this effect. One explanation could be that the data from subject 2 was cleaner
than for other participants as it also led to one amongst the
best individual classification accuracies. 6 This could have
led to more suitable features learned by the SDA. In general,
the two-hidden-layer models worked better than the threehidden-layer ones. Possibly, the compression caused by the
third hidden layer was just too much. Apart from this, it
was hard to make out a clear “winner” between A, C and E.
There seemed to be a trade-off between the accuracy of the
reconstruction (by choosing a smaller window size and/or
higher sampling rate) and learning more suitable features
6 Most of the model/learning parameters were selected by training just
on subject 2.
652
1.0
0.25
0.5
0.0
0.5
1.0
0s
1s
2s
3s
1.0
0.00
4s
0.25
0.5
0.0
0.5
1.0
0s
1s
2s
3s
0.00
4s
Figure 2. Input (blue) and its reconstruction (red) for the same 4s sequence from the test data. The background color indicates
the squared sample error. Top: Configuration B (100Hz) with MSRE 6.43. Bottom: Configuration D (400Hz) with MSRE 0.64.
(The bottom signals shows more higher-frequency information due to the four-times higher sampling rate.)
Table 3. Structural parameters of the CNN configurations.
input
convolutional layer 1
convolutional layer 2
id
dim.
shape patterns pool stride
shape patterns pool stride
G
100x1
15x1
1
70x1
H
200x1
25x1
10
7
1
151x1
10
7
1
I
22x33
1x33
20
5
1
9x1
10
5
1
J
47x33
1x33
20
5
1
9x1
10
5
1
10
7
10
7
the fast stimuli (for subjects 2 and 9) and the slow ones (for
subject 12) respectively. For each pair, we trained a classifier with configuration J using all but the two rhythms of the
pair. 7 Due to the amount of computation required, we trained
only for 3 epochs each. With the learned classifiers, the mean
frame-level accuracy over all 144 rhythm pairs was 82.6%,
84.5% and 79.3% for subject 2, 9 and 12 respectively. These
value were only slightly below those shown in Figure 1, which
we considered very remarkable after only 3 training epochs.
1
for recognizing the rhythm type at a larger time scale. This
led us to try a different approach using convolutional neural
networks (CNNs) as, e.g., described in [11].
4.2 Identifying Individual Rhythms
4.1.2 Convolutional Neural Network
We decided on a general layout consisting of two convolutional layers where the first layer was supposed to pick up
beat-related patterns while the second would learn to recognize higher-level structures. Again, a DLSVM layer was used
for the output and the rectifier non-linearity in the hidden
layers. The structural parameters are listed in Table 3. As
pooling operation, the maximum was applied. Configurations
G and H processed the same raw input as A–F whereas I and
J took the frequency spectrum as input (using all 33 bins).
All networks were trained for 20 epochs using SGD with a
momentum of 0.5 and an exponential decaying learning rate
initialized at 0.1.
The obtained accuracy values are listed in Table 2 (bottom).
Whilst G and H produced results comparable to A–F, the
spectrum-based CNNs, I and J, clearly outperformed all other
configurations for the individual subjects. For all but subjects 5 and 11, they showed the highest frame-level accuracy
(c.f. Figure 1). For subjects 2, 9 and 12, the trial classification
accuracy was even higher than 90% (not shown).
4.1.3 Cross-Trial Classification
In order to rule out the possibility that the classifiers just
recognized the individual trials – and not the rhythms – by
coincidental idiosyncrasies and artifacts unrelated to rhythm
perception, we additionally conducted a cross-trial classification experiment. Here, we only considered all subjects with
frame-level accuracies above 80% in the earlier experiments
– i.e., subjects 2, 9 and 12. We formed 144 rhythm pairs by
combining each East African with each Western rhythm from
653
Recognizing the correct rhythm amongst 24 candidates was
a much harder task than the previous one – especially as all
candidates had the same meter and tempo. The chance level
for 24 evenly balanced classes was only 4.17%. We used
again configuration J as our best known solution so far and
trained an individual classifier for each subject. As Figure 3
shows, the accuracy on the 2s input frames was at least twice
the chance level. Considering that these results were obtained
without any parameter tuning, there is probably still much
room for improvements. Especially, similarities amongst the
stimuli should be considered as well.
5. CONCLUSIONS AND OUTLOOK
We obtained encouraging first results for classifying chunks of
1-2s recorded from a single EEG channel into East African or
Western rhythms using convolutional neural networks (CNNs)
and multilayer perceptrons (MLPs) pre-trained as stacked
denoising autoencoders (SDAs). As it turned out, some configurations of the SDA (D and F) were especially suited to
recognize unwanted artifacts like spikes in the waveforms
through the reconstruction error. This could be elaborated in
the future to automatically discard bad segments during preprocessing. Further, the classification accuracy for individual
rhythms was significantly above chance level and encourages
more research in this direction. As this has been an initial and
by no means exhaustive exploration of the model- and leaning parameter space, there seems to be a lot more potential –
especially in CNNs processing the frequency spectrum – and
7 Deviating from the description given in Section 3.3, we used the first
4s of each recording for validation and the remaining 28s for training as the
test set consisted of full 32s from separate recordings in this special case.
subject
1
2
3
4
5
6
7
8
9
10
11
12
13
mean
accuracy
15.8% 9.9% 12.0% 21.4% 10.3% 13.9% 16.2% 11.0% 11.0% 10.3% 9.2% 17.4% 8.3% 12.8%
precision @3
31.5% 29.9% 26.5% 48.2% 28.3% 27.4% 41.2% 27.8% 28.5% 33.2% 24.7% 39.9% 20.7% 31.4%
mean reciprocal rank 0.31
0.27
0.27
0.42
0.26
0.28
0.36
0.27
0.28
0.30
0.25
0.36
0.23
0.30
Figure 3. Confusion matrix for all subjects (left) and per-subject performance (right) for predicting the rhythm (24 classes).
we will continue to look for better designs than those considered here. We are also planning to create publicly available
data sets and benchmarks to attract more attention to these
challenging tasks from the machine learning and information
retrieval communities.
As expected, individual differences were very high. For
some participants, we were able to obtain accuracies above
90%, but for others, it was already hard to reach even 60%.
We hope that by studying the models learned by the classifiers, we may shed some light on the underlying processes
and gain more understanding on why these differences occur
and where they originate. Also, our results still come with a
grain of salt: We were able to rule out side effects on a trial
level by successfully replicating accuracies across trials. But
due to the study’s block design, there remains still the chance
that unwanted external factors interfered with one of the two
blocks while being absent during the other one. Here, the
analysis of the learned models could help to strengthen our
confidence in the results.
The study is currently being repeated with North America
participants and we are curious to see whether we can replicate our findings. Furthermore, we want to extend our focus
by also considering more complex and richer stimuli such
as audio recordings of rhythms with realistic instrumentation
instead of artificial sine tones.
Acknowledgments: This work was supported by a fellowship within the Postdoc-Program of the German Academic
Exchange Service (DAAD), by the Natural Sciences and Engineering Research Council of Canada (NSERC), through the
Western International Research Award R4911A07, and by an
AUCC Students for Development Award.
6. REFERENCES
[1] G.F. Barz. Music in East Africa: experiencing music, expressing
culture. Oxford University Press, 2004.
[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu,
G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio.
Theano: a CPU and GPU math expression compiler. In Proc. of
the Python for Scientific Computing Conference (SciPy), 2010.
[3] R. Cabredo, R.S. Legaspi, P.S. Inventado, and M. Numao. An
emotion model for music using brain waves. In ISMIR, pages
265–270, 2012.
[4] D.J. Cameron, J. Bentley, and J.A. Grahn. Cross-cultural
influences on rhythm processing: Reproduction, discrimination,
and beat tapping. Frontiers in Human Neuroscience, to appear.
[5] T. Fujioka, L.J. Trainor, E.W. Large, and B. Ross. Beta and
gamma rhythms in human auditory cortex during musical beat
processing. Annals of the New York Academy of Sciences,
1169(1):89–92, 2009.
[6] T. Fujioka, L.J. Trainor, E.W. Large, and B. Ross. Internalized timing of isochronous sounds is represented in
neuromagnetic beta oscillations. The Journal of Neuroscience,
32(5):1791–1802, 2012.
[7] E. Geiser, E. Ziegler, L. Jancke, and M. Meyer. Early electrophysiological correlates of meter and rhythm processing in
music perception. Cortex, 45(1):93–102, 2009.
[8] I.J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin,
M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio.
Pylearn2: a machine learning research library. arXiv preprint
arXiv:1308.4214, 2013.
[9] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
and R.R. Salakhutdinov. Improving neural networks by
preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
[10] J.R. Iversen, B.H. Repp, and A.D. Patel. Top-down control of
rhythm perception modulates early auditory responses. Annals
of the New York Academy of Sciences, 1169(1):58–73, 2009.
[11] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in Neural Information Processing Systems (NIPS),
pages 1097–1105, 2012.
[12] O. Ladinig, H. Honing, G. Háden, and I. Winkler. Probing attentive and preattentive emergent meter in adult listeners without extensive music training. Music Perception, 26(4):377–386, 2009.
[13] M. Längkvist, L. Karlsson, and M. Loutfi. Sleep stage
classification using unsupervised feature learning. Advances
in Artificial Neural Systems, 2012:5:5–5:5, Jan 2012.
[14] Y.-P. Lin, T.-P. Jung, and J.-H. Chen. EEG dynamics during
music appreciation. In Engineering in Medicine and Biology
Society, 2009. EMBC 2009. Annual Int. Conf. of the IEEE,
pages 5316–5319, 2009.
[15] S.J. Morrison, S.M. Demorest, E.H. Aylward, S.C. Cramer,
and K.R. Maravilla. Fmri investigation of cross-cultural music
comprehension. Neuroimage, 20(1):378–384, 2003.
[16] S.J. Morrison, S.M. Demorest, and L.A. Stambaugh. Enculturation effects in music cognition the role of age and
music complexity. Journal of Research in Music Education,
56(2):118–129, 2008.
[17] S. Nozaradan, I. Peretz, M. Missal, and A. Mouraux. Tagging
the neuronal entrainment to beat and meter. The Journal of
Neuroscience, 31(28):10234–10240, 2011.
[18] S. Nozaradan, I. Peretz, and A. Mouraux. Selective neuronal entrainment to the beat and meter embedded in a musical rhythm.
The Journal of Neuroscience, 32(49):17572–17581, 2012.
[19] J.S. Snyder and E.W. Large. Gamma-band activity reflects the
metric structure of rhythmic tone sequences. Cognitive brain
research, 24(1):117–126, 2005.
[20] G. Soley and E.E. Hannon. Infants prefer the musical meter of
their own culture: a cross-cultural comparison. Developmental
psychology, 46(1):286, 2010.
[21] Y. Tang. Deep Learning using Linear Support Vector Machines.
arXiv preprint arXiv:1306.0239, 2013.
[22] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
Manzagol. Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising
criterion. The Journal of Machine Learning Research,
11:3371–3408, Dec 2010.
[23] R.J. Vlek, R.S. Schaefer, C.C.A.M. Gielen, J.D.R. Farquhar,
and P. Desain. Shared mechanisms in perception and imagery of
auditory accents. Clinical Neurophysiology, 122(8):1526–1532,
Aug 2011.
[24] D.F. Wulsin, J.R. Gupta, R. Mani, J.A. Blanco, and B. Litt. Modeling electroencephalography waveforms with semi-supervised
deep belief nets: fast classification and anomaly measurement.
Journal of Neural Engineering, 8(3):036015, Jun 2011.
654
MIREX Oral Session
655
656
TE
T
E
EN
N YEA
E AR
RS
S OF
OF M
MIIR
RE
REX
EX:
X:
REF
RE
EFLE
LEC
ECT
CTI
TIO
ON
NS
S,, CHA
HAL
ALL
LLE
LENGE
GE
ES
S AN
AND
ND OP
OP
PP
PO
OR
RTUN
RT
UN
NIIT
T
TIIE
ES
J. Steep
J.
pheen
ph
n Do
Dow
wniiee
wn
Xia
X ao H
Hu
u
Ji
Jin
n Ha Le
Leee
Univverssiityy of
Un
of Il
Illiinnooiiss
Uniiv
Un
veerrssitty
y of
of H
Hoonngg K
Koonng
g
Unnivveerrssiityy of
of Wa
Wasshhiinnggttoonn
jdo
dow
wn
nie@
e@i
il
llin
ino
oi
is.ed
edu
u
xia
xi
ao
oxh
xhu
u@
@h
h
hk
ku
u.
.h
hk
jin
inh
ha
alee
ee@
@u
uw.
w ed
edu
u
K
Kaahy
hyu
un C
Ch
hoi
o
Salllly
Sa
y Jo
Jo Cu
Cu
un
nn
niin
ngh
ghaam
m
Yun
Yu
n Ha
Haoo
Univverssiityy of
Un
of Il
Illiinnooiiss
Univve
Un
v rssiity
y of
of W
Waaikkaatto
o
U
Unnivverrssiity
y of
of IIllliinnooiiss
cka
kah
hy
yu
u2@
2@i
il
llin
ino
oi
is.ed
edu
u
sal
all
ly
yjo@
o@w
wa
aika
kat
to
o.a
ac
c.n
nz
yun
yu
nh
ha
ao2@
2@i
il
llin
ino
ois
s.
.e
ed
du
u
ABS
AB
ST
TRA
TR
ACT
AC
T
Thhe Mu
T
Mussiic Innffoorrm
maattioonn Ret
R trrieevvaall E
Ev
vaaluuaattioonn eX
eXc
X han
h ngge
(MI
(M
M RE
REX
X ha
X)
has be
beeenn ru
runn an
annnu
uaalllyy si
sinnce 20
200055, wit
w thh th
thee O
Occtobbeerr 2200144 pl
to
plenar
n ryy m
maarrkkiinngg iits ten
t ntthh itteeraattionn. By
By 20
201133,
MIR
MI
RE
EX
X ha
hass ev
evaaluuaatteedd ap
appprrooxxim
mateellyy 20
ma
200000 in
inddivviid
duuaal
mussicc innfoorrm
mu
maattioonn re
retrrievvaall (M
MIR
MI
R)) aallggoorritthhm
mss for
f r a wid
w de
rannggee of
ra
o ta
t skkss ov
oveer 37
37 dif
d fffereenntt te
tesst co
collleecctiio
onnss. M
MIR
RE
REX
X
hass in
ha
nvvo
ollvveedd rreesseearcchheerrs fro
f om
m oovverr 29
29 di
difffeerreennt co
couunntrriies
w thh a me
wit
meddiiaann of
of 10099 in
inddivvidduuaal parrtticciippaan
ntss pe
per ye
yeaarr.
Thiss pa
Th
pappeer ssuum
mmaarrizzeess thhee hhiistoorryy of
mm
of M
MIIR
RE
EX
X frroom
m it
its
earrliieesst pl
ea
plaannnnin
ngg me
m ettiin
ngg inn 20
200011 to th
thee pre
p esseenntt. It
I re
reflecttss uuppon tthhee ad
fl
dm
min
m nisstrraattiv
vee, ffinnanncciiaal,, anndd te
techno
n looggiicall ch
ca
chaallleennggeess M
MIIR
RE
EX
X hhaass fac
faceedd an
nd de
dessccriibbees ho
how
w th
thoose
chaallleennggeess havvee be
ch
beeen
n su
surrm
moouunntteedd. W
Wee pr
prooppoosse ne
new
w fu
funnddingg m
in
mood
deels,, a dis
d sttriibbuuttedd ev
evaalu
uaattioonn fra
f am
m
meew
woorrkk, an
andd mo
more
hollisttic us
ho
useerr ex
expperiieen
nce evvaalluuaatiio
onn taasskkss—
—soom
—s
mee ev
evo
olluut onar
ti
n ryy,, so
som
me rev
r voluuttioonnaarryy—
—
—fforr thhee conntiinnuueedd su
succcceesss of
of
MIR
MI
RE
EX
X W
X.
Wee ho
hoppe th
thaat thhiss pappeerr wi
willl innssppirree MIR
R co
com
m
mmunnityy me
mu
mem
mbeerss too co
mb
conntrriibbuutee thheeiirr id
ideeaas so
so M
MIIR
RE
EX
X cann
havvee ma
ha
man
nyy m
moorree su
succcceesssffuull ye
yeaarrs to co
com
m e.
1. IN
INT
TR
TRO
ODU
OD
UCT
CTIIO
ON
ON
Muussic In
M
Inffoorm
mattioon
ma
n Rettrrieevvaal Ev
Evaaluuaattioonn eX
eX
Xchan
h ngge
(MI
(M
M RE
REX
X is aann annnual
X)
u l ev
evaluuaattioonn ca
cam
mppaaiiggnn m
maannaaggeed
d by
by
thee In
th
Inteerrnnaatiioonnaal Mu
Mussic In
Inffoorm
matioonn R
ma
Rettrrieevvaal Sy
Systeem
m
ms
Evaalu
Ev

asian4you hai ling yang

Transcription

Similar documents

Step # 13 Open up TMPGEnc 4.0 Express and click on “Start a new

24 News

Read the show blog

Contents - February 2002

Home Audio Innovations

to our Audio Recording brochure

Audio Video Controller

Parasound Halo JC 1 Monaural Amplifier

Media Kit - Beauty and the Beat Radio