asian4you hai ling yang
Transcription
asian4you hai ling yang
Oral Session 7 Recommendation & Listeners 437 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 438 15th International Society for Music Information Retrieval Conference (ISMIR 2014) TASTE SPACE VERSUS THE WORLD: AN EMBEDDING ANALYSIS OF LISTENING HABITS AND GEOGRAPHY Joshua L. Moore, Thorsten Joachims Cornell University, Dept. of Computer Science {jlmo|tj}@cs.cornell.edu ABSTRACT Douglas Turnbull Ithaca College, Dept. of Computer Science [email protected] Hauger et al. have matched to cities and other geographic descriptors as well. Our goal in this work is to use embedding methods to enable a more thorough analysis of geographic and cultural patterns in this data by embedding cities and the artists from track plays in those cities into a joint space. The resulting taste space gives us a way to directly measure city/city, city/artist, and artist/artist affinities. After verifying the predictive fidelity of the learned taste space, we explore the surprisingly clear segmentations in taste space across geographic, cultural, and linguistic borders. In particular, we find that the taste space of cities gives us a remarkably clear image of some cultural and linguistic phenomena that transcend geography. Probabilistic embedding methods provide a principled way of deriving new spatial representations of discrete objects from human interaction data. The resulting assignment of objects to positions in a continuous, low-dimensional space not only provides a compact and accurate predictive model, but also a compact and flexible representation for understanding the data. In this paper, we demonstrate how probabilistic embedding methods reveal the “taste space” in the recently released Million Musical Tweets Dataset (MMTD), and how it transcends geographic space. In particular, by embedding cities around the world along with preferred artists, we are able to distill information about cultural and geographical differences in listening patterns into spatial representations. These representations yield a similarity metric among city pairs, artist pairs, and cityartist pairs, which can then be used to draw conclusions about the similarities and contrasts between taste space and geographic location. 2. RELATED WORK Embeddings methods have been applied to many different modeling and information retrieval tasks. In the field of music IR, these models have been used for tag prediction and song similarity metrics, as in the work of Weston et al. [7]. However, instead of a prediction task such as this, we intend to focus on data analysis tasks. Therefore, we rely on generative models like those proposed in our previous work [5, 6] and by Aizenberg et al [1]. Our prior work uses models which rely on sequences of songs augmented with social tags [5] or per-user song sequences with temporal dynamics [6]. The aim of this work differs from that of our previous work in that we are interested in aggregate global patterns and not in any particular playlist-related task, so we do not adopt the notion of song sequences. We also are concerned with geographic differences in listening patterns, and so we ignore individual users in favor of embedding entire cities into the space. Aizenberg et al. utilize generative models like those in our work for purposes of building a recommendation engine for music from Internet radio data on the web. However, their work focuses on building a powerful recommendation system using freely available data, and does not focus on the use of the resulting models for data analysis, nor do they concern themselves with geographic data. The data set which we will use throughout this work was published by Hauger et al. [3]. The authors of this work crawled Twitter for 17 months, looking for tweets which carried certain key words, phrases, or hashtags in order to find posts which signal that a user is listening to a track and for which the text of the tweet could be matched to a particular artist and track. In addition, the data was selected for only tweets with geographical tags (in the form 1. INTRODUCTION Embedding methods are a type of machine learning algorithm for distilling large amounts of data about discrete objects into a continuous and semantically meaningful representation. These methods can be applied even when only contextual information about the objects, such as cooccurrence statistics or usage data, is available. For this reason and due to the easy interpretability of the resulting models, embeddings have become popular for tasks in many fields, including natural language processing, information retrieval, and music information retrieval. Recently, embeddings have been shown to be a useful tool for analyzing trends in music listening histories [6]. In this paper, we learn embeddings that give insight into how music preferences relate to geographic and cultural boundaries. Our input data is the Million Musical Tweets Dataset (MMTD), which was recently collected and curated by Hauger et al. [3]. This dataset consists of over a million tweets containing track plays and rich geographical information in the form of globe coordinates, which c Joshua L. Moore, Thorsten Joachims, Douglas Turnbull. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Joshua L. Moore, Thorsten Joachims, Douglas Turnbull. “Taste Space Versus the World: an Embedding Analysis of Listening Habits and Geography”, 15th International Society for Music Information Retrieval Conference, 2014. 439 15th International Society for Music Information Retrieval Conference (ISMIR 2014) of GPS coordinates), and temporal data was retained. The final product is a large data set of geographically and temporally tagged music plays. In their work, the authors emphasize the collection of this impressive data set and a thorough description of the properties of the data set. The authors do add some analyses of the data, but the geographic analysis is limited to only a few examples of coarse patterns found in the data. The primary contribution of our work over the work presented in that paper is to greatly extend the scope of the geographic analysis, presenting a much clearer and more exhaustive view of the differences in musical taste across regions, countries, and languages. Finally, we describe how geographic information can be useful for various music IR tasks. Knopke [4] also discusses how geospatial data can be exploited for music marketing and musicological research. We use embedding as a tool to further explore these topics. Others, such as Lamere’s Roadtrip Mixtape 1 app, have developed systems that use a listeners location to generate a playlist of relevant music by local artists. (X,Y,p) = max X,Y,p = max −||X(ci)−Y (ai)||22 +pai −log(Z(ai)). We solve this optimization problem using a Stochastic Gradient Descent approach. First, each embedding vector X(·) and Y (·) is randomly initialized to a point in the unit ball in Rd (for the chosen dimension d). Then, the model parameters are updated in sequential stochastic gradient steps until convergence. The partition function Z(·) presents an optimization challenge, in that a naı̈ve optimization strategy requires O(|A|2 ) time for each pass over the data. For this work, we used our C++ implementation of the efficient training method employed in [6], an approximate method that estimates the partition function for efficient training. This implementation is available by request, and will later be available on the project website, http://lme.joachims.org. 3.1 Interpretation of Embedding Space As defined above, the model gives us a joint space in which both cities and artists are represented through their respective embedding vectors X(·) and Y (·). Related works have found such embedding spaces to be rich with semantic significance, compactly condensing the patterns present in the training data. Distances in embedding space reveal relationships between objects, and visual or spatial inspection of the resulting models quickly reveals a great deal of segmentation in the space. In particular, joint embeddings yield similarity metrics among the various types of embedded objects, even though individual dimensions in the embedding space have no explicit meaning (e.g. the embeddings are rotation invariant). In our case, this specifically entails the following three measures of similarity: City to Artist: this is the only similarity metric explicitly formulated in the model, and it reflects the distribution Pr(a|c) that we directly observe data for. In particular, we directly optimize the positions of cities and artists so that cities have a high probability of listening to artists which they were observed playing in the dataset. This requires placing the city and artist nearby in the embedding space, so proximity in the embedding space can be interpreted as an affinity between a city and an artist. Artist to Artist: due to the learned conditional probability distributions’ being constrained by the metric space, two artists which are placed near each other in the space will have a similar probability mass in each city’s distribution. This implies a kind of exchangeability or similarity, since any city which is likely to listen to one artist is likely to listen to the other in the model distribution. City to City: finally, the form of similarity on which we will most rely in this work is that among cities. Again due to the metric space, two nearby cities will assign similar masses to each artist, and so will have very similar distributions over artists in the model. This implies a similarity in musical taste or preferred artists between two cities. The third type of similarity will form the basis for most of the analyses in this paper. In particular, we are interested The embedding model used in this paper is similar to the one used in our previous work [6]. However, the following analysis focuses on geographical patterns instead of temporal dynamics and trends. In particular, we focus on the relationships among cities and artists, and so we elect to condense the geographical information in a tweet down to the city from which it came. Similarly, we discard the track name from each tweet and use only the artist for the song. This leads to a joint embedding of cities and artists. At the core of the embedding model lies a probabilistic link function that connects the observed data to the underlying semantic space. Intuitively, the link function we use states that the probability Pr(a|c) of a given city c playing a given artist a is proportional to the distance ||X(c) − Y (a)||22 between that city and that artist in a Euclidean embedding space of a chosen dimension d. X(c) and Y (a) are the embedding locations of city c and artist a respectively. Similar to previous works, we also incorporate a popularity bias term pa for each artist to model global popularity. More formally, the probability for a city c to play an artist a is: exp(−||X(c) − Y (a)||22 + pa ) . 2 a ∈A exp(−||X(c) − Y (a )||2 + pa ) The sum in the denominator is over the set A of artists. This sum is known as the partition function, denoted Z(·), and serves to normalize the distribution over artists. Determining the embedding locations X(c) and Y (a) for all cities and artists (and the popularity terms pa ) is the learning problem the embedding method must solve. To fit a model to the data, we maximize the log-likelihood formed by the sum of log-probabilities log(Pr(ai |ci ): 1 log(Pr(ai |ci )) (ci ,ai )∈D X,Y,p (ci ,ai )∈D 3. PROBABILISTIC EMBEDDING MODEL Pr(a|c) = http://labs.echonest.com/CityServer/roadtrip.html 440 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4.1 Quantitative Evaluation of the Model Before we inspect our model in order to make qualitative claims about the patterns in the data, we first wish to evaluate it on a quantitative basis. This is essential in order to confirm that the model accurately captures the relations among cities and artists, which will offer validation for the conclusions we draw later in the work. 4.1.1 Evaluating Model Fidelity First, we considered the performance of the model in terms of perplexity, which is a reformulation of the loglikelihood objective outside of a log scale. This is a commonly used measure of performance in other areas of research where models similar to ours are used, such as natural language processing [2]. The perplexity p is related to the average log-likelihood L by the transformation p = exp(−L). Our baseline is the unigram distribution, which assumes that Pr(a|c) is directly proportional to the number of tweets artist a received in the entire data set independent of the city. Estimating the unigram distribution from the training set and using it to calculate the perplexity on the validation set yielded a perplexity of 589 (very similar to the perplexity attained when estimating this distribution from the train set and calculating the perplexity on the train set itself). Our model offered a great improvement over this – the 100-dimensional model yielded a perplexity on the validation set of 290, while the 2-dimensional model reached a perplexity of 357. This improvement suggests that our model has captured a significant amount of useful information from the data. Figure 1: Precision at k of our model, a cosine similarity baseline, a tweet count ranking baseline, and a random baseline on a city/artist tweet prediction task. in the connection between the metric space of cities in the embedding space and another metric space: the one formed by the geographic distribution of cities on the Earth’s surface. As we will see, these two spaces differ greatly, and the taste space of cities gives us a clear image of some cultural and linguistic phenomena that transcend geography. 4. EXPERIMENTS We use the MMTD data set presented by Hauger et al. [3]. This data set contains nearly 1.1 million tweets with geographical data. We pre-process the data by condensing each tweet to a city/artist pair, which results in a city/artist affinity matrix used to train the model. Next, we discard all cities and artists which have not appeared at least 100 times in the data, as well as all cities for which fewer than 30 distinct users tweeted from that city. The post-processed data contains 1,017 distinct cities and 1,499 distinct artists. 4.1.2 Evaluating Predictive Accuracy Second, we created a task to evaluate the predictive power of our model. To this end, we split the data chronologically into two halves, and further divided the first half into a training set and a validation set. Using the first half of the data, we trained a 100-dimensional model. Our goal is to use this model to predict which new artists various cities will begin listening to in the second half of the data. We accomplish this by considering, for each city, the set of artists which had no observed tweets in that city in the first half of the data. We then sorted these artists by their score in the model – namely, for city c and artist a, the function −||X(c) − Y (a)||22 + pa . Using this ordering as a ranking function, we calculated the precision at k of our ranking for various values of k, where an artist is considered to be relevant if that artist receives at least one tweet from that city in the second half of the data. We average the results of each city’s ranking. We compare the performance of our model on this task to three baselines. First, we consider a random ranking of all the artists which a city has not yet tweeted. Second, we sort the yet untweeted artists by their raw global tweet count in the first half of the data – which we label the unigram baseline. Third, we use the raw artist tweet counts for a city’s nearest neighbor city in the first half of data to rank untweeted artists for that city. In this case, the nearest For choosing model parameters, we randomly selected 80% of the tweets for the training set, and the remaining 20% for the validation set. This resulted in a training set of 390,077 tweets and a validation set of 97,592 tweets. We used the validation set both to determine stopping criteria for the optimization as well as to choose the initial stochastic gradient step size η0 from the set {0.25, 0.1, 0.05, 0.01} and to evaluate the quality of models of dimension {2, 50, 100}. The optimal step size varied from model to model, but the 100-dimensional model consistently out-performed the others (although the difference between it and the 50dimensional model was small). We will analyze the data through the trained embedding models, both through spatial analyses (i.e. nearest neighbor queries and clusterings) and through visual inspection. In general, the high-dimensional model better captures the data, and so we will use it when direct visual inspection is not required. But first, we evaluate the quality of the model through quantitative means. 441 15th International Society for Music Information Retrieval Conference (ISMIR 2014) neighbor is not determined using our embedding but rather based on the maximum cosine similarity between the vector of artist tweet counts for the city and the vectors of tweet count for all other cities. The results can be seen in Figure 1. At k = 1, our model correctly guesses an artist that a city will later tweet with 64% accuracy, compared to 46% for the cosine similarity, 42% for unigram and around 5% for the random baseline. This advantage is consistent as k increases, with our method attaining about 24% precision at 100, compared to 18% for unigram and 14% for cosine similarity. We also show the performance of the same model at this task when popularity terms are excluded from the scoring function at ranking time. Interestingly, the performance in this case is still quite good. We see precision at 1 of about 51% in this case, with the gap between this method and the method with popularity terms growing smaller as k increases. This suggests that proximity in the space is very meaningful, which is an important validation of the analyses to follow. Finally, the good performance on this task invites an application of the space to making marketing predictions – which cities are prone to pick up on which artists in the near future? – but we leave this for future work. 4.2 Visual Inspection of the Embedding Space gleaned. However, higher dimensional models are able to achieve perplexities on the validation set which far exceed those of lower dimensional models. For example, as mentioned before, our best performing 2-dimensional model attains a validation perplexity of 357, while our best performing 100-dimensional model attains a perplexity of 290 on the validation set. This suggests that higher dimensional models capture more of the nuanced patterns present in the data. On the other hand, simple plotting is no longer sufficient to inspect high-dimensional data – we must resort to alternative methods, for example, clustering and nearest neighbor queries. First, in Figure 3, we present the results of using k-means clustering in the city space of the 100-dimensional model. The common algorithm for solving the k-means clustering problem is known to be prone to getting stuck in local optima, and in fact can be difficult to validate properly. We attempted to overcome these problems by using cross validation and repeated random restarts. Specifically, we used 10-fold cross-validation on the set of all cities in order to find a validation objective for each candidate value of k from 2 to 20. Then, we selected the parameter k by choosing the largest value for which no larger value offers more than a 5% improvement over the immediately previous value. In Figure 2 we present plots of the two-dimensional embedding space, with labels for some key cities (left) and artists (right). Note that the two plots are separated by city and artists only for readability, and that all points lie in the same space. In this figure, we can already see a striking segmentation in city space, with extreme distinction between, e.g., Brazilian cities, Southeast Asian cities, and American cities. We can also already see distinct regional and cultural groupings in some ways – the U.S. cities largely form a gradient, with Chicago, Atlanta, Washington, D.C., and Philadelphia in the middle, Cleveland and Detroit on one edge of the cluster, and New York and Los Angeles on the opposite edge. Interestingly, Toronto is also on the edge of the U.S. cluster, and on the same edge where New York and Los Angeles – arguably the most “international” of the U.S. cities shown here – end up. It is also interesting to note that the space has a very clear segmentation in terms of genre – just as clear as embeddings produced in previous work from songs alone [5] or songs and individual users [6]. Of course, this does not translate into an effective user model – surely there are many users in Recife, Brazil that would quickly tire of a radio station inspired by Linkin Park – but we believe it is still a meaningful phenomenon. Specifically, this suggests that the taste of the average listener can vary dramatically from one city to the next, even within the same country. More surprisingly, this variation in the average user is so dramatic that cities themselves can form nearly as coherent a taste space as individual users, as the genre segmentation is barely any less clear than in other authors’ work with user modeling. 4.3 Higher-dimensional Models Once the value of k was chosen, we tried to overcome the problem of local optima by running the clustering algorithm 10 times on the entire set of cities with that value of k and different random initializations, finally choosing the trial with the best objective value. This process resulted in optimal k values ranging from 6 to 13. Smaller values resulted in some clusterings with granularity too coarse to see interesting patterns, while larger values were noisy and produced unstable clusterings. Ultimately, we found that k = 9 was a good trade-off. Additionally, in Table 1, we obtain a complementary view of the 100-dimensional embedding by listing the results of nearest-neighbor queries for some well-known, hand-selected cities. These queries give us an alternative perspective of the city space, pointing out similarities that may not be apparent from the clustering alone. By combining these views, we can start to see many interesting patterns arise: The French-speaking supercluster: French-speaking cities form an extremely tight cluster, as can also be seen in the 2-dimensional embedding in Figure 2. Virtually every French city is part of this cluster, as well as Frenchspeaking cities in nearby European countries, such as Brussels and Geneva. Indeed even beyond the top 10 listed in Table 1, almost all of the top 100 nearest neighbors for Paris are French-speaking. Language is almost certainly the biggest factor in this effect, but if we consider the countries near France, we see that despite linguistic divides, in the clustering, many cities in the U.K. still group closely with Dutch cities and even Spanish cities. Furthermore, this grouping can be seen in every view of the data – in the two-dimensional space, the clustering, and the nearest neighbor queries. It should be noted that in our own trials clustering the data, the French cluster is one of the first Directly visualizing two-dimensional models can give us striking images from which rough patterns can be easily 442 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 2: The joint city/artist space with some key cities and artists labeled. Figure 3: A k-means clustering of cities around the world with k = 9. Kuala Lumpur Kulim Sungai Lembing Ipoh Kuching Sunway City Seremban Seri Kembangan Taman Cheras Hartamas Kuantan Selayang Paris Boulogne-Billancourt Brussels Rennes Lille Aix-en-Provence Limoges Amiens Marseille Geneva Grenoble Singapore Hougang Seng Kang USJ9 Subang Kota Bahru Bangkok Alam Damai Kota Padawan Glenmarie Budapest Los Angeles, CA Grand Prairie, TX Ontario, CA Riverside, CA Sacramento, CA Salinas, CA Paterson, NJ San Bernardino, CA Inglewood, CA Modesto, CA Pomona, CA Chicago, IL Buffalo, NY Clarksville, TN Cleveland, OH Durham, NC Birmingham, AL Flint, MI Montgomery, AL Nashville, TN Jackson, MS Paterson, NJ São Paulo Osasco Jundiaı́ Carapicuı́ba Ribeirão Pires Shinjuku Vargem Grande Paulista Santa Maria Itapevi Cascavel Embu das Artes Brooklyn, NY Minneapolis, MN Winston-Salem, NC Arlington, VA Waterbury, CT Washington, DC Syracuse, NY Jersey City, NJ Louisville, KY Tallahassee, FL Ontario, CA Atlanta, GA Savannah, GA Tallahassee, FL Cleveland, OH Washington, DC Memphis, TN Flint, MI Huntsville, AL Montgomery, AL Jackson, MS Lafayette, LA Madrid Sevilla Granada Barcelona Murcia Sorocaba Ponta Grossa Huntington Beach, CA Istanbul Vigo Oxford Amsterdam Eindhoven Tilburg Emmen Nijmegen Enschede Zwolle Amersfoort Maastricht Antwerp Coventry Sydney Toronto Denver, CO Windhoek Angers Rialto, CA Hamilton Rotterdam Ottawa London - Tower Hamlets London - Southwark Montréal Montpellier Geneva Raleigh, NC Limoges Angers Ontario, CA Anchorage, AK Nice Lyon Rennes Table 1: Nearest neighbor query results in 100-dimensional city space. Brooklyn was chosen over New York, NY due to having more tweets in the data set. In addition, only result cities with population at least 100,000 are displayed. 443 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Country Brazil Canada Netherlands Mexico Indonesia France United States Malaysia United Kingdom Russia Spain Least typical Criciúma, Santa Catarina Surrey, BC Leiden Campeche, CM Panunggangan Barat Bordeaux Huntington Beach, CA Kota Damansara Wolverhampton, England Ufa Álora, Andalusia Most typical Itapevi, São Paulo Toronto, ON Emmen Cuauhtémoc, DF RW 02 Mantes-la-Jolie, Île-de-France Jackson, MS Kuala Lumpur London Borough of Camden Podgory Barcelona typical taste profiles for that country. The results are shown in Table 2. We can see a few interesting patterns here. First, in Brazil, the most typical city is an outlying city near São Paulo city, while the least typical is a city in Santa Catarina, the second southernmost state in Brazil, which is also less populous than the southernmost, Rio Grande do Sul, which was also well-represented in the data. In Canada, the least typical city is an edge city on Vancouver’s east side, while the most typical is the largest city, Toronto. In France, the most typical city is in Île-deFrance, not too far from Paris. We also see in England that the least typical city is Wolverhampton, and edge city of Birmingham towards England’s industrial north, while the most typical is a borough of London. Table 2: Most and least typical cities in taste profile for various countries. clusters to become apparent, as well as one of the most consistent to appear. We can also see that the French cluster is indeed a linguistic and cultural one which is not just due to geographic proximity: although Montreal has several nearest neighbors in North America, it is present in the French group in the k-means clustering (as is Quebec City) and is also very close to many French-speaking cities in Europe, such as Geneva and Lyon. We can also see that Abidjan, Ivory Coast joins the French k-means cluster, as do Dakar in Senegal, Les Abymes in Guadeloupe and Le Lamentin and Fort-de-France in Martinique – all cities in countries which are members of the Francophonie. Australia: Here again, despite the relatively tight geographical proximity of Australia and Southeast Asia, and the geographic isolation of Australia from North America, Australian cities tend to group closely with Canadian cities and some cities in the United Kingdom. One way of seeing this is the fact that Sydney’s nearest neighbors include Toronto, Hamilton, Ontario, Ottawa, and two of London’s boroughs. In addition, other cities in Australia also belong to a cluster that mainly includes cities in the Commonwealth (e.g., U.K., Canada). Cultural divides in the United States: the cities in the U.S. tend to form at least two distinct subgroups in terms of listening patterns. One group contains many cities in the Southeast and Midwest, as well as a few cities on the southern edge of what some might call the Northeast (Philadelphia, for example). The other group consists primarily of cities in the Northeast, on the West Coast, and in the Southwest of the country, including most of the cities in Texas. Intuitively, there are two results that might be surprising to some here. The first is that the listening patterns of Chicago tend to cluster with listening patterns in the South and the rest of the Midwest, and not those of very large cities on the coasts (after all, Chicago is the third-largest city in the country). The second is that Texas groups with the West Coast and Northeast, and not with the Southeast, which would be considered by many to be more culturally similar in many ways. 5. CONCLUSIONS In this work, we learned probabilistic embeddings of the Million Musical Tweets Dataset, a large corpus of tweets containing track plays which has rich geographical information for each play. Through the use of embeddings, we were able to easily process a large amount of data and sift through it visually and with spatial analysis in order to uncover examples of how musical taste conforms to or transcends geography, language, and culture. Our findings reflect that differences in culture and language, as well as historical affinities among countries otherwise separated by vast distances, can be seen very clearly in the differences in taste among average listeners from one region to the next. More generally, this paper shows how nuanced patterns in large collections of preference data can be condensed into a taste space, which provides a powerful tool for discovering complex relationships. Acknowledgments: This work was supported by NSF grants IIS-1217485, IIS-1217686, IIS-1247696, and an NSF Graduate Research Fellowship. 6. REFERENCES [1] N. Aizenberg, Y. Koren, and O. Somekh. Build your own music recommender by modeling internet radio streams. In WWW, pages 1–10. ACM, 2012. [2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. JMLR, 3:1137–1155, 2003. [3] D. Hauger, M. Schedl, A. Košir, and M. Tkalčič. The million musical tweets dataset - what we can learn from microblogs. In ISMIR, 2013. [4] I. Knopke. Geospatial location of music and sound files for music information retrieval. ISMIR, 2005. [5] J. L. Moore, S. Chen, T. Joachims, and D. Turnbull. Learning to embed songs and tags for playlist prediction. In ISMIR, 2012. 4.4 Most and least typical cities We can also consider the relation of individual cities to their member countries. For this analysis, we considered all the countries which have at least 10 cities represented in the data. Then for each country we calculated the average position in embedding space of cities in that country. With this average city position, we can then measure the distance of individual cities from the mean of cities in their country and find the cities which have the most and least [6] J. L. Moore, Shuo Chen, T. Joachims, and D. Turnbull. Taste over time: the temporal dynamics of user preferences. In ISMIR, 2013. [7] J. Weston, S. Bengio, and P. Hamel. Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval. JNMR, 40(4):337–348, 2011. 444 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ENHANCING COLLABORATIVE FILTERING MUSIC RECOMMENDATION BY BALANCING EXPLORATION AND EXPLOITATION Zhe Xing, Xinxi Wang, Ye Wang School of Computing, National University of Singapore {xing-zhe,wangxinxi,wangye}@comp.nus.edu.sg ABSTRACT Collaborative filtering (CF) techniques have shown great success in music recommendation applications. However, traditional collaborative-filtering music recommendation algorithms work in a greedy way, invariably recommending songs with the highest predicted user ratings. Such a purely exploitative strategy may result in suboptimal performance over the long term. Using a novel reinforcement learning approach, we introduce exploration into CF and try to balance between exploration and exploitation. In order to learn users’ musical tastes, we use a Bayesian graphical model that takes account of both CF latent factors and recommendation novelty. Moreover, we designed a Bayesian inference algorithm to efficiently estimate the posterior rating distributions. In music recommendation, this is the first attempt to remedy the greedy nature of CF approaches. Results from both simulation experiments and user study show that our proposed approach significantly improves recommendation performance. 1. INTRODUCTION In the field of music recommendation, content-based approaches and collaborative filtering (CF) approaches have been the prevailing recommendation strategies. Contentbased algorithms [1, 9] analyze acoustic features of the songs that the user has rated highly in the past and recommend only songs that have high degrees of acoustic similarity. On the other hand, collaborative filtering (CF) algorithms [7, 13] assume that people tend to get good recommendations from someone with similar preferences, and the user’s ratings are predicted according to his neighbors’ ratings. These two traditional recommendation approaches, however, share a weakness. Working in a greedy way, they always generate “safe” recommendations by selecting songs with the highest predicted user ratings. Such a purely exploitative strategy may result in suboptimal performance over the long term due to the lack of exploration. The reason is that user preference c Zhe Xing, Xinxi Wang, Ye Wang. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Zhe Xing, Xinxi Wang, Ye Wang. “Enhancing collaborative filtering music recommendation by balancing exploration and exploitation”, 15th International Society for Music Information Retrieval Conference, 2014. is only estimated based on the current knowledge available in the recommender system. As a result, uncertainty always exists in the predicted user ratings and may give rise to a situation where some of the non-greedy options deemed almost as good as the greedy ones are actually better than them. Without exploration, however, we will never know which ones are better. With the appropriate amount of exploration, the recommender system could gain more knowledge about the user’s true preferences before exploiting them. Our previous work [12] tried to mitigate the greedy problem in content-based music recommendation, but no work has addressed this problem in the CF context. We thus aim to develop a CF-based music recommendation algorithm that can strike a balance between exploration and exploitation and enhance long-term recommendation performance. To do so, we introduce exploration into collaborative filtering by formulating the music recommendation problem as a reinforcement learning task called n-armed bandit problem. A Bayesian graphical model taking account of both collaborative filtering latent factors and recommendation novelty is proposed to learn the user preferences. The lack of efficiency becomes a major challenge, however, when we adopt an off-the-shelf Markov Chain Monte Carlo (MCMC) sampling algorithm for the Bayesian posterior estimation. We are thus prompted to design a much faster sampling algorithm for Bayesian inference. We carried out both simulation experiments and a user study to show the efficiency and effectiveness of the proposed approach. Contributions of this paper are summarized as follows: • To the best of our knowledge, this is the first work in music recommendation to temper CF’s greedy nature by investigating the exploration-exploitation trade-off using a reinforcement learning approach. • Compared to an off-the-shelf MCMC algorithm, a much more efficient sampling algorithm is proposed to speed up Bayesian posterior estimation. • Experimental results show that our proposed approach enhances the performance of CF-based music recommendation significantly. 2. RELATED WORK Based on the assumption that people tend to receive good recommendations from others with similar preferences, col- 445 15th International Society for Music Information Retrieval Conference (ISMIR 2014) laborative filtering (CF) techniques come in two categories: memory-based CF and model-based CF. Memory-based CF algorithms [3, 8] first search for neighbors who have similar rating histories to the target user. Then the target user’s ratings can be predicted according to his neighbors’ ratings. Model-based CF algorithms [7, 14] use various models and machine learning techniques to discover latent factors that account for the observed ratings. Our previous work [12] proposed a reinforcement learning approach to balance exploration and exploitation in music recommendation. However, this work is based on a content-based approach. One major drawback of the personalized user rating model is that low-level audio features are used to represent the content of songs. This purely content-based approach is not satisfactory due to the semantic gap between low-level audio features and high-level user preferences. Moreover, it is difficult to determine which underlying acoustic features are effective in music recommendation scenarios, as these features were not originally designed for music recommendation. Another shortcoming is that songs recommended by content-based methods often lack variety, because they are all acoustically similar to each other. Ideally, users should be provided with a range of genres rather than a homogeneous set. While no work has attempted to address the greedy problem of CF approaches in the music recommendation context, Karimi et al. tried to investigate it in other recommendation applications [4, 5]. However, their active learning approach merely explores items to optimize the prediction accuracy on a pre-determined test set [4]. No attention is paid to the exploration-exploitation trade-off problem. In their other work, the recommendation process is split into two steps [5]. In the exploration step, they select an item that brings maximum change to the user parameters, and then in the exploitation step, they pick the item based on the current parameters. The work takes balancing exploration and exploitation into consideration, but only in an ad hoc way. In addition, their approach is evaluated using only an offline and pre-determined dataset. In the end, their algorithm is not practical for deployment in online recommender systems due to its low efficiency. 3. PROPOSED APPROACH We first present a simple matrix factorization model for collaborative filtering (CF) music recommendation. Then, we point out major limitations of this traditional CF algorithm and describe our proposed approach in detail. song a song feature vector vj ∈ Rf , j = 1, 2, ..., n. For a given song j, vj measures the extent to which the song contains the latent factors. For a given user i, ui measures the extent to which he likes these latent factors. The user rating can thus be approximated by the inner product of the two vectors: r̂ij = uTi vj (1) To learn the latent feature vectors, the system minimizes the following regularized squared error on the training set: (i,j)∈I (rij − uTi vj )2 + λ( m nui ui 2 + i=1 n nvj vj 2 ) (2) j=1 where I is the index set of all known ratings, λ a regularization parameter, nui the number of ratings by user i, and nvj the number of ratings of song j. We use the alternating least squares (ALS) [14] technique to minimize Eq. (2). However, this traditional CF recommendation approach has two major drawbacks. (I) It fails to take recommendation novelty into consideration. For a user, the novelty of a song changes with each listening. (II) It works greedily, always recommending songs with the highest predicted mean ratings, while a better approach may be to actively explore a user’s preferences rather than to merely exploit available rating information [12]. To address these drawbacks, we propose a reinforcement learning approach to CF-based music recommendation. 3.2 A Reinforcement Learning Approach Music recommendation is an interactive process. The system repeatedly choose among n different songs to recommend. After each recommendation, it receives a rating feedback (or reward) chosen from an unknown probability distribution, and its goal is to maximize user satisfaction, i.e., the expected total reward, in the long run. Similarly, reinforcement learning explores an environment and takes actions to maximize the cumulative reward. It is thus fitting to treat music recommendation as a well-studied reinforcement learning task called n-armed bandit. The n-armed bandit problem assumes a slot machine with n levers. Pulling a lever generates a payoff from the unknown probability distribution of the lever. The objective is to maximize the expected total payoff over a given number of action selections, say, over 1000 plays. 3.2.1 Modeling User Rating To address drawback (I) in Section 3.1, we assume that a song’s rating is affected by two factors: CF score, the extent to which the user likes the song in terms of each CF latent factor, and novelty score, the dynamically changing novelty of the song. From Eq. (1), we define the CF score as: 3.1 Matrix Factorization for Collaborative Filtering Suppose we have m users and n songs in the music recommender system. Let R = {rij }m×n denote the user-song rating matrix, where each element rij represents the rating of song j given by user i. Matrix factorization characterizes users and songs by vectors of latent factors. Every user is associated with a user feature vector ui ∈ Rf , i = 1, 2, ..., m, and every UCF = θ T v (3) where vector θ indicates the user’s preferences for different CF latent factors and v is the song feature vector 446 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ݀Ͳ learned by the ALS CF algorithm. For the novelty score, we adopt the formula used in [12]: UN = 1 − e−t/s τ θ s v R t N Figure 1: Bayesian Graphical Model. ing probability dependency is defined as follows: R|v, t, θ, s, σ 2 ∼ N (θ T v(1 − e−t/s ), σ 2 ) (5) θ|σ ∼ N (0, a0 σ I) 2 2 Given the variability in musical taste and memory strength, each user is associated with a pair of parameters Ω = (θ, s), to be learned from the user’s rating history. More technical details will be explained in Section 3.2.2. Since the predicted user ratings always carry uncertainty, we assume them to be random variables rather than fixed numbers. Let Rj denote the rating of song j given by the target user, and Rj follows an unknown probability distribution. We assume that the expectation of Rj is the Uj defined in Eq. (5). Thus, the expected rating of song j can be estimated as: E[Rj ] = Uj = (θ T vj )(1 − e−tj /s ) ܿͲ ܾͲ (4) where t is the time elapsed since when the song was last heard, s the relative strength of the user’s memory, and e−t/s the well-known forgetting curve. The formula assumes that a song’s novelty decreases immediately when listened and gradually recovers with time. (For more details on the novelty definition, please refer to [12].) We thus model the final user rating by combining these two scores: U = UCF UN = (θ T v)(1 − e−t/s ) ܽͲ ݁Ͳ (6) Traditional recommendation strategy will first obtain the vj and tj of each song in the system to compute the expected rating using Eq. (6) and then recommend the song with the highest expected rating. We call this a greedy recommendation as the system is exploiting its current knowledge of the user ratings. By selecting one of the nongreedy recommendations and gathering more user feedback, the system explores further and gains more knowledge about the user preferences. A greedy recommendation may maximize the expected reward in the current iteration but would result in suboptimal performance over the long term. This is because several non-greedy recommendations may be deemed nearly as good but come with substantial variance (or uncertainty), and it is thus possible that some of them are actually better than the greedy recommendation. Without exploration, however, we will never know which ones they are. Therefore, to counter the greedy nature of CF (drawback II), we introduce exploration into music recommendation to balance exploitation. To do so, we adopt one of the state-of-the-art algorithms called Bayesian Upper Confidence Bounds (Bayes-UCB) [6]. In Bayes-UCB, the expected reward Uj is a random variable rather than a fixed value. Given the target user’s rating history D, the posterior distribution of Uj , denoted as p(Uj |D), needs to be estimated. Then the song with the highest fixed-level quantile value of p(Uj |D) will be recommended to the target user. 3.2.2 Bayesian Graphical Model To estimate the posterior distribution of U , we adopt the Bayesian model (Figure 1) used in [12]. The correspond- 447 (7) (8) s ∼ Gamma(b0 , c0 ) (9) τ = 1/σ ∼ Gamma(d0 , e0 ) (10) 2 I is the f × f identity matrix. N represents Gaussian distribution with parameters mean and variance. Gamma represents Gamma distribution with parameters shape and rate. θ, s, and τ are parameters. a0 , b0 , c0 , d0 , and e0 are hyperparameters of the priors. At current iteration h + 1, we have gathered h observed recommendation history Dh = {(vi , ti , ri )}hi=1 . Given that each user in our model is described as Ω = (θ, s), we have according to the Bayes theorem: p(Ω | Dh ) ∝ p(Ω)p(Dh | Ω) (11) Then the posterior probability density function (PDF) of the expected rating Uj of song j can be estimated as: p(Uj |Dh ) = p(Uj |Ω)p(Ω|Dh )dΩ (12) Since Eq. (11) has no closed form solution, we are unable to directly estimate the posterior PDF in Eq. (12). We thus turn to a Markov Chain Monte Carlo (MCMC) algorithm to adequately sample the parameters Ω = (θ, s). We then substitute every parameter sample into Eq. (6) to obtain a sample of Uj . Finally, the posterior PDF in Eq. (12) can be approximated by the histogram of the samples of Uj . After estimating the posterior PDF of each song’s expected rating, we follow the Bayes-UCB approach [6] to recommend song j ∗ that maximizes the quantile function: j ∗ = arg max j=1,...,|S| Q (α, p(Uj |Dh )) (13) 1 where α = 1 − h+1 , |S| is the total number of songs in the recommender system, and the quantile function Q returns the value x such that Pr(Uj ≤ x|Dh ) = α. The pseudo code of our algorithm is presented in Algorithm 1. 3.3 Efficient Sampling Algorithm Bayesian inference is very slow with an off-the-shelf MCMC sampling algorithm because it takes a long time for the Markov chain to converge. In response, we previously proposed an approximate Bayesian model using piecewise linear approximation [12]. However, not only is the original 15th International Society for Music Information Retrieval Conference (ISMIR 2014) # Users 100,000 Algorithm 1 Exploration-Exploitation Balanced Music Recommendation for h = 1 → N do if h == 1 then Recommend a song randomly; else Draw samples of θ and s based on p(Ω | Dh−1 ); for song j = 1 → |S| do Obtain vj and tj of song j and compute samples of Uj using Eq. (6); Estimate p(Uj |Dh−1 ) using histogram of the samples of Uj ; Compute quantile qjh = Q 1 − h1 , p(Uj |Dh−1 ) ; end for Recommend song j ∗ = arg maxj=1,...,|S| qjh ; Collect user rating rh and update p(Ω | Dh ); end if end for p(θ|D, τ, s) ∝ p(τ )p(θ|τ )p(s) α = d0 + β = e0 + 1 I+ a0 p(ri |vi , ti , θ, s, τ ) T μ Σ −1 T =η =τ N 4. EVALUATION 4.1 Dataset The Taste Profile Subset 1 used in the Million Song Dataset Challenge [10] has over 48 million triplets (user, song, count) describing the listening history of over 1 million users and 380,000 songs. We select 20,000 songs with top listening counts and 100,000 users who have listened to the most songs. Since this collection of listening history is a form of implicit feedback data, we use the approach proposed in [11] to perform negative sampling. The detailed statistics of the final dataset are shown in Table 1. (15) ri (1 − e −ti /s )viT (16) i=1 Similarly, the conditional distribution p(τ |D, θ, s) remains a Gamma distribution and can be derived as: p(τ |D, θ, s) ∝ p(τ )p(θ|τ )p(s) N p(ri |vi , ti , θ, s, τ ) 4.2 Learning CF Latent Factors i=1 ∝ p(τ )p(θ|τ ) N i=1 First, we determine the optimal value of λ, the regularization parameter, and f , the dimensionality of the latent feature vectors. We randomly split the dataset into three disjoint parts: training set (80%), validation set (10%), and test set (10%). Training set is used to learn the CF latent factors, and the convergence criteria of the ALS algorithm is achieved when the change in root mean square p(ri |vi , ti , θ, s, τ ) 1 T 2 −1 exp(−e0 τ ) × exp − θ (a0 σ I) θ × ∝τ 2 N √ −N 2 1 T −ti /s σ 2π exp − 2 ri − θ vi (1 − e ) 2σ i=1 d0 −1 ∝ τ α−1 exp(−βτ ) ∝ Gamma (α, β) # MH Step tmp i=1 (19) Draw u ∼ U nif orm(0, 1); if u < α then stmp = y; end if end for s(t+1) = stmp ; end for (1 − e−ti /s )2 vi viT N 2 θT θ 1 ri − θ T vi (1 − e−ti /s ) + 2a0 2 i=1 Initialize θ, s, τ ; for t = 1 → M axIteration do Sample θ (t+1) ∼ p(θ|D, τ (t) , s(t) ); Sample τ (t+1) ∼ p(τ |D, θ (t+1) , s(t) ); stmp = s(t) ; for i = 1 → K do Draw y ∼N (stmp , 1); (t+1) (t+1) ,τ ) , 1 ; α = min p(sp(y|D,θ (t+1) (t+1) |D,θ ,τ ) where μ and Σ, respectively the mean and covariance of the multivariate Gaussian distribution, satisfy: Σ−1 = Λ = τ (18) Algorithm 2 Gibbs Sampling for Bayesian Inference 1 p(ri |vi , ti , θ, s, τ ) ∝ exp − θ T (a0 σ 2 I)−1 θ 2 i=1 N 2 1 − 2 ri − θ T vi (1 − e−ti /s ) ×exp 2σ i=1 1 (14) ∝ exp − θ T Λθ + η T θ ∝ N (μ, Σ) 2 N f +N 2 The conditional distribution p(s|D, θ, τ ) has no closed form expression. We thus adopt the Metropolis-Hastings (MH) algorithm [2] with a proposal distribution q(st+1 |st ) = N (st , 1) to draw samples of s. Our detailed Gibbs sampling process is presented in Algorithm 2. N % Density 1.035% where α and β are respectively the shape and rate of the Gamma distribution and satisfy: i=1 ∝ p(θ|τ ) # Observations 20,699,820 Table 1: Size of the dataset. Density is the percentage of entries in the user-song matrix that have observations. Bayesian model altered, tuning the numerous (hyper)parameters is also tedious. In this paper, we present a better way to improve efficiency. Since it is simple to sample from a conditional distribution, we develop a specific Gibbs sampling algorithm to hasten convergence. Given N training samples D = {vi , ti , ri }N i=1 , the conditional distribution p(θ|D, τ, s) is still a Gaussian distribution and can be obtained as follows: N # Songs 20,000 1 (17) 448 http://labrosa.ee.columbia.edu/millionsong/tasteprofile 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 2: Prediction accuracy of sampling algorithms. Figure 3: Efficiency comparison of sampling algorithms. (T imeM CM C = 538.762s and T imeGibbs = 0.579s when T rainingSetSize = 1000). error (RMSE) on the validation set is less than 10−4 . Then we use the learned latent factors to predict the ratings on the test set. We first fix f = 55 and vary λ from 0.005 to 0.1; minimal RMSE is achieved at λ = 0.025. We then fix λ = 0.025 and vary f from 10 to 80, and f = 75 yields minimal RMSE. Therefore, we adopt the optimal value λ = 0.025 and f = 75 to perform the final ALS CF algorithm and obtain the learned latent feature vector of each song in our dataset. These vectors will later be used for reinforcement learning. Figure 4: Online evaluation platform. 4.3 Efficiency Study To show that our Gibbs sampling algorithm makes Bayesian inference significantly more efficient, we conduct simulation experiments to compare it with an off-the-shelf MCMC algorithm developed in JAGS 2 . We implemented the Gibbs algorithm in C++, which JAGS uses, for a fair comparison. For each data point di ∈ {(vi , ti , ri )}ni=1 in the simulation experiments, vi is randomly chosen from the latent feature vectors learned by the ALS CF algorithm. ti is randomly sampled from unif orm(50, 2592000), i.e. between a time gap of 50 seconds and one month. ri is calculated using Eq. (6) where elements of θ are sampled from N (0, 1) and s from unif orm(100, 1000). To determine the burn-in and sample size of the two algorithms and to ensure they draw samples equally effectively, we first check to see if they converge to a similar level. We generate a test set of 300 data points and vary the size of the training set to gauge the prediction accuracy. We set K = 5 in the MH step of our Gibbs algorithm. While our Gibbs algorithm achieves reasonable accuracy with burn-in = 20 and sample size = 100, the MCMC algorithm gives comparable results only when both parameters are 10000. Figure 2 shows their prediction accuracies averaged over 10 trials. With burn-in and sample size determined, we then conduct an efficiency study of the two algorithms. We vary the training set size from 1 to 1000 and record the time they take to finish the sampling process. We use a computer with Intel Core i7-2600 CPU @ 3.40Ghz and 8GB RAM. The efficiency comparison result is shown in Figure 3. We can see that computation time of both two sampling algorithms grows linearly with the training set size. However, our proposed Gibbs sampling algorithm is hundreds of times faster than MCMC, 2 http://mcmc-jags.sourceforge.net/ 449 suggesting that our proposed approach is practical for deployment in online recommender systems. 4.4 User Study In an online user study, we compare the effectiveness of our proposed recommendation algorithm, Bayes-UCB-CF, with that of two baseline algorithms: (1) Greedy algorithm, representing the traditional recommendation strategy without exploration-exploitation trade-off. (2) BayesUCB-Content algorithm [12], which also adopts the BayesUCB technique but is content-based instead of CF-based. We do not perform offline evaluation because it cannot capture the effect of the elapsed time t in our rating model and the interactiveness of our approach. Eighteen undergraduate and graduate students (9 females and 9 males, age 19 to 29) are invited to participate in the user study. The subject pool covers a variety of majors of study and nationalities, including American, Chinese, Korean, Malaysian, Singaporean and Iranian. Subjects receive a small payment for their participation. The user study takes place over the course of two weeks in April 2014 on a user evaluation website we constructed (Figure 4). The three algorithms evaluated are randomly assigned to numbers 1-3 to avoid bias. For each algorithm, 200 recommendations are evaluated using a rating scale from 1 to 5. Subjects are reminded to take breaks frequently to avoid fatigue. To minimize the carryover effect, subjects cannot evaluate two different algorithms in one day. For the user study, Bayes-UCB-CF’s hyperparameters are set as: a0 = 10, b0 = 3, c0 = 0.01, d0 = 0.001 and e0 = 0.001. Since maximizing the total expected rating is the main objective of a music recommender system, we thus compare the cumulative average rating of the three algorithms. Figure 5 shows the average rating and standard error of 15th International Society for Music Information Retrieval Conference (ISMIR 2014) the manuscript. This project is funded by the National Research Foundation (NRF) and managed through the multi-agency Interactive & Digital Media Programme Office (IDMPO) hosted by the Media Development Authority of Singapore (MDA) under Centre of Social Media Innovations for Communities (COSMIC). 7. REFERENCES [1] P. Cano, M. Koppenberger, and N. Wack. Content-based music audio recommendation. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 211– 212. ACM, 2005. [2] S. Chib and E. Greenberg. Understanding the metropolishastings algorithm. The American Statistician, 49(4):327– 335, 1995. [3] J. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd annual ACM international conference on SIGIR, pages 230–237. ACM, 1999. Figure 5: Recommendation performance comparison. each algorithm from the beginning till the n-th recommendation iteration. We can see that our proposed Bayes-UCBCF algorithm significantly outperforms Bayes-UCB-Content, suggesting that the latter still fails to bridge the semantic gap between high-level user preferences and low-level audio features. T-tests show that Bayes-UCB-CF starts to significantly outperform the Greedy baseline after the 46th iteration (pvalue < 0.0472). In fact, Greedy’s performance decays rapidly after the 60th iteration while others continue to improve. Because Greedy solely exploits, it is quickly trapped at a local optima, repeatedly recommending the few songs with initial good ratings. As a result, the novelty of those songs plummets, and users become bored. Greedy will introduce new songs after collecting many low ratings, only to be soon trapped into a new local optima. By contrast, our Bayes-UCB-CF algorithm balances exploration and exploitation and thus significantly improves the recommendation performance. [4] R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. SchmidtThieme. Active learning for aspect model in recommender systems. In Symposium on Computational Intelligence and Data Mining, pages 162–167. IEEE, 2011. [5] R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. SchmidtThieme. Non-myopic active learning for recommender systems based on matrix factorization. In International Conference on Information Reuse and Integration, pages 299–303. IEEE, 2011. [6] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 592– 600, 2012. [7] N. Koenigstein, G. Dror, and Y. Koren. Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy. In Proceedings of the fifth ACM conference on Recommender systems, pages 165–172. ACM, 2011. [8] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: applying collaborative filtering to usenet news. Communications of the ACM, 40(3):77–87, 1997. 5. CONCLUSION We present a novel reinforcement learning approach to music recommendation that remedies the greedy nature of the collaborative filtering approaches by balancing exploitation with exploration. A Bayesian graphical model incorporating both the CF latent factors and novelty is used to learn user preferences. We also develop an efficient sampling algorithm to speed up Bayesian inference. In music recommendation, our work is the first attempt to investigate the exploration-exploitation trade-off and to address the greedy problem in CF-based approaches. Results from simulation experiments and user study have shown that our proposed algorithm significantly improves recommendation performance over the long term. To further improve recommendation performance, we plan to deploy a hybrid model that combines content-based and CF-based approaches in the proposed framework. 6. ACKNOWLEDGEMENT We thank the subjects in our user study for their participation. We are also grateful to Haotian “Sam” Fang for proofreading [9] B. Logan. Music recommendation from song sets. In ISMIR, 2004. [10] B. McFee, T. Bertin-Mahieux, D. P.W. Ellis, and G. R.G. Lanckriet. The million song dataset challenge. In Proceedings of international conference companion on World Wide Web, pages 909–916. ACM, 2012. [11] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang. One-class collaborative filtering. In Eighth IEEE International Conference on Data Mining, pages 502– 511. IEEE, 2008. [12] X. Wang, Y. Wang, D. Hsu, and Y. Wang. Exploration in interactive personalized music recommendation: A reinforcement learning approach. arXiv preprint arXiv:1311.6355, 2013. [13] K. Yoshii and M. Goto. Continuous plsi and smoothing techniques for hybrid music recommendation. In ISMIR, pages 339–344, 2009. [14] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management, pages 337– 348. Springer, 2008. 450 15th International Society for Music Information Retrieval Conference (ISMIR 2014) IMPROVING MUSIC RECOMMENDER SYSTEMS: WHAT CAN WE LEARN FROM RESEARCH ON MUSIC TASTES? Audrey Laplante École de bibliothéconomie et des sciences de l’information, Université de Montréal [email protected] the taste profiles of these users. One of the principal limitations of RS based on CF is that, before they could gather sufficient information about the preferences of a user, they perform poorly. This corresponds to the welldocumented new user cold-start problem. One way to ease this problem would be to try to enrich the taste profile of a new user by relying on other types of information that are known to be correlated with music preferences. More recently, it has become increasingly common for music RS to encourage users to create a personal profile, or to allow them to connect to the system with a general social network site account (for instance, Deezer users can connect with their Facebook or their Google+ account). Music RS thus have access to a wider array of information regarding new users. Research on music tastes can provide insights into how to take advantage of this information. More than a decade ago, similar reasoning led Uitdenbogerd and Schyndel [2] to review the literature on the subject to identify the factors affecting music tastes. In 2003, however, a paper published by Rentfrow and Gosling [3] on the relationship between music and personality generated a renewed interest for music tastes among researchers, which translated into a sharp increase in research on this topic. In this paper, we propose to review the recent literature on music preferences from social psychology and sociology of music to identify the correlates of music tastes and to understand how music tastes are formed and evolve through time. We first explain the process by which we identified and selected the articles and books reviewed. We then present the structure and the correlates of music preferences based on the literature review. We conclude with a brief discussion on the implications of these findings for music RS design. ABSTRACT The success of a music recommender system depends on its ability to predict how much a particular user will like or dislike each item in its catalogue. However, such predictions are difficult to make accurately due to the complex nature of music tastes. In this paper, we review the literature on music tastes from social psychology and sociology of music to identify the correlates of music tastes and to understand how music tastes are formed and evolve through time. Research shows associations between music preferences and a wide variety of sociodemographic and individual characteristics, including personality traits, values, ethnicity, gender, social class, and political orientation. It also reveals the importance of social influences on music tastes, more specifically from family and peers, as well as the central role of music tastes in the construction of personal and social identities. Suggestions for the design of music recommender systems are made based on this literature review. 1. INTRODUCTION The success of a music recommender system (RS) depends on its ability to propose the right music, to the right user, at the right moment. This, however, is an extremely complex task. A wide variety of factors influence the development of music preferences, thus making it difficult for systems to predict how likely a particular user is to like or dislike a piece of music. This probably explains why music RS are often based on collaborative filtering (CF): it allows systems to uncover complex patterns in preferences that would be difficult to model based on musical attributes [1]. However, in order to make those predictions as accurate as possible, these systems need to collect a considerable amount of information about the music preferences of each user. To do so, they elicit explicit feedback from users, inviting them to rate, ban, or love songs, albums, or artists. They also collect implicit feedback, most often in the form of purchase or listening history data (including songs skipped) of individual users. These pieces of information are combined to form the user’s music taste profile, which allows the systems to identify like-minded users and to recommend music based on 2. METHODS We used two databases to identify the literature on music preferences, one in psychology, PsycINFO (Ovid), and one in sociology, Sociological Abstracts (ProQuest). We used the thesaurus of each database to find the descriptors that were used to represent the two concepts of interest (i.e., music, preferences), which led to the queries presented in Table 1. © Audrey Laplante Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Audrey Laplante. “Improving Music Recommender Systems: What Can We Learn from Research on Music Tastes?”, 15th International Society for Music Information Retrieval Conference, 2014. PsycINFO music AND preferences Sociological Abstracts: (music OR "music/musical") AND ("preference/preferences" OR preferences) Table 1. Queries used to retrieve articles in databases 451 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Both searches were limited to the subject heading field. We also limited the search to peer-reviewed publications and to articles published in 1999 or later to focus on the articles published during the last 15 years. This yielded 155 articles in PsycINFO and 38 articles in Sociological Abstracts. Additional articles and books were identified through chaining (i.e., by following citations in retrieved articles), which allowed us to add a few important documents that had been published before 1999. Considering the limited space and the large number of documents on music tastes, further selection was needed. After ensuring that all aspects were covered, we rejected articles with a narrow focus (e.g., articles focusing on a specific music genre or personality trait). For topics on which there were several publications, we retained articles with the highest number to citations based on Google Scholar. We also decided to exclude articles on the relationship between music preferences and the functions of music to concentrate on individual characteristics. with varimax rotation on participants’ ratings. This allowed them to uncover a factor structure of music preferences, composed of four dimensions, which they labeled Reflective and Complex, Intense and Rebellious, Upbeat and Conventional, and Energetic and Rhythmic. Table 2 shows the genres most strongly associated with each dimension. To verify the generalizability of this structure across samples, they replicated the study with 1,384 students of the same university, and examined the music libraries of individual users in a peer-to-peer music service. This allowed them to confirm the robustness of the model. Music-preference dimension 3. REVIEW OF LITERATURE ON MUSIC TASTES Research shows that people, especially adolescents, use their music tastes as a social badge through which they convey who they are, or rather how they would like to be perceived [4, 5]. This indicates that people consider that music preferences reflect personality, values, and beliefs. In the same line, people often make inferences about the personality of others based on their music preferences, as revealed by a study in which music was found to be the main topic of conversation between two young adults who are given the task of getting to know each other [6]. The same study showed that these inferences are often accurate: people can correctly infer several psychological characteristics based on one’s music preferences, which suggests that they have an intuitive knowledge of the relationships that exist between music preferences and personality. Several researchers have studied these relationships systematically to identify the correlates of music preferences that pertain to personality and demographic characteristics, values and beliefs, and social influences and stratification. Genres most strongly associated Reflective and Complex Blues, Jazz, Classical, Folk Intense and Rebellious Rock, Alternative, Heavy metal Upbeat and Conventional Country, Sound tracks, Religious, Pop Energetic and Rhythmic Rap/hip-hop, Soul/funk, Electronica/dance Table 2. Music-preference dimensions of Rentfrow and Gosling (2003). Several other researchers replicated Rentfrow and Gosling’s study with other populations and slightly different methodologies. To name a few, [8] surveyed 2,334 Dutch adolescents aged 12–19; [9] surveyed 268 Japanese college students; [10, 11] surveyed 422 and 170 German students, respectively; and [12] surveyed 358 Canadian students. Although there is a considerable degree of similarity in the results across these studies, there also appears to be a few inconsistencies. Firstly, the number of factors varies: while 4 studies revealed a 4factor structure [3, 8-10], one found 5 factors [11], and another, 9 factors1 [12]. These differences could potentially be explained by the fact that researchers used different music preference tests: the selection of the genres to include in these tests depends on the listening habits of the target population and thus needs to be adapted. The grouping of genres also varies. In the 4 above-mentioned studies in which a 4-factor structure was found, rock and metal music were consistently grouped together. However, techno/electronic was not always grouped with the same genres: while it was grouped with rap, hip-hop, and soul music in 3 studies, it was grouped with popular music in the study with the Dutch sample [8]. Similarly, religious music was paired with popular music in Rentfrow and Gosling’s study, but was paired with classical and jazz music in the 3 other studies. These discrepancies could come from the fact that some music genres might have different connotations in different cultures. It can also be added that music genres are problematic in themselves: they are broad, inconsistent, and ill-defined. To 3.1 Dimensions of music tastes There are numerous music genres and subgenres. However, as mentioned in [7], attitudes toward genres are not isolated from one another: there are genres that seem to go together while others seem to oppose. Therefore, to reduce the number of variables, prior to attempting to identify the correlates of music preferences, most researchers start by examining the nature of music preferences to identify the principal dimensions. The approach of Rentfrow and Gosling [3] is representative of the work of several researchers. To uncover the underlying structure of music preferences, they first asked 1,704 students from an American university to indicate their liking of 14 different music genres using a 7-point Likert scale. This questionnaire was called the Short Test Of Music Preferences (STOMP). They then performed factor analysis by means of principal-components analysis 1 For this study, the researchers started with 30 genres, as opposed to others who used between 11 and 21 genres. 452 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Energetic and Rhythmic music and Extraversion, which was found in most other studies [3, 8, 10], was not found with the Japanese sample. solve these problems, Rentfrow and colleagues [13, 14] replicated the study yet again, but used 52 music excerpts representing 26 different genres to measure music preferences instead of a list of music genres. The resulting structure was slightly different. It was composed of 5 factors labeled Mellow, Unpretentious, Sophisticated, Intense, and Contemporary (MUSIC). This approach also allowed them to examine the ties between the factors and the musical attributes. To do so, they asked non-experts to rate each music excerpt according to various attributes (i.e., auditory features, affect, energy level, perceived complexity) and used this information to identify the musical attributes that were more strongly associated with each factor. 3.3 Values and Beliefs Fewer recent studies have focused on the relationship between music preferences and values or beliefs compared to personality. Nevertheless, several correlates of music preferences were found in this area, from political orientation to religion to vegetarianism [15]. 3.3.1 Political Orientation In the 1980s, Peterson and Christenson [16] surveyed 259 American university students on their music preferences and political orientation. They found that liberalism was positively associated with liking jazz, reggae, soul, or hardcore punk, whereas linking 70s rock or 80s rock was negatively related to liberalism. They also uncovered a relationship between heavy metal and political alienation: heavy metal fans were significantly more likely than others to check off the “Don’t know/don’t care” box in response to the question about their political orientation. More recently, Rentfrow and Gosling [3] found that political conservatism was positively associated with liking Upbeat & Conventional music (e.g., popular music), whereas political liberalism was positively associated with liking Energetic and Rhythmic (e.g., rap, hip-hop) or Reflective and Complex (e.g., classical, jazz) music, although the last two correlations were weak. North and Hargreaves [15], who surveyed 2,532 British individuals, and Gardikiotis and Baltzis [17], who surveyed 606 Greek college students, also found that people who liked classical music, opera, and blues were more likely to have liberal, pro-social beliefs (e.g., public health care, protection of the environment, taking care of the most vulnerable). In contrast, fans of hip-hop, dance, and DJ-based music were found to be among the least likely groups to hold liberal beliefs (e.g., increased taxation to pay for public services, public health care) [15]. As we can see, liking jazz and classical music was consistently associated with liberalism, but no such clear patterns of associations emerged for other music genres, which suggests that further research is needed. 3.2 Personality traits Several researchers have examined the relationship between music preferences and personality traits [3, 8-10] using the 5-factor model of personality, commonly called the “Big Five” dimensions of personality (i.e., Extraversion, Emotional Stability, Agreeableness, Conscientiousness, and Openness to Experience). Rentfrow and Gosling [3] were the first to conduct a large-scale study focusing on this aspect, involving more than 3,000 participants. In addition to taking the STOMP test for measuring their music preferences, participants had to complete 6 personality tests, including the Big Five Inventory. The analysis of the results revealed associations between some personality traits and the 4 dimensions of music preferences. For instance, they found that liking Reflective and Complex music (e.g., classical, jazz) or Intense and Rebellious music (e.g., rock, metal) was positively related to Openness to Experience; and liking Upbeat and Conventional music (e.g., popular music) or Energetic and Rhythmic music (e.g., rap, hip-hop) was positively correlated with extraversion. Emotional Stability was the only personality dimension that had no significant correlation with any of the music-preference dimensions. Openness and Extraversion were the best predictors of music preferences. As mentioned previously, since researchers use different genres and thus find different music-preference dimensions, comparing results from various studies is problematic. Nonetheless, subsequent studies seem to confirm most of Rentfrow and Gosling’s findings. Delsing et al. [8] studied Dutch adolescents and found a similar pattern of associations between personality and music preferences dimensions. Only two correlations did not match. However, it should be noted that the correlations were generally lower, a disparity the authors attribute to the age difference between the two samples (college student vs. adolescents): adolescents being more influenced than young adults by their peers, personality might have a lesser effect on their music preferences. Brown [9] found fewer significant correlations when studying Japanese university students. The strongest correlations concerned Openness, which was positively associated with liking Reflective and Complex music (e.g., classical, jazz) and negatively related to liking Energetic and Rhythmic music (e.g., hip-hop/rap). The positive correlation between 3.3.2 Religious Beliefs There are very few studies that examined the link between music preferences and religion. The only recent one we could find was the study by North and Hargreaves previously mentioned [15]. Their analysis revealed that fans of western, classical music, disco, and musicals were the most likely to be religious; whereas fans of dance, indie, or DJ-based music were least likely to be religious. They also found a significant relation between music preferences and the religion affiliation of people. Fans of rock, musicals, or adult pop were more likely to be Protestant; fans of opera or country/western were more likely to be Catholic; and fans of R&B and hip-hop/rap were more likely to adhere to other religions. Another older study used the 1993 General Social Survey to examine the attitude of American adults towards heavy metal and rap music and found that people who attended religious services were more likely to dislike heavy metal 453 15th International Society for Music Information Retrieval Conference (ISMIR 2014) (no such association was found with rap music) [18]. Considering that religious beliefs vary across cultures, further studies are needed to discern a clear pattern of associations between music preferences and religion. Black Stylists cluster was composed of fans of hip-hop and reggae who were largely black, with some South Asian representation. By contrast, the Hard Rockers, who like heavy metal and alternative music, were almost exclusively white. 3.4 Demographic Variables 3.4.3 Age Most researchers who study music preferences draw their participants from the student population of the university where they work. As a result, samples are mostly homogenous in terms of age, which explains the small number of studies that focused on the relationship between age and music preferences. Age was found to be significantly associated with music preferences. For instance, [23] compared the music preferences of different age groups and found that there were only two genres—rock and country—that appeared in the five most highly rated genres of both the 18-24 year olds and the 55-64 year olds. While the favourite genres of younger adults were rap, metal, rock, country, and blues; older adults preferred gospel, country, mood/easy listening, rock, and classical/chamber music. [15] also found a correlation between age and preferences for certain music genres. Unsurprisingly, their analysis revealed that people who liked what could be considered trendy music genres (e.g., hiphop/rap, DJ-based music, dance, indie, chart pop) were more likely to be young, whereas people who liked more conventional music genres (e.g., classical music, sixties pop, musicals, country) were more likely to be older. [24] conducted a study involving more than 250,000 participants and found that the interest for music genres associated with the Intense (e.g., rock, heavy metal, punk) and the Contemporary (e.g., rap, funk, reggae) musicpreference dimensions decreases with age, whereas the interest for music genres associated with the Unpretentious (e.g., pop, country) and the Sophisticated (e.g., classical, folk, jazz) dimensions increases. Some researchers have also looked at the trajectory of music tastes. Studies on the music preferences of children and adolescents revealed that as they get older, adolescents tend to move away from mainstream rock and pop, although these genres remain popular throughout adolescence [7]. Research has also demonstrated that music tastes are already fairly stable in early adolescence and further crystallize in late adolescence or early adulthood [25, 26]. Using data from the American national Survey of Public Participation in the Arts (SPPA) of 1982, 1992, and 2002, [23] examined the relationship between age and music tastes, with a focus on older age. They looked at the number of genres liked per age group and found that in young adulthood, people had fairly narrow tastes. Their tastes expand into middle age (i.e., 55 year old), to then narrow again, suggesting that people disengage from music in older age. They also found that although music genres that are popular among younger adults change from generation to generation; they remain much more stable among older people. 3.4.1 Gender Several studies have revealed associations between gender and music tastes. It was found that women were more likely to be fans of chart pop or other types of easy listening music (e.g., country) [7, 12, 15, 19, 20], whereas men were more likely to prefer rock and heavy metal [12, 15, 19, 20]. This is not to say that women do not like rock: in Colley’s study [19], which focused on gender differences in music tastes, rock was the second most highly rated music genre among women: the average rating for women was 4.1 (on an 8-point scale from 0 to 7) vs. 4.8 for men. There was, however, a much greater gap in the attitudes towards popular music between men and women, who attributed 3.17 and 4.62 on average, respectively. This was the genre for which gender difference was the most pronounced. Lastly, it is worth mentioning that most studies did not find any significant gender difference for rap [7, 12, 19], which indicates that music in this category appeals to both sexes. This is a surprising result considering the misogynistic message conveyed by many rap songs. Christenson and Roberts [7] speculated that this could be due to the fact that men appreciate rap for its subversive lyrics while women appreciate it for its danceability. 3.4.2 Race and Ethnicity Very few studies have examined the ties between music preferences and race and ethnicity. In the 1970s, a survey of 919 American college students revealed that, among the demographic characteristics, race was the strongest predictor of music preferences [21]. In a book published in 1998 [7], Christenson and Roberts affirmed that racial and ethnic origins of fans of a music genre mirror those of its musicians. To support their affirmation, they reported the results of a survey of adolescents conducted in the 1990s by Carol Dykers in which 7% of black adolescents reported rap as their favourite music genre, compared with 13% of white adolescents. On the other hand, 25% of white adolescents indicated either rock or heavy metal as their favourite genre, whereas these two genres had been only mentioned by a very small number of black adolescents (less than 5% for heavy metal). North & Hargreaves [15] also found a significant relationship between ethnic background and music preferences. This study was conducted more recently (in 2007), with British adults, and with a more diversified sample in terms of ethnic origins. Interestingly, they found that a high proportion of the respondents who were from an Asian background liked R&B, dance, and hip-hop/rap, which seems to challenge Christenson and Roberts’ affirmation. [22] who studied 3,393 Canadian adolescents, performed a cluster analysis to group respondents according to their music preferences. They then examined the correlates of each music-taste cluster. The analysis revealed a different ethnic composition for different clusters. For instance, the 3.4.4 Education Education was also found to be significantly correlated to music preferences. [15] found that individuals who held a master’s degree or a Ph.D. were most likely to like opera, 454 15th International Society for Music Information Retrieval Conference (ISMIR 2014) jazz, classical music, or blues; whereas fans of country, musicals, or 1960s pop were most likely to have a lower level of education. [27] studied 325 adolescents and their parents and also found an association between higher education and a taste for classical and jazz music. Parents with lower education were more likely to like popular music and to dislike classical and jazz music. 3.5 Social influences As mentioned before, research established that people use their music preferences as a social badge that conveys information about their personality, values, and beliefs. But music does not only play a role in the construction of personal identity. It is also important to social identity. Music preferences can also act as a social badge that indicates membership in a social group or a social class. 3.5.1 Peers and Parents Considering the importance adolescents ascribe to both friendship and music, it is not surprising to learn that social groups often identify with music subcultures during adolescence [4]. Therefore, it seems legitimate to posit that in the process of forming their social identity, adolescents may adopt music preferences similar to that of other members of the social group to which they belong or they aspire to belong. This hypothesis seems to be confirmed by recent studies. [28] examined the music preferences of 566 Dutch adolescents who formed 283 samesex friendship dyads and found a high degree of similarity in the music preferences of mutual friends. Since they surveyed the same participants one year after the initial survey, they could also examine the role of music preferences in the formation of new friendships and found that adolescents who had similar music preferences were more likely to become friends, as long as their music preferences were not associated with the most mainstream dimensions. In the same line, Boer and colleagues [29] conducted three studies (two laboratory experiments involving German participants and one field study involving Hong Kong university students) to examine the relationship between similarity in music preferences and social attraction. They found that people were more likely to be attracted to others who shared their music tastes because it suggests that they might also share the same values. Adolescents were also found to be influenced by the music tastes of their parents. ter Bogt and colleagues [27] studied the music tastes of 325 adolescents and their parents. Their analysis revealed some significant correlations. The adolescents whose parents liked classical or jazz music were also more likely to appreciate these music genres. Parents’ preferences for popular music were associated with a preference for popular and dance music in their adolescent children. Parents were also found to pass on their liking of rock music to their adolescent daughters but not to their sons. One possible explanation for the influence of parents on their children’s music tastes is that since family members live under the same roof, children are almost inevitably exposed to the favourite music of their parents. 455 3.5.2 Social Class In La Distinction [30], Bourdieu proposed a social stratification of tastes and cultural practices according to which a taste for highbrow music or other cultural products (and a disdain for lowbrow culture) is considered the expression of a high status. Recent research, however, suggests that a profound transformation in the tastes of the elite has occurred. In an article published in 1996, Peterson and Kern [31] reported the results of a study of the musical tastes of Americans based on data from the Survey of Public Participation in the Arts of 1982 and 1992. Their analysis revealed that far from being snobbish in their tastes, individuals with a high occupational status had eclectic tastes which spanned across the lowbrow/highbrow spectrum. In fact, people of high status were found to be more omnivorous than others, and their level of omnivorousness has increased over time. This highly cited study has motivated several other researchers to study the link between social class and music preferences. Similar studies were conducted in other countries, notably in France [32], Spain [33], and the Netherlands [34], and yielded similar results. 4. IMPLICATION FOR MUSIC RECOMMENDER SYSTEM DESIGN A review of the literature on music tastes revealed many interesting findings that could be used to improve music RS. Firstly, we saw that researchers had been able to uncover the underlying structure of music preferences, which is composed of 4 or 5 factors. The main advantage for music RS is that these factors are fairly stable across populations and time, as opposed to genres, which are inconsistent and ill-defined. As suggested by Rentfrow, Goldberg, and Levitin themselves [13], music RS could characterize the music preferences of their users by calculating a score for each dimension. Secondly, some personality dimensions were found to be correlated to music preferences. In most studies, Openness to experience was the strongest predictor of music tastes. It was positively related to liking Reflective and Complex music (e.g., jazz and classical) and, to a lesser extent, to Intense and Rebellious music (e.g., rock, heavy metal). This could indicate that users who like these music genres are more open to new music than other users. RS could take that into account and adapt the novelty level accordingly. Finally, the demographic correlates of music preferences (e.g., age, gender, education, race), as well as religion and political orientation, could help ease the new user cold-start problem. As mentioned in the introduction, many music RS invite new users to create a profile and/or allow them to connect with a social networking site account, in which they have a profile. These profiles contain various types of information about users. Music RS could combine such information to make inferences about the music preferences of new users. In the same line, information about the education and the occupation of a user could be used to identify potential high-status, omnivore users. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 5. CONCLUSION The abundant research on music tastes in sociology and social psychology has been mostly overlooked by music RS developers. This review of selected literature on the topic allowed us to present the patterns of associations between music preferences and demographic characteristics, personality traits, values and beliefs. It also revealed the importance of social influences on music tastes and the role music plays in the construction of individual and social identities. [16] [17] [18] [19] 6. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [20] Y. Koren: “Factor in the neighbors: Scalable and accurate collaborative filtering,” ACM Transactions on Knowledge Discovery Data, Vol. 4, No. 1, pp. 1-24, 2010. A. Uitdenbogerd and R. V. Schyndel: "A review of factors affecting music recommender success," ISMIR 2002: Proceedings of the Third International Conference on Music Information Retrieval, M. Fingerhut, ed., pp. 204208, Paris, France: IRCAM - Centre Pompidou, 2002. P. J. Rentfrow and S. D. Gosling: “The do re mi's of everyday life: The structure and personality correlates of music preferences.,” Journal of Personality and Social Psychology, Vol. 84, No. 6, pp. 1236-1256, 2003. S. Frith: Sound effects : youth, leisure, and the politics of rock'n'roll, New York: Pantheon Books, 1981. A. C. North and D. J. Hargreaves: “Music and adolescent identity,” Music Education Research, Vol. 1, No. 1, pp. 75-92, 1999. P. J. Rentfrow and S. D. Gosling: “Message in a ballad: the role of music preferences in interpersonal perception,” Psychological Science, Vol. 17, No. 3, pp. 236-242, 2006. P. G. Christenson and D. F. Roberts: It's not only rock & roll : popular music in the lives of adolescents, Cresskill: Hampton Press, 1998. M. J. M. H. Delsing, T. F. M. ter Bogt, R. C. M. E. Engels, and W. H. J. Meeus: “Adolescents' music preferences and personality characteristics,” European Journal of Personality, Vol. 22, No. 2, pp. 109-130, 2008. R. Brown: “Music preferences and personality among Japanese university students,” International Journal of Psychology, Vol. 47, No. 4, pp. 259-268, 2012. A. Langmeyer, A. Guglhor-Rudan, and C. Tarnai: “What do music preferences reveal about personality? A crosscultural replication using self-ratings and ratings of music samples,” Journal of Individual Differences, Vol. 33, No. 2, pp. 119-130, 2012. T. Schäfer and P. Sedlmeier: “From the functions of music to music preference,” Psychology of Music, Vol. 37, No. 3, pp. 279-300, 2009. D. George, K. Stickle, R. Faith, and A. Wopnford: “The association between types of music enjoyed and cognitive, behavioral, and personality factors of those who listen,” Psychomusicology, Vol. 19, No. 2, pp. 32-56, 2007. P. J. Rentfrow, L. R. Goldberg, and D. J. Levitin: “The structure of musical preferences: A five-factor model,” Journal of Personality and Social Psychology, Vol. 100, No. 6, pp. 1139-1157, 2011. P. J. Rentfrow, L. R. Goldberg, D. J. Stillwell, M. Kosinski, S. D. Gosling, and D. J. Levitin: “The song remains the same: A replication and extension of the music model,” Music Perception, Vol. 30, No. 2, pp. 161185, 2012. A. C. North and D. J. Hargreaves, Jr.: “Lifestyle correlates of musical preference: 1. Relationships, living [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] 456 arrangements, beliefs, and crime, ” Psychology of Music, Vol 35m No. 1, pp. 58-87, 2007. J. B. Peterson and P. G. Christenson: “Political orientation and music preference in the 1980s,” Popular Music and Society, Vol. 11, No. 4, pp. 1-17, 1987. A. Gardikiotis and A. Baltzis: “'Rock music for myself and justice to the world!': Musical identity, values, and music preferences,” Psychology of Music, Vol. 40, No. 2, pp. 143-163, 2012. J. Lynxwiler and D. Gay: “Moral boundaries and deviant music: public attitudes toward heavy metal and rap,” Deviant Behavior, Vol. 21, No. 1, pp. 63-85, 2000. A. Colley: “Young people's musical taste: relationship with gender and gender-related traits,” Journal of Applied Social Psychology, Vol. 38, No. 8, pp. 2039-2055, 2008. P. G. Christenson and J. B. Peterson: “Genre and gender in the structure of music preferences,” Communication Research, Vol. 15, No. 3, pp. 282-301, 1988. R. S. Denisoff and M. H. Levine: “Youth and popular music: A test of the taste culture hypothesis,” Youth & Society, Vol. 4, No. 2, pp. 237-255, 1972. J. Tanner, M. Asbridge, and S. Wortley: “Our favourite melodies: musical consumption and teenage lifestyles,” British Journal of Sociology, Vol. 59, No. 1, pp. 117-144, 2008. J. Harrison and J. Ryan: “Musical taste and ageing,” Ageing & Society, Vol. 30, No. 4, pp. 649-669, 2010. A. Bonneville-Roussy, P. J. Rentfrow, M. K. Xu, and J. Potter: “Music through the ages: Trends in musical engagement and preferences from adolescence through middle adulthood, ” American Psychological Association, pp. 703-717, 2013. M. B. Holbrook and R. M. Schindler: “Some exploratory findings on the development of musical tastes,” Journal of Consumer Research, Vol. 16, No. 1, pp. 119-124, 1989. J. Hemming: “Is there a peak in popular music preference at a certain song-specific age? A replication of Holbrook & Schindler’s 1989 study,” Musicae Scientiae, Vol. 17, No. 3, pp. 293-304, 2013. T. F. M. ter Bogt, M. J. M. H. Delsing, M. van Zalk, P. G. Christenson, and W. H. J. Meeus: “Intergenerational Continuity of Taste: Parental and Adolescent Music Preferences,” Social Forces, Vol. 90, No. 1, pp. 297-319, 2011. M. H. W. Selfhout, S. J. T. Branje, T. F. M. ter Bogt, and W. H. J. Meeus: “The role of music preferences in early adolescents’ friendship formation and stability,” Journal of Adolescence, Vol. 32, No. 1, pp. 95-107, 2009. D. Boer, R. Fischer, M. Strack, M. H. Bond, E. Lo, and J. Lam: “How shared preferences in music create bonds between people: Values as the missing link,” Personality and Social Psychology Bulletin, Vol. 37, No. 9, pp. 11591171, 2011. P. Bourdieu: La distinction: critique sociale du jugement, Paris: Éditions de minuit, 1979. R. A. Peterson and R. M. Kern: “Changing Highbrow Taste: From Snob to Omnivore,” American Sociological Review, Vol. 61, No. 5, pp. 900-907, 1996. P. Coulangeon and Y. Lemel: “Is ‘distinction’ really outdated? Questioning the meaning of the omnivorization of musical taste in contemporary France,” Poetics, Vol. 35, No. 2-3, pp. 93-111, 2007. J. López-Sintas, M. E. Garcia-Alvarez, and N. Filimon: “Scale and periodicities of recorded music consumption: reconciling Bourdieu's theory of taste with facts,” The Sociological Review, Vol. 56, No. 1, pp. 78-101, 2008. K. van Eijck: “Social Differentiation in Musical Taste Patterns,” Social Forces, Vol. 79, No. 3, pp. 1163-1185, 2001. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) SOCIAL MUSIC IN CARS Sally Jo Cunningham, David M. Nichols, David Bainbridge, Hasan Ali Department of Computer Science, University of Waikato, New Zealand {sallyjo, d.nichols, davidb}@waikato.ac.nz, [email protected] Wayne: “I think we’ll go with a little Bohemian Rhapsody, gentlemen” Garth: “Good call” Wayne's World (1992) ABSTRACT large majority of drivers in the United States declare they sing aloud when driving”. Walsh provides the most detailed discussion of the social aspects of music in cars, noting the interaction with conversation (particularly through volume levels) and music’s role in filling “chasms of silence” [21]. Issues of impression management [9, 21] (music I like but wouldn’t want others to know I like) are more acute in the confined environment of a car and vary depending on the social relationships between the occupants [21]. Music selections are often the result of negotiations between the passengers and the driver [14, 21], where the driver typically has privileged access to the audio controls. Bull [6] reports a particularly interesting example of the intersection between the private environment of personal portable devices and the social environment of a car with passengers: Jim points to the problematic nature of joint listening in the automobile due to differing musical tastes. The result is that he plays his iPod through the car radio whilst his children listen to theirs independently or playfully in ‘harmony’ resulting in multiple soundworlds in the same space. Here, although the children have personal devices they try to synchronize the playback so that they can experience the same song at the same time; even though their activity will occur in the context of another piece of music on the car audio system. Alternative methods for sharing include explicit (and implicit) recommendation, as in Push!Music [15], and physical sharing of earbuds [3]. Bull [6] also highlights another aspect of music in cars: selection activities that occur prior to a journey. The classic ‘roadtrip’ activity of choosing music to accompany a long drive is also noted: “drivers would intentionally set up and prepare for their journey by explicitly selecting music to accompany the protracted journey “on the road”” [21]. Sound Pryer [18] is a joint-listening prototype that enables drivers to ‘pry’ into the music playing in other cars. This approach emphasizes driving as a social practice, though it focuses on inter-driver relationships rather than those involving passengers. Sound Pryer can also be thought of as a transfer of some of the mobile music sharing concepts in the tunA system [2] to the car setting. This paper builds an understanding of how music is currently experienced by a social group travelling together in a car—how songs are chosen for playing, how music both reflects and influences the group’s mood and social interaction, who supplies the music, the hardware/software that supports song selection and presentation. This fine-grained context emerges from a qualitative analysis of a rich set of ethnographic data (participant observations and interviews) focusing primarily on the experience of in-car music on moderate length and long trips. We suggest features and functionality for music software to enhance the social experience when travelling in cars, and prototype and test a user interface based on design suggestions drawn from the data. 1. INTRODUCTION Automobile travel occupies a significant space in modern Western lives and culture. The car can become a ‘homefrom-home’ for commuters in their largely solitary travels, and for groups of people (friends, families, work colleagues) in both long and short journeys [20]. Music is commonly seen as a natural feature of automotive travel, and as cars become increasingly computerized [17] the opportunities are increased for providing music tailored to the specific characteristics of a given journey. To achieve this goal, however, we must first come to a more fine-grained understanding of these car-based everyday music experiences. To that end, this paper explores the role of music in supporting the ‘peculiar sociality’ [20] of car travel. 2. BACKGROUND Most work investigating the experience of music in cars focuses on single-users, (e.g. [4], [5]). Solo drivers are free to create their own audio environment: “the car is a space of performance and communication where drivers report being in dialogue with the radio or singing in their own auditized/privatized space” [5]. Walsh [21] notes that “a © S.J. Cunningham, D.M. Nichols, D. Bainbridge, H. Ali. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: S.J. Cunningham, D.M. Nichols, D. Bainbridge, H. Ali.. “Social Music in Cars”, 15th International Society for Music Information Retrieval Conference, 2014. 457 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Driver distraction is known to be a significant factor in vehicle accidents and has led to legislation around the world restricting the use of mobile phones whilst driving. In addition to distraction effects caused by operating audio devices there are the separate issues of how the music itself affects the driver. Driving style can be influenced by genre, volume and tempo of music [10]: “at high levels, fast and loud music has been shown to divert attention [from driving]” [11], although drivers frequently use music to relax [11]. Several reports indicate that drivers use music to relieve boredom on long or familiar routes [1, 21], e.g. “as repetitious scenery encourages increasing disinterest … the personalized sounds of travel assume a greater role in allowing the driver-occupants respite via intermitting the sonic activity during protracted driving stints” [21]. Many accidents are caused by driver drowsiness; when linked with physiological sensors to assess the driver’s state, music can be used to assist in maintaining an appropriate level of driver vigilance [16]. Music can also counteract driver vigilance by masking external sounds and auditory warnings, particularly for older drivers where agerelated hearing loss is more likely to occur [19]. In summary, music fulfils a variety of different roles in affecting the mental state of the driver. It competes and interacts with passenger conversation, the external environmental and with audio functions from the increasingly computerized driving interface of the car. When passengers are present, the selection and playing of music is a social activity that requires negotiation between the occupants of the vehicle. people participating in a trip ranged from one to five (Table 2). Of the 69 total travelers across the nineteen journeys, 45 were male and 24 were female. One set of travelers were all female, 7 were all male, and the remainder (11) were mixed gender. Table 1. Demographics of student investigators Male Female National Origin Count 17 5 Age Range: 20 - 27 NZ/Australia China Mid-East Other 5 13 3 1 Grounded Theory methods [13] were used to analyze the student summaries of their participant observations and interviews. This present paper teases out the social behaviors that influence, and are influenced by, music played during group car travel. Supporting evidence drawn from the ethnographic data is presented below in italics. Table 2. Number of travelers in observed journeys 1 2 3 4 5 1 0 7 7 4 4. MUSIC BEHAVIOR IN CAR TRAVEL This section explores: the physical car environment and the reported car audio devices; the different reported roles of the driver; observed behaviors surrounding the choice of songs and the setting of volume; music and driving safety; ordering of songs that are selected to be played; and the ‘activities’ that music supports and influences. 3. DATA COLLECTION AND METHODOLOGY 4.1 Pre-trip Activities Our research uses data collected in a third year university Human Computer Interaction (HCI) course in which students design and prototype a system for the set application, where their designs are informed by an ethnographic investigations into behavior associated with the application domain. This present paper focuses on the ethnographic data collected that relates to music and car travel, as gathered by 22 student investigators (Table 1). All data gathering for this study occurred within New Zealand. To explore the problem of designing a system to support groups of people in selecting and playing music while traveling, The students performed participant observations, with the observations focusing on how the music is chosen for playing, how the music fits in with the other activities being conducted, who supplies the music, and how/who changes the songs or alters the volume. The students then explored subjective social music experiences through autoethnographies [8] and interviews of friends. The data comprises 19 participant observations, two self-interviews, and four interviews (approximately 45 printed pages). Of the 19 participant observations, four were of short drives (10 to 30 minutes), 14 were lengthier trips (50 minutes to 2 hours), and one was a classic ‘road trip’ (7 hours). The number of The owner of a car often keeps personal music on hand in the vehicle (CDs, an MP3 player loaded with ‘car music’) as well as carrying along a mobile or MP3 player loaded with his/her music collection). If only the owner’s music is played on the trip, then that person should, logically, also manage the selection of songs during the journey. Unfortunately the owner of the car is also often the driver as well— and so safety may be compromised when the driver is actively involved in choosing and ordering songs for play. Passengers are also likely to have on hand a mobile or MP3 player, and for longer trips may select CDs to share. If two or more people contribute music to be played on the journey, the challenge then becomes to bring all the songs together onto a single device—otherwise they experience the hassle of juggling several players. A consequence of merging collections, however, is that no one person will be familiar with the full set of songs, making on-the-road construction of playlists more difficult (particularly given the impoverished display surface of most MP3 players). A simple pooling of songs from the passengers’ and driver’s personal music devices is unlikely to provide an efficiently utilizable source for selection of songs for a specific journey. The music that an individual listens to during 458 15th International Society for Music Information Retrieval Conference (ISMIR 2014) a usual day’s activities may not be suitable for a particular trip, or indeed for any car journey. People tend to tailor their listening to the activity at hand [7], and so songs that are perfect ‘gym music’ or ‘study music’ may not have the appropriate tempo, mood, or emotional tenor. Further, an individual’s music collection may include ‘guilty pleasures’ that s/he may not want others to become aware of [9]: What mainly made [him] less comfortable in providing music that he likes is because he did [not] want to destroy the hyper atmosphere in the car as a result of the mostly energetic songs being played throughout the trip. His taste is mostly doom and death metal, with harsh emotion and so will create a bleak atmosphere in the car. Conversely, loud, fast tempo music can adversely affect safety ([As the driver, I] changed the volume very high… my body was shaking with the song. I stepped on the accelerator in my car; The driver [was] seen to increase the speed when the songs he liked is on). • Listening to music can be the main source of entertainment during a trip, as the driver and passengers focus on the songs played. • Songs need not be listened to passively; travelers may engage in group sing-alongs, with the music providing support for their ‘performances’. These sessions may be loud and include over-the-top emotive renditions for the amusement of the singer and the group, and be accompanied by clapping and ‘dancing’ in the seats (The participants would sing along to the lyrics of the songs, and also sometimes dance along to the music, laughing and smiling throughout it). • A particular song may spark a conversation about the music—to identify a song (they would know what song they wanted to hear but they would not know the artist or name of the song. When this happened, they would … try to think of the artist name together) or to discuss other aspects of the artist/song/genre/etc (‘In the air tonight, Phil Collins!’ Ann asked Joan and I, ‘did you know that it’s top of the charts at the moment’ … There was conversation about Phil Collins re-releasing his music.) A lively debate can surround the choice and ordering of the songs to play, if playlists are created during the trip itself. • Music can provide a background to conversation; at this point the travelers pay little or no attention to the songs but they mask traffic noises (when we were chatting… no one really cared what was on as long as there was some ambient sound). By providing ‘filler’ for awkward silences, music is particularly useful in supporting conversations among groups who don’t know each other particularly well (it seemed more natural to talk when there was music to break the silence). For shorter trips, music might serve only one or two of these social purposes—playing as background to a debate over where to eat, for example. On longer journeys, the focus of group attention and activity is likely to shift over time, and with that shift the role of the music will vary as well: At some times it would be the focus activity, with everyone having input on what song to choose and then singing along. While at other times the group just wanted to talk with each other and so the music was turned right down and became background music… 4.2 Physical Environment and Audio Equipment The travel described in the participant observations primarily occurred in standard sized cars with two seating areas, comfortably seating at most two people in the front and three in the rear sections. In this environment physical movement is constrained. If the audio device controller is fixed in place then not everyone can easily reach it or view its display; if the controller is a handheld device, then it must be passed around (and even then it may be awkward to move the controller between the two sections). As is typical of student vehicles in New Zealand, the cars tended to be older (10+ years) and so were less likely to include sophisticated audio options such as configurable speakers and built-in MP3 systems. The range of audio equipment reported included radio, built-in CD player, portable CD player, stand-alone MP3 player plus speakers, and MP3 player connected to the car audio system. The overwhelming preference evinced in this study is for devices that give more fine-grained control over song selection (i.e., MP3 players over CD players, CD players over radio). The disadvantages of radio are that music choice is by station rather than by song, reception can be disrupted if the car travels out of range, and most channels include ads. On the other hand, radio can provide news and talkback, to break up a longer journey. 4.3 Music in Support of Journey Social Activities Music is seen as integral to the group experience on a trip; it would be unacceptable and anti-social for the car’s occupants to simply each listen to their individual MP3 player, for example. We identify a wide variety of ways that travelers select songs so as to support group social activities during travel: • Music can contribute to driving safety, by playing songs that will reduce driver drowsiness and keep the driver focused (music… can liven up a drive and keep you entertained or awake much longer). For passengers, it can reduce the tedium associated with trips through uninteresting or too-familiar scenery (music can reduce the boredom for you and your friends with the journey). 4.4 Selecting and Ordering Songs The physical music device plays a significant role in determining who chooses the music on a car trip. If the device is fixed (typically in the center of the dashboard), then it is easily accessible only by the driver or front passenger—and so they are likely to have primary responsibility for choosing, or arbitrating the choice, of songs. The driver is often 459 15th International Society for Music Information Retrieval Conference (ISMIR 2014) the owner of the vehicle, and in that case is likely to be assertive at decision points (Since I was the driver, I was basically the DJ. I would select the CD and the song to be played. I also changed the song if I didn’t like it even if others in the car did.). Given the small display surfaces of most music devices and the complexity of interactions with those devices, it is likely that safety is compromised when the driver acts as DJ. Consider, for example: I select some remixed trance music from the second CD at odd slots of the playlist, and then insert some pop songs from other CDs in the rest of the slots of the list. … I manually change the play order to random. Also I disable the volume protect. And enable the max volume that from the subwoofer due to the noises from the outside of my car … If the music system has a hand-held controller, then the responsibility for song selection can move through the car. At any one point, however, a single individual will assume responsibility for music management. Friends are often familiar with each other’s tastes, and so decisions can be made amicably with little or no consultation (I felt comfortable in choosing the music because they were mostly friends and I knew what kind of music they were all into and what music some friends were not into…). Imposing one’s will might go against the sense of a group experience and social expectations (…having the last word means it could cause problems between friends), or alternatively close ties might make unilateral decisions more acceptable (I did occasionally get fed up from their music and put back my music again without even asking them for permission, you know we are all friends.). As noted in Section 4.1, song selection on the fly can be difficult because the chooser may not be familiar with the complete base collection, or because the base collection includes songs not suited to the current mood of the trip. A common strategy is to listen to the first few seconds of a song, and if it is unacceptable then to skip to the song that comes up ‘next’ in the CD / shuffle / predetermined playlist. This strategy provides a choppy listening experience, but does have the advantage of simplicity: a song is skipped if any one person in the car expresses an objection to it. It may, however, be embarrassing to ask for a change if one is not in current possession of the control device. Song-by-song selection is appropriate for shorter trips, as the setup time for a playlist may be longer than the journey itself. Suggesting and ordering songs can also be a part of the fun of the event and engage travelers socially (My friends would request any songs that they would like to hear, and the passenger in control of the iPod acted like a human playlist; trying to memorise the requests in order and playing them as each song finished.) For longer trips, a set of pre-created playlists or mixes (supporting the expected moods or phases of the journey) can create a smoother travel experience. A diverse set of playlists may be necessary to match the range of social mu- sic behaviors reported in Section 4.2. Even with careful pre-planning, however, a song may be rejected at time of play for personal, idiosyncratic reasons (for example, one participant skips particular songs … associated with particular memories and events so I don’t like to listen to them while driving for example). 4.5 Music Volume Sound volume is likely to change during a trip, signaling a change in the mood of the gathering, an alteration in the group focus, or to intensify / downplay the effects of a given song. Participant observations included the following reasons for altering sound levels: to focus group attention on a particular song (louder); for the group to sing along with a song (louder); to switch the focus of group activity from the music to conversation (softer); to ‘energize’ the mood of the group (louder); to calm the group mood, and particularly to permit passengers to sleep (softer); and to move the group focus from conversation back to the music, particularly when conversation falters (louder). Clearly the ability to modulate volume to fit to the current activity or mood is crucial. A finer control than is currently available would be desirable, as often speaker placement means perceived volume depends on one’s seat in the car ([he] asked the driver to turn the bass down … because the bass effect was too strong, and the driver … think[s] the bass is fine in the front). Further, the physical division of a car into separate rows of seats and its restriction of passenger movement can encourage separate activity ‘zones’ (for example, front seats / back seats)—and the appropriate volume for the music can differ between seating areas: One of our friends who sets beside the driver is paying more attentions on the music, the rest 3 of us set in the back were communicate a lot more, and didn’t paying too much attention on the music… the front people can hear the music a lot more clear then the people sets in the back, and it’s harder for the front people to join the communication with the back people because he need to turn his head around for the chat sometimes. 5. IMPLICATIONS FOR A SOCIAL AUDIO SYSTEM FOR CAR TRAVEL Leveraging upon music information retrieval capabilities, we now describe how our findings can inform the design of software specially targeted for song selection during car trips—personified, the software we seek in essence acts as a music host. In general a playlist generator [12] for song selection coupled with access to a distributed network of self-contained digital music libraries for storing, organizing, and retrieving items (the collections of songs the various people travelling have) are useful building blocks to developing such software; however, to achieve a digital music host, what is needed ultimately goes beyond this. 460 15th International Society for Music Information Retrieval Conference (ISMIR 2014) In broad terms, we envisage a software application with two phases: initial configuration and responsive adaptation. During configuration, the application gathers the pool of songs for the trip from the individuals’ devices, taking into account preferences such as which songs they wish to keep private and which types of songs (genre, artist, tempo, etc.) that they wish to have considered for the trip playlist. The users are then prompted to enter the approximate length of the upcoming road trip, and an initial playlist is constructed based on the user preferences and pool of songs. During the trip, the application can make use of a variety of inputs to dynamically adjust the sequence of songs played. Here significant gains can be made from inventive uses of MIR techniques coupled with temporal and spatial information–even data sensors from the car. For instance, if the application noticed the driver speeding for that section of road it could alter the selection of the next song to one that is quieter with a slower tempo (beat detection); alternatively, triggered by the detection of the conversation lapsing into silence (noise cancelling) the next song played could be altered to be one labeled with a higher “interest” value (tagged, for instance, using semantic web technologies, and captured in the playlist as metadata). News sourced from a radio signal (whichever is currently in range) can be interspersed with the songs being played. As evidenced by our analysis, the role of the driver/owner of the car takes on special significance in terms of the interface and interaction design. As the host of the vehicle, there is a perception that they are more closely linked to the software (the digital music host) that is making the decision over what to play next. While it is not a strict requirement of the software, for the majority of situations it will be an instinctive decision that the key audio device used to play the songs on the trip will be the one owned by the driver. For the adaptive phase of the software then, there is a certain irony that the driver (for reasons of driving safely) has less opportunity to influence the song selection during the trip. To address this imbalance, an aspect the software could support is the prioritization of input from the “master” application at noted times that are deemed safe (such when the car is stationary). More prosaically, the travellers will requires support in tweaking the playlist as the trip progresses. We developed and tested a prototype of this aspect of the system, to evaluate the design’s potential. The existing behaviors explored in Section 3 suggest that this system should be targeted at tablet devices rather than smaller mobiles: while the device should be lightweight enough to be easily passed between passengers in a vehicle, the users should be able to clearly see the screen details from an arm’s length, and controls should be large and spaced to minimize input error. Figure 1 presents screenshots for primary functionality of our prototype: the view of the trip playlist, which features the current song in context with the preceding and succeeding songs (Figure 1a); the lyrics display for the current song, sized to be viewable by all (Figure 1b); and a screen allowing selected songs to be easily inserted into different points in the playlist (Figure 1c). While it was tempting on a technical level to include mobile-based wireless voting (using their smart phones) to move the currently playing item up or down as an expression of like/dislike (relevance feedback), we recognize that face-to-face discussion and argument over songs is often a source of enjoyment and bonding for fellow travelers—and so we deliberately support only manual playlist manipulation. Figure 1a. Playlist view. Figure 1b. Lyrics view for the active song. Figure 1c. After searching for a song, ‘smart options’ for inserting the song into the current section of the playlist. Given the practical and safety difficulties in evaluating our prototype system in a moving car, we instead used a stationary simulation. Two groups of four high school aged 461 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [7] S.J. Cunningham, D. Bainbridge, A. Falconer. More males participated in the evaluation, with each trial consisting of approximately 30 minutes in which they listened to songs on a pre-prepared playlist, both collaboratively and individually selected additional songs, inserted them into the playlist, and viewed lyrics to sing along. The researchers took manual notes of the simulations, and participants engaged in focus group discussions post-simulation. While the participants found the prototype to be generally usable (though usability tweaks were identified), we identified worrying episodes in which the drivers switched focus from the wheel to the tablet. While we recognize that behavior may be different in a simulation than in real driving conditions, we also saw strong evidence from the ethnographic data that drivers—particularly young, male drivers—can prioritize song selection over road safety. Further design iterations must recognize that drivers will inevitably seize control of a car’s music system, and so should prioritize design that supports fast, one-handed interactions. of an art than a science: playlist and mix construction. Proceedings of ISMIR ’06, Vancouver, 2006. [8] S.J. Cunningham, M. Jones: “Autoethnography: a tool for practice and education,” Proceedings of the 6th New Zealand International Conference on ComputerHuman Interaction (CHINZ 2005), 1-8, 2005. [9] S.J. Cunningham, M. Jones, S. Jones: “Organizing digital music for use: an examination of personal music collections”. Proceedings of ISMIR’04, Barcelona, 447-454, 2004. [10] B.H. Dalton, D.G. Behm: “Effects of noise and music on human and task performance: A systematic review,” Occupational Ergonomics, 7:3, 143-152, 2007. [11] N. Dibben, V.J. Williamson: “An exploratory survey of in-vehicle music listening,” Psychology of Music, 35: 4, 571-589, 2007. 6. CONCLUSIONS [12] A. Flexer, D. Schnitzer, M. Gasser, G. Widmer. “Playlist generation using start and end songs”, Proceedings of ISMIR’08, 173-178, 2008. The primary contribution of this paper is understanding of social music behavior of small groups of people while on ‘road trips’, developed through a qualitative analysis of ethnographic data (participant observations and interviews). We prototyped and evaluated the more prosaic aspects of a system to support social music listening on road trips, and suggest further extensions—including sensorbased input to modify the trip playlist—for future research. [13] B. Glaser, A. Strauss: The Discovery of Grounded Theory: Strategies for Qualitative Research, Chicago, 1967. [14] A. E. Greasley, A. Lamont: “Exploring engagement with music in everyday life using experience sampling methodology,” Musicae Scientiae, 15: 45, 45-71, 2011. 7. REFERENCES [15] M. Håkansson, M. Rost, L.E. Holmquist: “Gifts from friends and strangers: a study of mobile music sharing,” Proceedings of ECSCW’07, 311-330, 2007. [1] K.P. Åkesson, A. Nilsson: “Designing Leisure Applications for the Mundane Car-Commute,” Personal and Ubiquitous Computing, 6:3, 176–187, 2002. [16] C. Hasegawa, K. Oguri: “The effects of specific musical stimuli on driver’s drowsiness,” Proceedings of the Intelligent Transportation Systems Conference (ITSC’06), 817-822, 2006. [2] A. Bassoli, J. Moore, S. Agamanolis: “tunA: Socialising Music Sharing on the Move,” In K. O'Hara and B. Brown (eds.), Consuming Music Together: Social and Collaborative Aspects of Music Consumption Technologies. Springer, 151-172, 2007. [17] O. Juhlin: Social Media on the Road: The Future of Car Based Computing, Springer, London, 2010. [18] M. Östergren, O. Juhlin: “Car Drivers Using Sound Pryer – Joint Music Listening in Traffic Encounters,” In K. O'Hara and B. Brown (eds.), Consuming Music Together: Social and Collaborative Aspects of Music Consumption Technologies. Springer, 173-190, 2006. [3] T. Bickford: “Earbuds Are Good for Sharing: Children’s Sociable Uses of Headphones at a Vermont Primary School,” In J. Stanyek and S. Gopinath (eds.), The Handbook of Mobile Music Studies, Oxford University Press, 2011. [19] E.B. Slawinski, J.F. McNeil: (2002) “Age, Music, and Driving Performance: Detection of External Warning Sounds in Vehicles,” Psychomusicology, 18, 123-31, 2002. [4] M. Bull: “Soundscapes of the car: a critical study of automobile habitation,” In M. Bull and L. Back, (eds.) The Auditory Culture Reader, Berg, 357–374, 2003. [5] M. Bull: “Automobility and the power of sound”, Theory, Culture & Society, 21:4/5, 243–259, 2004. [20] Urry, J. “Inhabiting the car,” The Sociological Review, 54, 17–31, 2006. [6] M. Bull: “Investigating the culture of mobile listening: from Walkman to iPod,” In K. O'Hara and B. Brown (eds.), Consuming Music Together. Springer, 131–149, 2006. [21] M.J. Walsh: “Driving to the beat of one’s own hum: Automobility and musical listening,” In N. K. Denzin (ed.) Studies in Symbolic Interaction, 35, 201-221, 2010. 462 Poster Session 3 463 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 464 15th International Society for Music Information Retrieval Conference (ISMIR 2014) A COMBINED THEMATIC AND ACOUSTIC APPROACH FOR A MUSIC RECOMMENDATION SERVICE IN TV COMMERCIALS Mohamed Morchid, Richard Dufour, Georges Linarès LIA - University of Avignon (France) {mohamed.morchid, richard.dufour, georges.linares}@univ-avignon.fr ABSTRACT ber of musics. For these reasons, the need for an automatic song recommandation system, to illustrate advertisements, becomes a critical subject for companies. Most of modern advertisements contain a song to illustrate the commercial message. The success of a product, and its economic impact, can be directly linked to this choice. Finding the most appropriate song is usually made manually. Nonetheless, a single person is not able to listen and choose the best music among millions. The need for an automatic system for this particular task becomes increasingly critical. This paper describes the LIA music recommendation system for advertisements using both textual and acoustic features. This system aims at providing a song to a given commercial video and was evaluated in the context of the MediaEval 2013 Soundtrack task [14]. The goal of this task is to predict the most suitable soundtrack from a list of candidate songs, given a TV commercial. The organizers provide a development dataset including multimedia features. The initial assumption of the proposed system is that commercials which sell the same type of product, should also share the same music rhythm. A two-fold system is proposed: find commercials with close subjects in order to determine the mean rhythm of this subset, and then extract, from the candidate songs, the music which better corresponds to this mean rhythm. In this paper, an automatic system for songs recommandation is proposed. The proposed approach combines both textual (web pages) and audio (acoustic) features to select, among a large number of songs, the most appropriate and relevant music knowing the commercial content. The first step of the proposed system is to represent commercials into a thematic space built from a Latent Dirichlet Allocation (LDA) [4]. This pre-processing subtask uses the related textual content of the commercial. Then, acoustic features of each song are extracted to find a set of the most relevant songs for a given commercial. An appropriate benchmark is needed to evaluate the effectiveness of the proposed recommandation system. For these reasons, the proposed system is evaluated in the context of the challenging MediaEval 2013 Soundtrack task for commercials [10]. Indeed, the MusiClef task seeks to make this process automated by taking into account both context- and content-based information about the video, the brand, and the music. The main difficulty of this task is to find the set of relevant features that best describes the most appropriate song for a video. Next section describes related work in topic space modeling for information retrieval and music tasks. Section 3 presents the proposed music recommandation system using both textual content and acoustic features related to musics from commercials. Section 4 explains in details the unsupervised Latent Dirichlet Allocation (LDA) technique, while Section 4.2 describes how the acoustic features are used to evaluate the proximity of a music to a commercial. Finally, experiments are presented in Section 5, while Section 6 gives conclusions and perspectives. 1. INTRODUCTION The success of a product or a service essentially depends of the way to present it. Thus, companies pay much attention to choose the most appropriate advertisement that will make a difference in the customer choice. The advertisers have different media possibilities, such as journal paper, radio, TV or Internet. In this context, they can exploit the audio media (TV, radio...) to attract listeners using a song related to the commercial. The choice of an appropriate song is crucial and can have a significant economic impact [5,18]. Usually, this choice is made by a human expert. Nonetheless, while millions of musics exist, a human agent could only choose a song among a limited subset. This choice could then be inappropriate, or simply not the best one, since the agent could not search into a large num- 2. RELATED WORKS Latent Dirichlet Allocation (LDA) [4] is widely used in several tasks of information retrieval such as classification or keywords extraction. However, this unsupervised method is not much considered in the music processing tasks. Next sections describe related works using LDA techniques with text corpora (Section 2.1) and in the context of music tasks (Section 2.3). c Mohamed Morchid, Richard Dufour, Georges Linarès. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mohamed Morchid, Richard Dufour, Georges Linarès. “A Combined Thematic and Acoustic Approach for a Music Recommendation Service in TV Commercials”, 15th International Society for Music Information Retrieval Conference, 2014. 465 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2.1 Topic modeling ment d of a corpus D, a first parameter θ is drawn according to a Dirichlet law of parameter α. A second parameter φ is drawn according to the same Dirichlet law of parameter β. Then, to generate every word w of the document d, a latent topic z is drawn from a multinomial distribution on θ. Knowing this topic z, the distribution of the words is a multinomial of parameters φ. The parameter θ is drawn for all the documents from the same prior parameter α. This allows to obtain a parameter binding the documents all together [4]. Several methods were proposed by Information Retrieval (IR) researchers to build topic spaces such as Latent Semantic Analysis or Indexing (LSA/LSI) [2, 6], that use a singular value decomposition (SVD) to reduce the space dimension. This method was improved by [11] which proposed a probabilistic LSA/LSI (pLSA/pLSI). The pLSI approach models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of topics. This method demonstrated its performance on various tasks, such as sentence [3] or keyword [24] extraction. In spite of the effectiveness of the pLSI approach, this method has two main drawbacks. The distribution of topics in pLSI is indexed by training documents. Thus, the number of these parameters grows with the training document set size, and then, the model is prone to overfitting which is a main issue in an IR task such as document clustering. However, to address this shortcoming, a tempering heuristic is used to smooth the parameter of pLSI model for acceptable predictive performance. Nonetheless, authors showed in [20] that overfitting can occur even if tempering process is used. As a result, IR researchers proposed the Latent Dirichlet allocation (LDA) [4] method to overcome these two drawbacks. Thus, the number of parameters of LDA does not grow with the size of the training corpus and LDA is not candidate for overfitting. LDA is a generative model which considers a document, seen as a bag-of-words [21], as a mixture of latent topics. In opposition to a multinomial mixture model, LDA considers that a theme is associated to each occurrence of a word composing the document, rather than associate a topic with the complete document. Thereby, a document can change of topics from a word to another. However, the word occurrences are connected by a latent variable which controls the global respect of the distribution of the topics in the document. These latent topics are characterized by a distribution of word probabilities which are associated with them. pLSI and LDA models have been shown to generally outperform LSI on IR tasks [12]. Moreover, LDA provides a direct estimate of the relevance of a topic knowing a word set or a document such as a web pages in the proposed system. β 2.2 Gibbs sampling Several techniques have been proposed to estimate LDA parameters, such as Variational Methods [4], ExpectationPropagation [17] or Gibbs Sampling [8]. Gibbs Sampling is a special case of Markov-chain Monte Carlo (MCMC) [7] and gives a simple algorithm for approximate inference in high-dimensional models such as LDA [9]. This overcomes the difficulty to directly and exactly estimate parameters that maximize the likelihood of the whole data col)M → − → − − − − lection defined as: P (W |→ α , β ) = m=1 P (→ w m |→ α, β ) → − for the whole data collection W = { w m }M m=1 knowing → − → − the Dirichlet parameters α and β . The first use of Gibbs Sampling for estimating LDA is reported in [8] and a more comprehensive description of this method can be found in [9]. One can refer to these papers for a better understanding of this sampling technique. 2.3 Topic modeling and Music Topic modeling was already used in music processing, such as [13], where the authors presented a system which learns musical key as a key-profile. Thus, the proposed approach considered a song as a random mixture of key-profiles. In [25], authors described a classification method to assign a label to an unseen music. The authors use LDA to build a topic space from music-tags to get the probability of every music-tag belonging to each music genre. Then, each music is labeled to a genre knowing its tags. The purpose of the proposed approach is to find a set of relevant musics for a TV commercial. 3. PROPOSED APPROACH The goal of the proposed automatic system is to recommend a set of musics given a TV commercial. The system uses external knowledge to find these songs. These external resources are composed with a set of TV commercials associated, for each one, with a song and a set of web pages (see [14] for more details about the MediaEval 2013 Soundtrack task). The idea behind the proposed approach is to assume that two commercials sharing same subjects or interests, also share the same kind of songs. The main issue in this approach is to find commercials, from the external dataset, that have sets of subjects close to those in commercials from the test set. As described in Section 2.1, a document can be represented as a set of latent topics. Thus, two documents sharing the same topics could be seen as thematically close. φ word distribution α θ topic distribution z w topic word N D Figure 1. LDA Formalism. Figure 1 presents the LDA formalism. For every docu- 466 15th International Society for Music Information Retrieval Conference (ISMIR 2014) LDA the mapping of a commercial in this topic representation to evaluate both V d and V t are described. Then, the computed similarity score is detailed. Finally, the soundtrack prediction process from a TV commercial is explained. Topic Space dD Development set {C1,S t } Cosine similarity Topic vectors d {V }dD {Cd,Sd } V d 4. TOPIC REPRESENTATION OF A TV COMMERCIAL -1 cos (d,1 ) Mapping Topic vector V1 tT V Let’s consider a corpus D from the development set of TV commercials with a word vocabulary V = {w1 , . . . , wN } of size N . A topic representation from corpus D is then performed using a Latent Dirichlet Allocation (LDA) [4] approach. At the final LDA analysis, a topic space m of n topics is obtained with, for each theme z, the probability of each word w of V knowing z, and for the entire model m, the probability of each theme z knowing the model m. Each TV commercial from both development and test sets is mapped into the topic space (see Figure 3) to obtain a vector representation (V d and V t ) of web pages related to a commercial into the thematic space computed as follow: 1 A TV commercial C1 and the candidate songs St from test set T Cosine similarity Mean rhythm pattern S S -1 cos ( l,t ) S l commercial from development set D {Cl ,Sl } with the highest similarity with C1 from test set T t 5 nearest soundtracks {S t} t=1, ... , 5 with the commercial C1 V d [i](Cjd ) = P (zi |Cjd ) where P (zi |Cjd ) is the probability of a topic zi to be generated by the web pages from the commercial Cjd , estimated using Gibbs sampling as described in Section 2.2. In the same way, V t is estimated with the same topic space, and with the use of web pages of commercials of test set Cjt (see Figure 3). Figure 2. Global architecture of the proposed system. WORD WEIGHT w1 P (w1 |z1 ) w2 w|V | P (w1 |z2 ) WORD WEIGHT w2 P (w2 |z2 ) w1 P (w1 |z3 ) w|V | ... z1 z3 WEIGHT w1 P (w2 |z3 ) w2 P (w|V | |z2 ) w|V | P (w|V | |z3 ) P (w2 |z1 ) ... P (w|V | |z1 ) Vd [2] Vd [3] Vd [1] z4 TV Commercial Vd [4] zn WORD WEIGHT w1 P (w1 |zn ) w2 w|V | P (w2 |zn ) WORD WEIGHT w1 P (w1 |z4 ) w2 w|V | ... P (w2 |z4 ) ... Vd [n] ... T = {C t , V T , Skt }k=1,...,5000 t=1,...,T z2 WORD ... Basically, the first process of the proposed three step system is to map each TV commercial from the test and development sets, into a topic space learnt with a LDA algorithm. A TV commercial from the test set is then linked to TV commercials from development set sharing a set of close topics. Moreover, each commercial of the development set is related to a music. Thus, as a result, a commercial from the test set is related to a subset of songs from the development set, considered as thematically close to the commercial textual content. The second step has the responsibility to estimate a list of candidate songs (see Figure 2) using song audio features from the subset of songs thematically close associated during the first step. This subset of songs is used to evaluate a rhythm pattern of the ideal song for this commercial. The last step retrieves, from all candidate songs from the test set, the closest song to the rhythm pattern estimated during the previous step. In details, the development set D is composed of TV commercials C d , with for each, a soundtrack S d and a vector representation V d related to the dth TV commercial. In the same manner, the test set T is composed of TV commercials C t , with, for the tth one, a vector representation V t and a soundtrack S t to predict. Then a similarity score d {αd,t }t=1,...,T d=1,...,D is computed for each commercial Ci of the development set given one from the test set C t : D = {C d , V D , S d }d=1,...,D (2) P (w|V | |z4 ) ... P (w|V | |zn ) Figure 3. Mapping of a TV commercial in the topic space. 4.1 Similarity measure Each commercial from both development and test set, is mapped into the topic space to produce a vector representation for each one, respectively V d and V t as outcomes. Then, given a TV commercial C 1 from the test set T, a subset of other TV commercials from the development set D is selected knowing their thematic proximity with C 1 . (1) . In the next sections, the topic space representation and 467 15th International Society for Music Information Retrieval Conference (ISMIR 2014) To estimate the similarity between C 1 and commercials from development set, the cosine metric α is used. This similarity metric is expressed thereafter: extracted using the Ircam software available at [1]. More information about features extraction from songs are detailed in [14]. As an outcome, each commercial is represented by a rhythm pattern vector of size 58 (10 from song features and 48 from rhythm pattern). From the subset of soundtracks of the l nearest commercials from D, a mean rhythm vector S is performed as: cosine(V d , V t ) = αd,t n V d [i] × V t [i] 2 =2 n n 2 2 d V [i] V t [i] i=1 i=1 (3) S= i=1 1 d S . l d∈l This metric allows to extract a subset of commercials from D thematically close to C 1 . Finally, the cosine measure between this mean rhythm S of the l nearest commercials from D, and each commercial (cosine(S, S t )t∈T ), is used to find, from the soundtrack S t of the test set T, the 5 songs from all the candidates having the closest rhythm pattern. 4.2 Rhythm pattern The cosine measure, presented in previous section, is also used to evaluate the similarity between a mean rhythm pattern vector S d of a song, and all the candidate songs Skt of the test set. Rhythm pattern of a song <?xml version="1.0" ?> <rhythmdescription> <media>363445_sum.wav</media> <description> <bpm_mean>99.982723</bpm_mean> <bpm_std>0.047869</bpm_std> <meter>22.000000</meter> <perc>47.023527</perc> <perc_norm>1.910985</perc_norm> <complex>29.630575</complex> <complex_norm>0.652134</complex_norm> <speed>2.660229</speed> <speed_norm>1.201633</speed_norm> <periodicity>0.900763</periodicity> <rhythmpattern>0.124231 ... 0.098873</rhythmpattern> </description> </rhythmdescription> (a) 5. EXPERIMENTS AND RESULTS Rhythm pattern vector bmp_mean bmp_std meter perc perc_norm complex complex_norm speed speed_norm peiodicity { rhythmpattern_1 rhythmpattern_2 Previous sections described the proposed automatic music recommandation system for TV commercials. This system is decomposed into three sub-processes. The first one maps the commercials into a topic space to evaluate the proximity of a commercial from the test set and all commercials from the development set. Then, the mean rhythm pattern of the thematically close commercials is computed. Finally, this rhythm pattern is computed with all ones from the test set of candidate songs to find a set of relevant musics. ... rhythmpattern_48 5.1 Experimental protocol (b) The first step of the proposed approach, detailed in previous section, maps TV commercial textual content into a topic space of size n (n = 500). This one is learnt from a LDA in a large corpus of documents. Section 4 describes the corpus D of web pages. This corpus contains 10, 724 Web pages related to brands of the commercials contained in D. This corpus is composed of 44, 229, 747 words for a vocabulary of 4, 476, 153 unique words. More details about this text corpus, and the way to collect it, is explained into [14]. The first step of the proposed approach is to map each commercial textual content into a topic space learnt from a latent Dirichlet allocation (LDA). During the experiments, the MALLET tool is used [16] to perform a topic model. The proposed system is evaluated in the MediaEval 2013 MusiClef benchmark [14]. The aim of this task is to predict, for each video of the test set, the most suitable soundtrack from 5,000 candidate songs. The dataset is split into 3 sets. The development set contains multimodal information on 392 commercials (various metadata including Youtube uploader comments, audio features, video features, web pages and text features). The test set is a set of 55 videos to which a song should be associated using the recommandation set of 5,000 soundtracks (30 seconds long excerpts). Figure 4. Rhythm pattern of a song from the development set in xml (a) and vector (b) representations. In details, each commercial from D is related with a soundtrack that is represented with a rhythm pattern vector. The organizers provide for each song contained into the MusicClef 2013 dataset: • video features (MPEG-7 Motion Activity and Scalable Color Descriptor [15]), • web pages about the respective brands and music artists, • music features: − MFCC or BLF [22], − PS209 [19], − beat, key, harmonic pattern extracted with the Ircam software [1]. In our experiments, 10 rhythm features of songs are used (speed, percussion, . . . , periodicity) as shown in Figure 4. These features of beat, key or harmonic pattern are 468 15th International Society for Music Information Retrieval Conference (ISMIR 2014) and songs extraction (rhythm pattern estimation of the ideal songs for a commercial from the test set). Moreover, this promising approach, combining thematic representation of the textual content of a set of web pages describing a TV commercial and acoustic features, shows the relevance of topic-based representation in automatic recommandation using external resources (development set). The choice of a relevant song to describe the idea behind a commercial, is a challenging task when the framework does not take into account relevant features related to: 5.2 Experimental metrics For each video in the test set, a ranked list of 5 candidate songs should be proposed. The song prediction evaluation is manually performed using the Amazon Mechanical Turk platform. This novel task is non-trivial in terms of “ground truth”, that is why human ratings for evaluation are used. Three scores have been computed from our system output. Let V be the full collection of test set videos, and let sr (v) be the average suitability score for the audio file suggested at rank r for the video v. Then, the evaluation measures are computed as follows: • mood, such as harmonic content, harmonic progressions and timbre, • Average suitability score of the first-ranked song: |V | 1 s1 (vi ) V • music rhythm, such as musical style, texture, spectral centroid, or tempo. i=1 • Average suitability score for the full top-5: |V | 1 1 V 5 sr (vi ) The proposed automatic music recommendation system is limited by this small number (58) of features which not describe all music aspects. For these reasons, in future works, we plan to use others features, such as the song lyrics or the audio transcription of the TV commercials, and evaluate the effectiveness of the proposed hybrid framework into other information retrieval tasks such as classification of music genre or music clustering. i=1 • Weighted average suitability score of the full top5. Here, we apply a weighted harmonic mean score instead of an arithmetic mean: |V | 5r=1 sr (vi ) 1 V i=1 5 r=1 sr (vi ) r The previously presented measures are used to study both rating and ranking aspects of the results. 7. REFERENCES 5.3 Results [1] Ircam. analyse-synthse: Software. In http://anasynth.ircam.fr/home/software., Accessed: Sept. 2013. The measures defined in the previous section are used to evaluate the effectiveness of songs selected to be associated to TV commercials from the test set. The proposed topic space-based approach is evaluated in the same way, and obtained the results detailed thereafter: [2] J.R. Bellegarda. A latent semantic analysis framework for large-span language modeling. In Fifth European Conference on Speech Communication and Technology, 1997. [3] J.R. Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8):1279–1296, 2000. • First rank average score: 2.16 • Top 5 average score (arithmetic mean): 2.24 • Top 5 average score (harmonic mean, taking rank into account): 2.22 [4] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. Considering that human judges rate the predicted songs from 1 (very poor) to 4 (very well), we can consider that our system is slightly better than the mean evaluation score (2) no matter the metric considered. While the system proposed in [23] is clearly different from ours, results are very similar. This shows the difficulty to build an automatic song recommendation system for TV commercials, the evaluation being also a critical point to discuss. [5] Claudia Bullerjahn. The effectiveness of music in television commercials. Food Preferences and Taste: Continuity and Change, 2:207, 1997. [6] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990. [7] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984. 6. CONCLUSIONS AND PERSPECTIVES In this paper, an automatic system to assign a soundtrack to a TV commercial has been proposed. This system combines two media: textual commercial content and audio rhythm pattern. The proposed approach obtains good results in spite of the fact that the system is automatic and unsupervised. Indeed, both subtasks are unsupervised (LDA learning and commercials mapping into the topic space) [8] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004. 469 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [9] Gregor Heinrich. Parameter estimation for text analysis. Web: http://www. arbylon. net/publications/textest. pdf, 2005. [10] Nina Hoeberichts. Music and advertising: The effect of music in television commercials on consumer attitudes. Bachelor Thesis, 2012. [11] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI ’ 99, page 21. Citeseer, 1999. [12] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177–196, 2001. [13] Diane Hu and Lawrence K Saul. A probabilistic topic model for unsupervised learning of musical keyprofiles. In ISMIR, pages 441–446, 2009. [23] Han Su, Fang-Fei Kuo, Chu-Hsiang Chiu, Yen-Ju Chou, and Man-Kwan Shan. Mediaeval 2013: Soundtrack selection for commercials based on content correlation modeling. In MediaEval 2013, volume 1043 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. [24] Y. Suzuki, F. Fukumoto, and Y. Sekiguchi. Keyword extraction using term-domain interdependence for dictation of radio news. In 17th international conference on Computational linguistics, volume 2, pages 1272– 1276. ACL, 1998. [25] Chao Zhen and Jieping Xu. Multi-modal music genre classification approach. In Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, volume 8, pages 398–402. IEEE, 2010. [14] Cynthia C. S. Liem, Nicola Orio, Geoffroy Peeters, and Markus Scheld. MusiClef 2013: Soundtrack Selection for Commercials. In MediaEval, 2013. [15] Bangalore S Manjunath, Philippe Salembier, and Thomas Sikora. Introduction to MPEG-7: multimedia content description interface, volume 1. John Wiley & Sons, 2002. [16] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002. [17] Thomas Minka and John Lafferty. Expectationpropagation for the generative aspect model. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 352–359. Morgan Kaufmann Publishers Inc., 2002. [18] C Whan Park and S Mark Young. Consumer response to television commercials: The impact of involvement and background music on brand attitude formation. Journal of Marketing Research, pages 11–24, 1986. [19] Tim Pohle, Dominik Schnitzer, Markus Schedl, Peter Knees, and Gerhard Widmer. On rhythm and general music similarity. In ISMIR, pages 525–530, 2009. [20] Alexandrin Popescul, David M Pennock, and Steve Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 437–444. Morgan Kaufmann Publishers Inc., 2001. [21] G. Salton. Automatic text processing: the transformation. Analysis and Retrieval of Information by Computer, 1989. [22] Klaus Seyerlehner, Gerhard Widmer, and Tim Pohle. Fusing block-level features for music similarity estimation. In Proc. of the 13th Int. Conference on Digital Audio Effects (DAFx-10), pages 225–232, 2010. 470 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ARE POETRY AND LYRICS ALL THAT DIFFERENT? Abhishek Singhi Daniel G. Brown University of Waterloo Cheriton School of Computer Science {asinghi,dan.brown}@uwaterloo.ca and the word. The list of relevant synonyms obtained after pruning was used to obtain the probability distribution over words. A key requirement of our study is that there exists a difference, albeit a hazy one, between poetry and lyrics. Poetry attracts a more educated and sensitive audience while lyrics are written for the masses. Poetry, unlike lyrics, is often structurally more constrained, adhering to a particular meter and style. Lyrics are often written keeping the music in mind while poetry is written against a silent background. Lyrics, unlike poetry, often repeat lines and segments, causing us to believe that lyricists tend to pick more rhymable adjectives; of course, some poetic forms also repeat lines, such as the villanelle. For twenty different concepts we compare adjectives which are more likely to be used in lyrics rather than poetry and vice versa. ABSTRACT We hypothesize that different genres of writing use different adjectives for the same concept. We test our hypothesis on lyrics, articles and poetry. We use the English Wikipedia and over 13,000 news articles from four leading newspapers for the article data set. Our lyrics data set consists of lyrics of more than 10,000 songs by 56 popular English singers, and our poetry dataset is made up of more than 20,000 poems from 60 famous poets. We find the probability distribution of synonymous adjectives in all the three different categories and use it to predict if a document is an article, lyrics or poetry given its set of adjectives. We achieve an accuracy level of 67% for lyrics, 80% for articles and 57% for poetry. Using these probability distribution we show that adjectives more likely to be used in lyrics are more rhymable than those more likely to be used in poetry, but they do not differ significantly in their semantic orientations. Furthermore we show that our algorithm is successfully able to detect poetic lyricists like Bob Dylan from non-poetic ones like Bryan Adams, as their lyrics are more often misclassified as poetry. 1. INTRODUCTION The choice of a particular word, from a set of words that can instead be used, depends on the context we use it in, and on the artistic decision of the authors. We believe that for a given concept, the words that are more likely to be used in lyrics will be different from the ones which are more likely to be used in articles or poems, because lyricists have different objectives typically. We test our hypothesis on adjective usage in these categories of documents. We use adjectives, as a majority have synonyms that can be used depending on context. To our surprise, just the adjective usage is sufficient to separate documents quite effectively. Finding the synonyms of a word is still an open problem. We used three different sources to obtain synonyms for a word – the WordNet, Wikipedia and an online thesaurus. We prune synonyms, obtained from the three sources, which fall below an experimentally determined threshold for the semantic distance between the synonyms Figure 1. The bold-faced words are the adjectives our algorithm takes into account while classifying a document, which in this case in a snippet of lyrics by the Backstreet Boys. We use a bag of words model for the adjectives, where we do not care about their relative positions in the text, but only their frequencies. Finding synonyms of a given word is a vital step in our approach and since it is still considered a difficult task improvement in synonyms finding approaches will lead to an improvement in our classification accuracy. Our algorithm has a linear run time as it scans through the document once to come up with the prediction, giving us an accuracy of 68% overall. Lyricists with a relatively high percentage of lyrics misclassified as poetry tend to be recognized for their poetic style, such as Bob Dylan and Annie Lennox. 2. RELATED WORK © Abhishek Singhi, Daniel G. Brown. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Abhishek Singhi, Daniel G. Brown. “Are Poetry And Lyrics All That Different?”, 15th International Society for Music Information Retrieval Conference, 2014. We do not know of any work on the classification of documents based on the adjective usage into lyrics, poetry or articles nor are we aware of any computational 471 15th International Society for Music Information Retrieval Conference (ISMIR 2014) edited by experts. Both of these are extremely rich sources of data on many topics. To remove the influence of the presence of articles about poems and lyrics in Wikipedia we set the pruning threshold frequency of adjectives to a high value, and we ensured that the articles were not about poetry or music. work which discerns poetic from non-poetic lyricists. Previous works have used adjectives for various purposes like sentiment analysis [1]. Furthermore in Music Information Retrieval, work on poetry has focused on poetry translator, automatic poetry generation. Chesley et al. [1] classifies blog posts according to sentiment using verb classes and adjective polarity, achieving accuracy levels of 72.4% on objective posts, 84.2% for positive posts, and 80.3% for negative posts. Entwisle et al. [2] analyzes the free verbal productions of ninth-grade males and females and conclude that girls use more adjectives than boys but fail to reveal differential use of qualifiers by social class. Smith et al. [13] use of tf-idf weighting to find typical phrases and rhyme pairs in song lyrics and conclude that the typical number one hits, on average, are more clichéd. Nichols et al. [14] studies the relationship between lyrics and melody on a large symbolic database of popular music and conclude that songwriters tend to align salient notes with salient lyrics. There is some existing work on automatic generation of synonyms. Zhou et al. [3] extracts synonyms using three sources - a monolingual dictionary, a bilingual corpus and a monolingual corpus, and use a weighted ensemble to combine the synonyms produced from the three sources. They get improved results when compared to the manually built thesauri, WordNet and Roget. Christian et al. [4] describe an approach for using Wikipedia to automatically build a dictionary of named entities and their synonyms. They were able to extract a large amount of entities with a high precision, and the synonyms found were mostly relevant, but in some cases the number of synonyms was very high. Niemi et al. [5] add new synonyms to the existing synsets of the Finnish WordNet using Wikipedia’s links between the articles of the same topic in Finnish and English. As to computational poetry, Jiang et al. [6] use statistical machine translation to generate Chinese couplets while Genzel et al. [7] use statistical machine translation to translate poetry keeping the rhyme and meter constraints. 3.2 Lyrics We took more than 10,000 lyrics from 56 very popular English singers. Both the authors listen to English music and hence it was easy to come up with a list which included singers from many popular genres with diverse backgrounds. We focus on English-language popular music in our study, because it is the closest to “universally” popular music, due to the strength of the music industry in English-speaking countries. We do not know if our work would generalize to non-English Language songs. Our data set includes lyrics from the US, Canada, UK and Ireland. 3.3 Poetry We took more than 20,000 poems from more than 60 famous poets, like Robert Frost, William Blake and John Keats, over the last three hundred years. We selected the top poets from Poem Hunter [19]. We selected a wide time range for the poets, as many of the most famous English poets are from that time period. None of the poetry selected were translations from another language. Most of the poets in our dataset are poets from North America and Europe. We believe that our training data, is representative of the mean, as a majority of poetry and poetic style are inspired by the work of these few extremely famous poets. 3.4 Test Data For the purpose of document classification we took 100 from each category, ensuring that they were not present in the training set. While collecting the test data we ensured the diversity, the lyrics and poets came from different genres and artists and the articles covered different topics and were selected from different newspapers. To determine poetic lyricists from non-poetic ones we took eight of each of the two types of lyricists, none of whom were present in our lyrics data sets. We ensured that the poetic lyricists we selected were indeed poetic by looking up popular news articles or ensuring that they were poet along with being lyricists. Our list for poetic lyricists included Bob Dylan and Annie Lennox etc. while the non-poetic ones included Bryan Adams and Michael Jackson. 3. DATA SET The training set consists of articles, lyrics and poetry and is used to calculate the probability distribution of adjectives in the three different types of documents. We use these probability distributions in our document classification algorithms, to identify poetic from non-poetic lyricists and to determine adjectives more likely to be used in lyrics rather than poetry and vice versa. 3.1 Articles 4. METHOD We take the English Wikipedia and over 13,000 news articles from four major newspapers as our article data set. Wikipedia, an enormous and freely available data set is These are the main steps in our method: 472 15th International Society for Music Information Retrieval Conference (ISMIR 2014) above, and the document(s) to be classified, calculates the score of the document being an article, lyrics or poetry, and labels it with the class with the highest score. The algorithm takes a single pass along the whole document and identifies adjectives using WordNet. For each word in the document we check its presence in our word list. If found, we add the probability to the score, with a special penalty of -1 for adjectives never found in the training set and a special bonus of +1 for words with probability 1. The penalty and boosting values used in the algorithm were determined experimentally. Surprisingly, this simple approach gives us much better accuracy rates than Naïve Bayes, which we thought would be a good option since it is widely used in classification tasks like spam filtering. We have decent accuracy rates with this simple, naïve algorithm; one future task could be to come up with a better classifier. 1) Finding the synonyms of all the words in the training data set. 2) Finding the probability distribution of word for all the three types of documents. 3) The document classification algorithm. 4.1 Extracting Synonyms We extract the synonyms for a term from three sources: WordNet, Wikipedia and an online thesaurus. WordNet is a large lexical database of English where words are grouped into sets of cognitive synonyms (synsets) together based on their meanings. WordNet interlinks not just word forms but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. The synonyms returned by WordNet need some pruning. We use Wikipedia redirects to discover terms that are mostly synonymous. It returns a large number of words, which might not be synonyms, so we need to prune the results. This method has been widely used for obtaining the synonyms of named entities e.g. [4], but we get decent results for adjectives too. We also used an online Thesaurus that lists words grouped together according to similarity of meaning. Though it gives very accurate synonyms, pruning is necessary to get better results. We prune synonyms obtained from the three sources, which fall below an experimentally determined threshold for the semantic distance between the synonyms and the word. To calculate the semantic similarity distance between words we use the method described by Pirro et al. [8]. Extracting synonyms for a given word is an open problem and with improvement in this area our algorithm will achieve better classification accuracy levels. 5. RESULTS First, we look at the classification accuracies between lyrics, articles and poems obtained by our classifier. We show that the adjectives used in lyrics are much more rhymable than the ones used in poems but they do not differ significantly in their semantic orientations. Furthermore, our algorithm is able to identify poetic lyricists from non-poetic ones using the word distributions, calculated in earlier section. We also compare adjectives for a given concepts which are more likely to be used in lyrics rather than poetry and vice versa. 5.1 Document Classification Our test set consists of the text of 100 each of our three categories. Using our algorithm with the adjective distributions we get an accuracy of 67% for lyrics, 80% for articles and 57% for poems. The confusion matrix, Table 1 we find the best accuracy for articles. This might be because of the enormous size of the article training set which consisted of all English Wikipedia articles. A slightly more number of articles get misclassified as lyrics than poetry. Surprisingly, a large number of misclassified poems get classified as articles rather than poetry, but most misclassified lyrics get classified as poems. 4.2 Probability Distribution We believe that the choice of an adjective to express a given concept depends on the genre of writing: adjectives used in lyrics will be different from ones used in poems or in articles. We calculate the probability of a specific adjective for each of the three document types. First, WordNet is used to identify the adjectives in our training sets. For each adjective we compute the frequency of that were in the training set and the frequency of it and its synonyms; the ratio of these is the frequency with which that adjective represents its synonym group in that class of writing. We exclude adjectives that occur infrequently (fewer than 5 times in our lyrics/poetry set or 50 in articles). The enormous size of the Wikipedia justifies the high threshold value. 5.2 Adjective Usage in Lyrics versus Poems Poetry is written against a silent background while lyrics are often written keeping the melody, rhythm, instrumentation, the quality of the singer’s voice and other qualities of the recording in mind. Furthermore, unlike most poetry, lyrics include repeated lines. This led us to believe the adjectives which were more likely to be used in lyrics rather than poetry would be more rhymable. We counted the number of words an adjective in our lyrics and poetry list rhymes with from the website rhymezone.com. The values are tabulated in Table 2. 4.3 Document classification algorithm We use a simple linear time algorithm which takes as input the probability distributions for adjectives, calculated 473 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Our algorithm consistently misclassifies a large fraction of the lyrics of such poetic lyricists as poetry while the percentage of misclassified lyrics as poetry for the non-poetic lyricists is significantly much less. These values for poetic and non-poetic lyricists are tabulated in table 4 and table 5 respectively. Poetic Lyricists % of lyrics misclassified as poetry Bob Dylan 42% Ed Sheeran 50% Ani Di Franco 29% Annie Lennox 32% Bill Callahan 34% Bruce Springsteen 29% Stephen Sondheim 40% Morrissey 29% Average misclassification 36% rate From the values in Table 2, we can clearly see that the adjectives which are more likely to be used in lyrics to be much more rhymable than the adjectives which are more likely to be used in poetry. Actual Lyrics Articles Poems Lyrics 67 11 10 Predicted Articles 11 80 33 Poems 22 6 57 Table 1. The confusion matrix for document classification. Many lyrics are categorized as poems, and many poems as articles. Mean Median 25th percentile 75th percentile Lyrics 33.2 11 2 38 Poetry 22.9 5 0 24 Table 4. Percentage of misclassified lyrics as poetry for poetic lyricists. Table 2. Statistical values for the number of words an adjective rhymes with. Mean Median 25th percentile 75th percentile Lyrics -.05 0.0 -0.27 0.13 Non-Poetic Lyricists Poetry -.053 0.0 -0.27 0.13 Bryan Adams Michael Jackson Drake Backstreet Boys Radiohead Stevie Wonder Led Zeppelin Kesha Average misclassification rate Table 3. Statistical values for the semantic orientation of adjectives used in lyrics and poetry. We were also interested in finding if the adjectives used in lyrics and poetry differed significantly in their semantic orientations. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. We calculated the semantic orientations, which take a value between -1 and +1, using SentiWordNet, of all the adjectives in the lyrics and poetry list, the values are in Table 3. They show no difference between adjectives in poetry and those in lyrics. % of lyrics misclassified as poetry 14% 22% 7% 23% 26% 17% 8% 18% 17% Table 5. Percentage of misclassified lyrics as poetry for non-poetic lyricists. From the values in table 4 and 5 we see that there is a clear separation between the misclassification rate between poetic and non-poetic lyricists. The maximum misclassification rate for the non-poetic lyricists i.e. 26% is less than the minimum mis-classification rate for poetic lyricists i.e. 29%. Furthermore the difference in average misclassification rate between the two groups of lyricists is 19%. Hence our simple algorithm can accurately identify poetic lyricists from non-poetic ones, based only on adjective usage. 5.3 Poetic vs non-Poetic Lyricists There are lyricists like Bob Dylan [15], Ani DiFranco [16], and Stephen Sondheim [17,18], whose lyrics are considered to be poetic, or indeed, who are published poets in some cases. The lyrics of such poetic lyricists possibly could be structurally more constrained than a majority of the lyrics or might adhere to a particular meter and style. While selecting the poetic lyricists we ensured that popular articles supported our claim or by going to their Wikipedia page and ensuring that they were poets along with being lyricists and hence the influence of their poetry on lyrics. 5.4 Concept representation in Lyrics vs Poetry We compare adjective uses for common concepts. To represent physical beauty we are more likely to use words like “sexy” and “hot” in lyrics but “gorgeous” and “handsome” in poetry. For 20 of these, results are tabulated in Table 6. The difference could possibly be because unlike lyrics, which are written for the masses, poetry is generally written for people who are interested in literature. It 474 15th International Society for Music Information Retrieval Conference (ISMIR 2014) has been shown that the typical number one hits, on average, are more clichéd [13]. Lyrics Poetry proud, arrogant, cocky haughty, imperious sexy, hot, beautiful, cute gorgeous, handsome merry, ecstatic, elated happy, blissful, joyous heartbroken, brokenhearted sad, sorrowful, dismal real genuine smart wise, intelligent bad, shady lousy, immoral, dishonest mad, outrageous wrathful, furious royal noble, aristocratic, regal pissed angry, bitter greedy selfish cheesy poor, worthless lethal, dangerous, fatal mortal, harmful, destructive afraid, nervous frightened, cowardly, timid jealous envious, covetous lax, sloppy lenient, indifferent weak, fragile feeble, powerless black ebon naïve, ignorant innocent, guileless, callow corny dull, stale 7. CONCLUSION Our key finding is that the choice of synonym for even a small number of adjectives are sufficient to reliably identify genre of documents. In accordance with our hypothesis, we show that there exist differences in the kind of adjectives used in different genres of writing. We calculate the probability distribution of adjectives over the three kinds of documents and using this distribution and a simple algorithm we are able to distinguish among lyrics, poetry and article with an accuracy of 67%, 57% and 80% respectively. Adjectives likely to be used in lyrics are more rhymable than the ones used in poetry. This might be because lyrics are written keeping in mind the melody, rhythm, instrumentation, quality of the singer’s voice and other qualities of the recording while poetry is without such concerns. There is no significant difference in the semantic orientation of adjectives which are more likely to be used in lyrics and those which are more likely to be used in poetry. Using the probability distributions, obtained from training data, we present adjectives more likely to be used in lyrics rather than poetry and vice versa for twenty common concepts. Using the probability distributions and our algorithm we show that we can discern poetic lyricists from nonpoetic ones. Our algorithm consistently misclassifies a majority of the lyrics of such poetic lyricists as poetry while the percentage of misclassified lyrics as poetry for the non-poetic lyricists is significantly much less. Calculating the probability distribution of adjectives over the various document types is a vital step in our method which in turn depends on the synonyms extracted for an adjective. Synonym extraction is still an open problem and with improvements in it our algorithm will give better accuracy levels. We extract synonyms from three different sources – Wikipeia, WordNet and an online Thesaurus, and prune the results based on the semantic similarity between the adjectives and the obtained synonyms. We use a simple naïve algorithm, which gives us better result than Naïve Bayes. An extension to the work can be coming up with an improved version of the algorithm with better accuracy levels. Future works can use a larger dataset for lyrics and poetry (we have an enormous dataset for articles) to come up with better probability distribution for the two document types or to identify parts of speech that effectively separates genres of writing. Our work here can be extended to different genres of writings like prose, fiction etc. to analyze the adjective usage in those writings. It would be interesting to do similar work for verbs and discern if different words, representing the same action, are used in different genres of writings. Table 6. For twenty different concepts, we compare adjectives which are more likely to be used in lyrics rather than poetry and vice versa. 6. APPLICATIONS The algorithm developed has many practical applications in Music Information Retrieval (MIR). They could be used for automatic poetry/lyrics generation to identify adjectives more likely to be used in a particular type of document. As we have shown we can analyze documents, analyze how lyrical, poetic or article-like a document is. For lyricists or poets we can come up with alternate better adjectives to make a document fit its genre better. Using the word distributions we can come up with a better measure of distance between documents where the weights are assigned to a word depending on its probability of usage in a particular type of document. And, of course, our work here can be extended to different genres of writings like prose or fiction. 8. ACKNOWLEDGEMENTS Our research is supported by a grant from the Natural Sciences and Engineering Research Council of Canada to DGB. 475 15th International Society for Music Information Retrieval Conference (ISMIR 2014) sources and Evaluation (LREC ‘06), pages 417–422, 2006. 9. REFERENCES [1] P. Chesley, B. Vincent., L. Xu, and R. Srihari, “Using Verbs and Adjectives to Automatically Classify Blog Sentiment”, Training, volume 580, number 263, pages 233, 2006. [13] A.G. Smith, C. X. S. Zee and A. L. Uitdenbogerd, “In your eyes: Identifying cliché in song lyrics”, in Proceedings of the Australasian Language Technology Association Workshop, pages 88–96, 2012. [2] D.R. Entwisle and C. Garvey, “Verbal productivity and adjective usage”, Language and Speech, volume 15, number 3, pages 288-298, 1972. [14] E. Nichols, D. Morris, S. Basu, S. Christopher, “Relationships between lyrics and melody in popular music”, in Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR ’09), 2009. [3] H. Wu and M. Zhou, “Optimizing Synonym Extraction Using Monolingual and Bilingual Resources”, in Proceedings of the 2nd International Workshop on Paraphrasing, volume 16, pages 72-79, 2003. [15] K. Negus, Bob Dylan, Equinox London, 2008. [16] A. DiFranco, Verses, Seven Stories, 2007. [4] C. Bohn and K. Norvag, “Extracting Named Entities and Synonyms from Wikipedia”, in Proceedings of 24th IEEE International Conference on Advanced Information Networking and Applications, (AINA ‘10), pages 1300– 1307. [17] S. Sondheim, Look, I Made a Hat! New York: Knopf, 2011. [18] S. Sondheim, Finishing the Hat, New York: Knopf, 2010. [5] J. Niemi, K. Linden and M. Hyvarinen, “Using a bilingual resource to add synonyms to a wordnet:FinnWordNet and Wikipedia as an example”, in Proceedings of the Global WordNet Association , pages 227– 231, 2012. [19] http://www.poemhunter.com. [6] L. Jiang and M. Zhou, “Generating Chinese couplets using a statistical MT approach”, in Proceedings of the 22nd International Conference on Computational Linguistics, pages 377–384, 2008. [7] D. Genzel, J. Uszkoreit and F. Och, “Poetic statistical machine translation: rhyme and meter”, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 158–166, 2010. [8] G. Pirro and J. Euzenat, “A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness”, in Proceedings of the 9th International Semantic Web Conference (ISWC ‘10), pages 615-630, 2010. [9] G. Miller, “WordNet: A Lexical Database for English”, Communications of the ACM, volume 38, number 11, pages 39-41, 1995. [10] G. Miller and F. Christiane, WordNet: An Electronic Lexical Database, 1998. [11] S. Baccianella, A. Esuli, and F. Sebastiani, “SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining”, in Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC ‘10) , pages 2200–2204, 2010. [12] A. Esuli and F. Sebastiani, “SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining”, in Proceedings of the 5th Conference on Language Re- 476 15th International Society for Music Information Retrieval Conference (ISMIR 2014) SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS Po-Sen Huang† , Minje Kim‡ , Mark Hasegawa-Johnson† , Paris Smaragdis†‡§ † Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA ‡ Department of Computer Science, University of Illinois at Urbana-Champaign, USA § Adobe Research, USA {huang146, minje, jhasegaw, paris}@illinois.edu ABSTRACT Joint Discriminative Training Monaural source separation is important for many real world applications. It is challenging since only single channel information is available. In this paper, we explore using deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting. Deep recurrent neural networks with different temporal connections are explored. We propose jointly optimizing the networks for multiple source signals by including the separation step as a nonlinear operation in the last layer. Different discriminative training objectives are further explored to enhance the source to interference ratio. Our proposed system achieves the state-of-the-art performance, 2.30∼2.48 dB GNSDR gain and 4.32∼5.42 dB GSIR gain compared to previous models, on the MIR-1K dataset. 1. INTRODUCTION Monaural source separation is important for several realworld applications. For example, the accuracy of automatic speech recognition (ASR) can be improved by separating noise from speech signals [10]. The accuracy of chord recognition and pitch estimation can be improved by separating singing voice from music [7]. However, current state-of-the-art results are still far behind human capability. The problem of monaural source separation is even more challenging since only single channel information is available. In this paper, we focus on singing voice separation from monaural recordings. Recently, several approaches have been proposed to utilize the assumption of the low rank and sparsity of the music and speech signals, respectively [7, 13, 16, 17]. However, this strong assumption may not always be true. For example, the drum sounds may lie in the sparse subspace instead of being low rank. In addition, all these models can be viewed as linear transformations in the spectral domain. Mixture Signal STFT Magnitude Spectra Phase Spectra Evaluation ISTFT DNN/DRNN Time Frequency Masking Discriminative Training Estimated Magnitude Spectra Figure 1. Proposed framework. With the recent development of deep learning, without imposing additional constraints, we can further extend the model expressibility by using multiple nonlinear layers and learn the optimal hidden representations from data. In this paper, we explore the use of deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting. We explore different deep recurrent neural network architectures along with the joint optimization of the network and a soft masking function. Moreover, different training objectives are explored to optimize the networks. The proposed framework is shown in Figure 1. The organization of this paper is as follows: Section 2 discusses the relation to previous work. Section 3 introduces the proposed methods, including the deep recurrent neural networks, joint optimization of deep learning models and a soft time-frequency masking function, and different training objectives. Section 4 presents the experimental setting and results using the MIR-1K dateset. We conclude the paper in Section 5. 2. RELATION TO PREVIOUS WORK Several previous approaches utilize the constraints of low rank and sparsity of the music and speech signals, respectively, for singing voice separation tasks [7, 13, 16, 17]. Such strong assumption for the signals might not always be true. Furthermore, in the separation stage, these models can be viewed as a single-layer linear network, predicting the clean spectra via a linear transform. To further improve the expressibility of these linear models, in this paper, we use deep learning models to learn the representations from c Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis. “Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks”, 15th International Society for Music Information Retrieval Conference, 2014. 477 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ... ... ... ... L-layer sRNN ... ... ... ... ... ... L 2 1 1 time ... ... ... ... l ... 1-layer RNN L-layer DRNN L 1 time time Figure 2. Deep Recurrent Neural Networks (DRNNs) architectures: Arrows represent connection matrices. Black, white, and grey circles represent input frames, hidden states, and output frames, respectively. (Left): standard recurrent neural networks; (Middle): L intermediate layer DRNN with recurrent connection at the l-th layer. (Right): L intermediate layer DRNN with recurrent connections at all levels (called stacked RNN). data, without enforcing low rank and sparsity constraints. By exploring deep architectures, deep learning approaches are able to discover the hidden structures and features at different levels of abstraction from data [5]. Deep learning methods have been applied to a variety of applications and yielded many state of the art results [2, 4, 8]. Recently, deep learning techniques have been applied to related tasks such as speech enhancement and ideal binary mask estimation [1, 9–11, 15]. In the ideal binary mask estimation task, Narayanan and Wang [11] and Wang and Wang [15] proposed a two-stage framework using deep neural networks. In the first stage, the authors use d neural networks to predict each output dimension separately, where d is the target feature dimension; in the second stage, a classifier (one layer perceptron or an SVM) is used for refining the prediction given the output from the first stage. However, the proposed framework is not scalable when the output dimension is high. For example, if we want to use spectra as targets, we would have 513 dimensions for a 1024-point FFT. It is less desirable to train such large number of neural networks. In addition, there are many redundancies between the neural networks in neighboring frequencies. In our approach, we propose a general framework that can jointly predict all feature dimensions at the same time using one neural network. Furthermore, since the outputs of the prediction are often smoothed out by time-frequency masking functions, we explore jointly training the masking function with the networks. Maas et al. proposed using a deep RNN for robust automatic speech recognition tasks [10]. Given a noisy signal x, the authors apply a DRNN to learn the clean speech y. In the source separation scenario, we found that modeling one target source in the denoising framework is suboptimal compared to the framework that models all sources. In addition, we can use the information and constraints from different prediction outputs to further perform masking and discriminative training. 478 3. PROPOSED METHODS 3.1 Deep Recurrent Neural Networks To capture the contextual information among audio signals, one way is to concatenate neighboring features together as input features to the deep neural network. However, the number of parameters increases rapidly according to the input dimension. Hence, the size of the concatenating window is limited. A recurrent neural network (RNN) can be considered as a DNN with indefinitely many layers, which introduce the memory from previous time steps. The potential weakness for RNNs is that RNNs lack hierarchical processing of the input at the current time step. To further provide the hierarchical information through multiple time scales, deep recurrent neural networks (DRNNs) are explored [3, 12]. DRNNs can be explored in different schemes as shown in Figure 2. The left of Figure 2 is a standard RNN, folded out in time. The middle of Figure 2 is an L intermediate layer DRNN with temporal connection at the l-th layer. The right of Figure 2 is an L intermediate layer DRNN with full temporal connections (called stacked RNN (sRNN) in [12]). Formally, we can define different schemes of DRNNs as follows. Suppose there is an L intermediate layer DRNN with the recurrent connection at the l-th layer, the l-th hidden activation at time t is defined as: hlt = fh (xt , hlt−1 ) , = φl Ul hlt−1 + Wl φl−1 Wl−1 . . . φ1 W1 xt (1) and the output, yt , can be defined as: yt = fo (hlt ) , = WL φL−1 WL−1 . . . φl Wl hlt (2) where xt is the input to the network at time t, φl is an element-wise nonlinear function, Wl is the weight matrix 15th International Society for Music Information Retrieval Conference (ISMIR 2014) for the l-th layer, and Ul is the weight matrix for the recurrent connection at the l-th layer. The output layer is a linear layer. The stacked RNNs have multiple levels of transition functions, defined as: Output l hlt = fh (hl−1 t , ht−1 ) = φl (Ul hlt−1 + Wl hl−1 t ), Source 2 y1t y2t zt zt y1t y2t (3) ht 3 where hlt is the hidden state of the l-th layer at time t. Ul and Wl are the weight matrices for the hidden activation at time t − 1 and the lower level activation hl−1 t , respectively. When l = 1, the hidden activation is computed using h0t = xt . Function φl (·) is a nonlinear function, and we empirically found that using the rectified linear unit f (x) = max(0, x) [2] performs better compared to using a sigmoid or tanh function. For a DNN, the temporal weight matrix Ul is a zero matrix. Hidden Layers ht 2 ht-1 ht+1 ht 1 Input Layer xt 3.2 Model Architecture Figure 3. Proposed neural network architecture. At time t, the training input, xt , of the network is the concatenation of features from a mixture within a window. We use magnitude spectra as features in this paper. The output targets, y1t and y2t , and output predictions, ŷ1t and ŷ2t , of the network are the magnitude spectra of different sources. Since our goal is to separate one of the sources from a mixture, instead of learning one of the sources as the target, we adapt the framework from [9] to model all different sources simultaneously. Figure 3 shows an example of the architecture. Moreover, we find it useful to further smooth the source separation results with a time-frequency masking technique, for example, binary time-frequency masking or soft timefrequency masking [7, 9]. The time-frequency masking function enforces the constraint that the sum of the prediction results is equal to the original mixture. Given the input features, xt , from the mixture, we obtain the output predictions ŷ1t and ŷ2t through the network. The soft time-frequency mask mt is defined as follows: |ŷ1t (f )| mt (f ) = , (4) |ŷ1t (f )| + |ŷ2t (f )| add an extra layer to the original output of the neural network as follows: |ŷ1t | zt |ŷ1t | + |ŷ2t | (6) |ŷ2t | ỹ2t = zt , |ŷ1t | + |ŷ2t | where the operator is the element-wise multiplication (Hadamard product). In this way, we can integrate the constraints to the network and optimize the network with the masking function jointly. Note that although this extra layer is a deterministic layer, the network weights are optimized for the error metric between and among ỹ1t , ỹ2t and y1t , y2t , using back-propagation. To further smooth the predictions, we can apply masking functions to ỹ1t and ỹ2t , as in Eqs. (4) and (5), to get the estimated separation spectra s̃1t and s̃2t . The time domain signals are reconstructed based on the inverse short time Fourier transform (ISTFT) of the estimated magnitude spectra along with the original mixture phase spectra. ỹ1t = 3.3 Training Objectives where f ∈ {1, . . . , F } represents different frequencies. Once a time-frequency mask mt is computed, it is applied to the magnitude spectra zt of the mixture signals to obtain the estimated separation spectra ŝ1t and ŝ2t , which correspond to sources 1 and 2, as follows: ŝ1t (f ) = mt (f )zt (f ) ŝ2t (f ) = (1 − mt (f )) zt (f ), Source 1 Given the output predictions ŷ1t and ŷ2t (or ỹ1t and ỹ2t ) of the original sources y1t and y2t , we explore optimizing neural network parameters by minimizing the squared error and the generalized Kullback-Leibler (KL) divergence criteria, as follows: (5) where f ∈ {1, . . . , F } represents different frequencies. The time-frequency masking function can be viewed as a layer in the neural network as well. Instead of training the network and applying the time-frequency masking to the results separately, we can jointly train the deep learning models with the time-frequency masking functions. We JM SE = ||ŷ1t − y1t ||22 + ||ŷ2t − y2t ||22 (7) JKL = D(y1t ||ŷ1t ) + D(y2t ||ŷ2t ), (8) and where the measure D(A||B) is defined as: Ai D(A||B) = Ai log − Ai + Bi . Bi i 479 (9) 15th International Society for Music Information Retrieval Conference (ISMIR 2014) D(··) reduces to the KL divergence when i Ai = i Bi = 1, so that A and B can be regarded as probability distributions. Furthermore, minimizing Eqs. (7) and (8) is for increasing the similarity between the predictions and the targets. Since one of the goals in source separation problems is to have high signal to interference ratio (SIR), we explore discriminative objective functions that not only increase the similarity between the prediction and its target, but also decrease the similarity between the prediction and the targets of other sources, as follows: ||ŷ1t −y1t ||22 −γ||ŷ1t −y2t ||22 +||ŷ2t −y2t ||22 −γ||ŷ2t −y1t ||22 (10) and D(y1t ||ŷ1t )−γD(y1t ||ŷ2t )+D(y2t ||ŷ2t )−γD(y2t ||ŷ1t ), (11) where γ is a constant chosen by the performance on the development set. 4. EXPERIMENTS 4.1 Setting Our system is evaluated using the MIR-1K dataset [6]. 1 A thousand song clips are encoded with a sample rate of 16 KHz, with durations from 4 to 13 seconds. The clips were extracted from 110 Chinese karaoke songs performed by both male and female amateurs. There are manual annotations of the pitch contours, lyrics, indices and types for unvoiced frames, and the indices of the vocal and non-vocal frames. Note that each clip contains the singing voice and the background music in different channels. Only the singing voice and background music are used in our experiments. Following the evaluation framework in [13, 17], we use 175 clips sung by one male and one female singer (‘abjones’ and ‘amy’) as the training and development set. 2 The remaining 825 clips of 17 singers are used for testing. For each clip, we mixed the singing voice and the background music with equal energy (i.e. 0 dB SNR). The goal is to separate the singing voice from the background music. To quantitatively evaluate source separation results, we use Source to Interference Ratio (SIR), Source to Artifacts Ratio (SAR), and Source to Distortion Ratio (SDR) by BSS-EVAL 3.0 metrics [14]. The Normalized SDR (NSDR) is defined as: NSDR(v̂, v, x) = SDR(v̂, v) − SDR(x, v), (GNSDR), Global SIR (GSIR), and Global SAR (GSAR), which are the weighted means of the NSDRs, SIRs, SARs, respectively, over all test clips weighted by their length. Higher values of SDR, SAR, and SIR represent better separation quality. The suppression of the interfering source is reflected in SIR. The artifacts introduced by the separation process are reflected in SAR. The overall performance is reflected in SDR. For training the network, in order to increase the variety of training samples, we circularly shift (in the time domain) the singing voice signals and mix them with the background music. In the experiments, we use magnitude spectra as input features to the neural network. The spectral representation is extracted using a 1024-point short time Fourier transform (STFT) with 50% overlap. Empirically, we found that using log-mel filterbank features or log power spectrum provide worse performance. For our proposed neural networks, we optimize our models by back-propagating the gradients with respect to the training objectives. The limited-memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) algorithm is used to train the models from random initialization. We set the maximum epoch to 400 and select the best model according to the development set. The sound examples and more details of this work are available online. 3 4.2 Experimental Results In this section, we compare different deep learning models from several aspects, including the effect of different input context sizes, the effect of different circular shift steps, the effect of different output formats, the effect of different deep recurrent neural network structures, and the effect of the discriminative training objectives. For simplicity, unless mentioned explicitly, we report the results using 3 hidden layers of 1000 hidden units neural networks with the mean squared error criterion, joint masking training, and 10K samples as the circular shift step size using features with a context window size of 3 frames. We denote the DRNN-k as the DRNN with the recurrent connection at the k-th hidden layer. We select the models based on the GNSDR results on the development set. First, we explore the case of using single frame features, and the cases of concatenating neighboring 1 and 2 frames as features (context window sizes 1, 3, and 5, respectively). Table 1 reports the results using DNNs with context window sizes 1, 3, and 5. We can observe that concatenating neighboring 1 frame provides better results compared with the other cases. Hence, we fix the context window size to be 3 in the following experiments. Table 2 shows the difference between different circular shift step sizes for deep neural networks. We explore the cases without circular shift and the circular shift with a step size of {50K, 25K, 10K} samples. We can observe that the separation performance improves when the number of training samples increases (i.e. the step size of circular (12) where v̂ is the resynthesized singing voice, v is the original clean singing voice, and x is the mixture. NSDR is for estimating the improvement of the SDR between the preprocessed mixture x and the separated singing voice v̂. We report the overall performance via Global NSDR 1 https://sites.google.com/site/unvoicedsoundseparation/mir-1k Four clips, abjones 5 08, abjones 5 09, amy 9 08, amy 9 09, are used as the development set for adjusting hyper-parameters. 2 3 480 https://sites.google.com/site/deeplearningsourceseparation/ 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Model (context window size) GNSDR GSIR GSAR DNN (1) 6.63 10.81 9.77 DNN (3) 6.93 10.99 10.15 DNN (5) 6.84 10.80 10.18 Table 1. Results with input features concatenated from different context window sizes. Model GNSDR GSIR (circular shift step size) DNN (no shift) 6.30 9.97 DNN (50,000) 6.62 10.46 DNN (25,000) 6.86 11.01 DNN (10,000) 6.93 10.99 GSAR 9.99 10.07 10.00 10.15 Table 4. The results of different architectures and different objective functions. The “MSE” denotes the mean squared error and the “KL” denotes the generalized KL divergence criterion. Table 2. Results with different circular shift step sizes. Model (num. of output GNSDR GSIR sources, joint mask) DNN (1, no) 5.64 8.87 DNN (2, no) 6.44 9.08 DNN (2, yes) 6.93 10.99 Model (objective) GNSDR GSIR GSAR DNN (MSE) 6.93 10.99 10.15 DRNN-1 (MSE) 7.11 11.74 9.93 DRNN-2 (MSE) 7.27 11.98 9.99 DRNN-3 (MSE) 7.14 11.48 10.15 sRNN (MSE) 7.09 11.72 9.88 DNN (KL) 7.06 11.34 10.07 DRNN-1 (KL) 7.09 11.48 10.05 DRNN-2 (KL) 7.27 11.35 10.47 DRNN-3 (KL) 7.10 11.14 10.34 sRNN (KL) 7.16 11.50 10.11 Model GNSDR GSIR GSAR DNN 6.93 10.99 10.15 DRNN-1 7.11 11.74 9.93 DRNN-2 7.27 11.98 9.99 DRNN-3 7.14 11.48 10.15 sRNN 7.09 11.72 9.88 DNN + discrim 7.09 12.11 9.67 DRNN-1 + discrim 7.21 12.76 9.56 DRNN-2 + discrim 7.45 13.08 9.68 DRNN-3 + discrim 7.09 11.69 10.00 sRNN + discrim 7.15 12.79 9.39 GSAR 9.73 11.26 10.15 Table 3. Deep neural network output layer comparison using single source as a target and using two sources as targets (with and without joint mask training). In the “joint mask” training, the network training objective is computed after time-frequency masking. shift decreases). Since the improvement is relatively small when we further increase the number of training samples, we fix the circular shift size to be 10K samples. Table 3 presents the results with different output layer formats. We compare using single source as a target (row 1) and using two sources as targets in the output layer (row 2 and row 3). We observe that modeling two sources simultaneously provides better performance. Comparing row 2 and row 3 in Table 3, we observe that using the joint mask training further improves the results. Table 4 presents the results of different deep recurrent neural network architectures (DNN, DRNN with different recurrent connections, and sRNN) and the results of different objective functions. We can observe that the models with the generalized KL divergence provide higher GSARs, but lower GSIRs, compared to the models with the mean squared error objective. Both objective functions provide similar GNSDRs. For different network architectures, we can observe that DRNN with recurrent connection at the second hidden layer provides the best results. In addition, all the DRNN models achieve better results compared to DNN models by utilizing temporal information. Table 5 presents the results of different deep recurrent neural network architectures (DNN, DRNN with different recurrent connections, and sRNN) with and without discriminative training. We can observe that discriminative training improves GSIR, but decreases GSAR. Overall, GNSDR is slightly improved. 481 Table 5. The comparison for the effect of discriminative training using different architectures. The “discrim” denotes the models with discriminative training. Finally, we compare our best results with other previous work under the same setting. Table 6 shows the results with unsupervised and supervised settings. Our proposed models achieve 2.30∼2.48 dB GNSDR gain, 4.32∼5.42 dB GSIR gain with similar GSAR performance, compared with the RNMF model [13]. An example of the separation results is shown in Figure 4. 5. CONCLUSION AND FUTURE WORK In this paper, we explore using deep learning models for singing voice separation from monaural recordings. Specifically, we explore different deep learning architectures, including deep neural networks and deep recurrent neural networks. We further enhance the results by jointly optimizing a soft mask function with the networks and exploring the discriminative training criteria. Overall, our proposed models achieve 2.30∼2.48 dB GNSDR gain and 4.32∼5.42 dB GSIR gain, compared to the previous proposed methods, while maintaining similar GSARs. Our proposed models can also be applied to many other applications such as main melody extraction. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) (a) Mixutre (b) Clean vocal (c) Recovered vocal (d) Clean music (e) Recovered music Figure 4. (a) The mixture (singing voice and music accompaniment) magnitude spectrogram (in log scale) for the clip Ani 1 01 in MIR-1K; (b) (d) The groundtruth spectrograms for the two sources; (c) (e) The separation results from our proposed model (DRNN-2 + discrim). Unsupervised Model GNSDR GSIR GSAR RPCA [7] 3.15 4.43 11.09 RPCAh [16] 3.25 4.52 11.10 RPCAh + FASST [16] 3.84 6.22 9.19 Supervised Model GNSDR GSIR GSAR MLRR [17] 3.85 5.63 10.70 RNMF [13] 4.97 7.66 10.03 DRNN-2 7.27 11.98 9.99 DRNN-2 + discrim 7.45 13.08 9.68 [6] C.-L. Hsu and J.-S.R. Jang. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Transactions on Audio, Speech, and Language Processing, 18(2):310 –319, Feb. 2010. [7] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson. Singing-voice separation from monaural recordings using robust principal component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60, 2012. [8] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In ACM International Conference on Information and Knowledge Management (CIKM), 2013. Table 6. Comparison between our models and previous proposed approaches. The “discrim” denotes the models with discriminative training. 6. ACKNOWLEDGEMENT [9] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. Deep learning for monaural speech separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014. [10] A. L. Maas, Q. V Le, T. M O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng. Recurrent neural networks for noise reduction in robust ASR. In INTERSPEECH, 2012. We thank the authors in [13] for providing their trained [11] A. Narayanan and D. Wang. Ideal ratio mask estimation using model for comparison. This research was supported by deep neural networks for robust speech recognition. In ProU.S. ARL and ARO under grant number W911NF-09-1ceedings of the IEEE International Conference on Acoustics, 0383. This work used the Extreme Science and EngineerSpeech, and Signal Processing. IEEE, 2013. ing Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575. [12] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. In International Conference on Learning Representations, 2014. 7. REFERENCES [1] N. Boulanger-Lewandowski, G. Mysore, and M. Hoffman. Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014. [2] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), 2011. [3] M. Hermans and B. Schrauwen. Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems, pages 190–198, 2013. [4] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29:82–97, Nov. 2012. [13] P. Sprechmann, A. Bronstein, and G. Sapiro. Real-time online singing voice separation from monaural recordings using robust low-rank modeling. In Proceedings of the 13th International Society for Music Information Retrieval Conference, 2012. [14] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. Audio, Speech, and Language Processing, IEEE Transactions on, 14(4):1462 –1469, July 2006. [15] Y. Wang and D. Wang. Towards scaling up classificationbased speech separation. IEEE Transactions on Audio, Speech, and Language Processing, 21(7):1381–1390, 2013. [16] Y.-H. Yang. On sparse and low-rank matrix decomposition for singing voice separation. In ACM Multimedia, 2012. [17] Y.-H. Yang. Low-rank representation of both singing voice and music accompaniment via learned dictionaries. In Proceedings of the 14th International Society for Music Information Retrieval Conference, November 4-8 2013. [5] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006. 482 15th International Society for Music Information Retrieval Conference (ISMIR 2014) IMPACT OF LISTENING BEHAVIOR ON MUSIC RECOMMENDATION Katayoun Farrahi Goldsmiths, University of London London, UK [email protected] Markus Schedl, Andreu Vall, David Hauger, Marko Tkalčič Johannes Kepler University Linz, Austria [email protected] ABSTRACT The next generation of music recommendation systems will be increasingly intelligent and likely take into account user behavior for more personalized recommendations. In this work we consider user behavior when making recommendations with features extracted from a user’s history of listening events. We investigate the impact of listener’s behavior by considering features such as play counts, “mainstreaminess”, and diversity in music taste on the performance of various music recommendation approaches. The underlying dataset has been collected by crawling social media (specifically Twitter) for listening events. Each user’s listening behavior is characterized into a three dimensional feature space consisting of play count, “mainstreaminess” (i.e. the degree to which the observed user listens to currently popular artists), and diversity (i.e. the diversity of genres the observed user listens to). Drawing subsets of the 28,000 users in our dataset, according to these three dimensions, we evaluate whether these dimensions influence figures of merit of various music recommendation approaches, in particular, collaborative filtering (CF) and CF enhanced by cultural information such as users located in the same city or country. 1. INTRODUCTION Early attempts in collaborative filtering (CF) recommender systems for music content have generally treated all users as equivalent in the algorithm [1]. The predicted score (i.e. the likelihood that the observed user would like the observed music piece) was a weighted average of the K nearest neighbors in a given similarity space [8]. The only way the users were treated differently was the weight, which reflected the similarity between users. However, users’ behavior in the consumption of music (and other multimedia material in general) has more dimensions than just ratings. Recently, there has been an increase of research in music consumption behavior and recommender systems that draw inspiration from psychology research on personality. Personality accounts for the individual difference in users in their behavioral styles [9]. Studies showed that personality affects rating behavior [6], music genre preferences [11] and taste diversity both in music [11] and other domains (e.g. movies in [2]). The aforementioned work inspired us to investigate how user features intuitively derived from personality traits affect the performance of a CF recommender system in the music domain. We chose three user features that are arguably proxies of various personality traits for user clustering and fine-tuning of the CF recommender system. The chosen features are play counts, mainstreaminess and diversity. Play count is a measure of how often the observed user engages in music listening (intuitively related to extraversion). Mainstreaminess is a measure that describes to what degree the observed user prefers currently popular songs or artists over non-popular (and is intuitively related to openness and agreeableness). The diversity feature is a measure of how diverse the observed user’s spectrum of listened music is (intuitively related to openness). In this paper, we consider the music listening behavior of a set of 28,000 users, obtained by crawling and analyzing microblogs. By characterizing users across a three dimensional space of play count, mainstreaminess, and diversity, we group users and evaluate various recommendation algorithms across these behavioral features. The goal is to determine whether or not the evaluated behavioral features influence the recommendation algorithms, and if so which directions are most promising. Overall, we find that recommending with collaborative filtering enhanced by continent and country information generally performs best. We also find that recommendations for users with large play counts, higher diversity and mainstreaminess values are better. 2. RELATED WORK The presented work stands at the crossroads of personalityinspired user features and recommender systems based on collaborative filtering. Among various models of personality, the Five-factor model (FFM) is the most widely used and is composed of the following traits: openness, conscientiousness, extraversion, agreeableness and neuroticism [9]. The personality theory inspired several works in the field of recommender systems. For example, Pu et al. [6] showed that user rating behavior is correlated with personality factors. Tkalčič et al. [13] used FFM factors to calculate similarities in a CF recommender system for images. A study by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. c 2014 International Society for Music Information Retrieval. 483 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Rentfrow et al. [11] showed that scoring high on certain personality traits is correlated with genre preferences and other listening preferences like diversity. Chen et al. [2] argue that people who score high in openness to new experiences prefer more diverse recommendations than people who score low. The last two studies explore the relations between personality and diversity. In fact, the study of diversity in recommending items has become popular after the publishing of two popular books, The Long Tail [4] and The Filter Bubble [10]. However, most of the work was focused on the trade-off between recommending diverse and similar items (e.g. in [7]). In our work, we treat diversity not as a way of presenting music items but as a user feature, which is a novel way of addressing the usage of diversity in recommender systems. The presented work builds on collaborative filtering (CF) techniques that are well established in the recommender systems domain [1]. CF methods have been improved using context information when available [3]. Recently, [12] incorporated geospatial context to improve music recommendations on a dataset gathered through microblog crawling [5]. In the presented work, we advance this work by including personality-inspired user features. 3. USER BEHAVIOR MODELING 3.1 Dataset We use the “Million Musical Tweets Dataset” 1 (MMTD) dataset of music listening activities inferred from microblogs. This dataset is freely available [5], and contains approximately 1,100,000 listening events of 215,000 users listening to a total of 134,000 unique songs by 25,000 artists, collected from Twitter. The data was acquired crawling Twitter and identifying music listening events in tweets, using several databases and rule-based filters. Among others, the dataset contains information on location for each post, which enables location-aware analyses and recommendations. Location is provided both as GPS coordinates and semantic identifiers, including continent, country, state, county, and city. The MMTD contains a large number of users with only a few listening events. These users are not suitable for reliable recommendation and evaluation. Therefore, we consider a subset of users who had at least five listening events over different artists. This subset consists of 28,000 users. Basic statistics of the data used in all experiments are given in Table 1. The second column shows the total amount of the entities in the corresponding first row, whereas the right-most six columns show principal statistics based on the number of tweets. 3.2 Behavioral Features Each user is defined by a set of three behavioral features: play count, diversity, and mainstreaminess, defined next. These features are used to group users and to determine how they influence the recommendation process. 1 http://www.cp.jku.at/datasets/MMTD 484 Play count The play count of a user, P (u), is a measure of the quantity of listening events for a user u. It is computed as the total number of listening events recorded over all time for a given user. Diversity The diversity of a user, D(u), can be thought of as a measure which captures the range of listening tastes by the user. It is computed as the total number of unique genres associated with all of the artists listened to by a given user. Genre information was obtained by gathering the top tags from Last.fm for each artist in the collection. We then identified genres within these tags by matching the tags to a selection of 20 genres indicated by Allmusic.com. Mainstreaminess The mainstreaminess M (u) is a measure of how mainstream a user u is in terms of her/his listening behavior. It reflects the share of most popular artists within all the artists user u has listened to. Users that listen mostly to artists that are popular in a given time window tend to have high M (u), while users who listen more to artists that are rarely among the most popular ones tend to score low. For each time window i ∈ {1 . . . I} within the dataset (where I is the number of all time windows in the dataset) we calculated the set of the most popular artists Ai . We calculated the most popular artists in an observed time period as follows. For the given period we sorted the artists by the aggregate of the listening events they received in a decreasing order. Then, the top k artists, that cover at least 50% of all the listening events of the observed period are regarded as popular artists. For each user u in a given time window i we counted the number of play counts of popular artists Pip (u) and normalized it with all the play counts of that user in the observed time window Pia (u). The final value M (u) was aggregated by averaging the partial values for each time window: 1 Pip (u) M (u) = I i=1 Pia (u) I (1) In our experiments, we investigated time windows of six months and twelve months. Table 3 shows the correlation between individual user features. No significant correlation was found, except for the mainstreaminess using an interval of six months and an interval of twelve months, which is expected. 3.3 User Groups Each user is characterized by a three dimensional feature vector consisting of M (u), D(u), P (u). The distribution of users across these features are illustrated in Figures 1 and 2. In Figure 3, mainstreaminess is considered with a 6 month interval. The results illustrate the even distribution of users across these features. Therefore, for grouping users, we consider each feature individually and divide users between groups considering a threshold. For mainstreaminess, we consider the histogram of M (u) (Figure 2 for a 6 month (top) and 12 month (bottom)) in making the groups. We consider 2 different cases for grouping users. First, we divide the users into 2 groups according to the median value (referred to as M6(12)-median-G1(2)). 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Level Users Artists Tracks Continents Countries States Counties Cities Amount 27,778 21,397 108,676 7 166 872 3557 15123 Min. 5 1 1 9 1 1 1 1 1st Qu. 7 1 1 4,506 12 7 2 1 Median 10 2 1 101,400 71 40 10 5 Mean 27.69 35.95 7.08 109,900.00 4,633.00 882.00 216.20 50.86 3rd Qu. 17 9 4 142,200 555 195 41 16 Max. 89,320 11,850 2,753 374,300 151,600 148,900 191,900 148,900 Table 1. Basic dataset characteristics, where “Amount” is the number of items, and the statistics correspond to the values of the data. P-top10 P-mid5k P-bottom22k P-G1 P-G2 P-G3 D-G1 D-G2 D-G3 M6-03-G1 M6-03-G2 M6-median-G1 M6-median-G2 M12-05-G1 M12-05-G2 M12-median-G1 M12-median-G2 RB 10.28 1.33 0.64 0.45 0.65 1.08 0.64 0.73 0.93 0.50 1.34 0.35 1.25 1.35 0.36 0.36 1.34 Ccnt 11.75 1.75 0.92 0.67 1.32 2.04 0.85 0.93 1.63 0.88 2.73 0.58 2.49 2.02 0.59 0.62 2.09 Ccry 11.1 2.25 1.10 0.72 1.34 2.02 1.16 1.05 1.49 0.95 2.43 0.62 2.89 2.27 0.69 0.71 2.33 Csta 5.70 2.43 1.03 0.68 1.01 1.88 1.04 1.23 1.56 0.96 2.22 0.65 2.25 2.25 0.61 0.64 2.34 Ccty 5.70 1.46 0.77 0.44 0.69 1.30 0.87 0.84 0.93 0.64 1.49 0.48 1.47 1.50 0.41 0.43 1.57 Ccit 5.70 1.96 1.07 0.56 0.92 1.73 0.88 1.02 1.41 0.88 2.00 0.61 1.97 1.93 0.57 0.59 2.01 CF 11.22 4.47 1.85 1.13 1.71 3.51 2.22 2.04 2.49 1.76 3.36 1.35 3.14 2.90 1.30 1.41 3.10 CCcnt 10.74 4.59 1.95 1.26 1.78 3.60 2.24 2.21 2.56 1.84 3.50 1.46 3.27 3.02 1.38 1.50 3.24 CCcry 10.47 4.51 1.95 1.17 1.77 3.59 2.16 2.20 2.59 1.84 3.50 1.45 3.29 3.04 1.38 1.50 3.26 CCsta 5.89 3.56 1.56 0.78 1.32 2.90 1.59 1.68 2.03 1.43 2.81 1.04 2.67 2.47 1.01 1.10 2.66 CCcty 5.89 1.96 0.96 0.26 0.80 1.68 0.97 0.98 1.08 0.81 1.67 0.56 1.66 1.54 0.52 0.56 1.67 CCcit 5.89 2.56 1.16 0.35 0.89 2.16 0.93 1.08 1.54 1.00 2.08 0.66 2.07 1.94 0.66 0.71 2.10 Table 2. Maximum F-score for all combinations of methods and user sets. C refers to the CULT approaches, CC to CF CULT; cnt indicates continent, cry country, sta state, cty county, and cit city. The best performing recommenders for a given group are in bold. Second, we divide users into 2 groups for which borders are defined by a mainstreaminess of 0.3 and 0.5, respectively, for the 6 month case and the 12 month case (referred to as M6(12)-03(05)-G1(2)). These values were chosen by considering the histograms in Figure 2 and choosing values which naturally grouped users. For the diversity, we create 3 groups according to the 0.33 and 0.67 percentiles (referred to as D-G1(2,3)). For play counts, we consider 2 different groupings. The first is the same as for diversity, i.e. dividing groups according to the 0.33 and 0.67 percentiles (referred to as P-G1(2,3)). The second splits the users according to the accumulative play counts into the following groups, each of which accounts for approximately a third of all play counts: top 10 users, mid 5,000 users, bottom 22,000 users (referred to as Ptop10(mid5k,bottom22k)). D(u) M(u) (12 mo.) P(u) D(u) 0.069 0.292 M(u) (6 mo.) 0.119 0.837 0.021 P(u) 0.292 0.013 - Table 3. Feature correlations. Note due to the symmetry of these featuers, mainstreaminess is presented for 6 months on one dimension and 12 months on another. Overall, none of the features are highly correlated other than the mainstreaminess 6 and 12 month features, which is expected. 4. RECOMMENDATION MODELS In the considered music recommendation models, each user u ∈ U is represented by a list of artists listened to A(u). All approaches determine for a given seed user u a number K of most similar neighbors VK (u), and recommend the artists listened to by these VK (u), excluding the artists 485 A(u) already known by u. The recommended artists R(u) 1 for user u are computed as R(u) = v∈VK (u) A(v) \ A(u) K and VK (u) = argmaxK v∈U \{u} sim(u, v), where argmaxv denotes the K users v with highest similarities to u. In considering geographical information for user-context models, we investigate the following approaches, which differ in the way this similarity term sim(u, v) is computed. The following approaches were investigated: CULT: In the cultural approach, we select the neighbors for the seed user only according to a geographical similarity computed by means of the Jaccard index on listening distributions over semantic locations. We consider as such 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 5 2000 10 4 10 1500 # users # users 3 10 2 1000 10 500 1 10 0 0 0 10 0 0.5 1 1.5 play count 2 2.5 3 4 0.2 0.4 0.6 0.8 mainstreaminess 6mo. 1 0.2 0.4 0.6 0.8 mainstreaminess 12mo. 1 2000 x 10 5000 1500 # users # users 4000 3000 1000 2000 500 1000 0 0 5 10 diversity 15 0 0 20 Figure 2. Histogram of mainstreaminess considering a time interval of (top) 6 months and (bottom) 12 months. Figure 1. Histogram of (top) play counts (note the log scale on the y-axis) and (bottom) diversity over users. semantic categories continent, country, state, county, and city. For each user, we obtain the relevant locations by computing the relative frequencies of his listening events over all locations. To exclude the aforementioned geoentities that are unlikely to contribute to the user’s cultural circle, we retain only locations at which the user has listened to music with a frequency above his own average 2 . On the corresponding listening vectors over locations of two users u and v, we compute the Jaccard index to obtain sim(u, v). Depending on the location category user similarities are computed on, we distinguish CULT continent, CULT country, CULT state, CULT county, and CULT city. CF: We also consider a user-based collaborative filtering approach. Given the artist play counts of seed user u as a vector P (u) over all artists in the corpus, we first omit the artists that occur in the test set (i.e. we set to 0 the play count values for artists we want our algorithm to predict). We then normalize P (u) so that its Euclidean norm equals 1 and compute similarities sim(u, v) as the inner product between P (u) and P (v). CF CULT: This approach works by combining the CF similarity matrix with the CULT similarity matrix via pointwise multiplication, in order to incorporate both music preference and cultural information. RB: For comparison, we implemented a random baseline model that randomly picks K users and recommends 2 This way we exclude, for instance, locations where the user might have spent only a few days during vacation. 486 the artists they listened to. The similarity function can thus be considered sim(u, v) = rand [0,1]. 5. EVALUATION 5.1 Experimental Setup For experiments, we perform 10-fold cross validation on the user level. For each user, we predict 10% of the artists based on the remaining 90% used for training. We compute precision, recall, and F-measure by averaging the results over all folds per user and all users in the dataset. To compare the performance between approaches, we use a parameter N for the number of recommended artists, and adapt dynamically the number of neighbors K to be considered for the seed user u. This is necessary since we do not know how many artists should be predicted for a given user (this number varies over users and approaches). To determine a suited value of K for a given recommendation approach and a given N , we start the approach with K = 1 and iteratively increase K until the number of recommended artists equals or exceeds N . In the latter case, we sort the returned artists according to their overall popularity among the K neighbors and recommend the top N . 5.2 Results Table 2 depicts the maximum F-score (over all values of N ) for each combination of user set and method. We decided to report the maximum F-scores, because recall and 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 20 3 US−P−G3−RB US−P−G3−CULT_continent US−P−G3−CULT_country US−P−G3−CULT_state US−P−G3−CULT_county US−P−G3−CULT_city US−P−G3−CF US−P−G3−CF_CULT_continent US−P−G3−CF_CULT_country US−P−G3−CF_CULT_state US−P−G3−CF_CULT_county US−P−G3−CF_CULT_city 2.5 15 precision diversity 2 10 1.5 1 5 0.5 0 0 0 0 10 5 20 30 40 recall 50 60 70 10 play count Figure 4. Recommendation performance of investigated methods on user group P-G3. 1 mainstreaminess 6 10 0.8 In terms of play counts, we observe as the user has a larger number of events in the dataset, the performance increases significantly (P-G3 and P-top10). This can be explained by the fact that more comprehensive user models can be created for users about whom we know more, which in turn yields better recommendations. Also in terms of diversity, there are performance differences across groups given a particular recommender algorithm. Especially between the high diversity listeners D-G3 and low diversity listeners D-G1, results differ substantially. This can be explained by the fact that it is easier to find a considerable amount of like-minded users for seeds who have a diverse music taste, in technical terms, less sparse A(u) vector. When considering mainstreaminess, taking either a 6 month or 12 month interval does not appear to have a significant impact on recommendation performance. There are minor differences depending on the recommendation algorithm. However, in general, the groups with larger mainstreaminess (M6-03-G2, M6-med-G2, M12-med-G2) always performed much better for all approaches than the groups with smaller mainstreaminess. It hence seems easier to satisfy users with a mainstream music taste than users with diverging taste. 0.6 0.4 0.2 0 0 5 10 diversity 15 20 Figure 3. Users plot as a function of (top) D(u) vs P (u) and (bottom) M (u) (6 months) vs D(u). Note the log scale for P (u) only. These figures illustrate the widespread, even distribution of users across the feature space. precision show an inverse characteristics over N . Since the F-score equals the harmonic mean of precision and recall, it is less influenced by variations of N , nevertheless aggregate performance in a meaningful way. We further plot precision/recall-curves for several cases reported in Table 2. In Figure 4, we present the results of all of the recommendation algorithms for one group on the play counts. For this case, the CF approach with integrated continent and country information performed best, followed by the CF approach. Predominantly, these three methods outperformed all of the other approaches for the various groups, which is also apparent in Table 2. The only exception was the P-top10 case, where the CULT continent approach outperformed CF approaches. However, considering the small number of users in this subset (10), the difference of one percentage point between CULT continent and CF CULT continent is not significant. We observe the CF approach with the addition of the continent and country information are very good recommenders in general for the data we are using. Now we are interested to know how the recommendations performed across user groups and respective features. 6. CONCLUSIONS AND FUTURE WORK In this paper, we consider the role of user listening behavior related to the history of listening events in order to evaluate how this may effect music recommendation, particularly considering the direction of personalization. We investigate three user characteristics, play count, mainstreaminess, and diversity, and form groups of users along these dimensions. We evaluate several different recommendation algorithms, particularly collaborative filtering (CF), and CF augmented by location information. We find the CF and CF approaches augmented by continent and country information about the listener to outperform the other methods. We also find recommendation algorithms for users with large play counts, higher diversity, and higher 487 15th International Society for Music Information Retrieval Conference (ISMIR 2014) mainstreaminess have better performance. As part of future work, we will investigate content-based music recommendation models as well as combinations of content-based, CF-based, and location-based models. Additional characteristics of the user, such as age, gender, or musical education, will be addressed, too. 3 US−P−G1−RB US−P−G1−CF US−P−G1−CF_CULT_continent US−P−G1−CF_CULT_country US−P−G2−RB US−P−G2−CF US−P−G2−CF_CULT_continent US−P−G2−CF_CULT_country US−P−G3−RB US−P−G3−CF US−P−G3−CF_CULT_continent US−P−G3−CF_CULT_country 2.5 precision 2 1.5 1 7. ACKNOWLEDGMENTS This research is supported by the Austrian Science Funds (FWF): P22856 and P25655, and by the EU FP7 project no. 601166 (“PHENICX”). 8. REFERENCES [1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005. 0.5 0 0 10 20 30 2 50 60 70 [2] L. Chen, W. Wu, and L. He. How personality influences users’ needs for recommendation diversity? CHI ’13 Extended Abstracts on Human Factors in Computing Systems on - CHI EA ’13, 2013. US−D−G1−RB US−D−G1−CF US−D−G1−CF_CULT_continent US−D−G1−CF_CULT_country US−D−G2−RB US−D−G2−CF US−D−G2−CF_CULT_continent US−D−G2−CF_CULT_country US−D−G3−RB US−D−G3−CF US−D−G3−CF_CULT_continent US−D−G3−CF_CULT_country 1.8 1.6 1.4 precision 40 recall 1.2 1 0.8 [3] N. Hariri, B. Mobasher, and R. Burke. Context-aware music recommendation based on latent topic sequential patterns. In Proc. ACM RecSys ’12, New York, NY, USA, 2012. [4] M. Hart. The long tail: Why the future of business is selling less of more by chris anderson. Journal of Product Innovation Management, 24(3):274–276, 2007. [5] D. Hauger, M. Schedl, A. Košir, and M. Tkalčič. The Million Musical Tweets Dataset: What Can We Learn From Microblogs. In Proc. ISMIR, Curitiba, Brazil, November 2013. 0.6 0.4 0.2 0 0 10 20 30 40 50 60 [6] R. Hu and P. Pu. Exploring Relations between Personality and User Rating Behaviors. 1st Workshop on Emotions and Personality in Personalized Services (EMPIRE), June 2013. 70 recall 2 precision [7] N. Hurley and M. Zhang. Novelty and diversity in top-n recommendation – analysis and evaluation. ACM Trans. Internet Technol., 10(4):14:1–14:30, March 2011. US−M12−median−G1−RB US−M12−median−G1−CF US−M12−median−G1−CF_CULT_continent US−M12−median−G1−CF_CULT_country US−M12−median−G2−RB US−M12−median−G2−CF US−M12−median−G2−CF_CULT_continent US−M12−median−G2−CF_CULT_country 2.5 1.5 [8] J. Konstan and J. Riedl. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, 22(1-2):101–123, March 2012. [9] R. McCrae and O. John. An Introduction to the FiveFactor Model and its Applications. Journal of Personality, 60(2):175–215, 1992. 1 [10] E. Pariser. The filter bubble: What the Internet is hiding from you. Penguin UK, 2011. 0.5 0 0 10 20 30 40 50 60 70 80 [11] P. Rentfrow and S. Gosling. The do re mi’s of everyday life: The structure and personality correlates of music preferences. Journal of Personality and Social Psychology, 84(6):1236– 1256, 2003. recall Figure 5. Precision vs. recall for play count (top), diversity (middle), and mainstreaminess with a 12 month interval (bottom) experiments over groups and various recommendation approaches. [12] M. Schedl and D. Schnitzer. Hybrid Retrieval Approaches to Geospatial Music Recommendation. In Proc. ACM SIGIR, Dublin, Ireland, July–August 2013. [13] M. Tkalčič, M. Kunaver, A. Košir, and J. Tasič. Addressing the new user problem with a personality based user similarity measure. Joint Proc. DEMRA and UMMS, 2011. 488 15th International Society for Music Information Retrieval Conference (ISMIR 2014) TOWARDS SEAMLESS NETWORK MUSIC PERFORMANCE: PREDICTING AN ENSEMBLE’S EXPRESSIVE DECISIONS FOR DISTRIBUTED PERFORMANCE Elaine Chew Queen Mary University of London Centre for Digital Music [email protected] Bogdan Vera Queen Mary University of London Centre for Digital Music [email protected] ABSTRACT Internet performance faces the challenge of network latency. One proposed solution is music prediction, wherein musical events are predicted in advance and transmitted to distributed musicians ahead of the network delay. We present a context-aware music prediction system focusing on expressive timing: a Bayesian network that incorporates stylistic model selection and linear conditional gaussian distributions on variables representing proportional tempo change. The system can be trained using rehearsals of distributed or co-located ensembles. We evaluate the model by comparing its prediction accuracy to two others: one employing only linear conditional dependencies between expressive timing nodes but no stylistic clustering, and one using only independent distributions for timing changes. The three models are tested on performances of a custom-composed piece that is played ten times, each in one of two styles. The results are promising, with the proposed system outperforming the other two. In predictable parts of the performance, the system with conditional dependencies and stylistic clustering achieves errors of 15ms; in more difficult sections, the errors rise to 100ms; and, in unpredictable sections, the error is too great for seamless timing emulation. Finally, we discuss avenues for further research and propose the use of predictive timing cues using our system. 1. INTRODUCTION Ensemble performance between remote musicians playing over the Internet is generally made difficult or impossible by high latencies in data transmission [3] [5]. While many composers and musicians have chosen to treat latency as a feature of network music, performance of conventional music, such as that of classical repertoire, remains extremely difficult in network scenarios. Audio latency frequently results in progressively decreasing tempo c Bogdan Vera, Elaine Chew. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Bogdan Vera, Elaine Chew. “Towards Seamless Network Music Performance: Predicting an Ensemble’s Expressive Decisions for Distributed Performance”, 15th International Society for Music Information Retrieval Conference, 2014. 489 and difficulty in synchronizing. One aspect that has received less attention than the latency is the lack of visual contact when performing over the internet. Visual cues can be transmitted via video, but such data is at least as slow as audio, and was previously found to not be of significant use for transmitting synchronization cues even when the audio had an acceptable latency [6]. Since the start of network music research, several researchers have posited theoretically that music prediction could be the solution to network latency (see, for example, Chafe [2]). Ideally, if the music can be predicted ahead of time with sufficient accuracy, then it can be replicated at all connected end-points with no apparent latency. Recent efforts have made limited progress towards this goal. One example is a system for predicting tabla drumming patterns [12], and recent proposals by Alexandraki [1]. Both assume that the tempo of the piece will be at least locally smooth and, in the case Alexandraki’s system, timing alterations are always based on one reference recording. In many styles of music, such as romantic classical music, the tempo can vary widely, with musicians interacting on fine-scale note-to-note timing changes and using visual cues to synchronize. The tempo cannot be expected to always evolve in the exact same way as one previous performance, rather the musicians significantly improvise timing deviations to some constraints. In this paper we propose a system for predicting timing in network performance in real time, loosely inspired by Raphael’s approach based on Bayesian networks [11]. We propose and test a way to incorporate abstract notions of expressive context within a probabilistic framework, making use of time series clustering. Flossman et al. [8] employed similar ideas when they extended the YQX model for expressive offline rendering of music by using conditional gaussian distributions to link expressive predictions over time. Our model contains an extra layer of stylistic abstraction and is applied to modeling and real-time tracking of one performer or ensemble’s expressive choice at the inter-onset interval level. We also describe how the method could be used for predicting musical timing in network performance, and discuss ideas for further work. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2. MOTIVATION on those from preceding events, making the timing changes dependent on both musicians’ previous timing choices, while also allowing the system to respond to the interplay between the two musicians. Secondly, we abstract different ways of performing the piece by summarizing these larger scale differences in an unsupervised manner in a new discrete node in the network: a stylistic cluster node. Our goal is to use observable sources of information during a live performance to predict the timing of future notes so as to counter the effects of network latency. The sources of information we can use include the timing of previous notes and the intensity with which the notes are played. The core idea is reminiscent of Raphael’s approach to automatic accompaniment [11], which uses a Bayesian network relating note onset times, tempo and its change over time. In Raphael’s model, changes in tempo and local note timing are represented as independent gaussian variables, with distributions estimated from rehearsals. During a performance, the system generates an accompaniment that emulates the rehearsals by applying similar alterations of timing and tempo at each note event in the performance. The model has been demonstrated in live performances and proven to be successful, however as long as the system generates musically plausible expression in the accompaniment, it is difficult to determine an error value, as it is simply meant to follow a musician and replicate a performance style established in rehearsals. An underlying assumption of this statistical model is that the solo musician leading the performance tends to perform the piece with the same expressive style each time. In an ensemble performance scenario, two-way communication exists between musicians. The requirement for the system to simply ‘follow’ is no longer enough. As a step towards tighter ensemble, we set as a goal a stringent accuracy requirement for our prediction system: to have errors small enough−no higher than 20-40ms−as to be indistinguishable from the normal fluctuations in ensemble playing. Note that actual playing may have higher errors, even in ideal conditions, due to occasional mistakes and fluctuations in motor control. The same ensemble might also explore a variety of ways to perform a piece expressively. When expressive possibilities are explored during rehearsals, the practices establish a common ‘vocabulary’ for possible variations in timing that the musicians can then anticipate. Another goal of our system is to account for several distinct ways of applying expression to the same piece. This is accomplished in two ways. Like Flossman et al. [8], we deliberately encode the context of the local expression by introducing dependencies between the expressive tempo changes at each time step. We additionally propose and test a form of model selection using discrete variables that represent the chosen stylistic mode of the expression. For example, given two samples exhibiting the same tempo change, one may be part of a longer term tempo increase, while another may be part of an elastic time-stretching gesture. Knowing the stylistic context for a tempo change will allow us to better predict its trajectory. 3.1 Linear Gaussian Conditional Timing Prediction Our goal is to predict the timing of events such as notes, chords, articulations, and rests. In particular, we wish to determine the time until the next event given the score information and a timing model. We collapse all chords into single events. Assume that the performance evolves according to the following equations, tn+1 = sn ln + tn , and sn+1 = sn · δn , (1) where tn is the onset time of the n-th event, sn is the corresponding inter-beat period, ln is the length of the event in beats, and δn is a proportional change in beat duration that is drawn from the gaussian distributions Δn . For simplicity, there is no distinction between tempo and local timing in our model, though it could be extended to include this separation. Because δn ’s reflect proportional change in beat duration, prediction of future beat durations are done on a logarithmic scale: log2 sn+1 = log2 sn + log2 δn . log(tempo) = log(1/sn ), thus log sn as well, has been shown in recent research to be a more consistent measure of tempo variation in expressive performance [4]. The parameters of the Δn distributions are predicted during the performance from previous observations, such as δn−1 . Thus, each inter-beat interval, sn , is shaped from event to event by the random changes, δn . The conditional dependencies between the random variables are illustrated in Figure 1. The first and last layers in the network, labeled P1 and P2 in the diagram, are the observed onset times. The 3rd layer, labeled ‘Composite’ following Raphael’s terminology, embodies the time and tempo information at each event, regardless of which ensemble musician is playing, and it is on this layer that our model focuses. The 2nd layer, Expression, consists of the variables Δn . The Δn variables are conditioned upon their predecessors, using any number of previous timing changes as input; formally, they are represented by linear conditional gaussian distributions [9]. Let there be a Bayesian network node with a normal distribution Y . We can condition Y on its k continuous parents C = {C1 , . . . , Ck } and discrete parents D = {D1 , . . . , Dk } by using a linear regression model to predict the mean and variance of Y given the values of C and D. The following equation describes the conditional probability of Y given only continuous parent k nodes: P (Y |C = c) = N (β0 + βi ci , σ 2 ). 3. CONTEXTUALIZING TIMING PREDICTION We combine two techniques to implement ensemble performance prediction. First, we condition the expressive ‘update’ distributions characterizing temporal expression i=1 490 15th International Society for Music Information Retrieval Conference (ISMIR 2014) than attempting to construct a universal model for mapping score to performance. As a result, the amount of training data will generally be much smaller as we may only use the most recent recorded and annotated rehearsals of the ensemble. The next section describes a clustering method we use to account for large-scale differences in timing. 3.2 Unsupervised Stylistic Characterization Although we could add a large number of previous inputs to each of the Δn nodes, we cannot tractably condition these variables’ distributions on potentially hundreds of previous observations. This would require a large amount of training data to estimate the parameters in a meaningful way. Instead, we propose to summarize larger-scale expression using a small number of discrete nodes representing the stylistic mode. For example, a musician may play the same section of music in few distinct ways, and a listener may describe it as ‘static’, ‘swingy’ or ‘loose’. If these playing styles could be classified in real time, prediction could be improved by considering this stylistic context. Our ultimate goal is to perform this segmentally on a piece of music, discovering distinct stylistic choices that occured in the ensemble’s rehearsals. In this paper, we present the first steps towards this goal: we characterize the style of the entire performance using a single discrete stylistic node. The stylistic node is shown at the top of Figure 1. In our model this node links to all of the Δn nodes in the piece, so that each of the Δn ’s is now linearly dependent on the previous timing changes with weights that are dependent on the stylistic node. Assuming that each Δn node is linked to one previous one, the parameters of the Δn distributions are then predicted at run-time using Figure 1: A section of the graphical model. Round nodes are continuous gaussian variables, and the square node (S) is a discrete stylistic cluster node. This is the equation for both continuous and discrete parents: k P (Y |D = d, C = c) = N (βd,0 + βd,j cj , σd2 ). j=1 Simply speaking, the mean and variance of each linear conditional gaussian node is calculated from the values of its continuous and discrete parent nodes. The mean is derived through linear regression from its continuous parents’ values with one weight matrix per configuration of its discrete parents. The use of conditional gaussian distributions means that rather than having fixed statistics for how the timing should occur at each point, the parameters for the timing distributions are predicted in real time from previous observations using linear regression. This simple linear relationship provides a means of predicting the extent of temporal expression as an ongoing gesture. For example, if the performance is slowing down, the model can capture the rate of slowdown, or a sharp tempo turnaround if this occurred during rehearsals. Our network music approach involves interaction between two actual musicians rather than a musician and a computer. Thus, each event observed is a ‘real’ event, and we update the Δn probability distributions at each step during run-time with the present actions of the musicians themselves. Unlike a system playing in automatic accompaniment or an expressive rendering system, our system is never left to play on its own, and its task is simply to continue from the musicians’ choices, leaving less opportunity for errors to accumulate. Additionally, we can correct the musicians’ intended timing by compensating for latency post-hoc - this implies that we can make predictions that emulate what the musicians would have done without the interference of the latency. We may also choose the number of previous changes to consider. Experience shows that adding up to 3 previous inputs improves the performance moderately, but the performance decreases thereafter with more inputs. For simplicity, we currently use only one previous input, which provides the most significant step improvement. In constrast to a similar approach by Flossman et al. [8], we do not attempt to link score features to the performance; we only consider the local context of their temporal expression. Our goal is to capture the essence of one particular ensemble’s interpretation of a particular piece rather P (Δt |S = s, Δt−1 = δ) = N (βs,0 + βs,1 δ, σs2 ), where S is the style node. To predict note events, we can simply take the means of the Δn distributions, and use Equation 1 to find the onset time of the next event given the current one. To use this model, we must first discover the distinct ways (if any) in which the rehearsing musicians perform the piece. We apply k-means clustering to the log(δn ) time series obtained from each rehearsal. We find the optimal number of clusters by using the Bayes Information Criterion (BIC) as described by Pelleg and Moore [10]. Note that other methods exist for estimating an optimal number of clusters. To train the Bayesian network, a training set is generated containing all of the δn values for each rehearsal as well as the cluster to which each time series is allocated. We then use the algorithm by Murphy [9] to find all the parameters of the linear conditional nodes. Note that all of the nodes are observable and we have training data for the Δn . During the performance, the system can update its belief about the stylistic node’s value from the note timings that have been observed at any point; we do not need to re-cluster the performance, as the network has learned the relationships between the Δn ’s and the stylistic node. We 491 15th International Society for Music Information Retrieval Conference (ISMIR 2014) We evaluated the system using a ‘leave-one-out’ approach, where out of the 20 performances we always trained on 19 of them and tested on the remaining one. We always used one previous input to the Δn nodes, using the actual observations in the performances rather than our predictions (like the extended YQX), simulating the process of live performance. We evaluated the prediction accuracy by measuring timing errors, which we define as the absolute difference between the true event times and those predicted by the model (in seconds). The training performances were clustered correctly in all cases, dividing the dataset into the two styles, with the first 10 performances being grouped with cluster 1 and the second 10 becoming part of cluster 2. Figure 3 shows the stylistic inference process. In the matrix, performances are arranged as rows, with events on the x-axis. Recall that we predict the time between events rather than just notes. So, we also consider the timing of rests, and chords are combined into single events rather than individual notes. The colors indicate the inferred value of the style node: grey for Style 1 and white for Style 2. We see that the system correctly infers the stylistic cluster of each performance within the first 19 events. In many cases the classification assigns the performance to the correct cluster after only two events. use the message passing algorithm of Bayesian networks to infer the most likely state of the node. As the performance progresses, the belief about the state of the node is gradually established. Intuitively, the system arrives at a stable answer after some observations, otherwise the overall style is ambiguous. The state of the node is then used to place future predictions into some higher level context. The next section shows that the prediction performance is improved by using the stylistic node to select the best regression parameters to predict the subsequent timing changes, which can be thought of as a form of model selection. 4. EVALUATION 4.1 Methodology Performance Number In this section we present an evaluation of the basic form of our model. Evaluation of such predictive models remains a challenge because testing in live performance requires further work on performance tracking and optimization, while offline testing necessitates a large number of annotated performances from the same ensemble. We present initial results on a small dataset; in our future work we will study real time performances of more complex pieces. We evaluate the performance of three models: one uses linear conditional nodes and a stylistic cluster node; the second uses only linear conditional nodes; and, the third has independent gaussian distributions for the Δ variables. Our dataset consists of 20 performances by one pianist of the short custom-composed piece shown in Figure 2. Notice that we have not added any dynamics or temporelated markings - the interpretation is left entirely to the musicians. While this is not an ensemble piece, the performances are sufficient to test the prediction accuracy of our model in various conditions. In this simple example, we consider only the composite layer in the model, without P1 and P2. Inferred Style per Event, per Performance 5 10 15 20 10 20 30 40 50 60 Event Number 70 80 90 Figure 3: Matrix showing most likely style state after each event’s observed δ. Performances 1-10 are in Style 1, and 11-20 are in Style 2. Classification result: grey = Style 1, white = Style 2. Figure 4 shows the tempo information for the dataset. Figure 4(a) shows the inter-beat period contours of all of the performances, while Figure 4(b) shows boxplots (indicating the mean and variability) of the period at each musical event, for the entire dataset and for the two clusters. 4.2 Results Figure 5a and Figure 5b show the performance of the models, measured using mean absolute error averaged over events in each performance, and over performances for each event, respectively. We also show a detailed ‘zoomed in’ plot of the errors between events 20-84 to make the different models’ mean errors clearer in Figure 5c. For network music performance, we would want to predict at least as far forward as needed to counter the network (and other system) latency. As some inter-event time differences may be shorter than the latency, we may occasionally need to predict more than one event ahead. The model with stylistic clustering and linear conditional nodes performed best, followed by the one with only linear conditional nodes, then the model with independent Figure 2: Custom-composed piano test piece. The piece was played on an M-Audio AXIOM MIDI keyboard in one of two expressive styles decided beforehand, ten times for each style. We used IRCAM’s Antescofo score follower [7] for live tracking of the performance in our system, and annotation of the note and chord events. The log-period plots for every performance in the dataset are shown in Figure 4a. The changes in log-period per event are shown in Figure 4b, and we also show the same changes but for the data in each cluster found, to demonstrate the difference between the two playing styles. 492 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 0.11 0 −2 0 20 40 60 Event Number 80 100 (a) Log-period per event for every performance in the dataset. Log2(Period) Event−wise Log−Period Change Boxplots for the Unclustered Data 0.09 0.08 0.07 0.06 2 0.05 0 0.04 −2 0.03 Event Number Log2(Period) Stylistic Clustering + Conditional Conditional Only No Clustering, Not Conditional 0.1 −1 Mean Absolute Error Log2(Period) Log Period per Event for all Performances 1 0 5 15 20 (a) Mean absolute error for each performance. 2 0 1 −2 Stylistic Clustering + Conditional Conditional Only No Clustering, Not Conditional Event Number 0.8 Event−wise Log−Period Change + Boxplots for K−Means Centroid 2 1 0 −1 −2 Mean Absolute Error Log2(Period) 10 Performance Number Event−wise Log−Period Change + Boxplots for K−Means Centroid 1 Event Number (b) Boxplots showing median and variability for the log-period change at each event. Top: unclustered data, Middle: first centroid, Bottom: second centroid. 0.6 0.4 0.2 0 0 20 40 60 Event Number Figure 4: Tempo Data 80 100 Mean Absolute Error (b) Mean absolute error per event, over the whole performance. Δn nodes. In all cases the errors were higher for the second style (the latter 10 performances), which was much looser than the first. The mean absolute errors for each model, considering all of the events in all of the performances are summarized in Table 1. Observe in Figure 5b that some parts of the performance were very difficult to predict. For example, we note high prediction errors in the first 12 events of the piece and one large spike in the error at the end of the piece. These are 1-bar and 2-bar long chords, for which musicians in an ensemble would have to use visual gestures or other information to synchronize. We would not expect any prediction system to do better than a musician anticipating the same timing without any form of extra-musical information. We discuss potential applications of music prediction for virtual cueing in the next section. The use of clustering and conditional timing distributions reduced the error rate for the events which were poorly predicted with independent timing distributions. For much of the piece the mean error was as low as 15ms, but even for these predictable parts of the performance, the models with conditional distributions and clustering lowered the error, as can be seen from Figure 5c. 0.04 0.03 0.02 0.01 20 30 40 50 60 Event Number 70 80 (c) A ‘zoomed-in’ view of the error rates between events 20-84. Figure 5: Mean absolute error per event. dent nodes, we have shown that the proposed approach produces promising results. Specifically, we have shown evidence that considering a notion of large scale expressive context, drawn from performance styles of a particular ensemble, can intuitively increase the accuracy of timing prediction. The model remains to be tested on more data. As creative musicians are infinitely diverse in their expressive interpretations, the true test of the model would ultimately be in live performances. The end goal of this research is to implement and evaluate network music performance systems based on the prediction model. Whether music prediction can ever be precise enough to allow seamless network performance remains an open question. Important questions arise in pur- 5. CONCLUSIONS AND FUTURE WORK Model Independent Conditional Clustering and Conditional We have outlined a novel approach to network music prediction using a Bayesian network incorporating contextual inference and linear gaussian conditional distributions. In an evaluation comparing the model with stylistic clustering and linear conditional nodes, one with only linear conditional nodes without clustering, and one with indepen- Mean Abs. Error 69.8ms 57.4ms 48.5ms Table 1: Overall Timing Errors for Each Model 493 15th International Society for Music Information Retrieval Conference (ISMIR 2014) suit of this goal: how much should the system lead the musicians to help them stay in time without making the performance artificial? Predicting musical timing with sufficient accuracy will open up interesting avenues for network music research, especially when we consider parallel research into predicting other information such as intensity and even pitch information, but whether any musician would truly want to let a machine impersonate them expressively remains to be seen, which is why we propose that a ‘minimally-invasive’ conductor-like approach to regulating tempo would be more appropriate than complete audio prediction. accuracy the timing in sections of a piece requiring temporal coordination, then we could help musicians synchronize by providing them with perfectly simultaneous predicted cues. We regard the use of predictive virtual cues as less invasive to networked ensembles than complete predictive sonification. In situations where the audio latency is low enough for performance to be feasible but video latency is still too high for effective transmission of gestural cues, predictive sonification may be omitted completely, and virtual cues could be implemented as a regulating factor. 5.1 The Bayesian Network This research was funded in part by the Engineering and Physical Sciences Research Council. 6. ACKNOWLEDGEMENTS It would be straightforward to extend our model by implementing prediction of timing from other forms of expression that tend to correlate with tempo. For example, using event loudness in the prediction would simply require the addition of another layer of variables in the Bayesian network and conditioning the timing variables on these nodes as well. 7. REFERENCES [1] C. Alexandraki and R. Bader. Using computer accompaniment to assist networked music performance. In Proc. of the AES 53rd Conference on Semantic Audio, London, UK, 2013. [2] C. Chafe. Tapping into the internet as an acoustical/musical medium. Contemporary Music Review, 28, Issue 4:413–420, 2010. 5.2 Capturing Style Much work remains to expand on the characterization of stylistic mode. As previously mentioned, we plan to explore segmental stylistic characterization, considering different contextual information for each part of the performance. In our current model we use only one stylistic node. This may be a plausible for a small segment of music, but in a longer performance the choice of performance style may vary over time. If the predicted performance starts within one style but changes to another, the model is ill-informed to predict the parameters. In our future work we would like to extend the model to capture such stylistic tendencies over time. One approach would require presegmentation of the piece based on the choice of expressive choices during the reharsal stage, and introduction of one stylistic node per segment. The prediction context would then be local to each part of the performance. We may then, for example, have causal conditional dependencies between the stylistic nodes in each segment of the piece, which would allow the system to both infer the style within a part of the performance from what is being played and from the previous stylistic choices. In practice, a musician or ensemble’s rehearsals may not comprise of completely distinct interpretations; however, capturing expression contextually will likely offer a larger degree of freedom to the musicians in an internet performance, who may then explore a greater variety of temporal and other articulations. 5.3 Virtual Cueing Virtual cueing forms an additional application of interest. As mentioned at the start of the paper, visual communication is generally absent or otherwise delayed in network music performance. If we could predict with reasonable [3] C. Chafe and M. Gurevich. Network time delay and ensemble accuracy: Effects of latency, asymmetry. In Proc. of the 117th Audio Engineering Society Convention, 2004. [4] E. Chew and C. Callender. Conceptual and experiential representations of tempo: Effects on expressive performance comparisons. In Proc. of the 4th International Conference on Mathematics and Compution in Music, pages 76–87, 2013. [5] E. Chew, A. Sawchuk, C. Tanoue, and R. Zimmermann. Segmental tempo analysis of performances in user-centered experiments in the distributed immersive performance project. In Proc. of the Sound and Music Computing Conference, 2005. [6] E. Chew, R. Zimmermann, A. Sawchuk, C. Kyriakakis, and C. Papadopolous. Musical interaction at a distance: Distributed immersive performance. In Proc. of the 4th Open Workshop of MUSICNETWORK, Barcelona, 2004. [7] A. Cont. Antescofo: Anticipatory synchronization and control of interactive parameters in computer music. In Proc. of the International Computer Music Conference, 2008. [8] S. Flossmann, M. Grachten, and G. Widmer. Guide to Computing for Expressive Music Performance, chapter Expressive Performance Rendering with Probabilistic Models, pages 75– 98. Springer Verlag, 2013. [9] K. P. Murphy. Fitting a conditional linear gaussian distribution. Technical report, University of British Columbia, 1998. [10] D. Pelleg and A. W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proc. of the Seventeenth International Conference on Machine Learning, pages 727–734, 2000. [11] C. Raphael. Music plus one and machine learning. In Proc. of the 27th International Conference on Machine Learning, pages 21–28, 2010. [12] M. Sarkar. Tablanet: a real-time online musical collaboration system for indian percussion. Master’s thesis, MIT, 2007. 494 15th International Society for Music Information Retrieval Conference (ISMIR 2014) DETECTION OF MOTOR CHANGES IN VIOLIN PLAYING BY EMG SIGNALS Ling-Chi Hsu, Yu-Lin Wang, Yi-Ju Lin, Alvin W.Y. Su Department of CSIE, National Cheng-Kung University, Taiwan [email protected]; [email protected]; [email protected]; [email protected] Cheryl D. Metcalf Faculty of Health Sciences, University of Southampton, United Kingdom [email protected] explored the dynamic pressures to analyze how pianists depressed the piano keys and hold them down during playing. The pressure measurement advances the evaluation of the keystroke in piano playing [4-5]. The use of muscle activity via electromyography (EMG) signals allows further investigation into the motor control sequences that produce the music. EMG is a technique which evaluates the electrical activity of the muscle by recording the electrical potentials when muscles generate an electrical voltage during activation, which results in a movement or coordinated action. EMG is generally recorded in two protocols; invasive electromyography (IEMG) and surface electromyography (SEMG). IEMG is used to measure deep muscles and discrete positions using a fine-wire needle; however, it is not a preferable model for subjects due to the invasiveness and being less repetitive. Compared to IEMG, SEMG has the following characteristics: (1) it is noninvasive; (2) it provides global information; (3) it is comparatively simple and inexpensive; (4) it is applicable by non-medical personnel; and (5) it can be used over a longer time during work and sport activities [6]. Therefore, the SEMG is suitable for use within biomechanics and movement analysis, and was used in this paper. For the analysis of musical performance, EMG has been used to evaluate behavioral changes of the fingers [7-8], upper limbs [9-10] shoulder [11-12] and wrist [13] in piano, violin, cello and drum players. The EMG method allows for differentiating the variations and reproducibility of muscular activities in individual players. Comparing the EMG activity between expert pianists and novice players [7-14] has also been studied. There have been many approaches developed for segmentation of EMG signals [15]. Prior EMG segmentation techniques were mainly used to detect the time period for a certain muscle contraction, but we found that the potential variations from various muscles maybe different during a movement. It causes the conventional EMG segmentation to fail to extract the accurate timing of movement in instrument playing. In this paper, the timing activation of the muscle group is assessed, and the changes in motor control of players during performance are investigated. We propose a system with the function of concurrently recording the audio signal and behavioral changes (EMG) while playing an instrument. This work is particularly focused on violin playing, which is considered difficult to segment with the ABSTRACT Playing a music instrument relies on the harmonious body movements. Motor sequences are trained to achieve the perfect performances in musicians. Thus, the information from audio signal is not enough to understand the sensorimotor programming in players. Recently, the investigation of muscular activities of players during performance has attracted our interests. In this work, we propose a multi-channel system that records the audio sounds and electromyography (EMG) signal simultaneously and also develop algorithms to analyze the music performance and discover its relation to player’s motor sequences. The movement segment was first identified by the information of audio sounds, and the direction of violin bowing was detected by the EMG signal. Six features were introduced to reveal the variations of muscular activities during violin playing. With the additional information of the audio signal, the proposed work could efficiently extract the period and detect the direction of motor changes in violin bowing. Therefore, the proposed work could provide a better understanding of how players activate the muscles to organize the multi-joint movement during violin performance. 1. INTRODUCTION For musicians, their motor skills must be honed by many hours of daily practice to maintain the performing quality. Motor sequences are trained to achieve the perfect performances. Playing a musical instrument relies on the harmonious coordination of body movements, arm and fingers. This is fundamental to understanding the neurophysiological mechanisms that underpin learning. It therefore becomes important to understand the sensorimotor programming in players. In the late 20th century, Harding et al. [1] directly measured the force between player’s fingers and piano keys with different skill levels. Engel et al. [2] found there is an anticipatory change of sequential hand movements in pianists. Parlitz et al. [3] © L.C. Hsu, Y.J. Lin, Y.L. Wang, A.W.Y. Su, C.D. Metcalf. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: L.C. Hsu, Y.J. Lin, Y.L. Wang, A.W.Y. Su, C.D. Metcalf. “Detection Of Motor Changes In Violin Playing By EMG Signals”, 15th International Society for Music Information Retrieval Conference, 2014. 495 15th International Society for Music Information Retrieval Conference (ISMIR 2014) soft onsets of the notes. The segment with body movements was first identified by the information of audio sounds. It is believed that if there is an audio signal, then there is a corresponding movement. Six features were then introduced to EMG signals to discover the variation of movements. This work identifies the individual movement segments, i.e. up-bowing and down-bowing, during violin playing. Thus, how motor systems operated in musicians and affected during performance could be explored using this methodology. This paper is organized as follows. The multi-channel signal recording system and its experimental protocol are shown in section 2. In section 3, we introduce the proposed algorithms for segmenting the EMG signal with additional audio information. The experimental results are shown in section 4 and the conclusion and future work are given in section 5. Two seconds resting time was given between the two consecutive movements. The EMG sampling rate was 1000Hz. The electrodes attached on the surface of the player’s skin as shown Figure 2. In this study, the direction of violin bowing, i.e. upbowing and down-bowing, is detected by the corresponding muscle activity (EMG signal). The total of 8 muscles in the upper limb and body is measured in our system. Figure 3 shows the 8-channel EMG signals of up-bowing movement, and potential variations were shown in all channels when bowing. Three types of variations were observed and grouped: (1) Channel#1 to Channel#6: it is seen that the trend of six channels is similar; additionally, the average noise floor between channel#3 and channel#6 are lower than others; finally, we choose channel#6 because the position is convenient to place the electrode. (2) Channel#7: the channel involving the most noise. (3) Channel#8: although it has more noise than Channel#1 to Channel#6, it is the important part when we have a whole-bowing movement. 2. AUDIO SOUNDS AND BIOSIGNAL RECORDING SYSTEM This work proposed a multi-channel signal recording system capable of recording audio and EMG signals concurrently. The system is illustrated in Figure 1 and comprises: (a) a signal pre-amplifier acquisition board, (b) an analog to digital signal processing unit, and (c) a host-system. Figure 1. The proposed multi-channel recording system for recording audio signal and EMG concurrently. The violin signal was recorded in a chamber and the microphone was placed 30cm from the player with a sampling rate of 44100Hz. With this real violin recording, the sound is supposedly embedded with the noise and the artifacts. Furthermore, there is three subjects in the experiment database. The violinist play music and be recorded. Each participant was requested to press one string during playing. This experiment included two tasks for performance evaluation, and each task contained 10 movements. The movements for task#1 and task#2 are defined as follows. Movements for task#1: (1) Player presses the 2nd string then is idle for 2s (begin the bow at the frog). (2) Pulls the bow from the frog to the tip for 4s (whole bow down). (3) Pulls the whole bow up for 4s. Figure 2. The placement of the electrodes attached on the player’s skin [16, 17]. Movements for task#2: (1) Player presses the 3rd string then is idle for 2s (begin the bow at the tip). (2) Pulls the whole bow up for 4s. (3) Pulls the whole bow down for 4s. Figure 3. The 8-channel EMG signals of up-down bowing movements. 496 15th International Society for Music Information Retrieval Conference (ISMIR 2014) the bowing; the Sustain is the duration of the note segment. Both frequency and spatial features were calculated and used as the inputs to our developed finite state machine (FSM). The diagram of our proposed FSM is illustrated in Figure 6. The output of FSM identifies the result of note detection and further used for EMG segmentation. To reduce the computation and retain the variety of features, only channel#6 and channel#8 were thereafter used for further analysis. Figure 4 shows the EMG signals of channel#6 and channel#8 while during downbowing. Figure 6. The state diagram of audio sounds. The violin signal was analyzed both in frequency and time domains. For frequency analysis, the violin signal was first transformed by short time Fourier transform. The inverse correlation (IC) was then applied to calculate the possible note onset period. The inverse correlation (IC) coefficients are computed from the correlation coefficients of two consecutive discrete Fourier transform spectra [18]. A support vector machine (SVM), denoted as SVMic (1), was applied for detecting the accurate timing of onset. SVM is a popular methodology, with high speed and simple implementation, for classification and regression analysis [19]. Ͳ ǡ ݊ ݊െ ݊݅ݐ݅ݏ݊ܽݎݐ ܸܵܯ ൌ ቄ (1) ͳ ǡ ݊݅ݐ݅ݏ݊ܽݎݐ For spatial analysis, the amplitude envelop (AE) was used to detect the segment of the sound data. AE is evaluated as the maximum value of a frame. There are two similar classifiers, called SVMae1 (2) and SVMae2 (3). SVMae1 is used to identify the possible onsets and SVMae2 is used to identify the possible offsets. Ͳ ǡ ݊ ݊െ ݐ݁ݏ݊ (2) ܸܵܯଵ ൌ ቄ ͳ ǡ ݐ݁ݏ݊ Figure 4. The EMG signals of triceps (channel#6) and pectoralis (channel#8) during down-bowing movements. 3. METHOD The following section will introduce the proposed algorithm for detecting the bowing states during violin playing. The proposed system is capable of recording audio and EMG signals concurrently, and in this study a bowing state detection algorithm was developed, which was implemented the embedded system. The flowchart of the proposed method is shown in Figure 5. Ͳ ܸܵܯଶ ൌ ൜ ͳ ǡ ǡ ݊ ݊െ ݐ݁ݏ݂݂ ݐ݁ݏ݂݂ (3) Figure 7 shows (a) a segment of audio sounds with one sequence of down-bowing and up-bowing, while Figure 7(b) and (c) display the results of IC and AE, respectively. During the bowing state, the IC value is extremely small when compared to the results of the non-bowing state. IC seems to be a good index to identify the state of whether the violin is being played, or not. However, it can be seen that a time deviation is introduced if the system simply applies a hard threshold, e.g. 0.3. Alternatively, the AE value becomes larger at the playing state. But the issue of time deviation is also present in this feature, if a hard threshold is applied. After calculating the IC and AE values, their variation is considered as one set of input data for SVM. The time period of each data is 100ms. Therefore, SVMic, SVMae1 and SVMae2 are designed to detect the most plausible timing of onset, transition and offset. Figure 5. Flowchart of the proposed system. The EMG signals were segmented according to the violin sounds. Then, six features were identified to detect the direction of bowing movements. For analyzing the audio signal, the window size of a frame is 2048 samples and the hop size 256 samples. 3.1 Onset/Transition/Offset detection This section elaborates on the state detection of audio sounds. The states of audio sounds are defined as Onset, Transition and Offset in this study. The Onset is the beginning of bowing; the Transition is the timing when the next bowing movement occurred; the Offset is the end of 497 15th International Society for Music Information Retrieval Conference (ISMIR 2014) of CV for each active frame. Table 1 lists the number of each feature for each channel. Table 1. The number of each feature per channel Feature MAV MAVS ZC SSC WL Number 20 19 1 1 20 A more detailed description of those applied features could be found in [20]. Figure 8 displays the triceps EMG signal of one active frame (8s ~ 16s) and the results calculated by MAV, MAVS, ZC, SSC and WL. It can be seen that variations are exhibited for 6 features in violin playing with a down-up bowing movement. The detection of bowing direction is also determined by a SVM classifier which is denoted as SVMdir (3). For SVMdir, a total of 125 inputs are used (61 inputs for channel#6 and channel#8 each, plus 3 values of CV) and it identifies whether the active EMG frame is in the upbowing or down-bowing state. Ͳ ǡ ܷ െ ܾ݃݊݅ݓ (3) ܸܵܯௗ ൌ ൜ ͳ ǡ ݊ݓܦെ ܾ݃݊݅ݓ Figure 7. (a) The audio sounds of down-bowing and upbowing; (b) the results of IC; (c) the results of AE 3.2 Detection of bowing direction In each movement, there are one onset, one offset, and several transitions. However, the total number of transitions will differ from the number of notes. After detection of the bowing state is completed, the duration between onset and offset is applied for segmenting the EMG signal of triceps (channel#6) and pectroalis (channel#8). For each note duration, there are three cases: (1)The duration from the onset to the first transition. (2)The duration from the current transition to the next transition. (3)The duration from the last transition of the offset. This note duration extracted from the audio sound is called an active frame and the active frames are variant lengths from each other. The segment extracted by the audio sounds is called an active frame and the active frames are variant lengths from each other. Figure 8. One down-up bowing movement and its six features: (a) the down-bowing movement, (b) the upbowing movement. For each active frame, six features in [20] were applied to calculate the variations of EMG signal while bowing. The features are: z Mean absolute value (MAV) z Mean absolute value slope (MAVS) z Zero crossings (ZC) z Slope sign changes (SSC) z Waveform length (WL) z Correlation variation (CV) 3.3 Performance evaluation In our experiment, 10-fold cross-validation is used for SVMic, SVMae and SVMdir, and the performance evaluation calculates the accuracy (4), precision (5), recall (6) and F-score (7) of each detecting function. ࢀ࢛࢘ࢋࡼ࢙࢚࢜ࢋାࢀ࢛࢘ࢋே࢚࢜ࢋ ൌ ࡼ࢙࢚࢜ࢋାே௧௩ ൌ ൌ Here, the active frame is experimentally divided into 20 segments for calculating MAV and WL, thus each active frame has 20 values of MAV and WL. For CV, we calculate the auto-correlation and cross-correlation of channel#6 and channel#8, and therefore there are 3 values ்௨௦௧௩௭ ்௨௦௧௩ାி௦௦௧௩ ்௨௦௧௩ ்௨௦௧௩ାி௦ேୣ௧௩ Ǧ ൌ ଶڄ௦ڄோ ௦ାோ ; ; (4) ; (5) ; (6) (7) The true positive means it correctly detected the movement; the false positive is a falsely detected movement; and the false negative is a missed detection. 498 15th International Society for Music Information Retrieval Conference (ISMIR 2014) violin signal of task#1 with three movements. Figure 11 (b) and (c) are the EMG segmentations of our proposed method and [15], respectively. Channel#6 is used in this example to illustrate a sample output. It is believed that if there is an audio signal, then there is a corresponding movement. It can be seen that the results segmented by [15], without the additional information of the audio signal, could not precisely identify the segment of movements during bowing. However, the proposed method is based on the information from audio signals and clearly identifies the segment of behavioral changes during violin playing. 4. EXPERIMENTAL RESULTS In this section, the efficiency of the proposed SVMs is observed. An example of the proposed EMG segmentation is then compared to the prior work [15]. Finally, the averaged and overall simulation results are given. 4.1 The performance of SVM classifications To illustrate both the proposed IC and AE effectively identify the sound states of onset and offset, respectively, Figure 9 shows the trend of IC and AE values in one down-up bowing movement by using the classification results for SVMic and SVMae1 and SVMae2. Table 2 shows that, with the given FSM, the detection rate of onsets, transitions and offsets are 90%, 100%, 100%, respectively. Figure 11. (a) The violin signal; (b) the proposed EMG segmentations; (c) the EMG segmentations of [15]. 4.3 The simulation results The detection result of violin bowing direction was given in Table 3 where accuracy, precision, recall and F-score are presented. Figure 9. The results of 3 classifiers: (a) onsets, (b) transitions, (c) offsets. Table 3. The detection results of the bowing direction: (1) the detection results of ground truths of active frames; (2) the detection results of extracted active frames. (1) (2) ġ 85% 87.5% Accuracy 76.92% 82.61% Precision 100% 95% Recall 86.96% 88.37% F-score The average detection results were shown to have excellent performance with an accuracy of 85%~87.5%. The results show that the proposed method efficiently identifies the bowing direction in violin playing. Table 2. The detection results of the bowing states with the given FSM. Onset Transition Offset 90.00% 100% 100% Accuracy 90.00% 100% 100% Precision 90.00% 100% 100% Recall 90.00% 100% 100% F-score Figure 10 shows the distribution of active EMG frames during up-bowing and down-bowing states, and it displays the distribution of MAV, MAVS and WL. The SVMdir classifies the data with 85% accuracy. 5. CONCLUSION AND FUTURE WORK The proposed biomechanical system for recording the audio sounds and EMG signals during playing an instrument was developed. The proposed method not only extracts the segment during movement and detects the moving direction of bowing, but with the additional information of violin sounds, changes in muscle activity as an element of motor control, could be efficiently detected when compared to the prior EMG segmentation (without any sound information). To the authors’ knowledge, this is the first study which proposes such concept. Figure 10. (a) The original distribution of up-bowing and down-bowing EMG frames; (b) the results of SVMdir classification. 4.2 EMG segmentation The results of EMG segmentation and its comparison to [15] are both illustrated in Figure 11. Figure 11 shows the 499 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [11] A. Fjellman-Wiklund and H. Grip, J. S. Karlsson et al.: “EMG trapezius muscle activity pattern in string players: Part I—is there variability in the playing technique?,” International journal of industrial ergonomics, vol. 33, no. 4, pp. 347-356, 2004. [12] J. G. Bloemsaat, R. G. Meulenbroek, and G. P. Van Galen: “Differential effects of mental load on proximal and distal arm muscle activity,” Experimental brain research, vol. 167, no. 4, pp. 622-634, 2005. Future work will improve the detection rate of onset, transition and offset to extract the period of an active frame more precisely. The detection of the bowing direction will be also improved. Furthermore, the relationship between the musical sounds and the muscular activities of players in musical performance will be observed and analyzed. By measuring the music and the player’s muscular activity, better insights can be made into the neurophysiological control during musical performances and may even prevent players from the injuries as greater insights into these mechanisms are made. [13] S. Fujii and T. Moritani: “Spike shape analysis of surface electromyographic activity in wrist flexor and extensor muscles of the world's fastest drummer,” Neuroscience letters, vol. 514, no. 2, pp. 185-188, 2012. 6. REFERENCES [1] DC. Harding, KD. Brandt, and BM. Hillberry: “Minimization of finger joint forces and tendon tensions in pianists,” Med. Probl. Perform Art pp.103-104, 1989. [14] S. Furuya and H. Kinoshita: “Organization of the upper limb movement for piano key-depression differs between expert pianists and novice players,” Experimental brain research, vol. 185, no. 4, pp. 581-593, 2008. [15] P. Mazurkiewicz, “Automatic Segmentation of EMG Signals Based on Wavelet Representation,” Advances in Soft Computing Volume 45, 2007, pp 589-595 [16] Bodybuilding is lifestyle! "Chest - Bodybuilding is lifstyle!"http://www.bodybuildingislifestyle.com/che st/. [17] Bodybuilding is lifestyle! "Chest - Bodybuilding is lifestyle!" http://www.bodybuildingislifestyle.com/hamstrings/. [2] KC. Engel, M. Flanders, and JF. Soechting: “Anticipatory and sequential motor control in piano playing,” Exp Brain Res. pp. 189-199, 1997. [3] D. Parlitz, T. Peschel, and E. Altenmuller: “Assessment of dynamic finger forces in pianists: Effects of training and expertise,” J. Biomech. pp.1063-1067, 1998. [4] H. Kinoshita , S. Furuya , T. Aoki, and E. Altenmüller E.: “Loudness control in pianists as exemplified in keystroke force measurements on different touches,” J Acoust Soc Am. pp. 2959-69, 2007. [5] AE. Minetti, LP. Ardigò, and T. McKee: “Keystroke dynamics and timing: Accuracy, precision and difference between hands in pianist's performance,” J Biomech. pp. 3738-43, 2007. [6] R. Merletti and P. Parker: Electromyography: physiology, engineering, and noninvasive applications, Wiley-IEEE Press, 2004. [7] C.-J. Lai, R.-C. Chan, and T.-F. Yang et al.: “EMG changes during graded isometric exercise in pianists: comparison with non-musicians,” Journal of the Chinese Medical Association, vol. 71, no. 11, pp. 571-575, 2008. [8] M. Candidi, L. M. Sacheli, and I. Mega et al.: “Somatotopic mapping of piano fingering errors in sensorimotor experts: TMS studies in pianists and visually trained musically naïves,” Cerebral Cortex, vol. 24, no. 2, pp. 435-443, 2014. [9] S. Furuya, T. Aoki, and H. Nakahara et al.: “Individual differences in the biomechanical effect of loudness and tempo on upper-limb movements during repetitive piano keystrokes,” Human movement science, vol. 31, no. 1, pp. 26-39, 2012. [10] D. L. Rickert and M. Halaki, K. A. Ginn et al.: “The use of fine-wire EMG to investigate shoulder muscle recruitment patterns during cello bowing: The results of a pilot study,” Journal of Electromyography and Kinesiology, vol. 23, no. 6, pp. 1261-1268, 2013. 500 [18] WJJ. Boo, Y. Wang, and A. Loscos, “A violin music transcriber for personalized learning,” pp 2081-2084, IEEE International Conference on Multimedia and Expo, 2006. [19] BE. Boser, IM. Guyon and VN. Vapnik: “A training algorithm for optimal margin classifiers,” In Fifth Annual Workshop on Computational Learning Theory, ACM 1992. [20] AJ. Andrews: “Finger movement classification using forearm EMG signals,” M. Sc. dissertation, Queen's University, Kingston, ON, Canada, 2008. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) AUTOMATIC KEY PARTITION BASED ON TONAL ORGANIZATION INFORMATION OF CLASSICAL MUSIC Lam Wang Kong, Tan Lee Department of Electronic Engineering The Chinese University of Hong Kong Hong Kong SAR, China {wklam,tanlee}@ee.cuhk.edu.hk ABSTRACT they are the beginning or ending chords? Seems it is not, as B major chord is normally not a member chord of C major and vice versa. It seems that there must be a key change in the middle. But how would you find out the point of key change, and how does the key change? With the help of the tonal grammar tree analysis in §2.1, a good estimate of the key path can be obtained. To start with, we assume that the excerpt consists of harmonically complete phrase(s) and the chord labels are free from errors. There are some existing algorithms to estimate the key based on chord progression. These algorithms can be classified into two categories: statistical-based and rule-based approach. Hidden Markov model is very often used in the statistical approach. Lee & Stanley [7] extracted key information by performing harmonic analysis on symbolic training data and estimated the model parameters from them. They built 24 key-specific HMMs (all major and minor keys) for recognizing a single global key which has the highest likelihood. Raphael & Stoddard [11] performed harmonic analysis on pitch and rhythm. They divided the music into a fixed musical period, usually a measure, and associate a key and chord to each of period. They performed functional analysis of chord progression to determine the key. Unlabeled MIDI files were used to train the transition and output distributions of HMM. Instead of recognizing the global key, it can track the local key. Catteau et al. [2] described a probabilistic framework for simultaneous chord and key recognition. Instead of using training data, Lerdahl’s representation of tonal space [8] were used as a distance metric to model the key and chord transition probabilities. Shenoy et al. [15] proposed a rule-based approach for determining the key from chord sequence. They created a reference vector for each of the 12 major and minor keys, including the possible chords within the key. Higher weights were assigned to primary chords (tonic, subdominant and dominant chords). The chord vector obtained from audio data were compared against the reference vector using weighted cosine similarity. The pattern with the highest rank is chosen as the selected global key. This paper uses a rule-based approach to model tonal harmony. A context-free dependency structure is used to exhaust all the possible combinations of key paths, and the best one is selected according to music knowledge. The main objective of this research is to exploit this tonal context-free dependency structure in order to partition an excerpt of classical music into several key sections. Key information is a useful information for tonal music analysis. It is related to chord progressions, which follows some specific structures and rules. In this paper, we describe a generative account of chord progression consisting of phrase-structure grammar rules proposed by Martin Rohrmeier. With some modifications, these rules can be used to partition a chord symbol sequence into different key areas, if modulation occurs. Exploiting tonal grammar rules, the most musically sensible key partition of chord sequence is derived. Some examples of classical music excerpts are evaluated. This rule-based system is compared against another system which is based on dynamic programming of harmonic-hierarchy information. Using Kostka-Payne corpus as testing data, the experimental result shows that our system is better in terms of key detection accuracy. 1. INTRODUCTION Chord progression is the foundation of harmony in tonal music and it can determine the key. The key involves certain melodic tendencies and harmonic relations that maintain the tonic as the centre of attention [4]. Key is an indicator of the musical style or character. For example, the key C major is related to innocence and pureness, whereas F minor is related to depression or funereal lament [16]. Key detection is useful for music analysis. A classical music piece may have several modulations (key changes). A change of key means a change of tonal center, the adoption of a different tone to which all the other tones are to be related [10]. Key change allows tonal music to convey a sense of long-range motion and drama [17]. Keys and chord labels are interdependent. Even if the chord labels are free from errors, obtaining the key path is often a non-trivial task. For example, if a music excerpt has been analyzed with the chord sequence [B , F, Gmin , Amin , G, C], how would you analyze its key? Is it a phrase entirely in B major or C major, as c Lam Wang Kong, Tan Lee. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Lam Wang Kong, Tan Lee. “Automatic key partition based on Tonal Organization Information of Classical Music”, 15th International Society for Music Information Retrieval Conference, 2014. 501 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Functional level Scale degree level T R → DR T DR → SR D T R → T R DR XR → XR XR phrase → T R Added rules for scale degree level: S → ii (minor) T → I IV V I | V I IV I | I bII I D → I V , after S or D(V ) TR DR SR XR T D tonic region dominant region predominant region any specific region tonic function dominant function S X D() X/Y I, III... ii, vi... T →I T → I IV I S → IV D → V | vii T → vi | III D → V II (minor) S → ii (major) S → V I | bII (minor) X → D(X) X D(X) → V /X | vii/X Figure 1. Example of a tonal grammar tree (single key) predominant function any specific function secondary dominant X of Y chord major chords minor chords Table 1. Rules (top) and labels (bottom) used in our system 2. TONAL THEORY OF CLASSICAL MUSIC 2.1 Schenkerian analysis and formalization Figure 3. Flow diagram of our key partitioning system To interpret the structure of the tonal music, Schenkerian analysis [14] is used. The input is assumed to be classical music with one or more tonal centre (tonal region). Each tonal centre can be elaborated into tonic – dominant – tonic regions [1]. The dominant region can be further elaborated into predominant-dominant regions. Each region can be recursively elaborated to form a tonal grammar tree. We can derive the key information by referring to the top of the tree, which groups the chord sequence into a tonal region. Context-free grammar can be used to formalize this tree structure. A list of generative syntax is proposed by Rohrmeier [13] in the form of V → w. V is a single nonterminal symbol, while w is a string of terminals and/or non-terminals. Chord symbols (eg. IV ) are represented by terminals. They are the leaves of the grammar tree. Tonal functions (eg. T for tonic) or regions (eg. T R for tonic region) are represented by non-terminals. They can be the internal nodes or the root of the grammar tree. For instance, the rule D → V | vii indicates that the V or vii chord can be represented by the dominant function. The rule S → ii (major) indicates that ii chord can be represented by the predominant function only when the current key is major. Originally Rohrmeier has proposed 28 rules. Some of them were modified to suit classical music and were listed in Table 1. Based on this set of rules, Cocke–Younger–Kasami parsing algorithm [18] is used to construct a tonal grammar tree. If a music input is harmonically valid, a single tonal grammar tree can be built like in Figure 1. Else some scattered tree branches are resulted and cannot be connected to one single root. 2.2 Modulation In Rohrmeier’s generative syntax of tonal harmony, modulation is formalized as a new local tonic [13]. Each functional region (new key section) is grouped as a single nontonic chord in the original passage, and they may relate this (elaborated) chord to the neighbouring chords. In this research we have a more general view of modulation. As a music theorist, Reger had published a book Modulation, showing how to modulate from C major / minor to every other key [12]. Modulation to every other key is possible, but modulation to harmonically closer keys is more common [10]. For instance, if the music is originally in C major, it is more probable to modulate to G major instead of B major. Lerdahl’s chordal distance [8] is used to measure the distance between different keys. Here Rohrmeier’s modulation rules in [13] are not used. Instead, a tonal grammar tree is built for each new key section, and the key path with the best score is chosen. Any key changes explainable by tonicization (temporary borrowing of chords from other keys), such as the chords [I V/V V I], is not considered as a modulation. Figure 2 shows an example of tonal grammar tree with modulation, from E minor to D minor. It is presented by two disjunct trees. 3. SYSTEM BUILDING BLOCKS 3.1 Overview The proposed key partitioning system is shown as in Figure 3. This system takes a sequence of chord labels (e.g. A minor, E major) and outputs the best key path. The path may consist of only one key, or several keys. For example, [F F F F F F] or [Am Am Am C C C] (m indicates minor 502 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 2. Example of a tonal grammar tree with modulation Path no. 1 2 chords, other chords are major) are both valid key paths. The Tonal Grammar Tree mentioned in §2.1 is the main tool used in this system. 3 4 3.2 Algorithm for key partitioning Each key section is assumed to have at least one tonic chord. The top of each grammar tree must be TR (tonic region), so the key section is a complete tonal grammar tree by itself. Furthermore, the minimum length of each key section is assumed to be 3 chords. However, if no valid paths can be found, key sections with only 2 chords are also considered. The algorithm is as follows: Gm Gm B B Gm Gm B B Key paths Gm Am Gm C B Am B C Am C Am C Am C Am C Table 2. All valid key paths in the example 6. If no valid paths can be found, go back to step 4 and change the requirement to “at least 2 chords”. Else proceed to step 7. 7. Evaluate the path score of all valid paths and select the one with the highest score to be the best key path. A simple example is used to illustrate this process. The input chord sequence is [B F Gm Am G C]. Incomplete trees with the keys (B, F, Gm, Am, G, C) are built. As all the trees are incomplete, proceed to step 3 and the accumulated length is calculated. The B major tree is shown in Figure 4 as an example. Other five trees (F, Gm, Am, G, C) were built in the same fashion. Either key sections 1-3 or 1-4 of Bmajor are valid key sections as they can all be grouped into a single TR and they have at least 3 chords. Then all the valid key paths were found and they are listed in Table 2. All the path scores were evaluated by the equation (1) of the next section. 1. In a chord sequence, hypothesize any of the chord label as the tonic of a key. Derive the tonal grammar tree of each key. 2. Find if there is any key that can build a single complete tree for the entire sequence. If yes, limit the valid paths to these single-key paths and go to step 7. This phrase is assumed to have a single key only. Else go to next step. 3. For each chord label in the sequence, find the maximum possible accumulated chord sequence length of each key section (up to that label). Determine if this sequence is breakable at that label (The secondary dominant chord is dependent on the subsequent chord. For example, the tonicization segment V/V V cannot be broken in the middle, as V/V is dependent on V chord). 3.3 Formulation We have several criteria for choosing the best key path. A good choice of a key section should be rich in tonic and dominant chords, as they are the most important chords to define and establish a key [10]. It is more preferable if the key section starts and ends with the tonic chord, and with less tonicizations as a simpler explanation is better than a complicated one. In a music excerpt, less modulations and modulations to closer keys are preferred. We formulate 4. Find out all possible key sections with at least 3 chords including at least one tonic chord. 5. Find out all valid paths traversing all the possible key sections, from beginning to end, in a brute-force manner. 503 15th International Society for Music Information Retrieval Conference (ISMIR 2014) added by the experienced musician 2 . All the chord types have been mapped to their roots: major or minor. There are 25 excerpts with a single key and 21 excerpts with key changes (one to four key changes). The longest excerpt has 47 chords whereas the shortest excerpt has 8 chords. The instrumentation ranges from solo piano to orchestral. As we assume the input chord sequence to be harmonically complete, the last chord of excerpts 9, 14 and 15 were truncated as they are the starting chord of another phrase. There are 866 chords in total. For every excerpt, the partitioning algorithm in §3.2 is used to obtain the best path. Figure 4. The incomplete B major Tree 4.2 Baseline system these criteria with equation (1): To the best of author’s knowledge, there is currently no key partitioning algorithm directly use chord labels as input. To compare the performance of our key partitioning system, another system based on Krumhansl’s harmonichierarchy information and dynamic programming were set up. Krumhansl’s key profile has been used in many note-based key tracking systems such as [3, 9]. Here Krumhansl’s harmonic-hierarchy ratings (listed in Chapter 7 of [6]) are used to obtain the perceptual closeness of a chord in a particular key. A higher rating corresponds to a higher tendency to be part of the key. As a fair comparison, the number of chords in a key section is restricted to be at least three, which is the same in our system. To prevent fluctuations of the key, a penalty term D(x, y) is imposed on key changes. The multiplicative constant of penalty term α is determined experimentally to give the best result. The best key path is found iteratively by the dynamic programming technique presented by equations (2) and (3): Stotal = aStd − bSton − cScost + dSstend − eSsect (1) where S td is the no. of tonic and dominant chords, Ston is the total number of tonicization steps. For example, in chord progression V/V/ii V/ii ii, the first chord has two steps, while the second chord has one step. Ston = 2 + 1 + 0 = 3. Scost is the total modulation cost: the total tonal distance of each modulation measured by Lerdahl’s distance defined in [8]. Sstend indicates whether the excerpt starts and ends with tonic or not. Ssect is the total number of key sections. If a key section has only 2 chords, it is counted as 3 in Ssect as a penalty. These parameters control how well chords fit in a key section against how often the modulation occurs. S td , Ston and Sstend maximizes fitness of the chord sequence to a key section. Scost and Ssect induce penalty whenever modulation occurs. The parameters S td , Ston , Scost , Sstend and Ssect are normalized so that their mean and standard deviation are 0 and 1 respectively. All the coefficients, namely a, b, c, d, e, are determined experimentally, although a slightly different set of values does not have a large effect on the key partitioning results. They are set at [a, b, c, d, e] = [1, 0.4, 2, 2, 0.4]. Key structure is generally thought to be hierarchical. An excerpt may have one level of large-scale key changes and another level of tonicizations [17], and the boundary is not well-defined. So it seemed fair to adjust these parameters in order to match the level of key changes labeled by the ground truth. The key path with the highest Stotal is chosen as the best path. Ax [1] = Hx [1] Ax [n − 1] + Hx [n], Ax [n] = max Ay [n − 1] + Hx [n] − αD(x, y) ∀x, y ∈ K, where y = x (2) 3 (3) Hx [n] is the harmonic-hierarchy rating of the nth chord with the key x. Ax [n] is the accumulated key strength of the nth chord when the current key is x. K is the set of all possible keys. D(x, y) is the distance between keys x, y based on the key distance in [6] derived from multidimensional scaling. The best path can be found by obtaining the largest Ax of the last chord and tracking all the way back to Ax [1]. The same Kostka-Payne corpus chord labels were used to test this baseline system. The best result was obtained by setting α = 4.5. 4. EXPERIMENTS 4.1 Settings To test the system, we have chosen the Kostka-Payne corpus, which contains classical music excerpts in a theory book [5]. This selection has 46 excerpts, covering compositions of many famous composers. They serve as representative examples of classical music in common practice period (around 1650-1900). All of the excerpts were examined. This corpus has ground truth key information labeled by David Temperley 1 . The mode (major or minor) of the key was labeled by an experienced musician. The chord labels are also available from the website, with the mode 1 ∀x ∈ K 4.3 Results The key partitioning result of our proposed system and the baseline system were compared against the ground truth provided by Temperley. Four kinds of result metrics were used. The average matching score is shown in Figure 5. 2 All the chord and key labels can be found here: https://drive.google.com/file/d/0B0Td6LwTULvMVJ6MFcyYWsxVzQ/edit?usp=sharing http://www.theory.esm.rochester.edu/temperley/kp-stats/ 504 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Key relation Dominant Supertonic Relative Parallel Minor 3rd Major 3rd Leading tone Tritone total no. 35 32 11 11 9 8 3 2 % 32.7 29.9 10.3 10.3 8.4 7.5 2.8 1.9 Table 3. Eight categories of the 107 error labels Figure 5. Key partitioning result, with 95% confidence interval chord symbols F major B major Exact indicates the exact matches between the obtained key path and the ground truth. As modulation is a gradual process, the exact location of key changes may not be definitive. It is more meaningful to consider Inexact. For inexact, the obtained key is also considered as correct if it matches the key of the previous or next chord. MIREX refers to the MIREX 2014 Audio key detection evaluation standard 3 . Harmonically close keys will be given a partial point. Perfect fifth is awarded with 0.5 points, relative minor/ major 0.3 points, whereas parallel major/ minor 0.2 points. This is useful as sometimes a chord progression may be explainable by two different related keys. MIREX in refers to the MIREX standard, but with the addition that the points of previous or next chord will also be considered and the maximum point will be chosen as the matching score of that chord. The proposed system outperforms the baseline system by about 18% for exact or inexact matching and 0.1 points for MIREX-related scores. It shows that our knowledgebased tonal grammar tree system is better than the baseline system which is based on perceptual closeness. Tonal structural information is exploited, so we have a better understanding of the chord progression and modulations. Gm ii vi C V V/V F I V B IV I Gm ii vi C V V/V F I V Table 4. Analysis with two different keys Modulations between keys that are supertonicallyrelated (differs by 2 semitones) or relative major / minor have a similar problem as the dominant key modulation. Many common chords are shared among both keys, so it is easy to confuse these two keys. It is worth to mention that nine of the supertonically-related errors came from excerpt 45. In Temperley’s key labels, the whole excerpt is labeled as C major with measures 10-12 considered as a passage of tonicization. However, in [5], it was written that “Measures 10-12 can be analyzed in terms of secondary functions or as a modulation”. If the measures 10-12 are considered as a modulation to D minor, then the analysis of these nine chords is correct. The parallel key modulation, for example from C major to C minor, has a different problem. Sometimes composers tend to start the phrase with a new mode (major or minor) without much preparation, as the tonic is the same. Fluctuation between major and minor of the same key has always been common [10]. When the phrase information is absent, the exact position of modulation cannot be found by the proposed system. In another way, there may exist some ornament notes that obscure the real identity of a chord, so that the chord symbol analyzed acoustically is different from the chord symbol analyzed structurally or grammatically. For example, in Figure 6, the first two bars should be analyzed as IV 6 -viiφ7 -I progression in A major. However, the C of the I chord is delayed to the next chord. The appoggiatura B made the I chord sound as a i chord, the tonic minor chord instead. Similarly, the last two bars should be analyzed as IV 6/5 -viio7 -i in F minor. However, the passing note A made the i chord sound as a I chord, the original A is delayed to the next chord. In these two cases, the key derived by the last chord in the progression is in conflict with the other chords. Hence the key will be recognized wrongly if the acoustic chord symbol is provided instead of the structural chord symbol. 4.4 Error analysis The ground truth key information are compared against the key labels generated by the proposed algorithm. 17 boundary errors were detected, ie. the key label of the previous or next chord was recognized instead. In classical music, modulation is usually not a sudden event. It occurs gradually through several pivot chords (chords common to both keys) [10]. Therefore it is sometimes subjective to determine the boundary between two key sections. It may not be a wrong labeling if the boundary is different from the ground truth. Other types of error are listed in Table 3. The most common error is the misclassification as dominant key, which is the closest related key [10]. It shares many common chords with the tonic key. From Table 4, the same chord sequence can be analyzed by two keys that are dominantly-related. Although the B major analysis contains more tonicizations, the resultant score disadvantage may be outweighed by the cost of key changes, if it is followed by a B major section. 3 Semitone difference 7 2 3 0 3 4 1 6 5. DIFFICULTIES The biggest problem of this research is lack of labeled data. To the best of our knowledge, large chord label database http://www.music-ir.org/mirex/wiki/2014:Audio Key Detection 505 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [3] E. Gómez and P. Herrera. Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR, pages 1–4, 2004. [4] B. Hyer. Key (i). In S. Sadie, editor, The New Grove Dictionary of Music and Musicians. Macmillan Publishers, London, 1980. Figure 6. Excerpt from Mozart’s Piano Concerto no. 23, 2nd movement [5] S. M. Kostka and D. Payne. Workbook for tonal harmony, with an introduction to twentieth-century music. McGraw-Hill, New York, 3rd ed. edition, 1995. for classical music is absent. The largest database we could find is the Kostka-Payne corpus used in this paper. In the future, we may consider manually label more music pieces to check if the system works generally well in classical music. Moreover, key partitioning is sometimes subjective to listener’s perception. In some cases, there are several pivot chords to establish the new key center. “Ground truth” boundaries of key sections are sometimes set arbitrarily. Or there are several sets of acceptable and sensible partitions of key sections. This problem is yet to be studied. Inconsistency between acoustic and structural chord symbols mentioned in §4.4 is also yet to be solved. For any rulebased systems, exceptions may occur. Composers may deliberately break some traditions in the creative process. It is not possible to handle all these exceptional cases. [6] C. L. Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University Press, New York, 1990. [7] K. Lee and M. Slaney. Acoustic Chord Transcription and Key Extraction From Audio Using Key-Dependent HMMs Trained on Synthesized Audio. In Array, editor, Ieee Transactions On Audio Speech And Language Processing, volume 16, pages 291–301. Ieee, 2008. [8] F. Lerdahl. Tonal pitch space. Oxford University Press, Oxford, 2001. [9] H. Papadopoulos and G. Peeters. Local Key Estimation From an Audio Signal Relying on Harmonic and Metrical Structures. IEEE Transactions on Audio, Speech, and Language Processing, 20(4):1297–1312, May 2012. 6. FUTURE WORK AND CONCLUSION We have only considered major and minor chords in this paper. As dominant 7th and diminished chords are common in classical music, we may consider expanding the chord type selection to make chord labels more accurate. The current system assumes chord labels to be free of errors. We plan to study the method of key tracking in the presence of chord label errors. Then we may incorporate this system to the chord classification system for audio key detection, as the key and chord progression is interdependent. Currently the input phrases must be complete in order to make this tree building process work. We plan to find the key partition method for incomplete input phrases. A more efficient algorithm for tree building process, instead of brute-force, is yet to be discovered. Then less trees are required to be built. In this paper, we have discussed the uses of tonal grammar to partition key sections of classical music. The proposed system outperforms the baseline system which uses dynamic programming on Krumhansl’s harmonichierarchy ratings. This tonal grammar is useful for tonal classical music information retrieval and hopefully more uses can be found. 7. REFERENCES [1] A. Cadwallader and D. Gagné. Analysis of Tonal Music: A Schenkerian Approach. Oxford University Press, Oxford, 1998. [2] B. Catteau, J. Martens, and M. Leman. A probabilistic framework for audio-based tonal key and chord recognition. Advances in Data Analysis, (2005):1–8, 2007. [10] W. Piston. Harmony. W. W. Norton, New York, rev. ed. edition, 1948. [11] C. Raphael and J. Stoddard. Functional harmonic analysis using probabilistic models. Computer Music Journal, pages 45–52, 2004. [12] M. Reger. Modulation. Dover Publications, Mineola, N.Y., dover ed. edition, 2007. [13] M. Rohrmeier. Towards a generative syntax of tonal harmony. Journal of Mathematics and Music, 5(1):35– 53, Mar. 2011. [14] H. Schenker. Free Composition. Longman, New York, London, 1979. [15] A. Shenoy and R. Mohapatra. Key determination of acoustic musical signals. 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), pages 1771–1774, 2004. [16] R. Steblin. A history of key characteristics in the eighteenth and early nineteenth centuries. University of Rochester Press, Rochester, NY, 2nd edition, 2002. [17] D. Temperley. The cognition of basic musical structures. MIT Press, Cambridge, Mass., 2001. [18] D. H. Younger. Recognition and parsing of contextfree languages in time n3 . Information and Control, 10(2):189–208, 1967. 506 15th International Society for Music Information Retrieval Conference (ISMIR 2014) BAYESIAN SINGING-VOICE SEPARATION Po-Kai Yang, Chung-Chien Hsu and Jen-Tzung Chien Department of Electrical and Computer Engineering, National Chiao Tung University, Taiwan {niceallen.cm01g, chien.cm97g, jtchien}@nctu.edu.tw ABSTRACT and background music should be collected. But, it is more practical to conduct the unsupervised learning for blind source separation by using only the mixed test data. In [13], the repeating structure of the spectrogram of the mixed music signal was extracted and applied for separation of music and voice. The repeating components from accompaniment signal were separated from the nonrepeating components from vocal signal. A binary timefrequency masking was applied to identify the repeating background accompaniment. In [9], a robust principal component analysis was proposed to decompose the spectrogram of mixed signal into a low-rank matrix for accompaniment signal and a sparse matrix for vocal signal. System performance was improved by imposing the harmonicity constraints [22]. A pitch extraction algorithm was inspired by the computational auditory scene analysis [3] and was applied to extract the harmonic components of singing voice. This paper presents a Bayesian nonnegative matrix factorization (NMF) approach to extract singing voice from background music accompaniment. Using this approach, the likelihood function based on NMF is represented by a Poisson distribution and the NMF parameters, consisting of basis and weight matrices, are characterized by the exponential priors. A variational Bayesian expectationmaximization algorithm is developed to learn variational parameters and model parameters for monaural source separation. A clustering algorithm is performed to establish two groups of bases: one is for singing voice and the other is for background music. Model complexity is controlled by adaptively selecting the number of bases for different mixed signals according to the variational lower bound. Model regularization is tackled through the uncertainty modeling via variational inference based on marginal likelihood. The experimental results on MIR-1K database show that the proposed method performs better than various unsupervised separation algorithms in terms of the global normalized source to distortion ratio. In general, the issue of singing-voice separation is seen as a single-channel source separation problem which could be solved by using the learning approach based on the nonnegative matrix factorization (NMF) [10, 19]. Using NMF, a nonnegative matrix is factorized into a product of a basis matrix and a weight matrix which are nonnegative [10]. NMF can be directly applied in Fourier spectrogram domain for audio signal processing. In [7], the nonnegative sparse coding was proposed to conduct sparse learning for overcomplete representation based on NMF. Such sparse coding provides efficient and robust solution to NMF. However, how to determine the regularization parameter for sparse representation is a key issue for NMF. In addition, the time-varying envelopes of spectrogram convey important information. In [16], one dimensional convolutive NMF was proposed to extract the bases, which considered the dependencies across successive columns of input spectrogram, and was applied for supervised singlechannel speech separation. In [14], two dimensional NMF was proposed to discover fundamental bases for blind musical instrument separation in presence of harmonic variations from piano and trumpet. Number of bases was empirically determined. Nevertheless, the selection of the number of bases is known as a model selection problem in signal processing and machine learning. How to tackle this regularization issue plays an important role to assure generalization for future data in ill-posed condition [1]. 1. INTRODUCTION Singing voice conveys important information of a song. This information is practical for many music-related applications including singer identification [11], music emotion annotation [21], melody extraction, lyric recognition and lyric synchronization [6]. However, singing voice is usually mixed with background accompaniment in a music signal. How to extract the singing voice from a singlechannel mixed signal is known as a crucial issue for music information retrieval. Some approaches have been proposed to deal with single-channel singing-voice separation. There are two categories of approaches to source separation: supervised learning [2] and unsupervised learning [8, 9, 13, 22]. Supervised approach conducts the singlechannel source separation given by the labeled training data from different sources. In the application of singingvoice separation, the separate training data of singing voice c Po-Kai Yang, Chung-Chien Hsu and Jen-Tzung Chien. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Po-Kai Yang, Chung-Chien Hsu and Jen-Tzung Chien. “BAYESIAN SINGING-VOICE SEPARATION”, 15th International Society for Music Information Retrieval Conference, 2014. Basically, uncertainty modeling via probabilistic framework is helpful to improve model regularization for NMF. 507 15th International Society for Music Information Retrieval Conference (ISMIR 2014) the approximated data BW The uncertainties in singing-voice separation may come from improper model assumption, incorrect model order and possible noise interference, nonstationary environment, reverberant distortion. Under probabilistic framework, nonnegative spectral signals are drawn from probability distributions. The nonnegative parameters are also represented by prior distributions. Bayesian learning is introduced to deal with uncertainty decoding and build a robust source separation by maximizing the marginal likelihood over the randomness of model parameters. In [15], Bayesian NMF (BNMF) was proposed for image feature extraction based on the assumption of Gaussian likelihood and exponential prior. In the BNMF [4], an approximate Bayesian inference based on variational Bayesian (VB) algorithm using Poisson likelihood for observation data and Gamma prior for model parameters was proposed for image reconstruction. Implementation cost was demanding due to the numerical calculation of shape parameter. Although NMF was presented for singing-voice separation in [19, 23], the regularization issue was ignored and the sensitivity of system performance due to uncertain model and ill-posed condition was serious. This paper presents a new model-based singing-voice separation. The novelties of this paper are twofold. The first one is to develop Bayesian approach to unsupervised singing-voice separation. Model uncertainty is compensated to improve the performance of source separation of vocal signal and background accompaniment signal. Number of bases is adaptively determined from the mixed signal according to the variational lower bound of the logarithm of a marginal likelihood over NMF basis and weight matrices. The second one is the theoretical contribution in Bayesian NMF. We construct a new Bayesian NMF where the likelihood function in NMF is drawn from Poisson distribution and the model parameters are characterized by exponential distributions. A closed-form solution to hyperparameters using the VB expectation-maximization (EM) [5] algorithm is derived for ease of implementation and computation. This BNMF is connected to standard NMF with sparseness constraint. But, using the BNMF, the regularization parameters or hyperparameters are optimally estimated from training data without empirical selection from validation data. Beyond the approaches in [4, 15], the proposed BNMF completely considers the dependencies of the variational objective on hyperparameters and derives the analytical solution to singing-voice separation. (Xmn log m,n Xmn + [BW]mn − Xmn ) [BW]mn (1) 2.1 Maximum Likelihood Factorization NMF approximation is revisited by introducing the probabilistic framework based on maximum likelihood (ML) theory. The nonnegative latent variable Zmkn is embedded in data entry Xmn by Xmn = k Zmkn and is represented by a Poisson distribution with mean Bmk Wkn , i.e. Zmkn ∼ Pois(Zmkn ; Bmk Wkn ) [4]. Log likelihood function of data matrix X given parameters Θ is expressed by log p(X|B, W) = log = Pois(Xmn ; m,n Bmk Wkn ) k (Xmn log[BW]mn − [BW]mn − log Γ(Xmn + 1)) (2) m,n where Γ(·) is the gamma function. Maximizing the log likelihood function in Eq. (2) based on Poisson distribution is equivalent to minimizing the KL divergence between X and BW in Eq. (1). This ML problem with missing variables Z = {Zmkn } can be solved according to EM algorithm. In E step, the expectation function of the log likelihood of data X and latent variable Z given new parameters B(τ +1) and W(τ +1) is calculated with respect to Z under current parameters B(τ ) and W(τ ) . In M step, we maximize the resulting auxiliary function to obtain the updating of NMF parameters which is equivalent to that of standard NMF in [10]. 2.2 Bayesian Factorization ML estimation is prone to find an over-trained model [1]. To improve model regularization, Bayesian approach is introduced to establish NMF for single-source separation. ML NMF was improved by considering the priors of basis matrix B and weight matrix W for Bayesian NMF (BNMF). Different specifications of likelihood function and prior distribution result in different solutions with different inference procedures. In [15], the approximation error of Xmn using k Bmk Wkn is modeled by a zeromean Gaussian distribution Xmn ∼ N (Xmn ; Bmk Wkn , σ 2 ) (3) k with the variance parameter σ 2 which is distributed by an inverse gamma prior. The priors of nonnegative Bmk and Wkn are modeled by the exponential distributions 2. NONNEGATIVE MATRIX FACTORIZATION Lee and Seung [10] proposed the standard NMF where no probabilistic distribution was assumed. Given a nonnega×N tive data matrix X ∈ RM , NMF aims to decompose + data matrix X into a product of two nonnegative matrices ×K B ∈ RM and W ∈ RK×N . The (m, n)-th + + entry of X is approximated by Xmn ≈ [BW]mn = k Bmk Wkn . NMF parameters Θ = {B, W} consist of basis matrix B and weight matrix W. The approximation based on NMF is optimized by minimizing the Kullback-Leibler (KL) divergence DKL (X BW) between the observed data X and Bmk ∼ Exp(Bmk ; λbmk ), Wkn ∼ Exp(Wkn ; λw kn ) (4) where Exp(x; θ) = θ exp(−θx), with means (λbmk )−1 and −1 (λw , respectively. Typically, the larger the exponenkn ) tial hyperparameter θ is involved, the sparser the exponential distribution is shaped. The sparsity of basis parameter Bmk and weight parameter Wkn is controlled by hyperparameters λbmk and λw kn , respectively. In [15], the hyperparameters {λbmk , λw } kn were fixed and empirically determined. The Gaussian likelihood does not adhere to 508 15th International Society for Music Information Retrieval Conference (ISMIR 2014) the assumption of nonnegative data matrix X. The other weakness in the BNMF [15] is that the exponential distribution is not conjugate prior to the Gaussian likelihood function for NMF. There was no closed-form solution. The parameters Θ = {B, W, σ 2 } were accordingly estimated by Gibbs sampling procedure where a sequence of posterior samples of Θ was drawn by the corresponding conditional posterior probabilities. Cemgil [4] proposed the BNMF for image reconstruction based on the Poisson likelihood function as given in Eq. (2) and the gamma priors for basis and weight matrices. The gamma distribution, represented by a shape parameter and a scale parameter, is known as the conjugate prior to Poisson likelihood function. Variational Bayesian (VB) inference procedure was developed for NMF implementation. However, the shape parameter was implemented by the numerical solution. The computation cost was relatively high. Some dependencies of variational lower bound on model parameters were ignored in [4]. The resulting parameters did not reach true optimum of variational objective. evidence function is meaningful to act as an objective for model selection which balances the tradeoff between data fitness and model complexity [1]. In the singing-voice separation based on NMF, this objective is used to judge which number of bases K should be selected. The selected number is adaptive to fit different experimental conditions with varying lengths and the variations from different singers, genders, songs, genres, instruments and music accompaniments. Model regularization is tackled accordingly. But, using NMF without Bayesian treatment, the number of bases was fixed and empirically determined. 3.2 Variational Bayesian Inference The exact Bayesian solution to optimization problem in Eq. (6) does not exist because the posterior probability of three latent variables {Z, B, W} given the observed mixtures X could not be factorized. To deal with this issue, the variational Bayesian expectation-maximization (VB-EM) algorithm is developed to implement Poisson-Exponential BNMF. VB-EM algorithm applies the Jensen’s inequality and maximizes the lower bound of the logarithm of marginal likelihood 3. NEW BAYESIAN FACTORIZATION This study aims to find an analytical solution to full Bayesian NMF by considering all dependencies of variational lower bound on regularization parameters. Regularization parameters are optimally estimated. log p(X|Θ) ≥ Z DKL (X||BW) + m,k + λw kn Wkn In VB-E step, a general solution to variational distribution qj of an individual latent variable j ∈ {Z, B, W} is obtained by [1] log q̂j ∝ Eq(i=j) [log p(X, Z, B, W|Θ)]. (8) (5) Given the variational distributions defined by k,n p(X|Z, B, W)p(Z|B, W)p(B, W|Θ)dBdW (7) 3.2.1 VB-E Step where the terms independent of Bmk and Wkn are treated as constants. Notably, the regularization terms (2nd and 3rd terms) in this objective are nonnegative and seen as the 1 regularizers [18] which are controlled by hyperparameters {λbmk , λw kn }. These regularizers impose sparseness in the estimated MAP parameters. However, MAP estimates are seen as point estimates. The randomness of parameters is not considered in model construction. To conduct full Bayesian treatment, BNMF is developed by maximizing the marginal likelihood p(X|Θ) over latent variables Z as well as NMF parameters {B, W} p(X, Z, B, W|Θ) q(Z, B, W) where H[·] is an entropy function. The factorized variational distribution q(Z, B, W) = q(Z)q(B)q(W) is assumed to approximate the true posterior distribution p(Z, B, W|X, Θ). In accordance with the Bayesian perspective and the spirit of standard NMF, we adopt the Poisson distribution as likelihood function and the exponential distribution as conjugate prior for NMF parameters Bmk and Wkn with hyperparameters λbmk and λw kn , respectively. Maximum a posteriori (MAP) estimates of parameters Θ = {B, W} are obtained by maximizing the posterior distribution or minimizing − log p(B, W|X) which is arranged as a regularized KL divergence between X and BW λbmk Bmk q(Z, B, W) log × dBdW = Eq [log p(X, Z, B, W|Θ)] + H[q(Z, B, W)] 3.1 Bayesian Objectives b q(Bmk ) = Gam(Bmk ; αbmk , βmk ) w w q(Wkn ) = Gam(Wkn ; αkn , βkn ) (9) q(Zmkn ) = Mult(Zmkn ; Pmkn ) b b w w the variational parameters {αmk , βmk , αkn , βkn , Pmkn } in three distributions are estimated by α̂bmk =1+ Zmkn , b β̂mk = n α̂w kn = 1 + m P̂mkn (6) Z w Zmkn , β̂kn = −1 Wkn + λbmk n Bmk + λw kn −1 (10) k exp(log Bmk + log Wkn ) = j exp(log Bmj + log Wjn ) where the expectation function Eq [·] is replaced by · for simplicity. By substituting the variational distribution into and estimating the sparsity-controlled hyperparameters or regularization parameters Θ = {λbmk , λw mk }. The resulting 509 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Eq. (7), the variational lower bound is obtained by BL = − + Bmk Wkn m,n,k (− log Γ(Xmn + 1) − m,n + log Bmk k Zmkn + n m,k + Zmkn log P̂mkn ) log Wkn (log λbmk − λbmk Bmk ) + Zmkn m k,n m,k + the Gaussian-Exponential BNMF in [15] and the PoissonGamma BNMF in [4]. The superiorities of the proposed method to the BNMFs in [15, 4] are twofold. First, assuming the exponential priors provides a BNMF approach with tractable solution as given in Eq. (13). Gibbs sampling in [15] and Newton’s solution in [4] are computationally expensive. Second, the dependencies of three terms of the variational lower bound in Eq. (11) on hyperparameters λbmk or λw kn are all considered in finding the true optimum while some dependencies were ignored in the solution to Poisson-Gamma BNMF [4]. Also, the observations in Gaussian-Exponential BNMF [15] were not constrained to be nonnegative. w (log λw kn − λkn Wkn ) k,n b (−(α̂bmk − 1)Ψ(α̂bmk ) + log β̂mk + α̂bmk + log Γ(α̂bmk )) m,k + w w w w (−(α̂w kn − 1)Ψ(α̂kn ) + log β̂kn + α̂kn + log Γ(α̂kn )) k,n (11) where Ψ(·) is the derivative of the log gamma function, and is known as a digamma function. 4. EXPERIMENTS 4.1 Experimental Setup 3.2.2 VB-M Step We used the MIR-1Kdataset [8] to evaluate the proposed method for unsupervised singing-voice separation from background music accompaniment. The dataset consisted of 1000 song clips extracted from 110 Chinese karaoke pop songs performed by 8 female and 11 male amateurs. Each clip recorded at 16 KHz sampling frequency with the duration ranging from 4 to 13 seconds. Since the music accompaniment and the singing voice were recorded at left and right channels, we followed [8, 9, 13] and simulated three different sets of monaural mixtures at signal-to-musicratios (SMRs) of 5, 0, and -5 dB where the singing-voice was treated as signal and the accompaniment was treated as music. The separation problem was tackled in the shorttime Fourier transform (STFT) domain. The 1024-point STFT was calculated to obtain the Fourier magnitude spectrograms with frame duration of 40 ms and frame shift of 10 ms. In the implementation of BNMF, ML-NMF was adopted as the initialization and 50 iterations were run to find the posterior means of basis and weight parameters. To evaluate the performance of singing-voice separation, we measure the signal-to-distortion ratio (SDR) [20] and then calculate the normalized SDR (NSDR) and the global NSDR (GNSDR) as In VB-M step, the optimal regularization parameters Θ = {λbmk , λw kn } are derived by maximizing Eq. (11) with respect to Θ and yielding b ∂ log βmk 1 ∂BL = b − Bmk + =0 b b ∂λmk λmk ∂λmk w ∂ log βkn 1 ∂BL = w − Wkn + = 0. w w ∂λkn λkn ∂λkn (12) Accordingly, the solution to BNMF hyperparameters is derived by solving a quadratic equation where nonnegative constraint is considered to find positive values of hyperparameters by λ̂bmk 1 = 2 λ̂w kn = 1 2 − Wkn + n − m ( Bmk + Wkn )2 +4 n Wkn Bmk m Bmk 2 ( Bmk ) + 4 Wkn m (13) n b b w w where Bmk = αmk βmk and Wkn = αkn βkn are obtained as the means of gamma distributions. VB-E step and VB-M step are alternatively and iteratively performed to estimate BNMF parameters Θ with convergence. It is meaningful to select the best number of bases (K) with the largest lower bound of the log marginal likelihood which integrates out the parameters of weight and basis matrices. NSDR(V̂, V, X) = SDR(V̂, V) − SDR(X, V) Ñ n=1 ln NSDR(V̂n , Vn , Xn ) GNSDR(V̂, V, X) = Ñ n=1 ln 3.3 Poisson-Exponential Bayesian NMF (14) where V̂, V, X denote the estimated singing voice, the original clean singing voice, and the mixture signal, respectively, Ñ is the total number of the clips and ln is the length of the nth clip. NSDR is used to measure the improvement of SDR between the estimated singing voice V̂ and the mixture signal X. GNSDR is used to calculate the overall separation performance by taking the weighted mean of the NSDRs. To the best of our knowledge, this is the first study where a Bayesian approach is developed for singing-voice separation. The uncertainties in singing-voice separation due to a variety of singers, songs and instruments could be compensated. Model selection problem is tackled as well. In this study, total number of basis vectors K is adaptively selected for individual mixed signal according to the variational lower bound in Eq. (11) with the converged variab b w w tional parameters {α̂mk , β̂mk , α̂kn , β̂kn , P̂mkn } and model b w parameters {λ̂mk , λ̂kn }. Considering the pairs of likelihood function and prior distribution in NMF, the proposed method is also called the Poisson-Exponential BNMF which is different from 4.2 Unsupervised Singing-Voice Separation We implemented the unsupervised singing-voice separation where total number of bases (K) and the grouping of these bases into vocal source and music source were both 510 15th International Society for Music Information Retrieval Conference (ISMIR 2014) SMR:−5dB GNSDR(dB) 4 3.38 3 2.4 2.1 2 1.51 2.81 K-means clustering NMF clustering 1.3 1 0 −1 Hsu Huang Yang Rafii[12] Rafii[13] BNMF1 BNMF2 GNSDR(dB) NMF (50) 2.47 2.97 BNMF (adaptive) 2.92 3.25 Table 1. Comparison of GNSDR at SMR = 0 dB using NMF with fixed number of bases {30, 40, 50} and BNMF with adaptive number of bases. SMR:0dB 3 2.37 2.76 2 0 NMF (40) 2.58 3.13 −0.51 4 1 NMF (30) 2.69 3.15 2.7 2.92 3.25 1.7 0.91 Hsu Huang Yang Rafii[12] Rafii[13] BNMF1 BNMF2 SMR:5dB GNSDR(dB) 4 3 2.57 2.58 2.57 2.1 2 2.12 1.3 1 0.17 0 Hsu Huang Yang Rafii[12] Rafii[13] BNMF1 BNMF2 Figure 1. Performance comparison using BNMF1 (Kmeans clustering) and BNMF2 (NMF-clustering) and five competitive methods (Hsu [8], Huang [9], Yang [22], Rafii [12], Rafii [13]) in terms of GNSDR under various SMRs. Figure 2. Histogram of the selected number of bases using BNMF under various SMRs. learned from test data in an unsupervised way. No training data were required. Model complexity based on K was determined in accordance with the variational lower bound of log marginal likelihood in Eq. (11) while the grouping of bases for two sources was simply performed via the clustering algorithms using the estimated basis vectors in B or equivalently from the estimated variational parameters b b {αmk , βmk }. Following [17], we conducted the K-means clustering algorithm based on the basis vectors B in Melfrequency cepstral coefficient (MFCC) domain. Each basis vector was first transformed to the Mel-scaled spectrum by applying 20 overlapping triangle filters spaced on the Mel scale. Then, we took the logarithm and applied the discrete cosine transform to obtain nine MFCCs. Finally, we normalized each coefficient to zero mean and unit variance. The K-means clustering algorithm was applied to partition the feature set into two clusters through an iterative procedure until convergence. However, it is more meaningful to conduct NMF-based clustering for the proposed BNMF method. To do so, we transformed the basis vectors B into Mel-scaled spectrum to form the Mel-scaled basis matrix. ML-NMF was applied to factorize this Mel-scaled basis matrix into two matrices B̃ of size N -by-2 and W̃ of size 2-by-K. The soft mask scheme based on Wiener gain was applied to smooth the separation of B into basis vectors for vocal signal and music signal. This same soft mask was performed for the separation of mixed signal X into vocal signal and music signal based on the K-means clustering and NMF clustering. Finally, the separated singing voice and music accompaniment signals were obtained by the overlap-and-add method using the original phase. 4.3 Experimental Results The unsupervised single-channel separation using BNMFs (BNMF1 using K-means clustering and BNMF2 using NMF clustering) and the other five competitive systems (Hsu [8], Huang [9], Yang [22], Rafii [12], Rafii [13]) is compared in terms of GNSDR as depicted in Figure 1. Using K-means clustering in MFCC domain, the resulting BNMF1 outperforms the other five methods under SMRs of 0 dB and -5 dB while the results using Huang [9] and Yang [22] perform better than BNMF1 under 5 dB condition. This is because the methods in [9, 22] used additional pre- and/or post-processing techniques as provided in [13, 22] which were not applied in BNMF1 and BNMF2. Nevertheless, using BNMF factorization with NMF clustering (BNMF2), the overall evaluation consistently achieves around 0.33∼0.57 dB relative improvement in GNSDR compared with BNMF1 including the SMR condition at 5dB. In addition, we evaluate the effect on the adaptive basis selection using BNMF. Table 1 reports the comparison of BNMF1 and BNMF2 with adaptive basis selection and ML-NMF with fixed number of bases under SMR of 0 dB. Two clustering methods were also carried out for NMF with different K. BNMF factorization combined with NMF clustering achieves the best performance in this comparison. Figure 2 shows the histogram of the selected number of bases K using BNMF. It is obvious that this adaptive basis selection plays an important role to find suitable amount of bases to fit different experimental conditions. 511 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 5. CONCLUSIONS We proposed a new unsupervised Bayesian nonnegative matrix factorization approach to extract the singing voice from background music accompaniment and illustrated the novelty on an analytical and true optimum solution to the Poisson-Exponential BNMF. Through the VB-EM inference procedure, the proposed method automatically selected different number of bases to fit various experimental conditions. We conducted two clustering algorithms to find the grouping of bases into vocal and music sources. Experimental results showed the consistent improvement of using BNMF factorization with NMF clustering over the other singing-voice separation methods in terms of GNSDR. In future works, the proposed BNMF shall be extended to multi-layer source separation and applied to detect unknown number of sources. 6. REFERENCES [1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer Science, 2006. [2] N. Boulanger-Lewandowski, G. J. Mysore, and M. Hoffman. Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation. In Proc. of ICASSP, pages 337–344, 2014. [10] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing Systems, pages 556–562, 2000. [11] A. Mesaros, T. Virtanen, and A. Klapuri. Singer identification in polyphonic music using vocal separation and pattern recognition methods. In Proc. of Annual Conference of International Society for Music Information Retrieval, pages 375–378, 2007. [12] Z. Rafii and B. Pardo. A simple music/voice separation method based on the extraction of the repeating musical structure. In Proc. of ICASSP, pages 221–224, 2011. [13] Z. Rafii and B. Pardo. Repeating pattern extraction technique (REPET): A simple method for music/voice separation. IEEE Transactions on Audio, Speech, Language Processing, 21(1):73–84, Jan. 2013. [14] M. N. Schmidt and M. Morup. Non-negative matrix factor 2-D deconvolution for blind single channel source separation. In Proc. of ICA, pages 700–707, 2006. [15] M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In Proc. of ICA, pages 540–547, 2009. [3] A. S. Bregman. Auditory Scene Analysis: the Perceptual Organization of Sound. MIT Press, 1990. [16] P. Smaragdis. Convolutive speech bases and their application to speech separation. IEEE Transactions on Audio, Speech, Language Processing, 15(1):1–12, 2007. [4] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational Intelligence and Neuroscience, (Article ID 785152), 2009. [17] M. Spiertz and V. Gnann. Source-Filter based clustering for monaural blind source separation. In Proc. of International Conference on Digital Audio Effects, pages 1–4, 2009. [5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society (B), 39(1):1–38, 1977. [6] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno. Lyricsynchronizer: automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing, 5(6):1252– 1261, 2011. [7] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004. [18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996. [19] S. Vembu and S. Baumann. Separation of vocals from polyphonic audio recordings. In Proc. of ISMIR, pages 375–378, 2005. [20] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. IEEE Transaction on Audio, Speech and Language Processing, 14(4):1462–1469, 2006. [21] D. Yang and W. Lee. Disambiguating music emotion using software agents. In Proc. of ISMIR, pages 52–57, 2004. [8] C.-L. Hsu and J.-S. R. Jang. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Transactions on Audio, Speech, Language Processing, 18(2):310–319, 2010. [22] Y.-H. Yang. On sparse and low-rank matrix decomposition for singing voice separation. In Proc. of ACM International Conference on Multimedia, pages 757– 760, 2012. [9] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson. Singing-voice separation from monaural recordings using robust principal component analysis. In Proc. of ICASSP, pages 57–60, 2012. [23] B. Zhu, W. Li, R. Li, and X. Xue. Multi-stage nonnegative matrix factorization for monaural singing voice separation. IEEE Transactions on Audio, Speech, Language Processing, 21(10):2096–2107, 2013. 512 15th International Society for Music Information Retrieval Conference (ISMIR 2014) PROBABILISTIC EXTRACTION OF BEAT POSITIONS FROM A BEAT ACTIVATION FUNCTION Filip Korzeniowski, Sebastian Böck, and Gerhard Widmer Department of Computational Perception Johannes Kepler University, Linz, Austria [email protected] ABSTRACT beat at each audio position. A post-processing step selects from these activations positions to be reported as beats. However, this method struggles to find the correct beats when confronted with ambiguous activations. We contribute a new, probabilistic method for this purpose. Although we designed the method for audio with a steady pulse, we show that using the proposed method the beat tracker achieves better results even for datasets containing music with varying tempo. The remainder of the paper is organised as follows: Section 2 reviews the beat tracker our method is based on. In Section 3 we present our approach, describe the structure of our model and show how we infer beat positions. Section 4 describes the setup of our experiments, while we show their results in Section 5. Finally, we conclude our work in Section 6. We present a probabilistic way to extract beat positions from the output (activations) of the neural network that is at the heart of an existing beat tracker. The method can serve as a replacement for the greedy search the beat tracker currently uses for this purpose. Our experiments show improvement upon the current method for a variety of data sets and quality measures, as well as better results compared to other state-of-the-art algorithms. 1. INTRODUCTION Rhythm and pulse lay the foundation of the vast majority of musical works. Percussive instruments like rattles, stampers and slit drums have been used for thousands of years to accompany and enhance rhythmic movements or dances. Maybe this deep connection between movement and sound enables humans to easily tap to the pulse of a musical piece, accenting its beats. The computer, however, has difficulties determining the position of the beats in an audio stream, lacking the intuition humans developed over thousands of years. Beat tracking is the task of locating beats within an audio stream of music. Literature on beat tracking suggests many possible applications: practical ones such as automatic time-stretching or correction of recorded audio, but also as a support for further music analysis like segmentation or pattern discovery [4]. Several musical aspects hinder tracking beats reliably: syncopation, triplets and offbeat rhythms create rhythmical ambiguousness that is difficult to resolve; varying tempo increases musical expressivity, but impedes finding the correct beat times. The multitude of existing beat tracking algorithms work reasonably well for a subset of musical works, but often fail for pieces that are difficult to handle, as [11] showed. In this paper, we further improve upon the beat tracker presented in [2]. The existing algorithm uses a neural network to detect beats in the audio. The output of this neural network, called activations, indicates the likelihood of a 2. BASE METHOD In this section, we will briefly review the approach presented in [2]. For a detailed discourse we refer the reader to the respective publication. First, we will outline how the algorithm processes the signal to emphasise onsets. We will then focus on the neural network used in the beat tracker and its output in Section 2.2. After this, Section 3 will introduce the probabilistic method we propose to find beats in the output activations of the neural network. 2.1 Signal Processing The algorithm derives from the signal three logarithmically filtered power spectrograms with window sizes W of 1024, 2048 and 4096 samples each. The windows are placed 441 samples apart, which results in a frame rate of fr = 100 frames per second for audio sampled at 44.1kHz. We transform the spectra using a logarithmic function to better match the human perception of loudness, and filter them using 3 overlapping triangular filters per octave. Additionally, we compute the first order difference for each of the spectra in order to emphasise onsets. Since longer frame windows tend to smear spectral magnitude values in time, we compute the difference to the last, second to last, and third to last frame, depending on the window size W . Finally, we discard all negative values. c Filip Korzeniowski, Sebastian Böck, Gerhard Widmer. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Filip Korzeniowski, Sebastian Böck, Gerhard Widmer. “Probabilistic Extraction of Beat Positions from a Beat Activation Function”, 15th International Society for Music Information Retrieval Conference, 2014. 513 0.6 0.06 0.5 0.05 0.4 Activation Activation 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 0.3 0.2 0.1 0.0 0.04 0.03 0.02 0.01 0 2 4 6 8 0.00 10 Time [s] (a) Activations of a piece from the Ballroom dataset 0 2 4 6 8 10 Time [s] (b) Activations of a piece from the SMC dataset Figure 1. Activations of pieces from two different datasets. The activations are shown in blue, with green, dotted lines showing the ground truth beat annotations. On the left, distinct peaks indicate the presence of beats. The prominent rhythmical structure of ballroom music enables the neural network to easily discern frames that contain beats from those that do not. On the right, many peaks in the activations do not correspond to beats, while some beats lack distinguished peaks in the activations. In this piece, a single woodwind instrument is playing a solo melody. Its soft onsets and lack of percussive instruments make detecting beats difficult. 2.2 Neural Network Our classifier consists of a bidirectional recurrent neural network of Long Short-Term Memory (LSTM) units, called bidirectional Long Short-Term Memory (BLSTM) recurrent neural network [10]. The input units are fed with the log-filtered power spectra and their corresponding positive first order differences. We use three fully connected hidden layers of 25 LSTM units each. The output layer consists of a single sigmoid neuron. Its value remains within [0, 1], with higher values indicating the presence of a beat at the given frame. After we initialise the network weights randomly, the training process adapts them using standard gradient descent with back propagation and early stopping. We obtain training data using 8-fold cross validation, and randomly choose 15% of the training data to create a validation set. If the learning process does not improve classification on this validation set for 20 training epochs, we stop it and choose the best performing neural network as final model. For more details on the network and the learning process, we refer the reader to [2]. The neural network’s output layer yields activations for every feature frame of an audio signal. We will formally represent this computation as mathematical function. Let N be the number of feature frames for a piece, and N≤N = {1, 2, . . . , N } the set of all frame indices. Furthermore, let υn be the feature vector (the log-filtered power spectra and corresponding differences) of the nth audio frame, and Υ = (υ1 , υ2 , . . . , υN ) denote all feature vectors computed for a piece. We represent the neural network as a function Ψ : N≤N → [0, 1] , (1) such that Ψ(n; Υ) is the activation value for the nth frame when the network processes the feature vectors Υ. We will call this function “activations” in the following. Depending on the type of music the audio contains, the activations show clear (or, less clear) peaks at beat positions. Figure 1 depicts the first 10 seconds of activations 514 for two different songs, together with ground truth beat annotations. In Fig. 1a, the peaks in the activations clearly correspond to beats. For such simple cases, thresholding should suffice to extract beat positions. However, we often have to deal with activations as those in Fig. 1b, with many spurious and/or missing peaks. In the following section, we will propose a new method for extracting beat positions from such activations. 3. PROBABILISTIC EXTRACTION OF BEAT POSITIONS Figure 1b shows the difficulty in deriving the position of beats from the output of the neural network. A greedy local search, as used in the original system, runs into problems when facing ambiguous activations. It struggles to correct previous beat position estimates even if the ambiguity resolves later in the piece. We therefore tackle this problem using a probabilistic model that allows us to globally optimise the beat sequence. Probabilistic models are a frequently used to process time-series data, and are therefore popular in beat tracking (e.g. [3, 9, 12, 13, 14]). Most systems favour generative time-series models like hidden Markov models (HMMs), Kalman filters, or particle filters as natural choices for this problem. For a more complete overview of available beat trackers using various methodologies and their results on a challenging dataset we refer the reader to [11]. In this paper, we use a different approach: our model represents each beat with its own random variable. We model time as dimension in the sample space of our random variables as opposed to a concept of time driving a random process in discrete steps. Therefore, all activations are available at any time, instead of one at a time when thinking of time-series data. For each musical piece we create a model that differs from those of other pieces. Different pieces have different lengths, so the random variables are defined over different sample spaces. Each piece contains a different number of beats, which is why each model consists of a different 15th International Society for Music Information Retrieval Conference (ISMIR 2014) where each yn is in the domain defined by the input features. Although Y is formally a random variable with a distribution P (Y ), its value is always given by the concrete features extracted from the audio. The model’s structure requires us to define dependencies between the variables as conditional probabilities. Assuming these dependencies are the same for each beat but the first, we need to define Y X1 X2 ··· XK P (X1 | Y ) P (Xk | Xk−1 , Y ) . Figure 2. The model depicted as Bayesian network. Each Xk corresponds to a beat and models its position. Y represents the feature vectors of a signal. If we wanted to compute the joint probability of the model, we would also need to define P (Y ) – an impossible task. Since, as we will elaborate later, we are only interested in P (X1:K | Y ) 1 , and Y is always given, we can leave this aside. number of random variables. The idea to model beat positions directly as random variables is similar to the HMM-based method presented in [14]. However, we formulate our model as a Bayesian network with the observations as topmost node. This allows us to directly utilise the whole observation sequence for each beat variable, without potentially violating assumptions that need to hold for HMMs (especially those regarding the observation sequence). Also, our model uses only a single factor to determine potential beat positions in the audio – the output of a neural network – whereas [14] utilises multiple features on different levels to detect beats and downbeats. 3.2 Probability Functions Except for X1 , two random variables influence each Xk : the previous beat Xk−1 and the features Y . Intuitively, the former specifies the spacing between beats and thus the rough position of the beat compared to the previous one. The latter indicates to what extent the features confirm the presence of a beat at this position. We will define both as individual factors that together determine the conditional probabilities. 3.2.1 Beat Spacing 3.1 Model Structure The pulse of a musical piece spaces its beats evenly in time. Here, we assume a steady pulse throughout the piece and model the relationship between beats as factor favouring their regular placement according to this pulse. Future work will relax this assumption and allow for varying pulses. Even when governed by a steady pulse, the position of beats is far from rigid: slight modulations in tempo add musical expressivity and are mostly artistic elements intended by performers. We therefore allow a certain deviation from the pulse. As [3] suggests, tempo changes are perceived relatively rather than absolutely, i.e. halving the tempo should be equally probable as doubling it. Hence, we use the logarithm to base 2 to define the intermediate factor Φ̃ and factor Φ, our beat spacing model. Let x and x be consecutive beat positions and x > x , we define Φ̃ (x, x ) = φ log2 (x − x ) ; log2 (τ ) , στ2 , (4) Φ̃ (x, x ) if 0 < x − x < 2τ Φ (x, x ) = , (5) 0 else As mentioned earlier, we create individual models for each piece, following the common structure described in this section. Figure 2 gives an overview of our system, depicted as Bayesian network. Each Xk is a random variable modelling the position of the k th beat. Its domain are all positions within the length of a piece. By position we mean the frame index of the activation function – since we extract features with a frame rate of fr = 100 frames per second, we discretise the continuous time space to 100 positions per second. Formally, the number of possible positions per piece is determined by N , the number of frames. Each Xk is then defined as random variable with domain N≤N , the natural numbers smaller or equal to N : Xk ∈ N≤N with 1 ≤ k ≤ K, (2) where K is the number of beats in the piece. We estimate this quantity by detecting the dominant interval τ of a piece using an autocorrelation-based method on the smoothed activation function of the neural network (see [2] for details). Here, we restrict the possible intervals to a range [τl ..τu ], with both bounds learned from data. Assuming a steady tempo and a continuous beat throughout the piece, we simply compute K = N/τ . Y models the features extracted from the input audio. If we divide the signal into N frames, Y is a sequence of vectors: Y ∈ {(y1 , . . . , yN )} , and where φ x; μ, σ 2 is the probability density function of a Gaussian distribution with mean μ and variance σ 2 , τ is the dominant inter-beat interval of the piece, and στ2 represents the allowed tempo variance. Note how we restrict the non-zero range of Φ: on one hand, to prevent computing the logarithm of negative values, and on the other hand, to reduce the number of computations. (3) 1 515 We use Xm:n to denote all Xk with indices m to n 15th International Society for Music Information Retrieval Conference (ISMIR 2014) The factor yields high values when x and x are spaced approximately τ apart. It thus favours beat positions that correspond to the detected dominant interval, allowing for minor variations. Having defined the beat spacing factor, we will now elaborate on the activation vector that connects the model to the audio signal. a dynamic programming method similar to the well known Viterbi algorithm [15] to obtain the values of interest. We adapt the standard Viterbi algorithm to fit the structure of model by changing the definition of the “Viterbi variables” δ to δ1 (x) = P (X1 = x | Υ) and P (Xk = x | Xk−1 = x , Υ) · δk−1 (x ), δk (x) = max 3.2.2 Beat Activations x 2 where x, x ∈ N≤N . The backtracking pointers are set accordingly. P (x∗1:K | Υ) gives us the probability of the beat sequence given the data. We use this to determine how well the deducted beat structure fits the features and in consequence the activations. However, we cannot directly compare the probabilities of beat sequences with different numbers of beats: the more random variables a model has, the smaller the probability of a particular value configuration, since there are more possible configurations. We thus normalise the probability by dividing by K, the number of beats. With this in mind, we try different values for the dominant interval τ to obtain multiple beat sequences, and choose the one with the highest normalised probability. Specifically, we run our method with multiples of τ (1/2, 2/3, 1, 3/2, 2) to compensate for errors when detecting the dominant interval. The neural network’s activations Ψ indicate how likely each frame n ∈ N≤N is a beat position. We directly use this factor in the definition of the conditional probability distributions. With both factors in place we can continue to define the conditional probability distributions that complete our probabilistic model. 3.2.3 Conditional Probabilities The conditional probability distribution P (Xk | Xk−1 , Y ) combines both factors presented in the previous sections. It follows the intuition we outlined at the beginning of Section 3.2 and molds it into the formal framework as P (Xk | Xk−1 , Y ) = Ψ (Xk ; Y ) · Φ (Xk , Xk−1 ) . Xk Ψ (Xk ; Y ) · Φ (Xk , Xk−1 ) (6) The case of X1 , the first beat, is slightly different. There is no previous beat to determine its rough position using the beat spacing factor. But, since we assume that there is a steady and continuous pulse throughout the audio, we can conclude that its position lies within the first interval from the beginning of the audio. This corresponds to a uniform distribution in the range [0, τ ], which we define as beat position factor for the first beat as 1/τ if 0 ≤ x < τ, . (7) Φ1 (x) = 0 else 4. EXPERIMENTS In this section we will describe the setup of our experiments: which data we trained and tested the system on, and which evaluation metrics we chose to quantify how well our beat tracker performs. 4.1 Data We ensure the comparability of our method by using three freely available data sets for beat tracking: the Ballroom dataset [8,13]; the Hainsworth dataset [9]; the SMC dataset [11]. The order of this listing indicates the difficulty associated with each of the datasets. The Ballroom dataset consists of dance music with strong and steady rhythmic patterns. The Hainsworth dataset includes of a variety of musical genres, some considered easier to track (like pop/rock, dance), others more difficult (classical, jazz). The pieces in the SMC dataset were specifically selected to challenge existing beat tracking algorithms. We evaluate our beat tracker using 8-fold cross validation, and balance the splits according to dataset. This means that each split consists of roughly the same relative number of pieces from each dataset. This way we ensure that all training and test splits represent the same distribution of data. All training and testing phases use the same splits. The same training sets are used to learn the neural network and to set parameters of the probabilistic model (lower and upper bounds τl and τu for dominant interval estimation and στ ). The test phase feeds the resulting tracker with data from the corresponding test split. After detecting the beats The conditional probability for X1 is then P (X1 | Y ) = Ψ (X1 ; Y ) · Φ1 (X1 ) . X1 Ψ (X1 ; Y ) · Φ1 (X1 ) (8) The conditional probability functions fully define our probabilistic model. In the following section, we show how we can use this model to infer the position of beats present in a piece of music. 3.3 Inference We want to infer values x∗1:K for X1:K that maximise the probability of the beat sequence given Y = Υ, that is x∗1:K = argmax P (X1:K | Υ) . (9) x1:K Each x∗k corresponds to the position of the k th beat. Υ are the feature vectors computed for a specific piece. We use 2 technically, it is not a likelihood in the probabilistic sense – it just yields higher values if the network thinks that the frame contains a beat than if not 516 15th International Society for Music Information Retrieval Conference (ISMIR 2014) for all pieces, we group the results according to the original datasets in order to present comparable results. SMC F Cg CMLt AMLt 0.545 0.497 0.436 0.402 0.442 0.360 0.580 0.431 F Cg CMLt AMLt 0.840 0.837 0.718 0.717 0.784 0.763 0.875 0.811 Degara* [7] Klapuri* [12] Davies* [6] - - 0.629 0.620 0.609 0.815 0.793 0.763 Ballroom F Cg CMLt AMLt Proposed Böck [1, 2] 0.903 0.889 0.864 0.857 0.833 0.796 0.910 0.831 Krebs [13] Klapuri [12] Davies [6] 0.855 0.728 0.764 0.772 0.651 0.696 0.786 0.539 0.574 0.865 0.817 0.864 Proposed Böck [1, 2] 4.2 Evaluation Metrics A multitude of evaluation metrics exist for beat tracking algorithms. Some accent different aspects of a beat tracker’s performance, some capture similar properties. For a comprehensive review and a detailed elaboration on each of the metrics, we refer the reader to [5]. Here, we restrict ourselves to the following four quantities, but will publish further results on our website 3 . Hainsworth Proposed Böck [1, 2] F-measure The standard measure often used in information retrieval tasks. Beats count as correct if detected within ±70ms of the annotation. Cemgil Measure that uses a Gaussian error window with σ = 40ms instead of a binary decision based on a tolerance window. It also incorporates false positives and false negatives. CMLt The percentage of correctly detected beats at the correct metrical level. The tolerance window is set to 17.5% of the current inter-beat interval. Table 1. Beat tracking results for the three datasets. F stands for F-measure and Cg for the Cemgil metric. Results marked with a star skip the first five seconds of each piece and are thus better by about 0.01 for each metric, in our experience. AMLt Similar to CMLt, but allows for different metrical levels like double tempo, half tempo, and off-beat. In contrast to common practice 4 , we do not skip the first 5 seconds of each audio signal for evaluation. Although skipping might make sense for on-line algorithms, it does not for off-line beat trackers. considerably. Our beat tracker also performs better than the other algorithms, where metrics were available. The proposed model assumes a stable tempo throughout a piece. This assumption holds for certain kinds of music (like most of pop, rock and dance), but does not for others (like jazz or classical). We estimated the variability of the tempo of a piece using the standard deviation of the local beat tempo. We computed the local beat tempo based on the inter-beat interval derived from the ground truth annotations. The results indicate that most pieces have a steady pulse: 90% show a standard deviation lower than 8.61 bpm. This, of course, depends on the dataset, with 97% of the ballroom pieces having a deviation below 8.61 bpm, 89% of the Hainsworth dataset but only 67.7% of the SMC data. We expect our approach to yield inferior results for pieces with higher tempo variability than for those with a more constant pulse. To test this, we computed Pearson’s correlation coefficient between tempo variability and AMLt value. The obtained value of ρ = -0.46 indicates that our expectation holds, although the relationship is not linear, as a detailed examination showed. Obviously, multiple other factors also influence the results. Note, however, that although the tempo of pieces from the SMC dataset varies most, it is this dataset where we observed the strongest improvement compared to the original approach. Figure 3 compares the beat detections obtained with the proposed method to those computed by the original approach. It exemplifies the advantage of a globally optimised beat sequence compared to a greedy local search. 5. RESULTS Table 1 shows the results of our experiments. We obtained the raw beat detections on the Ballroom dataset for [6, 12, 13] from the authors of [13] and evaluated them using our framework. The results are thus directly comparable to those of our method. For the Hainsworth dataset, we collected results for [6, 7, 12] from [7], who does skip the first 5 seconds of each piece in the evaluation. In our experience, this increases the numbers obtained for each metric by about 0.01. The approaches of [6, 7] do not require any training. In [12], some parameters are set up based on a separate dataset consisting of pieces from a variety of genres. [13] is a system that is specialised for and thus only trained on the Ballroom dataset. We did not include results of other algorithms for the SMC dataset, although available in [11]. This dataset did not exist at the time most beat trackers were crafted, so the authors could not train or adapt their algorithms in order to cope with such difficult data. Our method improves upon the original algorithm [1, 2] for each of the datasets and for all evaluation metrics. While F-Measure and Cemgil metric rises only marginally (except for the SMC dataset), CMLt and AMLt improves 3 http://www.cp.jku.at/people/korzeniowski/ismir2014 4 As implemented in the MatLab toolbox for the evaluation of beat trackers presented in [5] 517 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 0.06 Activation 0.05 0.04 0.03 0.02 0.01 0.00 0 5 10 15 20 Time [s] Figure 3. Beat detections for the same piece as shown in Fig. 1b obtained using the proposed method (red, up arrows) compared to those computed by the original approach (purple, down arrows). The activation function is plotted solid blue, ground truth annotations are represented by vertical dashed green lines. Note how the original method is not able to correctly align the first 10 seconds, although it does so for the remaining piece. Globally optimising the beat sequence via back-tracking allows us to infer the correct beat times, even if the peaks in the activation function are ambiguous at the beginning. 6. CONCLUSION AND FUTURE WORK Mary University of London, Centre for Digital Music, Tech. Rep. C4DM-TR-09-06, 2009. We proposed a probabilistic method to extract beat positions from the activations of a neural network trained for beat tracking. Our method improves upon the simple approach used in the original algorithm for this purpose, as our experiments showed. In this work we assumed close to constant tempo throughout a piece of music. This assumption holds for most of the available data. Our method also performs reasonably well on difficult datasets containing tempo changes, such as the SMC dataset. Nevertheless we believe that extending the presented method in a way that enables tracking pieces with varying tempo will further improve the system’s performance. [6] M. E. P. Davies and M. D. Plumbley. Context-Dependent Beat Tracking of Musical Audio. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1009–1020, Mar. 2007. [7] N. Degara, E. A. Rua, A. Pena, S. Torres-Guijarro, M. E. P. Davies, and M. D. Plumbley. Reliability-Informed Beat Tracking of Musical Signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):290–301, Jan. 2012. [8] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844, 2006. [9] S. W. Hainsworth and M. D. Macleod. Particle Filtering Applied to Musical Tempo Tracking. EURASIP Journal on Advances in Signal Processing, 2004(15):2385–2395, Nov. 2004. ACKNOWLEDGEMENTS This work is supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the GiantSteps project (grant agreement no. 610591). 7. REFERENCES [1] MIREX 2013 beat tracking results. http://nema.lis. illinois.edu/nema_out/mirex2013/results/ abt/, 2013. [2] S. Böck and M. Schedl. Enhanced Beat Tracking With Context-Aware Neural Networks. In Proceedings of the 14th International Conference on Digital Audio Effects (DAFx11), Paris, France, Sept. 2011. [3] A. T. Cemgil, B. Kappen, P. Desain, and H. Honing. On tempo tracking: Tempogram Representation and Kalman filtering. Journal of New Music Research, 28:4:259–273, 2001. [4] T. Collins, S. Böck, F. Krebs, and G. Widmer. Bridging the Audio-Symbolic Gap: The Discovery of Repeated Note Content Directly from Polyphonic Music Audio. In Proceedings of the Audio Engineering Society’s 53rd Conference on Semantic Audio, London, 2014. [5] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. Queen [10] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computing, 9(8):1735–1780, Nov. 1997. [11] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. a. L. Oliveira, and F. Gouyon. Selective Sampling for Beat Tracking Evaluation. IEEE Transactions on Audio, Speech, and Language Processing, 20(9):2539–2548, Nov. 2012. [12] A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):342–355, 2006. [13] F. Krebs, S. Böck, and G. Widmer. Rhythmic Pattern Modeling for Beat and Downbeat Tracking in Musical Audio. In Proc. of the 14th International Conference on Music Information Retrieval (ISMIR), 2013. [14] G. Peeters and H. Papadopoulos. Simultaneous beat and downbeat-tracking using a probabilistic framework: theory and large-scale evaluation. IEEE Transactions on Audio, Speech, and Language Processing, 19(6):1754–1769, 2011. [15] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 518 15th International Society for Music Information Retrieval Conference (ISMIR 2014) GEOGRAPHICAL REGION MAPPING SCHEME BASED ON MUSICAL PREFERENCES Sanghoon Jun Korea University Seungmin Rho Sungkyul University Eenjun Hwang Korea University [email protected] [email protected] [email protected] ences. In the case of two regions near the border of the two countries, the people might show very different music preferences from those living in a region far from the border but in the same country. The degree of preference differences can be varied because of the difference in the sizes of the countries. Furthermore, the water bodies that cover 71% of the Earth’s surface can lead to a disjunction of the differences. Music from countries that have a high cultural influence might gain global popularity. For instance, pop music from the United States is very popular all over the world. Countries that have a common cultural background might have similar musical preferences irrespective of the geographical distance between them. Language is another important factor that can lead to different countries, such as the US and the UK, having similar popular music charts. For these reasons, predicting musical preferences on the basis of geographical proximity can lead to incorrect results. In this paper, we present a scheme for constructing a music map where regions are positioned close to one another depending on the musical preferences of their populations. That is, regions such as cities in a traditional map are rearranged in the music map such that regions with similar musical preferences are close to one another. As a result, regions with similar musical preferences are concentrated in the music map and regions with distinct musical preferences are far away from the group. The rest of this paper is organized as follows: In Section 2, we present a brief overview of the related works. Section 3 presents the scheme for mapping a geographical region to a new music space. Section 4 describes the experiments that we performed and some of the results. In the last section, we conclude the paper with directions for future work. ABSTRACT Many countries and cities in the world tend to have different types of preferred or popular music, such as pop, K-pop, and reggae. Music-related applications utilize geographical proximity for evaluating the similarity of music preferences between two regions. Sometimes, this can lead to incorrect results due to other factors such as culture and religion. To solve this problem, in this paper, we propose a scheme for constructing a music map in which regions are positioned close to one another depending on the similarity of the musical preferences of their populations. That is, countries or cities in a traditional map are rearranged in the music map such that regions with similar musical preferences are close to one another. To do this, we collect users’ music play history and extract popular artists and tag information from the collected data. Similarities among regions are calculated using the tags and their frequencies. And then, an iterative algorithm for rearranging the regions into a music map is applied. We present a method for constructing the music map along with some experimental results. 1. INTRODUCTION To recommend suitable music pieces to users, various methods have been proposed and one of them is the joint consideration of music and location information. In general, users in the same place tend to listen to similar kinds of music and this is shown by the statistics of music listening history. Context-aware computing utilizes this human tendency to recommend songs to a user. However, the current approach of exploring geographical proximity for obtaining a user’s music preferences might have several limitations due to various factors such as region scale, culture, religion, and language. That is, neighboring regions can show significant differences in music listening statistics and vice versa. In fact, the geographical distance between two regions is not always proportional to the degree of difference in music preferences. For instance, assume that there are two neighboring countries having different music prefer- 2. RELATED WORK Many studies have tried to utilize location information for various music-related applications such as music search and recommendation. Kaminskas et al. presented a context-aware music recommender system that suggests music items on the basis of the users’ contextual conditions, such as the users’ mood or location [1]. They defined the term “place of interest (POI)” and considered the selection of suitable music tracks on the basis of the POI. In [2], Schedl et al. presented a music recommenda- © Sanghoon Jun, Seungmin Rho, Eenjun Hwang. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sanghoon Jun, Seungmin Rho, Eenjun Hwang. “Geographical Region Mapping Scheme Based On Musical Preferences”, 15th International Society for Music Information Retrieval Conference, 2014. 519 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Data Collection Extract music listening data and location - Messages - GPS information - Profile location Define region and group data Mapping regions to 2-dimensional space Space Representation Generate Gaussian mixture model Extract popular artists from groups - Geo.TopArtist - Artist.TopTags Generate map Extract tag statistics from popular artists PM Similarity Measurement SL HKTC GP TLTD NA MW Calculate similarities of region pairs MS VN BJ UG NG AS SZ AX MM FK FI MZ MO GN LY TG TO CN IQ SDBB JM LSMF BO BQ LC MKCM TH YE BT TV OM LA FO MD KY PW GUZM AO BWPS NU IO NO GI MAFR PH SC PA MC TJMG AZ BN ML CW MT TF PT NI RS VC SX SB AE SG DO GY PEGS AMKP ISSVMY EH AL AQ HN CLLUNL HU IN EG HR BH MQ MX CO AR ZA MR ES CA HM GRCC SKEE PY IM IE DKCX GB TR PR RE DZ VGAU PKTTUS DJ CH QA CI ECIT LV IL CR AT CZ BEVINZ LT UY UM PN VE BA GD VU ME BF FJ SS RO KZ SI SA SJ PL ZW SO GT TZ FMSE GW BG NE MNIDCY VA UA GG MV BV BYBD UZ GE BR CKTN ST CU TKDEPF GL HT LB AW JO AG WF RU TM PG LK BS MU KG AD CG KE YT BI NR SR AF CV LI SM NP CD KR MHNC KI IR JE SY KN WS KH MP ET AI TW GA GM KW GF BL GH KM - Google maps - Geocoder dimensional (3D) visualization model [5]. Using the music similarity model, they provided new tools for exploring and interacting with a music collection. In [6], Knees et al. presented a user interface that creates a virtual landscape for music collection. By extracting features from audio signals and clustering the music pieces, they created a 3D island landscape. In [7], Pampalk et al. presented a system that facilitates the exploration of music libraries. By estimating the perceived sound similarities, music pieces are organized on a two-dimensional (2D) map so that similar pieces are located close to one another. In [8], Rauber et al. proposed an approach to automatically create an organization of music collection based on sound similarities. A 3D visualization of music collection offers an interface for an interactive exploration of large music repositories. Space Mapping ER RW SH Generate similarity matrix NF DM BZ GQ SN LR JP CF BM Figure 1. Overall scheme 3. GEOGRAPHICAL REGION MAPPING In this paper, we propose a scheme for geographical region mapping on the basis of the musical preferences of the people residing in these regions. The proposed scheme consists of three parts as shown in Figure 1. Firstly, the music listening history and the related location data are collected from Twitter. After defining regions, the collected data are refined to tag the statistics per region by querying popular artists and their popularities from last.fm. Similarities between the defined regions are calculated and stored in the similarity matrix. The similarity matrix is represented into a 2D space by using an iterative algorithm. Then, a Gaussian mixture model (GMM) is generated for constructing the music map on the basis of the relative location of the regions. Figure 2. Collected data from twitter tion algorithm that combines information on the music content, music context, and user context by using a data set of geo-located music listing activities. In [3], Schedl et al derived and analyzed culture-specific music listening patterns by collecting music listening patterns of different countries (cities). They utilized social microblog such as Twitter and its tags in order to collect musicrelated information and measure the similarities between artists. Jun et al. presented a music recommender that considers personal and general musical predilections on the basis of time and location [4]. They analyzed massive social network streams from twitter and extracted the music listening histories. On the basis of a statistical analysis of the time and location, a collection of songs is selected and blended using automatic mixing techniques. These location-aware methods show a reasonable music search and recommendation performance when the range of the placeGof interest is small. However, the aforementioned problems might occur when the location range increases. Furthermore, these methods do not consider the case where remote regions have similar music preferences, which is often the case. On the basis of these observations, in this paper, we propose a new data structure called a “music map”, where regions with similar musical preferences are located close to one another. Some pioneering studies to represent music by using visualization techniques have been reported. Lamere et al. presented an application for exploring and discovering new music by using a three- 3.1 Music Listen History and Location Collection By analyzing the music listening history and location data, we can find out the music type that is popular in a certain city or country. In order to construct a music map, we need to collect the music listening history and location information on a global scale. To do this, we utilize last.fm, which is a popular music database. However, last.fm has several limitations related to the coverage of the global music listening history. The most critical one is that the database provides the listening data of a particular country only. In other words, we cannot obtain the data for a detailed region. Users in some countries (not all countries) use last.fm, and it does not contain sufficient data to cover the preferences of all the regions of these countries. Because of this, we observed that popular music in the real world does not always match with the last.fm data. On the other hand, an explosive number of messages are generated all over the world through Twitter. Twitter is one of the most popular social network services. In this study, we use Twitter for collecting a massive amount of music listening history data. By filtering music-related messages from Twitter, we can collect various types of 520 15th International Society for Music Information Retrieval Conference (ISMIR 2014) #nowplaying #np #music #soundcloud #musicfans #listenlive #hiphop #musicmondays #pandora #mp3 #itunes #newmusic 1 AU AT BE CA CL CZ DK EE FI FR DE GR HU IS IE IL IT JP KR LU MX NL NZ NO PL PT SK SI ES SE CH TR UK US Table 1. Music-related hashtags. <Phrase A> by < Phrase B> < Phrase A> - < Phrase B > < Phrase A > / < Phrase B > “< Phrase A >” - < Phrase B > Table 2. Typical syntax for parsing song title and artist music-related information, such as artist name, song title, and the published location. Figure 2 shows the distribution of the collected music-related tweets from around the world. We used the Tweet Stream provided through a Twitter application processing interface (API) for collecting tweets. In order to select only the music-related tweets, we used music-related hashtags. Hashtags are very useful for searching the relevant tweets or for grouping tweets on the basis of topics. As shown in Table 1, we used the music-related hashtag lists that have been defined in [4]. Music-related tweet messages contain musical information such as song title and artist name. These textual data are represented in various forms. In particular, we considered the patterns shown in Table 2 for finding the artist names and the song titles. We employed a local MusicBrainz [9] server to validate the artist names. For collecting location information, we gathered global positioning system (GPS) data that are included in tweet messages. However, we observed that the number of tweets that contain GPS data is quite small considering the total number of tweets. To solve this, we collected the profile location of the user who published a tweet message. Profile location contains the text address of the country or the city of the user. We employed the Google Geocoding API [10] for validating the location name and converting the address to GPS coordinates. 0.9 0.8 0.7 0.6 0.5 AU ATBE CA CL CZDKEE FI FRDE GRHU IS IE IL IT JP KR LU MXNL NZNOPL PTSK SI ES SE CH TRUK US Figure 3. Tag similarity matrix of 34 countries = { , … , } (2) where n is the number of referred artists. Also, using an artist name, we can collect his/her tag list. For a region r, we construct a set Tr of top tags by querying top tags to last.fm using the artist names of the region r as follows: ! = {"#$!%&!"( ) ' … ' "#$!%&!"( )| * } = {$ , … $+ } (3) where getTopTags(a) returns a list of top tags of artist a and m is the number of collected tags for the region r. We define a function RTC(r, t) that calculates the total count of tag t in region r using the following equation: -!(., $) = /1 *23 × "#$!"%0$( , $) (4) Here, getTagCount(a, t) returns the count of tag t for the artist a in last.fm. In the same vein, RTC can return a set of tag counts when the second argument is a tag set T. -!(., 4) = {-!(., $ ), … , -!(., $+ )|$ * 4} (5) 3.2 Region Definition and Tag Representation 3.3 Similarity Measurement Using the collected GPS information, we created a set of regions on the basis of the city or country. For grouping data by city name or country name, the collected GPS information is converted into its corresponding city or country name. In this study, we got 1327 cities or 198 countries from the music listening history collected through TwitterU For each region, we collect two sets Ar and ACr of referred artist names and their play counts, respectively: To construct a music map of regions, we need a measurement for estimating musical similarity. In this paper, we assume that music proximity between regions is closely related to the artists and their tags because the musical characteristics of a region can be explained by the artists’ tags of the region. In particular, in order to measure the similarity among the regions represented by the tag groups, we employed a cosine similarity measurement as shown in the following equation: = { , … , } (1) 521 !56(. , .7 ) = 89:(; ,4< )×89:(> ,4< ) ?89:@; ,43; A?×?89:@> ,43> A? (6) 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 1 Japan 0.8 0.6 (a) iteration = 1 Korea, republic of (b) iteration = 100 0.4 0.2 0 0 (c) iteration = 400 Israel Slovenia Poland Greece Luxembourg Germany Italy Turkey Chile Czech republic Spain Austria NewNetherlands zealand United kingdom Slovakia Switzerland Hungary Mexico Belgium Estonia Ireland United states Australia Denmark Portugal Canada Sweden Norway Iceland France 0.2 0.4 0.6 0.8 Finland 1 Figure 5. Gaussian mixture model of 34 countries (d) iteration = 1000 Figure 4. Example of mapped space in iterations 4B = 4; C 4> 1 JP (7) 0.8 The cosine similarities of all possible pairs of regions were calculated and stored in the tag similarity matrix TSM. Hence, if there were m regions in the collection, we obtained a TSM of m × m. A sample TSM for 34 countries is shown in Figure 3. 0.6 SI KR 0.4 0.2 SE 3.4 2D Space Mapping 0 On the basis of the TSM, we generated a 2D space for a music map by converting tag similarities between regions into proper metric for 2D space mapping. In this paper, this conversion is done approximately using an iterative algorithm. The proposed algorithm is based on the computational model such as a self-organizing map and an artificial neural network algorithm. By using an iterative phase, the algorithm gradually separates the regions in inverse proportion to the tag similarity. -0.2 -0.2 0.2 FI PL GR LU DE IT TR CL CZ AT NZNLESSK UK CH BEHUMX EE IE US DK AU PT CA IS NO FR 0.4 0.6 0.8 1 1.2 Figure 6. Music map of 34 countries 7 HD(.E , . ) = I@J(.E) G J(. )A + (L(.E ) G L(. ))7 (9) where x(ri) and y(ri) returns x and y positions of the region ri in 2D space, respectively. In order for TD and ED to have same value as much as possible, the following equation is applied 3.4.1 Initialization In the initialization phase, 2D space is generated where X-axis and Y-axis of the space have ranges from 0 to 1. Each region is randomly placed on the 2D space. We observed that our random initialization does not provide deterministic result of the 2D space mapping. J(. ) = J(. ) + M($)(HD(.E , . ) G !D(.E , . )) (N(O )PN(1 )) QR(O ,1 ) (10) 3.4.2 Iterations L(. ) = L(. ) + M($)(HD(.E , . ) G !D(.E , . )) In each iteration, a region in the 2D space is randomly selected and the tag distance TD between the selected region rs and any other region ri is computed using the similarity matrix. !D(.E , . ) = 1 G !56(.E , . ) 0 IL (S(O )PS(1 )) QR(O ,1 ) (11) (8) Here, ©(t) is a learning rate in t-th iteration. The learning rate is monotonically decreased during iteration according to the following equation M($) = MT exp(G$/!) Subsequently, Euclidean distances ED between the selected region rs and other region ri is computed using the following equation 522 (12) 15th International Society for Music Information Retrieval Conference (ISMIR 2014) GN Average |ED-TD| 0.6 SL BJ DM 0.5 KW 0.4 BZ HK 0.1 CF 0 100 200 300 400 500 Iterations Figure 7. Average difference of distances in iterations ©0 denotes the initial learning rate, and T represents the total number of iterations. After each iteration, regions having higher TD are located far away from the selected region and regions having lower TD are located closer. Figure 4 shows examples of the mapped space after iterations. AI LY BB FK SB CN KN BI HT FIKH MZ VA ET KR EH MV BY MFGD TC PH JOAD PG UZ KZ MAPN SE LCMM MH CG GE IN FR FJBQ NENO BT BVFO BO MNDE RW IL SI SJAQ BR MG ML ME MS BL TW VE KY OM EG PL VU EC LU BE LT UY SM HR CZ SS GS IT ATNL TF RO DK MTCR MR GR CI BG IS TR AR GL LV BD TK ZA CH MX GT CC CO CXUMTN SO SK VG UA ALBAPE EE CL IM IE HU GB GQ VC KG KP ES PT IDCYRSPY MQ NR DJ ST CA HM PK TO AU PR RE GI NZ US CUPF AZ AMQA CW CD VI BS WF MP BF ZW SCGU MD HN SVMY RU TJ NU TM TT SG SX LB AE BN CK AO MO DO CV MC LA IO SY NI LK GA TV BH ZMSA FM IR GW JM GY DZ LS GP GH NP CM PA AG GG JP SZ SN TZ PWTH YE ER LI SR AF AX AW LR GM NG AS TG GF VN JE Figure 8. Music map of 239 countries Hh Fv V(i) = {J(. ), L(. )} 0 1Z 8 (1 ) \ Ro In In After 2D space mapping, the regions are mapped such that regions having similar music preferences are placed close to one another. As a result, they form distinct crowds in the 2D space. In contrast, regions having unique preferences are placed apart from the crowds. To represent them as a map, a 2D distribution on the space is not sufficient. In this paper, in order to represent the information like a real world map, we employed the GMM. The Gaussian with diagonal matrix is constructed using the following equations: &(i) = IQ PM KM 3.5 Space Representation 1Z W(i) = X 8 0 MW BW MU SH 0.2 0 YT NC NFKI SDPS KE TD TL NA MK 0.3 BM WS UG Nu Db Ro Hh El ElHh RoIn Ro El El Po In El El Ro El El In ElIn Jz Ro El Ro RoRo Ro Ro El In Po Ro Po El Po Po Ro Po RoRo Po Po El ElElPoElPoPo Po Ro Ro Ro Ro Po El Ro Po Jz Po RoRo Po PoRoPo RoRo Po Po Jz Ro Po Ro Ro Ro Ro Ro Po Ro Ro Ro Ro Ro Ro Ro Po Ro Po Ro Ro Po Ro Po Ro Ro Po Ro Po Po El Ro Ro Ro Po Po Ro Ro Po Ro Po Ro Ro Po Po Ro Po Po Po Po Po El Jp Ro Ro Ro Ro Po Po Po Ro Ro Ro Ro Po Po El Po Hh Po Po Po Po Ro Po Ro Ro RoRo Po Ro RoRo Ro PoPo Po Hh Ro Ro PoPoPo PoPo Ro Po Po PoPo El PoRo Ro Po Po Po Ro Ro Po Po Ro Po PoPo Ro Po Ro Po In Ro Fv Hh Ro Po In Ro Ro Jp El Po Hh In In Po Ro In In Hh El Hh Hh Ro Hh In Hh In Jp Pu El Hh Hh In Po Hh In In (13) Po(Pop), El(electronic), Hh(Hip-Hop), Ro(rock), Jz(jazz), In(Indie), Jp(japanese), Fv(female vocalists), Db(Drum and bass), Pu(punk), Nu(Nu Metal) (14) Figure 9. Top tags of music map. or a small island on the basis of their distribution. As a result, the mapped result is visualized as a music map having an appearance similar to that of a real world map. An example of a music map for 34 countries is shown in Figure 6. Although the generated music map contains less information than the contour graph of GMM, it could be more intuitive to the casual users to understand the relations between regions in terms of music preferences. (15) Here, n is total number of regions and nn(ri) returns the number of neighboring regions of region ri in the 2D space. To model the GMM in the crowded area of 2D space, mixing proportion p(i) is adjusted based on the number of neighbors nn(ri). In other words, nn(ri) has a higher value when p(i) is crowded and it reduces the proportion of i-th Gaussian. It helps to prevent Gaussian from over-height. An example of generated GMM is shown in Figure 5. To generate a music map using the GMM, the probabilistic density function (pdf) of the GMM is simplified by applying a threshold. By projecting the GMM on the 2D plane after applying the threshold to the pdf, the boundaries of the GMM are created. We empirically found that the threshold value 0 gives an appropriate boundary. A boundary represents regions as a continent 4. EXPERIMENT 4.1 Experiment Setup To collect the music-related tweets, we gathered the tweet streams from the Twitter server in real time in order to collect the music information of Twitter users. During one week, we collected 4.57 million tweets that had the hashtags listed in Table 1. After filtering the tweets through regular expressions, 1.56 million music listening history records were collected. We got 1327 cities or 198 523 15th International Society for Music Information Retrieval Conference (ISMIR 2014) a music map according to the tag similarities. The possible application domains of the proposed scheme span a broad range—from music collection, browsing services, and music marketing tools, to a worldwide music trend analysis. countries from the music listening history collected through Twitter. We collected the lists of the top artists for 249 countries from last.fm. For these countries, 2735 artists and their top tags were collected from last.fm. 4.2 Differences of ED and TD In the proposed scheme, the iterative algorithm gradually reduces the difference between ED and TD, as mentioned above. In order to show that the algorithm reduces the difference and moves the regions appropriately, the average difference between ED and TD is measured in each iteration. Figure 7 shows the average distances during 500 iterations. The early phases in the computation show high average distance differences due to the random initialization. As the iteration proceeds, the average distance differences are gradually reduced and converged. 6. ACKNOWLEDGEMENT This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF2013R1A1A2012627) and the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2014-H0301-14-1001) supervised by the NIPA (National IT Industry Promotion Agency) 7. REFERENCES 4.3 Map Generation for 249 Countries In order to evaluate the effectiveness of the proposed scheme, we defined a region group that contained 249 countries. After collecting the music listening history from Twitter and last.fm, we generated a music map by using the proposed scheme. Figure 8 shows the resulting music map. We observed that the map consisted of a big island (continent) and a few small islands. In the center of the big island, countries that had a high musical influence, such as the US and the UK, were located. On the other hand, countries having unique music preferences such as Japan and Hong Kong were formed as small islands and located far away from the big island. 4.4 Top Tag Representation A music map is based on the musical preferences between regions, and these preferences were calculated on the basis of the similarities of the musical tags. In the last experiment, we first find out the top tag of each country and show the distribution of the top tags in the music map. Figure 9 shows the top tags of the map in Figure 8. In the map, “Rock” and “Pop”, which are the most popular tags in the collected data, are located in the center and occupies a significant portion of the big island. On the north side of the big island, “Electronic” tag is located and in the south, “Indie” tag is placed. The “Pop” tag, which is popular in almost every country, is located throughout the map. 5. CONCLUSION In this paper, we proposed a scheme for constructing a music map in which regions such as cities and countries are located close to one another depending on the musical preferences of the people residing in them. To do this, we collected the music play history and extracted the popular artists and tag information from Twitter and last.fm. A similarity matrix for each region pair was calculated by using the tags and their frequencies. By applying an iterative algorithm and GMM, we reorganized the regions into 524 [1] M. Kaminskas and F. Ricci, “Location-Adapted Music Recommendation Using Tags,” in User Modeling, Adaption and Personalization, Springer Berlin Heidelberg, 2011, pp. 183–194. [2] M. Schedl and D. Schnitzer, “Location-Aware Music Artist Recommendation,” in MultiMedia Modeling, Springer International Publishing, 2014, pp. 205–213. [3] M. Schedl and D. Hauger, “Mining Microblogs to Infer Music Artist Similarity and Cultural Listening Patterns,” in Proceedings of the 21st International Conference Companion on World Wide Web, New York, USA, 2012, pp. 877–886. [4] S. Jun, D. Kim, M. Jeon, S. Rho, and E. Hwang, “Social mix: automatic music recommendation and mixing scheme based on social network analysis,” Journal of Supercomputing, pp. 1–22, Apr. 2014. [5] P. Lamere and D. Eck, “Using 3d visualizations to explore and discover music,” in in Int. Conference on Music Information Retrieval, 2007. [6] P. Knees, M. Schedl, T. Pohle, and G. Widmer, “Exploring Music Collections in Virtual Landscapes,” IEEE MultiMedia, vol. 14, no. 3, pp. 46–54, Jul. 2007. [7] E. Pampalk, A. Rauber, and D. Merkl, “Contentbased Organization and Visualization of Music Archives,” in Proceedings of the Tenth ACM International Conference on Multimedia, New York, NY, USA, 2002, pp. 570–579. [8] A. Rauber, E. Pampalk, and D. Merkl, “The SOMenhanced JukeBox: Organization and Visualization of Music Collections Based on Perceptual Models,” Journal of New Music Research, vol. 32, no. 2, pp. 193–210, 2003. [9] “MusicBrainz - The Open Music Encyclopedia.” [Online]. Available: http://musicbrainz.org/. [Accessed: 03-May-2014]. [10] ˈThe Google Geocoding API” [Online]. Available: https://developers.google.com/maps/documentation/ geocoding/. [Accessed: 03-May-2014] 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ON COMPARATIVE STATISTICS FOR LABELLING TASKS: WHAT CAN WE LEARN FROM MIREX ACE 2013? John Ashley Burgoyne Universiteit van Amsterdam [email protected] W. Bas de Haas Universiteit Utrecht [email protected] ABSTRACT For mirex 2013, the evaluation of audio chord estimation (ace) followed a new scheme. Using chord vocabularies of differing complexity as well as segmentation measures, the new scheme provides more information than the ace evaluations from previous years. With this new information, however, comes new interpretive challenges. What are the correlations among different songs and, more importantly, different submissions across the new measures? Performance falls off for all submissions as the vocabularies increase in complexity, but does it do so directly in proportion to the number of more complex chords, or are certain algorithms indeed more robust? What are the outliers, songalgorithm pairs where the performance was substantially higher or lower than would be predicted, and how can they be explained? Answering these questions requires moving beyond the Friedman tests that have most often been used to compare algorithms to a richer underlying model. We propose a logistic-regression approach for generating comparative statistics for mirex ace, supported with generalised estimating equations (gees) to correct for repeated measures. We use the mirex 2013 ace results as a case study to illustrate our proposed method, including some of interesting aspects of the evaluation that might not apparent from the headline results alone. 1. INTRODUCTION Automatic chord estimation (ace) has a long tradition within the music information retrieval (mir) community, and chord transcriptions are generally recognised as a useful mid-level representation in academia as well as in industry. For instance, in an academic context it has been shown that chords are interesting for addressing musicological hypotheses [3,13], and that they can be used as a mid-level feature to aid in retrieval tasks like cover-song detection [7,10 ]. In Johan Pauwels is no longer affiliated with stms. Data and source code to reproduce this paper, including all statistics andfi gures, are available from http://bitbucket.org/jaburgoyne/ismir-2014. © John Ashley Burgoyne, W. Bas de Haas, Johan Pauwels. Licensed under a Creative Commons Attribution4. 0 International License (cc by 4.0). Attribution: John Ashley Burgoyne, W. Bas de Haas, Johan Pauwels. “On comparative statistics for labelling tasks: What can we learn from mirex ace 2013?”,15 th International Society for Music Information Retrieval Conference,2014. Johan Pauwels stms ircam–cnrs–upmc [email protected] an industrial setting, music start-ups like Riffstation 1 and Chordify 2 use ace in their music teaching tools, and at the time of writing, Chordify attracts more than 2million unique visitors every month [6]. In order to compare different algorithmic approaches in an impartial setting, the Music Information Retrieval Evaluation eXchange (mirex) introducted an annual ace task in 2008. Since then, between 11 and 18 algorithms have been submitted each year by between 6 and 13 teams. Despite the fact that ace algorithms are used outside of academic environments, and even though the number of mirex participants has decreased slightly over the last three years, the problem of automatic chord estimation is nowhere near solved. Automatically extracted chord sequences have classically been evaluated by calculating the chord symbol recall (csr), which reflects the proportion of correctly labelled chords in a single song, and a weighted chord symbol recall (wcsr), which weights the average csr of a set of songs by their length. On fresh validation data, the best-performing algorithms in 2013 achieved wcsr of only 75 percent, and that only when the range of possible chords was restricted exclusively to the 25 major, minor and “no-chord” labels; thefi gure drops to 60 percent when the evaluation is extended to include seventh chords (see Table1). mirex is a terrific platform for evaluating the performance of ace algorithms, but by 2010 it was already being recognised that the metrics could be improved. At that time, they included only csr and wcsr using a vocabulary of12 major chords, 12 minor chords and a “no-chord” label. At ismir 2010, a group of ten researchers met to discuss their dissatisfaction. In the resulting ‘Utrecht Agreement’, 3 it was proposed that future evaluations should include more diverse chord vocabularies, such as seventh chords and inversions, as the25 -chord vocabulary was considered a rather coarse representation of tonal harmony. Furthermore, the group agreed that it was important to include a measure of segmentation quality in addition to csr and wcsr. At approximately the same time, Christopher Harte proposed a formalisation of measures that implemented the aspirations indicated in the Utrecht agreement [8]. Recently, Pauwels and Peeters reformulated and extended Harte’s work with the precise aim of handling differences in chord vocabulary between annotated ground truth and algorithmic 1 http://www.riffstation.com/ http://chordify.net 3 http://www.music-ir.org/mirex/wiki/The_ Utrecht_Agreement_on_Chord_Evaluation 2 525 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Algorithm # Types ko2 nmsd2 cb4 nmsd1 cb3 ko1 pp4 pp3 cf2 ng1 ng2 sb8 7 10 13 10 13 7 5 2 10 2 5 2 Inversions? Training? I II III IV V VI VII VIII • 76 75 76 74 76 75 69 70 71 71 67 9 74 71 72 71 72 71 66 68 67 67 63 7 72 69 70 69 70 69 64 65 65 65 61 6 60 59 59 58 58 54 51 50 49 49 44 5 58 57 57 56 56 52 49 48 47 46 43 5 84 82 85 83 85 83 83 83 83 82 82 51 79 79 80 79 81 80 78 82 83 79 81 92 89 86 90 86 89 88 87 84 83 86 83 35 • • Table1 . Number of supported chord types, inversion support, training support, and mirex results on the Billboard 2013test set for all2013 ace submissions. I: root only; II: major-minor vocabulary; III: major-minor vocabulary with inversions; IV: major-minor vocabulary with sevenths; V: major-minor vocabulary with sevenths and inversions; VI: mean segmentation score; VII: under-segmentation; VIII: over-segmentation. Adapted from the mirex Wiki. output on one hand, and among the output of different algorithms on the other hand [15]. They also performed a rigorous re-evaluation of all mirex ace submissions from 2010 to2012 . As of mirex 2013, these revised evaluation procedures, including the chord-sequence segmentation evaluation suggested by Harte [8] and Mauch [12], have been adopted in the context of the mirex ace task. mirex ace evaluation has also typically included comparative statistics to help determine whether the differences in performance between pairs of algorithms are statistically significant. Traditionally, Friedman’s anova has been used for this purpose, accompanied by Tukey’s Honest Significant Difference tests for each pair of algorithms. Friedman’s anova is equivalent to a standard two-way anova with the actual measurements (in our case wcsr or directional Hamming distance [dhd], the new segmentation measure) replaced by the rank of each treatment (in our case, each algorithm) on that measure within each block (in our case, for each song) [11]. The rank transformation makes Friedman’s anova an excellent ‘one sizefi ts all’ approach that can be applied with minimal regard to the underlying distribution of the data, but these benefits come with costs. Like any nonparametric test, Friedman’s anova can be less powerful than parametric alternatives where the distribution is known, and the rank transformation can obscure information inherent to the underlying measurement, magnifying trivial differences and neutralising significant inter-correlations. But there is no need to pay the costs of Friedman’s anova for evaluating chord estimation. Fundamentally, wcsr is a proportion, specifically the expected proportion of audio frames that an estimation algorithm will label correctly, and as such, itfi ts naturally into logistic regression (i.e., a logit model). Likewise, dhd is constrained to fall between 0 and 100 percent, and thus it is also suitable for the same type of analysis. The remainder of this paper describes how logistic regression can be used to compare chord estimation algorithms, using mirex results from 2013 to illustrate four key benefits: easier interpretation, greater statistical power, built-in correlation estimates for identifying relationships among algorithms, and better detection of outliers. 526 2. LOGISTIC REGRESSION WITH GEES Proportions cannot be distributed normally because they are supported exclusively on [0,1 ], and thus they present challenges for traditional techniques of statistical analysis. Logit models are designed to handle these challenges without sacrificing the simplicity of the usual linear function relating parameters and covariates [1, ch.4]: π(x; β) = ex β , 1 + ex β (1) or equivalently log π(x; β) = x β , 1 − π(x; β) (2) where π represents the relative frequency of ‘success’ given the values of covariates in x and parameters β. In the case of a basic model for mirex ace, x would identify the algorithm and π would be the relative frequency of correct chord labels for that algorithm (i.e., wcsr). In the case of data like ace results, where there are proportions pi of correct labels over ni analysis frames rather than binary successes or failures, i indexing all combinations of individual songs and algorithms, logistic regression assumes that each pi represents the observed proportion of successes among ni conditionally-independent binary observations, or more formally, that the pi are distributed binomially: n f P | N,X (p | n, x; β) = π pn (1 − π) (1− p)n . (3) pn The expected value for each pi is naturally πi = π(xi ; β), the overall relative frequency of success given xi : E [P | N, X] = π(x; β) . (4) Logistic regression models are most oftenfi t by the maximum-likelihood technique, i.e., one is seeking a vector β̂ to maximise the log-likelihood given the data: ni P | N,X (β; p, n, X) = log + pi ni i pi ni log πi + (1 − pi )ni log (1 − πi ) . (5) 15th International Society for Music Information Retrieval Conference (ISMIR 2014) One thus solves the system of likelihood equations for β, whereby the gradient of Equation 5 is set to zero: (pi − πi )ni xi = 0 (6) ∇β P | N,X (β; p, n, X) = i and so pi ni xi = i πi ni xi . (7) i In the case of mirex ace evaluation, each xi is simply an indicator vector to partition the data by algorithm, and thus β̂ is the parameter vector for which πi equals the songlength–weighted mean over all pi for that algorithm. 2.1 Quasi-Binomial Models Under a strict logit model, the variance of each pi is inversely proportional to ni : 1 var [P | N, X] = π(1 − π) . (8) n Equation 8 only holds, however, if the estimates of chord labels for each audio frame are independent. For ace, this is unrealistic: only the most naïve algorithms treat every frame independently. Some kind of time-dependence structure is standard, most frequently a hidden Markov model or some close derivative thereof. Hence one would expect that the variance of wcsr estimates should be rather larger than the basic logit model would suggest. This type of problem is extremely common across disciplines, so much so that is has been given a name, overdispersion, and some authors go so far as to state that ‘unless there are good external reasons for relying on the binomial assumption [of independence], it seems wise to be cautious and to assume that over-dispersion is present to some extent unless and until it is shown to be absent’ [14, p.125]. One standard approach to handling over-dispersion is to use a so-called quasi-likelihood [1, §4.7 ]. In case of logistic regression, this typically entails a modification to the assumption on the distribution of the pi that includes an additional dispersion parameter φ. The expected values are the same as a standard binomial model, but φ var [P | N, X] = π(1 − π) . (9) n These models are known as quasi-likelihood models because one loses a closed-form solution for the actual probability distribution f P | N,X ; one knows only that the pi behave something like binomially-distributed variables, with identical means but proportionally more variance. The parameter estimates β̂ and predictions π(·; β̂) for a quasibinomial model are the same as ordinary logistic regression, but the estimated variance-covariance matrices are scaled by the estimated dispersion parameter φ̂ (and likewise the standard errors are scaled by its square root). The dispersion parameter is estimated so that the theoretical variance matches the empirical variance in the data, and because of the form of Equation9 , it renders any scaling considerations for the ni moot. Other approaches to handling over-dispersion include beta-binomial models [1, §13.3 ] and beta regression [5], but we prefer the simplicity of the quasi-likelihood model. 527 2.2 Generalised Estimating Equations (gees) The quasi-binomial model achieves most of what one would be looking for when evaluating ace for mirex: it handles proportions naturally, is consistent with the weighted averaging used to compute wcsr, and adjusts for over-dispersion in a way that also eliminates any worries about scaling. Nonetheless, it is slightly over-conservative for evaluating ace. As discussed earlier, quasi-binomial models are necessary to account for over-dispersion, and one important source of over-dispersion in these data is the lack of independence of chord estimates from most algorithms within the same song. mirex exhibits another important violation of the independence assumption, however: all algorithms are tested on the same sets of songs, and some songs are clearly more difficult than others. Put differently, one does not expect the algorithms to perform completely independently of one another on the same song but rather expects a certain correlation in performance across the set of songs. By taking that correlation into account, one can improve the precision of estimates, particularly the precision of pair-wise comparisons [1, §10.1]. A relatively straightforward variant of quasi-likelihood known as generalised estimating equations (gees) incorporates this type of correlation [1, ch.11 ]. With the gee approach, rather than predicting each pi individually, one predicts complete vectors of proportions pi for each relevant group, much as Friedman’s test seeks to estimate ranks within each group. For ace, the groups are songs, and thus one considers the observations to be vectors pi , one for each song, where pi j represents the csr or segmentation score for algorithm j on song i. Analogous to the case of ordinary quasi-binomial or logistic regression, (10) E P j | N, X j = π(x j ; β) . Likewise, analogous to the quasi-binomial variance, φ π j (1 − π j ) . (11) var P j | N, X j = n Because the gee approach is concerned with vectorvalued estimates rather than point estimates, it also involves estimating a full variance-covariance matrix. In addition to β and φ, the approach requires a further vector of parameters α and an a priori assumption on the correlation structure of the P j in the form of a function R(α) that yields a correlation matrix. (One might, for example, assume that that the P j are exchangeable, i.e., that every pair shares a common correlation coefficient.) Then if B is a diagonal matrix such that B j j = var [P j | N, X j ], cov [P | N, X] = B /2 R(α)B /2 . 1 1 (12) If all of the P j are uncorrelated with each other, then this formula reduces to the basic quasi-binomial model, which assumes a diagonal covariance matrix. Thefi nal step of gee estimation adjusts Equation 12 according to the actual correlations observed in the data, and as such, gees are quite robust in practice even when the a priori assumptions about the correlation structure are incorrect [1, §11.4.2]. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) a b G G G G G 6 G G G G G 4 G G SB8 NG2 NG1 CF2 PP3 PP4 CB3 KO1 NMSD1 CB4 KO2 G G G G 1.0 0.8 0.6 Algorithm (a) Friedman’s anova G G G G G 0.4 G 0.2 0.0 G G G KO2 G G 2 G b G G G G G G G G G G SB8 G d e b c NG2 G G e f CF2 G G G e f c d NG1 G G G e f b c PP3 8 G f c d PP4 10 f CB3 e f KO1 f g CB4 g a NMSD1 g 12 NMSD2 Rank per Song (1 low; 12 high) f g d e c d NMSD2 e f g c Chord−Symbol Recall (CSR) c d Algorithm (b) Logistic Regression Figure1 . Boxplots and compact letter displays for the mirex ace 2013 results on the Billboard 2013 test set with vocabulary V (seventh chords and inversions), weighted by song length. Bold lines represent medians andfi lled dots means. N = 161 songs per algorithm. Given the respective models, there are insufficient data to distinguish among algorithms sharing a letter, correcting to hold the fdr at α = .005. Although Friedman’s anova detects 2 more significant pairwise differences than logistic regression (45 vs.43 ), it operates on a different scale than csr and misorders algorithms relative to wcsr. 3. ILLUSTRATIVE RESULTS mirex ace 2013 evaluated 12 algorithms according to a battery of eight rubrics (wcsr onfi ve harmonic vocabularies and three segmentation measures) on each of three different data sets (the Isophonics set, including music from the Beatles, Queen, and Zweieck [12] and two versions of the McGill Billboard set, including music from the American pop charts [4]). There is insufficient space to present the results of logistic regression on all combinations, and so we will focus on a single one of the data sets, the Billboard 2013 test set. In some cases, logistic regression allows us to speak to all measures (11 592 observations), but in general, we will also restrict ourselves to discussing the newest and most challenging of the harmonic vocabularies for wcsr: Vocabulary V (1932 observations), which includes major chords, minor chords, major sevenths, minor sevenths, dominant sevenths, and the complete set of inversions of all of the above. We are interested in four key questions. 1. How do pairwise comparisons under logistic regression compare to pairwise comparisons with Friedman’s anova? Is logistic regression more powerful? 2. Are there differences among algorithms as the harmonic vocabularies get more difficult, or is the drop performance uniform? In other words, is there a benefit to continuing with so many vocabularies? 3. Are all ace algorithms making similar mistakes, or do they vary in their strengths and weaknesses? 4. Which algorithm-song pairs exhibited unexpectedly good or bad performance, and is there anything to be learned from these observations? is restricted to Vocabulary V, with the algorithms in descending order by wcsr. Figure1 a comes from Friedman’s anova weighted by song length, and thus its y-axis reflects not csr directly but the per-song ranks with respect to csr. Figure1 b comes from quasi-binomial regression estimated with gees, as described in Section2 . Its y-axis does reflect per-song csr. Above the boxplots, all significant pairwise differences are recorded as a compact letter display. In the interest of reproducible research, we used a stricter α = .005 threshold for reporting pairwise comparisons with the more contemporary false-discovery-rate (fdr) approach of Benjamini and Hochberg, as opposed to more traditional Tukey tests at α = .05[2 ,9 ]. Within either of the subfigures, the difference in performance between two algorithms that share any letter in the compact letter display is not statistically significant. Overall, Friedman’s anova found 2more significant pairwise differences than logistic regression. 3.2 Effect of Vocabulary To test the utility of the new evaluation vocabularies, we ran both Friedman anovas (ranked separately for each vocabulary) and logistic regressions and looked for significant interactions among the algorithm, inversions (present or absent from the vocabulary) and the complexity of the vocabulary (root only, major-minor, or major-minor with 7ths). Under Friedman’s anova, there was a significant Algorithm × Complexity interaction, F (22,9440 ) = 3.21, p < .001. The logistic regression model identified a significant three-way Algorithm × Complexity × Inversions interaction, χ2 (12) = 37.35, p < .001, but the additional interaction with inversions should be interpreted with care: only one algorithm (cf2) attempts to recognise inversions. 3.1 Pairwise Comparisons 3.3 Correlation Matrices The boxplots in Figure 1 give a more detailed view of the performance of each algorithm than Table1 . Thefigure Table 2 presents the inter-correlations of wcsr between algorithms, rank-transformed (Spearman’s correlations, ana- 528 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Algorithm ko2 nmsd2 cb4 nmsd1 cb3 ko1 pp4 pp3 cf2 ng1 ng2 sb8 ko2 nmsd2 cb4 nmsd1 cb3 ko1 pp4 pp3 cf2 ng1 ng2 sb8 – .25∗ .41∗ .30∗ .34∗ −.04 −.22 −.49∗ .09 −.54∗ .09 −.32∗ .07 – .39∗ .60∗ .10 −.42∗ .08 −.46∗ .19 −.42∗ .17 −.44∗ .11 −.01 – .53∗ .76∗ −.51∗ −.16 −.61∗ .24∗ −.60∗ .17 −.44∗ −.05 .49∗ .12 – .42∗ −.51∗ .06 −.53∗ .42∗ −.56∗ .16 −.52∗ .10 −.25∗ .47∗ −.17 – −.29∗ −.07 −.37∗ .17 −.41∗ −.03 −.46∗ .03 −.20 −.46∗ −.45∗ −.19 – −.05 .68∗ −.49∗ .68∗ −.50∗ .00 −.41∗ −.19 −.30∗ −.08 −.26∗ −.10 – .22 .06 .04 −.09 −.32∗ −.44∗ −.36∗ −.48∗ −.45∗ −.14 .42∗ .37∗ – −.51∗ .85∗ −.54∗ .08 −.03 .00 .09 .27∗ −.08 −.41∗ −.03 −.48∗ – −.47∗ .50∗ −.33∗ −.35∗ −.33∗ −.38∗ −.44∗ −.17 .50∗ .00 .66∗ −.48∗ – −.40∗ .08 .05 .02 .08 .17 −.16 −.52∗ .05 −.48∗ .48∗ −.40∗ – −.16 −.01 −.06 −.09 −.10 −.08 .05 −.03 .04 −.14 −.10 −.11 – Table2 . Pearson’s correlations on the coefficients from logistic regression (wcsr) for the Billboard 2013 test set with vocabulary V (lower triangle); Spearman’s correlations for the same data (upper triangle). N = 161 songs per cell. Starred correlations are significant at α = .005, controlling for the fdr. A set of algorithms (viz., ko1, pp3, ng1, and sb8) stands out for negative correlations with the top performers; in general, these algorithms did not attempt to recognise seventh chords. logous to Friedman’s anova) in the upper triangle, and in the lower triangle, as estimated from logistic regression with gees. Significant correlations are marked, again controlling the fdr at α = .005. Positive correlations do not necessarily imply that the algorithms perform similarly; rather it implies that theyfi nd the same songs relatively easy or difficult. Negative correlations imply that songs that one algorithm finds difficult are relatively easy for the other algorithm. 3.4 Outliers To identify outliers, we considered all evaluations on the Billboard 2013 test set and examined the distribution of residuals. Chauvenet’s criterion for outliers in a sample of this size is to lie more than4. 09 standard deviations from the mean [16, §6.2 ]. Under Friedman’s anova, Chauvenet’s criterion identified 7 extreme data points. These are all for algorithm sb8, a submission with a programming bug that erroneously returned alternating C- and B-major chords regardless of the song, on songs that were so difficult for most other algorithms that the essentially random approach of the bug did better. Under the logistic regression model, the criterion identified 26 extreme points. Here, the unexpected behaviour was primarily for songs that are tuned a quartertone off from standard tuning (A4 = 440 Hz). The ground truth necessarily is ‘rounded off’ to standard tuning in one direction or the other, but in cases where an otherwise highperforming algorithm happened to round off in the opposite direction, the performance is markedly low. 4. DISCUSSION We were surprised tofi nd that in terms of distinguishing between algorithms, Friedman’s anova was in fact more powerful than logistic regression, detecting a few extra significant pairs. Nonetheless, the two approaches yield substantially equivalent broad conclusions: that a group of top performers – cb3, cb4, ko2, nmsd1, and nmsd2 – are statistically indistinguishable from each other, with ko1 also indistinguishable from the lower end of this group. Moreover, having now benefited from years of study, wcsr 529 is a reasonably intuitive and well-motivated measure of ace performance, and it is awkward to have to work on the Friedman’s rank scale instead, especially since it ultimately ranks the algorithms’ overall performance in a slightly different order than the headline wcsr-based results. Friedman’s anova did exhibit less power for our question about interactions between algorithms and differing chord vocabularies. Again, wcsr as a unit and as a concept is highly meaningful for chord estimation, and there is a conceptual loss from rank transformation. Given the rank transformation, Friedman’s anova can only be sensitive to reconfigurations of relative performance as the vocabularies become more difficult; logistic regression can also be sensitive to different effect sizes across algorithms even when their relative ordering remains the same. It was encouraging to see that under either statistical model, there was a benefit to evaluating with multiple vocabularies. That encouraged us to examine the inter-correlations for the performance of the algorithms. Figure 2summarises the original correlation matrix in Table 2 more visually by using the correlations from logistic regression as the basis of a hierarchical clustering. Two clear groups emerge, both from the clustering and from minding negative correlations in the original matrix: one relatively low-performing group including ko1, pp3, ng1, and sb8, and one relatively highperforming group including all others but for perhaps pp4, which does not seem to correlate strongly with any other algorithm. The shape of the equivalent tree based on Spearman’s correlations is similar but for joining pp4 with sb8 instead of the high-performing group. Table 1 uncovers the secret behind the low performers: ko1 excepted, none of the low-performing algorithms attempt to recognise seventh chords, which comprise 29 percent of all chords under Vocabulary V. Furthermore, we performed an additional evaluation of seventh chords only, in the style of [15] and using their software available online. 4 From the resulting low score of ko1, we can deduce that this algorithm is able to recognise seventh chords in theory, but that it was most likely trained on the relatively seventh-poor Isophon4 https://github.com/jpauwels/MusOOEvaluator 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [3] J. A. Burgoyne. Stochastic Processes and DatabaseDriven Musicology. PhD thesis, McGill U., Montréal, QC,2012. NG2 NMSD1 NMSD2 CB3 KO2 CB4 NG1 PP3 KO1 CF2 SB8 PP4 0.6 0.4 0.2 0.0 Pearson's Distance 0.8 [2] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B,1(57):289–300, 1995. [4] J. A. Burgoyne, J. Wild, and I. Fujinaga. An expert ground-truth set for audio chord recognition and music analysis. In Proc. Int. Soc. Music Inf. Retr., pages633– 38, Miami, FL,2011. Figure2 . Hierarchical clustering of algorithms based on wcsr for for the Billboard 2013 test set with vocabulary V, Pearson’s distance as derived from the estimated correlation matrix under logistic regression, and complete linkage. The group of algorithms that is negatively correlated with the top performers appears at the left. pp4 stands out as the most idiosyncratic performer. ics corpus (only 15 percent of all chords). ko2 is the same algorithm trained directly on the mirex Billboard training corpus, and with that training, it becomes a top performer. Our analysis of outliers again showed Friedman’s anova to be less powerful than logistic regression, as one would expect given the range restrictions on rank transformation. But here also the more important advantage of logistic regression is the ability to work on the wcsr scale. Outliers under the logistic regression model are also points that have an unusually strong effect on the reported results. In our analysis, they highlight the practical consequences of the well-known problem of atypically-tuned commercial recordings. Although we would not propose deleting outliers, it is sobering to know that tuning problems may be having an outsized effect on our headline evaluationfi gures. It might be worth considering allowing algorithms their best score in keys up to a semitone above or below the ground truth. Overall, we have shown that as ace becomes more established and its evaluation more thorough, it is useful to use a subtler statistical model for comparative analysis. We recommend that future mirex ace evaluations use logistic regression in preference to Friedman’s anova. It preserves the natural units and scales of wcsr and segementation analysis, is more powerful for many (although not all) statistical tests, and when augmented with gees, it allows for a detailed correlational analysis of which algorithms tend to have problems with the same songs as others and which have perhaps genuinely broken innovative ground. This is by no means to suggest that Friedman’s test is a bad test in general – its near-universal applicability makes it an excellent choice in many circumstances, including many other mirex evaluations – but for ace, we believe that the extra understanding logistic regression can offer may help researchers predict which techniques are most promising for breaking the current performance plateau. 5. REFERENCES [1] A. Agresti. Categorical Data Analysis. Wiley, New York, 2nd edition,2007. [5] S. Ferrari and F. Cribari-Neto. Beta regression for modelling rates and proportions. J. Appl. Stat.,31(7):799–815, 2004. [6] W. B. de Haas, J. P. Magalhães, D. ten Heggeler, G. Bekenkamp, and T. Ruizendaal. Chordify: Chord transcription for the masses. Demo at the Int. Soc. Music Inf. Retr. Conf., Curitiba, Brazil,2012. [7] W. B. de Haas, J. P. Magalhães, R. C. Veltkamp, and F. Wiering. Harmtrace: Improving harmonic similarity estimation using functional harmony analysis. In Proc. Int. Soc. Music Inf. Retr., pages67–72 , Miami, FL,2011. [8] C. Harte. Towards Automatic Extraction of Harmony Information from Music Signals. PhD thesis, Queen Mary, U. London,2010. [9] V. E. Johnson. Revised standards for statistical evidence. P. Nat’l Acad. Sci. USA,110(48):19313–17 ,2013. [10] M. Khadkevich and M. Omologo. Large-scale cover song identification using chord profiles. In Proc. Int. Soc. Music Inf. Retr. Conf., pages233–38 , Curitiba, Brazil, 2013. [11] M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models. McGraw-Hill, Boston, MA,5 th edition,2005. [12] M. Mauch. Automatic Chord Transcription from Audio Using Computational Models of Musical Context. PhD thesis, Queen Mary, U. London,2010. [13] M. Mauch, S. Dixon, C. Harte, M. Casey, and B. Fields. Discovering chord idioms through Beatles and Real Book songs. In Proc. Int. Soc. Music Inf. Retr. Conf., pages255–58 , Vienna, Austria,2007. [14] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, Boca Raton, FL,2 nd edition, 1989. [15] J. Pauwels and G. Peeters. Evaluating automatically estimated chord sequences. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages749–53 , Vancouver, British Columbia,2013. [16] J. R. Taylor. An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books, Sausalito, CA,2 nd edition,1997. 530 15th International Society for Music Information Retrieval Conference (ISMIR 2014) MERGED-OUTPUT HMM FOR PIANO FINGERING OF BOTH HANDS Shigeki Sagayama Nobutaka Ono Eita Nakamura Meiji University National Institute of Informatics National Institute of Informatics Tokyo 164-8525, Japan Tokyo 101-8430, Japan Tokyo 101-8430, Japan [email protected] [email protected] [email protected] ABSTRACT This paper discusses a piano fingering model for both hands and its applications. One of our motivations behind the study is automating piano reduction from ensemble scores. For this, quantifying the difficulty of piano performance is important where a fingering model of both hands should be relevant. Such a fingering model is proposed that is based on merged-output hidden Markov model and can be applied to scores in which the voice part for each hand is not indicated. The model is applied for decision of fingering for both hands and voice-part separation, automation of which is itself of great use and were previously difficult. A measure of difficulty of performance based on the fingering model is also proposed and yields reasonable results. 1. INTRODUCTION Music arrangement is one of the most important musical activities, and its automation certainly has attractive applications. One common form is piano arrangement of ensemble scores, whose purposes are, among others, to enable pianists to enjoy a wider variety of pieces and to accompany other instruments by substituting the role of orchestra. While certain piano reductions have high technicality and musicality as in the examples by Liszt [8], those for vocal scores of operas and reduction scores of orchestra accompaniments are often faithful to the original scores in most parts. The most faithful reduction score is obtained by gathering every note in the original score, but the result can be too difficult to perform, and arrangement such as deleting notes is often in order. In general, the difficulty of a reduction score can be reduced by arrangement, but then the fidelity also decreases. If one can quantify the performance difficulty and the fidelity to the original score, the problem of “minimal” piano reduction can be considered as an optimization problem of the fidelity given constraints on the performance difficulty. A method for guitar arrangement based on probabilistic model with a similar formalization is proposed in Ref. [5]. This paper is a step toward a realization of piano reduction algorithm based on the formalization. The playability of piano passages is discussed in Refs. [3, 2] in connection with automatic piano arrangement. There, constraints such as the maximal number of notes in each hand, the maximal interval being played, say, 10th, and the minimal time interval of a repeated note are considered. Although these constraints are simple and effective to some extent, the actual situation is more complicated as manifested in the fact that, for example, the playability can change with tempos and players can arpeggiate chords that cannot be played simultaneously. In addition, the playability can depend on the technical level of players [3]. Given these problems, it seems appropriate to consider performance difficulty that takes values in a range. There are various measures and causes of performance difficulty including player’s movements and notational complexity of the score [12, 1, 15]. Here we focus on the difficulty of player’s movements, particularly piano fingering, which is presumably one of the most important factors. The difficulty of fingering is closely related to the decision of fingering [4, 7, 13, 16]. Given the current situation that a method of determining the fingering costs from first principles is not established, however, it is also effective to take a statistical approach, and consider the naturalness of fingering in terms of probability obtained from actual fingering data. With a statistical model of fingering, the most natural fingering can be determined, and one can quantify the difficulty of fingering in terms of naturalness. This will be explained in Secs. 2 and 3. The practical importance of piano fingering and its applications are discussed in Ref. [17]. Since voice parts played by both hands are not a priori separated or indicated in the original ensemble score, a fingering model must be applicable in such a situation. Thus, a fingering model for both hands and an algorithm to separate voice parts are necessary. We propose such a model and an algorithm based on merged-output hidden Markov model (HMM), which is suited for modeling multi-voicepart structured phenomena [10, 11]. Since multi-voice-part structure of music is common and voice-part separation can be applied for a wide range of information processing, the results are itself of great importance. 2. MODEL FOR PIANO FINGERING FOR BOTH HANDS c Eita Nakamura, Nobutaka Ono, Shigeki Sagayama. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Eita Nakamura, Nobutaka Ono, Shigeki Sagayama. “ Merged-Output HMM for Piano Fingering of Both Hands ”, 15th International Society for Music Information Retrieval Conference, 2014. 2.1 Model for one hand Before discussing the piano fingering model for both hands, let us discuss the fingering model for one hand. Piano 531 15th International Society for Music Information Retrieval Conference (ISMIR 2014) fingering models and algorithms for decision of fingering have been studied in Refs. [13, 16, 4, 18, 19, 20, 7]. Here we extend the model in Ref. [19] to including chords. Piano fingering for one hand, say, the right hand, is indicated by associating a finger number fn = 1, · · · , 5 (1 = thumb, 2 = the index finger, · · · , 5 = the little finger) to each note pn in a score 1 , where n = 1, · · · , N indexes notes in the score and N is the number of notes. We consider the probability of a fingering sequence f1:N = N (fn )N n=1 given a score, or a pitch sequence, p1:N = (pn )n=1 , which is written as P (f1:N |p1:N ). As explained in detail in Sec. 3.1, an algorithm for fingering decision can be obtained by estimating the most probable candidate fˆ1:N = argmax P (f1:N |p1:N ). The fingering of a particular note f1:N is more influenced by neighboring notes than notes that are far away in score position. Dependence on neighboring notes is most simply described by that on adjacent notes, and it can be incorporated with a Markov model. It also has advantages in efficiency in maximizing probability and setting model parameters. Although the probability of fingering may depend on inter-onset intervals between notes, the dependence is not considered here for simplicity. As proposed in Ref. [18, 19], the fingering model can be constructed with an HMM. Supposing that notes in score are generated by finger movements and the resulting performed pitches, their probability is represented with the probability that a finger would be used after another finger P (fn |fn−1 ), and the probability that a pitch would result from succeeding two used fingers. The former is called the transition probability, and the latter output probability. The output probability of pitch depends on the previous pitch in addition to the corresponding used fingers, and it is described with a conditional probability P (pn |pn−1 , fn−1 , fn ). In terms of these probabilities, the probability of notes and fingerings is given as P (p1:N , f1:N ) = N P (pn |pn−1 , fn−1 , fn )P (fn |fn−1 ), n=1 (1) where the initial probabilities are written as P (f1 |f0 ) ≡ P (f1 ) and P (p1 |p0 , f0 , f1 ) ≡ P (p1 |f1 ). The probability P (f1:N |p1:N ) can also be given accordingly. To train the model efficiently, we assume some reasonable constraints on the parameters. First we assume that the probability depends on pitches only through their geometrical positions on the keyboard which is represented as a two-dimensional lattice (Fig. 1). We also assume the translational symmetry in the x-direction and the time inversion symmetry for the output probability. If the coordinate on the keyboard is written as (p) = ( x (p), y (p)), the assumptions mean that the output probability has a form P (p |p, f, f ) = F ( x (p ) − x (p), y (p ) − y (p); f, f ), and it satisfies F ( x (p ) − x (p), y (p ) − y (p); f, f ) = F ( x (p) − x (p ), y (p) − y (p ); f , f ). A model for each hand can be obtained in this way, and it is written as Fη ( x (p ) − x (p), y (p ) − y (p); f, f ) with η = L, R. 1 We do not consider the so-called finger substitution in this paper. 532 Figure 1. Keyboard lattice. Each key on a keyboard is represented by a point of a two-dimensional lattice. It is further assumed that these probabilities are related by reflection in the x-direction, which yields FL ( x (p ) − x (p), y (p )− y (p); f, f ) = FR ( x (p )− x (p), y (p )− y (p); f, f ). The above model can be extended to be applied for passages with chords, by converting a polyphonic passage to a monophonic passage by virtually arpeggiating the chords [7]. Here, notes in a chord are ordered from low pitch to high pitch. The parameter values can be obtained from fingering data. 2.2 Model for both hands Now let us consider the fingering of both hands in the situation that it is unknown a priori which of the notes are to be played by the left or right hand. The problem can be stated as associating the fingering information (ηn , fn )N n=1 for the pitch sequence p1:N , where ηn = L, R indicates the hand with which the n-th note is played. One might think to build a model of both hands by simply extending the one-hand model and using (ηn , fn ) as a latent variable. However, this is not an effective model as far as it is a first-order Markov model since, for example, probabilistic constraints between two successive notes by the right hand cannot be directly incorporated when they are interrupted by other notes of the left hand. Using higher-order Markov models leads to the problem of increasing number of parameters that is hard to train as well as the increasing computational cost. The underlying problem is that the model cannot capture the structure of dependencies that is stronger among notes in each hand than those across hands. Recently an HMM, called merged-output HMM, is proposed that is suited for describing such voice-part-structured phenomena [10, 11]. The basic idea is to construct a model for both hands by starting with two parallel HMMs, called part HMMs, each of which corresponds to the HMM for fingering of each hand, and then merging the outputs of the part HMMs. Assuming that only one of the part HMMs transits and outputs an observed symbol at each time, the state space of the merged-output HMM is given as a triplet k = (η, fL , fR ) of the hand information η = L, R and fingerings of both hands: η indicate which of the HMMs transits, and fL and fR indicate the current states of the part HMMs. Let the transition and output probabilities 15th International Society for Music Information Retrieval Conference (ISMIR 2014) of the part HMMs be aηf f = Pη (f |f ) and bηf f ( ) = Fη ( ; f, f ) (η = L, R). Then the transition and output probabilities of the merged-output HMM are given as αL aL fL f δ fR fR , η = L; αR aR fR f δ fL fL , η = R, L akk = bkk ( ) = R bL fL f ( ), η = L; bR fR f ( ), η = R, L R 0.12 0.1 0.08 0.06 (2) 0.04 (3) 0.02 0 -40 where δ denotes Kronecker’s delta. Here, αL,R represent the probability of choosing which of the hands to play the note, and practically, they satisfy αL ∼ αR ∼ 1/2. As shown in Ref. [11], certain interaction factors can be introduced to Eqs. (2) and (3). Although such interactions may be important in the future [14], we confine ourselves to the case of no interactions in this paper for simplicity. By estimating the most probable sequence k̂1:N , both the optimal configuration of hands η̂1:N , which yields a voice-part separation, and that of fingers (fˆL , fˆR )1:N are obtained. For details of inference algorithms and other aspects of merged-output HMM, see Ref. [11]. αL aL p L p δp R p R , η = L; αR aR p R p δp L p L , η = R, L R bx (y) = δy,pη . 20 40 Figure 2. Histograms of pitch transitions in piano scores for each hand. 3. APPLICATIONS OF THE FINGERING MODEL 3.1 Algorithm for decision of fingering A direct application of the model explained in Secs. 2.1 and 2.2 is the decision of fingering. The algorithm can be derived by applying the Viterbi algorithm. For one hand, the derived algorithm is similar as the one in Ref. [19], but we reevaluated the accuracy since the present model can be applied for polyphonic passages and the details of the models are different. For evaluation, we prepared manually labeled fingerings of classical piano pieces and compared them to the one estimated with the algorithm. The test pieces were Nos. 1, 2, 3, and 8 of Bach’s two-voice inventions, and the introduction and exposition parts from Beethoven’s 8th piano sonata in C minor. The training and test of the algorithm was done with the leave-one-out cross validation method for each piece. To avoid zero frequencies in the training, we added a uniform count of 0.1 for every bin. The averaged accuracy was 56.0% (resp. 55.4%) for the right (resp. left) hand where the number of notes was 5202 (resp. 5539). Since the training data was not big, and we had much higher rate of more than 70% for closed test, the accuracy may improve if a larger set of training data is given. The results were better than the reported values in Ref. [19]. The reason would be that the constraints of the model in the reference was too strong, which is relaxed in the present model. For detailed analysis of the estimation errors, see Ref. [19]. The model explained in the previous section involves both hands and the used hand and fingers are modeled simultaneously. We can alternatively consider the problem of associating fingerings of both hands as first separating voice parts for both hands, and then associating fingerings for notes in each voice part. In this subsection, a simple model that can be used for voice-part separation is given. The model is also based on a simpler merged-output HMM, and it yields more efficient algorithm for voice-part separation. We consider a merged-output HMM with a hidden state x = (η, pL , pR ), where η = L, R indicates the voice part, and pL,R describes the pitch played in each voice part. If the pitch sequence in the score is denoted by (yn )n , the transition and output probabilities are written as axx = 0 Interval [semitone] 2.3 Model for voice-part separation -20 (4) (5) Here the transition probability aL,R pp describes the pitch sequence in each voice part directly, without any information on fingerings. The corresponding distributions can be obtained from actual data of piano pieces, as shown in Fig. 2. 3.2 Voice-part separation So far we have considered a model of pitches and horizontal intervals for voice-part separation. The voice-partseparation algorithm can be derived by applying the Viterbi algorithm to the above model. In fact, a voice part in the score played by one hand is also constrained by vertical intervals since it is physically difficult to play a chord containing an interval far wider than a octave by one hand. The constraint on the vertical intervals can also be introduced in terms of probability. Voice-part separation between two hands can be done with the model described in Sec. 2.3, and the algorithm can be obtained by the Viterbi algorithm. In fact, we can derive a more efficient estimation algorithm which is effectively equivalent since the model has noiseless observations as in Eq. (5). It is obtained by minimizing the following potential with respect to the variables {(ηn , hn )}, hn = 0, 1, · · · , Nh for 533 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Table 1. Error rates of the voice-part-separation algorithms.The 0-HMM (resp. 1-HMM, 2-HMM) indicates the algorithm with the zeroth-order (resp. first-order, second-order) HMM. Pieces # Notes 0-HMM [%] 1-HMM [%] 2-HMM [%] Merged-output HMM [%] Bach (15 pcs) 9638 5.1 5.3 6.1 1.9 Beethoven (2 pcs) 18144 13.0 11.1 11.5 9.28 Chopin (5 pcs) 8508 5.7 4.0 4.29 3.8 Debussy (3 pcs) 3360 17.8 14.8 14.8 18.7 Total 39650 9.9 8.5 8.9 7.1 each note: V (η, h) = − ln Q(ηn−1 , hn−1 ; ηn , hn ), (6) n Q(ηn−1 , hn−1 ; ηn , hn ) (ηn ) αηn ayn−1 ,yn δhn ,hn−1 +1 , = (ηn ) , αηn ayn−2−h ,y δ n−1 n hn ,0 (a) Passage in Bach’s two-voice invention No. 1. ηn = ηn−1 ; ηn = ηn−1 . (7) Here hn is necessary to memorize the current state of the voice part opposite of ηn . The minimization of the potential can be done with dynamic programming incrementally for each n. The estimation result is the same as the one with the Viterbi algorithm applied to the model when Nh is sufficiently large, and we confirmed that Nh = 50 is sufficient to provide a good approximation. The algorithm was evaluated by applying it to several classical piano pieces. The used pieces were all pieces of Bach’s two-voice inventions, the first two piano sonatas by Beethoven, Chopin’s Etude Op. 10 Nos. 1–5, and the first three pieces in the first book of Debussy’s Préludes. For comparison, we also evaluated algorithms based on lowerorder HMMs. The zeroth-order model with transition and output probabilities P (η) and P (p|η) is almost equivalent to the keyboard splitting method, the first-order model with P (η |η) and P (δp|η, η ) and the second-order model are simple applications of HMMs whose latent variables are hand informations η = L, R. The results are shown in Table 1. In total, the mergedoutput HMM yielded the lowest error rate, with which relatively accurate voice part separation can be done. On the other hand, there were less changes in results for the lower-order HMMs, showing that the effectiveness of the merged-output HMM. In Debussy’s pieces, the error rates were relatively high since the pieces necessitate complex fingerings with wide movements of the hands. An example of the voice-part separation result is shown in Fig. 3. (b) Piano role representation of the voice-part separation result. Two voice parts are colored red and blue. Figure 3. Example of a voice-part separation result. notes in the time range of [t − Δt/2, t + Δt/2], and f (t) be the corresponding fingerings, where Δt is a width of the time range to define the time rate. Then it is given as D(t) = − ln P (p(t), f (t))/Δt. (8) Since the minimal time interval of successive notes are about a few 10 milli seconds and it is hard to imagine that difficulty is strongly influenced by notes that are separated more than 10 seconds, it is natural to set Δt within these extremes. The right-hand side is given by Eq. (1). It is possible to calculate D(t) for a score without indicated fingerings by replacing f (t) with the estimated fingerings fˆ(t) with the model in Sec. 2. In addition to the difficulty for both hands, that for each hand DL,R (t) can also be defined similarly. Fig. 4 shows some examples of DL,R (t) calculated for several piano pieces. Here Δt was set to 1 sec. Although it is not easy to evaluate the quantity in a strict way, the results seems reasonable and reflects generic intuition of difficulty. The invention by Bach that can be played by beginners yields DL,R that are less than about 10, the example of Beethoven’s sonata which requires middle-level technicality has DL,R around 20 to 30, and Chopin’s Fantasie Impromptu which involves fast passages and difficult fingerings has DL,R up to about 40. It is also worthy of noting that relatively difficult passages such as the fast chromatique passage of the right hand in the introduction of Beethoven’s sonata and ornaments in the right hand of the 3.3 Quantitative measure of difficulty of performance A measure of performance difficulty based on the naturalness of the fingerings can be obtained by the probabilistic fingering model. Although global structures in scores may influence the difficulty, we concentrate on the effect of local structures. It is supposed that the difficulty is additive with regard to performed notes and an increasing function of tempo. A quantity satisfying these conditions is the time rate of probabilistic cost. Let p(t) denote the sequence of 534 15th International Society for Music Information Retrieval Conference (ISMIR 2014) (a) Difficulty for right hand DR (b) Difficulty for left hand DL Figure 4. Examples of DR and DL . The red (resp. green, blue) line is for Bach’s two-voice invention No.=1, (resp. Introduction and exposition parts of the first movement of Beethoven’s eighth piano sonata, Chopin’s Fantasie Impromptu). pp. 10–12, 2012. slow part of the Fantasie Impromptu are also captured in terms of DR . [2] S.-C. Chiu et al., “Automatic system for the arrangement of piano reductions,” Proc. AdMIRe, 2009. 4. CONCLUSIONS In this paper, we considered a piano fingering model of both hands and its applications especially toward a piano reduction algorithm. First we reviewed a piano fingering model for one hand based on HMM, and then constructed a model for both hands based on merged-output HMM. Next we applied the model for constructing an algorithm for fingering decision and voice-part-separation algorithm and obtained a measure of performance difficulty. The algorithm for fingering decision yielded better results than the previously proposed one by a modification in details of the model. The results of voice-part separation is quite good and encouraging. The proposed measure of performance difficulty successfully captures the dependence on tempos and complexity of pitches and finger movements. The next step to construct a piano reduction algorithm according to the formalization mentioned in the Introduction is to quantify the fidelity of the arranged score to the original score and to integrate it with the constraints of performance difficulty. The fidelity can be described with edit probability, similarly as in Ref. [5], and an arrangement model can be obtained by integrating the fingering model with the edit probability. We are currently working on these issues and the results will be reported elsewhere. 5. ACKNOWLEDGMENTS This work is supported in part by Grant-in-Aid for Scientific Research from Japan Society for the Promotion of Science, No. 23240021, No. 26240025 (S.S. and N.O.), and No. 25880029 (E.N.). [3] K. Fujita et al., “A proposal for piano score generation that considers proficiency from multiple part (in Japanese),” Tech. Rep. SIGMUS, MUS-77, pp. 47–52, 2008. [4] M. Hart and E. Tsai, “Finding optimal piano fingerings,” The UMAP Journal, 21(1), pp. 167–177, 2000. [5] G. Hori et al., “Input-output HMM applied to automatic arrangement for guitars,” J. Information Processing, 21(2), pp. 264–271, 2013. [6] Z. Ghahramani and M. Jordan, “Factorial Hidden Markov Models,” Machine Learning, 29, pp. 245–273, 1997. [7] A. Al Kasimi et al., “A simple algorithm for automatic generation of polyphonic piano fingerings,” ISMIR, pp. 355–356, 2007. [8] F. Liszt, Musikalische Werke, Serie IV, Breitkopf & Härtel, 1922. [9] J. Musafia, The Art of Fingering in Piano Playing, MCA Music, 1971. [10] E. Nakamura et al., “Merged-output hidden Markov model and its applications to score following and hand separation of polyphonic keyboard music (in Japanese),” Tech. Rep. SIGMUS, 2013-EC-27, 15, 2013. 6. REFERENCES [11] E. Nakamura et al., “Merged-output hidden Markov model for score following of MIDI performance with ornaments, desynchronized voices, repeats and skips,” to appear in Proc. ICMC, 2014. [1] S.-C. Chiu and M.-S. Chen, “A study on difficulty level recognition of piano sheet music,” Proc. AdMIRe, [12] C. Palmer, “Music performance,” Ann. Rev. Psychol., 48, pp. 115–138, 1997. 535 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [13] R. Parncutt et al., “An ergonomic model of keyboard fingering for melodic fragments,” Music Perception, 14(4), pp. 341–382, 1997. [14] R. Parncutt et al., “Interdependence of right and left hands in sight-read, written, and rehearsed fingerings of parallel melodic piano music,” Australian J. of Psychology, 51(3), pp. 204–210, 1999. [15] V. Sébastien et al., “Score analyzer: Automatically determining scores difficulty level for instrumental elearning,” Proc. ISMIR, 2012. [16] H. Sekiguchi and S. Eiho, “Generating and displaying the human piano performance,” 40(6), pp. 167–177, 1999. [17] Y. Takegawa et al., “Design and implementation of a real-time fingering detection system for piano performance,” Proc. ICMC, pp. 67–74, 2006. [18] Y. Yonebayashi et al., “Automatic determination of piano fingering based on hidden Markov model (in Japanese),” Tech. Rep. SIGMUS, 2006-05-13, pp. 7– 12, 2006. [19] Y. Yonebayashi et al., “Automatic decision of piano fingering based on hidden Markov models,” IJCAI, pp. 2915–2921, 2007. [20] Y. Yonebayashi et al., “Automatic piano fingering decision based on hidden Markov models with latent variables in consideration of natural hand motions (in Japanese),” Tech. Rep. SIGMUS, MUS-71-29, pp. 179– 184, 2007. 536 15th International Society for Music Information Retrieval Conference (ISMIR 2014) MODELING RHYTHM SIMILARITY FOR ELECTRONIC DANCE MUSIC Maria Panteli Niels Bogaards Aline Honingh University of Amsterdam, Elephantcandy, University of Amsterdam, Amsterdam, Netherlands Amsterdam, Netherlands Amsterdam, Netherlands [email protected] [email protected] [email protected] ABSTRACT A model for rhythm similarity in electronic dance music (EDM) is presented in this paper. Rhythm in EDM is built on the concept of a ‘loop’, a repeating sequence typically associated with a four-measure percussive pattern. The presented model calculates rhythm similarity between segments of EDM in the following steps. 1) Each segment is split in different perceptual rhythmic streams. 2) Each stream is characterized by a number of attributes, most notably: attack phase of onsets, periodicity of rhythmic elements, and metrical distribution. 3) These attributes are combined into one feature vector for every segment, after which the similarity between segments can be calculated. The stages of stream splitting, onset detection and downbeat detection have been evaluated individually, and a listening experiment was conducted to evaluate the overall performance of the model with perceptual ratings of rhythm similarity. &#(#*#&' (#&' '#)#&&#&( $% !" $% Figure 1: Example of a common (even) EDM rhythm [2]. 1. INTRODUCTION Music similarity has attracted research from multidisciplinary domains including tasks of music information retrieval and music perception and cognition. Especially for rhythm, studies exist on identifying and quantifying rhythm properties [16, 18], as well as establishing rhythm similarity metrics [12]. In this paper, rhythm similarity is studied with a focus on Electronic Dance Music (EDM), a genre with various and distinct rhythms [2]. EDM is an umbrella term consisting of the ‘four on the floor’ genres such as techno, house, trance, and the ‘breakbeat-driven’ genres such as jungle, drum ‘n’ bass, breaks etc. In general, four on the floor genres are characterized by a four-beat steady bass-drum pattern whereas breakbeat-driven exploit irregularity by emphasizing the metrically weak locations [2]. However, rhythm in EDM exhibits multiple types of subtle variations and embellishments. The goal of the present study is to develop a rhythm similarity model that captures these embellishments and allows for a fine inter-song rhythm similarity. c Maria Panteli, Niels Bogaards, Aline Honingh. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Maria Panteli, Niels Bogaards, Aline Honingh. “Modeling rhythm similarity for electronic dance music”, 15th International Society for Music Information Retrieval Conference, 2014. 537 The model focuses on content-based analysis of audio recordings. A large and diverse literature deals with the challenges of audio rhythm similarity. These include, amongst other, approaches to onset detection [1], tempo estimation [9,25], rhythmic representations [15,24], and feature extraction for automatic rhythmic pattern description and genre classification [5, 12, 20]. Specific to EDM, [4] study rhythmic and timbre features for automatic genre classification, and [6] investigate temporal and structural features for music generation. In this paper, an algorithm for rhythm similarity based on EDM characteristics and perceptual rhythm attributes is presented. The methodology for extracting rhythmic elements from an audio segment and a summary of the features extracted is provided. The steps of the algorithm are evaluated individually. Similarity predictions of the model are compared to perceptual ratings and further considerations are discussed. 2. METHODOLOGY Structural changes in an EDM track typically consist of an evolution of timbre and rhythm as opposed to a versechorus division. Segmentation is firstly performed to split the signal into meaningful excerpts. The algorithm developed in [21] is used, which segments the audio signal based on timbre features (since timbre is important in EDM structure [2]) and musical heuristics. EDM rhythm is expressed via the ‘loop’, a repeating pattern associated with a particular (often percussive) instrument or instruments [2]. Rhythm information can be extracted by evaluating characteristics of the loop: First, the rhythmic pattern is often presented as a combination of instrument sounds (eg. Figure 1), thus exhibiting a certain ‘rhythm polyphony’ [3]. To analyze this, the signal is split into the so-called rhythmic streams. Then, to describe the underlying rhythm, features are extracted for each stream based on three attributes: a) The attack phase of the onsets is considered to describe if the pattern is performed on 15th International Society for Music Information Retrieval Conference (ISMIR 2014) segmentation feature extraction rhythmic streams detection onset detection attack characterization feature vector metrical periodicity metricaldistribution distribution feature extraction similarity stream # 1 stream # 2 stream # 3 Figure 2: Overview of methodology. percussive or non-percussive instruments. Although this is typically viewed as a timbre attribute, the percussiveness of a sound is expected to influence the perception of rhythm [16]. b) The repetition of rhythmic sequences of the pattern are described by evaluating characteristics of different levels of onsets’ periodicity. c) The metrical structure of the pattern is characterized via features extracted from the metrical profile [24] of onsets. Based on the above, a feature vector is extracted for each segment and is used to measure rhythm similarity. Inter-segment similarity is evaluated with perceptual ratings collected via a specifically designed experiment. An overview of the methodology is shown in Figure 2 and details for each step are provided in the sections below. Part of the algorithm is implemented using the MIRToolbox [17]. 2.1 Rhythmic Streams Several instruments contribute to the rhythmic pattern of an EDM track. Most typical examples include combinations of bass drum, snare and hi-hat (eg. Figure 1). This is mainly a functional rather than a strictly instrumental division, and in EDM one finds various instrument sounds to take the role of bass, snare and hi-hat. In describing rhythm, it is essential to distinguish between these sources since each contributes differently to rhythm perception [11]. Following this, [15, 24] describe rhythmic patterns of latin dance music in two prefixed frequency bands (low and high frequencies), and [9] represents drum patterns as two components, the bass and snare drum pattern, calculated via non-negative matrix factorization of the spectrogram. In [20], rhythmic events are split based on their perceived loudness and brightness, where the latter is defined as a function of the spectral centroid. In the current study, rhythmic streams are extracted with respect to the frequency domain and loudness pattern. In particular, the Short Time Fourier Transform of the signal is computed and logarithmic magnitude spectra are assigned to bark bands, resulting into a total of 24 bands for a 44.1 kHz sampling rate. Synchronous masking is modeled using the spreading function of [23], and temporal masking is modeled with a smoothing window of 50 ms. This representation is hereafter referred to as loudness envelope and denoted by Lb for bark bands b = 1, . . . , 24. A self-similarity matrix is computed from this 24-band representation indicating the bands that exhibit similar loudness pattern. The novelty approach of [8] is applied to the 24 × 24 similarity matrix to detect adjacent bands that should be grouped to the same rhythmic stream. The peak 538 locations P of the novelty curve define the number of the bark band that marks the beginning of a new stream, i.e., if P = {pi ∈ {1, . . . , 24}|i = 1, . . . , I} for total number of peaks I, then stream Si consists of bark bands b given by, {b|b ∈ [pi , pi+1 − 1]} for i = 1, . . . , I − 1 Si = for i = I. {b|b ∈ [pI , 24]} (1) An upper limit of 6 streams is considered based on the approach of [22] that uses a total of 6 bands for onset detection and [14] that suggests a total of three or four bands for meter analysis. The notion of rhythmic stream here is similar to the notion of ‘accent band’ in [14] with the difference that each rhythmic stream is formed on a variable number of adjacent bark bands. Detecting a rhythmic stream does not necessarily imply separating the instruments, since if two instruments play the same rhythm they should be grouped to the same rhythmic stream. The proposed approach does not distinguish instruments that lie in the same bark band. The advantage is that the number of streams and the frequency range for each stream do not need to be predetermined but are rather estimated from the spectral representation of each song. This benefits the analysis of electronic dance music by not imposing any constraints on the possible instrument sounds that contribute to the characteristic rhythmic pattern. 2.1.1 Onset Detection To extract onset candidates, the loudness envelope per bark band and its derivative are normalized and summed with more weight on loudness than its derivative, i.e., Ob (n) = (1 − λ)Nb (n) + λNb (n) (2) where Nb is the normalized loudness envelope Lb , Nb the normalized derivative of Lb , n = 1, . . . , N the frame number for a total of N frames, and λ < 0.5 the weighting factor. This is similar to the approach described by Equation 3 in [14] with reduced λ, and is computed prior summation to the different streams as suggested in [14,22]. Onsets are detected via peak extraction within each stream, where the (rhythmic) content of stream i is defined as Ri = Σb∈Si Ob (3) with Si as in Equation 1 and Ob as in Equation 2. This onset detection approach incorporates similar methodological concepts with the positively evaluated algorithms for the task of audio onset detection [1] in MIREX 2012, and tempo estimation [14] in the review of [25]. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2 2 2 8 6 4 2 (a) Bark-band spectrogram. (b) Self-similarity matrix. (c) Novelty curve. Figure 3: Detection of rhyhmic streams using the novelty approach; first a bark-band spectrogram is computed, then its self-similarity matrix, and then the novelty [7] is applied where the novelty peaks define the stream boundaries. 2.2 Feature Extraction The onsets in each stream represent the rhythmic elements of the signal. To model the underlying rhythm, features are extracted from each stream, based on three attributes, namely, characterization of attack, periodicity, and metrical distribution of onsets. These are combined to a feature vector that serves for measuring inter-segment similarity. The sections below describe the feature extraction process in detail. 2.2.1 Attack Characterization To distinguish between percussive and non-percussive patterns, features are extracted that characterize the attack phase of the onsets. In particular, the attack time and attack slope are considered, among other, essential in modeling the perceived attack time [10]. The attack slope was also used in modeling pulse clarity [16]. In general, onsets from percussive sounds have a short attack time and steep attack slope, whereas non-percussive sounds have longer attack time and gradually increasing attack slope. For all onsets in all streams, the attack time and attack slope is extracted and split in two clusters; the ‘slow’ (non-percussive) and ‘fast’ (percussive) attack phase onsets. Here, it is assumed that both percussive and nonpercussive onsets can be present in a given segment, hence splitting in two clusters is superior to, e.g., computing the average. The mean and standard deviation of the two clusters of the attack time and attack slope (a total of 8 features) is output to the feature vector. Lag duration of maximum autocorrelation: The location (in time) of the second highest peak (the first being at lag 0) of the autocorrelation curve normalized by the bar duration. It measures whether the strongest periodicity occurs in every bar (i.e. feature value = 1), or every half bar (i.e. feature value = 0.5) etc. Amplitude of maximum autocorrelation: The amplitude of the second highest peak of the autocorrelation curve normalized by the amplitude of the peak at lag 0. It measures whether the pattern is repeated in exactly the same way (i.e. feature value = 1) or somewhat in a similar way (i.e. feature value < 1) etc. Harmonicity of peaks: This is the harmonicity as defined in [16] with adaptation to the reference lag l0 corresponding to the beat duration and additional weighting of the harmonicity value by the total number of peaks of the autocorrelation curve. This feature measures whether rhythmic periodicities occur in harmonic relation to the beat (i.e. feature value = 1) or inharmonic (i.e. feature value = 0). Flatness: Measures whether the autocorrelation curve is smooth or spiky and is suitable for distinguishing between periodic patterns (i.e. feature value = 0), and nonperiodic (i.e. feature value = 1). Entropy: Another measure of the ‘peakiness’ of autocorrelation [16], suitable for distinguishing between ‘clear’ repetitions (i.e. distribution with narrow peaks and hence feature value close to 0) and unclear repetitions (i.e. wide peaks and hence feature value increased). 2.2.3 Metrical Distribution 2.2.2 Periodicity To model the metrical aspects of the rhythmic pattern, the metrical profile [24] is extracted. For this, the downbeat is detected as described in Section 2.2.4, onsets per stream are quantized assuming a 44 meter and 16-th note resolution [2], and the pattern is collapsed to a total of 4 bars. The latter is in agreement with the length of a musical phrase in EDM being usually in multiples of 4, i.e., 4-bar, 8-bar, or 16-bar phrase [2]. The metrical profile of a given stream is thus presented as a vector of 64 bins (4 bars × 4 beats × 4 sixteenth notes per beat) with real values ranging between 0 (no onset) to 1 (maximum onset strength) as shown in Figure 5. For each rhythmic stream, a metrical pro- One of the most characteristic style elements in the musical structure of EDM is repetition; the loop, and consequently the rhythmic sequence(s), are repeating patterns. To analyze this, the periodicity of the onset detection function per stream is computed via autocorrelation and summed across all streams. The maximum delay taken into account is proportional to the bar duration. This is calculated assuming a steady tempo and 44 meter throughout the EDM track [2]. The tempo estimation algorithm of [21] is used. From the autocorrelation curve (cf. Figure 4), a total of 5 features are extracted: 539 15th International Society for Music Information Retrieval Conference (ISMIR 2014) tions are made: Assumption 1: Strong beats of the meter are more likely to be emphasized across all rhythmic streams. Assumption 2: The downbeat is often introduced by an instrument in the low frequencies, i.e. a bass or a kick drum [2, 13]. Considering the above, the onsets per stream are quantized assuming a 44 meter, 16-th note resolution, and a set of downbeat candidates (in this case the onsets that lie within one bar length counting from the beginning of the segment). For each downbeat candidate, hierarchical weights [18] that emphasize the strong beats of the meter as indicated by Assumption 1, are applied to the quantized patterns. Note, there is one pattern for each rhythmic stream. The patterns are then summed by applying more weight to the pattern of the low-frequency stream as indicated by Assumption 2. Finally, the candidate whose quantized pattern was weighted most, is chosen as the downbeat. 1.2 1 Bar 1 Beat 1 0.8 Normalized amplitude 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 0 0.5 1 1.5 2 2.5 Lag (s) Figure 4: Autocorrelation of onsets indicating high periodicities of 1 bar and 1 beat duration. 3. EVALUATION Figure 5: Metrical profile of the rhythm in Figure 1 assuming for simplicity a 2-bar length and constant amplitude. One of the greatest challenges of music similarity evaluation is the definition of a ground truth. In some cases, objective evaluation is possible, where a ground truth is defined on a quantifiable criterion, i.e., rhythms from a particular genre are similar [5]. In other cases, music similarity is considered to be influenced by the perception of the listener and hence subjective evaluation is more suitable [19]. Objective evaluation in the current study is not preferable since different rhythms do not necessarily conform to different genres or subgenres 1 . Therefore a subjective evaluation is used where predictions of rhythm similarity are compared to perceptual ratings collected via a listening experiment (cf. Section 3.4). Details of the evaluation of rhythmic stream, onset, and downbeat detection are provided in Sections 3.1 - 3.3. A subset of the annotations used in the evaluation of the latter is available online 2 . file is computed and the following features are extracted. Features are computed per stream and averaged across all streams. Syncopation: Measures the strength of the events lying on the weak locations of the meter. The syncopation model of [18] is used with adaptation to account for the amplitude (onset strength) of the syncopated note. Three measures of syncopation are considered that apply hierarchical weights with, respectively, sixteenth note, eighth note, and quarter note resolution. Symmetry: Denotes the ratio of the number of onsets in the second half of the pattern that appear in exactly the same position in the first half of the pattern [6]. Density: Is the ratio of the number of onsets over the possible total number of onsets of the pattern (in this case 64). Fullness: Measures the onsets’ strength of the pattern. It describes the ratio of the sum of onsets’ strength over the maximum strength multiplied by the possible total number of onsets (in this case 64). Centre of Gravity: Denotes the position in the pattern where the most and strongest onsets occur (i.e., indicates whether most onsets appear at the beginning or at the end of the pattern etc.). Aside from these features, the metrical profile (cf. Figure 5) is also added to the final feature vector. This was found to improve results in [24]. In the current approach, the metrical profile is provided per stream, restricted to a total of 4 streams, and output in the final feature vector in order of low to high frequency content streams. 3.1 Rhythmic Streams Evaluation The number of streams is evaluated with perceptual annotations. For this, a subset of 120 songs from a total of 60 artists (2 songs per artist) from a variety of EDM genres and subgenres was selected. For each song, segmentation was applied using the algorithm of [21] and a characteristic segment was selected. Four subjects were asked to evaluate the number of rhythmic streams they perceive in each segment, choosing between 1 to 6, where rhythmic stream was defined as a stream of unique rhythm. For 106 of the 120 segments, the subjects’ responses’ standard deviation was significantly small. The estimated number of rhythmic streams matched the mean of the subject’s response distribution with an accuracy of 93%. 2.2.4 Downbeat Detection 1 Although some rhythmic patterns are characteristic to an EDM genre or subgenre, it is not generally true that these are unique and invariant. 2 https://staff.fnwi.uva.nl/a.k.honingh/rhythm_ similarity.html The downbeat detection algorithm uses information from the metrical structure and musical heuristics. Two assump- 540 15th International Society for Music Information Retrieval Conference (ISMIR 2014) r -0.17 0.48 0.33 0.69 0.70 3.2 Onset Detection Evaluation Onset detection is evaluated with a set of 25 MIDI and corresponding audio excerpts, specifically created for this purpose. In this approach, onsets are detected per stream, therefore onset annotations should also be provided per stream. For a number of different EDM rhythms, MIDI files were created with the constraint that each MIDI instrument performs a unique rhythmic pattern therefore represents a unique stream, and were converted to audio. The onsets estimated from the audio were compared to the annotations of the MIDI file using the evaluation measures of the MIREX Onset Detection task 3 . For this, no stream alignment is performed but rather onsets from all streams are grouped to a single set. For 25 excerpts, an F -measure of 85%, presicion of 85%, and recall of 86% are obtained with a tolerance window of 50 ms. Inaccuracies in onset detection are due (on average) to doubled than merged onsets, because usually more streams (and hence more onsets) are detected. features attack characterization periodicity metrical distribution excl. metrical profile metrical distribution incl. metrical profile all Table 1: Pearson’s correlation r and p-values between the model’s predictions and perceptual ratings of rhythm similarity for different sets of features. rating, and all ratings being consistent, i.e., rated similarity was not deviating more than 1 point scale. The mean of the ratings was utilized as the ground truth rating per pair. For each pair, similarity can be calculated via applying a distance metric to the feature vectors of the underlying segments. In this preliminary analysis, the cosine distance was considered. Pearson’s correlation was used to compare the annotated and predicted ratings of similarity. This was applied for different sets of features as indicated in Table 1. A maximum correlation of 0.7 was achieved when all features were presented. The non-zero correlation hypothesis was not rejected (p > 0.05) for the attack characterization features indicating non-significant correlation with the (current set of) perceptual ratings. The periodicity features are correlated with r = 0.48, showing a strong link with perceptual rhythm similarity. The metrical distribution features indicate a correlation increase of 0.36 when the metrical profile is included in the feature vector. This is in agreement with the finding of [24]. As an alternative evaluation measure, the model’s predictions and perceptual ratings were transformed to a binary scale (i.e., 0 being dissimilar and 1 being similar) and their output was compared. The model’s predictions matched the perceptual ratings with an accuracy of 64%. Hence the model matches the perceptual similarity ratings at not only relative (i.e., Pearson’s correlation) but also absolute way, when a binary scale similarity is considered. 3.3 Downbeat Detection Evaluation To evaluate the downbeat the subset of 120 segments described in Section 3.1 was used. For each segment the annotated downbeat was compared to the estimated one with a tolerance window of 50 ms. An accuracy of 51% was achieved. Downbeat detection was also evaluated at the beat-level, i.e., estimating whether the downbeat corresponds to one of the four beats of the meter (instead of off-beat positions). This gave an accuracy of 59%, meaning that in the other cases the downbeat was detected on the off-beat positions. For some EDM tracks it was observed that high degree of periodicity compensates for a wrongly estimated downbeat. The overall results of the similarity predictions of the model (Section 3.4) indicate only a minor increase when the correct (annotated) downbeats are taken into account. It is hence concluded that the downbeat detection algorithm does not have great influence on the current results of the model. 3.4 Mapping Model Predictions to Perceptual Ratings of Similarity 4. DISCUSSION AND FUTURE WORK In the evaluation of the model, the following considerations are made. High correlation of 0.69 was achieved when the metrical profile, output per stream, was added to the feature vector. An alternative experiment tested the correlation when considering the metrical profile as a whole, i.e., as a sum across all streams. This gave a correlation of only 0.59 indicating the importance of stream separation and hence the advantage of the model to account for this. A maximum correlation of 0.7 was reported, taking into account the downbeat detection being 51% of the cases correct. Although regularity in EDM sometimes compensates for this, model’s predictions can be improved with a more robust downbeat detection. Features of periodicity (Section 2.2.2) and metrical distribution (Section 2.2.3) were extracted assuming a 44 meter, and 16-th note resolution throughout the segment. This is generally true for EDM, but exceptions do exist [2]. The The model’s predictions were evaluated with perceptual ratings of rhythm similarity collected via a listening experiment. Pairwise comparisons of a small set of segments representing various rhythmic patterns of EDM were presented. Subjects were asked to rate the perceived rhythm similarity, choosing from a four point scale, and report also the confidence of their rating. From a preliminary collection of experiment data, 28 pairs (representing a total of 18 unique music segments) were selected for further analysis. These were rated from a total of 28 participants, with mean age 27 years old and standard deviation 7.3. The 50% of the participants received formal musical training, 64% was familiar with EDM and 46% had experience as EDM musician/producer. The selected pairs were rated between 3 to 5 times, with all participants reporting confidence in their 3 p 0.22 0.00 0.01 0.00 0.00 www.MIREX.org 541 15th International Society for Music Information Retrieval Conference (ISMIR 2014) assumptions could be relaxed to analyze EDM with ternary divisions or no 44 meter, or expanded to other music styles with similar structure. The correlation reported in Section 3.4 is computed from a preliminary set of experiment data. More ratings are currently collected and a regression analysis and tuning of the model is considered in future work. [11] T. D. Griffiths and J. D. Warren. What is an auditory object? Nature Reviews Neuroscience, 5(11):887–892, 2004. 5. CONCLUSION [13] J. A. Hockman, M. E. P. Davies, and I. Fujinaga. One in the Jungle: Downbeat Detection in Hardcore, Jungle, and Drum and Bass. In ISMIR, 2012. A model of rhythm similarity for Electronic Dance Music has been presented. The model extracts rhythmic features from audio segments and computes similarity by comparing their feature vectors. A method for rhythmic stream detection is proposed that estimates the number and range of frequency bands from the spectral representation of each segment rather than a fixed division. Features are extracted from each stream, an approach shown to benefit the analysis. Similarity predictions of the model match perceptual ratings with a correlation of 0.7. Future work will fine-tune predictions based on a perceptual rhythm similarity model. 6. REFERENCES [1] S. Böck, A. Arzt, K. Florian, and S. Markus. Online real-time onset detection with recurrent neural networks. In International Conference on Digital Audio Effects, 2012. [2] M. J. Butler. Unlocking the Groove. Indiana University Press, Bloomington and Indianapolis, 2006. [3] E. Cambouropoulos. Voice and Stream: Perceptual and Computational Modeling of Voice Separation. Music Perception, 26(1):75–94, 2008. [4] D. Diakopoulos, O. Vallis, J. Hochenbaum, J. Murphy, and A. Kapur. 21st Century Electronica: MIR Techniques for Classification and Performance. In ISMIR, 2009. [12] C. Guastavino, F. Gómez, G. Toussaint, F. Marandola, and E. Gómez. Measuring Similarity between Flamenco Rhythmic Patterns. Journal of New Music Research, 38(2):129–138, June 2009. [14] A. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech and Language Processing, 14(1):342–355, January 2006. [15] F. Krebs, S. Böck, and G. Widmer. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In ISMIR, 2013. [16] O. Lartillot, T. Eerola, P. Toiviainen, and J. Fornari. Multi-feature Modeling of Pulse Clarity: Design, Validation and Optimization. In ISMIR, 2008. [17] O. Lartillot and P. Toiviainen. A Matlab Toolbox for Musical Feature Extraction From Audio. In International Conference on Digital Audio Effects, 2007. [18] H. C. Longuet-Higgins and C. S. Lee. The Rhythmic Interpretation of Monophonic Music. Music Perception: An Interdisciplinary Journal, 1(4):424–441, 1984. [19] A. Novello, M. M. F. McKinney, and A. Kohlrausch. Perceptual Evaluation of Inter-song Similarity in Western Popular Music. Journal of New Music Research, 40(1):1–26, March 2011. [20] J. Paulus and A. Klapuri. Measuring the Similarity of Rhythmic Patterns. In ISMIR, 2002. [5] S. Dixon, F. Gouyon, and G. Widmer. Towards Characterisation of Music via Rhythmic Patterns. In ISMIR, 2004. [21] B. Rocha, N. Bogaards, and A. Honingh. Segmentation and Timbre Similarity in Electronic Dance Music. In Sound and Music Computing Conference, 2013. [6] A. Eigenfeldt and P. Pasquier. Evolving Structures for Electronic Dance Music. In Genetic and Evolutionary Computation Conference, 2013. [22] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. The Journal of the Acoustical Society of America, 103(1):588–601, January 1998. [7] J. Foote and S. Uchihashi. The beat spectrum: a new approach to rhythm analysis. In ICME, 2001. [23] M. R. Schroeder, B. S. Atal, and J. L. Hall. Optimizing digital speech coders by exploiting masking properties of the human ear. The Journal of the Acoustical Society of America, pages 1647–1652, 1979. [8] J. T. Foote. Media segmentation using self-similarity decomposition. In Electronic Imaging. International Society for Optics and Photonics, 2003. [9] D. Gärtner. Tempo estimation of urban music using tatum grid non-negative matrix factorization. In ISMIR, 2013. [10] J. W. Gordon. The perceptual attack time of musical tones. The Journal of the Acoustical Society of America, 82(1):88–105, 1987. [24] L. M. Smith. Rhythmic similarity using metrical profile matching. In International Computer Music Conference, 2010. [25] J. R. Zapata and E. Gómez. Comparative Evaluation and Combination of Audio Tempo Estimation Approaches. In Audio Engineering Society Conference, 2011. 542 15th International Society for Music Information Retrieval Conference (ISMIR 2014) MUSE: A MUSIC RECOMMENDATION MANAGEMENT SYSTEM Martin Przyjaciel-Zablocki, Thomas Hornung, Alexander Schätzle, Sven Gauß, Io Taxidou, Georg Lausen Department of Computer Science, University of Freiburg zablocki,hornungt,schaetzle,gausss,taxidou,[email protected] ABSTRACT Evaluating music recommender systems is a highly repetitive, yet non-trivial, task. But it has the advantage over other domains that recommended songs can be evaluated immediately by just listening to them. In this paper, we present M U S E – a music recommendation management system – for solving the typical tasks of an in vivo evaluation. M U S E provides the typical offthe-shelf evaluation algorithms, offers an online evaluation system with automatic reporting, and by integrating online streaming services also a legal possibility to evaluate the quality of recommended songs in real time. Finally, it has a built-in user management system that conforms with state-of-the-art privacy standards. New recommender algorithms can be plugged in comfortably and evaluations can be configured and managed online. 1. INTRODUCTION One of the hallmarks of a good recommender system is a thorough and significant evaluation of the proposed algorithm(s) [6]. One way to do this is to use an offline dataset like The Million Song Dataset [1] and split some part of the data set as training data and run the evaluation on top of the remainder of the data. This approach is meaningful for features that are already available for the dataset, such as e.g. tag prediction for new songs. However, some aspects of recommending songs are inherently subjective, such as serendipity [12], and thus the evaluation of such algorithms can only be done in vivo, i.e. with real users not in an artificial environment. When conducting an in vivo evaluation, there are some typical issues that need to be considered: User management. While registering for evaluations, users should be able to provide some context information about them to guide the assignment in groups for A/B testing. Privacy & Security. User data is highly sensitive, and high standards have to be met wrt. who is allowed to access c Martin Przyjaciel-Zablocki, Thomas Hornung, Alexan der Schätzle, Sven Gauß, Io Taxidou, Georg Lausen. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Martin Przyjaciel-Zablocki, Thomas Hornung, Alexander Schätzle, Sven Gauß, Io Taxidou, Georg Lausen. “MuSe: A Music Recommendation Management System”, 15th International Society for Music Information Retrieval Conference, 2014. the data. Also, an evaluation framework needs to ensure that user data cannot be compromised. Group selection. Users are divided into groups for A/B testing, e.g. based on demographic criteria like age or gender. Then, recommendations for group A are provided by a baseline algorithm, and for group B by the new algorithm. Playing songs. Unlike other domains, e.g. books, users can give informed decisions by just listening to a song. Thus, to assess a recommended song, it should be possible to play the song directly during the evaluation. Evaluation monitoring. During an evaluation, it is important to have an overview of how each algorithm performs so far, and how many and how often users participate. Evaluation metrics. Evaluation results are put into graphs that contain information about the participants and the performance of the evaluated new recommendation algorithm. Baseline algorithms. Results of an evaluation are often judged by improvements over a baseline algorithm, e.g. a collaborative filtering algorithm [10]. In this paper, we present M U S E – a music recommendation management system – that takes care of all the regular tasks that are involved in conducting an in vivo evaluation. Please note that M U S E can be used to perform in vivo evaluations of arbitrary music recommendation algorithms. An instance of M U S E that conforms with state-ofthe-art privacy standards is accessible by using the link below, a documentation is available on the M U S E website 2 . muse.informatik.uni-freiburg.de The remainder of the paper is structured as follows: After a discussion of related work in Section 2, we give an overview of our proposed music recommendation management system in Section 3 with some insights in our evaluation framework in Section 4. Included recommenders are presented in Section 5, and we conclude with an outlook on future work in Section 6. 2. RELATED WORK The related work is divided in three parts: (1) music based frameworks for recommendations, (2) recommenders’ evaluation, (3) libraries and platforms for developing and plugin recommenders. Music recommendation has attracted a lot of interest from the scientific community since it has many real life applications and bears multiple challenges. An overview 2 M U S E - Music Sensing in a Social Context: dbis.informatik.uni-freiburg.de/MuSe 543 15th International Society for Music Information Retrieval Conference (ISMIR 2014) incorporate algorithms in the framework, integrate plugins, make configurations and visualize the results. However, our system offers additionally real-time online evaluations of different recommenders, while incorporating end users in the evaluation process. A case study of using Apache Mahout, a library for distributed recommenders based on MapReduce can be found in [15]. Their study provides insights into the development and evaluation of distributed algorithms based on Mahout. To the best of our knowledge, this is the first system that incorporates such a variety of characteristics and offers a full solution for music recommenders development and evaluation, while highly involving the end users. of factors affecting music recommender systems and challenges that emerge both for the users’ and the recommenders side are highlighted in [17]. Improving music recommendations has attracted equal attention. In [7, 12], we built and evaluated a weighted hybrid recommender prototype that incorporates different techniques for music recommendations. We used Youtube for playing songs but due to a complex process of identifying and matching songs, together with some legal issues, such an approach is no longer feasible. Music platforms are often combined with social media where users can interact with objects maintaining relationships. Authors in [2] leverage this rich information to improve music recommendations by viewing recommendations as a ranking problem. The next class of related work concerns evaluation of recommenders. An overview of existing systems and methods can be found in [16]. In this study, recommenders are evaluated based on a set of properties relevant for different applications and evaluation metrics are introduced to compare algorithms. Both offline and online evaluation with real users are conducted, discussing how to draw valuable conclusion. A second review on collaborative recommender systems specifically can be found in [10]. It consists the first attempt to compare and evaluate user tasks, types of analysis, datasets, recommendation quality and attributes. Empirical studies along with classification of existing evaluation metrics and introduction of new ones provide insights into the suitability and biases of such metrics in different settings. In the same context, researchers value the importance of user experience in the evaluation of recommender systems. In [14] a model is developed for assessing the perceived recommenders quality of users leading to more effective and satisfying systems. Similar approaches are followed in [3, 4] where authors highlight the need for user-centric systems and high involvement of users in the evaluation process. Relevant to our study is the work in [9] which recognizes the importance for online user evaluation, while implementing such evaluations simultaneously by the same user in different systems. 3. MUSE OVERVIEW We propose M U S E: a web-based music recommendation management system, built around the idea of recommenders that can be plugged in. With this in mind, M U S E is based on three main system design pillars: Extensibility. The whole infrastructure is highly extensible, thus new recommendation techniques but also other functionalities can be added as modular components. Reusability. Typical tasks required for evaluating music recommendations (e.g. managing user accounts, playing and rating songs) are already provided by M U S E in accordance with current privacy standards. Comparability. By offering one common evaluation framework we aim to reduce side-effects of different systems that might influence user ratings, improving both comparability and validity of in-vivo experiments. A schematic overview of the whole system is depicted in Fig. 1. The M U S E Server is the core of our music recommendation management system enabling the communication between all components. It coordinates the interaction with pluggable recommenders, maintains the data in three different repositories and serves the requests from multiple M U S E clients. Next, we will give some insights in the architecture of M U S E by explaining the most relevant components and their functionalities. The last class of related work refers to platforms and libraries for developing and selecting recommenders. The authors of [6] proposed LensKit, an open-source library that offers a set of baseline recommendation algorithms including an evaluation framework. MyMediaLite [8] is a library that offers state of the art algorithms for collaborative filtering in particular. The API offers the possibility for new recommender algorithm’s development and methods for importing already trained models. Both provide a good foundation for comparing different research results, but without a focus on in vivo evaluations of music recommenders, thus they don’t offer e.g. capabilities to play and rate songs or manage users. A patent in [13] describes a portal extension with recommendation engines via interfaces, where results are retrieved by a common recommendation manager. A more general purpose recommenders framework [5] which is close to our system, allows using and comparing different recommendation methods on provided datasets. An API offers the possibility to develop and 3.1 Web-based User Interface Unlike traditional recommender domains like e-commerce, where the process of consuming and rating items takes up to several weeks, recommending music exhibits a highly dynamic nature raising new challenges and opportunities for recommender systems. Ratings can be given on the fly and incorporated immediately into the recommending process, just by listening to a song. However, this requires a reliable and legal solution for playing a large variety of songs. M U S E benefits from a tight integration of Spotify 3 , a music streaming provider that allows listening to millions of songs for free. Thus, recommended songs can be embedded directly into the user interface, allowing to listen and rate them in a user-friendly way as shown in Fig. 2. 3 544 A Spotify account is needed to play songs 15th International Society for Music Information Retrieval Conference (ISMIR 2014) MuSeServer PluggableRecommender1 Recommendation ListBuilder MuSeClient REST Webservice WebbasedUserInterface Recommendation List Administration Track Manager Music Repository MusicRetrievalEngine User Profile Evaluation Framework User Context Wrapper/ Connector1 XML Social Connector1 Charts Last.fmAPI SocialNetworks XML Last.fmAPI HTML AJAX Spotify Connector WebServices HTML UserProfileEngine User Manager Recommender1 DataRepositories Coordinator Recommendation Model Recommender Interface Recommender RecommenderManager others JSON SpotifyAPI Figure 1. Muse – Music Recommendation Management System Overview 3.2 Data Repositories Although recommenders in M U S E work independently of each other and may even have their own recommendation model with additional data, all music recommenders have access to three global data structures. The first one is the Music Repository that stores songs with their meta data. Only songs in this database can be recommended, played and rated. The Music Retrieval Engine periodically collects new songs and meta data from Web Services, e.g. chart lists or Last.fm. It can be easily extended by new sources of information like audio analysis features from the Million Song Dataset [1], that can be requested periodically or dynamically. Each recommender can access all data stored in the Music Repository. The second repository stores the User Profile, hence it also contains personal data. In order to comply with German data privacy requirements only restricted access is granted for both, recommenders and evaluation analyses. The last repository collects the User Context, e.g. which songs a user has listened to with the corresponding rating for the respective recommender. Access with anonymized user IDs is granted for all recommenders and evaluation analyses. Finally, both userrelated repositories can be enriched by the User Profile Engine that fetches data from other sources like social networks. Currently, the retrieval of listening profiles of publicly available data from Last.fm and Facebook is supported. Figure 2. Songs can be played & rated In order to make sure that users can obtain recommendations without having to be long-time M U S E users, we ask for some contextual information during the registration process. Each user has to provide coarse-grained demographic and preference information, namely the user’s spoken languages, year of birth, and optionally a Last.fm user name. In Section 5, we will present five different approaches that utilize those information to overcome the cold start problem. Beyond that, these information is also exploited for dividing users into groups for A/B testing. 3.3 Recommender Manager Fig. 3 shows the settings pane of a user. Note, that this window is available only for those users, who are not participating in an evaluation. It allows to browse all available recommenders and compare them based on meta data provided with each recommender. Moreover, it is also possible to control how recommendations from different recommenders are amalgamated to one list. To this end, a summary is shown that illustrates the interplay of novelty, accuracy, serendipity and diversity. Changes are applied and reflected in the list of recommendations directly. The Recommender Manager has to coordinate the interaction of recommenders with users and the access to the data. This process can be summarized as follows: • It coordinates access to the repositories, forwards user request for new recommendations, and receives generated recommendations. • It composes a list of recommendations by amalgamating recommendations from different recommenders into one list based on individual user settings. 545 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4.1 Evaluation Setup The configuration of an evaluation consists of three steps (cf. Fig. 4): (1) A new evaluation has to be scheduled, i.e. a start and end date for the evaluation period has to be specified. (2) The number and setup of groups for A/B testing has to be defined, where up to six different groups are supported. For each group an available recommender can be associated with the possibility of hybrid combinations of recommenders if desired. (3) The group placement strategy based on e.g. age, gender and spoken languages is required. As new participants might join the evaluation over time, an online algorithm maintains a uniform distribution with respect to the specified criteria. After the setup is completed, a preview illustrates how group distributions would resemble based on a sample of registered users. Figure 3. Users can choose from available recommenders • A panel for administrative users allows enabling, disabling and adding of recommenders that implement the interface described in Section 3.4. Moreover, even composing hybrid recommenders is supported. 3.4 Pluggable Recommender A cornerstone of M U S E is its support for plugging in recommenders easily. The goal was to design a rather simple and compact interface enabling other developers to implement new recommenders with enough flexibility to incorporate existing approaches as well. This is achieved by a predefined Java interface that has to be implemented for any new recommender. It defines the interplay between the M U S E Recommender Manager and its pluggable recommenders by (1) providing methods to access all three data repositories, (2) forwarding requests for recommendations and (3) receiving recommended items. Hence, new recommenders do not have to be implemented within M U S E in order to be evaluated, it suffices to use the interface to provide a mapping of inputs and outputs 4 . Figure 4. Evaluation setup via Web interface While an evaluation is running, both registered users and new ones are asked to participate after they login to M U S E. If a user joins an evaluation, he will be assigned to a group based on the placement strategy defined during the setup and all ratings are considered for the evaluation. So far, the following types of ratings can be discerned: Song rating. The user can provide three ratings for the quality of the recommended song (“love”, “like”, and “dislike”). Each of these three rating options is mapped to a numerical score internally, which is then used as basis for the analysis of each recommender. List rating. The user can also provide ratings for the entire list of recommendations that is shown to him on a fivepoint Likert scale, visualized by stars. Question. To measure other important aspects of a recommendation like its novelty or serendipity, an additional field with a question can be configured that contains either a yes/no button or a five-point Likert scale. The user may also decide not to rate some of the recommendations. In order to reduce the number of non-rated recommendations in evaluations, the rating results can only be submitted when at least 50% of the recommendations are rated. Upon submitting the rating results, the user gets a new list with recommended songs. 4. EVALUATION FRAMEWORK There are two types of experiments to measure the performance of recommenders: (1) offline evaluations based on historical data and (2) in vivo evaluations where users can evaluate recommendations online. Since music is of highly subjective nature with many yet unknown correlations, we believe that in vivo evaluations have the advantage of also capturing subtle effects on the user during the evaluation. Since new songs can be rated within seconds by a user, such evaluations are a good fit for the music domain. M U S E addresses the typical issues that are involved in conducting an in-vivo evaluation and thus allows researches to focus on the actual recommendation algorithm. This section gives a brief overview of how evaluations are created, monitored and analyzed. 4 4.2 Monitoring Evaluations Running in vivo evaluations as a black box is undesirable, since potential issues might be discovered only after the More details can be found on our project website. 546 15th International Society for Music Information Retrieval Conference (ISMIR 2014) age of 14 and 20 [11]. The Annual Charts Recommender exploits this insight and recommends those songs, which were popular during this time. This means, when a user indicates 1975 as his year of birth, he will be assigned to the music context of years 1989 to 1995, and obtain recommendations from that context. The recommendation ranking is defined by the charts position in the corresponding annual charts, where the following function is used to map the charts position to a score, with cs as the position of song s in charts c and n is the maximum rank of charts c: evaluation is finished. Also, it is favorable to have an overview of the current state, e.g. if there are enough participants, and how the recommenders perform so far. M U S E provides comprehensive insights via an administrative account into running evaluations as it offers an easy accessible visualization of the current state with plots. Thus, adjustments like adding a group or changing the runtime of the evaluation can be made while the evaluation is still running. 1 score(s) = −log( cs ) n (1) Country Charts Recommender. Although music taste is subject to diversification across countries, songs that a user has started to listen to and appreciate oftentimes have peaked in others countries months before. This latency aspect as well as an inter-country view on songs provide a good foundation for serendipity and diversity. The source of information for this recommender is the spoken languages, provided during registration, which are mapped to a set of countries for which we collect the current charts. Suppose there is a user a with only one country A assigned to his spoken languages, and CA the set of charts songs for A. Then, the set CR of possible recommendations for a is defined as follows, where L is the set of all countries: " CR = ( C X ) \ CA X∈L Figure 5. Evaluation results are visualized dynamically The score for a song s ∈ CR is defined by the average charts position across all countries, where Function (1) is used for mapping the charts position into a score. City Charts Recommender. While music tastes differ across countries, they may likewise differ across cities in the same country. We exploit this idea by the City Charts Recommender, hence it can be seen as a more granular variant of the Country Charts Recommender. The set of recommendations CR is now composed based on the city charts from those countries a user was assigned to. Hereby, the ranking of songs in that set is not only defined by the average charts position, but also by the number of cities where the song occurs in the charts: The fewer cities a song appears in, the more “exceptional” and thus relevant it is. Social Neighborhood Recommender. Social Networks are, due to their growing rates, an excellent source for contextual knowledge about users, which in turn can be utilized for better recommendations. In this approach, we use the underlying social graph of Last.fm to generate recommendations based on user’s Last.fm neighborhood which can be retrieved by our User Profile Engine. To compute recommendations for a user a, we select his five closest neighbors, an information that is estimated by Last.fm internally. Next, for each of them, we retrieve its recent top 20 songs and thus get five sets of songs, namely N1 ...N5 . Since that alone would provide already known songs in general, we define the set NR of possible recommendations as follows, where Na is the set of at most 25 songs a 4.3 Analyzing Evaluations For all evaluations, including running and finished ones, a result overview can be accessed that shows results in a graphical way to make them easier and quicker to grasp (c.f. Fig. 5). The plots are implemented in a dynamic fashion allowing to adjust, e.g., the zoom-level or the displayed information as desired. They include a wide range of metrics like group distribution, number of participants over time, averaged ratings, mean absolute error, accuracy per recommender, etc. Additionally, the complete dataset or particular plotting data can be downloaded in CSV format. 5. RECOMMENDATION TECHNIQUES M U S E comes with two types of recommenders out-of-thebox. The first type includes traditional algorithms, i.e. Contend Based and Collaborative Filtering [10] that can be used as baseline for comparison. The next type of recommenders is geared towards overcoming the cold start problem by (a) exploiting information provided during registration (Annual, Country, and City Charts recommender), or (b) leveraging knowledge from social networks (Social Neighborhood and Social Tags recommender). Annual Charts Recommender. Studies have shown, that the apex of evolving music taste is reached between the 547 15th International Society for Music Information Retrieval Conference (ISMIR 2014) user a recently listened to and appreciated: NR = ( " [4] Paolo Cremonesi, Franca Garzotto, Sara Negro, Alessandro Vittorio Papadopoulos, and Roberto Turrin. Looking for ”good” recommendations: A comparative evaluation of recommender systems. In INTERACT (3), pages 152–168, 2011. N i ) \ Na 1≤i≤5 Social Tags Recommender. Social Networks collect an enormous variety of data describing not only users but also items. One common way of characterising songs is based on tags that are assigned to them in a collaborative manner. Our Social Tag Recommender utilizes such tags to discover new genres which are related to songs a user liked in the past. At first, we determine his recent top ten songs including their tags from Last.fm. We merge all those tags and filter out the most popular ones like “rock” or “pop” to avoid getting only obvious recommendations. By counting the frequency of the remaining tags, we determine the three most common thus relevant ones. For the three selected tags, we use again Last.fm to retrieve songs where the selected tags were assigned to most frequently. To test our evaluation framework as well as to assess the performance of our five recommenders we conducted an in vivo evaluation with M U S E. As a result 48 registered users rated a total of 1567 song recommendations confirming the applicability of our system for in vivo evaluations. Due to space limitations, we decided to omit a more detailed discussion of the results. 6. CONCLUSION M U S E puts the fun back in developing new algorithms for music recommendations by taking the burden from the researcher to spent cumbersome time on programming yet another evaluation tool. The module-based architecture offers the flexibility to immediately test novel approaches, whereas the web-based user-interface gives control and insight into running in vivo evaluations. We tested M U S E with a case study confirming the applicability and stability of our proposed music recommendation management system. As future work, we envision to increase the flexibility of setting up evaluations, add more metrics to the result overview, and to develop further connectors for social networks and other web services to enrich the user’s context while preserving data privacy. 7. REFERENCES [1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In ISMIR, 2011. [2] Jiajun Bu, Shulong Tan, Chun Chen, Can Wang, Hao Wu, Lijun Zhang 0005, and Xiaofei He. Music recommendation by unified hypergraph: combining social media information and music content. In ACM Multimedia, pages 391–400, 2010. [3] Li Chen and Pearl Pu. User evaluation framework of recommender systems. In Workshop on Social Recommender Systems (SRS’10) at IUI, volume 10, 2010. [5] Aviram Dayan, Guy Katz, Naseem Biasdi, Lior Rokach, Bracha Shapira, Aykan Aydin, Roland Schwaiger, and Radmila Fishel. Recommenders benchmark framework. In RecSys, pages 353–354, 2011. [6] Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John Riedl. Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit. In RecSys, pages 133–140, 2011. [7] Simon Franz, Thomas Hornung, Cai-Nicolas Ziegler, Martin Przyjaciel-Zablocki, Alexander Schätzle, and Georg Lausen. On weighted hybrid track recommendations. In ICWE, pages 486–489, 2013. [8] Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Mymedialite: a free recommender system library. In RecSys, pages 305– 308, 2011. [9] Conor Hayes and Pádraig Cunningham. An on-line evaluation framework for recommender systems. Trinity College Dublin, Dep. of Computer Science, 2002. [10] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John Riedl. Evaluating collaborative filtering recommender systems. In ACM Trans. Inf. Syst., pages 5–53, 2004. [11] Morris B Holbrook and Robert M Schindler. Some exploratory findings on the development of musical tastes. Journal of Consumer Research, pages 119–124, 1989. [12] Thomas Hornung, Cai-Nicolas Ziegler, Simon Franz, Martin Przyjaciel-Zablocki, Alexander Schätzle, and Georg Lausen. Evaluating Hybrid Music Recommender Systems. In WI, pages 57–64, 2013. [13] Stefan Liesche, Andreas Nauerz, and Martin Welsch. Extendable recommender framework for web-based systems, 2008. US Patent App. 12/209,808. [14] Pearl Pu, Li Chen, and Rong Hu. A user-centric evaluation framework for recommender systems. In RecSys ’11, pages 157–164, New York, NY, USA, 2011. ACM. [15] Carlos E Seminario and David C Wilson. Case study evaluation of mahout as a recommender platform. In RecSys, 2012. [16] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender Systems Handbook, pages 257–297, 2011. [17] Alexandra L. Uitdenbogerd and Ron G. van Schyndel. A review of factors affecting music recommender success. In ISMIR, 2002. 548 15th International Society for Music Information Retrieval Conference (ISMIR 2014) TEMPO- AND TRANSPOSITION-INVARIANT IDENTIFICATION OF PIECE AND SCORE POSITION Andreas Arzt1 , Gerhard Widmer1,2 , Reinhard Sonnleitner1 1 Department of Computational Perception, Johannes Kepler University, Linz, Austria 2 Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria [email protected] ABSTRACT goal is to identify different versions of one and the same song, mostly in order to detect cover versions in popular music. A common way to solve this task, especially for classical music, is to use an audio matching algorithm (see e.g. [10]). Here, all the scores are first transformed into audio files (or a suitable in-between representation), and then aligned to the query in question, most commonly with algorithms based on dynamic programming techniques. A limitation of this approach is that relatively large queries are needed (e.g. 20 seconds), to achieve good retrieval results. Another problem is computational cost. To cope with this, in [8] clever indexing strategies were presented that greatly reduce the computation time. In [2] an approach is presented that tries to solve the task in the symbolic domain instead. First, the query is transformed into a symbolic list of note events via an audio transcription algorithm. Then, a globally tempo-invariant fingerprinting method is used to query the database and identify matching positions. In this way even for queries with lengths of only a few seconds very robust retrieval results can be achieved. A downside is that this method depends on automatic music transcription, which in general is an unsolved problem. In [2] a state of the art transcription system for piano music is used, thus limiting the approach to piano music only, at least for the time being. In addition, we identified two other limitations of this algorithm, which we tackle in this paper. First, the approach depends on the performer playing the piece in the correct key and the correct octave (i.e. in the same key and octave as it is stored in the database). In music it is quite common to transpose a piece of music according to specific circumstances, e.g. a singer preferring to sing in a specific range. Secondly, while this algorithm works very well for small queries, larger queries with local tempo changes within the query tend to be problematic. Of course these limitations were already discussed in the literature for other approaches, see e.g. [10] for tempo- and transposition-invariant audio matching. In this paper we present solutions to both problems by proposing (1) a transposition-invariant fingerprinting method for symbolic music representations which uses an additional verification step that largely compensates for the general loss in discriminative power, and (2) a simple but effective tracking method that essentially achieves not only global, but also local invariance to tempo changes. We present an algorithm that, given a very small snippet of an audio performance and a database of musical scores, quickly identifies the piece and the position in the score. The algorithm is both tempo- and transposition-invariant. We approach the problem by extending an existing tempoinvariant symbolic fingerprinting method, replacing the absolute pitch information in the fingerprints with a relative representation. Not surprisingly, this leads to a big decrease in the discriminative power of the fingerprints. To overcome this problem, we propose an additional verification step to filter out the introduced noise. Finally, we present a simple tracking algorithm that increases the retrieval precision for longer queries. Experiments show that both modifications improve the results, and make the new algorithm usable for a wide range of applications. 1. INTRODUCTION Efficient algorithms for content-based retrieval play an important role in many areas of music retrieval. A well known example are audio fingerprinting algorithms, which permit the retrieval of all audio files from the database that are (almost) exact replicas of a given example query (a short audio excerpt). For this task there exist efficient algorithms that are in everyday commercial use (see e.g. [4], [13]). A related task, relevant especially in the world of classical music, is the following: given a short audio excerpt of a performance of a piece, identify both the piece (i.e. the musical score the performance is based on), and the position within the piece. For example, when presented with an audio excerpt of Vladimir Horowitz playing Chopin’s Nocturne Op. 55 No. 1, the goal is to return the name and data of the piece (Nocturne Op. 55 No. 1 by Chopin) rather than identifying the exact audio recording. Hence, the database for this task does not contain audio recordings, but symbolic representations of musical scores. This is related to version identification (see [11] for an overview), where the c Andreas Arzt1 , Gerhard Widmer1,2 , Reinhard Sonnleitner1 . Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Andreas Arzt1 , Gerhard Widmer1,2 , Reinhard Sonnleitner1 . “Tempo- and Transposition-invariant Identification of Piece and Score Position”, 15th International Society for Music Information Retrieval Conference, 2014. 549 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2. TEMPO-INVARIANT FINGERPRINTING The basis of our algorithm is a fingerprinting method presented in [2] (which in turn is based on [13]) that is invariant to the global tempo of both the query and the entries in the database. In this section we will give a brief summary of this algorithm. Then we will show how to make it transposition-invariant (Section 3) and how to make it invariant to local tempo changes (Section 4). 2.1 Building the Score Database In [2] a fingerprinting algorithm was introduced that is invariant to global tempo differences between the query and the scores in the database. Each score is represented as an ordered list of [ontime, pitch] pairs, which in turn are extracted from MIDI files with a suitable but constant tempo for the whole piece. For each score, fingerprint tokens are generated and stored in a database. Tokens are created from triplets of noteon events according to some constraints to make them tempo invariant. A fixed event e is paired with the first n1 events with a distance of at least d seconds “in the future” of e. This results in n1 event pairs. For each of these pairs this step is repeated with the n2 future events with a distance of at least d seconds. This finally results in n1 ∗ n2 event triplets. In our experiments we used the values d = 0.05 seconds and n1 = n2 = 5 (i.e. for each event 25 tokens are created). The pair creation steps are constrained to notes which are at most 2 octaves apart. Given such a triplet consisting of the events e1 , e2 and e3 , the time difference td 1,2 between e1 and e2 and the time difference td 2,3 between e2 and e3 are computed. To get a tempo independent fingerprint token, the ratio of the td time differences is computed: tdr = td 2,3 . This finally 1,2 leads to a fingerprint token dbtoken = [pitch1 : pitch2 : pitch3 : tdr ] : pieceID : time : td 1,2 , with the hash key being [pitch1 : pitch2 : pitch3 : tdr ], pieceID the identifier of the piece, and time the onset time of e1 . The tokens in our database are unique, i.e. we only insert the generated token if an equivalent one does not exist yet. 2.2 Querying the Database Before querying the database, the query (an audio snippet of a performance) has to be transformed into a symbolic representation. The algorithm we use to transcribe musical note onsets from an audio signal is based on the system described in [3]. The result of this step is a possibly very noisy list of [ontime, pitch] pairs. This list is processed in exactly the same fashion as above, resulting in a list of tokens of the form qtoken = [qpitch1 : qpitch2 : qpitch3 : qtdr ] : qtime : qtd 1,2 . Then, all the tokens which match hash keys of the query tokens are extracted from the database (we allow a maximal deviation of the ratio of the time differences of 15%). For querying, the general idea is to find regions in the database of scores which share a continuous sequence of tokens with the query. To quickly identify these regions we use the histogram approach presented in [2] and [13]. This is a computationally inexpensive way of finding these sequences by sorting the matched tokens into a histogram with a bin width of 1 second such that peaks appear at the start points of these regions (i.e. the start point where the query matches a database position). We also included the restriction that each query token can only be sorted at most once into each bin of the histogram, effectively preventing excessively high scores for sequences of repeated patterns in a brief period of time. The matching score for each score position is computed as the number of tokens in the respective histogram bin. In addition, we can also compute a tempo estimate, i.e. the tempo of the performance compared to the tempo in the score, by taking the mean of the ratios of td 1,2 and qtd 1,2 of the respective matching query and database tokens that were sorted in the bin in question. We will use this information for the tracking approach presented in Section 4. 3. TRANSPOSITION-INVARIANT FINGERPRINTS 3.1 General Approach In the algorithm described above, the pitches in the hash keys are represented as absolute values. Thus, if a performer decides to transpose a piece by an arbitrary number of semi-tones, any identification attempt by the algorithm must fail. To overcome this problem, we suggest a simple, relative representation of the pitch values, which makes the algorithm invariant to linear transpositions. Instead of using 3 absolute pitch values, we replace them by 2 differences, pd1 = pitch2 − pitch1 and pd2 = pitch3 − pitch2 , resulting in a hash key [pd1 : pd2 : tdr ]. For use in Section 3.2 below we additionally store pitch1 , the absolute pitch of the first note, in the token value. In every other aspect the algorithm works in the same way as the purely tempo-invariant version described above. Of course this kind of transposition invariance cannot come for free as the resulting fingerprints will not be as discriminative as before. This has two important direct consequences: (1) the retrieval accuracy will suffer, and (2) for every query a lot more matching tokens are found in the database, thus the runtime for each query increases (see Section 5). 3.2 De-noising the Results: Token Verification To compensate for the loss in discriminative power we propose an additional step before accepting a database token as a match to the query. The general idea is taken from [9] and was first used in a music context by [12]. It is based on a verification step for each returned token that looks at the context within the query and the context at the returned position the database. Each token dbtoken that was returned in response to a qtoken can be used to project the query (i.e. the notes identified from the query audio snippet by the transcription algorithm) to the possibly matching position in the score indicated by the dbtoken. The intuition then is that at 550 15th International Society for Music Information Retrieval Conference (ISMIR 2014) The basic idea is to create virtual ‘agents’ for positions in the result sets. Each agent has a current hypothesis of the piece, the position within the piece and the tempo, and a score based on the results of the sub-queries. The agents are updated, if possible, with newly arriving data. In doing so, agents that represent positions that successively occur in result sets will accumulate higher scores than agents that represent positions that only occurred once or twice by chance, and are most probably false positives. More precisely, we iterate over all sub-queries and perform the following steps in each iteration: true matching positions we will find a majority of the notes from the query at their expected positions in the score. This will permit us to more reliably decide if the match of hash keys is a false positive or an actual match. To do this, we need to compute the pitch shift and the tempo difference between the query and the potential position in the database. The pitch shift is computed as the difference of the pitch1 of qtoken and dbtoken. The difference in tempo is computed as the ratio of td1,2 of the two tokens. This information can now in turn be used to compute the expected time and pitch for each query note at the current score position hypothesis. We actually do not do this for the whole query, but only for a window of w = 10 notes, centred at the event e1 of the query, and we exclude the notes e1 , e2 and e3 from this list (as they were already used to come up with the match in the first place). We now take these w notes and check if they appear in the database as would be expected. In this search we are strict on the pitch value, but allow for a window of ±100 ms with regards to the actual time in the database. If we can confirm that a certain percentage of notes from the query appears in the database as expected (in the experiments we used 0.8), we finally accept the query token as an actual match. As this approach is computationally expensive, we actually compute the results in two steps: we first do ‘normal’ fingerprinting without the verification step and only keep the top 5% of the results. We then perform the verification step on these results only and recompute the scores. On our dataset this effectively more than halves the computation time. • Normalise Scores: First the scores of the positions in the result set of the sub-query are normalised by dividing them by their median. This makes sure that each iteration has approximately the same influence on the tracking process. • Update Agents: For every agent, we look for a matching position in the result set of the sub-query (i.e. a position that approximately fits the extrapolated position of the agent, given the old position, the tempo, and the elapsed time). The position, the tempo and the score of the agent are updated with the new data from the matching result of the sub-query. If we do not find a matching position in the result set, we update the agent with a score of 0, and the extrapolated position is taken as the new hypothesis. If a matching position is found, the accumulated score is updated in a fashion such that scores from further in the past have a smaller impact than more recent ones. Each agent has a ring buffer s of size 50, in which the scores of the individual sub-queries are being stored. The accumulated score of the agent is 50 si then calculated as scoreacc = 1+log i , where s1 4. PROCESSING LONGER QUERIES: MULTI-AGENT TRACKING The fingerprinting method in [2] was mainly concerned with invariance regarding the global tempo. When applying this algorithm to our database with longer queries, local tempo changes (i.e. tempo changes within the query) prove to be problematic, because they break the ‘cheap’ histogram approach that is used to determine continuous regions of matching tokens. Instead of using computationally much more expensive methods for determining these regions, we propose to split longer queries into shorter ones and track the results of these sub-queries over time. This is based on the assumption that in short queries the tempo is (quasi) stationary, and that a few exceptions will not break the tracking algorithm we use. In our implementation, we split each query into sub-queries with a window size of w = 15 notes and a hop size of h = 5 notes and then feed each sub-query to the fingerprinter individually. Each result of a sub-query (but at most the top 100 positions that are returned) is in turn fed to an on-line position hypothesis tracking algorithm. In our current proofof-concept implementation we use a simple on-line rulebased multi-agent approach, inspired by the beat-tracking algorithm described in [6]. For a purely off-line retrieval task a non-causal algorithm will lead to even better results. i=1 is the most recent score. • Create Agents: Each sub-query result that was not used to update an existing agent is used to initialise a new agent at the respective score position (i.e. in the first iteration up to 100 agents are created). • Remove obsolete Agents: Finally, agents with low scores are removed. In our implementation we simply remove agents that are older then 10 iterations and are not part of the current top 25 agents. At each point in time the agents are ordered by scoreacc and can be seen as hypotheses about the current position in the database of pieces. Thus, in the case of a single long query, the agents with the highest accumulated scores are returned in the end. In an on-line scenario, where an audio stream is constantly being monitored by the fingerprinting system, the current top hypotheses can be returned after each performed update (i.e. after each processed subquery). 551 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 5. EVALUATION We tested the algorithms with different query lengths: 10, 15, 20 and 25 notes (automatically transcribed from the audio query). For each of the query lengths, we generated 2500 queries by picking random points in the performances of our test database, and used them as input for the proposed algorithms. Duplicate retrieval results (i.e. positions that have the exact same note content; also, duplicate piece IDs for the experiments on piece-level) are removed from the result set. Table 2 shows the results of the original tempo-invariant (but not pitch-invariant) algorithm on our dataset. Here, we present results for two categories: correctly identified pieces, and correctly identified piece and position in the score. For both categories we give the percentage of correct results at rank 1, and the mean reciprocal rank. This experiment basically confirms the results that were reported in [2] on a larger database (more than twice as large), for which a slight drop in performance is expected. In addition, for the experiments with the transpositioninvariant fingerprinting method, we transposed each score randomly by between -11 and +11 semitones – although strictly speaking this was not necessary, as the transpositioninvariant algorithm returns exactly the same (large) set of tokens for un-transposed and transposed queries or scores. Table 3 gives the results of the transposition-invariant method on these queries, both without (left) and with the verification step (right). As expected, the use of pitchinvariant fingerprints without additional verification causes a big decrease in retrieval precision (compare left half of Table 3 with Table 2). Furthermore, the loss in discriminative power of the fingerprint tokens also results in an increased number of tokens returned for every query, which has a direct influence on the runtime of the algorithm (last row in Table 3). The proposed verification step solves the precision problem, at least to some extent, and in our opinion makes the approach usable. Of course this does not come for free, as the runtime increases slightly. We also tried to use the verification step with the original tempo-invariant algorithm but were not able to improve on the retrieval results. At least on our test data the tempoinvariant fingerprints are discriminative enough to mostly avoid false positives. Finally, Table 4 gives the results on slightly longer queries for both the original tempo-invariant and the new tempoand transposition-invariant algorithm. As can be seen, for the detection of the exact position in the score, using no tracking, the results based on queries with length 100 notes are worse than those for queries with only 50 notes, i.e. more information leads to worse results. This is caused by local tempo changes within the query, which break the histogram approach for finding sequences of matching tokens. As shown on the right hand side for both fingerprinting types in Table 4, the approach of splitting longer queries into shorter ones and tracking the results takes care of this problem. Please note that for the tracking approach we check if the position hypotheses after the last tracking step match the correct position in the score. Thus, as this is an 5.1 Dataset Description For the evaluation of the proposed algorithms a ground truth is needed. We need exact alignments of performances (recordings) of classical music to their respective scores such that we know exactly when each note given in the score is actually played in the performance. This data can either be generated by a computer program or by extensive manual annotation but both ways are prone to errors. Luckily, we have access to two unique datasets where professional pianists played performances on a computercontrolled piano 1 and thus every action (e.g. key presses, pedal movements) was recorded. The first dataset (see [14]) consists of performances of the first movements of 13 Mozart piano sonatas by Roland Batik. The second, much larger, dataset consists of nearly the complete solo piano works by Chopin performed by Nikita Magaloff [7]. For the latter set we do not have the original audio files and thus replayed the symbolic performance data on a Yamaha N2 hybrid piano and recorded the resulting performances. As we have both symbolic and audio information about the performances, we know the exact timing of each played note in the audio files. To build the score database we converted the sheet music to MIDI files with a constant tempo such that the overall duration of the file is similar to a ‘normal’ performance of the piece. In addition to these two datasets the score database includes the complete Beethoven piano sonatas, two symphonies by Beethoven, and various other piano pieces. To this data we have no ground truth, but this is irrelevant since we do not actively query for them with performance data in our evaluation runs. See Table 1 for an overview of the complete dataset. 5.2 Results For the evaluation we follow the procedure from [2]. A score position X is considered correct if it marks the beginning (+/- 1.5 seconds) of a score section that is identical in note content, over a time span the length of the query (but at least 20 notes), to the note content of the ‘real’ score situation corresponding to the audio segment that the system was just listening to. We can establish this as we have the correct alignment between performance time and score positions — our ground truth). This complex definition is necessary because musical pieces may contain repeated sections or phrases, and it is impossible for the system (or anyone else, for that matter) to guess the ‘true’ one out of a set of identical passages matching the current performance snippet, given just that performance snippet as input. We acknowledge that a measurement of musical time in a score in terms of seconds is rather unusual. But as the MIDI tempos in our database generally are set in a meaningful way, this seemed the best decision to make errors comparable over different pieces, with different time signatures – it would not be very meaningful to, e.g. compare errors in bars or beats over different pieces. 1 Bösendorfer SE 290 552 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Data Description Chopin Corpus Mozart Corpus Additional Pieces Total Score Database Number of Pieces Notes in Score 154 325,263 13 42,049 159 574,926 326 942,238 Testset Notes in Performance Performance Duration 326,501 9:38:36 42,095 1:23:56 – – Table 1. Database and Testset Overview. In the database, all the pieces are included. As we only have performances aligned to the scores for the Chopin and the Mozart corpus, only these are included in the test set to query the database. Query Length in Notes Correct Piece as Top Match Correct Piece Mean Reciprocal Rank (MRR) Correct Position as Top Match Correct Position Mean Reciprocal Rank (MRR) Mean Query Length in Seconds Mean Query Execution Time in Seconds 10 0.6 0.68 0.53 0.60 1.47 0.02 15 0.82 0.86 0.72 0.79 2.26 0.06 20 0.88 0.91 0.77 0.83 3.16 0.11 25 0.91 0.93 0.79 0.85 3.82 0.16 Table 2. Results for different query sizes of the original tempo-invariant piece and score position identification algorithm on the test database at the piece level (upper half) and on the score position level (lower half). Each estimate is based on 2500 random audio queries. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank (MRR) are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown. Query Length in Notes Correct Piece as Top Match Correct Piece MRR Correct Position as Top Match Correct Position MRR Mean Query Length in Seconds Mean Query Execution Time in Seconds Without Verification 10 15 20 25 0.30 0.40 0.41 0.40 0.36 0.47 0.50 0.49 0.23 0.33 0.32 0.32 0.29 0.40 0.41 0.40 1.47 2.26 3.16 3.82 0.10 0.32 0.62 0.91 10 0.43 0.49 0.33 0.41 1.47 0.12 With Verification 15 20 25 0.63 0.71 0.75 0.69 0.76 0.79 0.51 0.57 0.60 0.59 0.66 0.69 2.26 3.16 3.82 0.38 0.72 1.09 Table 3. Results for different query sizes of the proposed tempo- and transposition-invariant piece and score position identification algorithm on the test database with (right) and without (left) the proposed verification step. Each estimate is based on 2500 random audio queries. The upper half shows recognition results on the piece level, the lower half on the score position level. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank (MRR) are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown. Query Length in Notes Correct Piece as Top Match Correct Piece MRR Correct Position as Top Match Correct Position MRR Mean Query Length in Seconds Mean Query Execution Time in Seconds Tempo-invariant No Tracking Tracking 50 100 50 100 0.95 0.96 0.98 1 0.97 0.98 0.99 1 0.78 0.73 0.87 0.88 0.85 0.81 0.89 0.90 7.62 15.03 7.62 15.03 0.42 0.92 0.49 1.08 Tempo- and Pitch-invariant No Tracking Tracking 50 100 50 100 0.81 0.79 0.92 0.98 0.85 0.82 0.94 0.99 0.64 0.59 0.77 0.83 0.72 0.66 0.82 0.86 7.62 15.03 7.62 15.03 2.71 6.11 3.21 7.09 Table 4. Results of the proposed tracking algorithm on the test database for both the original tempo-invariant algorithm (left) and the new tempo- and transposition-invariant approach (right), including the verification step. For the category ‘No Tracking’, the query was fed directly to the fingerprinting algorithm. For ‘Tracking’, the queries were split into sub-queries with a window size of 15 notes and a hop size of 5 notes, and the individual results were tracked by our proof-of-concept multi-agent approach. Evaluation of the tracking approach is based on the finding the endpoint of a query (see text). Each estimate is based on 2500 random audio queries. The upper half shows recognition results on the piece level, the lower half on the score position level. For both categories the percentage of correct detections at rank 1 and the mean reciprocal rank (MRR) are given. Additionally, the mean length of the query in seconds and the mean execution time for a query is shown. 553 15th International Society for Music Information Retrieval Conference (ISMIR 2014) on-line algorithm, we are not interested in the start position of the query in the score, but in the endpoint, i.e. if the query was tracked successfully, and the correct current position is returned. Even the causal approach leads to a high percentage of correct results with both the original and the tempo- and pitch-invariant fingerprinting algorithm. Most of the remaining mistakes happen because (very) similar parts within one and the same piece are confused. [2] A. Arzt, S. Böck, and G. Widmer. Fast identification of piece and score position via symbolic fingerprinting. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2012. 6. CONCLUSIONS [4] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP), 2002. [3] S. Böck and M. Schedl. Polyphonic piano note transcription with recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012. 6.1 Applications The proposed algorithm is useful in a wide range of applications. As a retrieval algorithm it enables fast and robust (inter- and intra-document) searching and browsing in large collections of musical scores and corresponding performances. Furthermore, we believe that the algorithm is not limited to retrieval tasks in classical music, but may be of use for cover version identification in general, and possibly many other tasks. For example, it was already successfully applied in the field of symbolic music processing to find repeating motifs and sections in complex musical scores [5]. Currently, the algorithm is mainly used in an on-line scenario (see [1]). In connection with a score following algorithm it can act as a ‘piano music companion’. The system is able to recognise arbitrary pieces of classical piano music, identify the position in the score and track the progress of the performer. This enables a wide range of applications for musicians and for consumers of classical music. [5] T. Collins, A. Arzt, S. Flossmann, and G. Widmer. Siarct-cfp: Improving precision and the discovery of inexact musical patterns in point-set representations. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2013. [6] S. Dixon. Automatic extraction of tempo and beat from expressive performances. Journal of New Music Research, 30(1):39–58, 2001. [7] S. Flossmann, W. Goebl, M. Grachten, B. Niedermayer, and G. Widmer. The Magaloff project: An interim report. Journal of New Music Research, 39(4):363–377, 2010. [8] F. Kurth and M. Müller. Efficient index-based audio matching. IEEE Transactions on Audio, Speech, and Language Processing, 16(2):382–395, 2008. [9] D. Lang, D. W. Hogg, K. Mierle, M. Blanton, and S. Roweis. Astrometry. net: Blind astrometric calibration of arbitrary astronomical images. The Astronomical Journal, 139(5):1782, 2010. 6.2 Future Work In its current state the algorithm is able to recognise the correct piece and the score position even for very short queries of piano music. It is invariant to both tempo differences and transpositions and can be used in on-line contexts (i.e. to monitor audio streams and at any time report what it is listening to) and as an off-line retrieval algorithm. The main direction for future work is to lift the restriction to piano music and make it applicable to all kinds of classical music, even orchestral music. The limiting component at the moment is the transcription algorithm, which is only trained on piano sounds. 7. ACKNOWLEDGMENTS This research is supported by the Austrian Science Fund (FWF) under project number Z159 and the EU FP7 Project PHENICX (grant no. 601166). [10] M. Müller, F. Kurth, and M. Clausen. Audio matching via chroma-based statistical features. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2005. [11] J. Serra, E. Gómez, and P. Herrera. Audio cover song identification and similarity: background, approaches, evaluation, and beyond. In Z. W. Ras and A. A. Wieczorkowska, editors, Advances in Music Information Retrieval, pages 307–332. Springer, 2010. [12] R. Sonnleitner and G. Widmer. Quad-based audio fingerprinting robust to time and frequency scaling. In Proceedings of the International Conference on Digital Audio Effects, 2014. [13] A. Wang. An industrial strength audio search algorithm. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2003. 8. REFERENCES [1] A. Arzt, S. Böck, S. Flossmann, H. Frostel, M. Gasser, and G. Widmer. The complete classical music companion v0. 9. In Proceedings of the 53rd AES Conference on Semantic Audio, 2014. [14] G. Widmer. Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical discoveries. Artificial Intelligence, 146(2):129– 148, 2003. 554 15th International Society for Music Information Retrieval Conference (ISMIR 2014) GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA Chun-Hung Lu Jyh-Shing Roger Jang Ming-Ju Wu Innovative Digitech-Enabled Applications Computer Science Department Computer Science Department & Services Institute (IDEAS), National Tsing Hua University National Taiwan University Institute for Information Industry, Taipei, Taiwan Hsinchu, Taiwan Taipei, Taiwan [email protected] [email protected] [email protected] ABSTRACT Identity unknown Music recommendation is a crucial task in the field of music information retrieval. However, users frequently withhold their real-world identity, which creates a negative impact on music recommendation. Thus, the proposed method recognizes users’ real-world identities based on music metadata. The approach is based on using the tracks most frequently listened to by a user to predict their gender and age. Experimental results showed that the approach achieved an accuracy of 78.87% for gender identification and a mean absolute error of 3.69 years for the age estimation of 48403 users, demonstrating its effectiveness and feasibility, and paving the way for improving music recommendation based on such personal information. Music metadata of the user Top-1 track Top-2 track Top-3 track Artist name Paul Anka The Platters Johnny Cash … … Song title You Are My Destiny Only You I Love You Because … Input Our system Output Gender: male Age: 65 1. INTRODUCTION Figure 1. Illustration of the proposed system using a real example. Amid the rapid growth of digital music and mobile devices, numerous online music services (e.g., Last.fm, 7digital, Grooveshark, and Spotify) provide music recommendations to assist users in selecting songs. Most music-recommendation systems are based on content- and collaborative-based approaches [15]. For content-based approaches [2, 8, 9], recommendations are made according to the audio similarity of songs. By contrast, collaborative-based approaches involve recommending music for a target user according to matched listening patterns that are analyzed from massive users [1, 13]. Because music preferences of users relate to their real-world identities [12], several collaborative-based approaches consider identification factors such as age and gender for music recommendation [14]. However, online music services may experience difficulty obtaining such information. Conversely, music metadata (listening history) is generally available. This motivated us to recognize users’ real-world identities based on music metadata. Figure 1 illustrates the proposed system. In this preliminary study, we focused on predicting gender and age according to the most listened songs. In particular, gender identification was treated as a binary-classification problem, whereas age estimation was considered a regression problem. Two features were applied for both gender identification and age estimation tasks. The first feature, TF*IDF, is a widely used feature representation in natural language processing [16]. Because the music metadata of each user can be considered directly as a document, gender identification can be viewed as a document categorization problem. In addition, TF*IDF is generally applied with latent semantic indexing (LSI) to reduce feature dimension. Consequently, this serves as the baseline feature in this study. The second feature, the Gaussian super vector (GSV) [3], is a robust feature representation for speaker verification. In general, the GSV is used to model acoustic features such as MFCCs. In this study, music metadata was translated into proposed hotness features (a bag-of-features representation) and could be modeled using the GSV. The concept of the GSV can be described as follows. First, c Ming-Ju Wu, Jyh-Shing Roger Jang, Chun-Hung Lu. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Ming-Ju Wu, Jyh-Shing Roger Jang, Chun-Hung Lu. “Gender Identification and Age Estimation of Users Based on Music Metadata”, 15th International Society for Music Information Retrieval Conference, 2014. 555 15th International Society for Music Information Retrieval Conference (ISMIR 2014) a universal background model (UBM) is trained using a Gaussian mixture model (GMM) to represent the global music preference of users. A user-specific GMM can then be obtained using the maximum a posteriori (MAP) adaptation from the UBM. Finally, the mean vectors of the user-specific GMM are applied as GSV features. The remainder of this paper is organized as follows: Section 2 describes the related literature, and Section 3 introduces the TF*IDF; the GSV is explained in Section 4, and the experimental results are presented in Section 5; finally, Section 6 provides the conclusion of this study. User identity Gender Top level Age ǥ Semantic gap Music metadata Middle level Mood Music metadata Artist Genre ǥ Semantic gap Features 2. RELATED LITERATURE Low level Timbre Rhythm ǥ Machine learning has been widely applied to music information retrieval (MIR), a vital task of which is content-based music classification [5, 11]. For example, the annual Music Information Retrieval Evaluation eXchange (MIREX) competition has been held since 2004, at which some of the most popular competition tasks have included music genre classification, music mood classification, artist identification, and tag annotation. The purpose of content-based music classification is to recognize semantic music attributes from audio signals. Generally, songs are represented by features with different aspects such as timbre and rhythm. Classifiers are used to identify the relationship between low-level features and mid-level music metadata. However, little work has been done on predicting personal traits based on music metadata [7]. Figure 2 shows a comparison of our approach and content-based music classification. At the top level, user identity provides a basic description of users. At the middle level, music metadata provides a description of music. A semantic gap exists between music metadata and user identity. Beyond content-based music classification, our approach serves as a bridge between them. This enables online music services to recognize unknown users more effectively and, consequently, improve their music recommendations. Content-based music classification Our approach Figure 2. Comparison of our approach and content-based music classification. of an artist among documents. expressed as The TF*IDF can be tf idfi,n = tfi,n × log |D| dfn (2) where tfi,n is the frequency of tn in di , and dfn represents the number of documents in which tn appears. dfn = |{d : d ∈ D and tn ∈ d }| (3) 3.2 Latent Semantic Indexing The TF*IDF representation scheme leads to high feature dimensionality because the feature dimension is equal to the number of artists. Therefore, LSI is generally applied to transform data into a lower-dimensional semantic space. Let W be the TF*IDF reorientation of D, where each column represents document di . The LSI performs singular value decomposition (SVD) as follows: 3. TF*IDF FEATURE REPRESENTATION The music metadata of each user can be considered a document. The TF*IDF describes the relative importance of an artist for a specific document. LSI is then applied for dimensionality reduction. W ≈ U ΣV T (4) where U and V represent terms and documents in the semantic space, respectively. Σ is a diagonal matrix with corresponding singular values. Σ−1 U T can be used to transform new documents into the lower-dimensional semantic space. 3.1 TF*IDF Let the document (music metadata) of each user in the training set be denoted as di = {t1 , t2 , · · · , tn }, di ∈ D Artist 4. GSV FEATURE REPRESENTATION (1) This section introduces the proposed hotness features and explains how to generate the GSV features based on hotness features. where tn is the artist name of the top-n listened to song of user i. D is the collection of all documents in the training set. The TF*IDF representation is composed of the term frequency (TF) and inverse document frequency (IDF). TF indicates the importance of an artist for a particular document, whereas IDF indicates the discriminative power 4.1 Hotness Feature Extraction We assumed each artist tn may exude various degrees of hotness to different genders and ages. For example, the 556 15th International Society for Music Information Retrieval Conference (ISMIR 2014) count (the number of times) of Justin Bieber that occurs in users’ top listened to songs of the training set was 845, where 649 was from the female class and 196 was from the male class. We could define the hotness of Justin Bieber for females as 76.80% (649/845) and that for males as 23.20% (196/845). Consequently, a user tends to be a female if her top listened to songs related mostly to Justin Bieber. Consequently, the age and gender characteristics of a user can be obtained by computing the hotness features of relevant artists. Let D be divided into classes C according to users’ genders or ages: C 1 ∪ C2 ∪ · · · ∪ Cp = D (5) C1 ∩ C2 ∩ · · · ∩ Cp = ∅ α cn,2 α .. . ⎥ ⎥ ⎥ ⎦ Hotness feature extraction MAP adaptation UBM m2 (6) θM AP = arg max f (x|θ) g (θ|ω) cn,l x = [h1 , h2 , · · · , hn ] (10) θ (7) where f (x|θ) is the probability density function (PDF) for the observed data x given the parameter θ, and g (θ|ω) is the prior PDF given the hyperparameter ω. Finally, for each user, the mean vectors of the adapted GMM are stacked to form a new feature vector called GSV. Because the adapted GMM is obtained using MAP adaptation over the UBM, it is generally more robust than directly modeling the feature vectors by using GMM without any prior knowledge. (8) Because the form of x can be considered a bag-of-features, the GSV can be applied directly. 4.2 GSV Feature Extraction 5. EXPERIMENTAL RESULTS Figure 3 is a flowchart of the GSV feature extraction, which can be divided into offline and online stages. At the offline stage, the goal is to construct a UBM [10] to represent the global hotness features, which are then used as prior knowledge for each user at the online stage. First, hotness features are extracted for all music metadata in the training set. The UBM is then constructed through a GMM estimated using the EM (expectation-maximization) algorithm. Specifically, the UBM evaluates the likelihood of a given feature vector x as follows: wk N (x|mk , rk ) mk º¼ matrix rk . This bag-of-features model is based on the assumption that similar users have similar global artist characteristics. At the online stage, the MAP adaptation [6] is used to produce an adapted GMM for a specific user. Specifically, MAP attempts to determine the parameter θ in the parameter space Θ that maximizes the posterior probability given the training data x and hyperparameter ω, as follows: Next, each document in (1) can be transformed to a p × n matrix x, which describes the gender and age characteristics of a user: K Hotness feature extraction Figure 3. Flowchart of the GSV feature extraction. l=1 f (x|θ) = A user in the training or test sets (music metadata) GSV where cn,p is the count of artist tn in Cp , and α is the count of artist tn in all classes. p Training set (music metadata) ª¬m1 cn,p α α= Online ML estimation where p is the number of classes. Here, p is 2 for gender identification and 51 (the range of age) for age estimation. The hotness feature of each artist tn is defined as hn : ⎡ cn,1 ⎤ ⎢ ⎢ hn = ⎢ ⎣ Offline This section describes data collection, experimental settings, and experimental results. 5.1 Data Collection The Last.fm API was applied for data set collection, because it allows anyone to access data including albums, tracks, users, events, and tags. First, we collected user IDs through the User.getFriends function. Second, the User.getInfo function was applied to each user for obtaining their age and gender information. Finally, the User.getTopTracks function was applied to acquire at most top-50 tracks listened to by a user. The track information included song titles and artist names, but only artist names were used for feature extraction in this preliminary study. The final collected data set included 96807 users, in which each user had at least 40 top tracks as well as complete gender and age information. According to the users’ country codes, they were from 211 countries (or (9) k=1 where θ = (w1 , ..., wK , m1 , ..., mK , r1 , ..., rK ) is a set of parameters, with wk denoting the mixture gain for the K kth mixture component, subject to the constraint k=1 wk = 1, and N (x|mk , rk ) denoting the Gaussian density function with a mean vector mk and a covariance 557 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 6000 18% 5000 30% 1% 2% 2% 4000 2% Count United Kingdom United States Brazil Germany Russia Italy Poland Netherlands Belgium Canada Australia Spain France Sweden Mexico Others 2% 2% 3000 2% 2% 2000 3% 3% 18% 1000 3% 5% 6% 0 15 20 25 30 35 40 45 50 55 60 65 Age Figure 4. Ratio of countries of the collected data set. Figure 6. Age distribution of the collected data set. Male Female 5 10 33.79% 4 10 3 Count 10 66.21% 2 10 1 10 Figure 5. Gender ratio of the collected data set. 0 10 0 10 2 4 10 10 6 10 Artist regions such as Hong Kong). The ratio of countries is shown in Figure 4. The majority were Western countries. The gender ratio is shown in Figure 5, in which approximately one-third of users (33.79%) were female and two-thirds (66.21%) were male. The age distribution of users is shown in Figure 6. The distribution was a skewed normal distribution and most users were young people. Figure 7 shows the count of each artist that occurred in the users’ top listened songs. Among 133938 unique artists in the data set, the ranking of popularity presents a pow-law distribution. This demonstrates that a few artists dominate the top listened songs. Although the majority of artists are not popular for all users, this does not indicate that they are unimportant, because their hotness could be discriminative over ages and gender. Figure 7. Count of artists of users’ top listened songs. Ranking of popularity presents a pow-law distribution. to existing regression approaches. The RBF kernel with γ = 8 was applied to the SVM and SVR. For the UBM parameters, two Gaussian mixture components were experimentally applied (similar results can be obtained when using a different number of mixture components). Consequently, the numbers of dimensions of GSV features for gender identification and age estimation were 4 (2×2) and 102 (2×51), respectively. 5.3 Gender Identification The accuracy was 78.87% and 78.21% for GSV and TF*IDF + LSI features, respectively. This indicates that both features are adequate for such a task. Despite the low dimensionality of GSV (4), it was superior to the high dimensionality of TF*IDF + LSI (200). This indicates the effectiveness of GSV use and the proposed hotness features. Figures 8 and 9 respectively show the confusion matrix of using GSV and TF*IDF + LSI features. Both features yielded higher accuracies for the male class than for the female class. A possible explanation is that a portion of the females’ were similar to the males’. The classifier tended to favor the majority class 5.2 Experimental Settings The data set was equally divided into two subsets, the training (48404) and test (48403) sets. An open source tool of Python, Gensim, was applied for the TF*IDF and LSI implementation. followed the default setting of Gensim that maintained 200 latent dimensions for the TF*IDF. A support vector machine (SVM) tool, LIBSVM [4], was applied as the classifier. The SVM extension, support vector regression (SVR) was applied as the regressor, which has been observed in many cases to be superior 558 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Method GSV TF*IDF+LSI (male), resulting in many female instances with incorrect predictions. The age difference can also be regarded for further analysis. Figure 10 shows the gender identification results of two features over various ages. Both features tended to have lower accuracies between the ages of 25 and 40 years, implying that a user whose age is between 25 and 40 years seems to have more blurred gender boundaries than do users below 25 years and above 40 years. 6. CONCLUSION AND FUTURE WORK This study confirmed the possibility of predicting users’ age and gender based on music metadata. Three of the findings are summarized as follows. Female Male 86.40% (27690) MAE (female) 2.48 3.05 those of the TF*IDF + LSI. For further analysis, gender difference was also considered. Notably, the MAE of females is less than that of males for both GSV and TF*IDF + LSI features. In particular, the MAE differences between males and females are approximately 1.8 for both features, implying that females have more distinct age divisions than males do. Table 1 shows the performance comparison for age estimation. The mean absolute error (MAE) was applied as the performance index. The range of the predicted ages of the SVR is between 15 and 65 years. The experimental results show that the MAE is 3.69 and 4.25 years for GSV and TF*IDF + LSI, respectively. The GSV describes the age characteristics of a user and utilizes prior knowledge from the UBM; therefore, the GSV features are superior to Male MAE (male) 4.31 4.86 Table 1. Performance comparison for age estimation. 5.4 Age Estimation Accuracy 78.87% MAE 3.69 4.25 • GSV features are superior to those of TF*IDF + LSI for both gender identification and age estimation tasks. 13.60% (4358) • Males tend to exhibit higher accuracy than females do in gender identification, whereas females are more predictable than males in age estimation. Female 35.89% (5870) 64.11% (10485) • The experimental results indicate that gender identification is influenced by age, and vice versa. This suggests that an implicit relationship may exist between them. Future work could include utilizing the proposed approach to improve music recommendation systems. We will also explore the possibility of recognizing deeper social aspects of user identities, such as occupation and education level. Figure 8. Confusion matrix of gender identification by using GSV features. Female Male Accuracy 78.21% 1 GSV TF*IDF+LSI Male 92.14% (29529) 0.95 7.86% (2519) 0.9 Accuracy 0.85 Female 49.10% (8030) 0.8 0.75 50.90% (8325) 0.7 0.65 0.6 15 20 25 30 35 40 45 50 55 60 65 Age Figure 9. Confusion matrix of gender identification by using TF*IDF + LSI features. Figure 10. Gender identification results for various ages. 559 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 7. ACKNOWLEDGEMENT This study is conducted under the ”NSC 102-3114-Y-307-026 A Research on Social Influence and Decision Support Analytics” of the Institute for Information Industry which is subsidized by the National Science Council. 8. REFERENCES [1] L. Barrington, R. Oda, and G. Lanckriet. Smarter than genius? human evaluation of music recommender systems. In Proceedings of the International Symposium on Music Information Retrieval, pages 357–362, 2009. [2] D. Bogdanov, M. Haro, F. Fuhrmann, E. Gomez, and P. Herrera. Content-based music recommendation based on user preference examples. In Proceedings of the ACM Conf. on Recommender Systems. Workshop on Music Recommendation and Discovery, 2010. [3] W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5):308–311, May 2006. [4] C. C. Chang and C. J. Lin. Libsvm: A library for support vector machine, 2010. [12] A. Uitdenbogerd and R. V. Schnydel. A review of factors affecting music recommender success. In Proceedings of the International Symposium on Music Information Retrieval, pages 204–208, 2002. [13] B. Xu, J. Bu, C. Chen, and D. Cai. An exploration of improving collaborative recommender systems via user-item subgroups. In Proceedings of the 21st international conference on World Wide Web, pages 21–30, 2012. [14] Billy Yapriady and AlexandraL. Uitdenbogerd. Combining demographic data with collaborative filtering for automatic music recommendation. In Knowledge-Based Intelligent Information and Engineering Systems, volume 3684 of Lecture Notes in Computer Science, pages 201–207. 2005. [15] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno. Hybrid collaborative and content-based music recommendation using probabilistic model with latent user preferences. In Proceedings of the International Symposium on Music Information Retrieval, pages 296–301, 2006. [16] W. Zhang, T. Yoshida, and X. Tang. A comparative study of tf*idf, lsi and multi-words for text classification. Expert Systems with Applications, 38(3):2758–2765, 2011. [5] Z. Fu, G. Lu, K. M. Ting, and D. Zhang. A survey of audio-based music classification and annotation. IEEE Trans. Multimedia., 13(2):303–319, Apr. 2011. [6] J. L. Gauvain and C. H. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Audio, Speech, Lang. Process., 2(2):291–298, Apr. 1994. [7] Jen-Yu Liu and Yi-Hsuan Yang. Inferring personal traits from music listening history. In Proceedings of the Second International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies, MIRUM ’12, pages 31–36, New York, NY, USA, 2012. ACM. [8] B. McFee, L. Barrington, and G. Lanckriet. Learning content similarity for music recommendation. IEEE Trans. Audio, Speech, Lang. Process., 20(8):2207–2218, Oct. 2012. [9] A. V. D. Orrd, S. Dieleman, and B. Benjamin. Deep content-based music recommendation. In Advances in Neural Information Processing Systems, 2013. [10] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Process, 10(13):19–41, Jan. 2000. [11] B. L. Sturm. A survey of evaluation in music genre recognition. In Proceedings of the Adaptive Multimedia Retrieval, 2012. 560 15th International Society for Music Information Retrieval Conference (ISMIR 2014) INFORMATION-THEORETIC MEASURES OF MUSIC LISTENING BEHAVIOUR Daniel Boland, Roderick Murray-Smith School of Computing Science, University of Glasgow, United Kingdom [email protected]; [email protected] ABSTRACT We identify the entropy of music features as a metric for characterising music listening behaviour. This measure can be used to produce time-series analyses of user behaviour, allowing for the identification of events where this behaviour changed. In a case study, the date when a user adopted a different music retrieval system is detected. These detailed analyses of listening behaviour can support user studies or provide implicit relevance feedback to music retrieval. More broad analyses are performed across the 10, 000 playlists. A Mutual Information based feature selection algorithm is employed to identify music features relevant to how users create playlists. This user-centred feature selection can sanity-check the choice of features in MIR. The information-theoretic approach introduced here is applicable to any discretisable feature set and distinct in being based solely upon actual user behaviour rather than assumed ground-truth. With the techniques described here, MIR researchers can perform quantitative yet user-centred evaluations of their music features and retrieval systems. We present an information-theoretic approach to the measurement of users’ music listening behaviour and selection of music features. Existing ethnographic studies of music use have guided the design of music retrieval systems however are typically qualitative and exploratory in nature. We introduce the SPUD dataset, comprising 10, 000 handmade playlists, with user and audio stream metadata. With this, we illustrate the use of entropy for analysing music listening behaviour, e.g. identifying when a user changed music retrieval system. We then develop an approach to identifying music features that reflect users’ criteria for playlist curation, rejecting features that are independent of user behaviour. The dataset and the code used to produce it are made available. The techniques described support a quantitative yet user-centred approach to the evaluation of music features and retrieval systems, without assuming objective ground truth labels. 1.1 Understanding Users 1. INTRODUCTION User studies have provided insights about user behaviour in retrieving and listening to music and highlighted the lack of consideration in MIR about actual user needs. In 2003, Cunningham et al. bemoaned that development of music retrieval systems relied on “anecdotal evidence of user needs, intuitive feelings for user information seeking behavior, and a priori assumptions of typical usage scenarios” [5]. While the number of user studies has grown, the situation has been slow to improve. A review conducted a decade later noted that approaches to system evaluation still ignore the findings of user studies [12]. This issue is stated more strongly by Schedl and Flexer, describing systems-centric evaluations that “completely ignore user context and user properties, even though they clearly influence the result” [15]. Even systems-centric work, such as the development of music classifiers, must consider the user-specific nature of MIR. Downie termed this the multiexperiential challenge, and noted that “Music ultimately exists in the mind of its perceiver” [6]. Despite all of this, the assumption of an objective ground truth for music genre, mood etc. is common [4], with evaluations focusing on these rather than considering users. It is clear that much work remains in placing the user at the centre of MIR. Understanding how users interact with music retrieval systems is of fundamental importance to the field of Music Information Retrieval (MIR). The design and evaluation of such systems is conditioned upon assumptions about users, their listening behaviours and their interpretation of music. While user studies have offered guidance to the field thus far, they are mostly exploratory and qualitative [20]. The availability of quantitative metrics would support the rapid evaluation and optimisation of music retrieval. In this work, we develop an information-theoretic approach to measuring users’ music listening behaviour, with a view to informing the development of music retrieval systems. To demonstrate the use of these measures, we compiled ‘Streamable Playlists with User Data’ (SPUD) – a dataset comprising 10, 000 playlists from Last.fm 1 produced by 3351 users, with track metadata including audio streams from Spotify. 2 We combine the dataset with the mood and genre classification of Syntonetic’s Moodagent, 3 yielding a range of intuitive music features to serve as examples. c Daniel Boland, Roderick Murray-Smith. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Boland, Roderick MurraySmith. “Information-Theoretic Measures of Music Listening Behaviour”, 15th International Society for Music Information Retrieval Conference, 2014. 1 . http://www.last.fm 2 . http://www.spotify.com 3 . http://www.moodagent.com Last accessed: 30/04/14 561 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 1.2 Evaluation in MIR The lack of robust evaluations in the field of MIR was identified by Futrelle and Downie as early as 2003 [8]. They noted the lack of any standardised evaluations and in particular that MIR research commonly had an “emphasis on basic research over application to, and involvement with, users.” In an effort to address these failings, the Music Information Retrieval Evaluation Exchange (MIREX) was established [7]. MIREX provides a standardised framework of evaluation for a range of MIR problems using common metrics and datasets, and acts as the benchmark for the field. While the focus on this benchmark has done a great deal towards the standardisation of evaluations, it has distracted research from evaluations with real users. A large amount of evaluative work in MIR focuses on the performance of classifiers, typically of mood or genre classes. A thorough treatment of the typical approaches to evaluation and their shortcomings is given by Sturm [17]. We note that virtually all such evaluations seek to circumvent involving users, instead relying on a ‘ground truth’ which is assumed to be objective. An example of a widely used ground truth dataset is GTZAN, a small collection of music with the author’s genre annotations. Even were the objectivity of such annotations to be assumed, such datasets can be subject to confounding factors and mislabellings as shown by Sturm [16]. Schedl et al. also observe that MIREX evaluations involve assessors’ own subjective annotations as ground truth [15]. Figure 1. Distribution of playlist lengths within the SPUD dataset. The distribution peaks around a playlist length of 12 songs. There is a long tail of lengthy playlists. 2. THE SPUD DATASET The SPUD dataset of 10, 000 playlists was produced by scraping from Last.fm users who were active throughout March and April, 2014. The tracks for each playlist are also associated with a Spotify stream, with scraped metadata, such as artist, popularity, duration etc. The number of unique tracks in the dataset is 271, 389 from 3351 users. The distribution of playlist lengths is shown in Figure 1. We augment the dataset with proprietary mood and genre features produced by Syntonetic’s Moodagent. We do this to provide high-level and intuitive features which can be used as examples to illustrate the techniques being discussed. It is clear that many issues remain with genre and mood classification [18] and the results in this work should be interpreted with this in mind. Our aim in this work is not to identify which features are best for music classification but to contribute an approach for gaining an additional perspective on music features. Another dataset of playlists AOTM-2011 is published [13] however the authors only give fragments of playlists where songs are also present in the Million Song Dataset (MSD) [1]. The MSD provides music features for a million songs but only a small fraction of songs in AOTM-2011 were matched in MSD. Our SPUD dataset is distinct in maintaining complete playlists and having time-series data of songs listened to. 1.3 User-Centred Approaches There remains a need for robust, standardised evaluations featuring actual users of MIR systems, with growing calls for a more user-centric approach. Schedl and Flexer made the broad case for “putting the user in the center of music information retrieval”, concerning not only user-centred development but also the need for evaluative experiments which control independent variables that may affect dependent variables [14]. We note that there is, in particular, a need for quantitative dependent variables for user-centred evaluations. For limited tasks such as audio similarity or genre classification, existing dependent variables may be sufficient. If the field of MIR is to concern itself with the development of complete music retrieval systems, their interfaces, interaction techniques, and the needs of a variety of users, then additional metrics are required. Within the field of HCI it is typical to use qualitative methods such as the think-aloud protocol [9] or Likert-scale questionnaires such as the NASA Task Load Index (TLX) [10]. Given that the purpose of a Music Retrieval system is to support the user’s retrieval of music, a dependent variable to measure this ability is desirable. Such a measure cannot be acquired independently of users – the definition of musical relevance is itself subjective. Users now have access to ‘Big Music’ – online collections with millions of songs, yet it is unclear how to evaluate their ability to retrieve this music. The information-theoretic methodology introduced in this work aims to quantify the exploration, diversity and underlying mental models of users’ music retrieval. 3. MEASURING MUSIC LISTENING BEHAVIOUR When evaluating a music retrieval system, or performing a user study, it would be useful to quantify the musiclistening behaviour of users. Studying this behaviour over time would enable the identification of how different music retrieval systems influence user behaviour. Quantifying listening behaviour would also provide a dependent variable for use in MIR evaluations. We introduce entropy as one such quantitative measure, capturing how a user’s music-listening relates to the music features of their songs. 562 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 3.1 Entropy For each song being played by a user, the value of a given music feature can be taken as a random variable X. The entropy H(X) of this variable indicates the uncertainty about the value of that feature over multiple songs in a listening session. This entropy measure gives a scale from a feature’s value never changing, through to every level of the feature being equally likely. The more a user constrains their music selection by a particular feature, e.g. mood or album, then the lower the entropy is over those features. The entropy for a feature is defined as: p (x) log2 [p(x)] , (1) H(X) = − x∈X where x is every possible level of the feature X and the distribution p (x) is estimated from the songs in the listening session. The resulting entropy value is measured in bits, though can be normalised by dividing by the maximum entropy log2 [|X|]. Estimating entropy in this way can be done for any set of features, though requires that they are discretised to an appropriate number of levels. For example, if a music listening session is dominated by songs of a particular tempo, the distribution over values of a TEMPO feature would be very biased. The entropy H(TEMPO) would thus be very low. Conversely, if users used shuffle or listened to music irrespective of tempo, then the entropy H(TEMPO) would tend towards the average entropy of the whole collection. Figure 2. Windowed entropy over albums shows a user’s album-based music listening over time. Each point represents 20 track plays. The black line depicts mean entropy, calculated using locally weighted regression [3] with 95% CI of the mean shaded. A changepoint is detected around Feb. 2010, as the user began using online radio (light blue) 3.3 Changepoints in Music Retrieval Having produced a time-series analysis of music-listening behaviour, we are now able to identify events which caused changes in this behaviour. In order to identify changepoints in the listening history, we apply the ‘Pruned Exact Linear Time’ (PELT) algorithm [11]. The time-series is partitioned in a way that reduces a cost function of changes in the mean and variance of the entropy. Changepoints can be of use in user studies, for example in Figure 2, the user explained in an interview that the detected changepoint occurred when they switched to using online radio. There is a brief return to album-based listening after the changepoint – users’ music retrieval behaviour can be a mixture of different retrieval models. Changepoint detection can also be a user-centred dependent variable in evaluating music retrieval interfaces i.e. does listening behaviour change as the interface changes? Further examples of user studies are available with the SPUD dataset. 3.2 Applying a Window Function Many research questions regarding a user’s music listening behaviour concern the change in that behaviour over time. An evaluation of a music retrieval interface might hypothesise that users will be empowered to explore a more diverse range of music. Musicologists may be interested to study how listening behaviour has changed over time and which events precede such changes. It is thus of interest to extend Eqn (1) to define a measure of entropy which is also a function of time: H(X, t) = H(w(X, t)) , (2) where w(X, t) is a window function taking n samples of X around time t. In this paper we use a rectangular window function with n = 20, assuming that most albums will have fewer tracks than this. The entropy at any given point is limited to the maximum possible H(X, t) = log2 [n] i.e. where each of the n points has a unique value. An example of the change in entropy for a music feature over time is shown in Figure 2. In this case H(ALBUM) is shown as this will be 0 for album-based listening and at maximum for exploratory or radio-like listening. It is important to note that while trends in mean entropy can be identified, the entropy of music listening is itself quite a noisy signal – it is unlikely that a user will maintain a single music-listening behaviour over a large period of time. Periods of album listening (low or zero entropy) can be seen through the time-series, even after the overall trend is towards shuffle or radio-like music listening. 3.4 Identifying Listening Style The style of music retrieval that the user is engaging in can be inferred using the entropy measures. Where the entropy for a given music feature is low, the user’s listening behaviour can be characterised by that feature i.e. we can be certain about that feature’s level. Alternately, where a feature has high entropy, then the user is not ‘using’ that feature in their retrieval. When a user opts to use shufflebased playback i.e. the random selection of tracks, there is the unique case that entropy across all features will tend towards the maximum. In many cases, feature entropies have high covariance, e.g. songs on an album will have the same artist and similar features. We did not include other features in Figure 2 as the same pattern was apparent. 563 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4. SELECTING FEATURES FROM PLAYLISTS Identifying which music features best describe a range of playlists is not only useful for playlist recommendation, but also provides an insight into how users organise and think about music. Music recommendation and playlist generation typically work on the basis of genre, mood and popularity, and we investigate which of these features is supported by actual user behaviour. As existing retrieval systems are based upon these features, there is a potential ‘chicken-and-egg’ effect where the features which best describe user playlists are those which users are currently exposed to in existing retrieval interfaces. 4.1 Mutual Information Information-theoretic measures can be used to identify to what degree a feature shares information with class labels. For a feature X and a class label Y , the mutual information I(X; Y ) between these two can be given as: I(X; Y ) = H(X) − H(X | Y ) , Figure 3. Features are ranked by their Adjusted Mutual Information with playlist membership. Playlists are distinguished more by whether they contain ROCK or ANGRY music than by whether they contain POPULAR or WORLD. (3) that is, the entropy of the feature H(X) minus the entropy of that feature if the class is known H(X | Y ). By taking membership of playlists as a class label, we can determine how much we can know about a song’s features if we know what playlist it is in. When using mutual information to compare clusterings in this way, care must be taken to account for random chance mutual information [19]. We adapt this approach to focus on how much the feature entropy is reduced, and normalise accordingly: AM I(X; Y ) = I(X; Y ) − E[I(X; Y )] , H(X) − E[I(X; Y )] It is of interest that TEMPO was not one of the highest ranked features, illustrating the style of insights available when using this approach. Further investigation is required to determine whether playlists are not based on tempo as much as is often asumed or if this result is due to the peculiarities of the proprietary perceptual tempo detection. 4.3 Feature Selection Features can be selected using information-theoretic measures, with a rigorous treatment of the field given by Brown et al. [2]. They define a unifying framework within which to discuss methods for selecting a subset of features using mutual information. This is done by defining a J criterion for a feature: (4) where AM I(X; Y ) is the adjusted mutual information and E[I(X; Y )] is the expectation of the mutual information i.e. due to random chance. The AMI gives a normalised measure of how much of the feature’s entropy is explained by the playlist. When AM I = 1, the feature level is known exactly if the playlist is known, when AM I = 0, nothing about the feature is known if the playlist is known. J (fn ) = I(fn ; C | S) . (5) This gives a measure of how much information the feature shares with playlists given some previously selected features, and can be used as a greedy feature selection algorithm. Intuitively, features should be selected that are relevant to the classes but that are also not redundant with regard to previously selected features. A range of estimators for I(fn ; C | S) are discussed in [2]. As a demonstration of the feature selection approach we have described, we apply it to the features depicted in Figure 3, selecting features to minimise redundancy. The selected subset of features in rank order is: ROCK, DURA TION , POPULARITY , TENDER and JOY . It is notable that ANGRY had an AMI that was almost the same as ROCK , but it is redundant if ROCK is included. Unsurprisingly, the second feature selected is from a different source than the first – the duration information from Spotify adds to that used to produce the Syntonetic mood and genre features. Reducing redundancy in the selected features in this way yields a very different ordering, though one that may give a clearer insight into the factors behind playlist construction. 4.2 Linking Features to Playlists We analysed the AMI between the 10, 000 playlists in the SPUD dataset and a variety of high level music features. The ranking of some of these features is given in Figure 3. Our aim is only to illustrate this approach, as any results are only as reliable as the underlying features. With this in mind, the features ROCK and ANGRY had the most uncertainty explained by playlist membership. While the values may seem small, they are calculated over many playlists, which may combine moods, genres and other criteria. As these features change most between playlists (rather than within them), they are the most useful for characterising the differences between playlists. The DURATION feature ranked higher than expected, further investigation revealed playlists that combined lengthy DJ mixes. It is perhaps unsurprising that playlists were not well characterised by whether they included WORLD music. 564 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 5.3 Engagement 5. DISCUSSION While we reiterate that this work only uses a specific set of music features and user base, we consider our results to be encouraging. It is clear that the use of entropy can provide a detailed time-series analysis of user behaviour and could prove a valuable tool for MIR evaluation. Similarly, the use of adjusted mutual information allows MIR researchers to directly link work on acquiring music features to the ways in which users interact with music. In this section we consider how the information-theoretic techniques described in this work can inform the field of MIR. 5.1 User-Centred Feature Selection The feature selection shown in this paper is done directly from the user data. In contrast, feature selection is usually performed using classifier wrappers with ground truth class labels such as genre. The use of genre is based on the assumption that it would support the way users currently organise music and features are selected based on these labels. This has lead to issues including classifiers being trained on factors that are confounded with these labels and that are not of relevance to genre or users [18]. Our approach selects features independently of the choice of classifier, in what is termed a ‘filter’ approach. The benefit of doing this is that a wide range of features can be quickly filtered at relatively little computational expense. While the classifier ‘wrapper’ approach may achieve greater performance, it is more computationally expensive and more likely to suffer from overfitting. The key benefit of filtering features based on user behaviour is that it provides a perspective on music features that is free from assumptions about users and music ground truth. This user-centred perspective provides a sanity-check for music features and classification – if a feature does not reflect the ways in which users organise their music, then how useful is it for music retrieval? 5.2 When To Learn The information-theoretic measures presented offer an implicit relevance feedback for music retrieval. While we have considered the entropy of features as reflecting user behaviour, this behaviour is conditioned upon the existing music retrieval interfaces being used. For example, after issuing a query and receiving results, the user selects relevant songs from those results. If the entropy of a feature for those selected songs is small relative to the result set, then this feature is implicitly relevant to the retrieval. The identification of shuffle and explorative behaviour provides some context for this implicit relevance feedback. Music which is listened to in a seemingly random fashion may represent an absent or disengaged user, adding noise to attempts to weight recommender systems or build a user profile. At the very least, where entropy is high across all features, then those features do not reflect the user’s mental model for their music retrieval. The detection of shuffle or high-entropy listening states thus provides a useful data hygiene measure when interpreting listening data. The entropy measures capture how much each feature is being ‘controlled’ by the user when selecting their music. We have shown that it spans a scale from a user choosing to listen to something specific to the user yielding control to radio or shuffle. Considering entropy over many features in this way gives a high-dimensional vector representing the user’s engagement with music. Different styles of music retrieval occupy different points in this space, commonly the two extremes of listening to a specific album or just shuffling. There is an opportunity for music retrieval that has the flexibility to support users engaging and applying control over music features only insofar as they desire to. An example of this would be a shuffle mode that allowed users to bias it to varying degrees, or to some extent, the feedback mechanism in recommender systems. 5.4 Open Source The SPUD dataset is made available for download at: http://www.dcs.gla.ac.uk/˜daniel/spud/ Example R scripts for importing data from SPUD and producing the analyses and plots in this paper are included. The code used to scrape this dataset is available under the MIT open source license, and can be accessed at: http://www.github.com/dcboland/ The MoodAgent features are commercially sensitive, thus not included in the SPUD dataset. At present, industry is far better placed to provide such large scale analyses of music data than academia. Even with user data and the required computational power, large-scale music analyses require licensing arrangements with content providers, presenting a serious challenge to academic MIR research. Our adoption of commercially provided features has allowed us to demonstrate our information-theoretic approach, and we distribute the audio stream links, however it is unlikely that many MIR researchers will have the resources to replicate all of these large scale analyses. The CoSound 4 project is an example of industry collaborating with academic research and state bodies to navigate the complex issues of music licensing and large-scale analysis. 6. CONCLUSION This work introduces an information-theoretic approach to the study of users’ music listening behaviour. The case is made for a more user-focused yet quantitative approach to evaluation in MIR. We described the use of entropy to produce time-series analyses of user behaviour, and showed how changes in music-listening style can be detected. An example is given where a user started using online radio, having higher entropy in their listening. We introduced the use of adjusted mutual information to establish which music features are linked to playlist organisation. These techniques provide a quantitative approach to user studies and ground feature selection in user behaviour, contributing tools to support the user-centred future of MIR. 4 . http://www.cosound.dk/ Last accessed: 30/04/14 565 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ACKNOWLEDGEMENTS This work was supported in part by Bang & Olufsen and the Danish Council for Strategic Research of the Danish Agency for Science Technology and Innovation under the CoSound project, case number 11-115328. This publication only reflects the authors’ views. 7. REFERENCES [1] T Bertin-Mahieux, D. P Ellis, B Whitman, and P Lamere. The Million Song Dataset. In Proceedings of the 12th International Conference on Music Information Retrieval, Miami, Florida, 2011. [2] G Brown, A Pocock, M.-J Zhao, and M Luján. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research, 13:27–66, 2012. [3] W. S Cleveland and S. J Devlin. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, 1988. [4] A Craft and G Wiggins. How many beans make five? the consensus problem in music-genre classification and a new evaluation method for single-genre categorisation systems. In Proceedings of the 8th International Conference on Music Information Retrieval, Vienna, Austria, 2007. [5] S. J Cunningham, N Reeves, and M Britland. An ethnographic study of music information seeking: implications for the design of a music digital library. In Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, Houston, Texas, 2003. [6] J. S Downie. Music Information Retrieval. Annual Review of Information Science and Technology, 37(1):295–340, January 2003. [7] J. S Downie. The Music Information Retrieval Evaluation eXchange (MIREX). D-Lib Magazine, 12(12):795–825, 2006. [8] J Futrelle and J. S Downie. Interdisciplinary Research Issues in Music Information Retrieval: ISMIR 2000 2002. Journal of New Music Research, 32(2):121–131, 2003. [9] J. D Gould and C Lewis. Designing for usability: key principles and what designers think. Communications of the ACM, 28(3):300–311, 1985. [10] S. G Hart. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting, San Francisco, California, 2006. [11] R Killick, P Fearnhead, and I. A Eckley. Optimal Detection of Changepoints With a Linear Computational Cost. Journal of the American Statistical Association, 107(500):1590–1598, 2012. [12] J. H Lee and S. J Cunningham. The Impact (or Nonimpact) of User Studies in Music Information Retrieval. In Proceedings of the 13th International Conference for Music Information Retrieval, Porto, Portugal, 2012. [13] B McFee and G Lanckriet. Hypergraph models of playlist dialects. In Proceedings of the 13th International Conference for Music Information Retrieval, Porto, Portugal, 2012. [14] M Schedl and A Flexer. Putting the User in the Center of Music Information Retrieval. In Proceedings of the 13th International Conference on Music Information Retrieval, Porto, Portugal, 2012. [15] M Schedl, A Flexer, and J Urbano. The neglected user in music information retrieval research. Journal of Intelligent Information Systems, 41(3):523–539, 2013. [16] B. L Sturm. An Analysis of the GTZAN Music Genre Dataset. In Proceedings of the 2nd International ACM Workshop on Music Information Retrieval with Usercentered and Multimodal Strategies, MIRUM ’12, New York, USA, 2012. [17] B. L Sturm. Classification accuracy is not enough. Journal of Intelligent Information Systems, 41(3):371– 406, 2013. [18] B. L Sturm. A simple method to determine if a music information retrieval system is a horse. IEEE Transactions on Multimedia, 2014. [19] N. X Vinh, J Epps, and J Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837–2854, 2010. [20] D. M Weigl and C Guastavino. User studies in the music information retrieval literature. In Proceedings of the 12th International Conference for Music Information Retrieval, Miami, Florida, 2011. 566 15th International Society for Music Information Retrieval Conference (ISMIR 2014) EVALUATION FRAMEWORK FOR AUTOMATIC SINGING TRANSCRIPTION Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón, Isabel Barbancho Universidad de Málaga, ATIC Research Group, Andalucı́a Tech, ETSI Telecomunicación, Campus de Teatinos s/n, 29071 Málaga, SPAIN [email protected], [email protected], [email protected], [email protected] ABSTRACT In this paper, we analyse the evaluation strategies used in previous works on automatic singing transcription, and we present a novel, comprehensive and freely available evaluation framework for automatic singing transcription. This framework consists of a cross-annotated dataset and a set of extended evaluation measures, which are integrated in a Matlab toolbox. The presented evaluation measures are based on standard MIREX note-tracking measures, but they provide extra information about the type of errors made by the singing transcriber. Finally, a practical case of use is presented, in which the evaluation framework has been used to perform a comparison in detail of several state-of-the-art singing transcribers. 1. INTRODUCTION Singing transcription refers to the automatic conversion of a recorded singing signal into a symbolic representation (e.g. a MIDI file) by applying signal-processing methods [1]. One of its renowned applications is query-byhumming [5], but other types of applications also are related to this task, like singing tutors [2], computer games (e.g. Singstar 1 ), etc. In general, singing transcription is considered a specific case of melody transcription (also called note tracking), which is more general problem. However, singing transcription not only relates to melody transcription but also to speech recognition, and still nowadays it is a challenging problem even in the case of monophonic signals without accompaniment [3]. In the literature, various approaches for singing transcription can be found. A simple but commonly referenced approach was proposed by McNab in 1996 [4], and it relied on several handcrafted pitch-based and energy-based segmentation methods. Later, in 2001 Haus et al. used a similar approach with some rules to deal with intonation issues [5], and in 2002, Clarisse et al. [6] contributed with an auditory model, leading to later improved systems 1 http://www.singstar.com c Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón, Isabel Barbancho. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón, Isabel Barbancho. “Evaluation framework for automatic singing transcription”, 15th International Society for Music Information Retrieval Conference, 2014. such as [7] (later included in MAMI project 2 and today in SampleSumo products 3 ). Additionally, other more recent approaches use hidden Markov models (HMM) to detect note-events in singing voice [8, 9, 11]. One of the most representative HMM-based singing transcribers was published by Ryynänen in 2004 [9]. More recently, in 2013, another probabilistic approach for singing transcription has been proposed in [3], also leading to relevant results. Regarding the evaluation methodologies used in these works (see Sections 2.1 and 3.1 for a review), there is not a standard methodology. In this paper, we present a comprehensive evaluation framework for singing transcription. This framework consists of a cross-annotated dataset (Section 2) and a novel, compact set of evaluation measures (Section 3), which report information about the type of errors made by the singing transcriber. These measures have been integrated in a freely available Matlab toolbox (see Section 3.3). Then, we present a practical case in which the evaluation framework has been used to perform a comparison in detail of several state-of-the-art singing transcribers (Section 4). Finally, some relevant conclusions are presented in Section 5 2. DATASETS In this section, we review the evaluation datasets used in prior works on singing transcription , and we describe the proposed evaluation dataset and our strategy for groundtruth annotation. 2.1 Datasets used in prior works In Table 1, we present the datasets used in some relevant works on singing transcription. Note that none of the datasets fully represents the possible contexts in which singing transcription might be applied, since they are either too small (e.g. [5,6]), either very specific in style (e.g. [11] for opera and [3] for flamenco), or either they use an annotation strategy that may be subjective (e.g. [5, 6]), or only valid for very good performances in rhythm and intonation (e.g. [8, 9]). In addition, only the flamenco dataset used in [3] is freely available. 2.2 Proposed dataset In this section we describe the music collection, as well as the annotation strategy used to build the ground-truth. 2 3 567 http://www.ipem.ugent.be/MAMI http://www.samplesumo.com 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Author Year Dataset size Audio quality Music style McNab [4] Haus & Pollastri [5] Clarisse et al. [6] Viitaniemi et al. [8] Ryynänen et al. [9] Mulder et al. [7] 1996 2001 20 short melodies 22 short melodies 66 melodies (120 minutes) Low & moderated noise Low & moderated noise High quality (studio conditions) Popular and scales Popular 2004 52 melo. (1354 notes) Good & moderated noise Popular songs Kumar et al. [10] Krige et al. [11] 2007 47 songs (2513 notes) 13842 notes Good Gómez & Bonada [3] 2013 Indian music Opera lessons & scales Flamenco songs 2002 2003 2004 2008 72 excerpts (2803 notes) High quality but strong reverberation Good & slightly noisy Folk songs & scales Singing style NONE Syllables: ’na-na’... Singing with & without lyrics Singing, humming & whistling Syllables, singing & whistling Syllables: /la/ /da/ /na/ Syllables Lyrics & ornaments Ground-truth (GT) annotation strategy Tunning devs. annotated in GT Freely available Annotated by one musician Annotation by one musician Original score used as ground-truth No No No No No No Team of musicologists Manual annot. of vowel onsets [REf] Time alignment using Viterbi Musicians team (cross-annotation) No No No No No No Yes Yes Table 1. Review of the evaluation datasets used in prior works on singing transcription. Some details about the dataset are not provided in some cases, so certain fields can not be expressed in the same units (e.g. dataset size). 2.2.1 Music collection musicians were given a set of instructions about the specific criteria to annotate the singing melody: The proposed dataset consists of 38 melodies sung by adult and child untrained singers, recorded in mono with a sample rate of 44100Hz and a resolution of 16 bits. Generally, the recordings are not clean and some background noise is present. The duration of the excerpts ranges from 15 to 86 seconds and the total duration of the whole dataset is 1154 seconds. This music collection can be broken down into three categories, according to the type of singer: • Ornaments such as pitch bending at the beginning of the notes or vibratos are not considered independent notes. This criterion is based on Vocaloid’s 6 approach, where ornaments are not modelled with extra notes. • Portamento between two notes does not produce an extra third note (again, this is the criteria used in Vocaloid). • The onsets are placed at the beginning of voiced segments and in each clear change of pitch or phoneme. In the case of ’l’, ’m’, ’n’ voiced consonants + vowel (e.g. ’la’), the onset is not placed at the beginning of the consonant but at the beginning of the vowel. • Children (our own recordings 4 ): 14 melodies of traditional children songs (557 seconds) sung by 8 different children (5-11 years old). • Adult male: 13 pop melodies (315 seconds) sung by 8 different adult male untrained singers. These recordings were randomly chosen from the public dataset MTG-QBH 5 [12]. • The pitch of each note is annotated with cents resolution as perceived by the team of experts. Note that we annotate the tuning deviation for each independent note. 3. EVALUATION MEASURES • Adult female: 11 pop melodies (281 seconds) sung by 5 different adult female untrained singers, also taken from MTG-QBH dataset. Note that in this collection the pitch and the loudness can be unstable, and well performed vibratos are not frequent. In this section, we describe the evaluation measures used in prior works on automatic singing transcription, and we present the proposed ones. 3.1 Evaluation measures used in prior works 2.2.2 Ground-truth: annotation strategy The described music collection has been manually annotated to build the ground truth 4 . First, we have transcribed the audio recordings with a baseline algorithm (Section 4.2), and then all the transcription errors have been corrected by an expert musician with more than 10 years of music training. Then, a second expert musician (with 7 years of music training) checked all the annotations until both musicians agreed in their correctness. The transcription errors were corrected by listening, at the same time, to the synthesized transcription and the original audio. The 4 5 In Table 2, we review the evaluation measures used in some relevant works on singing transcription. In some cases, only the note and/or frame error is provided as a compact, representative measure [5, 9], whereas other approaches provide extra information about the type of errors made by the system using dynamic time warping (DTW) [6] or Viterbi-based alignment [11]. In our case, we have taken the most relevant aspects of these approaches and we added some novel ideas in order to define a novel, compact and comprehensive set of evaluations. Available at http://www.atic.uma.es/ismir2014singing http://mtg.upf.edu/download/datasets/mtg-qbh 6 568 http://www.vocaloid.com 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Author McNab Haus & Pollastri [5] Year 1996 Evaluation measures NONE Rate of note pitch errors (segmentation errors are not considered) DTW-based measurement of various note errors, e.g. insertions deletions and substitutions. Frame-based errors. Do not report information about type of errors made. Note-based and frame-based errors. Do not report information about type of errors made. DTW-based measurement of various note errors, e.g. insertions deletions and substitutions. Onset detection errors (pitch and durations are ignored). Viterbi-based measurement of deletions, insertions and substitutions (typical evaluation in speech recognition). MIREX measures for audio melody extraction and note-tracking. Do not report information about type of errors made. are N GT and N TR , respectively. Regarding the expressions used in the for correct notes, we have used Precision, Recall and F-measure, which are defined as follow: Table 2. Evaluation measures used in prior works on singing transcription. Finally, in the case of segmentation errors (Section 3.2.5), we also compute the mean number of notes tagged as X in the transcription for each note tagged as X in the groundtruth. This magnitude has been expressed as a ratio: 2001 Clarisse et al. [6] 2002 Viitaniemi et al. [8] 2003 Ryynänen et al. [9] 2004 Mulder et al. [7] 2004 Kumar et al. [10] 2007 Krige et al. [11] 2008 Gómez & Bonada [3] 2013 In this section, we firstly present the notation and some needed definitions that are used in the rest of sections, and then we describe the evaluation measures used to quantify the proportion of correctly transcribed notes. Finally, we present a set of novel evaluation measures that independently report the importance of each type of error. In Figure 1 we show an example of the types of errors considered. MIDI note 62 61 60 59 MIDI note MIDI note (a) CXRecall = CXF-measure = GT NCX N GT TR NCX N TR CXPrecision · CXRecall 2· CXPrecision + CXRecall XRatio = (1) (2) (3) NXTR NXGT (6) 3.2.2 Definition of correct onset/pitch/offset The definitions of correctly transcribed notes (given in Section 3.2.3) consists of combinations of three independent conditions: correct onset, correct pitch and correct offset. We have defined these conditions according to MIREX (Multiple F0 estimation and tracking and Audio Onset Detection tasks), and so they are defined as follow: • Correct Onset: If the note’s onset of a transcribed note nTj R is within a ±50ms range of the onset of a ground-truth note nGT i , i.e.: ±50 cents COnP, COn = where CX makes reference to the specific category of correct note: Correct Onset & Pitch & Offset (X = COnPOff), Correct Onset & Pitch (X = COnP) or Correct Onset (X GT TR = COn). Finally, NCX and NCX are the total number of matching CX conditions in the ground-truth and the transcription, respectively. Regarding the measures used for errors, we have computed the Error Rate with respect to N GT , or with respect to N TR , as follow: N GT XRateGT = XGT (4) N N TR XRateTR = XTR (5) N 3.2 Proposed measures 62 ± 20% 61 ± 50ms duration 60 59 COnPOff, COnP, COn CXPrecision COn (b) GT onset(nTj R ) ∈ [onset(nGT i ) − 50ms, onset(ni ) + 50ms] (7) M S M then we consider that nGT has a correct onset with respect i to nTj R . • Correct Pitch: If the note’s pitch of a transcribed note nTj R is within a ±0.5 semitones range of the pitch of a ground-truth note nGT i , i.e.: ND PU (c) 62 61 60 59 OBOn GT pitch(nTj R ) ∈ [pitch(nGT i ) − 0.5 st, pitch(ni ) + 0.5 st] (8) 3.2.1 Notation then we consider that nGT has a correct pitch with respect i to nTj R . • Correct Offset: If the offsets of the ground-truth note nGT and the transcribed note nTj R are within a range of i or ±50 ms, whichever is ±20% of the duration of nGT i larger, i.e.: The i:th note of the ground-truth is noted as nGT i , and the j:th note of the transcription is noted as nTR . The total j number of notes in the ground-truth and the transcription where OffRan = max(50ms, duration(nGT i )), then we consider that nGT has a correct offset with respect to nTj R . i OBP Transcription GROUND-TRUTH Figure 1. Examples of the different proposed measures. R GT GT offset(nT j ) ∈ [offset(ni ) − OffRan, offset(ni ) + OffRan] (9) 569 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 3.2.3 Correctly transcribed notes with correct onset and offset, but wrong pitch. In order to detect them, firstly we find all ground-truth notes with correct onset and offset, taking into account that one groundtruth note can only be associated with one transcribed note. Then, we remove all notes previously tagged as COnPOff (Section 3.2.3). The reported measure is the rate of OBP notes in the ground-truth: The definition of “correct note” should be useful to measure the suitability of a given singing transcriber for a specific application. However, different applications may require a different definition of correct note. Therefore, we have chosen three different definitions of correct note as defined in MIREX: • Correct onset, pitch and offset (COnPOff): This is a standard correctness criteria, since it is used in MIREX (Multiple F0 estimation and tracking task), and it is the most restrictive one. The note nGT is assumed to be cori rectly transcribed into the note nTj R if it has correct onset, correct pitch and correct offset (as defined in Section 3.2.2). In addition, one ground truth note nGT can only be i associated with one transcribed note nTj R . In our evaluation framework, we report Precision, Recall and F-measure as defined in Section 3.2.1: OBPRateGT • Only-Bad-Offset (OBOff): A ground-truth note nGT is i labelled as OBOn if it has been transcribed into a note nTj R with correct pitch and onset, but wrong offset. In order to detect them, firstly we find all ground-truth notes with correct pitch and onset, taking into account that one groundtruth note can only be associated with one transcribed note. Then, we remove all notes previously tagged as COnPOff (Section 3.2.3). The reported measure is the rate of OBOff notes in the ground-truth: COnPOffPrecision , COnPOffRecall and COnPOffF-measure . OBOffRateGT • Correct Onset, Pitch (COnP): This criteria is also used in MIREX, but it is less restrictive since it just considers onset and pitch, and ignores the offset value. Therefore, in COnP criteria, a note nGT is assumed to be correctly i transcribed into the note nTj R if it has correct onset and correct pitch. In addition, one ground truth note nGT can i only be associated with one transcribed note nTj R . In our evaluation framework, we report Precision, Recall and Fmeasure: COnPPrecision , COnPRecall and COnPF-measure . 3.2.5 Incorrect notes with segmentation errors Segmentation errors refer to the case in which sung notes are incorrectly split or merged during the transcription. Depending on the final application, certain types of segmentation errors may not be important (e.g. frame-based systems for query-by-humming are not affected by splits), but they can lead to problems in many other situations. Therefore, we have defined two evaluation measures which are informative about the segmentation errors made by the singing transcriber. • Split (S): A split note is a ground truth note nGT that i is incorrectly segmented into different consecutive notes nTj1R , nTj2R · · · nTjnR . Two requirements are needed in a split: (1) the set of transcribed notes nTj1R , nTj2R , . . . nTjnR must overlap at least the 40% of nGT in time (pitch is igi nored), and (2) nGT must overlap at least the 40% of every i note nTj1R , nTj2R , . . . nTjnR in time (again, pitch is ignored). These requirements are needed to ensure a consistent relationship between ground truth and transcribed notes. The specific reported measures are: • Correct Onset (COn): Additionally, we have included the evaluation criteria used in MIREX Audio Onset Detection task. In this case, a note nGT is assumed to be correctly i transcribed into the note nTj R if it has correct onset. In addition, one ground truth note nGT can only be associated i with one transcribed note nTj R . In our evaluation framework, we report Precision, Recall and F-measure: COnPOffPrecision , COnPOffRecall and COnPOffF-measure . 3.2.4 Incorrect notes with one single error In addition, we have included some novel evaluation measures to identify the notes that are close to be correctly transcribed, but they fail in one single aspect. These measures are useful to identify specific weaknesses of a given singing transcriber. The proposed categories are: • Only-Bad-Onset (OBOn): A ground-truth note nGT is i labelled as OBOn if it has been transcribed into a note nTj R with correct pitch and offset, but wrong onset. In order to detect them, firstly we find all ground-truth notes with correct pitch and offset, taking into account that one groundtruth note can only be associated with one transcribed note. Then, we remove all notes previously tagged as COnPOff (Section 3.2.3). The reported measure is the rate of OBOn notes in the ground-truth: OBOnRateGT SRateGT and SRatio Note that in this case SRatio > 1. • Merged (M): A set of consecutive ground-truth notes GT GT nGT i1 , ni2 , · · · nin are considered to be merged if they all are transcribed into the same note nTj R . This is the complementary case of split. Again, two requirements must be true to consider a group of merged notes: (1) the set of GT GT ground truth notes nGT i1 ,ni2 , . . . nin must overlap the TR 40% of nj in time (pitch is ignored), and (2) nTj R must GT GT overlap the 40% of every note nGT i1 ,ni2 , . . . nin in time (again, pitch is ignored). The specific reported measures are: MRateGT and MRatio • Only-Bad-Pitch (OBP): A ground-truth note nGT is lai belled as OBP if it has been transcribed into a note nTj R Note that in this case MRatio < 1. 570 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 3.2.6 Incorrect notes with voicing errors the experiment, we have used a binary provided by the authors of the algorithm. Method (b): Ryynänen (2008) [13]. We have used the method for automatic transcription of melody, bass line and chords in polyphonic music published by Ryynänen in 2008 [13], although we only focus on melody transcription. It is the last evolution of the original HMM-based monophonic singing transcriber [9]. For the experiment, we have used a binary provided by the authors of the algorithm. Method (c): Melotranscript 4 (based on Mulder 2004 [7]). It is the commercial version derived from the research carried out by Mulder et al. [7]. It is based on an auditory model. For the experiment, we have used the demo version available in SampleSumo website 3 . Voicing errors happen when an unvoiced sound produces a false transcribed note (spurious note), or when a sung note is not transcribed at all (non-detected note). This situation is commonly associated to a bad performance of the voicing stage within the singing transcriber. We have defined two categories: • Spurious notes (PU): A spurious note is a transcribed note nTj R that does not overlap at all (neither in time nor in pitch) any note in the ground truth. The associated reported measure is: PURateTR • Non-detected notes (ND): A ground-truth note nGT is i non-detected if it does not overlap at all (neither in time nor in pitch) any transcribed note. The associated reported measure is: NDRateGT 4.2 Baseline algorithm 3.3 Proposed Matlab toolbox The presented evaluation measures have been implemented in a freely available Matlab toolbox 4 , which consists of a set of functions and structures, as well as a graphical user interface to visually analyse the performance of the evaluated singing transcriber. The main function of our toolbox is evaluation.m, which receives the ground-truth and the transcription of an audio clip as inputs, and it outputs the results of all the evaluation measures. In addition, we have included a function called listnotes.m, which receives as inputs the ground-truth, the transcription and the category X to be listed, and it outputs a list (in a two-columns format: onset time-offset time) of all the notes in the ground-truth tagged as X category. This information is useful to isolate the problematic audio excerpts for further analysis. Finally, we have implemented a graphical user interface, where the ground-truth and the transcription of a given audio clip can be compared using a piano-roll representation. This interface also allows the user to highlight notes tagged as X (e.g. COnPOff, S, etc.). 4. PRACTICAL USE OF THE PROPOSED TOOLBOX In this section, we describe a practical case of use in which the presented evaluation framework has been used to perform an improved comparative study of several state-ofthe-art singing transcribers (presented in Section 4.1). In addition, a simple, easily reproducible baseline approach has been included in this comparative study. Finally, we show and discuss the obtained results. 4.1 Compared algorithms We have compared three state-of-the-art algorithms for singing transcription: Method (a): Gómez & Bonada (2013) [3]. It consists of three main steps: tuning-frequency estimation, transcription into short notes, and an iterative process involving note consolidation and refinement of the tuning frequency. For 571 According to [8], the simplest possible segmentation consists of simply rounding a rough pitch estimate to the closest MIDI note ni and taking all pitch changes as note boundaries. The proposed baseline method is based on such idea, and it uses Yin [14] to extract the F0 and aperiodicity at frame-level. A frame is classified as unvoiced if its aperiodicity is under < 0.4. Finally, all notes shorter than 100ms are discarded. 4.3 Results & discussion In Figure 2 we show the results of our comparative analysis. Regarding the F-measure of correct notes (COnPOff, COnP and COn), methods (a) and (c) attains similar values, whereas method (b) performs slightly worse. In addition, it seems that method (a) is slightly superior to method (c) for onset detection, but method (c) is superior when pitch and offset values must be also estimated. In all cases, the baseline is clearly worse than the rest of methods. In addition, we observed that the rate of notes with incorrect onset (OBOn) is equally high (20%) in all methods. After analysing the specific recordings, we concluded that onset detection within a range of ±50ms is very restrictive in the case of singing voice with lyrics, since many onsets are not clear even for an expert musician (as proved during the ground-truth building). Moreover, we also observed that all methods, and especially method (a), have problems with pitch bendings at the beginning of the notes, since they tend to split them. Regarding the segmentation and voicing errors, we realised that method (a) tends to split notes, whereas method (b) tends to merge notes. This information, easily provided by our evaluation framework, may be useful to improve specific weaknesses of the algorithms during the development stage. Finally, we also realised that method (b) is worse than method (a) and (c) in terms of voicing. To sum up, method (c) seems to be the best one in most measures, mainly due to a better performance in segmentation and voicing. However, method (a) is very appropriate for onset detection. Finally, although method (b) works clearly better than the baseline, has a poor performance due to errors in segmentation (mainly merged notes) and voicing (mainly spurious). 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Conference on Acoustics, Speech and Signal Processing ICASSP, pp. 744–748, 2013. COnPOff (Precision) COnPOff (Recall) [3] E. Gómez and J. Bonada, “Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, vol. 37, no. 2, pp. 73–90, 2013. COnPOff (F-measure) COnP (Precision) COnP (Recall) COnP (F-measure) COn (Precision) [4] R. J. McNab, L. A. Smith, and I. H. Witten, “Signal Processing for Melody Transcription,” Proceedings of the 19th Australasian Computer Science Conference, vol. 18, no. 4, pp. 301–307, 1996. COn (Recall) COn (F-measure) OBOn (RateGT) [5] G. Haus and E. Pollastri, “An audio front end for queryby-humming systems,” in Proceedings of the 2nd International Symposium on Music Information Retrieval ISMIR, pp. 65–72, sn, 2001. OBP (RateGT) OBOff (RateGT) Split (RateGT) Baseline method (a) Gómez & Bonada (b) Ryynänen (c) Melotranscript Merged (RateGT) Spurious (RateTR) Non-detected (RateTR) 0 0.2 0.4 0.6 [6] L. P. Clarisse, J. P. Martens, M. Lesaffre, B. D. Baets, H. D. Meyer, and M. Leman, “An Auditory Model Based Transcriber of Singing Sequences,” in Proceedings of the 3rd International Conference on Music Information Retrieval ISMIR, pp. 116–123, 2002. 0.8 Measure value [7] T. De Mulder, J.P. Martens, M. Lesaffre, M. Leman, B. De Baets, H. De Meyer, “Recent improvements of an auditory model based front-end for the transcription of vocal queries”, , Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP 2004), Montreal, Quebec, Canada, May 17–21, Vol. IV, pp. 257–260, 2004. Figure 2. Comparison in detail of several state-of-the-art singing transcription systems using the presented evaluation framework. 5. CONCLUSIONS In this paper, we have presented an evaluation framework for singing transcription. It consists of a cross-annotated dataset of 1154 seconds and a novel set of evaluation measures, able to report the type of errors made by the system. Both the dataset, and a Matlab toolbox including the presented evaluation measures, are freely available 4 . In order to show the utility of the work presented in this paper, we have performed an detailed comparative study of three state-of-the-art singing transcribers plus a baseline method, leading to relevant information about the performance of each method. In the future, we plan to expand our evaluation dataset in order to make it comparable to other datasets 7 used in MIREX (e.g. MIR-1K or MIR-QBSH). 6. ACKNOWLEDGEMENTS This work has been funded by the Ministerio de Economı́a y Competitividad of the Spanish Government under Project No. TIN2013-47276-C6-2-R and by the Junta de Andalucı́a under Project No. P11-TIC-7154. The work has been done at Universidad de Málaga. Campus de Excelencia Internacional Andalucı́a Tech. 7. REFERENCES [1] M. Ryynänen, “Singing transcription,” in Signal Processing Methods for Music Transcription (A. Klapuri and M. Davy, eds.), pp. 361–390, Springer Science + Business Media LLC, 2006. [2] E. Molina, I. Barbancho, E. Gómez, A. M. Barbancho, and L. J. Tardón, “Fundamental frequency alignment vs. note-based melodic similarity for singing voice assessment,” in Proceedings of the 2013 IEEE International 7 [8] T. Viitaniemi, A. Klapuri, and A. Eronen, “A probabilistic model for the transcription of single-voice melodies,” in Proceedings of the 2003 Finnish Signal Processing Symposium FINSIG03, pp. 59–63, 2003. [9] M. Ryynänen and A. Klapuri, “Modelling of note events for singing transcription,” in Proceedings of ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing SAPA, (Jeju, Korea), Oct. 2004. [10] P. Kumar, M. Joshi, S. Hariharan, and P. Rao, “Sung Note Segmentation for a Query-by-Humming System”. In Intl Joint Conferences on Artificial Intelligence (IJCAI), 2007. [11] W. Krige, T. Herbst, and T. Niesler, “Explicit transition modelling for automatic singing transcription,” Journal of New Music Research, vol. 37, no. 4, pp. 311–324, 2008. [12] J. Salamon, J. Serrá and E. Gómez, “Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming”, International Journal of Multimedia Information Retrieval, special issue on Hybrid Music Information Retrieval, vol. 2, no. 1, pp. 45–58, 2013. [13] M. P. Ryynänen and A. P. Klapuri, “Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music,” in Computer Music Journal, vol.32, no. 3, 2008. [14] A. De Cheveigné and H. Kawahara: “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustic Society of America, Vol. 111, No. 4, pp. 1917-1930, 2002. http://mirlab.org/dataSet/public/ 572 15th International Society for Music Information Retrieval Conference (ISMIR 2014) WHAT IS THE EFFECT OF AUDIO QUALITY ON THE ROBUSTNESS OF MFCCs AND CHROMA FEATURES? Julián Urbano, Dmitry Bogdanov, Perfecto Herrera, Emilia Gómez and Xavier Serra Music Technology Group, Universitat Pompeu Fabra Barcelona, Spain {julian.urbano,dmitry.bogdanov,perfecto.herrera,emilia.gomez,xavier.serra}@upf.edu ABSTRACT Many MIR tasks such as classification, similarity, autotagging, recommendation, cover identification and audio fingerprinting, audio-to-score alignment, audio segmentation, key and chord estimation, and instrument detection are at least partially based on them. As they pervade the literature on MIR, we analyzed the effect of audio encoding and signal analysis parameters on the robustness of MFCCs and chroma. To this end, we run two different audio analysis tools over a diverse collection of 400 music tracks. We then compute several indicators that quantify the robustness and stability of the resulting features and estimate the practical implications for a general task like genre classification. Music Information Retrieval is largely based on descriptors computed from audio signals, and in many practical applications they are to be computed on music corpora containing audio files encoded in a variety of lossy formats. Such encodings distort the original signal and therefore may affect the computation of descriptors. This raises the question of the robustness of these descriptors across various audio encodings. We examine this assumption for the case of MFCCs and chroma features. In particular, we analyze their robustness to sampling rate, codec, bitrate, frame size and music genre. Using two different audio analysis tools over a diverse collection of music tracks, we compute several statistics to quantify the robustness of the resulting descriptors, and then estimate the practical effects for a sample task like genre classification. 2. DESCRIPTORS 2.1 Mel-Frequency Cepstrum Coefficients MFCCs are inherited from the speech domain [18], and they have been extensively used to summarize the spectral content of music signals within an analysis frame. MFCCs are widely used in tasks like music similarity [1,12], music classification [6] (in particular, genre), autotagging [13], preference learning for music recommendation [19, 24], cover identification and audio segmentation [17]. There is no standard algorithm to compute MFCCs, and a number of variants have been proposed [8] and adapted for MIR applications. MFCCs are commonly computed as follows. The first step consists in windowing the input signal and computing its magnitude spectrum with the Fourier transform. We then apply a filterbank with critical (mel) band spacing of the filters and bandwidths. Energy values are obtained for the output of each filter, followed by a logarithm transformation. We finally compute a discrete cosine transform to the set of log-energy values to obtain the final set of coefficients. The number of mel bands and the frequency interval on which they are computed may vary among implementations. The low order coefficients account for the slowly changing spectral envelope, while the higher order coefficients describe the fast variations of the spectrum shape, including pitch information. The first coefficient is typically discarded in MIR applications because it does not provide information about the spectral shape; it reflects the overall energy in mel bands. 1. INTRODUCTION A significant amount of research in Music Information Retrieval (MIR) is based on descriptors computed from audio signals. In many cases, research corpora contain music files encoded in a lossless format. In some situations, datasets are distributed without their original music corpus, so researchers have to gather audio files themselves. In many other cases, audio descriptors are distributed instead of the audio files. In the end, MIR research is thus based on corpora that very well may use different audio encodings, all under the assumption that audio descriptors are robust to these variations and the final MIR algorithms are not affected. This possible lack of robustness poses serious questions regarding the reproducibility of MIR research and its applicability. For instance, whether algorithms trained with lossless audio files can generalize to lossy encodings; or whether a minimum audio bitrate should be required in datasets that distribute descriptors instead of audio files. In this paper we examine the assumption of robustness of music descriptors across different audio encodings on the example of Mel-frequency cepstral coefficients (MFCCs) and chroma features. They are among the most popular music descriptors used in MIR research, as they respectively capture timbre and tonal information. 2.2 Chroma c J.Urbano, D.Bogdanov, P.Herrera, E.Gómez and X.Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: J. Urbano, D. Bogdanov, P. Herrera, E. Gómez and X. Serra. “What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features?”, 15th International Society for Music Information Retrieval Conference, 2014. Chroma features represent the spectral energy distribution within an analysis frame, summarized into 12 semitones across octaves in equal-tempered scale. Chroma captures the pitch class distribution of an input signal, typically used 573 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Since our goal here is not to compare tools, we refer to them simply as Lib1 and Lib2 throughout the paper. Lib1 and Lib2 provide by default two different implementations of MFCCs, both of which compute cepstral coefficients on 40 mel bands, resembling the MFCC FB-40 implementation [8, 22] but on different frequency intervals. Lib1 covers a wider frequency range of 0-11000 Hz with mel bin centers being equally spaced on the mel scale in this range, while Lib2 covers a frequency range of 666364 Hz. We compute the first 13 MFCCs in both systems and discard the first coefficient. In the case of chroma, Lib1 analyzes a frequency range of 40-5000 Hz based on Fourier transform and estimates tuning frequency. Lib2 uses a Constant Q Transform and analyzes the frequency range 65-2093 Hz assuming tuning frequency of 440 Hz, but it does not account for harmonics of the detected peaks. We compute 12-dimensional chroma features. Genre. Robustness may depend as well on the music genre of songs. For instance, as the most dramatic change that perceptual coders introduce is that of filtering out highfrequency spectral content, genres that make use of very high-frequency sounds (e.g. cymbals and electronic tones) should show a more detrimental effect than genres not including them (e.g. country, blues and classical). for key and chord estimation [7, 9], music similarity and cover identification [20], classification [6], segmentation and summarization [5, 17], and synchronization [16]. Several approaches exist for chroma feature extraction, including the following steps. The signal is first analyzed with a high frequency resolution in order to obtain its frequency domain representation. The main frequency components (e.g. spectral peaks) are mapped onto pitch classes according to an estimated tuning frequency. For most approaches, a frequency value partially contributes to a set of “sub-harmonic” fundamental frequency (pitch) candidates. The chroma vector is computed with a given interval resolution (number of bins per octave) and is finally post-processed to obtain the final chroma representation. Timbre invariance is achieved by different transformations such as spectral whitening [9] or cepstrum liftering [15]. 3. EXPERIMENTAL DESIGN 3.1 Factors Affecting Robustness We identified several factors that could have an effect on the robustness of audio descriptors, from the perspective of their audio encoding (codec, bitrate and sampling rate), analysis parameters (frame/hop size and audio analysis tool) and the musical characteristics of the songs (genre). SRate. The sampling rate at which an audio signal is encoded may affect robustness when using very high frequency rates. We study standard 44100 and 22050 Hz. Codec. Perceptual audio coders may also affect descriptors because they introduce perturbations to the original audio signal, in particular by reducing high-frequency content, blurring the attacks, and smoothing the spectral envelope. In our experiments, we chose one lossless and two lossy audio codecs: WAV, MP3 CBR and MP3 VBR. BRate. Different audio codecs allow different bitrates depending on the sampling rate, so we can not combine all codecs with all bitrates. The following combinations are permitted and used in our study: • WAV: 1411 Kbps. • MP3 CBR at 22050 Hz: 64, 96, 128 and 160 Kbps. • MP3 CBR at 44100 Hz: 64, 96, 128, 160, 192, 256 and 320 Kbps. • MP3 VBR: 6 (100-130 Kbps), 4 (140-185 Kbps), 2 (170-210 Kbps) and 0 (220-260 Kbps). FSize. We considered a variety of frame sizes for spectral analysis: 23.2, 46.4, 92.9, 185.8, 371.5 and 743.0 ms. That is, we used frame sizes of 1024, 2048, 4096, 8192, 16384 and 32768 samples for signals with sampling rate of 44100 Hz, and the halved values (512, 1024, 2048, 4096, 8192 and 16384 samples) in the case of 22050 Hz. Audio analysis tool. The specific software used to compute descriptors may have an effect on their robustness due to parameterizations (e.g. frequency ranges) and other implementation details. We use two state-of-the-art and open source tools publicly available online: Essentia 2.0.1 1 [2] and QM Vamp Plugins 1.7 for Sonic Annotator 0.7 2 [3]. 3.2 Data We created an ad-hoc corpus of music for this study, containing 400 different music tracks (30 seconds excerpts) by 395 different artists, uniformly covering 10 music genres (blues, classical, country, disco/funk/soul, electronic, jazz, rap/hip-hop, reggae, rock and rock’n’roll). All 400 tracks are encoded from their original CD at a 44100 Hz sampling rate using the lossless FLAC audio codec. We converted all lossless tracks in our corpus into various audio formats in accordance with the factors identified above, taking into account all possible combinations of sampling rate, codec and bitrate. Audio conversion was done using the FFmpeg 0.8.3 3 converter, which includes the LAME codec for MP3 joint stereo mode (Lavf53.21.1 ). Afterwards, we analyzed the original lossless files and their lossy versions using both Lib1 and Lib2. In the case of Lib1, both MFCCs and chroma features were computed for all different frame sizes with the hop size equal to half the frame size. MFCCs were computed similarly in the case of Lib2, but chroma features only allow a fixed frame size of 16384 samples (we selected a hop size of 2048 samples). In all cases, we summarize the framewise feature vectors with the mean of each coefficient. 3.3 Indicators of Robustness We computed several indicators of the robustness of MFCCs and chroma, each measuring the difference between the descriptors computed with the original lossless audio clips and the descriptors computed with their lossy versions. We blocked by tool, sampling rate and frame size under the assumption that these factors are not mixed in practice within the same application. For two arbitrary 1 http://essentia.upf.edu http://vamp-plugins.org/plugin-doc/ qm-vamp-plugins.html 2 3 574 http://www.ffmpeg.org 15th International Society for Music Information Retrieval Conference (ISMIR 2014) vectors x and y (each containing n = 12 MFCC or chroma values) from a lossless and a lossy version, we compute five indicators to measure how different they are. Relative error δ. It is computed as the average relative difference across coefficients. This indicator can be easily interpreted as the percentage error between coefficients, and it is of especial interest for tasks in which coefficients are used as features to train some model. related to genre (main effects confounded with two-factor interactions) [14]. We ran an ANOVA analysis on these models to estimate variance components, which indicate the contribution of each factor to the total variance, that is, their impact on the robustness of the audio descriptors. Table 1 shows the results for MFCCs. As shown by the mean scores, the descriptors computed by Lib1 and Lib2 are similarly robust (note that ε scores are not directly comparable across tools because they are not normalized; actual MFCCs in Lib1 are orders of magnitude larger than in Lib2). Both correlation coefficients r and ρ, as well as cosine similarity θ, are extremely high, indicating that the shape of the feature vectors is largely preserved. However, the average error across coefficients is as high δ ≈ 6.1% at 22050 Hz and δ ≈ 6.7% at 44100 Hz. When focusing on the stability of the descriptors, we see that the implementation in Lib2 is generally more stable because the distributions have less variance, except for δ and ρ at 22050 Hz. The decomposition in variance components indicates that the choice of frame size is irrelevant in general (low σ̂F2 Size scores), and that the largest part of the variability depends on the particular characteristics of 2 scores). For the music pieces (very high σ̂T2 rack + σ̂residual Lib2 in particular, this means that controlling encodings or analysis parameters does not increase robustness significantly when the sampling rate is 22050 Hz; it depends almost exclusively on the specific music pieces. On the other hand, the combination of codec and bitrate has a quite large effect in Lib1. For instance, about 42% of the variability in Euclidean distances is due to the BRate:Codec interaction effect. This means that an appropriate selection of the codec and bitrate of the audio files leads to significantly more robust descriptors. At 44100 Hz both tools are clearly affected by the BRate:Codec effect as well, especially Lib1. Figure 1 compares the distributions of δ scores for each tool. We can see that Lib1 has indeed large variance across groups, but small variance within groups, as opposed to Lib2. The robustness of Lib1 seems to converge to δ ≈ 3% at 256 Kbps, and the descriptors are clearly more stable with larger bitrates (smaller withingroup variance). On the other hand, the average robustness of Lib2 converges to δ ≈ 5% at 160-192 Kbps, and stabil- 0.20 δ 0.10 VBR.0 VBR.2 VBR.4 VBR.6 CBR.320 CBR.96 CBR.64 0.00 VBR.0 VBR.2 VBR.4 VBR.6 CBR.320 CBR.256 CBR.192 CBR.160 CBR.64 0.00 0.10 δ 0.20 0.30 MFCCs Lib2 44100 Hz 0.30 MFCCs Lib1 44100 Hz CBR.256 For simplicity, we followed a hierarchical analysis for each combination of sampling rate, tool, feature and robustness indicator. We are first interested in the mean of the score distributions, which tells us the expected robustness in each case (e.g. a low ε mean score suggests that the descriptor is robust because it does not differ much between the lossless and the lossy versions). But we are also interested in the stability of the descriptor, that is, the variance of the distribution. For instance, a descriptor might be robust on average but not below 192 Kbps, or robust only with a frame size of 2048. To gain a deeper understanding of the variations in the indicators, we fitted a random effects model to study the effects of codec, bitrate and frame size [14]. The specific models included the FSize and Codec main effects, and the bitrate was modeled as nested within the Codec effect (BRate:Codec); all interactions among them were also fitted. Finally, we included the Genre and Track main effects to estimate the specific variability due to inherent differences among the music pieces themselves. We did not consider any Genre or Track interactions because they can not be controlled in a real-world application, so their effects are all confounded with the residual effect. Note though that this residual does not account for any random error (in fact, there is no random error in this model); it accounts for high-order interactions associated with Genre and Track that are irrelevant for our purposes. This results in a Resolution V design for the factors of interest (main effects unconfounded with two- or three-factor interactions) and a Resolution III design for musical factors CBR.192 3.4 Analysis CBR.160 Euclidean distance ε. The Euclidean distance between the two vectors, which is especially relevant for tasks that compute distances between pairs of songs, such as in music similarity or other tasks that use techniques like clustering. Pearson’s r. The common parametric correlation coefficient between the two vectors, ranging from -1 to 1. Spearman’s ρ. A non-parametric correlation coefficient, equal to the Pearson’s r correlation after transforming all coefficients to their corresponding ranks in x ∪ y. Cosine similarity θ. The angle between both vectors. It is is similar to ε, but it is normalized between 0 and 1. We have 400 tracks×19 BRate:Codec×6 FSize=45600 datapoints for MFCCs with Lib1, MFCCs with Lib2, and chroma with Lib1. For chroma with Lib2 there is just one FSize, which yields 7600 datapoints. This adds up to 144400 datapoints for each indicators, 722000 overall. CBR.128 |xi −yi | max(|xi |,|yi |) CBR.96 1 n CBR.128 δ(x, y) = 4. RESULTS Figure 1. Distributions of δ scores for different combinations of MP3 codec and bitrate at 44100 Hz, and for both audio analysis tools. Blue crosses mark the sample means. Outliers are rather uniformly distributed across genres. 575 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2 σ̂F Size 2 σ̂Codec Lib2 Lib1 2 σ̂BRate:Codec 2 σ̂F Size×Codec 2 σ̂F Size×(BRate:Codec) 2 σ̂Genre 2 σ̂T rack 2 σ̂residual Grand mean Total variance Standard deviation 2 σ̂F Size 2 σ̂Codec 2 σ̂BRate:Codec 2 σ̂F Size×Codec 2 σ̂F Size×(BRate:Codec) 2 σ̂Genre 2 σ̂T rack 2 σ̂residual Grand mean Total variance Standard deviation δ 1.08 0 31.25 0 4.87 0.99 19.76 42.05 0.0591 0.0032 0.0567 1.17 0 4.91 0 0.96 4.21 52.34 36.41 0.0622 0.0040 0.0631 ε 3.03 0 42.13 0 11.71 4.53 5.84 32.75 1.6958 3.4641 1.8612 0.32 0 6.01 0 0.43 14.68 61.05 17.51 0.0278 0.0015 0.0391 22050 Hz r 1.73 0 21.61 0 12.36 3.92 6.46 53.92 0.9999 1.8e-7 0.0004 0.16 0 2.32 0 0.03 2.84 32.07 62.57 0.9999 8.9e-8 0.0003 ρ 0 0 8.38 0 1.23 0.08 11.59 78.72 0.9977 3.2e-5 0.0056 0.24 0 0.74 0 0.04 0.61 66.10 32.27 0.9955 0.0002 0.0131 θ 1.74 0 21.49 0 13.21 3.80 5.73 54.03 0.9999 1.5e-7 0.0004 0.18 0 3.14 0 0.09 4.41 41.26 50.92 0.9999 3.5e-8 0.0002 δ 0.21 0 46.98 0 7.37 1.12 10.12 34.19 0.0682 0.0081 0.0897 0.25 0 23.46 0 7.17 0.37 27.33 41.42 0.0656 0.0055 0.0740 ε 0.09 0 41.77 0.20 18.25 0.52 3.91 35.26 1.8820 11.44 3.3835 0 0 24.23 0 8.09 5.37 14.10 48.21 0.0342 0.0034 0.0587 44100 Hz r 0.01 0 22.52 0.07 17.98 0.90 2.65 55.87 0.9998 1.6e-6 0.0013 0 0 14.27 0 10.35 0.50 6.55 68.32 0.9998 6.4e-7 0.0008 ρ 0 0 24.03 0.05 10.85 0.32 5.23 59.52 0.9939 0.0005 0.0214 0 0 13.31 0 6.34 0 13.32 67.03 0.9947 0.0002 0.0150 θ 0 0 21.51 0.06 18.02 0.89 2.59 56.92 0.9998 1.4e-6 0.0012 0 0 15.02 0 10.86 0.48 5.53 68.11 0.9999 4.8e-7 0.0007 Table 1. Variance components in the distributions of robustness of MFCCs for Lib1 (top) and Lib2 (bottom). Each component represents the percentage of total variance due to each effect (eg. σ̂F2 Size = 3.03 indicates that 3.03% of the variability in the robustness indicator is due to differences across frame sizes; σ̂x2 = 0 when the effect is so extremely small that the estimate is slightly below zero). All interactions with the Genre and Track main effects are confounded with the residual effect. The last rows show the grand mean, total variance and standard deviation of the distributions. their shape, the individual components vary significantly across encodings; we observed that increasing the bitrate leads to larger coefficients overall. This suggests that normalizing the chroma coefficients could dramatically improve the distributions of δ and ε. We tried the parameter normalization=2 to have Lib2 normalize chroma vectors to unit maximum. As expected, the effects of codec and bitrate are removed after normalization, and most of the variability is due to the Track effect. The correlation indicators are practically unaltered after normalization. ity remains virtually the same beyond 96 Kbps. These plots confirm that the MFCC implementation in Lib1 is nearly twice as robust and stable when the encoding is homogeneous in the corpus, while the implementation in Lib2 is less robust but more stable with heterogeneous encodings. The FSize effect is negligible, indicating that the choice of frame size does not affect the robustness of MFCCs in general. However, in several cases we can observe large σ̂F2 Size×(BRate:Codec) scores, meaning that for some codec-bitrate combinations it does matter. An in-depth analysis shows that these differences only occur at 64 Kbps though (small frame sizes are more robust); differences are 2 scores invery small otherwise. Finally, the small σ̂Genre dicate that robustness is similar across music genres. A similar analysis was conducted to assess the robustness and stability of chroma features. Even though the correlation indicators are generally high as well, Table 2 shows that chroma vectors do not preserve the shape as well as MFCCs do. When looking at individual coefficients, the relative errors are similarly δ ≈ 6% in Lib1, but they are greatly reduced in Lib2, especially at 44100 Hz. In fact, the chroma implementation in Lib2 is more robust and stable according to all indicators 4 . For Lib1, virtually all the variability in the distributions is due to the Track and residual effects, meaning that chroma is similarly robust across encodings, analysis parameters and genre. For Lib2, we can similarly observe that errors in the correlation indicators depend almost entirely on the Track effect, but δ and ε depend mostly on the codec-bitrate combination. This indicates that, despite chroma vectors preserve 5. ROBUSTNESS IN GENRE CLASSIFICATION The previous section provided indicators of robustness that can be easily understood. However, they can be hard to interpret because in the end we are interested in the robustness of the various algorithms that make use of these features; whether δ = 5% is large or not depends on how MFCCs and chroma are used in practice. To investigate this question we consider a music genre classification task. For each sampling rate, codec, bitrate and tool we trained one SVM model with radial basis kernel using MFCCs and another using chroma. For MFCCs we used a standard frame size of 2048, and for chroma we set 4096 in Lib1 and the fixed 16384 in Lib2. We did random sub-sampling validation with 100 random trials for each model, using 320 tracks for training and the remaining 80 for testing. We first investigate whether a particular choice of encoding is likely to classify better when fixed across training and test sets. Table 3 shows the results for a selection of encodings at 44100 Hz. Within the same tool and descriptor, differences across encodings are quite small, approximately 0.02. In particular, for MFCCs and Lib1 an ANOVA analysis suggests that differences are signifi- 4 Even though these distributions include all frame sizes in Lib1 but only 16384 in Lib2, the FSize effect is negligible in Lib1, meaning that these indicators are still comparable across implementations 576 Lib2 Lib1 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2 σ̂F Size 2 σ̂Genre 2 σ̂T rack 2 σ̂residual Grand Mean Total variance Standard deviation 2 σ̂Codec 2 σ̂BRate:Codec 2 σ̂Genre 2 σ̂T rack 2 σ̂residual Grand mean Total variance Standard deviation δ 1.68 2.81 20.69 74.82 0.0610 0.0046 0.0682 63.62 0.71 0.25 19.29 16.14 0.0346 0.0004 0.0195 ε 2.77 2.75 19.27 75.21 0.0545 0.0085 0.0924 34.55 0.23 15.87 32.77 16.58 0.0031 5e-6 0.0022 22050 Hz r 0.20 1.29 17.75 80.75 0.9554 0.0276 0.1663 0 0 2.90 96.71 0.38 0.9915 0.0002 0.0135 ρ 0.15 1.47 18.52 79.86 0.9366 0.0293 0.1713 0 0 4.05 92.75 3.20 0.9766 0.0007 0.0270 θ 0.38 0.81 16.63 82.17 0.9920 0.0014 0.0373 0 0 7.95 91.80 0.25 0.9998 6.1e-8 0.0002 δ 2.37 3.12 22.28 72.22 0.0588 0.0048 0.0695 32.32 61.80 0.62 3.27 1.98 2.6e-2 4.6e-4 0.0213 ε 2.42 2.61 20.78 74.19 0.0521 0.0082 0.0904 21.59 39.51 9.98 13.79 15.13 2.2e-3 4.8e-6 0.0022 44100 Hz r 0.24 1.17 18.81 79.79 0.9549 0.0286 0.1691 0 0.01 3.43 94.24 2.32 0.9989 3.7e-6 0.0019 ρ 0.34 1.25 19.92 78.49 0.9375 0.0298 0.1725 0 0.03 1.33 93.04 5.60 0.9928 0.0001 0.0122 θ 0.50 0.85 18.64 80.01 0.9922 0.0013 0.0355 0 0.04 3.66 77.00 19.30 1 1.8e-9 4.2e-5 Lib2 Lib1 Table 2. Variance components in the distributions of robustness of Chroma for Lib1 (top) and Lib2 (bottom), similar to Table 1. The Codec main effect and all its interactions are not shown for Lib1 because all variance components are estimated as 0. Note that the FSize main effect and all its interactions are omitted for Lib2 because it is fixed to 16384. MFCCs Chroma MFCCs Chroma 64 .383 .275 .335 .320 96 .384 .281 .329 .325 128 .401 .288 .332 .320 160 .403 .261 .341 .323 192 .395 .278 .336 .325 256 .402 .278 .336 .319 320 WAV .394 .393 .284 .291 .344 .335 .320 .313 trate compression, mostly due to distortions at high frequencies. They estimated squared Pearson’s correlation between MFCCs computed on original lossless audio and its MP3 derivatives, using 4 different MFCC implementations. All implementations were found to be robust at bitrates of at least 128 Kbps, with r2 > 0.95, but a significant loss in robustness was observed at 64 Kbps in some of the implementations. The most robust MFCC implementation had a highest frequency of 4600 Hz, while the least robust implementation included frequencies up to 11025 Hz. Their music corpus contained only 46 songs though, clearly limiting their results. In our experiments, all encodings show r2 > 0.99. However, we note that Pearson’s r is very sensible to outliers with such small samples. This is the case of the first MFCC coefficients, which are orders of magnitude larger than the last coefficients. This makes r extremely large simply because the first coefficients are remotely similar; most of the variability between feature vectors is explained because of the first coefficient. This is clear in our Table 1, where r ≈ 1 and variance is nearly 0. To minimize this sensibility to outliers, we also included the non-parametric Spearman’s ρ correlation coefficient as well as the cosine similarity. In our case, the tool with the larger frequency range was shown to be more robust under homogeneous encodings, while the shorter range was more stable under heterogeneous conditions. Hamawaki et al. [10] analyzed differences in the distribution of MFCCs for different bitrates using a corpus of 2513 MP3 files of Japanese and Korean pop songs with bitrates between 96 and 192 Kbps. Following a music similarity task, they compared differences in the top-10 ranked results when using MFCCs derived from WAV audio, its MP3 encoded versions, and the mixture of MFCCs from different sources. They found that the correlation of the results deteriorates smoothly as the bitrate decreases, while ranking on a set of MFCCs derived from different formats revealed uncorrelated results. We similarly observed that the differences between MFCCs of the original WAV files and its MP3 versions decrease smoothly with bitrate. Jensen et al. [12] measured the effect of audio encoding on performance of an instrument classifier using MFCCs. Table 3. Mean classification accuracy over 100 trials when training and testing with the same encoding (MP3 CBR and WAV only) at 44100 Hz. cant, F (7, 693) = 2.34, p = 0.023; a multiple comparisons analysis reveals that 64 Kbps is significantly worse than the best (160 Kbps). In terms of chroma, differences are again statistically significant, F (7, 693) = 3.71, p < 0.001; 160 Kbps is this time significantly worse that most of the others. With Lib2 differences are not significant for MFCCs, F (7, 693) = 1.07, p = 0.378. No difference is found for chroma either, F (7, 693) = 0.67, p = 0.702. Overall, despite some pairwise comparisons are significantly different, there is no particular encoding that clearly outperforms the others; the observed differences are probably just Type I errors. There is no clear correlation either between bitrate and accuracy. We then investigate whether a particular choice of encoding for training is likely to produce better results when the target test set has a fixed encoding. For MFCCs and Lib1 there is no significant difference in any but one case (testing with 160 Kbps is worst when training with 64 Kbps). For chroma there are a few cases where 160 Kbps is again significantly worse than others, but we attribute these to Type I errors as well. Although not significantly so, the best result is always obtained when the training set has the same encoding as the target test set. With Lib2 there is no significant difference for MFCCs or chroma. Overall, we do not observe a correlation either between training and test encodings. Due to space constrains, we do not discuss results for VBR or 22050 Hz, but the same general conclusions can be drawn nonetheless. 6. DISCUSSION Sigurdsson et al. [21] suggested that MFCCs are sensitive to the spectral perturbations that result from low bi- 577 15th International Society for Music Information Retrieval Conference (ISMIR 2014) They compared MFCCs computed from MP3 files at only 32-64 Kbps, observing a decrease in performance when using a different encoder for training and test sets. In contrast, performance did not change significantly when using the same encoder. For genre classification with MFCCs, our results showed no differences in either case. We note though that the bitrates we considered are much larger. Uemura et al. [23] examined the effect of bitrate on chord recognition using chroma features with an SVM classifier. They observed no obvious correlation between encoding and estimation results; the best results were even obtained with very low bitrates for some codecs. Our results on genre classification with chroma largely agree in this case as well; the best results with Lib2 were also obtained by low bitrates. Casey et al. [4] evaluated the effect of lossy encodings on genre classification tasks using audio spectrum projection features. They found a small but statistically significant decrease in accuracy for bitrates of 32 and 96 Kbps. In our experiments, we do not observe these differences, although the lowest bitrate we consider is 64 Kbps. Jacobson et al. [11] also investigated the robustness of onset detection methods to lossy MP3 encoding. They found statistically significant changes in accuracy only at bitrates lower than 32 Kbps. Our results showed that MFCCs and chroma features, as computed by Lib1 and Lib2, are generally robust and stable within reasonable limits. Some differences have been noted between tools though, largely attributable to the different frequency ranges they employ. Nonetheless, it is evident that certain combinations of codec and bitrate may require a re-parameterization of some descriptors to improve or even maintain robustness. In practice, these parameterizations affect the performance and applicability of algorithms, so a balance between performance, robustness and generalizability should be sought. These considerations are of major importance when collecting audio files for some dataset, as a minimum audio quality might be needed for some descriptors. 7. CONCLUSIONS In this paper we have studied the robustness of two common audio descriptors used in Music Information Retrieval, namely MFCCs and chroma, to different audio encodings and analysis parameters. Using a varied corpora of music pieces and two different audio analysis tools we have confirmed that MFFCs are robust to frame/hop sizes and lossy encoding provided that a minimum bitrate of approximately 160 Kbps is used. Chroma features were shown to be even more robust, as the codec and bitrates had virtually no effect on the computed descriptors. This is somewhat expected given that chroma does not capture information as fine-grained as MFCCs do, and that lossy compression does not alter the perceived tonality. We did find subtle differences between implementations of these audio features, which call for further research on standardizing algorithms and parameterizations to maximize their robustness while maintaining their effectiveness in the various tasks they are used in. The immediate line for future work includes the analysis of other features and tools. 8. ACKNOWLEDGMENTS This work is partially supported by an A4U postdoctoral grant and projects SIGMUS (TIN2012-36650), CompMusic (ERC 267583), PHENICX (ICT-2011.8.2) and GiantSteps (ICT-2013-10). 9. REFERENCES [1] J.J. Aucouturier, F. Pachet, and M. Sandler. “The way it sounds”: timbre models for analysis and retrieval of music signals. IEEE Trans. Multimedia, 2005. [2] D. Bogdanov, N. Wack, et al. ESSENTIA: an audio analysis library for music information retrieval. In ISMIR, 2013. [3] C. Cannam, M.O. Jewell, C. Rhodes, M. Sandler, and M. d’Inverno. Linked data and you: bringing music research software into the semantic web. J. New Music Res., 2010. [4] M. Casey, B. Fields, et al. The effects of lossy audio encoding on genre classification tasks. In AES, 2008. [5] W. Chai. Semantic segmentation and summarization of music: methods based on tonality and recurrent structure. IEEE Signal Processing Magazine, 2006. [6] D. Ellis. Classifying music audio with timbral and chroma features. In ISMIR, 2007. [7] T. Fujishima. Realtime chord recognition of musical sound: a system using common lisp music. In ICMC, 1999. [8] T. Ganchev, N. Fakotakis, and G. Kokkinakis. Comparative evaluation of various MFCC implementations on the speaker verification task. In SPECOM, 2005. [9] E. Gómez. Tonal description of music audio signals. PhD thesis, Universitat Pompeu Fabra, 2006. [10] S. Hamawaki, S. Funasawa, et al. Feature analysis and normalization approach for robust content-based music retrieval to encoded audio with different bit rates. In MMM, 2008. [11] K. Jacobson, M. Davies, and M. Sandler. The effects of lossy audio encoding on onset detection tasks. In AES, 2008. [12] J.H. Jensen, M.G. Christensen, D. Ellis, and S.H. Jensen. Quantitative analysis of a common audio similarity measure. IEEE TASLP, 2009. [13] B. McFee, L. Barrington, and G. Lanckriet. Learning content similarity for music recommendation. IEEE TASLP, 2012. [14] D.C. Montgomery. Design and Analysis of Experiments. Wiley & Sons, 2009. [15] M. Müller and S. Ewert. Towards timbre-invariant audio features for harmony-based music. IEEE TASLP, 2010. [16] M. Müller, H. Mattes, and F. Kurth. An efficient multiscale approach to audio synchronization. In ISMIR, 2006. [17] J. Paulus, M. Müller, and A Klapuri. Audio-based music structure analysis. In ISMIR, 2010. [18] L.R. Rabiner and R.W. Schafer. Introduction to Digital Speech Processing. Foundations and Trends in Signal Processing. 2007. [19] J. Reed and C. Lee. Preference music ratings prediction using tokenization and minimum classification error training. IEEE TASLP, 2011. [20] J. Serrà, E. Gómez, and P. Herrera. Audio cover song identification and similarity: background, approaches, evaluation, and beyond. In Z. Raś and A.A. Wieczorkowska, editors, Advances in Music Information Retrieval. Springer, 2010. [21] S. Sigurdsson, K.B. Petersen, and T. Lehn-Schiler. Mel Frequency Cepstral Coefficients: an evaluation of robustness of MP3 encoded music. In ISMIR, 2006. [22] M. Slaney. Auditory toolbox. Interval Research Corporation, Technical Report, 1998. http://engineering. purdue.edu/˜malcolm/interval/1998-010/. [23] A. Uemura, K. Ishikura, and J. Katto. Effects of audio compression on chord recognition. In MMM, 2014. [24] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and HG. Okuno. An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model. IEEE TASLP, 2008. 578 15th International Society for Music Information Retrieval Conference (ISMIR 2014) MUSIC INFORMATION BEHAVIORS AND SYSTEM PREFERENCES OF UNIVERSITY STUDENTS IN HONG KONG Xiao Hu University of Hong Kong Jin Ha Lee University of Washington Leanne Ka Yan Wong University of Hong Kong [email protected] [email protected] [email protected] ern and Eastern cultures. Before the handover to the Chinese government in 1997, Hong Kong had been ruled by the British government for 100 years. This had resulted in a heavy influence of Western culture, although much of the Chinese cultural heritage has also been preserved well in Hong Kong. The cultural influences of Hong Kong to the neighboring regions in Asia were significant, especially in the pre-handover era. In fact, in the 80s and throughout the 90s, Cantopop (Cantonese popular music, sometimes referred to as HK-pop) was widely popular across many Asian countries, and produced many influential artists such as Leslie Cheung, Anita Mui, Andy Lau, and so on [2]. In the post-handover era, there has been an influx of cultural products from mainland China which is significantly affecting the popular culture of Hong Kong [8]. The cultural history and influences of Hong Kong, especially paired with the significance of Cantopop, makes it an interesting candidate to explore among many non-Western cultures. Of the populations in Hong Kong, we specifically wanted to investigate young adults on their music information needs and behaviors. They represent a vibrant population who are not only heavily exposed to and fast adopters of new ideas, but also represent the future workforce and consumers. University students in Hong Kong are mostly digital natives (i.e., grew up with access to computers and the Internet from an early age) with rich experience of seeking and listening to digital music. Additionally the fact that they are influenced by both Western and Eastern cultures, and exposed to both global and local music make them worthy of exploring as a particular group of music users.1 2 There have been a few related studies which investigated music information users in Hong Kong. Lai and Chan [5] surveyed information needs of users in an academic music library setting. They found that the frequencies of using score and multimedia were higher than using electronic journal databases, books, and online journals. Nettamo et al. [9] compared users in New York City and those in Hong Kong in using their mobile devices for music-related tasks. Their results showed that users’ envi- ABSTRACT This paper presents a user study on music information needs and behaviors of university students in Hong Kong. A mix of quantitative and qualitative methods was used. A survey was completed by 101 participants and supplemental interviews were conducted in order to investigate users’ music information related activities. We found that university students in Hong Kong listened to music frequently and mainly for the purposes of entertainment, singing and playing instruments, and stress reduction. This user group often searches for music with multiple methods, but common access points like genre and time period were rarely used. Sharing music with people in their online social networks such as Facebook and Weibo was a common activity. Furthermore, the popularity of smartphones prompted the need for streaming music and mobile music applications. We also examined users’ preferences on music services available in Hong Kong such as YouTube and KKBox, as well as the characteristics liked and disliked by the users. The results not only offer insights into non-Western users’ music behaviors but also for designing online music services for young music listeners in Hong Kong. 1. INTRODUCTION AND RELATED WORK Seeking music and music information is prevalent in our everyday life as music is an indispensable element for many people [1]. People in Hong Kong are not an exception. Hong Kong has the second highest penetration rate of broadband Internet access in Asia, following South Korea1. Consequently, Hong Kongers are increasingly using various online music information services to seek and listen to music, including iTunes, YouTube, Kugou, Sogou and Baidu2. However, our current understanding of their music information needs and behaviors are still lacking, as few studies explored user populations in Hong Kong, or in any non-Western cultures. Hong Kong is a unique location that merges the West© Xiao Hu, Jin Ha Lee, Leanne Ka Yan Wong. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Xiao Hu, Jin Ha Lee, Leanne Ka Yan Wong. “Music Information Behaviors and System Preferences of University Students in Hong Kong”, 15th International Society for Music Information Retrieval Conference, 2014. 1 http://www.itu.int/ITU-D/ICTEYE/Reporting/Dynamic ReportWizard.aspx 2 579 http://hk.epochtimes.com/b5/11/10/20/145162.htm 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ronment and context greatly influenced their behaviors, and there were cultural differences in consuming and managing mobile music between the two user groups. Our study investigates everyday music information behaviors of university students in Hong Kong, and thus the scope is broader than these studies. In addition to music information needs and behaviors, this study also examines the characteristics of popular music services adopted by university students in Hong Kong, in order to investigate their strengths and weaknesses. Recommendations for designing music services are proposed based on the results. This study will improve our understanding on music information behaviors of the target population and contribute to the design of music services that can better serve target users. depth explanations to support the survey findings. Faceto-face interviews were carried out individually with five participants from three different universities. The interviews were conducted in Cantonese, the mother tongue of the interviewees, and were later transcribed and translated to English. Each interview lasted up to approximately 20 minutes. 3. SURVEY DATA ANALYSIS Of the 167 survey responses collected, 101 complete responses were analyzed in this study. All the survey participants were university students in Hong Kong. Among them, 58.4% of were female and 41.6% of them were male. They were all born between 1988 and 1994, and most of them (88.1%) were born between 1989 and 1992. Therefore, they were in their early 20s when the survey was taken in 2013. Nearly all of them (98.0%) were undergraduates majoring Science/Engineering (43.6%), Social Sciences/Humanities (54.0%) and Other (2.0%). 2. METHODS A mix of quantitative and qualitative methods was used in order to triangulate our results. We conducted a survey in order to collect general information about target users’ music information needs, seeking behaviors, and opinions on commonly used music services. Afterwards, follow-up face-to-face interviews of a smaller user group were conducted to collect in-depth explanations on the themes and patterns discovered in the survey results. Prior to the formal survey and interviews, pilot tests were carried out with a smaller group of university students to ensure that the questions were well-constructed and students were able to understand and answer them without major issues. 3.1 Music Preferences In order to find out participants’ preferred music genres, they were asked to select and rank up to five of their favorite music genres from a list of 25 genres covering most Western music genres. To ensure that the participants understand the different genres, titles and artist names of example songs representative of each genre were provided. The results are shown in Table 1 where each cell represents the number of times each genre was mentioned with the rank corresponding in the column. Pop was the most preferred genre among the participants, followed by R&B/Soul and Rock. We also aggregated the results by assigning reversely proportional weights to the ranks (1st: 5 points, and 5th: 1 point). The most popular music genres among the participants were Pop (311 pts), R&B/Soul (204 pts), Rock (109 pts), Gospel (88 pts) and Jazz (86 pts). 2.1 Survey The survey was conducted as an online questionnaire. The questionnaire instrument was adapted from the one used in [6] and [7], with modifications to fit the multilingual and multicultural environment. Seventeen questions about the use of popular music services were added to the questionnaire. The survey was implemented with LimeSurvey, an open-source survey application, and consisted of five parts: demographic information, music preference, music seeking behaviors, music collection management, and opinions on preferred music services. Completing the survey took approximately 30 minutes, and each participant was offered a chance to enter his/her name for a raffle to win one of the three supermarket gift coupons of HKD50, if they wished. The target population was students (both undergraduate and graduate) from the eight universities sponsored by the government of Hong Kong Special Administrative Region. The sample was recruited using Facebook due to its popularity among university students in Hong Kong. Survey invitations were posted on Facebook, initially through the list of friends of the authors, and then further disseminated by chain-referrals. 1st 2nd 3rd 4th 5th Pop 43 14 9 4 5 75 74.2% R&B 7 29 11 7 6 60 59.4% Total Total (%) Rock 6 9 10 3 7 35 34.7% Gospel 9 6 2 4 5 26 25.7% Jazz 6 8 3 5 5 27 26.7% Table 1. Preferences on music genres Moreover, as both Chinese and English are official languages of Hong Kong, participants were also asked to rank their preferences on languages of lyrics. The five options were English, Cantonese, Mandarin, Japanese and Korean. The last three were included due to popularity of songs from nearby countries/regions in Hong Kong, including mainland China and Taiwan (Mandarin), Japan (Japanese), and Korea (Korean). As shown in Table 2, English was in fact highly preferred, followed by Cantonese. Mandarin was mostly ranked at the second or third 2.2 Interviews Semi-structured interviews were conducted after the survey data were collected and analyzed, in order to seek in- 580 15th International Society for Music Information Retrieval Conference (ISMIR 2014) place, while Korean and Japanese were ranked lower. We also aggregated the answers and found that the most popular languages in songs are English (394 points), Cantonese (296 points), and Mandarin (223 points). 1st 2nd 3rd 4th 5th Total Known-item search was the most common type of music information seeking; nearly all respondents (95.1%) sought music information for the identification/verification of musical works, artist and lyrics, and about half of them do so at least a few times a week. Obtaining background information was also a strong reason; over 90% of the participants sought music to learn more about music artists (97.0%) as well as music (94.1%), and approximately half of them (53.5% and 40.6%, respectively) sought this kind of music information at least two or three times a month. When asked which sources stimulated or influenced their music information needs, all 101 participants acknowledged online video clips (e.g. YouTube) and TV shows/movies. This suggests that the influence of other media using music is quite significant which echoes the finding that associative metadata in music seeking was important for the university population in the United States [6]. Also over 70% of the participants’ music needs were influenced by music heard in public places, advertisement/commercial, radio show, or family members’/friends’ home. As for the metadata used in searching for music, performer was the most popular access point with 80.2% of positive responses, followed by the title of work(s) (65.3%) and some words of lyrics (62.4%). Other common types of metadata such as genre and time period were only used by a few respondents (33.7% and 29.7%, respectively). Particularly for genre, the proportion is significantly lower than 62.7% as found in the prior survey of university population in the United States [6]. This is perhaps related to the exposure to different music genres in Hong Kong, and the phenomenon that Hong Kongers music listeners tend to emphasize an affinity with friends while Americans (New Yorkers) are more likely to use music to highlight their individual personalities [9]. Moreover, participants responded that they would also seek music based on other users’ opinions: 57.4% by recommendations from other people and 52.5% by popularity. The proportion for popularity is also fairly larger than the 31% in [6]. This shows that the social aspect is a crucial factor affecting participants’ music seeking behaviors. Of the different types of people, friends and family members (91.1%) and people on their social network websites (e.g. Facebook, Weibo) (89.1%) were the ones whom they most likely ask for help when searching for music. In addition, they turned to the Internet more frequently than friends and family members. Thirty-nine percent of them sought help on social network websites at least a few times a week while only 23.8% turned to friends/family members at least a few times a week. On the other hand, when asked which physical places they go to in order to search for music or music information, 82.18% said that they would find music in family members’ or friends’ home, which was higher than going to record stores (75.3%), libraries (70.3%), and academic Total (%) English 46 27 16 3 2 94 93.1% Cantonese 31 20 15 7 2 75 74.3% Mandarin 13 23 20 2 2 60 59.4% Korean 6 15 6 16 14 57 56.4% Japanese 5 5 10 16 16 52 51.5% Table 2. Preferences on languages of song lyrics 3.2 Music Seeking Behaviors When asked about the type of music information they have ever searched, most participants indicated preferences on audio: MP3s and music videos (98.0%), music recordings (e.g., CDs, vinyl records, tapes) (94.1%), and music multimedia in other formats (e.g., Blue-ray, DVD, VHS) (88.1%). Written forms of music information were sought by fewer respondents: books on music (73.2%), music magazines (69.3%), and academic music journals (63.4%). Approximately one out of three participants even responded that they have never sought music magazines (30.7%) or academic music journals (36.6%). As for the frequency of search, 41.6% of respondents indicated that they sought MP3s and music videos at least a few times a week, compared to only 18.8% for music recordings (e.g., CDs, vinyl records, tapes) and 24.8% for music multimedia in other formats (e.g., Blue-ray, DVD). Moreover, 98.0% of participants responded that they had searched for music information on the Internet. Among them, almost all (99.0%) answered that they had downloaded free music online, and 95.0% responded that they had listened to streaming music or online radio. This clearly indicates that participants sought digital music more often through online channels than offline or physical materials. However, even though 77.8% of respondents had visited online music store, only 69.7% of them had purchased any electronic music files or albums. Not surprisingly, participants preferred free music resources. Music was certainly a popular element of entertainment in the lives of the participants. When asked why they sought music, all participants included entertainment in their answers. Also, a large proportion (83.0%) indicated that they sought music for entertainment at least a few times a week. Furthermore, 97.0% of respondents search for music information for singing or playing a musical instrument for fun. This proportion is significantly higher than the results from the previous survey of university population in the United States (32.8% for singing and 31.9% for playing a musical instrument) [6]. In addition, 78.2% of our respondents do this at least two or three times a month. We conjecture that this is most likely due to the popularity of karaoke in Hong Kong. 581 15th International Society for Music Information Retrieval Conference (ISMIR 2014) institutions (64.4%). Overall, these data show that users’ social networks, and especially online networks are important for their music searching process. (56.9%). Only a few respondents (9.8%) were unsatisfied with certain features of YouTube such as advanced search, relevance of search results, and navigation. It is surprising to see that five respondents rated YouTube negatively on the aspect of price. We suspect they might have associated this aspect with the price of purchasing digital music from certain music channels on YouTube, or the indirect cost of having to watch ads. However, we did not have the means to identify these respondents to verify the reasons behind their ratings. 3.3 Music Collection Management More participants were managing a digital collection (40.6%) than a physical one (25.7%). On average, each respondent estimated that he/she managed 900 pieces of digital music and 94 pieces of music in physical formats. This shows that managing digital music is more popular among participants, although the units that they typically associate with digital versus physical items might differ (e.g., digital file vs. physical album). We also found that students tended to manage their music collections with simple methods. Over half of the respondents (50.0% for music in physical formats and 56.1% for digital music) manage their music collection by artist name. Participants sometimes also organized their digital collections by album title (17.7%), but rarely by format type (3.9%) and never by record label. More participants indicated they did not organize their music at all for their physical music collection (19.2%) than their digital music collection (2.4%). When they did organize their physical music collection, they would use album title (11.5%) and genre (11.5%). Overall, organizing the collection did not seem to be one of the users’ primary activities related to music information. YouTube KKBox 3.4 Preferred Music Services Respondents gave a variety of responses regarding their most frequently visited music services: YouTube (51.5%), KKBox (26.7%), and iTunes (14.9%) were the most popular ones. KKBox is a large cloud-based music service provider founded in Taiwan, very popular in the region and sometimes referred to as “Asian Spotify.” YouTube, which provides free online streaming music video, was almost twice as popular as the second most favored music service, KKBox. The popularity of YouTube was also observed in Lee and Waterman’s survey of 520 music users in 2012 [7]. Their respondents ranked Pandora as the most preferred service, followed by YouTube as the second. The participants were also asked to evaluate their favorite music services. Specifically, they were asked to indicate their level of satisfaction using a 5-point Likert scale on 15 different aspects on search function, search results and system utility. Table 3 shows the percentage of positive (aggregation of “somewhat satisfied” and “very satisfied) and negative (aggregation of “somewhat unsatisfied” and “very unsatisfied”) ratings among users who chose each of the three services as their most favored one. For those who selected YouTube as their most frequently used service, they indicated that they were especially satisfied with its keyword search function (74.5%), recommendation of keywords (70.6%), variety of available music information (60.8%) and attractive interface 582 utility search results search function P N P N iTunes P N keyword search 74.5 7.8 29.6 7.4 13.3 0.0 advanced search 54.9 9.8 44.4 18.5 46.7 6.7 content-based search 51.0 7.8 44.4 29.6 66.7 13.3 auto-correction 49.0 7.8 29.6 29.6 20.0 33.3 keywords suggestion 70.6 3.9 40.7 25.9 20.0 53.3 number of results 52.9 7.8 40.7 22.2 6.7 33.3 relevance 47.1 9.8 48.1 18.5 13.3 33.3 accuracy 49.0 7.8 44.4 18.5 33.3 26.7 price of the service 39.2 9.8 25.9 25.9 33.3 20.0 accessibility 52.9 7.8 22.2 37.0 26.7 20.0 navigation 52.9 9.8 18.5 29.6 6.7 20.0 variety of available music information 60.8 7.8 22.2 22.2 26.7 13.3 music recommendation 52.9 7.8 33.3 22.2 53.3 20.0 interface attractiveness 56.9 3.9 33.3 7.4 40.0 20.0 music sharing 47.1 3.9 40.7 7.4 40.0 20.0 Table 3 User ratings of three most preferred music services (“P”: positive; “N”: negative, in percentage) The level of satisfaction for KKBox was lower than that of YouTube. Nearly half of the participants who use KKBox were satisfied with its relevance of results (48.1%), advanced search function (44.4%) and contentbased search function (44.4%). The aspects of KKBox that participants did not like included the lack of accessibility (37.0%), content-based search function (29.6%), and auto-correction (29.6%). Interestingly, the contentbased search function in KKBox was controversial among the participants. Some participants liked it probably because it was a novel feature that few music services had; while others were not satisfied with it, perhaps due to fact that current performance of audio content-based technologies have yet to meet users’ expectation. Only 15 participants rated iTunes as their most frequently used music service. Their opinions on iTunes were mixed. Its content-based search function and music recommendations were valued by 66.7% and 53.3% of the 15 participants, respectively. The data seem to suggest that audio content-based technologies in iTunes performed better than KKBox, but this must be verified with a larger sample in future work. On the other hand, over 15th International Society for Music Information Retrieval Conference (ISMIR 2014) half of the respondents gave negative response to the keyword suggestion function in iTunes. Moreover, the auto-correction, number of search results, and relevance of search results also received negative responses by one third of the respondents. These functions are related to the content of music collection in iTunes, and thus we suspect that the coverage of iTunes perhaps did not meet the expectations of young listeners in Hong Kong, as much as the other two services did. 4.3 24/7 Online Music Listening Participants in this study preferred listening to or watching streaming music services rather than downloading music. Downloading an mp3 file of a song usually takes about a half minute with a broadband connection and slightly longer with a wireless connection. Interviewees commented that downloading just added an extra step which was inconvenient to them. Apart from the web, smart mobile devices are becoming ubiquitous which is also affecting people’s mode of music listening. According to Mobilezine4, 87% of Hong Kongers aged between 15 and 64 own a smart device. According to Phneah [10], 55% of Hong Kong youths think that the use of smartphones dominates their lives as they are unable to stop using smartphones even in restrooms, and many sleep next to it. As expected, university students in Hong Kong are accustomed to having 24/7 access to streaming music on their smartphones. 4. THEMES/TRENDS FROM INTERVIEWS 4.1 Multiple Music Information Searching Strategies Interviewees searched for music using not only music services like YouTube or KKBox, but also generalpurpose search engines, such as Google and Yahoo!. Most often, a simple keyword search with the song title or artist name was conducted when locating music in these music services. However, more complicated searches such as those using lyrics and the name of composer are not supported by most existing music services. In this case, search engines had to be used. For example, if the desired song title and artist name are unknown or inaccurate, interviewees would search for them on Google or Yahoo! with any information they know about the song. The search often directed them to the right piece of metadata which then allowed them to conduct a search in YouTube or other music services. As expected, this does not always lead to successful results; one participant said “when I did not know the song title or artist name, I tried singing the song to Google voice search, but the result was not satisfactory.” 5. IMPLICATIONS FOR MUSIC SERVICES 5.1 Advanced Search A simple keyword search may not be sufficient to accommodate users who want to search for music with various metadata, not only with song titles, but also performer’s names, lyrics, and so on. For example, if a user wants to locate songs with the word “lotus” in the lyrics, they would simply use “lotus” as the search keyword. However, the search functions in various music services generally are not intelligent enough to understand the semantic differences among the band named Lotus and the word “lotus” in lyrics, not to mention which role the band Lotus might have played (e.g., performer, composer, or both). As a result, users have to conduct preliminary searches in web search engines as an extra step when attempting to locate the desired song. Many users will appreciate having an advanced search function with specific fields in music services that allow them to conduct lyric search with “lotus” rather than a general keyword search. 4.2 Use of Online Social Networks Online social network services are increasingly popular among people in Hong Kong. According to an online survey conducted with 387 Hong Kong residents in March 20113, the majority of the respondents visited Facebook (92%), read blogs (77%) and even wrote blog posts (52%). Social media provides a convenient way for people to connect with in Hong Kong where maintaining a work-life balance can be quite challenging. University students in Hong Kong are also avid social media users. They prefer communicating and sharing information with others using online social networks for the efficiency and flexibility. Naturally, it also serves as a convenient channel for sharing music recommendations and discussing music-related topics. Relying on others was considered an important way to search for music: “Normally, I will consider others’ opinions first. There are just way too many songs, so it helps find good music much more easily.”, “I love other people’s comments, especially when they have the same view as me!” 5.2 Mood Search Participants showed great interests in the feeling or emotion in music, as they perceived the meaning of songs were mostly about particular emotions. Terms such as “positive”, “optimistic”, and “touching” were used to describe the meaning of music during the interviews. Therefore, music services that can support searching by mood terms may be useful. Music emotion or mood has been recognized as an important access point for music [3]. A cross-cultural study by Hu and Lee [4] points out that listeners from different cultural backgrounds have different music mood judgments and they tend to agree more with users from the 3 4 Hong Kong has the second highest smartphone penetration in the world: http://mobilezine.asia/2013/01/hong-kong-has-the-secondhighest-smartphone-penetration-in-the-world/. Hong Kong social media use higher than United States: http://travel.cnn.com/hong-kong/life/hong-kong-social-media-usehigher-united-states-520745. 583 15th International Society for Music Information Retrieval Conference (ISMIR 2014) same cultural background than users from other cultures. This cultural difference must be taken into account when establishing mood metadata for music services. 7. ACKNOWLEDGEMENT The study was partially supported by a seed basic research project in University of Hong Kong. The authors extend special thanks to Patrick Ho Ming Chan for assisting in data collection. 5.3 Connection with Social Media Social media play a significant role in sharing and discussing music among university students in Hong Kong. YouTube makes it easy for people to share videos in various online social communities such as Facebook, Twitter and Google Plus. Furthermore, users can view the shared YouTube videos directly on Facebook which makes it even more convenient. This is one of the key reasons our participants preferred YouTube. However, music services like iTunes have yet to adopt this strategy. For our study population, linking social network to music services would certainly enhance user experience and help promote music as well. 8. REFERENCES [1] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney: “Content-Based Music Information Retrieval: Current Directions and Future Challenges,” Proceedings of the IEEE, 96 (4), pp. 668-696, 2008. [2] S. Y. Chow: “Before and after the Fall: Mapping Hong Kong Cantopop in the Global Era,” LEWI Working Paper Series, 63, 2007. [3] X. Hu: “Music and mood: Where theory and reality meet,” Proceedings of iConference. 2010. 5.4 Smartphone Application Many participants are listening to streaming music with their smartphones, and thus naturally, offering music apps for smart devices will be critical for music services. Both YouTube and iTunes offer smartphone apps. Moreover, instant messaging applications, such as WhatsApp, is found as the most common reason for using smartphones among Hong Kongers [10]. To further improve the user experience, music-related smartphone apps may consider incorporating online instant messaging capabilities. [4] X. Hu and J. H. Lee: “A Cross-cultural Study of Music Mood Perception between American and Chinese Listeners,” Proceedings of the ISMIR, pp.535-540, 2012. [5] K. Lai and K. Chan: “Do you know your music users' needs? A library user survey that helps enhance a user-centered music collection.” The Journal of Academic Librarianship, 36(1), pp.63-69, 2010. 6. CONCLUSION [6] J. H. Lee and S. J. Downie: “Survey of music information needs, uses, and seeking behaviours: Preliminary findings,” Proceedings of the ISMIR, pp. 441-446, 2004. Music is essential for many university students in Hong Kong. They listen to music frequently for the purpose of entertainment and relaxation, to help reduce stress in their extremely tense daily lives. Currently, there does not exist a single music service that can fulfill all or most of their music information needs, and thus they often use multiple tools for specific searches. Furthermore, sharing and acquiring music from friends and acquaintances was a key activity, mainly done on online social networks. Comparing our findings to those of previous studies revealed some cultural differences between Hong Kongers and Americans, such as Hong Kongers relying more on popularity and significantly less on genres in music search. With the prevalence of smartphones, students are increasingly becoming “demanding” as they get accustomed to accessing music anytime and anywhere. Streaming music and music apps for smartphones are becoming increasingly common. The most popular music service among university students in Hong Kong was YouTube due to its convenience, user-friendly interface, and requiring no payment to use their service. In order to further improve the design of music services, we recommended providing an advanced search function, emotion/moodbased search, social network connection, smartphone apps as well as access to high quality digital music which will help fulfill users’ needs. [7] J. H. Lee and M. N. Waterman: “Understanding user requirements for music information services,” Proceedings of the ISMIR, pp. 253-258, 2012. [8] B. T. McIntyre, C. C. W. Sum, and Z. Weiyu: “Cantopop: The voice of Hong Kong,” Journal of Asian Pacific Communication, 12 (2), pp. 217-243, 2002. [9] E. Nettamo, M. Norhamo, and J. Häkkilä: “A crosscultural study of mobile music: Retrieval, management and consumption,” Proceedings of OzCHI 2006, pp. 87-94, 2006. [10] J. Phneah: “Worrying signals as smartphone addiction soars,” The Standard. Retrieved from http://www. thestandard.com.hk/news_detail.asp?pp_cat=30&art _id=132763&sid=39444767&con_type=1, 2013. [11] V. M. Steelman: “Intraoperative music therapy: Effects on anxiety, blood pressure,” Association of Operating Room Nurses Journal, 52(5), pp. 10261034, 1990. 584 15th International Society for Music Information Retrieval Conference (ISMIR 2014) LYRICSRADAR: A LYRICS RETRIEVAL SYSTEM BASED ON LATENT TOPICS OF LYRICS Shoto Sasaki∗1 Kazuyoshi Yoshii∗∗2 Tomoyasu Nakano∗∗∗3 Masataka Goto∗∗∗4 Shigeo Morishima∗5 ∗ Waseda University ɹ ∗∗ Kyoto University ∗∗∗ National Institute of Advanced Industrial Science and Technology (AIST) 1 joudanjanai-ss[at]akane.waseda.jp 2 yoshii[at]kuis.kyoto-u.ac.jp 3,4 (t.nakano, m.goto)[at]aist.go.jp 5 shigeo[at]waseda.jp ABSTRACT This paper presents a lyrics retrieval system called LyricsRadar that enables users to interactively browse song lyrics by visualizing their topics. Since conventional lyrics retrieval systems are based on simple word search, those systems often fail to reflect user’s intention behind a query when a word given as a query can be used in different contexts. For example, the wordʠtearsʡcan appear not only in sad songs (e.g., feel heartrending), but also in happy songs (e.g., weep for joy). To overcome this limitation, we propose to automatically analyze and visualize topics of lyrics by using a well-known text analysis method called latent Dirichlet allocation (LDA). This enables LyricsRadar to offer two types of topic visualization. One is the topic radar chart that visualizes the relative weights of five latent topics of each song on a pentagon-shaped chart. The other is radar-like arrangement of all songs in a two-dimensional space in which song lyrics having similar topics are arranged close to each other. The subjective experiments using 6,902 Japanese popular songs showed that our system can appropriately navigate users to lyrics of interests. Figure 1. Overview of topic modeling of LyricsRadar. approaches analyzed the text of lyrics by using natural language processing to classify lyrics according to emotions, moods, and genres [2, 3, 11, 19]. Automatic topic detection [6] and semantic analysis [1] of song lyrics have also been proposed. Lyrics can be used to retrieve songs [5] [10], visualize music archives [15], recommend songs [14], and generate slideshows whose images are matched with lyrics [16]. Some existing web services for lyrics retrieval are based on social tags, such as “love” and “graduation”. Those services are useful, but it is laborious to put appropriate tags by hands and it is not easy to find a song whose tags are also put to many other songs. Macrae et al. showed that online lyrics are inaccurate and proposed a ranking method that considers their accuracy [13]. Lyrics are also helpful for music interfaces: LyricSynchronizer [8] and VocaRefiner [18], for example, show the lyrics of a song so that a user can click a word to change the current playback position and the position for recording, respectively. Latent topics behind lyrics, however, were not exploited to find favorite lyrics. 1. INTRODUCTION Some listeners regard lyrics as essential when listening to popular music. It was, however, not easy for listeners to find songs with their favorite lyrics on existing music information retrieval systems. They usually happen to find songs with their favorite lyrics while listening to music. The goal of this research is to assist listeners who think the lyrics are important to encounter songs with unfamiliar but interesting lyrics. Although there were previous lyrics-based approaches for music information retrieval, they have not provided an interface that enables users to interactively browse lyrics of many songs while seeing latent topics behind those lyrics. We call these latent topics lyrics topics. Several c Shoto Sasaki, Kazuyoshi Yoshii, Tomoyasu Nakano, Masataka Goto, Shigeo Morishima. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Shoto Sasaki, Kazuyoshi Yoshii, Tomoyasu Nakano, Masataka Goto, Shigeo Morishima. LyricsRadar: A Lyrics Retrieval System Based on Latent Topics of Lyrics, 15th International Society for Music Information Retrieval Conference, 2014. We therefore propose a lyrics retrieval system, LyricsRadar, that analyzes the lyrics topics by using a machine learning technique called latent Dirichlet allocation (LDA) and visualizes those topics to help users find their favorite lyrics interactively (Fig.1). A single word could have different topics. For example, “diet” may at least have two 585 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 2. Example display of LyricsRadar. by dots. So this approach cannot be achieved at all by the conventional method which directly searches for a song by the keywords or phrases appearing in lyrics. Since linguistic expressions of the topic are not necessary, user can find a target song intuitively even when user does not have any knowledge about lyrics. lyrics topics. When it is used with words related to meal, vegetables, and fat, its lyrics topic “food and health” could be estimated by the LDA. On the other hand, when it is used with words like government, law, and elections, “politics” could be estimated. Although the LDA can estimate various lyrics topics, five typical topics common to all lyrics in a given database were chosen. The lyrics of each song are represented by the unique ratios of these five topics, which are displayed as pentagon-shaped chart called as a topic radar chart. This chart makes it easy to guess the meaning of lyrics before listening to its song. Furthermore, users can directly change the shape of this chart as a query to retrieve lyrics having a similar shape. In LyricsRadar, all the lyrics are embedded in a twodimensional space, mapped automatically based on the ratios of the five lyrics topics. The position of lyrics is such that lyrics in close proximity have similar ratios. Users can navigate in this plane by mouse operation and discover some lyrics which are located very close to their favorite lyrics. 2.1 Visualization based on the topic of lyrics LyricsRadar has the following two visualization functions: (1) the topic radar chart; and (2) a mapping to the twodimensional plane. Figure 2 shows an example display of our interface. The topic radar chart shown in upperleft corner of Figure 2 is a pentagon-shape chart which expresses the ratio of five topics of lyrics. Each colored dot displayed in two dimensional plane shown in Figure 2 means the relative location of lyrics in a database. We call these colored dot representations of lyrics lyrics dot. User can see lyrics, its title and artist name, and the topic ratio by clicking the lyrics dot placed on the 2D space, this supports to discover lyrics interactively. While the lyrics mapping assists user to understand the lyrics topic by the relative location in the map, the topic radar chart helps to get the lyrics image intuitively by the shape of chart. We explain each of these in the following subsections. 2. FUNCTIONALITY OF LYRICSRADAR LyricsRadar enables to bring a graphical user interface assisting users to navigate in a two dimensional space intuitively and interactively to come across the target song. This space is generated automatically by analysis of the topics which appear in common with the lyrics of many musical pieces in database using LDA. Also a latent meaning of lyrics is visualized by the topic radar chart based on the combination of topics ratios. Lyrics that are similar to a user’s preference (target) can be intuitively discovered by clicking of the topic radar chart or lyrics representing 2.1.1 Topic radar chart The values of the lyrics topic are computed and visualized as the topic radar chart which is pentagon style. Each vertex of the pentagon corresponds to a distinct topic, and predominant words of each topic (e.g., “heart”, “world”, and “life” for the topic 3) are also displayed at the five corner of pentagon shown in Figure 2. The predominant words help user to guess the meaning of each topic. The center 586 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 4. Mapping of 487 English artists. Figure 3. An example display of lyrics by a selected artist. associated with lyrics as metadata. When an artist name is chosen, as shown in the right side of Figure 3, the point of the artist’s lyrics will be getting yellow; similarly, when a songwriter is chosen, the point of the songwriter’s lyrics will be changed to orange. While this is somewhat equivalent to lyrics retrieval using the artist or songwriter as a query, it is our innovative point in the sense that a user can intuitively grasp how artists and songwriters are distributed based on the ratio of the given topic. Although music retrieval by artist is very popular in a conventional system, a retrieval by songwriter is not focused well yet. However, in the meaning of lyrics retrieval, it is easier for search by songwriter to discover songs with one’s favorite lyrics because a songwriter has his own lyrics vocabulary. Moreover, we can make a topic analysis depending on a specific artist in our system. Intuitively similar artists are also located and colored closer in a topic chart depending on topic ratios. The artist is colored based on a topic ratio in the same way as that of the lyrics. In Figure 4, the size of a circle is proportional to the number of musical pieces each artist has. In this way, other artists similar to one’s favorite artist can be easily discovered. of the topic radar chart indicates 0 value of a ratio of the lyrics topic in the same manner as the common radar chart. Since the sum of the five components is a constant value, if the ratio of a topic stands out, it will clearly be seen by the user. It is easy to grasp the topic of selected lyrics visually and to make an intuitive comparison between lyrics. Furthermore, the number of topics in this interface is set to five to strike a balance between the operability of interface and the variety of topics1 . 2.1.2 Plane-mapped lyrics The lyrics of musical pieces are mapped onto a twodimensional plane, in which musical pieces with almost the same topic ratio can get closer to each other. Each musical piece is expressed by colored dot whose RGB components are corresponding to 3D compressed axis for five topics’ values. This space can be scalable so that the local or global structure of each musical piece can be observed. The distribution of lyrics about a specific topic can be recognized by the color of the lyrics. The dimension compression in mapping and coloring used t-SNE [9]. When a user mouseovers a point in the space, it is colored pink and meta-information about the title, artist, the topic radar chart appears simultaneously. By repeating mouseover, lyrics and names of its artist and songwriter are updated continuously. Using this approach, other lyrics with the similar topics to the input lyrics can be discovered. The lyrics map can be moved and zoomed by dragging the mouse or using a specific keyboard operation. Furthermore, it is possible to visualize the lyrics map specialized to artist and songwriter, which are 2.2 Lyrics retrieval using topic of lyrics In LyricsRadar, in addition to the ability to traverse and explore a map to find lyrics, we also propose a system to directly enter a topic ratio as an intuitive expression of one’s latent feeling. More specifically, we consider the topic radar chart as an input interface and provide a means by which a user can give topic ratios for five elements directly to search for lyrics very close to one’s latent image. This interface can satisfy the search query in which a user would like to search for lyrics that contain more of the same topics using the representative words of each topic. Figure 5 shows an example in which one of the five topics is increased by mouse drag, then the balance of five topics ratio 1 If the number of topics was increased, a more subdivided and exacting semantic content could have been represented; however, the operation for a user will be getting more complicated. 587 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 5. An example of the direct manipulation of the topic ratio on the topic radar chart. Each topic ratio can be increased by dragging the mouse. has changed because the sum of five components is equal to 1.0. A user can repeat these processes by updating topics ratios or navigating the point in a space interactively until finding interesting lyrics. As with the above subsections, we have substantiated our claims for a more intuitive and exploratory lyrics retrieval system. 3. IMPLEMENTATION OF LYRICSRADAR Figure 6. Graphical representation of the latent Dirichlet allocation (LDA). LyricsRadar used LDA [4] for the topic analysis of lyrics. LDA is a typical topic modeling method by machine learning. Since LDA assigns each word which constitutes lyrics to a different topic independently, the lyrics include a variety of topics according to the variation of words in the lyrics. In our system, K typical topics which constitute many lyrics in database are estimated and a ratio to each topic is calculated for lyrics with unsupervised learning. As a result, appearance probability of each word in every topic can be calculated. The typical representative word to each topic can be decided at the same time. functions, whereas the other two terms are prior distributions. The likelihood functions themselves are defined as K xd,n,v Nd D V z d,n,k p(X|Z, φ) = φk,v (2) d=1 n=1 v=1 p(Z|π) = Nd D K z d,n,k πd,k (3) d=1 n=1 k=1 3.1 LDA for lyrics We then introduce conjugate priors as The observed data that we consider for LDA are D independent lyrics X = {X1 , ..., XD }. The lyrics Xd consist of Nd word series Xd = {xd,1 , ..., xd,Nd }. The size of all vocabulary that appear in the lyrics is V , xd,n is a V -dimensional “1-of-K“ vector (a vector with one element containing 1 and all other elements containing 0). The latent variable (i.e., the topics series) of the observed lyrics Xd is Zd = {zd,1 , ..., zd,Nd }. The number of topics is K, so zd,n indicates a K-dimensional 1-of-K vector. Hereafter, all latent variables of lyrics D are indicated Z = {Z1 , ..., ZD }. Figure 6 shows a graphical representation of the LDA model used in this paper. The full joint distribution is given by p(X, Z, π, φ) = p(X|Z, φ)p(Z|π)p(π)p(φ) k=1 p(π) = p(φ) = D Dir(πd |α(0) ) = D C(α(0) ) K d=1 d=1 k=1 K K V k=1 Dir(φk |β (0) ) = C(β (0) ) (0) α πd,k −1 (4) βv(0) −1 φk,v v=1 k=1 (5) where p(π) and p(φ) are products of Dirichlet distributions, α(0) and β (0) are hyperparameters, and C(α(0) ) and C(β (0) ) are normalization factors calculated as follows: Γ(x̂) C(x) = , Γ(x1 ) · · · Γ(xI ) (1) x̂ = I xi (6) i=1 Also note that π is the topic mixture ratio of lyrics used as the topic radar chart by normalization. The appearance probability φ of the vocabulary in each topic was used to evaluate the high-representative word that is strongly correlated with each topic of the topic radar chart. where π indicates the mixing weights of the multiple topics of lyrics (D of the K-dimensional vector) and φ indicates the unigram probability of each topic (K of the V -dimensional vector). The first two terms are likelihood 588 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 3.2 Training of LDA The lyrics database contains 6902 Japanese popular songs (J-POP) and 5351 English popular songs. Each of these songs includes more than 100 words. J-POP songs are selected from our own database and English songs are from Music Lyrics Database v.1.2.72 . J-POP database has 1847 artists and 2285 songwriters and English database has 398 artists. For the topic analysis per artist, 2484 J-POP artists and 487 English artists whose all songs include at least 100 words are selected. 26229 words in J-POP and 35634 words in English which appear more than ten times in all lyrics is used for the value V which is the size of vocabulary in lyrics. In J-POP lyrics, MeCab [17] was used for the morphological analysis of J-POP lyrics. The noun, verb, and adjective components were extracted and then the original and the inflected form were counted as one word. In English lyrics, we use stopwords using Full-Text Stopwords in MySQL3 to remove commonly-used words. However, words which appeared often in many lyrics were inconvenient to analyze topics. To lower the importance of such words in the topic analysis, they were weighted by inverse document frequency (idf). In the training the LDA, the number of topics (K) is set to 5. All initial values of hyperparameters α(0) and β (0) were set to 1. Figure 7. Results of our evaluation experiment to evaluate topic analysis; the score of (1) was the closest to 1.0, showing our approach to be effective. (4) The lyrics selected at random Each subject evaluated the similarity of the impression received from the two lyrics using a five-step scale (1: closest, 2: somehow close, 3: neutral, 4: somehow far, and 5: most far), comparing the basis lyrics and one of the target lyrics after seeing the basis lyrics. Presentation order to subjects was random. Furthermore, each subject described the reason of evaluation score. 4.1.2 Experimental results The average score of the five-step evaluation results for the four target lyrics by all subjects is shown in the Figure 7. As expected, lyrics closest to the basis lyrics on the lyrics map were evaluated as the closest in terms of the impression of the basis lyrics, because the score of (1) was closest to 1.0. Results of target lyrics (2) and (3) were both close to 3.0. The lyrics closest to the basis lyrics of the same songwriter or artist as the selected lyrics were mostly judged as “3: neutral.” Finally, the lyrics selected at random (4) were appropriately judged to be far. As the subjects’ comments about the reason of decision, we obtained such responses as a sense of the season, positive-negative, love, relationship, color, lightdark, subjective-objective, and tension. Responses differed greatly from one subject to the next. For example, some felt the impression only by the similarity of a sense of the season of lyrics. Trial usage of LyricsRadar has shown that it is a useful tool for users. 4. EVALUATION EXPERIMENTS To verify the validity of the topic analysis results (as related to the topic radar chart and mapping of lyrics) in LyricsRadar, we conducted a subjective evaluation experiment. There were 17 subjects (all Japanese speakers) with ages from 21 to 32. We used the results of LDA for the lyrics of the 6902 J-POP songs described in Section 3.2. 4.1 Evaluation of topic analysis Our evaluation here attempted to verify that the topic ratio determined by the topic analysis of LDA could appropriately represent latent meaning of lyrics. Furthermore, when the lyrics of a song are selected, relative location to other lyrics of the same artist or songwriter in the space is investigated. 4.1.1 Experimental method In our experiment, the lyrics of a song are selected at random in the space as basis lyrics and also target lyrics of four songs are selected to be compared according to the following conditions. 4.2 Evaluation of the number of topics The perplexity used for the quality assessment of a language model was computed for each number of topics. The more the model is complicated, the higher the perplexity becomes. Therefore, we can estimate that the performance of language model is good when the value of perplexity is low. We calculated perplexity as (1) The lyrics closest to the basis lyrics on lyrics map (2) The lyrics closest to the basis lyrics with same songwriter D d=1 log p(Xd ) perplexity(X) = exp − D d=1 Nd (3) The lyrics closest to the basis lyrics with same artist 2 “Music Lyrics Database v.1.2.7,” http://www.odditysoftware. com/page-datasales1.htm. (7) In case the number of topics (K) is five, the perplexity is 1150 which is even high. 3 “Full-Text Stopwords in MySQL,” http://dev.mysql.com/doc/ refman/5.5/en/fulltext-stopwords.html. 589 15th International Society for Music Information Retrieval Conference (ISMIR 2014) [2] C. Laurier et al.: “Multimodal Music Mood Classification Using Audio and Lyrics,” Proceedings of ICMLA 2008, pp. 688–693, 2008. [3] C. McKay et al.: “Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features,” Proceedings of ISMIR 2008, pp. 213–218, 2008. [4] D. M. Blei et al.: “Latent Dirichlet Allocation,” Journal of Machine Learning Research Vol.3, pp. 993–1022, 2003. [5] E. Brochu and N. de Freitas: ““Name That Song!”: A Probabilistic Approach to Querying on Music and Text,” Proceedings of NIPS 2003, pp. 1505–1512, 2003. [6] F. Kleedorfer et al.: “Oh Oh Oh Whoah! Towards Automatic Topic Detection In Song Lyrics,” Proceedings of ISMIR 2008, pp. 287–292, 2008. Figure 8. Perplexity for the number of topics. On the other hand, because Miller showed that the number of objects human can hold in his working memory is 7 ± 2 [7], the number of topics should be 1 to 5 in order to obtain information naturally. So we decided to show five topics in the topic radar chart. Figure 8 shows calculation results of perplexity for each topic number. Blue points represent perplexity for LDA applied to lyrics and red points represent perplexity for LDA applied to each artist. Orange bar indicates the range of human capacity for processing information. Since there exists a tradeoff between the number of topics and operability, we found that five is appropriate number of topics. 5. CONCLUSIONS In this paper, we propose LyricsRadar, an interface to assist a user to come across favorite lyrics interactively. Conventionally lyrics were retrieved by titles, artist names, or keywords. Our main contribution is to visualize lyrics in the latent meaning level based on a topic model by LDA. By seeing the pentagon-style shape of Topic Radar Chart, a user can intuitively recognize the meaning of given lyrics. The user can also directly manipulate this shape to discover target lyrics even when the user does not know any keyword or any query. Also the topic ratio of focused lyrics can be mapped to a point in the two dimensional space which visualizes the relative location to all the lyrics in our lyrics database and enables the user to navigate similar lyrics by controlling the point directly. For future work, user adaptation is inevitable task because every user has an individual preference, as well as improvements to topic analysis by using hierarchical topic analysis [12]. Furthermore, to realize the retrieval interface corresponding to a minor topic of lyrics, a future challenge is to consider the visualization method that can reflect more numbers of topics by keeping an easy-to-use interactivity. Acknowledgment: This research was supported in part by OngaCREST, CREST, JST. 6. REFERENCES [1] B. Logan et al.: “Semantic Analysis of Song Lyrics,” Proceedings of IEEE ICME 2004 Vol.2, pp. 827–830, 2004. [7] G. A. Miller: “The magical number seven, plus or minus two: Some limits on our capacity for processing information,” Journal of the Psychological Review Vol.63(2), pp. 81– 97, 1956. [8] H. Fujihara et al.: “LyricSynchronizer: Automatic Synchronization System between Musical Audio Signals and Lyrics,” Journal of IEEE Selected Topics in Signal Processing, Vol.5, No.6, pp. 1252–1261, 2011. [9] L. Maaten and G. E. Hinton: “Visualizing High-Dimensional Data Using t-SNE,” Journal of Machine Learning Research, Vol.9, pp. 2579–2605, 2008. [10] M. Müller et al.: “Lyrics-based Audio Retrieval and Multimodal Navigation in Music Collections,” Proceedings of ECDL 2007, pp. 112–123, 2007. [11] M. V. Zaanen and P. Kanters: “Automatic Mood Classification Using TF*IDF Based on Lyrics,” Proceedings of ISMIR 2010, pp. 75–80, 2010. [12] R. Adams et al.: “Tree-Structured Stick Breaking Processes for Hierarchical Data,” Proceedings of NIPS 2010, pp. 19– 27, 2010. [13] R. Macrae and S. Dixon: “Ranking Lyrics for Online Search,” Proceedings of ISMIR 2012, pp. 361–366, 2012. [14] R. Takahashi et al.: “Building and combining document and music spaces for music query-by-webpage system,” Proceedings of Interspeech 2008, pp. 2020–2023, 2008. [15] R. Neumayer and A. Rauber: “Multi-modal Music Information Retrieval: Visualisation and Evaluation of Clusterings by Both Audio and Lyrics,” Proceedings of RAO 2007, pp. 70– 89, 2007. [16] S. Funasawa et al.: “Automated Music Slideshow Generation Using Web Images Based on Lyrics,” Proceedings of ISMIR 2010, pp. 63–68, 2010. [17] T. Kudo: “MeCab: Yet Another Part-of-Speech and Morphological Analyzer,” http://mecab.googlecode.com/ svn/trunk/mecab/doc/index.html. [18] T. Nakano and M. Goto: “VocaRefiner: An Interactive Singing Recording System with Integration of Multiple Singing Recordings,” Proceedings of SMC 2013, pp. 115– 122, 2013. [19] Y. Hu et al.: “Lyric-based Song Emotion Detection with Affective Lexicon and Fuzzy Clustering Method,” Proceedings of ISMIR 2009, pp. 122–128, 2009. 590 15th International Society for Music Information Retrieval Conference (ISMIR 2014) JAMS: A JSON ANNOTATED MUSIC SPECIFICATION FOR REPRODUCIBLE MIR RESEARCH Eric J. Humphrey1,* , Justin Salamon1,2 , Oriol Nieto1 , Jon Forsyth1 , Rachel M. Bittner1 , and Juan P. Bello1 1 2 Music and Audio Research Lab, New York University, New York Center for Urban Science and Progress, New York University, New York ABSTRACT Meanwhile, the interests and requirements of the community are continually evolving, thus testing the practical limitations of lab-files. By our count, there are three unfolding research trends that are demanding more of a given annotation format: The continued growth of MIR is motivating more complex annotation data, consisting of richer information, multiple annotations for a given task, and multiple tasks for a given music signal. In this work, we propose JAMS, a JSON-based music annotation format capable of addressing the evolving research requirements of the community, based on the three core principles of simplicity, structure and sustainability. It is designed to support existing data while encouraging the transition to more consistent, comprehensive, well-documented annotations that are poised to be at the crux of future MIR research. Finally, we provide a formal schema, software tools, and popular datasets in the proposed format to lower barriers to entry, and discuss how now is a crucial time to make a concerted effort toward sustainable annotation standards. • Comprehensive annotation data: Rich annotations, like the Billboard dataset [2], require new, contentspecific conventions, increasing the complexity of the software necessary to decode it and the burden on the researcher to use it; such annotations can be so complex, in fact, it becomes necessary to document how to understand and parse the format [5]. • Multiple annotations for a given task: The experience of music can be highly subjective, at which point the notion of “ground truth” becomes tenuous. Recent work in automatic chord estimation [8] shows that multiple reference annotations should be embraced, as they can provide important insight into system evaluation, as well as into the task itself. 1. INTRODUCTION Music annotations —the collection of observations made by one or more agents about an acoustic music signal— are an integral component of content-based Music Information Retrieval (MIR) methodology, and are necessary for designing, evaluating, and comparing computational systems. For clarity, we define the scope of an annotation as corresponding to time scales at or below the level of a complete song, such as semantic descriptors (tags) or time-aligned chords labels. Traditionally, the community has relied on plain text and custom conventions to serialize this data to a file for the purposes of storage and dissemination, collectively referred to as “lab-files”. Despite a lack of formal standards, lab-files have been, and continue to be, the preferred file format for a variety of MIR tasks, such as beat or onset estimation, chord estimation, or segmentation. ∗ Please • Multiple concepts for a given signal: Although systems are classically developed to accomplish a single task, there is ongoing discussion toward integrating information across various musical concepts [12]. This has already yielded measurable benefits for the joint estimation of chords and downbeats [9] or chords and segments [6], where leveraging multiple information sources for the same input signal can lead to improved performance. It has long been acknowledged that lab-files cannot be used to these ends, and various formats and technologies have been previously proposed to alleviate these issues, such as RDF [3], HDF5 [1], or XML [7]. However, none of these formats have been widely embraced by the community. We contend that the weak adoption of any alternative format is due to the combination of several factors. For example, new tools can be difficult, if not impossible, to integrate into a research workflow because of compatibility issues with a preferred development platform or programming environment. Additionally, it is a common criticism that the syntax or data model of these alternative formats is non-obvious, verbose, or otherwise confusing. This is especially problematic when researchers must handle for- direct correspondence to [email protected] c Eric J. Humphrey, Justin Salamon, Oriol Nieto, Jon Forsyth, Rachel M. Bittner, Juan P. Bello. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Eric J. Humphrey, Justin Salamon, Oriol Nieto, Jon Forsyth, Rachel M. Bittner, Juan P. Bello. “JAMS: A JSON Annotated Music Specification for Reproducible MIR Research”, 15th International Society for Music Information Retrieval Conference, 2014. 591 15th International Society for Music Information Retrieval Conference (ISMIR 2014) mat conversions. Taken together, the apparent benefits to conversion are outweighed by the tangible costs. In this paper, we propose a JSON Annotated Music Specification (JAMS) to meet the changing needs of the MIR community, based on three core design tenets: simplicity, structure, and sustainability. This is achieved by combining the advantages of lab-files with lessons learned from previously proposed formats. The resulting JAMS files are human-readable, easy to drop into existing workflows, and provide solutions to the research trends outlined previously. We further address classical barriers to adoption by providing tools for easy use with Python and MATLAB, and by offering an array of popular datasets as JAMS files online. The remainder of this paper is organized as follows: Section 2 identifies three valuable components of an annotation format by considering prior technologies; Section 3 formally introduces JAMS, detailing how it meets these design criteria and describing the proposed specification by example; Section 4 addresses practical issues and concerns in an informal FAQ-style, touching on usage tools, provided datasets, and some practical shortcomings; and lastly, we close with a discussion of next steps and perspectives for the future in Section 5. where X, Y, and Z correspond to “artist”, “album”, and “title”, respectively 1 ; parsing rules, such as “lines beginning with ‘#’ are to be ignored as comments”; auxiliary websites or articles, decoupled from the annotations themselves, to provide critical information such as syntax, conventions, or methodology. Alternative representations are able to manage more complex data via standardized markup and named entities, such as fields in the case of RDF or JSON, or IDs, attributes and tags for XML. 2. CORE DESIGN PRINCIPLES 3. INTRODUCING JAMS In order to craft an annotation format that might serve the community into the foreseeable future, it is worthwhile to consolidate the lessons learned from both the relative success of lab-files and the challenges faced by alternative formats into a set of principles that might guide our design. With this in mind, we offer that usability, and thus the likelihood of adoption, is a function of three criteria: So far, we have identified several goals for a music annotation format: a data structure that matches the document model; a lightweight markup syntax; support for multiple annotations, multiple tasks, and rich metadata; easy workflow integration; cross-language compliance; and the use of pre-existing technologies for stability. To find our answer, we need only to look to the web development community, who have already identified a technology that meets these requirements. JavaScript Object Notation (JSON) 2 has emerged as the serialization format of the Internet, now finding native support in almost every modern programming language. Notably, it was designed to be maximally efficient and human readable, and is capable of representing complex data structures with little overhead. JSON is, however, only a syntax, and it is necessary to define formal standards outlining how it should be used for a given purpose. To this end, we define a specification on top of JSON (JAMS), tailored to the needs of MIR researchers. 2.1 Simplicity The value of simplicity is demonstrated by lab-files in two specific ways. First, the contents are represented in a format that is intuitive, such that the document model clearly matches the data structure and is human-readable, i.e. uses a lightweight syntax. This is a particular criticism of RDF and XML, which can be verbose compared to plain text. Second, lab-files are conceptually easy to incorporate into research workflows. The choice of an alternative file format can be a significant hurdle if it is not widely supported, as is the case with RDF, or the data model of the document does not match the data model of the programming language, as with XML. 2.3 Sustainability Recently in MIR, a more concerted effort has been made toward sustainable research methods, which we see positively impacting annotations in two ways. First, there is considerable value to encoding methodology and metadata directly in an annotation, as doing so makes it easier to both support and maintain the annotation while also enabling direct analyses of this additional information. Additionally, it is unnecessary for the MIR community to develop every tool and utility ourselves; we should instead leverage well-supported technologies from larger communities when possible. 3.1 A Walk-through Example Perhaps the clearest way to introduce the JAMS specification is by example. Figure 1 provides the contents of a hypothetical JAMS file, consisting of nearly valid 3 JSON syntax and color-coded by concept. JSON syntax will be familiar to those with a background in C-style languages, as it uses square brackets (“[ ]”) to denote arrays (alternatively, lists or vectors), and curly brackets (“{ }”) to denote 2.2 Structure It is important to recognize that lab-files developed as a way to serialize tabular data (i.e. arrays) in a languageindependent manner. Though lab-files excel at this particular use case, they lack the structure required to encode complex data such as hierarchies or mix different data types, such as scalars, strings, multidimensional arrays, etc. This is a known limitation, and the community has devised a variety of ad hoc strategies to cope with it: folder trees and naming conventions, such as “{X}/{Y}/{Z}.lab”, 1 http://www.isophonics.net/content/ reference-annotations 2 http://www.json.org/ 3 The sole exception is the use of ellipses (“...”) as continuation characters, indicating that more information could be included. 592 15th International Society for Music Information Retrieval Conference (ISMIR 2014) A {'tag': objects (alternatively, dictionaries, structs, or hash maps). Defining some further conventions for the purpose of illustration, we use single quotes to indicate field names, italics when referring to concepts, and consistent colors for the same data structures. Using this diagram, we will now step through the hierarchy, referring back to relevant components as concepts are introduced. B C [ {'data': [ {'value': "good for running", 'confidence': 0.871, 'secondary_value': "use-case"} , D {'value': "rock", ... } , ...] , E 'annotation_metadata': {'corpus': "User-Generated Tags", 'version': "0.0.1", 3.1.1 The JAMS Object 'annotation_rules': "Annotators were provided ...", 'annotation_tools': "Sonic Visualizer, ...", A JAMS file consists of one top-level object, indicated by the outermost bounding box. This is the primary container for all information corresponding to a music signal, consisting of several task-array pairs, an object for file metadata, and an object for sandbox. A taskarray is a list of annotations corresponding to a given task name, and may contain zero, one, or many annotations for that task. The format of each array is specific to the kind of annotations it will contain; we will address this in more detail in Section 3.1.2. The file metadata object (K) is a dictionary containing basic information about the music signal, or file, that was annotated. In addition to the fields given in the diagram, we also include an unconstrained identifiers object (L), for storing unique identifiers in various namespaces, such as the EchoNest or YouTube. Note that we purposely do not store information about the recording’s audio encoding, as a JAMS file is format-agnostic. In other words, we assume that any sample rate or perceptual codec conversions will have no effect on the annotation, within a practical tolerance. Lastly, the JAMS object also contains a sandbox, an unconstrained object to be used as needed. In this way, the specification carves out such space for any unforeseen or otherwise relevant data; however, as the name implies, no guarantee is made as to the existence or consistency of this information. We do this in the hope that the specification will not be unnecessarily restrictive, and that commonly “sandboxed” information might become part of the specification in the future. 'validation': "Data were checked by ...", 'data_source': "Manual Annotation", F 'curator': {"name": "Jane Doe", "email": "[email protected]"} , 'annotator': G {'unique_id': "61a4418c841", 'skill_level': "novice", 'principal_instrument': "voice", , ...] 'primary_role': "composer", ... } 'sandbox': { ... } , } , {'data': ... } , 'beat': H [ {'data': [ {'time': {'value': 0.237, ... } , 'label': {'value': "1", ... } } , {'time': ... } , ...] , 'annotation_metadata': { ... } , 'sandbox': { ... } } , {'data': ... } , ...] , 'chord': I [ {'data': [ {'start': {'value': 0.237, ... } , 'end': {'value': "1", ... } , 'label': {'value': "Eb", ... } } , {'time': ... } , ...] , 'annotation_metadata': { ... } , 'sandbox': { ... } } , {'data': ... } , ...] , 'melody': [ {'data': J [ {'value': [ 205.340, 204.836, 205.561, ... ], 'time': [ 10.160, 10.538, 10.712, ... ], 'confidence': [ 0.966, 0.884, 0.896, , ... ], , 'label': {'value': "vocals", ... } } , {'value': ... } , ...] 3.1.2 Annotations An annotation (B) consists of all the information that is provided by a single annotator about a single task for a single music signal. Independent of the task, an annotation comprises three sub-components: an array of objects for data (C), an annotation metadata object (E), and an annotation-level sandbox. For clarity, a task-array (A) may contain multiple annotations (B). Importantly, a data array contains the primary annotation information, such as its chord sequence, beat locations, etc., and is the information that would normally be stored in a lab-file. Though all data containers are functionally equivalent, each may consist of only one object type, specific to the given task. Considering the different types of musical attributes annotated for MIR research, we divide them into four fundamental categories: ,, 'annotation_metadata': { ... } , , 'sandbox': { ... } , {'data': ... } , ...] K 'file_metadata': {'version': "0.0.1", 'identifiers': {'echonest_song_id': "SOVBDYA13D4615308E", 'youtube_id': "jBDF04fQKtQ”, ... } , L 'artist': "The Beatles", 'title': "With a Little Help from My Friends", 'release': "Sgt. Pepper's Lonely Hearts Club Band", 'duration': 159.11 } 'sandbox': {'foo': "bar", ... } M , } Figure 1. Diagram illustrating the structure of the JAMS specification. 1. Attributes that exist as a single observation for the entire music signal, e.g. tags. 593 15th International Society for Music Information Retrieval Conference (ISMIR 2014) observation tag genre mood 2. Attributes that consist of sparse events occurring at specific times, e.g. beats or onsets. 3. Attributes that span a certain time range, such as chords or sections. 4. Attributes that comprise a dense time series, such as discrete-time fundamental frequency values for melody extraction. event beat onset range chord segment key note source time series melody pitch pattern Table 1. Currently supported tasks and types in JAMS. These four types form the most atomic data structures, and will be revisited in greater detail in Section 3.1.3. The important takeaway here, however, is that data arrays are not allowed to mix fundamental types. Following [10], an annotation metadata object is defined to encode information about what has been annotated, who created the annotations, with what tools, etc. Specifically, corpus provides the name of the dataset to which the annotation belongs; version tracks the version of this particular annotation; annotation rules describes the protocol followed during the annotation process; annotation tools describes the tools used to create the annotation; validation specifies to what extent the annotation was verified and is reliable; data source details how the annotation was obtained, such as manual annotations, online aggregation, game with a purpose, etc.; curator (F) is itself an object with two subfields, name and email, for the contact person responsible for the annotation; and annotator (G) is another unconstrained object, which is intended to capture information about the source of the annotation. While complete metadata are strongly encouraged in practice, currently only version and curator are mandatory in the specification. diagram, the value of time is a scalar quantity (0.237), whereas the value of label is a string (‘1’), indicating metrical position. A range (I) is useful for representing musical attributes that span an interval of time, such as chords or song segments (e.g. intro, verse, chorus). It is an object that consists of three observations: start, end, and label. The time series (J) atomic type is useful for representing musical attributes that are continuous in nature, such as fundamental frequency over time. It is an object composed of four elements: value, time, confidence and label. The first three fields are arrays of numerical values, while label is an observation. 3.2 The JAMS Schema 3.1.3 Datatypes Having progressed through the JAMS hierarchy, we now introduce the four atomic data structures, out of which an annotation can be constructed: observation, event, range and time series. For clarity, the data array (A) of a tag annotation is a list of observation objects; the data array of a beat annotation is a list of event objects; the data array of a chord annotation is a list of range objects; and the data array of a melody annotation is a list of time series objects. The current space of supported tasks is provided in Table 1. Of the four types, an observation (D) is the most atomic, and used to construct the other three. It is an object that has one primary field, value, and two optional fields, confidence and secondary value. The value and secondary value fields may take any simple primitive, such as a string, numerical value, or boolean, whereas the confidence field stores a numerical confidence estimate for the observation. A secondary value field is provided for flexibility in the event that an observation requires an additional level of specificity, as is the case in hierarchical segmentation [11]. An event (H) is useful for representing musical attributes that occur at sparse moments in time, such as beats or onsets. It is a container that holds two observations, time and label. Referring to the first beat annotation in the 594 The description in the previous sections provides a highlevel understanding of the proposed specification, but the only way to describe it without ambiguity is through formal representation. To accomplish this, we provide a JSON schema 4 , a specification itself written in JSON that uses a set of reserved keywords to define valid data structures. In addition to the expected contents of the JSON file, the schema can specify which fields are required, which are optional, and the type of each field (e.g. numeric, string, boolean, array or object). A JSON schema is concise, precise, and human readable. Having defined a proper JSON schema, an added benefit of JAMS is that a validator can verify whether or not a piece of JSON complies with a given schema. In this way, researchers working with JAMS files can easily and confidently test the integrity of a dataset. There are a number of JSON schema validator implementations freely available online in a variety of languages, including Python, Java, C, JavaScript, Perl, and more. The JAMS schema is included in the public software repository (cf. Section 4), which also provides a static URL to facilitate directly accessing the schema from the web within a workflow. 4. JAMS IN PRACTICE While we contend that the use and continued development of JAMS holds great potential for the many reasons outlined previously, we acknowledge that specifications and standards are myriad, and it can be difficult to ascertain the benefits or shortcomings of one’s options. In the interest of encouraging adoption and the larger discussion of 4 http://json-schema.org/ 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4.4 What datasets are already JAMS-compliant? standards in the field, we would like to address practical concerns directly. To further lower the barrier to entry and simplify the process of integrating JAMS into a pre-existing workflow, we have collected some of the more popular datasets in the community and converted them to the JAMS format, linked via the public repository. The following is a partial list of converted datasets: Isophonics (beat, chord, key, segment); Billboard (chord); SALAMI (segment, pattern); RockCorpus (chord, key); tmc323 (chords); Cal500 (tag); Cal10k (tag); ADC04 (melody); and MIREX05 (melody). 4.1 How is this any different than X? The biggest advantage of JAMS is found in its capacity to consistently represent rich information with no additional effort from the parser and minimal markup overhead. Compared to XML or RDF, JSON parsers are extremely fast, which has contributed in no small part to its widespread adoption. These efficiency gains are coupled with the fact that JAMS makes it easier to manage large data collections by keeping all annotations for a given song in the same place. 4.5 Okay, but my data is in a different format – now what? We realize that it is impractical to convert every dataset to JAMS, and provide a collection of Python scripts that can be used to convert lab-files to JAMS. In lieu of direct interfaces, alternative formats can first be converted to labfiles and translated to JAMS thusly. 4.2 What kinds of things can I do with JAMS that I can’t already do with Y? JAMS can enable much richer evaluation by including multiple, possibly conflicting, reference annotations and directly embedding information about an annotation’s origin. A perfect example of this is found in the Rock Corpus Dataset [4], consisting of annotations by two expert musicians: one, a guitarist, and the other, a pianist. Sources of disagreement in the transcriptions often stem from differences of opinion resulting from familiarity with their principal instrument, where the voicing of a chord that makes sense on piano is impossible for a guitarist, and vice versa. Similarly, it is also easier to develop versatile MIR systems that combine information across tasks, as that information is naturally kept together. Another notable benefit of JAMS is that it can serve as a data representation for algorithm outputs for a variety of tasks. For example, JAMS could simplify MIREX submissions by keeping all machine predictions for a given team together as a single submission, streamlining evaluations, where the annotation sandbox and annotator metadata can be used to keep track of algorithm parameterizations. This enables the comparison of many references against many algorithmic outputs, potentially leading to a deeper insight into a system’s performance. 4.6 My MIR task doesn’t really fit with JAMS. 4.3 So how would this interface with my workflow? Thanks to the widespread adoption of JSON, the vast majority of languages already offer native JSON support. In most cases, this means it is possible to go from a JSON file to a programmatic data structure in your language of choice in a single line of code using tools you didn’t have to write. To make this experience even simpler, we additionally provide two software libraries, for Python and MATLAB. In both instances, a lightweight software wrapper is provided to enable a seamless experience with JAMS, allowing IDEs and interpreters to make use of autocomplete and syntax checking. Notably, this allows us to provide convenience functionality for creating, populating, and saving JAMS objects, for which examples and sample code are provided with the software library 5 . 5 https://github.com/urinieto/jams 595 That’s not a question, but it is a valid point and one worth discussing. While this first iteration of JAMS was designed to be maximally useful across a variety of tasks, there are two broad reasons why JAMS might not work for a given dataset or task. One, a JAMS annotation only considers information at the temporal granularity of a single audio file and smaller, independently of all other audio files in the world. Therefore, extrinsic relationships, such as cover songs or music similarity, won’t directly map to the specification because the concept is out of scope. The other, more interesting, scenario is that a given use case requires functionality we didn’t plan for and, as a result, JAMS doesn’t yet support. To be perfectly clear, the proposed specification is exactly that –a proposal– and one under active development. Born out of an internal need, this initial release focuses on tasks with which the authors are familiar, and we realize the difficulty in solving a global problem in a single iteration. As will be discussed in greater detail in the final section, the next phase on our roadmap is to solicit feedback and input from the community at large to assess and improve upon the specification. If you run into an issue, we would love to hear about your experience. 4.7 This sounds promising, but nothing’s perfect. There must be shortcomings. Indeed, there are two practical limits that should be mentioned. Firstly, JAMS is not designed for features or signal level statistics. That said, JSON is still a fantastic, crosslanguage syntax for serializing data, and may further serve a given workflow. As for practical concerns, it is a known issue that parsing large JSON objects can be slow in MATLAB. We’ve worked to make this no worse than reading current lab-files, but speed and efficiency are not touted benefits of MATLAB. This may become a bigger issue as JAMS files become more complete over time, but we are 15th International Society for Music Information Retrieval Conference (ISMIR 2014) actively exploring various engineering solutions to address this concern. [3] Chris Cannam, Christian Landone, Mark B Sandler, and Juan Pablo Bello. The sonic visualiser: A visualisation platform for semantic descriptors from musical signals. In Proc. of the 7th International Society for Music Information Retrieval Conference, pages 324– 327, 2006. 5. DISCUSSION AND FUTURE PERSPECTIVES In this paper, we have proposed a JSON format for music annotations to address the evolving needs of the MIR community by keeping multiple annotations for multiple tasks alongside rich metadata in the same file. We do so in the hopes that the community can begin to easily leverage this depth of information, and take advantage of ubiquitous serialization technology (JSON) in a consistent manner across MIR. The format is designed to be intuitive and easy to integrate into existing workflows, and we provide software libraries and pre-converted datasets to lower barriers to entry. Beyond practical considerations, JAMS has potential to transform the way researchers approach and use music annotations. One of the more pressing issues facing the community at present is that of dataset curation and access. It is our hope that by associating multiple annotations for multiple tasks to an audio signal with retraceable metadata, such as identifiers or URLs, it might be easier to create freely available datasets with better coverage across tasks. Annotation tools could serve music content found freely on the Internet and upload this information to a common repository, ideally becoming something like a Freebase 6 for MIR. Furthermore, JAMS provides a mechanism to handle multiple concurrent perspectives, rather than forcing the notion of an objective truth. Finally, we recognize that any specification proposal is incomplete without an honest discussion of feasibility and adoption. The fact remains that JAMS arose from the combination of needs within our group and an observation of wider applicability. We have endeavored to make the specification maximally useful with minimal overhead, but appreciate that community standards require iteration and feedback. This current version is not intended to be the definitive answer, but rather a starting point from which the community can work toward a solution as a collective. Other professional communities, such as the IEEE, convene to discuss standards, and perhaps a similar process could become part of the ISMIR tradition as we continue to embrace the pursuit of reproducible research practices. 6. REFERENCES [1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011. [4] Trevor De Clercq and David Temperley. A corpus analysis of rock harmony. Popular Music, 30(1):47–70, 2011. [5] W Bas de Haas and John Ashley Burgoyne. Parsing the billboard chord transcriptions. University of Utrecht, Tech. Rep, 2012. [6] Matthias Mauch, Katy Noland, and Simon Dixon. Using musical structure to enhance automatic chord transcription. In Proc. of the 10th International Society for Music Information Retrieval Conference, pages 231– 236, 2009. [7] Cory McKay, Rebecca Fiebrink, Daniel McEnnis, Beinan Li, and Ichiro Fujinaga. Ace: A framework for optimizing music classification. In Proc. of the 6th International Society for Music Information Retrieval Conference, pages 42–49, 2005. [8] Yizhao Ni, Matthew McVicar, Raul Santos-Rodriguez, and Tijl De Bie. Understanding effects of subjectivity in measuring chord estimation accuracy. Audio, Speech, and Language Processing, IEEE Transactions on, 21(12):2607–2615, 2013. [9] Hélène Papadopoulos and Geoffroy Peeters. Joint estimation of chords and downbeats from an audio signal. Audio, Speech, and Language Processing, IEEE Transactions on, 19(1):138–152, 2011. [10] G. Peeters and K. Fort. Towards a (better) definition of annotated MIR corpora. In Proc. of the 13th International Society for Music Information Retrieval Conference, pages 25–30, Porto, Portugal, Oct. 2012. [11] Jordan Bennett Louis Smith, John Ashley Burgoyne, Ichiro Fujinaga, David De Roure, and J Stephen Downie. Design and creation of a large-scale database of structural annotations. In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 555–560, 2011. [12] Emmanuel Vincent, Stanislaw A Raczynski, Nobutaka Ono, Shigeki Sagayama, et al. A roadmap towards versatile mir. In Proc. of the 11th International Society for Music Information Retrieval Conference, pages 662– 664, 2010. [2] John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga. An expert ground truth set for audio chord recognition and music analysis. In Proc. of the 12th International Society for Music Information Retrieval Conference, pages 633–638, 2011. 6 http://www.freebase.com 596 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ON THE CHANGING REGULATIONS OF PRIVACY AND PERSONAL INFORMATION IN MIR Pierre Saurel Université Paris-Sorbonne Francis Rousseaux IRCAM Marc Danger ADAMI pierre.saurel @paris-sorbonne.fr francis.rousseaux @ircam.fr mdanger @adami.fr labs (for MIR teaching, for multicultural emotion comparisons, or for MIR user requirement purposes) the identification of legal issues becomes essential or strategic. Legal issues related to copyright and Intellectual Property have already been identified and expressed into Digital Rights Management by the MIR community [2], [7], when those related to security, business models and right to access have been expressed by Information Access [4], [11]. Privacy is another important legal issue. To address it properly one needs first to classify the personal data and processes. A naive classification appears when you quickly look at the kind of personal data MIR deals with: User’s comments, evaluation, annotation and music recommendations are obvious personal data as long as they are published under their name or pseudo; Addresses allowing identification of a device or an instrument and Media Access Control addresses are linked to personal data; Any information allowing identification of a natural person, as some MIR processes do, shall be qualified as personal data and processing of personal data. But the legal professionals do not unanimously approve this classification. For instance the Court of Appeal in Paris judged in two decisions (2007/04/27 and 2007/05/15) that the Internet Protocol address is not a personal data. ABSTRACT In recent years, MIR research has continued to focus more and more on user feedback, human subjects data, and other forms of personal information. Concurrently, the European Union has adopted new, stringent regulations to take effect in the coming years regarding how such information can be collected, stored and manipulated, with equally strict penalties for being found in violation of the law. Here, we provide a summary of these changes, consider how they relate to our data sources and research practices, and identify promising methodologies that may serve researchers well, both in order to be in compliance with the law and conduct more subject-friendly research. We additionally provide a case study of how such changes might affect a recent human subjects project on the topic of style, and conclude with a few recommendations for the near future. This paper is not intended to be legal advice: our personal legal interpretations are strictly mentioned for illustration purpose, and reader should seek proper legal counsel. 1. INTRODUCTION The International Society for Music Information Retrieval addresses a wide range of scientific, technical and social challenges, dealing with processing, searching, organizing and accessing music-related data and digital sounds through many aspects, considering real scale usecases and designing innovative applications, exceeding its academic-only initiatory aims. Some recent Music Information Retrieval tools and algorithms aim to attribute authorship and to characterize the structure of style, to reproduce the user’s style and to manipulate one’s style as a content [8], [1]. They deal for instance with active listening, authoring or personalised reflexive feedback. These tools will allow identification of users in the big data: authors, listeners, performers. As the emerging MIR scientific community leads to industrial applications of interest to the international business (start-up, Majors, content providers, platforms) and to experimentations involving many users in living 2. WHAT ARE PROCESSES OF PERSONAL DATA AND HOW THEY ARE REGULATED A careful consideration of the applicable law of personal data is necessary to elaborate a proper classification of MIR personal data processes taking the different international regulations into account. 2.1 Europe vs. United States: two legal approaches Europe regulates data protection through one of the highest State Regulations in the world [3], [9] when the United States lets contractors organize data protection through agreements supported by consideration and entered into voluntarily by the parties. These two approaches are deeply divergent. United States lets companies specify their own rules with their consumers while Europe enforces a unique regulated framework on all companies providing services to European citizens. For instance any company in the United States can define how long they keep the personal data, when the regulations in Europe would specify a maximum length of time the personal © Pierre Saurel, Francis Rousseaux, Marc Danger. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Pierre Saurel, Francis Rousseaux, Marc Danger. “On the Changing Regulations of Privacy and Personal Information in MIR”, 15th International Society for Music Information Retrieval Conference, 2014. 597 15th International Society for Music Information Retrieval Conference (ISMIR 2014) data is to be stored. And this applies to any company offering the same service. A prohibition is at the heart of the European Commission’s Directive on Data Protection (95/46/CE – The Directive) [3]. The transfer of personal data to nonEuropean Union countries that do not meet the European Union adequacy standard for privacy protection is strictly forbidden [3, article 25]1. The divergent legal approaches and this prohibition alone would outlaw the proposal by American companies of many of their IT services to European citizens. In response the U.S. Department of Commerce and the European Commission developed the Safe Harbor Framework (SHF) [6], [14]. Any nonEuropean organization is free to self-certify with the SHF and join. A new Proposal for a Regulation on the protection of individuals with regard to the processing of personal data was adopted the 12 March 2014 by the European Parliament [9]. The Directive allows adjustments from one European country to another and therefore diversity of implementation in Europe when the regulation is directly enforceable and should therefore be implemented directly and in the same way in all countries of the European Union. This regulation should apply in 2016. This regulation enhances data protection and sanctions to anyone who does not comply with the obligations laid down in the Regulation. For instance [9, article 79] the supervisory authority will impose, as a possible sanction, a fine of up to one hundred million Euros or up to 5% of the annual worldwide turnover in case of an enterprise. Complying with Safe Harbor is the easiest way for an organization using MIR processing to fulfill the high level European standard about personal data, to operate worldwide and to avoid prosecution regarding personal data. As explained below any non-European organization may enter the US – EU SHF’s requirement and publicly declare that they do so. In that case the organization must develop a data privacy policy that conforms to the seven Safe Harbor Principles (SHP) [14]. First of all organizations must identify personal data and personal data processes. Then they apply the SHP to these data and processes. By joining the SHF, organizations must implement procedures and modify their own information system whether paper or electronic. Organizations must notify (P1) individuals about the purposes for which they collect and use information about them, to whom the information can be disclosed and the choices and means offered for limiting its disclosure. Organizations must explain how they can be contacted with any complaints. Individuals should have the choice (P2) (opt out) whether their personal information is disclosed or not to a third party. In case of sensitive information explicit choice (opt in) must be given. A transfer to a third party (P3) is only possible if the individual made a choice and if the third party subscribed to the SHP or was subject to any adequacy finding regarding to the ED. Individuals must have access (P4) to personal information about them and be able to correct, amend or delete this information. Organizations must take reasonable precautions (P5) to prevent loss, misuse, disclosure, alteration or destruction of the personal information. Personal information collected must be relevant (P6: data integrity) for the purpose for which it is to be used. Sanctions (P7 enforcement) ensure compliance by the organization. There must be a procedure for verifying the implementation of the SHP and the obligation to remedy problems arising out of a failure to comply with the SHP. 2.2 Data protection applies to any information concerning an identifiable natural person Until French law applied the 95/46/CE European Directive, personal data was only defined considering sets of data containing the name of a natural person. This definition has been extended; the 95/46/CE European Directive (ED) defines ‘personal data’ [3, article 2] as: “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity”. For instance the identification of an author through the structure of his style as depending on his mental, cultural or social identity is a process that must comply with the European data privacy principles. 3. CLASSIFICATION FOR MIR PERSONAL DATA PROCESSING Considering the legal definition of personal data we can now propose a less naive classification of MIR processes and data into three sets: (i) nominative data, (ii) data leading to an easy identification of a natural person and (iii) data leading indirectly to the identification of a natural person through a complex process. 3.1 Nominative data and data leading easily to the identification of a natural person 2.3 Safe Harbor is the Framework ISMIR affiliates need not to pay a fine up to hundreds million Euros The first set of processes deals with all the situations giving the name of a natural person directly. The second set deals with the cases of a direct or an indirect identification easily done for instance through devices. In these two sets we find that the most obvious set of data concerns the “Personal Music Libraries” and “recommendations”. Looking at the topics that characterize 1 Argentina, Australia, Canada, State of Israel, New Zealand, United States – Transfer of Air Passenger Name Record (PNR) Data, United States – Safe Harbor, Eastern Republic of Uruguay are, to date, the only non-European third countries ensuring an adequate level of protection: http://ec.europa.eu/justice/data-protection/document/internationaltransfers/adequacy/index_en.htm 598 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ISMIR papers from year 2000 to 2013, we find more than 30 papers and posters dealing with those topics as their main topic. Can one recommend music to a user or analyze their personal library without tackling privacy? sonal data in case of a simple direct or indirect identification process. 4.1 Trends in terms of use and innovative technology Databases of personal data are no more clearly identified. We can view the situation as combining five aspects, which lead to new scientific problems concerning MIR personal data processing. Data Sources Explosion. The number of databases for retrieving information is growing dramatically. Applications are also data sources. Spotify for instance provides a live flow of music consumption information from millions of users. Data from billions of sensors will soon be added. This profusion of data does not mean quality. Accessible does not mean legal or acceptable for a user. Those considerations are essential to build reliable and sustainable systems. Crossing & Reconciling Data. Data sources are no longer isolated islands. Once the user can be identified (cookie, email, customer id), it is possible to match, aggregate and remix data that was previously isolated. Time Dimension. The web has a good memory that humans are generally not familiar with. Data can be public one day and be considered as very private 3 years later. Many users forget they posted a picture after a student party. And the picture has the misfortune to crop up again when you apply for a job. And it is not only a question of human memory: Minute traces collected one day can be exploited later and provide real information. Permanent Changes. The general instability of the data sources, technical formats and flows, applications and use is another strong characteristic of the situation. The impact on personal data is very likely. If the architecture of the systems changes a lot and frequently, the social norms also change. Users today publicly share information that they would have considered totally private a few years earlier. And the opposite could be the case. User Understandability and Control. Because of the complexity of changing systems and complex interactions users will less and less control over their information. This lack of control is caused by the characteristics of the systems and by the mistakes and the misunderstandings of human users. The affair of the private Facebook messages appearing suddenly on timeline (Sept. 2012) is significant. Facebook indicates that there was no bug. Those messages were old wall posts that are now more visible with the new interface. This is a combination of bad user understanding and fast moving systems. 3.2 Data leading to the identification of a natural person through a complex process The third set of personal data deals with cases when a natural person is indirectly identifiable using a complex process, like some of the MIR processes. Can one work on “Classification” or “Learning”, producing 130 publications (accepted contributions at ISMIR from year 2000 to year 2013) without considering users throughout their tastes or style? The processes used under these headings belong for the most part to this third set. Looking directly at the data without any sophisticated tool does not allow any identification of the natural person. On the contrary, using some MIR algorithms or machine learning can lead to indirect identifications [12]. Most of the time these non-linear methods use inputs to build new data which are outputs or data stored inside the algorithm, like weights for instance in a neural net. 3.3 The legal criteria of the costs and the amount of time required for identification This third set of personal data is not as homogeneous as it seems to be at first glance. Can we compare sets of data that lead to an identification of a natural person through a complex process? The European Proposal for a Regulation specifies the concept of “identifiability”. It tries to define legal criteria to decide if an identifiable set of data is or is not personal data. It considers the identification process [9, recital 23] as a relative one depending on the means used for that identification: “To determine whether a person is identifiable, account should be taken of all the means reasonably likely to be used either by the controller or by any other person to identify or single out the individual directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the individual, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration both available technology at the time of the processing and technological development.” But under what criteria should we, as MIR practitioners, specify when a set of data allows an easy identification and belongs to the second set or, on the contrary, is too complex or reaches a too uncertain identification so that we would not legally say that these are personal data? To answer these questions, we must be able to compare MIR processes with new criteria. 4.2 The case of an Apache Hadoop File System (AHFS) on which some machine learning is applied Everyone produces data and personal data without being always aware that they provide data revealing their identification. When a user tags / rates musical items [13], he gives personal information. If a music recommender ex- 4. MANAGING THE TWO FIRST SETS On an example chosen to be problematic (but increasingly common in the industry), we show how to manage per- 599 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ploits this user data without integrating privacy concepts, he faces legal issues and strong discontent from the users. The data volume has increased faster than “Moore’s law”: This is what is meant by “Big Data”. New data is generally unstructured and traditional database systems such as Relational Database Management Systems cannot handle the volume of data produced by users & machines & sensors. This challenge was the main drive for Google to define a new technology: the Apache Hadoop File System (AHFS). Within this framework, data and computational activities are distributed on a very large number of servers. Data is not loaded for computation, nor the results stored. Here, the algorithm is close to the data. This situation leads to the epistemological problem of separability into the field of MIR personal data processing: are all MIR algorithms (and for instance the authorship attribution algorithms) separable into data and processes? An answer to this question is required for any algorithm to be able to identify the set of personal data it deals with. Now, let us consider a machine learning classifier/recommender trained on user data. In this sense, the algorithm is inseparable from the data it uses to function. And, if the machine is internalizing identifiable information from a set of users in a certain state (let say EU), it is then in violation to share the resulting function in a non-adequate country (let say Brazil) the EU if it was trained in, say, the US. solution has gained widespread international recognition, and was recently recognized as a global privacy standard. According to its Canadian inventor 1, is PbD based on seven Foundation Principles (FP): PbD “is an approach to protect privacy by embedding it into the design specifications of technologies, business practices, and physical infrastructures. That means building in privacy up front – right into the design specifications and architecture of new systems and processes. PbD is predicated on the idea that, at the outset, technology is inherently neutral. As much as it can be used to chip away at privacy, it can also be enlisted to protect privacy. The same is true of processes and physical infrastructure”: Proactive not Reactive (FP1): the PbD approach is based on proactive measures anticipating and preventing privacy invasive events before they occur; Privacy as the Default Setting (FP2): the default rules seek to deliver the maximum degree of privacy; Privacy embedded into Design (FP3): Privacy is embedded into the architecture of IT systems and business practices; Full Functionality – Positive Sum, not Zero-Sum (FP4): PbD seeks to accommodate all legitimate interests and objectives (security, etc.) in a “win-win” manner; End-to-End Security – Full Lifecycle Protection (FP5): security measures are essential to privacy, from start to finish; Visibility and Transparency — Keep it Open (FP6): PbD is subject to independent verification. Its component parts and operations remain visible and transparent, to users and providers alike; Respect for User Privacy — Keep it User-Centric (FP7): PbD requires architects and operators to keep the interests of the individual uppermost. At the time of digital data exchange through networks, PbD is a key-concept in legacy [10]. In Europe, where this domain has been directly inspired by the Canadian experience, the EU 2 affirms: “PbD means that privacy and data protection are embedded throughout the entire life cycle of technologies, from the early design stage to their deployment, use and ultimate disposal”. 4.3 Analyzing the multinational AHFS case Regarding to the European regulation rules [3, art. 25], you may not transfer personal data collected in Europe to a non-adequate State (see list of adequate countries above). If you build a multinational AHFS system, you may collect data in Europe and in US depending on the way you localized the AHFS servers. The European data may not be transferred to Brazil. Even the classifier would not legally be used in Brazil as long as it internalizes some identifiable European personal information. In practice one should then localize the AHFS files and machine-learning processes to make sure no identifiable data will be transferred from one country with a specific regulation to another with another regulation about personal data. We call these systems “heterarchical” due to the blended situation of a hierarchical system (the global AHFS management) and the need of a heterogeneous local regulation. To manage properly the global AHFS system we need a first analysis of the system dispatching the different files on the right legal places. Privacy by Design (PbD) is a useful methodology to do so. 4.5 Prospects for a MIR Privacy by Design PbD is a reference for designing systems and processing involving personal data, enforced by the new European proposal for a Regulation [9, art. 23]. It becomes a method for these designs whereby it includes signal analysis methods and may interest MIR developers. This proposal leads to new questions, such as the following: Is PbD a universal methodological solution about personal data for all MIR projects? Most of ISMIR contributions are still research oriented which doesn’t mean 4.4 Foundations Principals of Privacy by Design PbD was first developed by Ontario’s Information and Privacy Commissioner, Dr. Ann Cavoukian, in the 1990s, at the very birth of the future big data phenomenon. This 1 http://www.ipc.on.ca/images/Resources/7foundationalprinciples.pdf “Safeguarding Privacy in a Connected World – A European Data Protection Framework for the 21st Century” COM (2012) 9 final. 2 600 15th International Society for Music Information Retrieval Conference (ISMIR 2014) that they fulfill the two specific exceptions [9, art. 83]1. To say more about that intersection, we need to survey the ISMIR scientific production, throughout the main FPs. FP6 (transparency) and FP7 (user-centric) are usually respected among the MIR community as source code and processing are often (i) delivered under GNU like licensing allowing audit and traceability (ii) user-friendly. However, as long as PbD is not embedded, FP3 cannot be fulfilled and accordingly FP2 (default setting), FP5 (endto-end), FP4 (full functionality) and FP1 (proactive) cannot be fulfilled even. Without any PbD embedded into Design, there are no default settings (FP2), you cannot follow an end-to-end approach (FP5), you cannot define full functionality regarding to personal data (FP4) nor be proactive. Principle of pro-activity (FP1) is the key. Fulfilling FP1 you define the default settings (FP2), be fully functional (FP4) and define an end-to-end process (FP5). In brief is PbD useful to MIR developers even if it is not the definitive martingale! This situation leads to a new scientific problem: Is there an absolute criterion about the identifiability of personal data extracted from a set of data with a MIR process? What characterizes a maximal subset from the big data that could not ever be computed by any Turing machine to identify a natural person with any algorithm? 5.2 What about the foundational separation in computer science between data and process? Computer science is based on a strict separation between data and process (dual as these two categories are interchangeable at any time; data can be activated as a process and a process can be treated as a data). We may wonder about the possibility of maintaining the data/process separation paradigm if i) the data stick to the process and ii) the legal regulation leads to a location of the data in the legal system in which those data were produced. 6. CONCLUSION 5. EXPLORING THE THIRD SET 6.1 When some process lead to direct or indirect personal data identification “Identifiability” is the potentiality of a set of data to lead to the identification of its source. A set of data should be qualified as being personal data if the cost and the amount of time required for identification are reasonable. These new criteria are a step forward since the qualification is not an absolute one anymore and depends specifically on the state of the art. Methodological Recommendations. MIR researchers could first audit their algorithm and data, and check if they are able to identify a natural person (two first sets of our classification). If so they could use the SHF which could already be an industrial challenge for instance regarding Cyber Security (P5). Using the PbD methodology certainly leads to operational solutions in these situations. 5.1 Available technology and technological development to take into account at this present moment 6.2 When some process may lead to indirect personal data identification through some complex process Changes in Information Technology lead to a shift in the approach of data management: from computational to data exploration. The main question is “What to look for?” Many companies build new tools to “make the data speak”. This is the case considering the trend of personalized marketing. Engineers using big data build systems that produce new personal dataflow. Is it possible to stabilize these changes through standardization of metadata? Is it possible to develop a standardization of metadata which could ease the classification of MIR processing of personal data into identifying and non-identifying processes. Many of the MIR methods are stochastic, probabilistic or designed to cost and more generally non-deterministic. On the contrary the European legal criteria [9, recital 23] (see above § 3.3) to decide whether a data is personal or not (the third set) seem to be much to deterministic to fit the effective new practices about machine learning on personal data. In many circumstances, the MIR community develops new personal data on the fly, using the whole available range of data analysis and data building algorithm. Then researchers could apply the PbD methodology, to insure that no personal data is lost during the system design. Here PbD is not a universal solution because the time when data (on the one hand) and processing (on the other hand) were functionally independent, formally and semantically separated, has ended. Nowadays, MIR researchers currently use algorithms that support effective decision, supervised or not, without introducing ‘pure’ data or ‘pure’ processing, but building up acceptable solutions together with machine learning [5] or heuristic knowledge that cannot be reduced to data or processing: The third set of personal data may appear, and raise theoretical scientific problems. Political Opportunities. The MIR community has a political role to play in the data privacy domain, by explaining to lawyers —joining expert groups in the US, UE or elsewhere— what we are doing and how we overlap with the tradition in style description, turning it into a computed style genetic, which radically questions the analysis of data privacy traditions, cultures and tools. 1 (i) these processing cannot be fulfilled otherwise and (ii) data permitting the identification are kept separately from the other information, or when the bodies conducting these data respect three conditions: (i) consent of the data subject, (ii) publication of personal data is necessary and (iii) data are made public 601 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Future Scientific Works. In addition to methodological and political ones, we face purely scientific challenges, which constitute our research program for future works. Under what criteria should we, as MIR practitioners, specify when a set of data allows an easy identification and belongs to the second set or on the contrary is too complex or allows a too uncertain identification so that we would say that these are not personal data? What characterizes a maximal subset from the big data that could not ever be computed by any Turing machine to identify a natural person with any algorithm? [10] V. Reding: “The European Data Protection Framework for the Twenty-first century”, International Data Privacy Law, volume 2, issue 3, pp.119-129, 2012. [11] A. Seeger: “I Found It, How Can I Use It? - Dealing With the Ethical and Legal Constraints of Information Access”, Proceedings of the International Symposium on Music Information Retrieval, 2003. [12] A.B. Slavkovic, A. Smith: “Special Issue on Statistical and Learning-Theoretic Challenges in Data Privacy”, Journal of Privacy and Confidentiality, Vol. 4, Issue 1, pp. 1-243, 2012. 7. REFERENCES [1] S. Argamon, K. Burns, S. Dubnov (Eds): The Structure of Style, Springer-Verlag, 2010. [13] P. Symeonidis, M. Ruxanda, A. Nanopoulos, Y. Manolopoulos: “Ternary Semantic Analysis of Social Tags for Personalized Music Recommendation”, Proceedings of the International Symposium on Music Information Retrieval, 2008. [2] C. Barlas: “Beating Babel - Identification, Metadata and Rights”, Invited Talk, Proceedings of the International Symposium on Music Information Retrieval, 2002. [14] U.S. – EU [3] Directive (95/46/EC) of 24 October 1995 Official Journal L 281, 23/11/1995 P. 0031 - 0050 : http://eur- Safe Harbor: http://www.export.gov/safeharbor/eu/eg_main_018365.asp lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L00 46:en:HTML [4] J.S. Downie, J. Futrelle, D. Tcheng: “The International Music Information Retrieval Systems Evaluation Laboratory: Governance, Access and Security”, Proceedings of the International Symposium on Music Information Retrieval, 2004. [5] A. Gkoulalas-Divanis, Y. Saygin, Vassilios S. Verykios: “Special Issue on Privacy and Security Issues in Data Mining and Machine Learning”, Transactions on Data Privacy, Vol. 4, Issue 3, pp. 127-187, December 2011. [6] D. Greer: “Safe Harbor - A Framework that Works”, International Data Privacy Law, Vol.1, Issue 3, pp. 143-148, 2011. [7] M. Levering: “Intellectual Property Rights in Musical Works: Overview, Digital Library Issues and Related Initiatives”, Invited Talk, Proceedings of the International Symposium on Music Information Retrieval, 2000. [8] F. Pachet, P. Roy: “Hit Song Science is Not Yet a Science”, Proceedings of the International Symposium on Music Information Retrieval, 2008. [9] Proposal for a Regulation on the protection of individuals with regard to the processing of personal data was adopted the 12 March 2014 by the European Parliament: http://www.europarl.europa.eu/sides/getDoc.do?type =TA&reference=P7-TA-2014-0212&language=EN 602 15th International Society for Music Information Retrieval Conference (ISMIR 2014) A MULTI-MODEL APPROACH TO BEAT TRACKING CONSIDERING HETEROGENEOUS MUSIC STYLES Sebastian Böck, Florian Krebs and Gerhard Widmer Department of Computational Perception Johannes Kepler University, Linz, Austria [email protected] ABSTRACT this task. Most systems then determine the most predominant tempo from these periodicities and subsequently determine the beat times using multiple agents approaches [8,12], dynamic programming [6,10], hidden Markov models (HMM) [7, 16, 18], or recurrent neural networks (RNN) [2]. Other systems operate directly on the input features and jointly determine the tempo and phase of the beats using dynamic Bayesian networks (DBN) [3, 14, 17, 21]. One of the most common problems of beat tracking systems are “octave errors”, meaning that a system detects beats at double or half the rate of the ground truth tempo. For human tappers this generally does not constitute a problem, as can be seen when comparing beat tracking results at different metrical levels [6]. Hainsworth and Macleod stated that beat tracking systems will have to be style specific in the future in order to improve the state-ofthe-art [14]. This is consistent with the finding of Krebs et al. [17] who showed on a dataset of Ballroom music that the beat tracking performance can be improved by incorporating style-specific knowledge, especially by resolving the octave error. While approaches have been proposed which combined multiple existing features for beat tracking [22], no one has so far combined several models specialised on different musical styles to improve the overall performance. In this paper, we propose a multi-model approach to fuse information of different models that have been specialised on heterogeneous music styles. The model is based on the recurrent neural network (RNN) beat tracking system proposed in [2] and can be easily adapted to any music style without further parameter tweaking, only by providing a corresponding beat-annotated dataset. Further, we propose an additional dynamic Bayesian network stage based on the work of Whiteley et al. [21] which jointly infers the tempo and the beat phase from the beat activations of the RNN stage. In this paper we present a new beat tracking algorithm which extends an existing state-of-the-art system with a multi-model approach to represent different music styles. The system uses multiple recurrent neural networks, which are specialised on certain musical styles, to estimate possible beat positions. It chooses the model with the most appropriate beat activation function for the input signal and jointly models the tempo and phase of the beats from this activation function with a dynamic Bayesian network. We test our system on three big datasets of various styles and report performance gains of up to 27% over existing stateof-the-art methods. Under certain conditions the system is able to match even human tapping performance. 1. INTRODUCTION AND RELATED WORK The automatic inference of the metrical structure in music is a fundamental problem in the music information retrieval field. In this line, beat tracking deals with finding the most salient level of this metrical grid, the beat. The beat consists of a sequence of regular time instants which usually invokes human reactions like foot tapping. During the last years, beat tracking algorithms have considerably improved in performance. But still they are far from being considered on par with human beat tracking abilities – especially for music styles which do not have simple metrical and rhythmic structures. Most methods for beat tracking extract some features from the audio signal as a first step. As features, commonly low-level features such as amplitude envelopes [20] or spectral features [2], mid-level features like onsets either in discretised [8,12] or continuous form [6,10,16,18], chord changes [12,18] or combinations thereof with higher level features such as rhythmic patterns [17] or metrical relations [11] are used. The feature extraction is usually followed by a stage that determines periodicities within the extracted features sequences. Autocorrelation [2, 9, 12] and comb filters [6, 20] are commonly used techniques for 2. PROPOSED METHOD The new beat tracking algorithm is based on the state-ofthe-art approach presented by Böck and Schedl in [2]. We extend their system to be able to better deal with heterogeneous music styles and combine it with a dynamic Bayesian network similar to the ones presented in [21] and [17]. The basic structure is depicted in Figure 1 and consists of the following elements: first the audio signal is preprocessed and fed into multiple neural network beat track- c Sebastian Böck, Florian Krebs and Gerhard Widmer. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sebastian Böck, Florian Krebs and Gerhard Widmer. “A Multi-Model Approach to Beat Tracking Considering Heterogeneous Music Styles”, 15th International Society for Music Information Retrieval Conference, 2014. 603 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 2.2 Multiple parallel neural networks ing modules. Each of the modules is trained on different audio material and outputs a different beat activation function when activated with a musical signal. These functions are then fed into a module which chooses the most appropriate model and passes its activation function to a dynamic Bayesian network to infer the actual beat positions. At the core of the new approach, multiple neural networks are used to determine possible beat locations in the audio signal. As outlined previously, these networks are trained on material with different music styles to be able to better detect the beats in heterogeneous music styles. As networks we chose the same recurrent neural network (RNN) topology as in [2] with three bidirectional hidden layers with 25 long short-term memory (LSTM) units per layer. For training of the networks, standard gradient descent with error backpropagation and a learning rate of 1e−4 is used. We initialise the network weights with a Gaussian distribution with mean 0 and standard deviation of 0.1. We use early stopping with a disjoint validation set to stop training if no improvement over 20 epochs can be observed. Reference Network Model 1 Signal Preprocessing Model 2 Model Switcher Dynamic Bayesian Network Beats • • • Model N Figure 1. Overview of the new multi-model beat tracking system. One reference network is trained on the complete dataset until the stopping criterion is reached for the first time. We use this point during the training phase to diverge the specialised models from the reference network. Theoretically, a single network large enough should be able to model all the different music styles simultaneously, but unfortunately this optimal solution is hardly achievable. The main reason for this is the difficulty to choose an absolutely balanced training set with an evenly distributed set of beats over all the different dimensions relevant for detecting beats. These include rhythmic patterns [17, 20], harmonic aspects and many other features. To overcome this limitation, we split the available training data into multiple parts. Each part should represent a more homogeneous subset than the whole set so that the networks are able to specialise on the dominant aspects of this subset. It seems reasonable to assume that humans do something similar when tracking beats [4]. Depending on the style of the music, the rhythmic patterns present, the instrumentation, the timbre, they apply their musical knowledge to chose one of their “learned” models and then decide which musical events are beats or not. Our approach mimics this behaviour by learning multiple distinct models. Afterwards, all networks are fine-tuned with a reduced learning rate of 1e−5 on either the complete set or the individual subsets (cf. Section 3.1) with the above mentioned stopping criterion. Given N subsets, N + 1 models are generated. The output functions of the network models represent the beat probability at each time frame. Instead of tracking the beats with an autocorrelation function as described in the original work, the beat activation functions of the different models are fed into the next model-selection stage. 2.3 Model selection The purpose of this stage is to select a model which outputs a better beat activation function than the reference model when activated with a signal. Compared to the reference model, the specialised models produce better predictions on input data which is similar to that used for fine-tuning, but worse predictions on signals dissimilar to the training data. This behaviour can be seen in Figure 2, where the specialised model produces higher beat activation values at the beat locations and lower values elsewhere. 2.1 Signal pre-processing Table 1 illustrates the impact on the Ballroom subset, where the relative gain of the best specialised model compared to the reference model (+1.7%) is lower than the penalties of the other models (−2.3% to −6.3%). The fact that the performance degradation of the unsuitable specialised models is greater than the gain of the most suitable model allows us to use a very simple but effective method to choose the best model. All neural networks share the same signal pre-processing step, which is very similar to the work in [2]. As inputs to the different neural networks, the logarithmically filtered and scaled spectrograms of three parallel Short Time Fourier Transforms (STFT) obtained for different window lengths and their positive first order differences are used. The system works with a constant frame rate fr of 100 frames per second. Window lengths of 23.2 ms, 46.4 ms and 92.9 ms are used and the resulting spectrogram bins of the discrete Fourier transforms are filtered with overlapping triangular filters to have a frequency resolution of three bands per octave. To put all resulting magnitude values into a positive range we add 1 before taking the logarithm. To select the best performing model, all network outputs of the fine-tuned networks are compared with the output of the reference network (which was trained on the whole training set) and the one yielding the lowest mean squared difference is selected as the final one and its output is fed into the final beat tracking stage. 604 15th International Society for Music Information Retrieval Conference (ISMIR 2014) The DBN we use is closely related to the one proposed in [21], adapted to our specific needs. Instead of modelling whole bars, we only model one beat period which reduces the size of the search space. Additionally we do not model rhythmic patterns explicitly and leave this higher level analysis to the neural networks. This finally leads to a DBN which consists of two hidden variables, the tempo ω and the position φ inside a beat period. In order to infer the hidden variables from an audio signal, we have to specify three entities: A transition model which describes the transitions between the hidden variables, an observation model which takes the beat activations from the neural network and transforms them into probabilities suitable for the DBN, and the initial distribution which encodes prior knowledge about the hidden variables. For computational ease we discretise the tempo-beat space to be able to use standard hidden Markov model (HMM) [19] algorithms for inference. Figure 2. Example beat activations for a 4 seconds ballroom snippet. Red is the reference network’s activations, black the selected model and blue a discarded one. Green dashed vertical lines denote the annotated beat positions. SMC * Hainsworth * Ballroom * Reference Multi-model F-measure 0.834 0.867 0.904 0.887 0.897 Cemgil 0.807 0.839 0.872 0.855 0.866 AMLc 0.664 0.694 0.777 0.748 0.759 2.4.1 Transition model AMLt 0.767 0.793 0.853 0.831 0.841 The beat period is discretised into Φ = 640 equidistant cells and φ ∈ {1, ..., Φ}. We refer to the unit of the variable φ (position inside a beat period) as pib. φk at audio frame k is then computed by φk = (φk−1 + ωk−1 − 1) mod Φ + 1. Table 1. Performance of differently specialised models (marked with asterisks, fine-tuned on the SMC, Hainsworth and Ballroom subsets) on the Ballroom subset compared to the reference model and the network selected by the multi-model selection stage. (1) The tempo space is discretised into Ω = 23 equidistant cells, which cover the tempo range up to 215 beats per minute (BPM). The unit of the tempo variable ω is pib per audio frame. As we want to restrict ω to integer values (to stay within the φ grid at transitions), we need a high resolution of φ in order to get a high resolution of ω. Based on experiments with the training set, we set the tempo space to ω ∈ {6, ..., Ω}, where ω = 6 is equivalent to a minimum tempo of 6 × 60 × fr /Φ ≈ 56 BPM. As in [21] we only allow for three tempo transitions at time frame k: It stays constant, it accelerates, or it decelerates. ⎧ P (ωk |ωk−1 ) = 1 − pω ⎨ ωk−1 , ωk−1 + 1, P (ωk |ωk−1 ) = p2ω ωk = (2) ⎩ ωk−1 − 1, P (ωk |ωk−1 ) = p2ω 2.4 Dynamic Bayesian network Independent of whether only one or multiple neural networks are used, the approach of Böck and Schedl [2] has a fundamental shortcoming: the final peak-picking stage does not try to find a global optimum when selecting the final locations of the beats. It rather determines the dominant tempo of the piece (or a segment of certain length) and then aligns the beat positions according to this tempo by simply choosing the best start position and then progressively locating the beats at positions with the highest activation function values in a certain region around the pre-determined position. To allow a greater responsiveness to tempo changes, this chosen region must not be too small. However, this also introduces a weakness to the algorithm, because the tracking stage can easily get distracted by a few misaligned beats and needs some time to recover from this fault. The activation function depicted in Figure 2 has two of these spurious detections around frames 100 and 200. To circumvent this problem, we feed the output of the chosen neural network model into a dynamic Bayesian network (DBN) which jointly infers tempo and phase of a beat sequence. Another advantage of this new method is that we are able to model both beat and non-beat states, which was shown to perform superior to the case where only beat states are modelled [7]. Transitions to tempi outside of the allowed range are not allowed by setting the corresponding transition probabilities to zero. The probability of a tempo change pω was set to 0.002. 2.4.2 Observation model Since the beat activation function a produced by the neural network is limited to the range [0, 1] and shows high values at beat positions and low values at non-beat positions, we use the activation function directly as state-conditional observation distributions (similar to [7]). We define the observation likelihood as ak , 1 ≤ φk ≤ Φ λ P (ak |φk ) = (3) 1−ak λ−1 , otherwise. Φ λ ∈ [ Φ−1 , Φ] is a parameter that controls the proportion of the beat interval which is considered as beat and non-beat 605 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 3.2 Performance measures location. Smaller values of λ (a higher proportion of beat locations and a smaller proportion of non-beat locations) are especially important for higher tempi, as the DBN visits only a few position states of a beat interval and could possibly miss the beginning of a beat. On the other hand, higher values of λ (a smaller proportion of beat locations) lead to less accurate beat tracking, as the activations are blurred in the state domain of the DBN. On our training set we achieved the best results with the value λ = 16. In line with almost all other publications on the topic of beat tracking, we report the following scores: F-measure : counts the number of true positive (correctly located beats within a tolerance window of ±70 ms), false positive and negative detections; P-score : measures the tracking accuracy by the correlation of the detections and the annotations, considering deviations within 20% of the annotated beat interval as correct; 2.4.3 Initial state distribution The initial state distribution is normally used to incorporate any prior knowledge about the hidden states, such as tempo distributions. In this paper, we use a uniform distribution over all states, for simplicity and ease of generalisation. Cemgil : places a Gaussian function with a standard deviation of 40 ms around the annotations and then measures the tracking accuracy by summing up the scores of the detected beats on this function normalising it by the overall length of the annotations or detections, whichever is greater; 2.4.4 Inference We are interested in the sequence of hidden variables φ1:K and ω1:K , that maximise the posterior probability of the hidden variables given the observations (activations a1:K ). Combining the discrete states of φ and ω into one state vector xk = [φk , ωk ], we can compute the maximum aposteriori state sequence x∗1:K by x∗1:K = arg max p(x1:K |a1:K ). x1:K CMLc & CMLt : measure the longest continuously segment (CMLc) or all correctly tracked beats (CMLt) at the correct metrical level. A beat is considered correct if it is reported within a 17.5% tempo and phase tolerance, and the same applies for the previously detected beat; (4) AMLc & AMLt : like CMLc & CMLt, but additionally allow offbeat and double/half as well as triple/third tempo variations of the annotated beats; Equation 4 can be computed efficiently using the wellknown Viterbi algorithm [19]. Finally the set of beat times B are determined by the set of time frames k which were assigned to a beat position (B = {k : φk < φk−1 }). In our experiments we found that the beat detection becomes less accurate if the part of the beat interval which is considered as beat-state is too large (i.e. smaller values of λ). Therefore we determine the final beat times by looking for the highest beat activation value inside the beat-state window W = {k : φk ≤ Φ λ }. D & Dg : the information gain (D) and global information gain (Dg ) are phase agnostic measures comparing the annotations with the detections (and vice-versa) building a error histogram and then calculating the Kullback-Leibler divergence w.r.t. a uniform histogram. A more detailed description of the evaluation methods can be found in [5]. However, since we only investigate offline algorithms, we do not skip the first five seconds for evaluation. 3. EVALUATION 3.3 Results & Discussion For the development and evaluation of the algorithm we used some well-known datasets. This allows for highest comparability with previously published results of stateof-the-art algorithms. Table 2 lists the performance results of the reference implementation, Böck’s BeatTracker.2013, and the various extensions proposed in this paper for all datasets. All results are obtained with 8-fold cross validation with previously defined splittings, ensuring that no pieces are used both for training or parameter tuning and testing purposes. Additionally, we compare our new approach to published statof-the-art results on the Hainsworth and Ballroom datasets. 3.1 Datasets As training material for our system, the datasets introduced in [13–15] are used. They are called Ballroom, Hainsworth and SMC respectively. To show the ability of our new algorithm to adapt to various music styles, a very simple approach of splitting the complete dataset into multiple subsets according to the original source was chosen. Although far from optimal – both the SMC and Hainsworth datasets contain heterogeneous music styles – we still consider this a valid choice, since any “better” splitting would allow the system to adapt even further to heterogeneous styles and in turn lead to better results. At least the three sets have a somehow different focus regarding the music styles present. 3.3.1 Multi-model extension As can be seen, the use of the multi-model extension almost always improves the results over the implementation it is based on, especially on the SMC set. The gain in performance on the Ballroom set was expected, since Krebs et al. already showed that modelling rhythmic patterns helps to increase the overall detection accuracy [17]. Although we did not split the set according to the individual rhythmic patterns, the overall style of ballroom music can be considered unique enough to be distinct from the other music 606 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Ballroom BeatTracker.2013 [1, 2] — Multi-Model — DBN — Multi-Model + DBN Krebs et al. [17] Zapata et al. [22] † Hainsworth BeatTracker.2013 [1, 2] — Multi-Model — DBN — Multi-Model + DBN Zapata et al. [22] † Davies et al. [6] Peeters & Papadopoulos [18] Degara et al. [7] Human tapper [6] ‡ SMC BeatTracker.2013 [1, 2] — Multi-Model — DBN — Multi-Model + DBN Zapata et al. [22] † F-measure P-score Cemgil CMLc CMLt AMLc AMLt D Dg 0.887 0.897 0.903 0.910 0.855 0.767 0.863 0.875 0.876 0.881 0.839 0.735 0.855 0.866 0.838 0.845 0.772 0.672 0.719 0.740 0.792 0.800 0.745 0.586 0.795 0.814 0.825 0.830 0.786 0.607 0.748 0.759 0.873 0.885 0.818 0.824 0.831 0.841 0.915 0.924 0.865 0.860 3.404 3.480 3.427 3.469 2.499 2.750 2.596 2.674 2.275 2.352 1.681 1.187 0.832 0.832 0.843 0.840 0.710 - 0.843 0.847 0.867 0.865 0.732 - 0.712 0.716 0.711 0.707 0.589 - 0.618 0.617 0.696 0.696 0.569 0.548 0.547 0.561 0.528 0.756 0.761 0.808 0.803 0.642 0.612 0.628 0.629 0.812 0.655 0.652 0.759 0.760 0.709 0.681 0.703 0.719 0.575 0.807 0.809 0.883 0.881 0.824 0.789 0.831 0.815 0.874 2.167 2.171 2.251 2.268 2.057 - 1.468 1.490 1.481 1.466 0.880 - 0.497 0.514 0.516 0.529 0.369 0.598 0.617 0.622 0.630 0.460 0.402 0.415 0.404 0.415 0.285 0.238 0.257 0.294 0.296 0.115 0.360 0.389 0.415 0.428 0.158 0.279 0.296 0.378 0.383 0.239 0.436 0.467 0.550 0.567 0.397 1.263 1.324 1.426 1.460 0.879 0.416 0.467 0.504 0.531 0.126 Table 2. Performance of the proposed algorithm on the Ballroom [13], Hainsworth [14] and SMC [15] datasets. BeatTracker is the reference implementation our Multi-Model and dynamic Bayesian network (DBN) extensions are built on. The results marked with † are obtained with Essentia’s implementation of the multi-feature beat tracker. 1 ‡ denotes causal (i.e. online) processing, all listed algorithms use non-causal analysis (i.e. offline processing) with the best results in bold. styles present in the other sets and the salient features can be exploited successfully by the multi-model approach. 3.3.2 Dynamic Bayesian network extension As already indicated in the original paper [2] (and described earlier in Section 2.4), the original BeatTracker can be easily distracted by some misaligned beats and then needs some time to recover from any failure. The newly adapted dynamic Bayesian network beat tracking stage does not suffer from this shortcoming by searching for the globally best beat locations. The use of the DBN boosts the performance on all datasets for almost all evaluation measures. Interestingly, the Cemgil accuracy is degraded by using the DBN stage. This might be explained by the fact that the discretisation grid of the beat period beat positions becomes too coarse for low tempi (cf. Section 2.4.4) and therefore yields inaccurate beat detections, which especially affect the Cemgil accuracy. This is one of the issues that needs to be resolved in the future, especially for lower tempi where the penalty is the highest. Davies et al. [6] also list performance results of a human tapper on the same dataset. However it must be noted that these were obtained by online real-time tapping, hence they cannot be compared directly to the system presented. However, the system of Davies et al. can also be switched to causal mode (and thus being comparable to a human tapper). In this mode it achieved performance reduced by approximately 10% [6]. Adding the same amount to the reported tapping results of 0.528 CMLc and 0.575 AMlc suggests that our system is capable of performing as good as humans when continuous tapping is required. On the Ballroom set we achieve higher results than the particularly specialised system of Krebs et al. [17]. Since our DBN approach is a simplified variant of their model, it can be assumed that the relatively low scores of the Cemgil accuracy and the information gain are due to the same reason – the coarse discretisation of the beat or bar states. Nonetheless, comparing the continuity scores (which have higher tolerance thresholds) we can still report an average increase in performance of more than 5%. 3.3.3 Comparison with other methods 4. CONCLUSIONS & OUTLOOK Our new system set side by side with other state-of-the-art algorithms draws a clear picture. It outperforms all of them considerably – independently of the dataset and evaluation measure chosen. Especially the high performance boosts of the CMLc and CMLt scores on the Hainworth dataset highlight the ability to track the beats at the correct metrical level significantly more often than any other method. In this paper we have presented a new beat tracking system which is able to improve over existing algorithms by incorporating multiple models which were trained on different music styles and combining it with a dynamic Bayesian 1 607 http://essentia.upf.edu, v2.0.1 15th International Society for Music Information Retrieval Conference (ISMIR 2014) network for the final inference of the beats. The combination of these two extensions yields a performance boost – depending on the dataset and evaluation measures chosen – of up to 27% relative, matching human tapping results under certain conditions. It outperforms other state-of-theart algorithms in tracking the beats at the correct metrical level by 20%. We showed that the specialisation on a certain musical style helps to improve the overall performance, although the method for splitting the available data into sets of different styles and then selecting the most appropriate model is rather simple. For the future we will investigate more advanced techniques for the selection of suitable data for the creation of the specialised models, e.g. splitting the datasets according to dance styles as performed by Krebs et al. [17] or applying unsupervised clustering techniques. We also expect better results from more advanced model selection methods. One possible approach could be to feed the individual model activations to the dynamic Bayesian network and let it choose among them. Finally, the Bayesian network could be tuned towards using a finer beat positions grid and thus reporting the beats at more appropriate times than just selecting the position of the highest activation reported by the neural network model. 5. ACKNOWLEDGMENTS This work is supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the GiantSteps project (grant agreement no. 610591) and the Austrian Science Fund (FWF) project Z159. 6. REFERENCES [1] MIREX 2013 beat tracking results. http://nema.lis. illinois.edu/nema_out/mirex2013/results/ abt/, 2013. [2] S. Böck and M. Schedl. Enhanced Beat Tracking with Context-Aware Neural Networks. In Proceedings of the 14th International Conference on Digital Audio Effects (DAFx11), pages 135–139, Paris, France, September 2011. [3] A. T. Cemgil, H. Kappen, P. Desain, and H. Honing. On tempo tracking: Tempogram Representation and Kalman filtering. Journal of New Music Research, 28:4:259–273, 2001. [4] N. Collins. Towards a style-specific basis for computational beat tracking. In Proceedings of the 9th International Conference on Music Perception and Cognition (ICMPC9), pages 461–467, Bologna, Italy, 2006. [5] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. Technical Report C4DM-TR-09-06, Centre for Digital Music, Queen Mary University of London, 2009. [6] M. E. P. Davies and M. D. Plumbley. Context-dependent beat tracking of musical audio. IEEE Transactions on Audio, Speech, and Language Processing, 15(3):1009–1020, March 2007. [7] N. Degara, E. Argones-Rúa, A. Pena, S. Torres-Guijarro, M. E. P. Davies, and M. D. Plumbley. Reliability-informed beat tracking of musical signals. IEEE Transactions on Audio, Speech and Language Processing, 20(1):290–301, January 2012. [8] S. Dixon. Automatic extraction of tempo and beat from expressive performances. Journal of New Music Research, 30:39–58, 2001. [9] D. Eck. Beat tracking using an autocorrelation phase matrix. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), volume 4, pages 1313–1316, Honolulu, Hawaii, USA, April 2007. [10] D. P. W. Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 2007:51–60, 2007. [11] A. Gkiokas, V. Katsouros, G. Carayannis, and T. Stafylakis. Music tempo estimation and beat tracking by applying source separation and metrical relations. In Proceedings of the 37th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pages 421–424, Kyoto, Japan, March 2012. [12] M. Goto and Y. Muraoka. Beat tracking based on multipleagent architecture a real-time beat tracking system for audio signals. In Proceedings of the International Conference on Multiagent Systems, pages 103–110, 1996. [13] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844, September 2006. [14] S. Hainsworth and M. Macleod. Particle filtering applied to musical tempo tracking. EURASIP J. Appl. Signal Process., 15:2385–2395, January 2004. [15] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. Oliveira, and F. Gouyon. Selective sampling for beat tracking evaluation. IEEE Transactions on Audio, Speech, and Language Processing, 20(9):2539–2548, November 2012. [16] A. Klapuri, A. Eronen, and J. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):342–355, January 2006. [17] F. Krebs, S. Böck, and G. Widmer. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), pages 227–232, Curitiba, Brazil, November 2013. [18] G. Peeters and H. Papadopoulos. Simultaneous beat and downbeat-tracking using a probabilistic framework: Theory and large-scale evaluation. IEEE Transactions on Audio, Speech, and Language Processing, 19(6):1754–1769, 2011. [19] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989. [20] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. The Journal of the Acoustical Society of America, 103(1):588–601, 1998. [21] N. Whiteley, A. Cemgil, and S. Godsill. Bayesian modelling of temporal structure in musical audio. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), pages 29–34, Victoria, BC, Canada, October 2006. [22] J. R. Zapata, M. E. P. Davies, and E. Gómez. Multi-feature beat tracking. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):816–825, April 2014. 608 Oral Session 8 Source Separation 609 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 610 15th International Society for Music Information Retrieval Conference (ISMIR 2014) EXTENDING HARMONIC-PERCUSSIVE SEPARATION OF AUDIO SIGNALS Jonathan Driedger1 , Meinard Müller1 , Sascha Disch2 1 International Audio Laboratories Erlangen 2 Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany {jonathan.driedger,meinard.mueller}@audiolabs-erlangen.de, [email protected] ABSTRACT In recent years, methods to decompose an audio signal into a harmonic and a percussive component have received a lot of interest and are frequently applied as a processing step in a variety of scenarios. One problem is that the computed components are often not of purely harmonic or percussive nature but also contain noise-like sounds that are neither clearly harmonic nor percussive. Furthermore, depending on the parameter settings, one often can observe a leakage of harmonic sounds into the percussive component and vice versa. In this paper we present two extensions to a state-of-the-art harmonic-percussive separation procedure to target these problems. First, we introduce a separation factor parameter into the decomposition process that allows for tightening separation results and for enforcing the components to be clearly harmonic or percussive. As second contribution, inspired by the classical sines+transients+noise (STN) audio model, this novel concept is exploited to add a third residual component to the decomposition which captures the sounds that lie in between the clearly harmonic and percussive sounds of the audio signal. Figure 1. (a): Input audio signal x. (b): Spectrogram X. (c): Spectrogram of the harmonic component Xh (left), the residual component Xr (middle) and the percussive component Xp (right). (d): Waveforms of the harmonic component xh (left), the residual component xr (middle) and the percussive component xp (right). 1. INTRODUCTION harmonic sounds have a horizontal structure in a spectrogram representation of the input signal, while percussive sounds form vertical structures. By iteratively diffusing the spectrogram once in horizontal and once in vertical direction, the harmonic and percussive elements are enhanced, respectively. The two enhanced representations are then compared, and entries in the original spectral representation are assigned to either the harmonic or the percussive component according to the dominating enhanced spectrogram. Finally, the two components are transformed back to the time-domain. Following the same idea, Fitzgerald [5] replaces the diffusion step by a much simpler median filtering strategy, which turns out to yield similar results while having a much lower computational complexity. A drawback of the aforementioned approaches is that the computed decompositions are often not very tight in the sense that the harmonic and percussive components may still contain some non-harmonic and non-percussive residues, respectively. This is mainly because of two reasons. First, sounds that are neither of clearly harmonic nor of clearly percussive nature such as applause, rain, or the sound of a heavily distorted guitar are often more or less The task of decomposing an audio signal into its harmonic and its percussive component has received large interest in recent years. This is mainly because for many applications it is useful to consider just the harmonic or the percussive portion of an input signal. Harmonic-percussive separation has been applied, for example, for audio remixing [9], improving the quality of chroma features [14], tempo estimation [6], or time-scale modification [2, 4]. Several decomposition algorithms have been proposed. In [3], the percussive component is modeled by detecting portions in the input signal which have a rather noisy phase behavior. The harmonic component is then computed by the difference of the original signal and the computed percussive component. In [10], the crucial observation is that c Jonathan Driedger, Meinard Müller, and Sascha Disch. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jonathan Driedger, Meinard Müller, and Sascha Disch. “Extending Harmonic-Percussive Separation of Audio Signals”, 15th International Society for Music Information Retrieval Conference, 2014. 611 15th International Society for Music Information Retrieval Conference (ISMIR 2014) randomly distributed among the two components. Second, depending on the parameter setting, harmonic sounds often leak into the percussive component and the other way around. Finding suitable parameters which yield satisfactory results often involves a delicate trade-off between a leakage in one or the other direction. In this paper, we propose two extensions to [5] that lead towards more flexible and refined decompositions. First, we introduce the concept of a separation factor (Section 2). This novel parameter allows for tightening decomposition results by enforcing the harmonic and percussive component to contain just the clearly harmonic and percussive sounds of the input signal, respectively, and therefore to attenuate the aforementioned problems. Second, we exploit this concept to add a third residual component that captures all sounds in the input audio signal which are neither clearly harmonic nor percussive (see Figure 1). This kind of decomposition is inspired by the classical sines+transients+noise (STN) audio model [8, 11] which aims at resynthesizing a given audio signal in terms of a parameterized set of sine waves, transient sounds, and shaped white noise. While a first methodology to compute such a decomposition follows rather straightforward from the concept of a separation factor, we also propose a more involved iterative decomposition procedure. Building on concepts proposed in [13], this procedure allows for a more refined adjustment of the decomposition results (Section 3.3). Finally, we evaluate our proposed procedures based on objective evaluation measures as well as subjective listening tests (Section 4). Note that this paper has an accompanying website [1] where you can find all audio examples discussed in this paper. distributed, while the harmonic components stand out. By applying a median filter to Y once in horizontal and once in vertical direction, we get a harmonically enhanced magnitude spectrogram Ỹh and a magnitude spectrogram Ỹp with enhanced percussive content median(Y (t − h , k), . . . , Y (t + h , k)) Ỹp (t, k) := median(Y (t, k − p ), . . . , Y (t, k + p )) Mh (t, k) := Mp (t, k) := Ỹh (t, k)/(Ỹp (t, k) + ) > β Ỹp (t, k)/(Ỹh (t, k) + ) ≥ β where is a small constant to avoid division by zero, and the operators ≥ and > yield a binary result from {0, 1}. Applying these masks to the original spectrogram X yields the spectrograms for the harmonic and the percussive component The first steps of our proposed decomposition procedure for tightening the harmonic and the percussive component are the same as in [5], which we now summarize. Given an input audio signal x, our goal is to compute a harmonic component xh and a percussive component xp such that xh and xp contain the clearly harmonic and percussive sounds of x, respectively. To achieve this goal, first a spectrogram X of the signal x is computed by applying a short-time Fourier transform (STFT) N −1 := for h , p ∈ N where 2 h + 1 and 2 p + 1 are the lengths of the median filters, respectively. Now, extending [5], we introduce an additional parameter β ∈ R, β ≥ 1, called the separation factor. We assume an entry of the original spectrogram X(t, k) to be part of the clearly harmonic or percussive component if Ỹh (t, k)/Ỹp (t, k) > β or Ỹp (t, k)/Ỹh (t, k) ≥ β, respectively. Intuitively, for a sound to be included in the harmonic component it is required to stand out from the percussive portion of the signal by at least a factor of β, and vice versa for the percussive component. Using this principle, we can define binary masks Mh and Mp 2. TIGHTENED HARMONIC-PERCUSSIVE SEPARATION X(t, k) = Ỹh (t, k) Xh (t, k) := X(t, k) · Mh (t, k) Xp (t, k) := X(t, k) · Mp (t, k) . These spectrograms can then be brought back to the timedomain by applying an “inverse” short-time Fourier transform, see [7]. This yields the desired signals xh and xp . Choosing a separation factor β > 1 tightens the separation result of the procedure by preventing sounds which are neither clearly harmonic nor percussive to be included in the components. In Figure 2a, for example, you see the spectrogram of a sound mixture of a violin (clearly harmonic), castanets (clearly percussive), and applause (noise-like, and neither harmonic nor percussive). The sound of the violin manifests itself as clear horizontal structures, while one clap of the castanets is visible as a clear vertical structure in the middle of the spectrogram. The sound of the applause however does not form any kind of directed structure and is spread all over the spectrum. When decomposing this audio signal with a separation factor of β=1, which basically yields the procedure proposed in [5], the applause is more or less equally distributed among the harmonic and the percussive component, see Figure 2b. However, when choosing β=3, only the clearly horizontal and vertical structures are preserved in Xh and Xp , respectively, and the applause is no longer contained in the two components, see Figure 2c. w(n) x(n + tH) exp(−2πikn/N ) n=0 with t ∈ [0 : T −1] and k ∈ [0 : K], where T is the number of frames, K = N/2 is the frequency index corresponding to the Nyquist frequency, N is the frame size and length of the discrete Fourier transform, w is a sine-window function and H is the hopsize (we usually set H = N/4). A crucial observation is that looking at one frequency band in the magnitude spectrogram Y = |X| (one row of Y ), harmonic components stay rather constant, while percussive structures show up as peaks. Contrary, in one frame (one column of Y ), percussive components tend to be equally 612 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 3. Energy distribution between the harmonic, residual, and percussive components for different frame sizes N and separation factors β. (a): Harmonic components. (b): Residual components. (c): Percussive components. the original signal are not necessarily equal. Our proposed approach yields a decomposition of the signal. The three components always add up to the original signal again. The separation factor β hereby constitutes a flexible handle to adjust the sound characteristics of the components. Figure 2. (a): Original spectrogram X. (b): Spectrograms Xh (left) and Xp (right) for β = 1. (c): Spectrograms Xh (left) and Xp (right) for β = 3. 3.2 Influence of the Parameters 3. HARMONIC-PERCUSSIVE-RESIDUAL SEPARATION The main parameters of our decomposition procedure are the length of the median filters, the frame size N used to compute the STFT, and the separation factor β. Intuitively, the length of the filters specify the minimal sizes of horizontal and vertical structures which should be considered as harmonic and percussive sounds in the STFT of x, respectively. Our experiments have shown that the filter lengths actually do not influence the decomposition too much as long as no extreme values are chosen, see also [1]. The frame size N on the other hand pushes the overall energy of the input signal towards one of the components. For large frame sizes, the short percussive sounds lose influence in the spectral representation and more energy is assigned to the harmonic component. This results in a leakage of some percussive sounds to the harmonic component. Vice versa, for small frame sizes the low frequency resolution often leads to a blurring of horizontal structures, and harmonic sounds tend to leak into the percussive component. The separation factor β shows a different behavior to the previous parameters. The larger its value, the clearer becomes the harmonic and percussive nature of the components xh and xp . Meanwhile, also the portion of the signal that is assigned to the residual component xr increases. To illustrate this behavior, let us consider a first synthetic example where we apply our proposed procedure to the mixture of a violin (clearly harmonic), castanets (clearly percussive), and applause (neither harmonic nor percussive), all sampled at 22050 Hertz and having the same energy. In Figure 3, we visualized the relative energy distribution of the three components for varying frame sizes N and separation factors β, while fixing the length of the median filters to be always equivalent to 200 milliseconds in horizontal direction and 500 Hertz in vertical direction, see also [1]. Since the energy of all three signals is normalized, potential leakage between the components is indicated by components that have either more or less than a third of the overall energy assigned. Considering Fitzgerald’s procedure [5] as a baseline (β=1), we can investigate In Section 3.1 we show how harmonic-percussive separation can be extended with a third residual component. Afterwards, in Section 3.2, we show how the parameters of the proposed procedure influence the decomposition results. Finally, in Section 3.3, we present an iterative decomposition procedure which allows for a more flexible adjustment of the decomposition results. 3.1 Basic Procedure and Related Work The concept presented in Section 2 allows us to extend the decomposition procedure with a third component xr , called the residual component. It contains the portion of the input signal x that is neither part of the harmonic component xh nor the percussive components xp . To compute xr , we define the binary mask Mr (t, k) := 1 − Mh (t, k) + Mp (t, k) , apply it to X, and transform the resulting spectrogram Xr back to the time-domain (note that the masks Mh and Mp are disjoint). This decomposition into three components is inspired by the STN audio model. Here, an audio signal is analyzed to yield parameters for sinusoidal, transient, and noise components which can then be used to approximately resynthesize the original signal [8, 11]. While the main application of the STN model lies in the field of low bitrate audio coding, the estimated parameters can also be used to synthesize just the sinusoidal, the transient, or the noise component of the approximated signal. The harmonic, the percussive, and the residual component resulting from our proposed decomposition procedure are often perceptually similar to the STN components. However, our proposed procedure is conceptually different. STN modeling aims for a parametrization of the given audio signal. While the estimated parameters constitute a compact approximation of the input signal, this approximation and 613 15th International Society for Music Information Retrieval Conference (ISMIR 2014) its behavior by looking at the first columns of the matrices in Figure 3. While the residual component has zero energy in this setting, one can observe by listening that the applause is more or less equally distributed between the harmonic and the percussive component for medium frame sizes. This is also reflected in Figure 3a/c by the energy being split up roughly into equal portions. For very large N , most of the signal’s energy moves towards the harmonic component (value close to one in Figure 3a for β=1, N =4096), while for very small N , the energy is shifted towards the percussive component (value close to one in Figure 3c for β=1, N =128). With increasing β, one can observe how the energy gathered in the harmonic and the percussive component flows towards the residual component (decreasing values in Figure 3a/c and increasing values in Figure 3b for increasing β). Listening to the decomposition results shows that the harmonic and the percussive component thereby become more and more extreme in their respective characteristics. For medium frame sizes, this allows us to find settings that lead to decompositions in which the harmonic component contains the violin, the percussive component contains the castanets, and the residual contains the applause. This is reflected by Figure 3, where for N =1024 and β=2 the three sound components all hold roughly one third of the overall energy. For very large or very small frame sizes it is not possible to get such a good decomposition. For example, considering β=1 and N =4096, we already observed that the harmonic component holds most of the signal’s energy and also contains some of the percussive sounds. However, already for small β > 1 these percussive sounds are shifted towards the residual component (see the large amount of energy assigned to the residual in Figure 3b for β=1.5, N =4096). Furthermore, also the energy from the percussive component moves towards the residual. The large frame size therefore results in a very clear harmonic component while the residual holds both the percussive as well as all other non-harmonic sounds, leaving the percussive component virtually empty. For very small N the situation is exactly the other way around. This observation can be exploited to define a refined decomposition procedure which we discuss in the next section. Figure 4. Overview of the refined procedure. (a): Input signal x. (b): First run of the decomposition procedure using a large frame size Nh and a separation factor βh . (c): Second run of the decomposition procedure using a small frame size Np and a separation factor βp . Figure 5. Energy distribution between the harmonic, residual, and percussive components for different separation factors βh and βp . (a): Harmonic components. (b): Residual components. (c): Percussive components. presented in Section 3.1. So far, although it is possible to find a good combination of N and β such that both the harmonic as well as the percussive component represent the respective characteristics of the input signal well (see Section 3.2), the computation of the two components is still coupled. It is therefore not clear how to adjust the content of the harmonic and the percussive component independently. Having made the observation that large N lead to good harmonic but poor percussive/residual components for β>1, while small N lead to good percussive components but poor harmonic/residual components for β>1, we build on the idea from Tachibana et al. [13] and compute the decomposition in two iterations. Here, the goal is to decouple the computation of the harmonic component from the computation of the percussive component. First, the harmonic component is extracted by applying our basic procedure with a large frame size Nh and a separation facfirst tor βh >1, yielding xfirst and xfirst p . In a second run, h , xr 3.3 Iterative Procedure In [13], Tachibana et al. described a method for the extraction of human singing voice from music recordings. In this algorithm, the singing voice is estimated by iteratively applying the harmonic-percussive decomposition procedure described in [9] first to the input signal and afterwards again to one of the resulting components. This yields a decomposition of the input signal into three components, one of which containing the estimate of the singing voice. The core idea of this algorithm is to perform the two harmonicpercussive separations on spectrograms with two different time-frequency resolutions. In particular, one of the spectrograms is based on a large frame size and the other on a small frame size. Using this idea, we now extend our proposed harmonic-percussive-residual separation procedure 614 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Violin -3.10 -5.85 Castanets -2.93 Applause -3.04 3.58 − HPR-IO HPR-I HPR HP-I HP BL HPR-IO HPR-I SAR HPR HP-I HP BL HPR-IO HPR-I SIR HPR HP-I HP BL SDR 0.08 8.23 7.65 8.85 -3.10 -5.09 1.08 17.69 14.58 21.65 274.25 8.33 9.44 8.82 8.78 9.11 2.86 8.29 9.14 9.28 -2.93 10.45 22.34 20.66 24.41 274.25 8.14 4.07 8.49 9.50 9.44 -7.03 4.25 4.93 5.00 -3.04 6.06 − 14.69 8.41 12.80 9.04 274.25 − -6.85 6.95 5.93 7.69 Table 1. Objective evaluation measures. All values are given in dB. the procedure is applied again to the sum xfirst + xfirst r p , this time using a small frame size Np and a second separation factor βp >1. This yields the components xsecond , xsecond r h second and xp . Finally, we define the output components of the procedure to be the castanets, and the applause signal represent the characteristics that we would like to capture in the harmonic, the percussive, and the residual components, respectively, we treated the decomposition task of this mixture as a source separation problem. In an optimal decomposition the harmonic component would contain the original violin signal, the percussive component the castanets signal, and the residual component the applause. To evaluate the decomposition quality, we computed the source to distortion ratios (SDR), the source to interference ratios (SIR), and the source to artifacts ratios (SAR) [15] for the decomposition results of the following procedures. As a baseline (BL), we simply considered the original mixture as an estimate for all three sources. Furthermore, we applied the standard harmonic-percussive separation procedure by Fitzgerald [5] (HP) with the frame size set to N =1024, the HP method applied iteratively (HP-I) with Nh =4096 and Np =256, the proposed basic harmonic-percussive-residual separation procedure (HPR) as described in Section 3.1 with N =1024 and β=2, and the proposed iterative harmonic-percussive-residual separation procedure (HPR-I) as described in Section 3.3 with Nh =4096, Np =256, and βh =βp =2. As a final method, we also considered HPR-I with separation factor βh =3 and βp =2.5, which were optimized manually for the task at hand (HPR-IO). The filter lengths in all procedures were always fixed to be equivalent to 200 milliseconds in time direction and 500 Hertz in frequency direction. Decomposition results for all procedures can be found at [1]. The results are listed in Table 1. All values are given in dB and higher values indicate better results. As expected, BL yields rather low SDR and SIR values for all components, while the SAR values are excellent since there are no artifacts present in the original mixture. The method HP yields low evaluation measures as well. However, these values are to be taken with care since HP decomposes the input mixture in just a harmonic and a percussive component. The applause is therefore not estimated explicitly and, as also discussed in Section 2, randomly distributed among the harmonic and percussive component. It is therefore clear that especially the SIR values are low in comparison to the other procedures since the applause heavily interferes with the remaining two sources in the computed components. When looking at HP-I, the benefit of having a third component becomes clear. Although here the residual component does not capture the applause very well (SDR of −7.03 dB) this already suf- second + xsecond , xp := xsecond . xh := xfirst h , xr := xh r p For an overview of the procedure see Figure 4. While fixing the values of Nh and Np to a small and a large frame size, respectively (in our experiments we chose Nh =4096 and Np =256), the separation factors βh and βp yield handles that give simple and independent control over the harmonic and percussive component. Figure 5, which is based on the same audio example as Figure 3, shows the energy distribution among the three components for different combinations of βh and βp , see also [1]. For the harmonic components (Figure 5a) we see that the portion of the signals energy contained in this component is independent of βp and can be controlled purely by βh . This is a natural consequence from the fact that in our proposed procedure the harmonic component is always computed directly from the input signal x and βp does not influence its computation at all. However, we can also observe that the energy contained in the percussive component (Figure 5c) is fairly independent of βh and can be controlled almost solely by βp . Listening to the decomposition results confirms these observations. Our proposed iterative procedure therefore allows to adjust the harmonic and the percussive component almost independently what significantly simplifies the process of finding an appropriate parameter setting for a given input signal. Note that in principle it would also be possible to choose βh =βp =1, resulting in an iterative application of Fitzgerald’s method [5]. However, as discussed in Section 3.2, Fitzgerald’s method suffers from component leakage when using very large or small frame sizes. Therefore, most of the input signal’s energy will be assigned to the harmonic component in the first iteration of the algorithm, while most of the remaining portion of the signal is assigned to the percussive component in the second iteration. This leads to a very weak, although not empty, residual component. 4. EVALUATION In a first experiment, we applied objective evaluation measures to our running example. Assuming that the violin, 615 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Item name Description CastanetsViolinApplause Heavy Synthetic mixture of a violin, castanets and applause. Recording of heavily distorted guitars, a bass and drums. Excerpt from My Leather, My Fur, My Nails by the band Stepdad. Regular beat played on bongos. Monophonic melody played on a glockenspiel. Excerpt from “Gute Nacht” by Franz Schubert which is part of the Winterreise song cycle. It is a duet of a male singer and piano. Stepdad Bongo Glockenspiel Winterreise tics of the fine structure must remain constant on the large scale” [12]. In our opinion this is not a bad description of what one can hear in residual components. Acknowledgments: This work has been supported by the German Research Foundation (DFG MU 2686/6-1). The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer IIS. Table 2. List of audio excerpts. 5. REFERENCES [1] J. Driedger, M. Müller, and S. Disch. Accompanying website: Extending harmonic-percussive separation of audio signals. http://www.audiolabs-erlangen.de/resources/ 2014-ISMIR-ExtHPSep/. fices to yield SDR and SIR values clearly above the baseline for the estimates of the violin and the castanets. The separation quality further improves when considering the results of our proposed method HPR. Here the evaluation yields high values for all measures and components. The very high SIR values are particularly noticeable since they indicate that the three sources are separated very clearly with very little leakage between the components. This confirms our claim that our proposed concept of a separation factor allows for tightening decomposition results as described in Section 2. The results of HPR-I are very similar to the results for the basic procedure HPR. However, listening to the decomposition reveals that the harmonic and the percussive component still contain some slight residue sounds of the applause. Slightly increasing the separation factors to βh =3 and βp =2.5 (HPR-IO) eliminates these residues and further increases the evaluation measures. This straight-forward adjustment is possible since the two separation factors βh and βp constitute independent handles to adjust the content of the harmonic and percussive component, what demonstrates the flexibility of our proposed procedure. The above described experiment constitutes a first case study for the objective evaluation of our proposed decomposition procedures, based on an artificially mixed example. To also evaluate these procedures on real-world audio data, we additionally performed an informal subjective listening tests with several test participants. To this end, we applied our procedures to the set of audio excerpts listed in Table 2. Among the excerpts are complex sound mixtures as well as purely percussive and harmonic signals, see also [1]. Raising the question whether the computed harmonic and percussive components meet the expectation of representing the clearly harmonic or percussive portions of the audio excerpts, respectively, the performed listening test confirmed our hypothesis. It furthermore turned out that βh =βp =2, Nh =4096 and Np =256 seems to be a setting for our iterative procedure which robustly yields good decomposition results, rather independent of the input signal. Regarding the residual component, it was often described to sound like a sound texture by the test participants, which is a very interesting observation. Although there is no clear definition of what a sound texture exactly is, literature states “sound texture is like wallpaper: it can have local structure and randomness, but the characteris- [2] J. Driedger, M. Müller, and S. Ewert. Improving time-scale modification of music signals using harmonic-percussive separation. Signal Processing Letters, IEEE, 21(1):105–109, 2014. [3] C. Duxbury, M. Davies, and M. Sandler. Separation of transient information in audio using multiresolution analysis techniques. In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, 12 2001. [4] C. Duxbury, M. Davies, and M. Sandler. Improved time-scaling of musical audio using phase locking at transients. In Audio Engineering Society Convention 112, 4 2002. [5] D. Fitzgerald. Harmonic/percussive separation using medianfiltering. In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages 246–253, Graz, Austria, 2010. [6] A. Gkiokas, V. Katsouros, G. Carayannis, and T. Stafylakis. Music tempo estimation and beat tracking by applying source separation and metrical relations. In ICASSP, pages 421–424, 2012. [7] D. W. Griffin and J. S. Lim. Signal estimation from modified shorttime Fourier transform. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2):236–243, 1984. [8] S. N. Levine and J. O. Smith III. A sines+transients+noise audio representation for data compression and time/pitch scale modications. In Proceedings of the 105th Audio Engineering Society Convention, 1998. [9] N. Ono, K. Miyamoto, H. Kameoka, and S. Sagayama. A real-time equalizer of harmonic and percussive components in music signals. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 139–144, Philadelphia, Pennsylvania, USA, 2008. [10] N. Ono, K. Miyamoto, J. LeRoux, H. Kameoka, and S. Sagayama. Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In European Signal Processing Conference, pages 240–244, Lausanne, Switzerland, 2008. [11] A. Petrovsky, E. Azarov, and A. Petrovsky. Hybrid signal decomposition based on instantaneous harmonic parameters and perceptually motivated wavelet packets for scalable audio coding. Signal Processing, 91(6):1489–1504, 2011. [12] N. Saint-Arnaud and K. Popat. Computational auditory scene analysis. chapter Analysis and synthesis of sound textures, pages 293–308. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1998. [13] H. Tachibana, N. Ono, and S. Sagayama. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):228–237, January 2013. [14] Y. Ueda, Y. Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. HMM-based approach for automatic chord detection using refined acoustic features. In ICASSP, pages 5518–5521, 2010. [15] E. Vincent, R. Gribonval, and C. Févotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006. 616 15th International Society for Music Information Retrieval Conference (ISMIR 2014) SINGING VOICE SEPARATION USING SPECTRO-TEMPORAL MODULATION FEATURES Frederick Yen Yin-Jyun Luo Master Program of SMIT National Chiao-Tung University, Taiwan Tai-Shih Chi Dept. of Elec. & Comp. Engineering National Chiao-Tung University, Taiwan {fredyen.smt01g,fredom.smt02g} @nctu.edu.tw [email protected] ABSTRACT An auditory-perception inspired singing voice separation algorithm for monaural music recordings is proposed in this paper. Under the framework of computational auditory scene analysis (CASA), the music recordings are first transformed into auditory spectrograms. After extracting the spectral-temporal modulation contents of the timefrequency (T-F) units through a two-stage auditory model, we define modulation features pertaining to three categories in music audio signals: vocal, harmonic, and percussive. The T-F units are then clustered into three categories and the singing voice is synthesized from T-F units in the vocal category via time-frequency masking. The algorithm was tested using the MIR-1K dataset and demonstrated comparable results to other unsupervised masking approaches. Meanwhile, the set of novel features gives a possible explanation on how the auditory cortex analyzes and identifies singing voice in music audio mixtures. 1. INTRODUCTION Over the past decade, the task of singing voice separation has gained much attention due to improvements in digital audio technologies. In the research field of music information retrieval (MIR), separated vocal signals or accompanying music signals can be of great use in many applications, such as singer identification, pitch extraction, and music genre classification. During the past few years, many algorithms have been proposed for this challenging task. These algorithms can be categorized into unsupervised and supervised approaches. The unsupervised approaches do not contain any training mechanism in the algorithms. For instance, Durrieu et al. used a source/filter signal model with nonnegative matrix factorization (NMF) to perform source separation [5] and Fitzgerald et al. used median filtering and factorization techniques to separate harmonic and percussive components in audio signals [7]. Some other unsupervised methods considered structural characteristics of vocals and music accompaniments in several domains for separation. For example, Pardo and Rafii proposed REPET which views the accompaniments as repeating background signals and vocals as the varying information lying on top of them [16]. Tachibana et al. pro© Frederick Yen, Yin-Jyun Luo, Tai-Shih Chi. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Frederick Yen, Yin-Jyun Luo, TaiShih Chi. “Singing Voice Separation using Spectro-Temporal Modulation Features”, 15th International Society for Music Information Retrieval Conference, 2014. 617 posed the separation technique, HPSS, to remove the harmonic and percussive instruments sequentially in a two-stage framework by considering the nature of fluctuations of audio signals [19]. Huang et al. used RPCA to present accompaniments in low-rank subspace and vocal in sparse representation [8]. In addition, some unsupervised CASA-based systems were proposed for singing voice separation by finding singing dominant regions on the spectrograms using pitch and harmonic information. For instance, Li and Wang proposed a CASA system obtaining binary masks using pitch-based inference [13]. Hsu and Jang extended the work and proposed a system for separating both voiced and unvoiced singing segments from the music mixtures [9]. Although training mechanisms were seen in these two systems, they were only for detecting voiced and unvoiced segments, but not for separation. In contrast, there were approaches based on supervised learning techniques. For example, Vembu et al. used vocal/non-vocal SVM and neural-network (NN) classifiers for vocal-nonvocal segmentation [20]. Ozerov et al. used a vocal/non-vocal classifier based on Bayesian modeling [15]. Another group of methods combined RPCA with training mechanisms. For instance, Yang’s low-rank representation method decomposed vocals and accompaniments using pre-trained low-rank matrices [22] and Sprechmann et al. proposed a real-time method using low-rank modeling with neural networks [17]. Although these supervised learning methods demonstrated very high performance, they usually offer a weaker conception of generality. Music instruments produce signals with various kinds of fluctuations such that they can be briefly categorized into two groups, percussive and harmonic. Signals produced by percussive instruments are more consistent along the spectral axis and by harmonic instruments are more consistent along the temporal axis with little or no fluctuations. These two categories occupy a large proportion of a spectrogram with mainly vertical and horizontal lines. To extend this sense into a more general form, the fluctuations can be viewed as a sum of sinusoid modulations along the spectral axis and the temporal axis. If a signal has nearly zero modulation along one of the two axes, its energy is smoothly distributed along that axis. Conversely, if a signal has a high frequency of modulation along one axis, then its energy becomes scattered along that axis. Therefore, if one can decipher the modulation status of a signal, one may be able to identify the instrument type of the signal. An algorithm utilizing mo- 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ributed over 5.3 octaves with the 24 filters/octave frequency resolution. These constant-Q filters mimic the frequency selectivity of the cochlea. Outputs of these filters are then transformed through a non-linear compression stage, a lateral inhibitory network (LIN), and a halfwave rectifier cascaded with a low-pass filter. The nonlinear compression stage models the saturation caused by inner hair cells, the LIN models the spectral masking effect, and the following stage serves as an envelope extractor to model the temporal dynamic reduction along the auditory pathway to the midbrain. Outputs of the module from different stages are formulated below: Figure 1. Stages of the cochlear module, adopted from [2]. dulation information can be seen in [1], where Barker et al. combined the modulation spectrogram (MS) with nonnegative tensor factorization (NTF) to perform speech separation from mixtures of speech and music. Although the above mentioned engineering approaches produce promising results, human’s tremendous ability in sound streams separation makes a biomimetic approach interesting to investigate. Based on neurophysiological evidences, it is suggested that neurons of the auditory cortex (A1) respond to both spectral modulations and temporal modulations of the input sounds. Accordingly, a computational auditory model was proposed to model A1 neurons as spectro-temporal modulation filters [2]. This concept of spectro-temporal modulation decomposition has inspired many approaches in various engineering topics, such as using spectro-temporal modulation features for speaker recognition [12], robust speech recognition [18], voice activity detection [10], and sound segregation [6]. Since modulations are important for music signal categorization, this modulation-decomposition auditory model is used as a pre-processing stage for singing voice separation in this paper. Our proposed unsupervised algorithm adapts this two-stage auditory model, which decodes the spectro-temporal modulations of a T-F unit, to extract modulation based features and performs singing voice separation under the CASA framework. This paper is organized as follows. A brief review of the auditory model is presented in Section 2. Section 3 describes the proposed method. Section 4 shows evaluation and results. Lastly, Section 5 draws the conclusion. c d f g $ hj k$l m (1) c7 d f g @no c d fA ho q (2) c^ d f g rsnt c7 d fd u (3) cv d f g c^ d f ho wl y (4) where z is the input signal; k$l m is the impulse response of the cochlear filter with center frequency m; hj denotes convolution in time; 炽 is the nonlinear compression function; no is the partial derivative of ; q is the membrane leakage low-pass filter; wl y g Po~ is the integration window with the time constant y to model current leakage of the midbrain; is the step function. Detailed descriptions of the cochlear module can be found in [2]. The output cv d f of the module is the auditory spectrogram, which represents the neuron activities along time and log-frequency axis. In this work, we bypass the non-linear compression stage by assuming input sounds are properly normalized without triggering the highvolume saturation effect of the inner hair cells. 2. SPECTRO-TEMPORAL AUDITORY MODEL A neuro-physiological auditory model is used to extract the modulation features. The model consists of an early cochlear (ear) module and a central auditory cortex (A1) module. 2.2 Cortical Module The second module simulates the neural responses of the auditory cortex (A1). The auditory spectrogram cv d f is analyzed by cortical neurons which are modeled by two-dimensional filters tuned to different spectrotemporal modulations. The rate parameter (in Hz) characterizes the velocity of local spectro-temporal envelope 2.1 Cochlear Module As shown in Figure 1, the input sound goes through 128 overlapping asymmetric constant-Q band-pass filters (] ^_` a b ) whose center frequencies are uniformly dist- 618 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 3. Block diagram of the proposed algorithm. harmonic, percussive and vocal. Harmonic components have steady energy distributions over time and have clear formant structures over frequency. Each percussive component has impulsive energy concentrated in a short period of time and has no obvious harmonic structure. Vocal components possess harmonic structure and their energy is distributed along various time periods. Interpreting the above statements from the rate-scale perspective, several general properties can be drawn. Harmonic components can be usually regarded as having low rate and high scale modulations. It means that they have relatively slow energy change along time and rapid energy change along the log-frequency axis due to the harmonic structures. In contrast, percussive components typically show quick energy change along time and energy spreading along the whole log-frequency axis, such that they possess high rate and low scale modulations. Vocal components are often recognized as a mix version of the harmonic and percussive components with characteristics sometimes considered more similar to harmonics. Different types of singing or vocal expression can result in various values of rate and scale. Figure 4 shows some examples of rate-scale plots of components from the three categories. Given an auditory spectrogram cb * transformed from an input music signalz, the rate-scale plots of the T-F units are generated. As a pre-process, in order to prevent extracting trivial data from nearly inaudible T-F units of the auditory spectrogram, we leave out the T-F units that have energy less than 1% of the maximum energy of the whole auditory spectrogram. With the rest of the T-F units, we obtain the rate-scale plot of each unit and proceed to the feature extraction stage. For each rate-scale plot, the total energies of the negative and positive rate side are compared. The side with greater energy is determined as the dominant plot. From the dominant plot, we extract 11 features as shown in Table 1. The features are selected by observing the ratescale plots with some intuitive assumptions of the physical properties which distinguish between harmonic, percussive and vocal. The first 10 features are obtained by computing the energy ratio of two different areas on the rate-scale plot. For example, as shown in Table 1, the first feature is the ratio of the total modulation energy of scale = 1 to the total modulation energy of scale = 0.25. The low scales, such as 0.25 and 0.5, capture the degree of the Figure 2. Rate-scale outputs of the cortical module to two T-F units of the auditory spectrogram of the 'Ani_2_03.wav' vocal track in MIR-1K [9]. variation along the temporal axis. The scale parameter (in cycle/octave) characterizes the density of the local spectro-temporal envelope variation along the logfrequency axis. Furthermore, the cortical neurons are found sensitive to the direction of the spectro-temporal envelope. It is characterized by the sign of the rate parameter in this model, with negative for the upward direction and positive for the downward direction. From functional point of view, this module performs a spectro-temporal multi-resolution analysis on the input auditory spectrogram in various rate-scale combinations. Outputs of various cortical neurons to a single T-F unit of the spectrogram demonstrate the local spectro-temporal modulation contents of the unit in terms of the rate, scale and directionality parameters. Figure 2 shows rate-scale outputs of two T-F units in an auditory spectrogram of a vocal clip. The rate-scale output is referred to as the rate-scale plot in this paper. The rate and scale indices are P7 andG , respectively. The strong responses of the plots correspond to the variations of singing pitch envelopes resolved by the rate and scale parameters and the moving direction of the pitch. Detailed description of the cortical module is available in [3]. 3. PROPOSED METHOD A schematic diagram of the proposed algorithm is shown in Figure 3. The following sections will discuss each part in details. 3.1 Feature Extraction According to the spectral and temporal behaviors observed on the auditory spectrogram, components of a musical piece are characterized into three categories, 619 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Scale 1 : 0.25 2 : 0.25 4 : 0.25 8 : 0.25 (0.25, 2, 4) (0.25, 2, 4) (0.25, 2, 4) (0.25, 0.5) : all (1, 2) : all (4, 8) : all (0.25) Rate all all all all (1, 2) : (0.25, 0.5, 1, 2, 16, 32) (0.25, 0.5) : (0.25, 0.5, 1, 2, 16, 32) (16, 32) : (0.25, 0.5, 1, 2, 16, 32) all all all all Table 1. Eleven extracted modulation energy features Figure 4. (a) Rate-scale plot from the vocal track of ‘Ani_4_07’ in MIR-1K. The modulation energy is mostly concentrated in the middle and high scales for a unit with a clear harmonic structure. (b) Rate-scale plots from the accompanying music track of ‘Ani_4_07’. The upper plot shows energy concentrating at low rates for a sustained unit. The lower plot shows energy concentrating at high rates for a transient unit. flatness of the formant structure while the high scales, such as 1, 2, 4 and 8, capture the harmonicity with different frequency spacing between harmonics. Therefore, the first four features can be thought as descriptors which distinguish harmonic from percussive using spectral information. The fifth to the seventh features capture temporal information which can distinguish sustained units from transient units. The feature values are saved as feature vectors and then grouped as a feature matrix * for clustering, where is the number of features and is the number of total valid units in the auditory spectrogram. 3.2 Unsupervised Clustering In the unsupervised clustering stage, a spectrogram is divided into three parts and clustering is performed for each part. Based on hearing perception, the frequency resolution is higher at lower frequencies while the temporal resolution is higher at higher frequencies [14]. Due to the frequency resolution of the constant-Q cochlear filters/channels in the auditory model, the auditory spectrogram can only resolve about ten harmonics [11]. To handle different resolutions, the spectrogram is separated into three sub-spectrograms with overlapped frequency ranges. The three sub-spectrograms consist of channel 1 to channel 60, channel 46 to channel 75, and channel 61 to channel 128, respectively, with overlaps of 15 channels. 620 The clustering step is performed using the EM algorithm to group data into three unlabelled clusters. The EM algorithm assigns a probability set to each T-F unit showing its likelihood of belonging to each cluster. Note that in spectrogram representations, the sound sources are superimposed on top of each other. It implies that one TF unit may contain energy from more than one source. Therefore, in this work, if one T-F unit has a probability set in which the second highest probability is higher than 5%, that particular T-F unit will also be labelled to the second high probability cluster. It means one unit may eventually appear in more than one cluster. The parameter 5% was empirically determined. Each of the three sub-spectrograms is clustered into three groups. Total of nine groups are generated and merged back into three whole spectrograms by comparing the correlations of the overlapped channels between different groups. Each of the three whole spectrograms represents the extracted harmonic, percussive, and vocal part of the music mixture. With no prior information about the labels of the three whole spectrograms, the effective mean rate-scale plot of each spectrogram is examined. The effective mean ratescale plot is the mean of rate-scale plots of the T-F units with energy higher than 20% of the maximum energy in that spectrogram. The total modulation energy of rate = 1, 2 Hz and scale = 0.25, 2, 4 cycle/octave is calculated from the effective mean rate-scale plot and referred to as Ev, which is used as the criterion to select the vocal spectrogram. The one with the maximum Ev value is picked as the vocal spectrogram since Ev catches modulations related to the formant structure (scale = 0.25), the harmonic structure (scale = 2 and 4) and the singing rate (rate = 1 and 2) of singing voices. The vocal spectrogram is then synthesized to an estimated signal using the auditory model toolbox [24]. The nonlinear operation of the envelope extractor in the cochlear module makes perfect synthesis impossible, thus causing a general result of loss of higher frequencies of the signal. Detailed computations are shown in [2]. 4. EVALUATION RESULTS The MIR-1K [9] is used as the evaluation dataset. It cont- 15th International Society for Music Information Retrieval Conference (ISMIR 2014) mance to the masking-based REPET in all SNR conditions. When compared with the subspace RPCA method, our proposed method has comparable performance only in the -5 dB SNR condition. These results demonstrate the effectiveness of the spectral-temporal modulation features for analyzing music mixtures. As this proposed method only applies a simple EM algorithm for clustering, harmonic mismatches and artificial noises are yet to be discussed. The future work will be focused on applying more advanced classifiers for more accurate separations and adopting a two-stage mechanism like HPSS to discard percussive and harmonic components sequentially. The other potential work is to implement the proposed spectro-temporal modulation based method in the Fourier spectrogram domain [4] to mitigate synthesis errors injected by the projection-based reconstruction process of the auditory model. Figure 5. GNSDR comparison at voice-to-music ratio of -5, 0, and 5 dB with existing methods. ains 1000 WAV files of karaoke clips sung by amateur singers. The length of each clip is around 4~13 seconds. The vocal and music accompaniment parts were recorded in the right and the left channels separately. In this experiment, we mixed two channels in -5, 0, 5 dB SNR (signal to noise ratio, i.e., vocal to music accompaniment ratio) for test. To assess the quality of separation, the source-todistortion ratio (SDR) [21] is used as the objective measure. The ratios are computed by the BSS Eval toolbox v3.0 [23]. Following [9], we compute the normalized SDR (NSDR) and the weighted average of NSDR, the global NSDR (GNSDR), with the weighting proportional to the length of each file. To have a fair comparison, we compare our method with other unsupervised methods, which extract vocal clips only through one major stage. The compared algorithms are listed below: I. II. III. 6. ACKNOWLEDGEMENTS This research is supported by the National Science Council, Taiwan under Grant No NSC 102-2220-E-009-049 and the Biomedical Electronics Translational Research Center, NCTU. 7. REFERENCES [1] T. Barker and T. Virtanen, "Non-negative tensor factorization of modulation spectrograms for monaural sound source separation," Proc. of Interspeech, pp. 827-831, 2013. [2] T. Chi, P. Ru, and S. A. Shamma, "Multiresolution spectrotemporal analysis of complex sounds," J. Acoust. Soc. Am., Vol. 118, No. 2, pp. 887-906, 2005. [3] T. Chi, Y. Gao, M. C. Guyton, P. Ru, and S. Shamma, "Spectro-temporal modulation transfer functions and speech intelligibility," J. Acoust. Soc. Am., Vol. 106, No. 5, pp. 2719-2732, 1999. Hsu: the approach proposed in [9] that performs unvoiced sound separation combined with the pitch-based inference method in [13]. R (REPET with soft masking): the approach proposed in [16] that computes a repeating background structure and extract vocal with soft time-frequency masking. RPCA: a matrix decomposition method applying robust principal component analysis proposed by Huang et al. [8]. [4] T.-S. Chi and C.-C. Hsu, "Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram," J. Acoust. Soc. Am., Vol. 129, No. 5, pp. EL190-EL196, 2011. From Figure 5, we can observe that the proposed method has the highest performance tied with RPCA in the -5 dB SNR condition. In 0 and 5 dB SNR conditions, the performance of the proposed method is comparable to the performance of REPET. [5] J.-L. Durrieu, B. David, and G. Richard, "A musically motivated mid-level representation for pitch estimation and musical audio source separation, "IEEE J. of Selected Topics on Signal Process.," Vol. 5, No. 6, pp. 1180-1191, 2011. [6] M. Elhilali and S. A. Shamma, "A cocktail party with a cortical twist: how cortical mechanisms contribute to sound segregation, " J. Acoust. Soc. Am., Vol. 124, No. 6, pp. 3751-3771, 2008. 5. CONCLUSION In this paper, we propose a singing voice separation method utilizing the spectral-temporal modulations as clustering features. Based on the energy distributions on the rate-scale plots of T-F units, the vocal signal is extracted from the auditory spectrogram and the separation performance is evaluated using the MIR-1K dataset. Our proposed CASA-based masking method outperforms the CASA-based system in [9] and has comparable perfor- [7] D. FitzGerald and M. Gainza, "Single channel vocal separation using median filtering and factorization techniques," ISAST Trans. on Electron. and Signal Process., Vol. 4, No. 1, pp. 62-73 (ISSN 1797-2329), 2010. 621 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms," IEEE/ACM Trans. on Audio, Speech, and Language Process., Vol. 22, No. 1, pp. 228-237, 2014. [8] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, "Singing-voice separation from monaural recordings using robust principal component analysis," Porc. IEEE Int. Conf. on Acoust., Speech and Signal Process., pp. 57-60, 2012. [20] S. Vembu and S. Baumann, "Separation of vocals from polyphonic audio recordings," Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 337–344, 2005. [9] C.-L. Hsu and J.-S. R. Jang, "On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset," IEEE Trans. on Audio, Speech, and Language Process., Vol. 18, No. 2, pp. 310-319, 2010. [21] E. Vincent, R. Gribonval, and C. Févotte, "Performance measurement in blind audio source separation," IEEE Trans. on Audio, Speech, and Language Process., Vol. 14, No. 4, pp. 1462-1469, 2006. [10] C.-C. Hsu, T.-E. Lin, J.-H. Chen, and T.-S. Chi, "Voice activity detection based on frequency modulation of harmonics," IEEE Int. Conf. on Acoust. , Speech and Signal Process., pp. 6679-6683, 2013. [22] Y. Yang, "Low-rank representation of both singing voice and music accompaniment via learned dictionaries," Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 427-432, 2013. [11] D. Klein, and S. A. Shamma, "The case of the missing pitch templates: how harmonic templates emerge in the early auditory system," J. Acoust. Soc. Am., Vol. 107, No. 5, pp. 2631-2644, 2000. [23] http://bass-db.gforge.inria.fr/bss_eval/ [24] http://www.isr.umd.edu/Labs/NSL/nsl.html [12] H. Lei, B. T. Meyer, and N. Mirghafori, "Spectrotemporal Gabor features for speaker recognition," IEEE Int. Conf. on Acoust., Speech and Signal Process., pp. 4241-4244, 2012. [13] Y. Li and D. Wang, "Separation of singing voice from music accompaniment for monaural recordings," IEEE Trans. on Audio, Speech, and Language Process., Vol. 15, No. 4, pp. 1475-1487, 2007. [14] B. C. J. Moore: An Introduction to the Psychology of Hearing 5th Ed., Academic Press, 2003. [15] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs, "IEEE Trans. on Audio, Speech, and Language Process.," special issue on Blind Signal Proc. for Speech and Audio Applications, Vol. 15, No. 5, pp. 1564-1578, 2007. [16] Z. Rafii and B. Pardo, "REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation," IEEE Trans. on Audio, Speech, and Language Process., Vol. 21, No. 1, pp. 73-84, 2013. [17] P. Sprechmann, A. Bronstein, and G. Sapiro, "Realtime online singing voice separation from monaural recordings using robust low-rank modeling," Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 67–72, 2012. [18] R. M. Stern and N. Norgan, "Hearing is believing: biologically inspired methods for robust automatic speech recognition," IEEE Signal Process. Mag., Vol. 29, No. 6, pp. 34–43, 2012. [19] H. Tachibana, N. Ono, and S. Sagayama, "Singing Voice Enhancement in Monaural Music Signals 622 15th International Society for Music Information Retrieval Conference (ISMIR 2014) HARMONIC-TEMPORAL FACTOR DECOMPOSITION INCORPORATING MUSIC PRIOR INFORMATION FOR INFORMED MONAURAL SOURCE SEPARATION Tomohiko Nakamura† , Kotaro Shikata† , Norihiro Takamune† , Hirokazu Kameoka†‡ † Graduate School of Information Science and Technology, The University of Tokyo. ‡ NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation. {nakamura,k-shikata,takamune,kameoka}@hil.t.u-tokyo.ac.jp ABSTRACT For monaural source separation two main approaches have thus far been adopted. One approach involves applying non-negative matrix factorization (NMF) to an observed magnitude spectrogram, interpreted as a non-negative matrix. The other approach is based on the concept of computational auditory scene analysis (CASA). A CASAbased approach called the “harmonic-temporal clustering (HTC)” aims to cluster the time-frequency components of an observed signal based on a constraint designed according to the local time-frequency structure common in many sound sources (such as harmonicity and the continuity of frequency and amplitude modulations). This paper proposes a new approach for monaural source separation called the “Harmonic-Temporal Factor Decomposition (HTFD)” by introducing a spectrogram model that combines the features of the models employed in the NMF and HTC approaches. We further describe some ideas how to design the prior distributions for the present model to incorporate musically relevant information into the separation scheme. 1. INTRODUCTION Monaural source separation is a process in which the signals of concurrent sources are estimated from a monaural polyphonic signal and is one of fundamental objectives offering a wide range of applications such as music information retrieval, music transcription and audio editing. While we can use spatial cues for blind source separation with multichannel inputs, for monaural source separation we need other cues instead of the spatial cues. For monaural source separation two main approaches have thus far been adopted. One approach is based on the concept of computational auditory scene analysis (e.g., [7]). The auditory scene analysis process described by Bregman [1] involves grouping elements that are likely to have originated from the same source into a perceptual structure called an auditory stream. In [8, 10], an attempt has been made to imitate this process by clustering timefrequency components based on a constraint designed according to the auditory grouping cues (such as the har- monicity and the coherences and continuities of amplitude and frequency modulations). This method is called “harmonic-temporal clustering (HTC).” The other approach involves applying non-negative matrix factorization (NMF) to an observed magnitude spectrogram (time-frequency representation) interpreted as a non-negative matrix [19]. The idea behind this approach is that the spectrum at each frame is assumed to be represented as a weighted sum of a limited number of common spectral templates. Since the spectral templates and the mixing weights should both be non-negative, this implies that an observed spectrogram is modeled as the product of two non-negative matrices. Thus, factorizing an observed spectrogram into the product of two non-negative matrices allows us to estimate the unknown spectral templates constituting the observed spectra and decompose the observed spectra into components associated with the estimated spectral templates. The two approaches described above rely on different clues for making separation possible. Roughly speaking, the former approach focuses on the local time-frequency structure of each source, while the latter approach focuses on a relatively global structure of music spectrograms (such a property that a music signal typically consists of a limited number of recurring note events). Rather than discussing which clues are more useful, we believe that both of these clues can be useful for achieving a reliable monaural source separation algorithm. This belief has led us to develop a new model and method for monaural source separation that combine the features of both HTC and NMF. We call the present method “harmonic-temporal factor decomposition (HTFD).” The present model is formulated as a probabilistic generative model in such a way that musically relevant information can be flexibly incorporated into the prior distributions of the model parameters. Given the recent progress of state-of-the-art methods for a variety of music information retrieval (MIR)-related tasks such as audio key detection, audio chord detection, and audio beat tracking, information such as key, chord and beat extracted from the given signal can potentially be utilized as reliable and useful prior information for source separation. The inclusion of auxiliary information in the separation scheme is referred to as informed source separation and is gaining increasing momentum in recent years (see e.g., among others, [5,15,18,20]). This paper further describes some ideas how to design the prior distributions for the present model to incorporate musically relevant information. We henceforth denote the normal, Dirichlet and Poisson c Tomohiko Nakamura† , Kotaro Shikata† , Norihiro Takamune† , Hirokazu Kameoka†‡ . Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Tomohiko Nakamura† , Kotaro Shikata† , Norihiro Takamune† , Hirokazu Kameoka†‡ . “Harmonictemporal factor decomposition incorporating music prior information for informed monaural source separation”, 15th International Society for Music Information Retrieval Conference, 2014. 623 15th International Society for Music Information Retrieval Conference (ISMIR 2014) distributions by N, Dir and Pois, respectively. at time t is expressed as a harmonically-spaced Gaussian mixture function. If we assume the additivity of power spectra, the power spectrogram of a superposition of K pitched sounds is given by the sum of Eq. (8) over k. It should be noted that this model is identical to the one employed in the HTC approach [8]. Although we have defined the spectrogram model above in continuous time and continuous log-frequency, we actually obtain observed spectrograms as a discrete timefrequency representation through computer implementations. Thus, we henceforth use Yl,m := Y(xl , tm ) to denote an observed spectrogram where xl (l = 1, . . . , L) and tm (m = 1, . . . , M) stand for the uniformly-quantized logfrequency points and time points, respectively. We will also use the notation Ωk,m and ak,n,m to indicate Ωk (tm ) and ak,n (tm ). 2. SPECTROGRAM MODEL OF MUSIC SIGNAL 2.1 Wavelet transform of source signal model As in [8], this section derives the continuous wavelet transform of a source signal. Let us first consider as a signal model for the sound of the kth pitch the analytic signal representation of a pseudo-periodic signal given by N ak,n (u)ej(nθk (u)+ϕk,n ) , (1) fk (u) = n=1 where u denotes the time, nθk (u) + ϕk,n the instantaneous phase of the n-th harmonic and ak,n (u) the instantaneous amplitude. This signal model implicitly ensures not to violate the ‘harmonicity’ and ‘coherent frequency modulation’ constraints of the auditory grouping cues. Now, let the wavelet basis function be defined by u − t 1 , (2) ψ ψα,t (u) = √ α 2πα where α is the scale parameter such that α > 0, t the shift parameter and ψ(u) the mother wavelet with the center frequency of 1 satisfying the admissibility condition. ψα,t (u) can thus be used to measure the component of period α at time t. The continuous wavelet transform of fk (u) is then defined by ∞ N ak,n (u)ej(nθk (u)+ϕk,n ) ψ∗α,t (u)du. (3) Wk (log α1 , t) = 2.2 Incorporating source-filter model The generating processes of many sound sources in real world can be explained fairly well by the source-filter theory. In this section, we follow the idea described in [12] to incorporate the source-filter model into the above model. Let us assume that each signal fk (u) within a short-time segment is an output of an all-pole system. That is, if we use fk,m [i] to denote the discrete-time representation of fk (u) within a short-time segment centered at time tm , fk,m [i] can be described as P βk,m [p] fk,m [i − p] + k,m [i], (9) βk,m [0] fk,m [i] = −∞ n=1 Since the dominant part of ψ∗α,t (u) is typically localized around time t, the result of the integral in Eq. (3) shall depend only on the values of θk (u) and ak,n (u) near t. By taking this into account, we replace θk (t) and ak,n (t) with zero- and first-order approximations around time t: ak,n (u) ak,n (t), θk (u) θk (t) + θ̇k (t)(u − t). (4) p=1 where i, k,m [i], and βk,m [p] (p = 0, . . . , P) denote the discrete-time index, an excitation signal, and the autoregressive (AR) coefficients, respectively. As we have already assumed in 2.1 that the F0 of fk,m [i] is eΩk,m , to make the assumption consistent, the F0 of the excitation signal k,m [i] must also be eΩk,m . We thus define k,m [i] as N Ωk,m vk,n,m e jne iu0 , (10) k,m [i] = Note that the variable θ̇k (u) corresponds to the instantaneous fundamental frequency (F0 ). By undertaking the above approximations, applying the Parseval’s theorem, and putting x = log(1/α) and Ωk (t) = log θ̇k (t), we can further write Eq. (3) as N ak,n (t)Ψ∗ (ne−x+Ωk (t) )e j(nθk (t)+ϕk,n ) , (5) Wk (x, t) = n=1 where u0 denotes the sampling period of the discrete-time representation and vk,n,m denotes the complex amplitude of the nth partial. By applying the discrete-time Fourier transform (DTFT) to Eq. (9) and putting Bk,m (z) := βk,m [0] − βk,m [1]z−1 · · · − βk,m [P]z−P , we obtain √ N 2π vk,n,m δ(ω − neΩk,m u0 ), (11) Fk,m (ω) = Bk,m (e jω ) n=1 n=1 where x denotes log-frequency and Ψ the Fourier transform of ψ. Since the function Ψ can be chosen arbitrarily, as with [8], we employ the following unimodal real function whose maximum is taken at ω = 1: ⎧ (log ω)2 ⎪ ⎪e− 4σ2 ⎨ (ω > 0) . Ψ(ω) = ⎪ (6) ⎪ ⎩0 (ω ≤ 0) Eq. (5) can then be written as N (x−Ωk (t)−log n)2 ak,n (t)e− 4σ2 e j(nθk (t)+ϕk,n ) . Wk (x, t) = where Fk,m denotes the DTFT of fk,m , ω the normalized angular frequency, and δ the Dirac delta function. The inverse DTFT of Eq. (11) gives us another expression of fk,m [i]: N Ωk,m vk,n,m fk,m [i] = (12) e jne iu0 . Ωk,m jne u0 ) n=1 Bk,m (e By comparing Eq. (12) and the discrete-time representation of Eq. (1), we can associate the parameters of the source filter model defined above with the parameters introduced in 2.1 through the explicit relationship: vk,n,m . (13) |ak,n,m | = Bk,m (e jneΩk,m u0 ) (7) n=1 If we now assume that the time-frequency components are sparsely distributed so that the partials rarely overlap each other, |Wk (x, t)|2 is given approximately as N (x−Ωk (t)−log n)2 |ak,n (t)|2 e− 2σ2 . (8) |Wk (x, t)|2 n=1 2.3 Constraining model parameters The key assumption behind the NMF model is that the spectra of the sound of a particular pitch is expressed as This assumption means that the power spectra of the partials can approximately be considered additive. Note that a cutting plane of the spectrogram model given by Eq. (8) 624 15th International Society for Music Information Retrieval Conference (ISMIR 2014) a multiplication of time-independent and time-dependent factors. In order to extend the NMF model to a more reasonable one, we consider it important to clarify which factors involved in the spectra should be assumed to be timedependent and which factors should not. For example, the F0 must be assumed to vary in time during vibrato or portamento. Of course, the scale of the spectrum should also be assumed to be time-varying (as with the NMF model). On the other hand, the timbre of an instrument can be considered relatively static throughout an entire piece of music. We can reflect these assumptions in the present model in the following way. For convenience of the following analysis, we factorize |ak,n,m | into the product of two variables, wk,n,m and Uk,m |ak,n,m | = wk,n,m Uk,m . (14) wk,n,m can be interpreted as the relative power of the nth harmonic and Uk,m as the time-varying normalized ampli tude of the sound of the kth pitch such that k,m Uk,m = 1. In the same way, let us put vk,n,m as vk,n,m = w̃k,n,m Uk,m . (15) 7000 Frequency [Hz] 3929 1238 695 390 0 0.37 0.73 1.1 1.46 1.83 Time [s] Figure 1. Power spectrogram of a violin vibrato sound. p(w|w̃, β, Ω) expressed by the Dirac delta function w̃k,n,m . δ wk,n,m − p(w|w̃, β, Ω) = Bk (e jneΩk,m u0 ) (20) k,n,m The conditional distribution p(w|β, Ω) can thus be obtained by defining the distribution p(w̃) and marginalizing over w̃. If we now assume that the complex amplitude w̃k,n,m follows a circular complex normal distribution (21) w̃k,n,m ∼ NC (w̃k,n,m ; 0, ν2 ), n = 1, . . . , N, Since the all-pole spectrum 1/|Bk,m (e jω )|2 is related to the timbre of the sound of the kth pitch, we want to constrain it to be time-invariant. This can be done simply by eliminating the subscript m. Eq. (13) can thus be rewritten as w̃k,n,m . (16) wk,n,m = Bk (e jneΩk,m u0 ) where NC (z; 0, ξ 2 ) = e−|z| /ξ /(πξ2 ), we can show, as in [12], that wk,n,m follows a Rayleigh distribution: 2 2 wk,n,m ∼ Rayleigh(wk,n,m ; ν/|Bk (e jne Ωk,m u 0 )|), (22) −z2 /(2ξ2 ) where Rayleigh(z; ξ) = (z/ξ )e . This defines the conditional distribution p(w|β, Ω). The F0 of stringed and wind instruments often varies continuously over time with musical expressions such as vibrato. For example, the F0 of a violin sound varies periodically around the note frequency during vibrato, as depicted in Fig. 1. Let us denote the standard log-F0 corresponding to the kth note by μk . To appropriately describe the variability of an F0 contour in both the global and local time scales, we design a prior distribution for Ωk := (Ωk,1 , Ωk,2 , . . . , Ωk,M )T by employing the productof-experts (PoE) [6] concept using two probability distributions. First, we design a distribution qg (Ωk ) describing how likely Ωk,1 , . . . , Ωk,L stay near μk . Second, we design another distribution ql (Ωk ) describing how likely Ωk,1 , . . . , Ωk,L are locally continuous along time. Here we define qg (Ωk ) and ql (Ωk ) as 2 We can use Ωk,m as is, since it is already dependent on m. To sum up, we obtain a spectrogram model Xl,m as ⎞ ⎛ N K (x −Ω −log n)2 ⎟ ⎜⎜⎜ 2 ⎟⎟⎟ − l k,m 2 ⎜ 2σ Ck,l,m , Ck,l,m = ⎜⎝ wk,n,m e Xl,m = ⎟⎠ Uk,m , n=1 k=1 2205 Hk,l,m (17) where Ck,l,m stands for the spectrogram of the kth pitch. If we denote the term insidethe parenthesis by Hk,l,m , Xl,m can be rewritten as Xl,m = k Hk,l,m Uk,m and so the relation to the NMF model may become much clearer. 2.4 Formulating probabilistic model Since the assumptions and approximations we made so far do not always hold exactly in reality, an observed spectrogram Yl,m may diverge from Xl,m even though the parameters are optimally determined. One way to simplify the process by which this kind of deviation occurs would be to assume a probability distribution of Yl,m with the expected value of Xl,m . Here, we assume that Yl,m follows a Poisson distribution with mean Xl,m Yl,m ∼ Pois(Yl,m ; Xl,m ), (18) qg (Ωk ) = N(Ωk ; μk 1 M , υ2k I M ), ql (Ωk ) = N(Ω k ; 0 M , τ2k D−1 ), (23) (24) ⎤ ⎡ ⎢⎢⎢ 1 −1 0 0 · · · 0 ⎥⎥⎥ ⎢⎢⎢−1 2 −1 0 · · · 0 ⎥⎥⎥ ⎥ ⎢⎢⎢ ⎢⎢⎢ 0 −1 2 −1 · · · 0 ⎥⎥⎥⎥⎥ ⎢ (25) D = ⎢⎢⎢ .. . . . . . . .. ⎥⎥⎥⎥⎥ , ⎢⎢⎢ . . . . . ⎥⎥ ⎥⎥⎥ ⎢⎢⎢ ⎢⎢⎣ 0 · · · 0 −1 2 −1⎥⎥⎦ 0 · · · 0 0 −1 1 where I M denotes an M × M identity matrix, D an M × M band matrix, 1 M an M-dimensional all-one vector, and 0 M an M-dimensional all-zero vector, respectively. υk denotes the standard deviation from mean μk , and τk the standard deviation of the F0 jumps between adjacent frames. The prior distribution of Ωk is then derived as (26) p(Ωk ) ∝ qg (Ωk )αg ql (Ωk )αl where αg and αl are the hyperparameters that weigh the where Pois(z; ξ) = ξ z e−ξ /Γ(z). This defines our likelihood function Pois(Yl,m ; Xl,m ), (19) p(Y|θ) = l,m where Y denotes the set consisting of Yl,m and Θ the entire set consisting of the unknown model parameters. It should be noted that the maximization of the Poisson likelihood with respect to Xl,m amounts to optimally fitting Xl,m to Yl,m by using the I-divergence as the fitting criterion. Eq. (16) implicitly defines the conditional distribution 625 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Ak = (Ak,1 , . . . , Ak,M )T . Here we introduce Dirichlet distributions: Ak ∼ Dir( Ak ; γ(A) R ∼ Dir(R; γ(R) ), (28) k ), " ξi (A) (A) (A) T where Dir(z; ξ) ∝ i zi , γk := (γk,1 , . . . , γk,M ) , and (R) T γ(R) := (γ1(R) , . . . , γ(R) K ) . For p(R), we set γk at a reasonably high value if the kth pitch is contained in the scale and (A) < 1 so that the Dirichlet vice versa. For p( Ak ), we set γk,m distribution becomes a sparsity inducing distribution. contributions of qg (Ωk ) and ql (Ωk ) to the prior distribution. 2.5 Relation to other models It should be noted that the present model is related to other models proposed previously. If we do not assume a parametric model for Hk,l,m and treat each Hk,l,m itself as the parameter, the spectrogram model Xl,m can be seen as an NMF model with timevarying basis spectra, as in [14]. In addition to this assumption, if we assume that Hk,l,m is time-invariant (i.e., Hk,l,m = Hk,l ), Xl,m reduces to the regular NMF model [19]. Furthermore, if we assume each basis spectrum to have a harmonic structure, Xl,m becomes equivalent to the harmonic NMF model [16, 21]. If we assume that Ωk,m is equal over time m, Xl,m reduces to a model similar to the ones described in [17, 22]. Furthermore, if we describe Uk,m using a parametric function of m, Xl,m becomes equivalent to the HTC model [8, 10]. With a similar motivation, Hennequin et al. developed an extension to the NMF model defined in the short-time Fourier transform domain to allow the F0 of each basis spectrum to be time-varying [4]. 4. PARAMETER ESTIMATION ALGORITHM Given an observed power spectrogram Y := {Yl,m }l,m , we would like to find the estimates of Θ := {Ω, w, β, V, R, A} that maximizes the posterior density p(Θ|Y) ∝ p(Y|Θ)p(Θ). We therefore consider the problem of maximizing L(Θ) := ln p(Y|Θ) + ln p(Θ), (29) with respect to Θ where # $ Yl,m ln Xl,m − Xl,m (30) ln p(Y|Θ) = c l,m ln p(Θ) = ln p(w|β, Ω) + 3. INCORPORATION OF AUXILIARY INFORMATION 3.1 Use of musically relevant information We consider using side-information obtained with the state-of-the-art methods for MIR-related tasks including key detection, chord detection and beat tracking to assist source separation. When multiple types of side-information are obtained for a specific parameter, we can combine the use of the mixture-of-experts and PoE [6] concepts according to the “AND” and “OR” conditions we design. For example, pitch occurrences typically depend on both the chord and key of a piece of music. Thus, when the chord and key information are obtained, we may use the product-of-experts concept to define a prior distribution for the parameters governing the likeliness of the occurrences of the pitches. In the next subsection, we describe specifically how to design the prior distributions. + ln p(R) + ln p(Ωk ) k ln p( Ak ). (31) k =c denotes equality up to constant terms. Since the first term of Eq. (30) involves summation over k and n, analytically solving the current maximization problem is intractable. However, we can develop a computationally efficient algorithm for finding a locally optimal solution based on the auxiliary function concept, by using a similar idea described in [8, 12]. When applying an auxiliary function approach to a certain maximization problem, the first step is to define a lower bound function for the objective function. As mentioned earlier, the difficulty with the current maximization problem lies in the first term in Eq. (30) . By using the fact that the logarithm function is a concave function, we can invoke the Jensen’s inequality 3.2 Designing prior distributions The likeliness of the pitch occurrences in popular and classical western music usually depend on the key or the chord used in that piece. The likeliness of the pitch occurrences can be described as a probability distribution over the relative energies of the sounds of the individual pitches. Since the number of times each note is activated is usually limited, inducing sparsity to the temporal activation of each note event would facilitate the source separation. The likeliness of the number of times each note is activated can be described as well as a probability distribution over the temporal activations of the sound of each pitch. To allow for designing such prior distributions, we dethe pitchcompose Uk,m as the product of two variables: wise relative energy Rk = m Uk,m (i.e. k Rk = 1), and the pitch-wise normalized amplitude Ak,m = Uk,m /Rk (i.e. m Ak,m = 1). Hence, we can write (27) Uk,m = Rk Ak,m . This decomposition allows us to incorporate different kinds of prior information into our model by separately defining prior distributions over R = (R1 , . . . , RK )T and Yl,m ln Xl,m ≥ Yl,m λk,n,l,m ln w2k,n,m e− k,n (xl −Ωk,m −log n)2 2σ2 λk,n,l,m Uk,m , (32) to obtain a lower bound function, where λk,n,l,m is a positive variable that sums to unity: k,n λk,n,l,m = 1. Equality of (32) holds if and only if λk,n,l,m = w2k,n,m e− (xl −Ωk,m −log n)2 2σ2 Xl,m Uk,m . (33) Although one may notice that the second term in Eq. (30) is nonlinear in Ωk,m , the summation of Xl,m over fairly well using the integral % ∞ l can be approximated X(x, t )dx, since X is the sum of the values at the m l,m l −∞ sampled points X(x1 , tm ), . . . , X(xL , tm ) with an equal interval, say Δ x . Hence, ∞ 1 Xl,m X(x, tm )dx Δ x −∞ l ∞ (x−Ω −log n)2 k,m 1 2 = wk,n,m Uk,m e− 2σ2 dx Δ x k,n −∞ 626 15th International Society for Music Information Retrieval Conference (ISMIR 2014) √ 2πσ Uk,m w2k,n,m . Δx n k (34) 5555 Frequency [Hz] = This approximation implies that the second term in Eq. (30) depends little on Ωk,m . An auxiliary function can thus be written as + L (Θ, λ) = Yl,m λk,n,l,m ln w2k,n,m e− −ln n)2 (xl −Ωk,m 2σ2 λk,n,l,m l,m k,n √ 2πσ − Uk,m w2k,n,m + ln p(Θ). Δx m n k c Uk,m 1750 551 174 2 4 6 8 Time [s] (35) Figure 2. Power spectrogram of a mixed audio signal of three violin vibrato sounds (D4, F4 and A4). We can derive update equations for the model parameters, using the above auxiliary function. By setting at zero the partial derivative of L+ (Θ, λ) with respect to each of the model parameters, we obtain l Yl,m λk,n,l,m + 1/2 , (36) w2k,n,m ← √ Ω 2πRk Ak,m σ/Δ x + ν2 /(2|Bk (e jne k,m u0 )|2 ) ⎛ ⎞−1 ⎜⎜⎜ αl ⎟⎟⎟ α g Ωk ← ⎜⎜⎜⎝ 2 D + 2 I M + diag( pk,n,l )⎟⎟⎟⎠ τ υk n,l ⎛ ⎞ ⎜⎜⎜ αg ⎟⎟⎟ (37) × ⎜⎜⎝⎜μk 2 1 M + (xl − ln n)pk,n,l ⎟⎟⎠⎟ , υk n,l (R) l,m Yl,m n λk,n,l,m + γk − 1 Rk ∝ , (38) 2 m,n Ak,m wk,m,n (A) l Yl,m n λk,n,l,m + γk,m − 1 Ak,m ∝ , (39) Rk n w2k,m,n ' 1 & pk,n,l := 2 Yl,1 λk,n,l,1 , Yl,2 λk,n,l,2 , · · · , Yl,M λk,n,l,M , (40) σ were artificially made by mixing D4, F4 and A4 violin vibrato sounds from the RWC instrument database [3]. In this paper, the F0 of the pitch name A4 was set at 440 Hz. The power spectrogram of the mixed signal is shown in Fig. 2. To convert the signal into a spectrogram, we employed the fast approximate continuous wavelet transform [9] with a 16 ms time-shift interval. {xl }l ranged 55 to 7040 Hz per 10 cent. The parameters of HTFD were = (1 − 3.96 × 10−6 )1I , (τk , vk ) = (0.83, 1.25) set at γ(A) k for all k, (N, K, σ, αg , α s ) = (8, 73, 0.02, 1, 1), and γ(R) = (1−2.4×10−3 )1K . {μk }k ranged A1 to A7 with a chromatic interval, i.e. μk = ln(55) + ln(2) × (k − 1)/12. The number of NMF bases were set at three. The parameter updates of both HTFD and NMF were stopped at 100 iterations. While the estimates of spectrograms obtained with NMF were flat and the vibrato spectra seemed to be averaged (Fig. 3 (a)), those obtained with HTFD tracked the F0 contours of the vibrato sounds appropriately (Fig. 3 (b)), and clear vibrato sounds were contained in the separated audio signals by HTFD. where diag( p) converts a vector p into a diagonal matrix with the elements of p on the main diagonal. As for the update equations for the AR coefficients β, we can invoke the method described in [23] with a slight modification, since the terms in the auxiliary function that depend on β has the similar form as the objective function defined in [23]. It can be shown that L+ can be increased by the following updates (the details are omitted owing to space limitations): (41) hk ← Ĉk (βk )βk , βk ← Ck−1 hk , 5.2 Separation using key information We next examined whether the prior information of a sound improve source separation accuracy. The key of the sound used in 5.1, was assumed as D major. The key information was incorporated in the estimation scheme by setting γk(R) = 1 − 2.4 × 10−3 for the pitch indices that are not contained in the D major scale and γk(R) = 1−3.0×10−3 for the pitch indices contained in that scale. The other conditions were the same as 5.1. With HTFD without using the key information, the estimated activations of the pitch indices that were not contained in the scale, in particular D4, were high as illustrated in Fig. 4 (a). In contrast, those estimated activations with HTFD using the key information were suppressed as shown in Fig. 4 (b). These results thus support strongly that incorporating prior information improve the source separation accuracy. where Ck and Ĉk (βk ) are (P + 1) × (P + 1) Toeplitz matrices, whose (p, q)-th elements are 2 1 wk,m,n Ck,p,q = cos[(p − q)neΩk,m u0 ], MN m,n 2ν 1 1 Ĉk,p,q (βk ) = cos[(p − q)neΩk,m u0 ]. jne MN m,n |Bk (e Ωk,m u0 )|2 (42) 5. EXPERIMENTS 5.3 Transposing from one key to another Here we show some results of an experiment on automatic key transposition [11] using HTFD. The aim of key transposition is to change the key of a musical piece to another key. We separated the spectrogram of a polyphonic sound into spectrograms of individual pitches using HFTD, transposed the pitches of the subset of the separated components, added all the spectrograms together to construct a pitch-modified polyphonic spectrogram, and constructed a In the following preliminary experiments, we simplified HTFD by omitting the source filter model and assuming the time-invariance of wk,m,n . 5.1 F0 tracking of violin sound To confirm whether HTFD can track the F0 contour of a sound, we compared HTFD with NMF with the Idivergence, by using a 16 kHz-sampled audio signal which 627 15th International Society for Music Information Retrieval Conference (ISMIR 2014) A♭4 311 554 311 F4 Pitch 554 988 Frequency [Hz] 988 Frequency [Hz] Frequency [Hz] 988 554 D4 D♭4 311 Time 0 2 4 Time [s] 6 0 2 4 Time [s] 6 0 2 4 Time [s] 6 (a) Without key information (a) Estimates of spectrograms and F0 contours (orange lines) obtained with HTFD 311 554 311 F4 Pitch 554 A♭4 988 Frequency [Hz] 988 Frequency [Hz] Frequency [Hz] 988 554 D4 D♭4 311 Time 0 2 4 Time [s] 6 0 2 4 Time [s] 6 0 2 4 Time [s] 6 (b) With key information (b) Estimates of spectrograms obtained with NMF Figure 4. Temporal activations of Figure 3. Estimated spectrogram models by harmonic-temporal factor decomposi- A3–A4 estimated with HTFD using tion (HTFD) and non-negative matrix factorization (NMF). In left-to-right fashion, and without using prior information of the key. The red curves represent the spectrogram models are for D4, F4 and A4. the temporal activations of D4. time-domain signal from the modified spectrogram using the method described in [13]. For the key transposition, we adopted a simple way: To transpose, for example, from A major scale to A natural minor scale, we changed the pitches of the separated spectrograms corresponding to C, F and G to C, F and G, respectively. Some results are demonstrated in http://hil.t. u-tokyo.ac.jp/˜nakamura/demo/HTFD.html. 6. CONCLUSION This paper proposed a new approach for monaural source separation called the “Harmonic-Temporal Factor Decomposition (HTFD)” by introducing a spectrogram model that combines the features of the models employed in the NMF and HTC approaches. We further described some ideas how to design the prior distributions for the present model to incorporate musically relevant information into the separation scheme. 7. ACKNOWLEDGEMENTS This work was supported by JSPS Grant-in-Aid for Young Scientists B Grant Number 26730100. 8. REFERENCES [1] A. S. Bregman: Auditory Scene Analysis, MIT Press, Cambridge, 1990. [2] J. S. Downie, D. Byrd, and T. Crawford: “Ten years of ISMIR: Reflections on challenges and opportunities,” Proc. ISMIR, pp. 13–18, 2009. [3] M. Goto: “Development of the RWC Music Database,” Proc. ICA, pp. l–553–556, 2004. [4] R. Hennequin, R. Badeau, and B. David: “Time-dependent parametric and harmonic templates in non-negative matrix factorization,” Proc. DAFx, pp. 246–253, 2010. [5] R. Hennequin, B. David, and R. Badeau: “Score informed audio source separation using a parametric model of nonnegative spectrogram,” Proc. ICASSP, pp. 45–48, 2011. [6] G. E. Hinton: “Training products of experts by minimizing contrastive divergence,” Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002. [7] G. Hu, and D. L. Wang: “An auditory scene analysis approach to monaural speech segregation,” Topics in Acoust. Echo and Noise Contr., pp. 485–515, 2006. [8] H. Kameoka: Statistical Approach to Multipitch Analysis, PhD thesis, The University of Tokyo, Mar. 2007. [9] H. Kameoka, T. Tabaru, T. Nishimoto, and S. Sagayama: (Patent) Signal processing method and unit, in Japanese, Nov. 2008. [10] H. Kameoka, T. Nishimoto, and S. Sagayama: “A multipitch analyzer based on harmonic temporal structured clustering,” IEEE Trans. ASLP, vol. 15, no. 3, pp. 982–994, 2007. [11] H. Kameoka, J. Le Roux, Y. Ohishi, and K. Kashino: “Music Factorizer: A note-by-note editing interface for music waveforms,” IPSJ SIG Tech. Rep., 2009-MUS-81-9, in Japanese, Jul. 2009. [12] H. Kameoka: “Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model,” IEICE Tech. Rep., vol. 110, no. 297, SP2010-74, pp. 29–34, in Japanese, Nov. 2010. [13] T. Nakamura and H. Kameoka: “Fast signal reconstruction from magnitude spectrogram of continuous wavelet transform based on spectrogram consistency,” Proc. DAFx, 40, to appear, 2014. [14] M. Nakano, J. Le Roux, H. Kameoka, Y. Kitano, N. Ono, and S. Sagayama: “Nonnegative matrix factorization with Markov-chained bases for modeling time-varying patterns in music spectrograms,” Proc. LVA/ICA, pp. 149–156, 2010. [15] A. Ozerov, C. Févotte, R. Blouet, and J. L. Durrieu: “Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation,” Proc. ICASSP., pp. 257–260, 2011. [16] S. A. Raczyński, N. Ono, and S. Sagayama: “Multipitch analysis with harmonic nonnegative matrix approximation,” Proc. ISMIR, pp. 381–386, 2007. [17] D. Sakaue, T. Otsuka, K. Itoyama, and H. G. Okuno: “Bayesian nonnegative harmonic-temporal factorization and its application to multipitch analysis,” Proc. ISMIR, pp. 91– 96, 2012. [18] U. Simsekli and A. T. Cemgil: “Score guided musical source separation using generalized coupled tensor factorization,” Proc. EUSIPCO, pp. 2639–2643, 2012. [19] P. Smaragdis and J. C. Brown: “Non-negative matrix factorization for polyphonic music transcription,” Proc. WASPAA, pp. 177–180, 2003. [20] P. Smaragdis and G. J. Mysore: “Separation by ”humming”: User-guided sound extraction from monophonic mixtures,” Proc. WASPAA, pp. 69–72, 2009. [21] E. Vincent, N. Bertin, and R. Badeau: “Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription,” Proc. ICASSP, pp. 109–112, 2008. [22] K. Yoshii and M. Goto: “Infinite latent harmonic allocation: A nonparametric Bayesian approach to multipitch analysis,” Proc. ISMIR, pp. 309–314, 2010. [23] A. El-Jaroudi, J. Makhoul: “Discrete all-pole modeling,” IEEE Trans. SP, vol. 39, no. 2, pp. 411–423, 1991. 628 Oral Session 9 Rhythm & Beat 629 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 630 15th International Society for Music Information Retrieval Conference (ISMIR 2014) DESIGN AND EVALUATION OF ONSET DETECTORS USING DIFFERENT FUSION POLICIES Mi Tian, György Fazekas, Dawn A. A. Black, Mark Sandler Centre for Digital Music, Queen Mary University of London {m.tian, g.fazekas, dawn.black, mark.sandler}@qmul.ac.uk ABSTRACT Note onset detection is one of the most investigated tasks in Music Information Retrieval (MIR) and various detection methods have been proposed in previous research. The primary aim of this paper is to investigate different fusion policies to combine existing onset detectors, thus achieving better results. Existing algorithms are fused using three strategies, first by combining different algorithms, second, by using the linear combination of detection functions, and third, by using a late decision fusion approach. Large scale evaluation was carried out on two published datasets and a new percussion database composed of Chinese traditional instrument samples. An exhaustive search through the parameter space was used enabling a systematic analysis of the impact of each parameter, as well as reporting the most generally applicable parameter settings for the onset detectors and the fusion. We demonstrate improved results attributed to both fusion and the optimised parameter settings. 1. INTRODUCTION The automatic detection of onset events is an essential part in many music signal analysis schemes and has various applications in content-based music processing. Different approaches have been investigated for onset detection in recent years [1,2]. As the main contribution of this paper, we present new onset detectors using different fusion policies, with improved detection rates relying on recent research in the MIR community. We also investigate different configurations of onset detection and fusion parameters, aiming to provide a reference for configuring onset detection systems. The focus of ongoing onset detection work is typically targeting Western musical instruments. Apart from using two published datasets, a new database is incorporated into our evaluation, collecting percussion ensembles of Jingju, also denoted as Peking Opera or Beijing Opera, a major genre of Chinese traditional music 1 . By including this dataset, we aim at increasing the diversity of instrument categories in the evaluation of onset detectors, as well as extending the research to include non-Western music types. The goal of this paper can be summarised as follows: i) to evaluate fusion methods in comparison with the baseline algorithms, as well as a state-of-the-art method 2 ; ii) to investigate which fusion policies and which pair-wise combinations of onset detectors yield the most improvement over standard techniques; iii) to find the best performing configurations by searching through the multi-dimensional parameter space, hence identifying emerging patterns in the performances of different parameter settings, showing good results across different datasets; iv) to investigate the performance difference in Western and non-Western percussive instrument datasets. In the next section, we present a review of related work. Descriptions of the datasets used in this experiment are given in Section 3. In Section 4, we introduce different fusion strategies. Relevant post-processing and peak-picking procedures, as well as the parameter search process will be discussed in Section 5. Section 6 presents the results, with a detailed analysis and discussion of the performance of the fusion methods. Finally, the last section summarises our findings and provides directions for future work. 2. RELATED WORK Many onset detection algorithms and systems have been proposed in recent years. Common approaches using energy or phase information derived from the input signal include the high frequency content (HFC) and complex domain (CD) methods. See [1, 6] for detailed reviews and [9] for further improvements. Pitch contours and harmonicity information can also be indicators for onset events [7]. These methods shows some superiority over energy based ones in case of soft onsets. Onset detection systems using machine learning techniques have also been gaining popularity in recent years 3 . The winner of MIREX 2013 audio onset detection task utilises convolutional neural networks to classify and distinguish onsets from non-onset events in the spectrogram [13]. The data-driven nature of these methods makes the c Mi Tian, György Fazekas, Dawn A. A. Black, Mark San dler. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mi Tian, György Fazekas, Dawn A. A. Black, Mark Sandler. “Design and Evaluation of Onset Detectors Using Different Fusion Policies”, 15th International Society for Music Information Retrieval Conference, 2014. 1 http://en.wikipedia.org/wiki/Peking_opera Machine learning-based methods are excluded from this study to limit the scope of our work. 3 http://www.music-ir.org/mirex/wiki/2013: Audio_Onset_Detection 2 631 15th International Society for Music Information Retrieval Conference (ISMIR 2014) with 732 onsets. We also use NPP onsets from the first two datasets to form the fourth one, providing a direct comparison with the Chinese NPP instruments. All stimuli are mono signals sampled at 44.1kHz 6 and 16 bits per sample, having 3349 onsets in total. detection less dependent on onset types, though a computationally expensive training process is required. A promising approach for onset detection lies in the fusion of multiple detection methods. Zhou et al. proposed a system integrating two detection methods selected according to properties of the target onsets [17]. In [10], pitch, energy and phase information are considered in parallel for the detection of pitched onsets. Another fusion strategy is to combine peak score information to form new estimations of the onset events [8]. Albeit fusion has been used in previous work, there is a lack of systematic evaluation of fusion strategies and applications in the current literature. This paper focusses on the assessment of different fusion policies, from feature-level and detection function-level fusion to higher level decision fusion. The success of an onset detection algorithm largely depends on the signal processing methods used to extract salient features from the audio that emphasise the features characterising onset events as well as smoothing the noise in the detection function. Various signal processing techniques have been introduced in recent studies, such as vibrato suppression [3] and adaptive thresholding [1]. In [14], adaptive whitening is presented where each STFT bins magnitude is divided by the an average peak for that bin accumulated over time. This paper also investigates the performances of some commonly used signal processing modules within onset detection systems. 4. FUSION EXPERIMENT 3. DATASETS In this study, we use two previously released evaluation datasets and a newly created one. The first published dataset comes from [1], containing 23 audio tracks with a total duration of 190 seconds and having 1058 onsets. These are classified into four groups: pitched non-percussive (PNP), e.g. bowed strings, 93 onsets, pitched percussive (PP), e.g. piano, 482 onsets 4 , non-pitched percussive (NPP), e.g. drums, 212 onsets, and complex mixtures (CM), e.g. pop singing music, 271 onsets. The second set comes from [2] which is composed of 30 samples 5 of 10 second audio tracks, containing 1559 onsets in total, covering also four categories: PNP (233 onsets in total), PP (152 onsets), NPP (115 onsets), CM (1059 onsets). The use of these datasets enables us to test the algorithms on a range of different instruments and onset types, and provides for direct comparison with published work. The combined dataset used in the evaluation of our work is composed of these two sets. The third dataset consists of recordings of the four major percussion instruments in Jingju: bangu (clapper- drum), daluo (gong-1), naobo (cymbals), and xiaoluo (gong-2). The samples are manually mixed using individual recordings of these instruments with possibly simultaneous onsets to closely reproduce real world conditions. See [15] for more details on the instrument types and the dataset. This dataset includes 10 samples of 30-second excerpts 4 A 7-onset discrepancy(482 instead of 489) from the reference paper is reported by the original author due to revisions of annotations. 5 Only a subset of this dataset presented in the original paper is received from the author for the evaluation in this paper. The aim of information fusion is to merge information from heterogeneous sources to reduce uncertainty of inferences [11]. In our study, six spectral-based onset detection algorithms are considered as baselines for fusion: high frequency content (HFC), spectral difference (SD) complex domain (CD), broadband energy rise (BER), phase deviation (PD), outlined in [1], and SuperFlux (SF) from recent work [4]. We also developed and included in the fusion a method based on Linear Predictive Coding [12], where the LPC coefficients are computed using the Levinson-Durbin recursion, and the onset detection function is derived from the LPC error signal. Three fusion policies are used in our experiments: i) feature-level fusion, ii) fusion using the linear combination of detection functions and iii) decision fusion by selecting and merging onset candidates. All pairwise combination of the baseline algorithms are amenable for the latter two fusion policies. However, not all algorithms can be meaningfully combined using feature-level fusion. For example CD can be considered as an existing combination of SD and PD, therefore combining CD with either of these two at a feature level is not sensible. In this study, 10 featurelevel fusion, 13 linear combination based fusion and 15 decision fusion based methods are tested. These are compared to the 7 original methods, giving us 45 detectors in total. In the following, we describe specific fusion policies. We assume familiarity with onset detection principles and restrain from describing these details, please see [1] for a tutorial. 4.1 Feature-level Fusion In feature-level fusion, multiple algorithms are combined to compute fused features. For conciseness, we provide only one example combining BER and SF, denoted BERSF, utilising the vibrato suppression capability of SF [4] for detecting soft onsets, as well as the good performance of BER for detecting percussive onsets with sharp energy bursts [1]. Here, we use the BER to mask the SF detection function as described by Equation (1). In essence, SF is used directly when there is evidence for a sharp energy rise, otherwise it is further smoothed using a median filter. ODF (n) = if BER(n) > γ otherwise, (1) where γ is an experimentally defined threshold, λ is a weighting constant set to 0.9 and SF (n) is the median filtered detection function with a window size of 3 frames. 6 632 SF (n) λ(SF (n)) Some audio files were upsampled to obtain a uniform dataset. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4.2 Linear Combination of Detection Functions In this method, two time aligned detection functions are used and their weighted linear combination is computed to form a new detection function as shown in Equation 2: ODF (n) = wODF1 (n) + (1 − w)ODF2 (n), (2) where ODF1 and ODF2 are two normalised detection functions and w is a weighting coefficient (0 ≤ w ≤ 1). 4.3 Decision Fusion 5.2.2 Backtracking This fusion method operates at a later stage and combines prior decisions of two detectors. Post-processing and peak picking are applied separately yielding two lists of onset candidates. Onsets from the two lists occurring within a fixed temporal tolerance window will be merged and accepted. Let T S1 and T S2 be the lists of onset locations given by two different detectors, i and j be indexes of onsets in the candidate lists and δ the tolerance time window. The final onset locations are generated using the fusion strategy described by Algorithm 1. In case of many musical instruments, onsets have longer transients without a sharp burst of energy rise. This may cause energy based detection functions to exhibit peaks after the perceived onset locations. Vos and Rasch conclude that onsets are perceived when the envelope reaches a level of roughly 6-15 dB below the maximum level of the tones [16]. Using this rationale, we trace the onset locations from the detected peak position back to a hypothesised earlier “perceived” location. The backtracking procedure is based on measuring relative differences in the detection function, as illustrated by Algorithm 2, where θ is the threshold used as a stopping condition. We use the implementation available in the QM Vamp Plugins. Algorithm 1 Onset decision fusion 1: procedure D ECISION F USION(T S1 , T S2 ) 2: I, J ← 0 : len(T S1 ) − 1, 0 : len(T S2 ) − 1 3: T S ← empty list 4: for all i, j in product(I, J) do 5: if abs(T S1 [i] − T S2 [j]) < δ then 6: insert sorted: T S ← mean(T S1 [i], T S2 [j]) 7: degree polynomial on the detection function around local maxima using a least squares method, following the QM Vamp Plugins 7 . The coefficients a and c of the quadratic equation y = ax2 + bx + c are used to detect both sharper peaks, under the condition a > tha , and peaks with a higher magnitude, when c > thc . The corresponding thresholds are computed from a single sensitivity parameter called threshold using tha = (100 − threshold)/1000 for the quadratic term and thc = (100 − threshold)/1500 for the constant term. The linear term b can be ignored. Algorithm 2 Backtracking Require: idx: index of a peak location in the ODF 1: procedure BACKTRACKING(idx, ODF, θ) 2: δ, γ ← 0 3: while idx > 1 do 4: δ ← ODF [idx] − ODF [idx − 1] 5: if δ < γ ∗ θ then 6: break 7: idx ← idx − 1 8: γ←δ 9: return idx return T S 5. PEAK PICKING AND PARAMETER SEARCH 5.1 Smoothing and Thresholding Post-processing is an optional stage to reduce noise that interferes with the selection of maxima in the detection function. In this study, three post-processing blocks are used: i) DC removal and normalisation, ii) zero-phase low-pass filtering and iii) adaptive thresholding. In conventional normalisation, data is scaled using a fixed constant. Here we use a normalisation coefficient computed by weighting the input exponentially. After removing constant offsets, the detection function is normalised using the coefficient AlphaNorm calculated by Equation (3): 5.3 Parameter Search An exhaustive search is carried out to find the configurations in the parameter space yielding the best detection rates. The following parameters and settings, related to the onset detection and fusion stages, are evaluated: i) adaptive whitening (wht) on/off; ii) detection sensitivity (threshold), ranging from 0.1 to 1.0 with an increment of 0.1; iii) backtracking threshold (θ), ranging from 0.4 to 2.4 with 8 equal subdivisions (the upper bound is set to an empirical value 2.4 in the experiment since the tracking will not go beyond the previous valley); iv) linear combination coefficient (w), ranging from 0.0 to 1.0 with an increment of 0.1; v) tolerance window length (δ) for decision fusion, ranging from 0.01 to 0.05 (in second) having 8 subdivisions. This gives a 5-dimensional space and all combinations of all possible values described above are evaluated. This results in 180 configurations in case of standard detectors and feature-level fusion, 1980 in case of linear fusion and 1620 for decision fusion. The configurations are described 1 |ODF (n)|α α AlphaN orm = (3) len(ODF ) A low-pass filter is applied to the detection function to reduce noise. To avoid introducing delays, a zero phase filter is employed at this stage. Finally, adaptive thresholding using a moving median filter is applied following Bello [1], to avoid the common pitfalls of using a fixed threshold for peak picking. n 5.2 Peak Picking 5.2.1 Polynomial Fitting The use of polynomial fitting allows for assessing the shape and magnitude of peaks separately. Here we fit a second- 7 633 http://www.vamp-plugins.org 15th International Society for Music Information Retrieval Conference (ISMIR 2014) using the Vamp Plugin Ontology 8 and the resulting RDF files are used by Sonic Annotator [5] to configure the detectors. The test result will thus give us not only the overall performance of each onset detector, but also uncover their strengths and limitations across different datasets and parameter settings. 6. EVALUATION AND RESULTS 6.1 Analysis of Overall Performance Figure 1 provides an overview of the results, showing the F-measure for the top 12 detectors in our study 9 . Detectors are ranked by the median showing the overall performance increase due to fusion across the entire range of parameter settings. Due to space limitations, only a subset of the results are reported in this paper. The complete result set for all tested detectors under all configurations on different datasets is available online 10 , together with Vamp plugins of all tested onset detectors. The names of the fusion algorithms come from the abbreviations of the constituent methods, while the numbers represent the fusion policy: 0: feature-level fusion, 1: linear combination of detection functions and 2: decision fusion. CDSF-1 yields improved F-measure for the combined dataset by 3.06% and 6.14% compared to the two original methods SF and CD respectively. Smaller interquartile ranges (IQRs) observed in case of CD, SD and HFC based methods show they have less dependency on the configuration. BERSF-2 and BERSF-1 vary the most in performance, also reflected from their IQRs. In case of BERSF2, the best performance is obtained using the widest considered tolerance window (0.05s), with modest sensitivity (40%). However, decreasing the tolerance window size has an adverse effect on the performance, yielding one of the lowest detection rates caused by the significant drop of recall. In case of BERSF-1, a big discrepancy between the best and worst performing configurations can be observed. This is partly because the highest sensitivity setting has a negative effect on SF causing very low precision. Table 1 shows the results ranked by F-measure, precision and recall with corresponding standard deviations for the ten best detectors as well as all baseline methods. Standard deviations are computed over the results for all configurations in each dataset. SF is ranked in the best performing ten, thus it is excluded from the baseline. Nine out of the top ten detectors are fusion methods. CDSF-1 performs the best for all datasets (including CHN-NPP and WESNPP that are not listed in the table) while BERSF yields the second best performance in the combined, WES-NPP and JPB datasets. Corresponding parameter settings for the combined dataset are given in Table 2. Fusion policies may perform differently in the evaluation. In case of feature-level fusion, we compared how combined methods score relative to their constituents. The Figure 1. F-meaure of all configurations for the top 12 detectors. (Min, first and third quartile and max value of the data are represented by the bottom bar of the whiskers, bottom and upper borders of the boxes and upper bar of the whiskers respectively. Median is shown by the red line) method CDSF-1 BERSF-1 BERSF-2 BERSF-0 CDSF-2 SF CDBER-1 BERSD-1 HFCCD-1 CDBER-2 mean std median mode threshold 10.0 10.0 40.0 30.0 50.0 20.0 10.0 10.0 20.0 50.0 25.90 15.01 20.0 10.0 θ 2.15 2.40 2.15 2.40 2.40 2.40 2.40 2.40 1.15 1.15 2.100 0.4848 2.15 2.40 wht off off off off off off off off off off off w 0.20 0.30 n/a n/a n/a n/a 0.50 0.60 0.50 n/a 0.4200 0.1470 0.50 0.50 δ (s) n/a n/a 0.05 n/a 0.05 n/a n/a n/a n/a 0.05 0.05 0.00 0.05 0.05 Table 2. Parameter settings for the ten best performing detectors, threshold: overall detection sensitivity; θ: backtracking threshold; wht: adaptive whitening; w: linear combination coefficient; δ: tolerance window size. performances vary between datasets, with only HFCBER0 outperforming both HFC and BER on the combined and SB datasets in terms of mean F-measure. However, five perform better than their two constitutes on JPB, two on CHN-NPP and five on WES-NPP dataset (these results are published online). A more detailed analysis of these performance differences constitutes future work. When comparing linear fusion of detection functions with decision fusion, the former performs better across all datasets in all but one cases, the fusion of HFC and BER. Even in this case, linear fusion yields close performance in terms of mean F-measure. Interesting observations also emerge for particular methods on certain datasets. The linear fusion based detectors involving LPC and PD (SDPD1 and LPCPD-1) show better performances in the case of the CHN-NPP dataset compared to their performances on other datasets as well those given by their constituent methods (please see table online). Further analysis, for instance, by looking at statistical significance of these observations is required to identify relevant instrument properties. When comparing BERSF-2, CDSF-2 and CDBER-2 to the other detectors in Table 1, notably higher standard deviations in recall and F-measure are shown, indicating this 8 http://www.omras2.org/VampOntology Due to different post-processing stages, the results reported here may diverge from previously published results. 10 http://isophonics.net/onset-fusion 9 634 15th International Society for Music Information Retrieval Conference (ISMIR 2014) method F (combined) P (combined) R (combined) F (sb) P (sb) R (sb) F (jpb) P (jpb) R (jpb) CDSF-1 BERSF-1 BERSF-2 BERSF-0 CDSF-2 SF CDBER-1 BERSD-1 HFCCD-1 CDBER-2 0.8580 0.0613 0.8559 0.0941 0.8528 0.1684 0.8451 0.0722 0.8392 0.1537 0.8274 0.0719 0.8145 0.0809 0.8073 0.0792 0.8032 0.0472 0.7967 0.2231 0.7966 0.0492 0.7883 0.0942 0.7795 0.0466 0.7712 0.0412 0.7496 0.0658 0.6537 0.1084 0.9054 0.1195 0.8857 0.1363 0.8901 0.1411 0.8638 0.1200 0.8970 0.1129 0.8313 0.1209 0.8210 0.1276 0.8163 0.1311 0.8512 0.1179 0.8423 0.1404 0.8509 0.1164 0.7776 0.1184 0.8354 0.1269 0.8011 0.1225 0.7671 0.1103 0.5775 0.1008 0.8153 0.0609 0.8280 0.0866 0.8186 0.2028 0.8272 0.0701 0.7884 0.1855 0.8234 0.0657 0.8080 0.0792 0.7986 0.0812 0.7603 0.0734 0.7558 0.2398 0.7489 0.0672 0.7994 0.1001 0.7305 0.0733 0.7436 0.0898 0.7330 0.1061 0.7530 0.2235 0.8194 0.0598 0.8126 0.0961 0.8088 0.1677 0.8025 0.0723 0.7892 0.1758 0.8126 0.0744 0.7877 0.0829 0.7843 0.0828 0.7802 0.0448 0.7605 0.2279 0.7692 0.0467 0.7626 0.0974 0.7604 0.0450 0.7411 0.0375 0.7243 0.0657 0.6143 0.1093 0.8455 0.1165 0.8191 0.1306 0.8729 0.1470 0.8185 0.1134 0.8336 0.1251 0.8191 0.1241 0.7972 0.1295 0.7985 0.1358 0.8387 0.1239 0.8140 0.1607 0.8361 0.1191 0.7521 0.1166 0.8311 0.1326 0.7818 0.1291 0.7494 0.1069 0.5230 0.0688 0.7949 0.0681 0.8062 0.0988 0.7536 0.2055 0.7870 0.0744 0.7493 0.2014 0.8063 0.0737 0.7785 0.0893 0.7707 0.0915 0.7293 0.0765 0.7138 0.2384 0.7123 0.0709 0.7138 0.1119 0.7009 0.0785 0.7044 0.0844 0.7009 0.1019 0.7308 0.2302 0.9286 0.0649 0.9283 0.0925 0.9230 0.1724 0.9175 0.0747 0.9165 0.1344 0.8488 0.0704 0.8560 0.0793 0.8420 0.0756 0.8416 0.0511 0.8498 0.2291 0.8320 0.0535 0.8254 0.0920 0.8210 0.0491 0.8159 0.0496 0.7913 0.0662 0.7114 0.1115 0.9748 0.1241 0.9718 0.1463 0.9637 0.1310 0.9712 0.1322 0.9642 0.1001 0.8290 0.1177 0.8678 0.1253 0.8310 0.1252 0.8376 0.1101 0.8853 0.1273 0.8692 0.1128 0.7968 0.1226 0.8202 0.1190 0.8082 0.1138 0.8041 0.1164 0.6513 0.1536 0.8865 0.0525 0.8885 0.0710 0.8856 0.2011 0.8694 0.0658 0.8732 0.1690 0.8694 0.0558 0.8446 0.0667 0.8532 0.0685 0.8456 0.0705 0.8170 0.2494 0.7979 0.0636 0.8561 0.0851 0.8217 0.0676 0.8236 0.1002 0.7788 0.1118 0.7836 0.2158 CD BER SD HFC LPC PD Table 1. F-measure (F), Precision (P) and Recall (R) for dataset combined, SB, JPB for detectors under best performing configurations from the parameter search, with corresponding standard deviations over different configurations. statistic mean std median Combined 0.7731 0.0587 0.7818 SB 0.7438 0.0579 0.7595 JPB 0.8183 0.0628 0.8226 CHN-NPP 0.8527 0.1206 0.8956 WES-NPP 0.8358 0.0641 0.8580 Table 3. Statistics for F-measure of the ten detectors with their best performances from Table 1 for different datasets fusion policy is more sensitive to the choice of parameters. A possible improvement in this fusion policy would be to make the size of the tolerance window dependent on the magnitude of relevant peaks of the detection functions. The results also vary across different datasets. Table 3 summarises F-measure statistics computed over the detectors listed in Table 1 at their best setting for each datasets used in this paper. In comparison with SB, the JPB dataset exhibits higher F-measure. This dataset has larger diversity in terms of the length of tracks and the level of complexity, while the SB dataset mainly consists of complex mixture (CM) onsets type. Both the Chinese and Western NPP onset class provides noticeably higher detection rate compared to the mix-typed datasets. Though the CHN-NPP set shows the largest standard deviation, suggesting a greater variation in performance between the different detectors for these instruments. Apart from aiming at optimal overall detection results, it is also useful to consider when and how a certain onset detector exhibits the best performance, which constitutes future work. Figure 2. Performances of CDSF-1 onset detector under different w (labelled in each curve) and threshold (annotated in the side box) settings even at a higher threshold, the onset location would not be traced back further than the valley preceding the peak detected in our algorithm. An interesting direction for future work would thus be, given this observation, to take into account the properties of human perception. Adaptive whitening had to be turned off for the majority of detectors to provide good performance for all datasets. This indicates that the method does not improve onset detection performance in general, although it is available in most onset detectors in the Vamp plugin library. The value of the tolerance window was always 0.05s for best performance in our study, suggesting that the temporal precision of the different detectors varies significantly, which requires a fairly wide decision horizon for successful combination. Figure 2 shows how two parameters influence the performance of the onset detector CDSF-1. The figure illustrates the true positive rate (i.e., correct detections relative to the number of target onsets) and false positive rate (i.e., false detections relative to the number of detected onsets) and better performance is indicated by the curve shifting upwards and leftwards. All parameters except the linear combination coefficient (w) and detection sensitivity 6.2 Parameter Specifications For general datasets a low detection sensitivity value is favourable, which is supported by the fact that 30 out of the 45 tested methods yield the best performances with a sensitivity lower than 50% (see online). In 23 out of all cases, the value of the backtracking threshold was the highest considered in our study (2.4) when the detectors yield the best performances for the combined dataset, and it was unanimously at a high value for all other datasets including the percussive ones. This suggests that in many cases, the perceived onset will be better characterised by the valley of the detection function prior to the detected peak. Note that 635 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Proc. of the 14th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2013. (threshold) are fixed at their optimal values. We can observe that the value of the linear combination coefficient is around 0.2 for best performance. This suggests that the detector works the best when taking the majority of the contribution from SF. With the threshold increasing from 10.0% to 60.0%, the true positive rate is increasing at the cost of picking more false onsets, thus a lower sensitivity is preferred in this case. Poorest performance in case of the linear fusion policy occurs in general when the linear combination coefficient overly favours one constituent detector, or the sensitivity (threshold) is too high and the backtracking threshold (θ) is at its lowest value. [4] S. Böck and G. Widmer. Maximum filter vibrato suppression for onset detection. In Proc. of the 16th Int. Conf. on Digital Audio Effects (DAFx), 2013. [5] C. Cannam, M.O. Jewell, C. Rhodes, M. Sandler, and M. d’Inverno. Linked data and you: Bringing music research software into the semantic web. Journal of New Music Research, 2010. [6] N. Collins. A comparison of sound onset detection algorithms with emphasis on psychoacoustically motivated detection functions. In Proc. of the 118th Convention of the Audio Engineering Society, 2005. 7. CONCLUSION AND FUTURE WORK In this work, we applied several fusion techniques to aid the music onset detection task. Different fusion policies were tested and compared to their constituent methods, including the state-of-the-art SuperFlux method. A large scale evaluation was performed on two published datasets showing improvements as a result of fusion, without extra computational cost, or the need for a large amount of training data as in the case of machine learning based methods. A parameter search was used to find the optimal settings for each detector to yield the best performance. We found that some of the best performing configurations do not match the default settings of some previously published algorithms. This suggests that in some cases, better performance can be achieved just by finding better settings which work best overall for a given type of audio even without changing the algorithms. In future work, a possible improvement in case of late decision fusion is to take the magnitude of the peaks into account when combining detected onsets, essentially treating the value as an estimation confidence. We will investigate the dependency of the selection of onset detectors on the type and the quality of the input music signal. We also intend to carry out more rigorous statistical analyses with significance tests for the reported results. More parameters could be included in the search to study their strengths as well as how they influence each other under different configurations. Another interesting direction is to incorporate more Non-Western music types as detection target and design algorithms using instrument specific priors. 8. REFERENCES [1] J.P. Bello, L. Daudet, S. Abdallan, C. Duxbury, and M. Davies. A tutorial on onset detection in music signals. In IEEE Transactions on Audio, Speech, and Language Processing, volume 13, 2005. [2] S. Böck, F. Krebs, and M. Schedl. Evaluating the online capabilities of onset detection methods. In Proc. of the 13th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2012. [3] S. Böck and G. Widmer. Local group delay based vibrato and tremolo suppression for onset detection. In [7] N. Collins. Using a pitch detector for onset detection. In Proc. of the 6th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2005. [8] N. Degara-Quintela, A. Pena, and S. Torres-Guijarro. A comparison of score-level fusion rules for onset detection in music signals. In Proc. of the 10th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2009. [9] S. Dixon. Onset detection revisited. In Proc. of the 9 th Int. Conference on Digital Audio Effects (DAFx’06), 2006. [10] A. Holzapfel, Y. Stylianou, A.C. Gedik, and B. Bozkurt. Three dimensions of pitched instrument onset detection. IEEE Transactions on Audio, Speech, and Language Processing, 2010. [11] L.A. Klein. Sensor and data fusion: a tool for information assessment and decision making. SPIE, 2004. [12] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 1975. [13] J. Schlüter and S. Böck. Musical onset detection with convolutional neural networks. In 6th Int. Workshop on Machine Learning and Music (MML), 2013. [14] D. Stowell and M. Plumbley. Adaptive whitening for improved real-time audio onset detection. In Proceedings of the International Computer Music Conference (ICMC), 2007. [15] M. Tian, A. Srinivasamurthy, M. Sandler, and X. Serra. A study of instrument-wise onset detection in beijing opera percussion ensembles. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014. [16] J. Vos and R. Rasch. The perceptual onset of musical tones. Perception & Psychophysics, 29(4), 1981. [17] R. Zhou, M. Mattavellii, and G. Zoia. Music onset detection based on resonator time frequency image. IEEE Transactions on Audio, Speech, and Language Processing, 2008. 636 15th International Society for Music Information Retrieval Conference (ISMIR 2014) EVALUATING THE EVALUATION MEASURES FOR BEAT TRACKING Sebastian Böck Department of Computational Perception Johannes Kepler University, Linz, Austria [email protected] Mathew E. P. Davies Sound and Music Computing Group INESC TEC, Porto, Portugal [email protected] ABSTRACT This measurement of performance can happen via subjective listening test, where human judgements are used to determine beat tracking performance [3], to discover: how perceptually accurate the beat estimates are when mixed with the input audio. Alternatively, objective evaluation measures can be used to compare beat times with ground truth annotations [4], to determine: how consistent the beat estimates are with the ground truth according to some mathematical relationship. While undertaking listening tests and annotating beat locations are both extremely time-consuming tasks, the apparent advantage of the objective approach is that once ground truth annotations have been determined, they can easily be re-used without the need for repeated listening experiments. However, the usefulness of any given objective accuracy score (of which there are many [4]) is contingent on its ability to reflect human judgement of beat tracking performance. Furthermore, for the entire objective evaluation process to be meaningful, we must rely on the inherent accuracy of the ground truth annotations. In this paper we work under the assumption that musically trained experts can provide meaningful ground truth annotations and rather focus on the properties of the objective evaluation measures. The main question we seek to address is: to what extent do existing objective accuracy scores reflect subjective human judgement of beat tracking performance? In order to answer this question, even in principle, we must first verify that human listeners can make reliable judgements of beat tracking performance. While very few studies exist, we can find supporting evidence suggesting human judgements of beat tracking accuracy are highly repeatable [3] and that human listeners can reliably disambiguate accurate from inaccurate beat click sequences mixed with music signals [11]. The analysis we present involves the use of a test database for which we have a set of estimated beat locations, annotated ground truth and human subjective judgements of beat tracking performance. Access to all of these components (via the results of existing research [12, 17]) allows us to examine the correlation between objective accuracy scores, obtained by comparing the beat estimates to the ground truth, with human listener judgements. To the best of our knowledge this is the first study of this type for musical beat tracking. The remainder of this paper is structured as follows. In Section 2 we summarise the objective beat tracking evaluation measures used in this paper. In Section 3 we describe The evaluation of audio beat tracking systems is normally addressed in one of two ways. One approach is for human listeners to judge performance by listening to beat times mixed as clicks with music signals. The more common alternative is to compare beat times against ground truth annotations via one or more of the many objective evaluation measures. However, despite a large body of work in audio beat tracking, there is currently no consensus over which evaluation measure(s) to use, meaning multiple accuracy scores are typically reported. In this paper, we seek to evaluate the evaluation measures by examining the relationship between objective accuracy scores and human judgements of beat tracking performance. First, we present the raw correlation between objective scores and subjective ratings, and show that evaluation measures which allow alternative metrical levels appear more correlated than those which do not. Second, we explore the effect of parameterisation of objective evaluation measures, and demonstrate that correlation is maximised for smaller tolerance windows than those currently used. Our analysis suggests that true beat tracking performance is currently being overestimated via objective evaluation. 1. INTRODUCTION Evaluation is a critical element of music information retrieval (MIR) [16]. Its primary use is a mechanism to determine the individual and comparative performance of algorithms for given MIR tasks towards improving them in light of identified strengths and weaknesses. Each year many different MIR systems are formally evaluated within the MIREX initiative [6]. In the context of beat tracking, the concept and purpose of evaluation can be addressed in several ways. For example, to measure reaction time across changing tempi [2], to identify challenging musical properties for beat trackers [9] or to drive the composition of new test datasets [10]. However, as with other MIR tasks, evaluation in beat tracking is most commonly used to estimate the performance of one or more algorithms on a test dataset. c Mathew E. P. Davies, Sebastian Böck. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Mathew E. P. Davies, Sebastian Böck. “Evaluating the evaluation measures for beat tracking”, 15th International Society for Music Information Retrieval Conference, 2014. 637 15th International Society for Music Information Retrieval Conference (ISMIR 2014) ceding tolerance window. In addition, a separate threshold requires that the estimated inter-beat-interval should be close to the IAI. In practice both thresholds are set at ±17.5% of the IAI. In [4], two basic conditions consider the ratio of the longest continuously correct region to the length of the excerpt (CMLc), and the total proportion of correct regions (CMLt). In addition, the AMLc and AMLt versions allow for additional interpretations of the annotations to be considered accurate. As specified above, we reduce these four to two principal accuracy scores. To prevent any ambiguity, we rename these accuracy scores Continuity-C (CMLc) and Continuity-T (CMLt). the comparison between subjective ratings and objective scores of beat tracking accuracy. Finally, in Section 4 we present discussion and areas for future work. 2. BEAT TRACKING EVALUATION MEASURES In this section we present a brief summary each of the evaluation measures from [4]. While nine different approaches were presented in [4], we reduce them to seven by only presenting the underlying approaches for comparing a set of beats with a set of annotations (i.e. ignoring alternate metrical interpretations). We consider the inclusion of different metrical interpretations of the annotations to be a separate process which can be applied to any of these evaluation measures (as in [5, 8, 15]), rather than a specific property of one particular approach. To this end, we choose three evaluation conditions: Annotated – comparing beats to annotations, Annotated+Offbeat – including the “off-beat” of the annotations for comparison against beats and Annotated+Offbeat+D/H – including the off-beat and both double and half the tempo of the annotations. This doubling and halving has been commonly used in beat tracking evaluation to attempt to reflect the inherent ambiguity in music over which metrical level to tap the beat [13]. The set of seven basic evaluation measures are summarised below: Information Gain : this method performs a two-way comparison of estimated beat times to annotations and vice-versa. In each case, a histogram of timing errors is created and from this the Information Gain is calculated as the Kullback-Leibler divergence from a uniform histogram. The default number of bins used in the histogram is 40. 3. SUBJECTIVE VS. OBJECTIVE COMPARISON 3.1 Test Dataset To facilitate the comparison of objective evaluation scores and subjective ratings we require a test dataset of audio examples for which we have both annotated ground truth beat locations and a set of human judgements of beat tracking performance for a beat tracking algorithm. For this purpose we use the test dataset from [17] which contains 48 audio excerpts (each 15s in duration). The excerpts were selected from the MillionSongSubset [1] according to a measurement of mutual agreement between a committee of five state of the art beat tracking algorithms. They cover a range from very low mutual agreement – shown to be indicative of beat tracking difficulty, up to very high mutual agreement – shown to be easier for beat tracking algorithms [10]. In [17] a listening experiment was conducted where a set of 22 participants listened to these audio examples mixed with clicks corresponding to automatic beat estimates and rated on a 1 to 5 scale how well they considered the clicks represented the beats present in the music. For each excerpt these beat times were the output of the beat tracker which most agreed with the remainder of the five committee members from [10]. Analysis of the subjective ratings and measurements of mutual agreement revealed low agreement to be indicative of poor subjective performance. In a later study, these audio excerpts were used as one test set in a beat tapping experiment, where participants tapped the beat using a custom piece of software [12]. In order to compare the mutual agreement between tappers with their global performance against the ground truth, a musical expert annotated ground truth beat locations. The tempi range from 62 BPM (beats per minute) up to 181 BPM and, with the exception of two excerpts, all are in 4/4 time. Of the remaining two excerpts, one is in 3/4 time and F-measure : accuracy is determined through the proportion of hits, false positives and false negatives for a given annotated musical excerpt, where hits count as beat estimates which fall within a pre-defined tolerance window around individual ground truth annotations, false positives are extra beat estimates, and false negatives are missed annotations. The default value for the tolerance window is ±0.07s. PScore : accuracy is measured as the normalised sum of the cross-correlation between two impulse trains, one corresponding to estimated beat locations, and the other to ground truth annotations. The cross-correlation is limited to the range covering 20% of the median interannotation-interval (IAI). Cemgil : a Gaussian error function is placed around each ground truth annotation and accuracy is measured as the sum of the “errors” of the closest beat to each annotation, normalised by whichever is greater, the number of beats or annotations. The standard deviation of this Gaussian is set at 0.04s. Goto : the annotation interval-normalised timing error is measured between annotations and beat estimates, and a binary measure of accuracy is determined based on whether a region covering 25% of the annotations continuously meets three conditions – the maximum error is less than ±17.5% of the IAI, and the mean and standard deviation of the error are within ±10% of the IAI. Continuity-based : a given beat is considered accurate if it falls within a tolerance window placed around an annotation and that the previous beat also falls within the pre- 638 15th International Society for Music Information Retrieval Conference (ISMIR 2014) F−measure (0.77) ratings A 4 ratings A+O Cemgil (0.79) 4 2 0 4 2 0.5 (0.85) 1 0 Goto (0.52) 4 2 0.5 (0.74) 1 0 Continuity−C (0.68) 4 2 0.5 (0.84) 1 0 Continuity−T (0.68) 4 2 0.5 (0.51) 1 0 Inf. Gain (0.85) 4 2 0.5 (0.65) 1 0 2 0.5 (0.61) 1 0 4 4 4 4 4 4 2 2 2 2 2 2 2 0 0.5 (0.85) 1 4 0.5 (0.82) 1 4 2 0 0 1 0 0.5 (0.85) 1 4 2 0.5 accuracy 0 1 0 0.5 (0.41) 1 4 2 0.5 accuracy 0 1 0 0.5 (0.86) 1 4 2 0.5 accuracy 0 1 0 0.5 (0.84) 1 1 0 5 4 2 0.5 accuracy 0 (0.85) 4 2 0.5 accuracy 0 5 (0.86) 4 A+O+D/H ratings PScore (0.72) 2 0.5 accuracy 1 0 5 bits Figure 1. Subjective ratings vs. objective accuracy scores for different evaluation measures. The rows indicate different evaluation conditions. (top row) Annotated, (middle row) Annotated+Offbeat, and (bottom row) Annotated+Offbeat+D/H. For each scatter plot, the linear correlation coefficient is provided. Comparing each individual measure across these evaluation conditions, reveals that Information Gain is least affected by the inclusion of additional interpretations of the annotations, and hence most robust to ambiguity over metrical level. Referring to the F-measure and PScore columns of Figure 1 we see that the “vertical” structure close to accuracies of 0.66 and 0.5 respectively is mapped across to 1 for the Annotated+Offbeat+D/H condition. This pattern is also reflected for Goto, Continuity-C and Continuity-T which also determine beat tracking accuracy according to fixed tolerance windows, i.e. a beat falling anywhere inside a tolerance window is perfectly accurate. However, the fact that a fairly uniform range of subjective ratings between 3 and 5 (i.e. “fair” to “excellent” [17]) exists for apparently perfect objective scores indicates a potential mismatch and over-estimation of beat tracking accuracy. While a better visual correlation appears to exist in the scatter plots of Cemgil and Information Gain, this is not reflected in the correlation values (at least not for the Annotated+Offbeat+D/H condition). The use a Gaussian instead of a “top-hat” style tolerance window for Cemgil provides more information regarding the precise localisation of beats to annotations and hence does not have this clustering at the maximum performance. The Information Gain measure does not use tolerance windows at all, instead it measures beat tracking accuracy in terms of the temporal dependence between beats and annotations, and thus shows a similar behaviour. the other was deemed to have no beat at all, and therefore no beats were annotated. In the context of this paper, this set of ground truth beat annotations provides the final element required to evaluate the evaluation measures, since we now have: i) automatically estimated beat locations, ii) subjective ratings corresponding to these beats and iii) ground truth annotations to which the estimated beat locations can be compared. We use each of the seven evaluation measures described in Section 2 to obtain the objective accuracy scores according to the three versions of the annotations: Annotated, Annotated+Offbeat and Annotated+Offbeat+D/H. Since all excerpts are short, and we are evaluating the output of an offline beat tracking algorithm, we remove the startup condition from [4] where beat times in the first five seconds are ignored. 3.2 Results 3.2.1 Correlation Analysis To investigate the relationship between the objective accuracy scores and subjective ratings, we present scatter plots in Figure 1. The title of each individual scatter plot includes the linear correlation coefficient which we interpret as an indicator of the validity of a given evaluation measure in the context of this dataset. The highest overall correlation (0.86) occurs for Continuity-C when the offbeat and double/half conditions are included. However, for all but Goto, the correlation is greater than 0.80 once these additional evaluation criteria are included. It is important to note only Continuity-C and Continuity-T explicitly include these conditions in [4]. Since Goto provides a binary assessment of beat tracking performance, it is unlikely to be highly correlated with the subjective ratings from [17] where participants were explicitly required to use a five point scale rather than a good/bad response concerning beat tracking performance. Nevertheless, we retain it to maintain consistency with [4]. 3.2.2 The Effect of Parameterisation For the initial correlation analysis, we only considered the default parameterisation of each evaluation measure as specified in [4]. However, to only interpret the validity of the evaluation measures in this way presupposes that they have already been optimally parameterised. We now explore whether this is indeed the case, by calculating the objective accuracy scores (under each evaluation condition) as a function of a threshold parameter for each measure. 639 15th International Society for Music Information Retrieval Conference (ISMIR 2014) F−measure PScore 1 Cemgil 1 Goto 1 Continuity−C 1 1 Continuity−T Inf. Gain 1 5 0.5 0.5 0.5 0.5 0.5 bits accuracy 4 0.5 3 2 1 correlation 0 0 0.05 0.1 0 0 0.5 0 0 0.05 0.1 0 0 0.5 0 0 0.5 0 0 0.5 0 0 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0.05 0.1 threshold (s) 0 0 0.5 threshold 0 0 0.05 0.1 threshold (s) 0 0 0.5 threshold 0 0 0.5 threshold 0 0 0.5 threshold 0 0 50 100 50 100 num bins Figure 2. (top row) Beat tracking accuracy as a function of threshold (or number of bins for Information Gain) per evaluation measure. (bottom row) Correlation between subjective ratings and accuracy scores as a function of threshold (or number of bins). In each plot the solid line indicates the Annotated condition, the dashed–dotted line shows Annotated+Offbeat and the dashed line shows Annotated+Offbeat+D/H. For each evaluation measure, the default parameteristation from [4] is shown by a dotted vertical line. We then re-compute the subjective vs. objective correlation. We adopt the following parameter ranges as follows: F-measure PScore Cemgil Goto Continuity-C Continuity-T Information Gain F-measure : the size of the tolerance window increases from ±0.001s to ±0.1s. PScore : the width of the cross-correlation increases from 0.01 to 0.5 times the median IAI. Cemgil : the standard deviation of the Gaussian error function grows from 0.001s to 0.1s. Default Parameters Max. Correlation Parameters 0.070s 0.200 0.040s 0.175 0.175 0.175 40 0.049s 0.110 0.051s 0.100 0.095 0.090 38 Table 1. Comparison of default parameters per evaluation measure with those which provide the maximum correlation with subjective ratings in the Annotated+Offbeat+D/H condition. Goto : to allow a similar one-dimensional representation, we make all three parameters identical and vary them from ±0.005 to ±0.5 times the IAI. Continuity-based : the size of the tolerance window increases from ±0.005 to ±0.5 times the IAI. after which the correlation soon reaches its maximum and then reduces. Comparing these change points with the dotted vertical lines (which show the default parameters) we see that correlation is maximised for smaller (i.e. more restrictive) parameters than those currently used. By finding the point of maximum correlation in each of the plots in the bottom row of Figure 2 we can identify the parameters which yield the highest correlation between objective accuracy and subjective ratings. These are shown for the Annotated+Offbeat+D/H evaluation condition in Table 1 for which the correlation is typically highest. Returning to the plots in the top row of Figure 2 we can then read off the corresponding objective accuracy with the default and then maximum correlation parameters. These accuracy scores are shown in Table 2. From these Tables we see that it is only Cemgil whose default parameterisation is lower than that which maximises the correlation. However this does not apply for the Annotated only condition which is implemented in [4]. While there is a small difference for Information Gain, in- Information Gain : we vary the number of bins in multiples of 2 from 2 up to 100. In the top row of Figure 2 the objective accuracy scores as a function of different parameterisations are shown. The plots in the bottom row show the corresponding correlations with subjective ratings. In each plot the dotted vertical line indicates the default parameters. From the top row plots we can observe the expected trend that, as the size of the tolerance window increases so the objective accuracy scores increase. For the case of Information Gain the beat error histograms become increasingly sparse due to having more histogram bins than observations, hence the entropy reduces and the information gain increases. In addition, Information Gain does not have a maximum value of 1, but instead, log2 of the number of histogram bins [4]. Looking at the effect of correlation with subjective ratings in the bottom row of Figure 2, we see that for most evaluation measures there is rapid increase in the correlation as the tolerance windows grow from very small sizes 640 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Annotated Default Max Corr. Params Params F-measure PScore Cemgil Goto Continuity-C Continuity-T Information Gain 0.673 0.653 0.596 0.583 0.518 0.526 3.078 0.607 0.580 0.559 0.563 0.488 0.505 2.961 Annotated+Offbeat Default Max Corr. Params Params 0.764 0.753 0.681 0.667 0.605 0.624 3.187 0.738 0.694 0.702 0.646 0.570 0.587 3.187 Annotated+Offbeat+D/H Default Max Corr. Params Params 0.834 0.860 0.739 0.938 0.802 0.837 3.259 0.797 0.792 0.779 0.813 0.732 0.754 3.216 Table 2. Summary of objective beat tracking accuracy under the three evaluation conditions: Annotated, Annotated+Offbeat and Annotated+Offbeat+D/H per evaluation measure. Accuracy is reported using the default parameterisation from [4] and also using the parameterisation which provides maximal correlation to the subjective ratings. For Information Gain only performance is measured in bits. might argue that the apparent glass ceiling of around 80% for beat tracking [10] (using Continuity-T for the Annotated+Offbeat+D/H condition) may in fact be closer to 75%, or perhaps lower still. In terms of external evidence to support our findings, a perceptual study evaluating human tapping ability [7] used a tolerance window of ±10% of the IAI, which is much closer to our “maximum correlation” Continuity-T parameter of ±9% than the default value of ±17.5% of the IAI. spection of Figure 2 shows that it is unaffected by varying the number of histogram bins in terms of the correlation. In addition, the inclusion of the extra evaluation criteria also leads to a negligible difference in reported accuracy. Therefore Information Gain is most robust to parameter sensitivity and metrical ambiguity. For the other evaluation measures the inclusion of the Annotated+Offbeat and the Annotated+Offbeat+D/H (in particular) leads to more pronounced differences. The highest overall correlation between objective accuracy scores and subjective ratings (0.89) occurs for Continuity-T for a tolerance window of ±9% of the IAI rather than the default value of ±17.5%. Referring again to Table 2 we see that this smaller tolerance window causes a drop in reported accuracy from 0.837 to 0.754. Indeed a similar drop in performance can be observed for most evaluation measures. Before making recommendations to the MIR community with regard to how beat tracking evaluation should be conducted in the future, we should first revisit the makeup of the dataset to assess the scope from which we can draw conclusions. All excerpts are just 15s in duration, and therefore not only much shorter than complete songs, but also significantly shorter than most annotated excerpts in existing datasets (e.g. 40s in [10]). Therefore, based on our results, we cannot yet claim that our subjective vs. objective correlations will hold for evaluating longer excerpts. We can reasonably speculate that an evaluation across overlapping 15s windows could provide some local information about beat tracking performance for longer pieces, however this is currently not how beat tracking evaluation is addressed. Instead, a single score of accuracy is normally reported regardless of excerpt length. With the exception of [3] we are unaware of any other research where subjective beat tracking performance has been measured across full songs. 4. DISCUSSION Based on the analysis of objective accuracy scores and subjective ratings on this dataset of 48 excerpts, we can infer that: i) a higher correlation typically exists when the Annotated+Offbeat and/or Annotated+Offbeat+D/H conditions are included, and ii) for the majority of existing evaluation measures, this correlation is maximised for a more restrictive parameterisation than the default parameters which are currently used [4]. A strict following of the results presented here would promote either the use of Continuty-T for the Annotated+Offbeat+D/H condition with a smaller tolerance window, or Information Gain since it is most resilient to these variable evaluation conditions while maintaining a high subjective vs. objective correlation. If we are to extrapolate these results to all existing work in the beat tracking literature this would imply that any papers reporting only performance for the Annotated condition using F-measure and PScore may not be as representative of subjective ratings (and hence true performance) as they could be by incorporating additional evaluation conditions. In addition, we could infer that most presented accuracy scores (irrespective of evaluation measure or evaluation condition) are somewhat inflated due to the use of artificially generous parameterisations. On this basis, we Regarding the composition of our dataset, we should also be aware that the excerpts were chosen in an unsupervised data-driven manner. Since they were sampled from a much larger collection of excerpts [1] we do not believe there is any intrinsic bias in their distribution other than any which might exist across the composition of the MillionSongSubset itself. The downside of this unsupervised sampling is that we do not have full control over exploring specific interesting beat tracking conditions such as off-beat tapping, expressive timing, the effect of related metrical levels and non-4/4 time-signatures. We can say that for the few test examples where the evaluated beat tracker tapped the off-beat (shown as zero accuracy points in the Anno- 641 15th International Society for Music Information Retrieval Conference (ISMIR 2014) tated condition but non-zero for the Annotated+Offbeat condition in Figure 1), were not rated as “bad”. Likewise, there did not appear to be a strong preference over a single metrical level. Interestingly, the ratings for the unannotatable excerpt were among the lowest across the dataset. Overall, we consider this to be a useful pilot study which we intend to follow up in future work with a more targeted experiment across a much larger musical collection. In addition, we will also explore the potential for using bootstrapping measures from Text-IR [14] which have also been used for the evaluation of evaluation measures. Based on these outcomes, we hope to be in a position to make stronger recommendations concerning how best to conduct beat tracking evaluation, ideally towards a single unambiguous measurement of beat tracking accuracy. However, we should remain open to the possibility that different evaluation measures may be more appropriate than others and that this could depend on several factors, including: the goal of the evaluation; the types of beat tracking systems evaluated; how the ground truth was annotated; and the make up of the test dataset. To summarise, we believe the main contribution of this paper is to further raise the profile and importance of evaluation in MIR, and to encourage researchers to more strongly consider the properties of evaluation measures, rather than merely reporting accuracy scores and assuming them to be valid and correct. If we are to improve underlying analysis methods through iterative evaluation and refinement of algorithms, it is critical to optimise performance according to meaningful evaluation methodologies targeted towards specific scientific questions. While the analysis presented here has only been applied in the context of beat tracking, we believe there is scope for similar subjective vs. objective comparisons in other MIR topics such as chord recognition or structural segmentation, where subjective assessments should be obtainable via similar listening experiments to those used here. 5. ACKNOWLEDGMENTS This research was partially funded by the Media Arts and Technologies project (MAT), NORTE-07-0124-FEDER000061, financed by the North Portugal Regional Operational Programme (ON.2–O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT) as well as FCT post-doctoral grant SFRH/BPD/88722/2012. It was also supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the GiantSteps project (grant agreement no. 610591). 6. REFERENCES [1] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011. [2] N. Collins. Towards Autonomous Agents for Live Computer Music: Realtime Machine Listening and Interactive Music Systems. PhD thesis, Centre for Music and Science, Faculty of Music, Cambridge University, 2006. [3] R. B. Dannenberg. Toward automated holistic beat tracking, music analysis, and understanding. In Proceedings of 6th International Conference on Music Information Retrieval, pages 366–373, 2005. [4] M. E. P. Davies, N. Degara, and M. D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. Technical Report C4DM-TR-09-06, Queen Mary University of London, Centre for Digital Music, 2009. [5] S. Dixon. Evaluation of audio beat tracking system beatroot. Journal of New Music Research, 36(1):39–51, 2007. [6] J. S. Downie. The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. Acoustical Science and Technology, 29(4):247–255, 2008. [7] C. Drake, A. Penel, and E. Bigand. Tapping in time with mechanically and expressively performed music. Music Perception, 18(1):1–23, 2000. [8] M. Goto and Y. Muraoka. Issues in evaluating beat tracking systems. In Working Notes of the IJCAI-97 Workshop on Issues in AI and Music - Evaluation and Assessment, pages 9–16, 1997. [9] P. Grosche, M. Müller, and C. S. Sapp. What Makes Beat Tracking Difficult? A Case Study on Chopin Mazurkas. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 649–654, 2010. [10] A. Holzapfel, M. E. P. Davies, J. R. Zapata, J. Oliveira, and F. Gouyon. Selective sampling for beat tracking evaluation. IEEE Transactions on Audio, Speech and Language Processing, 20(9):2539–2460, 2012. [11] J. R. Iversen and A. D. Patel. The beat alignment test (BAT): Surveying beat processing abilities in the general population. In Proceedings of the 10th International Conference on Music Perception and Cognition, pages 465–468, 2008. [12] M. Miron, F. Gouyon, M. E. P. Davies, and A. Holzapfel. Beat-Station: A real-time rhythm annotation software. In Proceedings of the Sound and Music Computing Conference, pages 729–734, 2013. [13] D. Moelants and M. McKinney. Tempo perception and musical content: what makes a piece fast, slow or temporally ambiguous? In Proceedings of the 8th International Conference on Music Perception and Cognition, pages 558–562, 2004. [14] T. Sakai. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the International ACM SIGIR conference on research and development in information retrieval, pages 525–532, 2006. [15] A. M. Stark. Musicians and Machines: Bridging the Semantic Gap in Live Performance. PhD thesis, Centre for Digital Music, Queen Mary University of London, 2011. [16] J. Urbano, M. Schedl, and X. Serra. Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems, 41(3):345–369, 2013. [17] J. R. Zapata, A. Holzapfel, M. E. P. Davies, J. L. Oliveira, and F. Gouyon. Assigning a confidence threshold on automatic beat annotation in large datasets. In Proceedings of 13th International Society for Music Information Retrieval Conference, pages 157–162, 2012. 642 15th International Society for Music Information Retrieval Conference (ISMIR 2014) IMPROVING RHYTHMIC TRANSCRIPTIONS VIA PROBABILITY MODELS APPLIED POST-OMR Maura Church Applied Math, Harvard University and Google Inc. Michael Scott Cuthbert Music and Theater Arts M.I.T. [email protected] [email protected] ticularly in searches such as chord progressions that rely on accurate recognition of multiple musical staves. ABSTRACT Despite many improvements in the recognition of graphical elements, even the best implementations of Optical Music Recognition (OMR) introduce inaccuracies in the resultant score. These errors, particularly rhythmic errors, are time consuming to fix. Most musical compositions repeat rhythms between parts and at various places throughout the score. Information about rhythmic selfsimilarity, however, has not previously been used in OMR systems. Understandably, the bulk of OMR research has focused on improving the algorithms for recognizing graphical primitives and converting them to musical objects based on their relationships on the staves. Improving score accuracy using musical knowledge (models of tonality, meter, form) has largely been relegated to “future work” sections and when discussed has focused on localized structures such as beams and measures and requires access to the “guts” of a recognition engine (see Section 6.2.2 in [9]). Improvements to score accuracy based on the output of OMR systems using multiple OMR engines have been suggested [2] and when implemented yielded results that were more accurate than individual OMR engines, though the results were not statistically significant compared to the best commercial systems [1]. Improving the accuracy of an OMR score using musical knowledge and a single engine’s output alone remains an open field. This paper describes and implements methods for using the prior probabilities for rhythmic similarities in scores produced by a commercial OMR system to correct rhythmic errors which cause a contradiction between the notes of a measure and the underlying time signature. Comparing the OMR output and post-correction results to hand-encoded scores of 37 polyphonic pieces and movements (mostly drawn from the classical repertory), the system reduces incorrect rhythms by an average of 19% (min: 2%, max: 36%). This paper proposes using rhythmic repetition and similarity within a score to create a model where measurelevel metrical errors can be fixed using correctly recognized (or at least metrically consistent) measures found in other places in the same score, creating a self-healing method for post-OMR processing conditioned on probabilities based on rhythmic similarity and statistics of symbolic misidentification. The paper includes a public release of an implementation of the model in music21 and also suggests future refinements and applications to pitch correction that could further improve the accuracy of OMR systems. 1. INTRODUCTION 2. PRIOR PROBABILITIES OF DISTANCE Millions of paper copies of musical scores are found in libraries and archival collections and hundreds of thousands of scores have already been scanned as PDFs in repositories such as IMSLP [5]. A scan of a score cannot, however, be searched or manipulated musically, so Optical Music Recognition (OMR) software is necessary to transform an image of a score into symbolic formats (see [7] for a recent synthesis of relevant work and extensive bibliography; only the most relevant citations from this work are included here). Projects such as Peachnote [10] show both the feasibility of recognizing large bodies of scores and also the limitations that errors introduce, par© Maura Church Michael Scott Cuthbert. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Maura Church and Michael Scott Cuthbert. “Improving Rhythmic Transcriptions via Probability Models Applied Post-OMR”, 15th International Society for Music Information Retrieval Conference, 2014. 643 Most Western musical scores, excepting those in certain post-common practice styles (e.g., Boulez, Cage), use and gain cohesion through a limited rhythmic vocabulary across measures. Rhythms are often repeated immediately or after a fixed distance (e.g., after a 2, 4, or 8 measure distance). In a multipart score, different instruments often employ the same rhythms in a measure or throughout a passage. From a parsed musical score, it is not difficult to construct a hash of the sequence of durations in each measure of each part (hereafter simply called “measure”; “measure stack” will refer to measures sounding together across all parts); if grace notes are handled separately, and interior voices are flattened (e.g., using the music21 chordify method) then hash-key collisions will only occur in the rare cases where two graphically distinct symbols equate to the same length in quarter notes (such as a dotted-triplet eighth note and a normal eighth). 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Within each part, the prior probability that a measure m0 will have the same rhythm as the measure n bars later (or earlier) can be computed (the prior-based-on-distance, or PrD). Similarly, the prior probability that, within a measure stack, part p will have the same rhythm as part q can also be computed (the prior-based-on-part, or PrP). computing these values independently for each OMR system and quality of scan, such work is beyond the scope of the current paper. Therefore, we use Rossant and Bloch’s recognition rates, adjusting them for the differences between working with individual symbols (such as dots and note stems) and symbolic objects (such as dotted-eighth and quarter notes). The values used in this model are thus: c = .003, o = .009, a = .004, v = .016.1 As will become clear, more accurate measures would only improve the results given below. Subtracting these probabilities from 1.0, the rate of equality, e, is .968. Figure 1 shows these two priors for the violin I and viola parts of the first movement of Mozart K525 (Eine kleine Nachtmusik). Individual parts have their own characteristic shapes; for instance, the melodic violin I (top left), shows less rhythmic similarity overall than the viola (bot. left). This difference results from the greater rhythmic variety of the violin I part compared to the viola part. Moments of large-scale repetition such as between the exposition and recapitulation, however, are easily visible as spikes in the PrD graph for violin I. (Possible refinements to the model taking into account localized similarities are given at the end of this paper.) The PrP graphs (right) show that both parts are more similar to the violoncello part than to any other part. However, the viola is more similar to the cello (and to violin II) that violin I is to any other part. 3.2 Aggregate Change Distances The similarity of two measures can be calculated in a number of different ways, including the earth mover distance, the Hamming distance, and the minimum Levenshtein or edit distance. The nature of the change probabilities obtained from Rossant and Bloch along with the inherent difficulties of finding the one-to-one correspondence of input and output objects required for other methods, made Levenshtein distance the most feasible method. The probability that certain changes would occur in a given originally scanned measure (source, S) to transform it into the OMR output measure (destination, D) is determined by finding, through an implementation of edit distance, values for i, j, k, l, and m (for number of class changes, omissions, additions, value changes, and unchanged elements) that maximize: pS, D = c i o j a k v l e m (1) Equation (1), the prior-based-on-changes or PrC, can be used to derive a probability of rhythmic change due to OMR errors between any two arbitrary measures, but the model employed here concerns itself with measures with incorrect rhythms, or flagged measures. 3.3 Flagged Measures Let FPi be the set of flagged measures for part Pi, that is, measures whose total durations do not correspond to the total duration implied by the currently active time signature, and F = {FP1, …, FPj} for a score with j parts. (Measure stacks where each measure number is in F can be removed as probable pickup or otherwise intended incomplete measures, and long stretches of measures in F in all parts can be attributed to incorrectly identified time signatures and reevaluated, though neither of these refinements is used in this model). It is possible for rhythms within a measure to be incorrectly recognized without the entire measure being in F; though this problem only arises in the rare case where two rhythmic errors cancel out each other (as in a dotted quarter read as a quarter with an eighth read as a quarter in the same measure). Figure 1. Priors based on distance (l. in measure separation) and part (r.) for the violin I (top) and viola (bot.) parts in Mozart, K525. 3. PRIOR PROBABILITIES OF CHANGE 3.1 Individual Change Probabilities The probability that any given musical glyph will be read correctly or incorrectly is dependent on the quality of scan, the quality of original print, the OMR engine used, and the type of repertory. One possible generalization used in the literature [8] is to classify errors as class confusion (e.g., rest for note, with probability of occurring c), omissions (e.g., of whole symbols or of dots, tuplet marks: probability o), additions (a), and general value confusion (e.g., quarter for eighth: v). Other errors, such as sharp for natural or tie for slur, do not affect rhythmic accuracy. Although accuracy would be improved by 1 Rossant and Bloch give probabilities of change given that an error has occurred. The numbers given here are renormalizations of those error rates after removing the prior probability that an error has taken place. 644 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 4. INTEGRATING THE PRIORS For each m FPi, the measure n in part Pi with the highest likelihood of representing the prototype source rhythm before OMR errors were introduced is the source measure SD that maximizes the product of the prior-based-ondistance, that is, the horizontal model, and the priorbased-on-changes: SD = argmax(PrDn PrCn) n F. Figure 2. Mozart, K525 I, in OMR (l.) and scanned (r.) versions. (2) prior based on changes is much smaller (4 10-9). Violin I is not considered as a source since its measure has also been flagged as incorrect. Therefore the viola’s measure is used for SP. (In the highly unlikely case of equal probabilities, a single measure is chosen arbitrarily) Similarly, for each m in FP the measure t in the measure stack corresponding to m, with the highest likelihood of being the source rhythm for m, is the source measure SP that maximizes the product of the prior-based-on-part, that is, the vertical model, and the prior-based-on-changes: SP = argmax(PrPt PrCt) t F. A similar search is done for the other (unflagged) measures in the rest of the violin II part in order to find SD. In this case, the probability of SP exceeds that of SD, so the viola measure’s rhythm is, correctly, used for violin II. (3) Since the two priors PrD and PrP have not been normalized in any way, the best match from SD and SP can be obtained by simply taking the maximum of the two: S = argmax(P(m)) m in [SD, SP] 6. IMPLEMENTATION The model developed above was implemented using conversion and score manipulation routines from the opensource Python-based toolkit, music21 [4] and has been contributed back to the toolkit as the omr.correctors module in v.1.9 and above. Example 1 demonstrates a round-trip in MusicXML of a raw OMR score to a postprocessed score. (4) Given the assumption that the time signature and barlines have accurately been obtained and that each measure originally contained notes and rests whose total durations matched the underlying meter, we do not need to be concerned with whether S is a “better” solution for correcting m than the rhythms currently in m, since the probability of a flagged measure being correct is zero. Thus any solution has a higher likelihood of being correct than what was already there. (Real-world implementations, however, may wish to place a lower bound on P(S) to avoid substitutions that are below a minimum threshold to prevent errors being added that would be harder to fix than the original.) from music21 import * s = converter.parse('/tmp/k525omrIn.xml') sc = omr.correctors.ScoreCorrector(s) s2 = sc.run() s2.write('xml', fp='/tmp/k525post.xml') Example 1. Python/music21 code for correcting OMR errors in Mozart K525, I. Figure 3, below, shows the types of errors that the model is able, and in some cases unable, to correct. 5. EXAMPLE 7. RESULTS In this example from Mozart K525, mvmt. 1, measure stack 17, measures in both Violin I and Violin II have been flagged as containing rhythmic errors (marked in purple in Figure 2). Nine scores of four-movement quartets by Mozart (5),1 Haydn (1), and Beethoven (4) were used for the primary evaluation. (Mozart K525, mvmt. 1 was used as a test score for development and testing but not for evaluation.) Scanned scores came from out-of-copyright editions (mainly Breitkopf & Härtel) via IMSLP and were converted to MusicXML using SmartScore X2 Pro (v.10.5.5). Ground truth encodings in MuseData and MusicXML formats came via the music21 corpus originally from the Stanford’s CCARH repertories [6] and Project Gutenberg. Both the OMR software and our implementation of the method, described below, can identify the violin lines as containing rhythmic errors, but neither can know that an added dot in each part has caused the error. The vertical model (PrP * PrC) will look to the viola and cello parts for corrections to the violin parts. Violin II and viola share five rhythms (e5) and only one omission of a dot is required to transform the viola rhythm into violin II (o1), for a PrC of 0.0076. The prior on similarities between violin II and viola (PrP) is 0.57, so the complete probability of this transformation is 0.0043. The prior on similarities between violin II and cello is slightly higher, 0.64, but the 1 Mozart K156 is a three-movement quartet, however, both the ground truth and the OMR versions include the abandoned first version of the Adagio as a fourth movement. 645 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Figure 3: Comparison of Mozart K525 I, mm. 35–39 in the original scan (top), SmartScore OMR output (middle), and after post-OMR processing (bot.). Flags 1–3 were corrected successfully; Flags 4 and 5 result in metrically plausible but incorrect emendations. The model was able to preserve the correct pitches for Flags 2 (added quarter rest) and Flag 3 (added augmentation dot). Flag 1 (omitted eighth note) is considered correct in this evaluation, based solely on rhythm, even though the pitch of the reconstructed eighth note is not correct. The proportion of suggestions taken from the horizontal (PrD) and vertical models (PrP) depended significantly on the number of parts in the piece. In Mozart K525 quartet, 72% of the suggestions came from the horizontal model while for the Schubert symphony (fourteen parts), only 39% came from the horizontal model. The pre-processed OMR movement was aligned with the ground truth by finding the minimum edit distance between measure hashes. This step was necessary for the many cases where the OMR version contained a different number of measures than the ground truth. The number of differences between the two versions of the same movement was recorded. A total of 29,728 measures with 7,196 flagged measures were examined. Flag rates ranged from 0.6% to 79.2% with a weighed mean of 24.2% and median of 21.7%. 8. APPLICATIONS The model has broad applications for improving the accuracy of scores already converted via OMR, but it would have greater impact as an element of an improved user experience within existing software. Used to its full potential, the model could help systems provide suggestions as users examine flagged measures. Even a small scale implementation could greatly improve the lengthy errorcorrecting process that currently must take place before a score is useable. See Figure 4 for an example interface. The model was then run on each OMR movement and the number of differences with the ground truth was recorded again. (In order to make the outputted score useful for performers and researchers, we added a simple algorithm to preserve as much pitch information as possible from the original measure.) From 2.1% to 36.1% of flagged measures were successfully corrected, with a weighed mean of 18.8% and median of 18.0%: a substantial improvement over the original OMR output. Manually checking the pre- and post-processed OMR scores against the ground truth showed that the highest rates of differences came from scores where single-pitch repetitions (tremolos) were spelled out in one source and written in abbreviated form in another; such differences could be corrected for in future versions. There was no significant correlation between the percentage of measures originally flagged and the correction rate (r = .17, p > .31). Figure 4. A sample interface improvement using the model described. The model was also run on two scores outside the classical string quartet repertory to test its further relevance. On a fourteenth-century vocal work (transcribed into modern notation), Gloria: Clemens Deus artifex and the first movement of Schubert’s “Unfinished” symphony, the results were similar to the previous findings (16.8% and 18.7% error reduction, respectively). A similar model to the one proposed here could also be integrated into OMR software to offer suggestions for pitch corrections if the user selects a measure that was not flagged for rhythmic errors. Integration within OMR software would also potentially give the model access to 646 15th International Society for Music Information Retrieval Conference (ISMIR 2014) rejected interpretations for measures that may become more plausible when rhythmic similarity within a piece is taken into account. [2] D. Byrd, M. Schindele: “Prospects for improving OMR with multiple recognizers,” Proc. ISMIR, Vol. 7, pp. 41–47, 2006. The model could be expanded to take into account spatial separation between glyphs as part of the probabilities. Simple extensions such as ignoring measures that are likely pickups or correcting wrong time signatures and missed barlines (resulting in double-length measures) have already been mentioned. Autocorrelation matrices, which would identify repeating sections such as recapitulations and rondo returns, would improve the prior-basedon-distance metric. Although the model runs quickly on small scores (in far less than the time to run OMR despite the implementation being written in an interpreted language), on larger scores the O(len(F) len(Part)) complexity of the horizontal model could become a problem (though correction of the lengthy Schubert score took less than ten minutes on an i7 MacBook Air). Because the prior-based-on-distance tends to fall off quickly, examining only a fixed-sized window worth of measures around each flagged measure would offer substantial speed-ups. [3] D. Byrd, J. G. Simonsen, “Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images,” http://www.informatics.indiana.edu/donbyrd/Papers/ OMRStandardTestbed_Final.pdf, in progress. [4] M. Cuthbert and C. Ariza: “music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data,” Proc. ISMIR, Vol. 11, pp. 637–42, 2010. [5] E. Guo et al.: Petrucci Music Library, imslp.org, 2006–. [6] W. Hewlett, et al.: MuseData: an Electronic Library of Classical Music Scores, musedata.org, 1994, 2000. [7] A. Rebelo, et al.: “Optical music recognition: Stateof-the-art and open issues,” International Journal of Multimedia Information Retrieval, Vol. 1, No. 3, pp. 173–190, 2012. Longer scores and scores with more parts offered more possibilities for high-probability correcting measures. Thus we encourage the creators of OMR competitions and standard OMR test examples [3] to include entire scores taken from standard repertories in their evaluation sets. [8] F. Rossant and I. Bloch, “A fuzzy model for optical recognition of musical scores,” Fuzzy sets and systems, Vol. 141, No. 2, pp. 165–201, 2004. [9] F. Rossant, I. Bloch: “Robust and adaptive OMR system including fuzzy modeling, fusion of musical rules, and possible error detection,” EURASIP Journal on Advances in Signal Processing, 2007. The potential of post-OMR processing based on musical knowledge is still largely untapped. Models of tonal behavior could identify transposing instruments and thus create better linkages between staves across systems that vary in the number of parts displayed. Misidentifications of time signatures, clefs, ties, and dynamics could also be reduced through comparison across parts and with similar sections in scores. While more powerful algorithms for graphical recognition will always be necessary, substantial improvements can be made quickly with the selective deployment of musical knowledge. [10] V. Viro: “Peachnote: Music score search and analysis platform,” Proc. ISMIR, Vol. 12, pp. 359– 362, 2011. 9. ACKNOWLEDGEMENTS The authors thank the Radcliffe Institute of Harvard University, the National Endowment for the Humanities/Digging into Data Challenge, the Thomas Temple Hoopes Prize at Harvard, and the School of Humanities, Arts, and Social Sciences, MIT, for research support, four anonymous readers for suggestions, and Margo Levine, Beth Chen, and Suzie Clark of Harvard’s Applied Math and Music departments for advice and encouragement. 10. REFERENCES [1] E. P. Bugge, et al.: “Using sequence alignment and voting to improve optical music recognition from multiple recognizers,” Proc. ISMIR, Vol. 12, pp. 405–410, 2011. 647 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 648 15th International Society for Music Information Retrieval Conference (ISMIR 2014) CLASSIFYING EEG RECORDINGS OF RHYTHM PERCEPTION Sebastian Stober, Daniel J. Cameron and Jessica A. Grahn Brain and Mind Institute, Department of Psychology, Western University, London, ON, Canada {sstober,dcamer25,jgrahn}@uwo.ca ABSTRACT Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different rhythm types/genres or even identify the rhythms themselves. In this paper, we present first classification results using deep learning techniques on EEG data recorded within a rhythm perception study in Kigali, Rwanda. We tested 13 adults, mean age 21, who performed three behavioral tasks using rhythmic tone sequences derived from either East African or Western music. For the EEG testing, 24 rhythms – half East African and half Western with identical tempo and based on a 2-bar 12/8 scheme – were each repeated for 32 seconds. During presentation, the participants’ brain waves were recorded via 14 EEG channels. We applied stacked denoising autoencoders and convolutional neural networks on the collected data to distinguish African and Western rhythms on a group and individual participant level. Furthermore, we investigated how far these techniques can be used to recognize the individual rhythms. 1. INTRODUCTION Musical rhythm occurs in all human societies and is related to many phenomena, such as the perception of a regular emphasis (i.e., beat), and the impulse to move one’s body. However, the brain mechanisms underlying musical rhythm are not fully understood. Moreover, musical rhythm is a universal human phenomenon, but differs between human cultures, and the influence of culture on the processing of rhythm in the brain is uncharacterized. In order to study the influence of culture on rhythm processing, we recruited participants in East Africa and Canada to test their ability to perceive and produce rhythms derived from East African and Western music. Besides behavioral tasks, which have already been discussed in [4], the East African participants also underwent electroencephalography (EEG) recording while listening to East African and Western musical rhythms thus enabling us to study the neural mechanisms underlying rhythm perception. We were interested in differences between neuronal entrainment to the periodicities in East African versus Western rhythms for participants from those respective cultures. Entrainment was defined as c Sebastian Stober, Daniel J. Cameron and Jessica A. Grahn. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sebastian Stober, Daniel J. Cameron and Jessica A. Grahn. “Classifying EEG Recordings of Rhythm Perception”, 15th International Society for Music Information Retrieval Conference, 2014. 649 the magnitudes of steady state evoked potentials (SSEPs) at frequencies related to the metrical structure of rhythms. A similar approach has been used previously to study entrainment to rhythms [17, 18]. But it is also possible to look at the collected EEG data from an information retrieval perspective by asking questions like How well can we tell from the EEG whether a participant listened to an East African or Western rhythm? or Can we even say from a few seconds of EEG data which rhythm somebody listened to? Note that answering such question does not necessarily require an understanding of the underlying processes. Hence, we have attempted to let a machine figure out how best to represent and classify the EEG recordings employing recently developed deep learning techniques. In the following, we will review related work in Section 2, describe the data acquisition and pre-processing in Section 3 present our experimental findings in Section 4, and discuss further steps in Section 5. 2. RELATED WORK Previous research demonstrates that culture influences perception of the metrical structure (the temporal structure of strong and weak positions in rhythms) of musical rhythms in infants [20] and in adults [16]. However, few studies have investigated differences in brain responses underlying the cultural influence on rhythm perception. One study found that participants performed better on a recall task for culturally familiar compared to unfamiliar music, yet found no influence of cultural familiarity on neural activations while listening to the music while undergoing functional magnetic resonance imaging (fMRI) [15]. Many studies have used EEG and magnoencephalography (MEG) to investigate brain responses to auditory rhythms. Oscillatory neural activity in the gamma (20-60 Hz) frequency band is sensitive to accented tones in a rhythmic sequence and anticipates isochronous tones [19]. Oscillations in the beta (20-30 Hz) band increase in anticipation of strong tones in a non-isochronous sequence [5, 6, 10]. Another approach has measured the magnitude of SSEPs (reflecting neural oscillations entrained to the stimulus) while listening to rhythmic sequences [17, 18]. Here, enhancement of SSEPs was found for frequencies related to the metrical structure of the rhythm (e.g., the frequency of the beat). In contrast to these studies investigating the oscillatory activity in the brain, other studies have used EEG to investigate event-related potentials (ERPs) in responses to tones occurring in rhythmic sequences. This approach has been used to show distinct sensitivity to perturbations of the rhythmic pat- 15th International Society for Music Information Retrieval Conference (ISMIR 2014) tern vs. the metrical structure in rhythmic sequences [7], and to suggest that similar responses persist even when attention is diverted away from the rhythmic stimulus [12]. In the field of music information retrieval (MIR), retrieval based on brain wave recordings is still a very young and unexplored domain. So far, research has mainly focused on emotion recognition from EEG recordings (e.g., [3, 14]). For rhythms, however, Vlek et al. [23] already showed that imagined auditory accents can be recognized from EEG. They asked ten subjects to listen to and later imagine three simple metric patterns of two, three and four beats on top of a steady metronome click. Using logistic regression to classify accented versus unaccented beats, they obtained an average single-trial accuracy of 70% for perception and 61% for imagery. These results are very encouraging to further investigate the possibilities for retrieving information about the perceived rhythm from EEG recordings. In the field of deep learning, there has been a recent increase of works involving music data. However, MIR is still largely under-represented here. To our knowledge, no prior work has been published yet on using deep learning to analyze EEG recordings related to music perception and cognition. However, there are some first attempts to process EEG recordings with deep learning techniques. Wulsin et al. [24] used deep belief nets (DBNs) to detect anomalies related to epilepsy in EEG recordings of 11 subjects by classifying individual “channel-seconds”, i.e., onesecond chunks from a single EEG channel without further information from other channels or about prior values. Their classifier was first pre-trained layer by layer as an autoencoder on unlabelled data, followed by a supervised fine-tuning with backpropagation on a much smaller labeled data set. They found that working on raw, unprocessed data (sampled at 256Hz) led to a classification accuracy comparable to handcrafted features. Langkvist et al. [13] similarly employed DBNs combined with a hidden Markov model (HMM) to classify different sleep stages. Their data for 25 subjects comprises EEG as well as recordings of eye movements and skeletal muscle activity. Again, the data was segmented into one-second chunks. Here, a DBN on raw data showed a classification accuracy close to one using 28 hand-selected features. 3. DATA ACQUISITION & PRE-PROCESSING 3.1 Stimuli African rhythm stimuli were derived from recordings of traditional East African music [1]. The author (DC) composed the Western rhythmic stimuli. Rhythms were presented as sequences of sine tones that were 100ms in duration with intensity ramped up/down over the first/final 50ms and a pitch of either 375 or 500 Hz. All rhythms had a temporal structure of 12 equal units, in which each unit could contain a sound or not. For each rhythmic stimulus, two individual rhythmic sequences were overlaid – each at a different pitch. For each cultural type of rhythm, there were 2 groups of 3 individual rhythms for which rhythms could be overlaid with the others in their group. Because an individual rhythm could be one 650 Table 1. Rhythmic sequences in groups of three that pairings were based on. All ‘x’s denote onsets. Larger, bold ‘X’s denote the beginning of a 12 unit cycle (downbeat). Western Rhythms 1 X x x x 2 X x 3 X x x x x x x x x x x X x x x x x X x x x x x X x x 4 X x x 5 X x x x 6 X x x x x x x X x x x x x X x x x x x x x x x X x x x x x x x x x x x x x x x x x x x x x x x x x x x x x East African Rhythms 1 X 2 X 3 X x x x x x x x x x x x x x x X x x X x X 4 X 5 X 6 X x x x x x x x x X x x x x x x x X x x x x x x X x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x of two pitches/sounds, this made for a total of 12 rhythmic stimuli from each culture, each used for all tasks. Furthermore, rhythmic stimuli could be one of two tempi: having a minimum inter-onset interval of 180 or 240ms. 3.2 Study Description Sixteen East African participants were recruited in Kigali, Rwanda (3 female, mean age: 23 years, mean musical training: 3.4 years, mean dance training: 2.5 years). Thirteen of these participated in the EEG portion of the study as well as the behavioral portion. All participants were over the age of 18, had normal hearing, and had spent the majority of their lives in East Africa. They all gave informed consent prior to participating and were compensated for their participation, as per approval by the ethics boards at the Centre Hospitalier Universitaire de Kigali and the University of Western Ontario. After completion of the behavioral tasks, electrodes were placed on the participant’s scalp. They were instructed to sit with eyes closed and without moving for the duration of the recording, and to maintain their attention on the auditory stimuli. All rhythms were repeated for 32 seconds, presented in counterbalanced blocks (all East African rhythms then all Western rhythms, or vice versa), and with randomized order within blocks. All 12 rhythms of each type were presented – all at the same tempo (fast tempo for subjects 1–3 and 7–9, and slow tempo for the others). Each rhythm was preceded by 4 seconds of silence. EEG was recorded via a portable Grass EEG system using 14 channels at a sampling rate of 400Hz and impedances were kept below 10kΩ. 3.3 Data Pre-Processing EEG recordings are usually very noisy. They contain artifacts caused by muscle activity such as eye blinking as well as possible drifts in the impedance of the individual electrodes over the course of a recording. Furthermore, the recording equipment is very sensitive and easily picks up interferences from the surroundings. For instance, in this experiment, the power supply dominated the frequency band around 50Hz. All these issues have led to the common practice to invest a lot of effort 15th International Society for Music Information Retrieval Conference (ISMIR 2014) into pre-processing EEG data, often even manually rejecting single frames or channels. In contrast to this, we decided to put only little manual work into cleaning the data and just removed obviously bad channels, thus leaving the main work to the deep learning techniques. After bad channel removal, 12 channels remained for subjects 1–5 and 13 for subjects 6–13. We followed the common practice in machine learning to partition the data into training, validation (or model selection) and test sets. To this end, we split each 32s-long trial recording into three non-overlapping pieces. The first four seconds were used for the validation dataset. The rationale behind this was that we expected that the participants would need a few seconds in the beginning of each trial to get used to the new rhythm. Thus, the data would be less suited for training but might still be good enough to estimate the model accuracy on unseen data. The next 24 seconds were used for training and the remaining four seconds for testing. The data was finally converted into the input format required by the neural networks to be learned. 1 If the network just took the raw EEG data, each waveform was normalized to a maximum amplitude of 1 and then split into equally sized frames matching the size of the network’s input layer. No windowing function was applied and the frames overlapped by 75% of their length. If the network was designed to process the frequency spectrum, the processing involved: 1. computing the short-time Fourier transform (STFT) with given window length of 64 samples and 75% overlap, 2. computing the log amplitude, 3. scaling linearly to a maximum of 1 (per sequence), 4. (optionally) cutting of all frequency bins above the number requested by the network, 5. splitting the data into frames matching the network’s input dimensionality with a given hop size of 5 to control the overlap. As the classes were perfectly balanced for both tasks, we chose the accuracy, i.e., the percentage of correctly classified instances, as evaluation measure. Accuracy can be measured on several levels. The network predicts a class label for each input frame. Each frame is a segment from the time sequence of a single EEG channel. Finally, for each trial, several channels were recorded. Hence, it is natural to also measure accuracy also at the sequence (i.e, channel) and trial level. There are many ways to aggregate frame label predictions into a prediction for a channel or a trial. We tested the following three ways to compute a score for each class: • plain: sum of all 0-or-1 outputs per class • fuzzy: sum of all raw output activations per class • probabilistic: sum of log output activations per class While the latter approach which gathers the log likelihoods from all frames worked best for a softmax output layer, it usually performed worse than the fuzzy approach for the DLSVM output layer with its hinge loss (see below). The plain approach worked best when the frame accuracy was close to the chance level for the binary classification task. Hence, we chose the plain aggregation scheme whenever the frame accuracy was below 52% on the validation set and otherwise the fuzzy approach. We expected significant inter-individual differences and therefore made learning good individual models for the participants our priority. We then tested configuration that worked well for individuals on three groups – all participants as well as one group for each tempo, containing 6 and 7 subjects respectively. 4.1 Classification into African and Western Rhythms 4.1.1 Multi-Layer Perceptron with Pre-Trained Layers Here, the number of retained frequency bins and the input length were considered as hyper-parameters. 4. EXPERIMENTS & FINDINGS All experiments were implemented using Theano [2] and pylearn2 [8]. 2 The computations were run on a dedicated 12-core workstation with two Nvidia graphics cards – a Tesla C2075 and a Quadro 2000. As the first retrieval task, we focused on recognizing whether a participant had listened to an East African or Western rhythm (Section 4.1). This binary classification task is most likely much easier than the second task – trying to predict one out of 24 rhythms (Section 4.2). Unfortunately, due to the block design of the study, it was not possible to train a classifier for the tempo. Trying to do so would yield a classifier that “cheated” by just recognizing the inter-individual differences because every participant only listened to stimuli of the same tempo. 1 Most of the processing was implemented through the librosa library available at https://github.com/bmcfee/librosa/. 2 The code to run the experiments is publicly available as supplementary material of this paper at http://dx.doi.org/10.6084/m9. figshare.1108287 651 Motivated by the existing deep learning approaches for EEG data (cf. Section 2), we choose to pre-train a MLP as an autoencoder for individual channel-seconds – or similar fixedlength chunks – drawn from all subjects. In particular, we trained a stacked denoising autoencoder (SDA) as proposed in [22] where each individual input was set to 0 with a corruption probability of 0.2. We tested several structural configurations, varying the input sample rate (400Hz or down-sampled to 100Hz), the number of layers, and the number of neurons in each layer. The quality of the different models was measured as the mean squared reconstruction error (MSRE). Table 2 gives an overview of the reconstruction quality for selected configurations. All SDAs were trained with tied weights, i.e., the weight matrix of each decoder layer equals the transpose of the respective encoder layer’s weight matrix. Each layer was trained with stochastic gradient descent (SGD) on minibatches of 100 examples for a maximum of 100 epochs with an initial learning rate of 0.05 and exponential decay. In order to turn a pre-trained SDA into a multilayer perceptron (MLP) for classification, we replaced the decoder part of the SDA with a DLSVM layer as proposed in [21]. 3 This special kind of output layer for classification uses the hinge 3 We used the experimental implementation for pylearn2 provided by Kyle Kastner at https://github.com/kastnerkyle/pylearn2/ blob/svm_layer/pylearn2/models/mlp.py 15th International Society for Music Information Retrieval Conference (ISMIR 2014) Table 2. MSRE and classification accuracy for selected SDA (top, A-F) and CNN (bottom, G-I) configurations. neural network configuration id (sample rate, input format, hidden layer sizes) MSRE train MLP Classification Accuracy (for frames, channels and trials) in % test indiv. subjects fast (1–3, 7–9) slow (4–6, 10–13) all (1–13) A 100Hz, 100 samples, 50-25-10 (SDA for subject 2) 4.35 4.17 61.1 65.5 72.4 58.7 60.6 61.1 53.7 56.0 59.5 53.5 56.6 60.3 B 100Hz, 100 samples, 50-25-10 3.19 3.07 58.1 62.0 66.7 58.1 60.7 61.1 53.5 57.7 57.1 52.1 53.5 54.5 C 100Hz, 100 samples, 50-25 1.00 0.96 61.7 65.9 71.2 58.6 62.3 63.2 54.4 56.4 57.1 53.4 54.8 56.4 D 400Hz, 100 samples, 50-25-10 0.54 0.53 51.7 58.9 62.2 50.3 50.6 50.0 50.0 51.8 51.2 50.1 50.2 50.0 E 400Hz, 100 samples, 50-25 0.36 0.34 60.8 65.9 71.8 56.3 58.6 66.0 52.0 55.0 56.0 49.9 50.1 56.1 F 400Hz, 80 samples, 50-25-10 0.33 0.32 52.0 59.9 62.5 52.3 53.9 54.9 50.5 53.5 55.4 50.2 51.0 50.3 G 100Hz, 100 samples, 2 conv. layers 62.0 63.9 67.6 57.1 57.9 59.7 49.9 50.2 50.0 51.7 52.8 52.9 H 100Hz, 200 samples, 2 conv. layers 64.0 64.8 67.9 58.2 58.5 61.1 49.5 49.6 50.6 50.9 50.2 50.6 I 400Hz, 1s freq. spectrum (33 bins), 2 conv. layers 69.5 70.8 74.7 58.1 58.0 59.0 53.8 54.5 53.0 53.7 53.9 52.6 J 400Hz, 2s freq. spectrum (33 bins), 2 conv. layers 72.2 72.6 77.6 57.6 57.5 60.4 52.9 52.9 54.8 53.1 53.5 52.3 Figure 1. Boxplot of the frame-level accuracy for each individual subject aggregated over all configurations. 5 loss as cost function and replaces the commonly applied softmax. We observed much smoother learning curves and a slightly increased accuracy when using this cost function for optimization together with rectification as non-linearity in the hidden layers. For training, we used SGD with dropout regularization [9] and momentum, a high initial learning rate of 0.1 and exponential decay over each epoch. After training for 100 epochs on minibatches of size 100, we selected the network that maximized the accuracy on the validation dataset. We found that the dropout regularization worked really well and largely avoided over-fitting to the training data. In some cases, even a better performance on the test data could be observed. The obtained mean accuracies for the selected SDA configurations are also shown in Table 2 for MLPs trained for individual subjects as well as for the three groups. As Figure 1 illustrates, there were significant individual differences between the subjects. Whilst learning good classifiers appeared to be easy for subject 9, it was much harder for subjects 5 and 13. As expected, the performance for the groups was inferior. Best results were obtained for the “fast” group, which comprised only 6 subjects including 2 and 9 who were amongst the easiest to classify. We found that two factors had a strong impact on the MSRE: the amount of (lossy) compression through the autoencoder’s bottleneck and the amount of information the 5 Boxes show the 25th to 75th percentiles with a mark for the median within, whiskers span to furthest values within the 1.5 interquartile range, remaining outliers are shown as crossbars. network processes at a time. Configurations A, B and D had the highest compression ratio (10:1). C and E lacked the third autoencoder layer and thus only compressed at 4:1 and with a lower resulting MSRE. F had exactly twice the compression ratio as C and E. While the difference in the MSRE was remarkable between F and C, it was much less so between F and E – and even compared to D. This could be explained by the four times higher sample rate of D–F. Whilst A–E processed the same amount of samples at a time, the input for A–C contained much more information as they were looking at 1s of the signal in contrast to only 250ms. Judging from the MSRE, the longer time span appears to be harder to compress. This makes sense as EEG usually contains most information in the lower frequencies and higher sampling rates do not necessarily mean more content. Furthermore, with growing size of the input frames, the variety of observable signal patterns increases and they become harder to approximate. Figure 2 illustrates the difference between two reconstructions of the same 4s raw EEG input segment using configurations B and D. In this specific example, the MSRE for B is ten times as high compared to D and the loss of detail in the reconstruction is clearly visible. However, D can only see 250ms of the signal at a time whereas B processes one channel-second. Configuration A had the highest MSRE as it was only trained on data from subject 2 but needed to process all other subjects as well. Very surprisingly, the respective MLP produced much better predictions than B, which had identical structure. It is not clear what caused this effect. One explanation could be that the data from subject 2 was cleaner than for other participants as it also led to one amongst the best individual classification accuracies. 6 This could have led to more suitable features learned by the SDA. In general, the two-hidden-layer models worked better than the threehidden-layer ones. Possibly, the compression caused by the third hidden layer was just too much. Apart from this, it was hard to make out a clear “winner” between A, C and E. There seemed to be a trade-off between the accuracy of the reconstruction (by choosing a smaller window size and/or higher sampling rate) and learning more suitable features 6 Most of the model/learning parameters were selected by training just on subject 2. 652 15th International Society for Music Information Retrieval Conference (ISMIR 2014) 1.0 0.25 0.5 0.0 0.5 1.0 0s 1s 2s 3s 1.0 0.00 4s 0.25 0.5 0.0 0.5 1.0 0s 1s 2s 3s 0.00 4s Figure 2. Input (blue) and its reconstruction (red) for the same 4s sequence from the test data. The background color indicates the squared sample error. Top: Configuration B (100Hz) with MSRE 6.43. Bottom: Configuration D (400Hz) with MSRE 0.64. (The bottom signals shows more higher-frequency information due to the four-times higher sampling rate.) Table 3. Structural parameters of the CNN configurations. input convolutional layer 1 convolutional layer 2 id dim. shape patterns pool stride shape patterns pool stride G 100x1 15x1 1 70x1 H 200x1 25x1 10 7 1 151x1 10 7 1 I 22x33 1x33 20 5 1 9x1 10 5 1 J 47x33 1x33 20 5 1 9x1 10 5 1 10 7 10 7 the fast stimuli (for subjects 2 and 9) and the slow ones (for subject 12) respectively. For each pair, we trained a classifier with configuration J using all but the two rhythms of the pair. 7 Due to the amount of computation required, we trained only for 3 epochs each. With the learned classifiers, the mean frame-level accuracy over all 144 rhythm pairs was 82.6%, 84.5% and 79.3% for subject 2, 9 and 12 respectively. These value were only slightly below those shown in Figure 1, which we considered very remarkable after only 3 training epochs. 1 for recognizing the rhythm type at a larger time scale. This led us to try a different approach using convolutional neural networks (CNNs) as, e.g., described in [11]. 4.2 Identifying Individual Rhythms 4.1.2 Convolutional Neural Network We decided on a general layout consisting of two convolutional layers where the first layer was supposed to pick up beat-related patterns while the second would learn to recognize higher-level structures. Again, a DLSVM layer was used for the output and the rectifier non-linearity in the hidden layers. The structural parameters are listed in Table 3. As pooling operation, the maximum was applied. Configurations G and H processed the same raw input as A–F whereas I and J took the frequency spectrum as input (using all 33 bins). All networks were trained for 20 epochs using SGD with a momentum of 0.5 and an exponential decaying learning rate initialized at 0.1. The obtained accuracy values are listed in Table 2 (bottom). Whilst G and H produced results comparable to A–F, the spectrum-based CNNs, I and J, clearly outperformed all other configurations for the individual subjects. For all but subjects 5 and 11, they showed the highest frame-level accuracy (c.f. Figure 1). For subjects 2, 9 and 12, the trial classification accuracy was even higher than 90% (not shown). 4.1.3 Cross-Trial Classification In order to rule out the possibility that the classifiers just recognized the individual trials – and not the rhythms – by coincidental idiosyncrasies and artifacts unrelated to rhythm perception, we additionally conducted a cross-trial classification experiment. Here, we only considered all subjects with frame-level accuracies above 80% in the earlier experiments – i.e., subjects 2, 9 and 12. We formed 144 rhythm pairs by combining each East African with each Western rhythm from 653 Recognizing the correct rhythm amongst 24 candidates was a much harder task than the previous one – especially as all candidates had the same meter and tempo. The chance level for 24 evenly balanced classes was only 4.17%. We used again configuration J as our best known solution so far and trained an individual classifier for each subject. As Figure 3 shows, the accuracy on the 2s input frames was at least twice the chance level. Considering that these results were obtained without any parameter tuning, there is probably still much room for improvements. Especially, similarities amongst the stimuli should be considered as well. 5. CONCLUSIONS AND OUTLOOK We obtained encouraging first results for classifying chunks of 1-2s recorded from a single EEG channel into East African or Western rhythms using convolutional neural networks (CNNs) and multilayer perceptrons (MLPs) pre-trained as stacked denoising autoencoders (SDAs). As it turned out, some configurations of the SDA (D and F) were especially suited to recognize unwanted artifacts like spikes in the waveforms through the reconstruction error. This could be elaborated in the future to automatically discard bad segments during preprocessing. Further, the classification accuracy for individual rhythms was significantly above chance level and encourages more research in this direction. As this has been an initial and by no means exhaustive exploration of the model- and leaning parameter space, there seems to be a lot more potential – especially in CNNs processing the frequency spectrum – and 7 Deviating from the description given in Section 3.3, we used the first 4s of each recording for validation and the remaining 28s for training as the test set consisted of full 32s from separate recordings in this special case. 15th International Society for Music Information Retrieval Conference (ISMIR 2014) subject 1 2 3 4 5 6 7 8 9 10 11 12 13 mean accuracy 15.8% 9.9% 12.0% 21.4% 10.3% 13.9% 16.2% 11.0% 11.0% 10.3% 9.2% 17.4% 8.3% 12.8% precision @3 31.5% 29.9% 26.5% 48.2% 28.3% 27.4% 41.2% 27.8% 28.5% 33.2% 24.7% 39.9% 20.7% 31.4% mean reciprocal rank 0.31 0.27 0.27 0.42 0.26 0.28 0.36 0.27 0.28 0.30 0.25 0.36 0.23 0.30 Figure 3. Confusion matrix for all subjects (left) and per-subject performance (right) for predicting the rhythm (24 classes). we will continue to look for better designs than those considered here. We are also planning to create publicly available data sets and benchmarks to attract more attention to these challenging tasks from the machine learning and information retrieval communities. As expected, individual differences were very high. For some participants, we were able to obtain accuracies above 90%, but for others, it was already hard to reach even 60%. We hope that by studying the models learned by the classifiers, we may shed some light on the underlying processes and gain more understanding on why these differences occur and where they originate. Also, our results still come with a grain of salt: We were able to rule out side effects on a trial level by successfully replicating accuracies across trials. But due to the study’s block design, there remains still the chance that unwanted external factors interfered with one of the two blocks while being absent during the other one. Here, the analysis of the learned models could help to strengthen our confidence in the results. The study is currently being repeated with North America participants and we are curious to see whether we can replicate our findings. Furthermore, we want to extend our focus by also considering more complex and richer stimuli such as audio recordings of rhythms with realistic instrumentation instead of artificial sine tones. Acknowledgments: This work was supported by a fellowship within the Postdoc-Program of the German Academic Exchange Service (DAAD), by the Natural Sciences and Engineering Research Council of Canada (NSERC), through the Western International Research Award R4911A07, and by an AUCC Students for Development Award. 6. REFERENCES [1] G.F. Barz. Music in East Africa: experiencing music, expressing culture. Oxford University Press, 2004. [2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proc. of the Python for Scientific Computing Conference (SciPy), 2010. [3] R. Cabredo, R.S. Legaspi, P.S. Inventado, and M. Numao. An emotion model for music using brain waves. In ISMIR, pages 265–270, 2012. [4] D.J. Cameron, J. Bentley, and J.A. Grahn. Cross-cultural influences on rhythm processing: Reproduction, discrimination, and beat tapping. Frontiers in Human Neuroscience, to appear. [5] T. Fujioka, L.J. Trainor, E.W. Large, and B. Ross. Beta and gamma rhythms in human auditory cortex during musical beat processing. Annals of the New York Academy of Sciences, 1169(1):89–92, 2009. [6] T. Fujioka, L.J. Trainor, E.W. Large, and B. Ross. Internalized timing of isochronous sounds is represented in neuromagnetic beta oscillations. The Journal of Neuroscience, 32(5):1791–1802, 2012. [7] E. Geiser, E. Ziegler, L. Jancke, and M. Meyer. Early electrophysiological correlates of meter and rhythm processing in music perception. Cortex, 45(1):93–102, 2009. [8] I.J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214, 2013. [9] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [10] J.R. Iversen, B.H. Repp, and A.D. Patel. Top-down control of rhythm perception modulates early auditory responses. Annals of the New York Academy of Sciences, 1169(1):58–73, 2009. [11] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012. [12] O. Ladinig, H. Honing, G. Háden, and I. Winkler. Probing attentive and preattentive emergent meter in adult listeners without extensive music training. Music Perception, 26(4):377–386, 2009. [13] M. Längkvist, L. Karlsson, and M. Loutfi. Sleep stage classification using unsupervised feature learning. Advances in Artificial Neural Systems, 2012:5:5–5:5, Jan 2012. [14] Y.-P. Lin, T.-P. Jung, and J.-H. Chen. EEG dynamics during music appreciation. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual Int. Conf. of the IEEE, pages 5316–5319, 2009. [15] S.J. Morrison, S.M. Demorest, E.H. Aylward, S.C. Cramer, and K.R. Maravilla. Fmri investigation of cross-cultural music comprehension. Neuroimage, 20(1):378–384, 2003. [16] S.J. Morrison, S.M. Demorest, and L.A. Stambaugh. Enculturation effects in music cognition the role of age and music complexity. Journal of Research in Music Education, 56(2):118–129, 2008. [17] S. Nozaradan, I. Peretz, M. Missal, and A. Mouraux. Tagging the neuronal entrainment to beat and meter. The Journal of Neuroscience, 31(28):10234–10240, 2011. [18] S. Nozaradan, I. Peretz, and A. Mouraux. Selective neuronal entrainment to the beat and meter embedded in a musical rhythm. The Journal of Neuroscience, 32(49):17572–17581, 2012. [19] J.S. Snyder and E.W. Large. Gamma-band activity reflects the metric structure of rhythmic tone sequences. Cognitive brain research, 24(1):117–126, 2005. [20] G. Soley and E.E. Hannon. Infants prefer the musical meter of their own culture: a cross-cultural comparison. Developmental psychology, 46(1):286, 2010. [21] Y. Tang. Deep Learning using Linear Support Vector Machines. arXiv preprint arXiv:1306.0239, 2013. [22] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371–3408, Dec 2010. [23] R.J. Vlek, R.S. Schaefer, C.C.A.M. Gielen, J.D.R. Farquhar, and P. Desain. Shared mechanisms in perception and imagery of auditory accents. Clinical Neurophysiology, 122(8):1526–1532, Aug 2011. [24] D.F. Wulsin, J.R. Gupta, R. Mani, J.A. Blanco, and B. Litt. Modeling electroencephalography waveforms with semi-supervised deep belief nets: fast classification and anomaly measurement. Journal of Neural Engineering, 8(3):036015, Jun 2011. 654 MIREX Oral Session 655 15th International Society for Music Information Retrieval Conference (ISMIR 2014) This Page Intentionally Left Blank 656 15th International Society for Music Information Retrieval Conference (ISMIR 2014) TE T E EN N YEA E AR RS S OF OF M MIIR RE REX EX: X: REF RE EFLE LEC ECT CTI TIO ON NS S,, CHA HAL ALL LLE LENGE GE ES S AN AND ND OP OP PP PO OR RTUN RT UN NIIT T TIIE ES J. Steep J. pheen ph n Do Dow wniiee wn Xia X ao H Hu u Ji Jin n Ha Le Leee Univverssiityy of Un of Il Illiinnooiiss Uniiv Un veerrssitty y of of H Hoonngg K Koonng g Unnivveerrssiityy of of Wa Wasshhiinnggttoonn jdo dow wn nie@ e@i il llin ino oi is.ed edu u xia xi ao oxh xhu u@ @h h hk ku u. .h hk jin inh ha alee ee@ @u uw. w ed edu u K Kaahy hyu un C Ch hoi o Salllly Sa y Jo Jo Cu Cu un nn niin ngh ghaam m Yun Yu n Ha Haoo Univverssiityy of Un of Il Illiinnooiiss Univve Un v rssiity y of of W Waaikkaatto o U Unnivverrssiity y of of IIllliinnooiiss cka kah hy yu u2@ 2@i il llin ino oi is.ed edu u sal all ly yjo@ o@w wa aika kat to o.a ac c.n nz yun yu nh ha ao2@ 2@i il llin ino ois s. .e ed du u ABS AB ST TRA TR ACT AC T Thhe Mu T Mussiic Innffoorrm maattioonn Ret R trrieevvaall E Ev vaaluuaattioonn eX eXc X han h ngge (MI (M M RE REX X ha X) has be beeenn ru runn an annnu uaalllyy si sinnce 20 200055, wit w thh th thee O Occtobbeerr 2200144 pl to plenar n ryy m maarrkkiinngg iits ten t ntthh itteeraattionn. By By 20 201133, MIR MI RE EX X ha hass ev evaaluuaatteedd ap appprrooxxim mateellyy 20 ma 200000 in inddivviid duuaal mussicc innfoorrm mu maattioonn re retrrievvaall (M MIR MI R)) aallggoorritthhm mss for f r a wid w de rannggee of ra o ta t skkss ov oveer 37 37 dif d fffereenntt te tesst co collleecctiio onnss. M MIR RE REX X hass in ha nvvo ollvveedd rreesseearcchheerrs fro f om m oovverr 29 29 di difffeerreennt co couunntrriies w thh a me wit meddiiaann of of 10099 in inddivvidduuaal parrtticciippaan ntss pe per ye yeaarr. Thiss pa Th pappeer ssuum mmaarrizzeess thhee hhiistoorryy of mm of M MIIR RE EX X frroom m it its earrliieesst pl ea plaannnnin ngg me m ettiin ngg inn 20 200011 to th thee pre p esseenntt. It I re reflecttss uuppon tthhee ad fl dm min m nisstrraattiv vee, ffinnanncciiaal,, anndd te techno n looggiicall ch ca chaallleennggeess M MIIR RE EX X hhaass fac faceedd an nd de dessccriibbees ho how w th thoose chaallleennggeess havvee be ch beeen n su surrm moouunntteedd. W Wee pr prooppoosse ne new w fu funnddingg m in mood deels,, a dis d sttriibbuuttedd ev evaalu uaattioonn fra f am m meew woorrkk, an andd mo more hollisttic us ho useerr ex expperiieen nce evvaalluuaatiio onn taasskkss— —soom —s mee ev evo olluut onar ti n ryy,, so som me rev r voluuttioonnaarryy— — —fforr thhee conntiinnuueedd su succcceesss of of MIR MI RE EX X W X. Wee ho hoppe th thaat thhiss pappeerr wi willl innssppirree MIR R co com m mmunnityy me mu mem mbeerss too co mb conntrriibbuutee thheeiirr id ideeaas so so M MIIR RE EX X cann havvee ma ha man nyy m moorree su succcceesssffuull ye yeaarrs to co com m e. 1. IN INT TR TRO ODU OD UCT CTIIO ON ON Muussic In M Inffoorm mattioon ma n Rettrrieevvaal Ev Evaaluuaattioonn eX eX Xchan h ngge (MI (M M RE REX X is aann annnual X) u l ev evaluuaattioonn ca cam mppaaiiggnn m maannaaggeed d by by thee In th Inteerrnnaatiioonnaal Mu Mussic In Inffoorm matioonn R ma Rettrrieevvaal Sy Systeem m ms Evaalu Ev