Automatic Music Transcription using Autoregressive Frequency
Transcription
Automatic Music Transcription using Autoregressive Frequency
ENSEEIHT 2 rue Charles Camichel BP 7122 31071 Toulouse Cedex 7 Télécommunications Spatiales et Aéronautiques (TéSA) 17 bis, rue Paul Riquet F-31000 Toulouse FRANCE Automatic Music Transcription using Autoregressive Frequency Estimation 14. June 2001 Fredrik Hekland NTNU, Norwegian University of Science and Technology [email protected] Under the direction of: Corinne Mailhes (ENSEEIHT) David Bonacci (ENSEEIHT) 1 of 38 Preface Where and why With help from the Erasmus student exchange program, I had the opportunity to do my fourth year as an engineer student abroad. I had chosen France as country since I wanted to learn to speak French and see parts of Europe I hadn't seen before. I found ENSEEIHT in Toulouse which offered Signal Processing, and luckily they accepted my application. Since it was only the fifth year at this school who was offering enough courses within signal processing, I followed the option "Traitement du Signal et des Images" even though I was one year short. This also meant that a "Stage", a four months final project, was ahead of me. I was kindly given chance to do the Stage at TéSA, a research laboratory at the school. Some possible subjects were presented to me, and after some counsels from my responsible professor at NTNU I chose the subject regarding music transcription. TéSA (Télécommunications Spatiales et Aéronautiques) is a newly created research laboratory as a collaboration between several schools and enterprises. The lab is well equipped with both hardware and software, and a good library which made my literature review easy. The work In the beginning I had a two weeks period doing a literature review, trying to find previous work in the field and all the necessary background information. A lot of articles was found with help from google.com and citeseer.nj.nec.com, while the library contained most of the IEEE publications. After having gained an overview of the problems, a longer period of testing was conducted. Both synthetic signals and samples from real instruments were used. The different frequency estimators and model order criteria were explored, and it was decided to use the Modified Covariance Method coupled with a simple AIC/MDL order selection criterion for the transcriber. A working monophonic transcriber was built, and some of the findings in the project can permit an extension to polyphonic operation. Software Matlab 5.2 and 5.3 was used for all coding purposes and most of the analysis work, and Spectrogram 6.0.4 was helping to analysis the instrument samples. Yamaha's free wave-editor was indispensable to mix and manipulate the necessary wav-files. The Matlab files referred to in the text is not rugged enough to be used in any serious transcribing. For that reason, the code is not available to the public. Personal outcome Even though my work is not exactly ground-breaking, I have gained much personally. Especially concerning the process of doing a research work and writing a report. I now know more how to proceed and what to do underway, and certainly some of the pitfalls to avoid. Both within Matlab programming and Signal Processing I seen a great development, and within parametric modelling I have gained a greater understanding. Finally, it has been interesting to see how the life is in a laboratory and to observe the cultural differences and similarities between Norway and France. At last, it must be mentioned that I have learned a lot French, starting at ground-zero before my arrival in France now being able to communicate without too much problem at a basic level. I hope I'll be able to maintain and improve the language in the future. 2 of 38 Acknowledgements I would like to thank the Director of the laboratory Prof. Francis Castanié, and the responsible for "Traitement du Signal et des Images" Dr. Corinne Mailhes for giving me the opportunity to do this work in the labs. Mdm. Mailhes also being the responsible for my Stage, and David Bonacci was the person working on a subject closest to mine and being the most important advisor giving good ideas and tips. Both deserves a thank. The guys at "Bureau treize" for having received me well, and accepting me even though my French is at best confusing, and at times incomprehensible. C' est dommage que je ne pourrais pas participer dans vos conneries. J' ai passé un bon moment chez vous quand même. Thanks to all the other persons at the lab helping me out and being nice to me. Lots of moral support and positive words from Tonje helped me when I needed it most. I love you. 3 of 38 Abstract The project studies the use of Principle Component AR Frequency Estimation in automatic music transcription and discusses some of the problems arising when using AR models, among them model order selection. Some comparisons to classical Fourier method is done. A well-functioning monophonic transcriber using the Modified Covariance Method as pitch estimator is implemented in Matlab and some suggestions of further work is given. 4 of 38 Table of Contents Preface.......................................................................................................................2 Acknowledgements...................................................................................................3 Abstract.....................................................................................................................4 Chapter 1 – Initial theoretical studies ....................................................................6 Introduction.................................................................................................................................6 Presentation of the problem......................................................................................................6 The goal of this project.............................................................................................................7 Literature review.........................................................................................................................8 Papers dealing on musical transcription....................................................................................8 Commercial or free transcription programs available...............................................................9 A closer look at the challenges..................................................................................................10 Common features of instruments............................................................................................10 Problems encountered when estimation pitch.........................................................................11 Other problems related to transcription...................................................................................12 MIDI file format......................................................................................................................12 Pitch estimators..........................................................................................................................13 Cross-correlation between signal and sinusoids......................................................................13 Filter banks.............................................................................................................................13 Fourier-based methods............................................................................................................14 Autoregressive methods..........................................................................................................15 Model order selection..............................................................................................................16 Estimation of the Prediction Error Power.........................................................................................................16 Order estimations methods based on Singular Values or noise subspace estimation.......................................17 Chapter 2 – Implementation of a music-transcriber............................................18 Initial testing on real and synthetic musical signals...............................................................18 Fourier methods, real instruments...........................................................................................18 Yule-Walker, synthetic signals................................................................................................20 Modified covariance method, synthetic signals......................................................................22 Implementation of the transcriber...........................................................................................25 The structure of the transcriber...............................................................................................25 Transcribing real music...........................................................................................................26 Flute5.wav – a simple flute solo........................................................................................................................26 Flute00.wav – A quicker flute solo...................................................................................................................29 Some different sound files.................................................................................................................................30 Some final words........................................................................................................................31 Limitations of the transcriber..................................................................................................31 Improvement for the transcriber and ideas for future work.....................................................32 Conclusion..............................................................................................................................33 A – Converting between Note, MIDI and Frequency...........................................34 B – Matlab code for the transcriber......................................................................35 References...............................................................................................................44 5 of 38 Chapter 1 – Initial theoretical studies Introduction Presentation of the problem Music can be represented in many different ways, ranging from simple monophonic PCM files with low sample rates to highly symbolical multitrack representations containing control information for various musical parameters. Every abstraction level has its use and being able to easily switch to another representation would certainly be useful. The most obvious application is to aid musicians or composers write music by playing the actual instrument rather than writing notes or playing on a keyboard. This could also extend to analysis of existing music where nothing but the sound recordings exists. An application that could be of greater commercial interest is the possibility to search for music on the Internet by simply whistling the tune, so called content-based retrieval or query by audio [McNab],[Vercoe97]. One could also think of a scheme that tracked all radio stations for a particular music style. Areas which certainly would appreciate perfect transcription are the coming standards MPEG-4 for structured audio [Vercoe97], and MPEG-7 for description of audio-visual content. While the process of converting music from a symbolic format to a waveform representation (synthesis) has evolved over the years, and now gives a fairly realistic sound for a reasonable prize, the opposite process (analysis or transcription) is far from being ready for commercialisation. There exists several monophonic solutions that are reported to work well in real time, but at soon as we want to analyse polyphonic music our options are effectively reduced to zero. A quick search on the Internet found some shareware programs claiming to perform transcription (see table 1), but trials showed these programs to perform rather poorly even after substantial parameter regulations. Speech recognition is very similar to the problem of musical transcription, but while the former has experienced much interest and successful applications during the last three decades, research on music recognition has mainly been done by a few individuals with special interests in the subject. One reason for this is the apparent lack of commercial applications. Another is the complexity of the problem compared to speech recognition. While speech is limited to frequencies between 50Hz to 4kHz and the sources all have similar characteristics, musical frequencies ranges from 20Hz to 20kHz and there are many different instrument models. However, the main problem is the fact that western music is constructed upon harmonic relations (i.e. Different instruments playing frequencies that are in fractional relation) which gives rise to spectral overlapping and possibly complete masking of certain notes. When we think of a symphonic orchestra with many musicians playing simultaneously, the task of separating and recognising each one of them from a simple twochannel recording seems (and might prove to be) impossible. Currently most systems works in an bottom-up fashion, that is, all decisions are based upon the frequency- and segmentation-information one obtains from the music recording. This works satisfactory for monophonic music where the notes are well separated in time and frequency. But these systems are unable to correct even the most obvious errors, since they don' t possess any knowledge of compositional styles (rules). This problem is addressed by putting a top-down along with the normal bottom-up recognising engine, and for example letting the "knowledge-loaded" top-down engine monitor the transcription process and intervene when it disagrees to the estimations found. The common way to implement such systems are so called Blackboard system whose name stem from the idea that several ' experts' , each having knowledge of a certain parameter, are ' standing in front of a blackboard'and together solving the given problem. These systems are very flexible since it is easy to add experts and the system can be driven in a bottom-up correcting fashion or as a top-down predicting fashion. 6 of 38 The goal of this project Given the restricted time and lack of experience, the goal of this project was to explore the problems related to transcription and review the existing solutions, and thereby try to implement a monophonic transcriber that works better than the existing affordable programs. Instruments that are non-harmonic in nature, such as percussion instruments are left out, while any kind of harmonic instrument is targeted. In fact an independence of instrument was wanted, and additionally a possible later extension to polyphonic recognition was desired. It was therefor decided to try other ways to do frequency estimation and possibly obtain higher resolutions and more precise estimates than what is realisable using standard Fourier methods. A natural choice was to investigate the different parametric methods available, their advantages and disadvantages, and the well-known problem of model order selection. A very limited top-down based correcting system is implemented, as to improve the transcription process in lack of a segmentation system. As output format the standard MIDI file format was chosen. This because of it widespread use and its relative simplicity. All the programs was to be built in Matlab since it provides most of the needed routines and enables rapid development. 7 of 38 Literature review Papers dealing on musical transcription Only one article on the use of parametric methods in music recognition was found [Schro00]. Here, three algorithms are discussed and analysed with respect to relative frequency precision. It was concluded that the Modified Covariance Method (MODCOVAR) was superior to both standard Maximum Entropy Method (Yule-Walker) and Prony Spectral Line Estimation. Additionally, the two former methods give us the relative size of the spectral peaks for free. The "Modcovar" method was applied to a short piano recording using a fixed order of 20, showing a promising result. No order estimation was discussed. Another and more elaborate work is the master thesis of Anssi Klapuri [Klapuri98] where some of the problems related to automatic transcription are discussed, and a system trying to resolve the problem of harmonic overlapping is described. The shortcomings of the purely bottom-up approach when it comes to polyphonic music and the necessity to employ a top-down "knowledge system" are discussed. The thesis discusses very general problems and its possible solutions, and different techniques for extracting information are presented. Klapuri has also released several papers related to this thesis, more directed to a specific implementation; [Klapuri01A], [Klapuri01B]. Even more can be found at <http://www.cs.tut.fi/sgn/arg/publications.html> and <http://www.cs.tut.fi/~klap/iiro/> Top-down systems are probably the most promising approach for polyphonic music recognition and despite the fact that such systems will need greater understanding of human musical perception and a huge amount of musical knowledge at its hand, some systems with limited knowledge have been implemented and shows improvements over the usual bottom-up systems. Examples of an ' expert'in a top-down system are probability of transition between different chords and rules for which notes can be played to which chord. See [Kashino98] and [Martin96]. Two other areas of importance, and yet untreated in this project, are segmentation and instrument recognition. Segmentation of events (i.e. Notes and pauses) gives us the possibility to respect the duration of each note, and to avoid including two notes in one frequency analysis frame to obtain better frequency estimates. The master thesis of T. Jehan deals with this problem [Jehan97], and proposes methods both in the time domain and in the frequency domain. Especially the third chapter using changes in AR model as basis for segmentation could be an interesting future addition to this project. Recognition of instruments gives us the opportunity to improve the process even more by using instrument models in the frequency analysis, and by automatically setting correct instruments in the output file. Methods seen tested are some kind of neural network working on cepstral coefficients or features from log-lag correlograms, See [Brown99] and [Martin98]. 8 of 38 Commercial or free transcription programs available Name and Url Technology WIDI Recognition System 2.6 http://www.midi.ru/w2m FFT based Polyphonic AudioToMidi 1.01 http://www.midi.ru/AudioToMidi/ Unsure. Possibly cross-correlation with synthetic sinusoids. Polyphonic. AmazingMIDI 1.60 http://www.pluto.dti.ne.jp/~araki/amazingmidi/ Unsure Single instrument, polyphonic WAV2MID 1.5a http://www.audioworks.com Unknown Monophonic DigitalEar 3.0 http://www.digital-ear.com/index2.html Unknown Monophonic Table 1 As said in the introduction, none of these programs offers fully automatic operation and independence of instrument and even after many parameter adjustments, none of the polyphonic ones give an impressive result even on monophonic music. A more comprehensive list of existing programs can be found at: <http://www.s-line.de/homepages/gerd_castan/compmus/audio2midi_e.html> The two monophonic transcribers performs quite well. Digital Ear was difficult to test, because the demo version was very restricted. From what could be tested, it seemed a bit less powerful than both the project transcriber and the AudioWorks'transcriber. The AudioWorks transcriber performed nearly perfect with a minimum of parameter setting and excellent speed. Testing the program with the same wav-files as in the project showed a similar performance to what was obtained in this project with less parameters. A small bug in the program made the instrument in the MIDI file to always be a grand piano. 9 of 38 A closer look at the challenges Common features of instruments Apart from percussion instruments, the creation of sound in most instruments is based on generating standing waves on strings or in hollow tubes of some geometry. Since the length of the string and tube normally is fixed for a given note, the possible wavelengths are constrained to fulfil the wave equation for the given length, geometry and speed of sound: 2 2 u x,t 1 u x,t (1) 0 x² v² t² Usually strings are connected in both ends, while tubes can be open in one or both ends with either circular or conical inner geometry. These constraints only allows certain modes to operate in the resonator, and for strings and circular open tubes possible solutions are of the form: k v ,k Ak sin 2 (2) eventually as a sum of cosines if the tube is circular half open. As for the conical case, the solution consists of spherical harmonics. A more elaborate explanation can be found at the website [WolfeA] This gives rise to a harmonic spectrum, with all over-harmonics being an integral number of the fundamental frequency. Figure 1 shows a typical frequency spectrum, with the fundamental frequency at 392Hz and all the over-harmonics being 2 to12 times the fundamental frequency. Another typical feature is that most of the energy is located in the lower harmonics, leading to weaker overharmonics. For example in piano sounds, more than 90% of the energy is contained in the fundamental [Schro00]. In western musical notation, it' s the frequency of the fundamental in the harmonic series, also called the pitch1 , that names the note. The Figure 1 relation between the notes is given in (3), with 440Hz being the so-called chamber note (A4), and k ranging from -48 to 39 for a standard piano. To convert between frequency, note name and Midi number, see table 4, appendix A. A G4 note from a flute 0 −10 Magnitude [dB] −20 −30 −40 −50 −60 −70 −80 f 0 1000 2000 3000 Frequence [Hz] 4000 5000 440Hz 2 k 12 Hz ,k 1The term pitch is not well defined, and at least three different meanings exists; a) Production pitch: The rate of which the excitation device opens and closes (e.g. Glottal closure rate in human speech). b) The mathematical pitch: The fundamental of the harmonic series c) Perception pitch: The actual frequency perceived by the listener. In this project, the definition in b) is used, since it is directly connected to the searched notes. 10 of 38 (3) Problems encountered when estimation pitch It is clear that a reliable pitch estimation is essential to successfully do the transcription, and the relative frequency error must not exceed 2.5% to avoid picking the note a semitone away from the real note. Further we want to be able to resolve several instruments playing simultaneously so we must find as many harmonics as possible, where the harmonics can be closely spaced or even overlapping. Instruments without pitch that have noise-like spectres (percussion/drums) can obviously not be identified by tracking the pitch, and are not treated in this project either. Another problem is irregularities in the harmonic spectrum. A classic example is the clarinet where only the odd harmonics are present, because of the constraints the circular half-open resonator impose on the wave equation. Guitars can sometimes play without the first harmonic present, or at least very weak. This leads to the phenomenon virtual pitch, where the human brain deduces the missing fundamental from the distance between the harmonics [Terhardt] (It is this concept that is used to in some headphones to obtain a lower frequency response than what is possible with the given membrane radius). Another problem related to string instruments is inharmonicities in the spectrums. This is due to the fact that the string isn' t infinitely thin, but has a certain mass, and when this mass is unevenly distributed some of the harmonics can get displaced from its ' correct'position [WolfeB]. Even if these exceptions cause problems for the transcription process, they can be countered fairly easily and the most concerning issue is still estimating pitch of multiple instruments where the harmonic series are overlapping. This might seem like a special case which only happens from time to time, but the fact is that this is the rule rather than the exception. The reason for composing the music in this fashion is rooted in the way the brain experiences the music; entities related harmonically are put together in a bigger entity which is experienced as a whole, that is, one cannot distinguish the different notes. This helps reducing the complexity of the listening process, and makes it easier to listen to the music. The more the elements in the music are in harmonic relation, the more the human brain is grouping the elements together, making it more pleasing to listen to. So what makes the listening easier, makes the transcription process harder. Many different approaches exists for estimating the pitch, where many seems to be adapted from methods originally developed for speech processing. Still, most of these methods are made for monophonic music and will perform poorly when applied to polyphonic music. The most used is the spectrogram, based on the Short-Time Fourier Transform (STFT). Other examples are correlogram, filterbanks, Constant Q-Transform, cepstrum, auto-correlation, cross-correlation, zero-crossing detection, wavelet transform, sinusoidal modelling, AR- or ARMA methods, Prony modelling. Some efforts are made in trying to improve multipitch estimation by subtracting the spectrum associated with an estimated note, and thereby continue searching for notes [Klapuri01B]. Some of the methods mentioned above are discussed further down, other can be found in [Klapuri98], [Schro00], [Fitch00]. Even though it' s not directly applicable to music, the tutorial paper on signal modelling in speech recognition by Picone is worth a read to get an overview over available techniques within speech processing [Picone93]. 11 of 38 Other problems related to transcription It is a known fact that it is the attack portion of the note (the transient state before harmonics in steady-state take over) dictates the way we experience the timbre, and it is probably this region that can provide the necessary information needed to identify the instrument. It is therefor important to determine the location of the on- and off-sets of the notes. This will also help avoiding placing an pitch analysis frame over two different notes and thereby getting more accurate estimates. More importantly when the orders is rising, segmentation enables us to adjust the analysis windows to cover the whole note, giving us as many datapoints as possible. Many schemes for segmenting music exists, see [Jehan97], [Klapuri98]. Successfully identifying the instruments would enable us to apply instrument models to the pitch estimation process. This could help detecting notes with overlapping harmonics since we would to some extent know the expected amplitude signatures. Also, it would provide automatic selection of instruments in the output file. Processing cost and storage requirements are not often discussed in connection with automatic transcription. It might not be an issue with laboratory tests or professional applications not in need of realtime applications, but if used for music search on the Internet, the transcriber would most probably be implemented as a Java applet that is downloaded from a server. It is clear that as long as not everybody is blessed with a high-speed Internet connection and the fastest processors, only a limited amount of knowledge and calculations can be put into the applet. Thus, a system requiring as little resources as possible while still being able to do the job would be desired. MIDI file format The MIDI file format is relative simple, and much documentation can be found on the Internet. It is in binary format something that helps making the files small. Both single track and multi track files are supported. In brief, the file is organised in chunks. Every chunk start with a four bytes ID field telling what type of chunk it is and a four bytes length field telling the number of bytes of data following this header. All MIDI files begin with the MThd chunk containing tempo information, and are followed by one or more MTrk containing meta-data and the actual music. All events in the track-chunk such as note-on and note-off are equipped with a delta-time field, indicating the duration (in MIDI clocks) between the event and the preceding event. This delta-time is stored in a variable-length format enabling shorter events to be represented with fewer bytes than with a fixed format. Three types of files exists. Type 0, 1 and 2. Type 0 contains only one track and is thereby monophonic. Type 1 contains one or more tracks where all tracks are played simultaneously. Type 2 contains one or more tracks, but the tracks are sequentially independent. Type 1 was the natural choice for this project, since an extension to polyphony is accomplished simply by adding MTrk chucks. A good overview over the event codes in the specification can be fount at <http://crystal.apana.org.au/ghansper/midi_introduction/midi_file_format.html>, while a more textual introduction can be found at <http://jedi.ks.uiuc.edu/~johns/links/music/midifile.html> 12 of 38 Pitch estimators Cross-correlation between signal and sinusoids Since we assume that the signal consists of sinusoid with several harmonics (also sinusoids), we could find these sinusoids by taking the cross-correlation between the signal block and test sinusoids with frequencies given by (3). The frequencies found are those who yield the highest cross-correlation values. These test sinusoids could be synthetic ones or samples of real instruments. This method will of course be a bit costly in terms of computation, since every possible note must be tested. Further, we can' t reveal notes where all harmonics are masked by a lower note. To be able to to that, we must apply instrument models to compare the correlation value given from (4), with the expected value from the used model; a deviation from the expected value could indicate a hidden note. N cxy k Fig.2 shows an example of the G4, 392 Hz (same as in Fig.1) , where the cross-correlation between the signal and 8 sinusoids has been calculated. x l y l k ,k 0,1, l 0 (4) k max Xcorr between G4 from a flute, and 8 sinusoids 15 10 xy 5 c The calculation cost is rather expensive since we have to do N multiplications for every possible note (128 for a piano), and the lower the note is, the bigger kmax gets to account for the longer sinus-period. 1 0 −5 −10 −15 392Hz 784Hz 1176Hz 1568Hz 0.67of the first 4 0 5 10 15 sample 20 25 30 Figure 2 Filter banks Some of the earliest attempts to estimate pitch was done with filterbanks. One simply separates the spectrum by filtering the signal with several bandpass filters, and then measures the energy present in the different filters. A better, and more promising approach is to use a dyadic filterbank in conjunction with wavelets, see [Karaj99] and [Fitch00]. 13 of 38 Fourier-based methods As mentioned earlier, the spectrogram has been the most popular way to obtain a pitch estimate. To find the frequencies of the sinusoids, one simply picks the peaks in the spectrum. The spectrogram based on the Short-Time Fourier Transform (5), is an extension of the normal Fourier transform where the small analysis window is moved along the time axis and successive transforms are taken. Sx u,f ' !#"%& $ ' )! ( * ! & + !#"-, .! , f t w t u e i2 f dt (5) The spectrogram gives the energy density similar to the periodogram, and is given by (6) P S f u,f Sf u,f 2 (6) Two problems are associated with the spectrogram. The first problem is the windowing function. This windowing introduces side-lobes which can mask weaker signals, and trying to reduce the sidelobes eventually leads to a wider main lobe. Further, the resolution in time and frequency is limited by Heisenberg' s uncertainty theorem, which states that the area of the rectangle defined by the windowing function has a minimum given by (7) / 0/ 1 1 (7) 2 This windowing of the data clearly limits the frequency resolution obtainable, and we are given the choice between high frequency resolution or high time resolution, but not both. Additionally, we have got some given minimal time resolution to respect if we want to avoid mixing several notes in the same analysis frame, and to know where the note is actually starting and stopping. For example, a musical piece playing at 200BPM (beats per minute or quarter notes per minute) sampled at 44.1kHz gives us 60sec/(200*4)=75msec per sixteenth note, which again gives us 44100*75msec=3309 samples per sixteenth note. Lower sample rate means fewer datapoints, and if we want to capture ornamental tones2, we' ll have even fewer sample points. f t Then to the second problem, namely the linear spaced frequency bins of the Fourier transform. Musical tones are ordered in a logarithmic fashion similar to the sensitivity of human hearing, so the distance between two neighbouring notes in the higher end of the musical scale is greater than in the lower end. That allows for higher time resolution by lowering the frequency resolution for the higher end of the spectrum. The Constant Q Transform obtains a such constant ratio between centre frequency and frequency resolution for each frequency partition by combining several bins of an FFT. This however fails to give better time resolution for the higher frequencies, and we haven' t won much. 2 Ornamental notes are very short notes not written on the music sheet, but added by the musician. 14 of 38 Autoregressive methods Since the windowing function is a problem for the resolution, we want to avoid windowing our data. Parametric methods like AR models and eigenvector methods make this possible. We assume our signal to be composed by a number of sinusoids in white noise, something that enables us to make use of an eigendecomposition of the estimated auto-correlation matrix. This is possible since the matrix is composed of a signal auto-correlation matrix and a noise auto-correlation matrix, where the eigenvalues associated with noise are generally small compared to those of the signal. Since we cannot find theoretical Maximum Likelihood Estimator (MLE) for more than one sinusoid analytically [Kay88], we must use a sub-optimal method. The most promising estimator that is not too computationally expensive is the Principal Component AR frequency estimator. 2 354 2 6 2 R xx1 r xx a (8) Eq. (8) is the standard AR parameter estimation, with the auto-correlation matrix being positive definite hermitian. This allows for a eigendecomposition with real, positive eigenvalues and orthonormal eigenvectors. These eigenvectors span the entire auto-correlation space, so we can write the auto-correlation vector as a linear combination of the eigenvectors, which inserted into (8) gives: 2 35487 : 2 2 2 7 ; 2 9 9 2 34<7 ;: 2 2 9 M a 1 i 1 vi v H i i M j 1 j vj (9) Simplifying and discarding the (M-p) smallest eigenvalues, we obtain: p a PC i i 1 i vi (10) Finally, the frequency estimates are obtained by picking the peaks in the spectral estimator in (10) which is done by solving the roots of (10), and take the angles of these poles. = >#3 P pc f ?2 2 pc @| 7 9 2 = > 6 A | 2 p 1 k 1 a pc k e j2 fk (11) In practice, we can estimate the AR coefficients with the Modified Covariance Method (MODCOV), pick the p poles closest to the unity circle and again we obtain the frequency estimations from the angles of these poles. The modified covariance method has proven to be insensitive to initial phases of the sinusoids and spectral peak shifting due to noise is low. In fact, in absence of noise the true frequencies are found. Tests show that the Principal Component Method reaches the Cramer-Rao bound for SNR higher than 10dB. Since the MODCOV method is Least-squares optimised without constraints, the poles can end up outside the unit circle leading to unstable models. This is not a problem since we are only interested in the angles of the poles. Further, the AR spectrum can be calculated from (11). 15 of 38 Model order selection The choice of model order is crucial in order to make the AR methods perform acceptably, but unluckily no perfect methods exists. This problems gets even harder as the number of available sample points are reduced. As soon as the possible order is greater than 0.1 times the available datapoints, we are dealing with a finite sample and the criteria for larger samples are no longer valid and need corrections. Most of the methods are based on the maximum likelihood estimator of some parameter with some bias/variance correcting factor. Estimation of the Prediction Error Power Most of the methods that exists tries to fit several model structures and orders to the data and the order is chosen by minimising a cost function. This cost function is usually based on the Maximum Likelihood Estimate of the prediction error power (PEP) from the model, but since the PEP is on average decreasing for increasing model order, we have to add a certain penalty to the cost function. This because higher order leads to higher variance in the prediction coefficient estimates, and thereby higher variance in the PEP. The information-based criteria requires two passes over the data; one for calculating the cost-function and another for choosing the right order. PEP: C B D EGF H IKJLB H IMF KN D B H IOQP H ISB H I R 2 pep E x n 2 x n p r xx 0 k 1 a k r xx k (12) The a' s in the PEP is calculated using the actual type (YW, ModCov (also called FB or ForwardBackward) etc.) and PEPFB indicates that the Modified covariance method is used. The best known criterion (for AR models) is Akaike' s Information Criterion (AIC): HB ID AIC k T arg [ H C B H IUI#O V ] (13) [ H C B H IUI#O V ] (14) min k 0,1,..,p max 2 pep ln k 2 N k This criterion is however not consistent and tends to overestimate the model order. The minimum description length (MDL) criterion fixes this problem by having a higher penalty factor, and is asymptotically consistent: HB ID MDL k arg T min k 0,1,..,p max ln 2 pep k ln N N k These two criteria impose equal penalty to all unknown parameters (amplitudes, phases, frequencies etc.). Better performance can be obtained with the maximum a posteriori (MAP) criterion, where different penalties can be attributed to the different unknown parameters. For example for AR models the MAP is equal to MDL, and for sinusoids with unknown frequencies + white noise the penalty is 5k/N ln(N). For a development of the MAP, see [Djuric96] and [Djuric98]. A problem arises when the number of datapoints is less than ten times the model order, as we' re then dealing with finite samples and the criteria given above no longer works. This problem is especially present in small samples where they can even fail to give a minimum. In larger samples they tend to choose too high orders as they don' t account for the increased variance in the modelling error. Finite Sample Information Criterion (FSIC) [Broer00] tries to handle this overestimation by changing the penalty factor to better suit the variance from the given AR estimation method and model order. HB ID FSIC k arg T min [ k 0,1,..,p max ln H C B H I IO ( W R OJ J 2 pep k k i 0 )] 1 vi 1 1 vi (15) Where the vi depend on the AR method, and for MODCOV it is given as vi =(N+1.5-1.5i)-1. The combined information criterion uses the maximum between the one in (15) and 3⋅Σvi, see [Broer00] eq.13. 16 of 38 Order estimations methods based on Singular Values or noise subspace estimation Another group of methods for estimating orders are the ones based on determining singular values of the signal auto-correlation- or the covariance matrix. These methods are more specialised as they are usually based upon the assumption that the signal consists of sinusoids in white noise. This makes it possible to decompose the auto-correlation matrix in a signal matrix and a noise matrix because the singular values associated with the signal are generally much higher than those of the noise. Additionally, these two subspaces are orthogonal. The drawbacks with these methods based on eigen-calculation are that they need O(p3) operations, and that the true order is not always the best order (At least for autoregressive models which are AR(∞) when corrupted with white noise). The AIC and MDL criteria based on eigenvalues are given below, and the development can be found in [Wax85]. c d Y ( a` b ) f _ _ _ ( ) ] ^ c _ ae b Y [{ } ] 1 p k p XZY [#\ AIC svd k arg min k 0,1 p max l k ln l 1 1 p kl X)Y [#\ arg k 1 2p k p k min k 0,1 [{ } p max l k 1 p k ln 1 1 p kl l p k 1 l (16) l c d Y ( a` b ) f X _ _ ( X g ] ^ c _ ae b Y p MDL svd k k N p k N ] [_ X [ ) [ 2p k ln N 2 p k (17) where k is the estimated order; p is the order of the covariance matrix, N is the number of datapoints used to calculate the covariance matrix, and the λ' s are the eigenvalues of the covariance matrix ordered as λ1 > λ2 > .. > λp. The expression in the braces are simply the ratio of the geometric mean to the arithmetic mean of the p-k smallest eigenvalues. Similar to the PEP-based criteria, the AIC tends to overestimate the order, while the MDL has shown to be consistent. It should be possible to merge the order estimation and the parameter estimation in order to reduce the computational load. Another possible method is to continuously estimate the noise subspace by means of a QRfactorisation of the covariance matrix or even the data matrix itself. The idea is to decompose a matrix X into a square matrix Q with orthonormal columns and a upper-triangular matrix R, h XE QR where E is a permutation matrix that orders the diagonal of R in descending order. R \( ) R 11 R 12 0 R 22 (18) (19) The factorisation in (18) is called rank-revealing QR factorisation if R22 has a small norm. The dimension of R22 is equal to the rank-deficiency of X, or equivalently equal to the dimension of the noise subspace. The dimension of R11 is then equal to that of the signal subspace. An effective implementation requiring on average O(p²) operations can be found in [Bisc92]. Of course, one can always decompose the signal- or covariance matrix, and determine the the rank of the noise subspace by finding the smallest singular values. However, this proves to be difficult when the order of the matrix is small. 17 of 38 Chapter 2 – Implementation of a music-transcriber Initial testing on real and synthetic musical signals Unless otherwise cited, all sound files are sampled at 11025 Hz with 16 bits in one channel. Some single note sound files was found in the Internet. A series of tones from piano, plus some single clarinet and saxophone examples was also found. The most useful was a series of flute tones found at <http://www.phys.unsw.edu.au/music/flute/flute.html> from different flutes, with impedance and spectral measurements available. These sound clips was chosen as the basis the project. The synthetic signals were created from the script synthsign.m and the signals are analysed with the script process.m. Description of the different program modules is found in the appendix B. Fourier methods, real instruments In the beginning some testing was done on single notes from real instruments. This to better see the difference between parametric and non-parametric methods, and to get a picture of how instrument spectrums look like. Additionally, some modules for file-handling and note/frequency decision was built that also was needed for the real transcriber. Synthetic signals was not tested, since it was assumed that as long as the peaks are well separated the correct frequencies are found, and that the resolution is limited by the number of datapoints N in the analysis windows. A crude periodogram was first implemented, and a simple peak-picking method was used to estimate the frequencies. The peak-picking worked by searching for the maximum value and then deleting a certain number of samples around this maximum. An example from an A4B is shown in Fig.3 with the Matlab output found in Table 2. A problem is that we have to use a threshold and thereby miss some weaker peaks. Instead of a fixed-value threshold, a more adaptive method that estimates the noisefloor to use as threshold [Durne98] could be used. a4b.wav 10 0 −10 Magnitude [dB] −20 −30 −40 −50 −60 −70 −80 0 1000 2000 3000 Frequence [Hz] 4000 5000 Figure 3 Peak 1 is at freq 0.000 Hz Difference from 0 Hz: Peak 2 is at freq 425.954 Hz Difference from previous peak: 425.954 Hz Peak 3 is at freq 438.066 Hz Difference from previous peak: 12.112 Hz Peak 4 is at freq 864.693 Hz Difference from previous peak: 426.627 Hz Peak 5 is at freq 877.142 Hz Difference from previous peak: 12.449 Hz Peak 6 is at freq 888.245 Hz Difference from previous peak: 11.103 Hz Peak 7 is at freq 1303.432 Hz Difference from previous peak: 415.187 Hz Peak 8 is at freq 1314.871 Hz Difference from previous peak: 11.440 Hz Peak 9 is at freq 1325.301 Hz Difference from previous peak: 10.430 Hz Peak 10 is at freq 1753.274 Hz Difference from previous peak: 427.972 Hz Peak 11 is at freq 2200.088 Hz Difference from previous peak: 446.814 Hz Table 2 18 of 38 0.000 Hz It is clear that the Fourier Transform will work fine for monophonic music, but a certain smoothing is necessary before the peaks are searched for. A different search routine would be desired, since one never knows how many points to delete around the peaks. An averaged periodogram with overlapping windows 0 would be a better alternative, as it would −10 reduce both the amplitude−20 and frequency variance. Matlab provides this as a −30 Welch-periodogram. The same sound is tested in −40 Fig.4, and we see a clear improvement over the −50 periodogram: a more precise frequency estimate −60 is obtained, and the −70 spurious peaks present in table 2 are gone. This is −80 file has 18743 samples (1.7 0 sec) and is of course a bit longer than the average Figure 4 note duration. Magnitude [dB] A4 − Welch, 1024 points FFT, 512 points window, 25% overlap 1000 2000 3000 Frequence [Hz] 4000 5000 Peak 1 is at freq 441.431 Hz Difference from 0 Hz: 441.431 Hz Peak 2 is at freq 872.095 Hz Difference from previous peak: 430.664 Hz Peak 3 is at freq 1313.525 Hz Difference from previous peak: 441.431 Hz Peak 4 is at freq 1754.956 Hz Difference from previous peak: 441.431 Hz Table 3 With such a short window we have a possible resolution of (300-1)*11025Hz = 36.75 Hz, and a smoothed periodogram makes it even worse. For example, an A1 (55Hz) would prove to be difficult to resolve, and overlapping harmonics would be unresolvable. A4 − Welch, 1024 points FFT, 512 points window, 25% overlap 0 −10 −20 Magnitude [dB] In music, we can expect the signal to be stationary for about 25 ms, so a window size of that duration is realistic. Fig.5 shows a 28 ms clip (308 points) of the previous A4, and the the same frequencies as in table.3 are found. −30 −40 −50 −60 0 1000 Figure 5 19 of 38 2000 3000 Frequence [Hz] 4000 5000 Yule-Walker, synthetic signals Even though the Yule-Walker method is reported to perform poorly as a frequency estimator in noisy signals [Kay88], it was implemented partly because of its computational simplicity and partly because it was the only available method to estimate the AR coefficients in Matlab 5.2. A 25ms synthetic noise-free signal corresponding to an A4 with four overharmonics was created and the frequencies was estimated from rooting the AR coefficient polynom. The resulting spectra given from (11) are seen in Fig.6, and the frequencies found are 436.639, 879.691, 1321.294, 1763.093, 2207.368. We see Figure 6 that even though the signal is noise free, the true frequencies are not found. This is due to the ' zeroing'of the auto-correlation values outside the auto-correlation matrix, something that smoothes and displaces the peaks. Welch periodogram of signal, and estimated AR−spectrum 2 10 Original AR(27) 1 10 0 10 −1 10 −2 10 −3 10 −4 10 −5 10 −6 10 This is probably due to the fact that without noise, the auto-correlation matrix is singular. When the signal is corrupted with noise, the order stays the same, but the poles modelling the noise get even closer to the unit circle. 1000 2000 3000 Frequency [Hz] 4000 0 Real Part 0.5 5000 6000 1 0.8 0.6 0.4 Imaginary Part Figure 7 shows the pole-zero plot of the noise-free A4. We see that the poles not modelling the sinusoids are fairly close to the unit circle. This will lead to problems selecting the threshold for which poles to accept. The reason for the relative high order selected compared to the true order (10) is the same as above: the zeroing of the auto-correlation values. To minimise the modellation error a higher order is necessary. The AIC function (with penalty k set to 1) is shown in figure 8. The same value is obtained for the MDL and FPE, while the eigenvalue-based criteria chooses 55 which is way too high. 0 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.5 1 Figure 7 aic −3.1 −3.2 −3.3 −3.4 −3.5 −3.6 −3.7 −3.8 −3.9 −4 0 Figure 8 20 of 38 10 20 30 40 50 model order 60 70 80 90 100 One of the reasons for using AR models instead of Fourier was to obtain higher resolution, and thereby to be able to resolve harmonics a semitone apart, or maybe even resolve overlapping harmonics. A 25ms synthetic signal consisting of a G4# (415.4 Hz) and an A4 (440) in white-noise with variance 0.16 was created. At a sampling rate 11025kHz we have 276 datapoints which gives us the Fourier-resolution of (1/276)*11025=39.95 Hz. This means that a standard periodogram should not be able to resolve the first harmonic (24.7 Hz apart), while the over-harmonics could be found (>49 Hz apart). Still using the YW-method to find the prediction coefficients, we analyse the signal, using a standard AIC with penalty 1 to estimate the order. Fig. 9 shows the result, and as predicted we see that the Welch-periodogram is unable to resolve the first harmonics. While not visible in this figure, the two peaks are found in the AR model, see fig.10. However, one of the poles are too far from the unit circle, and is considered as noise. This limited capability to separate signal and noise probably makes the the Yule-Walker method for estimating the poles less attractive and other methods should be Figure 9 tested. Welch periodogram of signal, and estimated AR−spectrum 2 10 Original AR(51) 1 10 0 10 −1 10 −2 10 −3 10 −4 10 −5 10 One could also consider Burg' s algorithm since it has the same computational cost as YW but performs on general better. The problem is the phenomenon of line splitting when the order augments. 1000 2000 3000 Frequency [Hz] 4000 5000 415.30Hz + 440Hz, 25ms/276 points 1 0.8 0.6 0.4 Imaginary Part On the other side, one could resolve all of the frequencies in the model, and from that search for possible harmonic series. However, care must be taken to avoid creation of non-existing notes, since we see from fig.10 and 7 that the poles not associated with the signal are not exactly randomly distributed. 0 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 Figure 10 21 of 38 −0.5 0 Real Part 0.5 1 6000 Modified covariance method, synthetic signals As reported in [Schro00], Marple' s modified covariance method holds promise of better frequency estimates than the traditional Yule-Walker method. And it was claimed that in the absence of noise, the true frequencies were found. AIC,k=1, PEP from Modcov 0 −1 −2 −3 −4 Matlab 5.2 does not include the algorithm, but 5.3 does, and it' s called armcov.m −5 −6 −7 The same noiseless A4 as used on the YW is tested with the ModCov algorithm. We see in fig.11 that the correct model order is found with the AIC when using the ModCov to find the PEP. The problem is Figure 11 that some of these poles are displaced more than 0.035 from the unit circle. −8 −9 −10 10 20 30 40 50 model order 60 70 80 90 A4, no noise, ModCov 1 0.8 0.6 0.4 Imaginary Part We obtain 444.87, 899.84, 1332.10, 1764.21 and 2200.45 HZ which is even worse than YW. This could be due to round-off errors when forming the covariance matrix or inaccuracies when inverting it. 0 0.2 To improve the estimations, one could try to increase the order of the model. By doubling the order, we see that the poles are on the unit circle and give the exact frequencies. Fig.13 shows the result of adding two poles, in fact, just increasing the order to 12 gives a better result with maximum 0.12 Hz deviations, see fig.13 Figure 12 and 14. 0 −0.2 −0.4 −0.6 −0.8 −1 22 of 38 −1 −0.5 0 Real Part 0.5 1 100 Welch periodogram of signal, and estimated AR−spectrum 6 10 Original AR(12) 1 0.8 4 10 0.6 0.4 2 Imaginary Part 10 0 10 0.2 0 −0.2 −2 10 −0.4 −0.6 −4 10 −0.8 −1 −6 10 0 1000 2000 3000 Frequency [Hz] 4000 5000 6000 −1 Figure 13 −0.5 0 Real Part 0.5 1 Figure 14 Again, as with the Yule-Walker the signal with notes a semitone apart is examined (page 21). While using the ModCov to estimate the PEP, the MDL, FPE, FSIC and AIC with penalty 2 all gives order 21 which is approximately the correct order (20) for the noiseless case, but too low to model all the sinusoids in noise. Using penalty k=1 for AIC using PEPFB gives order 76, something that successfully finds all the sinusoids with less than 5 Hz error. This overestimation is typical for the AIC. On the other hand, using order 76 gives us five extra sinusoids, not in harmonic relation to those really existing. This shows that letting the order become Figure 15 too high quickly gives rise to spurious peaks which have to be ' filtered'away by some means. It is clear that this method is not optimal. If we look at fig.9 again, we see that order 51 was chosen for exactly the same signal using AICk=1 and PEPYW. In fact, this combination seem to work fine for selecting order to use with the ModCov, and was most often used when doing the transcription. Much because the calculation using the Levinson algorithm is faster than the ModCov algorithm, and thus speeds up the analysis. Welch periodogram of signal, and estimated AR−spectrum 2 10 Original AR(76) 1 10 0 10 −1 10 −2 10 −3 10 −4 10 −5 10 0 23 of 38 1000 2000 3000 Frequency [Hz] 4000 5000 6000 Synthetic signals are useful since we will have complete knowledge of the signal, but real instruments don' t create perfect sinusoids so it is interesting to do testing of order estimation on real instrument samples. Welch periodogram of signal, and estimated AR−spectrum 4 10 Original AR(38) 2 10 0 10 Again an A4 was chosen, but this time from a flute. The number of datapoints was 308, which is about 25ms at 11kHz sample rate and this should be short enough to catch most of the notes. The Fig.17 shows the MDLk=2 using PEPFB, and looking at the periodogram in fig.16, we see that the correct order is found again (or more precisely: the correct number of sinusoids Figure 16 compared with the number of peaks). The same holds for CIC and FSIC, while AIC, FPE and the eigenvalue-based method in (16) and (17) overestimate the order. −2 10 −4 10 −6 10 −8 10 0 1000 2000 3000 Frequency [Hz] 4000 5000 6000 A4 flute − 308 points, PEPFB, MDLk=2 −550 −600 −650 The order chosen in the model is twice the order estimated. This seem to be a good balance between the correct order which is too smooth and too high order that gives spurious peaks. We see that one extra peak appears, something that is tolerable. Experiments showed that model orders of 1.5 to 2 times the number of sinusoids gave good results. −700 −750 −800 −850 −900 0 10 20 30 40 50 model order 60 70 80 90 Figure 17 A4 flute − 308 points, ModCov, Order=38 1 0.8 0.6 Imaginary Part 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 Figure 18 24 of 38 −0.5 0 Real Part 0.5 1 100 Implementation of the transcriber The structure of the transcriber The transcriber WavToMid is completely implemented in Matlab, and takes a wav-file from the subdirectory ".\wav\" as input and gives a midi-file in ".\mid\" as output. Instrument type in the midi-file must be specified manually. For the moment the program first finds all midi notes, and then does the post-processing and writing to a MIDI file. The processing is done block-wise with each block being fixed to 250 samples (25ms). This is done to simplify the program, but it is obvious that better results can be obtained if a proper segmentation is implemented. The size of the block was chosen from the fact that at 200 BPM a sixteenth-note is 75ms long, and a 25ms window should then be able to capture most of the changes in the music. Each block is tested for silence before it is passed to frequency estimation. After the preliminary testing, it was clear that for the frequency estimation the Modified Covariance Method was the way to go. The frequencies are found by using Matlab' s tf2zp.m and calculating the angles of the poles returned. Finding the roots of the polynoms implies some eigenvalue-calculation. Whether this is possible to avoid is unknown. Model order selection was most of the time done by AICYW with penalty k=1 because of the speed advantages, but using MDLFB with penalty 2 gives us a closer estimate of the number of sinusoids and can in some cases give better results. After the frequency estimation, harmonic series are searched for in order to find potential fundamental frequencies, and afterwards converted to the corresponding midi-number. At this moment only monophonic music is supported, but an extension to polyphony should be straightforward. One problem is to decide which note belongs to which instrument. When the notes are determined some ' top-down processing'is done. Pauses and notes that are too short are removed. Reverb is removed by trying to detect whether a note is continuing to play while another note is present. Finally the midi numbers are converted to binary Midi 1.1 format and written to disk. The relations between the different tempo parameters used in the MIDI specification was not completely understood, so changing the analysis block size is not possible without some manual regulations in these parameters. The different modules in the transcriber are shown in fig.19. The shaded boxes are shared with process.m used in the initial testing, while the hatched box is non-essential for the program. It is just an early pitch-estimator calculating the most occurring distance between the frequencies found in the Fourier spectrum. It is used in the time-frequency plot of the data. The numbers in the boxes indicate in which order the functions are called. All parameters not set interactively can be found in the main script. WavToMid. m loadfile.m 1 ar_cov.m 3 orderselect. m 2 5 freq2mid.m fixed2var.m 4 most_freq. m Figure 19 25 of 38 midiwrite.m coeff2freq. m Transcribing real music Since flute samples was used throughout the preliminary testing, a short flute solo (flute5.wav) was chosen as the reference transcribing clip. It is recorded in 11025Hz with 8 bits resolution, so the quality is average with not too many partitials present, and a small amount of reverb is present. The tempo is modest, with the shortest notes being around 260ms. To test the ability to transcribe quicker passages, some seconds of Bach' s "Badinerie" was used (flute00.wav). Here the shortest note is about 100ms, and there is almost no reverb present. Another quick passage with a lot of reverb was tested (bachfast11.wav). This is Bach' s "Partita in A Minor". A number of other small clips with different instruments was also tested to see how sensitive the transcriber was to the instrument used. Flute5.wav – a simple flute solo Figure 20 Looking at the spectrogram in fig.20 we could expect to find three to five harmonics, and a visual comparison between this spectrogram and the time-frequency plot of the music clip was used as a benchmark when testing the order selection criteria. Additionally, when a MIDI file was made, the wav-file and the midi-file was compared by listening. In all of the following time-frequency plots, the blue circles indicate the frequencies found from the angle of the poles, while the red crosses are estimations of the pitch calculated with most_freq.m and are not used in the written MIDI-file. 26 of 38 Testing the different order selection criteria showed the same as the one-note testing: When we use PEPYW and directly using the order found, we have to use AICk=1 to avoid underestimation. In fact, this seems to be the most practical setting as it performs very well for different types of instruments and tempos. In fig.21 we see the time-frequency plot with this setting used, and it is not too far from the real spectrum, and the conversion to MIDI format is perfect. We see however that there are some spurious poles where the frequencies are changing. This effect could be reduced if the analysis windows Figure 21 are dynamically adjusted. flute5.wav − AIC k=1, PEP , order= 1 x estimated YW 5000 4000 Hz 3000 2000 1000 0 50 100 150 Blocknr 200 250 flute5.wav − AIC k=1, PEPYW, order= 1 x estimated smoothed 5000 4000 3000 Hz A more simple approach is to ' smooth'the order selection curve, taking the average between the estimated order and some of the last orders found. In fig.22 we see the result of taking the average with the two preceding orders, and the non-continuous areas are better estimated. However, this solution is probably not a good choice in polyphonic music since the orders will change more rapidly. A smoothing will then lead to instruments not detected at once because the order is too low . In this project the method seems to work fine, and was used most of the time 0 2000 1000 0 0 50 100 150 Blocknr 200 250 Figure 22 Looking at the spectrum in fig.1 again, we see that the over-harmonics are getting weaker and weaker, and are often modelled too far from the unit-circle to be chosen. This phenomenon is also present in speech, and sometimes a ' pre-flattening' filter that emphasises the higher frequencies is used [Picone93]. This filter is most often a one-tap FIR filter with a [-1,-0.4]. The problem with this filter is that the noise is boosted as well. In fig.23 we see just a minor improvement using a=-0.85. The gains might be higher when dealing with string-based instruments where the over-harmonics tend to die out Figure 23 quickly. flute5.wav − AIC k=1, PEPYW, order= 1 x estim/smoo, flatten 5000 4000 i Hz 3000 2000 1000 0 0 27 of 38 50 100 150 Blocknr 200 250 Using the method that seemed successful in the one-note case on page 24 ( with four times the number of sinusoids) misses many of the over-harmonics. That is simply because the signal to noise ratio (SNR) is worse in this case, and number of poles has to be further augmented in order to cope with the noise. The SNR will without doubt be important when adding even more instruments and thereby the order. Probably using a higher sampling rate (and possibly lowpass filtering to obtain the same signal bandwith as before) will help since we then will have more datapoints on which Figure 24 to base our modelling. flute5.wav − MDL k=2, PEP , order= 2 x estimated FB 5000 4000 Hz 3000 2000 1000 0 0 50 100 150 Blocknr 200 250 This need to adjust the order according to the noise level is not good for our goal of creating a automatic transcriber. Perhaps an estimate of the SNR could be calculated, and from that select a multiplication factor to the number of sinusoids giving the best order to use. This might need the PEPFB in order to have a precise estimate of the number of sinusoids. If we knew the average number of sinusoid and the SNR, we could of course use a fixed order with some success. Eventually, the order could be estimated for bigger blocks, speeding up the analysis. Fixed order − AR(32) 5000 4000 Figure 26 shows the output of a modified version of the transcriber, using the correlation based pitch tracker described on page 13. Even though the calculations are optimised to use as short test-sinuses as possible, the analysis is rather lengthy and using segmentation to reduce the number of blocks would help also here. The resulting MIDI file was perfect. Figure 25 Hz 3000 2000 1000 0 0 50 This method will not be able to distinguish harmonics in the same frequency band. The only hope is to apply signal models as to compare the correlation values found with those expected, assuming overlapping harmonics where the value found is higher than expected. Figure 26 28 of 38 100 150 Blocknr 200 250 Flute00.wav – A quicker flute solo This clip is quicker than the foregoing. No reverb and more harmonics. However, it proved to be a bit difficult for the transcriber. We see that the time-frequency plot is not too far away from the spectrogram, and the conversion to MIDI format is not too bad either. The problems that arise when the tempo is increased appear to be more related to the lack of segmentation and to the post-processing of the notes found. Both using PEPYW with AICk=1, and PEPFB with MDL or CIC works fine. Figure 27 flute00.wav − AIC k=1, PEPYW, order= 1 x estimated 5000 4000 Hz 3000 2000 1000 0 0 Figure 28 29 of 38 20 40 60 80 100 Blocknr 120 140 160 180 Some different sound files clarinet example.wav − AIC k=1, PEPYW, order= 1 x estim/smoo 5000 4000 3000 Hz Fig.29 shows an example file from AudioWorks this time with a clarinet. Many more harmonics are present, and the chosen order is between 21 and 67. In such a file a fixed order would not work well. A spectrogram of this clip shows weak even harmonics which is typical for a clarinet. The transcription done is on par with the result from AudioWorks'own transcriber. 2000 Using PEPFB in this case is painfully slow, since the maximum allowable order must be set to 70. It is clear that an analysis window matched to each note must be used, since that would reduce the number Figure 29 of order estimations from about 470 to 35 in this example. 1000 0 0 The clip bachfast.wav is another flute solo with a lot of reverb. Energy from up to four notes can observed simultaneously. This leads to high model orders, and higher demand on the postprocessing to eliminate the echo. Both the spectrogram and the time-frequency plot is cluttered, but the result from the transcription is not too bad and is still on par with the AudioWorks transcriber. Such heavy reverb is problem for the autoregressive methods since the number of sinusoids exploses. If such files are expected to be converted successfully, some sort of echo-cancellation should be Figure 30 employed. 100 200 300 Blocknr 400 500 bachfast.wav − CIC k=3, PEPYW, order= 1.5 x estimated 5000 4000 Hz 3000 2000 1000 0 0 Oboe.wav, being of modest tempo and order, was converted easily. Figure 31 30 of 38 100 200 300 Blocknr 400 500 Some final words Limitations of the transcriber All of the music clips tested share some common characteristics: 1. They are all created from tube resonators which means that all harmonics ' live'equally long. 2. They have a minimum frequency not too low which mean a limited number of harmonics. The reason for omitting piano and guitar music is twofold. Most importantly, these instruments are seldom monophonic. Another aspect is lower (possible) fundamental frequencies. Looking at fig.32 which is an A0 from a grand piano, we see that there are a lot of harmonics and the higher harmonics die out quickly. Figure 32 The dying harmonics are not a problem since the transcriber needs only one harmonic to decide a note. The high number of harmonics is worse, especially if we are dealing with polyphony. This means that it is harder to find the best order, and the frequency estimations will be less accurate. To remedy this, we have to increase the number of datapoints by increasing the sampling rate and/or segmenting the music to form bigger analysis blocks. The program is using a fixed key-press (loudness) for all notes. This is seldom the case for real music. The duration of the notes are not rigourously respected since the analysis is done with fixed blocks. Additionally, the relations between the timing/tempo parameters in the MIDI specification was not completely understood, so some files experiences incorrect conversion with the standard setup. Some manual adjustments of the parameters fixes the problems. The speed is also a problem. Reducing the number of calculations by reducing the number of analysis blocks is desired. In other words, segmentation is needed. Of course, critical parts could be coded in C. Different noise levels are not accounted for. Since more noise implies higher AR models, some automatic adjustment of the model order according to noise level should be employed if less handadjusting is desired. 31 of 38 Improvement for the transcriber and ideas for future work No further work is planned in this project, at least not on a professional basis. However, some suggestions for further work are given. Without doubt, the most important modification of the transcriber is to do segmentation allowing the analysis window to cover the whole note. This has several advantages: 1. More datapoints available in the analysis window leading to better frequency- and orderestimations. 2. Avoid mixing two consecutive notes in the same analysis window. 3. The note duration will be respected in the MIDI representation. 4. The number of calculations can be reduced, since fewer order estimations is needed. 5. The attack of the note can be analysed to identify the instrument. 6. The relative loudness of each note can easily be determined. Further, implementing pkt.6 above to respect each note's loudness, and making a GUI for easier testing of different important parameters. Some ideas that requires a bit more research before integration into the transcriber are: AR modelling in sub-bands. If D. Bonacci' s research is successful we would be able to do AR modelling in sub-bands. This would enable us to use many low-order models in place of one high order model, possibly making the estimations more reliable (and faster). Adaptive sequential algorithms. Many adaptive algorithms for spectrum estimation exists, updating the spectrum for every data point arriving, and requiring only O(9m) multiplications instead of the usual O(p2) multiplications [Kalou87]. These algorithms could be used for frequency estimations in real-time, possibly adding the benefits of detecting note changes and avoiding order estimations. Directionality. A stereo signal usually contains information of spatial placement, and using for example MUSIC or Capon' s method could use this information to suppress all but one instrument, thus improving polyphonic transcription. Frequency estimations could be done simultaneously. Phase information. No work regarding the phase information in a music signal was found, not even excluding the possibility. The idea is that if there are any relation between the phases of the harmonics created in the instrument, one could be able to detect if a harmonic is mixed with a harmonic from another instrument (having a different phase). 32 of 38 Conclusion It has been demonstrated that AR frequency estimation is a plausible solution when transcribing music. The AR methods are able to provide higher resolutions than the Fourier counterparts, something that is important when the number of simultaneous notes increases. The frequency estimates from Principal Component Frequency Estimation using the Modified Covariance Method are reliable even in the presence of noise. Even if some of the harmonics are completely masked. Fig.33 shows two flutes played simultaneously, where the C5 is completely masked. Having 1000 datapoints and using MDL with PEPFB gives almost the correct number of sinusoids, and using order 4 times the number of sinusoids actually makes it possible to find the hidden C5. This is a promising result when considering polyphonic transcription. C4+C5, 91ms, MDLk=2, PEPFB, 2x 1 0.8 0.6 Imaginary Part 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.5 0 Real Part 0.5 1 Figure 33 Order estimation is crucial for optimal performance. Best order is not equal to the 2*(number of real sinusoids), but somewhat higher depending on the amount of noise. Using the PEPFB together with the information criteria MDL or CIC seems to give best estimate of the number of sinusoids. Then using an estimate of the noise level to find a multiplicator for the order found to get the best model order seems to be a way to go in order to cope with different SNR. However, calculating all the orders up to the maximum allowable order for every block using the modified covariance method is slow. A faster and ' less correct' , but well performing method is utilised in the project, namely using the order found from AICk=1 with PEPYW. This works remarkably well even for different types of instruments. A monophonic transcriber has been built taking a wav-file as input and giving a MIDI-file as output. The program is performing on par with the commercial available monophonic transcribers. The program has been build with a possible extension to polyphonic operation in mind. 33 of 38 A – Converting between Note, MIDI and Frequency The A with MIDI number 21 is called A0, while the next is called A1 etc. Similar for all the other notes. The table is created from equation 3, where the MIDI number equals k+69. Note name C MIDI Nr. MIDI Freq.[Hz] Nr. 0 8.176 12 Freq.[Hz] 16.352 MIDI Nr. 24 MIDI Freq.[Hz] Nr. MIDI Freq.[Hz] Nr. 32.703 65.406 36 48 MIDI Freq.[Hz] Nr. 130.813 60 Freq.[Hz] 261.626 Db 1 8.662 13 17.324 25 34.648 37 69.296 49 138.591 61 277.183 D 2 9.177 14 18.354 26 36.708 38 73.416 50 146.832 62 293.665 Eb 3 9.723 15 19.445 27 38.891 39 77.782 51 155.563 63 311.127 E 4 10.301 16 20.602 28 41.203 40 82.407 52 164.814 64 329.628 F 5 10.913 17 21.827 29 43.654 41 87.307 53 174.614 65 349.228 Gb 6 11.562 18 23.125 30 46.249 42 92.499 54 184.997 66 369.994 G 7 12.250 19 24.500 31 48.999 43 97.999 55 195.998 67 391.995 Ab 8 12.978 20 25.957 32 51.913 44 103.826 56 207.652 68 415.305 A 9 13.750 21 27.500 33 55.000 45 110.000 57 220.000 69 440.000 Bb 10 14.568 22 29.135 34 58.270 46 116.541 58 233.082 70 466.164 B 11 15.434 23 30.868 35 61.735 47 123.471 59 246.942 71 493.883 C 72 523.251 84 1046.502 96 2093.005 108 4186.009 120 8372.018 Db 73 554.365 85 1108.731 97 2217.461 109 4434.922 121 8869.844 D 74 587.330 86 1174.659 98 2349.318 110 4698.636 122 9397.273 Eb 75 622.254 87 1244.508 99 2489.016 111 4978.032 123 9956.063 E 76 659.255 88 1318.510 100 2637.020 112 5274.041 124 10548.082 F 77 698.456 89 1396.913 101 2793.826 113 5587.652 125 11175.303 Gb 78 739.989 90 1479.978 102 2959.955 114 5919.911 126 11839.822 127 12543.854 G 79 783.991 91 1567.982 103 3135.963 115 6271.927 Ab 80 830.609 92 1661.219 104 3322.438 116 6644.875 A 81 880.000 93 1760.000 105 3520.000 117 7040.000 Bb 82 932.328 94 1864.655 106 3729.310 118 7458.620 B 83 987.767 95 1975.533 107 3951.066 119 7902.133 Table 4 34 of 38 B – Matlab code for the transcriber 35 of 38 References REF NAME TITLE PUBL YEAR [Bisc92] C.H.Bischof, M.Shroff "On Updating Signal Subspaces" [Broer00] P.M.T.Broersen "Finite Sample IEEE 2000 Criteria for Trans.Sig.Proc,Vol.48 Autoregressive Order ,No.12 Selection [Brown99] J. Brown "Computer identification of musical instruments using pattern recognition with cepstral coefficients as features" MIT Media Labs 1999 [Dick94] J.R.Dickie, A.K.Nandi "On the Performance of AR Model Order Selection Methods" Signal Processing VII; Theories and Applications 1994 [Djuric96] P.M.Djuric "A Model Selection Rule for Sinusoids in White Gaussian Noise" 1996 IEEE Trans.Sig.Proc,Vol.44 ,No.7 [Djuric98] P.M.Djuric "Asymptotic MAP Criteria for Model Selection" 1998 IEEE Trans.Sig.Proc,Vol.46 ,No.10 [Durne98] M.Durnerin Operation ASPECT [Feldman] J. Feldman "Derivation of the Wave Equation" http://www.math.ubc. ca/~feldman/apps/wa ve.pdf [Fitch00] J.Fitch, W.Shabana "A Wavelet-based Pitch Detector For Musical Signals" University of Bath, UK [Fuchs88] J.J.Fuchs "Estimating the Number of Sinusoids in Additive White Noise" 1988 IEEE Trans.ASSP,Vol.36,N o.12 [Jehan97] T. Jehan "Musical Signal Parameter Estimation" http://www.cnmat.ber 1997 keley.edu/~tristan/Th esis/thesis.html [Kalou87] N.Kalouptsidis, S.Theodoridis "Fast Adaptive L-S IEEE 1987 Algorithms for Power Trans.ASSP,Vol.35,p Spectral Estimation" p.95-108 May [Karaj99] M.Karjalainen, T.Tolonen "Multi-Pitch and Periodicity Analysis Model for Sound Separation and Auditory Scene Analysis" http://citeseer.nj.nec.c 1999 om/411704.html [Kashino95] Kashino, Nakadai, Kinoshita, Tanaka. "Organization of Hierarchical Perceptual Sounds" http://citeseer.nj.nec.c 1995 om/27731.html 36 of 38 IEEE 1992 Trans.Sig.Proc,Vol.40 ,No.1 1998 REF NAME TITLE PUBL YEAR [Kashino98] Kashino, Nakadai, Kinoshita, Tanaka. "Application of Bayesian Probability Network to Musical Scene Analysis" [Kay88] S.M.Kay "Modern Spectral Prentice Hall Estimation, Theory & Application" 1988 [Klapuri01A] A. Klapuri "Means of Integrating Audio Content Analysis Algorithms" 2001 [Klapuri01B] A. Klapuri "Multipitch Estimation And Sound Separation By The Spectral Smoothness Principle" 2001 [Klapuri98] A. Klapuri "Automatic transcription of music" Http://www.cs.tut.fi/s 1998 gn/arg/music/klapthes .pdf.zip [Lee92] H.B.Lee "Eigenvalues and Eigenvectors of Covariance Matrices for Signals Closely Spaced in Frequency 1992 IEEE Trans.Sig.Proc,Vol.40 ,No.10 [Mallat99] S.Mallat "A Wavelet Tour of Signal Processing" Academic Press [Martin96] K. Martin "Automatic Transcription of Simple Polyphonic Music: Robust Front End Processing" 1996 Third Joint Meeting of the Acoustical Societies of America and Japan, ftp://sound.media.mit. edu/pub/Papers/kdmTR399.ps.gz [Martin98] K. Martin "Musical instrument identification: A pattern-recognition approach" 136 th meeting of the Acoustical Society of America, 1998 [McNab] R. J. McNab , L. A. Smith, I. H. Witten, C. L. Henderson ,S. Jo Cunningham "Towards the Digital Music Library: Tune Retrieval from Acoustic Input" University of Waikato, Hamilton, New Zealand. 1996 [Picone93] J.W.Picone "Signal Modeling Proc. of IEEE, Sept. Techniques in Speech 1993, p1215-1247 Recognition" 1993 [Proakis] J.G.Proakis, D.G.Manolakis Prentice Hall "Digital Signal Processing, principles, algorithms, and applications" 1996 [Schro00] T. von Schroeter "Auto-regressive spectral line analysis of piano tones" 2000 [Terhardt] E. Terhardt "Psychoacoustics related to musical perception" 37 of 38 Http://citeseer.nj.nec. 1998 com/kashino98applica tion.html http://www.mmk.ei.tu m.de/persons/ter.html 1999 REF NAME TITLE PUBL YEAR [Vercoe97] B.L.Vercoe, W.G.Gardner, E.D.Schreirer "Structured Audio: Creation, transmission, and Rendering of Parametric Sound Representations" Proc. of IEEE, May 1998, p922-940 [Wax85] M.Wax, T.Kailath "Detection of Signals by Information Theoretic Criteria" IEEE 1985 Trans.ASSP,Vol.33,N o.2 [WolfeA] J. Wolfe "The University of New South Wales, Australia – Music acoustics group" http://www.phys.unsw .edu.au/music/ [WolfeB] J. Wolfe "How harmonic are harmonics" http://www.phys.unsw .edu.au/~jw/harmonic s.html 38 of 38 1998