Automatic Music Transcription using Autoregressive Frequency

Transcription

Automatic Music Transcription using Autoregressive Frequency
ENSEEIHT
2 rue Charles Camichel
BP 7122
31071 Toulouse Cedex 7
Télécommunications Spatiales et Aéronautiques (TéSA)
17 bis, rue Paul Riquet
F-31000 Toulouse
FRANCE
Automatic Music Transcription
using
Autoregressive Frequency Estimation
14. June 2001
Fredrik Hekland
NTNU, Norwegian University of Science and Technology
[email protected]
Under the direction of:
Corinne Mailhes (ENSEEIHT)
David Bonacci (ENSEEIHT)
1 of 38
Preface
Where and why
With help from the Erasmus student exchange program, I had the opportunity to do my fourth year
as an engineer student abroad. I had chosen France as country since I wanted to learn to speak
French and see parts of Europe I hadn't seen before. I found ENSEEIHT in Toulouse which offered
Signal Processing, and luckily they accepted my application. Since it was only the fifth year at this
school who was offering enough courses within signal processing, I followed the option
"Traitement du Signal et des Images" even though I was one year short. This also meant that a
"Stage", a four months final project, was ahead of me.
I was kindly given chance to do the Stage at TéSA, a research laboratory at the school. Some
possible subjects were presented to me, and after some counsels from my responsible professor at
NTNU I chose the subject regarding music transcription.
TéSA (Télécommunications Spatiales et Aéronautiques) is a newly created research laboratory as a
collaboration between several schools and enterprises. The lab is well equipped with both hardware
and software, and a good library which made my literature review easy.
The work
In the beginning I had a two weeks period doing a literature review, trying to find previous work in
the field and all the necessary background information. A lot of articles was found with help from
google.com and citeseer.nj.nec.com, while the library contained most of the IEEE publications.
After having gained an overview of the problems, a longer period of testing was conducted. Both
synthetic signals and samples from real instruments were used. The different frequency estimators
and model order criteria were explored, and it was decided to use the Modified Covariance Method
coupled with a simple AIC/MDL order selection criterion for the transcriber. A working
monophonic transcriber was built, and some of the findings in the project can permit an extension
to polyphonic operation.
Software
Matlab 5.2 and 5.3 was used for all coding purposes and most of the analysis work, and
Spectrogram 6.0.4 was helping to analysis the instrument samples. Yamaha's free wave-editor was
indispensable to mix and manipulate the necessary wav-files.
The Matlab files referred to in the text is not rugged enough to be used in any serious transcribing.
For that reason, the code is not available to the public.
Personal outcome
Even though my work is not exactly ground-breaking, I have gained much personally. Especially
concerning the process of doing a research work and writing a report. I now know more how to
proceed and what to do underway, and certainly some of the pitfalls to avoid.
Both within Matlab programming and Signal Processing I seen a great development, and within
parametric modelling I have gained a greater understanding.
Finally, it has been interesting to see how the life is in a laboratory and to observe the cultural
differences and similarities between Norway and France. At last, it must be mentioned that I have
learned a lot French, starting at ground-zero before my arrival in France now being able to
communicate without too much problem at a basic level. I hope I'll be able to maintain and improve
the language in the future.
2 of 38
Acknowledgements
I would like to thank the Director of the laboratory Prof. Francis Castanié, and the responsible for
"Traitement du Signal et des Images" Dr. Corinne Mailhes for giving me the opportunity to do this
work in the labs.
Mdm. Mailhes also being the responsible for my Stage, and David Bonacci was the person working
on a subject closest to mine and being the most important advisor giving good ideas and tips. Both
deserves a thank.
The guys at "Bureau treize" for having received me well, and accepting me even though my French
is at best confusing, and at times incomprehensible. C'
est dommage que je ne pourrais pas
participer dans vos conneries. J'
ai passé un bon moment chez vous quand même.
Thanks to all the other persons at the lab helping me out and being nice to me.
Lots of moral support and positive words from Tonje helped me when I needed it most. I love you.
3 of 38
Abstract
The project studies the use of Principle Component AR Frequency Estimation in automatic music
transcription and discusses some of the problems arising when using AR models, among them
model order selection. Some comparisons to classical Fourier method is done.
A well-functioning monophonic transcriber using the Modified Covariance Method as pitch
estimator is implemented in Matlab and some suggestions of further work is given.
4 of 38
Table of Contents
Preface.......................................................................................................................2
Acknowledgements...................................................................................................3
Abstract.....................................................................................................................4
Chapter 1 – Initial theoretical studies ....................................................................6
Introduction.................................................................................................................................6
Presentation of the problem......................................................................................................6
The goal of this project.............................................................................................................7
Literature review.........................................................................................................................8
Papers dealing on musical transcription....................................................................................8
Commercial or free transcription programs available...............................................................9
A closer look at the challenges..................................................................................................10
Common features of instruments............................................................................................10
Problems encountered when estimation pitch.........................................................................11
Other problems related to transcription...................................................................................12
MIDI file format......................................................................................................................12
Pitch estimators..........................................................................................................................13
Cross-correlation between signal and sinusoids......................................................................13
Filter banks.............................................................................................................................13
Fourier-based methods............................................................................................................14
Autoregressive methods..........................................................................................................15
Model order selection..............................................................................................................16
Estimation of the Prediction Error Power.........................................................................................................16
Order estimations methods based on Singular Values or noise subspace estimation.......................................17
Chapter 2 – Implementation of a music-transcriber............................................18
Initial testing on real and synthetic musical signals...............................................................18
Fourier methods, real instruments...........................................................................................18
Yule-Walker, synthetic signals................................................................................................20
Modified covariance method, synthetic signals......................................................................22
Implementation of the transcriber...........................................................................................25
The structure of the transcriber...............................................................................................25
Transcribing real music...........................................................................................................26
Flute5.wav – a simple flute solo........................................................................................................................26
Flute00.wav – A quicker flute solo...................................................................................................................29
Some different sound files.................................................................................................................................30
Some final words........................................................................................................................31
Limitations of the transcriber..................................................................................................31
Improvement for the transcriber and ideas for future work.....................................................32
Conclusion..............................................................................................................................33
A – Converting between Note, MIDI and Frequency...........................................34
B – Matlab code for the transcriber......................................................................35
References...............................................................................................................44
5 of 38
Chapter 1 – Initial theoretical studies
Introduction
Presentation of the problem
Music can be represented in many different ways, ranging from simple monophonic PCM files with
low sample rates to highly symbolical multitrack representations containing control information for
various musical parameters. Every abstraction level has its use and being able to easily switch to
another representation would certainly be useful. The most obvious application is to aid musicians
or composers write music by playing the actual instrument rather than writing notes or playing on a
keyboard. This could also extend to analysis of existing music where nothing but the sound
recordings exists. An application that could be of greater commercial interest is the possibility to
search for music on the Internet by simply whistling the tune, so called content-based retrieval or
query by audio [McNab],[Vercoe97]. One could also think of a scheme that tracked all radio
stations for a particular music style. Areas which certainly would appreciate perfect transcription
are the coming standards MPEG-4 for structured audio [Vercoe97], and MPEG-7 for description of
audio-visual content.
While the process of converting music from a symbolic format to a waveform representation
(synthesis) has evolved over the years, and now gives a fairly realistic sound for a reasonable prize,
the opposite process (analysis or transcription) is far from being ready for commercialisation. There
exists several monophonic solutions that are reported to work well in real time, but at soon as we
want to analyse polyphonic music our options are effectively reduced to zero. A quick search on the
Internet found some shareware programs claiming to perform transcription (see table 1), but trials
showed these programs to perform rather poorly even after substantial parameter regulations.
Speech recognition is very similar to the problem of musical transcription, but while the former has
experienced much interest and successful applications during the last three decades, research on
music recognition has mainly been done by a few individuals with special interests in the subject.
One reason for this is the apparent lack of commercial applications. Another is the complexity of
the problem compared to speech recognition. While speech is limited to frequencies between 50Hz
to 4kHz and the sources all have similar characteristics, musical frequencies ranges from 20Hz to
20kHz and there are many different instrument models. However, the main problem is the fact that
western music is constructed upon harmonic relations (i.e. Different instruments playing
frequencies that are in fractional relation) which gives rise to spectral overlapping and possibly
complete masking of certain notes. When we think of a symphonic orchestra with many musicians
playing simultaneously, the task of separating and recognising each one of them from a simple twochannel recording seems (and might prove to be) impossible.
Currently most systems works in an bottom-up fashion, that is, all decisions are based upon the
frequency- and segmentation-information one obtains from the music recording. This works
satisfactory for monophonic music where the notes are well separated in time and frequency. But
these systems are unable to correct even the most obvious errors, since they don'
t possess any
knowledge of compositional styles (rules). This problem is addressed by putting a top-down along
with the normal bottom-up recognising engine, and for example letting the "knowledge-loaded"
top-down engine monitor the transcription process and intervene when it disagrees to the
estimations found. The common way to implement such systems are so called Blackboard system
whose name stem from the idea that several '
experts'
, each having knowledge of a certain
parameter, are '
standing in front of a blackboard'and together solving the given problem. These
systems are very flexible since it is easy to add experts and the system can be driven in a bottom-up
correcting fashion or as a top-down predicting fashion.
6 of 38
The goal of this project
Given the restricted time and lack of experience, the goal of this project was to explore the
problems related to transcription and review the existing solutions, and thereby try to implement a
monophonic transcriber that works better than the existing affordable programs. Instruments that
are non-harmonic in nature, such as percussion instruments are left out, while any kind of harmonic
instrument is targeted. In fact an independence of instrument was wanted, and additionally a
possible later extension to polyphonic recognition was desired. It was therefor decided to try other
ways to do frequency estimation and possibly obtain higher resolutions and more precise estimates
than what is realisable using standard Fourier methods. A natural choice was to investigate the
different parametric methods available, their advantages and disadvantages, and the well-known
problem of model order selection. A very limited top-down based correcting system is
implemented, as to improve the transcription process in lack of a segmentation system.
As output format the standard MIDI file format was chosen. This because of it widespread use and
its relative simplicity. All the programs was to be built in Matlab since it provides most of the
needed routines and enables rapid development.
7 of 38
Literature review
Papers dealing on musical transcription
Only one article on the use of parametric methods in music recognition was found [Schro00]. Here,
three algorithms are discussed and analysed with respect to relative frequency precision. It was
concluded that the Modified Covariance Method (MODCOVAR) was superior to both standard
Maximum Entropy Method (Yule-Walker) and Prony Spectral Line Estimation. Additionally, the
two former methods give us the relative size of the spectral peaks for free. The "Modcovar" method
was applied to a short piano recording using a fixed order of 20, showing a promising result. No
order estimation was discussed.
Another and more elaborate work is the master thesis of Anssi Klapuri [Klapuri98] where some of
the problems related to automatic transcription are discussed, and a system trying to resolve the
problem of harmonic overlapping is described. The shortcomings of the purely bottom-up approach
when it comes to polyphonic music and the necessity to employ a top-down "knowledge system"
are discussed. The thesis discusses very general problems and its possible solutions, and different
techniques for extracting information are presented. Klapuri has also released several papers related
to this thesis, more directed to a specific implementation; [Klapuri01A], [Klapuri01B]. Even more
can be found at <http://www.cs.tut.fi/sgn/arg/publications.html> and
<http://www.cs.tut.fi/~klap/iiro/>
Top-down systems are probably the most promising approach for polyphonic music recognition
and despite the fact that such systems will need greater understanding of human musical perception
and a huge amount of musical knowledge at its hand, some systems with limited knowledge have
been implemented and shows improvements over the usual bottom-up systems. Examples of an
'
expert'in a top-down system are probability of transition between different chords and rules for
which notes can be played to which chord. See [Kashino98] and [Martin96].
Two other areas of importance, and yet untreated in this project, are segmentation and instrument
recognition. Segmentation of events (i.e. Notes and pauses) gives us the possibility to respect the
duration of each note, and to avoid including two notes in one frequency analysis frame to obtain
better frequency estimates. The master thesis of T. Jehan deals with this problem [Jehan97], and
proposes methods both in the time domain and in the frequency domain. Especially the third
chapter using changes in AR model as basis for segmentation could be an interesting future
addition to this project. Recognition of instruments gives us the opportunity to improve the process
even more by using instrument models in the frequency analysis, and by automatically setting
correct instruments in the output file. Methods seen tested are some kind of neural network
working on cepstral coefficients or features from log-lag correlograms, See [Brown99] and
[Martin98].
8 of 38
Commercial or free transcription programs available
Name and Url
Technology
WIDI Recognition System 2.6
http://www.midi.ru/w2m
FFT based
Polyphonic
AudioToMidi 1.01
http://www.midi.ru/AudioToMidi/
Unsure. Possibly cross-correlation with
synthetic sinusoids. Polyphonic.
AmazingMIDI 1.60
http://www.pluto.dti.ne.jp/~araki/amazingmidi/
Unsure
Single instrument, polyphonic
WAV2MID 1.5a
http://www.audioworks.com
Unknown
Monophonic
DigitalEar 3.0
http://www.digital-ear.com/index2.html
Unknown
Monophonic
Table 1
As said in the introduction, none of these programs offers fully automatic operation and
independence of instrument and even after many parameter adjustments, none of the polyphonic
ones give an impressive result even on monophonic music. A more comprehensive list of existing
programs can be found at:
<http://www.s-line.de/homepages/gerd_castan/compmus/audio2midi_e.html>
The two monophonic transcribers performs quite well. Digital Ear was difficult to test, because the
demo version was very restricted. From what could be tested, it seemed a bit less powerful than
both the project transcriber and the AudioWorks'transcriber.
The AudioWorks transcriber performed nearly perfect with a minimum of parameter setting and
excellent speed. Testing the program with the same wav-files as in the project showed a similar
performance to what was obtained in this project with less parameters. A small bug in the program
made the instrument in the MIDI file to always be a grand piano.
9 of 38
A closer look at the challenges
Common features of instruments
Apart from percussion instruments, the creation of sound in most instruments is based on
generating standing waves on strings or in hollow tubes of some geometry. Since the length of the
string and tube normally is fixed for a given note, the possible wavelengths are constrained to fulfil
the wave equation for the given length, geometry and speed of sound:
2
2
u x,t
1
u x,t
(1)
0
x²
v²
t²
Usually strings are connected in both ends, while tubes can be open in one or both ends with either
circular or conical inner geometry. These constraints only allows certain modes to operate in the
resonator, and for strings and circular open tubes possible solutions are of the form:
k v ,k Ak sin 2
(2)
eventually as a sum of cosines if the tube is circular half open. As for the conical case, the solution
consists of spherical harmonics. A more elaborate explanation can be found at the website
[WolfeA]
This gives rise to a harmonic spectrum,
with all over-harmonics being an integral
number of the fundamental frequency.
Figure 1 shows a typical frequency
spectrum, with the fundamental frequency
at 392Hz and all the over-harmonics
being 2 to12 times the fundamental
frequency. Another typical feature is that
most of the energy is located in the lower
harmonics, leading to weaker overharmonics. For example in piano sounds,
more than 90% of the energy is contained
in the fundamental [Schro00]. In western
musical notation, it'
s the frequency of the
fundamental in the harmonic series, also
called the pitch1 , that names the note. The Figure 1
relation between the notes is given in (3),
with 440Hz being the so-called chamber note (A4), and k ranging from -48 to 39 for a standard
piano. To convert between frequency, note name and Midi number, see table 4, appendix A.
A G4 note from a flute
0
−10
Magnitude [dB]
−20
−30
−40
−50
−60
−70
−80
f
0
1000
2000
3000
Frequence [Hz]
4000
5000
440Hz 2
k
12
Hz
,k
1The term pitch is not well defined, and at least three different meanings exists;
a) Production pitch: The rate of which the excitation device opens and closes (e.g. Glottal closure rate in human
speech).
b) The mathematical pitch: The fundamental of the harmonic series
c) Perception pitch: The actual frequency perceived by the listener.
In this project, the definition in b) is used, since it is directly connected to the searched notes.
10 of 38
(3)
Problems encountered when estimation pitch
It is clear that a reliable pitch estimation is essential to successfully do the transcription, and the
relative frequency error must not exceed 2.5% to avoid picking the note a semitone away from the
real note. Further we want to be able to resolve several instruments playing simultaneously so we
must find as many harmonics as possible, where the harmonics can be closely spaced or even
overlapping. Instruments without pitch that have noise-like spectres (percussion/drums) can
obviously not be identified by tracking the pitch, and are not treated in this project either.
Another problem is irregularities in the harmonic spectrum. A classic example is the clarinet where
only the odd harmonics are present, because of the constraints the circular half-open resonator
impose on the wave equation.
Guitars can sometimes play without the first harmonic present, or at least very weak. This leads to
the phenomenon virtual pitch, where the human brain deduces the missing fundamental from the
distance between the harmonics [Terhardt] (It is this concept that is used to in some headphones to
obtain a lower frequency response than what is possible with the given membrane radius). Another
problem related to string instruments is inharmonicities in the spectrums. This is due to the fact
that the string isn'
t infinitely thin, but has a certain mass, and when this mass is unevenly
distributed some of the harmonics can get displaced from its '
correct'position [WolfeB].
Even if these exceptions cause problems for the transcription process, they can be countered fairly
easily and the most concerning issue is still estimating pitch of multiple instruments where the
harmonic series are overlapping. This might seem like a special case which only happens from time
to time, but the fact is that this is the rule rather than the exception. The reason for composing the
music in this fashion is rooted in the way the brain experiences the music; entities related
harmonically are put together in a bigger entity which is experienced as a whole, that is, one cannot
distinguish the different notes. This helps reducing the complexity of the listening process, and
makes it easier to listen to the music. The more the elements in the music are in harmonic relation,
the more the human brain is grouping the elements together, making it more pleasing to listen to.
So what makes the listening easier, makes the transcription process harder.
Many different approaches exists for estimating the pitch, where many seems to be adapted from
methods originally developed for speech processing. Still, most of these methods are made for
monophonic music and will perform poorly when applied to polyphonic music. The most used is
the spectrogram, based on the Short-Time Fourier Transform (STFT). Other examples are
correlogram, filterbanks, Constant Q-Transform, cepstrum, auto-correlation, cross-correlation,
zero-crossing detection, wavelet transform, sinusoidal modelling, AR- or ARMA methods, Prony
modelling. Some efforts are made in trying to improve multipitch estimation by subtracting the
spectrum associated with an estimated note, and thereby continue searching for notes
[Klapuri01B]. Some of the methods mentioned above are discussed further down, other can be
found in [Klapuri98], [Schro00], [Fitch00]. Even though it'
s not directly applicable to music, the
tutorial paper on signal modelling in speech recognition by Picone is worth a read to get an
overview over available techniques within speech processing [Picone93].
11 of 38
Other problems related to transcription
It is a known fact that it is the attack portion of the note (the transient state before harmonics in
steady-state take over) dictates the way we experience the timbre, and it is probably this region that
can provide the necessary information needed to identify the instrument. It is therefor important to
determine the location of the on- and off-sets of the notes. This will also help avoiding placing an
pitch analysis frame over two different notes and thereby getting more accurate estimates. More
importantly when the orders is rising, segmentation enables us to adjust the analysis windows to
cover the whole note, giving us as many datapoints as possible. Many schemes for segmenting
music exists, see [Jehan97], [Klapuri98].
Successfully identifying the instruments would enable us to apply instrument models to the pitch
estimation process. This could help detecting notes with overlapping harmonics since we would to
some extent know the expected amplitude signatures. Also, it would provide automatic selection of
instruments in the output file.
Processing cost and storage requirements are not often discussed in connection with automatic
transcription. It might not be an issue with laboratory tests or professional applications not in need
of realtime applications, but if used for music search on the Internet, the transcriber would most
probably be implemented as a Java applet that is downloaded from a server. It is clear that as long
as not everybody is blessed with a high-speed Internet connection and the fastest processors, only a
limited amount of knowledge and calculations can be put into the applet. Thus, a system requiring
as little resources as possible while still being able to do the job would be desired.
MIDI file format
The MIDI file format is relative simple, and much documentation can be found on the Internet. It is
in binary format something that helps making the files small. Both single track and multi track files
are supported. In brief, the file is organised in chunks. Every chunk start with a four bytes ID field
telling what type of chunk it is and a four bytes length field telling the number of bytes of data
following this header. All MIDI files begin with the MThd chunk containing tempo information,
and are followed by one or more MTrk containing meta-data and the actual music. All events in the
track-chunk such as note-on and note-off are equipped with a delta-time field, indicating the
duration (in MIDI clocks) between the event and the preceding event. This delta-time is stored in a
variable-length format enabling shorter events to be represented with fewer bytes than with a fixed
format.
Three types of files exists. Type 0, 1 and 2. Type 0 contains only one track and is thereby
monophonic. Type 1 contains one or more tracks where all tracks are played simultaneously. Type
2 contains one or more tracks, but the tracks are sequentially independent. Type 1 was the natural
choice for this project, since an extension to polyphony is accomplished simply by adding MTrk
chucks.
A good overview over the event codes in the specification can be fount at
<http://crystal.apana.org.au/ghansper/midi_introduction/midi_file_format.html>, while a more
textual introduction can be found at <http://jedi.ks.uiuc.edu/~johns/links/music/midifile.html>
12 of 38
Pitch estimators
Cross-correlation between signal and sinusoids
Since we assume that the signal consists of sinusoid with several harmonics (also sinusoids), we
could find these sinusoids by taking the cross-correlation between the signal block and test
sinusoids with frequencies given by (3). The frequencies found are those who yield the highest
cross-correlation values. These test sinusoids could be synthetic ones or samples of real
instruments. This method will of course be a bit costly in terms of computation, since every
possible note must be tested. Further, we can'
t reveal notes where all harmonics are masked by a
lower note. To be able to to that, we must apply instrument models to compare the correlation value
given from (4), with the expected value from the used model; a deviation from the expected value
could indicate a hidden note.
N
cxy k
Fig.2 shows an
example of the G4,
392 Hz (same as in
Fig.1) , where the
cross-correlation
between the signal
and 8 sinusoids has
been calculated.
x l y l k
,k 0,1,
l 0
(4)
k max
Xcorr between G4 from a flute, and 8 sinusoids
15
10
xy
5
c
The calculation cost is
rather expensive since
we have to do N
multiplications for
every possible note
(128 for a piano), and
the lower the note is,
the bigger kmax gets to
account for the longer
sinus-period.
1
0
−5
−10
−15
392Hz
784Hz
1176Hz
1568Hz
0.67of the first 4
0
5
10
15
sample
20
25
30
Figure 2
Filter banks
Some of the earliest attempts to estimate pitch was done with filterbanks. One simply separates the
spectrum by filtering the signal with several bandpass filters, and then measures the energy present
in the different filters. A better, and more promising approach is to use a dyadic filterbank in
conjunction with wavelets, see [Karaj99] and [Fitch00].
13 of 38
Fourier-based methods
As mentioned earlier, the spectrogram has been the most popular way to obtain a pitch estimate. To
find the frequencies of the sinusoids, one simply picks the peaks in the spectrum.
The spectrogram based on the Short-Time Fourier Transform (5), is an extension of the normal
Fourier transform where the small analysis window is moved along the time axis and successive
transforms are taken.
Sx u,f
'
!#"%& $ ' )! ( * ! & +
!#"-, .! ,
f t
w t u e
i2
f
dt
(5)
The spectrogram gives the energy density similar to the periodogram, and is given by (6)
P S f u,f
Sf u,f
2
(6)
Two problems are associated with the spectrogram. The first problem is the windowing function.
This windowing introduces side-lobes which can mask weaker signals, and trying to reduce the
sidelobes eventually leads to a wider main lobe. Further, the resolution in time and frequency is
limited by Heisenberg'
s uncertainty theorem, which states that the area of the rectangle defined by
the windowing function has a minimum given by (7)
/ 0/ 1
1
(7)
2
This windowing of the data clearly limits the frequency resolution obtainable, and we are given the
choice between high frequency resolution or high time resolution, but not both. Additionally, we
have got some given minimal time resolution to respect if we want to avoid mixing several notes in
the same analysis frame, and to know where the note is actually starting and stopping. For example,
a musical piece playing at 200BPM (beats per minute or quarter notes per minute) sampled at
44.1kHz gives us 60sec/(200*4)=75msec per sixteenth note, which again gives us
44100*75msec=3309 samples per sixteenth note. Lower sample rate means fewer datapoints, and if
we want to capture ornamental tones2, we'
ll have even fewer sample points.
f
t
Then to the second problem, namely the linear spaced frequency bins of the Fourier transform.
Musical tones are ordered in a logarithmic fashion similar to the sensitivity of human hearing, so
the distance between two neighbouring notes in the higher end of the musical scale is greater than
in the lower end. That allows for higher time resolution by lowering the frequency resolution for the
higher end of the spectrum. The Constant Q Transform obtains a such constant ratio between centre
frequency and frequency resolution for each frequency partition by combining several bins of an
FFT. This however fails to give better time resolution for the higher frequencies, and we haven'
t
won much.
2 Ornamental notes are very short notes not written on the music sheet, but added by the musician.
14 of 38
Autoregressive methods
Since the windowing function is a problem for the resolution, we want to avoid windowing our
data. Parametric methods like AR models and eigenvector methods make this possible. We assume
our signal to be composed by a number of sinusoids in white noise, something that enables us to
make use of an eigendecomposition of the estimated auto-correlation matrix. This is possible since
the matrix is composed of a signal auto-correlation matrix and a noise auto-correlation matrix,
where the eigenvalues associated with noise are generally small compared to those of the signal.
Since we cannot find theoretical Maximum Likelihood Estimator (MLE) for more than one
sinusoid analytically [Kay88], we must use a sub-optimal method. The most promising estimator
that is not too computationally expensive is the Principal Component AR frequency estimator.
2 354 2 6 2
R xx1 r xx
a
(8)
Eq. (8) is the standard AR parameter estimation, with the auto-correlation matrix being positive
definite hermitian. This allows for a eigendecomposition with real, positive eigenvalues and
orthonormal eigenvectors. These eigenvectors span the entire auto-correlation space, so we can
write the auto-correlation vector as a linear combination of the eigenvectors, which inserted into (8)
gives:
2 35487 : 2 2 2 7 ; 2
9
9
2 34<7 ;: 2 2
9
M
a
1
i 1
vi v
H
i
i
M
j 1
j
vj
(9)
Simplifying and discarding the (M-p) smallest eigenvalues, we obtain:
p
a PC
i
i 1
i
vi
(10)
Finally, the frequency estimates are obtained by picking the peaks in the spectral estimator in (10)
which is done by solving the roots of (10), and take the angles of these poles.
= >#3
P pc f
?2
2
pc
@| 7 9 2 = > 6 A |
2
p
1
k
1
a pc k e
j2
fk
(11)
In practice, we can estimate the AR coefficients with the Modified Covariance Method
(MODCOV), pick the p poles closest to the unity circle and again we obtain the frequency
estimations from the angles of these poles.
The modified covariance method has proven to be insensitive to initial phases of the sinusoids and
spectral peak shifting due to noise is low. In fact, in absence of noise the true frequencies are
found. Tests show that the Principal Component Method reaches the Cramer-Rao bound for SNR
higher than 10dB. Since the MODCOV method is Least-squares optimised without constraints, the
poles can end up outside the unit circle leading to unstable models. This is not a problem since we
are only interested in the angles of the poles.
Further, the AR spectrum can be calculated from (11).
15 of 38
Model order selection
The choice of model order is crucial in order to make the AR methods perform acceptably, but
unluckily no perfect methods exists. This problems gets even harder as the number of available
sample points are reduced. As soon as the possible order is greater than 0.1 times the available
datapoints, we are dealing with a finite sample and the criteria for larger samples are no longer
valid and need corrections. Most of the methods are based on the maximum likelihood estimator of
some parameter with some bias/variance correcting factor.
Estimation of the Prediction Error Power
Most of the methods that exists tries to fit several model structures and orders to the data and the
order is chosen by minimising a cost function. This cost function is usually based on the Maximum
Likelihood Estimate of the prediction error power (PEP) from the model, but since the PEP is on
average decreasing for increasing model order, we have to add a certain penalty to the cost
function. This because higher order leads to higher variance in the prediction coefficient estimates,
and thereby higher variance in the PEP. The information-based criteria requires two passes over the
data; one for calculating the cost-function and another for choosing the right order.
PEP:
C B D EGF H IKJLB H IMF KN D B H IOQP H ISB H I
R
2
pep
E x n
2
x n
p
r xx 0
k
1
a k r xx k
(12)
The a'
s in the PEP is calculated using the actual type (YW, ModCov (also called FB or ForwardBackward) etc.) and PEPFB indicates that the Modified covariance method is used.
The best known criterion (for AR models) is Akaike'
s Information Criterion (AIC):
HB ID
AIC k
T
arg
[ H C B H IUI#O V ]
(13)
[ H C B H IUI#O V ]
(14)
min
k 0,1,..,p max
2
pep
ln
k
2
N
k
This criterion is however not consistent and tends to overestimate the model order. The minimum
description length (MDL) criterion fixes this problem by having a higher penalty factor, and is
asymptotically consistent:
HB ID
MDL k
arg
T
min
k 0,1,..,p max
ln
2
pep
k
ln N
N
k
These two criteria impose equal penalty to all unknown parameters (amplitudes, phases,
frequencies etc.). Better performance can be obtained with the maximum a posteriori (MAP)
criterion, where different penalties can be attributed to the different unknown parameters. For
example for AR models the MAP is equal to MDL, and for sinusoids with unknown frequencies +
white noise the penalty is 5k/N ln(N). For a development of the MAP, see [Djuric96] and
[Djuric98].
A problem arises when the number of datapoints is less than ten times the model order, as we'
re
then dealing with finite samples and the criteria given above no longer works. This problem is
especially present in small samples where they can even fail to give a minimum. In larger samples
they tend to choose too high orders as they don'
t account for the increased variance in the
modelling error. Finite Sample Information Criterion (FSIC) [Broer00] tries to handle this
overestimation by changing the penalty factor to better suit the variance from the given AR
estimation method and model order.
HB ID
FSIC k
arg
T
min
[
k 0,1,..,p max
ln
H C B H I IO ( W R OJ J
2
pep
k
k
i 0
)]
1 vi
1
1 vi
(15)
Where the vi depend on the AR method, and for MODCOV it is given as vi =(N+1.5-1.5i)-1. The
combined information criterion uses the maximum between the one in (15) and 3⋅Σvi, see [Broer00]
eq.13.
16 of 38
Order estimations methods based on Singular Values or noise subspace estimation
Another group of methods for estimating orders are the ones based on determining singular values
of the signal auto-correlation- or the covariance matrix. These methods are more specialised as they
are usually based upon the assumption that the signal consists of sinusoids in white noise. This
makes it possible to decompose the auto-correlation matrix in a signal matrix and a noise matrix
because the singular values associated with the signal are generally much higher than those of the
noise. Additionally, these two subspaces are orthogonal. The drawbacks with these methods based
on eigen-calculation are that they need O(p3) operations, and that the true order is not always the
best order (At least for autoregressive models which are AR(∞) when corrupted with white noise).
The AIC and MDL criteria based on eigenvalues are given below, and the development can be
found in [Wax85].
c
d
Y
( a` b ) f _ _
_
( )
] ^
c
_ ae b Y
[{ } ]
1
p k
p
XZY [#\
AIC svd k
arg
min
k 0,1
p max
l k
ln
l
1
1
p kl
X)Y [#\
arg
k
1
2p k
p k
min
k 0,1
[{ }
p max
l k
1
p k
ln
1
1
p kl
l
p
k
1
l
(16)
l
c
d
Y
( a` b ) f X _
_
( X
g
] ^
c
_ ae b Y
p
MDL svd k
k
N
p
k
N
]
[_ X [ )
[
2p k ln N
2 p k
(17)
where k is the estimated order; p is the order of the covariance matrix, N is the number of
datapoints used to calculate the covariance matrix, and the λ'
s are the eigenvalues of the covariance
matrix ordered as λ1 > λ2 > .. > λp. The expression in the braces are simply the ratio of the
geometric mean to the arithmetic mean of the p-k smallest eigenvalues. Similar to the PEP-based
criteria, the AIC tends to overestimate the order, while the MDL has shown to be consistent. It
should be possible to merge the order estimation and the parameter estimation in order to reduce
the computational load.
Another possible method is to continuously estimate the noise subspace by means of a QRfactorisation of the covariance matrix or even the data matrix itself. The idea is to decompose a
matrix X into a square matrix Q with orthonormal columns and a upper-triangular matrix R,
h
XE QR
where E is a permutation matrix that orders the diagonal of R in descending order.
R
\(
)
R 11 R 12
0 R 22
(18)
(19)
The factorisation in (18) is called rank-revealing QR factorisation if R22 has a small norm. The
dimension of R22 is equal to the rank-deficiency of X, or equivalently equal to the dimension of the
noise subspace. The dimension of R11 is then equal to that of the signal subspace. An effective
implementation requiring on average O(p²) operations can be found in [Bisc92].
Of course, one can always decompose the signal- or covariance matrix, and determine the the rank
of the noise subspace by finding the smallest singular values. However, this proves to be difficult
when the order of the matrix is small.
17 of 38
Chapter 2 – Implementation of a music-transcriber
Initial testing on real and synthetic musical signals
Unless otherwise cited, all sound files are sampled at 11025 Hz with 16 bits in one channel. Some
single note sound files was found in the Internet. A series of tones from piano, plus some single
clarinet and saxophone examples was also found. The most useful was a series of flute tones found
at <http://www.phys.unsw.edu.au/music/flute/flute.html> from different flutes, with impedance and
spectral measurements available. These sound clips was chosen as the basis the project.
The synthetic signals were created from the script synthsign.m and the signals are analysed with
the script process.m. Description of the different program modules is found in the appendix B.
Fourier methods, real instruments
In the beginning some testing was done on single notes from real instruments. This to better see the
difference between parametric and non-parametric methods, and to get a picture of how instrument
spectrums look like. Additionally, some modules for file-handling and note/frequency decision was
built that also was needed for the real transcriber. Synthetic signals was not tested, since it was
assumed that as long as the peaks are well separated the correct frequencies are found, and that the
resolution is limited by the number of datapoints N in the analysis windows.
A crude periodogram was first implemented,
and a simple peak-picking method was used to
estimate the frequencies. The peak-picking
worked by searching for the maximum value
and then deleting a certain number of samples
around this maximum. An example from an
A4B is shown in Fig.3 with the Matlab output
found in Table 2. A problem is that we have to
use a threshold and thereby miss some weaker
peaks. Instead of a fixed-value threshold, a
more adaptive method that estimates the noisefloor to use as threshold [Durne98] could be
used.
a4b.wav
10
0
−10
Magnitude [dB]
−20
−30
−40
−50
−60
−70
−80
0
1000
2000
3000
Frequence [Hz]
4000
5000
Figure 3
Peak 1 is at freq 0.000 Hz
Difference from 0 Hz:
Peak 2 is at freq 425.954 Hz
Difference from previous peak: 425.954 Hz
Peak 3 is at freq 438.066 Hz
Difference from previous peak: 12.112 Hz
Peak 4 is at freq 864.693 Hz
Difference from previous peak: 426.627 Hz
Peak 5 is at freq 877.142 Hz
Difference from previous peak: 12.449 Hz
Peak 6 is at freq 888.245 Hz
Difference from previous peak: 11.103 Hz
Peak 7 is at freq 1303.432 Hz
Difference from previous peak: 415.187 Hz
Peak 8 is at freq 1314.871 Hz
Difference from previous peak: 11.440 Hz
Peak 9 is at freq 1325.301 Hz
Difference from previous peak: 10.430 Hz
Peak 10 is at freq 1753.274 Hz
Difference from previous peak: 427.972 Hz
Peak 11 is at freq 2200.088 Hz
Difference from previous peak: 446.814 Hz
Table 2
18 of 38
0.000 Hz
It is clear that the Fourier
Transform will work fine for
monophonic music, but a
certain smoothing is
necessary before the peaks are
searched for. A different
search routine would be
desired, since one never
knows how many points to
delete around the peaks.
An averaged periodogram
with overlapping windows
0
would be a better
alternative, as it would
−10
reduce both the amplitude−20
and frequency variance.
Matlab provides this as a
−30
Welch-periodogram. The
same sound is tested in
−40
Fig.4, and we see a clear
improvement over the
−50
periodogram: a more
precise frequency estimate
−60
is obtained, and the
−70
spurious peaks present in
table 2 are gone. This is
−80
file has 18743 samples (1.7
0
sec) and is of course a bit
longer than the average
Figure 4
note duration.
Magnitude [dB]
A4 − Welch, 1024 points FFT, 512 points window, 25% overlap
1000
2000
3000
Frequence [Hz]
4000
5000
Peak 1 is at freq 441.431 Hz
Difference from 0 Hz:
441.431 Hz
Peak 2 is at freq 872.095 Hz
Difference from previous peak: 430.664 Hz
Peak 3 is at freq 1313.525 Hz
Difference from previous peak: 441.431 Hz
Peak 4 is at freq 1754.956 Hz
Difference from previous peak: 441.431 Hz
Table 3
With such a short window we
have a possible resolution of
(300-1)*11025Hz = 36.75 Hz,
and a smoothed periodogram
makes it even worse. For
example, an A1 (55Hz) would
prove to be difficult to resolve,
and overlapping harmonics
would be unresolvable.
A4 − Welch, 1024 points FFT, 512 points window, 25% overlap
0
−10
−20
Magnitude [dB]
In music, we can expect the
signal to be stationary for about
25 ms, so a window size of that
duration is realistic. Fig.5 shows
a 28 ms clip (308 points) of the
previous A4, and the the same
frequencies as in table.3 are
found.
−30
−40
−50
−60
0
1000
Figure 5
19 of 38
2000
3000
Frequence [Hz]
4000
5000
Yule-Walker, synthetic signals
Even though the Yule-Walker method is
reported to perform poorly as a frequency
estimator in noisy signals [Kay88], it was
implemented partly because of its
computational simplicity and partly
because it was the only available method to
estimate the AR coefficients in Matlab 5.2.
A 25ms synthetic noise-free signal
corresponding to an A4 with four overharmonics was created and the frequencies
was estimated from rooting the AR
coefficient polynom. The resulting spectra
given from (11) are seen in Fig.6, and the
frequencies found are 436.639, 879.691,
1321.294, 1763.093, 2207.368. We see
Figure 6
that even though the signal is noise free,
the true frequencies are not found. This is
due to the '
zeroing'of the auto-correlation values outside the auto-correlation matrix, something
that smoothes and displaces the peaks.
Welch periodogram of signal, and estimated AR−spectrum
2
10
Original
AR(27)
1
10
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
−6
10
This is probably due to the fact that
without noise, the auto-correlation matrix
is singular.
When the signal is corrupted with noise,
the order stays the same, but the poles
modelling the noise get even closer to the
unit circle.
1000
2000
3000
Frequency [Hz]
4000
0
Real Part
0.5
5000
6000
1
0.8
0.6
0.4
Imaginary Part
Figure 7 shows the pole-zero plot of the
noise-free A4. We see that the poles not
modelling the sinusoids are fairly close to
the unit circle. This will lead to problems
selecting the threshold for which poles to
accept. The reason for the relative high
order selected compared to the true order
(10) is the same as above: the zeroing of
the auto-correlation values. To minimise
the modellation error a higher order is
necessary. The AIC function (with penalty
k set to 1) is shown in figure 8. The same
value is obtained for the MDL and FPE,
while the eigenvalue-based criteria
chooses 55 which is way too high.
0
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.5
1
Figure 7
aic
−3.1
−3.2
−3.3
−3.4
−3.5
−3.6
−3.7
−3.8
−3.9
−4
0
Figure 8
20 of 38
10
20
30
40
50
model order
60
70
80
90
100
One of the reasons for using AR models instead of Fourier was to obtain higher resolution, and
thereby to be able to resolve harmonics a semitone apart, or maybe even resolve overlapping
harmonics.
A 25ms synthetic signal consisting of a G4# (415.4 Hz) and an A4 (440) in white-noise with
variance 0.16 was created. At a sampling rate 11025kHz we have 276 datapoints which gives us
the Fourier-resolution of (1/276)*11025=39.95 Hz. This means that a standard periodogram should
not be able to resolve the first harmonic (24.7 Hz apart), while the over-harmonics could be found
(>49 Hz apart).
Still using the YW-method to find the
prediction coefficients, we analyse the
signal, using a standard AIC with penalty
1 to estimate the order. Fig. 9 shows the
result, and as predicted we see that the
Welch-periodogram is unable to resolve
the first harmonics. While not visible in
this figure, the two peaks are found in the
AR model, see fig.10. However, one of
the poles are too far from the unit circle,
and is considered as noise. This limited
capability to separate signal and noise
probably makes the the Yule-Walker
method for estimating the poles less
attractive and other methods should be
Figure 9
tested.
Welch periodogram of signal, and estimated AR−spectrum
2
10
Original
AR(51)
1
10
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
One could also consider Burg'
s algorithm
since it has the same computational cost
as YW but performs on general better.
The problem is the phenomenon of line
splitting when the order augments.
1000
2000
3000
Frequency [Hz]
4000
5000
415.30Hz + 440Hz, 25ms/276 points
1
0.8
0.6
0.4
Imaginary Part
On the other side, one could resolve all of
the frequencies in the model, and from
that search for possible harmonic series.
However, care must be taken to avoid
creation of non-existing notes, since we
see from fig.10 and 7 that the poles not
associated with the signal are not exactly
randomly distributed.
0
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
Figure 10
21 of 38
−0.5
0
Real Part
0.5
1
6000
Modified covariance method, synthetic signals
As reported in [Schro00], Marple'
s
modified covariance method holds
promise of better frequency estimates than
the traditional Yule-Walker method. And
it was claimed that in the absence of
noise, the true frequencies were found.
AIC,k=1, PEP from Modcov
0
−1
−2
−3
−4
Matlab 5.2 does not include the
algorithm, but 5.3 does, and it'
s called
armcov.m
−5
−6
−7
The same noiseless A4 as used on the YW
is tested with the ModCov algorithm. We
see in fig.11 that the correct model order
is found with the AIC when using the
ModCov to find the PEP. The problem is
Figure 11
that some of these poles are displaced
more than 0.035 from the unit circle.
−8
−9
−10
10
20
30
40
50
model order
60
70
80
90
A4, no noise, ModCov
1
0.8
0.6
0.4
Imaginary Part
We obtain 444.87, 899.84, 1332.10,
1764.21 and 2200.45 HZ which is even
worse than YW. This could be due to
round-off errors when forming the
covariance matrix or inaccuracies when
inverting it.
0
0.2
To improve the estimations, one could try
to increase the order of the model. By
doubling the order, we see that the poles
are on the unit circle and give the exact
frequencies. Fig.13 shows the result of
adding two poles, in fact, just increasing
the order to 12 gives a better result with
maximum 0.12 Hz deviations, see fig.13
Figure 12
and 14.
0
−0.2
−0.4
−0.6
−0.8
−1
22 of 38
−1
−0.5
0
Real Part
0.5
1
100
Welch periodogram of signal, and estimated AR−spectrum
6
10
Original
AR(12)
1
0.8
4
10
0.6
0.4
2
Imaginary Part
10
0
10
0.2
0
−0.2
−2
10
−0.4
−0.6
−4
10
−0.8
−1
−6
10
0
1000
2000
3000
Frequency [Hz]
4000
5000
6000
−1
Figure 13
−0.5
0
Real Part
0.5
1
Figure 14
Again, as with the Yule-Walker the signal
with notes a semitone apart is examined
(page 21). While using the ModCov to
estimate the PEP, the MDL, FPE, FSIC
and AIC with penalty 2 all gives order 21
which is approximately the correct order
(20) for the noiseless case, but too low to
model all the sinusoids in noise. Using
penalty k=1 for AIC using PEPFB gives
order 76, something that successfully
finds all the sinusoids with less than 5 Hz
error. This overestimation is typical for
the AIC. On the other hand, using order
76 gives us five extra sinusoids, not in
harmonic relation to those really existing.
This shows that letting the order become Figure 15
too high quickly gives rise to spurious
peaks which have to be '
filtered'away by some means. It is clear that this method is not optimal. If
we look at fig.9 again, we see that order 51 was chosen for exactly the same signal using AICk=1 and
PEPYW. In fact, this combination seem to work fine for selecting order to use with the ModCov, and
was most often used when doing the transcription. Much because the calculation using the
Levinson algorithm is faster than the ModCov algorithm, and thus speeds up the analysis.
Welch periodogram of signal, and estimated AR−spectrum
2
10
Original
AR(76)
1
10
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
0
23 of 38
1000
2000
3000
Frequency [Hz]
4000
5000
6000
Synthetic signals are useful since we will have
complete knowledge of the signal, but real
instruments don'
t create perfect sinusoids so it is
interesting to do testing of order estimation on
real instrument samples.
Welch periodogram of signal, and estimated AR−spectrum
4
10
Original
AR(38)
2
10
0
10
Again an A4 was chosen, but this time from a
flute. The number of datapoints was 308, which
is about 25ms at 11kHz sample rate and this
should be short enough to catch most of the
notes. The Fig.17 shows the MDLk=2 using
PEPFB, and looking at the periodogram in fig.16,
we see that the correct order is found again (or
more precisely: the correct number of sinusoids Figure 16
compared with the number of peaks). The same
holds for CIC and FSIC, while AIC, FPE and
the eigenvalue-based method in (16) and (17)
overestimate the order.
−2
10
−4
10
−6
10
−8
10
0
1000
2000
3000
Frequency [Hz]
4000
5000
6000
A4 flute − 308 points, PEPFB, MDLk=2
−550
−600
−650
The order chosen in the model is twice the order
estimated. This seem to be a good balance
between the correct order which is too smooth
and too high order that gives spurious peaks.
We see that one extra peak appears, something
that is tolerable. Experiments showed that model
orders of 1.5 to 2 times the number of sinusoids
gave good results.
−700
−750
−800
−850
−900
0
10
20
30
40
50
model order
60
70
80
90
Figure 17
A4 flute − 308 points, ModCov, Order=38
1
0.8
0.6
Imaginary Part
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
Figure 18
24 of 38
−0.5
0
Real Part
0.5
1
100
Implementation of the transcriber
The structure of the transcriber
The transcriber WavToMid is completely implemented in Matlab, and takes a wav-file from the
subdirectory ".\wav\" as input and gives a midi-file in ".\mid\" as output. Instrument type in the
midi-file must be specified manually. For the moment the program first finds all midi notes, and
then does the post-processing and writing to a MIDI file.
The processing is done block-wise with each block being fixed to 250 samples (25ms). This is done
to simplify the program, but it is obvious that better results can be obtained if a proper
segmentation is implemented. The size of the block was chosen from the fact that at 200 BPM a
sixteenth-note is 75ms long, and a 25ms window should then be able to capture most of the
changes in the music. Each block is tested for silence before it is passed to frequency estimation.
After the preliminary testing, it was clear that for the frequency estimation the Modified
Covariance Method was the way to go. The frequencies are found by using Matlab'
s tf2zp.m and
calculating the angles of the poles returned. Finding the roots of the polynoms implies some
eigenvalue-calculation. Whether this is possible to avoid is unknown. Model order selection was
most of the time done by AICYW with penalty k=1 because of the speed advantages, but using
MDLFB with penalty 2 gives us a closer estimate of the number of sinusoids and can in some cases
give better results.
After the frequency estimation, harmonic series are searched for in order to find potential
fundamental frequencies, and afterwards converted to the corresponding midi-number. At this
moment only monophonic music is supported, but an extension to polyphony should be
straightforward. One problem is to decide which note belongs to which instrument.
When the notes are determined some '
top-down processing'is done. Pauses and notes that are too
short are removed. Reverb is removed by trying to detect whether a note is continuing to play while
another note is present. Finally the midi numbers are converted to binary Midi 1.1 format and
written to disk. The relations between the different tempo parameters used in the MIDI
specification was not completely understood, so changing the analysis block size is not possible
without some manual regulations in these parameters.
The different modules in the
transcriber are shown in fig.19.
The shaded boxes are shared
with process.m used in the
initial testing, while the hatched
box is non-essential for the
program. It is just an early
pitch-estimator calculating the
most occurring distance
between the frequencies found
in the Fourier spectrum. It is
used in the time-frequency plot
of the data. The numbers in the
boxes indicate in which order
the functions are called. All
parameters not set interactively
can be found in the main script.
WavToMid.
m
loadfile.m
1
ar_cov.m
3
orderselect.
m
2
5
freq2mid.m
fixed2var.m
4
most_freq.
m
Figure 19
25 of 38
midiwrite.m
coeff2freq.
m
Transcribing real music
Since flute samples was used throughout the preliminary testing, a short flute solo (flute5.wav) was
chosen as the reference transcribing clip. It is recorded in 11025Hz with 8 bits resolution, so the
quality is average with not too many partitials present, and a small amount of reverb is present. The
tempo is modest, with the shortest notes being around 260ms. To test the ability to transcribe
quicker passages, some seconds of Bach'
s "Badinerie" was used (flute00.wav). Here the shortest
note is about 100ms, and there is almost no reverb present. Another quick passage with a lot of
reverb was tested (bachfast11.wav). This is Bach'
s "Partita in A Minor". A number of other small
clips with different instruments was also tested to see how sensitive the transcriber was to the
instrument used.
Flute5.wav – a simple flute solo
Figure 20
Looking at the spectrogram in fig.20 we could expect to find three to five harmonics, and a visual
comparison between this spectrogram and the time-frequency plot of the music clip was used as a
benchmark when testing the order selection criteria. Additionally, when a MIDI file was made, the
wav-file and the midi-file was compared by listening.
In all of the following time-frequency plots, the blue circles indicate the frequencies found from the
angle of the poles, while the red crosses are estimations of the pitch calculated with most_freq.m
and are not used in the written MIDI-file.
26 of 38
Testing the different order selection
criteria showed the same as the one-note
testing: When we use PEPYW and directly
using the order found, we have to use
AICk=1 to avoid underestimation. In fact,
this seems to be the most practical setting
as it performs very well for different types
of instruments and tempos. In fig.21 we
see the time-frequency plot with this
setting used, and it is not too far from the
real spectrum, and the conversion to MIDI
format is perfect. We see however that
there are some spurious poles where the
frequencies are changing. This effect
could be reduced if the analysis windows
Figure 21
are dynamically adjusted.
flute5.wav − AIC k=1, PEP
, order= 1 x estimated
YW
5000
4000
Hz
3000
2000
1000
0
50
100
150
Blocknr
200
250
flute5.wav − AIC k=1, PEPYW, order= 1 x estimated smoothed
5000
4000
3000
Hz
A more simple approach is to '
smooth'the
order selection curve, taking the average
between the estimated order and some of
the last orders found. In fig.22 we see the
result of taking the average with the two
preceding orders, and the non-continuous
areas are better estimated. However, this
solution is probably not a good choice in
polyphonic music since the orders will
change more rapidly. A smoothing will
then lead to instruments not detected at
once because the order is too low . In this
project the method seems to work fine,
and was used most of the time
0
2000
1000
0
0
50
100
150
Blocknr
200
250
Figure 22
Looking at the spectrum in fig.1 again, we
see that the over-harmonics are getting
weaker and weaker, and are often
modelled too far from the unit-circle to be
chosen. This phenomenon is also present
in speech, and sometimes a '
pre-flattening'
filter that emphasises the higher
frequencies is used [Picone93]. This filter
is most often a one-tap FIR filter with
a [-1,-0.4]. The problem with this filter is
that the noise is boosted as well. In fig.23
we see just a minor improvement using
a=-0.85. The gains might be higher when
dealing with string-based instruments
where the over-harmonics tend to die out
Figure 23
quickly.
flute5.wav − AIC k=1, PEPYW, order= 1 x estim/smoo, flatten
5000
4000
i
Hz
3000
2000
1000
0
0
27 of 38
50
100
150
Blocknr
200
250
Using the method that seemed successful
in the one-note case on page 24 ( with
four times the number of sinusoids)
misses many of the over-harmonics. That
is simply because the signal to noise ratio
(SNR) is worse in this case, and number
of poles has to be further augmented in
order to cope with the noise. The SNR
will without doubt be important when
adding even more instruments and thereby
the order. Probably using a higher
sampling rate (and possibly lowpass
filtering to obtain the same signal
bandwith as before) will help since we
then will have more datapoints on which
Figure 24
to base our modelling.
flute5.wav − MDL k=2, PEP , order= 2 x estimated
FB
5000
4000
Hz
3000
2000
1000
0
0
50
100
150
Blocknr
200
250
This need to adjust the order according to the noise level is not good for our goal of creating a
automatic transcriber. Perhaps an estimate of the SNR could be calculated, and from that select a
multiplication factor to the number of sinusoids giving the best order to use. This might need the
PEPFB in order to have a precise estimate of the number of sinusoids.
If we knew the average number of
sinusoid and the SNR, we could of course
use a fixed order with some success.
Eventually, the order could be estimated
for bigger blocks, speeding up the
analysis.
Fixed order − AR(32)
5000
4000
Figure 26 shows the output of a modified
version of the transcriber, using the
correlation based pitch tracker described
on page 13. Even though the calculations
are optimised to use as short test-sinuses
as possible, the analysis is rather lengthy
and using segmentation to reduce the
number of blocks would help also here.
The resulting MIDI file was perfect.
Figure 25
Hz
3000
2000
1000
0
0
50
This method will not be able to distinguish
harmonics in the same frequency band. The
only hope is to apply signal models as to
compare the correlation values found with those
expected, assuming overlapping harmonics
where the value found is higher than expected.
Figure 26
28 of 38
100
150
Blocknr
200
250
Flute00.wav – A quicker flute solo
This clip is quicker than the foregoing. No
reverb and more harmonics. However, it
proved to be a bit difficult for the
transcriber.
We see that the time-frequency plot is not
too far away from the spectrogram, and
the conversion to MIDI format is not too
bad either. The problems that arise when
the tempo is increased appear to be more
related to the lack of segmentation and to
the post-processing of the notes found.
Both using PEPYW with AICk=1, and PEPFB
with MDL or CIC works fine.
Figure 27
flute00.wav − AIC k=1, PEPYW, order= 1 x estimated
5000
4000
Hz
3000
2000
1000
0
0
Figure 28
29 of 38
20
40
60
80
100
Blocknr
120
140
160
180
Some different sound files
clarinet example.wav − AIC k=1, PEPYW, order= 1 x estim/smoo
5000
4000
3000
Hz
Fig.29 shows an example file from
AudioWorks this time with a clarinet.
Many more harmonics are present, and
the chosen order is between 21 and 67. In
such a file a fixed order would not work
well. A spectrogram of this clip shows
weak even harmonics which is typical for
a clarinet.
The transcription done is on par with the
result from AudioWorks'own transcriber.
2000
Using PEPFB in this case is painfully slow,
since the maximum allowable order must
be set to 70. It is clear that an analysis
window matched to each note must be
used, since that would reduce the number Figure 29
of order estimations from about 470 to 35
in this example.
1000
0
0
The clip bachfast.wav is another flute
solo with a lot of reverb. Energy from up
to four notes can observed
simultaneously. This leads to high model
orders, and higher demand on the postprocessing to eliminate the echo. Both the
spectrogram and the time-frequency plot
is cluttered, but the result from the
transcription is not too bad and is still on
par with the AudioWorks transcriber.
Such heavy reverb is problem for the
autoregressive methods since the number
of sinusoids exploses. If such files are
expected to be converted successfully,
some sort of echo-cancellation should be
Figure 30
employed.
100
200
300
Blocknr
400
500
bachfast.wav − CIC k=3, PEPYW, order= 1.5 x estimated
5000
4000
Hz
3000
2000
1000
0
0
Oboe.wav, being of modest tempo and
order, was converted easily.
Figure 31
30 of 38
100
200
300
Blocknr
400
500
Some final words
Limitations of the transcriber
All of the music clips tested share some common characteristics: 1. They are all created from tube
resonators which means that all harmonics '
live'equally long. 2. They have a minimum frequency
not too low which mean a limited number of harmonics. The reason for omitting piano and guitar
music is twofold. Most importantly, these instruments are seldom monophonic. Another aspect is
lower (possible) fundamental frequencies. Looking at fig.32 which is an A0 from a grand piano,
we see that there are a lot of harmonics and the higher harmonics die out quickly.
Figure 32
The dying harmonics are not a problem since the transcriber needs only one harmonic to decide a
note. The high number of harmonics is worse, especially if we are dealing with polyphony. This
means that it is harder to find the best order, and the frequency estimations will be less accurate. To
remedy this, we have to increase the number of datapoints by increasing the sampling rate and/or
segmenting the music to form bigger analysis blocks.
The program is using a fixed key-press (loudness) for all notes. This is seldom the case for real
music.
The duration of the notes are not rigourously respected since the analysis is done with fixed blocks.
Additionally, the relations between the timing/tempo parameters in the MIDI specification was not
completely understood, so some files experiences incorrect conversion with the standard setup.
Some manual adjustments of the parameters fixes the problems.
The speed is also a problem. Reducing the number of calculations by reducing the number of
analysis blocks is desired. In other words, segmentation is needed. Of course, critical parts could be
coded in C.
Different noise levels are not accounted for. Since more noise implies higher AR models, some
automatic adjustment of the model order according to noise level should be employed if less handadjusting is desired.
31 of 38
Improvement for the transcriber and ideas for future work
No further work is planned in this project, at least not on a professional basis. However, some
suggestions for further work are given.
Without doubt, the most important modification of the transcriber is to do segmentation allowing
the analysis window to cover the whole note. This has several advantages:
1. More datapoints available in the analysis window leading to better frequency- and orderestimations.
2. Avoid mixing two consecutive notes in the same analysis window.
3. The note duration will be respected in the MIDI representation.
4. The number of calculations can be reduced, since fewer order estimations is needed.
5. The attack of the note can be analysed to identify the instrument.
6. The relative loudness of each note can easily be determined.
Further, implementing pkt.6 above to respect each note's loudness, and making a GUI for easier
testing of different important parameters.
Some ideas that requires a bit more research before integration into the transcriber are:
AR modelling in sub-bands. If D. Bonacci'
s research is successful we would be able to do AR
modelling in sub-bands. This would enable us to use many low-order models in place of one high
order model, possibly making the estimations more reliable (and faster).
Adaptive sequential algorithms. Many adaptive algorithms for spectrum estimation exists,
updating the spectrum for every data point arriving, and requiring only O(9m) multiplications
instead of the usual O(p2) multiplications [Kalou87]. These algorithms could be used for frequency
estimations in real-time, possibly adding the benefits of detecting note changes and avoiding order
estimations.
Directionality. A stereo signal usually contains information of spatial placement, and using for
example MUSIC or Capon'
s method could use this information to suppress all but one instrument,
thus improving polyphonic transcription. Frequency estimations could be done simultaneously.
Phase information. No work regarding the phase information in a music signal was found, not
even excluding the possibility. The idea is that if there are any relation between the phases of the
harmonics created in the instrument, one could be able to detect if a harmonic is mixed with a
harmonic from another instrument (having a different phase).
32 of 38
Conclusion
It has been demonstrated that AR frequency estimation is a plausible solution when transcribing
music. The AR methods are able to provide higher resolutions than the Fourier counterparts,
something that is important when the number of simultaneous notes increases. The frequency
estimates from Principal Component Frequency Estimation using the Modified Covariance Method
are reliable even in the presence of noise. Even if some of the harmonics are completely masked.
Fig.33 shows two flutes played simultaneously, where the C5 is completely masked. Having 1000
datapoints and using MDL with PEPFB gives almost the correct number of sinusoids, and using
order 4 times the number of sinusoids actually makes it possible to find the hidden C5. This is a
promising result when considering polyphonic transcription.
C4+C5, 91ms, MDLk=2, PEPFB, 2x
1
0.8
0.6
Imaginary Part
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.5
0
Real Part
0.5
1
Figure 33
Order estimation is crucial for optimal performance. Best order is not equal to the 2*(number of
real sinusoids), but somewhat higher depending on the amount of noise. Using the PEPFB together
with the information criteria MDL or CIC seems to give best estimate of the number of sinusoids.
Then using an estimate of the noise level to find a multiplicator for the order found to get the best
model order seems to be a way to go in order to cope with different SNR. However, calculating all
the orders up to the maximum allowable order for every block using the modified covariance
method is slow. A faster and '
less correct'
, but well performing method is utilised in the project,
namely using the order found from AICk=1 with PEPYW. This works remarkably well even for
different types of instruments.
A monophonic transcriber has been built taking a wav-file as input and giving a MIDI-file as
output. The program is performing on par with the commercial available monophonic transcribers.
The program has been build with a possible extension to polyphonic operation in mind.
33 of 38
A – Converting between Note, MIDI and Frequency
The A with MIDI number 21 is called A0, while the next is called A1 etc. Similar for all the other
notes. The table is created from equation 3, where the MIDI number equals k+69.
Note
name
C
MIDI
Nr.
MIDI
Freq.[Hz] Nr.
0
8.176
12
Freq.[Hz]
16.352
MIDI
Nr.
24
MIDI
Freq.[Hz] Nr.
MIDI
Freq.[Hz] Nr.
32.703
65.406
36
48
MIDI
Freq.[Hz] Nr.
130.813
60
Freq.[Hz]
261.626
Db
1
8.662
13
17.324
25
34.648
37
69.296
49
138.591
61
277.183
D
2
9.177
14
18.354
26
36.708
38
73.416
50
146.832
62
293.665
Eb
3
9.723
15
19.445
27
38.891
39
77.782
51
155.563
63
311.127
E
4
10.301
16
20.602
28
41.203
40
82.407
52
164.814
64
329.628
F
5
10.913
17
21.827
29
43.654
41
87.307
53
174.614
65
349.228
Gb
6
11.562
18
23.125
30
46.249
42
92.499
54
184.997
66
369.994
G
7
12.250
19
24.500
31
48.999
43
97.999
55
195.998
67
391.995
Ab
8
12.978
20
25.957
32
51.913
44
103.826
56
207.652
68
415.305
A
9
13.750
21
27.500
33
55.000
45
110.000
57
220.000
69
440.000
Bb
10
14.568
22
29.135
34
58.270
46
116.541
58
233.082
70
466.164
B
11
15.434
23
30.868
35
61.735
47
123.471
59
246.942
71
493.883
C
72
523.251
84
1046.502
96
2093.005
108
4186.009
120
8372.018
Db
73
554.365
85
1108.731
97
2217.461
109
4434.922
121
8869.844
D
74
587.330
86
1174.659
98
2349.318
110
4698.636
122
9397.273
Eb
75
622.254
87
1244.508
99
2489.016
111
4978.032
123
9956.063
E
76
659.255
88
1318.510
100
2637.020
112
5274.041
124
10548.082
F
77
698.456
89
1396.913
101
2793.826
113
5587.652
125
11175.303
Gb
78
739.989
90
1479.978
102
2959.955
114
5919.911
126
11839.822
127
12543.854
G
79
783.991
91
1567.982
103
3135.963
115
6271.927
Ab
80
830.609
92
1661.219
104
3322.438
116
6644.875
A
81
880.000
93
1760.000
105
3520.000
117
7040.000
Bb
82
932.328
94
1864.655
106
3729.310
118
7458.620
B
83
987.767
95
1975.533
107
3951.066
119
7902.133
Table 4
34 of 38
B – Matlab code for the transcriber
35 of 38
References
REF
NAME
TITLE
PUBL
YEAR
[Bisc92]
C.H.Bischof,
M.Shroff
"On Updating Signal
Subspaces"
[Broer00]
P.M.T.Broersen
"Finite Sample
IEEE
2000
Criteria for
Trans.Sig.Proc,Vol.48
Autoregressive Order ,No.12
Selection
[Brown99]
J. Brown
"Computer
identification of
musical instruments
using pattern
recognition with
cepstral coefficients
as features"
MIT Media Labs
1999
[Dick94]
J.R.Dickie,
A.K.Nandi
"On the Performance
of AR Model Order
Selection Methods"
Signal Processing
VII; Theories and
Applications
1994
[Djuric96]
P.M.Djuric
"A Model Selection
Rule for Sinusoids in
White Gaussian
Noise"
1996
IEEE
Trans.Sig.Proc,Vol.44
,No.7
[Djuric98]
P.M.Djuric
"Asymptotic MAP
Criteria for Model
Selection"
1998
IEEE
Trans.Sig.Proc,Vol.46
,No.10
[Durne98]
M.Durnerin
Operation ASPECT
[Feldman]
J. Feldman
"Derivation of the
Wave Equation"
http://www.math.ubc.
ca/~feldman/apps/wa
ve.pdf
[Fitch00]
J.Fitch, W.Shabana
"A Wavelet-based
Pitch Detector For
Musical Signals"
University of Bath,
UK
[Fuchs88]
J.J.Fuchs
"Estimating the
Number of Sinusoids
in Additive White
Noise"
1988
IEEE
Trans.ASSP,Vol.36,N
o.12
[Jehan97]
T. Jehan
"Musical Signal
Parameter
Estimation"
http://www.cnmat.ber 1997
keley.edu/~tristan/Th
esis/thesis.html
[Kalou87]
N.Kalouptsidis,
S.Theodoridis
"Fast Adaptive L-S
IEEE
1987
Algorithms for Power Trans.ASSP,Vol.35,p
Spectral Estimation" p.95-108 May
[Karaj99]
M.Karjalainen,
T.Tolonen
"Multi-Pitch and
Periodicity Analysis
Model for Sound
Separation and
Auditory Scene
Analysis"
http://citeseer.nj.nec.c 1999
om/411704.html
[Kashino95]
Kashino, Nakadai,
Kinoshita, Tanaka.
"Organization of
Hierarchical
Perceptual Sounds"
http://citeseer.nj.nec.c 1995
om/27731.html
36 of 38
IEEE
1992
Trans.Sig.Proc,Vol.40
,No.1
1998
REF
NAME
TITLE
PUBL
YEAR
[Kashino98]
Kashino, Nakadai,
Kinoshita, Tanaka.
"Application of
Bayesian Probability
Network to Musical
Scene Analysis"
[Kay88]
S.M.Kay
"Modern Spectral
Prentice Hall
Estimation, Theory &
Application"
1988
[Klapuri01A]
A. Klapuri
"Means of Integrating
Audio Content
Analysis Algorithms"
2001
[Klapuri01B]
A. Klapuri
"Multipitch
Estimation And
Sound Separation By
The Spectral
Smoothness
Principle"
2001
[Klapuri98]
A. Klapuri
"Automatic
transcription of
music"
Http://www.cs.tut.fi/s 1998
gn/arg/music/klapthes
.pdf.zip
[Lee92]
H.B.Lee
"Eigenvalues and
Eigenvectors of
Covariance Matrices
for Signals Closely
Spaced in Frequency
1992
IEEE
Trans.Sig.Proc,Vol.40
,No.10
[Mallat99]
S.Mallat
"A Wavelet Tour of
Signal Processing"
Academic Press
[Martin96]
K. Martin
"Automatic
Transcription of
Simple Polyphonic
Music: Robust Front
End Processing"
1996
Third Joint Meeting
of the Acoustical
Societies of America
and Japan,
ftp://sound.media.mit.
edu/pub/Papers/kdmTR399.ps.gz
[Martin98]
K. Martin
"Musical instrument
identification: A
pattern-recognition
approach"
136 th meeting of the
Acoustical Society of
America,
1998
[McNab]
R. J. McNab , L. A.
Smith, I. H. Witten,
C. L. Henderson ,S.
Jo Cunningham
"Towards the Digital
Music Library: Tune
Retrieval from
Acoustic Input"
University of
Waikato, Hamilton,
New Zealand.
1996
[Picone93]
J.W.Picone
"Signal Modeling
Proc. of IEEE, Sept.
Techniques in Speech 1993, p1215-1247
Recognition"
1993
[Proakis]
J.G.Proakis,
D.G.Manolakis
Prentice Hall
"Digital Signal
Processing,
principles, algorithms,
and applications"
1996
[Schro00]
T. von Schroeter
"Auto-regressive
spectral line analysis
of piano tones"
2000
[Terhardt]
E. Terhardt
"Psychoacoustics
related to musical
perception"
37 of 38
Http://citeseer.nj.nec. 1998
com/kashino98applica
tion.html
http://www.mmk.ei.tu
m.de/persons/ter.html
1999
REF
NAME
TITLE
PUBL
YEAR
[Vercoe97]
B.L.Vercoe,
W.G.Gardner,
E.D.Schreirer
"Structured Audio:
Creation,
transmission, and
Rendering of
Parametric Sound
Representations"
Proc. of IEEE, May
1998, p922-940
[Wax85]
M.Wax, T.Kailath
"Detection of Signals
by Information
Theoretic Criteria"
IEEE
1985
Trans.ASSP,Vol.33,N
o.2
[WolfeA]
J. Wolfe
"The University of
New South Wales,
Australia – Music
acoustics group"
http://www.phys.unsw
.edu.au/music/
[WolfeB]
J. Wolfe
"How harmonic are
harmonics"
http://www.phys.unsw
.edu.au/~jw/harmonic
s.html
38 of 38
1998