7.2 Analysis - eriksvensson

Transcription

7.2 Analysis - eriksvensson
Examensarbete
LITH-ITN-MT-EX--04/044--SE
Semi-automatic Music Creation
using the Continuous Wavelet
Transform and Markov Chains
Thomas Björkvald & Erik Svensson
2004-05-28
Department of Science and Technology
Linköpings Universitet
SE-601 74 Norrköping, Sweden
Institutionen för teknik och naturvetenskap
Linköpings Universitet
601 74 Norrköping
LITH-ITN-MT-EX--04/044--SE
Semi-automatic Music Creation
using the Continuous Wavelet
Transform and Markov Chains
Examensarbete utfört i Medieteknik
vid Linköpings Tekniska Högskola, Campus Norrköping
Thomas Björkvald & Erik Svensson
Handledare: Niklas Rönnberg
Examinator: Björn Kruse
Norrköping 2004-05-28
Datum
Date
Avdelning, Institution
Division, Department
Institutionen för teknik och naturvetenskap
2004-05-28
Department of Science and Technology
Språk
Language
Svenska/Swedish
Engelska/English
Rapporttyp
Report category
Examensarbete
B-uppsats
C-uppsats
D-uppsats
ISBN
_____________________________________________________
ISRN LITH-ITN-MT-EX--04/044--SE
_________________________________________________________________
Serietitel och serienummer
ISSN
Title of series, numbering
___________________________________
_ ________________
_ ________________
URL för elektronisk version
http://www.ep.liu.se/exjobb/itn/2004/mt/044/
Titel
Title
Semi-automatic Music Creation using the Continuous Wavelet Transform and Markov Chains
Författare
Author
Thomas Björkvald & Erik Svensson
Sammanfattning
Abstract
A common opinion these days is that every song on the radio sounds exactly the same. It is almost as if they were algorithmically made from
standardised templates. But could an automatised artificial composer really compete with the human alternative? This thesis seeks the answer to
this question, and describes the implementation of a semi-automatic composer application. The approach is to imitate the composing style of
humans through analysis of the characteristics of existing sound material and synthesis using the statistics obtained in the analysis.
Important aspects of the work include deciding which information is possible and realistic to extract from sound files, and a reliable statistical
model allowing the synthesis to produce unique results using only the properties of the input(s).
Classic Fourier analysis presents a straightforward way of determining important characteristics of signals. More recent techniques, such as the
wavelet transform, offer new possibilities in the analysis, taking the complex research field of sound and music to another level. Markov chains
provide a powerful tool for identifying structural similarities of patterns, which can be used to synthesise new material.
Nyckelord
Keyword
Music analysis, music synthesis, digital signal processing, continuous wavelet transform, Fourier analysis, Markov chains, human behaviour.
Björkvald, Svensson 2004
Abstract
A common opinion these days is that every song on the radio sounds exactly the same. It is almost as if they were algorithmically made from standardised templates. But could an automatised
artificial composer really compete with the human alternative? This thesis seeks the answer to
this question, and describes the implementation of a semi-automatic composer application. The
approach is to imitate the composing style of humans through analysis of the characteristics of
existing sound material and synthesis using the statistics obtained in the analysis.
Important aspects of the work include deciding which information is possible and realistic to
extract from sound files, and a reliable statistical model allowing the synthesis to produce unique
results using only the properties of the input(s).
Classic Fourier analysis presents a straightforward way of determining important characteristics
of signals. More recent techniques, such as the wavelet transform, offer new possibilities in the
analysis, taking the complex research field of sound and music to another level. Markov chains
provide a powerful tool for identifying structural similarities of patterns, which can be used to
synthesise new material.
Sammanfattning
En vanlig åsikt är att alla sånger som spelas på radio låter exakt likadant. Det är precis som om de
vore gjorda på algoritmisk väg utifrån färdigställda mallar. Men skulle en automatiserad artificiell
kompositör på allvar kunna mäta sig mot det mänskliga alternativet? Detta examensarbete syftar
till att besvara denna fråga, och inkluderar implementationen av en halvautomatisk applikation
för komposition. Tillvägagångssättet är att imitera mänskligt komponerande genom att analysera
befintligt ljudmaterial och att använda dess egenskaper för syntetisering.
Viktiga frågeställningar är bland annat vilken information som är möjlig och realistisk att extrahera från ljudfiler, och hur en pålitlig statistisk modell som tillåter syntesen att producera unika resultat, enbart utifrån inmatningsdatan, kan skapas.
Klassisk fourieranalys tillhandahåller sätt att hitta viktiga egenskaper i signaler. Senare tekniker,
såsom wavelettransformen, erbjuder nya analysmöjligheter och för det komplexa forskningsområdet inom ljud och musik till andra nivåer. Markovkedjor är ett kraftfullt verktyg för identifiering
av strukturella likheter i mönster, vilket kan nyttjas vid syntetisering.
Keywords
Music analysis, music synthesis, digital signal processing, continuous wavelet transform, Fourier
analysis, Markov chains, human behaviour.
i
Björkvald, Svensson 2004
Table of contents
1
2
3
4
Introduction ...................................................................................................................................1
1.1
Background .................................................................................................................................1
1.2
Thesis objectives ........................................................................................................................1
1.3
Nature of the thesis....................................................................................................................1
1.4
Problems......................................................................................................................................2
1.5
Thesis outline..............................................................................................................................2
Previous work................................................................................................................................3
2.1
Introduction ................................................................................................................................3
2.2
Analysis........................................................................................................................................3
2.3
Synthesis ......................................................................................................................................4
2.4
Conclusion ..................................................................................................................................4
Method ..............................................................................................................................................5
3.1
Introduction ................................................................................................................................5
3.2
Analysis........................................................................................................................................5
3.3
Synthesis ......................................................................................................................................5
3.4
Conclusion ..................................................................................................................................5
Brief music theory ......................................................................................................................6
4.1
Introduction ................................................................................................................................6
4.2
Tones and octaves......................................................................................................................6
4.2.1
4.3
Frequency ranges........................................................................................................................7
4.4
Harmonics...................................................................................................................................7
4.5
Chords and scales.......................................................................................................................8
4.5.1
Scales.........................................................................................................................................8
4.5.2
Some simple rules .......................................................................................................................9
4.5.3
Chord table ................................................................................................................................9
4.6
5
Tone frequency calculation...........................................................................................................7
Conclusion ................................................................................................................................10
Analysis ...........................................................................................................................................11
5.1
Introduction ..............................................................................................................................11
5.2
Digital Signal Processing (DSP) .............................................................................................11
5.2.1
Time and frequency domain ......................................................................................................12
5.2.2
Sampling issues ........................................................................................................................12
5.2.3
Aliasing...................................................................................................................................13
ii
Björkvald, Svensson 2004
5.2.4
Filtering ...................................................................................................................................14
5.2.5
Convolution..............................................................................................................................14
5.2.6
Filtering in the frequency domain ..............................................................................................15
5.2.7
Compressor ..............................................................................................................................16
5.3
Frequency analysis....................................................................................................................17
5.3.1
Fourier Transform (FT)...........................................................................................................17
5.3.2
Discrete Fourier Transform (DFT) ..........................................................................................19
5.3.3
Fast Fourier Transform (FFT) ................................................................................................19
5.3.4
Time-frequency problem ............................................................................................................19
5.3.5
Short Time Fourier Transform (STFT) ...................................................................................19
5.3.6
Resolution problem ...................................................................................................................20
5.4
Wavelets.....................................................................................................................................21
5.4.1
5.5
The wavelet theory...................................................................................................................22
5.5.1
Continuous Wavelet Transform (CWT) ...................................................................................23
5.5.2
CWT in the frequency domain..................................................................................................23
5.5.3
Visualisation ...........................................................................................................................23
5.5.4
Discretisation of the CWT .......................................................................................................24
5.5.5
More sparse discretisation of the CWT .....................................................................................24
5.5.6
Sub-band coding.......................................................................................................................24
5.5.7
Discrete Wavelet Transform (DWT)........................................................................................25
5.5.8
Wavelet families .......................................................................................................................25
5.5.9
Conditions for wavelets .............................................................................................................27
5.5.10
Wavelets and music..............................................................................................................27
5.5.11
The Morlet wavelet ..............................................................................................................28
5.6
6
Wavelet history.........................................................................................................................21
Conclusion ................................................................................................................................29
Synthesis ........................................................................................................................................30
6.1
Introduction ..............................................................................................................................30
6.2
Markov chains...........................................................................................................................30
6.2.1
Statistical model .......................................................................................................................30
6.2.2
Markov chain example.............................................................................................................31
6.3
Artificial Intelligence (AI) .......................................................................................................32
6.3.1
The AI research field................................................................................................................32
6.3.2
Simulating human behaviour ....................................................................................................33
6.3.3
AI and music...........................................................................................................................33
6.4
MIDI..........................................................................................................................................33
iii
Björkvald, Svensson 2004
6.4.1
MIDI history...........................................................................................................................33
6.4.2
The MIDI commands ..............................................................................................................34
6.4.3
Standard MIDI file format ......................................................................................................35
6.4.4
MIDI file example...................................................................................................................36
6.5
7
Conclusion ................................................................................................................................36
Implementation .........................................................................................................................37
7.1
Introduction ..............................................................................................................................37
7.2
Analysis......................................................................................................................................37
7.2.1
CWT analysis .........................................................................................................................37
7.2.2
Improving performance..............................................................................................................40
7.2.3
Fourier spectra .........................................................................................................................41
7.2.4
Normalisation and compressor usage .........................................................................................41
7.2.5
Downsampling .........................................................................................................................42
7.2.6
Octave-wise analysis .................................................................................................................42
7.2.7
Binary threshold .......................................................................................................................44
7.2.8
Holefilling ................................................................................................................................45
7.2.9
Event matrix ...........................................................................................................................46
7.2.10
7.3
Markov model..........................................................................................................................48
7.3.2
Prefix length.............................................................................................................................50
7.3.3
Tone sequence analysis example ................................................................................................50
7.3.4
Creation of new tone sequences ..................................................................................................52
7.3.5
Controlling the characteristics of the output ................................................................................53
7.4.1
9
Synthesis ....................................................................................................................................47
7.3.1
7.4
8
Storing the results ................................................................................................................46
MIDI representation................................................................................................................54
Writing the MIDI format ........................................................................................................54
7.5
Application process flow.........................................................................................................55
7.6
Conclusion ................................................................................................................................56
Results .............................................................................................................................................57
8.1
Analysis......................................................................................................................................57
8.2
Synthesis ....................................................................................................................................57
8.3
Application screenshots ..........................................................................................................57
Conclusions ..................................................................................................................................60
9.1
Problem formulations..............................................................................................................60
9.2
Limitations ................................................................................................................................61
iv
Björkvald, Svensson 2004
9.2.1
No storing of simultaneous tones ...............................................................................................61
9.2.2
No instrument identification or separation.................................................................................61
9.2.3
MIDI for playback ..................................................................................................................61
9.2.4
No beat detection......................................................................................................................61
9.2.5
Combining different inputs........................................................................................................62
9.3
Thesis separation......................................................................................................................62
9.4
Artificial intelligence aspects...................................................................................................62
9.5
Music theory aspects................................................................................................................63
9.6
Final comments ........................................................................................................................63
10
Future work .............................................................................................................................64
10.1
Improving performance ..........................................................................................................64
10.2
Improving the features of the analysis ..................................................................................64
10.3
Extending the statistical model ..............................................................................................65
10.4
Enhancing the realism of the synthesis.................................................................................65
11 Closing thoughts .......................................................................................................................66
12
Bibliography ............................................................................................................................68
12.1
Literature ...................................................................................................................................68
12.2
Web ............................................................................................................................................68
v
Introduction
Björkvald, Svensson 2004
1 Introduction
This thesis has been planned and performed by two students currently finalising their M.Sc. in
Media Technology at Linköping University, Sweden. The education combines classic engineering
courses with more modern areas, such as computer graphics and digital video. While neither of
us has any major musical knowledge (besides strumming out the odd guitar chord now and then),
the one thing we do have in common is our compulsive obsessive record collecting and interest
in all things related to the music scene, be it concerts, gossip or reviews.
1.1 Background
In the world of media technology, the emphasis is mostly put on computer graphics in one way
or another. Therefore, a deeper insight into the research area of sound and music is desirable.
Some of the methods used in computer graphics, for example the texture synthesis theories in
the image-based rendering field, would be interesting to apply in sound-related areas. The concept of Markov chains has been used to statistically generate new note sequences from well-known
symphonies. A natural development of these principles is to combine them with signal processing
and analysis theories, and thereby be able to use recorded music as source for the Markov chain
(or any other suitable statistical method).
1.2 Thesis objectives
The purpose of this thesis is to generate new music from existing sound material by using frequency analysis and the statistical properties of the analysed information. A simple database containing characteristics from the (several) input sources has to be built. The desired characteristics
are for example beat and tonal sequence. A decision engine will then produce new music using
statistical methods. What this means is that the output will in fact be based on the characteristics
of all the different inputs, but will still be something completely unique.
This work could be reconnected with the area of computer graphics by synthesising ”mood”
music for virtual environments without having to hire a composer or buying expensive copyrighted material. However, the most obvious usage of these ideas would be to create a program
where a great number of “super-hits” are used as input and a new, completely original multiplatinum hit single is the output.
Another aspect of this thesis is the challenge of creating an artificial composer without a genuine
knowledge of music theory. Can the result be enjoyable music or will the engineering approach
prove itself inadequate?
1.3 Nature of the thesis
While the dream result is a software application good enough to compete with the most acclaimed songwriters of today, consideration has to be given to the extreme complexity of sound
and music. Even the simplest piano tune holds a vast wealth of physical properties. Due to this,
and the lack of well-known projects of similar kind, the thesis will have to be viewed upon as an
experiment with no real given end product. It is not the final result that is important, but rather
the insight into an interesting area of science, gained during the process. After all, this is the true
essence of the engineering spirit. That being said, the plan is of course to fulfil the thesis objectives, even if it means to simplify them.
1
Introduction
Björkvald, Svensson 2004
1.4 Problems
There are three important aspects that will be considered in this thesis:
What sort of information is possible to extract from an arbitrary piece of recorded music?
The term “information” here includes musical characteristics like tone, tempo etc. A natural first
step is to try to analyse a simple tonal sequence, with just one instrument playing, one tone at a
time. The problem here is to isolate the fundamental frequencies, and identify the tones, in other
words the melody. Appropriate methods can then be used on a larger scale with more instruments playing simultaneously. This results in more complex soundscapes, where it will be necessary to find a way of distinguishing and getting rid of all redundant information (for example the
vocals).
How must the extracted information be represented to be storable?
A suitable representation form needs to be designed, in order to properly store the information.
Here it is absolutely necessary to decide in detail what is to be stored. Is it just the different tones
or sequences of tones? How should chords be treated? What about the characteristics that make a
certain instrument sound special (i.e. harmonics etc.)? Consideration also has to be given to the
aspects of time signature and tempo, should they be stored at all, or should they be decided
manually when combined? Issues here could be for example if two input melodies have different
time signature and/or tempo. Should they be combined at all in this case?
How can the stored information be used to synthesise new material?
The information must be analysed using statistical methods, and combined with a decision engine
to create new patterns of music. The idea implies that the more the input, the more original results will be obtained, since the decision engine will have more information to work with.
1.5 Thesis outline
This thesis consists of two major threads; sound analysis and sound synthesis, which are clearly
separated in each chapter. The sound analysis parts consider the aspects of extracting information
from sounds produced by musical instruments. The sound synthesis parts then deal with using
this extracted information to produce new music.
To separate the theories and methods used from how the practical problems were actually solved;
all theories used are explained thoroughly in chapters 4 through 6. The implementation chapter 7
describes how the theories were used in practice.
The result and conclusion chapters 8 and 9 sum up the knowledge obtained along the way, and
address the original thesis objectives and problems declared in sections 1.2 and 1.4, respectively.
2
Previous work
Björkvald, Svensson 2004
2 Previous work
2.1 Introduction
In order to properly understand the thesis problems and what to make of them, a thorough study
of previous work in the fields of sound analysis and synthesis has to be done. Of interest are not
only books and articles, but also websites and software.
2.2 Analysis
Commercial “wav2midi”-software has been available for some time. The purpose of this type of
application is to analyse sampled music in wave-format, and then convert it into MIDI. An example is Recognisoft’s Solo Explorer. This application, along with several other similar programs, has
proved to be quite successful in completing their tasks. However, as the name “Solo Explorer”
implies, the program only works for a melody line played by a single instrument. [30]
Jehan, a Ph.D. Candidate at the Media Laboratory of MIT, Boston, has written a thesis about the
analysis and synthesis of musical signals. The emphasis here is put on segmentation and frequency estimation. The term segmentation refers to the analysis of so-called musical events,
which is a property that can describe a number of features of the musical tone. For instance, it
can appear as a vowel change in a sung melody, or the percussive sound of a drum. By analysing
the musical events, a lot can be said about the music. Jehan explains two different methods for
the segmentation part; a Fourier-based frequency analysis, which involves normalisation of the
energy of the signal, and a statistically based method. The frequency estimation part of Jehan’s
thesis involves filtering the signal with wavelet-based filters, and then investigating the zerocrossings in order to calculate the frequency. [3]
Chapman, who wrote his Ph.D. thesis at the Meteorology Department at the University of
Reading, has created a small Matlab-based program called Multiana. This program was originally
created for determination of melodies of a guitar and a harmonica in a short piece of music, using
wavelets. It plots the result in a graph containing five octaves of notes, and the resulting coefficients of the wavelet analysis. This way, it graphically visualises the melodic content of the music
signal. [12]
In his B.Sc. thesis for the University of Sheffield, Self describes the wavelet transform in depth,
and how it can be used for music analysis purposes. He also explains how a visualisation of the
transform result can be created, in what is usually called scalograms. [11]
In their article for SIAM Review, Alm et al. describes several different approaches for analysing
the content of sounds produced by musical instruments. The most promising technique provided
is the use of the Continuous Wavelet Transform (CWT) for calculating the scalogram of sound. Alm
et al. also introduces the use of complex wavelets, which because of their oscillatory nature have
similar shape to sound waves from common harmonic instruments. [1]
Matlab’s Wavelet Toolbox has a built in function for performing the CWT analysis, described in the
documentation [21]. From a sound and particularly musical analysis point of view, it is the possibility to use complex Morlet wavelets directly on a signal with just a couple of lines of code that is
the most appealing feature of the Wavelet Toolbox [20].
3
Previous work
Björkvald, Svensson 2004
2.3 Synthesis
On his webpage, Kesteloot presents a very interesting example of the possibilities of using
Markov chains to synthesise written language. The application takes two texts of different languages as input, and builds one Markov chain for each of the texts. By using the probabilities of
the analysed inputs, and interpolating the generated output from one of the Markov chains to the
other, the result is a completely unique text, switching from looking like the first language to
looking like the second one. The application also involves choosing a so-called prefix length, which
decides how many letters are grouped together in the input analysis. A high value results in realistic outputs, while small prefix numbers generate “mumbo-jumbo” sentences. [16]
Another interesting application of Markov chains is the SynText plugin for Microsoft Word. This
plugin creates Markov chains from text documents, and can be used to find reoccurring patterns
and repetitions. [25]
Various people have also used Markov chains to compose musical patterns. Stanford alumni
Maurer presents a brief summary of the field of algorithmic composition on his webpage [24].
In a project report for the University of Derby, Mosley describes an application that correlates
an ordinary sine wave at different frequencies with the waveform being analysed. It is in fact a
simplified version of the idea of wavelets he uses – without ever mentioning the word “wavelet”.
However, the most interesting part of Mosley’s thesis is the MIDI writing routine he provides.
Especially useful are his methods for converting time (in samples) into terms of clicks per quarter
note and tempo, and rewriting this time information into binary format. Both these transformations are necessary for the notes to be played correctly by the MIDI player of choice. [6]
2.4 Conclusion
Evidently, research has been done in the areas of music analysis and synthesis. A common denominator, however, is not as obvious. Consequently, effort has to be put into deciding which of
the relevant techniques that are to be studied closer.
4
Method
Björkvald, Svensson 2004
3 Method
3.1 Introduction
In order to fulfil the thesis objectives, the separate parts each needed a suitable solution to their
problems. After all, this is an experimental thesis with no given starting points. Thus, the study
and selection of relevant material was a crucial part of the work. This became particularly important when the chosen theories were to be turned into code and had to be used throughout the
whole implementation process. Changing theory in some part of the code along the way could in
fact result in having to rewrite all of it.
3.2 Analysis
The intention of the analysis part of the thesis was to analyse sampled audio data and determine
the notes played in the music file. The wav2midi-type of software, made for this exact purpose,
was therefore clearly interesting at first. But since all of these programs were commercial products, the source code or even a brief explanation of the ideas behind them was more or less impossible to find. Moreover, these applications generally only worked for a melody line played by a
single instrument. This was not acceptable as it seriously limited the thesis objectives. Therefore,
the wav2midi-principle was quickly abandoned.
Multiana presented the concept of wavelets and a phenomenon referred to as multi-resolution analysis, which became the basis of the analysis part of this thesis. In particular, the drawing function
of the program, plotting the content of the music for each and every note ranging over five octaves, was impressive. Sadly, the code for Multiana was far from efficient, but at least it clearly
showed the possibilities and power of complex wavelets in the area of music analysis.
With this new way of looking at the analysis problems, the focus was put on studying the theoretical nature of the wavelet analysis, how it differed from Fourier analysis, and especially how it
could be used to analyse the sampled audio data. The most important feature of this form of
analysis was that unlike the traditional Fourier analysis, which simply identified the frequency
content of a signal without localising it in time, it also involved the time aspect. All things considered, wavelets were the obvious choice of method for the sound analysis.
3.3 Synthesis
The use of wavelet analysis was chosen after a thorough study of the research field of music
analysis. Markov chains on the other hand, were more or less decided upon from the very beginning, when this thesis was merely an idea toyed with in the back of the authors’ minds. Ever since
hearing about interesting projects involving Markov chains, the opportunity to learn more about
this technique had been desired. Experimenting with various Markov-based applications further
strengthened this gut feeling.
3.4 Conclusion
Even if engineering techniques such as wavelets and Markov chains could be decided upon, they
were not of much use without knowing what to look for in the analysis. These somewhat general
methods needed to be used in conjunction with classical music theory.
5
Brief music theory
Björkvald, Svensson 2004
4 Brief music theory
4.1 Introduction
For a thesis of this nature, a brief insight into the fascinating world of music was essential. Even
though the work was performed in an engineering spirit, it still involved simple rules and features
of traditional western music. The concept of naming tones and octaves comes down to the fact
that all musical sounds are about frequency. Koniaris’ Understanding Notes and their Notation webpage [18] influenced much of the discussions in this chapter.
4.2 Tones and octaves
The tonal system is divided into octaves and semitones. Usually, it is claimed that eleven octaves are
sufficient to cover the human hearing, where each one of the octaves consists of twelve semitones. The frequency range of an octave is between f and 2f, meaning that the same semitone in
the next octave has the exact double frequency as the current one. The frequency of a tone being
played is called the fundamental frequency.
In order to name the semitones in a distinguishable way, the octave number is written directly
after the corresponding semitone. For instance, an A-tone in the fourth octave would be denoted
“A4”.
Originally, western music was defined as consisting of seven distinct notes: C, D, E, F, G, A, and
B (or H as it is referred to in some European countries). How does this work with the twelvesemitone octave system mentioned above? The answer is quite simple; over time it was realised
that more semitones were needed, in order to enable the transcription of pieces so that it would
match the limited tonal resources of singers. To avoid having to rewrite all existing sheet music,
the new semitones, located in between the original tones, were denoted with “#” (pronounced
“sharp”) or “b”1 (pronounced “flat”). The “#” was used to raise a tone, while the “b” was used
to lower it. This means that “A#” and “Bb” is in fact the same semitone.
Semitone Notation
0
C
1
C# or Db
2
D
3
D# or Eb
4
E
5
F
6
F# or Gb
7
G
8
G# or Ab
9
A
10
A# or Bb
11
B or Cb
Table 4.1: Semitone order.
1
”b” is actually not an entirely correct notation, but rather a ”computerised” version of the real sign for flat notes.
6
Brief music theory
Björkvald, Svensson 2004
4.2.1 Tone frequency calculation
With the octave system it is not difficult to calculate the frequency of any given tone. A formula
can be defined as in (4.1).
f = f 0 ⋅ 2 note / 12
(4.1)
f0 is the frequency of the C0 note (~ 16.3516 Hz) and note is the index of the current semitone
(the index of f0 is zero). Using this formula, any given tone can be expressed as a frequency.
To exemplify this, consider the A4 tone. With the index system, this semitone has a number of
57. The formula (4.1) then yields f = 16.3516 ⋅ 257 / 12 ≈ 440 Hz.
4.3 Frequency ranges
With the human hearing usually divided into eleven octaves, it is interesting to examine in what
ranges popular instruments appear. The instrument spreading most over the spectrum is the piano, which can in fact appear all the way from the lowest (zero:th) to the eighth octave. While
this offers a high freedom when it comes to playing the instrument, it also makes it difficult to
analyse, because it appears in regions of the frequency spectra where other instruments reside.
Looking at a standard tuned guitar for instance, the lowest possible note is an E2 and the highest
one is E6. All of these notes can also be played on a piano. How is it possible to identify which
of the instruments that is actually playing a certain melody? The frequency ranges for some of the
most popular instruments are listed in Table 4.2.
Instrument
Piano
Guitar
Bass guitar
Cello
Violin
Lowest tone
A0
E2
B0
C2
G3
Highest tone
C8
E6
D4
A6
E7
Table 4.2: Frequency ranges of popular instruments.
4.4 Harmonics
The property that makes instruments sound differently is the phenomenon of harmonics. When a
note is played, it is not just the fundamental frequency of the note that is heard, but rather a
combination of this frequency together with a unique number of other semitones, the harmonics.
Consider an E2 being played on a guitar. Besides the fundamental frequency of the E2 tone, the
following harmonics appear:
•
•
•
•
•
•
•
+ 12 semitones
+ 19 semitones
+ 24 semitones
+ 27.86 semitones
+ 31 semitones
+ 33.69 semitones
+ 36 semitones
(one octave higher, i.e. E3)
(B3)
(E4)
(somewhere between G4 and G4#)
(B4)
(somewhere between C#5 and D5)
(E5)
7
Brief music theory
Björkvald, Svensson 2004
These notes, often the same semitone as the fundamental tone but in higher octaves, are played
simultaneously, regardless of what instrument is being used. What makes the instruments sound
differently is the fact that the harmonics appear with different strengths depending on the instrument. The harmonics are always weaker than the fundamental, and they always appear in
higher octaves.
4.5 Chords and scales
Individual notes are nice on their own, but in “real” music they are often combined into chords.
Mugglin likens the musical language to the alphabet. The base is then the so-called scale. Each
note is defined as a letter, and by putting notes together from the scale, chords (words) are created. Next, the words are put together to form phrases (musical sentences). Just like with written
language, knowing words is not the same as knowing the language. A natural understanding of
how the words fit together is essential. [28]
4.5.1 Scales
There are many different scales, or languages of music. The simplest one is called the major scale.
It is obtained by playing a certain combination of tones:
{ start, tone, tone, semitone, tone, tone, tone, semitone }
“Tone” means skipping one semitone (and thereby playing the next full note), and “semitone”
means just that; playing the next semitone. By choosing starting note, the scale takes on different
shapes. The C major scale has the nice feature of consisting of all white keys in a keyboard octave, i.e. no sharps or flats (see Table 4.3).
start
C
tone
D
tone
E
semitone
F
tone
G
tone
A
tone
B
semitone
C
Table 4.3: C major scale.
Notice how the F tone is the semitone after the E tone, and how the C tone follows the B in the
same way.
For simplicity, the different notes of the scale are usually numbered, as in Table 4.4.
C
1
D
2
E
3
F
4
G
5
A
6
B
7
C
1
Table 4.4: Numbers of the major scale.
From the major scale, one chord made up of several notes can be created for each note. The
chord based on note 1 (in the C major scale the C note) is called the “one chord”. Assigning roman numbers to the different chords, a relationship table can be written (see Table 4.5).
8
Brief music theory
Björkvald, Svensson 2004
Note
1
2
3
4
5
6
Chord
I
ii
iii
IV
V
vi
Table 4.5: Chords derived from the major scale.
The chord based on note 7 is a bit tricky, and left out of this brief chord theory. Some of the
chords are defined as lowercase letters, while others are uppercase. The lowercase chords are
called minor chords, and their sound is often described as darker and sadder compared to the
uppercase ones. In the same way, chords structures can be built for a variety of different scales.
Some examples of other scales are natural minor, harmonic minor and melodic minor.
4.5.2 Some simple rules
Mugglin states some simple rules for creating an enjoyable song:
•
•
•
Starting and ending the song on the same note or chord makes an establishment for
the song’s beginning and end, giving the melody a more finished sound.
Do not use too many chords. Many songs have been written using only three chords;
I, IV, and V.
Choosing a key, which means to decide the note 1. For instance, if a D note is selected as 1, the song is played in the key of D and so forth. The selection of note 1
changes the content of the major scale.
4.5.3 Chord table
All chords I to vi for the different notes of a major scale, in any key, can be summed in Table 4.6.
Key
I
ii
iii
IV
V
vi
C
C
Dm
Em
F
G
Am
C# C# D#m Fm
F#
G# A#m
D
D
Em F#m
G
A
Bm
D# D# Fm
Gm G#m A#m Cm
E
E F#m G#m
A
B
C#m
F
F
Gm
Am
A#
C
Dm
F# F# G#m A#m
B
C# D#m
G
G
Am
Bm
C
D
Em
G# G# A#m Cm
C#
D#
Fm
A
A
Bm C#m
D
E
F#m
A# A# Cm
Dm
D#
F
Gm
B
B C#m D#m
E
F# G#m
Table 4.6: Full chord table.
So, for each key there are six (in reality seven) different chords. How does one know which fit
and sound good together? Generally, the seven chords of a key sound quite good together, since
9
Brief music theory
Björkvald, Svensson 2004
they originate from the same scale. However, Mugglin proposes a simple map, showing how
chord sequences can be chosen to sound natural. When listening to a song, one is constantly trying to guess which chord comes next. This fact can be used whilst composing; the listener wants
to be able to guess the next chord, but at the same time he/she wants to be surprised sometimes.
By using the map, and deliberately “fooling” the listener on occasions, playing a chord that is not
the natural choice of successor, the music created can be varied enough to be exciting, but at the
same time regular enough to be comforting.
4.6 Conclusion
The tone and octave system of western music is built entirely on the frequency characteristics of
tones. By naming twelve semitones in each octave, the human range of hearing can be represented. An important phenomenon is the concept of harmonics, making different instruments
sound differently. Scales can be seen as the musical language, and by defining different scales,
chords and different ways of combining them can be derived. Without this knowledge, it is more
or less impossible to direct the analysis towards the desired results.
10
Analysis
Björkvald, Svensson 2004
5 Analysis
5.1 Introduction
The purpose of the analysis part of the thesis is to examine sampled signals, in the form of music.
Thus, this section introduces more or less well-known theories relating to sound and signals; Digital Signal Processing (DSP), Fourier analysis and the wavelet theory. Much of the signal theory is
based on Fundamentals of signals and system using MATLAB by Kamen et al. [4].
5.2 Digital Signal Processing (DSP)
The concept of signals and systems is a very general field of study, and its applications can be
found virtually everywhere, from home appliances to advanced engineering innovations. Sound is
an area well suited to incorporate the signal theory; speech as well as music can be described as
continuous signals. For every time instant the signal has a corresponding amplitude value.
Figure 5.1: 440 Hz signal; the A4 tone.
Continuous signals are sometimes referred to as analogue. However, in order to be able to work
with the analogue signals in a computer environment, the signals need to be sampled, yielding a
discrete version of the signal. Sampling an analogue signal simply means to pick out every n:th
value, thereby discretisising the time variable. Thus, the signal is no longer based on infinitely
many time values. The sampling rate states how many times per second the signal amplitude is
read, and is also called the sampling frequency.
11
Analysis
Björkvald, Svensson 2004
Figure 5.2: Sampled signal.
5.2.1 Time and frequency domain
Traditionally, a signal can be described in two different domains: the time domain and the frequency
domain. The information given in the respective domains is the same, but presented in different
ways; the time domain consists of signal amplitude values for each time instant, while the frequency domain contains the magnitudes of all frequencies without time information.
Figure 5.3: Time (left) and frequency domain (right) representations of a 440 Hz sinus signal.
5.2.2 Sampling issues
The sampling process has one important limitation. If the signal has a highest frequency of f, it
needs to be sampled with a minimum frequency of 2f. This is called the sampling theorem, and the
minimum sampling frequency is often referred to as the Nyqvist sampling frequency.
If the sampling theorem requirement is not met, i.e. if the sampling frequency is too low, the
signal cannot be correctly reconstructed from the sampled version. This is a phenomenon referred to as aliasing.
12
Analysis
Björkvald, Svensson 2004
Figure 5.4: 1440 Hz signal sampled at 1000 Hz, which is below the Nyqvist frequency of 2880 Hz. This results in the reconstructed signal having a 440 Hz frequency.
5.2.3 Aliasing
Figure 5.3 is not entirely truthful, because it only shows the first half of the frequency spectra.
When transforming a signal from the time domain to the frequency equivalent, half the signal will
be redundant information. All information above half the sampling frequency will be a reflection
of the content below it. When sampling with a too low sampling frequency, unwanted frequency
components occurs among the real frequencies. This is due to the reflection being “pushed” into
the first half of the spectra.
Figure 5.5: FFT spectra without aliasing. The right peak is a reflection of the left
since the content is mirrored at half the sampling rate, i.e. 500 Hz. The sampling frequency is 1000 Hz, which is above the Nyqvist frequency, for this 440 Hz signal.
13
Analysis
Björkvald, Svensson 2004
Figure 5.6: The frequency spectra of the signal as in Figure 5.5 overlapping and causing aliasing, due to the sampling frequency being 700 Hz and below the Nyqvist frequency. A “ghost” frequency appears at ~ 265 Hz.
A classic example of aliasing is the “wagon wheel effect”. Consider filming a spoked wheel spinning. If the camera registers the motion too slowly, i.e. with too low sampling rate, the reconstructed film sequence will actually show a wheel that appears to be spinning backwards.
There are a couple of factors to consider in order to avoid aliasing:
•
•
The sampling frequency needs to be equal to or higher than the Nyqvist frequency.
By lowpass filtering the signal, removing the frequency content higher than that of interest before the sampling, the aliasing effect can be decreased.
5.2.4 Filtering
Transforming an input signal x(t) into an output signal y(t) is called filtering. The purpose of this
process can be, for instance, to remove certain unwanted frequencies of a signal. This is called
bandpass filtering. By removing all frequencies over a certain threshold, the signal is lowpass filtered. The opposite treatment, removing all frequencies below the threshold is called highpass filtering.
Filtering can be performed in two different ways; in the time domain by convolution and in the
frequency domain by a simple scalar multiplication.
5.2.5 Convolution
The behaviour of a linear, continuous, time-invariant system with input signal x(t) and output
signal y(t) is described by the convolution integral (5.1).
∞
y (t ) =
∫ h(τ ) x(t − τ )dτ , or in short,
−∞
Here, the signal h(t) is the filter.
14
y = h⊗x
(5.1)
Analysis
Björkvald, Svensson 2004
When working with discrete-time systems, as with sampled signals on computers, the convolution integral becomes a convolution sum (5.2).
y[n] =
∞
∑ h[k ]x[n − k ]
(5.2)
k = −∞
What this means is that the convolution sum and integral express the amount of overlap of one
function x, as it is shifted over another function h. Hence, convolution could be described as
“blending” one function with another. At times when the two functions are much alike, the value
of y is large, and naturally it is small when the functions match poorly. [23]
Figure 5.7: Convolution of the signal x with the filter h. The green curve shows the
convolution of the red and blue functions, with the vertical green line indicating the position in time. The grey region is the product h(τ)x(t - τ), its area being precisely the
convolution at the given time instant.
5.2.6 Filtering in the frequency domain
Although it is convenient to use convolution for filtering, it also introduces an annoying problem
when working with large signals. Fact is that the computational burden increases rapidly with the
lengths of the signal and filter. For recorded music, several minutes in length with high sample
rate, the situation becomes unbearable.
The number of calculations needed for convolution is Nx · Nh multiplications and (Nx - 1)(Nh - 1)
additions, where Nx is the number of samples in the signal and Nh is the number of filter points.
This can be compared with filtering in the frequency domain where (not counting the cost of
transforming the signal) the same amount of calculations as the length of the signal (Nx) is
needed.
It might sound like filtering in the frequency domain is the magical answer to the signal processor’s prayers. And this may very well be true for larger signals like sound files, but for shorter
ones, good old-fashioned convolution can actually be faster.
An important difference between filtering in the two domains is the size of the filter. In the time
domain, the filter size is set large enough to get good enough results (which of course depend on
the application). In the frequency domain the filter size remains the same for every filter operation on a certain signal. The filter size is simply the length of the signal, as the filtering consists of
scalar multiplication between the filter and the signal (5.3).
15
Analysis
Björkvald, Svensson 2004
yˆ (ω ) = hˆ(ω ) xˆ (ω )
(5.3)
The signals ŷ , x̂ and the filter ĥ have been transformed into the frequency domain and are
thereby functions of frequency instead of time. If the filter is shorter than the signal (as usually is
the case), it is padded with zeros during the transformation to achieve equal lengths.
A more efficient procedure would be defining the filter directly in the frequency domain, saving
the computational cost of transforming the filter. For filters changing the frequency characteristics of a signal, this can be quite intuitive since working in the frequency domain allows for direct
manipulation of the frequency content. But there is no easy way to define filters changing the
behaviour of a signal over time in terms of frequency operations. In this case, the transformation
from the time domain to the frequency domain is basically necessary.
Bandpass filtering is an example of a straightforward operation in the frequency domain, since
this kind of filtering is nothing more than removing all unwanted frequencies. Here, the filter is
just a binary vector, where ones mean “pass” and zeros mean “stop”. This might introduce some
undesirable effects when the filtered signal is transformed back to the time domain and a
smoother filter (e.g. a Gaussian) is usually preferred.
5.2.7 Compressor
A compressor is used to reduce (compress) the dynamic range (e.g. variation in amplitude) of an
input signal. By setting a threshold level, the compressor will attempt to maintain that level
through turning down everything above it a certain ratio. As the ratio approaches infinity, the
compressor turns into a limiter. All points of the signal having amplitude below the threshold are
unaffected and all other points will have their amplitude compressed. [2]
The effect of the compressor is that weak amplitudes are augmented and strong ones are attenuated, leaving the signal with more even sound levels. This procedure is often used to dampen
peaks that are too high when recording music, avoiding for important but less accentuated
sounds to be drowned in the mix. The compressor can also be a useful tool in signal processing,
since it can make a noticeable difference for frequency analysis.
Figure 5.8: The effect of a compressor. Using a 2:1 ratio, all signal content above the
threshold is compressed to half the overflow. A n:1 ratio (n being large) gives the effect
of a limiter since all content above the threshold is set to the threshold value.
16
Analysis
Björkvald, Svensson 2004
5.3 Frequency analysis
Analysis of sound and music is not in any way a new field of research. However, this is not the
same as saying that it is well known. In order to decide exactly how to analyse signals and which
factors need to be considered, a thorough study of the available techniques, their differences and
similarities is absolutely necessary. For most engineers, the Fourier Transform is the natural tool for
analysing signal content. But is it the most appropriate method for this thesis? Are there any
other more suitable theories available?
5.3.1 Fourier Transform (FT)
Few scientists have made such an impression on the scientific world in as Joseph Fourier. His
discoveries have influenced applications in areas such as science, maths, engineering and, perhaps
the most significant one, signal processing. The revolution started in the 1930:s with this simple
claim from Fourier:
“Any periodic function can be represented as a sum of sines and cosines.”
The Fourier theories have since been recognised as not only a mathematical tool, but also a
physical phenomenon. The sound and music analysis and synthesis are areas in which the Fourier
research has become a cornerstone. The transformation between the time- and frequency domain
using the FT is totally lossless, enabling transformations between the two domains without losing
information. [11]
Figure 5.9: Transformation between the different domains.
This feature is useful in many applications, for instance the concept of image processing, where an
image can be transformed into its frequency domain using the FT. Many of the filters used in
image processing are frequency-based and the filtering is performed in the frequency domain.
Any signal, which periodically repeats itself, can be described by a sum of well-defined sinusoidal
(sin and cosine) functions. This can be expressed as the so-called Fourier series (5.4).
f (t ) =
∞
1
a 0 + ∑ (a k cos kt + bk sin kt )
2
k =1
(5.4)
The Fourier coefficients a0, ak and bk can be defined as in (5.5).
1
a0 =
2π
2π
∫ f (t )dt , a
0
k
=
1
π
2π
∫ f (t ) cos(kt )dt , b
k
0
=
1
π
2π
∫ f (t ) sin(kt )dt
(5.5)
0
By multiplying sinusoidal functions with different amplitudes and adding them together, any periodic signal can be described. Thus, simple sine and cosine signals form the base of the approximated signal. An example of this procedure is given by Figure 5.10.
17
Analysis
Björkvald, Svensson 2004
Figure 5.10: Signal constructed by adding sinusoidal waves with frequencies 440,
880 and 1320 Hz (the last one with all amplitudes shifted by 4).
Only multiples of the fundamental frequency of the signal needs to be accounted for in the periodic case. Furthermore, in practice merely a finite number of harmonics (multiples) is needed to
approximate f(t) with an acceptable error. Thus, the Fourier series is a sparse way of representing
periodic signals.
However, just being able to handle periodic signals is a big limitation. In reality, entirely periodic
signals are not that common, and it is often necessary to approximate non-periodic ones. To do
this, the Fourier series needs to incorporate all possible frequencies. The Fourier coefficients can
then be described as one complex coefficient (5.6).
1
ck =
2π
2π
∫ f (t )e
jkt
dt
(5.6)
0
This gives the complex Fourier series (5.7).
∞
f (t ) = ∑ c k e − jkt
(5.7)
−∞
Turning the sum into an integral gives the FT for any non-periodic signal (5.8).
fˆ (ω ) =
∞
∫ f (t )e
− jωt
dt
(5.8)
−∞
The inverse FT can be defined as in (5.9).
1
f (t ) =
2π
∞
∫ fˆ (ω )e
jωt
dω
(5.9)
−∞
There are a number of different ways of using the FT, each having its own characteristics.
18
Analysis
Björkvald, Svensson 2004
5.3.2 Discrete Fourier Transform (DFT)
Since computers are discrete machines unable to calculate continuous functions with infinitely
many values, they need a discrete implementation of the FT – the Discrete Fourier Transform. This is
achieved by modifying the transform formula into a sum rather than an integral, and by restricting the frequency interval being used. The signal is then sampled at a regular grid of N points
(5.10).
N −1
F [k ] = ∑ f [n]e − j 2πkn / N , k = 0, 1, ..., N − 1
(5.10)
n =0
Here, 2πk/N and n correspond to the continuous variables ω and t, respectively.
5.3.3 Fast Fourier Transform (FFT)
The Fast Fourier Transform is an interesting and useful development of the DFT. Basically, it takes
hold of the fact that the DFT does not need many coefficients to make an adequate analysis of a
signal. The FFT is simply the DFT performed on fewer samples, and is therefore much faster.
The FFT is implemented in many software applications, including Matlab.
5.3.4 Time-frequency problem
The FT, DFT and FFT have one problem in common. Unfortunately, it is a rather large issue. All
these different types of the Fourier transform can accurately distinguish the frequency contents
of an arbitrary signal. However, what if the time location of the frequencies is of interest? None
of the methods give any time information whatsoever. The result of this limitation is that the
methods are only of interest when dealing with a stationary signal, built on constant frequencies,
where the time aspect is not of interest.
5.3.5 Short Time Fourier Transform (STFT)
In order to implement a time-dependency in the FT, and thereby making it able to not only estimate the frequencies, but also when they occur, the Short Time Fourier Transform (or Windowed Fourier Analysis) was developed. The principle of the STFT is to introduce a time parameter, so that
the transformation is performed over a limited part of the signal, sort of like looking at the signal
through a window. By choosing a size of the window which makes the signal in it practically stationary, the frequency can be estimated locally, and consequently over the entire signal.
19
Analysis
Björkvald, Svensson 2004
Figure 5.11: The STFT principle. By moving the window over the signal, the local
frequency inside the window can be approximated over the entire signal.
The signal viewed in the window is represented by (5.11).
x(t ) w(t − T )
(5.11)
T is the time location of the centre point of the window and w(t) is the window function. Applying this to equation (5.8) yields the STFT formula (5.12).
S (ω , T ) =
∞
∫ x(t )w(t − T )e
− jωt
dt
(5.12)
−∞
This is in principle a convolution between the signal and the window, and the process can therefore be seen as filtering.
5.3.6 Resolution problem
With the STFT being very dependent of the choice of the window form and size, its major disadvantage is that a fixed size of the window cannot suit all contents of the signal. A small window is
good when approximating high frequency parts, but when trying to estimate a low frequency, the
window is far too small to detect the oscillations. In the same way, a large window can be used to
detect low frequency oscillations, but has no chance to detect rapid changes (i.e. high frequencies). Because of the fixed window size, the resolution of the STFT is poor.
20
Analysis
Björkvald, Svensson 2004
Figure 5.12: The fixed window size of the STFT is unable to approximate different
frequencies in the same signal. The red window at the time instant T covers one full
period, but was unable to do so earlier in the signal (illustrated by the dashed grey
window).
5.4 Wavelets
The wavelet theory and principles were originally developed to address the shortcomings of the
FT, and especially the STFT. Where the classic Fourier analysis offers the possibility to reveal the
frequencies of a stationary signal, the wavelet analysis takes the idea one step further; not only
does it determine which frequencies exist in the signal, but also when they occur. The improvement of the resolution for the frequency and time detection of the wavelet transform makes it the
natural successor of the STFT.
5.4.1 Wavelet history
Tracing exactly where, when and by whom the wavelet analysis theory was created is a difficult
task. Yves Meyer, who was one of the first people to obtain public attention for his wavelet
work, once made the following statement, which in short sums up the wavelet situation:
“Tracing the history of wavelets is almost a job for an archeologist, I have found at least 15 distinct roots of the
theory, some going back to the 1930’s”
Meyer realised that a lot of people had actually been using the wavelet principles without knowing about it themselves [11]. The possibly biggest pioneer in the wavelet field was Jean Morlet,
who was actually the first man to use the term “wavelet”. Around the year 1975, while working
for an oil company, he realised that the techniques that were used for searching for underground
oil could be improved. By sending impulses into the ground and analysing their echoes, it was
possible to tell how thick a layer of underground oil was. Originally, Fourier analysis and especially STFT were used for this process, but since these techniques were very time-consuming,
Morlet began searching for another solution.
Looking at the STFT, Morlet decided that keeping the window size fixed was the wrong approach. Instead, he changed the window size while keeping the function (number of oscillations)
fixed. This way, he discovered that stretching the window stretched the function and squeezing
the window compressed the function. The foundation for wavelet theory was created, but Morlet
was not satisfied yet.
21
Analysis
Björkvald, Svensson 2004
In 1981 he teamed up with Alex Grossman, and together they worked on an idea that Morlet
discovered while experimenting with a simple calculator. The idea was the transformation of a
signal into wavelet form and back without losing any information, a lossless transformation between the time domain and the time-frequency domain. [17]
Other famous wavelet people include Ingrid Daubechies, who has created one of the most
commonly used families of wavelets, the Daubechies wavelets [29], Stephane Mallat, who collaborated with Yves Meyer, and Aldred Haar, who laborated with wavelet ideas as early as 1909, and
also named a wavelet family.
5.5 The wavelet theory
The wavelet analysis is based on the translation and dilation (scaling) of a so-called mother wavelet,
ψ(t). The wavelet function can be described as in (5.13).
ψ s ,τ (t ) =
1
s
 t −τ 

 s 
ψ
(5.13)
1
 t −τ 
s is the scale factor, ψ 
 is the mother wavelet, τ is the translation factor and the factor
s
 s 
is for energy normalisation across the scales. [32]
The procedure of a wavelet transform is straightforward; the wavelet, which is a scaled and dilated version of the mother wavelet, is convolved with the signal. The scale values can be likened
to inverted frequencies and range from small to large values. No restrictions appear as to how
many scales, or the spacing between them, that can be used in the transform. A large scale value
stretches the mother wavelet, and will therefore correlate best with the low-frequent content of
the signal. In the same way, a small scale value results in a compressed wavelet function, making
it well suited for the analysis of high frequency signals.
Figure 5.13: Differently scaled (s = 1, 2, 4) wavelets.
What this actually means is that, unlike the STFT, the wavelet transform uses differently sized
analysis functions in order to maximise the exactness of the analysis, a phenomenon referred to
22
Analysis
Björkvald, Svensson 2004
as multi-resolution analysis. The advantage of this is that for low-frequency content, the time
resolution is very high, and for the high frequency parts of the signal, the frequency resolution is
emphasised. This way, the wavelet analysis overcomes the traditional resolution problems of the
STFT. Another advantage of the wavelet analysis compared to the traditional Fourier analysis is
that since the mother wavelet can be defined in infinitely many ways, a wavelet can contain as
many sharp corners and discontinuities as desired. The Fourier analysis is entirely based on using
sinusoids, giving it less freedom and possibilities.
5.5.1 Continuous Wavelet Transform (CWT)
Much like the STFT, the CWT is performed by convolving a signal with a function, in this case
the wavelet declared in (5.13), and the transform is computed separately for different segments of
the time-domain signal. The transform can be seen as a filtering with the wavelet function being
the filter. The big difference between CWT and STFT is that with wavelets, the width of the
“window” (i.e. the wavelet function) changes with the scale value.
C ( s,τ ) = ∫ x(t )ψ s*,τ (t )dt
(5.14)
where * indicates complex conjugate and ψ is the wavelet function defined in equation (5.13).
Thus, using a scaling function (theoretically) yielding infinitely many scale values, and translating
the wavelet in time is called the Continuous Wavelet Transform. The convolution is performed
once for every scale value defined. This results in the two-dimensional matrix C, shaped of rows
corresponding to the scales and one column for each sample point of the signal. The contents are
coefficients for every scale at every time instant as to how well the corresponding wavelet function matches the signal. By examining the coefficients, the best-fitting scale, and thereby the most
likely frequency content can be decided.
5.5.2 CWT in the frequency domain
The formula for performing the CWT in the frequency domain (5.15) is similar to the one for the
time domain.
Cˆ ( s,τ ) = ∫ xˆ (ω )ψˆ s*,τ (ω )dω , where
(5.15)
ψˆ s ,τ (ω ) = s ψˆ ( sω )e − jωτ
(5.16)
is the Fourier transform of the wavelet function. As can be seen in equation (5.15), the frequency-based CWT does not include any kind of time shift and is therefore a simple scalar multiplication. Another difference lies in the rescaling of the mother wavelet, since a rescaling by s in
the time domain becomes 1/s in the frequency domain. [10]
5.5.3 Visualisation
In order to visualise the CWT, a three dimensional plot is an easy way to clarify the results of the
analysis. By plotting time contra the scale values on the x- and y-axis respectively, and the coefficients of the transform on the z-axis, the results are visualised as “mountain peaks”, as illustrated
in Figure 5.14.
23
Analysis
Björkvald, Svensson 2004
Figure 5.14: Coefficients of the CWT performed with the Daubechies 8 wavelet on a
440 Hz sine signal.
5.5.4 Discretisation of the CWT
The CWT is quite a demanding process, since the results cover the analysis of the signal using
every defined scale for every time instant. Computing the CWT on a computer is actually an impossibility, with all computers being of a discrete nature. A discretisation of the CWT is necessary. The wavelet function and the scale values need to be sampled in order to be used in the
transform for a discrete signal.
5.5.5 More sparse discretisation of the CWT
In many practical application examples, having a uniformly sampled time-frequency plane will
cause redundant information. To remove this, speed up the procedure and make the wavelet
transform more manageable, a number of properties can be altered. The most important aspect is
the frequency content. Since the scales correspond to the frequency of the signal it is not necessary to use the same sampling rate for every scale. High scales correspond to low frequencies,
where the analysis does not rely on a high sampling rate. Therefore, by sampling differently for
every scale, a lot of redundant information can be ignored.
5.5.6 Sub-band coding
There are a number of different ways of performing this operation. One popular way is to discretisise the scales logarithmically. By using 2 as base for the logarithm, only the values 2, 4, 8, 16, 32
and so on, i.e. scales derived from the expression s = 2k, where k is an non-negative integer, are
used as scales. This discretisation, which is part of a technique called sub-band coding, also makes it
possible to discretisise the time axis. The scale changes by a factor of two, resulting in the equivalent of a two times lower frequency. Consequently, the corresponding sampling rate for the time
axis can be halved, according to the Nyqvist criterion. The dyadic sampling grid (Figure 5.15) represents this method. As the scale factor increases, the frequency being analysed decreases, and a
lower sampling rate is required.
24
Analysis
Björkvald, Svensson 2004
Figure 5.15: The dyadic sampling grid. As the scale value increases, the amount of
necessary sampling points decreases.
5.5.7 Discrete Wavelet Transform (DWT)
The sub-band coding technique is the most common way of calculating the Discrete Wavelet Transform (DWT). By using low- and highpass filters, the signal is divided into two parts, each being
examined with different scales at different frequencies. The collection of filters is often called a
filter bank. The result of the sub-band coding is a number of coefficients, describing the high and
low frequency content of the signal, according to the desired level of the analysis. Stepping up to
a higher level means repeating this process for the lowpass filtered part of the signal.
Figure 5.16: The DWT principle. For every level, a highpass filtered signal describes
the details, and the lowpass equivalent the approximation of the signal.
For every level of the analysis, the resulting coefficients describe the high frequency content (the
details) and the low frequency content (the general approximation) of the signal. Depending on
the intent of the transform, using a sufficient number of levels usually means using a lot less
sample points than the discretisised CWT, thanks to the dynamic rate of sampling. The DWT is
mostly used in image and audio compression methods, and beyond the scope of this thesis.
5.5.8 Wavelet families
The possibility of using a unique function as mother wavelet is one of the most thankful aspects
of the wavelet concept. There are a great number of mother wavelets to be chosen from, each
having their own characteristics and suitable areas to used within. It is also fully possible to design new wavelets. A wavelet function can be complex or real, and often has an adjustable parameter for the localised oscillation. The most simple mother wavelet is the Haar wavelet, which
looks like a step function.
25
Analysis
Björkvald, Svensson 2004
Figure 5.17: The Haar wavelet.
Another, more advanced wavelet function is the Daubechies family, where an integer denotes the
number of vanishing moments. By adjusting this number n, the functions take on different
shapes. This leads to the wavelets being called “Db n”. For instance, a Daubechies wavelet with
four vanishing moments is called a “Db 4”.
Figure 5.18: The Daubechies wavelet family (top left the Db 2, top right the Db 4,
bottom left the Db 8 and bottom right the Db 16).
26
Analysis
Björkvald, Svensson 2004
5.5.9 Conditions for wavelets
Even though it is fully possible to design new mother wavelets, there are a number of conditions
that need to be granted:
The admissibility condition
For a continuous wavelet transform to be invertible, the mother wavelet must satisfy the admissibility condition (5.17).
2
ψˆ (ω )
∫ ω dω < ∞
(5.17)
Here, ψˆ is the FT of the mother wavelet ψ. This is only true when (5.18) is fulfilled.
ψˆ (ω )
2
ω =0
=0
(5.18)
This means that the wavelet must have a bandpass-like spectrum. A resulting zero of the FT at
the zero frequency also means that the average value of the wavelet in the time domain must be
zero (5.19).
∫ψ (t )dt = 0
(5.19)
It follows that the function is oscillatory, and must be a wave. Hence the name “wavelet”.
If the admissibility condition is fulfilled, the inverse wavelet transform can be defined as in (5.20).
x(t ) = ∫∫ C ( s,τ )ψ s ,τ (t )dτds
(5.20)
The signal x(t) is exactly reconstructed from the CWT coefficients, using the same wavelet function. Thus, the transformation is lossless.
The regularity conditions
The regularity conditions state that the wavelet function should have some smoothness and concentration in both time and frequency domains, in order to make the wavelet transform decrease
quickly with decreasing scale s. [32]
5.5.10 Wavelets and music
Two related wavelets that have been successfully used in applications related to music and sound
are the Gabor and the Morlet wavelets [1]. These functions are complex and based on exponential
functions, making them appropriate for analysing sinusoidal sound signals. This in the sense that
when the functions does match the signal, the resulting coefficient values are quite high, and
when it does not match very well, the coefficient values are very low. This contrast makes it easy
to distinguish the scales best suited to describe the analysed signal, in comparison with the realvalued wavelets, where the peaks in the resulting coefficients are wider and harder to distinguish.
27
Analysis
Björkvald, Svensson 2004
5.5.11 The Morlet wavelet
The Morlet wavelet can be defined as in (5.21).
ψ (t ) = 2e −t
2
/α 2
(e
j πt
− e −π
2
α2 /4
)
(5.21)
α is a parameter controlling the bandwidth of the signal.
In the frequency domain, the Morlet wavelet (5.22) is a complex bandpass filter, thus its effect as
a filter would be to limit a signal to a band centred around a certain frequency.
ψˆ (ω ) = αe −α
2
( π 2 +ω 2 ) / 4
(e
πα 2ω / 2
)
−1
(5.22)
A modified version of the original complex Morlet wavelet is the Morlet pseudowavelet (5.23).
ψ p (t ) =
1
bπ
e −(t
2
/ b ) + jω 0 t
(5.23)
Here, ω0 is the centre frequency and b is the bandwidth. By altering the parameters for width and
centre frequency, the wavelet can be designed to match normal sound signals very well. This way,
the Morlet wavelets are usually referred to as “Morlet ω0 - b”. For instance, a Morlet wavelet with
centre frequency 1 and bandwidth 5 is called “Morlet 1-5”.
Figure 5.19: Three different Morlet pseudowavelets (1-1, 1-5, 3-5). The two leftmost
have the same centre frequency, but different bandwidths. The two rightmost have the
same bandwidth, but different centre frequencies.
28
Analysis
Björkvald, Svensson 2004
This version of the wavelet is not, strictly speaking, a theoretically perfect choice of mother wavelet, because it does not meet the admissibility condition. That is, it does not integrate to zero. The
inverse transform is therefore not possible. However, by choosing a centre frequency large
enough, the integral of the pseudowavelet can be made extremely close to zero and the condition
is thereby in principle met.
Despite being a simplification, the pseudowavelet is in fact quite useful for time-frequency display
of signals. Since the FT for the Morlet pseudowavelet, declared by (5.24), is much simpler to define than for the original Morlet Wavelet, it is also faster to use in frequency-based CWT computations. [10]
ψˆ p (ω ) =
1 − (ω − ω 0 ) 2 / b
e
bπ
(5.24)
5.6 Conclusion
The classic Fourier analysis is an important concept in the field of signal analysis. However, the
fact that it doesn’t include any time information makes it an inadequate method for this thesis.
With the wavelet analysis, the possibilities of examining audio files and estimate frequencies in
time present a straightforward way of determining melodies, the principle aim of the analysis.
From this, the information needs to be combined with the synthesis theories in order to fulfil the
thesis goals.
29
Synthesis
Björkvald, Svensson 2004
6 Synthesis
6.1 Introduction
This chapter presents the different theories forming the synthesis part of the thesis. The major
aspect here is the Markov chain and its features. Furthermore, some Artificial Intelligence aspects
are discussed, and a summary of the MIDI file format is given. The combination of these ideas
forms the synthesis method used.
6.2 Markov chains
The Markov chains, and more generally Markov processes, are among the most fundamental objects
in the study of probability. This chapter is influenced by lecture notes by Petterson [8].
6.2.1 Statistical model
Technically, a Markov process is a random process characterised by a lack of memory, and only
depending on the preceding state, a fact known as the Markovian property.
Consider a collection of random variables { Xt } (with the index t running through 0, 1, ...)
where, given the present, the future is conditionally independent of the past. In other words, a
Markov process has the property of (6.1).
P ( X t = j | X 0 = i0 , X 1 = i1 ,..., X t −1 = it −1 ) = P( X t = j | X t −1 = it −1 )
(6.1)
Markov processes are continuous, while Markov chains are time-discrete implementations. This
means that a Markov chain implies a fixed size time step; every transition happens after a certain
length of time (6.2).
P ( x n = ain | x n −1 = ain −1 ,..., x1 = ai1 ) = P( x n = ain | x n −1 = ain −1 )
(6.2)
A Markov process has slightly different transition behaviour. The number of states can be the
same as with the Markov chain, but in each state there are a number of possible events that can
cause a transition. These events take place at random points in time. This makes the Markov
processes continuous.
A first order Markov chain only depends on the previous state, while a higher order process
would depend on a higher number of preceding events. This is the same as defining the Markov
chain as first order, but with a different set of states (6.3).
P ( X t | X 1t −1 ) = P( X t | X t − n ,..., X t −1 )
(6.3)
An n-th order Markov chain over the alphabet A is equivalent to a first order chain over the alphabet of n-tuples, An. In general, an n-th order Markov process can be transformed to a first
order Markov process by introducing new random variable Yt = { X t − n ,..., X t −1 } , yielding (6.4)
.
P ( X t | X 1t −1 ) = P ( X t | Yt )
(6.4)
30
Synthesis
Björkvald, Svensson 2004
For instance, consider the sequence MARKOV. A Markov chain of first order would consist of
the states A = { M, A, R, K, O, V }. In this case, a second order chain would use the alphabet
A2 = { MA, AR, RK, KO, OV } but still behave like a first order chain.
To describe a so-called finite state first order Markov chain, three criteria need to be fulfilled:
•
•
•
The system can be described by a set of finite states, and can be in one and only one
state at a given time.
The probability of a transition from state i to state j, Pij, is given for every possible
combination of i and j. The transition probabilities are stationary over the time period
of interest, and independent of how state i was reached.
The initial state of the system or the probability distribution of the initial state is
known.
Thus, in order to model a Markov chain, the initial state and the probability distribution of the
different states need to be known. These probabilities can be arranged in a so-called transition
matrix. The expression in (6.5) shows a probability transition matrix for a system of m number of
states.
 P11
P
P =  21
 M

 Pm1
P12
L
P22
L
M
Pm 2
O
Pm3
P1m 
P2 m 
M 

Pmm 
(6.5)
Pij represents the constant probability of a transition from state Xi at time t to state Xj at time
(t+1). The Markovian property makes P time-invariant. All rows of the transition matrix add to 1,
and all values of Pij are greater or equal to 0. In this matrix, all possible successive states and their
probabilities are gathered. From this, calculating probabilities for a desired passage of states is a
simple problem involving basic probability operations.
6.2.2 Markov chain example
Suppose that a system has the transition matrix of (6.6).
A
B C

 A 0.2 0.6 0.2

T =
 B 0.1 0.6 0.3


C 0.5 0.2 0.3
(6.6)
This would imply that the possibility of reaching state A from state A is 0.2, reaching state B
from state A is 0.6 and so on. Furthermore, assume the initial state as the vector (6.7).
S0 = [1 0 0]
(6.7)
Apparently, the starting state is A. Using the transition matrix it is straightforward to calculate the
state vector after j transitions, as given by (6.8).
S j = S 0T
j
(6.8)
31
Synthesis
Björkvald, Svensson 2004
For instance, after 2 transitions the state vector looks like (6.9).
S 2 = [0.20 0.52 0.28]
(6.9)
This means that after 2 transitions, there is a possibility of 0.2 that the current state is A, 0.52 for
state B and 0.28 for state C. An interesting phenomenon occurs when comparing the state vectors after 6 and 10 transitions, respectively (6.10)
.
S6 = [0.23 0.49 0.28] = S10
(6.10)
The two state vectors have the exact same probabilities. The system has reached its so-called
steady state distribution, and shows the constant probabilities for the different states, regardless of
the starting position. This happens when the transition matrix T is regular (all components greater
than zero for any power of T). A Markov chain represented by such a transition matrix is called a
regular Markov chain.
6.3 Artificial Intelligence (AI)
The study of human intelligence is one of the oldest research fields. Over the last 2000 years,
philosophers have tried to understand how learning, understanding and reasoning is, and should
be done. During the 1950:s, the development of computers allowed testing of theories and experiments of a more complex nature than what was previously possible, making the research area
more practical and concrete. The computers were initially believed to have an unlimited potential
of intelligence. But while the computers offered countless calculations to be performed in an
instant, many of the theories believed to address the concept of intelligence failed. Combining
intelligence and computers proved itself a major research area of its own.
6.3.1 The AI research field
Russell states the AI research as the quest for the solution to one of the ultimate puzzles; how is
it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict
and manipulate a world far larger and more complicated than itself? Unlike other questions,
where the answer may not even exist, the evidence of the existence of the answer to the AI quest
is clear; just look in the mirror.
To understand the diversity of the AI field, looking at the areas in which it appears is sufficient;
from general-purpose areas like perception and logical reasoning to specific tasks like playing
chess, proving mathematical theorems, writing poetry and much more. However, to nail down
the most important fields of study, four different aspects can be considered:
•
•
•
•
Systems that think like humans.
Systems that act like humans.
Systems that think rationally.
Systems that act rationally.
The keyword here is ”rationality”. In AI, the term rationality simply means that the human
behaviour contains irrational mistakes, and that consideration has to be given this fact.
Therefore, Russell states:
“A human-centred approach must be an empirical science, involving hypothesis and experimental confirmation. A
rationalist approach involves a combination of mathematics and engineering.” [31]
32
Synthesis
Björkvald, Svensson 2004
6.3.2 Simulating human behaviour
As far as the AI elements of this thesis goes, the “mistakes” of the human behaviour is of interest. In order to synthesise music that avoids having a cold and machine-like feel to it, it is necessary to actually recreate the small errors and random factors that human fabrications may possess.
This way, the purpose could be defined as “a system that creates an output that appears to be
made by a human”, rather than the four cases above. The study and consequences of human error is a big research area of its own. Usually, this sort of study adresses computer training systems
that allows the computer to learn from its own mistakes. Rauterberg discusses examples of this,
which goes beyond the scope of this thesis [9].
6.3.3 AI and music
Looking at the combination of music and AI, the goal of the research is to make computers behave like skilled musicians. This would mean the ability to perform specialised tasks like composition, analysis, improvisation and so on. As in many other AI areas, the emphasis so far has been
put on these specialised tasks independently from each other. Therefore, current research is looking for a way to integrate the different tasks in a general application.
Since music is not a strict science, but rather the combined effect of both physical and emotional
reactions, it is debatable whether the possibility of a superior music machine is desirable. Most
traditional musicians keep trying to move music away from this type of automatism, while AI
research tries to reduce the gap between computers and music. This conflict makes music and AI
a very interesting combination. [27]
6.4 MIDI
MIDI is an abbreviation for Musical Instrument Digital Interface. To avoid a common
misunderstanding, it is important to realise that MIDI is not a thing that can be owned. It is not a
thing that can be touched. What it is, is actually the name of a communications protocol that
allows electronic musical instruments to interact with each other. Another misunderstanding is
that MIDI was designed to use as sound source for video games etc. In reality the MIDI protocol
was created by musicians, for musicians and with the need of the musicians in focus. [13]
6.4.1 MIDI history
The saga of MIDI has its origin back in the days when synthesisers began to gain recognition
from the public as a proper music instrument (read: late 1970:s/early 1980:s). The breakthrough
synthesiser artists had one major problem; it was hard to perform their music live on stage. In the
studio, they could layer their electronics sounds on top of each other using multiple tracks, but
like everybody else, they only had two hands, which limited the possibility of recreating the music
live.
To solve the problem, synthesiser technicians from various manufacturers met to discuss ideas.
In 1983, their results were revealed at the first North American Music Manufacturers show in
Los Angeles. The demonstration showed how two synthesisers, manufactured by different companies, could be connected with cables. One of the synthesisers was played, and both of them
could be heard. In order to show the two-way nature of the communication, the process was
then reversed in front of an impressed audience.
The MIDI principle is very reminiscent of the way two computers can communicate via modem,
with the difference being that the computers are synthesisers in the MIDI case. The information
being shared is musical in nature, and in its most basic mode tells the synthesiser when to start
and when to stop playing a certain note. Other information possible to share is the volume and
33
Synthesis
Björkvald, Svensson 2004
the possible modulation of the note. MIDI information can also be more hardware specific; it
can tell a synthesiser to change sounds, master volume, modulation devices and much more.
Soon it became clear that computers and MIDI would be an ideal combination since they speak
the same binary language. The only problem was the fact that MIDI used a data transmission rate
of 31.5 kBaud, which was different from all computer data rates. To solve this, an interface was
designed, which allowed the computer to talk to MIDI devices. The first companies to establish
themselves in the MIDI-computer market were Apple, Commodore and ATARI. Today, almost all types of computer systems have interfaces for the MIDI protocol. [19]
6.4.2 The MIDI commands
The very basis for the MIDI communication is the byte, or rather the combination of bytes. Each
MIDI command, or MIDI Event, has a specific binary byte sequence, in this chapter expressed in
terms of hexadecimals. Each byte is 8 bits long. The first byte is always the status byte, telling the
MIDI device which function to perform. Encoded within the status byte is the MIDI channel,
ranging from 0 to 15. Thus, MIDI is a 16-track based interface, with the channels being completely independent of each other. Possible actions of the status byte can be Note On, Note Off or
Patch Change. Depending on which of these actions the status byte indicates, a number of different bytes will follow.
Naturally, the most important commands are the Note On and Note Off cases. If a Note On is
sent, the MIDI device is told to begin playing a note. Two additional bytes are required; a pitch
byte, deciding which note will be played, and a velocity byte, which sets the force of the pressed
key. The velocity note is not supported by all MIDI devices, but is still required to complete a
Note On transmission. A Note Off indication uses the same structure as the Note On command.
To exemplify a Note On command, it could appear like Table 6.1.
Binary code
Hexadecimal
10010000
90
00111100
3C
01110010
72
Table 6.1: MIDI Note On command.
where the 90 figures indicates a Note On command for MIDI channel 0, 3C is the key pressed
(translating into the decimal number 60, a C4) and 72 is the velocity with which the key was
pressed (resulting in the force 114). A MIDI keyboard has 128 keys, meaning that the key variable range between 0 and 127, or in hexadecimal notation 00 and 7F. The velocity variable has
the same range. If a Note On command with a velocity of 0 is executed, it is actually interpreted
as a Note Off.
When the key is released, a Note Off command is sent to the MIDI device.
Binary code
Hexadecimal
10000000
80
00111100
3C
01100011
23
Table 6.2: MIDI Note Off command.
Just like Note On, it consists of three bytes. The first one (80) indicates a Note Off command for
channel 0, the second one (3C) indicates which note is to be turned off and the last one (23) sets
an off-velocity.
A Patch Change command instructs the MIDI device which of its built-in sounds should be
played. The General MIDI library is the standard instrument list. Using this standard it does not
34
Synthesis
Björkvald, Svensson 2004
matter on which synthesiser a tone is played. For instance, playing a tone with patch number 0,
the sound will always be a Acoustic Grand Piano. The Patch Change command requires only one
byte; the number corresponding to the patch number on the synthesiser.
Binary code
Hexadecimal
11000000
C0
01001010
4A
Table 6.3: MIDI Patch Change command.
This command changes the patch on channel 0 to instrument number 4A (or 74 expressed in
decimal notation), which in the General MIDI case would be a flute. In the General MIDI library, the instruments are divided into 16 different families. This means that within the patch
numbers 1 to 8, the “Piano” family of instruments will always be found, and so on.
6.4.3 Standard MIDI file format
In order to use the MIDI commands or events as defined above, a MIDI file needs to have a
certain appearance. A standard MIDI file consists of different types of so-called chunks; a header
chunk and an arbitrary number of track chunks. A track in a MIDI file can be thought of as the
equivalent on a multi-track tape deck; it may represent a voice or an instrument.
Header chunks
The header chunk appears at the beginning of the MIDI file, and describes the file format.
MIDI File header:
[ 4D 54 68 64 ]
[ 00 00 00 06 ]
[ ff ff ]
[ nn nn ]
[ dd dd ]
The first four bytes translate in to the ASCII letters “MThd”, indicating the start of the MIDI
file. The next four bytes represent the header length, always six bytes. The [ ff ff ] information is
the file format. There are three different formats of MIDI files; single-track, synchronous multiple-tracks or asynchronous multiple-tracks. Single-track means that there is only one track, synchronous tracks mean that the tracks all start at the same time, which they do not in the asynchronous case.
[ nn nn ] is the number of tracks in the file, and [ dd dd ] is the number of delta-time ticks per
quarter note, stating how many ticks after the previous event it should be executed. Delta-time is
a variable-length-encoded value. This format allows large numbers to use as many bytes as they
need, without requiring small numbers to waste bytes by filling with zeros. Some examples of
numbers represented as variable-length quantities are stated in Table 6.4.
Fixed size hexadecimal format
00000000
00000040
0000007F
00000080
00002000
Table 6.4: Variable-length encoding examples.
35
Variable-length format
00
40
7F
81 00
C0 00
Synthesis
Björkvald, Svensson 2004
Track chunks
After the header chunk comes one or more track chunk(s). Each track chunk has a header, and
may contain an arbitrary number of MIDI commands. The header for a track is similar to the file
header.
MIDI Track header:
[ 4D 54 72 6B ]
[ xx xx xx xx ]
The first four bytes have the ASCII equivalent of “MTrk”, indicating a MIDI track. The four
bytes after this statement give the length of the track (excluding the track header), stating the
number of bytes occupied by following MIDI events. Each event is preceded by a delta-time. [26]
6.4.4 MIDI file example
To give a proper overview on a standard MIDI file, an example could look like in Table 6.5.
MIDI Command
MIDI File Header
Number of bytes in header (6)
Hexadecimal notation
[ 4D 54 68 64 ]
[ 00 00 00 06 ]
MIDI File Format (1)
Number of tracks (1)
Ticks per quarter note (96)
[ 00 01 ]
[ 00 01 ]
[ 00 60 ]
MIDI Track Header
Number of bytes in track (31)
[ 4D 54 72 6B ]
[ 00 00 00 1F ]
Patch change, program 2, channel 1, deltatime 0
Note on, channel 1, C4, velocity 64, deltatime 0
Note off, channel 1, C4, velocity 64, deltatime 30
Note on, channel 1, E4, velocity 64, deltatime 0
Note off, channel 1, E4, velocity 64, deltatime 30
Note on, channel 1, G4, velocity 64, deltatime 0
Note off, channel 1, G4, velocity 64, deltatime 30
End of Track, deltatime 0
[ 00 C0 02 ]
[ 00 90 3C 64 ]
[ 30 80 3C 64 ]
[ 00 90 40 64 ]
[ 30 80 40 64 ]
[ 00 90 43 64 ]
[ 30 80 43 64 ]
[ 00 FF 2F 00 ]
Table 6.5: MIDI file example.
Since the C, E and G notes all start and end at the same time, this is actually a C-major chord. [6]
6.5 Conclusion
The techniques explained in this section form the basis of the synthesis part. The Markov chain is
used to generate new tone sequences, the AI features make the output appear more “human”,
and the MIDI specifications are used to make it listenable. The implementation of these ideas,
along with the analysis theories described in the previous chapter, forms the application which
accompanies this thesis.
36
Implementation
Björkvald, Svensson 2004
7 Implementation
7.1 Introduction
This chapter will in detail explain how the various theoretical principles were implemented to
form a Matlab-based application.
7.2 Analysis
The idea of the analysis part of the application was to find the tone sequence of the input sound
file. By doing this for several inputs, a database containing statistical information was passed on
to the synthesis.
Figure 7.1: Workflow of the analysis. The dashed grey window symbolises octave
separation.
After studying the wavelet theory and looking at example applications, the focus was put on understanding and improving the Multiana program, and trying to combine its ideas with the functions in Matlab’s wavelet toolbox. Multiana performed a tone analysis, which was very appealing,
but lacked speed and a clear structure as to what type of analysis was actually made.
7.2.1 CWT analysis
An example program on the official Matlab reference guide page, also able to find and identify
frequencies in time, was altered to perform the CWT with the scales being actual tones, rather
than simple integer values. To implement this idea, a formula was used to define the required
frequencies (i.e. tones), and then transform them into scale values for the mother wavelet, as
given by (7.1). [22]
37
Implementation
s=
Björkvald, Svensson 2004
fc ⋅ fs
f
(7.1)
fc is the center frequency of the given wavelet, fs is the sampling frequency and f is the frequency
of the current note of interest.
By doing so for every musical tone within five octaves, ranging from around 65 Hz to 2000 Hz, a
CWT analysis could be performed with these adapted scale values. Originally, test sound signals
were created as pure sinusoid signals. The reason for this was that ”clean” signals correlated better with the wavelet, and resulted in higher coefficient values. As a result, the differences between
the various wavelets could easier be spotted. A pure sinusoidal A4 note can be defined as in (7.2).
A = sin (440 ⋅ 2 ⋅ π ⋅ t )
(7.2)
The built-in Matlab CWT function resulted in a large matrix with coefficients for the 60 different
tones all along the signal, showing how well the signal matched the wavelet function along the
signal.
A number of different types of mother wavelets and their characteristics were examined in order
to find the most suitable one. Originally, the test programs used conventional mother wavelets,
for instance the Daubechies wavelet. The results varied for the different wavelet types, and the
CWT seemed very slow. Furthermore, the resulting coefficients did not impress, as shown by
Figure 5.14.
The algorithm did find the scales most likely to represent the tone of the signal, but the “top” of
the resulting coefficients was not narrow enough to decide one unique scale (and tone). The
problem of determining one tone from the coefficient matrix became obvious when plotting the
tone (frequency) versus time representation, where the blacker the colour, the higher the coefficient value (Figure 7.2).
Figure 7.2: 2D time-frequency plot of CWT coefficients using Daubechies 8 wavelet.
The signal being analysed is an A4 (440 Hz).
The use of the Daubechies wavelet was clearly unsatisfying, and something presenting more distinct results for musical signals was needed. The Morlet wavelet’s similarity to a theoretically perfect (sinusoidal) music signal made it very interesting for this type of analysis. With it being one
38
Implementation
Björkvald, Svensson 2004
of Matlabs built-in mother wavelets, it could be used directly in the Matlab CWT algorithm. For
the moment, this saved a lot of programming work.
Figure 7.3: Coefficients of the CWT performed with the Morlet 1-5 wavelet on a 440
Hz sine signal.
As seen in Figure 7.3, the resulting coefficients were now improved, and indicated that the nature
of musical signal was best explained as a complex expression. The plot of the coefficients from
the CWT showed that the peak clearly gave the most appropriate scale value. Since the scale best
suited to describe the signal was easy to decide from the coefficient matrix, the tone was also easy
to identify directly from the scale value, using equation (7.1). Plotting these results showed a
much better picture of the tonal content, as illustrated in Figure 7.4.
Figure 7.4: 2D time-frequency plot of CWT coefficients using a complex Morlet 1-5
wavelet. The signal being analysed is an A4 (440 Hz).
Knowing that for larger signals, filtering through convolution was a much more demanding process than filtering through scalar multiplication in the frequency domain, performing the wavelet
transform in the time domain was obviously not the optimal choice. So to substantially decrease
the number of computations for the transform, working in the frequency domain was desired. At
39
Implementation
Björkvald, Svensson 2004
this point, using the Matlab wavelet toolbox was no longer an option, since it only included
methods for transforming with convolution.
Fortunately, there were a number of independent wavelet toolboxes for Matlab with the source
code available on the Internet. One of them was YAWTB, which did in fact perform the CWT in
the frequency domain [14]. However, the mother wavelets available in this toolbox were not
nearly as good as the Morlet pseudowavelet for analysing musical signals. Still, the way the transform was performed was of great interest and the principle in itself could be used. Just a couple
of lines of code from YAWTB combined with some wavelet theory and the definition of the
Morlet pseudowavelet in the frequency domain [10] formed the foundation of the CWT function
being used in the final version of the application.
7.2.2 Improving performance
The application was now able to identify which tone was constructed in the sinusoid test signal.
However, to make the application able to analyse “real” music signals, the input files were hereafter selected as mono audio in the formats .wav or .au, since the Matlab environment fully supported the reading of these formats. The motivation for using mono was that the CWT could
only perform analysis on one signal at a time, while stereo sound consisted of two channels (signals).
Performing the CWT calculations for a signal with hundreds of thousands of samples over several octaves (each octave having 12 scales) was extremely demanding, even for a fast computer
with lots of memory. To improve the performance, some sort of skip variable needed to be implemented, making the program analyse the signal only at each skip:th element. But simply performing the calculations this way gave very poor and extremely oscillatory results. It was necessary to skip values in a more intelligent way, looking at the content of the signal before deciding
the step length.
The new approach was to downsample the signal as much as possible without losing important
information. According to the Nyqvist criterion, the sample frequency of a signal cannot be
lower than twice the highest included frequency in order to fully describe the contents of the
original signal. Theoretically this meant that if the highest frequency could be found, the CWT
calculations could be performed on the signal downsampled to twice this frequency, still giving
the same results. Since the tones played in the test files often appeared in the fourth or fifth octave, the sampling frequency needed was seldom higher than 2 kHz. This was a major difference
from the CD-quality sampling rate of 44.1 kHz which was used in most signals, meaning that
they could be downsampled up to ten times, making the CWT calculations significantly faster.
In practice, downsampling an audio signal that much produced unlistenable results and it is usually claimed that it is not recommended to use a sampling frequency lower than five to ten times
the highest frequency. However, in this case, the signals were used only for calculations and were
not reconstructed for listening purposes. The CWT actually produced satisfying results when
downsampled with a sampling frequency of a mere 2.5 times the highest frequency. The gain of
using a higher sampling frequency had to be weighted against the cost of longer computation
times. For the sake of being able to address this matter, the multiplication factor of the highest
frequency introduced itself as a variable in the application.
40
Implementation
Björkvald, Svensson 2004
7.2.3 Fourier spectra
To decide the optimal downsampling factor, the highest frequency of the input audio signal
needed to be found. Thus, the FFT was performed on the signal, producing a vector of frequency magnitudes. The highest frequency was then the index of the last non-zero element in the
vector (rescaled by sampling frequency and number of samples).
Unfortunately non-zero elements appeared, corresponding to frequencies far above the highest
frequency of any importance at all to the signal content. Only the strongest frequencies were of
interest, since the focus was put on the fundamental tones of the melody, rather than its harmonics. Therefore, a threshold was set, specifying a percentage of the highest magnitude in the FFT
vector. A frequency had to have a corresponding magnitude higher than this threshold, or else it
would not be accounted for. The thresholding process passed on entire octaves, meaning that the
CWT was always performed on n · 12 tones, where n is the number of interesting octaves. The
octave containing the frequency with the highest magnitude would always be analysed.
Figure 7.5: Thresholding using a factor of 50 %. The red window shows the interesting frequencies selected from the FFT spectra. The entire octave(s) containing the interesting frequencies is always used, as seen in the right image.
This threshold could also be used to specify at which depth the analysis would be performed,
since it implicitly set the new sampling frequency. Using a low threshold, more frequencies were
covered, and the CWT found more harmonics. This increased the computation time. If only the
fundamental tones were interesting, the threshold could be set to a high value, saving a lot of
valuable time.
7.2.4 Normalisation and compressor usage
Obviously, a real life music signal was not as pure as a computer-generated sinus tone, and
caused the CWT to produce less clinically perfect results. To rectify this, consideration needed to
be given to the spectra of amplitudes in the input signal. A natural way to avoid this problem
would be to normalise the signal, so that its maximum amplitude was set to one, just as in a pure
sinus signal. However, the internal relationship of the audio levels in the signal also posed problems; the louder tones could cause the CWT to miss the weaker ones. To some extent, reducing
the amplitude difference in the signal using a compressor solved this. The compressor found
amplitudes higher than a certain threshold value, and set all these amplitude values to the threshold. Actually, this made the compressor a limiter. This was followed by another normalisation,
which stretched the amplitude spectra, and made it easier to analyse.
41
Implementation
Björkvald, Svensson 2004
Figure 7.6: Uncompressed signal (left) and the same signal compressed with a threshold of 0.3 and then normalised (right).
7.2.5 Downsampling
From the highest frequency of interest, the signal was downsampled correspondingly. The downsampling factor d can be expressed as given by (7.3).
 f 
d = s 
 kN ⋅ fh 
(7.3)
Here, fs is the original sampling rate of the signal, kN is the multiplication factor for the Nyqvist
criterion and fk is the highest frequency found in the FFT spectra.
Thus, the downsampling meant that every d:th amplitude value was kept. This procedure introduced a new and disturbing problem. The CWT now found harmonics even in a pure sinus signal, which in theory should not produce any harmonics at all. This was due to aliasing problems,
and the signals had to be lowpass filtered before the downsampling. By doing so at half the desired sampling frequency, the artefacts of the aliasing disappeared.
7.2.6 Octave-wise analysis
After the adaptive downsampling of the signal, the CWT could be performed faster, and the program was now a completely functional tone detection implementation. Next, the plan was to try
to separate different kinds of instruments from the signal, and analyse them individually. By investigating the frequency ranges of a number of normal “pop music” instruments (bass guitar,
guitar, piano and drums) the conclusion was that their frequency ranges in many cases overlap.
An efficient and intelligent identification of the instruments from the CWT analysis would therefore be hard to implement, which led to an abandonment of these plans. Instrument identification was not a feature necessary for the basic concept of the thesis ideas, but rather something to
implement in a more sophisticated version of the application.
Still, the idea of separating the signal’s frequency spectra into smaller parts was not abandoned
completely. Analysing one octave at a time allowed much more optimisation of the calculations,
due to a number of possibilities:
42
Implementation
Björkvald, Svensson 2004
To use fewer scales.
Since one octave consisted of twelve tones, no CWT calculations were performed with more than
12 scales. This resulted in faster computations and smaller amounts of data to be handled simultaneously. Even though the total number of analysed octaves might not be less than before, the
separation of them eased the computational stress of the CPU.
To downsample according to the highest frequency in the particular octave.
By keeping track of the current octave being analysed, the highest frequency included in this octave was always known. Thus, an optimal downsampling rate could easily be decided for each
analysis, meaning that the lower octaves could be downsampled to a much higher extent. More
downsampled signals gave the CWT analysis less data to analyse, which decreased the computation time.
To avoid performing any calculations for octaves not containing any strong frequencies.
From the FFT analysis, the lowest and highest octaves containing strong frequencies were found.
By examining every octave in between, looking for frequencies strong enough to be accounted
for, the application decided whether or not the particular octave should be CWT analysed. No
unnecessary computations were then performed.
Using these ideas, the signal was analysed with twelve scales for each octave, where the interesting ones were selected from the FFT in the thresholding process. The frequencies of the tones
were known, and could be used to define the corresponding scale values. Based on the highest
frequency, an optimal sampling frequency was set. Each octave could then be downsampled as
much as possible.
 fs 
dj = 

 k N ⋅ f hj 
(7.4)
The downsampling factor dj was decided for each octave j prior to the CWT analysis. fhj is the
highest frequency for the octave j. The lowpass filtering was then performed for each octave,
removing the frequency content above fhj. The octave separation is illustrated in Figure 7.7, and is
also symbolised by the grey dashed window in Figure 7.1.
43
Implementation
Björkvald, Svensson 2004
Figure 7.7: Adaptive downsampling and interpolation of the CWT results.
The CWT analysis resulted in a matrix for each octave, with coefficients for the different tones.
Since each octave was downsampled differently, the number of columns in the matrices was not
the same. Therefore, the matrices needed to be “smeared” in order to be able to assemble them
back into one big matrix. This was done by making all matrices the same width as the least downsampled one (the octave containing the highest interesting frequency), and interpolating the coefficients. By doing so, the results for each octave could be piled on top of each other, making the
matrix look like it had actually been calculated using one CWT analysis, but much faster. The
procedure was reminiscent of the DWT:s sub-band coding technique, with the difference being
that the different levels were smeared to the same size and put back together into one matrix.
The realisation of the octave separation sped up the CWT part of the application substantially
and it also increased the capacity to handle minutes worth of high quality data rather than just
seconds, as was the case prior to the optimisation.
7.2.7 Binary threshold
The CWT analysis proved itself able to accurately find the tones played in a musical piece. It was
also possible to affect the depth of the analysis, by adjusting the threshold selecting interesting
frequencies from the FFT spectra of the input signal. This way, finding harmonics was not a
problem for the application, but may result in a slower analysis. However, harmonics were not
the primary interest, but rather finding the sequence of the melody’s fundamental tones. These
resulted in the highest coefficient values of the CWT matrix. To clean up this matrix and get rid
of weaker matches, for instance harmonics, the matrix was thresholded into a binary version,
setting all values over a certain value to one, and the rest as zero. This way, only the strongest
tones were left in the matrix.
44
Implementation
Björkvald, Svensson 2004
Cb = C ≥ T
(7.5)
Cb is the binary matrix obtained by setting all values higher than the constant T of the original
CWT result matrix C to one. An effect of the thresholding was that oscillating coefficients in the
matrix were evened out, due to the binary nature of the operation. The oscillatory behaviour
might be of interest if the aim is to perform some sort of quantisation of the entire analysis, but
since the fundamental tones were the primary interest here, the binary representation was a more
suitable way of looking at the results.
The effect of the binary threshold was apparent when comparing the original matrix to the newly
obtained one (Figure 7.8).
Figure 7.8: Resulting CWT matrix (left) and binary equivalent, using a threshold of
0.2 (right).
7.2.8 Holefilling
There were problems associated with the binary thresholding. Due to the oscillatory behaviour of
the coefficients, tones could be split up; if the value oscillated lower than the threshold factor,
there appeared “holes” in certain tones. This eventually lead to the application misunderstanding
the melody sequence, since a single tone could be interpreted as two individual tones.
Another problem came from the fact that the CWT results sometimes showed very small
wrongly estimated peaks at neighbours to the correctly identified tones.
Figure 7.9: “Holes” and “peaks” from the CWT analysis (left) cleaned up, using an
e-value of 0.1 (right).
45
Implementation
Björkvald, Svensson 2004
Looking at the nature of these artefacts, it was easy to understand that they were actually of the
same type. In the hole case, it was necessary to fill out the matrix with a certain amount of ones,
and in the peak case, ones needed to be removed. This was performed using yet another variable
e, specifying a percentage of the sampling rate of the result matrix. By looping through the binary
matrix for each tone and making sure that the next entry comes within the e interval, possible
holes in the analysis was found. If the next entry was outside the interval, it was regarded as a
new event. In the same loop, an event was required to be longer than e points to be saved at all.
After performing this thresholding with a suitable value of e, the peak and hole problems were
solved. Left was a matrix only containing the most important tones, as illustrated in Figure 7.9.
7.2.9 Event matrix
To simplify the melody identification and MIDI writing of the analysis results, the binary matrix
was transformed into another matrix. The purpose of this was to view the results as events; what
tone was played, when did it start and end. Every position of the binary matrix containing information (every tone content found in the analysis) was translated into a row in a new threecolumn matrix. That is, every single tone information found, at all time instants, was saved. An
illustration of this is given by Figure 7.10.
Figure 7.10: Binary matrix transformed into an event matrix. The tones are selected
in a time-wise order, and every tone transforms into two rows of the event matrix.
With this convenient way of saving the information, it was easy to sort all events by their start or
stop time. This enabled the definition of each tone played, expressed in tone numbers as used in
MIDI notation, and the exact start- and stop time for every event. From this, MIDI files could be
written easier than by looking at the original binary matrix.
7.2.10 Storing the results
With the event matrix, it was a trivial problem finding the played tone sequence. By simply looking at the start times for all notes, the order in which they were played was saved as a text string.
Every saved tone was represented by two letters; the first indicating the semitone, and the second
the corresponding octave number. To differ a tone from its raised equivalent (i.e. a G from a
G#) without using two symbols, the raised ones were denoted as uppercase letters. This resulted
in the following tone range:
46
Implementation
Björkvald, Svensson 2004
[cCdDef FgGaAb]
A typical tone sequence can look like Figure 7.11.
Figure 7.11: Analysed tone sequence.
Figure 7.11 turned into the desired letter representation gives the following result:
[ f4g4a4f4g4a4g4g4g4g4f4f4f4a4g4f4a4g4f4a4g4A4a4g4f4f4g4a4f4g4a4g4g4g4g4f4f4 ]
For every analysed sound file, the resulting tone sequence was saved in a text file, which formed
the “database” of the application. Furthermore, the length of every event was read from the
event matrix and saved in the database file, for later use in the synthesis part.
By having an external file as a database it was possible to open the application with any valid file,
and continuing building upon it, rather than starting over with an empty database every time. In
addition, the synthesis part of the application could be started with any previously saved database,
without having to do a new analysis.
7.3 Synthesis
The idea of the synthesis was to examine all the tone sequences in the database created in the
analysis part, and from them create a Markov model. This model, based on a singly linked list
representation, was then used to generate an output built on the statistics from all input, in the
form of a Markov chain implementation.
47
Implementation
Björkvald, Svensson 2004
Figure 7.12: Workflow of the synthesis. The rounded boxes symbolise properties that
can be altered.
7.3.1 Markov model
The linked list model required an object oriented Matlab programming approach, which was a
new and exciting experience. With Matlab offering no possibilities of creating pointers to objects,
a specific class was borrowed from the Data Structures & Algorithms Toolbox for Matlab [15]. At the
very basis of the Markov model was an instance of a linked list, called prefixObject.list. To this list
different so-called prefix objects were added. The prefix objects consisted of two things; a prefix,
which was a string sequence of tones of a certain length p and a chain.
Figure 7.13: The list containing prefixObjects.
The chain object was pointing to another list object link.list. It also had a variable called total,
which was an integer noting the number of links in the link list attached to it. The link objects had
a variable called chr, which was the string equivalent of the found tone. They also had a count variable, which was the total number of occurrences of this given tone.
48
Implementation
Björkvald, Svensson 2004
Figure 7.14: Chain object with associated list.
Assembling the different objects gave the model the appearance of Figure 7.15.
Figure 7.15: Full Markov model.
Object name
prefixObject.list
prefixObject
chain
link.list
link
Variables
prefixObjects
prefix, chain
link.list, total
Link
chr, count
Table 7.1: Object table for Markov model.
49
Implementation
Björkvald, Svensson 2004
The principle of the input analysis was to slide a window of a predefined size over all tone sequences in the database file. The size of the window corresponded to the size of the prefixes of
the model. For every prefix inside the window, a prefix object was created and the following tone
was stored in link.list within the chain object.
This way, every analysed prefix was stored, and with it all possible following tones. If a prefix that
had already been stored was found again, its next tone was added to same chain. If the tone existed in the chain, the count variable was increased by one, and if the tone was new it was stored
as a new link.
By repeating this procedure for all inputs, the result was a Markov model consisting of all prefixes of the length p. Associated to each prefix were the possible successive tones, zero tones or
several.
7.3.2 Prefix length
The most important factor during the analysis of the tone sequences was the prefix length. By
being able to change the prefix length, the order of the Markov chain could be altered dynamically. Practically, this was implemented by using a variable sized queue, with a p noting how many
tones are stored simultaneously in the queue. If p was changed, the input was re-analysed with the
new prefix length.
The idea was that by increasing p, and thereby increasing the order of the Markov chain, the output would have more in common with the input sequences, and vice versa. A high value of p
resulted in the application using longer patterns of tones as prefixes. A low value on the other
hand, meant that the application only used a very short pattern of tones as prefixes, resulting in a
more randomly created output.
7.3.3 Tone sequence analysis example
An example of a tone sequence analysis using the prefix length of four (p = 4) is given below.
The tone sequence in this example is the typical output of the analysis declared in chapter 7.2.
[ f4g4a4f4g4a4g4g4g4g4f4f4f4a4g4f4a4g4f4a4g4A4a4g4f4f4g4a4f4g4a4g4g4g4g4f4f4 ]
The first step of the statistical analysis would be to store the first four tones in the queue:
Queue: [ f4g4a4f4 ]
Successor: [ g4 ]
A prefix length of four means that the queue contains four tones at a time, and always stores the
following tone as a possible successor to the current prefix in the queue. In this case, the prefix
[ f4g4a4f4 ] will have the tone [ g4 ] stored as a possible successor, as shown in Figure 7.16.
50
Implementation
Björkvald, Svensson 2004
Figure 7.16: Prefix and followers added to the Markov model.
Next, the [ g4 ] tone is put in the queue, meaning that the first tone is thrown out:
Queue: [ g4a4f4g4 ]
Successor: [ a4 ]
With the current queue now being [ g4a4f4g4 ], the successor [ a4 ] is stored for this prefix. The
procedure is repeated for the entire sequence.
Figure 7.17: All queued prefixes are added to the Markov model.
51
Implementation
Björkvald, Svensson 2004
Passing the entire tone sequence through the queue gives the model the relationship between
prefixes and their following tones shown by Table 7.2.
Prefix2
[ g4f4f4f4 ]
[ f4f4f4a4 ]
[ f4f4a4g4 ]
[ f4a4g4f4 ]
[ a4g4f4a4 ]
[ g4f4a4g4 ]
[ f4a4g4A4 ]
[ a4g4A4a4 ]
[ g4A4a4g4 ]
[ A4a4g4f4 ]
[ a4g4f4f4 ]
[ g4f4f4g4 ]
[ f4f4g4a4 ]
[ f4g4a4f4 ]
[ g4a4f4g4 ]
[ a4f4g4a4 ]
[ f4g4a4g4 ]
[ g4a4g4g4 ]
[ a4g4g4g4 ]
[ g4g4g4g4 ]
[ g4g4g4f4 ]
[ g4g4f4f4 ]
Following tone
[ a4 ]
[ g4 ]
[ f4 ]
[ a4 ]
[ g4 ]
[ f4 ]
[ A4 ]
[ a4 ]
[ g4 ]
[ f4 ]
[ f4 ]
[ g4 ]
[ a4 ]
[ f4 ]
[ g4 ]
[ a4 ]
[ g4 ]
[ g4 ]
[ g4 ]
[ g4 ]
[ f4 ]
[ f4 ]
[ f4 ]
Occurrences
1
1
1
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
Table 7.2: Prefixes and followers of the tone analysis example.
Only the [ g4f4a4g4 ] prefix has two different tones as possible successors, with the analysis finding a [ f4 ] on one occasion and an [ A4 ] on another. As seen from the number of occurrences,
some of the rest of the prefixes have also appeared more than once, but always with the same
following tones.
From this information, the statistics are fairly obvious; for all prefixes except [ g4f4a4g4 ], the
following tone is 100 % certain. For this prefix, there is a 50 % chance that the next tone is a
[ f4 ] and a 50 % chance of it being an [ A4 ]. Building a table like this for all input creates an
equivalent of the transition probability matrix mostly used in Markov chain applications.
7.3.4 Creation of new tone sequences
With the prefix model used as the probability source, a Markov chain implementation was used
to generate the output. Again, this was performed using a queue system. If the analysis was performed with a prefix length of p, the Markov chain would use the same prefix length. This meant
that the new queue was of the same size.
2
Note that the prefixes are not in the same order as the input is processed.
52
Implementation
Björkvald, Svensson 2004
The principle is simple; by randomly picking a new tone from the associated links to the current
prefix and putting it in the queue, a new prefix is created, with new possible successors. Using the
example in 7.3.3, the procedure looks like this:
Queue: [ g4f4a4g4 ]
Possible successors: [ f4 ], [ A4 ]
A random number is created, which is used to step through the link list a certain number of
times. If the [ f4 ] is selected, it is inserted into the queue:
Queue: [ f4a4g4f4 ]
Possible successor: [ a4 ]
The queue content is now a new four-tone prefix. In order to find the next successor, the application finds the prefix object associated with this prefix, and looks at the possible following
tones. In this case, the prefix [ f4a4g4f4 ] always leads to next tone being an [ a4 ].
To guarantee that the output will be longer than one prefix, the starting prefix was always randomly selected from all prefixes in the model having any possible successors. There were two
possible cases of ending the synthesis; when a prefix with no successive tones was currently in
the queue, or when the output length reached a predefined maximum length. This way, the user
was able to control the length of the application output. If the result was satisfying for the user, it
was possible to save the tone sequence as a MIDI file. If not, by shuffling the output, a new sequence was created.
Every generated tone sequence ended with the same tone as it began with. This was a feature
inherited from most existing real life music, and it made the melodies feel like they had a proper
ending.
Using this simple procedure, the model was used to create the new output tone sequence. Since it
contained information for all input, using it to synthesise new sequences was a statistically safe
way of combining all the input characteristics into one unique output. The synthesis was purely
based on what was found in the analysis, and all generated combinations of tones are guaranteed
to have been found somewhere in the input. If a certain tone combination was common, it was
more likely to appear in the output.
7.3.5 Controlling the characteristics of the output
At the same time as the tone sequences were read from the database, all stored tone lengths were
read into a vector. When a new tone sequence was synthesised and written to a MIDI file, these
lengths were used for deciding the length of each tone through random selection from the values
in the vector. A shorter tone than the shortest value could not be created, and likewise, the highest value was not possible to override.
But just randomly selecting values for the tone lengths produced highly unrhythmical results,
sounding much like a small child tinkling away on the piano (or any other chosen MIDI instrument). And like the piano playing of small children, this hardly had any musical value at all to
anyone but the parents (or in this case; the programmers). In order for the output to be more
listenable, there was clearly a need to take control of the tempo for the melody being played.
Introducing a variable for tone length variation, it was possible to specify the amount the tone
lengths were allowed to differ from the mean value of the length vector. All values that fell outside the specified range were trimmed. Naturally, it followed that the smaller the variation, the
more alike the tone lengths were. However, a zero percent variation created results that were too
perfect even for a skilled pianist, and at least a few percent of tone length variation were needed.
53
Implementation
Björkvald, Svensson 2004
Setting a small tone length variation introduced an even tempo throughout the new synthesised
“song”, based on the mean value of all lengths from the input. But what if a tempo change was
desired? A logical progression from being able to control the amount of variation was to introduce another variable, which displaced the value the variation was centred around. This could be
set to any value within the range of the length vector. As a result, changing the length of the
tones implicitly set the tempo. If the real tempo (in beats per minute) was of any interest, it could
easily be approximated by dividing 60 by the centre value.
To make the output sound “rhythmically” correct, a quantisation of the length vector was performed. The centre value was divided by two as many times as possible without being smaller
than the shortest time value allowed. The resulting value was used as quantisation factor, making
all other values multiples of it. This way, all tone lengths were related, making them sound better
together. The quantisation factor was also used for deciding the spacing between tones. This
introduced a more dynamic sound, as the melody became less mechanical.
7.4 MIDI representation
Since the analysis part offered possibilities for a wav2midi-conversion of the analysed input, and
the synthesis created new tone sequences, based on the same structure, both of the cases required
a writer that transformed the string sequences into actual playable sound. This was performed by
using the event matrix, and transforming it into a MIDI audio file.
7.4.1 Writing the MIDI format
The event matrix was of the form defined by (7.6).
time start
 M
M

time stop
note
M 
note
(7.6)
time is the sample value where the event occurs, start/stop is a one or a zero, respectively, and note
is a MIDI note number between 0-127. Every note that had a start event must also have a stop
event, resulting in twice as many rows in the event matrix as there were tones.
For this thesis, the method chosen for playing the new synthesised music was of far less importance than the synthesis itself. For simplicity MIDI was chosen because it was a well-known
standard and there were plenty of information about it to be found on the Internet. A useful example of the latter was Mosley’s thesis, which provided a complete MIDI writing routine for
Matlab [6]. This MIDI writer was somewhat modified for use with the application described.
All times were recalculated from sample values to MIDI ticks using the formula (7.7).
t MIDI =
t s bpm ⋅ ppqn
fs
60
(7.7)
Here, tMIDI is time in MIDI ticks, ts is time in sample value, fs is the sampling frequency, bpm is the
tempo in beats per minute and ppqn is the resolution of the MIDI file in ticks per quarter note. By
dividing the sample value by the sampling frequency, time was converted into seconds and all
dependency of the sample rate was removed.
54
Implementation
Björkvald, Svensson 2004
These MIDI tick values were then transformed yet another time, into relative time values, deltatime. Delta-time was the time from the last occurred event to the current one, and saved a lot of
data since the time between two events rarely was large compared to the absolute time value of
an event. When writing the delta-times to a (binary) MIDI file, even more data was saved by utilising variable length format, meaning that only the necessary number of bytes (in principle, never
more than two) were used.
In order for the notes being played to sound more “human” and less mechanic, the attack velocity of each start event was set to a random value between 100 and 127. This was an AI approach
simulating the trivial fact that a human being is not in any way an exact machine. Making the
flaws and mistakes of regular persons visible in an application made it trustworthier – people in
general tend not to rely on things that are “too perfect”.
7.5 Application process flow
The flow of the application process is visualised by Figure 7.18.
Figure 7.18: Workflow of the entire application.
55
Implementation
Björkvald, Svensson 2004
7.6 Conclusion
The analysis was performed using a frequency domain based CWT algorithm, which downsampled the signal differently for different octaves, analysed and stored the result in a coefficient
matrix. This matrix was then transformed into an event matrix, storing the different tones in time
order, which resulted in the exact tone sequence found in the sound file. By saving this sequence
in a text file, along with every found tone length, the analysis was completed, and the file was
passed on to the synthesis part.
The purpose of the synthesis part of the thesis was to generate new tone sequences by analysing
the ones found in the analysis part. Through defining a so-called prefix length, the database file
containing all input tone sequences was analysed. All existing prefixes and their possible successors were stored in a Markov model. The statistics for all input was collected in the same model.
A Markov chain implementation then used this model and its associated probabilities and created
an entirely unique output purely based on the input tone sequences. By transforming the output
into a MIDI file, it could be listened to. It was also possible to manipulate the output with some
parameters affecting tempo, tone length and variation.
56
Results
Björkvald, Svensson 2004
8 Results
The final result of the thesis work was an application that analysed audio files containing music,
stored the tones of the melodies played and statistically synthesised new material based on one or
several inputs. All kinds of output of the application could be represented in the MIDI format,
whether it was an interpretation of the input (i.e. a “wav2midi” conversion) or a synthesised
unique melody. A GUI was implemented, in order for easier manipulation of a number of variables and thereby the data.
8.1 Analysis
To focus the analysis on only the most interesting frequencies, or tones, an examination of the
frequency content of the input signal was performed. From this, the relevant frequencies were
selected, and passed on octave-wise to a continuous wavelet transform (CWT) analysis. The
scales of the CWT were selected to correspond with the tones of the current octave, resulting in
correlation coefficients for every tone at every sample point of the signal. Prior to the CWT, the
signal was downsampled as much as possible without losing information for the particular octave.
By doing so, a lot of computational effort could be avoided.
The data was processed in some ways for easier use and clearer results. Up to this point, the application was more or less a straight up “wav2midi” converter. For synthesis purposes, all found
tone sequences were then saved in a database text file.
8.2 Synthesis
The database file was run through a window of a certain size. For every string of tones in the
window, the following tone was stored, resulting in a statistical model of all window-sized strings
and their successors. This model was used as source for a Markov chain implementation of any
chosen order, producing new tone sequences. The output was purely based on the database, making it unique but still with structural similarities to the input. By modifying certain parameters, the
characteristics of the synthesised music could be altered.
8.3 Application screenshots
Screenshots of the three different steps of the analysis and the Markov application can be seen in
Figure 8.1Figures 8.1 through 8.4.
57
Results
Björkvald, Svensson 2004
Figure 8.1: Step 1 of the application.
Figure 8.2: Step 2 of the application.
58
Results
Björkvald, Svensson 2004
Figure 8.3: Step 3 of the application.
Figure 8.4: Markov application.
59
Conclusions
Björkvald, Svensson 2004
9 Conclusions
As a measure of the success of the thesis work, the objectives can be recapitulated:
“The purpose of this thesis is to generate new music from existing sound material by using frequency analysis and
the statistical properties of the analysed information. (…) What this means is that the new music will in fact be
based on the characteristics of all the different input music, but will still be something completely unique.”
Comparing the finished application with the declared objectives, it is apparent that they are fulfilled. Although some limitations have been applied, the final result is new music, based on the
analysis of all input. However, due to the thesis nature, limitations may not always be synonymous with failures. Problems in some areas were expected from the beginning, and all insight
gained is actually a part of the work.
One aspect, which may not have been as predicted, is the distinct separation of the thesis into
two different parts, showing themselves hard to combine. The analysis part proved itself the major one, consuming most of the project time. In fact, it was complex and interesting enough to
form a thesis of its own. But since there was the synthesis matter left to deal with, the time was
not sufficient to realise the true potential of the wavelet analysis.
9.1 Problem formulations
Another way of confirming the thesis result is to return to the fundamental problem questions,
defined in the very beginning, and see how well they are answered:
What sort of information is possible to extract from an arbitrary piece of recorded music?
The main information retrieved is what tones are played, and when they are played. While this
offers no possibility to estimate the tempo, it gives the position in time and the length of all
tones, which in a way can be seen as tempo. No beat detection is performed, and drum sounds
are not accounted for in the analysis. Harmonics can be found, but they are not stored. The essence is to find the fundamental tones forming the melody of the music.
How must the extracted information be represented to be storable?
Since the focus is put on the melody, it is saved as a text string, simply noting the order in which
the tones were played. Along with this, the length of each tone is stored, for synthesis purposes.
All analysed input is stored in the same text file, forming a database of the available information
for the synthesis.
How can the stored information be used to synthesise new material?
A statistical analysis of the sequences in the database file is performed, in order to make a single
model of all patterns of tones and their possible successors. This way, all input characteristics are
used to form the output, which is completely synthesised from the model. During the synthesis,
the stored lengths are used to decide the lengths of the new tones.
The thesis work has managed to answer the problem questions on which it was built, although
the answers are in some cases simplified compared to what was expected.
60
Conclusions
Björkvald, Svensson 2004
9.2 Limitations
Wild ideas are in some ways the necessary fuel for the inspiration when carrying out an experiment like this thesis. But as always, the circumstances will sooner or later bridle the enthusiasm.
There was simply not enough time to implement even nearly as advanced features as were desired
from the start. Limitations had to be applied to the objectives, some of them more prominent
than others in the final application.
9.2.1 No storing of simultaneous tones
The Markov model poses a major drawback when working in the context of music; the lack of
methods for analysing and synthesising concurrent events. Thinking of the complexity of music
in general, it is quite obvious that simultaneous tones are more of a rule than an exception. To
somewhat compensate for this, the application treats concurrent tones as if they are following
each other. In reality, tones rarely start at the exact same time instant. Even when playing chords
there is likely to be a small gap in time between the start of each tone. And since all tones in a
(not too experimental) song are supposed to harmonise with each other, treating simultaneous
tones as being played one at a time does not corrupt the content of a song. It may slightly change
the melody though.
Not being able to synthesise simultaneous events is a larger problem. To only create output
where no tones are concurrent can never produce anything sounding more advanced than a
simulation of a human playing a simple monophonic tune on for example a piano.
9.2.2 No instrument identification or separation
Identifying and separating instruments proved to be a task far more challenging than expected.
The only information found on this subject involved machine learning – training the program in
recognising the sound of certain instruments. As one of the fundamental ideas behind the thesis
was to provide the program with sound files and nothing else, this (in itself interesting) AI approach would be a step in the wrong direction. Besides, using a technique like machine learning
would still mean that the program would just be able to identify instruments that it had been
trained to recognise. Only generalised approaches were of any interest, as the application was not
supposed to “know” anything about the input. All in all, the sum of the limited amount of time
available and the lack of well-known methods for instrument identification led to these ideas being abandoned.
9.2.3 MIDI for playback
Although MIDI can be a powerful tool for musicians, the standard synthesiser on a regular inexpensive soundcard can never reproduce MIDI sounds in such a way that they sound anything like
real instruments. Playing the output from the Markov chain as MIDI does not impress the listener in any way. However, due to the fact that no instrument identification has been performed,
what is being played is unknown and a cheap synthesiser is as good replacement for an arbitrary
instrument as any “perfect” simulation of organic sounds.
9.2.4 No beat detection
Looking at the tone lengths and the general structure of the input, the tempo can fairly easily be
found. But as soon as more input is to be analysed, there are problems if the tempo is not the
same. How should the tempos be combined into one value for the output? Should the new
tempo simply be the mean of the inputs, or should it matter which of the inputs that have contributed the most to the output? If there are changes in tempo somewhere in any input, should its
tones be separated or should the mean value be used in the analysis? There were a lot of uncer61
Conclusions
Björkvald, Svensson 2004
tainties surrounding these matters and in the end, a completely different approach was used –
looking at the lengths of individual tones and using them to construct the tempo of the new synthesised music.
9.2.5 Combining different inputs
In order to be able to synthesise output that is a combination of all input, they have to be in the
same range since the Markov chain does not allow for leaps that were not present in the input.
For example, the sequences [ c4d4e4c4 ] and [ d4c4e4d4 ] are not possible to combine using a
chain of third or second order. They can be used together if the chain is of first order, but on the
other hand this is rarely of any interest. Seeing that input sometimes can be difficult to combine
even when the melodies are played in the same octave, it can easily be understood that there will
problems if some input is played in a completely different octave. This is not accounted for in the
application, and using inputs that differ much in frequency will most likely result in an output
comprised of only tones from one of the inputs.
9.3 Thesis separation
The major problem with the thesis separation is the link between the two parts. Both of them are
powerful, but unfortunately combining them means losing some of the possibilities. The most
obvious example of this is the loss of information found in the wavelet analysis, i.e. it is not possible to make full use of the ability to actually find simultaneous tones and harmonics.
Due to this, the analysis looks like the more successful of the two parts at the first glance. Further
development of it could form a very useful wav2midi-converter, fully able to convert not only
monophonic melodies, but also simultaneous tones and chords played. Practically, this would
mean utilising a quantisation of the coefficient results from the analysis, rather than the binary
thresholding, which removes all information about the actual strengths of the found tones.
The Markov chain implementation is also very useful on its own, especially since it offers the user
a way of dynamically changing the order of the chain. The major problem however, is that the
structure of the Markov chain favours the use of strings and is generally text-based in its appearance. Applying it to language and text synthesis rather than the music representation of this thesis
would probably prove its true strength. Since the analysis part mainly expresses itself with
correlation coefficients and numbers, there is an apparent communication issue causing
problems. Perhaps using a simpler method could have performed a just as suitable analysis for
this cause. With the Markov synthesis only using monophonic input material, the wav2midiprinciples could have been adequate.
9.4 Artificial intelligence aspects
In order to enjoy the music created with the application, and avoid having a machine-like feel to
the output, certain AI factors have proved themselves very important. The focus during the
planning of the project was put on the pure analysis and structuring of sound information, but
when the synthesis actually started to produce listenable results, the characteristics of the output
needed to be considered. What makes a melody sound like it is played by a human being? Originally, the idea was that the analysis would provide the synthesis with all necessary information for
it to create an output that sounds “real”, without having to modify any parameters of the synthesis. However, since the analysis results had to be somewhat limited, some parameters had to be
introduced in the synthesis to make up for the lost information. By altering these, the output can
take on different forms, giving the user a possibility of creating something very machine-like or
human, whichever is desired. This aspect took the thesis into yet another major research field;
simulation of human behaviour.
62
Conclusions
Björkvald, Svensson 2004
9.5 Music theory aspects
The plan originally was to minimise the amount of music theory used in the thesis, and approach
the problem from a very mathematical point of view. These plans have not been abandoned in
the final application, although some bits of music theory were necessary to implement:
Tone frequencies
During the wavelet analysis, the idea is to analyse signals with respect to their frequency content.
To make this relevant for music, all tones searched for during the analysis are translated into pure
frequencies. This requires the analysis to use the normal western tone notation of twelve semitones in each octave, and may restrict the application from being used in more experimental music surroundings.
To start and end the synthesis on the same tone
This may seem like a minor detail, but implementing the fact actually makes the output sound
more like a real melody. This is because it is very uncommon to end melodies on a different note
than the one they started with.
Although no information is stored with regards to chords or harmonics found in the analysis, the
application very rarely assembles notes that fit badly together. Storing the found tones in time
order solves this. Chords are not stored as a unit, but as individual tones. However, since the
individual tone’s successors become the other tones of the chord, they are virtually guaranteed to
sound well together. This is reflected in the synthesis, since no other successor tones than the
ones stored from the analysis are selected.
One major drawback from not using any tempo or beat detection in the analysis is that the synthesis often produces quite unrhytmical melodies. This is something that a more thorough analysis based on theoretical tempo knowledge possibly could have solved.
9.6 Final comments
Generally, the idea of music synthesis based purely on a thorough analysis of input material, and
especially the idea of being able to affect the compositions only by selecting the input and setting
certain parameters, is a mouth-watering prospect. It offers anyone the chance of creating music,
without having any criteria for talent or musical knowledge. While the final result of this thesis
might not offer this sort of complex composition possibility, it clearly presents the high potential
of the principle. Expanding the application to make it able to make use of polyphonic tones and
melodies would mean a lot of work, but this thesis and the future work proposed by it could be a
good starting point.
The work has involved more or less deep insights into a number of different subjects; music theory, DSP, AI, mathematics and so on. While this fact poses limitations, mostly due to insufficient
time resources (i.e. each field could have been explored more), it also leads to a vast number of
possible future implementations. The realisation of these features would most definitely take the
thesis one step closer to the ultimate aim: an artificial “hit-maker”.
63
Future work
Björkvald, Svensson 2004
10 Future work
This chapter states possible extensions and ideas for future work gained during the course of the
thesis work.
10.1 Improving performance
•
The most apparent solution for improving performance of the application would be to translate the code into a precompiled programming language, tentatively C or C++. The reason
for not doing this directly during the process of the thesis is that the Matlab language contains predefined methods for a number of the most important steps of the application; sampling, Fourier transforms, wavelets, matrix handling and so on. Using another language would
require all of these functions being written, taking up a lot of valuable time.
•
Another important improvement could be to separate input files into smaller parts, and then
perform the wavelet analysis at smaller intervals. Some sort of interpolation could be performed, filling out the analysis results in between the different parts. This would in principle
remove the upper limit (hardware dependent) for the size of the files being analysed, and
could also possibly improve the speed of the analysis.
10.2 Improving the features of the analysis
•
The perhaps most important improvement of the analysis would be to implement a proper
instrument identification. By being able to separate the analysis, each instrument could be
analysed and synthesised individually. If the MIDI format is kept, this would mean that the
synthesis could assign the correct instruments to the new melodies, creating a full soundscape
rather than a single monophonic melody.
•
Proper instrument identification would also allow drums to be properly analysed, putting
more effort into tempo and beat tracking. This may result in the beat and drum synthesis
forming a separate part of the application. The human voice is also an interesting feature;
what if it was possible to analyse the backing music and the sung melody individually? This
way, the music and the vocal melody could also be synthesised in order to match each other,
along with the drum patterns.
•
Since the analysis is able to find simultaneous tones, a natural progression would be the ability to store the chords, not only singular tones at a time. This would move the application
even further away from its monophonic nature, and make it employ more of the chord and
scale theories.
•
A quantisation of the CWT coefficients is already mentioned in sections 7.2.7 and 9.3. By
determining the strengths of all tones, fundamentals as well as harmonics, the analysis could
in a simple way be developed to be a powerful wav2midi-application, since the harmonic features of the tones could be utilised. It would also offer the possibility of synthesising MIDI
based on the strength of the tones in the input, rather than just randomising the velocity variable, and thereby obtaining a more successful simulation of human behaviour.
•
Another way of improving the analysis could be to extend it so that it also incorporates looking at the signal in the time domain. Sometimes information about a tone’s duration and location in time is easier to derive from this domain than from the time-frequency domain repre-
64
Future work
Björkvald, Svensson 2004
sentation of the wavelet transform. A combination could be a way of optimising the analysis
results.
•
The final application can only deal with one-channel sound files, i.e. mono audio. Of course,
stereo handling is a concern for future work. Should the analysis be performed for each separate channel or should the channels be merged? Analysing them separately could extend the
synthesis, but exactly how should the channel information be used?
•
A major issue is the choice of transform. The CWT implementation of this thesis is a sort of
mixture of the CWT and the DWT, i.e. it performs a complete analysis, but only at selected
scales, and it downsamples the signal according to the frequency content of interest. However, the pure DWT may be an alternative, if an equally efficient way of reconstructing signals
from the (fewer) coefficients can be found.
10.3 Extending the statistical model
•
In order to reduce the communication problem between the analysis and synthesis parts, the
possibility of storing complete chords would be useful. To do this, the Markov model needs
to be able to handle several tones at a time. The relationship and statistics of not only single
tones, but also full chords would then serve as a source for the synthesis. Again, this would
also allow for more “proper” music theory being used, i.e. different scales and their corresponding chord structures.
•
The statistical model could also be improved by transposing the input, so that a connection
between all sequences is guaranteed. The possibility of merging music played in completely
different octaves and/or keys has to be weighted against the loss of important characteristics.
•
Another way of looking at the storing of analysed information would be to store the actual
CWT coefficients directly, and thereby avoiding the “stringification” of all analysis results.
This, along with a statistical model better suited to this type of representation than the
Markov chain, which is quite dependant of text-based data, may change the synthesis part
completely.
•
To generate an output having the structure of a traditional song, “intelligence” has to be
added to the statistical model. Reoccurring events like verse and chorus has to be found in
the statistical analysis, and used in the synthesis. This could involve both a Markov chain, and
some other structural analysis method specialised in pattern recognition.
10.4 Enhancing the realism of the synthesis
•
For simplicity, MIDI was chosen as the format for all synthesis. It is an easy way of
representing the synthesised tone sequences, and making them listenable. However, the
sound of a MIDI representation can in no way be compared to wave-format audio.
Therefore, a more realistic sounding synthesis representation is desirable. One proposal for
this, although probably quite advanced, is to perform the synthesis by using an inverse
transform directly on the CWT coefficients. Of course, this requires that the synthesis
produces an output written as coefficients. If done properly, the coefficients would include all
information about harmonics and thereby the most important characteristics of the
instrument being played. The MIDI format could then be abandoned for “real” sound waves.
The Morlet pseudowavelet used does not meet the admissibility condition, and is therefore
not fully invertible. For this type of synthesis to become a reality, there might be a need for
another type of mother wavelet.
65
Closing thoughts
Björkvald, Svensson 2004
11 Closing thoughts
Ever since the first synthesisers arrived, people have always argued about who is really composing the music when a machine is involved in one way or another. If software is used in the production of the music, is it then the user, the programmer or the software itself that is the composer? Issues like these become more and more relevant as computers become more and more a
natural part of music creation. For this thesis, questions concerning who or what is making the
music are of even larger importance, since a computer is used to create new music, completely
based on the compositions of (usually) other people than the user.
Where should the line between theft and borrowing and/or gathering inspiration be drawn? This
is a common problem in the music industry where people are basically sued on a daily basis for
“stealing” parts of other people’s songs. But what if a machine is doing the stealing? Is it all right
as long as it cannot easily be heard from where a particular part has been taken? This could mean
that it might be okay to use a Markov chain of low order, but not one of higher order. As the
order of the chain increases, the output will be more alike the inputs, and when the order becomes high enough, it will in principle be an exact copy of one of the inputs. At this point, the
application has more or less created a cover version of another song and it is time to start paying
royalties to the composer.
Li discusses music creation with machines, and focuses on the almost philosophical question
“can a machine ever be considered the composer of a musical piece?” There is no doubt that
there are music-making machines out there, the application described in this thesis being one of
them. But a machine producing music is not a composer per definition. To be able to answer the
question above, there has to be a distinct definition of what music composition really is. According to Li, composing music must involve intelligence to some extent; the composer must be able
to make its own choices based on some general knowledge. If the composition system uses ad
hoc knowledge directly from the creator it cannot be seen as a composer. In this case, the machine is merely an extension of the builder, not an intelligent entity of its own. [5]
But what about a machine that does not have any knowledge whatsoever and is unable to learn?
In the application accompanying this thesis, all decisions are based on the statistics of previous
decisions made by various composers. Creating patterns with Markov chains is in a way just a
matter of imitating the structure of someone else’s work. This could hardly be seen as composing, could it? Still, given a large enough database of inputs and using a Markov chain of relatively
low order, the output could, at least in theory be something completely unique. If a machine creates musical pieces that no one has heard before, is it not a composer then, intelligent or not?
Li claims that that a machine can only be said to make music if the user is unrelated to the builder
and if the machine is autonomic [5]. This is true for the Markov-based music application, if used
by an arbitrary person. Admittedly, the software is not completely autonomic in the sense that the
user provides the input and can decide the order of the chain. However, the process of combining tone sequences is made out of control from the user, and seeing it that way, the application is
autonomic. So, the program is fully capable of making music, but can it compose?
Perhaps the answer is no. Perhaps it is a mockery to all hard working talented composers out
there to even suggest that a machine in itself could be a composer. But then on the other hand, is
it the music or who composed it that is important? As soon as an artistic work of any kind is
made available to a wider audience, it does not belong to the creator anymore. Sure, he/she can
still receive royalties for it, but since anyone is free to interpret the work, it is out of control for
66
Closing thoughts
Björkvald, Svensson 2004
the creator from there on. The harsh truth is that vision he/she had for the work has become
unimportant, no matter how interesting it was.
There was a time when making music with machines was considered of less artistic value than
using “real” organic instruments. Times have surely changed since, and nowadays only musical
elitists really care about how the music was made. The possibility of seeing the name of a computer or software as composer for a musical piece in the near future is not science fiction at all.
Whether the application created in conjunction with this thesis can compose or not can be left
for the reader to decide. In any way, it does not create anything more advanced than monophonic
melodies. For this to ever be released on a record, a human being needs to include it in a larger
arrangement, possibly containing percussion and vocals. Furthermore, the program does not
really capture the structure of choruses and verses. At best, it can create a melody line, which can
be used in a song put together by a real person. This could mean that the application and a human in combination serve as composer. While this is not the most advanced software available,
there is no other well-known program that poses any concrete threat to human composers. The
day a machine can produce complete songs, ready for release to the record-buying audience, it
might be time to start thinking about another line of work. But until then (and probably far beyond), there will always be a need for the unique machinery of the human ear and mind in the
creation of musical pieces. After all, to the best of all knowledge, no man-made machine will ever
fully understand the complex emotions behind all great works of art.
67
Bibliography
Björkvald, Svensson 2004
12 Bibliography
12.1 Literature
[1] Alm, J. F., Walker, J. S. (2002). Time-Frequency Analysis of Musical Instruments, SIAM Review Vol.
44, No. 3.
[2] Bohn, D. (1997). Signal Processing Fundamentals, Rane Corporation.
[3] Jehan, T. (1997). Musical Signal Parameter Estimation, Berkeley University.
[4] Kamen, E., Heck, S. (1997). Fundamentals of signals and system using MATLAB, Prentice-Hall.
[5] Li, T-C. Who or What is Making the Music: Music Creation in a Machine Age, Faculty of Music,
McGill University.
[6] Mosley, B. (2002). Audio to MIDI conversion, University of Derby.
[7] Nowak, R. (2003). Fast Convolution Using the FFT, The Connexions Project.
[8] Petterson, R. (2001). Föreläsningsanteckningar om Markovkedjor, Växjö Universitet.
[9] Rauterberg, M. Why and what can we learn from human errors?, Advances in Applied Ergonomics,
West Lafayette, USA Publishing.
[10] Sadowsky, J. (1996). Investigation of Signal Characteristics Using the Continuous Wavelet Transform,
Johns Hopkins APL Technical Digest, Volume 17, Number 3.
[11] Self G. (2001). Wavelets for Sound Analysis and Re-Synthesis, University of Sheffield.
12.2 Web
Last visited 2004-04-28 unless otherwise stated.
[12] Chapman, D. Multiana. Last visited 2003-10-10, no longer available.
http://www.met.rdg.ac.uk/~chapman/spectrum/
[13] Hansper, G. An introduction to MIDI.
http://crystal.apana.org.au/ghansper/midi_introduction/contents.html
[14] Jacques, L., et al. Yet Another Wavelet ToolBox (YAWTB), Institut de Physique Théorique.
http://www.fyma.ucl.ac.be/projects/yawtb/
[15] Keren, Y. Data Structures & Algorithms Toolbox.
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=212&objectTyp
e=file
[16] Kesteloot, L. Markov chains.
http://www.teamten.com/lawrence/projects/markov/
68
Bibliography
Björkvald, Svensson 2004
[17] Kieft, B. A brief history on wavelets.
http://www.gvsu.edu/math/wavelets/student_work/Kieft/Wavelets%20%20Main%20Page.html
[18] Koniaris, K. Understanding Notes and their Notation.
http://koniaris.com/music/notes/
[19] Lipscomb, E. Introduction into MIDI.
http://www.harmony-central.com/MIDI/Doc/intro.html
[20] The MathWorks Inc. Complex Morlet Wavelets: cmor, Matlab Wavelet Toolbox documentation.
http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/ch06_a37.shtml#40178
[21] The MathWorks Inc. cwt, Matlab Wavelet Toolbox documentation.
http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/cwt.shtml
[22] The MathWorks Inc. scal2freq, Matlab Wavelet Toolbox documentation.
http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/scal2frq.shtml
[23] Mathworld. Convolution, Wolfram Research Inc.
http://mathworld.wolfram.com/Convolution.html
[24] Maurer IV, John A. A Brief History of Algorithmic Composition.
http://ccrma-www.stanford.edu/~blackrse/algorithm.html
[25] Multimedia Education Group. Text Synthesis.
http://www.meg.uct.ac.za/downloads/VBA/textgen.htm
[26] The International MIDI Association. Standard MIDI-File Format Spec. 1.1.
http://www.pgts.com.au/download/txt/midi.txt
[27] Miranda, E.R An introduction to music and Artificial Intelligence.
http://website.lineone.net/~edandalex/ai-essay.htm
[28] Mugglin, S. Music Theory for Songwriters.
http://members.aol.com/chordmaps/
[29] O'Connor, J.J, Robertson, E.F. Ingrid Daubechies.
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Daubechies.html
[30] Recognisoft. Wav to MIDI conversion software - Solo Explorer.
http://www.recognisoft.com
[31] Russell, S. Introduction to AI – a modern approach.
http://www.cs.berkeley.edu/~russell/intro.html
[32] Valens, C. A really friendly guide to Wavelets.
http://perso.wanadoo.fr/polyvalens/clemens/wavelets/wavelets.html
69